Tuesday, April 29, 2008

More Than One Way to Meet the Challenge: Systematic Approaches to the Capture & Preservation of Complex Digital Artifacts

Riccardo Ferrante, SI Archives; Kelly Eubank, DCR; Steve Burbeck, RAC; Mike Smorul, UMD
Part I
Many of the state's digital records are disappearing -- In the age of the Internet, it is a simple matter to get government information w/ a few mouse clicks. But what if you're interested instead in what your gov't. was doing 3 years ago? The full article will be available on the Web for a limited time.

Periodic Desktop Hardware Replacement Program with varying saves of dbase.

Web 2.0 poses special problems w/outsiders adding to a site. Ex. of SI site - has images, streaming video, java script, etc.

Possible future: Submission Info Package --> Data Management --> Archival Info Package --> data management --> Dissemination Info Package - creates a derivative for the researcher to use.

Part II - Lessons Learned: Archiving E-Mail
Rockefeller Center & SI collaborative grant project
If you had to write the email metadat, you'd write LESS EMAIL!!

Make up my mind -- which do you want? Keep EVERYTHING, Destroy EVERYTHING . . . need 2 systems working together - 1 archival and 1 email destroyer.

Storage Format = XML b/c it's OPEN, human readable, "self-describing," a good descriptive schema allows validity checking, many open source tools to create, manipulate & read xml

David Minor at DCR came up with Mail Account XML schema - [account][folder][message][header][body][attachment], etc. Coming soon to an archival computer near YOU! or more to the point ME!! Yippee!!

Archives viruses - no, really! You can't get them out, so just save them too. 4 gig is the biggest known email account at SI, haven't tried to save / validate it yet, it may not work.

Prototype EMail Conversion Results - they have converted and validated 70,000 messages in 3 test sets to the XML Mail Account schema. SI - 5,537 messages in 232 Mb of recent Outlook Mail - 99.7% successfully parsed - 4 sticky that turned out to be garbage . . .
SI - 28,000 messages in 1.5 Gb Outlook account - 99.975% successful, 5 unparsed
RA - 43,778 messages in 378 Mb of older eclectic mail for RAC - 98.85% successfully parsed, 74 unparsed, but improvement is clearly possible

Lessons Learned
  1. 100% success is unrealistic
  2. We CAN achieve at least 99.9% success and save the few unparsed email for human inspection
  3. And DSpace can store and retrieve it!

Part III - Chasing the E-Tiger - Electronic Mail Capture and Preservation Tool - EMCAP

Grant Web Site

EMCAP tool - open source, client is configured to have Archives Folder that is mapped to DCR server, User can replicate file folder structure, mimics current drag/drop, drops email into the DCR collection server, internet message parser makes xml copy (see above)

There is a user client view and an archivist view. Allows for creation of account, administrator can change password and has field to add free text.

Parser - xml schema (above) represents all email in account, parses header info, text attachments converted to Unicode, leaves tag in schema to point to attachment, retain all original bit streams, when saving an external file it creates a message digest with unique identifier -- important -- verifies that no changes were made which has legal implications

Next steps - development of additional .pst file import capability, finish training docs and roll out tool to state partners - one of which is KENTUCKY! Hurrah!!

Part IV - Archival Prototypes & Lessons Learned
Very technical see Project Wiki & Papers

2 comments:

Dedpepl said...
This post has been removed by the author.
Dedpepl said...

Related article: http://www.charlotte.com/409/story/623030.html