Wednesday, June 4, 2008
Digital Collections: Preservation or Access
From: Jim Lindner
Date: Mon, 2 Jun 2008 15:32:32 -0400
Digital Preservation has been successfully going on for decades -
quietly and successfully, every second, every day, for decades. This
is nothing new. What is "new" is that it has not been practiced in
this particular field - but there is no reason why we can and should
not learn from others that do it on a regular basis. Unfortunately the
more frequent mantra is the "can not do" one, usually with the same
old NASA example of the lost data - - - SEE, even NASA loses data -
another good excuse to do nothing.
Unfortunately (or fortunately) even the time for that argument has run
out. The reality is that there is no other choice - so either you
start to learn the tools and the technology or hope that your
retirement can be early enough that you will not have to (and sadly
the later is the route that it seems that many hope for in this
field). This is not meant to be an overall indictment of the field -
rather stating what appears to be obvious - - that there are oh so
many reasons why something can not work, so of course let's not try or
certainly not learn how others do things. It does not have to be this
way. It should not be this way - there IS a choice.
Digital Preservation not possible? I say - nonsense. Close your eyes
as tight as they can be closed and maybe it will all go away. I don't
think so. If you want to learn how to preserve data - get a book on
the subject - there is no shortage... just go to the Computer Science
section of your local book store - or do I dare say - the library. You
don't even need to read an entire book - most basic Computer Science
books discuss backup and archiving strategies in a chapter or two and
in depth.
Can you imagine the reaction in a major corporation if the CEO asks
the head of IT for the annual report from five years ago the the
manager replied that it can not be retrieved because they switched to
a new version of Word Perfect and no longer can retrieve the record?
Is this really the excuse that we are using to not start using Digital
Preservation? That operating systems change and we can't play back
those 8" floppy disks any more? This just plain silly. Industry has
been 'preserving" its data for decades. Stock Exchanges can find a
single transaction among billions every week - for decades - every
one. Manufacturers can find part numbers for cars out of production
for decades - and tell you the new replacement part number, we ASSUME
these things. It is part of the way things "work". How does it work?
It is very simple - you don't wait 20 years to migrate a file - you do
NOT put it on a shelf like it is a library book for 20 years and hope
you can read it - because you will NOT be able to read that. We know
that - it is OK. You don't try strategies that work for books on
data.... why? Well because data are not books - and data requires a
different paradigm and strategy - but the really good news is that
that strategy has been defined and used, reliably - for decades - and
we all use it in our lives every single day - we depend on it -there
is no turning back this clock.
Why is it that we do not hear these worries when it comes time to use
an ATM - are you worried that the bank has not preserved your bank
balance and will erroneously give you an extra million or two - or
have banks somehow figured out how to keep track of transactions for
decades and so have a current balance? Are mistakes made - yes - we
all reconcile our bank accounts and know they do - but is that a
reason to go back to ledger books? Could we go back to ledger books
even if we wanted to? Do you hear of many banks that have lost all
their files the way NASA did - no? Why - simple - because they migrate
each and every day - they know that it is not enough to just "back up"
their files - one must keep the data current by changing as the
applications change. The idea is NOT to keep a file in the same format
for 20 years and then complain that it can not be opened. Have
operating systems changed for the banks - yes. Applications - sure.
Floppy Drives? What are those? Somehow they got over it and figured it
out. Dare I say - very quietly - shhh - - - why not just copy what
they do - - - it seems to work!!!!!
One can not think in one paradigm and operate in another. The world
HAS changed and continues to - that is a good thing. Change brings
challenges AND opportunity. While we are all sworn to preserve and
protect - have we also sworn to close our eyes to change and to not
try to learn and look around ourselves to perhaps learn better ways?
I did not take that oath. We know what does not work - and now what
can not work. When it comes time to look at this time and place - will
people wonder what took us so long to make a change that was so
obvious - - - did we have to lose so much. As we now look back to
other times and their losses - and shake our heads - so too shall
others in the future shake their heads about us - BUT in our case we
had much less of an excuse - we KNEW what did not work, and still we
persisted. The losses during our watch ARE preventable - the others
did not have those opportunities.
We are not an island. What we do is in many ways not that different
then what others do in other fields. It is time to embrace change, to
look with unfettered vision and see - and ask - and try - and yes take
the risk to fail because sometimes in innovation you will. Failure as
part of a process of innovation is an acceptable strategy, failure
with no process nor innovation is in my view - unforgivable.
Jim Lindner
Tuesday, May 6, 2008
Radio Frequency Post-Its
http://ambient.media.mit.edu/assets/_pubs/pranavIUI_quickies.pdf
Thursday, May 1, 2008
Web Analytics
Archives goals for the site
- facilitate contact
- description information
- about us
- mediate use
- promote services
What are the numbers before the web?
Google/analytics/tos.html
How do people get to the site? What are the most popular pages? What are the most popular searches? How do they move around on the site?
Referrers --> Google search to their site, try to filter out stuff. People aren't using subject guides. The home page is not the main entry point. People are going straight into holdings -- finding aids. Landing - come in a collection description 55% come in and 75% leave.
What are they searching for? You can replicate the search. The more complex the search, the longer a person stays on the site. Can improve the metadata based on the current searches.
Found that their assumptions were wrong. People don't use the home page. They bounce around, no obvious path. Found that digital content is NOT "value added" service, it's THE THING that people want. Google optimization matters. 66 character rule - the first 66 characters in the site title is what Google searches on. Content then location.
U of I redesigned their pages.
- They now use breadcrumbs on all pages
- Search box is prominent - near the top
- Tabs across the top
- Contact Us in the upper right corner of every page
- login box for staff at the bottom of the page
- nothing falls off the page - everything is above the fold
- filling in the sweet spots on pages
Tuesday, April 29, 2008
The Future Belongs to Archivists
Archivists have been out in front, setting an example for our colleagues in libraries and museums. We've pioneered collection-level records, addressed our backlogs, pooled our collection descriptions in XML, and recognized synergies of unique materials and digital libraries. The rate of change will continue to accelerate. Our jobs are changing, research expectations are changing, and sometimes the way we have always done things will no longer do. We need to take risks and experiment. Yet, this so-called redefinition of archives today reinforces longstanding archival theory, standards, and practice.
The future requires us to re-examine and embrace our traditions. Our experience thinking about context, aggregate-level description, and documentation practice can enable efficiencies in the digital environment. We have selected, arranged, described, and preserved our archival collections for a primary purpse -- LONG TERM ACCESS. Now we need to disclose our collections where researchers EXPECT TO FIND THEM: on the Web. This future holds opportunities to connect with researchers in ways we have always wanted to.
Intro --
Appraisal needs to come 1st not when processing, not take it all to be on the safe side. Have a collection plan! Only look at those things that fit that plan. Don't take anything that DOESN'T fit. Field appraisal in the home with clear communication results in very little problems in practice. FAST - sampling, asking relevant questions, NO WEEDING.
E-records - need method & practice. Why should e-rec arrangement & description be any different from paper rec.? Appraisal is the same as well.
Digitization -- can find things if series and sub-series are done well & an understanding of provenance. Get over our fascination w/ individual docs. Perfect is the enemy of the good. More guardians in archives than the general public processes need to be FLEXIBLE -- ask why we do things this way DAILY!
Jennifer Schaffner - Future Belongs to Archivists
"I believe the archivists are the future, teach them well and let them lead the way . . . " Nah, she didn't sing it or play it . . .
Leading the world of information; change is the order of the day; be on the web; access is the key to survival of archives.
The Big Bang - creating the new library universe - Aussies - digital content on the rise, we collect local to present to the world. Compared access to digital collections to locked up journals, etc. Deliver the archives to the public. The digital is both the original and the backup - mindless itemitis!! Look for relationships, scan on demand, not just images. Scan documents rather than photocopy, low resolution - UT Austin is doing this. Goes to PDF, may scan entire record, not just the page(s) asked for. Don't worry too much about the metadata.
People want more stuff, no more item level metadata.
How to Introduce Change . . .
Funding - large scale access through NEH and NHPRC - CHANGE is OUR RESPONSIBILITY!!
Three things to try at home
- microfilm conversion to digital and served online
- ordinary digital camera instead of scanner!
- scan master negs that are stock and trade and get them online
Route 66 - some people like the mother road just fine
some people like the interstate better . . .
and then there are the people who just want to FLY . . .
Start Flying!
The Useful 10 Words of the 10,000 -- Describing Photographs in Words
- Write about what you see - 10 words or less
- Method of seeing - divide and conquer, start in the center, then look at quadrants for details
- Who, What, When, Where, Why
- Subject Terms - names, places, topics - who will use them, researchers, archivists, what's important to the repository, controlled vocabulary, consistent
- What story are they telling? - clothing, surroundings, attitude, your imagination
- Posed/snapshot - purpose
- Exhibit titles can be different
More Than One Way to Meet the Challenge: Systematic Approaches to the Capture & Preservation of Complex Digital Artifacts
Part I
Many of the state's digital records are disappearing -- In the age of the Internet, it is a simple matter to get government information w/ a few mouse clicks. But what if you're interested instead in what your gov't. was doing 3 years ago? The full article will be available on the Web for a limited time.
Periodic Desktop Hardware Replacement Program with varying saves of dbase.
Web 2.0 poses special problems w/outsiders adding to a site. Ex. of SI site - has images, streaming video, java script, etc.
Possible future: Submission Info Package --> Data Management --> Archival Info Package --> data management --> Dissemination Info Package - creates a derivative for the researcher to use.
Part II - Lessons Learned: Archiving E-Mail
Rockefeller Center & SI collaborative grant project
If you had to write the email metadat, you'd write LESS EMAIL!!
Make up my mind -- which do you want? Keep EVERYTHING, Destroy EVERYTHING . . . need 2 systems working together - 1 archival and 1 email destroyer.
Storage Format = XML b/c it's OPEN, human readable, "self-describing," a good descriptive schema allows validity checking, many open source tools to create, manipulate & read xml
David Minor at DCR came up with Mail Account XML schema - [account][folder][message][header][body][attachment], etc. Coming soon to an archival computer near YOU! or more to the point ME!! Yippee!!
Archives viruses - no, really! You can't get them out, so just save them too. 4 gig is the biggest known email account at SI, haven't tried to save / validate it yet, it may not work.
Prototype EMail Conversion Results - they have converted and validated 70,000 messages in 3 test sets to the XML Mail Account schema. SI - 5,537 messages in 232 Mb of recent Outlook Mail - 99.7% successfully parsed - 4 sticky that turned out to be garbage . . .
SI - 28,000 messages in 1.5 Gb Outlook account - 99.975% successful, 5 unparsed
RA - 43,778 messages in 378 Mb of older eclectic mail for RAC - 98.85% successfully parsed, 74 unparsed, but improvement is clearly possible
Lessons Learned
- 100% success is unrealistic
- We CAN achieve at least 99.9% success and save the few unparsed email for human inspection
- And DSpace can store and retrieve it!
Part III - Chasing the E-Tiger - Electronic Mail Capture and Preservation Tool - EMCAP
EMCAP tool - open source, client is configured to have Archives Folder that is mapped to DCR server, User can replicate file folder structure, mimics current drag/drop, drops email into the DCR collection server, internet message parser makes xml copy (see above)
There is a user client view and an archivist view. Allows for creation of account, administrator can change password and has field to add free text.
Parser - xml schema (above) represents all email in account, parses header info, text attachments converted to Unicode, leaves tag in schema to point to attachment, retain all original bit streams, when saving an external file it creates a message digest with unique identifier -- important -- verifies that no changes were made which has legal implications
Next steps - development of additional .pst file import capability, finish training docs and roll out tool to state partners - one of which is KENTUCKY! Hurrah!!
Part IV - Archival Prototypes & Lessons Learned
Very technical see Project Wiki & Papers
Project Management for Archivists
My projects
- processing collections
- film preservation project
- digital projects
Things to manage
- people - different levels
- expectations
- budget
- deliverables
- time
Vision - who's vision, what happens if it changes - oh no! Have to get everyone to buy in to the vision. Think specifically.
Project manager is the goalie - put the ball back into play and keep the team working. Advocate for more resources.
How does my management style impact on the project? Be approachable. Expect stuff to go wrong on the first day. Be there to make the changes and the decisions, get it in writing. Be clear on the decisions that impact the project.
Focus on primary audience.
Mission - what we want to do htttp://www.ohiomemory.org/om/mission.html
Vision - what that's going to look like - http://memory.loc.gov/ammem/dli2/html/lcndlp.html http://www.cdpheritage.org/about/mission.html
Establishing goals - SMART = Specific, Measurable, Achievable, Realistic, Timebound
Communication re: what's possible. Project creep - getting ahead of yourself, take on new, bigger things as the project goes on, STAY FOCUSED! Develop a template/structure to base the next project on. Pad the timeline - stuff WILL happen.
Goals - http://www.dlib.indiana.edu/about/planning/stratPlan.shtml
Know your collection well enough to get good numbers as end goal. Do these things actually exist? Find out, don't guess.
Identify and select appropriate standards - save time and budget in the long run.
Leave a record of what you did for the next fool, ahem, archivist . . .
Money - 1st time through - the amount of money that it will take to do the very best project. Then scale back to what you can get. Ask why - why are we doing the thing and does that impact needs, level of quality, etc.?
Work flow - road map, set of relationships b/t all the steps in a project start to finish with triggers. DETAILED, the more detailed the better. Do a test of all the sections as a part of the planning process. Can you actually do the thing that you are planning?!
LC - DLP Project Planning Checklist
OCLC 12 step process:
- material check-in
- project spec sheet
- material preparation
- digital capture
- quality assurance
- metadata collection
- file naming/directories
- ocr processing
- derivative file creation
- indexing
- media burn
- review & acceptance
Quality control - builds trust of all parties. Establish and document specific criteria that define what is and is not acceptable.
Delegate - the project will be better with collaboration. You need to be able to do all the parts at some level in order to lead, answer questions, troubleshoot, etc. List on the bulletin board better than a calendar.
Evaluation - have you determined a need? Have you met the need? Have you changed lives? for the better . . . ?
Difficulty and rewards increase exponetially with number of collaborators and complexity of project.
- You must believe!
- Do a sample/pilot project
- Decisions depend on circumstance
- The only ones who don't make mistakes are those who don't DO ANYTHING . . .
DO SOMETHING!