Digital Preservation: new assessment tools

This year I collaborated with Chris Fryer of Northumberland Estates on a project under the auspices of the Jisc’s SPRUCE funding. It’s ended up as a case study, and it’s an assessment of available digital preservation solutions. The main aim was to build outputs that would have value to smaller organisations, who intend to implement digital preservation on a limited budget; Chris in particular wanted something aligned very closely to his own business case, and local practices.

We believe that the methodology we used on this project, if not the actual deliverables, will have some reuse value for other small organisations. There are four useful outputs in our toolkit:

  1. A requirements shopping list – a specification of what the chosen system would have to do
  2. An assessment form – the same shopping list, expressed as a scored checklist to assess a system
  3. Example(s) of assessments of real-world solutions
  4. A very simple self-assessment form for scoring organisational preparedness for digital preservation, based on ISO 16363.

The Requirements Deliverable is essentially a “shopping list” of what the chosen system has to do to perform digital preservation. It was built from a combination of:

1. The OAIS standard (somewhat selectively)
2. US National Library of Medicine 2007 specification
3. Suggestions sent by Jen Mitcham (Digital Archivist at the University of York), QA supplier to the project

We wanted to keep the specification concise, manageable and realistic so that it would meet the immediate business needs of Northumberland Estates, while also adhering to best practice. The project team agreed that it was not necessary to adhere to every last detail of OAIS compliance. This approach might horrify purists, but it worked in this context.

The Assessment Form deliverable is a recasting of the requirements document into a form that could be used for assessing a preservation solution. We added a simple scoring range, and a weighted score methodology to add weight to the “essential” requirements.

With these two deliverables, we achieved a credible specification and assessment method that is a good fit for Northumberland Estates. Our methodology shows it would be possible for any small organisation to devise their own suitable specification. It is based not exclusively on OAIS, but on the business needs of NE and a simple understanding of the user workflow.

We used our documents to assess actual solutions (I looked at Preservica, the cloud-based version of Safety Deposit Box). Using these assessments, NE stands a better chance of selecting the right system for their business needs, and using a process that can be repeated and objectively verified.

This method should be regarded as quick and easy. Since we used supplier information, success of the method depends on whether that information is accurate and truthful. But it would be a good first step to selecting a supplier. More in-depth assessments of systems are possible.

Lastly we built the cut-down ISO 16363 assessment. This was suggested by the project sponsor to compensate for the technology-heavy direction we had been heading in. ULCC prepared the cut-down and simplified version of ISO 16363, by retaining only those requirements considered essential for the purposes of this project.

This deliverable was explicitly intended to complement and enhance the assessment of the repository solution, so as to be effective in the context of this project. In particular, all of the standard’s section 4 on Digital Object Management is omitted in this deliverable, since most of its essential detail is already expressed in the repository assessment document.

The scoring element uses the Five Organisational Stages model (Kenney / McGovern). This is a very strong model and I also used it in the preparation of AIDA and for my contributions to CARDIO.

There are already a lot of self-assessment tools available for repositories, including very thorough and comprehensive tools like TRAC and DRAMBORA. But with this quick and easy approach, we show it is possible for an organisation to perform a credible ISO self-assessment in a very short time. Users of this tool effectively conduct a mini-gap analysis of their organisation, the results of which could be used as a starting point for building your business case.

Chris’s final report on the project exists as a blog post. The deliverables can be downloaded from the SPRUCE project wiki.

Digitisation Course at Salford

Recently, I delivered a one-day training course on digitisation to Digital Humanities post graduates in Salford. Elinor Taylor of Salford University won an AHRC grant for a Research Skills Enrichment project, called Issues in the Digital Humanities: A Key Skills Package for Postgraduate Researchers, and one of the strands was about improving digitisation skills; more specifically how best to manage a digitisation project.

Elinor was unable to find anyone who could deliver the course they wanted, and commissioned ULCC to create a bespoke course. Elinor at first thought a workshop / hands-on event might be best, where a digitisation workflow could be aligned with a real-world case processing papers from the Working Class Movement Library which they were scanning. In the end we agreed that an overview of management principles would be better. I was asked not to dwell on scanners and cameras, since the audience for the course would mostly be outsourcing their origination work to commercial providers. Audio-visual conversion was also out of scope.

My course was structured to follow a start-to-finish narrative. Inevitably this meant spending one-third of the time discussing the planning and preparation. I’m a great believer in asking the question “why” about 15 times before beginning any project, and the same applies to digitisation. Who wants this stuff? Why digitise it? Will it improve their lives if you do? More importantly, can we make the experience for a user even better by digitising it? Then if you get a sponsor for your project, there’s the management – the multiple considerations and the groundwork that has to be done before a single image is created.

Let’s skip ahead to image creation. When it comes to producing digitised content, my archivist training always tells me it’s best practice to create “archival originals” – or “master copies” – to a very high standard of resolution, quality, and format compatibility. I favour the process used by many professional digitisers of creating RAW images in a good camera and deriving TIFF files from those RAW images. From that point it’s possible to derive accessible copies – quite often low-res, small size objects in JPEG – for your user base. It was also good to address my favourite subject, metadata, and attempted to stress why it’s such an essential part of digitisation. We need the descriptive catalogue metadata, but we also need the technical metadata for validating, handling, preserving etc. of image objects. Especially if we want to put them into an image management system.

As to image management systems, I sensed great interest in the room as I described all the useful features that you might find in a bespoke system such as Canto or Extensis Portfolio. At one level it almost seems these systems offer everything a digitisation project could want in terms of searching and browsing, integrated editing suites, and management tools. Providing of course, we have the metadata to begin with. If I do this course again I must remember to stress the essential role of metadata. A management system can’t manage very much without it.

Another subject which generated a lot of interest and discussion was Copyright. Unsurprisingly, this continues to be the concern that nearly breaks the deal for a digitisation project. Do you have the right to digitise, store, preserve, disseminate and share this material? Do you attempt to take a managed risk, or go to the effort of resolving the question of ownership? How do Google Books get away with what they are doing? was one question from the floor. The copyright dilemma is that the law as it currently stands does all it can to prevent copying of copyrighted material. But for any digitisation project (for that matter, any project involving IT or the internet) archivists have to make copies and allow users to make copies too. Is copyright law lagging behind the reality of the digital world? Discuss.

Some positive feedback from students:

  • “Great technical details and non-biased advice from an expert”.
  • “A very realistic view of the amount of work involved in digitisation projects, and the importance of planning”.
  • “The presenter was excellent. It was accessible and yet in-depth”.
  • “Very straightforward methodology of doing a digitisation project – jargon explained.”

Update 29th August:

Foiled by an implementation bug

I recently attempted to web-archive an interesting website called Letters of Charlotte Mary Yonge. The creators had approached us for some preservation advice, as there was some danger of losing institutional support.

The site was built on a WordPress platform, with some functional enhancements undertaken by computer science students, to create a very useful and well-presented collection of correspondence transcripts of this influential Victorian woman writer; within the texts, important names, dates and places have been identified and are hyperlinked.

Since I’ve harvested many WordPress sites before with great success, I added the URL to Web Curator Tool, confident of success. However, right from the start some problems were experienced. One concern was that the harvest was taking many hours to complete, which seemed unusual for a small text-based site with no large assets such as images or media attachments. One of my test harvests even went up to the 3 GB limit. As I often do in such cases, I terminated the harvests to examine the log files and folder structures of what had been collected up to that point.

This revealed that a number of page requests were showing a disproportionately large size, some of them collecting over 40 MB for one page – odd, considering that the average size of a gathered page in the rest of the site was less than 50 KB. When I tried to open these 40 MB pages in the Web Curator Tool viewer, they failed badly, often yielding an Apache Tomcat error report and not rendering any viewable text at all.

These pages weren’t actually static pages as such – it might be more accurate to call them responses to a query. A typical query was

http://www.yongeletters.com/letters-1850-1859?year_id=1850

a simple script that would display all letters tagged with a year value of 1850. Again, I’ve encountered such queries in my web-archiving activities before, and they don’t usually present problems like this one.

I decided to investigate this link’s behaviour, and others like it, on the live site. The page is supposed to load truncated links to other pages. Instead it loads the same request on the page multiple times, ad infinitum. The code is actually looping, endlessly returning the result “Letters 1 to 10 of 11”, and will never complete its task.

When this behaviour on the live site is encountered by the web harvester Heritrix, it means the harvester is likewise sent into a loop of requests that can never be completed. This is what caused the 40 MB “page bloat” for these requests.

We have two options for web-archiving in this instance; neither one is satisfactory.

  • Remove the 3 GB system limit and let the harvester keep running. However, as my aborted harvests suggested, it would probably keep running forever, and the results still would not produce readable (or useful) pages.
  • Using exclusion commands, filter out the links such as the one above. The problem with that approach is that the harvester misses a large amount of the very content it is supposed to be collecting, and the archived version is then practically useless as a resource. To be precise, it would collect the pages with the actual transcribed letters, but the method of navigating the collection by date would fail. Since the live site only offers navigation using the dated Letter Collection links, the archived version would remain inaccessible.

This is, therefore, an example of a situation where a web site is effectively un-archivable, as it never completes executing its scripts and potentially ties the harvester up forever. The only sensible solution is for the website owners to fix and test their code (which, arguably, they should have done when developing it). Until then, a valuable resource, and all the labour that went into it, will continue to be at risk of oblivion.

BlogForever: Preservation in BlogForever – an alternative view

From the BlogForever project blog

I’d like to propose an alternative digital preservation view for the BF partners to consider.

The preservation problem is undoubtedly going to look complicated if we concentrate on the live blogosphere. It’s an environment that is full of complex behaviours and mixed content. Capturing it and replaying it presents many challenges.

But what type of content is going into the BF repository? Not the live blogosphere. What’s going in is material generated by the spider: it’s no longer the live web. It’s structured content, pre-processed, and parsed, fit to be read by the databases that form the heart of the BF system. If you like, the spider creates a “rendition” of the live web, recast into the form of a structured XML file.

What I propose is that these renditions of blogs should become the target of preservation. This way, we would potentially have a much more manageable preservation task ahead of us, with a limited range of content and behaviours to preserve and reproduce.

If these blog renditions are preservable, then the preservation performance we would like to replicate is the behaviour of the Invenio database, and not live web behaviour. All the preservation strategy needs to do is to guarantee that our normalised objects, and the database itself, conform to the performance model.

When I say “normalised”, I mean the crawled blogs that will be recast in XML. As I’ve suggested previously, XML is already known to be a robust preservation format. We anticipate that all the non-XML content is going to be images, stylesheets, multi-media, stylesheets, and attachments. Preservation strategies for this type of content are already well understood in the digital preservation world, and we can adapt them.

There is already a strand of the project that is concerned with migration of the database, to ensure future access and replay on applications and platforms of the future. This in itself could feasibly form the basis of the long-term preservation strategy.

The preservation promise in our case should not guarantee to recreate the live web, rather to recreate the contents of the BF repository, and to replicate the behaviour of the BF database. After all that is the real value of what the project is offering: searchability, retrievability, and creating structure (parsed XML files) where there is little or no structure (the live blogosphere).

Likewise it’s important that the original order and arrangement of the blogs be supported. I would anticipate that this will be one of the possible views of the harvested content. If it’s possible for an Invenio database query to “rebuild” a blog in its original order, that would be a test of whether preservation has succeeded.

As to PREMIS metadata: in this alternative scenario the live data in the database and the preserved data are one and the same thing. In theory, we should be able to manipulate the database to devise a PREMIS “view” of the data, with any additional fields needed to record our preservation actions on the files.

In short, I wonder whether the project is really doing “web archiving” at all? And does it matter if we aren’t?

In summary I would suggest:

  • We consider the target of preservation to be crawled blogs which have been transformed into parsed XML (I anticipate that this would not invalidate the data model).
  • We regard the spidering action as a form of “normalisation” which is an important step to transforming unmanaged blog content into a preservable package.
  • Following the performance model proposed by National Archives of Australia, we declare the performance we wish to replicate is that of normalised files in the Invenio database, rather than the behaviours of individual blogs. This approach potentially makes it simpler to define “significant properties”; instead of trying to define the significant properties of millions of blogs and their objects, we could concentrate on the significant properties of our normalised files, and of Invenio.

BlogForever: BlogForever and migration

From the BlogForever project blog

Recently I have been putting together my report on the extent to which the BlogForever platform operates within the framework of the OAIS model. Inevitably, I have thought a bit about migration as one of the potential approaches we could use to preserve blog content.

Migration is the process whereby we preserve data by shifting it from one file format to another. We usually do this when the “old” format is in danger of obsolescence for a variety of reasons, while the “target” format is something we think we can depend on for a longer period of time. This strategy works well for relatively static document-like content, such as format-shifting a text file onto PDF.

The problem with blogs, and indeed all web content, is when we start thinking of the content exclusively in terms of file formats. The content of a blog could be said to reside in multiple formats, not just one; and even if we format-shift all the files we gather, does that really constitute preservation?

With BlogForever, we’re going for an approach to capture and ingest which seems to have two discrete strands to it.

(1) We will be gathering and keeping the content in its “original” native formats, such as HTML, images files, CSS etc. At time of writing, the current plan is that we will have a repository record for each ingested blog post and all its associated files (original images, CSS, PDF, etc.) will be connected with this record. These separate files will be preserved and presumably migrated over time, if some of these native formats acquire “at risk” status.

(2) We are also going to create an XML file (complete with all detected Blog Data Model elements) from each blog post we are aggregating. What interests me here is that in this strand, an archived blog is being captured and submitted as a stream of data, rather than a file format. It so happens the format for storing that data-stream is going to be XML. The CyberWatcher spider is capable of harvesting blog content by harnessing the RSS feed from a blog, and by using blog-specific monitoring technologies like blog pings; and it also performs a complex parsing of the data it finds. The end result is a large chunk of “live” blog content, stored in an XML file.

Two things are of interest here. One is that the spider is already performing a form of migration, or transformation, simply by the action of harvesting the blog. Secondly, it’s migrating to XML, which is something we already know to be a very robust and versatile preservation format, more so even than a non-proprietary tabular format such as CSV. The added value of XML is the possibility of easily storing more complex data structures and multiple values.

If that assumption about the spider is correct, perhaps we need to start thinking of it as a transformation / validation tool. The more familiar digital preservation workflow assumes that migration will probably happen some time after the content has been ingested; what if migration is happening before ingest? We’re already actively considering the use of the preservation metadata standard PREMIS to document our preservation actions. Maybe the first place to use PREMIS is on the spider itself, picking up some technical metadata and logs on the way the spider is performing. Indeed, some of the D4.1 user requirements refer to this: DR6 ‘Metadata for captured Contents’ and DR17 ‘Metadata for Blogs’.

We anticipate the submitted XML is going to be further transformed in the Invenio repository via its databases, and various metadata additions and modifications will transform it from a Submission Information Package into an Archival Information Package and a Dissemination Information Package. As far as I can see though, the XML format remains in use throughout these processes. It feels as though the BlogForever workflow could have a credible preservation process hard-wired into it, and that (apart from making Archival Information Packages, backing-up and keeping the databases free from corruption) very little is needed from us in the way of migration interventions.

It also feels as though it would be much easier to test this methodology; the focus of the testing becomes the spider>XML>repository>database workflow, rather than a question of juggling multiple strategies and testing them against file formats and/or significant properties. Of course, migration would still need to apply to the original native file formats we have captured, and this would probably need to be part of our preservation strategy. But it’s the XML renditions which most users of BlogForever will be experiencing.

Enhancing Linnean Online: The AIDA metrics

From Enhancing Linnean Online project blog

We’ve mentioned so far Beagrie’s metrics for measuring improvements to the management of academic research data, and the Ithaka metrics for measuring improvements to delivery of content, particularly with regard to the operations of an organisation’s business model.

A third possibility is making use of UoL’s AIDA toolkit, a desk-based assessment method which has gone through many iterations and possible applications. Over time, we’ve shown how it could be used for digital assets, records management, and even research data (although admittedly it has never been used in anger in those situations). AIDA doesn’t intend to measure assets, but instead measures the capability of the Institution (or the owning Organisation) to preserve its own digital resources.

In July 2011 we produced a detailed reworking of AIDA that could specifically be used for research data. This was part of the JISC-funded IDMP project and the intention was that AIDA could feed into the DCC’s online assessment tool, CARDIO. The detail of the reworked AIDA was assisted greatly by the expertise of numerous external consultants, recruited from a wide range of international locations and skillsets. They fine-tuned the wording of the AIDA assessment statements to make it into a benchmarking tool with great potential.

AIDA is predicated on the notion of “continuous improvement”, and expresses its benchmarking with an adapted version of the “Five Stages” model which was originally invented and developed at Cornell University by Anne Kenney and Nancy McGovern. It also uses their “Three Legs” framework to ensure that the three mainstays of digital preservation (i.e. Organisation, Technology and Resources) are properly investigated.

We think there may be some scope for applying AIDA to JISC ELO, mainly as an analysis tool or knowledge base for measuring the results of responses to questionnaires and surveys. It could assess broadly whether the Linnean Online service finds itself at a Stage Two or Stage Three. We could subsequently measure whether the enhancements, once implemented, have moved the service forward to a Stage Four or Stage Five.

This could be done with a little tweaking of the wording of the current iteration of AIDA, and through selective / partial application of its benchmarks. We think it would be a good fit for the ELO project strands which discuss Metadata, Licensing, and Preservation Policy – all of which are expressed in the Organisation leg of AIDA. The Resources leg of AIDA could be tweaked to measure improvements in the area of ELO’s Revenue Generation. One of the most salient features of AIDA is its flexibility.

Versions of the adapted AIDA toolkit can be found via the project blog, although the improved CARDIO version has not been published as yet.

Enhancing Linnean Online: The Ithaka metrics

From Enhancing Linnean Online project blog

In our last post, we considered whether the Beagrie metrics are going to work for this project. This time, we’ll look at another JISC-related initiative, the Ithaka study on sustainability (Sustaining Digital Resources: An On-the-Ground View of Projects Today) from July 2009.

Beagrie’s metrics were of course directed at the HFE sector, and the main beneficiaries in his report are Universities, researchers, staff, and students who benefit from improved scholarly access. Conversely, Ithaka takes the view that an organisation really needs a business model to underpin long-term access to its digital content, and manage preservation of that content. They undertook 12 case studies examining such business models in various European organisations, and identified numerous key factors for success and sustainability.

The subjects of these case studies were not commercially-oriented businesses as such, but Ithaka takes a no-nonsense view of what “sustainability” means in a digital context: it means whatever you do, you need to cover your operating costs. One of the report’s chief interests then, is discovering what your revenue-generating strategy is going to be. They identify metrics for success, but it’s clear what they mean by “success” is the financial success of the resource and revenue model, and that is what is being measured.

The metrics proposed by Ithaka are very practical and tend to deal with tangibles. Broadly I see three themes to the metrics:

1. Quantitative metrics which apply to the content

  • Amount of content made available
  • Usage statistics for the website

2. Quantitative metrics which apply to the revenue model

  • Amount of budget expected to be generated by revenue strategies
  • Numbers of subscriptions raised, against the costs of generating them
  • Numbers of sales made, against the costs of generating them

3. Intangible metrics

  • Proving the value and effectiveness of a project to the host institution
  • Proving the value and effectiveness of a project to stakeholders and beneficiaries

How would these work for our project? My sense is that (1) ought to be easy enough to establish, particularly if we apply our before-and-after method here and compile some benchmark statistics (e.g. figures from the Linnean weblogs) at an early stage, which can be revisited in a few years.

As to (2), revenue generation is something we have explicitly outlined in our bid. Since the project is predicated on repository enhancements, we intend to develop these enhancements in line with existing revenue models proposed to us by the Linnean staff. Our thinking at this time is that the digitised content can be turned into an income stream by imaginative and innovative strategies for reuse of images and other digital content, which might involve licensing. As yet we haven’t discussed plans for a subscription service, or direct sales of content.

(3) is an interesting one. The immediate metric we’re thinking of applying here is how the enhanced repository features will improve the user experience. I’m also expecting that when we interview stakeholders in more detail, they can provide more wide-ranging views about “value and effectiveness”, connected with their research and scholarship. These intangibles amount to much more than just ease of navigation or speed of download, and they ought to be translatable into something of value which we can measure.

But maybe we can also look again at the host institution, and find examples of organisational goals and policies at Linnean that we could align with the enhancement programme, with a view to indicating how each enhancement can assist with a specific goal of the organisation. As Ithaka found however, this approach works better with a large memory institution like TNA, which happens to work under a civil service structure with key performance indicators and very strong institutional targets.

In all the Ithaka model looks like it can work well for this project, provided we can promote the idea of a “business model” to Linnean without sounding like we’re planning some form of corporate takeover!