Digital Archives – Page 2 – Ed The Archivist

Enhancing Linnean Online: The Ithaka metrics

From Enhancing Linnean Online project blog

In our last post, we considered whether the Beagrie metrics are going to work for this project. This time, we’ll look at another JISC-related initiative, the Ithaka study on sustainability (Sustaining Digital Resources: An On-the-Ground View of Projects Today) from July 2009.

Beagrie’s metrics were of course directed at the HFE sector, and the main beneficiaries in his report are Universities, researchers, staff, and students who benefit from improved scholarly access. Conversely, Ithaka takes the view that an organisation really needs a business model to underpin long-term access to its digital content, and manage preservation of that content. They undertook 12 case studies examining such business models in various European organisations, and identified numerous key factors for success and sustainability.

The subjects of these case studies were not commercially-oriented businesses as such, but Ithaka takes a no-nonsense view of what “sustainability” means in a digital context: it means whatever you do, you need to cover your operating costs. One of the report’s chief interests then, is discovering what your revenue-generating strategy is going to be. They identify metrics for success, but it’s clear what they mean by “success” is the financial success of the resource and revenue model, and that is what is being measured.

The metrics proposed by Ithaka are very practical and tend to deal with tangibles. Broadly I see three themes to the metrics:

1. Quantitative metrics which apply to the content

Amount of content made available
Usage statistics for the website

2. Quantitative metrics which apply to the revenue model

Amount of budget expected to be generated by revenue strategies
Numbers of subscriptions raised, against the costs of generating them
Numbers of sales made, against the costs of generating them

3. Intangible metrics

Proving the value and effectiveness of a project to the host institution
Proving the value and effectiveness of a project to stakeholders and beneficiaries

How would these work for our project? My sense is that (1) ought to be easy enough to establish, particularly if we apply our before-and-after method here and compile some benchmark statistics (e.g. figures from the Linnean weblogs) at an early stage, which can be revisited in a few years.

As to (2), revenue generation is something we have explicitly outlined in our bid. Since the project is predicated on repository enhancements, we intend to develop these enhancements in line with existing revenue models proposed to us by the Linnean staff. Our thinking at this time is that the digitised content can be turned into an income stream by imaginative and innovative strategies for reuse of images and other digital content, which might involve licensing. As yet we haven’t discussed plans for a subscription service, or direct sales of content.

(3) is an interesting one. The immediate metric we’re thinking of applying here is how the enhanced repository features will improve the user experience. I’m also expecting that when we interview stakeholders in more detail, they can provide more wide-ranging views about “value and effectiveness”, connected with their research and scholarship. These intangibles amount to much more than just ease of navigation or speed of download, and they ought to be translatable into something of value which we can measure.

But maybe we can also look again at the host institution, and find examples of organisational goals and policies at Linnean that we could align with the enhancement programme, with a view to indicating how each enhancement can assist with a specific goal of the organisation. As Ithaka found however, this approach works better with a large memory institution like TNA, which happens to work under a civil service structure with key performance indicators and very strong institutional targets.

In all the Ithaka model looks like it can work well for this project, provided we can promote the idea of a “business model” to Linnean without sounding like we’re planning some form of corporate takeover!

Enhancing Linnean Online: Beagrie’s Metrics

From Enhancing Linnean Online project blog

We’re aiming at delivering a set of enhancements to Linnean, but how will we know if they worked? One of the aims of the ELO project is to measure the results of the programme of enhancements in terms of tangible benefits to Linnean and its stakeholders. We’re thinking about a framework that will enable us to measure the results of this before-and-after process.

Our thinking at the moment is that we could adapt and make use of the Beagrie metrics published in Benefits from the infrastructure projects in the JISC managing research data programme, which were devised for measuring the value of research data to an HEI.

The Institutions that Beagrie worked with were asked about how their lives would improve if their research data was better managed. Data management planning is a wide-ranging process that includes preservation as one of the outcomes. Those consulted were very strong at coming up with lists of potential benefits. But it was slightly harder for them to come up with reliable means of measuring those benefits.

Even so, the report came up with a very credible list. It was organised under the names of the stakeholders who would benefit the most. A little tinkering with that table allows us to put Linnean at the top of the list as the main beneficiary. We also know Linnean has researchers, and that they are concerned with scholarly access. This suggests a framework like the one below might work for us.

Benefits Metrics for Linnean

New research grant income
Number of research dataset publications generated
Number of research papers
Improvements over time in benchmark results
Cost savings/efficiencies
Re-use of infrastructure in new projects

Benefits Metrics for researchers

Increase in grant income/success rates
Increased visibility of research through data citation
Average time saved
Percentage improvement in range/effectiveness of research tool/software

Benefits Metrics for Scholarly Communication and Access

Number of citations to datasets in research articles
Number of citations to specific methods for research
Percentage increase in user communities
Number of service level agreements for nationally important datasets

The Institutions in the report go on to give specific instances of how these metrics apply in their case. For instance, for the “Average Time Saved” metric the Sudadmih project reported:

“In an attempt to measure benefit 1 (time saved by researchers by locating and retrieving relevant research notes and information more rapidly) Sudamih asked course attendees to estimate how much of their time spent writing up their research outputs is actually spent looking for notes/files/data that they know they already have and wish to refer to. The average was 18%, although in some instances it was substantially more, especially amongst those who had already spent many years engaged in research (and presumably therefore had more material to sift through). This would indicate that there is at least considerable scope to save time (and improve research efficiency) by offering training that over the long term could improve information management practices.”

However, the report is also clear that any form of enhancements (technical, administrative, cultural) can take some time to bed down before their benefits are even visible, let alone become measurable. “Measuring benefits therefore might be best undertaken over a longer time-scale”, is one possible conclusion. That is a caveat we’ll have to bear in mind, but it doesn’t preclude us devising our own bespoke set of metrics.

Every man his own modified digital object

Today we’ve just completed our Future-Proofing study at ULCC and sent the final report to the JISC Programme Manager, with hopes of a favourable sign-off so that we can publish the results on our blog.

It was a collaboration between myself and Kit Good, the records manager here at UoL. We’re quite pleased with the results. We wanted to see if we could create preservation copies of core business documents that require permanent preservation, but do it using a very simple intervention and with zero overheads. So we worked with a simple toolkit of services and software that can plug into a network drive; we used open source migration and validation tools. Our case study sought to demonstrate the viability of this approach. Along the way we learned a lot about how Xena digital preservation software operates, and how (combined with Open Office) it makes a very credible job of producing bare-bones Archival Information Packages, and putting information into formats with improved long-term prospects.

The project has worked on a small test corpus of common Institutional digital records, performed preservation transformations on them and conducted systematic evaluation to ensure that the conversions worked, that the finished documents render correctly, that sufficient metadata been generated for preservation purposes, and that it can feasibly be extracted and stored in a database; and that the results are satisfactory and fit for purpose.

The results show us that it is possible to build a low-cost, practical preservation solution that addresses immediate preservation problems, makes use of available open source tools, and requires minimal IT support. We think the results of the case study can feasibly be used by other Institutions facing similar difficulties, and scaled up to apply to the preservation of other and more complex digital objects. It will enable non-specialist information professionals to perform certain preservation and information management tasks with a minimum of preservation-specific theoretical knowledge.

Future-Proofing won’t solve your records management problems, but it stands a chance of empowering records managers by allowing them to create preservation-worthy digital objects out of their organisation’s records, without the need for an expensive bespoke solution.

Future-Proofing: A preservation workflow using our toolkit

From the JISC Future Proofing project blog

Kit Good will describe a potential records management workflow using these open source tools. This post describes a potential preservation workflow. Since UoL does not yet have a dedicated digital preservation service, this workflow will have to remain hypothetical, but the gist of it is whether these tools can succeed in creating preservable objects with sufficient technical metadata to ensure long-term continuity.

Ingest

The records to be preserved would arrive as Submission Information Packages (SIP). As a first step we would want to ingest selected records into the digital archive repository we haven’t got yet. In our hypothetical scenario, this would probably be done by the records manager (much the same way he transfers records of permanent value to the University archivist). If we had a digital archivist, all we would need is an agreed methodology between that person and the records manager, for example a web-based front end like Dropbox that allows us to move the files from one place to another, with some metadata and other documentation.

Validation

At a bare minimum, the archive service would want to run the following QA steps on the submitted records:

Fixity
Virus check
Confirm submission formats can be supported in the repository

DROID can perform some of these steps for us with its built-in checksum, and the way it identifies formats by comparing them with its online registry. (This would need to align itself with a written preservation policy of some sort). At the moment we lack a virus checker, but it would not be unfeasible to use an open source virus checker such as Avast. This is one step where the DPSP works out nicely, with its built-in virus check and quarantine stage.

Next, we might want to validate the formats in more detail, since DROID doesn’t look at the signature of the file formats. This is where JHOVE comes in, although the limitations of that tool with regard to Office documents have already been noted.

You’ll notice by this point we are not advocating use of the NZ Metadata Extraction Tool, since its output has been found a bit lacking, but this would be the stage to deploy it.

Transformation

Next step is to normalise the records using Xena. This action is as close as we get to migration in our repository and the actions of Xena have already been described.

These steps create a little “bundle” of objects:

The DROID report in CSV format
The JHOVE output in XML format
The normalised Xena object in .xena format

For our purposes, this bundle represents the archival copy which in OAIS terms is an Archival Information Package (AIP). The DROID and JHOVE files are our technical metadata, while the actual data we want to keep is in the Xena file.

Move to archive

Step four would be to move this AIP bundle into archival storage, keeping the “original” submissions elsewhere in the managed store.

Access

A fifth step would be to re-render the Xena objects as dissemination copies, as needed when requested by a member of staff who wanted to have access to a preserved record. For this, we would open the Xena object using the Xena viewer and create a Dissemination Information Package (DIP) version by clicking the OpenOffice button to produce a readable document. (As noted previously, we can also make this OO version into a PDF).

This access step creates yet more digital objects. In fact if we look at the test results for the 12 spreadsheets in our test corpus, we have created the following items in our bundle:

One DROID file which analysed all 12 spreadsheets – in CSV
One JHOVE file which analysed all 12 spreadsheets – in XML
12 normalised Xena objects – in XENA
A test Open Office rendition of one the spreadsheets in ODS
A PDF rendering of that OO version

In a separate folder, there are 12 MET files in XML – one for each spreadsheet

In real life we certainly wouldn’t want to keep all of these objects in the same place; the archival files must be stored separately from the dissemination versions.

Checking

Many of the above stages present opportunities for a checksum script to be run, to ensure that the objects have not been corrupted in the processes of transformation and moving them in and out of the archival store. If we wanted to go further with checking, we would re-validate the files using JHOVE.

Sounds simple, doesn’t it? But there are quite a few preservation gaps with this bare-bones toolkit.

Gap #1

The technical metadata outputs from DROID and JHOVE are all “loose”, not tightly fused with the object they actually relate to. To put it another way, the processes described above have involved running several separate actions on the object that is the target for preservation. It has also created several separate outputs, which have landed in several different destinations on my PC.

We could manually move everything into a “bundle” as suggested above, but this is extra work and feels a bit insecure and unreliable. For real success, we need a method (a database, perhaps) that manages all the loose bits and pieces under a single UID, or other reliable retrieval method. Xena does create such a UID for each of its objects – it’s visible there in the dc:identifier tag. So we may stand a chance of using that element in future iterations of this work.

Another thing which may seem trivial, but JHOVE, DROID and Xena can do batch processing, and MET cannot. This results in a mismatch of outputs, and the outputs are created in different formats.

There is also some duplication among the detail of the technical metadata that has been extracted from the objects.

Gap #2

Ideally we’d like the technical metadata to be embedded within a single wrapper, along with the object itself. The Xena wrapper seems the most obvious place for this. I lack the technical ability to understand how to do it, though. What I would like is some form of XML authoring tool that enables me to write the DROID and JHOVE output directly into the Xena wrapper.

Gap #3

All the steps described above are manual. Obviously if we were going to do this on a larger scale we would want to automate the actions. This is not surprising and we knew this would be the case before we embarked on the project, but it’s good to see the extent to which our workflow remains non joined-up.

Gap #4

Likewise, our audit trail for preservation actions is a bit distributed. Both DROID and JHOVE give us a ‘Last Modified’ date for each object processed, and Xena embeds naa:last-modified within the XML output for each object, but ideally these dates ought to be retrievable by the preservation metadata database and presented as a kind of linear chain of events. We’d also like to have a field identifying what the process was that triggered the ‘Last Modified’ date.

Gap #5

How we manage the descriptive metadata in this process? In order to deliver a DIP to our consumer, they would have to know the record exists and want to know something useful about it to enable them to retrieve it. We have confirmed that descriptive metadata survives the Xena process, but how can we expose it for our users?

What we’re talking about here is a searchable front end to the archival store which we haven’t yet built. Kit is proposing a structured file store for his records, so maybe we need to expand on this approach and think about ways of delivering a souped-up archival catalogue for these assets.

Future-Proofing: Significant Properties in Office documents

From the JISC Future Proofing project blog

Significant properties are properties of digital objects or files which are considered to be essential for their interpretation, rendering or meaningful access. One of the aims of digital preservation is to make sure we don’t lose or corrupt these properties through our preservation actions, especially migration. For this particular project, we want to be assured that a Xena normalisation has not affected the significant properties of our electronic records.

Much has been written on this complex subject. Significant properties are to do with preserving the “performance” and continued behaviour of an object; they aren’t the same thing as the technical metadata we’ve already assessed, nor the same as descriptive metadata. For more information, see The InSPECT project, which is where we obtained advice about significant property elements for various file formats and digital object types.

However, the guidance we used for significant properties of document types was Document Metadata: Document Technical Metadata for Digital Preservation, Florida Digital Archive and Harvard University Library, 2009.

To summarise it, Florida and Harvard suggest we should be interested in system counts of pages, words, paragraphs, lines and characters in the text; use of embedded tables and graphics; the language of the document; the use of embedded fonts; and any special features in the document.

When normalising MS Office documents to Open Office, the Xena process produces a block of code within the AIP wrapper. I assume the significant properties of the documents are somewhere in that block of code. But we can’t actually view them using the Xena viewer.

We can however view the properties if we look at the end of the chain of digital preservation, and go to the DIP. For documents, The Open Office rendering of a Xena AIP has already been shown to have retained many of the significant properties of the original MS Word file. (See previous post). The properties are visible in Open Office, via File, Properties, Statistics:

The language property of the document is visible in Tools, Options, Language Settings (see image below). I’m less certain about the meaning or authenticity of this, and wonder if Open Office is simply restating the default language settings of my own installation of Open Office; whereas what we want is the embedded language property of the original document.

The other area Open Office fails to satisfy us is with the embedded fonts. However, these features are more normally associated with PDF files, which depend for their portability success on embedding fonts as part of their conversion process.

For our project, what we could do is take the Open Office normalisation and use the handy Open Office feature, a magic button which can turn any of its products into a PDF.

We could use the taskbar button to do this, but I’d rather use File, Export as PDF. (See screenshot below.) This gives me the option to select PDF/A-1a, an archival PDF format; and lossless compression for any images in the document. Both those options will give me more confidence in producing an object that is more authentic and can be preserved long-term.

This experiment yields a PDF with the following significant properties about the embedded fonts:

These are the Document Property fields which Adobe supports.

The above demonstrates how, in the chain of transformation from MS Word > Xena AIP > Open Office equivalent > Exported PDF, these significant properties may not be viewable all in one place, but they do survive the normalisation process.

Future-Proofing: DIPs from a Xena object – descriptive metadata

From the JISC Future Proofing project blog

In OAIS terms, the Xena digital object is our Archival Information Package (AIP). This is the object that would be stored and preserved in UoL’s archival storage system (if we had one).

For the purposes of this project, the records manager needs to be assured we can render and deliver a readable, authentic version of the record from the Xena object. In OAIS terms, this could be seen as a Dissemination Information Package (DIP) derived from an AIP. Among other things, the DIP package ought to include sufficient descriptive metadata.

We’ve defined a minimal set of what we would need for records management purposes (names, dates for authenticity) and for information retrieval, and criteria for assessing overall success of the transformations – such as legibility, presentation, look and feel, and basic functionality. See our previous post for the documentation of how we arrived at our criteria.

For most of the examples below we are looking at standard Xena normalisations, which can produce a DIP by the following methods:

Export the AIP to nearest Open Office equivalent
Export the AIP to native format
Export the AIP to a format of our choice (an option we haven’t tried as yet)

Emails are slightly more complex – see below for more detail, and previous post on emails.

Documents, spreadsheets and powerpoints

For these MS Office documents, Xena normalises by converting them to their nearest Open Office equivalent. We can view this Open Office version via the Xena Viewer simply by clicking on Show in OpenOffice.org.

In the case of a sample docx, this shows us the document with its original formatting intact. In Open Office, we can now look at File, Properties, and confirm the following descriptive metadata are intact (some of these we will revisit when we look at significant properties):

Title
Subject
Keywords
Comments
Company Name
Number of pages
Number of tables
Number of graphics
Number of OLE objects
Number of paragraphs
Number of words
Number of characters
Number of lines
Size
Date of creation / created by who
Date of modification / modified by who

One missing property is a discrete ‘Author’ field in the OO version.

PDFs

Xena Viewer has its own Adobe-like viewer for PDFs. When we look at the Document Properties for a normalised PDF, we get the following descriptive metadata:

File size
Page count
Title
Author
Creator (i.e. software that created the file)
Producer (i.e. software that created the file)
Date of creation
Date of modification

Images

We may not need much descriptive metadata for individual images. We looked at the converted images using a free EXIF viewer tool and the date values (creation and modification) are intact in the normalised object.

Emails

For emails, the AIP will either be a standard Xena normalised object or a binary Xena object. Broadly, the former can export to an XML version and the latter can export back to its native format (.msg file). We’re always guaranteed to get a minimum of descriptive metadata whichever transformation we enact. This would be:

Title
Author
Recipient
Date

Results – progress so far

Below is a PDF of my table of analysis of descriptive metadata (and significant properties).

DescriptiveMetadata and SigProps

We also have Kit’s assessment of the transformations, including his comments on whether a reasonable amount of metadata is intact.

Kit20111208-Testing

Future-Proofing: Emails in Xena

From the JISC Future Proofing project blog

For this post, I just want to look at how emails are rendered as Xena transformations. I should stress that Xena doesn’t expect you to normalise individual .msg files, as we have done, but instead convert entire Outlook PST files using the readpst plugin. Nevertheless we still got some useable results with the 23 message files in our test corpus, by running the standard normalisation action.

In our test corpus, the emails generally worked well. Almost all of them had an image file attached to the message – a way of the writer identifying their parent organisation. Two test emails had attachments (one a PDF, one a spreadsheet). Four of them contained a copy of an e-Newsletter, content which for some reason we can’t open in the Xena viewer. I’ll want to revisit that, and also try out some other emails with different attachment types.

When we open a normalised email in the Xena Viewer, we can see the following as shown in the screenshot below:

The text of the email message in the large Package NAA window.
Above this, a smaller window with metadata from the message header. (In Xena, this is unique to email as far as we can see, and it’s very useful. Why can’t it do this for properties of other documents?)
Below, a window displaying the technical metadata about the rendition.
Below the large window on the left, a small window which is like a mini-viewer for all attachments.
To the right of this window is technical metadata about each attachment.
Right at the bottom, a window showing the technical metadata about the entire package. Even from this view, it’s clear that Xena is managing an email as separate objects, each with their own small wrapper, within a larger wrapper.

The immediate problem we found here was truncation of the message in the large window. This is being caused by a character rendering issue, and the content isn’t actually lost, as we’ll see when we look at another view.

In the Raw XML view (above image), we can see the XML namespaces being used for preservation of metadata from the email header (http://preservation.naa.gov.au/email/1.0) and a plaintext namespace (http://preservation.naa.gov.au/plaintext/1.0) for the email text.

If we copy the email text from here into a Notepad file, we can see the problem causing the truncation. Something in the system doesn’t like inverted commas. The tag plaintext:bad_char xml:space=”preserve” tells us something is wrong.

In the XML Tree View (image below) we can also see the email being managed as a Header and Parts.

Exporting from Xena

I tried clicking the Export button for an email object. This does the following:

Puts the message into XML. It can now open in a browser and appears reasonably well formatted.
Creates an XSL stylesheet – probably this is what formats the message correctly
Detaches the attachment and stores it in an OO equivalent, or its native format (might be worth investigating this for more examples and seeing what comes out.)

So for a single email with a small image footer and a PDF attachment, I get the following four objects:

Message rendered in XML
XSL stylesheet
One image file
One PDF file

When opened in a browser, the XML message file also contains hyperlinks to its own attachments, which has some potential use for records management purposes (provided these links can be managed and maintained properly).

So far so good. Unfortunately this XML rendition still has the bad_char problem! It would be possible to open this XML file in UltraEdit and do a find and replace action, but editing a file like this undermines the authenticity of the record, and completely goes against the grain for records managers and archivists.

The second option we have for .msg files is to normalise them using Xena’s Binary Normalisation method instead. This creates something that can’t be read at all in the Xena Viewer, but when exported, this binary object converts back into an exact original copy of the email in .msg format, with no truncation or other rendering problems. Attachments are intact too. It creates the same technical metadata, as expected.

We’ve also tried the PST method, which works using the readpst plugin that comes with Xena. This is pretty nifty and speedy too. The normalisation process produces a Xena digital object for every single email in the PST, and a PST directory also. When exported, we get the XML rendition and stylesheet also. So far, a few PROs and CONs with the PST method:

PRO1 – no truncation of the message.
PRO2 – lots of useable metadata.
CON1 – file names of the XML renditions are not meaningful. Each email becomes a numbered item in an mbox folder.
CON2 – attachments simply don’t work in this process.
CON3 – PST files are not really managed records. In fact Kit Good’s initial reaction to this was “If I could ban .pst files I would as I think they are a bit of a nightmare for records management (and data management generally). I could see the value of this in the case of archiving an email account of a senior employee (such as the VC) or eminent academic member of staff but I think it would be impractical for staff, especially as .pst files are often used as a ‘dump’ when inboxes get full.”

Future-Proofing: Xena objects

From the JISC Future Proofing project blog

In this post, I’m just going to look at the normalised Xena object, which is an Archival Information Package (AIP). It’s encoded in a file format whose extension is .xena and which is identified by Windows as a ‘Xena preserved digital object’ type.

We’ve created these for normalisations of each of our principal records types – documents, spreadsheets, PowerPoint, PDFs, images and emails.

When we click on one of these it launches the Xena Viewer application. This gives an initial view of the AIP in the default NAA Package View. This is very useful as it gives a quick visual on how well the conversion / normalisation has worked. (It seems particularly strong on rendering documents and common image formats.)

The AIP is actually quite a complex object though. It is in fact an XML wrapper which contains the transformed object, and the metadata about it. This conforms with the preservation model as proposed by the Library of Congress Metadata Encoding and Transmission Standard, which suggests that building an XML profile like this is the best way to make a preserved object self-describing over time.

The main components of a Xena AIP are (A) technical metadata about the wrapper and the conversion process, some of which is expressed as Dublin Core-compliant metadata, and some of which conforms to an XML namespace defined by the NAA; (B) a string of code that is the transformed file; and (C) a checksum.

The Xena Viewer offers three ways of seeing the package:

1) The default NAA Package view. For documents, this shows the representation of the transformed object in a large window. Underneath that window is a smaller scrolling window which displays the transformation metadata. Underneath this is an even smaller window displaying the package signature checksum.

2) The Raw XML view. In this view we can see the entire package as a stream of XML code. This makes clear the use of XML namespaces and Dublin Core elements for the metadata.

3) The XML Tree View. This makes clear the way the AIP is structured as a package of content and metadata.

The default view changes slightly depending on object type:

For documents, the viewer window shows the OO representation in a manner that keeps some of the basic formatting intact. To be precise, it’s a MIME-compliant representation of the OO document.
For images, the viewer shows a Base 64 rendering of an image file.
For PDF documents, the viewer integrates a set of Adobe Reader-like functions – search, zoom, properties, page navigation etc. This seems to be done by a JPedal Library GUI.
For Spreadsheets and Powerpoint files, the viewer doesn’t show anything in the big window (but we’ll get to this when we follow the “Show in OpenOffice” option, which will be the subject of another post).
For emails, the Xena Viewer has a unique view that not only displays the message content, but also the email header, and any attachments, both in separate windows. We’ll look at this again in a future post about how Xena works with emails.

As an Archival Information Package in OAIS terms, a Xena object is clearly compliant and suitable for preservation purposes. The only problem for our project is getting more of our metadata into the AIP. To put it another way, integrating more metadata that records our digital preservation actions. Ideally, we would like to be able to perform a string of actions (DROID identification of an object, virus checking, checksum etc) and integrate the accumulated metadata into a single AIP. However, this is not in scope of the project.

Future-Proofing: Testing the DPSP

From the JISC Future Proofing project blog

I did a day’s testing of the Digital Preservation Software Platform (DPSP) in December. The process has been enough to persuade me it isn’t quite right for what we intend with our project, but that’s not intended as a criticism of the system, which has many strengths.

The heart of DPSP is performing the normalisation of office documents to their nearest Open Office equivalent, using Xena. Around this, it builds a workflow that enables automated generation of checksums, quarantining of ingested records, virus checking, and deposit of converted records into an archival repository. It also generates a “manifest” file (not far apart from a transfer list), and logs all of the steps in its database.

The workflow is one that clearly suits The National Archives of Australia (NAA) and matches their practice, which involves thorough checking and QA to move objects in and out of quarantine, in and out of a normalisation stage, and finally into the repository. All of these steps require the generation of Unique IDs and folder destinations which probably match an accessioning or citation system at NAA; there’s no workaround for this, and I simply had to invent dummy IDs and folders for my purpose. The steps also oblige a user to log out of the database, and log back in so as to perform a different activity; this is required three times. This process is undoubtedly correct for a National Archives and I would expect nothing less. It just feels a bit too thorough for what we’re trying to demonstrate with the current project, our preservation policy is not yet this advanced, and there aren’t enough different staff members to cover all the functions. To put it another way, we’re trying to find a simple system that would enable one person (the records manager) to perform normalisations, and DPSP could be seen as “overkill”. I seem to recall PANDORA, the NLA’s web-archiving system, was similarly predicated on a series of workflows where tasks were divided up among several members of staff with extensive curation and QA.

My second concern is that multiple sets of AIPs are generated by DPSP. Presumably the intention is that the verified AIPs which end up in the repository will be the archive copies, and anything generated in the quarantine and transformation stages can be discarded eventually. However, this is not made explicit in the workflow, neither is the removal of such copies described in the manual.

My third problem is an area which I’ll have to revisit, because I must have missed something. When running the Xena transformations, DPSP creates two folders – one for “binary” and one for “normalised” objects. The distinction here is not clear to me yet. I’m also worried because out of 47 objects transformed, I ended up with only 31 objects in the “normalised” folder.

The gain with using DPSP is that we get checksums and virus checks built into the workflow, and complete audit trails also; so far with our method of “manual” transformations with XENA we have none of the above, although we do get checksums if we use DROID. But I found the DPSP workflow a little clunky and time-consuming, and somewhat counter-intuitive in navigating through the stages.

A Tab in the Ocean

I’ve been using Web Curator Tool (WCT) to curate the JISC website collection at UKWA since 2008. I’ve long been aware that the system offered me the opportunity to record a lot of metadata, in tabs called General, Annotations, Groups and Access. It’s a mix of technical metadata (about the gather / website) and descriptive metadata. It’s mainly of value to the curator who wants to keep track of what they’re doing with the website gathering; but WCT also allows us to create some descriptive metadata for exposure. At the bare minimum, we’re required to use Groups; despite its name, this component is actually a simple subject classification scheme, allowing me to tag all my websites with “Higher Education” for example. Once stored in the WCT database and rendered through Wayback Machine, this subject selection translates into this useful view of the collection.

Recently the British Library team approached all the users of the shared WCT tool. It seems that the curators involved in UKWA have been using these metadata fields slightly differently and the BL team have initiated a project to move towards more consistency. The project will involve deciding on definitive interpretations of how to use these fields, followed by a process of cleaning up legacy data stored in the system. Some of it is potentially useful, some of it not so useful; some is legacy from the earlier PANDAS phase of the project, mostly not needed, or entered into the wrong field.

As noted, a lot of this metadata is mainly to do with selection and evaluation decisions, curation information such as changes in status of the site, and as such it’s never been exposed anywhere except within WCT. However, one descriptive field will eventually end up exposed on the UKWA live site, and provide us cataloguer types an opportunity to describe the resources in more detail. It will appear on the Title Entry Page (TEP) for each instance.

I welcome any move towards exposing more descriptive metadata on the UKWA public site. I have always taken the view that the phrase which currently appears alongside a Title “The live site may provide more information” is not really very helpful in the context of a web archive, for three reasons. (1) we don’t want our users clicking away from UKWA; (2) the link to the live site may be dead by now, and; (3) as archivists and curators, I feel strongly that we are the ones who should be providing that “more information” in the shape of a catalogue description of some kind.

The JISC project sites, as a collection, have high evidentiary value as stages in development of very specific tools, services and activities that benefit the UK Higher Education community. The sites by themselves don’t always explain their history or intentions; I would argue that a lot of rich contextual detail about the reasons these sites existed (the JISC programme under which they were developed, the dates, the staff involved, the themes, the outputs) would help interpret the collection to the users and make it more intelligible.