PASIG17 reflections: archival storage and detachable AIPs

A belated reflection on another aspect of PASIG17 from me today. I wish to consider aspects of “storage” which emerged from the three days of the conference.

One interesting one was the Queensland Brain Institute case study, where they are serving copies of brain scan material to researchers who need it. This is bound to be of interest to those managing research data in the UK, not least because the scenario described by Christian Vanden Balck of Oracle involved such large datasets and big files – 180 TB ingested per day is just one scary statistic we heard. The tiered storage approach at Queensland was devised exclusively to preserve and deliver this content; I wouldn’t have a clue how to explain it in detail to anyone, let alone know how to build it, but I think it involves a judicious configuration and combination of “data movers, disc storage, tape storage and remote tape storage”. The outcomes that interests me are strategic: it means the right data is served to the right people at the right time; and it’s the use cases, and data types, that have driven this highly specific storage build. We were also told it’s very cost-effective, so I assume that means that data is pretty much served on demand; perhaps this is one of the hallmarks of good archival storage. It’s certainly the opposite of active network storage, where content is made available constantly (and at a very high cost).

Use cases and users have also been at the heart of the LOCKSS distributed storage approach, as Art Pasquinelli of Stanford described in his talk. I like the idea that a University could have its own LOCKSS box to connect to this collaborative enterprise. It was encouraging to learn how this service (active since the 1990s) has expanded, and it’s much more than just a sophisticated shared storage system with multiple copies of content. Some of the recent interesting developments include (1) more content types admissible than before, not just scholarly papers. (2) Improved integration with other systems, such as Samvera (IR software) and Heritrix (web-archiving software); this evidently means that if it’s in a BagIt or WARC wrapper, LOCKSS can ingest it somehow. (3) Better security; the claim is that LOCKSS is somehow “tamper-resistant”. Because of its distributed nature, there’s no single point of failure, and because of the continual security checks – the network is “constantly polling” – it is possible for LOCKSS to somehow “repair” data. (By the way I would love to hear more examples and case studies of what “repairing data” actually involves; I know the NDSA Levels refer to it explicitly as one of the high watermarks of good digital preservation.)

In both these cases, it’s not immediately clear to me if there’s an Archival Information Package (AIP) involved, or at least an AIP as the OAIS Reference Model would define it; certainly both instances seem more complex and dynamic to me than the Reference Model has proposed. For a very practical view on AIP storage, there was the impromptu lightning-talk from Tim Gollins of National Records of Scotland. Although a self-declared OAIS-sceptic, he was advocating that we need some form of “detachable AIP”, an information package that contains the payload of data, yet is not dependent on the preservation system which created it. This pragmatic line of thought probably isn’t too far apart from Tim’s “Parsimonious Preservation” approach; he’s often encouraging digital archivists to think in five-year cycles, linked to procurement or hardware reviews.

Tim’s expectation is that the digital collection must outlive the construction in which it’s stored. The metaphor he came up with in this instance goes back to a physical building. A National Archive can move its paper archives to another building, and the indexes and catalogues will continue to work, allowing the service to continue. Can we say the same about our AIPs? Will they work in another system? Or are they dependent on metadata packages that are inextricably linked to the preservation system that authored them? What about other services, such as the preservation database that indexes this metadata?

WIth my “naive user” hat on, I suppose it ought to be possible to devise a “standard” wrapper whose chief operator is the handle, the UUID, which ought to work anywhere. Likewise if we’re all working to standard metadata schemas, and formats (such as XML or json) for storing that metadata, then why can’t we have detachable AIPs? Tim pointed out that among all the vendors proposing preservation systems at PASIG, not one of them agreed on the parameters of such important stages as Ingest, data management, or migration; and by parameters I mean when, how, and where it should be done, and which tools should be used.

The work of the E-ARK project, which has proposed and designed standardised information packages and rules to go with them, may be very germane in this case. I suppose it’s also something we will want to consider when framing our requirements before working with any vendor.

Selection and Appraisal in the OAIS Model

Recently I attended the ARA Conference. On 31 August 2016 we heard three very useful presentations in the digital preservation strand from Matthew Addis of Arkivum, Sarah Higgins and Sally McInnes from Wales, and Mike Quinn from Preservica. I recall asking a question about the OAIS model, which was prompted by another question from a fellow archivist in the audience. I was asking something about the skills of selection and appraisal. Can the OAIS Model accommodate them? My worry is that it cannot, and that the Model tends to present an over-simplified view where the Submission Information Package (SIP) arrives in a “perfect state” all ready to preserve, and the process of transforming it into an Archival Information Package (AIP) can begin. Any archivist or records manager who’s ever handled a deposit or transfer of records will tell you that real life isn’t like that. As a result, the OAIS Model alienates the archivist.

I’m aware of those in our community who have advocated a stronger pre-ingest stage in OAIS. Some call it the “long tail” before Ingest. I believe there is a body of work underway to formalise the process as part of the standard: the Producer-Archive Interface Specification. And I’m aware of those contributions to the DPC OAIS wiki where suggestions are made for how to instigate it, and even automate it to some degree.

But that’s not quite what’s worrying me. Let’s get back to the basics of what we mean by Selection and Appraisal. I think these are very strong archivist skills, which could have tremendous value in the field of digital preservation.

The Record / Archive Series

When I worked as an archivist at the General Synod with paper records and paper archives, we would often appraise and select on a Series basis. What that means to me is that we could assess the value of the content in a contextual framework, based on other records which we knew were being created, or other archival series which we had already selected and kept in the archive. The collections strategy would be based on this approach, looking for a Series in the context of provenance. For instance, the originating body might be the Board for Social Responsibility (BSR); the record series could be “Minute Books”. We would always know to accept deposits of BSR Minutes, because we could trust these as being accurate records of the Board’s work. Likewise, if the BSR collected copies of another Board’s Minutes and Documents (e.g. The Central Board of Finance), we could apply a rule that excluded that series from accessioning, on the grounds that BSR were only receiving “copies for information”.

This process I’m describing is second nature to any archives or records management professional. An understanding of context, provenance, record series: all of these things help us identify the potential value of content. Indeed, a Series model is the foundation for all Archival arrangement, and is the cornerstone of our profession. It’s extremely efficient; it saves you from having to examine every single document.

Appraisal in OAIS

I wonder to myself how Series are expressed in the OAIS Model. I often think the Model is predicated to favour the individual digital object, rather than a record series. To put it another way, a Submission Information Package is not an ideal unit on which to carry out an appraisal. At which point you could tell me “here’s 100 related SIPs, there’s your record series”. Or “we’re putting all the PDFs of our Minutes into this single SIP”. But I would still worry. Through the basic action of ingesting a SIP, we’re starting a process where all subsequent preservation actions continue to centre around the individual digital object – checksums, file format identification, file format characterisation, technical metadata extraction, and preservation metadata. And of course, the temptation is strong to automate these AIP-building actions, which has led us into building scripts that are entirely focused on a single characteristic – most commonly, the file format.

Where’s the record / archival series in all this? It’s difficult to make it out. Maybe it gets reinstated or reconstructed at the point of cataloguing. Even so, it’s not hard to see why archivists can feel alienated by this view of what constitutes digital preservation. The integrity and contextual meaning of a collection is being overlooked, in favour of this atomised digital-object view. OAIS, if strictly interpreted, could bypass the Series altogether in favour of an assembly line workflow that simply processes one digital object after another.

I believe we need to rediscover the value of Appraisal and Selection; I call on all archivists to come forward and re-assert its importance in the digital realm.

In the meantime, some questions: Can anyone show me a way that Appraisal and Selection can truly be incorporated in an OAIS Model workflow? Is there room for considering a new “Series Information Package”, or something similar? Am I over-stressing the atomisation of OAIS?