PASIG17 reflections: archival storage and detachable AIPs

A belated reflection on another aspect of PASIG17 from me today. I wish to consider aspects of “storage” which emerged from the three days of the conference.

One interesting one was the Queensland Brain Institute case study, where they are serving copies of brain scan material to researchers who need it. This is bound to be of interest to those managing research data in the UK, not least because the scenario described by Christian Vanden Balck of Oracle involved such large datasets and big files – 180 TB ingested per day is just one scary statistic we heard. The tiered storage approach at Queensland was devised exclusively to preserve and deliver this content; I wouldn’t have a clue how to explain it in detail to anyone, let alone know how to build it, but I think it involves a judicious configuration and combination of “data movers, disc storage, tape storage and remote tape storage”. The outcomes that interests me are strategic: it means the right data is served to the right people at the right time; and it’s the use cases, and data types, that have driven this highly specific storage build. We were also told it’s very cost-effective, so I assume that means that data is pretty much served on demand; perhaps this is one of the hallmarks of good archival storage. It’s certainly the opposite of active network storage, where content is made available constantly (and at a very high cost).

Use cases and users have also been at the heart of the LOCKSS distributed storage approach, as Art Pasquinelli of Stanford described in his talk. I like the idea that a University could have its own LOCKSS box to connect to this collaborative enterprise. It was encouraging to learn how this service (active since the 1990s) has expanded, and it’s much more than just a sophisticated shared storage system with multiple copies of content. Some of the recent interesting developments include (1) more content types admissible than before, not just scholarly papers. (2) Improved integration with other systems, such as Samvera (IR software) and Heritrix (web-archiving software); this evidently means that if it’s in a BagIt or WARC wrapper, LOCKSS can ingest it somehow. (3) Better security; the claim is that LOCKSS is somehow “tamper-resistant”. Because of its distributed nature, there’s no single point of failure, and because of the continual security checks – the network is “constantly polling” – it is possible for LOCKSS to somehow “repair” data. (By the way I would love to hear more examples and case studies of what “repairing data” actually involves; I know the NDSA Levels refer to it explicitly as one of the high watermarks of good digital preservation.)

In both these cases, it’s not immediately clear to me if there’s an Archival Information Package (AIP) involved, or at least an AIP as the OAIS Reference Model would define it; certainly both instances seem more complex and dynamic to me than the Reference Model has proposed. For a very practical view on AIP storage, there was the impromptu lightning-talk from Tim Gollins of National Records of Scotland. Although a self-declared OAIS-sceptic, he was advocating that we need some form of “detachable AIP”, an information package that contains the payload of data, yet is not dependent on the preservation system which created it. This pragmatic line of thought probably isn’t too far apart from Tim’s “Parsimonious Preservation” approach; he’s often encouraging digital archivists to think in five-year cycles, linked to procurement or hardware reviews.

Tim’s expectation is that the digital collection must outlive the construction in which it’s stored. The metaphor he came up with in this instance goes back to a physical building. A National Archive can move its paper archives to another building, and the indexes and catalogues will continue to work, allowing the service to continue. Can we say the same about our AIPs? Will they work in another system? Or are they dependent on metadata packages that are inextricably linked to the preservation system that authored them? What about other services, such as the preservation database that indexes this metadata?

WIth my “naive user” hat on, I suppose it ought to be possible to devise a “standard” wrapper whose chief operator is the handle, the UUID, which ought to work anywhere. Likewise if we’re all working to standard metadata schemas, and formats (such as XML or json) for storing that metadata, then why can’t we have detachable AIPs? Tim pointed out that among all the vendors proposing preservation systems at PASIG, not one of them agreed on the parameters of such important stages as Ingest, data management, or migration; and by parameters I mean when, how, and where it should be done, and which tools should be used.

The work of the E-ARK project, which has proposed and designed standardised information packages and rules to go with them, may be very germane in this case. I suppose it’s also something we will want to consider when framing our requirements before working with any vendor.

PASIG17 reflections: Sheridan’s disruptive digital archive

I was very interested to hear John Sheridan, Head of Digital at The National Archives, present on this theme. He is growing new ways of thinking about archival care in relation to digital preservation. As per my previous post, when these phrases occur in the same sentence then you have my attention. He has blogged about the subject this year (for the Digital Preservation Coalition), but clearly the subject is becoming deeper all the time. Below, I reflect on three of the many points that he makes concerning what he dubs the “disruptive digital archive”.

The paper metaphor is nearing end of life

Sheridan suggests “the deep-rooted nature of paper-based thinking and its influence on our thinking” needs to change and move on. “The archival catalogue is a 19th century thing, and we’ve taken it as far as we can in the 20th century”.

I love a catalogue, but I still agree; and I would extend this to electronic records management. And here I repeat an idea stated some time ago by Andrew Wilson, currently working on the E-ARK project. We (as a community) applied a paper metaphor when we built file plans for EDRM systems, and this approach didn’t work out too well. That approach requires a narrow insistence on single locations for digital objects, locations exactly matching against the retention needs of each object. Not only is this hard work for everyone who has to do “electronic filing”, it proved not to work in practice. It’s one-dimensional, and it stems from the grand error of the paper metaphor.

I would still argue there’d be a place in digital preservation for sorting and curation, “keeping like with like” in directories, though I wouldn’t insist on micro-managing it; and, as archivists and records managers we need to make more use of two things computers can do for us.

One of them is linked aliases; allowing the possibility for digital content sitting permanently in one place on the server, mostly likely in an order that has nothing to do with “original order”, while aliased links, or a METS catalogue, do the work of presenting a view of the content based on a logical sequence or hierarchy, one that the archivist, librarian, and user are happy with. In METS for instance, this is done with the <FLocat> element.

The second one is making use of embedded metadata in Office documents and emails. Though it’s not always possible to get these properties assigned consistently and well, doing so would allow us to view / retrieve / sort materials in a more three-dimensional manner, which the single directory view doesn’t allow us to do.

I dream of a future where both approaches will apply in ways that allow us these “faceted views” of our content, whether that’s records or digital archives.

Get over the need for tidiness

“We are too keen to retrofit information into some form of order,” said Sheridan. “In fact it is quite chaotic.” That resonates with me as much as it would with my other fellow archivists who worked on the National Digital Archive of Datasets, a pioneering preservation service set up by Kevin Ashley and Ruth Vyse for TNA. When we were accessioning and cataloguing a database – yes, we did try and catalogue databases – we had to concede there is really no such thing as an “original order” when it comes to tables in a relational database. We still had to give them ISAD(G) compliant citations, so some form of arrangement and ordering was required, but this is a limitation of ISAD(G), which I still maintain is far from ideal when it comes to describing born-digital content.

I accept Sheridan’s chaos metaphor…one day we will square this circle; we need some new means of understanding and performing arrangement that is suitable for the “truth” of digital content, and that doesn’t require massive amounts of wasteful effort.

Trust

Sheridan’s broad message was that “we need new forms of trust”. I would say that perhaps we need to embrace both new forms and old forms of trust.

In some circles we have tended to define trust in terms of the checksum – exclusively defining trust as a computer science thing. We want checksums, but they only prove that a digital object has not changed; they’re not an absolute demonstration of its trustworthiness. I think Somaya Langley has recently articulated this very issue in the DP0C blog, though I can’t find the reference just now.

Elsewhere, we have framed the trust discussion in terms of the Trusted Digital Repository, a complex and sometimes contentious narrative. One outcome has been that to demonstrate trust, an expensive overhead in terms of certification tick-boxing is required. It’s not always clear how this exercise demonstrates trust to users…see the Twitter snippet below.

Me, I’m a big fan of audit trails – and not just PREMIS, which only audits what happens in the repository. I think every step from creation to disposal should be logged in some way. I often bleat about rescuing audit trails from EDRM systems and CMS systems. And I’d love to see a return to that most despised of paper forms, the Transfer List, expressed in digital form. And I don’t just mean a manifest, though I like them too.

Lastly, there’s supporting documentation. We were very strong on that in the NDAD service too, a provision for which I am certain we have Ruth Vyse to thank. We didn’t just ingest a dataset, but also lots of surrounding reports, manuals, screenshots, data dictionaries, code bases…anything that explained more about the dataset, its owners, its creation, and its use. Naturally our scrutiny also included a survey of the IT environment that was needed to support the database in its original location.

All of this documentation, I believe, goes a long way to engendering trust, because it demonstrates the authenticity of any given digital resource. A single digital object can’t be expected to demonstrate this truth on its own account; it needs the surrounding contextual information, and multiple instances of such documentation give a kind of “triangulation” on the history. This is why the archival skill of understanding, assessing and preserving the holistic context of the resource continues to be important for digital preservation.

Conclusion

Sheridan’s call for “disruption” need not be heard as an alarmist cry, but there is a much-needed wake-up call to the archival profession in his words. It is an understatement to say that the digital environment is evolving very quickly, and we need to respond to the situation with equal alacrity.

PASIG17 reflections: archivist skills & digital preservation

Any discussion that includes “digital preservation” and “traditional archivist skills” in the same sentence always interests me. This reflects my own personal background (I trained as an archivist) but also my conviction that the skills of an archivist can have relevance to digital preservation work.

I recently asked a question along these lines after I heard Catherine Taylor, the archivist for Waddesdon Manor, give an excellent presentation at the PASIG17 Conference this month. She had started life as the paper archivist and has evidently grown into the role of digital archivist with great success. Her talk was called “We can just keep it all can’t we?: managing user expectations around digital preservation and access”.

We can’t find our stuff

As Taylor told it, she was a victim of her own success; staff always depended on her to find (paper) documents which nobody else could find. The same staff apparently saw no reason why they couldn’t depend on her to find that vital email, spreadsheet, or Word document. To put it another way, they expected the “magic” of the well-organised archive to pass directly into a digital environment. My guess is that they expected that “magic” to take effect without anyone lifting a finger or expending any effort on good naming, filing, or metadata assignment. But all of that is hard work.

What’s so great about archivists?

My question to Catherine was to do with the place of archival skills in digital preservation, and how I feel they can sometimes be neglected or overlooked in many digital preservation projects. Possible scenario is that the “solution” we purchase is an IT system, so its implementation is in the hands of IT project managers. Archivists might be consulted as project suppliers; more often, I fear they are ignored, or don’t speak up.

Catherine’s reply affirmed the value of such skills as selection and appraisal, which she believes have a role to play in assessing the overload of digital content and reducing duplication.

After the conference, I wondered to myself what other archival skills or weapons in the toolbox might help with digital preservation. A partial tag cloud might look like this:

We’ve got an app for that

What tools and methods do technically-able people reach for to address issues associated with the “help us to find stuff” problem? Perhaps…

  • Automated indexing of metadata, where the process is operated by machines on machine-readable text.
  • Using default metadata fields – by which I mean properties embedded in MS Word documents. These can be exposed, made sortable and searchable; SharePoint has made a whole “career” out of doing that.
  • Network drives managed by IT sysadmins alone – which can include everything from naming to deletion (but also backing up, of course).
  • De-duplication tools that can automatically find and remove duplicate files. Very often, they’re deployed as network management tools and applied to resolve what is perceived as a network storage problem. The way they work is based on recognition of checksum matches or similar rules.
  • Search engines – which may be powerful, but not very effective if there’s nothing to search on.
  • Artificial Intelligence (AI) tools which can be “trained” to recognise words and phrases, and thus assist (or even perform) selection and appraisal of content on a grand scale.

Internal user behaviours

There are some behaviours of our beloved internal staff / users which arguably contribute to the digital preservation problem in the long-term. They could all be characterised as “neglect”. They include:

  • Keeping everything – if not instructed to do otherwise, and there’s enough space to do so.
  • Free-spirited file naming and metadata assignment.
  • Failure to mark secure emails as secure – which is leading to a retrospective correction problem for large government archives now.

I would contend that a shared network run on an IT-only basis, where the only management and ownership policies come from sysadmins, is likely to foster such neglect. Sysadmins might not wish to get involved in discussions of meaning, context, or use of content.

How to restore the “magic”?

I suppose we’d all love to get closer to a live network, EDRMS, or digital archive where we can all find and retrieve our content. A few suggestions occur to me…

  • Collaboration. No archivist can solve this alone, and the trend of many of the talks at PASIG was to affirm that collaboration between IT, storage, developers, archivists, librarians and repository managers is not only desirable – it is in fact the only way we’ll succeed now. This might be an indicator of how big the task is ahead of us. The 4C Project said as much.
  • Archivists must change and grow. Let’s not “junk” our skillsets; for some reason, I fear that we are encouraged not to tread on IT ground, start to assume that machines can do everything we can do, and that our training is worthless. Rather, we must engage with what IT systems, tools and applications can do for us, how they can help us realise the results in that word cloud.
  • Influence and educate staff and other users. And if we could do it in a painless way, that would be one miracle cure that we’re all looking for. On the other hand, Catherine’s plan to integrate SharePoint with Preservica (with the help of the latter) is one move in the right direction: for one thing, she’s finding that the actual location of digital objects doesn’t really matter to users, so long as the links work. For reasons I can’t articulate right now, this strikes me as a significant improvement on a single shared drive sitting in a building.

Conclusion

I think archivists can afford to assert their professionalism, make their voice a little louder, where possible stepping in at in all stages of the digital preservation narrative; at the same time, we mustn’t cling to the “old ways”, rather start to discover ways in which we can update them. John Sheridan of The National Archives has already outlined an agenda of his own to do just this. I would like to see this theme taken up by other archivists, and propose a strand along these lines for discussion at the ARA Conference.