Ed The Archivist – Digital Preservation and Archives

Anti-folder, pro-searching

Chris Loftus at the University of Sheffield has detected a trend among tech giants Google and Microsoft in their cloud storage provision. They would prefer to us make more use of searches to find material, rather than store it in named folders.

With MS SharePoint at least – which is more than just cloud storage, it’s a whole collaborative environment with built-in software and numerous features – my sense is that Microsoft would be happier if we moved away from using folders. One reason for this might be because these cloud-based web-accessed environments would struggle if the pathway or URL is too long; presumably the more folders you add, the more the string grows, and you make the problem worse. So there’s a practical technical reason right there; we wanted a way to work collaboratively in the cloud, but maybe some web browsers can’t cope.

However, I also think SharePoint’s owners are trying to edge us towards taking another view of our content. This is probably based on its use of metadata. SharePoint offers a rich array of tags; one instance that springs to mind is the “Create Column” feature that enables the user to build their own metadata fields for the their content (such as Department Name) and populate it with their own content. This enables the user to create a custom view of thousands of documents, with useful fields arranged in columns. The columns can be searched, filtered, sorted, rearranged.

This could be called a “paradigm shift” by those who like such jargon…it’s a way of moving towards a “faceted view” of individual documents, based on metadata selections, not unlike the faceted views offered by Institutional Repository software (which allow browsing by year, name of author, departments; see this page for instance).

Advocates of this approach would say that this faceted view is arguably more flexible and better than the views of documents afforded by the old hierarchical folder structure in Windows, which tends to flatten out access to a single point of entry, which must be followed by drilling-down into a single route and opening more sub-folders. Anecdotally, I have heard of enthusiasts who actively welcome this future – “we’ll make folders a thing of the past!”

In doing this, perhaps Microsoft are exploiting a feature which has been present in their product for some time now, even before SharePoint. I mean document properties; when one creates a Word file, some of these properties (including dates) are generated. Some of them (Title, Comments) can be added by the user, if so inclined. Some can be auto-populated, for instance a person’s name – if the institution managed to find a way to synch Outlook address book data, or the Identity Management system, with document authoring.

Few users have ever bothered much with creating or using document properties, in my experience. It’s true they aren’t really that “visible”. If you right-click on any given file, you can see some of them. Some of them are also visible if you decide to pick certain “details” from a drop down, which then turn into columns in Windows Explorer. Successive versions of Explorer have gradually tweaked that feature. In one sense, SharePoint have found a way to expose these fields, leverage the properties even more dynamically. Did I mention SharePoint is like a gigantic database?

I might want to add that success in SharePoint metadata depends on an organisation taking the trouble to do it, and configure the system accordingly. If you don’t, SharePoint probably isn’t much of an improvement over the old Windows Explorer way. If you do want to configure it that way, I would say it’s a process that should be managed by a records manager or someone who knows about naming conventions and rules for metadata entry; I seem to be saying it’s not unlike building an old-school (paper) file registry with a controlled vocabulary. How 19th-century is that? But if that path is not followed, might there not be the risk of free-spirited column-adding and naming by individual users, resulting in metadata (and views) that are only of value to themselves.

However, I would probably be in favour of anything that moves us away from the “paper metaphor”. What I mean by this is that storing Word-processed files, spreadsheets and emails in (digital) folders has encouraged us to think we can carry on working the old pre-digital way, and imagine that we are “doing the filing” by putting pieces of paper into named folders. This has led to tremendous errors in electronic records management systems, which likewise perpetuate this paper-based myth, and create the illusion that records can be managed, sentenced and disposed on a folder basis. Any digital change offers us an opportunity to rethink the way we do things, but the paper metaphor gets in the way of that. If nothing else, SharePoint allows us a way of apprehending content that is arguably “truer” to computer science.

PDF/A and read-only in SharePoint

In the previous post I made some observations about metadata in Office Documents in the SharePoint environment. In this brief follow-up post, I will mention what happens to PDF/A files in SharePoint. This is prompted by a comment received from Özhan Saglik. As part of this, I was also moved to try an experiment with Read-Only documents.

Well, the punchline to the first one is simple – PDF/A files don’t change at all in SharePoint. At least, that’s what I found with my limited testbed. This result need not surprise us, if we consider some basic facts:

PDF/A files are designed to “open as read only”. This is part of the standard, so they can’t be (carelessly) modified by anything. Matter of fact you should get a message at the top of the document when you open it, informing you that this is the case.
SharePoint was designed to work with Microsoft products, not Adobe products. To be precise, the main purpose of working in SharePoint is the sharing and editing capability; the developers have not expended any effort in adding an editor or other app for a file format which they don’t support. On the other hand, you can upload a PDF to SharePoint and let others see it.
There is no possibility of editing PDF/A (or even plain PDFs) within the SharePoint environment. See previous remark.

The above observations are endorsed by MS employees, for instance in this forum thread.

PDF/A experiments

For the record, for this experiment I used five files which are conformant to the PDF/A 1-a flavour. I can say this with a small degree of certainty because (a) this information is reported in their metadata and (b) they passed the VeraPDF test for conformance with the standard. As with my previous experiment, I did a before-and-after profiling of their properties using Apache Tika.

I was able to upload the PDF/A files into Sharepoint, and download them; that’s as far as it goes, since (as noted) no editing of a PDF within SharePoint is possible. All the metadata remained completely intact, and the upload-download process did not change anything. Results can be seen in comparative_table_3.xlsx.

Download comparative tables as a zip file (27.9 KB)

However, one aspect still intrigues me. It’s a property – or something – which is reporting on a date and time. It displays as “Modified” in SharePoint, and also displays as ‘Date Modified’ in Windows Explorer. I am assuming, perhaps wrongly, that both of these are the same thing. If we look at this screenshot, it shows my PDF/A files after I’d dragged them in:

And after I’d downloaded them back onto my Desktop, Windows Explorer likewise reported today’s date:

These dates in no way match the modified dates of the original PDFs as reported in Windows:

What’s also interesting is that I can’t find this “Date modified” value anywhere in the Apache Tika report.

I’m assuming that, in Windows Explorer at least, this is a timestamp, or a file time value which gets written by the NFTS system. These values change in response to certain events, such as copying or drag-and-dropping. This page seems to describe what I think might be happening.

Maybe Read-Only is the key?

I was about to scribble something along the lines of “PDF/A is a robust format – and here’s why!” Perhaps then setting myself the impossible job of understanding (and explaining) how the PDF/A encoding meets some of our expectations for long-term preservation by the complex set of conditions that obtain within the standard. But what if it’s much simpler than that? Perhaps a PDF/A file is impervious to SharePoint changes for one simple reason – it’s Read-Only.

If that’s true, I wondered, then would MS Office documents behave nicely if I set their Properties to Read Only before uploading them into SharePoint, where the expected metadata changes would take place?

I decided to revisit my MS Office Documents testbed from the previous experiment and try this out. The results can be seen in comparative_table_4.xlsx. and were extremely encouraging.

For the “old style” Office documents, the only thing that has changed between upload and download is the checksum. From this evidence, perhaps we might conclude that SharePoint has simply generated a “new” digital object in some way – which is pretty much the expected behaviour, is it not?

In the case of the three Office Open XML files, SharePoint added a new Custom Field that wasn’t there before. The Content Length also changed.

However, unlike what happened in my previous experiments, all the properties and metadata remain the same, including that all-important date of creation.

Lessons learned

Documents authored to be conformant to the PDF/A standard seem to be impervious to change in SharePoint. This may mean the format is a good thing, from that point of view.
Office Documents, if set to Read Only before upload to SharePoint, are much better protected from SharePoint change than those that are not. Of course, if you start to edit the document in SharePoint it’s another matter completely. But this way we stand a chance at least of uploading an authentic “original” document with metadata and properties intact, which could be used as a ground-zero point of comparison for any subsequent changes that are made.
Ticking the Read Only box is a simple expedient which can be achieved without even opening the document (simply right-click on the object in Windows Explorer). A purist might argue that doing so constitutes making a change to the file which violates archival principles; but does it? Maybe we’re only ticking a box in properties which has no profound effect on the content. Maybe the added protection it affords us is a trade-off that is worth making.

These are not especially ground-breaking revelations, but they may help an archivist or records manager who is facing a SharePoint rollout, and provide some clues for a practical workaround which may help to mitigate this metadata loss.

Metadata and Properties In SharePoint

“No changes were made to your original file”

Today’s blog post makes a few observations about the way file formats behave. To be more specific, we’re talking about MS Office documents and what happens to them in MS SharePoint.

In my superficial way, I have noticed that when a document gets uploaded to SharePoint we can now look at it through a web browser. This is because SharePoint is predicated on the idea that we can all work in the cloud instead of being tied down to a local server. The browser view we are offered is not unlike Windows Explorer. In this view, we can see a folder structure, a file name, and also columns indicating dates (the Modified column), the Owner Name (Modified By) and Size (File Size).

I could see in this view that when a document gets uploaded, the date displayed is the date when it was added to SharePoint. This made me wonder what happened to the original Date Of Creation, something we worry about if we’re archivists or records managers. Further, I wondered if other metadata was being affected by this drag-and-drop action.

Tests

I did some tests using the Apache Tika utility, which is capable of exposing Properties in many file formats. In the case of Office documents, these properties can be a rich mix of dates, text strings, and technical metadata. I’m naively assuming these things are inscribed in the document in some way, by a combination of the application (e.g. MS Word) and the Windows file system, e.g. NFTS.

My method was to start with a small test bed of Office documents (one .doc, one .xls, and one .ppt). I got these from the OPF format corpus. I wanted to carry out a simple before-and-after comparison. First I profiled all the documents before upload, and pasted all the metadata into my table.

Then I tried three operations using SharePoint: (1) Drag and drop (2) Edit in the Browser (3) Edit in Word. The first one is simply moving (copying) the document from Windows Explorer into the SharePoint environment. The editing operations refer to the two options available: “make quick changes in the Browser”, or launch the application for more functionality. (2) suggests there is a web-based version of Word, Excel and PowerPoint which doesn’t quite have all the functions you’d expect, but still enables a user to carry out some limited edits.

After each action, there was a change to the test object. I downloaded the changed object in each case, and ran Apache Tika to see what the profiles looked like. I then pasted all the results into my table so I could make comparisons.

Click to download comparative tables

What changed?

There’s a number of changes which you can see if you download my tables. Look at comparativetable.xlsx. The most obvious and profound change is to the dates, especially the date of creation. This property remains unchanged if we just drag and drop; but when we start editing, either in browser or in application, the date of creation appears to change to the date when editing was carried out. The PowerPoint file kept its date of creation, but the other two didn’t; so now the only evidence we have of original dates is in the “Date Printed” property.

The second profound change is to the file size. In each case, this changed quite noticeably; even the act of dragging and dropping introduced a change to file size. Given the fact that the checksum is also new, it looks as though SharePoint is creating a new digital object in some way, and “injecting” it with something (I have no idea what) that makes it larger.

We’ve also tended to lose properties like Last-Author, which can get overwritten in SharePoint. In one exceptional case, there’s also a puzzling report on the page count of my Word Document, which mysteriously changes from 1 page to 29 pages.

What about newer documents?

So far so good. However, this experiment has been applied to “old” Microsoft Documents, by which I mean documents authored before the introduction of the Office Open XML standard. I thought I had better try out the same experiments with some more recent documents, and so selected a testbed of one .docx file, one pptx file, and one .xlsx file. The same before and after actions as above were carried out. Results are available in my second table, comparativetable_2.xlsx.

This time the changes were nowhere near as profound. In each case, the Date of Creation is intact, a result likely to reassure obsessive archivists like myself. There are still some minor losses but most of the elements highlighted in red are as expected (i.e. they reflect that fact that editing took place). However, SharePoint evidently still continues to “inject” something to make the files change size.

What’s going on?

One thing that might be happening here is not limited to SharePoint, but reflects Microsoft’s commitment to forward compatibility. When you launch an old MS Document in a more recent version of the application, it offers to perform file conversion for you. The user receives notification messages that this is happening. Matter of fact we received such notifications in the course of this experiment, such as these:

One result of this is that SharePoint now helpfully stores two iterations of your file for you. One of them is the “original”, the other is the “conversion”. However, the extent of reporting on changes is restricted to a rather vague generic message about “changes to the layout”. Well, if my tests indicate anything, it’s more than just a layout change.

Does any of this matter?

From a digital preservation point of view, I would say yes it does. I don’t think any of us would be too happy about a process that seems to over-write the date of creation of a resource; and more to the point doesn’t really tell us that the change is happening.

I don’t think I need stress the importance of dates for record-keeping, and other embedded properties may also add value. Indeed, one approach to digital preservation as it applies to file formats is to carry out extensive profiling of ingested files, extract and copy the metadata, and store it within the Archival Information Package. If we’re even more clever, we can parse the properties into separate fields and manage them in a preservation database.

I’m aware that the value of doing this is disputed, and that we’re continuing to have discussions and conversations about “significant properties” in our community. But if any of my observations are correct, it seems that SharePoint is performing a species of migration on our content (they call it “conversion”), and introducing changes without really telling us the extent of these changes.

The lesson, if indeed there is one, might be that “old” Office documents need some care and attention before upload to SharePoint, if these properties are important to you and your users.

Additional thoughts

If we find ourselves moving content into SharePoint, do we have to do it by a drag and drop action? To put it another way, are there other ways of moving files so we can protect these properties? Probably. One possibility is the TeraCopy tool, and another possibility is file compression.

TeraCopy is a Windows tool which offers a more sophisticated form of drag-and-drop. It evidently integrates well with Windows Explorer, although the vendors don’t claim that it works with SharePoint. While I do have the free version, I haven’t experimented with it as part of this test.

TeraCopy includes checksum verification as part of its capabilities, which is why it’s bound to appeal to digital archivists. Additionally, it claims to do something to keep date properties intact:

As to file compression, this would involve zipping up the target files into a single compressed object (e.g. .zip or .7z) and moving this into SharePoint. In some other unrelated experiments, I have found this does indeed protect the dates and other properties from any unwarranted change. However, it’s arguably pretty pointless to put a zipped object into SharePoint, as this will probably obviate against the faceted views and collaborative aspects that the platform offers.

Wanted: an underpinning model of organisational truth for the digital realm

Every so often I am privileged to spend a day in the company of fellow archivists and records managers to discuss fascinating topics that matter to our professional field. This happened recently at the University of Westminster Archives. Elaine Penn and Anna McNally organised a very good workshop on the subjects of appraisal and selection, especially in the context of born-digital records and archives. Do we need to rethink our ideas? We think we understand what they mean when applied to paper records and archives, but do we need to change or adapt when faced with their digital cousins?

For my part I talked for 25 mins on a subject I’ve been reflecting on for years, i.e. early interventions, early transfer, and anything to bridge the “disconnect” that I currently perceive between three of the important environments, i.e. the live network, the EDRMS or record system, and digital preservation storage and systems. I’m still trying to get closer to some answers. One idea, which I worked up for this talk, was the notion of a semi-current digital storage service. I’m just at the wish-list stage, and my ideas have lots of troubling gaps. I’d love to hear more from people who are building something like this. A colleague who attended tells me that University of Glasgow may have built something that overlaps with my “vision” (though a more accurate description in my case might be “wishful thinking”).

When listening to James Lappin’s excellent talk on email preservation, I noted he invoked the names of two historical figures in our field – Jenkinson and Schellenberg. I would claim that they achieved models of understanding that continue to shape our thinking about archives and records management to this day. Later I wondered out loud what is it that has made the concepts of Provenance and Original Order so effective; they have longevity, to the extent they still work now, and they bring clarity to our work whether or not you’re a sceptic of these old-school notions (and I know a lot of us are). They have achieved the status of “design classics”.

I wonder if that effectiveness is not really about archival care, nor the lifecycle of the record, nor the creation of fonds and record series, but about organisational functions. Maybe Jenkinson and Schellenberg understood something about how organisations work; and maybe it was a profound truth. Maybe we like it because it gives us insights into how and why organisations create records in the first place, and how those records turn into archives. If I am right, it may account for why archivists and records managers are so adept at understanding the complexities of institutions, organisations and departments in ways which even skilled business managers cannot. The solid grounding in these archival principles has led to intuitive skills that can simplify complex, broken, and wayward organisations. And see this earlier post, esp. #2-3, for more in this vein.

What I would like is for us to update models like this for the digital world. I want an archival / records theory that incorporates something about the wider “truth” of how computers do what they do, how they have impacted on the way we all work as organisations, and changed the ways in which records are generated. My suspicion is that it can’t be that hard to see this truth; I sense there is a simple underlying pattern to file storage, networks and applications, which could be grasped if we only see it clearly, from the holistic vantage point where it all makes sense. Further, I think it’s not really a technical thing at all. While it would probably be useful for archivists to pick up some basic rudiments of computer science along with their studies, I think what I am calling for is some sort of new model, like those of Provenance and Original Order, but something that is able to account for the digital realm in a meaningful way. It has to be simple, it has to be clear, and it has to stand the test of time (at least, for as long as computers are around!).

I say this because I sometimes doubt that we, in this loose affiliation of experts called the “digital preservation community”, have yet to reach consensus on what we think digital preservation actually is. Oh, I know we have our models, our standards, systems, and tools; but we keep on having similar debates over what we think the target of preservation is, what we think we’re doing, why, how, and what it will mean in the future. I wonder if we still lack an underpinning model of organisational truth, one that will help us make sense of all the complexity introduced by information technology. We didn’t have these profound doubts before; and whether we like them or not, we all agree on what Jenkinson and Schellenberg achieved, and we understand it. The rock music writer Lester Bangs once wrote “I can guarantee you one thing: we will never again agree on anything as we agreed on Elvis”, noting the diversity of musical culture since the early days of rock and roll. Will we ever reach accord with the meaning of digital preservation?

What does an archivist do?

This post is a response to a Tweet from Judith Dray seen recently. The plea was for a “cool, interesting, and accessible way to describe what an archivist does”.

I worked as a “traditional” archivist for the General Synod of the Church of England for about 15 years. When I say traditional, I mean I worked with paper records and archives. I could easily describe what I did in the usual terms, involving cataloguing, indexing, arrangement, description, and boxing of materials and putting them on shelves. In so doing, I would probably confirm the clichéd view that an archivist is a solitary hermit who loses themselves in the abstruse rules of provenance and original order.

But that misses the bigger picture. The work of an archivist only has any value if we put it in context. This context involves tangible things like other people, organisations, and the work and life of other people; and also involves abstract ideas, like culture, meaning, and history.

Below, I have written seven Tweetable responses to Judith Dray. But I had to unpack each Tweet into a paragraph of prose. I may have resorted to some hyperbole and rhetoric, but I like to think there is still a grain of truth in my ravings and fantasising. For this post, I have thought myself back into the past, and temporarily forgotten whatever I might know about digital preservation.

1. An archivist brings order to chaos.

Just give an archivist a random-seeming mess of unsorted papers and see how quickly that mess is transformed into an accessible collection. This is because archivist is applying sorting skills, based on their knowledge of parent collections, parent organisation, and former owners. Colleagues at the Synod sometimes assumed I was just “doing the filing”, but I think there’s more to it.

2. An archivist reflects the truth of an organisation.

If you want to know the core meaning, truth or essence of any organisation – from a business to a school to a textile factory – the archive holds the authoritative version of it. The archivist brings out that truth, through adhering to the fundamental principles of provenance (where the papers came from) and original order (how they were kept). These two principles may sound musty and boring, yet have proven surprisingly robust as a reliable method for reflecting the truth.

3. An archivist has the holistic view.

A good archivist isn’t just there at the end of the life of a record, but is there right at the start; they know the creators and understand precisely why they create the records that they do. In this way, they connect to and engage with the creating organisation in ways that surpass even the most diligent executive officer or auditor. The development of records management in the 20th century only served to strengthen this inherently archival virtue. At one stage in the 1990s, commercial companies tried to harness that rare skill and monetise it, turning it into something called “Knowledge Management”. Naturally, this failed!

4. An archivist engenders trust in their depositors.

The real value of an archivist’s role has to be seen in the context of people and agencies who use archives. Among these people, the depositors, creators and owners of the resources are key. Over time the “culture” of archives has created and diligently nurtured a trust bond, a covenant if you will, that enables depositors to place their faith in a single archivist or an entire memory institution. That trust has been hard won, but we got there through applying effective procedures for due diligence, managing and documenting every stage in the transfer of content in ways that ensured the integrity of the resource, informed by the “holistic” skill (see above).

5. An archivist enables use and re-use of the archives.

A second key group of archive users comprises the researcher, the scholar, the historian, the reader. In today’s impoverished world the beleaguered archivist has been obliged to reframe “readers” as “customers”, seeing them as an income stream, but the cultural truth is much richer. Archives don’t change; but the historian’s interpretation of the source keeps evolving all the time. If any historian seeks to validate or challenge the interpretation of another, the archives are there – waiting silently for consultation. The same resource can be used to research multiple topics, depending on the “lens” the researcher chooses to apply; there is a well-known archival resource which began life as a land survey, yet in its lifetime it has been used as statistical evidence for population studies, income distribution, place names, family history, and more.

6. An archivist can beat Google hands-down.

In our insatiable lust for faster and deeper browser searches, we sometimes tend to overlook the value of structure. Structure is something an archivist has hard-wired into their genetic code, and it’s what makes archival cataloguing a superior way of organising and presenting information concisely and meaningfully. It’s not about sticking obstinately to the arcane rules of ISAD(G) or insisting on the Fonds-Series-Item hierarchy to the point of madness, but about understanding the structure of meaning, the way that one piece of information “belongs” to another, and how we can use these relationships to bring out the inner truth of the collection. Compared to this deep understanding, any given Google search return may give the user a quick hit of satisfaction, yet it is severely fractured, lacking in context, and disconnected from the core.

7. An archivist makes history manageable.

Any given archive probably represents a very small percentage of the actual records that were created at the time; this is especially true of any 20th century collection. Archivists are able to select the core 5% from this abundance, and yet still preserve the truth of the organisation. We don’t keep “everything”, to put it another way; we keep just the right amount. The skills of appraisal and selection are among the most valuable tools we have for any society that wants to manage its collective memory, yet these skills are taken for granted and under-valued, even by archivists themselves. We can feasibly scale this up to address the challenge of digital content. At a time when the world is creating more digital data than we can store or contain, let alone preserve, the skills of selection and appraisal will be needed more than ever.

PASIG17 reflections: archival storage and detachable AIPs

A belated reflection on another aspect of PASIG17 from me today. I wish to consider aspects of “storage” which emerged from the three days of the conference.

One interesting one was the Queensland Brain Institute case study, where they are serving copies of brain scan material to researchers who need it. This is bound to be of interest to those managing research data in the UK, not least because the scenario described by Christian Vanden Balck of Oracle involved such large datasets and big files – 180 TB ingested per day is just one scary statistic we heard. The tiered storage approach at Queensland was devised exclusively to preserve and deliver this content; I wouldn’t have a clue how to explain it in detail to anyone, let alone know how to build it, but I think it involves a judicious configuration and combination of “data movers, disc storage, tape storage and remote tape storage”. The outcomes that interests me are strategic: it means the right data is served to the right people at the right time; and it’s the use cases, and data types, that have driven this highly specific storage build. We were also told it’s very cost-effective, so I assume that means that data is pretty much served on demand; perhaps this is one of the hallmarks of good archival storage. It’s certainly the opposite of active network storage, where content is made available constantly (and at a very high cost).

Use cases and users have also been at the heart of the LOCKSS distributed storage approach, as Art Pasquinelli of Stanford described in his talk. I like the idea that a University could have its own LOCKSS box to connect to this collaborative enterprise. It was encouraging to learn how this service (active since the 1990s) has expanded, and it’s much more than just a sophisticated shared storage system with multiple copies of content. Some of the recent interesting developments include (1) more content types admissible than before, not just scholarly papers. (2) Improved integration with other systems, such as Samvera (IR software) and Heritrix (web-archiving software); this evidently means that if it’s in a BagIt or WARC wrapper, LOCKSS can ingest it somehow. (3) Better security; the claim is that LOCKSS is somehow “tamper-resistant”. Because of its distributed nature, there’s no single point of failure, and because of the continual security checks – the network is “constantly polling” – it is possible for LOCKSS to somehow “repair” data. (By the way I would love to hear more examples and case studies of what “repairing data” actually involves; I know the NDSA Levels refer to it explicitly as one of the high watermarks of good digital preservation.)

In both these cases, it’s not immediately clear to me if there’s an Archival Information Package (AIP) involved, or at least an AIP as the OAIS Reference Model would define it; certainly both instances seem more complex and dynamic to me than the Reference Model has proposed. For a very practical view on AIP storage, there was the impromptu lightning-talk from Tim Gollins of National Records of Scotland. Although a self-declared OAIS-sceptic, he was advocating that we need some form of “detachable AIP”, an information package that contains the payload of data, yet is not dependent on the preservation system which created it. This pragmatic line of thought probably isn’t too far apart from Tim’s “Parsimonious Preservation” approach; he’s often encouraging digital archivists to think in five-year cycles, linked to procurement or hardware reviews.

Tim’s expectation is that the digital collection must outlive the construction in which it’s stored. The metaphor he came up with in this instance goes back to a physical building. A National Archive can move its paper archives to another building, and the indexes and catalogues will continue to work, allowing the service to continue. Can we say the same about our AIPs? Will they work in another system? Or are they dependent on metadata packages that are inextricably linked to the preservation system that authored them? What about other services, such as the preservation database that indexes this metadata?

WIth my “naive user” hat on, I suppose it ought to be possible to devise a “standard” wrapper whose chief operator is the handle, the UUID, which ought to work anywhere. Likewise if we’re all working to standard metadata schemas, and formats (such as XML or json) for storing that metadata, then why can’t we have detachable AIPs? Tim pointed out that among all the vendors proposing preservation systems at PASIG, not one of them agreed on the parameters of such important stages as Ingest, data management, or migration; and by parameters I mean when, how, and where it should be done, and which tools should be used.

The work of the E-ARK project, which has proposed and designed standardised information packages and rules to go with them, may be very germane in this case. I suppose it’s also something we will want to consider when framing our requirements before working with any vendor.

PASIG17 reflections: Sheridan’s disruptive digital archive

I was very interested to hear John Sheridan, Head of Digital at The National Archives, present on this theme. He is growing new ways of thinking about archival care in relation to digital preservation. As per my previous post, when these phrases occur in the same sentence then you have my attention. He has blogged about the subject this year (for the Digital Preservation Coalition), but clearly the subject is becoming deeper all the time. Below, I reflect on three of the many points that he makes concerning what he dubs the “disruptive digital archive”.

The paper metaphor is nearing end of life

Sheridan suggests “the deep-rooted nature of paper-based thinking and its influence on our thinking” needs to change and move on. “The archival catalogue is a 19th century thing, and we’ve taken it as far as we can in the 20th century”.

I love a catalogue, but I still agree; and I would extend this to electronic records management. And here I repeat an idea stated some time ago by Andrew Wilson, currently working on the E-ARK project. We (as a community) applied a paper metaphor when we built file plans for EDRM systems, and this approach didn’t work out too well. That approach requires a narrow insistence on single locations for digital objects, locations exactly matching against the retention needs of each object. Not only is this hard work for everyone who has to do “electronic filing”, it proved not to work in practice. It’s one-dimensional, and it stems from the grand error of the paper metaphor.

I would still argue there’d be a place in digital preservation for sorting and curation, “keeping like with like” in directories, though I wouldn’t insist on micro-managing it; and, as archivists and records managers we need to make more use of two things computers can do for us.

One of them is linked aliases; allowing the possibility for digital content sitting permanently in one place on the server, mostly likely in an order that has nothing to do with “original order”, while aliased links, or a METS catalogue, do the work of presenting a view of the content based on a logical sequence or hierarchy, one that the archivist, librarian, and user are happy with. In METS for instance, this is done with the <FLocat> element.

The second one is making use of embedded metadata in Office documents and emails. Though it’s not always possible to get these properties assigned consistently and well, doing so would allow us to view / retrieve / sort materials in a more three-dimensional manner, which the single directory view doesn’t allow us to do.

I dream of a future where both approaches will apply in ways that allow us these “faceted views” of our content, whether that’s records or digital archives.

Get over the need for tidiness

“We are too keen to retrofit information into some form of order,” said Sheridan. “In fact it is quite chaotic.” That resonates with me as much as it would with my other fellow archivists who worked on the National Digital Archive of Datasets, a pioneering preservation service set up by Kevin Ashley and Ruth Vyse for TNA. When we were accessioning and cataloguing a database – yes, we did try and catalogue databases – we had to concede there is really no such thing as an “original order” when it comes to tables in a relational database. We still had to give them ISAD(G) compliant citations, so some form of arrangement and ordering was required, but this is a limitation of ISAD(G), which I still maintain is far from ideal when it comes to describing born-digital content.

I accept Sheridan’s chaos metaphor…one day we will square this circle; we need some new means of understanding and performing arrangement that is suitable for the “truth” of digital content, and that doesn’t require massive amounts of wasteful effort.

Trust

Sheridan’s broad message was that “we need new forms of trust”. I would say that perhaps we need to embrace both new forms and old forms of trust.

In some circles we have tended to define trust in terms of the checksum – exclusively defining trust as a computer science thing. We want checksums, but they only prove that a digital object has not changed; they’re not an absolute demonstration of its trustworthiness. I think Somaya Langley has recently articulated this very issue in the DP0C blog, though I can’t find the reference just now.

Elsewhere, we have framed the trust discussion in terms of the Trusted Digital Repository, a complex and sometimes contentious narrative. One outcome has been that to demonstrate trust, an expensive overhead in terms of certification tick-boxing is required. It’s not always clear how this exercise demonstrates trust to users…see the Twitter snippet below.

Me, I’m a big fan of audit trails – and not just PREMIS, which only audits what happens in the repository. I think every step from creation to disposal should be logged in some way. I often bleat about rescuing audit trails from EDRM systems and CMS systems. And I’d love to see a return to that most despised of paper forms, the Transfer List, expressed in digital form. And I don’t just mean a manifest, though I like them too.

Lastly, there’s supporting documentation. We were very strong on that in the NDAD service too, a provision for which I am certain we have Ruth Vyse to thank. We didn’t just ingest a dataset, but also lots of surrounding reports, manuals, screenshots, data dictionaries, code bases…anything that explained more about the dataset, its owners, its creation, and its use. Naturally our scrutiny also included a survey of the IT environment that was needed to support the database in its original location.

All of this documentation, I believe, goes a long way to engendering trust, because it demonstrates the authenticity of any given digital resource. A single digital object can’t be expected to demonstrate this truth on its own account; it needs the surrounding contextual information, and multiple instances of such documentation give a kind of “triangulation” on the history. This is why the archival skill of understanding, assessing and preserving the holistic context of the resource continues to be important for digital preservation.

Conclusion

Sheridan’s call for “disruption” need not be heard as an alarmist cry, but there is a much-needed wake-up call to the archival profession in his words. It is an understatement to say that the digital environment is evolving very quickly, and we need to respond to the situation with equal alacrity.

PASIG17 reflections: archivist skills & digital preservation

Any discussion that includes “digital preservation” and “traditional archivist skills” in the same sentence always interests me. This reflects my own personal background (I trained as an archivist) but also my conviction that the skills of an archivist can have relevance to digital preservation work.

I recently asked a question along these lines after I heard Catherine Taylor, the archivist for Waddesdon Manor, give an excellent presentation at the PASIG17 Conference this month. She had started life as the paper archivist and has evidently grown into the role of digital archivist with great success. Her talk was called “We can just keep it all can’t we?: managing user expectations around digital preservation and access”.

We can’t find our stuff

As Taylor told it, she was a victim of her own success; staff always depended on her to find (paper) documents which nobody else could find. The same staff apparently saw no reason why they couldn’t depend on her to find that vital email, spreadsheet, or Word document. To put it another way, they expected the “magic” of the well-organised archive to pass directly into a digital environment. My guess is that they expected that “magic” to take effect without anyone lifting a finger or expending any effort on good naming, filing, or metadata assignment. But all of that is hard work.

What’s so great about archivists?

My question to Catherine was to do with the place of archival skills in digital preservation, and how I feel they can sometimes be neglected or overlooked in many digital preservation projects. Possible scenario is that the “solution” we purchase is an IT system, so its implementation is in the hands of IT project managers. Archivists might be consulted as project suppliers; more often, I fear they are ignored, or don’t speak up.

Catherine’s reply affirmed the value of such skills as selection and appraisal, which she believes have a role to play in assessing the overload of digital content and reducing duplication.

After the conference, I wondered to myself what other archival skills or weapons in the toolbox might help with digital preservation. A partial tag cloud might look like this:

We’ve got an app for that

What tools and methods do technically-able people reach for to address issues associated with the “help us to find stuff” problem? Perhaps…

Automated indexing of metadata, where the process is operated by machines on machine-readable text.
Using default metadata fields – by which I mean properties embedded in MS Word documents. These can be exposed, made sortable and searchable; SharePoint has made a whole “career” out of doing that.
Network drives managed by IT sysadmins alone – which can include everything from naming to deletion (but also backing up, of course).
De-duplication tools that can automatically find and remove duplicate files. Very often, they’re deployed as network management tools and applied to resolve what is perceived as a network storage problem. The way they work is based on recognition of checksum matches or similar rules.
Search engines – which may be powerful, but not very effective if there’s nothing to search on.
Artificial Intelligence (AI) tools which can be “trained” to recognise words and phrases, and thus assist (or even perform) selection and appraisal of content on a grand scale.

Internal user behaviours

There are some behaviours of our beloved internal staff / users which arguably contribute to the digital preservation problem in the long-term. They could all be characterised as “neglect”. They include:

Keeping everything – if not instructed to do otherwise, and there’s enough space to do so.
Free-spirited file naming and metadata assignment.
Failure to mark secure emails as secure – which is leading to a retrospective correction problem for large government archives now.

I would contend that a shared network run on an IT-only basis, where the only management and ownership policies come from sysadmins, is likely to foster such neglect. Sysadmins might not wish to get involved in discussions of meaning, context, or use of content.

How to restore the “magic”?

I suppose we’d all love to get closer to a live network, EDRMS, or digital archive where we can all find and retrieve our content. A few suggestions occur to me…

Collaboration. No archivist can solve this alone, and the trend of many of the talks at PASIG was to affirm that collaboration between IT, storage, developers, archivists, librarians and repository managers is not only desirable – it is in fact the only way we’ll succeed now. This might be an indicator of how big the task is ahead of us. The 4C Project said as much.
Archivists must change and grow. Let’s not “junk” our skillsets; for some reason, I fear that we are encouraged not to tread on IT ground, start to assume that machines can do everything we can do, and that our training is worthless. Rather, we must engage with what IT systems, tools and applications can do for us, how they can help us realise the results in that word cloud.
Influence and educate staff and other users. And if we could do it in a painless way, that would be one miracle cure that we’re all looking for. On the other hand, Catherine’s plan to integrate SharePoint with Preservica (with the help of the latter) is one move in the right direction: for one thing, she’s finding that the actual location of digital objects doesn’t really matter to users, so long as the links work. For reasons I can’t articulate right now, this strikes me as a significant improvement on a single shared drive sitting in a building.

Conclusion

I think archivists can afford to assert their professionalism, make their voice a little louder, where possible stepping in at in all stages of the digital preservation narrative; at the same time, we mustn’t cling to the “old ways”, rather start to discover ways in which we can update them. John Sheridan of The National Archives has already outlined an agenda of his own to do just this. I would like to see this theme taken up by other archivists, and propose a strand along these lines for discussion at the ARA Conference.

Conference on web-archiving, part 3: Albany and web-archives as Institutional Records

In my last in this series of blog posts about IIPC 2017, I’ll look at the work of Gregory Wiedeman at Albany SUNY.

He is doing two things that are sure to gladden the heart of any archivist. First, he is treating his Institutional web archives as records that require permanent preservation. Secondly, he is attempting to apply traditional archival arrangement and description to his web captures. His experiences have taught him that neither one of these things is easy.

Firstly, I’m personally delighted to hear someone say that web archives might be records; I would agree. One reason I like it is because the way mainstream web-archiving seems to have evolved is in favour of something akin to the “library” model – where a website is treated as a book, with a title, author and subject. For researchers, that might be a model that’s more aligned to their understanding and methods. Not that I am against it; I am just making an observation.

I first heard Seamus Ross make the observation “can websites be records?” some 11 years ago, and I think it is a useful way of regarding certain classes of web content, which I would encourage. When I worked on the Jisc PoWR project, one of the assumptions I made was that a University would be using its website to store, promote or promulgate content directly relevant to its institutional mission and functions. In doing this, a University starts to generate born-digital records, whether as HTML pages or PDF attachments. What I am concerned with is when these are the only copies of such records. Yet quite often we find that the University archivist is not involved in their capture, storage, or preservation.

The problem becomes even more poignant when we see how important record/archival material that was originally generated in paper form, such as a prospectus, is shifting over to the web. There may or may not be a clear cut-off point for when and how this happens. The archivist may notice that they aren’t receiving printed prospectuses any more. Who is the owner of the digital version, and how can we ensure the pages aren’t simply disposed of by the web master when expired? Later at the RESAW conference I heard a similar and even more extreme example of this unhappy scenario, from Federico Nanni and his attempts to piece together the history of the University of Bologna website.

However, none of this has stopped Gregory Wiedeman from performing his duty of archival care. He is clear: Albany is a public university, subject to state records laws; therefore, certain records must be kept. He sees the continuity between the website and existing collections at the University, even to the point where web pages have their paper equivalents; he is aware of overlap between existing printed content and web content; he knows about embedded documents and PDFs on web pages; and is aware of interactive sites, which may create transient but important records through interactions.

In aligning University web resources with a records and archives policy, Wiedeman points out one significant obstacle: a seed URL, which is the basis of web capture in the first place, is not guaranteed to be a good fit for our existing practices. To put it another way, we may struggle to map an archived website or even individual pages from it to an archival Fonds or series, or to a record series with a defined retention schedule.

Nonetheless, Wiedeman has found that traditional archives theory and practice does adapt well to working with web archives, and he is addressing such key matters as retaining context, the context of attached documents, the relationship of one web page to another, and the history of records through a documented chain of custody – of which more below.

When it comes to describing web content, Wiedeman uses the American DACS Standard, which is a subset of ISAD(G). With its focus on the intellectual content rather than the individual file format, he has found this works for large scale collections and granular access to them. His cataloguing tool is ArchivesSpace, which is DACS compliant, and which is capable of handling aggregated record collections. The access component to ArchivesSpace is able to show relations between record collections, making context visible, and showing a clear link between the creating organisation and the web resources. Further, there are visible relations between web records and paper records, which suggests Wiedeman is on the way to addressing the hybrid archive conundrum faced by many. He does this, I suggest, by trusting to the truth of the archival Fonds, which continues to exert a natural order on the archives, in spite of the vagaries of website structures and their archived snapshots.

It’s in the fine detail of capture and crawling that Wiedeman is able to create records that demonstrate provenance and authenticity. He works with Archive-It to perform his web crawls; the process creates a range of technical metadata about the crawl itself (type of crawl, result, start and end dates, recurrence, extent), which can be saved and stored as a rather large JSON file. Wiedeman retains this, and treats it as a provenance record, which makes perfect sense; it contains hard (computer-science) facts about what happened. This JSON output might not be perfect, and at time of writing Albany don’t do more than retain it and store it; there remains developer work to be done on parsing and exposing the metadata to make it more useful.

Linked to this, he maintains what I think is a stand-alone record documenting his own selection decisions, as to the depth and range of the drawl; this is linked to the published collection development policy. Archivists need to be transparent about their decisions, and they should document their actions; users need to know this, in order to make any sense of the web data. None of these concepts are new to the traditional archivist, but this is the first time I have heard the ideas so well articulated in this context, and applied so effectively to collections management of digital content.

Gregory’s work is described at https://github.com/UAlbanyArchives/describingWebArchives

Conference on web-archiving, part 2: Ian Milligan and his derived datasets

Web-archiving is a relatively young side of digital archiving, yet it has already established a formidable body of content across the world, a large corpus of materials that could be mined by researchers and historians to uncover interesting trends and patterns about the 20th and 21st centuries. One advocate and enthusiast for this is Ian Milligan from University of Waterloo’s Faculty of Arts.

His home country has many excellent web archive collections, but he feels they are under-used by scholars. One problem is that scholars might not even know the web archives exist in the first place. The second problem is many people find web archives really hard to use; quite often, search engines which interrogate the corpus don’t really match the way that a scholar wishes to retrieve information. At a very simple practical level, a search can return too many hits, the hitlist appears to be unsorted, and the results are difficult to understand.

Milligan is personally concerned at the barriers facing academics, and he’s actively working to make it easier, devising a way of serving historic web archives in ways that doesn’t require massive expertise. His project Web Archives for Longitudinal Knowledge (WALK) is aiming to create a centralised portal for access to web content. The main difference to most such approaches which I’ve seen is that he does it by building derived datasets.

As I understand it, a derived dataset is a new assemblage of data that’s been created out of a web archive. To put this in context, it might help to understand the basic building block of web-archiving is a file called a WARC. A WARC is a file format, of which the contents are effectively a large chunk of code that represents the harvesting session, all the links visited, the responses, and a representation of the web content. If you wanted to replay the WARC so that it looks like the original website, then you’d feed it to an instance of the Wayback Machine, which is programmed to read the WARC and serve it back as a rendition of the original web page.

However, Milligan is more interested in parsing WARCS. He knows they contain very useful strings of data, and he’s been working for some time on tools to do the parsing. He’s interested in text strings, dates, URLs, embedded keywords and names, and more. One such tool is Warcbase, part of this WALK project. Broadly, the process is that he would transfer data from a web archive in WARC form, and use Warcbase to create scholarly derivatives from that WARC automatically. When the results are uploaded to the Dataverse platform, the scholar now has a much more user-friendly web-archive dataset in their hands. The process is probably far more elaborate than I am making it sound, but all I know is that simple text searches are now much more rewarding and focussed; and by using a graphical interface, it’s possible to build visualisations out of data.

A few archival-ish observations occur to me about this.

What about provenance and original order? Doesn’t this derived dataset damage these fundamental “truths” of the web crawl? Well, let’s remember that the derived dataset is a new thing, another digital object; the “archival original” WARC file remains intact. If there’s any provenance information about the date and place of the actual website and the date of the crawl, that won’t be damaged in any way. If we want paper analogues, we might call this derived dataset a photocopy of the original; or perhaps it’s more like a scrapbook, if it’s created from many sources.
That made me wonder if the derived dataset could be considered a Dissemination Information Package in OAIS Reference Model terms, with the parent WARC or WARCs in the role of the Archival Information Package. I’d better leave it at that; the terms “OAIS conformance” and “web-archiving” don’t often appear in the same sentence in our community.
It seems to me rather that what Milligan is doing is exploiting the versatility of structured data. If websites are structured, and WARCs are structured, why not turn that to our advantage and see if we can create new structures? If it makes the content more accessible to otherwise alienated users, then I’m all for it. Instead of having to mine a gigantic quarry of hard granite, we have manageable building blocks of information carved out of that quarry, which can be used as needed.
The other question that crossed my mind is “how is Ian and his team deciding what information to put in these derivatives?” He did allude to that fact that they are “doing something they think Humanities Scholars would like”, and since he himself is a Humanities scholar, he has a good starting point. Scholars hate a WARC, which after all isn’t much more than raw data generated by a web crawl, but they do like data arranged in a CSV file, and text searches with meaningful results.
To play devil’s advocate, I suspect that a traditional archivist would recoil from any approach which appears to smack of bias; our job has usually been to serve the information in the most objective way possible, and the actions of arrangement and cataloguing are intended to preserve the truth of original order of the resource, and to help the user with neutral finding aids that steer them through the collection. If we do the work of creating a derived dataset, making decisions in advance about date ranges, domains, and subjects, aren’t we somehow pre-empting the research?

This may open up bigger questions than can be addressed in this blog post, and in any case I may have misunderstood Milligan’s method and intention, but it may have implications for the archive profession and how we process and work with digital content on this scale.