March 2017 – Ed The Archivist

The meaning of Curation: Data Curation

Unlike the meaning of the term archive, I think curation is a ‘doing’ word that describes the actions of an information professional working with content or resources. The term has been applied to many sectors recently, and I sense its meaning may have come adrift from its more traditional interpretation – for instance a museum curator, or an art curator. For this reason, this post will be the first of a series as we try and move towards a disambiguation of this term.

When it comes to ‘digital’, the curation term has been commonly applied in the field of research data, but it may also have some specialist uses in the field of digital preservation (for instance, Chapel Hill at the University of North Carolina offers a Certificate in Digital Curation). In this post however, I will look at the term as it’s been applied to research data.

Research data is one of the hot topics in digital preservation just now. In the UK at least, universities are working hard at finding ways to make their datasets persist, for all kinds of reasons – compliance with research council requirements, funder requirements, conformance with the REF Framework, and other drivers (legal, reputational, etc.). The re-use of data, repurposing datasets in the future, is the very essence of research data, and this need is what makes it distinct from other digital preservation projects.This is precisely where data curation has a big part to play. In no particular order, here’s our partial list:

1. Curation provides meaningful access to data. This could be cataloguing, using structured hierarchies, standards, common terms, defined ontologies, vocabularies, thesauri. All of these could derive from library and archive standards, but the research community also has its own set of subject-specific and discipline-specific vocabularies, descriptive metadata standards and agreed thesauri. It could also involve expressing that catalogue in the form of metadata (standards, again); and operating that metadata in a system, such as a CMS or Institutional Repository software. The end result of this effort ought to be satisfied end users who can discover, find, use, and interpret the dataset.

If further unpicking is needed, I could regard those as three different (though related) skills; a skilled cataloguer doesn’t necessarily know how to recast their work into EAD or MARC XML, and may rely on a developer or other system to help them do that. On the other hand, those edges are always blurring; institutional repository software (such as EPrints) was designed to empower individual users to describe their own deposited materials, working from pre-defined metadata schemas and using drop-down menus for controlled vocabularies.

2. Curation provides enduring access to data. This implies that the access has to last for a long time. One way of increasing your chances of longevity is by working with other institutions, platforms, and collaborators. Curation may involve applying agreed interoperability standards, such as METS, a protocol which allows you to share your work with other systems (not just other human beings). Since it involves machines talking to each other, I’ve tended to regard interoperability as a step beyond cataloguing.

Another aspect of enduring access is the use of UUIDs – Universal Unique Identifiers. If I make a request through a portal or UI, I will get served something digital – a document, image, or data table. For that to happen, we need UIDs or UUIDs; it’s the only way a system can “retrieve” a digital object from the server. We could call that another part of curation, a skill that must involve a developer somewhere in the service, even if the process of UID creation ends up being automated. You could regard the UID as technical metadata, but the main thing is making the system work with machine-readable elements; it’s not the same as “meaningful access”. UUIDs do it for digital objects; there’s also the ORCID system, which does it for individual researchers. Other instances, which are even more complex, involve minting DOIs for papers and datasets, making citations “endure” on the web to some degree.

3. Curation involves organisation of data. This one is close to my archivist heart. It implies constructing a structure that sorts the data into a meaningful retrieval system and gives us intellectual control over lots of content. An important part of organisation for data is associating the dataset or datasets with the relevant scholarly publications, and other supporting documentation such as research notes, wikis, and blogs.

In the old days I would have called this building a finding aid, and invoked accessioning skills such as archival arrangement – “sorting like with like” – so that the end user would have a concise and well organised finding aid to help them understand the collection. The difference is that now we might do it with tools such as directory trees, information packages, aggregated zip or tar files, and so on. We still need the metadata to complete the task (see above) but this type of “curation” is about sorting and parsing the research project into meaningful, accessible entities.

If we get this part of curation right, we are helping future use and re-use of the dataset. If we can capture the outputs of any secondary research, they stand a better chance of being associated with the original dataset.

4. Curation is a form of lifecycle management. There is a valid interpretation of data curation that claims “Data curation is the management of data throughout its lifecycle, from creation and initial storage, to the time when it is archived for posterity or becomes obsolete and is deleted.” I would liken this to an advanced form of records management, a profession that already recognises how lifecycles work, and has workflows and tools for how to deal with them. It’s a question of working out how to intervene, and when to intervene; if this side of curation means speaking to a researcher about their record-keeping as soon as they get their grant, then I’m all for it.

5. Curation provides for re-use over time through activities including authentication, archiving, management, preservation, and representation. While this definition may seem to involve a large number of activities, in fact most of them are already defined as things we would do as part of “digital preservation”, especially as defined by the OAIS Reference Model. The main emphasis for this class of resource however is “re-use”. The definition of what this means, and the problems of creating a re-usable dataset (i.e. a dataset that could be repurposed by another system) are too deep for this blog post, but they go beyond the idea that we could merely create an access copy.

Authentication is another disputed area, but I would like to think that proper lifecycle management (see above) would go some way to building a reliable audit trail that helps authentication; likewise the correct organisation of data (see above) will add context, history and evidence that situates the research project in a certain place and date, with an identifiable owner, adding further authentication.

To conclude this brief overview I make two observations:

Though there is some commonality among the instances I looked at, there is apparently no single shared understanding of what “data curation” means within the HE community; some views will tend to promote one aspect over another, depending on their content type, collections, or user community.
All the definitions I looked at tend to roll all the diverse actions together in a single paragraph, as if they were related parts of the same thing. Are they? Does it imply that data curation could be done by a single professional, or do we need many skillsets contributing to this process for complete success?

The meaning of the term Archive

In this blog post I should like to disambiguate uses of the word “archive”. I have found the term is often open to misunderstandings and misinterpretation. Since I come from a traditional archivist background, I will begin with a definition whose meaning is clear to me.

At any rate, it is a definition that pre-dates computers, digital content, and the internet; the arrival of these agencies has brought us new, ambiguous meanings of the term. Some instances of this follow below. In each instance, I will be looking for whether these digital “archives” imply or offer “permanence”, a characteristic I would associate with a traditional archive.

In the paper world: an archive is any collection of documents needed for long-term preservation, e.g. for historical, cultural heritage, or business purposes. It can also mean the building where such documents are permanently stored, in accordance with archival standards, or even the memory Institution itself (e.g. The National Archives).
In the digital world: a “digital archive” ought to refer to a specific function of a much larger process called digital preservation. This offers permanent retention, managed storage, and a means of keeping content accessible in the long term. The organisation might use a service like this for keeping content that has no current business need, but it still needed for historical or legal reasons. Therefore, the content is no longer held on a live system.
The OAIS Reference Model devised the term “Archival Storage” for describing this, and call it a Functional Entity of the Model; this means it can apply to the function of the organisation that makes this happen, the system that governs it, or the servers where the content is actually stored. More than just storage, it requires system logging, validation, and managed backups on a scale and frequency that exceeds the average network storage arrangement. The outcome of this activity is long-term preservation of digital content.
In the IT world: a sysadmin might identify a tar, zip or gz file as an “archive”. This is an accumulation of multiple files within a single wrapper. The wrapper may or may not perform a compression action on the content. The zipped “archive” is not necessarily being kept; the “archiving” action is the act of doing the zipping / compression.
On a blog: a blog platform, such as WordPress or Google Blogger, organises its pages and posts according to date-based rules. WordPress automatically builds directories to store the content in monthly and annual partitions. These directories are often called “archives”, and the word itself appears on the published blog page. In this context the word “archives” simply designates “non-current content”, in order to distinguish it from this month’s current posts. This “archive” is not necessarily backed up, or preserved; and in fact it is still accessible on the live blog.
In network management: the administrator backs up content from the entire network on a regular basis. They might call this action “archiving”, and may refer to the data, the tapes/discs on which the data are stored, or even the server room as the “archive”. In this instance, it seems to me the term is used to distinguish the backups from the live network. In case of a fail (e.g. accidental data deletion, or the need for a system restore), they would retrieve the lost data from the most recent “archive”. However: none of these “archives” are ever kept permanently. Rather, they are subject to a regular turnover and refreshment programme, meaning that the administrator only ever retains a few weeks or months of backups.
Cloud storage services may offer services called “Data Archive” or “Cloud Archive”. In many cases this service performs the role of extended network storage, except that it might be cheaper than storing the data on your own network. Your organisation also might decide to use this cheaper method to store “non-current” content. In neither case is the data guaranteed to be preserved permanently, unless the provider explicitly states it is, or the provider is using cloud storage as part of a larger digital preservation approach.
For emails: In MS Outlook, there is a term called AutoArchive. When run, this routine will move emails to an “archive” directory, based on rules (often associated with the age of the email) which the user can configure. The action also does a “clear out”, i.e. a deletion, of expired content, again based on rules. There is certainly no preservation taking place. This “AutoArchive” action is largely about moving email content from one part of the system to another, in line with rules. I believe a similar principle has been used to “archive” a folder or list in SharePoint, another Microsoft product. Some organisations scale up this model for email, and purchase enterprise “mail archiving” systems which apply similar age-based rules to the entire mail server. Unless explicitly applied as an additional service, there is no preservation taking place, just data compression to save space.

To summarise:

The term “archive” has been used in a rather diffuse manner in the IT and digital worlds, and can mean variously “compression”, “aggregated content”, “backing up”, “non-current content”, and “removal from the live environment”. While useful and necessary, none of these are guaranteed to offer the same degree of permanence as digital preservation. Of these examples, only digital preservation (implementation of which is a complex and non-trivial task) offers permanent retention, protection, and replayability of your assets.
If you are an archivist, content owner, or publisher: when dealing with vendors, suppliers, or IT managers, be sure you take the time to discuss and understand what is meant by the term “archive”, especially if you’re purchasing a service that includes the term in some way.

How to capture and preserve electronic newsletters in HE and beyond

This blog post is based on a real-world case study. It happens to have come from a UK Higher Education institute, but the lessons here could feasibly apply to anyone wishing to capture and preserve electronic newsletters.

The archivist reported that the staff newsletter started to manifest itself in electronic form “without warning”. Presumably they’d been collecting the paper version successfully for a number of years, then this change came along. The change was noticed when the archivist (and all staff) received the Newsletter in email form. The archivist immediately noticed the email was full of embedded links, and pictures. If this was now the definitive and only version of the newsletter, how would they capture it and preserve it?

I asked the archivist to send me a copy of the email, so I could investigate further.

It turns out the Newsletter in this case is in fact a website, or a web-based resource. It’s being hosted and managed by a company called Newsweaver, a communications software company who specialise in a service for generating electronic newsletters, and providing means for their dissemination. They do it for quite a few UK Universities; for instance, the University of Manchester resource can be seen here. In this instance, the email noted above is simply a version of the Newsletter page, slightly recast and delivered in email form. By following the links in the example, I was soon able to see the full version of that issue of the Newsletter, and indeed the entire collection (unhelpfully labelled an “archive” – but that’s another story).

What looked at first like it might be an email capture and preserve issue is more likely to be a case calling for a web-archiving action. Only through web-archiving would we get the full functionality of the resource. The email, for instance, contains links labelled “Read More”, which when followed take us to the parent Newsweaver site. If we simply preserved the email, we’d only have a cut-down version of the Newsletter; more importantly, the links would not work if Newsweaver.com became unavailable, or changed its URLs.

Since I have familiarity with using the desktop web-archiving tool HTTrack, I tried an experiment to see if I could capture the online Newsletter from the Newsweaver host. My gather failed first time, because the resource is protected by the site robots (more on this below), but a second gather worked when I instructed the web harvester to ignore the robots.txt file.

My trial brought in about 500-600MB of content after one hour of crawling – there is probably more content, but I decided to terminate it at that point. I now had a working copy of the entire Newsletter collection for this University. In my version, all the links work, the fonts are the same, the pictures are embedded. I would treat this as a standalone capture of the resource, by which I mean it is no longer dependent on the live web, and works as a collection of HTML pages, images and stylesheets, and can be accessed and opened by any browser.

Of course, it is only a snapshot. A capture and archiving strategy would need to run a gather like this on a regular basis to be completely successful, to capture the new content as it is published. Perhaps once a year would do it, or every six months. If that works, it can be the basis of a strategy for the digital preservation of this newsletter.

Such a strategy might evolve along these lines:

Archivist decides to include electronic newsletters in their Selection Policy. Rationale: the University already has them in paper form. They represent an important part of University history. The collection should continue for business needs. Further, the content will have heritage value for researchers.

University signs up to this strategy. Hopefully, someone agrees that it’s worth paying for. The IT server manager agrees to allocate 600MB of space (or whatever) per annum for the storage of these HTTrack web captures. The archivist is allocated time from an IT developer, whose job it is to programme HTTrack and run the capture on a regular basis.

The above process is expressed as a formal workflow, or (to use terms an archivist would recognise) a Transfer and Accession Policy. With this agreement, names are in the frame; tasks are agreed; dates for when this should happen are put into place. The archivist doesn’t have to become a technical expert overnight, they just have to manage a Transfer / Accession process like any other.
Since they are “snapshots”, the annual web crawls could be reviewed – just like any other deposit of records. A decision could be made as to whether they all need to be kept, or whether it’s enough to just keep the latest snapshot. Periodic review lightens the burden on the servers.

This isn’t yet full digital preservation – it’s more about capture and management. But at least the Newsletters are not being lost. Another, later, part of the strategy is for the University to decide how it will keep these digital assets in the long-term, for instance in a dedicated digital preservation repository – a service which they University might not be able to provide themselves, or even want to. But it’s a first step towards getting the material into a preservable state.

There are some other interesting considerations in this case:

The content is hosted by Newsweaver, not by the University. The name of the Institution is included in the URL, but it’s not part of the ac.uk estate. This means that an intervention is most certainly needed, if the University wants to keep the content long-term. It’s not unlike the Flickr service, who merely act as a means of hosting and distributing your content online. For the above proposed strategy to work, the archivist would probably need to speak to Newsweaver, and advise them of the plan to make annual harvests. There would need to be an agreement that robots.txt is disabled or ignored, or the harvest won’t work. There may be a way to schedule the harvest at an ideal time that won’t put undue stress on the servers.

Newsweaver might even wish to co-operate with this plan; maybe they have a means for allowing export of content from the back-end system that would work just as well as tis pull-gather method, but then it’s likely the archivist would need additional technical support to take it further. I would be very surprised if Newsweaver claimed any IP or ownership of the content, but it would be just as well to ascertain what’s set out in the contract with the company. This adds another potential stakeholder to the mix: the editorial team who compile the University Newsletter in the first place.

Operating HTTrack may seem like a daunting prospect to an archivist. There is a simpler option, which would be to use PDFs as a target format for preservation. One approach would be to print the emails to PDFs, an operation which could be done direct from the desktop with minimal support, although a licensed copy of Adobe Acrobat would be needed. Even so, the PDF version would disappoint very quickly; the links wouldn’t work as standalone links, and would point back to the larger Newsweaver collection on the live web. That said, a PDF version would look exactly like the email version, and PDF would be more permanent than the email format.

The second PDF approach would be to capture pages from Newsweaver using Acrobat’s “Create PDF from Web Page” feature. This would yield a slightly better result than the email option above, but the links would still fail. For the full joined-up richness of the highly cross-linked Newsletter collection, web-archiving is still the best option.

To summarise the high-level issues, I suggest an Archivist needs to:

Define the target of preservation. In this case we thought it was an email at first, but it turns out the target is web content hosted on a domain not owned by the University.

Define the aspects of the Newsletter which we want to survive – such as links, images, and stylesheets.

Agree and sign off a coherent selection policy and transfer procedure, and get resources assigned to the main tasks.
Assess the costs of storing these annual captures, and tell your IT manager what you need in terms of server space.

If there’s a business case to be made to someone, the first thing to point out is the risk of leaving this resource in the hands of Newsweaver, who are great at content delivery, but may not have a preservation policy or a commitment to keep the content beyond the life of the contract.

This approach has some value as a first step towards digital preservation; it gets the archivist on the radar of the IT department, the policy owners, and finance, and wakes up the senior University staff to the risks of trusting third-parties with your content. Further, if successful, it could become a staff-wide policy that individual recipients of the email can, in future, delete these in the knowledge that the definitive resource is being safely captured and backed up.