The meaning of Curation: Data Curation

Unlike the meaning of the term archive, I think curation is a ‘doing’ word that describes the actions of an information professional working with content or resources. The term has been applied to many sectors recently, and I sense its meaning may have come adrift from its more traditional interpretation – for instance a museum curator, or an art curator. For this reason, this post will be the first of a series as we try and move towards a disambiguation of this term.

When it comes to ‘digital’, the curation term has been commonly applied in the field of research data, but it may also have some specialist uses in the field of digital preservation (for instance, Chapel Hill at the University of North Carolina offers a Certificate in Digital Curation). In this post however, I will look at the term as it’s been applied to research data.

Research data is one of the hot topics in digital preservation just now. In the UK at least, universities are working hard at finding ways to make their datasets persist, for all kinds of reasons – compliance with research council requirements, funder requirements, conformance with the REF Framework, and other drivers (legal, reputational, etc.). The re-use of data, repurposing datasets in the future, is the very essence of research data, and this need is what makes it distinct from other digital preservation projects.This is precisely where data curation has a big part to play. In no particular order, here’s our partial list:

1. Curation provides meaningful access to data. This could be cataloguing, using structured hierarchies, standards, common terms, defined ontologies, vocabularies, thesauri. All of these could derive from library and archive standards, but the research community also has its own set of subject-specific and discipline-specific vocabularies, descriptive metadata standards and agreed thesauri. It could also involve expressing that catalogue in the form of metadata (standards, again); and operating that metadata in a system, such as a CMS or Institutional Repository software. The end result of this effort ought to be satisfied end users who can discover, find, use, and interpret the dataset.

If further unpicking is needed, I could regard those as three different (though related) skills; a skilled cataloguer doesn’t necessarily know how to recast their work into EAD or MARC XML, and may rely on a developer or other system to help them do that. On the other hand, those edges are always blurring; institutional repository software (such as EPrints) was designed to empower individual users to describe their own deposited materials, working from pre-defined metadata schemas and using drop-down menus for controlled vocabularies.

2. Curation provides enduring access to data. This implies that the access has to last for a long time. One way of increasing your chances of longevity is by working with other institutions, platforms, and collaborators. Curation may involve applying agreed interoperability standards, such as METS, a protocol which allows you to share your work with other systems (not just other human beings). Since it involves machines talking to each other, I’ve tended to regard interoperability as a step beyond cataloguing.

Another aspect of enduring access is the use of UUIDs – Universal Unique Identifiers. If I make a request through a portal or UI, I will get served something digital – a document, image, or data table. For that to happen, we need UIDs or UUIDs; it’s the only way a system can “retrieve” a digital object from the server. We could call that another part of curation, a skill that must involve a developer somewhere in the service, even if the process of UID creation ends up being automated. You could regard the UID as technical metadata, but the main thing is making the system work with machine-readable elements; it’s not the same as “meaningful access”. UUIDs do it for digital objects; there’s also the ORCID system, which does it for individual researchers. Other instances, which are even more complex, involve minting DOIs for papers and datasets, making citations “endure” on the web to some degree.

3. Curation involves organisation of data. This one is close to my archivist heart. It implies constructing a structure that sorts the data into a meaningful retrieval system and gives us intellectual control over lots of content. An important part of organisation for data is associating the dataset or datasets with the relevant scholarly publications, and other supporting documentation such as research notes, wikis, and blogs.

In the old days I would have called this building a finding aid, and invoked accessioning skills such as archival arrangement – “sorting like with like” – so that the end user would have a concise and well organised finding aid to help them understand the collection. The difference is that now we might do it with tools such as directory trees, information packages, aggregated zip or tar files, and so on. We still need the metadata to complete the task (see above) but this type of “curation” is about sorting and parsing the research project into meaningful, accessible entities.

If we get this part of curation right, we are helping future use and re-use of the dataset. If we can capture the outputs of any secondary research, they stand a better chance of being associated with the original dataset.

4. Curation is a form of lifecycle management. There is a valid interpretation of data curation that claims “Data curation is the management of data throughout its lifecycle, from creation and initial storage, to the time when it is archived for posterity or becomes obsolete and is deleted.” I would liken this to an advanced form of records management, a profession that already recognises how lifecycles work, and has workflows and tools for how to deal with them. It’s a question of working out how to intervene, and when to intervene; if this side of curation means speaking to a researcher about their record-keeping as soon as they get their grant, then I’m all for it.

5. Curation provides for re-use over time through activities including authentication, archiving, management, preservation, and representation. While this definition may seem to involve a large number of activities, in fact most of them are already defined as things we would do as part of “digital preservation”, especially as defined by the OAIS Reference Model. The main emphasis for this class of resource however is “re-use”. The definition of what this means, and the problems of creating a re-usable dataset (i.e. a dataset that could be repurposed by another system) are too deep for this blog post, but they go beyond the idea that we could merely create an access copy.

Authentication is another disputed area, but I would like to think that proper lifecycle management (see above) would go some way to building a reliable audit trail that helps authentication; likewise the correct organisation of data (see above) will add context, history and evidence that situates the research project in a certain place and date, with an identifiable owner, adding further authentication.

To conclude this brief overview I make two observations:

  • Though there is some commonality among the instances I looked at, there is apparently no single shared understanding of what “data curation” means within the HE community; some views will tend to promote one aspect over another, depending on their content type, collections, or user community.
  • All the definitions I looked at tend to roll all the diverse actions together in a single paragraph, as if they were related parts of the same thing. Are they? Does it imply that data curation could be done by a single professional, or do we need many skillsets contributing to this process for complete success?

Research datasets: lockdown or snapshot?

In today’s blog post we’re going to talk about digital preservation planning in the context of research datasets. We’re planning a one-day course for research data managers, where we can help with making preservation planning decisions that intersect with and complement your research data management plan.

When we’re dealing with datasets of research materials, there’s often a question about when (and whether) it’s possible to “close” the dataset. The dataset is likely to be a cumulative entity, especially if it’s a database, continually accumulating new records and new entries. Is there ever a point at which the dataset is “finished”? If you ask a researcher, it’s likely they will say it’s an ongoing concern, and they would rather not have it taken away from them and put into an archive.

For the data manager wishing to protect and preserve this valuable data, there are two possibilities.

The first is to “lock down” the dataset

This would involve intervening at a suitable date or time, for instance at the completion of a project, and negotiating with the researcher and other stakeholders. If everyone can agree on a lockdown, it means that no further changes can be made to the dataset; no more new records added, and existing records cannot be changed.

A locked-down dataset is somewhat easier to manage in a digital preservation repository, especially if it’s not being requested for use very frequently. However, this approach doesn’t always match the needs of the institution, nor the researcher who created the content. This is where the second possibility comes into play.

The second possibility is to take “snapshots” of the dataset

This involves a capture action that involves abstracting records from the dataset, and preserving that as a “view” of the dataset at a particular moment in time. The dataset itself remains intact, and can continue being used for live data as needed: it can still be edited and updated.

Taking dataset snapshots is a more pragmatic way of managing and preserving important research data, while meeting the needs of the majority of stakeholders. However, it also requires more effort: a strategic approach, more planning, and a certain amount of technical capability. In terms of planning, it might be feasible to take snapshots of a large and frequently-updated dataset on a regular basis, e.g. every year or every six months; doing so will tend to create reliable, well-managed views of the data.

Another valid approach would be to align the snapshot with a particular piece of research

For instance, when a research paper is published, the snapshot of the dataset should reflect the basis on which the analysis in that paper was carried out. The dataset snapshot would then act as a strong affirmation of the validity of the dataset. This is a very good approach, but requires the data manager and archivist to have a detailed knowledge of the content, and more importantly the progress of the research cycle.

The ideal scenario would be to have your researcher on board with your preservation programme, and get them signed up to a process like this; at crucial junctures in their work, they could request snapshots of the dataset, or even be empowered to perform it themselves.

In terms of the technical capability for taking snapshots, it may be as simple as running an export script on a database, but it’s likely to be a more delicate and nuanced operation. The parameters of the export may have to be discussed and managed quite carefully.

Lastly we should add that these operations by themselves don’t constitute the entirety of digital preservation. They are both strategies to create an effective capture of a particular resource; but capture alone is not preservation.

That resource must pass into the preservation repository and undergo a series of preservation actions in order to be protected and usable in the future. There will be several variations on this scenario, as there are many ways of gathering and storing data. We know that institutions struggle with this area, and there is no single agreed “best practice.”