Research datasets: lockdown or snapshot?

In today’s blog post we’re going to talk about digital preservation planning in the context of research datasets. We’re planning a one-day course for research data managers, where we can help with making preservation planning decisions that intersect with and complement your research data management plan.

When we’re dealing with datasets of research materials, there’s often a question about when (and whether) it’s possible to “close” the dataset. The dataset is likely to be a cumulative entity, especially if it’s a database, continually accumulating new records and new entries. Is there ever a point at which the dataset is “finished”? If you ask a researcher, it’s likely they will say it’s an ongoing concern, and they would rather not have it taken away from them and put into an archive.

For the data manager wishing to protect and preserve this valuable data, there are two possibilities.

The first is to “lock down” the dataset

This would involve intervening at a suitable date or time, for instance at the completion of a project, and negotiating with the researcher and other stakeholders. If everyone can agree on a lockdown, it means that no further changes can be made to the dataset; no more new records added, and existing records cannot be changed.

A locked-down dataset is somewhat easier to manage in a digital preservation repository, especially if it’s not being requested for use very frequently. However, this approach doesn’t always match the needs of the institution, nor the researcher who created the content. This is where the second possibility comes into play.

The second possibility is to take “snapshots” of the dataset

This involves a capture action that involves abstracting records from the dataset, and preserving that as a “view” of the dataset at a particular moment in time. The dataset itself remains intact, and can continue being used for live data as needed: it can still be edited and updated.

Taking dataset snapshots is a more pragmatic way of managing and preserving important research data, while meeting the needs of the majority of stakeholders. However, it also requires more effort: a strategic approach, more planning, and a certain amount of technical capability. In terms of planning, it might be feasible to take snapshots of a large and frequently-updated dataset on a regular basis, e.g. every year or every six months; doing so will tend to create reliable, well-managed views of the data.

Another valid approach would be to align the snapshot with a particular piece of research

For instance, when a research paper is published, the snapshot of the dataset should reflect the basis on which the analysis in that paper was carried out. The dataset snapshot would then act as a strong affirmation of the validity of the dataset. This is a very good approach, but requires the data manager and archivist to have a detailed knowledge of the content, and more importantly the progress of the research cycle.

The ideal scenario would be to have your researcher on board with your preservation programme, and get them signed up to a process like this; at crucial junctures in their work, they could request snapshots of the dataset, or even be empowered to perform it themselves.

In terms of the technical capability for taking snapshots, it may be as simple as running an export script on a database, but it’s likely to be a more delicate and nuanced operation. The parameters of the export may have to be discussed and managed quite carefully.

Lastly we should add that these operations by themselves don’t constitute the entirety of digital preservation. They are both strategies to create an effective capture of a particular resource; but capture alone is not preservation.

That resource must pass into the preservation repository and undergo a series of preservation actions in order to be protected and usable in the future. There will be several variations on this scenario, as there are many ways of gathering and storing data. We know that institutions struggle with this area, and there is no single agreed “best practice.”

Pros and cons of the JPEG2000 format for a digitisation project

In this post we want to briefly discuss some of the pros and cons of the JPEG2000 format. When it comes to selecting file formats for a digitisation project, choosing the right ones may help with continuity and longevity, or even access to the content. It all depends on the type of resource, or your needs, or the needs of your users.

If you’re working with images (e.g. for digitised versions of books, texts, or photographs), there’s nothing wrong with using the TIFF standard file format for your master copies. We’re not here to advocate using the JPEG2000 format, but it does have its adherents (and its evangelists).

PROS

May save storage space

This is a compelling reason and may be why a lot of projects opt for the JP2. Unlike the TIFF, it supports lossless compression. This means it can be compressed to leave a smaller “footprint” on the server, and yet not lose anything in terms of quality. How? It’s thanks to the magic codec.

Versatility

In “old school” digitisation projects, we tended to produce at least two digital objects – a high resolution scan (the “archive” copy, as I would call it) and a low resolution version derived from it, which we’d serve to users as the access copy. Gluttons for punishment might even create a third object, a thumbnail, to exhibit on the web page / online catalogue. Conversely the JPEG2000 format could perform all three functions from a single object. It can do this because of the “scalable bitstream;” the image data is encoded so it only serves as much as is needed to meet the request, which could be for an image of any size.

Open standard with ISO support

As indicated above, a file format that’s recognised as an International Standard gives us more confidence in its longevity, and the prospects for continued support and development. An “open” standard in this instance refers to a file format whose specification has been published; this sort of documentation, although highly technical, can be useful to help us understand (and in some cases validate) the behaviour of a file format.

CONS

Codec dependency

We mentioned the scalable bitstream above and the capacity for lossless compression as two of this format’s strengths. However, to do these requires an extra bit of functionality above and beyond what most file formats are capable of (including the TIFF). This is the codec, which performs a compression-decompression action on the image data. Besides being a dependency – without the codec, the magic of the JPEG2000 won’t work. This is one part of the format which remains something of a “black box,” a lack of transparency which may make some developers reluctant to work with the format.

Save As settings can be complex

In digitisation projects, the “Save As” action is crucial; you want your team to be producing consistent digitised resources which conform precisely to a pre-determined profile, for instance with regard to pixel size, resolution, and colour space. With a TIFF, these settings are relatively easy to apply; with the JPEG2000, there are many options and many possibilities, and it requires some expertise selecting the settings that will work for your project. Both the decision-making process, and the time spent applying them while scanning, might add a burden to your project.

Not yet the de facto standard

The “digital image industry,” if indeed there is such an entity, has not yet adopted the JPEG 2000 file format as a de facto standard. If you’re inclined to doubt this, look at the hardware; most digital cameras and digital scanners tend to save to TIFF or JPEG, not JPEG2000.

In conclusion, this post is not aiming to “sell” you on one format over another; the process that is relevant is going through a series of decisions, and informing yourself as best you can about the suitability of any given file format. Neither is it a case of either/or; we are aware of at least one major digitisation project that makes judicious use of both the TIFF and the JPEG2000, exploiting the salient features of both.

Scan Once for All Purposes – some cautionary tales

The acronym SOAP – Scan Once For All Purposes – has evolved over time among digitisation projects, and it’s a handy way to remember a simple rule of thumb: don’t scan content until you have a clear understanding of all the intended uses that will be made of the resource. This may seem simple, but in some projects it may have been overlooked in the rush to push digitised content out.

One reason for the SOAP rule is because we need to recognise that digitisation is expensive. It costs money, staff time, expertise and expensive hardware to turn analogue content into digital content.

Taking books or archive boxes off shelves, scanning them, and reshelving them all takes time. Scanning paper can damage the original resource, so to minimise that risk we’d only want to do it once. In some extreme cases, scanning can even destroy a resource; there are projects which have sacrificed an entire run of a print journal to the scanner, in order to allow “disaggregation”, which is a euphemistic way of saying “we cut them up with a scalpel to create separate scanner-friendly pages”.

Beyond that, there are digital considerations and planning considerations which prove the importance of the Scan Once For All Purposes rule. To demonstrate this, let’s try and illustrate it with some imaginary but perfectly plausible scenarios for a digitisation project, and see what the consequences could be of failing to plan.

Scenario 1
An organisation decides to scan a collection of photographs of old buses from the 1930s, because they’re so popular with the public. Unfortunately, nobody told them about the differences between file formats, so the scans end up as low-resolution compressed JPEGs scanned at 72 DPI because the web manager advised that was best for sending the images over the web.

Consequence: the only real value these JPEGs have is as access copies. If we wanted to use them for commercial purposes, such as printing, 72 DPI will prove to be ineffectual. Further, if a researcher wanted to examine details of the buses, there wouldn’t be enough data in the scan for a proper examination, so chances are they would have to come in to the searchroom anyway. Result: photographs are once again subjected to more wear and tear. And weren’t we trying to protect these photographs in some way?

Scenario 2
The organisation has another go at the project – assuming they have any money left in the budget. This time they’re better informed about the value of high-resolution scans, and the right file formats for supporting that much image data in a lossless, uncompressed manner. Unfortunately, they didn’t tell the network manager they wanted to do this.

Consequence: the library soon finds their departmental “quota” of space on the server has been exceeded three times over. Because this quota system is managed automatically in line with an IT policy, the scans are now at risk of being deleted, with a notice period of 24 hours.

Scenario 3 
The organisation succeeds in securing enough server space for the high-resolution scans. After a few months running the project, it turns out the users are not satisfied with viewing low-resolution JPEGs and demand online access to full-resolution, zoomable TIFF images. The library agrees to this, and asks the IT manager to move their TIFF scans onto the web server for this purpose.

Consequence: through constant web access and web traffic, the original TIFF files are now exposed to a strong possibility of corruption. Since they’re the only copies, the organisation is now putting an important digital asset at risk. Further, the action of serving such large files over the web – particularly for this dynamic use, involving a zoom viewer – is putting a severe strain on the organisation’s bandwidth, and costing more money.

The simple solution to all of the above imaginary scenarios could be SOAP. The ideal would be for the organisation to handle the original photographs precisely once as part of this digitisation project, and not have to re-scan them because they got it wrong first time. The scanning operation should produce a single high-quality digital image, scanned at a high resolution and encoded in a dependable, robust format. We would refer to this as the “original”.

The project could then then derive further digital objects from the “original”, such as access copies stored in a lower-resolution format. However, this is not part of the scanning operation; it’s managed as part of an image manipulation stage of the project, and is totally digital. The photographs, now completely ‘SOAPed’, are already safely back in their archive box.

The digital “originals” should now go into a safe part of the digital store. They would never be used as part of the online service, and users would not get hold of them. To meet the needs of scenario 3, the project now has to plan a routine that derives further copies from the originals; but these should be encoded in a way that makes them suitable for web access, most likely using a file format with a scalable bitstream that allows the zoom tool to work.

All of the above SOAP operations depend on the project manager having a good dialogue with the network server manager and the web manager too; a trait which such projects share with long-term digital preservation. As can be seen, a little bit of planning will economise the project and get desired results without having to perform a scan twice over.