When is it a good time for a file format migration?

I used to teach a one-day course on file format migration. The course advanced the idea that migration, although one of the oldest and best-understood methods of enacting digital preservation, can still carry risks of loss. To mitigate that loss, we want to make a case for use cases and acceptance criteria – good old-fashioned planning, in short.

When would it be a good time to migrate a file? And when would it be good not to migrate, or at any rate defer the decision? We can think of some plausible scenarios, and will discuss them briefly below.

We think the community has moved on now from its earlier line of thought, which was along the lines of “migrate as soon as possible, ideally at point of ingest” – the risks of careless migrations are hopefully better understood now, and we don’t want to rush into a bad decision. That said, some digital preservation systems still have an automated migration action built into the ingest routine.

Do migrate if: 

  • You don’t trust the format of the submission. The depositor may have sent you something in an obscure, obsolete, or unsupported file format. A scenario like this is likely to involve a private depositor, or an academic who insists on working in their “special” way. Obsolescence (or the imminent threat of it) is a well-established motivator for bringing out the conversion toolkit, though there are some who would disagree.
  • Your archive/repository works to a normalisation policy. This means that you tend to limit the number of preservation formats you work with, so you convert all ingests to the standard set which you support. The policy might be to migrate all Microsoft products to their Open Office equivalent. Indeed, this rule is built into Xena, the open-source tool from National Archives of Australia. Normalisation may have a downside, but it can create economies in how many formats you need to commit to supporting, and may go some way to “taming” wild deposits that arrive in a variety of formats.
  • You want to provide access to the content immediately. This means creating an access copy of the resource, for instance by migrating a tiff image to a jpeg. Some would say this doesn’t really qualify as migration, but it does involve a re-encoding action, which is why we mention it. It might be that this access copy doesn’t have to meet the same high standards as a preservation copy.

Don’t migrate if: 

  • The format of the resource is already stored in a quality format. The deposit you are ingesting may already be encoded in a format that is widely accepted as meeting a preservation standard, in which case migration is arguably not necessary. To ascertain this and verify the format, use DROID or other identification tools. To learn about preservation standard formats, start with the Library of Congress resource Sustainability of Digital Formats.
  • There is no immediate need for you to migrate. In this scenario, you fear that the ingested content’s format may become obsolete one day, but your researches (starting with the PRONOM online registry) indicate that the risk is some way off – maybe even 10-15 years away. In which case deferring the migration is your best policy. Be sure to add a “note to self” in the form of preservation metadata about this decision, and a trigger date in your database that will remind you to take action.
  • You want to migrate, but currently lack the IT skills. To this scenario we could add “you lack the tools to do migration” or even “you lack a suitable destination format”. You’ve made a search on COPTR and still come up empty. Through no fault of your own, technology has simply not yet found a reliable way to migrate the format you wish to preserve, and a tool for migration does not exist. In this instance, don’t wait for the solution – put the content into preservation storage, with a “note to self” (see above) that action will be taken at some point when the technology, tools, skills, and formats are available.
  • You have no preservation plan. This refers to your over-arching strategy which governs your approach to doing digital preservation. Part of it is an agreed action plan for what you will do when faced with particular file formats, including a detailed workflow, choice of conversion tool, and clear rationale for why you’re doing it that way. Ideally, in compiling this action plan, you will have understood the potential losses that migration can cause to the content, and the archivist (and the organisation) have signed off on how much of a “hit” is acceptable. Without a plan like this, you’re at risk of guessing which is the best migration pathway, and your decisions end up being guided by the tools (which are limited) rather than your own preservation needs.

Pros and cons of the JPEG2000 format for a digitisation project

In this post we want to briefly discuss some of the pros and cons of the JPEG2000 format. When it comes to selecting file formats for a digitisation project, choosing the right ones may help with continuity and longevity, or even access to the content. It all depends on the type of resource, or your needs, or the needs of your users.

If you’re working with images (e.g. for digitised versions of books, texts, or photographs), there’s nothing wrong with using the TIFF standard file format for your master copies. We’re not here to advocate using the JPEG2000 format, but it does have its adherents (and its evangelists).

PROS

May save storage space

This is a compelling reason and may be why a lot of projects opt for the JP2. Unlike the TIFF, it supports lossless compression. This means it can be compressed to leave a smaller “footprint” on the server, and yet not lose anything in terms of quality. How? It’s thanks to the magic codec.

Versatility

In “old school” digitisation projects, we tended to produce at least two digital objects – a high resolution scan (the “archive” copy, as I would call it) and a low resolution version derived from it, which we’d serve to users as the access copy. Gluttons for punishment might even create a third object, a thumbnail, to exhibit on the web page / online catalogue. Conversely the JPEG2000 format could perform all three functions from a single object. It can do this because of the “scalable bitstream;” the image data is encoded so it only serves as much as is needed to meet the request, which could be for an image of any size.

Open standard with ISO support

As indicated above, a file format that’s recognised as an International Standard gives us more confidence in its longevity, and the prospects for continued support and development. An “open” standard in this instance refers to a file format whose specification has been published; this sort of documentation, although highly technical, can be useful to help us understand (and in some cases validate) the behaviour of a file format.

CONS

Codec dependency

We mentioned the scalable bitstream above and the capacity for lossless compression as two of this format’s strengths. However, to do these requires an extra bit of functionality above and beyond what most file formats are capable of (including the TIFF). This is the codec, which performs a compression-decompression action on the image data. Besides being a dependency – without the codec, the magic of the JPEG2000 won’t work. This is one part of the format which remains something of a “black box,” a lack of transparency which may make some developers reluctant to work with the format.

Save As settings can be complex

In digitisation projects, the “Save As” action is crucial; you want your team to be producing consistent digitised resources which conform precisely to a pre-determined profile, for instance with regard to pixel size, resolution, and colour space. With a TIFF, these settings are relatively easy to apply; with the JPEG2000, there are many options and many possibilities, and it requires some expertise selecting the settings that will work for your project. Both the decision-making process, and the time spent applying them while scanning, might add a burden to your project.

Not yet the de facto standard

The “digital image industry,” if indeed there is such an entity, has not yet adopted the JPEG 2000 file format as a de facto standard. If you’re inclined to doubt this, look at the hardware; most digital cameras and digital scanners tend to save to TIFF or JPEG, not JPEG2000.

In conclusion, this post is not aiming to “sell” you on one format over another; the process that is relevant is going through a series of decisions, and informing yourself as best you can about the suitability of any given file format. Neither is it a case of either/or; we are aware of at least one major digitisation project that makes judicious use of both the TIFF and the JPEG2000, exploiting the salient features of both.

File formats…or data streams?

On 1st December Malcolm Todd of The National Archives gave a good account of the work he’s been doing on File Formats for Preservation, resulting in a substantial new Technology Watch report for the DPC. It was a seminar hosted by William Kilbride, with participants from the BBC, the BL, NLW and others. The afternoon was useful and interesting for me since I teach an elementary module on file formats in a preservation context for our DPTP courses.

My naïve thinking in the area has been characterised by the assumption that the process is rather static or linear, and that the problem we’re facing is broadly the same every time; migrate data from a format that’s about to become obsolete or unsupported, onto another format that’s stable, supported, and open. MS Word document to PDF or PDF/A…now that, I can understand!

In fact, I learned at least two ways of thinking about formats that hadn’t occurred to me before. One simple one is costs; some formats can cost more to preserve than others. This can be calculated in terms of storage costs, multiplied over time, and the costs associated with migrations to new versions of that format. Continue reading “File formats…or data streams?”