file formats – Ed The Archivist

PDF/A and read-only in SharePoint

In the previous post I made some observations about metadata in Office Documents in the SharePoint environment. In this brief follow-up post, I will mention what happens to PDF/A files in SharePoint. This is prompted by a comment received from Özhan Saglik. As part of this, I was also moved to try an experiment with Read-Only documents.

Well, the punchline to the first one is simple – PDF/A files don’t change at all in SharePoint. At least, that’s what I found with my limited testbed. This result need not surprise us, if we consider some basic facts:

PDF/A files are designed to “open as read only”. This is part of the standard, so they can’t be (carelessly) modified by anything. Matter of fact you should get a message at the top of the document when you open it, informing you that this is the case.
SharePoint was designed to work with Microsoft products, not Adobe products. To be precise, the main purpose of working in SharePoint is the sharing and editing capability; the developers have not expended any effort in adding an editor or other app for a file format which they don’t support. On the other hand, you can upload a PDF to SharePoint and let others see it.
There is no possibility of editing PDF/A (or even plain PDFs) within the SharePoint environment. See previous remark.

The above observations are endorsed by MS employees, for instance in this forum thread.

PDF/A experiments

For the record, for this experiment I used five files which are conformant to the PDF/A 1-a flavour. I can say this with a small degree of certainty because (a) this information is reported in their metadata and (b) they passed the VeraPDF test for conformance with the standard. As with my previous experiment, I did a before-and-after profiling of their properties using Apache Tika.

I was able to upload the PDF/A files into Sharepoint, and download them; that’s as far as it goes, since (as noted) no editing of a PDF within SharePoint is possible. All the metadata remained completely intact, and the upload-download process did not change anything. Results can be seen in comparative_table_3.xlsx.

Download comparative tables as a zip file (27.9 KB)

However, one aspect still intrigues me. It’s a property – or something – which is reporting on a date and time. It displays as “Modified” in SharePoint, and also displays as ‘Date Modified’ in Windows Explorer. I am assuming, perhaps wrongly, that both of these are the same thing. If we look at this screenshot, it shows my PDF/A files after I’d dragged them in:

And after I’d downloaded them back onto my Desktop, Windows Explorer likewise reported today’s date:

These dates in no way match the modified dates of the original PDFs as reported in Windows:

What’s also interesting is that I can’t find this “Date modified” value anywhere in the Apache Tika report.

I’m assuming that, in Windows Explorer at least, this is a timestamp, or a file time value which gets written by the NFTS system. These values change in response to certain events, such as copying or drag-and-dropping. This page seems to describe what I think might be happening.

Maybe Read-Only is the key?

I was about to scribble something along the lines of “PDF/A is a robust format – and here’s why!” Perhaps then setting myself the impossible job of understanding (and explaining) how the PDF/A encoding meets some of our expectations for long-term preservation by the complex set of conditions that obtain within the standard. But what if it’s much simpler than that? Perhaps a PDF/A file is impervious to SharePoint changes for one simple reason – it’s Read-Only.

If that’s true, I wondered, then would MS Office documents behave nicely if I set their Properties to Read Only before uploading them into SharePoint, where the expected metadata changes would take place?

I decided to revisit my MS Office Documents testbed from the previous experiment and try this out. The results can be seen in comparative_table_4.xlsx. and were extremely encouraging.

For the “old style” Office documents, the only thing that has changed between upload and download is the checksum. From this evidence, perhaps we might conclude that SharePoint has simply generated a “new” digital object in some way – which is pretty much the expected behaviour, is it not?

In the case of the three Office Open XML files, SharePoint added a new Custom Field that wasn’t there before. The Content Length also changed.

However, unlike what happened in my previous experiments, all the properties and metadata remain the same, including that all-important date of creation.

Lessons learned

Documents authored to be conformant to the PDF/A standard seem to be impervious to change in SharePoint. This may mean the format is a good thing, from that point of view.
Office Documents, if set to Read Only before upload to SharePoint, are much better protected from SharePoint change than those that are not. Of course, if you start to edit the document in SharePoint it’s another matter completely. But this way we stand a chance at least of uploading an authentic “original” document with metadata and properties intact, which could be used as a ground-zero point of comparison for any subsequent changes that are made.
Ticking the Read Only box is a simple expedient which can be achieved without even opening the document (simply right-click on the object in Windows Explorer). A purist might argue that doing so constitutes making a change to the file which violates archival principles; but does it? Maybe we’re only ticking a box in properties which has no profound effect on the content. Maybe the added protection it affords us is a trade-off that is worth making.

These are not especially ground-breaking revelations, but they may help an archivist or records manager who is facing a SharePoint rollout, and provide some clues for a practical workaround which may help to mitigate this metadata loss.

When is it a good time for a file format migration?

I used to teach a one-day course on file format migration. The course advanced the idea that migration, although one of the oldest and best-understood methods of enacting digital preservation, can still carry risks of loss. To mitigate that loss, we want to make a case for use cases and acceptance criteria – good old-fashioned planning, in short.

When would it be a good time to migrate a file? And when would it be good not to migrate, or at any rate defer the decision? We can think of some plausible scenarios, and will discuss them briefly below.

We think the community has moved on now from its earlier line of thought, which was along the lines of “migrate as soon as possible, ideally at point of ingest” – the risks of careless migrations are hopefully better understood now, and we don’t want to rush into a bad decision. That said, some digital preservation systems still have an automated migration action built into the ingest routine.

Do migrate if:

You don’t trust the format of the submission. The depositor may have sent you something in an obscure, obsolete, or unsupported file format. A scenario like this is likely to involve a private depositor, or an academic who insists on working in their “special” way. Obsolescence (or the imminent threat of it) is a well-established motivator for bringing out the conversion toolkit, though there are some who would disagree.
Your archive/repository works to a normalisation policy. This means that you tend to limit the number of preservation formats you work with, so you convert all ingests to the standard set which you support. The policy might be to migrate all Microsoft products to their Open Office equivalent. Indeed, this rule is built into Xena, the open-source tool from National Archives of Australia. Normalisation may have a downside, but it can create economies in how many formats you need to commit to supporting, and may go some way to “taming” wild deposits that arrive in a variety of formats.
You want to provide access to the content immediately. This means creating an access copy of the resource, for instance by migrating a tiff image to a jpeg. Some would say this doesn’t really qualify as migration, but it does involve a re-encoding action, which is why we mention it. It might be that this access copy doesn’t have to meet the same high standards as a preservation copy.

Don’t migrate if:

The format of the resource is already stored in a quality format. The deposit you are ingesting may already be encoded in a format that is widely accepted as meeting a preservation standard, in which case migration is arguably not necessary. To ascertain this and verify the format, use DROID or other identification tools. To learn about preservation standard formats, start with the Library of Congress resource Sustainability of Digital Formats.
There is no immediate need for you to migrate. In this scenario, you fear that the ingested content’s format may become obsolete one day, but your researches (starting with the PRONOM online registry) indicate that the risk is some way off – maybe even 10-15 years away. In which case deferring the migration is your best policy. Be sure to add a “note to self” in the form of preservation metadata about this decision, and a trigger date in your database that will remind you to take action.
You want to migrate, but currently lack the IT skills. To this scenario we could add “you lack the tools to do migration” or even “you lack a suitable destination format”. You’ve made a search on COPTR and still come up empty. Through no fault of your own, technology has simply not yet found a reliable way to migrate the format you wish to preserve, and a tool for migration does not exist. In this instance, don’t wait for the solution – put the content into preservation storage, with a “note to self” (see above) that action will be taken at some point when the technology, tools, skills, and formats are available.
You have no preservation plan. This refers to your over-arching strategy which governs your approach to doing digital preservation. Part of it is an agreed action plan for what you will do when faced with particular file formats, including a detailed workflow, choice of conversion tool, and clear rationale for why you’re doing it that way. Ideally, in compiling this action plan, you will have understood the potential losses that migration can cause to the content, and the archivist (and the organisation) have signed off on how much of a “hit” is acceptable. Without a plan like this, you’re at risk of guessing which is the best migration pathway, and your decisions end up being guided by the tools (which are limited) rather than your own preservation needs.

Pros and cons of the JPEG2000 format for a digitisation project

In this post we want to briefly discuss some of the pros and cons of the JPEG2000 format. When it comes to selecting file formats for a digitisation project, choosing the right ones may help with continuity and longevity, or even access to the content. It all depends on the type of resource, or your needs, or the needs of your users.

If you’re working with images (e.g. for digitised versions of books, texts, or photographs), there’s nothing wrong with using the TIFF standard file format for your master copies. We’re not here to advocate using the JPEG2000 format, but it does have its adherents (and its evangelists).

PROS

May save storage space

This is a compelling reason and may be why a lot of projects opt for the JP2. Unlike the TIFF, it supports lossless compression. This means it can be compressed to leave a smaller “footprint” on the server, and yet not lose anything in terms of quality. How? It’s thanks to the magic codec.

Versatility

In “old school” digitisation projects, we tended to produce at least two digital objects – a high resolution scan (the “archive” copy, as I would call it) and a low resolution version derived from it, which we’d serve to users as the access copy. Gluttons for punishment might even create a third object, a thumbnail, to exhibit on the web page / online catalogue. Conversely the JPEG2000 format could perform all three functions from a single object. It can do this because of the “scalable bitstream;” the image data is encoded so it only serves as much as is needed to meet the request, which could be for an image of any size.

Open standard with ISO support

As indicated above, a file format that’s recognised as an International Standard gives us more confidence in its longevity, and the prospects for continued support and development. An “open” standard in this instance refers to a file format whose specification has been published; this sort of documentation, although highly technical, can be useful to help us understand (and in some cases validate) the behaviour of a file format.

CONS

Codec dependency

We mentioned the scalable bitstream above and the capacity for lossless compression as two of this format’s strengths. However, to do these requires an extra bit of functionality above and beyond what most file formats are capable of (including the TIFF). This is the codec, which performs a compression-decompression action on the image data. Besides being a dependency – without the codec, the magic of the JPEG2000 won’t work. This is one part of the format which remains something of a “black box,” a lack of transparency which may make some developers reluctant to work with the format.

Save As settings can be complex

In digitisation projects, the “Save As” action is crucial; you want your team to be producing consistent digitised resources which conform precisely to a pre-determined profile, for instance with regard to pixel size, resolution, and colour space. With a TIFF, these settings are relatively easy to apply; with the JPEG2000, there are many options and many possibilities, and it requires some expertise selecting the settings that will work for your project. Both the decision-making process, and the time spent applying them while scanning, might add a burden to your project.

Not yet the de facto standard

The “digital image industry,” if indeed there is such an entity, has not yet adopted the JPEG 2000 file format as a de facto standard. If you’re inclined to doubt this, look at the hardware; most digital cameras and digital scanners tend to save to TIFF or JPEG, not JPEG2000.

In conclusion, this post is not aiming to “sell” you on one format over another; the process that is relevant is going through a series of decisions, and informing yourself as best you can about the suitability of any given file format. Neither is it a case of either/or; we are aware of at least one major digitisation project that makes judicious use of both the TIFF and the JPEG2000, exploiting the salient features of both.

File formats…or data streams?

On 1st December Malcolm Todd of The National Archives gave a good account of the work he’s been doing on File Formats for Preservation, resulting in a substantial new Technology Watch report for the DPC. It was a seminar hosted by William Kilbride, with participants from the BBC, the BL, NLW and others. The afternoon was useful and interesting for me since I teach an elementary module on file formats in a preservation context for our DPTP courses.

My naïve thinking in the area has been characterised by the assumption that the process is rather static or linear, and that the problem we’re facing is broadly the same every time; migrate data from a format that’s about to become obsolete or unsupported, onto another format that’s stable, supported, and open. MS Word document to PDF or PDF/A…now that, I can understand!

In fact, I learned at least two ways of thinking about formats that hadn’t occurred to me before. One simple one is costs; some formats can cost more to preserve than others. This can be calculated in terms of storage costs, multiplied over time, and the costs associated with migrations to new versions of that format. Continue reading “File formats…or data streams?”