metadata – Ed The Archivist

PDF/A and read-only in SharePoint

In the previous post I made some observations about metadata in Office Documents in the SharePoint environment. In this brief follow-up post, I will mention what happens to PDF/A files in SharePoint. This is prompted by a comment received from Özhan Saglik. As part of this, I was also moved to try an experiment with Read-Only documents.

Well, the punchline to the first one is simple – PDF/A files don’t change at all in SharePoint. At least, that’s what I found with my limited testbed. This result need not surprise us, if we consider some basic facts:

PDF/A files are designed to “open as read only”. This is part of the standard, so they can’t be (carelessly) modified by anything. Matter of fact you should get a message at the top of the document when you open it, informing you that this is the case.
SharePoint was designed to work with Microsoft products, not Adobe products. To be precise, the main purpose of working in SharePoint is the sharing and editing capability; the developers have not expended any effort in adding an editor or other app for a file format which they don’t support. On the other hand, you can upload a PDF to SharePoint and let others see it.
There is no possibility of editing PDF/A (or even plain PDFs) within the SharePoint environment. See previous remark.

The above observations are endorsed by MS employees, for instance in this forum thread.

PDF/A experiments

For the record, for this experiment I used five files which are conformant to the PDF/A 1-a flavour. I can say this with a small degree of certainty because (a) this information is reported in their metadata and (b) they passed the VeraPDF test for conformance with the standard. As with my previous experiment, I did a before-and-after profiling of their properties using Apache Tika.

I was able to upload the PDF/A files into Sharepoint, and download them; that’s as far as it goes, since (as noted) no editing of a PDF within SharePoint is possible. All the metadata remained completely intact, and the upload-download process did not change anything. Results can be seen in comparative_table_3.xlsx.

Download comparative tables as a zip file (27.9 KB)

However, one aspect still intrigues me. It’s a property – or something – which is reporting on a date and time. It displays as “Modified” in SharePoint, and also displays as ‘Date Modified’ in Windows Explorer. I am assuming, perhaps wrongly, that both of these are the same thing. If we look at this screenshot, it shows my PDF/A files after I’d dragged them in:

And after I’d downloaded them back onto my Desktop, Windows Explorer likewise reported today’s date:

These dates in no way match the modified dates of the original PDFs as reported in Windows:

What’s also interesting is that I can’t find this “Date modified” value anywhere in the Apache Tika report.

I’m assuming that, in Windows Explorer at least, this is a timestamp, or a file time value which gets written by the NFTS system. These values change in response to certain events, such as copying or drag-and-dropping. This page seems to describe what I think might be happening.

Maybe Read-Only is the key?

I was about to scribble something along the lines of “PDF/A is a robust format – and here’s why!” Perhaps then setting myself the impossible job of understanding (and explaining) how the PDF/A encoding meets some of our expectations for long-term preservation by the complex set of conditions that obtain within the standard. But what if it’s much simpler than that? Perhaps a PDF/A file is impervious to SharePoint changes for one simple reason – it’s Read-Only.

If that’s true, I wondered, then would MS Office documents behave nicely if I set their Properties to Read Only before uploading them into SharePoint, where the expected metadata changes would take place?

I decided to revisit my MS Office Documents testbed from the previous experiment and try this out. The results can be seen in comparative_table_4.xlsx. and were extremely encouraging.

For the “old style” Office documents, the only thing that has changed between upload and download is the checksum. From this evidence, perhaps we might conclude that SharePoint has simply generated a “new” digital object in some way – which is pretty much the expected behaviour, is it not?

In the case of the three Office Open XML files, SharePoint added a new Custom Field that wasn’t there before. The Content Length also changed.

However, unlike what happened in my previous experiments, all the properties and metadata remain the same, including that all-important date of creation.

Lessons learned

Documents authored to be conformant to the PDF/A standard seem to be impervious to change in SharePoint. This may mean the format is a good thing, from that point of view.
Office Documents, if set to Read Only before upload to SharePoint, are much better protected from SharePoint change than those that are not. Of course, if you start to edit the document in SharePoint it’s another matter completely. But this way we stand a chance at least of uploading an authentic “original” document with metadata and properties intact, which could be used as a ground-zero point of comparison for any subsequent changes that are made.
Ticking the Read Only box is a simple expedient which can be achieved without even opening the document (simply right-click on the object in Windows Explorer). A purist might argue that doing so constitutes making a change to the file which violates archival principles; but does it? Maybe we’re only ticking a box in properties which has no profound effect on the content. Maybe the added protection it affords us is a trade-off that is worth making.

These are not especially ground-breaking revelations, but they may help an archivist or records manager who is facing a SharePoint rollout, and provide some clues for a practical workaround which may help to mitigate this metadata loss.

Metadata and Properties In SharePoint

“No changes were made to your original file”

Today’s blog post makes a few observations about the way file formats behave. To be more specific, we’re talking about MS Office documents and what happens to them in MS SharePoint.

In my superficial way, I have noticed that when a document gets uploaded to SharePoint we can now look at it through a web browser. This is because SharePoint is predicated on the idea that we can all work in the cloud instead of being tied down to a local server. The browser view we are offered is not unlike Windows Explorer. In this view, we can see a folder structure, a file name, and also columns indicating dates (the Modified column), the Owner Name (Modified By) and Size (File Size).

I could see in this view that when a document gets uploaded, the date displayed is the date when it was added to SharePoint. This made me wonder what happened to the original Date Of Creation, something we worry about if we’re archivists or records managers. Further, I wondered if other metadata was being affected by this drag-and-drop action.

Tests

I did some tests using the Apache Tika utility, which is capable of exposing Properties in many file formats. In the case of Office documents, these properties can be a rich mix of dates, text strings, and technical metadata. I’m naively assuming these things are inscribed in the document in some way, by a combination of the application (e.g. MS Word) and the Windows file system, e.g. NFTS.

My method was to start with a small test bed of Office documents (one .doc, one .xls, and one .ppt). I got these from the OPF format corpus. I wanted to carry out a simple before-and-after comparison. First I profiled all the documents before upload, and pasted all the metadata into my table.

Then I tried three operations using SharePoint: (1) Drag and drop (2) Edit in the Browser (3) Edit in Word. The first one is simply moving (copying) the document from Windows Explorer into the SharePoint environment. The editing operations refer to the two options available: “make quick changes in the Browser”, or launch the application for more functionality. (2) suggests there is a web-based version of Word, Excel and PowerPoint which doesn’t quite have all the functions you’d expect, but still enables a user to carry out some limited edits.

After each action, there was a change to the test object. I downloaded the changed object in each case, and ran Apache Tika to see what the profiles looked like. I then pasted all the results into my table so I could make comparisons.

Click to download comparative tables

What changed?

There’s a number of changes which you can see if you download my tables. Look at comparativetable.xlsx. The most obvious and profound change is to the dates, especially the date of creation. This property remains unchanged if we just drag and drop; but when we start editing, either in browser or in application, the date of creation appears to change to the date when editing was carried out. The PowerPoint file kept its date of creation, but the other two didn’t; so now the only evidence we have of original dates is in the “Date Printed” property.

The second profound change is to the file size. In each case, this changed quite noticeably; even the act of dragging and dropping introduced a change to file size. Given the fact that the checksum is also new, it looks as though SharePoint is creating a new digital object in some way, and “injecting” it with something (I have no idea what) that makes it larger.

We’ve also tended to lose properties like Last-Author, which can get overwritten in SharePoint. In one exceptional case, there’s also a puzzling report on the page count of my Word Document, which mysteriously changes from 1 page to 29 pages.

What about newer documents?

So far so good. However, this experiment has been applied to “old” Microsoft Documents, by which I mean documents authored before the introduction of the Office Open XML standard. I thought I had better try out the same experiments with some more recent documents, and so selected a testbed of one .docx file, one pptx file, and one .xlsx file. The same before and after actions as above were carried out. Results are available in my second table, comparativetable_2.xlsx.

This time the changes were nowhere near as profound. In each case, the Date of Creation is intact, a result likely to reassure obsessive archivists like myself. There are still some minor losses but most of the elements highlighted in red are as expected (i.e. they reflect that fact that editing took place). However, SharePoint evidently still continues to “inject” something to make the files change size.

What’s going on?

One thing that might be happening here is not limited to SharePoint, but reflects Microsoft’s commitment to forward compatibility. When you launch an old MS Document in a more recent version of the application, it offers to perform file conversion for you. The user receives notification messages that this is happening. Matter of fact we received such notifications in the course of this experiment, such as these:

One result of this is that SharePoint now helpfully stores two iterations of your file for you. One of them is the “original”, the other is the “conversion”. However, the extent of reporting on changes is restricted to a rather vague generic message about “changes to the layout”. Well, if my tests indicate anything, it’s more than just a layout change.

Does any of this matter?

From a digital preservation point of view, I would say yes it does. I don’t think any of us would be too happy about a process that seems to over-write the date of creation of a resource; and more to the point doesn’t really tell us that the change is happening.

I don’t think I need stress the importance of dates for record-keeping, and other embedded properties may also add value. Indeed, one approach to digital preservation as it applies to file formats is to carry out extensive profiling of ingested files, extract and copy the metadata, and store it within the Archival Information Package. If we’re even more clever, we can parse the properties into separate fields and manage them in a preservation database.

I’m aware that the value of doing this is disputed, and that we’re continuing to have discussions and conversations about “significant properties” in our community. But if any of my observations are correct, it seems that SharePoint is performing a species of migration on our content (they call it “conversion”), and introducing changes without really telling us the extent of these changes.

The lesson, if indeed there is one, might be that “old” Office documents need some care and attention before upload to SharePoint, if these properties are important to you and your users.

Additional thoughts

If we find ourselves moving content into SharePoint, do we have to do it by a drag and drop action? To put it another way, are there other ways of moving files so we can protect these properties? Probably. One possibility is the TeraCopy tool, and another possibility is file compression.

TeraCopy is a Windows tool which offers a more sophisticated form of drag-and-drop. It evidently integrates well with Windows Explorer, although the vendors don’t claim that it works with SharePoint. While I do have the free version, I haven’t experimented with it as part of this test.

TeraCopy includes checksum verification as part of its capabilities, which is why it’s bound to appeal to digital archivists. Additionally, it claims to do something to keep date properties intact:

As to file compression, this would involve zipping up the target files into a single compressed object (e.g. .zip or .7z) and moving this into SharePoint. In some other unrelated experiments, I have found this does indeed protect the dates and other properties from any unwarranted change. However, it’s arguably pretty pointless to put a zipped object into SharePoint, as this will probably obviate against the faceted views and collaborative aspects that the platform offers.

Metadata creation for digitisation: counting the costs

Speaking as a traditional archivist, I love cataloguing. I never thought I’d find myself having to justify cataloguing work, but given that it’s possible to attach a cost to everything these days, I find it is a serious consideration.

Experts who understand that specialist work has a real cost will try and tell us that detailed cataloguing might be turning into a luxury we can’t afford. This post will try and consider some of the things that make it expensive, and apply the lessons to digitised content.

I’m proposing that metadata can serve two important functions. One, to make digitised content intelligible to human beings; two, to make it possible for the computer to store, manage, and process that content.

Human-readable catalogues for your digitised collections are an absolute must. Whether it’s an archive catalogue written in ISAD(G), or a library catalogue written in MARC21, or a resource described using Dublin Core. We have standards we can work to, and increasingly we have computer-based cataloguing tools (such as Calm, Adlib, or AtoM) that facilitate the task. I would like to think of these tools as something that help to turn human-readable descriptions into metadata, i.e. something that a computer can store and process.

That’s great if we’re writing a catalogue from scratch, but that’s not always the case; sometimes the original resource has metadata attached to it, perhaps created by its owner. Except that person probably didn’t work to a standard, and so if we want to recycle that metadata, we might be faced with a “mapping” task, normalising their non-standard metadata to standard fields, which is both an intellectual exercise and an IT task, involving importing and exporting values between spreadsheets.

Records managers also might take an interest in a normalising process like that; describing business documents in a records management environment, ensuring the context and meaning of the content is accurate and useful. The difference is they might be applying that metadata in a live environment, rather than applying it after the fact. Anyone who’s about to embark on a SharePoint project will recognise this; one way of looking at the transition from your old Document Management System to SharePoint is to see it as a vast metadata modelling exercise. Given the amount of metadata which SharePoint can support – both for individual documents and for folders and creators – this is worth thinking about.

It’s not just about building an inventory of the resources, but wouldn’t we like to apply our cataloguing skills to help users on their journey by adding navigational elements to web pages, such as structured views, clickable links, and faceted views of the collections based on elements such as dates, names, and subjects? This is totally possible, if you regard all of these things as metadata, individual fields which can be stored in databases and manipulated by web technology.

All of these things take time and money, but the expense is in the cost of information specialisms and expertise, and hours of effort spent carrying out the work. Metadata also can be “computationally expensive,” though. What we mean by this is there’s a potential cost to your IT.

A large-scale digitisation project, particularly if it intends to get serious about metadata creation, sharing, and interoperability, will typically create a lot of pages and possibly store them in XML files. These XML files can have many purposes, including describing the resource, and expressing its relationship to other resources.

Creating lots of XML pages is a grand thing to do, but even so they can take up server space – especially when there are so many of them, even if individually each file has a small “footprint”. It also can be expensive to index that metadata, which requires database operations and processing power; and even serving metadata may have a cost attached to it, as it can be calculated as one more strain on your bandwidth.

The general conclusion here is certainly not to abandon cataloguing and metadata creation, but to be aware of the costs to your organisation, and consider ways of reducing the burden, finding economies of scale, and concentrating your effort on delivering a core of essential metadata for your digitised content. This of course involves knowing the collections, and knowing the users. But that would be the subject of another post!

IPTC Photo Metadata Conference

The DART team will attend the IPTC Photo Metadata conference in Zagreb on 26 May 2016. The theme is “Keep Metadata Alive and Intact”. Ed Pinsent will be speaking in the morning session, which is themed on “Strongly Attached Metadata, what you need to know”.

We think the Conference will allow us to speak to various image management experts, people and organisations who manage picture libraries, who may have an interest in IPTC metadata and the management of their collections with a Digital Asset Management System (DAMS).

Sarah Saunders of Electric Lane works with a lot of these professional image management people. When she came on our DPTP Course recently, she noticed a few things:

There’s more to preservation of image files (e.g. TIFFs or JPEGs) than most people think
Elements of a possible digital preservation repository / system, and its workflow, overlapped to some degree with what she understood about the production chain for images, and the place of the DAMS, which leads to…
The idea (which we tend to teach on DPTP) that a preservation system doesn’t have to be a single system, but rather could repurpose existing systems (or elements of them) to arrive at a whole that is OAIS-compliant; for instance, one system performing storage, one for access, one for ingest.
She liked our insistence on the management of technical metadata and other useful metadata embedded in files

IPTC Photo Metadata Conference – Our Talk

From talks with Sarah there evolved the notion that I might be able to deliver a presentation which expresses some of these messages specifically targeted at image management experts. With that in mind, I’ve tried to devise a blue-sky thinking slide show that covers the following:

One – Drivers: why this audience might be interested in applying digital preservation to their image collections.

Two – How to do it for image files, involving some simple overviews of migration and technical metadata extraction. While image files will have generic technical metadata, e.g concerning the size, resolution, and color of the image, there is also specialist metadata. Of especial interest to this audience, we think, will be the management of IPTC metadata and EXIF metadata.

These are two specialist types of metadata which by and large only apply to digital image files. Broadly, IPTC metadata can be used to protect rights and ownership of images; and EXIF metadata records details about the hardware (camera, scanner) that was used to create the image.

Interestingly, although it’s possible to embed these metadata in some formats (e.g. TIFF, JPEG, and JPEG 2000), neither metadata type is guaranteed to survive permanently – especially if the file is migrated.

There’s also descriptive metadata created by a curator to help describe and identify images – names, keywords, dates. Quite often this is part of a Digital Asset Management System, and will be exposed and published online to make the images more meaningful and accessible to an audience.

Is any of this metadata useful in the long term? I would argue that it is, and maybe we need to learn how to protect it better.

Ask not what we must do for PREMIS…

I very much enjoyed the DPC’s latest event on metadata, particularly the first half of the day which concentrated on the PREMIS preservation metadata standard. One of my interests is how I can improve my teaching when I’m training students on the Digital Preservation Training Programme on this subject. Angela Dappert’s excellent presentation and exercise, now available here, has been enormously helpful for this.

My tendency has been to introduce standards like PREMIS and METS to my eager students in a linear top-down manner, explaining data models and structure…only to find them somewhat overwhelmed by the detail and the degree of effort that seems to be required in implementing it. I sense some students get the impression that they are (a) compelled to use these standards in order to succeed at digital preservation in the first place, or (b) have to implement it in a certain way. Worst case would be if they assume they have to use all the fields they possibly can, to arrive at a “complete” profile of a digital object.

Angela’s common-sense approach is to turn this question on its head. To paraphrase JFK’s inaugural address, we should “Ask not what we must do for PREMIS, ask what PREMIS can do for us.” Angela puts the questions in this order:

What are the entities and objects we need to describe?
What metadata do we need to do that?
Which standard do we use for which metadata?
How do we implement the selected metadata schemas?

When you do it this way, it becomes clear in short order that the range of metadata you actually need to be collecting and storing turns out to be much more manageable. You choose what you want, then find a metadata standard that suits your needs. Further, your selection decisions – because they are aligned to your overall selection policy – will be driven to some extent by what your user community want, and what your repository can support. How much time can your staff actually spend extracting and parsing metadata, and are they really adding any value by doing it? Is there really an audit requirement that obliges you to demonstrate you have run a virus check three times?

The four lessons I jotted down, and will be adding to the DPTP course, are:

Seek more information from your content creators when you need it – and don’t be afraid to ask for it!
Ask the creators for a manifest of all the files in their submission. (When I was an archivist, I’d always insist on a transfer list…)
Will this metadata be useful? Always ask what function you are supporting with your hard work.
Analyse and understand your domain – what can your repository support?

At a stroke, Angela has shown how PREMIS is achievable when we start showing that verbose data dictionary just who’s the boss around here!

A Tab in the Ocean

I’ve been using Web Curator Tool (WCT) to curate the JISC website collection at UKWA since 2008. I’ve long been aware that the system offered me the opportunity to record a lot of metadata, in tabs called General, Annotations, Groups and Access. It’s a mix of technical metadata (about the gather / website) and descriptive metadata. It’s mainly of value to the curator who wants to keep track of what they’re doing with the website gathering; but WCT also allows us to create some descriptive metadata for exposure. At the bare minimum, we’re required to use Groups; despite its name, this component is actually a simple subject classification scheme, allowing me to tag all my websites with “Higher Education” for example. Once stored in the WCT database and rendered through Wayback Machine, this subject selection translates into this useful view of the collection.

Recently the British Library team approached all the users of the shared WCT tool. It seems that the curators involved in UKWA have been using these metadata fields slightly differently and the BL team have initiated a project to move towards more consistency. The project will involve deciding on definitive interpretations of how to use these fields, followed by a process of cleaning up legacy data stored in the system. Some of it is potentially useful, some of it not so useful; some is legacy from the earlier PANDAS phase of the project, mostly not needed, or entered into the wrong field.

As noted, a lot of this metadata is mainly to do with selection and evaluation decisions, curation information such as changes in status of the site, and as such it’s never been exposed anywhere except within WCT. However, one descriptive field will eventually end up exposed on the UKWA live site, and provide us cataloguer types an opportunity to describe the resources in more detail. It will appear on the Title Entry Page (TEP) for each instance.

I welcome any move towards exposing more descriptive metadata on the UKWA public site. I have always taken the view that the phrase which currently appears alongside a Title “The live site may provide more information” is not really very helpful in the context of a web archive, for three reasons. (1) we don’t want our users clicking away from UKWA; (2) the link to the live site may be dead by now, and; (3) as archivists and curators, I feel strongly that we are the ones who should be providing that “more information” in the shape of a catalogue description of some kind.

The JISC project sites, as a collection, have high evidentiary value as stages in development of very specific tools, services and activities that benefit the UK Higher Education community. The sites by themselves don’t always explain their history or intentions; I would argue that a lot of rich contextual detail about the reasons these sites existed (the JISC programme under which they were developed, the dates, the staff involved, the themes, the outputs) would help interpret the collection to the users and make it more intelligible.

BlogForever: Thoughts about blog data and metadata

From the BlogForever blog.

During the ArchivePress project at ULCC, we briefly considered the data and metadata generally made available with blogs and blog posts. As ArchivePress focused on the representations of blogs in newsfeeds, we examined the metadata that is generated in common, and exposed in the newsfeeds of three of the most common blog platforms, WordPress, Blogger and TypePad. Blogger and Typepad prefer the Atom newsfeed format; WordPress (particularly WordPress.com) prefers RSS (though it can be made to publish Atom feeds too). This analysis was done, about a year ago, things may have changed, but here is a summary of what we found.

For each Blog, the following core information is available in the feeds:

	WordPress (RSS)	Blogger (Atom)	Typepad (Atom)
Feed Unique ID	NA	feed/id	feed/id
Blog URL	rss/channel/link	feed/link@rel=”alternate”	feed/link@rel=”alternate”
Blog Title	rss/channel/title	feed/title	feed/title
Blog Description	rss/channel/description	feed/subtitle	feed/subtitle
Date of last update	rss/channel/lastBuildDate	feed/updated	feed/updated
Generating software	rss/channel/generator	feed/generator	feed/generator

For each Post, we established that the following core information is available in the newsfeeds:

	WordPress (RSS)	Blogger (Atom)	Typepad (Atom)
Post Unique ID	rss/channel/item/guid@isPermaLink	feed/entry/id	feed/entry/id
Post Title	rss/channel/item/title	feed/entry/title	feed/entry/title
Post Summary	rss/channel/item/description	NA	feed/entry/summary
Post URL	rss/channel/item/link	feed/entry/link@rel=”alternate”	feed/entry/link@rel=”alternate”
Date of publication	rss/channel/item/pubDate	feed/entry/published	feed/entry/published
Date of last update	NA	feed/entry/updated	feed/entry/updated
Post Author	rss/channel/item/dc:creator rss/xmlns:dc	feed/entry/author/name	feed/entry/author/name
Post Category	rss/channel/item/category	feed/entry/category@term	feed/entry/category@term
Post Content	rss/channel/item/content:encoded rss/xmlns:content	feed/entry/content	feed/entry/content
Post Comments	rss/channel/item/comments	feed/entry/link@rel=”replies”	feed/entry/link@rel=”replies”
Post Comments Feed	rss/channel/item/wfw:commentRss	NA	NA

One interesting point we noted was that neither Blogger nor Typepad published a link to a Comments Feed for each post. This made our work on ArchivePress more difficult since it was predicated on being able to easily identify the Comments feed for each post, and harvest new Comments as they were published. Obviously for blogs generated other than by WordPress, this was not going to be so easy. (Our ace developer Emanuele found some workarounds, but that’s another story.)

I think this offers us an interesting overview of the core of standard, structured blog data and metadata, in three of the leading blog platforms. This is the data structure and metadata profile that is maintained in blog databases, in one of its native forms, and I’d expect it to be present in all blog platforms, since it arguably represents the essence of blogs. I hope this will be useful background when considering the core models for data and metadata handling that will be developed for BlogForever.