Conference on web-archiving: reconciling two curation methods

One of the first things I did in digital preservation was a long-term webarchiving project, so I have long felt quite close to the subject. I was very pleased to attend this year’s IIPC conference in Senate House in London, which this year combined to great effect with the RESAW conference, ensuring wide coverage and maximum audience satisfaction in the papers and presentations. 

In this short series of blog posts, I want to look at some of the interesting topics that reflect some of my own priorities and concerns as an archivist. I will attempt to draw out the wider lessons as they apply to information management generally, and readers may find something of interest that puts another slant on our orthodox notions of collection, arrangement, and cataloguing.

Government publications at the BL

Andy Jackson at the British Library is facing an interesting challenge as he attempts to build a technical infrastructure to accommodate a new and exciting approach to collections management.

The British Library has traditionally had custodial care of official Government papers. They’ve always collected them in paper form, but more recently two separate curation strands have emerged.

The first has been through web-archiving, where as part of the domain-wide crawls and targeted collection crawls, the BL has harvested entire government websites into the UK Web Archive. These harvests can include the official publications in the form of attached PDFs or born-digital documents.

The second strand involves the more conventional route followed by the curators who add to The Catalogue, i.e. the official BL union catalogue. It’s less automated, but more intensive on the quality control side; it involves manual selection, download, and cataloguing of the publication to MARC standards.

Currently, public access to the UK Web Archive and to The Catalogue are two different things. My understanding is that the BL are aiming to streamline this into a single collection discovery point, enabling end users to access digital content regardless of where it’s from, or how catalogued.

Two curation methods

Andy’s challenges include the following:

  • The two curation methods involve thinking about digital content in quite different ways.The first one is more automated, and allows the possibility of data reprocessing. The second one has its roots in a physical production line, with clearly defined start and end points.
  • Because of its roots to the physical world, the second method has a form of workflow management which is closely linked to the results in the catalogue itself. It seems there are elements in the database which indicate sign-off and completion of a particular stage of the work. Web crawling, conversely, resembles a continual ongoing process, and the cut-off point for completion (if indeed there is one) is harder to identify.
  • There is known to be some duplication taking place, duplication of effort and of content; to put it another way, PDFs known to be in the web archive are also being manually uploaded to the catalogue.

In response to this, Andy has been commissioned to build an over-arching “transformation layer” model that encompasses these strands of work. It’s difficult because there’s a need to get away from a traditional workflow, there are serious synching issues, and the sheer volume of the content is so considerable.

I’m sure the issues of duplication will resonate with most readers of this blog, but there are also interesting questions about reconciling traditional cataloguing with new ways of gathering and understanding digital content. One dimension to Andy’s work is the opportunity for sourcing descriptive metadata from outside the process; he makes use of external Government catalogues to find definitive entries for the documents he finds on the web pages in PDF form, and is able to combine this information in the process. What evidently appeals to him is the use of automation to save work.

Andy has posted his talks here and here.

How archival is that?

My view as an archivist (and not a librarian) would involve questions such as:

  • Is MARC cataloguing really suitable for this work? Which isn’t meant as a challenge to professional librarians – I’d level the same question at ISAD(G) too, which is a standard with many deficiencies when it comes to describing digital content adequately. On the other hand, end-users know and love MARC, and are still evidently wedded to accessing content in a subject-title-author based manner.
  • The issue of potential duplication bothers me as (a) it’s wasteful and (b) it increases ambiguity as to which one of several copies is the correct one. I’m also interested, as an archivist, in context and provenance; it could be there is additional valuable contextual information stored in the HTML of the web page, and embedded in the PDF properties; neither of these are guaranteed to be found, or catalogued, by the MARC method. But this raises the question, which Andy is well aware of, “what constitutes a publication”?
  • I can see how traditional cataloguers, including my fellow archivists, might find it hard to grasp the value of “reprocessing” in this context. Indeed it might even seem to cast doubts on the integrity of a web harvest if there’s all this indexing and re-indexing taking place on a digital resource. I would encourage any doubters to try and see it as a process not unlike “metadata enrichment”, a practice which is gaining ground as we try to archive more digital material; we simply can’t get it right first time, and it’s within the rules to keep adding metadata (be it descriptive / technical, hand-written or automated) as our understanding of the resource deepens, and the tools we can use keep improving.

Keep an eye out for the next blog post in this mini-series. 

When is it a good time for a file format migration?

I used to teach a one-day course on file format migration. The course advanced the idea that migration, although one of the oldest and best-understood methods of enacting digital preservation, can still carry risks of loss. To mitigate that loss, we want to make a case for use cases and acceptance criteria – good old-fashioned planning, in short.

When would it be a good time to migrate a file? And when would it be good not to migrate, or at any rate defer the decision? We can think of some plausible scenarios, and will discuss them briefly below.

We think the community has moved on now from its earlier line of thought, which was along the lines of “migrate as soon as possible, ideally at point of ingest” – the risks of careless migrations are hopefully better understood now, and we don’t want to rush into a bad decision. That said, some digital preservation systems still have an automated migration action built into the ingest routine.

Do migrate if: 

  • You don’t trust the format of the submission. The depositor may have sent you something in an obscure, obsolete, or unsupported file format. A scenario like this is likely to involve a private depositor, or an academic who insists on working in their “special” way. Obsolescence (or the imminent threat of it) is a well-established motivator for bringing out the conversion toolkit, though there are some who would disagree.
  • Your archive/repository works to a normalisation policy. This means that you tend to limit the number of preservation formats you work with, so you convert all ingests to the standard set which you support. The policy might be to migrate all Microsoft products to their Open Office equivalent. Indeed, this rule is built into Xena, the open-source tool from National Archives of Australia. Normalisation may have a downside, but it can create economies in how many formats you need to commit to supporting, and may go some way to “taming” wild deposits that arrive in a variety of formats.
  • You want to provide access to the content immediately. This means creating an access copy of the resource, for instance by migrating a tiff image to a jpeg. Some would say this doesn’t really qualify as migration, but it does involve a re-encoding action, which is why we mention it. It might be that this access copy doesn’t have to meet the same high standards as a preservation copy.

Don’t migrate if: 

  • The format of the resource is already stored in a quality format. The deposit you are ingesting may already be encoded in a format that is widely accepted as meeting a preservation standard, in which case migration is arguably not necessary. To ascertain this and verify the format, use DROID or other identification tools. To learn about preservation standard formats, start with the Library of Congress resource Sustainability of Digital Formats.
  • There is no immediate need for you to migrate. In this scenario, you fear that the ingested content’s format may become obsolete one day, but your researches (starting with the PRONOM online registry) indicate that the risk is some way off – maybe even 10-15 years away. In which case deferring the migration is your best policy. Be sure to add a “note to self” in the form of preservation metadata about this decision, and a trigger date in your database that will remind you to take action.
  • You want to migrate, but currently lack the IT skills. To this scenario we could add “you lack the tools to do migration” or even “you lack a suitable destination format”. You’ve made a search on COPTR and still come up empty. Through no fault of your own, technology has simply not yet found a reliable way to migrate the format you wish to preserve, and a tool for migration does not exist. In this instance, don’t wait for the solution – put the content into preservation storage, with a “note to self” (see above) that action will be taken at some point when the technology, tools, skills, and formats are available.
  • You have no preservation plan. This refers to your over-arching strategy which governs your approach to doing digital preservation. Part of it is an agreed action plan for what you will do when faced with particular file formats, including a detailed workflow, choice of conversion tool, and clear rationale for why you’re doing it that way. Ideally, in compiling this action plan, you will have understood the potential losses that migration can cause to the content, and the archivist (and the organisation) have signed off on how much of a “hit” is acceptable. Without a plan like this, you’re at risk of guessing which is the best migration pathway, and your decisions end up being guided by the tools (which are limited) rather than your own preservation needs.

The meaning of Curation: Data Curation

Unlike the meaning of the term archive, I think curation is a ‘doing’ word that describes the actions of an information professional working with content or resources. The term has been applied to many sectors recently, and I sense its meaning may have come adrift from its more traditional interpretation – for instance a museum curator, or an art curator. For this reason, this post will be the first of a series as we try and move towards a disambiguation of this term.

When it comes to ‘digital’, the curation term has been commonly applied in the field of research data, but it may also have some specialist uses in the field of digital preservation (for instance, Chapel Hill at the University of North Carolina offers a Certificate in Digital Curation). In this post however, I will look at the term as it’s been applied to research data.

Research data is one of the hot topics in digital preservation just now. In the UK at least, universities are working hard at finding ways to make their datasets persist, for all kinds of reasons – compliance with research council requirements, funder requirements, conformance with the REF Framework, and other drivers (legal, reputational, etc.). The re-use of data, repurposing datasets in the future, is the very essence of research data, and this need is what makes it distinct from other digital preservation projects.This is precisely where data curation has a big part to play. In no particular order, here’s our partial list:

1. Curation provides meaningful access to data. This could be cataloguing, using structured hierarchies, standards, common terms, defined ontologies, vocabularies, thesauri. All of these could derive from library and archive standards, but the research community also has its own set of subject-specific and discipline-specific vocabularies, descriptive metadata standards and agreed thesauri. It could also involve expressing that catalogue in the form of metadata (standards, again); and operating that metadata in a system, such as a CMS or Institutional Repository software. The end result of this effort ought to be satisfied end users who can discover, find, use, and interpret the dataset.

If further unpicking is needed, I could regard those as three different (though related) skills; a skilled cataloguer doesn’t necessarily know how to recast their work into EAD or MARC XML, and may rely on a developer or other system to help them do that. On the other hand, those edges are always blurring; institutional repository software (such as EPrints) was designed to empower individual users to describe their own deposited materials, working from pre-defined metadata schemas and using drop-down menus for controlled vocabularies.

2. Curation provides enduring access to data. This implies that the access has to last for a long time. One way of increasing your chances of longevity is by working with other institutions, platforms, and collaborators. Curation may involve applying agreed interoperability standards, such as METS, a protocol which allows you to share your work with other systems (not just other human beings). Since it involves machines talking to each other, I’ve tended to regard interoperability as a step beyond cataloguing.

Another aspect of enduring access is the use of UUIDs – Universal Unique Identifiers. If I make a request through a portal or UI, I will get served something digital – a document, image, or data table. For that to happen, we need UIDs or UUIDs; it’s the only way a system can “retrieve” a digital object from the server. We could call that another part of curation, a skill that must involve a developer somewhere in the service, even if the process of UID creation ends up being automated. You could regard the UID as technical metadata, but the main thing is making the system work with machine-readable elements; it’s not the same as “meaningful access”. UUIDs do it for digital objects; there’s also the ORCID system, which does it for individual researchers. Other instances, which are even more complex, involve minting DOIs for papers and datasets, making citations “endure” on the web to some degree.

3. Curation involves organisation of data. This one is close to my archivist heart. It implies constructing a structure that sorts the data into a meaningful retrieval system and gives us intellectual control over lots of content. An important part of organisation for data is associating the dataset or datasets with the relevant scholarly publications, and other supporting documentation such as research notes, wikis, and blogs.

In the old days I would have called this building a finding aid, and invoked accessioning skills such as archival arrangement – “sorting like with like” – so that the end user would have a concise and well organised finding aid to help them understand the collection. The difference is that now we might do it with tools such as directory trees, information packages, aggregated zip or tar files, and so on. We still need the metadata to complete the task (see above) but this type of “curation” is about sorting and parsing the research project into meaningful, accessible entities.

If we get this part of curation right, we are helping future use and re-use of the dataset. If we can capture the outputs of any secondary research, they stand a better chance of being associated with the original dataset.

4. Curation is a form of lifecycle management. There is a valid interpretation of data curation that claims “Data curation is the management of data throughout its lifecycle, from creation and initial storage, to the time when it is archived for posterity or becomes obsolete and is deleted.” I would liken this to an advanced form of records management, a profession that already recognises how lifecycles work, and has workflows and tools for how to deal with them. It’s a question of working out how to intervene, and when to intervene; if this side of curation means speaking to a researcher about their record-keeping as soon as they get their grant, then I’m all for it.

5. Curation provides for re-use over time through activities including authentication, archiving, management, preservation, and representation. While this definition may seem to involve a large number of activities, in fact most of them are already defined as things we would do as part of “digital preservation”, especially as defined by the OAIS Reference Model. The main emphasis for this class of resource however is “re-use”. The definition of what this means, and the problems of creating a re-usable dataset (i.e. a dataset that could be repurposed by another system) are too deep for this blog post, but they go beyond the idea that we could merely create an access copy.

Authentication is another disputed area, but I would like to think that proper lifecycle management (see above) would go some way to building a reliable audit trail that helps authentication; likewise the correct organisation of data (see above) will add context, history and evidence that situates the research project in a certain place and date, with an identifiable owner, adding further authentication.

To conclude this brief overview I make two observations:

  • Though there is some commonality among the instances I looked at, there is apparently no single shared understanding of what “data curation” means within the HE community; some views will tend to promote one aspect over another, depending on their content type, collections, or user community.
  • All the definitions I looked at tend to roll all the diverse actions together in a single paragraph, as if they were related parts of the same thing. Are they? Does it imply that data curation could be done by a single professional, or do we need many skillsets contributing to this process for complete success?

The meaning of the term Archive

In this blog post I should like to disambiguate uses of the word “archive”. I have found the term is often open to misunderstandings and misinterpretation. Since I come from a traditional archivist background, I will begin with a definition whose meaning is clear to me.

At any rate, it is a definition that pre-dates computers, digital content, and the internet; the arrival of these agencies has brought us new, ambiguous meanings of the term. Some instances of this follow below. In each instance, I will be looking for whether these digital “archives” imply or offer “permanence”, a characteristic I would associate with a traditional archive. 

    1. In the paper world: an archive is any collection of documents needed for long-term preservation, e.g. for historical, cultural heritage, or business purposes. It can also mean the building where such documents are permanently stored, in accordance with archival standards, or even the memory Institution itself (e.g. The National Archives).
    2. In the digital world: a “digital archive” ought to refer to a specific function of a much larger process called digital preservation. This offers permanent retention, managed storage, and a means of keeping content accessible in the long term. The organisation might use a service like this for keeping content that has no current business need, but it still needed for historical or legal reasons.  Therefore, the content is no longer held on a live system.
      The OAIS Reference Model devised the term “Archival Storage” for describing this, and call it a Functional Entity of the Model; this means it can apply to the function of the organisation that makes this happen, the system that governs it, or the servers where the content is actually stored. More than just storage, it requires system logging, validation, and managed backups on a scale and frequency that exceeds the average network storage arrangement. The outcome of this activity is long-term preservation of digital content.
    3. In the IT world: a sysadmin might identify a tar, zip or gz file as an “archive”. This is an accumulation of multiple files within a single wrapper. The wrapper may or may not perform a compression action on the content. The zipped “archive” is not necessarily being kept; the “archiving” action is the act of doing the zipping / compression.
    4. On a blog: a blog platform, such as WordPress or Google Blogger, organises its pages and posts according to date-based rules. WordPress automatically builds directories to store the content in monthly and annual partitions. These directories are often called “archives”, and the word itself appears on the published blog page. In this context the word “archives” simply designates “non-current content”, in order to distinguish it from this month’s current posts. This “archive” is not necessarily backed up, or preserved; and in fact it is still accessible on the live blog.
    5. In network management: the administrator backs up content from the entire network on a regular basis. They might call this action “archiving”, and may refer to the data, the tapes/discs on which the data are stored, or even the server room as the “archive”. In this instance, it seems to me the term is used to distinguish the backups from the live network. In case of a fail (e.g. accidental data deletion, or the need for a system restore), they would retrieve the lost data from the most recent “archive”. However: none of these “archives” are ever kept permanently. Rather, they are subject to a regular turnover and refreshment programme, meaning that the administrator only ever retains a few weeks or months of backups.
    6. Cloud storage services may offer services called “Data Archive” or “Cloud Archive”. In many cases this service performs the role of extended network storage, except that it might be cheaper than storing the data on your own network. Your organisation also might decide to use this cheaper method to store “non-current” content. In neither case is the data guaranteed to be preserved permanently, unless the provider explicitly states it is, or the provider is using cloud storage as part of a larger digital preservation approach.
    7. For emails: In MS Outlook, there is a term called AutoArchive. When run, this routine will move emails to an “archive” directory, based on rules (often associated with the age of the email) which the user can configure. The action also does a “clear out”, i.e. a deletion, of expired content, again based on rules. There is certainly no preservation taking place. This “AutoArchive” action is largely about moving email content from one part of the system to another, in line with rules. I believe a similar principle has been used to “archive” a folder or list in SharePoint, another Microsoft product. Some organisations scale up this model for email, and purchase enterprise “mail archiving” systems which apply similar age-based rules to the entire mail server. Unless explicitly applied as an additional service, there is no preservation taking place, just data compression to save space.

    To summarise:

    • The term “archive” has been used in a rather diffuse manner in the IT and digital worlds, and can mean variously “compression”, “aggregated content”, “backing up”, “non-current content”, and “removal from the live environment”. While useful and necessary, none of these are guaranteed to offer the same degree of permanence as digital preservation. Of these examples, only digital preservation (implementation of which is a complex and non-trivial task) offers permanent retention, protection, and replayability of your assets.
    • If you are an archivist, content owner, or publisher: when dealing with vendors, suppliers, or IT managers, be sure you take the time to discuss and understand what is meant by the term “archive”, especially if you’re purchasing a service that includes the term in some way.

How to capture and preserve electronic newsletters in HE and beyond

This blog post is based on a real-world case study. It happens to have come from a UK Higher Education institute, but the lessons here could feasibly apply to anyone wishing to capture and preserve electronic newsletters.

The archivist reported that the staff newsletter started to manifest itself in electronic form “without warning”. Presumably they’d been collecting the paper version successfully for a number of years, then this change came along. The change was noticed when the archivist (and all staff) received the Newsletter in email form. The archivist immediately noticed the email was full of embedded links, and pictures. If this was now the definitive and only version of the newsletter, how would they capture it and preserve it?

I asked the archivist to send me a copy of the email, so I could investigate further.

It turns out the Newsletter in this case is in fact a website, or a web-based resource. It’s being hosted and managed by a company called Newsweaver, a communications software company who specialise in a service for generating electronic newsletters, and providing means for their dissemination. They do it for quite a few UK Universities; for instance, the University of Manchester resource can be seen here. In this instance, the email noted above is simply a version of the Newsletter page, slightly recast and delivered in email form. By following the links in the example, I was soon able to see the full version of that issue of the Newsletter, and indeed the entire collection (unhelpfully labelled an “archive” – but that’s another story).

What looked at first like it might be an email capture and preserve issue is more likely to be a case calling for a web-archiving action. Only through web-archiving would we get the full functionality of the resource. The email, for instance, contains links labelled “Read More”, which when followed take us to the parent Newsweaver site. If we simply preserved the email, we’d only have a cut-down version of the Newsletter; more importantly, the links would not work if Newsweaver.com became unavailable, or changed its URLs.

Since I have familiarity with using the desktop web-archiving tool HTTrack, I tried an experiment to see if I could capture the online Newsletter from the Newsweaver host. My gather failed first time, because the resource is protected by the site robots (more on this below), but a second gather worked when I instructed the web harvester to ignore the robots.txt file.

My trial brought in about 500-600MB of content after one hour of crawling – there is probably more content, but I decided to terminate it at that point. I now had a working copy of the entire Newsletter collection for this University. In my version, all the links work, the fonts are the same, the pictures are embedded. I would treat this as a standalone capture of the resource, by which I mean it is no longer dependent on the live web, and works as a collection of HTML pages, images and stylesheets, and can be accessed and opened by any browser.

Of course, it is only a snapshot. A capture and archiving strategy would need to run a gather like this on a regular basis to be completely successful, to capture the new content as it is published. Perhaps once a year would do it, or every six months. If that works, it can be the basis of a strategy for the digital preservation of this newsletter.

Such a strategy might evolve along these lines:

  • Archivist decides to include electronic newsletters in their Selection Policy. Rationale: the University already has them in paper form. They represent an important part of University history. The collection should continue for business needs. Further, the content will have heritage value for researchers.
  • University signs up to this strategy. Hopefully, someone agrees that it’s worth paying for. The IT server manager agrees to allocate 600MB of space (or whatever) per annum for the storage of these HTTrack web captures. The archivist is allocated time from an IT developer, whose job it is to programme HTTrack and run the capture on a regular basis.
  • The above process is expressed as a formal workflow, or (to use terms an archivist would recognise) a Transfer and Accession Policy. With this agreement, names are in the frame; tasks are agreed; dates for when this should happen are put into place. The archivist doesn’t have to become a technical expert overnight, they just have to manage a Transfer / Accession process like any other.
  • Since they are “snapshots”, the annual web crawls could be reviewed – just like any other deposit of records. A decision could be made as to whether they all need to be kept, or whether it’s enough to just keep the latest snapshot. Periodic review lightens the burden on the servers.

This isn’t yet full digital preservation – it’s more about capture and management. But at least the Newsletters are not being lost. Another, later, part of the strategy is for the University to decide how it will keep these digital assets in the long-term, for instance in a dedicated digital preservation repository – a service which they University might not be able to provide themselves, or even want to. But it’s a first step towards getting the material into a preservable state.

There are some other interesting considerations in this case:

The content is hosted by Newsweaver, not by the University. The name of the Institution is included in the URL, but it’s not part of the ac.uk estate. This means that an intervention is most certainly needed, if the University wants to keep the content long-term. It’s not unlike the Flickr service, who merely act as a means of hosting and distributing your content online. For the above proposed strategy to work, the archivist would probably need to speak to Newsweaver, and advise them of the plan to make annual harvests. There would need to be an agreement that robots.txt is disabled or ignored, or the harvest won’t work. There may be a way to schedule the harvest at an ideal time that won’t put undue stress on the servers.

Newsweaver might even wish to co-operate with this plan; maybe they have a means for allowing export of content from the back-end system that would work just as well as tis pull-gather method, but then it’s likely the archivist would need additional technical support to take it further. I would be very surprised if Newsweaver claimed any IP or ownership of the content, but it would be just as well to ascertain what’s set out in the contract with the company. This adds another potential stakeholder to the mix: the editorial team who compile the University Newsletter in the first place.

Operating HTTrack may seem like a daunting prospect to an archivist. There is a simpler option, which would be to use PDFs as a target format for preservation. One approach would be to print the emails to PDFs, an operation which could be done direct from the desktop with minimal support, although a licensed copy of Adobe Acrobat would be needed. Even so, the PDF version would disappoint very quickly; the links wouldn’t work as standalone links, and would point back to the larger Newsweaver collection on the live web. That said, a PDF version would look exactly like the email version, and PDF would be more permanent than the email format.

The second PDF approach would be to capture pages from Newsweaver using Acrobat’s “Create PDF from Web Page” feature. This would yield a slightly better result than the email option above, but the links would still fail. For the full joined-up richness of the highly cross-linked Newsletter collection, web-archiving is still the best option.

To summarise the high-level issues, I suggest an Archivist needs to:

  • Define the target of preservation. In this case we thought it was an email at first, but it turns out the target is web content hosted on a domain not owned by the University.
  • Define the aspects of the Newsletter which we want to survive – such as links, images, and stylesheets.
  • Agree and sign off a coherent selection policy and transfer procedure, and get resources assigned to the main tasks.
  • Assess the costs of storing these annual captures, and tell your IT manager what you need in terms of server space.

If there’s a business case to be made to someone, the first thing to point out is the risk of leaving this resource in the hands of Newsweaver, who are great at content delivery, but may not have a preservation policy or a commitment to keep the content beyond the life of the contract.

This approach has some value as a first step towards digital preservation; it gets the archivist on the radar of the IT department, the policy owners, and finance, and wakes up the senior University staff to the risks of trusting third-parties with your content. Further, if successful, it could become a staff-wide policy that individual recipients of the email can, in future, delete these in the knowledge that the definitive resource is being safely captured and backed up.

Dynamic links – what do they mean for digital preservation?

Today I’d like to think about the subject of dynamic links. I’m hoping to start off in a document management context, but it also opens up questions from a digital preservation point of view.

Very few of the ideas here are my own. Last December I heard Barbara Reed of Recordkeeping Innovation Pty Ltd speaking at the Pericles Conference Acting on Change: New Approaches and Future Practices in Digital Preservation, and on a panel about the risk assessment of complex objects. She made some insightful remarks that very much resonated with me.

She described dynamic links, or self-referential links, as machine-readable links. These are now very common to many of us, particularly if we’re choosing to work in a cloud-based environment, such as Google Drive, or more recently Office 365 or SharePoint.

These environments greatly facilitate the possibility of creating a dynamic link to a resource – and share that link with others, e.g. colleagues in your organisation, or even external partners. It’s a grand way to enable and speed up collaboration. On a drive managed by Windows Explorer, the limitation was we could only open one document at a time; collaborators often got the message “locked for editing by another user”. With these new environments, multiple editors can work simultaneously, and dynamic links help to manage it.

Dynamic links don’t always depend on cloud storage of course, and I suppose we can manage dynamic links just as well in our own local network. Spreadsheets can link to other documents, and links can be held in emails.

Well, it seems there might be a weakness in this way of working. Reed said that these kind of links only work if the objects stay in the same place. It’s fine if nothing changes, but the server configuration can affect that “same place”, such as the network store, quite drastically.

If that is true, then the very IT architecture itself can be a point of failure. “Links are great,” said Reed, “but they presume a stable architecture.”

Part of the weakness could be the use of URLs for creating and maintaining these links. Reed said she has worked in places where there are no protocols for unique identifiers (UIDs), and instead it was more common to use URLs, which are based on storage location.

The problem scales up to larger systems, such as an Electronic Document and Records Management System (EDRMS), and to digital repositories generally. Many an EDRMS anticipates sharing and collaboration when working with live documents, and may have a built-in link generator for this purpose.

But when a resource is moved out of its native environment, you run the risk of breaking the links. Vendors of systems often have no procedure for this, and will simply recommend a redirect mechanism. We can’t seem to keep / preserve this dynamism. “This is everyone’s working environment now,” said Reed, “and we have no clear answer.”

There is a glimmer of hope though, and it seems to involve using UUIDs instead of URLs. I wanted to understand this a bit better, so I did a small amount of research as part of a piece of consultancy I was working on; very coincidentally, the client wanted a way to maintain the stability of digital objects migrated out of an EDRMS into a digital preservation repository.

URLs vs UUIDs

From what I understand, URLs and UUIDs are two fundamentally different methods of identifying and handling digital material. The article On the utility of identification schemes for digital earth science data: an assessment and recommendations (Duerr, R.E., Downs, R.R., Tilmes, C. et al. Earth Sci Inform (2011) 4: 139. doi:10.1007/s12145-011-0083-6), offers the following definitions:

A Uniform Resource Identifier (URI, or URL) is a naming and addressing technology that uses “a compact sequence of characters to identify” World Wide Web resources.

A Universally Unique Identifier (UUID) is a number that is 16-bytes (128-bits), as specified by the Open Software Foundation’s Distributed Computing Environment. A UUID contains 36 characters, of which 32 are hexadecimal digits that are arranged, as 5 hyphenated groups, for example:

0a9ecf4f-ab79-4b6b-b52a-1c9d4e1bb12f 

As I would understand it, this is how it applies to the subject at hand:

URLs – which is what dynamic links tend to be expressed as – will only continue to work if the objects stay in the same places, and there is a stable environment. Server configuration is one profound change that can affect this.

UUIDs are potentially a more stable way of managing locations, and require less maintenance while ensuring integrity. According to the article:

“An organization that chooses to use URIs as its identifiers will need to maintain the web domain, manage the structure of the URIs and maintain the URL redirects (Cox et al. 2010) for the long-term.” 

“Unlike DOIs or other URL-based identification schemes, UUIDs do not need to be recreated or maintained when data is migrated from one location to another.” 

What this means for digital preservation 

I think it means that digital archivists need to understand this basic difference between URLs and UUIDs, especially when communicating their migration requirements to a vendor or other supplier. Otherwise, there is a risk that this requirement will be misunderstood as a simple redirection mechanism, which it isn’t. For instance, I found online evidence that one vendor offering an export service asserts that:

“It is best to utilize a redirection mechanism to translate your old links to the current location in SharePoint or the network drive.”

Redirection feels to me like a short-term fix, one that extends the shelf-life of dynamic links, but does nothing to stabilise the volatile environment. Conversely, UUIDs will give us more stability, and will not require to be recreated or maintained in future. This approach feels closer to digital preservation; indeed I am fairly certain that a good digital preservation system manages its objects using UUIDs rather than URLs.

UUIDs might be more time-consuming or computationally expensive to create – I honestly don’t know if they are. But that 36-character reference looks like a near-unbreakable machine-readable way of identifying a resource, and I would tend to trust its longevity.

It also means that the conscientious archivist or records manager will at least want to be aware of changes to the network, or server storage, across their current organisation. IT managers may not regard these architecture changes as something that impacts on record-keeping or archival care. My worry is that it might impact quite heavily, and we might not even know about it. The message here is to be aware of this vulnerability in your environment.

READ MORE: Barbara Reed’s own account of her Pericles talk

Research datasets: lockdown or snapshot?

In today’s blog post we’re going to talk about digital preservation planning in the context of research datasets. We’re planning a one-day course for research data managers, where we can help with making preservation planning decisions that intersect with and complement your research data management plan.

When we’re dealing with datasets of research materials, there’s often a question about when (and whether) it’s possible to “close” the dataset. The dataset is likely to be a cumulative entity, especially if it’s a database, continually accumulating new records and new entries. Is there ever a point at which the dataset is “finished”? If you ask a researcher, it’s likely they will say it’s an ongoing concern, and they would rather not have it taken away from them and put into an archive.

For the data manager wishing to protect and preserve this valuable data, there are two possibilities.

The first is to “lock down” the dataset

This would involve intervening at a suitable date or time, for instance at the completion of a project, and negotiating with the researcher and other stakeholders. If everyone can agree on a lockdown, it means that no further changes can be made to the dataset; no more new records added, and existing records cannot be changed.

A locked-down dataset is somewhat easier to manage in a digital preservation repository, especially if it’s not being requested for use very frequently. However, this approach doesn’t always match the needs of the institution, nor the researcher who created the content. This is where the second possibility comes into play.

The second possibility is to take “snapshots” of the dataset

This involves a capture action that involves abstracting records from the dataset, and preserving that as a “view” of the dataset at a particular moment in time. The dataset itself remains intact, and can continue being used for live data as needed: it can still be edited and updated.

Taking dataset snapshots is a more pragmatic way of managing and preserving important research data, while meeting the needs of the majority of stakeholders. However, it also requires more effort: a strategic approach, more planning, and a certain amount of technical capability. In terms of planning, it might be feasible to take snapshots of a large and frequently-updated dataset on a regular basis, e.g. every year or every six months; doing so will tend to create reliable, well-managed views of the data.

Another valid approach would be to align the snapshot with a particular piece of research

For instance, when a research paper is published, the snapshot of the dataset should reflect the basis on which the analysis in that paper was carried out. The dataset snapshot would then act as a strong affirmation of the validity of the dataset. This is a very good approach, but requires the data manager and archivist to have a detailed knowledge of the content, and more importantly the progress of the research cycle.

The ideal scenario would be to have your researcher on board with your preservation programme, and get them signed up to a process like this; at crucial junctures in their work, they could request snapshots of the dataset, or even be empowered to perform it themselves.

In terms of the technical capability for taking snapshots, it may be as simple as running an export script on a database, but it’s likely to be a more delicate and nuanced operation. The parameters of the export may have to be discussed and managed quite carefully.

Lastly we should add that these operations by themselves don’t constitute the entirety of digital preservation. They are both strategies to create an effective capture of a particular resource; but capture alone is not preservation.

That resource must pass into the preservation repository and undergo a series of preservation actions in order to be protected and usable in the future. There will be several variations on this scenario, as there are many ways of gathering and storing data. We know that institutions struggle with this area, and there is no single agreed “best practice.”

Pros and cons of the JPEG2000 format for a digitisation project

In this post we want to briefly discuss some of the pros and cons of the JPEG2000 format. When it comes to selecting file formats for a digitisation project, choosing the right ones may help with continuity and longevity, or even access to the content. It all depends on the type of resource, or your needs, or the needs of your users.

If you’re working with images (e.g. for digitised versions of books, texts, or photographs), there’s nothing wrong with using the TIFF standard file format for your master copies. We’re not here to advocate using the JPEG2000 format, but it does have its adherents (and its evangelists).

PROS

May save storage space

This is a compelling reason and may be why a lot of projects opt for the JP2. Unlike the TIFF, it supports lossless compression. This means it can be compressed to leave a smaller “footprint” on the server, and yet not lose anything in terms of quality. How? It’s thanks to the magic codec.

Versatility

In “old school” digitisation projects, we tended to produce at least two digital objects – a high resolution scan (the “archive” copy, as I would call it) and a low resolution version derived from it, which we’d serve to users as the access copy. Gluttons for punishment might even create a third object, a thumbnail, to exhibit on the web page / online catalogue. Conversely the JPEG2000 format could perform all three functions from a single object. It can do this because of the “scalable bitstream;” the image data is encoded so it only serves as much as is needed to meet the request, which could be for an image of any size.

Open standard with ISO support

As indicated above, a file format that’s recognised as an International Standard gives us more confidence in its longevity, and the prospects for continued support and development. An “open” standard in this instance refers to a file format whose specification has been published; this sort of documentation, although highly technical, can be useful to help us understand (and in some cases validate) the behaviour of a file format.

CONS

Codec dependency

We mentioned the scalable bitstream above and the capacity for lossless compression as two of this format’s strengths. However, to do these requires an extra bit of functionality above and beyond what most file formats are capable of (including the TIFF). This is the codec, which performs a compression-decompression action on the image data. Besides being a dependency – without the codec, the magic of the JPEG2000 won’t work. This is one part of the format which remains something of a “black box,” a lack of transparency which may make some developers reluctant to work with the format.

Save As settings can be complex

In digitisation projects, the “Save As” action is crucial; you want your team to be producing consistent digitised resources which conform precisely to a pre-determined profile, for instance with regard to pixel size, resolution, and colour space. With a TIFF, these settings are relatively easy to apply; with the JPEG2000, there are many options and many possibilities, and it requires some expertise selecting the settings that will work for your project. Both the decision-making process, and the time spent applying them while scanning, might add a burden to your project.

Not yet the de facto standard

The “digital image industry,” if indeed there is such an entity, has not yet adopted the JPEG 2000 file format as a de facto standard. If you’re inclined to doubt this, look at the hardware; most digital cameras and digital scanners tend to save to TIFF or JPEG, not JPEG2000.

In conclusion, this post is not aiming to “sell” you on one format over another; the process that is relevant is going through a series of decisions, and informing yourself as best you can about the suitability of any given file format. Neither is it a case of either/or; we are aware of at least one major digitisation project that makes judicious use of both the TIFF and the JPEG2000, exploiting the salient features of both.

Scan Once for All Purposes – some cautionary tales

The acronym SOAP – Scan Once For All Purposes – has evolved over time among digitisation projects, and it’s a handy way to remember a simple rule of thumb: don’t scan content until you have a clear understanding of all the intended uses that will be made of the resource. This may seem simple, but in some projects it may have been overlooked in the rush to push digitised content out.

One reason for the SOAP rule is because we need to recognise that digitisation is expensive. It costs money, staff time, expertise and expensive hardware to turn analogue content into digital content.

Taking books or archive boxes off shelves, scanning them, and reshelving them all takes time. Scanning paper can damage the original resource, so to minimise that risk we’d only want to do it once. In some extreme cases, scanning can even destroy a resource; there are projects which have sacrificed an entire run of a print journal to the scanner, in order to allow “disaggregation”, which is a euphemistic way of saying “we cut them up with a scalpel to create separate scanner-friendly pages”.

Beyond that, there are digital considerations and planning considerations which prove the importance of the Scan Once For All Purposes rule. To demonstrate this, let’s try and illustrate it with some imaginary but perfectly plausible scenarios for a digitisation project, and see what the consequences could be of failing to plan.

Scenario 1
An organisation decides to scan a collection of photographs of old buses from the 1930s, because they’re so popular with the public. Unfortunately, nobody told them about the differences between file formats, so the scans end up as low-resolution compressed JPEGs scanned at 72 DPI because the web manager advised that was best for sending the images over the web.

Consequence: the only real value these JPEGs have is as access copies. If we wanted to use them for commercial purposes, such as printing, 72 DPI will prove to be ineffectual. Further, if a researcher wanted to examine details of the buses, there wouldn’t be enough data in the scan for a proper examination, so chances are they would have to come in to the searchroom anyway. Result: photographs are once again subjected to more wear and tear. And weren’t we trying to protect these photographs in some way?

Scenario 2
The organisation has another go at the project – assuming they have any money left in the budget. This time they’re better informed about the value of high-resolution scans, and the right file formats for supporting that much image data in a lossless, uncompressed manner. Unfortunately, they didn’t tell the network manager they wanted to do this.

Consequence: the library soon finds their departmental “quota” of space on the server has been exceeded three times over. Because this quota system is managed automatically in line with an IT policy, the scans are now at risk of being deleted, with a notice period of 24 hours.

Scenario 3 
The organisation succeeds in securing enough server space for the high-resolution scans. After a few months running the project, it turns out the users are not satisfied with viewing low-resolution JPEGs and demand online access to full-resolution, zoomable TIFF images. The library agrees to this, and asks the IT manager to move their TIFF scans onto the web server for this purpose.

Consequence: through constant web access and web traffic, the original TIFF files are now exposed to a strong possibility of corruption. Since they’re the only copies, the organisation is now putting an important digital asset at risk. Further, the action of serving such large files over the web – particularly for this dynamic use, involving a zoom viewer – is putting a severe strain on the organisation’s bandwidth, and costing more money.

The simple solution to all of the above imaginary scenarios could be SOAP. The ideal would be for the organisation to handle the original photographs precisely once as part of this digitisation project, and not have to re-scan them because they got it wrong first time. The scanning operation should produce a single high-quality digital image, scanned at a high resolution and encoded in a dependable, robust format. We would refer to this as the “original”.

The project could then then derive further digital objects from the “original”, such as access copies stored in a lower-resolution format. However, this is not part of the scanning operation; it’s managed as part of an image manipulation stage of the project, and is totally digital. The photographs, now completely ‘SOAPed’, are already safely back in their archive box.

The digital “originals” should now go into a safe part of the digital store. They would never be used as part of the online service, and users would not get hold of them. To meet the needs of scenario 3, the project now has to plan a routine that derives further copies from the originals; but these should be encoded in a way that makes them suitable for web access, most likely using a file format with a scalable bitstream that allows the zoom tool to work.

All of the above SOAP operations depend on the project manager having a good dialogue with the network server manager and the web manager too; a trait which such projects share with long-term digital preservation. As can be seen, a little bit of planning will economise the project and get desired results without having to perform a scan twice over.