DA Blog – Ed The Archivist

Making Progress in Digital Preservation: Part 2 – Costs, Standards, Tools and Solutions

This one-day event on 31 October 2014 was organised by the DPC. After lunch Sarah Middleton of the DPC reported on progress from the 4C Project on the costs of curation. The big problem facing the digital preservation community is that the huge volumes of data we are expected to manage are increasing dramatically, yet our budgets are shrinking. Any investment we make must be strategic and highly targeted, and collaboration with others will be pretty much an essential feature of the future. To assist with this, the 4C project has built the Curation Exchange platform, which will allow participating institutions to share – anonymised, of course – financial data in a way that will enable the comparison of costs. The 4C project has worked very hard to advance us beyond the simple “costs model” paradigm, and this dynamic interactive tool will be a big step in the right direction.

William Kilbride then described the certification landscape, mentioning Trusted Digital Repositories, compliance with the OAIS Model, and the Trusted Repositories Audit & Certification checklist, and the evolution of European standards DIN 31644 and the Data Seal of Approval. William gave his personal endorsement to the Data Seal of Approval approach (it has been completed by 36 organisations, and another 30 are in progress of doing it), and suggested that we all try an exercise to see how many of the 16 elements we felt we could comply with. After ten minutes, a common lament was “there are things here beyond my control…I can’t influence my depositors!”

William went on to discuss tools for digital preservation. Very coincidentally, he had just participated in the DPC collaborative “book sprint” event for the upcoming new DPC Handbook, and helped to write a chapter on this very topic. Guess what? There are now more tools for digital preservation than we know what to do with. The huge proliferation of devices we can use, for everything from ingest to migration to access, has developed into a situation where we can hardly find them any more, let alone use them. William pins his hopes on the Tools Registry COPTR, the user-driven wiki with brief descriptions of the functionality and purpose of hundreds of tools – but COPTR is just one of many such registries. The field is crowded out with competitors such as the APARSEN Tool Repository, DCH-RP, the Library of Congress, DCEX…ironically, we may soon need a “registry of tool registries”.

Our host James Mortlock described the commercial route his firm had taken in building a bespoke digital repository and cataloguing tool. His project management process showed him just how requirements can evolve in the lifetime of a project – what they built was not what they first envisaged, but through the process they came up with stronger ideas about how to access content.

Kurt Helfrich’s challenge was not only to unify a number of diverse web services and systems at RIBA, but also to create a seamless entity in the Cloud that could meet multiple requirements. RIBA’s in a unique position to work on system platforms and their development, because of their strategic partnership with the V&A, a partner organisation with whom they even share some office space. The problem he faces is not just scattered teams, but one of mixed content – library and archive materials in various states of completion regarding their digitisation or cataloguing. Among his solutions, he trialled the Archivists’ Toolkit which served him so well in California; and the open-source application Archivematica, with an attached Atom catalogue and Duracloud storage service. A keen adaptor of tools, Kurt proposed that we look at the POWRR tool grid, which is especially suitable for small organisations; and Bit Curator, the digital forensics systems from Chapel Hill.

Download the slides from this event

Guest Post on UK Web Archive Blog

Ed Pinsent has contributed a post to the British Library UK Web Archive blog, describing the work of the BlogForever project – go and have a look if you’d like to find out more.

A Tab in the Ocean

I’ve been using Web Curator Tool (WCT) to curate the JISC website collection at UKWA since 2008. I’ve long been aware that the system offered me the opportunity to record a lot of metadata, in tabs called General, Annotations, Groups and Access. It’s a mix of technical metadata (about the gather / website) and descriptive metadata. It’s mainly of value to the curator who wants to keep track of what they’re doing with the website gathering; but WCT also allows us to create some descriptive metadata for exposure. At the bare minimum, we’re required to use Groups; despite its name, this component is actually a simple subject classification scheme, allowing me to tag all my websites with “Higher Education” for example. Once stored in the WCT database and rendered through Wayback Machine, this subject selection translates into this useful view of the collection.

Recently the British Library team approached all the users of the shared WCT tool. It seems that the curators involved in UKWA have been using these metadata fields slightly differently and the BL team have initiated a project to move towards more consistency. The project will involve deciding on definitive interpretations of how to use these fields, followed by a process of cleaning up legacy data stored in the system. Some of it is potentially useful, some of it not so useful; some is legacy from the earlier PANDAS phase of the project, mostly not needed, or entered into the wrong field.

As noted, a lot of this metadata is mainly to do with selection and evaluation decisions, curation information such as changes in status of the site, and as such it’s never been exposed anywhere except within WCT. However, one descriptive field will eventually end up exposed on the UKWA live site, and provide us cataloguer types an opportunity to describe the resources in more detail. It will appear on the Title Entry Page (TEP) for each instance.

I welcome any move towards exposing more descriptive metadata on the UKWA public site. I have always taken the view that the phrase which currently appears alongside a Title “The live site may provide more information” is not really very helpful in the context of a web archive, for three reasons. (1) we don’t want our users clicking away from UKWA; (2) the link to the live site may be dead by now, and; (3) as archivists and curators, I feel strongly that we are the ones who should be providing that “more information” in the shape of a catalogue description of some kind.

The JISC project sites, as a collection, have high evidentiary value as stages in development of very specific tools, services and activities that benefit the UK Higher Education community. The sites by themselves don’t always explain their history or intentions; I would argue that a lot of rich contextual detail about the reasons these sites existed (the JISC programme under which they were developed, the dates, the staff involved, the themes, the outputs) would help interpret the collection to the users and make it more intelligible.

AIDA and repositories

The AIDA project (Assessing Institutional Digital Assets) has completed its official, funded phase, but it’s gratifying to see interest emerging in the toolkit. We possibly could have done more at ULCC to publicise and sell our work, but our ongoing partnership with the DCC on the current Research Data Management project for the JISC gives us an opportunity to make up for that. One of the planned outcomes of the RDMP work will be an integrated planning tool for use by data owners or repository managers (or indeed anyone who has a digital collection to curate) that will offer the best of DAF, DRAMBORA, LIFE2 and AIDA without requiring an Institution to compile the same profile information four times over. We have already massaged the toolkit into a proof-of-concept online version of AIDA, using MediaWiki, and this clearly signals the way forward for this kind of assessment tool.

I was recently invited to contribute a module about AIDA to Steve Hitchcock’s Keep-It programme in Southampton – encouragingly, he is someone looking into the detail of how repositories could be used to manage digital preservation, and wants input from as many current toolkits as he could get his hands on. My experiences of the day have already been blogged. I thought I would add two other little incidents from the day that I found interesting.

The first was the repository manager whose perception was that assessment of the Institution’s workings at the highest level (for example, its technology infrastructure, business management planning process and implementation of centralised policies) was not really part of her job. So why work with AIDA at all? The main purpose of AIDA is largely to assess the Institution’s overall preparedness to do asset management, and the task of assessment can take an individual staff member (repository manager, records manager, librarian) to parts of the organisation they didn’t know about before. I try and make this sound positive when I encouragingly suggest that an AIDA assessment has to be a collaborative team effort within an organisation. But our friend at Southampton reminded me that people do have these sensitivities and that very often, merely having a repository in place at all represents a hard-won struggle.

The second incident relates to my AIDA exercise, where I asked teams to apply sections of the toolkit to their own organisation. The response fed back by Miggie Pickton was memorable – her team had elected to analyse three separate organisations, applying one AIDA leg (Organisation, Technology and Resource) to each. My initial feeling was that this makes a complete mockery of AIDA, subjective and unvalidated as it might be; what better way to cheat a good score than by cherry-picking the best results across three institutions? However, Miggie’s observations were in fact very useful – and the scores still resulted in a wobbly three-legged stool. It seems that even if they collaborated, HFE Institutions still would not be able to achieve that stability that is the foundation for good asset management.

File formats…or data streams?

On 1st December Malcolm Todd of The National Archives gave a good account of the work he’s been doing on File Formats for Preservation, resulting in a substantial new Technology Watch report for the DPC. It was a seminar hosted by William Kilbride, with participants from the BBC, the BL, NLW and others. The afternoon was useful and interesting for me since I teach an elementary module on file formats in a preservation context for our DPTP courses.

My naïve thinking in the area has been characterised by the assumption that the process is rather static or linear, and that the problem we’re facing is broadly the same every time; migrate data from a format that’s about to become obsolete or unsupported, onto another format that’s stable, supported, and open. MS Word document to PDF or PDF/A…now that, I can understand!

In fact, I learned at least two ways of thinking about formats that hadn’t occurred to me before. One simple one is costs; some formats can cost more to preserve than others. This can be calculated in terms of storage costs, multiplied over time, and the costs associated with migrations to new versions of that format. Continue reading “File formats…or data streams?”

Archiving a wiki

On dablog recently I have put up a post with a few observations about archiving a MediaWiki site. The example is the UKOLN Repositories Research Team wiki DigiRep, selected for the JISC to add to their UKWAC collection (or to put it more accurately, pro-actively offered for archiving by DigiRep’s manager). The post illustrates a […]

Firstly, we don’t want to gather absolutely everything that’s presented as a web page in the wiki, since the wiki contains not only the user-input content but also a large number of automatically generated pages (versioning, indexing, admin and login forms, etc). This stems from the underlying assumption about doing digital preservation, mainly that it costs money to capture and store digital content, and it goes on costing money to keep on storing it. (Managing this could be seen as good housekeeping. The British Library Life and Life2 projects have devised ingenious and elaborate formulae for costing digital preservation, taking all the factors into account to enable you to figure out if you can really afford to do it.) In my case, there are two pressing concerns: (a) I don’t want to waste time and resource in the shared gather queue while Web Curator Tool gathers hundreds of pages from DigiRep, and (b) I don’t want to commit the JISC to paying for expensive server space, storing a bloated gather which they don’t really want.

Secondly, the above assumptions have led to me making a form of selection decision, i.e. to exclude from capture those parts of the wiki I don’t want to preserve. The parts I don’t want are the edit history and the discussion pages. The reason I don’t want them is because UKWAC users, the target audience for the archived copy – or the designated user community, as OAIS calls it – probably don’t want to see them either. All they will want is to look at the finished content, the abiding record of what it was that DigiRep actually did.

This selection aspect led to Maureen Pennock’s reply, which is a very valid point – there are some instances where people would want to look at the edit history. Who wrote what, when…and why did it change? If that change-history is retrievable from the wiki, should we not archive it? My thinking is that yes, it is valuable, but only to a certain audience. I would think the change history is massively important to the current owner-operators of DigiRep, and that as its administrators they would certainly want to access that data. But then I put on my Institutional records management hat, and start to ask them how long they really want to have access to that change history, and whether they really need to commit the Institution to its long-term (or even permanent) preservation. Indeed, could their access requirement be satisfied merely by allowing the wiki (presuming it is reasonably secure, backed-up etc.) to go on operating the way it is, as a self-documenting collaborative editing tool?

All of the above raises some interesting questions which you may want to consider if undertaking to archive a wiki in your own Institution. Who needs it, how long for, do we need to keep every bit of it, and if not then which bits can we exclude? Note that they are principally questions of policy and decision-making, and don’t involve a technology-driven solution; the technology comes in later, when you want to implement the decisions.

Working with Web Curator Tool (part 1)

Keen readers may recall a post from April 2008 about my website-archiving forays working with Web Curator Tool, the workflow database, used for programming Heritrix, the crawler which does the harvesting of websites.

Other UKWAC partners and myself have since found that Heritrix sometimes has a problem, described by some as ‘collateral harvesting’. This means it can gather links, pages, resources, images, files and so forth from websites we don’t actually want to include in the finished archived item.

Often this problem is negligible, resulting in a few extra KB of pages from adobe.com or google.com for example. Sometimes though it can result in large amounts of extraneous material, amounting to several MB or even GB of digital content (for example if the crawler somehow finds a website full of .avi files.)

I have probably become overly preoccupied with this issue, since I don’t want to increase our sponsor (JISC)’s overheads by occupying their share of the server space with unnecessarily bloated gathers, nor clutter up the shared bandwidth by spending hours gathering pages unnecessarily.

Web Curator Tool allows us two options for dealing with collateral harvesting. One of them is to use the Prune Tool on the harvested site after the gather has run. The Prune Tool allows you to browse the gather’s tree structure, and to delete a single file or an entire folder full of files which you don’t want.

The other option is to apply exclusion filters to the title before the gather runs. This can be a much more effective method. The method is to enter a little bit of code in the ‘Exclude Filters’ box of a title’s profile. The basic principle is using the code .* for exclusions. .*www.aes.org.* will exclude that entire website from the gather. .*/images/.* will exclude any path containing a folder named ‘images’.

So far I generally find myself making two types of exclusion:

(a) Exclusions of websites we don’t want. As noted with collateral harvesting, Heritrix is following external links from the target a little too enthusiastically. It’s easy to identify these sites with the Tree View feature in WCT. This view also lets you know the size of the folder that has resulted from the external gathering. This has helped me make decisions; I tend to target those folders where the size is 1MB or larger.

(b) Exclusions of certain pages or folders within the Target which we don’t want. This is where it gets slightly trickier, and we start to look in the log files of client-server requests for instances where the browser is staying in the target, but performing actions like requesting the same page over and over. This can happen with database-driven sites, CMS sites, wikis, and blogs.

I believe I may have had a ‘breakthrough’ of sorts with managing collateral harvesting with at least one brand of wiki, and will report on this for my next post.

The Continuity Girl

Amanda Spencer gave an informative presentation at the UK Web-Archiving Consortium Partners Meeting on 23 July, which I happened to attend. The Web Continuity Project at TNA is a large-scale and Government-centric project, which includes a “comprehensive archiving of the government web estate by The National Archives”. Its aims are to address both “persistence” and “preservation” in a way that is seamless and robust: in many ways, “continuity” seems a very apposite concept with which to address the particular nature of web resources. It’s all about the issue of sustainable information across government.

At ULCC we’re interested to see if we can align some ‘continuity’ ideas within the context of our PoWR project. Many of the issues facing departmental web and information managers are likely to have analogues in HE and FE institutions, and Web Continuity offers concepts and ways of working that may be worth considering and may be adaptable to a web-archiving programme in a University.

A main area of focus for Web Continuity is integrity of websites – links, navigation, consistency of presentation. The working group on this, set up by Jack Straw, found a lot of mixed practices in e-publication (some use attached PDFs, others HTML pages); and numerous different content management systems in use. No centralised or consistent publication method, in other words.

To achieve persistency of links, Web Continuity are making use of digital object identifiers (DOIs) which can marry a live URL to a persistent identifier. Further, they use a redirection component which is derived from open-source software. It can be installed on common web server applications, eg Apache and Microsoft IIS. This component will “deliver the information requested by the user whether it is on the live website, or retrieved from the web archive and presented appropriately”. Of course, this redirection component only works if the domains are still being maintained, but it will do much to ensure that links persist over time.

They are building a centralised registry database, which is growing into an authority record of Government websites, including other useful contextual and technical detail (which can be updated by Departmental webmasters). It is a means of auditing the website crawls that are undertaken. Such a registry approach would be well worth considering on a smaller scale for a University.

Their sitemap implementation plan involves the rollout of XML sitemaps across government. XML sitemaps can help archiving, because they help to expose hidden content that is not linked to by navigation, or dynamic pages created by a CMS or database. This methodology may be something for HFE webmasters to consider, as it would assist with remote harvesting by an agreed third party.

The intended presentation method will make it much clearer to users that they are accessing an archived page instead of a live one. Indeed, user experience has been a large driver for this project. I suppose that UK Government want to ensure that the public can trust the information they find and that the frustrating experience of meeting dead-ends in the form of dead links is minimised. Further, it does something to address any potential liability issues arising from members of public accessing – and possibly acting upon – outdated information.

Web Continuity Project at The National Archives

Ed and I were pleased to come across an interesting document, recently received from The National Archives, describing their Web Continuity Project. This is the latest of the many digital preservation initiatives undertaken by TNA/PRO, that began with EROS and NDAD in the mid 1990s, leading to the UK Government Web Archive and other recent […]

From the JISC-PoWR Project blog.

The Web Continuity Project arises from a request by Jack Straw, as leader of the House of Commons in 2007, that government departments ensure continued access to online documents. Further research revealed that:

Government departments are increasingly citing URLs in answer to Parliamentary Questions
60% of links in Hansard to UK government websites for the period 1997 to 2006 are now broken
Departments vary considerably: for one, every link works; for another every link is broken. (TNA’s own website is not immune!)

Continue reading “Web Continuity Project at The National Archives”

Digital preservation in a nutshell, part II

As Richard noted in Part I, digital preservation is a “series of managed activities necessary to ensure continued access to digital materials for as long as necessary.” But what sort of digital materials might be in scope for the PoWR project?

We think it extremely likely that institutional web resources are going to include digital materials […]

Originally published on the JISC-PoWR blog.

We think it extremely likely that institutional web resources are going to include digital materials such as “records created during the day-to-day business of an organisation” and “born-digital materials created for a specific purpose”.

What we want is to “maintain access to these digital materials beyond the limits of media failure or technological change”. This leads us to consider the longevity of certain file formats, the changes undergone by proprietary software, technological obsolescence, and the migration or emulation strategies we’ll use to overcome these problems.

By migration we mean “a means of overcoming technological obsolescence by transferring digital resources from one hardware/software generation to the next.” In contrast, emulation is “a means of overcoming technological obsolescence of hardware and software by developing techniques for imitating obsolete systems on future generations of computers.”

Note also that when we talk about preserving anything, “for as long as necessary” doesn’t always mean “forever”. For the purposes of the PoWR project, it may be worth us considering medium-term preservation for example, which allows “continued access to digital materials beyond changes in technology for a defined period of time, but not indefinitely.”

We also hope to consider the idea of life-cycle management. According to DPC, “The major implications for life-cycle management of digital resources is the need actively to manage the resource at each stage of its life-cycle and to recognise the inter-dependencies between each stage and commence preservation activities as early as practicable.”

From these definitions alone, it should be apparent that success in the preservation of web resources will potentially involve the participation and co-operation of a wide range of experts: information managers, asset managers, webmasters, IT specialists, system administrators, records managers, and archivists.

(All the quotations and definitions above are taken from the DPC’s online handbook.)