JISC – Ed The Archivist

Updating the AIDA toolkit

This week, I have been mostly reworking and reviewing the ULCC AIDA toolkit. We’re planning to relaunch it later this year, with a new name, new scope, and new scorecard.

AIDA toolkit – a short history

The AIDA acronym stands for “Assessing Institutional Digital Assets”. Kevin Ashley and myself completed this JISC-funded project in 2009, and the idea was it could be used by any University – i.e. an Institution – to assess its own capability for managing digital assets.

At the time, AIDA was certainly intended for an HE/FE audience; and that’s reflected in the “Institutional” part of the name, and the type of digital content in scope. Content likely to have been familiar to anyone working in HE – digital libraries, research publications, digital datasets. As a matter of fact, AIDA was pressed into action as a toolkit directly relevant to the needs of Managing Research Data, as is shown by its reworking in 2011 into the CARDIO Toolkit.

I gather CARDIO, under the auspices of Joy Davidson, HATII and the DCC, has since been quite successful and its take-up among UK Institutions to measure or benchmark their own preparedness for Research Data Management perhaps indicates we were doing something right.

A new AIDA toolkit for 2016

My plan is to open up the AIDA toolkit so that it can be used by more people, apply to more content, and operate on a wider basis. In particular, I want it to apply to:

Not just Universities, but any Organisation that has digital content
Not just research / library content, but almost anything digital (the term “Digital Assets” always seemed vague to me; where the term “Digital Asset Management” is in fact something very specific and may refer to particular platforms and software)
Not just repository managers, but also archivists, records managers, and librarians working with digital content.

I’m also going to be adding a simpler scorecard element; we had one for AIDA before, but it got a little too “clever” with its elaborate weighted scores.

Readers may legitimately wonder if the community really “needs” another self-assessment tool; we teach several of the known models on our Digital Preservation Training Programme, including the use of the TRAC framework for self-assessment purposes; and since doing AIDA, the excellent DPCMM has become available, and indeed the latter has influenced my thinking. The new AIDA toolkit will continue to be a free download, though, and we’re aiming to retain its overall simplicity, which we believe is one of its strengths.

A new acronym

As part of this plan, I’m keen to bring out and highlight the “Capability” and “Management” parts of the AIDA toolkit, factors which have been slightly obscured by its current name and acronym. With this in mind, I need a new name and a new acronym. The elements that must be included in the title are:

Assessing or Benchmarking
Organisational
Capacity or Readiness [for]
Management [of]
Digital Content

I’ve already tried feeding these combinations through various online acronym generators, and come up empty. Hence we would like to invite the wider digital preservation community & use the power of crowd-sourcing to collect suggestions & ideas. Simply comment below or tweet us at @dart_ulcc and use the #AIDAthatsnotmyname hashtag. Naturally, the winner(s) of this little crowd-sourcing contest will receive written credit in the final relaunched AIDA toolkit.

Every man his own modified digital object

Today we’ve just completed our Future-Proofing study at ULCC and sent the final report to the JISC Programme Manager, with hopes of a favourable sign-off so that we can publish the results on our blog.

It was a collaboration between myself and Kit Good, the records manager here at UoL. We’re quite pleased with the results. We wanted to see if we could create preservation copies of core business documents that require permanent preservation, but do it using a very simple intervention and with zero overheads. So we worked with a simple toolkit of services and software that can plug into a network drive; we used open source migration and validation tools. Our case study sought to demonstrate the viability of this approach. Along the way we learned a lot about how Xena digital preservation software operates, and how (combined with Open Office) it makes a very credible job of producing bare-bones Archival Information Packages, and putting information into formats with improved long-term prospects.

The project has worked on a small test corpus of common Institutional digital records, performed preservation transformations on them and conducted systematic evaluation to ensure that the conversions worked, that the finished documents render correctly, that sufficient metadata been generated for preservation purposes, and that it can feasibly be extracted and stored in a database; and that the results are satisfactory and fit for purpose.

The results show us that it is possible to build a low-cost, practical preservation solution that addresses immediate preservation problems, makes use of available open source tools, and requires minimal IT support. We think the results of the case study can feasibly be used by other Institutions facing similar difficulties, and scaled up to apply to the preservation of other and more complex digital objects. It will enable non-specialist information professionals to perform certain preservation and information management tasks with a minimum of preservation-specific theoretical knowledge.

Future-Proofing won’t solve your records management problems, but it stands a chance of empowering records managers by allowing them to create preservation-worthy digital objects out of their organisation’s records, without the need for an expensive bespoke solution.

Working with Web Curator Tool (part 1)

Keen readers may recall a post from April 2008 about my website-archiving forays working with Web Curator Tool, the workflow database, used for programming Heritrix, the crawler which does the harvesting of websites.

Other UKWAC partners and myself have since found that Heritrix sometimes has a problem, described by some as ‘collateral harvesting’. This means it can gather links, pages, resources, images, files and so forth from websites we don’t actually want to include in the finished archived item.

Often this problem is negligible, resulting in a few extra KB of pages from adobe.com or google.com for example. Sometimes though it can result in large amounts of extraneous material, amounting to several MB or even GB of digital content (for example if the crawler somehow finds a website full of .avi files.)

I have probably become overly preoccupied with this issue, since I don’t want to increase our sponsor (JISC)’s overheads by occupying their share of the server space with unnecessarily bloated gathers, nor clutter up the shared bandwidth by spending hours gathering pages unnecessarily.

Web Curator Tool allows us two options for dealing with collateral harvesting. One of them is to use the Prune Tool on the harvested site after the gather has run. The Prune Tool allows you to browse the gather’s tree structure, and to delete a single file or an entire folder full of files which you don’t want.

The other option is to apply exclusion filters to the title before the gather runs. This can be a much more effective method. The method is to enter a little bit of code in the ‘Exclude Filters’ box of a title’s profile. The basic principle is using the code .* for exclusions. .*www.aes.org.* will exclude that entire website from the gather. .*/images/.* will exclude any path containing a folder named ‘images’.

So far I generally find myself making two types of exclusion:

(a) Exclusions of websites we don’t want. As noted with collateral harvesting, Heritrix is following external links from the target a little too enthusiastically. It’s easy to identify these sites with the Tree View feature in WCT. This view also lets you know the size of the folder that has resulted from the external gathering. This has helped me make decisions; I tend to target those folders where the size is 1MB or larger.

(b) Exclusions of certain pages or folders within the Target which we don’t want. This is where it gets slightly trickier, and we start to look in the log files of client-server requests for instances where the browser is staying in the target, but performing actions like requesting the same page over and over. This can happen with database-driven sites, CMS sites, wikis, and blogs.

I believe I may have had a ‘breakthrough’ of sorts with managing collateral harvesting with at least one brand of wiki, and will report on this for my next post.

Digital preservation in a nutshell, part II

As Richard noted in Part I, digital preservation is a “series of managed activities necessary to ensure continued access to digital materials for as long as necessary.” But what sort of digital materials might be in scope for the PoWR project?

We think it extremely likely that institutional web resources are going to include digital materials […]

Originally published on the JISC-PoWR blog.

We think it extremely likely that institutional web resources are going to include digital materials such as “records created during the day-to-day business of an organisation” and “born-digital materials created for a specific purpose”.

What we want is to “maintain access to these digital materials beyond the limits of media failure or technological change”. This leads us to consider the longevity of certain file formats, the changes undergone by proprietary software, technological obsolescence, and the migration or emulation strategies we’ll use to overcome these problems.

By migration we mean “a means of overcoming technological obsolescence by transferring digital resources from one hardware/software generation to the next.” In contrast, emulation is “a means of overcoming technological obsolescence of hardware and software by developing techniques for imitating obsolete systems on future generations of computers.”

Note also that when we talk about preserving anything, “for as long as necessary” doesn’t always mean “forever”. For the purposes of the PoWR project, it may be worth us considering medium-term preservation for example, which allows “continued access to digital materials beyond changes in technology for a defined period of time, but not indefinitely.”

We also hope to consider the idea of life-cycle management. According to DPC, “The major implications for life-cycle management of digital resources is the need actively to manage the resource at each stage of its life-cycle and to recognise the inter-dependencies between each stage and commence preservation activities as early as practicable.”

From these definitions alone, it should be apparent that success in the preservation of web resources will potentially involve the participation and co-operation of a wide range of experts: information managers, asset managers, webmasters, IT specialists, system administrators, records managers, and archivists.

(All the quotations and definitions above are taken from the DPC’s online handbook.)

Web-archiving: the WCT workflow tool

This month I have been happily harvesting JISC project website content using my new toy, the Web Curator Tool. It has been rewarding to resume work on this project after a hiatus of some months; the former setup, which used PANDAS software, has been winding down since December. Who knows what valuable information and website content changes may have escaped the archiving process during these barren months?

Web Curator Tool is a web-based workflow database, one which manages the assignment of permission records, builds profiles for each ‘target’ website, and allows a certain amount of inter-facing with Heritrix, the actual engine that gathers the materials. The open-source Heritrix project is being developed by the Internet Archive, whose access software (effectively the ‘Wayback Machine’) may also be deployed in the new public-facing website when it is launched in May 2008.

Although the idiosyncrasies of WCT caused me some anguish at first, largely through being removed from my ‘comfort zone’ of managing regular harvests, I suddenly turned the corner about two weeks ago. The diagnostics are starting to make sense. Through judicious ticking of boxes and refreshing of pages, I can now interrogate the database to the finest detail. I learned how to edit and save a target so as to ‘force’ a gather, thus helping to clear the backlog of scheduled gathers which had been accumulating, unbeknownst to us, since December. Most importantly, with the help of UKWAC colleagues, we’re slowly finding ways of modifying the profile so as to gather less external material (or reduce collateral harvesting, to put it another way); or extend its reach to capture stylesheets and other content which is outside the root URL.

True, a lot of this has been trial and error, involving experimental gathers before a setting was found that would ‘take’. But WCT, unlike our previous set-up, allows the possibility of gathering a site more than once in a day. And it’s much faster. It can bring in results on some of the smaller sites in less than two minutes.

Now, 200 new instances of JISC project sites have been successfully gathered during March and April alone. A further 50 instances have been brought in from the Jan-Feb backlog. The daunting backlog of queued instances has been reduced to zero. Best of all, over 30 new JISC project websites (i.e. those which started around or after December 07) have been brought into the new system. I’ll be back in my comfort zone in no time…