Working with Web Curator Tool (part 1)

Keen readers may recall a post from April 2008 about my website-archiving forays working with Web Curator Tool, the workflow database, used for programming Heritrix, the crawler which does the harvesting of websites.

Other UKWAC partners and myself have since found that Heritrix sometimes has a problem, described by some as ‘collateral harvesting’. This means it can gather links, pages, resources, images, files and so forth from websites we don’t actually want to include in the finished archived item.

Often this problem is negligible, resulting in a few extra KB of pages from adobe.com or google.com for example. Sometimes though it can result in large amounts of extraneous material, amounting to several MB or even GB of digital content (for example if the crawler somehow finds a website full of .avi files.)

I have probably become overly preoccupied with this issue, since I don’t want to increase our sponsor (JISC)’s overheads by occupying their share of the server space with unnecessarily bloated gathers, nor clutter up the shared bandwidth by spending hours gathering pages unnecessarily.

Web Curator Tool allows us two options for dealing with collateral harvesting. One of them is to use the Prune Tool on the harvested site after the gather has run. The Prune Tool allows you to browse the gather’s tree structure, and to delete a single file or an entire folder full of files which you don’t want.

The other option is to apply exclusion filters to the title before the gather runs. This can be a much more effective method. The method is to enter a little bit of code in the ‘Exclude Filters’ box of a title’s profile. The basic principle is using the code .* for exclusions. .*www.aes.org.* will exclude that entire website from the gather. .*/images/.* will exclude any path containing a folder named ‘images’.

So far I generally find myself making two types of exclusion:

(a) Exclusions of websites we don’t want. As noted with collateral harvesting, Heritrix is following external links from the target a little too enthusiastically. It’s easy to identify these sites with the Tree View feature in WCT. This view also lets you know the size of the folder that has resulted from the external gathering. This has helped me make decisions; I tend to target those folders where the size is 1MB or larger.

(b) Exclusions of certain pages or folders within the Target which we don’t want. This is where it gets slightly trickier, and we start to look in the log files of client-server requests for instances where the browser is staying in the target, but performing actions like requesting the same page over and over. This can happen with database-driven sites, CMS sites, wikis, and blogs.

I believe I may have had a ‘breakthrough’ of sorts with managing collateral harvesting with at least one brand of wiki, and will report on this for my next post.

The Continuity Girl

Amanda Spencer gave an informative presentation at the UK Web-Archiving Consortium Partners Meeting on 23 July, which I happened to attend. The Web Continuity Project at TNA is a large-scale and Government-centric project, which includes a “comprehensive archiving of the government web estate by The National Archives”. Its aims are to address both “persistence” and “preservation” in a way that is seamless and robust: in many ways, “continuity” seems a very apposite concept with which to address the particular nature of web resources. It’s all about the issue of sustainable information across government.

At ULCC we’re interested to see if we can align some ‘continuity’ ideas within the context of our PoWR project. Many of the issues facing departmental web and information managers are likely to have analogues in HE and FE institutions, and Web Continuity offers concepts and ways of working that may be worth considering and may be adaptable to a web-archiving programme in a University.

A main area of focus for Web Continuity is integrity of websites – links, navigation, consistency of presentation. The working group on this, set up by Jack Straw, found a lot of mixed practices in e-publication (some use attached PDFs, others HTML pages); and numerous different content management systems in use. No centralised or consistent publication method, in other words.

To achieve persistency of links, Web Continuity are making use of digital object identifiers (DOIs) which can marry a live URL to a persistent identifier. Further, they use a redirection component which is derived from open-source software. It can be installed on common web server applications, eg Apache and Microsoft IIS. This component will “deliver the information requested by the user whether it is on the live website, or retrieved from the web archive and presented appropriately”. Of course, this redirection component only works if the domains are still being maintained, but it will do much to ensure that links persist over time.

They are building a centralised registry database, which is growing into an authority record of Government websites, including other useful contextual and technical detail (which can be updated by Departmental webmasters). It is a means of auditing the website crawls that are undertaken. Such a registry approach would be well worth considering on a smaller scale for a University.

Their sitemap implementation plan involves the rollout of XML sitemaps across government. XML sitemaps can help archiving, because they help to expose hidden content that is not linked to by navigation, or dynamic pages created by a CMS or database. This methodology may be something for HFE webmasters to consider, as it would assist with remote harvesting by an agreed third party.

The intended presentation method will make it much clearer to users that they are accessing an archived page instead of a live one. Indeed, user experience has been a large driver for this project. I suppose that UK Government want to ensure that the public can trust the information they find and that the frustrating experience of meeting dead-ends in the form of dead links is minimised. Further, it does something to address any potential liability issues arising from members of public accessing – and possibly acting upon – outdated information.

Web-archiving: the WCT workflow tool

This month I have been happily harvesting JISC project website content using my new toy, the Web Curator Tool. It has been rewarding to resume work on this project after a hiatus of some months; the former setup, which used PANDAS software, has been winding down since December. Who knows what valuable information and website content changes may have escaped the archiving process during these barren months?

Web Curator Tool is a web-based workflow database, one which manages the assignment of permission records, builds profiles for each ‘target’ website, and allows a certain amount of inter-facing with Heritrix, the actual engine that gathers the materials. The open-source Heritrix project is being developed by the Internet Archive, whose access software (effectively the ‘Wayback Machine’) may also be deployed in the new public-facing website when it is launched in May 2008.

Although the idiosyncrasies of WCT caused me some anguish at first, largely through being removed from my ‘comfort zone’ of managing regular harvests, I suddenly turned the corner about two weeks ago. The diagnostics are starting to make sense. Through judicious ticking of boxes and refreshing of pages, I can now interrogate the database to the finest detail. I learned how to edit and save a target so as to ‘force’ a gather, thus helping to clear the backlog of scheduled gathers which had been accumulating, unbeknownst to us, since December. Most importantly, with the help of UKWAC colleagues, we’re slowly finding ways of modifying the profile so as to gather less external material (or reduce collateral harvesting, to put it another way); or extend its reach to capture stylesheets and other content which is outside the root URL.

True, a lot of this has been trial and error, involving experimental gathers before a setting was found that would ‘take’. But WCT, unlike our previous set-up, allows the possibility of gathering a site more than once in a day. And it’s much faster. It can bring in results on some of the smaller sites in less than two minutes.

Now, 200 new instances of JISC project sites have been successfully gathered during March and April alone. A further 50 instances have been brought in from the Jan-Feb backlog. The daunting backlog of queued instances has been reduced to zero. Best of all, over 30 new JISC project websites (i.e. those which started around or after December 07) have been brought into the new system. I’ll be back in my comfort zone in no time…