web archiving – Ed The Archivist

How to capture and preserve electronic newsletters in HE and beyond

This blog post is based on a real-world case study. It happens to have come from a UK Higher Education institute, but the lessons here could feasibly apply to anyone wishing to capture and preserve electronic newsletters.

The archivist reported that the staff newsletter started to manifest itself in electronic form “without warning”. Presumably they’d been collecting the paper version successfully for a number of years, then this change came along. The change was noticed when the archivist (and all staff) received the Newsletter in email form. The archivist immediately noticed the email was full of embedded links, and pictures. If this was now the definitive and only version of the newsletter, how would they capture it and preserve it?

I asked the archivist to send me a copy of the email, so I could investigate further.

It turns out the Newsletter in this case is in fact a website, or a web-based resource. It’s being hosted and managed by a company called Newsweaver, a communications software company who specialise in a service for generating electronic newsletters, and providing means for their dissemination. They do it for quite a few UK Universities; for instance, the University of Manchester resource can be seen here. In this instance, the email noted above is simply a version of the Newsletter page, slightly recast and delivered in email form. By following the links in the example, I was soon able to see the full version of that issue of the Newsletter, and indeed the entire collection (unhelpfully labelled an “archive” – but that’s another story).

What looked at first like it might be an email capture and preserve issue is more likely to be a case calling for a web-archiving action. Only through web-archiving would we get the full functionality of the resource. The email, for instance, contains links labelled “Read More”, which when followed take us to the parent Newsweaver site. If we simply preserved the email, we’d only have a cut-down version of the Newsletter; more importantly, the links would not work if Newsweaver.com became unavailable, or changed its URLs.

Since I have familiarity with using the desktop web-archiving tool HTTrack, I tried an experiment to see if I could capture the online Newsletter from the Newsweaver host. My gather failed first time, because the resource is protected by the site robots (more on this below), but a second gather worked when I instructed the web harvester to ignore the robots.txt file.

My trial brought in about 500-600MB of content after one hour of crawling – there is probably more content, but I decided to terminate it at that point. I now had a working copy of the entire Newsletter collection for this University. In my version, all the links work, the fonts are the same, the pictures are embedded. I would treat this as a standalone capture of the resource, by which I mean it is no longer dependent on the live web, and works as a collection of HTML pages, images and stylesheets, and can be accessed and opened by any browser.

Of course, it is only a snapshot. A capture and archiving strategy would need to run a gather like this on a regular basis to be completely successful, to capture the new content as it is published. Perhaps once a year would do it, or every six months. If that works, it can be the basis of a strategy for the digital preservation of this newsletter.

Such a strategy might evolve along these lines:

Archivist decides to include electronic newsletters in their Selection Policy. Rationale: the University already has them in paper form. They represent an important part of University history. The collection should continue for business needs. Further, the content will have heritage value for researchers.

University signs up to this strategy. Hopefully, someone agrees that it’s worth paying for. The IT server manager agrees to allocate 600MB of space (or whatever) per annum for the storage of these HTTrack web captures. The archivist is allocated time from an IT developer, whose job it is to programme HTTrack and run the capture on a regular basis.

The above process is expressed as a formal workflow, or (to use terms an archivist would recognise) a Transfer and Accession Policy. With this agreement, names are in the frame; tasks are agreed; dates for when this should happen are put into place. The archivist doesn’t have to become a technical expert overnight, they just have to manage a Transfer / Accession process like any other.
Since they are “snapshots”, the annual web crawls could be reviewed – just like any other deposit of records. A decision could be made as to whether they all need to be kept, or whether it’s enough to just keep the latest snapshot. Periodic review lightens the burden on the servers.

This isn’t yet full digital preservation – it’s more about capture and management. But at least the Newsletters are not being lost. Another, later, part of the strategy is for the University to decide how it will keep these digital assets in the long-term, for instance in a dedicated digital preservation repository – a service which they University might not be able to provide themselves, or even want to. But it’s a first step towards getting the material into a preservable state.

There are some other interesting considerations in this case:

The content is hosted by Newsweaver, not by the University. The name of the Institution is included in the URL, but it’s not part of the ac.uk estate. This means that an intervention is most certainly needed, if the University wants to keep the content long-term. It’s not unlike the Flickr service, who merely act as a means of hosting and distributing your content online. For the above proposed strategy to work, the archivist would probably need to speak to Newsweaver, and advise them of the plan to make annual harvests. There would need to be an agreement that robots.txt is disabled or ignored, or the harvest won’t work. There may be a way to schedule the harvest at an ideal time that won’t put undue stress on the servers.

Newsweaver might even wish to co-operate with this plan; maybe they have a means for allowing export of content from the back-end system that would work just as well as tis pull-gather method, but then it’s likely the archivist would need additional technical support to take it further. I would be very surprised if Newsweaver claimed any IP or ownership of the content, but it would be just as well to ascertain what’s set out in the contract with the company. This adds another potential stakeholder to the mix: the editorial team who compile the University Newsletter in the first place.

Operating HTTrack may seem like a daunting prospect to an archivist. There is a simpler option, which would be to use PDFs as a target format for preservation. One approach would be to print the emails to PDFs, an operation which could be done direct from the desktop with minimal support, although a licensed copy of Adobe Acrobat would be needed. Even so, the PDF version would disappoint very quickly; the links wouldn’t work as standalone links, and would point back to the larger Newsweaver collection on the live web. That said, a PDF version would look exactly like the email version, and PDF would be more permanent than the email format.

The second PDF approach would be to capture pages from Newsweaver using Acrobat’s “Create PDF from Web Page” feature. This would yield a slightly better result than the email option above, but the links would still fail. For the full joined-up richness of the highly cross-linked Newsletter collection, web-archiving is still the best option.

To summarise the high-level issues, I suggest an Archivist needs to:

Define the target of preservation. In this case we thought it was an email at first, but it turns out the target is web content hosted on a domain not owned by the University.

Define the aspects of the Newsletter which we want to survive – such as links, images, and stylesheets.

Agree and sign off a coherent selection policy and transfer procedure, and get resources assigned to the main tasks.
Assess the costs of storing these annual captures, and tell your IT manager what you need in terms of server space.

If there’s a business case to be made to someone, the first thing to point out is the risk of leaving this resource in the hands of Newsweaver, who are great at content delivery, but may not have a preservation policy or a commitment to keep the content beyond the life of the contract.

This approach has some value as a first step towards digital preservation; it gets the archivist on the radar of the IT department, the policy owners, and finance, and wakes up the senior University staff to the risks of trusting third-parties with your content. Further, if successful, it could become a staff-wide policy that individual recipients of the email can, in future, delete these in the knowledge that the definitive resource is being safely captured and backed up.

A Tab in the Ocean

I’ve been using Web Curator Tool (WCT) to curate the JISC website collection at UKWA since 2008. I’ve long been aware that the system offered me the opportunity to record a lot of metadata, in tabs called General, Annotations, Groups and Access. It’s a mix of technical metadata (about the gather / website) and descriptive metadata. It’s mainly of value to the curator who wants to keep track of what they’re doing with the website gathering; but WCT also allows us to create some descriptive metadata for exposure. At the bare minimum, we’re required to use Groups; despite its name, this component is actually a simple subject classification scheme, allowing me to tag all my websites with “Higher Education” for example. Once stored in the WCT database and rendered through Wayback Machine, this subject selection translates into this useful view of the collection.

Recently the British Library team approached all the users of the shared WCT tool. It seems that the curators involved in UKWA have been using these metadata fields slightly differently and the BL team have initiated a project to move towards more consistency. The project will involve deciding on definitive interpretations of how to use these fields, followed by a process of cleaning up legacy data stored in the system. Some of it is potentially useful, some of it not so useful; some is legacy from the earlier PANDAS phase of the project, mostly not needed, or entered into the wrong field.

As noted, a lot of this metadata is mainly to do with selection and evaluation decisions, curation information such as changes in status of the site, and as such it’s never been exposed anywhere except within WCT. However, one descriptive field will eventually end up exposed on the UKWA live site, and provide us cataloguer types an opportunity to describe the resources in more detail. It will appear on the Title Entry Page (TEP) for each instance.

I welcome any move towards exposing more descriptive metadata on the UKWA public site. I have always taken the view that the phrase which currently appears alongside a Title “The live site may provide more information” is not really very helpful in the context of a web archive, for three reasons. (1) we don’t want our users clicking away from UKWA; (2) the link to the live site may be dead by now, and; (3) as archivists and curators, I feel strongly that we are the ones who should be providing that “more information” in the shape of a catalogue description of some kind.

The JISC project sites, as a collection, have high evidentiary value as stages in development of very specific tools, services and activities that benefit the UK Higher Education community. The sites by themselves don’t always explain their history or intentions; I would argue that a lot of rich contextual detail about the reasons these sites existed (the JISC programme under which they were developed, the dates, the staff involved, the themes, the outputs) would help interpret the collection to the users and make it more intelligible.

BlogForever: Thoughts about blog data and metadata

From the BlogForever blog.

During the ArchivePress project at ULCC, we briefly considered the data and metadata generally made available with blogs and blog posts. As ArchivePress focused on the representations of blogs in newsfeeds, we examined the metadata that is generated in common, and exposed in the newsfeeds of three of the most common blog platforms, WordPress, Blogger and TypePad. Blogger and Typepad prefer the Atom newsfeed format; WordPress (particularly WordPress.com) prefers RSS (though it can be made to publish Atom feeds too). This analysis was done, about a year ago, things may have changed, but here is a summary of what we found.

For each Blog, the following core information is available in the feeds:

	WordPress (RSS)	Blogger (Atom)	Typepad (Atom)
Feed Unique ID	NA	feed/id	feed/id
Blog URL	rss/channel/link	feed/link@rel=”alternate”	feed/link@rel=”alternate”
Blog Title	rss/channel/title	feed/title	feed/title
Blog Description	rss/channel/description	feed/subtitle	feed/subtitle
Date of last update	rss/channel/lastBuildDate	feed/updated	feed/updated
Generating software	rss/channel/generator	feed/generator	feed/generator

For each Post, we established that the following core information is available in the newsfeeds:

	WordPress (RSS)	Blogger (Atom)	Typepad (Atom)
Post Unique ID	rss/channel/item/guid@isPermaLink	feed/entry/id	feed/entry/id
Post Title	rss/channel/item/title	feed/entry/title	feed/entry/title
Post Summary	rss/channel/item/description	NA	feed/entry/summary
Post URL	rss/channel/item/link	feed/entry/link@rel=”alternate”	feed/entry/link@rel=”alternate”
Date of publication	rss/channel/item/pubDate	feed/entry/published	feed/entry/published
Date of last update	NA	feed/entry/updated	feed/entry/updated
Post Author	rss/channel/item/dc:creator rss/xmlns:dc	feed/entry/author/name	feed/entry/author/name
Post Category	rss/channel/item/category	feed/entry/category@term	feed/entry/category@term
Post Content	rss/channel/item/content:encoded rss/xmlns:content	feed/entry/content	feed/entry/content
Post Comments	rss/channel/item/comments	feed/entry/link@rel=”replies”	feed/entry/link@rel=”replies”
Post Comments Feed	rss/channel/item/wfw:commentRss	NA	NA

One interesting point we noted was that neither Blogger nor Typepad published a link to a Comments Feed for each post. This made our work on ArchivePress more difficult since it was predicated on being able to easily identify the Comments feed for each post, and harvest new Comments as they were published. Obviously for blogs generated other than by WordPress, this was not going to be so easy. (Our ace developer Emanuele found some workarounds, but that’s another story.)

I think this offers us an interesting overview of the core of standard, structured blog data and metadata, in three of the leading blog platforms. This is the data structure and metadata profile that is maintained in blog databases, in one of its native forms, and I’d expect it to be present in all blog platforms, since it arguably represents the essence of blogs. I hope this will be useful background when considering the core models for data and metadata handling that will be developed for BlogForever.

Working with Web Curator Tool (part 1)

Keen readers may recall a post from April 2008 about my website-archiving forays working with Web Curator Tool, the workflow database, used for programming Heritrix, the crawler which does the harvesting of websites.

Other UKWAC partners and myself have since found that Heritrix sometimes has a problem, described by some as ‘collateral harvesting’. This means it can gather links, pages, resources, images, files and so forth from websites we don’t actually want to include in the finished archived item.

Often this problem is negligible, resulting in a few extra KB of pages from adobe.com or google.com for example. Sometimes though it can result in large amounts of extraneous material, amounting to several MB or even GB of digital content (for example if the crawler somehow finds a website full of .avi files.)

I have probably become overly preoccupied with this issue, since I don’t want to increase our sponsor (JISC)’s overheads by occupying their share of the server space with unnecessarily bloated gathers, nor clutter up the shared bandwidth by spending hours gathering pages unnecessarily.

Web Curator Tool allows us two options for dealing with collateral harvesting. One of them is to use the Prune Tool on the harvested site after the gather has run. The Prune Tool allows you to browse the gather’s tree structure, and to delete a single file or an entire folder full of files which you don’t want.

The other option is to apply exclusion filters to the title before the gather runs. This can be a much more effective method. The method is to enter a little bit of code in the ‘Exclude Filters’ box of a title’s profile. The basic principle is using the code .* for exclusions. .*www.aes.org.* will exclude that entire website from the gather. .*/images/.* will exclude any path containing a folder named ‘images’.

So far I generally find myself making two types of exclusion:

(a) Exclusions of websites we don’t want. As noted with collateral harvesting, Heritrix is following external links from the target a little too enthusiastically. It’s easy to identify these sites with the Tree View feature in WCT. This view also lets you know the size of the folder that has resulted from the external gathering. This has helped me make decisions; I tend to target those folders where the size is 1MB or larger.

(b) Exclusions of certain pages or folders within the Target which we don’t want. This is where it gets slightly trickier, and we start to look in the log files of client-server requests for instances where the browser is staying in the target, but performing actions like requesting the same page over and over. This can happen with database-driven sites, CMS sites, wikis, and blogs.

I believe I may have had a ‘breakthrough’ of sorts with managing collateral harvesting with at least one brand of wiki, and will report on this for my next post.

The Continuity Girl

Amanda Spencer gave an informative presentation at the UK Web-Archiving Consortium Partners Meeting on 23 July, which I happened to attend. The Web Continuity Project at TNA is a large-scale and Government-centric project, which includes a “comprehensive archiving of the government web estate by The National Archives”. Its aims are to address both “persistence” and “preservation” in a way that is seamless and robust: in many ways, “continuity” seems a very apposite concept with which to address the particular nature of web resources. It’s all about the issue of sustainable information across government.

At ULCC we’re interested to see if we can align some ‘continuity’ ideas within the context of our PoWR project. Many of the issues facing departmental web and information managers are likely to have analogues in HE and FE institutions, and Web Continuity offers concepts and ways of working that may be worth considering and may be adaptable to a web-archiving programme in a University.

A main area of focus for Web Continuity is integrity of websites – links, navigation, consistency of presentation. The working group on this, set up by Jack Straw, found a lot of mixed practices in e-publication (some use attached PDFs, others HTML pages); and numerous different content management systems in use. No centralised or consistent publication method, in other words.

To achieve persistency of links, Web Continuity are making use of digital object identifiers (DOIs) which can marry a live URL to a persistent identifier. Further, they use a redirection component which is derived from open-source software. It can be installed on common web server applications, eg Apache and Microsoft IIS. This component will “deliver the information requested by the user whether it is on the live website, or retrieved from the web archive and presented appropriately”. Of course, this redirection component only works if the domains are still being maintained, but it will do much to ensure that links persist over time.

They are building a centralised registry database, which is growing into an authority record of Government websites, including other useful contextual and technical detail (which can be updated by Departmental webmasters). It is a means of auditing the website crawls that are undertaken. Such a registry approach would be well worth considering on a smaller scale for a University.

Their sitemap implementation plan involves the rollout of XML sitemaps across government. XML sitemaps can help archiving, because they help to expose hidden content that is not linked to by navigation, or dynamic pages created by a CMS or database. This methodology may be something for HFE webmasters to consider, as it would assist with remote harvesting by an agreed third party.

The intended presentation method will make it much clearer to users that they are accessing an archived page instead of a live one. Indeed, user experience has been a large driver for this project. I suppose that UK Government want to ensure that the public can trust the information they find and that the frustrating experience of meeting dead-ends in the form of dead links is minimised. Further, it does something to address any potential liability issues arising from members of public accessing – and possibly acting upon – outdated information.

Web-archiving: the WCT workflow tool

This month I have been happily harvesting JISC project website content using my new toy, the Web Curator Tool. It has been rewarding to resume work on this project after a hiatus of some months; the former setup, which used PANDAS software, has been winding down since December. Who knows what valuable information and website content changes may have escaped the archiving process during these barren months?

Web Curator Tool is a web-based workflow database, one which manages the assignment of permission records, builds profiles for each ‘target’ website, and allows a certain amount of inter-facing with Heritrix, the actual engine that gathers the materials. The open-source Heritrix project is being developed by the Internet Archive, whose access software (effectively the ‘Wayback Machine’) may also be deployed in the new public-facing website when it is launched in May 2008.

Although the idiosyncrasies of WCT caused me some anguish at first, largely through being removed from my ‘comfort zone’ of managing regular harvests, I suddenly turned the corner about two weeks ago. The diagnostics are starting to make sense. Through judicious ticking of boxes and refreshing of pages, I can now interrogate the database to the finest detail. I learned how to edit and save a target so as to ‘force’ a gather, thus helping to clear the backlog of scheduled gathers which had been accumulating, unbeknownst to us, since December. Most importantly, with the help of UKWAC colleagues, we’re slowly finding ways of modifying the profile so as to gather less external material (or reduce collateral harvesting, to put it another way); or extend its reach to capture stylesheets and other content which is outside the root URL.

True, a lot of this has been trial and error, involving experimental gathers before a setting was found that would ‘take’. But WCT, unlike our previous set-up, allows the possibility of gathering a site more than once in a day. And it’s much faster. It can bring in results on some of the smaller sites in less than two minutes.

Now, 200 new instances of JISC project sites have been successfully gathered during March and April alone. A further 50 instances have been brought in from the Jan-Feb backlog. The daunting backlog of queued instances has been reduced to zero. Best of all, over 30 new JISC project websites (i.e. those which started around or after December 07) have been brought into the new system. I’ll be back in my comfort zone in no time…