Conference on web-archiving, part 3: Albany and web-archives as Institutional Records

In my last in this series of blog posts about IIPC 2017, I’ll look at the work of Gregory Wiedeman at Albany SUNY.

He is doing two things that are sure to gladden the heart of any archivist. First, he is treating his Institutional web archives as records that require permanent preservation. Secondly, he is attempting to apply traditional archival arrangement and description to his web captures. His experiences have taught him that neither one of these things is easy.

Firstly, I’m personally delighted to hear someone say that web archives might be records; I would agree. One reason I like it is because the way mainstream web-archiving seems to have evolved is in favour of something akin to the “library” model – where a website is treated as a book, with a title, author and subject. For researchers, that might be a model that’s more aligned to their understanding and methods. Not that I am against it; I am just making an observation.

I first heard Seamus Ross make the observation “can websites be records?” some 11 years ago, and I think it is a useful way of regarding certain classes of web content, which I would encourage. When I worked on the Jisc PoWR project, one of the assumptions I made was that a University would be using its website to store, promote or promulgate content directly relevant to its institutional mission and functions. In doing this, a University starts to generate born-digital records, whether as HTML pages or PDF attachments. What I am concerned with is when these are the only copies of such records. Yet quite often we find that the University archivist is not involved in their capture, storage, or preservation.

The problem becomes even more poignant when we see how important record/archival material that was originally generated in paper form, such as a prospectus, is shifting over to the web. There may or may not be a clear cut-off point for when and how this happens. The archivist may notice that they aren’t receiving printed prospectuses any more. Who is the owner of the digital version, and how can we ensure the pages aren’t simply disposed of by the web master when expired? Later at the RESAW conference I heard a similar and even more extreme example of this unhappy scenario, from Federico Nanni and his attempts to piece together the history of the University of Bologna website.

However, none of this has stopped Gregory Wiedeman from performing his duty of archival care. He is clear: Albany is a public university, subject to state records laws; therefore, certain records must be kept. He sees the continuity between the website and existing collections at the University, even to the point where web pages have their paper equivalents; he is aware of overlap between existing printed content and web content; he knows about embedded documents and PDFs on web pages; and is aware of interactive sites, which may create transient but important records through interactions.

In aligning University web resources with a records and archives policy, Wiedeman points out one significant obstacle: a seed URL, which is the basis of web capture in the first place, is not guaranteed to be a good fit for our existing practices. To put it another way, we may struggle to map an archived website or even individual pages from it to an archival Fonds or series, or to a record series with a defined retention schedule.

Nonetheless, Wiedeman has found that traditional archives theory and practice does adapt well to working with web archives, and he is addressing such key matters as retaining context, the context of attached documents, the relationship of one web page to another, and the history of records through a documented chain of custody – of which more below.

When it comes to describing web content, Wiedeman uses the American DACS Standard, which is a subset of ISAD(G). With its focus on the intellectual content rather than the individual file format, he has found this works for large scale collections and granular access to them. His cataloguing tool is ArchivesSpace, which is DACS compliant, and which is capable of handling aggregated record collections. The access component to ArchivesSpace is able to show relations between record collections, making context visible, and showing a clear link between the creating organisation and the web resources. Further, there are visible relations between web records and paper records, which suggests Wiedeman is on the way to addressing the hybrid archive conundrum faced by many. He does this, I suggest, by trusting to the truth of the archival Fonds, which continues to exert a natural order on the archives, in spite of the vagaries of website structures and their archived snapshots.

It’s in the fine detail of capture and crawling that Wiedeman is able to create records that demonstrate provenance and authenticity. He works with Archive-It to perform his web crawls; the process creates a range of technical metadata about the crawl itself (type of crawl, result, start and end dates, recurrence, extent), which can be saved and stored as a rather large JSON file. Wiedeman retains this, and treats it as a provenance record, which makes perfect sense; it contains hard (computer-science) facts about what happened. This JSON output might not be perfect, and at time of writing Albany don’t do more than retain it and store it; there remains developer work to be done on parsing and exposing the metadata to make it more useful.

Linked to this, he maintains what I think is a stand-alone record documenting his own selection decisions, as to the depth and range of the drawl; this is linked to the published collection development policy. Archivists need to be transparent about their decisions, and they should document their actions; users need to know this, in order to make any sense of the web data. None of these concepts are new to the traditional archivist, but this is the first time I have heard the ideas so well articulated in this context, and applied so effectively to collections management of digital content.

Gregory’s work is described at https://github.com/UAlbanyArchives/describingWebArchives

Conference on web-archiving, part 2: Ian Milligan and his derived datasets

Web-archiving is a relatively young side of digital archiving, yet it has already established a formidable body of content across the world, a large corpus of materials that could be mined by researchers and historians to uncover interesting trends and patterns about the 20th and 21st centuries. One advocate and enthusiast for this is Ian Milligan from University of Waterloo’s Faculty of Arts.

His home country has many excellent web archive collections, but he feels they are under-used by scholars. One problem is that scholars might not even know the web archives exist in the first place. The second problem is many people find web archives really hard to use; quite often, search engines which interrogate the corpus don’t really match the way that a scholar wishes to retrieve information. At a very simple practical level, a search can return too many hits, the hitlist appears to be unsorted, and the results are difficult to understand.

Milligan is personally concerned at the barriers facing academics, and he’s actively working to make it easier, devising a way of serving historic web archives in ways that doesn’t require massive expertise. His project Web Archives for Longitudinal Knowledge (WALK) is aiming to create a centralised portal for access to web content. The main difference to most such approaches which I’ve seen is that he does it by building derived datasets.

As I understand it, a derived dataset is a new assemblage of data that’s been created out of a web archive. To put this in context, it might help to understand the basic building block of web-archiving is a file called a WARC. A WARC is a file format, of which the contents are effectively a large chunk of code that represents the harvesting session, all the links visited, the responses, and a representation of the web content. If you wanted to replay the WARC so that it looks like the original website, then you’d feed it to an instance of the Wayback Machine, which is programmed to read the WARC and serve it back as a rendition of the original web page.

However, Milligan is more interested in parsing WARCS. He knows they contain very useful strings of data, and he’s been working for some time on tools to do the parsing. He’s interested in text strings, dates, URLs, embedded keywords and names, and more. One such tool is Warcbase, part of this WALK project. Broadly, the process is that he would transfer data from a web archive in WARC form, and use Warcbase to create scholarly derivatives from that WARC automatically. When the results are uploaded to the Dataverse platform, the scholar now has a much more user-friendly web-archive dataset in their hands. The process is probably far more elaborate than I am making it sound, but all I know is that simple text searches are now much more rewarding and focussed; and by using a graphical interface, it’s possible to build visualisations out of data.

A few archival-ish observations occur to me about this.

  • What about provenance and original order? Doesn’t this derived dataset damage these fundamental “truths” of the web crawl? Well, let’s remember that the derived dataset is a new thing, another digital object; the “archival original” WARC file remains intact. If there’s any provenance information about the date and place of the actual website and the date of the crawl, that won’t be damaged in any way. If we want paper analogues, we might call this derived dataset a photocopy of the original; or perhaps it’s more like a scrapbook, if it’s created from many sources.
  • That made me wonder if the derived dataset could be considered a Dissemination Information Package in OAIS Reference Model terms, with the parent WARC or WARCs in the role of the Archival Information Package. I’d better leave it at that; the terms “OAIS conformance” and “web-archiving” don’t often appear in the same sentence in our community.
  • It seems to me rather that what Milligan is doing is exploiting the versatility of structured data. If websites are structured, and WARCs are structured, why not turn that to our advantage and see if we can create new structures? If it makes the content more accessible to otherwise alienated users, then I’m all for it. Instead of having to mine a gigantic quarry of hard granite, we have manageable building blocks of information carved out of that quarry, which can be used as needed.
  • The other question that crossed my mind is “how is Ian and his team deciding what information to put in these derivatives?” He did allude to that fact that they are “doing something they think Humanities Scholars would like”, and since he himself is a Humanities scholar, he has a good starting point. Scholars hate a WARC, which after all isn’t much more than raw data generated by a web crawl, but they do like data arranged in a CSV file, and text searches with meaningful results.
  • To play devil’s advocate, I suspect that a traditional archivist would recoil from any approach which appears to smack of bias; our job has usually been to serve the information in the most objective way possible, and the actions of arrangement and cataloguing are intended to preserve the truth of original order of the resource, and to help the user with neutral finding aids that steer them through the collection. If we do the work of creating a derived dataset, making decisions in advance about date ranges, domains, and subjects, aren’t we somehow pre-empting the research?

This may open up bigger questions than can be addressed in this blog post, and in any case I may have misunderstood Milligan’s method and intention, but it may have implications for the archive profession and how we process and work with digital content on this scale.

Conference on web-archiving: reconciling two curation methods

One of the first things I did in digital preservation was a long-term webarchiving project, so I have long felt quite close to the subject. I was very pleased to attend this year’s IIPC conference in Senate House in London, which this year combined to great effect with the RESAW conference, ensuring wide coverage and maximum audience satisfaction in the papers and presentations. 

In this short series of blog posts, I want to look at some of the interesting topics that reflect some of my own priorities and concerns as an archivist. I will attempt to draw out the wider lessons as they apply to information management generally, and readers may find something of interest that puts another slant on our orthodox notions of collection, arrangement, and cataloguing.

Government publications at the BL

Andy Jackson at the British Library is facing an interesting challenge as he attempts to build a technical infrastructure to accommodate a new and exciting approach to collections management.

The British Library has traditionally had custodial care of official Government papers. They’ve always collected them in paper form, but more recently two separate curation strands have emerged.

The first has been through web-archiving, where as part of the domain-wide crawls and targeted collection crawls, the BL has harvested entire government websites into the UK Web Archive. These harvests can include the official publications in the form of attached PDFs or born-digital documents.

The second strand involves the more conventional route followed by the curators who add to The Catalogue, i.e. the official BL union catalogue. It’s less automated, but more intensive on the quality control side; it involves manual selection, download, and cataloguing of the publication to MARC standards.

Currently, public access to the UK Web Archive and to The Catalogue are two different things. My understanding is that the BL are aiming to streamline this into a single collection discovery point, enabling end users to access digital content regardless of where it’s from, or how catalogued.

Two curation methods

Andy’s challenges include the following:

  • The two curation methods involve thinking about digital content in quite different ways.The first one is more automated, and allows the possibility of data reprocessing. The second one has its roots in a physical production line, with clearly defined start and end points.
  • Because of its roots to the physical world, the second method has a form of workflow management which is closely linked to the results in the catalogue itself. It seems there are elements in the database which indicate sign-off and completion of a particular stage of the work. Web crawling, conversely, resembles a continual ongoing process, and the cut-off point for completion (if indeed there is one) is harder to identify.
  • There is known to be some duplication taking place, duplication of effort and of content; to put it another way, PDFs known to be in the web archive are also being manually uploaded to the catalogue.

In response to this, Andy has been commissioned to build an over-arching “transformation layer” model that encompasses these strands of work. It’s difficult because there’s a need to get away from a traditional workflow, there are serious synching issues, and the sheer volume of the content is so considerable.

I’m sure the issues of duplication will resonate with most readers of this blog, but there are also interesting questions about reconciling traditional cataloguing with new ways of gathering and understanding digital content. One dimension to Andy’s work is the opportunity for sourcing descriptive metadata from outside the process; he makes use of external Government catalogues to find definitive entries for the documents he finds on the web pages in PDF form, and is able to combine this information in the process. What evidently appeals to him is the use of automation to save work.

Andy has posted his talks here and here.

How archival is that?

My view as an archivist (and not a librarian) would involve questions such as:

  • Is MARC cataloguing really suitable for this work? Which isn’t meant as a challenge to professional librarians – I’d level the same question at ISAD(G) too, which is a standard with many deficiencies when it comes to describing digital content adequately. On the other hand, end-users know and love MARC, and are still evidently wedded to accessing content in a subject-title-author based manner.
  • The issue of potential duplication bothers me as (a) it’s wasteful and (b) it increases ambiguity as to which one of several copies is the correct one. I’m also interested, as an archivist, in context and provenance; it could be there is additional valuable contextual information stored in the HTML of the web page, and embedded in the PDF properties; neither of these are guaranteed to be found, or catalogued, by the MARC method. But this raises the question, which Andy is well aware of, “what constitutes a publication”?
  • I can see how traditional cataloguers, including my fellow archivists, might find it hard to grasp the value of “reprocessing” in this context. Indeed it might even seem to cast doubts on the integrity of a web harvest if there’s all this indexing and re-indexing taking place on a digital resource. I would encourage any doubters to try and see it as a process not unlike “metadata enrichment”, a practice which is gaining ground as we try to archive more digital material; we simply can’t get it right first time, and it’s within the rules to keep adding metadata (be it descriptive / technical, hand-written or automated) as our understanding of the resource deepens, and the tools we can use keep improving.

Keep an eye out for the next blog post in this mini-series. 

Foiled by an implementation bug

I recently attempted to web-archive an interesting website called Letters of Charlotte Mary Yonge. The creators had approached us for some preservation advice, as there was some danger of losing institutional support.

The site was built on a WordPress platform, with some functional enhancements undertaken by computer science students, to create a very useful and well-presented collection of correspondence transcripts of this influential Victorian woman writer; within the texts, important names, dates and places have been identified and are hyperlinked.

Since I’ve harvested many WordPress sites before with great success, I added the URL to Web Curator Tool, confident of success. However, right from the start some problems were experienced. One concern was that the harvest was taking many hours to complete, which seemed unusual for a small text-based site with no large assets such as images or media attachments. One of my test harvests even went up to the 3 GB limit. As I often do in such cases, I terminated the harvests to examine the log files and folder structures of what had been collected up to that point.

This revealed that a number of page requests were showing a disproportionately large size, some of them collecting over 40 MB for one page – odd, considering that the average size of a gathered page in the rest of the site was less than 50 KB. When I tried to open these 40 MB pages in the Web Curator Tool viewer, they failed badly, often yielding an Apache Tomcat error report and not rendering any viewable text at all.

These pages weren’t actually static pages as such – it might be more accurate to call them responses to a query. A typical query was

http://www.yongeletters.com/letters-1850-1859?year_id=1850

a simple script that would display all letters tagged with a year value of 1850. Again, I’ve encountered such queries in my web-archiving activities before, and they don’t usually present problems like this one.

I decided to investigate this link’s behaviour, and others like it, on the live site. The page is supposed to load truncated links to other pages. Instead it loads the same request on the page multiple times, ad infinitum. The code is actually looping, endlessly returning the result “Letters 1 to 10 of 11”, and will never complete its task.

When this behaviour on the live site is encountered by the web harvester Heritrix, it means the harvester is likewise sent into a loop of requests that can never be completed. This is what caused the 40 MB “page bloat” for these requests.

We have two options for web-archiving in this instance; neither one is satisfactory.

  • Remove the 3 GB system limit and let the harvester keep running. However, as my aborted harvests suggested, it would probably keep running forever, and the results still would not produce readable (or useful) pages.
  • Using exclusion commands, filter out the links such as the one above. The problem with that approach is that the harvester misses a large amount of the very content it is supposed to be collecting, and the archived version is then practically useless as a resource. To be precise, it would collect the pages with the actual transcribed letters, but the method of navigating the collection by date would fail. Since the live site only offers navigation using the dated Letter Collection links, the archived version would remain inaccessible.

This is, therefore, an example of a situation where a web site is effectively un-archivable, as it never completes executing its scripts and potentially ties the harvester up forever. The only sensible solution is for the website owners to fix and test their code (which, arguably, they should have done when developing it). Until then, a valuable resource, and all the labour that went into it, will continue to be at risk of oblivion.

BlogForever: Preservation in BlogForever – an alternative view

From the BlogForever project blog

I’d like to propose an alternative digital preservation view for the BF partners to consider.

The preservation problem is undoubtedly going to look complicated if we concentrate on the live blogosphere. It’s an environment that is full of complex behaviours and mixed content. Capturing it and replaying it presents many challenges.

But what type of content is going into the BF repository? Not the live blogosphere. What’s going in is material generated by the spider: it’s no longer the live web. It’s structured content, pre-processed, and parsed, fit to be read by the databases that form the heart of the BF system. If you like, the spider creates a “rendition” of the live web, recast into the form of a structured XML file.

What I propose is that these renditions of blogs should become the target of preservation. This way, we would potentially have a much more manageable preservation task ahead of us, with a limited range of content and behaviours to preserve and reproduce.

If these blog renditions are preservable, then the preservation performance we would like to replicate is the behaviour of the Invenio database, and not live web behaviour. All the preservation strategy needs to do is to guarantee that our normalised objects, and the database itself, conform to the performance model.

When I say “normalised”, I mean the crawled blogs that will be recast in XML. As I’ve suggested previously, XML is already known to be a robust preservation format. We anticipate that all the non-XML content is going to be images, stylesheets, multi-media, stylesheets, and attachments. Preservation strategies for this type of content are already well understood in the digital preservation world, and we can adapt them.

There is already a strand of the project that is concerned with migration of the database, to ensure future access and replay on applications and platforms of the future. This in itself could feasibly form the basis of the long-term preservation strategy.

The preservation promise in our case should not guarantee to recreate the live web, rather to recreate the contents of the BF repository, and to replicate the behaviour of the BF database. After all that is the real value of what the project is offering: searchability, retrievability, and creating structure (parsed XML files) where there is little or no structure (the live blogosphere).

Likewise it’s important that the original order and arrangement of the blogs be supported. I would anticipate that this will be one of the possible views of the harvested content. If it’s possible for an Invenio database query to “rebuild” a blog in its original order, that would be a test of whether preservation has succeeded.

As to PREMIS metadata: in this alternative scenario the live data in the database and the preserved data are one and the same thing. In theory, we should be able to manipulate the database to devise a PREMIS “view” of the data, with any additional fields needed to record our preservation actions on the files.

In short, I wonder whether the project is really doing “web archiving” at all? And does it matter if we aren’t?

In summary I would suggest:

  • We consider the target of preservation to be crawled blogs which have been transformed into parsed XML (I anticipate that this would not invalidate the data model).
  • We regard the spidering action as a form of “normalisation” which is an important step to transforming unmanaged blog content into a preservable package.
  • Following the performance model proposed by National Archives of Australia, we declare the performance we wish to replicate is that of normalised files in the Invenio database, rather than the behaviours of individual blogs. This approach potentially makes it simpler to define “significant properties”; instead of trying to define the significant properties of millions of blogs and their objects, we could concentrate on the significant properties of our normalised files, and of Invenio.

BlogForever: BlogForever and migration

From the BlogForever project blog

Recently I have been putting together my report on the extent to which the BlogForever platform operates within the framework of the OAIS model. Inevitably, I have thought a bit about migration as one of the potential approaches we could use to preserve blog content.

Migration is the process whereby we preserve data by shifting it from one file format to another. We usually do this when the “old” format is in danger of obsolescence for a variety of reasons, while the “target” format is something we think we can depend on for a longer period of time. This strategy works well for relatively static document-like content, such as format-shifting a text file onto PDF.

The problem with blogs, and indeed all web content, is when we start thinking of the content exclusively in terms of file formats. The content of a blog could be said to reside in multiple formats, not just one; and even if we format-shift all the files we gather, does that really constitute preservation?

With BlogForever, we’re going for an approach to capture and ingest which seems to have two discrete strands to it.

(1) We will be gathering and keeping the content in its “original” native formats, such as HTML, images files, CSS etc. At time of writing, the current plan is that we will have a repository record for each ingested blog post and all its associated files (original images, CSS, PDF, etc.) will be connected with this record. These separate files will be preserved and presumably migrated over time, if some of these native formats acquire “at risk” status.

(2) We are also going to create an XML file (complete with all detected Blog Data Model elements) from each blog post we are aggregating. What interests me here is that in this strand, an archived blog is being captured and submitted as a stream of data, rather than a file format. It so happens the format for storing that data-stream is going to be XML. The CyberWatcher spider is capable of harvesting blog content by harnessing the RSS feed from a blog, and by using blog-specific monitoring technologies like blog pings; and it also performs a complex parsing of the data it finds. The end result is a large chunk of “live” blog content, stored in an XML file.

Two things are of interest here. One is that the spider is already performing a form of migration, or transformation, simply by the action of harvesting the blog. Secondly, it’s migrating to XML, which is something we already know to be a very robust and versatile preservation format, more so even than a non-proprietary tabular format such as CSV. The added value of XML is the possibility of easily storing more complex data structures and multiple values.

If that assumption about the spider is correct, perhaps we need to start thinking of it as a transformation / validation tool. The more familiar digital preservation workflow assumes that migration will probably happen some time after the content has been ingested; what if migration is happening before ingest? We’re already actively considering the use of the preservation metadata standard PREMIS to document our preservation actions. Maybe the first place to use PREMIS is on the spider itself, picking up some technical metadata and logs on the way the spider is performing. Indeed, some of the D4.1 user requirements refer to this: DR6 ‘Metadata for captured Contents’ and DR17 ‘Metadata for Blogs’.

We anticipate the submitted XML is going to be further transformed in the Invenio repository via its databases, and various metadata additions and modifications will transform it from a Submission Information Package into an Archival Information Package and a Dissemination Information Package. As far as I can see though, the XML format remains in use throughout these processes. It feels as though the BlogForever workflow could have a credible preservation process hard-wired into it, and that (apart from making Archival Information Packages, backing-up and keeping the databases free from corruption) very little is needed from us in the way of migration interventions.

It also feels as though it would be much easier to test this methodology; the focus of the testing becomes the spider>XML>repository>database workflow, rather than a question of juggling multiple strategies and testing them against file formats and/or significant properties. Of course, migration would still need to apply to the original native file formats we have captured, and this would probably need to be part of our preservation strategy. But it’s the XML renditions which most users of BlogForever will be experiencing.

Archiving a wiki

On dablog recently I have put up a post with a few observations about archiving a MediaWiki site. The example is the UKOLN Repositories Research Team wiki DigiRep, selected for the JISC to add to their UKWAC collection (or to put it more accurately, pro-actively offered for archiving by DigiRep’s manager). The post illustrates a […]

On dablog recently I have put up a post with a few observations about archiving a MediaWiki site. The example is the UKOLN Repositories Research Team wiki DigiRep, selected for the JISC to add to their UKWAC collection (or to put it more accurately, pro-actively offered for archiving by DigiRep’s manager). The post illustrates a few points which we have touched on in the PoWR Handbook, which I’d like to illuminate and amplify here.

Firstly, we don’t want to gather absolutely everything that’s presented as a web page in the wiki, since the wiki contains not only the user-input content but also a large number of automatically generated pages (versioning, indexing, admin and login forms, etc). This stems from the underlying assumption about doing digital preservation, mainly that it costs money to capture and store digital content, and it goes on costing money to keep on storing it. (Managing this could be seen as good housekeeping. The British Library Life and Life2 projects have devised ingenious and elaborate formulae for costing digital preservation, taking all the factors into account to enable you to figure out if you can really afford to do it.) In my case, there are two pressing concerns: (a) I don’t want to waste time and resource in the shared gather queue while Web Curator Tool gathers hundreds of pages from DigiRep, and (b) I don’t want to commit the JISC to paying for expensive server space, storing a bloated gather which they don’t really want.

Secondly, the above assumptions have led to me making a form of selection decision, i.e. to exclude from capture those parts of the wiki I don’t want to preserve. The parts I don’t want are the edit history and the discussion pages. The reason I don’t want them is because UKWAC users, the target audience for the archived copy – or the designated user community, as OAIS calls it – probably don’t want to see them either. All they will want is to look at the finished content, the abiding record of what it was that DigiRep actually did.

This selection aspect led to Maureen Pennock’s reply, which is a very valid point – there are some instances where people would want to look at the edit history. Who wrote what, when…and why did it change? If that change-history is retrievable from the wiki, should we not archive it? My thinking is that yes, it is valuable, but only to a certain audience. I would think the change history is massively important to the current owner-operators of DigiRep, and that as its administrators they would certainly want to access that data. But then I put on my Institutional records management hat, and start to ask them how long they really want to have access to that change history, and whether they really need to commit the Institution to its long-term (or even permanent) preservation. Indeed, could their access requirement be satisfied merely by allowing the wiki (presuming it is reasonably secure, backed-up etc.) to go on operating the way it is, as a self-documenting collaborative editing tool?

All of the above raises some interesting questions which you may want to consider if undertaking to archive a wiki in your own Institution. Who needs it, how long for, do we need to keep every bit of it, and if not then which bits can we exclude? Note that they are principally questions of policy and decision-making, and don’t involve a technology-driven solution; the technology comes in later, when you want to implement the decisions.

Working with Web Curator Tool (part 1)

Keen readers may recall a post from April 2008 about my website-archiving forays working with Web Curator Tool, the workflow database, used for programming Heritrix, the crawler which does the harvesting of websites.

Other UKWAC partners and myself have since found that Heritrix sometimes has a problem, described by some as ‘collateral harvesting’. This means it can gather links, pages, resources, images, files and so forth from websites we don’t actually want to include in the finished archived item.

Often this problem is negligible, resulting in a few extra KB of pages from adobe.com or google.com for example. Sometimes though it can result in large amounts of extraneous material, amounting to several MB or even GB of digital content (for example if the crawler somehow finds a website full of .avi files.)

I have probably become overly preoccupied with this issue, since I don’t want to increase our sponsor (JISC)’s overheads by occupying their share of the server space with unnecessarily bloated gathers, nor clutter up the shared bandwidth by spending hours gathering pages unnecessarily.

Web Curator Tool allows us two options for dealing with collateral harvesting. One of them is to use the Prune Tool on the harvested site after the gather has run. The Prune Tool allows you to browse the gather’s tree structure, and to delete a single file or an entire folder full of files which you don’t want.

The other option is to apply exclusion filters to the title before the gather runs. This can be a much more effective method. The method is to enter a little bit of code in the ‘Exclude Filters’ box of a title’s profile. The basic principle is using the code .* for exclusions. .*www.aes.org.* will exclude that entire website from the gather. .*/images/.* will exclude any path containing a folder named ‘images’.

So far I generally find myself making two types of exclusion:

(a) Exclusions of websites we don’t want. As noted with collateral harvesting, Heritrix is following external links from the target a little too enthusiastically. It’s easy to identify these sites with the Tree View feature in WCT. This view also lets you know the size of the folder that has resulted from the external gathering. This has helped me make decisions; I tend to target those folders where the size is 1MB or larger.

(b) Exclusions of certain pages or folders within the Target which we don’t want. This is where it gets slightly trickier, and we start to look in the log files of client-server requests for instances where the browser is staying in the target, but performing actions like requesting the same page over and over. This can happen with database-driven sites, CMS sites, wikis, and blogs.

I believe I may have had a ‘breakthrough’ of sorts with managing collateral harvesting with at least one brand of wiki, and will report on this for my next post.