July 2017 – Ed The Archivist

Conference on web-archiving, part 3: Albany and web-archives as Institutional Records

In my last in this series of blog posts about IIPC 2017, I’ll look at the work of Gregory Wiedeman at Albany SUNY.

He is doing two things that are sure to gladden the heart of any archivist. First, he is treating his Institutional web archives as records that require permanent preservation. Secondly, he is attempting to apply traditional archival arrangement and description to his web captures. His experiences have taught him that neither one of these things is easy.

Firstly, I’m personally delighted to hear someone say that web archives might be records; I would agree. One reason I like it is because the way mainstream web-archiving seems to have evolved is in favour of something akin to the “library” model – where a website is treated as a book, with a title, author and subject. For researchers, that might be a model that’s more aligned to their understanding and methods. Not that I am against it; I am just making an observation.

I first heard Seamus Ross make the observation “can websites be records?” some 11 years ago, and I think it is a useful way of regarding certain classes of web content, which I would encourage. When I worked on the Jisc PoWR project, one of the assumptions I made was that a University would be using its website to store, promote or promulgate content directly relevant to its institutional mission and functions. In doing this, a University starts to generate born-digital records, whether as HTML pages or PDF attachments. What I am concerned with is when these are the only copies of such records. Yet quite often we find that the University archivist is not involved in their capture, storage, or preservation.

The problem becomes even more poignant when we see how important record/archival material that was originally generated in paper form, such as a prospectus, is shifting over to the web. There may or may not be a clear cut-off point for when and how this happens. The archivist may notice that they aren’t receiving printed prospectuses any more. Who is the owner of the digital version, and how can we ensure the pages aren’t simply disposed of by the web master when expired? Later at the RESAW conference I heard a similar and even more extreme example of this unhappy scenario, from Federico Nanni and his attempts to piece together the history of the University of Bologna website.

However, none of this has stopped Gregory Wiedeman from performing his duty of archival care. He is clear: Albany is a public university, subject to state records laws; therefore, certain records must be kept. He sees the continuity between the website and existing collections at the University, even to the point where web pages have their paper equivalents; he is aware of overlap between existing printed content and web content; he knows about embedded documents and PDFs on web pages; and is aware of interactive sites, which may create transient but important records through interactions.

In aligning University web resources with a records and archives policy, Wiedeman points out one significant obstacle: a seed URL, which is the basis of web capture in the first place, is not guaranteed to be a good fit for our existing practices. To put it another way, we may struggle to map an archived website or even individual pages from it to an archival Fonds or series, or to a record series with a defined retention schedule.

Nonetheless, Wiedeman has found that traditional archives theory and practice does adapt well to working with web archives, and he is addressing such key matters as retaining context, the context of attached documents, the relationship of one web page to another, and the history of records through a documented chain of custody – of which more below.

When it comes to describing web content, Wiedeman uses the American DACS Standard, which is a subset of ISAD(G). With its focus on the intellectual content rather than the individual file format, he has found this works for large scale collections and granular access to them. His cataloguing tool is ArchivesSpace, which is DACS compliant, and which is capable of handling aggregated record collections. The access component to ArchivesSpace is able to show relations between record collections, making context visible, and showing a clear link between the creating organisation and the web resources. Further, there are visible relations between web records and paper records, which suggests Wiedeman is on the way to addressing the hybrid archive conundrum faced by many. He does this, I suggest, by trusting to the truth of the archival Fonds, which continues to exert a natural order on the archives, in spite of the vagaries of website structures and their archived snapshots.

It’s in the fine detail of capture and crawling that Wiedeman is able to create records that demonstrate provenance and authenticity. He works with Archive-It to perform his web crawls; the process creates a range of technical metadata about the crawl itself (type of crawl, result, start and end dates, recurrence, extent), which can be saved and stored as a rather large JSON file. Wiedeman retains this, and treats it as a provenance record, which makes perfect sense; it contains hard (computer-science) facts about what happened. This JSON output might not be perfect, and at time of writing Albany don’t do more than retain it and store it; there remains developer work to be done on parsing and exposing the metadata to make it more useful.

Linked to this, he maintains what I think is a stand-alone record documenting his own selection decisions, as to the depth and range of the drawl; this is linked to the published collection development policy. Archivists need to be transparent about their decisions, and they should document their actions; users need to know this, in order to make any sense of the web data. None of these concepts are new to the traditional archivist, but this is the first time I have heard the ideas so well articulated in this context, and applied so effectively to collections management of digital content.

Gregory’s work is described at https://github.com/UAlbanyArchives/describingWebArchives

Conference on web-archiving, part 2: Ian Milligan and his derived datasets

Web-archiving is a relatively young side of digital archiving, yet it has already established a formidable body of content across the world, a large corpus of materials that could be mined by researchers and historians to uncover interesting trends and patterns about the 20th and 21st centuries. One advocate and enthusiast for this is Ian Milligan from University of Waterloo’s Faculty of Arts.

His home country has many excellent web archive collections, but he feels they are under-used by scholars. One problem is that scholars might not even know the web archives exist in the first place. The second problem is many people find web archives really hard to use; quite often, search engines which interrogate the corpus don’t really match the way that a scholar wishes to retrieve information. At a very simple practical level, a search can return too many hits, the hitlist appears to be unsorted, and the results are difficult to understand.

Milligan is personally concerned at the barriers facing academics, and he’s actively working to make it easier, devising a way of serving historic web archives in ways that doesn’t require massive expertise. His project Web Archives for Longitudinal Knowledge (WALK) is aiming to create a centralised portal for access to web content. The main difference to most such approaches which I’ve seen is that he does it by building derived datasets.

As I understand it, a derived dataset is a new assemblage of data that’s been created out of a web archive. To put this in context, it might help to understand the basic building block of web-archiving is a file called a WARC. A WARC is a file format, of which the contents are effectively a large chunk of code that represents the harvesting session, all the links visited, the responses, and a representation of the web content. If you wanted to replay the WARC so that it looks like the original website, then you’d feed it to an instance of the Wayback Machine, which is programmed to read the WARC and serve it back as a rendition of the original web page.

However, Milligan is more interested in parsing WARCS. He knows they contain very useful strings of data, and he’s been working for some time on tools to do the parsing. He’s interested in text strings, dates, URLs, embedded keywords and names, and more. One such tool is Warcbase, part of this WALK project. Broadly, the process is that he would transfer data from a web archive in WARC form, and use Warcbase to create scholarly derivatives from that WARC automatically. When the results are uploaded to the Dataverse platform, the scholar now has a much more user-friendly web-archive dataset in their hands. The process is probably far more elaborate than I am making it sound, but all I know is that simple text searches are now much more rewarding and focussed; and by using a graphical interface, it’s possible to build visualisations out of data.

A few archival-ish observations occur to me about this.

What about provenance and original order? Doesn’t this derived dataset damage these fundamental “truths” of the web crawl? Well, let’s remember that the derived dataset is a new thing, another digital object; the “archival original” WARC file remains intact. If there’s any provenance information about the date and place of the actual website and the date of the crawl, that won’t be damaged in any way. If we want paper analogues, we might call this derived dataset a photocopy of the original; or perhaps it’s more like a scrapbook, if it’s created from many sources.
That made me wonder if the derived dataset could be considered a Dissemination Information Package in OAIS Reference Model terms, with the parent WARC or WARCs in the role of the Archival Information Package. I’d better leave it at that; the terms “OAIS conformance” and “web-archiving” don’t often appear in the same sentence in our community.
It seems to me rather that what Milligan is doing is exploiting the versatility of structured data. If websites are structured, and WARCs are structured, why not turn that to our advantage and see if we can create new structures? If it makes the content more accessible to otherwise alienated users, then I’m all for it. Instead of having to mine a gigantic quarry of hard granite, we have manageable building blocks of information carved out of that quarry, which can be used as needed.
The other question that crossed my mind is “how is Ian and his team deciding what information to put in these derivatives?” He did allude to that fact that they are “doing something they think Humanities Scholars would like”, and since he himself is a Humanities scholar, he has a good starting point. Scholars hate a WARC, which after all isn’t much more than raw data generated by a web crawl, but they do like data arranged in a CSV file, and text searches with meaningful results.
To play devil’s advocate, I suspect that a traditional archivist would recoil from any approach which appears to smack of bias; our job has usually been to serve the information in the most objective way possible, and the actions of arrangement and cataloguing are intended to preserve the truth of original order of the resource, and to help the user with neutral finding aids that steer them through the collection. If we do the work of creating a derived dataset, making decisions in advance about date ranges, domains, and subjects, aren’t we somehow pre-empting the research?

This may open up bigger questions than can be addressed in this blog post, and in any case I may have misunderstood Milligan’s method and intention, but it may have implications for the archive profession and how we process and work with digital content on this scale.

Conference on web-archiving: reconciling two curation methods

One of the first things I did in digital preservation was a long-term web–archiving project, so I have long felt quite close to the subject. I was very pleased to attend this year’s IIPC conference in Senate House in London, which this year combined to great effect with the RESAW conference, ensuring wide coverage and maximum audience satisfaction in the papers and presentations.

In this short series of blog posts, I want to look at some of the interesting topics that reflect some of my own priorities and concerns as an archivist. I will attempt to draw out the wider lessons as they apply to information management generally, and readers may find something of interest that puts another slant on our orthodox notions of collection, arrangement, and cataloguing.

Government publications at the BL

Andy Jackson at the British Library is facing an interesting challenge as he attempts to build a technical infrastructure to accommodate a new and exciting approach to collections management.

The British Library has traditionally had custodial care of official Government papers. They’ve always collected them in paper form, but more recently two separate curation strands have emerged.

The first has been through web-archiving, where as part of the domain-wide crawls and targeted collection crawls, the BL has harvested entire government websites into the UK Web Archive. These harvests can include the official publications in the form of attached PDFs or born-digital documents.

The second strand involves the more conventional route followed by the curators who add to The Catalogue, i.e. the official BL union catalogue. It’s less automated, but more intensive on the quality control side; it involves manual selection, download, and cataloguing of the publication to MARC standards.

Currently, public access to the UK Web Archive and to The Catalogue are two different things. My understanding is that the BL are aiming to streamline this into a single collection discovery point, enabling end users to access digital content regardless of where it’s from, or how catalogued.

Two curation methods

Andy’s challenges include the following:

The two curation methods involve thinking about digital content in quite different ways.The first one is more automated, and allows the possibility of data reprocessing. The second one has its roots in a physical production line, with clearly defined start and end points.
Because of its roots to the physical world, the second method has a form of workflow management which is closely linked to the results in the catalogue itself. It seems there are elements in the database which indicate sign-off and completion of a particular stage of the work. Web crawling, conversely, resembles a continual ongoing process, and the cut-off point for completion (if indeed there is one) is harder to identify.

There is known to be some duplication taking place, duplication of effort and of content; to put it another way, PDFs known to be in the web archive are also being manually uploaded to the catalogue.

In response to this, Andy has been commissioned to build an over-arching “transformation layer” model that encompasses these strands of work. It’s difficult because there’s a need to get away from a traditional workflow, there are serious synching issues, and the sheer volume of the content is so considerable.

I’m sure the issues of duplication will resonate with most readers of this blog, but there are also interesting questions about reconciling traditional cataloguing with new ways of gathering and understanding digital content. One dimension to Andy’s work is the opportunity for sourcing descriptive metadata from outside the process; he makes use of external Government catalogues to find definitive entries for the documents he finds on the web pages in PDF form, and is able to combine this information in the process. What evidently appeals to him is the use of automation to save work.

Andy has posted his talks here and here.

How archival is that?

My view as an archivist (and not a librarian) would involve questions such as:

Is MARC cataloguing really suitable for this work? Which isn’t meant as a challenge to professional librarians – I’d level the same question at ISAD(G) too, which is a standard with many deficiencies when it comes to describing digital content adequately. On the other hand, end-users know and love MARC, and are still evidently wedded to accessing content in a subject-title-author based manner.
The issue of potential duplication bothers me as (a) it’s wasteful and (b) it increases ambiguity as to which one of several copies is the correct one. I’m also interested, as an archivist, in context and provenance; it could be there is additional valuable contextual information stored in the HTML of the web page, and embedded in the PDF properties; neither of these are guaranteed to be found, or catalogued, by the MARC method. But this raises the question, which Andy is well aware of, “what constitutes a publication”?

I can see how traditional cataloguers, including my fellow archivists, might find it hard to grasp the value of “reprocessing” in this context. Indeed it might even seem to cast doubts on the integrity of a web harvest if there’s all this indexing and re-indexing taking place on a digital resource. I would encourage any doubters to try and see it as a process not unlike “metadata enrichment”, a practice which is gaining ground as we try to archive more digital material; we simply can’t get it right first time, and it’s within the rules to keep adding metadata (be it descriptive / technical, hand-written or automated) as our understanding of the resource deepens, and the tools we can use keep improving.

Keep an eye out for the next blog post in this mini-series.