Conference on web-archiving, part 3: Albany and web-archives as Institutional Records

In my last in this series of blog posts about IIPC 2017, I’ll look at the work of Gregory Wiedeman at Albany SUNY.

He is doing two things that are sure to gladden the heart of any archivist. First, he is treating his Institutional web archives as records that require permanent preservation. Secondly, he is attempting to apply traditional archival arrangement and description to his web captures. His experiences have taught him that neither one of these things is easy.

Firstly, I’m personally delighted to hear someone say that web archives might be records; I would agree. One reason I like it is because the way mainstream web-archiving seems to have evolved is in favour of something akin to the “library” model – where a website is treated as a book, with a title, author and subject. For researchers, that might be a model that’s more aligned to their understanding and methods. Not that I am against it; I am just making an observation.

I first heard Seamus Ross make the observation “can websites be records?” some 11 years ago, and I think it is a useful way of regarding certain classes of web content, which I would encourage. When I worked on the Jisc PoWR project, one of the assumptions I made was that a University would be using its website to store, promote or promulgate content directly relevant to its institutional mission and functions. In doing this, a University starts to generate born-digital records, whether as HTML pages or PDF attachments. What I am concerned with is when these are the only copies of such records. Yet quite often we find that the University archivist is not involved in their capture, storage, or preservation.

The problem becomes even more poignant when we see how important record/archival material that was originally generated in paper form, such as a prospectus, is shifting over to the web. There may or may not be a clear cut-off point for when and how this happens. The archivist may notice that they aren’t receiving printed prospectuses any more. Who is the owner of the digital version, and how can we ensure the pages aren’t simply disposed of by the web master when expired? Later at the RESAW conference I heard a similar and even more extreme example of this unhappy scenario, from Federico Nanni and his attempts to piece together the history of the University of Bologna website.

However, none of this has stopped Gregory Wiedeman from performing his duty of archival care. He is clear: Albany is a public university, subject to state records laws; therefore, certain records must be kept. He sees the continuity between the website and existing collections at the University, even to the point where web pages have their paper equivalents; he is aware of overlap between existing printed content and web content; he knows about embedded documents and PDFs on web pages; and is aware of interactive sites, which may create transient but important records through interactions.

In aligning University web resources with a records and archives policy, Wiedeman points out one significant obstacle: a seed URL, which is the basis of web capture in the first place, is not guaranteed to be a good fit for our existing practices. To put it another way, we may struggle to map an archived website or even individual pages from it to an archival Fonds or series, or to a record series with a defined retention schedule.

Nonetheless, Wiedeman has found that traditional archives theory and practice does adapt well to working with web archives, and he is addressing such key matters as retaining context, the context of attached documents, the relationship of one web page to another, and the history of records through a documented chain of custody – of which more below.

When it comes to describing web content, Wiedeman uses the American DACS Standard, which is a subset of ISAD(G). With its focus on the intellectual content rather than the individual file format, he has found this works for large scale collections and granular access to them. His cataloguing tool is ArchivesSpace, which is DACS compliant, and which is capable of handling aggregated record collections. The access component to ArchivesSpace is able to show relations between record collections, making context visible, and showing a clear link between the creating organisation and the web resources. Further, there are visible relations between web records and paper records, which suggests Wiedeman is on the way to addressing the hybrid archive conundrum faced by many. He does this, I suggest, by trusting to the truth of the archival Fonds, which continues to exert a natural order on the archives, in spite of the vagaries of website structures and their archived snapshots.

It’s in the fine detail of capture and crawling that Wiedeman is able to create records that demonstrate provenance and authenticity. He works with Archive-It to perform his web crawls; the process creates a range of technical metadata about the crawl itself (type of crawl, result, start and end dates, recurrence, extent), which can be saved and stored as a rather large JSON file. Wiedeman retains this, and treats it as a provenance record, which makes perfect sense; it contains hard (computer-science) facts about what happened. This JSON output might not be perfect, and at time of writing Albany don’t do more than retain it and store it; there remains developer work to be done on parsing and exposing the metadata to make it more useful.

Linked to this, he maintains what I think is a stand-alone record documenting his own selection decisions, as to the depth and range of the drawl; this is linked to the published collection development policy. Archivists need to be transparent about their decisions, and they should document their actions; users need to know this, in order to make any sense of the web data. None of these concepts are new to the traditional archivist, but this is the first time I have heard the ideas so well articulated in this context, and applied so effectively to collections management of digital content.

Gregory’s work is described at https://github.com/UAlbanyArchives/describingWebArchives

Building a Digital Preservation Strategy

IRMS ARAI Event 19 November 2015

Last week I was in Dublin where I gave a presentation for the IRMS Ireland Group at their joint meeting with ARA Ireland. It was great for me personally to address a roomful of fellow Archivists and Records Managers, and learn more about how they’re dealing with digital concerns in Ireland. I heard a lot of success stories and met some great people.

Sarah Hayes, the Chair of IRMS Ireland, heard me speak earlier this year at the Celtic Manor Hotel (the IRMS Conference) and invited me to talk at her event. Matter of fact I got a similar invite from IRMS Wales this year, but Sarah wanted new content from me, specifically on the subject of Building a Digital Preservation Strategy.

How to develop a digital preservation strategy

My talk on developing a digital preservation strategy made the following points:

  • Start small, and grow the service
  • You already have knowledge of your collections and users – so build on that
  • Ask yourself why you are doing digital preservation, and who will benefit
  • Build use cases
  • Determine your own organisational capacity for the task
  • Increase your metadata power
  • Determine your digital preservation strategy (or strategies) in advance of talking to IT, or a vendor

I also presented some imaginary scenarios that would address digital preservation needs incrementally and meet requirements for different audiences:

  • Bit-level preservation (access deferred)
  • Emphasis on access and users
  • Emphasis on archival care of digital objects
  • Emphasis on legal compliance
  • Emphasis on income generation

Event Highlights

In fact the whole day was themed on Digital Preservation issues. John McDonough, the Director of the National Archives of Ireland, gave encouraging reports of how they are managing electronics records by “striding up the slope of enlightenment”. There’s an expectation that public services in Ireland must be “digital by default”, with an emphasis on continual online access to archival content in digital form. John is clear that archives in Ireland “underpin citizen’s rights” and are crucial to the “development of Nation and statehood”, which fits the picture I have of Dublin’s culture – it’s a city with a very clear sense of its own identity, and history.

In terms of change management and advocacy for working digitally, Joanne Rothwell has single-handedly transformed the records management of Waterford City and County Council, using SharePoint. Her resourceful use of an alphanumeric File Index allows machine-readable links between paper records and born-digital content, thus preserving continuity of materials. She also uses SharePoint’s site-creation facility to build a virtual space for holding “non-current” records, which replicate existing file structures. It’s splendid to see sound records management practice carry across into the digital realm so successfully.

DPTP alumnus from the class of November 2011, Hugh Campbell of the Public Record Office of Northern Ireland, has developed a robust and effective workflow for the transfer, characterisation and preservation of digital content. It’s not only a model of good practice, but he’s done it all in-house with his own team, using open source tools and developer skills.

During the breaks I managed to mingle and met many other professionals in Ireland who have responded well to digital challenges. I was especially impressed by Liz Robinson, the Records Officer for the Health and Safety Authority in Ireland. We agreed that any system implementation should only proceed after a thorough planning period, where the organisation establishes its own workflows and procedures, and does proper requirements gathering. This ought to be a firm foundation in advance of purchasing and implementing a system. Sadly, we’ve both seen projects where the system drove the practice, rather than the other way around.

Plan, plan and plan again before you speak to a vendor; this was the underlying message to my ‘How to develop a digital preservation strategy’ talk, so it was nice to be singled out in one Tweet as a “particular highlight” of the day.

Archiving a wiki

On dablog recently I have put up a post with a few observations about archiving a MediaWiki site. The example is the UKOLN Repositories Research Team wiki DigiRep, selected for the JISC to add to their UKWAC collection (or to put it more accurately, pro-actively offered for archiving by DigiRep’s manager). The post illustrates a […]

On dablog recently I have put up a post with a few observations about archiving a MediaWiki site. The example is the UKOLN Repositories Research Team wiki DigiRep, selected for the JISC to add to their UKWAC collection (or to put it more accurately, pro-actively offered for archiving by DigiRep’s manager). The post illustrates a few points which we have touched on in the PoWR Handbook, which I’d like to illuminate and amplify here.

Firstly, we don’t want to gather absolutely everything that’s presented as a web page in the wiki, since the wiki contains not only the user-input content but also a large number of automatically generated pages (versioning, indexing, admin and login forms, etc). This stems from the underlying assumption about doing digital preservation, mainly that it costs money to capture and store digital content, and it goes on costing money to keep on storing it. (Managing this could be seen as good housekeeping. The British Library Life and Life2 projects have devised ingenious and elaborate formulae for costing digital preservation, taking all the factors into account to enable you to figure out if you can really afford to do it.) In my case, there are two pressing concerns: (a) I don’t want to waste time and resource in the shared gather queue while Web Curator Tool gathers hundreds of pages from DigiRep, and (b) I don’t want to commit the JISC to paying for expensive server space, storing a bloated gather which they don’t really want.

Secondly, the above assumptions have led to me making a form of selection decision, i.e. to exclude from capture those parts of the wiki I don’t want to preserve. The parts I don’t want are the edit history and the discussion pages. The reason I don’t want them is because UKWAC users, the target audience for the archived copy – or the designated user community, as OAIS calls it – probably don’t want to see them either. All they will want is to look at the finished content, the abiding record of what it was that DigiRep actually did.

This selection aspect led to Maureen Pennock’s reply, which is a very valid point – there are some instances where people would want to look at the edit history. Who wrote what, when…and why did it change? If that change-history is retrievable from the wiki, should we not archive it? My thinking is that yes, it is valuable, but only to a certain audience. I would think the change history is massively important to the current owner-operators of DigiRep, and that as its administrators they would certainly want to access that data. But then I put on my Institutional records management hat, and start to ask them how long they really want to have access to that change history, and whether they really need to commit the Institution to its long-term (or even permanent) preservation. Indeed, could their access requirement be satisfied merely by allowing the wiki (presuming it is reasonably secure, backed-up etc.) to go on operating the way it is, as a self-documenting collaborative editing tool?

All of the above raises some interesting questions which you may want to consider if undertaking to archive a wiki in your own Institution. Who needs it, how long for, do we need to keep every bit of it, and if not then which bits can we exclude? Note that they are principally questions of policy and decision-making, and don’t involve a technology-driven solution; the technology comes in later, when you want to implement the decisions.