Digitisation – Ed The Archivist

Metadata creation for digitisation: counting the costs

Speaking as a traditional archivist, I love cataloguing. I never thought I’d find myself having to justify cataloguing work, but given that it’s possible to attach a cost to everything these days, I find it is a serious consideration.

Experts who understand that specialist work has a real cost will try and tell us that detailed cataloguing might be turning into a luxury we can’t afford. This post will try and consider some of the things that make it expensive, and apply the lessons to digitised content.

I’m proposing that metadata can serve two important functions. One, to make digitised content intelligible to human beings; two, to make it possible for the computer to store, manage, and process that content.

Human-readable catalogues for your digitised collections are an absolute must. Whether it’s an archive catalogue written in ISAD(G), or a library catalogue written in MARC21, or a resource described using Dublin Core. We have standards we can work to, and increasingly we have computer-based cataloguing tools (such as Calm, Adlib, or AtoM) that facilitate the task. I would like to think of these tools as something that help to turn human-readable descriptions into metadata, i.e. something that a computer can store and process.

That’s great if we’re writing a catalogue from scratch, but that’s not always the case; sometimes the original resource has metadata attached to it, perhaps created by its owner. Except that person probably didn’t work to a standard, and so if we want to recycle that metadata, we might be faced with a “mapping” task, normalising their non-standard metadata to standard fields, which is both an intellectual exercise and an IT task, involving importing and exporting values between spreadsheets.

Records managers also might take an interest in a normalising process like that; describing business documents in a records management environment, ensuring the context and meaning of the content is accurate and useful. The difference is they might be applying that metadata in a live environment, rather than applying it after the fact. Anyone who’s about to embark on a SharePoint project will recognise this; one way of looking at the transition from your old Document Management System to SharePoint is to see it as a vast metadata modelling exercise. Given the amount of metadata which SharePoint can support – both for individual documents and for folders and creators – this is worth thinking about.

It’s not just about building an inventory of the resources, but wouldn’t we like to apply our cataloguing skills to help users on their journey by adding navigational elements to web pages, such as structured views, clickable links, and faceted views of the collections based on elements such as dates, names, and subjects? This is totally possible, if you regard all of these things as metadata, individual fields which can be stored in databases and manipulated by web technology.

All of these things take time and money, but the expense is in the cost of information specialisms and expertise, and hours of effort spent carrying out the work. Metadata also can be “computationally expensive,” though. What we mean by this is there’s a potential cost to your IT.

A large-scale digitisation project, particularly if it intends to get serious about metadata creation, sharing, and interoperability, will typically create a lot of pages and possibly store them in XML files. These XML files can have many purposes, including describing the resource, and expressing its relationship to other resources.

Creating lots of XML pages is a grand thing to do, but even so they can take up server space – especially when there are so many of them, even if individually each file has a small “footprint”. It also can be expensive to index that metadata, which requires database operations and processing power; and even serving metadata may have a cost attached to it, as it can be calculated as one more strain on your bandwidth.

The general conclusion here is certainly not to abandon cataloguing and metadata creation, but to be aware of the costs to your organisation, and consider ways of reducing the burden, finding economies of scale, and concentrating your effort on delivering a core of essential metadata for your digitised content. This of course involves knowing the collections, and knowing the users. But that would be the subject of another post!

Pros and cons of the JPEG2000 format for a digitisation project

In this post we want to briefly discuss some of the pros and cons of the JPEG2000 format. When it comes to selecting file formats for a digitisation project, choosing the right ones may help with continuity and longevity, or even access to the content. It all depends on the type of resource, or your needs, or the needs of your users.

If you’re working with images (e.g. for digitised versions of books, texts, or photographs), there’s nothing wrong with using the TIFF standard file format for your master copies. We’re not here to advocate using the JPEG2000 format, but it does have its adherents (and its evangelists).

PROS

May save storage space

This is a compelling reason and may be why a lot of projects opt for the JP2. Unlike the TIFF, it supports lossless compression. This means it can be compressed to leave a smaller “footprint” on the server, and yet not lose anything in terms of quality. How? It’s thanks to the magic codec.

Versatility

In “old school” digitisation projects, we tended to produce at least two digital objects – a high resolution scan (the “archive” copy, as I would call it) and a low resolution version derived from it, which we’d serve to users as the access copy. Gluttons for punishment might even create a third object, a thumbnail, to exhibit on the web page / online catalogue. Conversely the JPEG2000 format could perform all three functions from a single object. It can do this because of the “scalable bitstream;” the image data is encoded so it only serves as much as is needed to meet the request, which could be for an image of any size.

Open standard with ISO support

As indicated above, a file format that’s recognised as an International Standard gives us more confidence in its longevity, and the prospects for continued support and development. An “open” standard in this instance refers to a file format whose specification has been published; this sort of documentation, although highly technical, can be useful to help us understand (and in some cases validate) the behaviour of a file format.

CONS

Codec dependency

We mentioned the scalable bitstream above and the capacity for lossless compression as two of this format’s strengths. However, to do these requires an extra bit of functionality above and beyond what most file formats are capable of (including the TIFF). This is the codec, which performs a compression-decompression action on the image data. Besides being a dependency – without the codec, the magic of the JPEG2000 won’t work. This is one part of the format which remains something of a “black box,” a lack of transparency which may make some developers reluctant to work with the format.

Save As settings can be complex

In digitisation projects, the “Save As” action is crucial; you want your team to be producing consistent digitised resources which conform precisely to a pre-determined profile, for instance with regard to pixel size, resolution, and colour space. With a TIFF, these settings are relatively easy to apply; with the JPEG2000, there are many options and many possibilities, and it requires some expertise selecting the settings that will work for your project. Both the decision-making process, and the time spent applying them while scanning, might add a burden to your project.

Not yet the de facto standard

The “digital image industry,” if indeed there is such an entity, has not yet adopted the JPEG 2000 file format as a de facto standard. If you’re inclined to doubt this, look at the hardware; most digital cameras and digital scanners tend to save to TIFF or JPEG, not JPEG2000.

In conclusion, this post is not aiming to “sell” you on one format over another; the process that is relevant is going through a series of decisions, and informing yourself as best you can about the suitability of any given file format. Neither is it a case of either/or; we are aware of at least one major digitisation project that makes judicious use of both the TIFF and the JPEG2000, exploiting the salient features of both.

Scan Once for All Purposes – some cautionary tales

The acronym SOAP – Scan Once For All Purposes – has evolved over time among digitisation projects, and it’s a handy way to remember a simple rule of thumb: don’t scan content until you have a clear understanding of all the intended uses that will be made of the resource. This may seem simple, but in some projects it may have been overlooked in the rush to push digitised content out.

One reason for the SOAP rule is because we need to recognise that digitisation is expensive. It costs money, staff time, expertise and expensive hardware to turn analogue content into digital content.

Taking books or archive boxes off shelves, scanning them, and reshelving them all takes time. Scanning paper can damage the original resource, so to minimise that risk we’d only want to do it once. In some extreme cases, scanning can even destroy a resource; there are projects which have sacrificed an entire run of a print journal to the scanner, in order to allow “disaggregation”, which is a euphemistic way of saying “we cut them up with a scalpel to create separate scanner-friendly pages”.

Beyond that, there are digital considerations and planning considerations which prove the importance of the Scan Once For All Purposes rule. To demonstrate this, let’s try and illustrate it with some imaginary but perfectly plausible scenarios for a digitisation project, and see what the consequences could be of failing to plan.

Scenario 1
An organisation decides to scan a collection of photographs of old buses from the 1930s, because they’re so popular with the public. Unfortunately, nobody told them about the differences between file formats, so the scans end up as low-resolution compressed JPEGs scanned at 72 DPI because the web manager advised that was best for sending the images over the web.

Consequence: the only real value these JPEGs have is as access copies. If we wanted to use them for commercial purposes, such as printing, 72 DPI will prove to be ineffectual. Further, if a researcher wanted to examine details of the buses, there wouldn’t be enough data in the scan for a proper examination, so chances are they would have to come in to the searchroom anyway. Result: photographs are once again subjected to more wear and tear. And weren’t we trying to protect these photographs in some way?

Scenario 2
The organisation has another go at the project – assuming they have any money left in the budget. This time they’re better informed about the value of high-resolution scans, and the right file formats for supporting that much image data in a lossless, uncompressed manner. Unfortunately, they didn’t tell the network manager they wanted to do this.

Consequence: the library soon finds their departmental “quota” of space on the server has been exceeded three times over. Because this quota system is managed automatically in line with an IT policy, the scans are now at risk of being deleted, with a notice period of 24 hours.

Scenario 3
The organisation succeeds in securing enough server space for the high-resolution scans. After a few months running the project, it turns out the users are not satisfied with viewing low-resolution JPEGs and demand online access to full-resolution, zoomable TIFF images. The library agrees to this, and asks the IT manager to move their TIFF scans onto the web server for this purpose.

Consequence: through constant web access and web traffic, the original TIFF files are now exposed to a strong possibility of corruption. Since they’re the only copies, the organisation is now putting an important digital asset at risk. Further, the action of serving such large files over the web – particularly for this dynamic use, involving a zoom viewer – is putting a severe strain on the organisation’s bandwidth, and costing more money.

The simple solution to all of the above imaginary scenarios could be SOAP. The ideal would be for the organisation to handle the original photographs precisely once as part of this digitisation project, and not have to re-scan them because they got it wrong first time. The scanning operation should produce a single high-quality digital image, scanned at a high resolution and encoded in a dependable, robust format. We would refer to this as the “original”.

The project could then then derive further digital objects from the “original”, such as access copies stored in a lower-resolution format. However, this is not part of the scanning operation; it’s managed as part of an image manipulation stage of the project, and is totally digital. The photographs, now completely ‘SOAPed’, are already safely back in their archive box.

The digital “originals” should now go into a safe part of the digital store. They would never be used as part of the online service, and users would not get hold of them. To meet the needs of scenario 3, the project now has to plan a routine that derives further copies from the originals; but these should be encoded in a way that makes them suitable for web access, most likely using a file format with a scalable bitstream that allows the zoom tool to work.

All of the above SOAP operations depend on the project manager having a good dialogue with the network server manager and the web manager too; a trait which such projects share with long-term digital preservation. As can be seen, a little bit of planning will economise the project and get desired results without having to perform a scan twice over.

Things to consider before undertaking a digitisation project

Counter-intuitive as it may seem, this blog post will try and advance the idea that embarking on a project to digitise your paper collections isn’t always a great idea. This isn’t to say you should abandon the idea completely, but we would encourage you to think it through. You could read this post as a sort of cautionary tale.

The Harvard report Selecting Research Collections for Digitization proposes a number of very sound reasons for why an HFE Institution should pause before it commits resource to any large and complex digitisation project. They provide the reader with a series of questions that will help a good project planner steer a way through the decision process.

Among the reasons identified by these experts, I will single out two of my favourite themes:

Is anyone even interested?

Look at the material you’re intending to digitise. Does it have any value? Do you think readers, users, researchers and customers are going to be interested in it? Even if they are interested, why does it improve the situation for them to access it in digital form? Will usage of the material increase? If you increase access to thousands more people around the world who look at the material through your online catalogue, is that a genuine improvement? Why?

The answers to these questions may seem to be obvious to you, but this line of thinking also can expose some of our assumptions and pre-conceived ideas about our relationship with our audience, and the real value of serving content digitally.

We might assume a collection is going to be popular when it isn’t. We might assume that simply scanning a book and putting images of the pages online is all we need to do. Have we even asked the readers what they would like?

Can you go on supporting it?

This is about the very real problem of ongoing costs. We may assume that once all the scans are produced, the project budget can be closed. In fact, it continues to cost you money to store, support, manage and steward your digitised collections; and that’s leaving aside the cost of long-term preservation, should you realise there’s permanent value in the digital material you have created. In short, it may cost more than you think.

My former colleague Patricia Sleeman did a survey of a number of HFE Institutions in 2009 who had received JISC funding to carry out digitisation projects over the previous decade. She found:

“Four principal themes surfaced through analysis of the preservation plans of the digitisation projects that relate the maturity of institution to the likely success of their digitisation efforts. These are the need for preservation policies; collection management procedures; robust preservation infrastructures; and sustainability. In short, institutions or consortia which have clarity in these four areas considerably reduce the risks associated with long term access to digitized collections.”

Both of these reports may have been aimed primarily at HFE audiences in a research context, but I think the lessons apply to any organisations, including those in the commercial sector who intend to digitise content.

You’re considering spending a lot of money on digitising this collection, and potentially committing the resources of people, technology, and time. If you proceed with the project on overly-optimistic assumptions, it can lead to difficulties in the future.

However, don’t let this discourage you…

When you’ve decided to say “yes”

The benefits of doing digitisation have probably occurred to you already (saves wear and tear on originals, disseminates more content to a wider audience, benefits the organisation, may help with income generation…). I also like to encourage project managers to rethink, if possible, what the collection’s potential is for engaging with its intended audience. Are we happy to continue the traditional model of the searcher visiting the searchroom and looking at a box of photographs with captions, only doing it in a “digital” manner? Wouldn’t we like to use web tools like page-turners and zoom devices to enhance and improve on the experience in some way?

The great thing is that if you’ve done scanning according to best practices, you can repurpose your resources (as Access Copies) in a myriad of ways, making the most of access technologies. You’re now opening the doors for a potential dialogue with your user community, responding to changes in user needs and repurposing the way you serve your content. All your hard work will have paid off.

Priorities for business scanning

A business may decide to scan all their current paperwork, but this is not quite the same as a managed digitisation project.

Quite often a project like this is undertaken for a number of reasons: to save money, to improve efficiency, and to save space occupied by paper. The dream of the “paperless office” has been haunting us for about 30 years now. It still hasn’t come true, at least not in the way they promised us. I can personally recall a time when scanning bureaux appeared in the UK almost overnight, offering to convert the contents of 25 file cabinets into digital scans, and put them all onto a single CD ROM.

The prospect of doing this often appealed to senior executives, especially as the next logical step in their minds would be to get all that paper destroyed (a suggestion that usually causes an information manager to shudder).

How it differs from a traditional digitisation project

Which brings us to the next aspect that interests me. How long are we intending to keep these scans? A digitisation project for a library or archive collection will most likely result in digital content which we wish to preserve and keep permanently, because it’s both a valuable digital asset and a digital surrogate of an important part of our collections. However, when we take on “scanning for business”, as I call it, it’s possible the scans might have a relatively short shelf-life.

This is where it starts to shade into a records management concern. In fact my ideal would be to see a scanning project owned by the records manager, with one eye on user satisfaction, another on protecting business and legal needs, and a third eye on the possible long-term retention needs.

Taking all this into account, ideally we’d try and frame this project with a different emphasis to the concerns we have when doing digitisation for preservation purposes. Our list of priorities when scanning for business might look a bit like this:

Metadata

People need to find stuff again, and any automated retrieval system will only work if there’s sufficient metadata for the objects stored in it. We’d like to think about using pre-determined metadata schema, depending on the nature of the content; tags, folders, and naming rules that will help users retrieve content. My point here is that metadata decisions will tend to be driven by immediate user needs, rather than archival or library cataloguing standards.

Image quality

For a long-term preservation project, our first thought would be of high-resolution image files encoded in robust, open-source formats. For business scanning, it’s highly likely we might be able to compromise on the quality. If we can get away with lower-resolution images in compressed files, it’s worth considering. It may depend on whether the staff want OCR as well as images, which is yet another consideration.

Retention and disposal

Our plan for scanning must align with records management plans. The content is still maintained for as long as there’s a business need, just the same as when it was in paper form. Likewise, we’d hope staff co-operate with our recommended best practices for file naming and description, to assist with those retention decisions.

Authenticity

We’re all concerned with creating “authentic” digital objects, but the business need in this scenario might be slightly different to how an archivist or a researcher regards an authentic digital object. In the archives scenario, the archivist wants to be sure the preserved object is a genuine representation of the original, and so do their users. In the business scenario, we not only need to be assured of that, but we also want hard evidence that is the case, for when the auditors start asking questions. We’re thus facing at least two tough tasks – ensuring the scans themselves are authentic when they’re created, and then making sure we maintain that authenticity through daily use of the scans. We’d certainly want some form of evidence chain and audit trail for that.

From here, we’ve got the bare bones of a successful business scanning project. We might soon be in a better position to safely destroy paper originals, if indeed that was one of the drivers or project goals. That destruction needs to be carried out with due care and attention. You’d certainly want all the digital content signed off as regards authenticity to prove the admissibility of digital objects as legal documents.

If however you succeed in secure shredding of a large number of boxes of paper, you’ve now freed up storage space and shelf space. That is something that has a cash value. If you keep metrics of progress in this area, you’re ready to start proving the value of your project to the organisation.

Projects like this aren’t necessarily easier to carry out than a library/archive focussed digitisation project, and they still require much planning and engagement with stakeholders. As I’ve tried to show above, the priorities have a slightly different emphasis. However, the results can be something of genuine benefit to your organisation, and will prove the value of the Information Manager/Records Manager/Archivist roles and services.

Five benefits of a digitisation programme

We see digitisation as a form of project management, and any managed project needs to have at least three core things – costs, risks, and benefits. It’s important to think about the benefits that a digitisation programme will bring, and not just to you as a collection manager, but to your users, and to your organisation. Sometimes these benefits can be overlooked, or not considered and assessed in detail. In this post we’ll pick out some of the possible benefits digitisation can bring.

Saves originals

Archivists and librarians will recognise the scenario – there’s a precious irreplaceable resource, or one that is fragile (the paper may be crumbling), or it’s the only available copy in the country. What’s more it’s in constant demand, so subjected to frequent handling every time it’s retrieved from the stacks by the staff, then further handling in the searchroom. These precious documents and books don’t like being out in the light too often. Digitisation eliminates all the above risks and provides what, in the old analogue world, would have been called a “surrogate” copy.

Main beneficiaries: archivists, librarians

Meets user needs

This may seem obvious, but it’s still surprising how some digitisation projects still start and end with the collection manager’s decision, and don’t take the audience of users into account. There ought to be a formal process of assessing user needs at the start of a project, and the application of metrics to determine whether user needs have actually been met. This doesn’t always happen; digitisation decisions can be driven instead by internal staff meetings, advisory boards, or the recommendations of external consultants.

It might be more beneficial to consider user-centric methods and approaches like focus groups, customer surveys, online questionnaires, and statistics on searchroom use. A successful digitisation project aimed directly at satisfying a real user need can reap visible dividends for the organisation, in terms of visits, web page hits, raised profile, user satisfaction, and user engagement.

Main beneficiaries: users, the institution

Improves or enhances access

This is surely one of the main benefits of digitising any resource. If planned and executed correctly it can result in a string of related benefits for you and your organisation. Increased access through the web, reaching more users, and increasing not just the numbers but the diversity of your audience. But it’s not enough to just throw an existing image collection on the web in a gallery browser and let the power of the internet do the rest.

Collection managers should take the opportunity to rethink the potential of the resources, listen to user needs, and use technology to provide more imaginative ways to recast and enhance access to the content. There are possibilities for discovery metadata as well as cataloguing metadata, for navigational links that allow many entry points to a collection instead of a traditional hierarchical catalogue, and plug-in tools that can deliver popular and attractive ways to serve the content to users.

One of the most prominent of these is the page-turner and zoom tool device, so common with online books. These things are not merely gimmicks to be used for their own sake, but can offer your users more direct engagement with your collections. And we haven’t even mentioned crowd-sourcing yet…

Main beneficiaries: users, researchers

Saves space

This scenario is a bit of an outlier, and it’s primarily more of a records management/organisational change story (although other information management professionals may consider it too). The common motivator here is that the office is running out of space and that it would be convenient to scan all the current papers into digital form, and start “working digitally”. Managers who have this bright idea can immediately see a cost saving in terms of storage space, with visions of now-empty filing cabinets being removed from costly office space.

True, space saving can be a massive benefit – but people still have to find the materials. A project like this has to be managed very carefully and with a lot of preparation, especially giving due attention to metadata, which doesn’t automatically appear from nowhere when you take folders out of an organised filing system. And scanning is not cost-neutral either. Even so, if you can do this right, you’ll be contributing a genuine improvement to current working practice, and you will save money and space.

Main beneficiaries: staff, organisation, managers

A step towards digital preservation

The gain here is that the digitisation process can seriously lengthen the life of your valuable resources. Through digitisation, you could begin the process of long-term digital preservation. The scenario would be that you continue to keep the original analogue materials, but also keep the digitised version you have created; after all, it has cost you a lot of money to create it (staff time, server space), and its ongoing value to the organisation is already being demonstrated.

Treat the digitised resource with as much care and respect as you would your archival originals, and you’re on the road to digital preservation. As part of the project planning you would want to factor in the long-term preservation goal, before you even lift the lid of the scanner.

Main beneficiaries: archivist, institution

These are just five of the many benefits that a well-managed digitisation project can bring. Other topics would include income generation, pro-active user engagement, and attracting new customers to your offering. Understanding benefits (along with the costs of risks) is a positive way of understanding the digitisation task and delivering the project successfully.

Digitisation Course at Salford

Recently, I delivered a one-day training course on digitisation to Digital Humanities post graduates in Salford. Elinor Taylor of Salford University won an AHRC grant for a Research Skills Enrichment project, called Issues in the Digital Humanities: A Key Skills Package for Postgraduate Researchers, and one of the strands was about improving digitisation skills; more specifically how best to manage a digitisation project.

Elinor was unable to find anyone who could deliver the course they wanted, and commissioned ULCC to create a bespoke course. Elinor at first thought a workshop / hands-on event might be best, where a digitisation workflow could be aligned with a real-world case processing papers from the Working Class Movement Library which they were scanning. In the end we agreed that an overview of management principles would be better. I was asked not to dwell on scanners and cameras, since the audience for the course would mostly be outsourcing their origination work to commercial providers. Audio-visual conversion was also out of scope.

My course was structured to follow a start-to-finish narrative. Inevitably this meant spending one-third of the time discussing the planning and preparation. I’m a great believer in asking the question “why” about 15 times before beginning any project, and the same applies to digitisation. Who wants this stuff? Why digitise it? Will it improve their lives if you do? More importantly, can we make the experience for a user even better by digitising it? Then if you get a sponsor for your project, there’s the management – the multiple considerations and the groundwork that has to be done before a single image is created.

Let’s skip ahead to image creation. When it comes to producing digitised content, my archivist training always tells me it’s best practice to create “archival originals” – or “master copies” – to a very high standard of resolution, quality, and format compatibility. I favour the process used by many professional digitisers of creating RAW images in a good camera and deriving TIFF files from those RAW images. From that point it’s possible to derive accessible copies – quite often low-res, small size objects in JPEG – for your user base. It was also good to address my favourite subject, metadata, and attempted to stress why it’s such an essential part of digitisation. We need the descriptive catalogue metadata, but we also need the technical metadata for validating, handling, preserving etc. of image objects. Especially if we want to put them into an image management system.

As to image management systems, I sensed great interest in the room as I described all the useful features that you might find in a bespoke system such as Canto or Extensis Portfolio. At one level it almost seems these systems offer everything a digitisation project could want in terms of searching and browsing, integrated editing suites, and management tools. Providing of course, we have the metadata to begin with. If I do this course again I must remember to stress the essential role of metadata. A management system can’t manage very much without it.

Another subject which generated a lot of interest and discussion was Copyright. Unsurprisingly, this continues to be the concern that nearly breaks the deal for a digitisation project. Do you have the right to digitise, store, preserve, disseminate and share this material? Do you attempt to take a managed risk, or go to the effort of resolving the question of ownership? How do Google Books get away with what they are doing? was one question from the floor. The copyright dilemma is that the law as it currently stands does all it can to prevent copying of copyrighted material. But for any digitisation project (for that matter, any project involving IT or the internet) archivists have to make copies and allow users to make copies too. Is copyright law lagging behind the reality of the digital world? Discuss.

Some positive feedback from students:

“Great technical details and non-biased advice from an expert”.
“A very realistic view of the amount of work involved in digitisation projects, and the importance of planning”.
“The presenter was excellent. It was accessible and yet in-depth”.
“Very straightforward methodology of doing a digitisation project – jargon explained.”

Update 29th August: