Future-Proofing: Emails in Xena

From the JISC Future Proofing project blog

For this post, I just want to look at how emails are rendered as Xena transformations. I should stress that Xena doesn’t expect you to normalise individual .msg files, as we have done, but instead convert entire Outlook PST files using the readpst plugin. Nevertheless we still got some useable results with the 23 message files in our test corpus, by running the standard normalisation action.

In our test corpus, the emails generally worked well. Almost all of them had an image file attached to the message – a way of the writer identifying their parent organisation. Two test emails had attachments (one a PDF, one a spreadsheet). Four of them contained a copy of an e-Newsletter, content which for some reason we can’t open in the Xena viewer. I’ll want to revisit that, and also try out some other emails with different attachment types.

When we open a normalised email in the Xena Viewer, we can see the following as shown in the screenshot below:

  • The text of the email message in the large Package NAA window.
  • Above this, a smaller window with metadata from the message header. (In Xena, this is unique to email as far as we can see, and it’s very useful. Why can’t it do this for properties of other documents?)
  • Below, a window displaying the technical metadata about the rendition.
  • Below the large window on the left, a small window which is like a mini-viewer for all attachments.
  • To the right of this window is technical metadata about each attachment.
  • Right at the bottom, a window showing the technical metadata about the entire package. Even from this view, it’s clear that Xena is managing an email as separate objects, each with their own small wrapper, within a larger wrapper.

The immediate problem we found here was truncation of the message in the large window. This is being caused by a character rendering issue, and the content isn’t actually lost, as we’ll see when we look at another view.

In the Raw XML view (above image), we can see the XML namespaces being used for preservation of metadata from the email header (http://preservation.naa.gov.au/email/1.0) and a plaintext namespace (http://preservation.naa.gov.au/plaintext/1.0) for the email text.

If we copy the email text from here into a Notepad file, we can see the problem causing the truncation. Something in the system doesn’t like inverted commas. The tag plaintext:bad_char xml:space=”preserve” tells us something is wrong.

In the XML Tree View (image below) we can also see the email being managed as a Header and Parts.

Exporting from Xena

I tried clicking the Export button for an email object. This does the following:

  • Puts the message into XML. It can now open in a browser and appears reasonably well formatted.
  • Creates an XSL stylesheet – probably this is what formats the message correctly
  • Detaches the attachment and stores it in an OO equivalent, or its native format (might be worth investigating this for more examples and seeing what comes out.)

So for a single email with a small image footer and a PDF attachment, I get the following four objects:

  1. Message rendered in XML
  2. XSL stylesheet
  3. One image file
  4. One PDF file

When opened in a browser, the XML message file also contains hyperlinks to its own attachments, which has some potential use for records management purposes (provided these links can be managed and maintained properly).

So far so good. Unfortunately this XML rendition still has the bad_char problem! It would be possible to open this XML file in UltraEdit and do a find and replace action, but editing a file like this undermines the authenticity of the record, and completely goes against the grain for records managers and archivists.

The second option we have for .msg files is to normalise them using Xena’s Binary Normalisation method instead. This creates something that can’t be read at all in the Xena Viewer, but when exported, this binary object converts back into an exact original copy of the email in .msg format, with no truncation or other rendering problems. Attachments are intact too. It creates the same technical metadata, as expected.

We’ve also tried the PST method, which works using the readpst plugin that comes with Xena. This is pretty nifty and speedy too. The normalisation process produces a Xena digital object for every single email in the PST, and a PST directory also. When exported, we get the XML rendition and stylesheet also. So far, a few PROs and CONs with the PST method:

  • PRO1 – no truncation of the message.
  • PRO2 – lots of useable metadata.
  • CON1 – file names of the XML renditions are not meaningful. Each email becomes a numbered item in an mbox folder.
  • CON2 – attachments simply don’t work in this process.
  • CON3 – PST files are not really managed records. In fact Kit Good’s initial reaction to this was “If I could ban .pst files I would as I think they are a bit of a nightmare for records management (and data management generally). I could see the value of this in the case of archiving an email account of a senior employee (such as the VC) or eminent academic member of staff but I think it would be impractical for staff, especially as .pst files are often used as a ‘dump’ when inboxes get full.”

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.