I’ve been capturing websites for the JISC since 2004. In recent years, many of these project sites have been blogs, and I’ve noted strange behaviours with some of my captures. I’m interested in any method that can improve on the current methodology. I anticipate that this ArchivePress approach is going to work very well from a preservation standpoint. This I think for three reasons:
- It’s going to result in a much “cleaner” capture of blog content than the remote harvesting method. Up to now I’ve been working with the Heritrix harvesting engine, which is a very powerful robot for copying the content and the folder structure of a web site. Some blogs can defeat Heritrix in short order, particularly if they use tricky scripts or are highly database-driven. WordPress blogs store their content in neat folders of course, and they have proven more amenable to a Heritrix capture. However, even WordPress blogs can result in me capturing the same content three times over, especially if it’s harvesting the “Tags” or “Category” folders. What I assume is happening is that Heritrix is effectively requesting the same content from the server. With more well-established blogs that are full of content, this can lead to bloated gathers. We can solve that problem with judicious filters applied to the harvesting engine. However, as I understand it, ArchivePress is going to be capturing the content direct from the RSS Feed. This may be as good as getting the data direct from the web server itself, in a nice “clean” export.
- The data from the feed is structured. It’s in an XML wrapper, which identifies all the content within a schema of structured data fields. Some of these fields, such as Title and Author, are Dublin Core compliant, which is good news for us archival metadata lovers. For various reasons, structured data is more desirable than the less-structured output of a remotely gathered website. True, with the latter we’re copying the folder structure, and html pages could be deemed “structured” with their tags, but they’re almost exclusively to do with formatting and rendering the content so that it works in a web browser (colours, fonts, headings etc.).
- The structured XML data is going to be much more flexible and adaptable. Instead of bundling the elaborate website structure and its contents inside a WARC file format for preservation purposes, we could pour the data direct into our own local database, configured so as to match the metadata from the feed. Over time, perhaps our database could be preserved as CSV tables (a method ULCC pioneered with the NDAD service), a format which will be robust and reliable and migrate-able. At any time, a copy of the data could be poured out of the database again, and (if desired) rendered as formatted data for a web browser, even replicating the look and feel if necessary. Yet throughout such processes, it can retain its integrity as that collection of clean structured data.
We can see how this will lead to improved Submission Information Packages, Archival Information Packages and Dissemination Information Packages within the OAIS framework, meaning that ArchivePress is potentially a very good approach for digital preservation purposes.