Archive for July, 2009

I met with some Alexandra Eveleigh and Sarah Shooter from the West Yorkshire Archive Service (WYAS) this morning to talk about web archiving and in particular, ArchivePress. Though ArchivePress was initially conceived as a tool to support archiving of academic blog content, we’ve long thought that it could be used more widely. This meeting confirmed that and gave us the chance to explore some of the archival implications of using ArchivePress to collect and archive blog content.

We discussed a number of scenarios where ArchivePress could be used – not just by archival institutions (as per scenario 2 in my post below) but also by local groups who want to develop an archival collection along a particular theme and include in that collection blogged content. This is an interesting example and one which I hadn’t really thought of before – ArchivePress as a mechanism to support ‘community archiving’, of a sort. It was encouraging to get feedback from an archival perspective about our underlying premise too – that blog archiving can have a different set of requirements to other types of websites, and that blogs are one content type where the core content (ie blog posts) are easily and discernibly more important than the presentation. Sarah was quick to point out that she rarely consumes blog content via websites, receiving it straight to her iPhone instead. She rarely sees what the blog website actually looks like, let alone uses any of the content from the sidebar widgets.

Does extracting just the posts have any impact on the integrity, reliability, and authenticity of the resource? Traditional archival practices would mostly have it that an object has to be preserved in its entirity, including the preservation of context. It could be argued that by removing and preserving only blog posts rather than the whole website, a lot of important context is being lost. But then again, archival appraisal – as Alex pointed out – tends to focus on the content. Archivists have been transferring that content around and altering the objects for decades, for example by microfilming colour resources in black and white, and that has still been accceptable. So, it would seem that as long as the ArchivePress process and the original context of blog posts is sufficiently documented, our proposal should still result in resources of sufficiently high quality, integrity, and reliability to be afforded archival status.

We’ll be posting more on documenting context in due course.

ArchivePress dissemination got off to a flying start thanks to the DPC/JISC/UKWAC who invited us to make a short presentation about it at Tuesday’s workshop, Missing Links: The Enduring Web, at the British Library. For those of you content with the silent version, the slides from my presentation are attached below, courtesy Slideshare.

Lots of other interesting projects were presented – the slides of these are all available at the DPC website. Reports sighted so far include Marieke Guy’s report for JISC-PoWR, and a post on Jonathan Clark’s blog. I’ll add any further sightings as comments. Some Tweets were also spotted with the tag WAC09.

Among the updates on web preservation initiatives that we are familiar with, particularly at TNA and BL, it was particularly interesting to hear Hanno Lecher’s presentation on his citation repository for DACHS at Leiden. Keeping safe copies of the web resources cited by researchers or students seems hugely important to me: I’d like to see institutions getting more involved in that, and ArchivePress may have a part to play.

Andy McGregor drew my attention to Michael Nielsen’s recent blog post (article?), Is scientific publishing about to be disrupted?. Michael convincingly analyses the disruption of the news publishing industry by online news and blogging, and moves on in a similar way to consider scientific publishing. Michael reminds us that “more and more blogs contain high quality research content”.

If you’re reading this, I may be preaching to the converted, but in the interests of invoking authority and experience (like Chaucer’s Wife of Bath) we can add this to a growing number of assertions to this effect. As previously mentioned, Peter Murray Rust’s views on the importance of blogging (and therefore of blog preservation), are worth repeating:

Blogs are evolving and being used for many valuable activities (here we highlight scholarship). Some bloggers spend hours or more on a popst. Bill Hooker has an incredible set of statistics about the cost of Open Access and Toll Access publications, page charges, etc. Normally that would get published in a journal no-one reads (I have even published in such it was a huge effort and it’s got one citation. Not that I care about citations). So I tend to work out my half-baked ideas in public. Some people do their early science in the Open. Some are activists. Some review the current landscape, etc.

And in a similar vein, Heather Morrison, in her First Monday article Rethinking collections – Libraries and librarians in an open age describes her experience:

Many of my most important contributions to the debates surrounding open access, for example, are posted to the Imaginary Journal of Poetic Economics, or to a listserv. These contributions may or may not be included in peer–reviewed literature at a later date.

If libraries focus solely on collecting peer–reviewed or formally published literature and not blogs and listservs, some of my best writings, and some of the ideas contained there and not expressed elsewhere, are likely to be lost.

I expect I’ll find many more opinions about this over the next few months, so this post will probably have a sequel. Back to Michael Nielsen, in the meantime, who also touches on the issue of collecting and preserving this valuable blog content:

It would be easy to build upon the open source WordPress platform [adding] important features [...] like reliable signing of posts, timestamping, human-readable URLs, and support for multiple post versions, with the ability to see (and cite) a full revision history. [...] Perhaps most importantly, blog posts could be made fully citable.

Encouraging words for this project. And WordPress-based plugin/theme solutions to many of Michael’s suggestions are already available in at least embryonic form – and they are GPL too. I’m looking forward to pulling some of them together into ArchivePress.

June was ArchivePress Month 1, and already it’s hard to keep up, particularly with the online buzz. We’ve attracted a modicum of interest in the Twitosphere:

We’ve also had some highly useful discussions about the project on the JISC-PoWR blog and at Peter Murray-Rust’s blog. Among the things I’ve learned from them is that:

  • We have to continue to make our scope and use cases clear, particularly with regard to distinguishing our approach from crawling/spidering/harvesting. Creating local static copies of HTML renderings is the daddy of web archiving approaches, but our thesis is: TIMTOWTDI.
  • We’re not alone in thinking that blogs merit being treated differently from ‘traditional’ websites, and that this (setting a blog to catch a blog) might be a worthwhile idea/approach. but there are bridges to cross, notably the comments-harvesting.
  • Throughout academia – teaching, learning, research and administration – blogs are going from strength to strength. It would be a crime not to ensure they are preserved for future research.

July is the month when I get on with our first demonstrator, AP1, and record the process and the review the results. And IWMW2009 and Enduring Web at BL. And the JISCRI Projects startup meeting I’m sitting in right now.