Archive for June, 2009
You might already be familiar with blog archiving and collections of blogs that web archiving projects are putting together. There’s the collection of blogs from the Wellcome Library, for example, or from the British Library, or the Library of Congress BLawg Archive. The common and familiar scenario is that an organisation runs a web crawler such as HTTrack or Heritrix to capture copies of content – in this case, blogs – and provides subsequent access to the web site (blog) as an integral whole. This is perfectly acceptable if the requirement is that the site is presented as an integral whole. ArchivePress, on the other hand, is based upon the premise that organisations may not have this requirement and can have different reasons for wishing to capture copies of blog content, and different intentions for managing and using the content once they have it. For instance:
Scenario 1: A university institution has given its academics free reign to blog on whatever software platforms they choose. It later realises that this output is of academic and record-keeping value but that it has no record it and that any attempt to force staff to switch to an internally hosted service would probably be badly received. ArchivePress is installed and configured to collect copies of all blog posts from a list of pre-selected blogs; the contents are consolidated into a single database which can then be re-presented to the user via a single interface branded with the University logo.
Scenario 2: A local archiving institution wishes to collect blog content around a particular local theme (for example, about a local author or event) and aggregate the contents into a single and easily searchable resource for access via its website. It does not have the technical resources to manage a harvest-based web archiving approach, nor the finances to invest in a commercial web archiving service. ArchivePress can be implemented and managed with very little technical knowledge and enables the institution to collect copies of posts and comments from a pre-compiled list of blogs, store them in a easy-to-manage database, and re-present the contents as a single resource for users.
For these organisations, it is the raw and aggregated content that is of primary target, and not the complete website that is used to host each blog. Their core requirement is the consolidation of content from different origins into a single resource for re-use and re-purposing. ArchivePress enables them to focus on this content, along with the necessary metadata to identify each resource, and can meet their needs better than a ‘traditional’ harvesting approach.
We appreciate that this is but the first step in preservation: the simple act of collecting the content into a database does not mean it has been preserved! But, we believe that the tools we will use and the infrastructure we will provide will be conducive to preservation. Just how we’ll do that will be covered in more detail in a subsequent post.
We also recognise that there are all sorts of legal issues that would need to be addressed before an institution should implement ArchivePress. Whilst we won’t be providing legal advice pe se, we will be exploring these during the course of the project. First however, we will be looking into the subject of user requirements in more detail. This is vital to ensure that we provide the functionality required by institutional and organisational users. More posts about this subject will appear as our work progresses.
You’d think it obvious that my blog should be preserved, though I’m not so sure about yours! According to the poster summarising the fascinating 2007 survey by Carolyn Hank et al: “The majority of bloggers agreed (36%) or strongly agreed (34.9%) that their own blogs should be preserved.” Five per cent don’t want their blogs preserved at all; nearly a quarter aren’t fussed either way.
Here’s one of the data tables (which I had to retype as HTML – Peter Murray Rust is right about PDFs and data):
Table 4. Preservation perceptions – general
Strongly agree or agree Neither agree or (sic) disagree Strongly disagree or disagree Should preserve Personal blog 70.9% 23.8% 5.3% Every blog 35.8% 27.9% 36.3% Every comment 31.4% 31.9% 36.7% All online content 28.2% 22.3% 49.5% Should not preserve Some blogs 44.7% 27.7% 27.7% Some comments 48.4% 31.3% 20.2% Some online content 51.3% 24.9% 23.8%
The overall pattern seems a good vindication of our own project approach, which will progressively move from capturing blog content (posts), to addressing comments and content, reflecting the scale of the bloggers’ own priorities.
It also seems a useful juncture in our project to throw open the question: which blogs should we preserve?
With over 5 million active blogs noted by Technorati, it seems daft to even start to enumerate them but in our field (libraries, archives, information science), several stand out, and it’s the very nature and importance of these that bolster the case for keeping them. I have in mind in particular Peter Suber’s Open Access News blog, but also blogs such as those of Peter Murray Rust, Brian Kelly, Lorcan Dempsey, Dorothea Salo, Jill Walker Rettberg – all ripe with contemporary accounts and robust views on matters of scholarly communication. But in every case, we have cause to wonder: will that information survive, will that link still work tomorrow?
What blogs (or types of blogs) do you think should be preserved, and why?
I’ve been capturing websites for the JISC since 2004. In recent years, many of these project sites have been blogs, and I’ve noted strange behaviours with some of my captures. I’m interested in any method that can improve on the current methodology. I anticipate that this ArchivePress approach is going to work very well from a preservation standpoint. This I think for three reasons:
- It’s going to result in a much “cleaner” capture of blog content than the remote harvesting method. Up to now I’ve been working with the Heritrix harvesting engine, which is a very powerful robot for copying the content and the folder structure of a web site. Some blogs can defeat Heritrix in short order, particularly if they use tricky scripts or are highly database-driven. WordPress blogs store their content in neat folders of course, and they have proven more amenable to a Heritrix capture. However, even WordPress blogs can result in me capturing the same content three times over, especially if it’s harvesting the “Tags” or “Category” folders. What I assume is happening is that Heritrix is effectively requesting the same content from the server. With more well-established blogs that are full of content, this can lead to bloated gathers. We can solve that problem with judicious filters applied to the harvesting engine. However, as I understand it, ArchivePress is going to be capturing the content direct from the RSS Feed. This may be as good as getting the data direct from the web server itself, in a nice “clean” export.
- The data from the feed is structured. It’s in an XML wrapper, which identifies all the content within a schema of structured data fields. Some of these fields, such as Title and Author, are Dublin Core compliant, which is good news for us archival metadata lovers. For various reasons, structured data is more desirable than the less-structured output of a remotely gathered website. True, with the latter we’re copying the folder structure, and html pages could be deemed “structured” with their tags, but they’re almost exclusively to do with formatting and rendering the content so that it works in a web browser (colours, fonts, headings etc.).
- The structured XML data is going to be much more flexible and adaptable. Instead of bundling the elaborate website structure and its contents inside a WARC file format for preservation purposes, we could pour the data direct into our own local database, configured so as to match the metadata from the feed. Over time, perhaps our database could be preserved as CSV tables (a method ULCC pioneered with the NDAD service), a format which will be robust and reliable and migrate-able. At any time, a copy of the data could be poured out of the database again, and (if desired) rendered as formatted data for a web browser, even replicating the look and feel if necessary. Yet throughout such processes, it can retain its integrity as that collection of clean structured data.
We can see how this will lead to improved Submission Information Packages, Archival Information Packages and Dissemination Information Packages within the OAIS framework, meaning that ArchivePress is potentially a very good approach for digital preservation purposes.
There will be a presentation about the ArchivePress project, its background and aims, as part of the forthcoming JISC, DPC and UK Web Archiving Consortium Workshop: Missing Links: the Enduring Web, July 21st at the British Library. For full information about the event see the Digital Preservation Coalition website.
We held the first ArchivePress team meeting at ULCC on Monday, to review the project plan and objectives.
The plan described in the Project Proposal still seems essentially reasonable and achievable. The project will have three main iterations, each dealing with a different corpus of blogs and with different technical and functional issues.
In Phase One (AP-1), we will simply use FeedWordPress to gather the content from the three blogs of the Digital Curation Centre. This will allow us to examine the results and flag issues for the next phase. Initial guidance on installing and configuring the software will be prepared.
Phases AP-2 and AP-3. will address, respectively, the issues of harvesting comments associated with blog posts, and gathering embedded objects (images, etc). Both Lincoln University and UKOLN have provisionally agreed that we can harvest their various blog outputs as part of this process.
The starting point of the AP approach is the hypothesis that collecting the content of the newsfeeds from blogs may be sufficient for many likely requirements of blog archiving. This doesn’t mean necessarily that it is a fool-proof or instant solution, and the intention of the project is to determine, through practical investigation, how effective this approach is, what are its strengths and limitations.
We have a number of dissemination opportunities available. I am already confirmed as a speaker at The Enduring Web (BL, Tuesday 21st July), and at UKOLN’s IWMW 2009 (University of Essex, Tuesday 28th July). Other imminent opportunities include IWAW 2009 (at ECDL in Corfu, 30th Sept – 1st Oct), iPRES 2009 (San Francisco, Oct 5th – 6th) and IIPC at iPRES 2009 (Oct 7th). A proposal has already been submitted to iPRES. In addition, ULCC hopes to launch its AIDA digital preservation toolkit shortly, with a programme of DP events, and there may be scope to represent AP there.
This blog will be the central source of information about the project, and a place to publish our findings, and discuss what we are doing, how, and why. We hope to follow the successful model of the JISC-PoWR blog and encourage discussion from colleagues in the field, and maybe even some guest posts from eminent digital preservationists.
Maureen is going to focus her attention on user requirements and expectations, including legal, ethical and ownership issues, and the possible use cases, such as academic institutions, thematic collections, or local history projects: she will discuss her thoughts in another post. Ed will assess the relative merits of the AP approach and the web crawler approach of other web archiving endeavours, from the perspectives of both records-management and usability.
I will manage the WordPress configuration and customisation in the early phases, and expect to call on Rory’s help and advice for any advanced PHP development requirements. An environment is being set up on Google Code to support development work in due course.
My next tasks are to prepare the formal plan for our JISC Programme Manager, James Farnhill; and start configuring a WordPress installation for AP-1 – more on that in due course.