You might already be familiar with blog archiving and collections of blogs that web archiving projects are putting together. There’s the collection of blogs from the Wellcome Library, for example, or from the British Library, or the Library of Congress BLawg Archive. The common and familiar scenario is that an organisation runs a web crawler such as HTTrack or Heritrix to capture copies of content – in this case, blogs – and provides subsequent access to the web site (blog) as an integral whole. This is perfectly acceptable if the requirement is that the site is presented as an integral whole. ArchivePress, on the other hand, is based upon the premise that organisations may not have this requirement and can have different reasons for wishing to capture copies of blog content, and different intentions for managing and using the content once they have it. For instance:
Scenario 1: A university institution has given its academics free reign to blog on whatever software platforms they choose. It later realises that this output is of academic and record-keeping value but that it has no record it and that any attempt to force staff to switch to an internally hosted service would probably be badly received. ArchivePress is installed and configured to collect copies of all blog posts from a list of pre-selected blogs; the contents are consolidated into a single database which can then be re-presented to the user via a single interface branded with the University logo.
Scenario 2: A local archiving institution wishes to collect blog content around a particular local theme (for example, about a local author or event) and aggregate the contents into a single and easily searchable resource for access via its website. It does not have the technical resources to manage a harvest-based web archiving approach, nor the finances to invest in a commercial web archiving service. ArchivePress can be implemented and managed with very little technical knowledge and enables the institution to collect copies of posts and comments from a pre-compiled list of blogs, store them in a easy-to-manage database, and re-present the contents as a single resource for users.
For these organisations, it is the raw and aggregated content that is of primary target, and not the complete website that is used to host each blog. Their core requirement is the consolidation of content from different origins into a single resource for re-use and re-purposing. ArchivePress enables them to focus on this content, along with the necessary metadata to identify each resource, and can meet their needs better than a ‘traditional’ harvesting approach.
We appreciate that this is but the first step in preservation: the simple act of collecting the content into a database does not mean it has been preserved! But, we believe that the tools we will use and the infrastructure we will provide will be conducive to preservation. Just how we’ll do that will be covered in more detail in a subsequent post.
We also recognise that there are all sorts of legal issues that would need to be addressed before an institution should implement ArchivePress. Whilst we won’t be providing legal advice pe se, we will be exploring these during the course of the project. First however, we will be looking into the subject of user requirements in more detail. This is vital to ensure that we provide the functionality required by institutional and organisational users. More posts about this subject will appear as our work progresses.