With lots of other commitments pressing, it’s been another quiet time on the ArchivePress project, but now we are ready to enter a final phase of research and reporting into the effects of our activities.
While we’ve been away we have been running a demo instance of AP on a public web hosting service with encouraging results, which I’ll report on soon. It was always intended that the plugins should run successfully on a standard public installation of WordPress. Unfortunately this has had to be removed at short notice because of the one significant and unavoidable downside of unattended web harvesting: the exponential growth of content, and of scheduled tasks, as each blog post harvested by the repository also adds another comments feed that needs checking. Budget shared webspace hosts tend to get a bit twitchy if your PHP hogs the processor.
Fortunately Emanuele included with the plugin the option to use native server crontab feature, rather than the PHP pseudo-cron which is built in to WordPress. We now have a final demonstrator server ready where we can better control and monitor these activities. We will be reporting back on that in September and October.
We’ve had several enquiries about using ArchivePress. In most cases, however, these have related to retrospective blog archiving requirements, which are best dealt with by standard web harvesting approaches. ArchivePress has some capabilities to play “catch up”, but its main value is as a tool for dynamically harvesting active blog content. If we have a valuable active blog with legacy content, then using the ArchivePress import feature is worthwhile, but for a purely retrospective capture, I’m not sure it worth it, and I’d still recommend capturing the content with httrack or wget.
As well as monitoring the final pilot system, we also hope to look at some other issues that the project has raised about blog archiving, and how we can address them within ArchivePress. These include
- Use cases: what have we learned about blog archiving use cases and attitudes to blog preservation during the last year or so? Have expectations of blog archiving changed since the project’s inception?
- Embedding semantic metadata: the ArchivePress templates offer many possibilities for enriching and normalising blog metadata
- Persistent identifiers: what is the value and what are the possibilities of implementing persistent identifiers for posts and comments in an AP archive?
- SPARQL endpoints: can we usefully add a SPARQL interface to an AP archive?
- Cloud-hosting: as other repository applications move into the cloud, is this a direction that AP could support?
All this and more coming soon! (Possibly a new theme too, I never liked this one!)

January 24th, 2011 at 4:02 pm
Very nice post w ould like to say greate blog keep up the hard work… Thanks Again.