Innovations in Reference Management

January 19th, 2010 Richard M. Davis Posted in Events, JISC, JiSC-PoWR | No Comments »

Beacon cited through fog

Beacon cited through fog

Who would have thought that reference management could be so interesting? We spent a  very informative and enjoyable Thursday in snowy Milton Keynes, at the Innovations in Reference Management (#IRM10) event (part of the OU/JISC TELSTAR project). All thoroughly blogged by Owen Stephens, and tweeted by many.

Owen Stephens and Jason Platts of OU described the outputs of the TELSTAR project, which integrates the OU’s Moodle VLE with Refworks. This means that students using the VLE can move seamlessly between their reading lists and Refworks, locating resources, maintaining consistency of style and generating bibliographies easily.

Paul Stainthorp of Lincoln University described some exciting, bleeding-edge uses of Yahoo Pipes to mashup data from Refworks, OPAC, and Amazon. Arguably even more bleeding-edge was the presentation by Euan Adie from Nature Publishing, who showed us Help Me Igor, a reference manager plugin for Google Wave. Speakers from CiteULike and Mendeley also gave us fascinating insights into their respective social-tinged bibliographic management offerings.

Perhaps unsurprisingly, Kevin and I brought to the table the theme of web preservation. With reference to our work with JISC-PoWR, UKWAC and ArchivePress, we reminded anyone who hasn’t heard our spiel already that there are many important, valuable and eminently citable web resources, notably blogs by academic researchers, that are at risk of disappearing – making references to them virtually useless.

Authors may not be responsible for ensuring their readers can access the resources they reference, but we think they should at least give them a fighting chance of doing so! We  therefore proposed that students and researchers should be encouraged to locate and cite copies of web resources in stable web archives (such as the UK Web Archive) rather than “in the wild”.

We also discussed the idea that persistent collections of web resources could be created at the institutional level, whether that were an open archive of blog posts by a university’s researchers, or a closed repository where researchers can store copies of the web resources they cite.

One of the strong themes that emerged in discussion was the need for information literacy/digital skills training at all levels to address current tools and trends in reference management; and to re-assert the purpose, value and nature of citation in online digital environments

An interesting suggestion also made was that reference management tools are becoming a natural part of the environment, just as email has: is provision of specialised applications by universities an “aberration”?

I’m inclined to think not, after all it was clear from the workshop that there’s still a need to support ongoing study and research effectively, and scope to develop and validate new approaches.  Microsoft Word may now include reference management features, but that doesn’t obviate the need to educate people in how to use them effectively, and why.

We’re very grateful to Owen for including us in his programme: this is a fascinating area, where e-learning, libraries, preservation and publishing collide, and I’m sure we haven’t heard the last of it.

AddThis Social Bookmark Button

A repository for pi(es)

January 7th, 2010 Kevin Ashley Posted in General, Technical | 4 Comments »

As you may have read recently, Fabrice Bellard has announced the computation of π to almost 2.7 trillion decimal places using a faster algorithm that allows desktop technology to be used, rather than the supercomputers that are usually used to break this particular record. Bellard is an extremely talented programmer who has made a useful contribution to one area of digital preservation with his emulation and virtualisation system QEMU. But it’s a comment by Les Carr that set me thinking about costs, research data and repositories.

“Would you want to put that in your repository?” asked Les. And this is a particularly extreme example where we can do some calculations to give us a fairly good answer. Scientific data centres and the researchers that Pi Pie - CC-BY-NC-SA by Maitri@flickr use them have been considering this question for many years, and one way of looking at it is to see if the cost of recomputation exceeds the cost of storage over a particular time period. We’re assuming here that the initial question – is this worth keeping at all – has been answered at least vaguely positively.

Let’s look first at the cost of recomputation. Fabrice says the equipment used for this task cost no more than €2000. If we assume that it has a life of 3 years, that gives us a cost per day of €1.83. I’m avoiding the usual accounting practice of allowing for inflation, or lost interest on capital, in calculating the true depreciation value of the asset – there’s a number of different schemes and they all give similar results. I’ve just dividided the capital cost by the number of days of use we’ll get. But computers use electricity, and that costs money as well. Let’s assume this is a power-hungry beast that draws 400W and that power costs us 13.5¢ per kwH (which is what my domestic tarrif is if we assume a euro/sterling rate of €1.10 = £1 and 5% VAT.) That adds €1.30/day to the cost of running the system, for a total cost of €3.13/day.

Fabrice’s announcement says that it took 131 days of system time to calculate and verify his results, which gives a computational cost of €410.03 – which I’ll round to €410 since I’ve only been using 3 significant figures so far in the computations, and because there’s a lot of hand-waving involved in lots of these figures. So, we know how much it would take to recompute this result given the software, machine and instructions. (And the computational cost is likely to decline over time in the short term.)

The answer needs a Terabyte of storage. What will it cost to keep that in a repository? That’s a slightly more difficult question to answer, but we can give a number of figures that provide upper and lower bounds. SDSC quote $390/Tbyte/year for archival tape storage (dual copies), excluding setup costs and assuming no retrieval. Moore et al quote $500/year as a raw figure, obtained by dividing total system costs by usable storage within it. At current rates of $1 = €0.67, that gives us a cost of €261/year or €335/year. SDSC are likely to be at the cheap end of the scale. ULCC’s costs, given our lower total volumes, would be closer to €1500/year for a similar service (dual archival tape copies on separate sites) although that does include retrieval costs. Amazon’s AWS would be about €100/year for a single copy. You would want two copies, so it’s twice that, and the cost of transferring the data in would be about 25% more than the storage cost. Since I haven’t factored in ingest costs for any of the other models, I’ll ignore it for AWS as well. (And yes, AWS isn’t a repository, and there’s no metadata, and… This is a back-of-the-envelope calculation. It’s a small envelope.)

Which means, at a very rough level and ignoring many pertinent factors, that after about two years of storage in the repository, we would have been better off recalculating the data rather than storing it. There’s a lot of assumptions hidden there, however. For one, we’re assuming that this data will rarely, if ever, be required. If many people want it, the recalculation cost rapidly becomes prohibitive (and so does the 131 days they have to wait for their request to be satisfied!)

One of the other problems is more subtle. I said that, in the short term, recalculation costs would be likely to fall as computational power becomes cheaper. The energy costs involved will rise, of course, but there’s still a significant downward trend. But after a sufficient period of time, it becomes non-trivial to reconstruct the software and the environment it needs in order to allow the computation to happen. Imagine trying to recalculate something now where the original software is a PL/I program designed to run under OS/360. It’s not impossible by any means, but the cost involved and expertise required is non-trivial. At least with our example we won’t have any doubts about whether the right answer has been produced – the computation of π produces an exact, if never-ending, answer. Most scientific software doesn’t do this and the exact answers produced can depend on the compiler, the floating-point hardware, mathematical libraries and the operating system. Over time, it becomes harder and harder to recreate these faithfully, and we often don’t have any means of checking whether or not we have succeeded. (Keeping the original outputs would help in this, of course, but that’s exactly what we’re trying to avoid.) That’s part of the problem that Brian Matthews and his colleagues examine in the SigSoft project and there’s still a great deal of work to be done there.

So have we answered Les’s question ? My feeling is that in this case we have – there’s a fair amount of evidence that suggests that keeping this particular data set isn’t cost-effective. But in general, the question is far harder to answer. Yet we must strive harder for more general answers as the cost of not doing so is not trivial. Even if money did grow on trees, it still wouldn’t be free and at present we need to be very careful how we use it.

AddThis Social Bookmark Button

Our new EPrints repository (is not just for Christmas)

December 21st, 2009 Richard M. Davis Posted in Repositories Service | No Comments »

IR

As regular readers will know, we have been working with repositories for quite a few years now. In 2005 we began working with the School of Advanced Study on their requirements for an Institutional Repository, and since then we have installed, configured and maintained several repositories, including some highly customised, specialist systems.

In most cases we have used EPrints. This is partly because we are familiar with the stuff it is built with (Perl, MySQL and XML have been at the heart of the NDAD dataset repository we have operated for The National Archives since 1997). But also because we like the ever-expanding set of features and options EPrints provides. I’ve watched its capabilities grow, thanks to the seemingly limitless energy and initiative of the EPrints team at Southampton. (For an interesting, user’s-eye perspective on the relative merits of DSpace and EPrints, I recommend reading some of the posts tagged DSpace in Dorothea Salo’s Caveat Lector blog).

It’s three years almost to the day since Rory and I attended the pre-launch briefing on EPrints3 and came away convinced that, with its AJAX UI and evolving plugin architecture, EPrints 3 was likely to play a big part in our future plans.

And hardly a day’s gone by since, when we haven’t had some EPrints-related work on our plate. In 2007 we began developing Linnean Online for the Linnean Society, and PRIMO for the Institute of Musical Research. Out of this, and the snowballing Web 2.0 zeitgeist, we also honed the idea that became SNEEP (Social Networking Extensions for EPrints), one of the first JISC Rapid Innovation projects. Most recently, we’ve scaled new heights of EPrints customisation with the SOAS Fürer-Haimendorf collection, with its user-defined albums and searching enhancements, all wrapped up in 9Web’s impressive graphic design.

We’ve tweaked config files and hacked templates and for the most part enjoyed doing stuff with EPrints. (All credit is due to Rory and Ben, by the way. My role is chiefly to say “We could make it do that couldn’t we?” And, lo and behold, usually “we” can.)

Over the years I’ve also talked to many repository managers, and potential repository managers, about their requirements and expectations. I’ve spoken and networked at DSpace User Groups , Open Repositories conferences and many excellent events organised by the JISC, particularly the Repositories Support Project – and I’ve met a lot of smart and insightful people in the repo biz. Some of it must have rubbed off – I think my own understanding of what’s needed, and what’s feasible has grown considerably.

But what we’ve never done is run our own repository, and experienced these things day-to-day for ourselves. As Atticus Finch said in To Kill A Mockingbird,

You never really understand a person until you consider things from his point of view . . . until you climb into his skin and walk around in it.

That’s why, in the gaps between everything else going on round here, Annemarie has been putting together the ULCC Publications Archive, which I hope will become a canonical home for our published outputs. It’s not big and it’s not clever, it’s certainly not perfect, but it is something we can use to improve our understanding of what it means to run a repository. We will also no doubt use it to explore some of the tools and techniques emanating from the EPrints developer community.

And now I can really start to empathise with the repository managers I know: their agony – clarifying copyright and licenses, ambiguous form fields, disappearing diacritics – and their ecstasy – a well-formed subject tree or citation, a successful search. I’ve also an insight into the needs of authors/submitters, since several articles are mine – and I naturally want to get the citations looking just right, so that I can embed some of the nice feeds EPrints provides into my blogs, e-portfolios and who knows what other mashups. Self-interest is a great motivator, as many Open Access advocates have observed: before long I’m sure I’ll be wanting download statistics, author profiles, and most of the other things I described in 1001 Things To Do With A Live Repository.

For me it’s an invaluable experience – no less so than when, a couple of years ago, I became an actual user of a VLE, through my MSc course at Edinburgh. There’s a world of difference between being a developer or implementer of this kind of online system – thinking your job’s done when it seems to be up-and-running – and being the poor end-user who doesn’t care about PHP, JSP, Maven, Apache, etc, but  just wants to get something done.

Among the things you’ll find in pubs.ulcc.ac.uk are: papers and articles from events we have contributed to over the years, such as iPRES, Open Repositories, and DLM-Forum; published reports, like last year’s JISC-PoWR web preservation report; presentations and posters from other events, mostly in the field of e-learning or digital archives; and even the swish product sheets produced by our ace marketing department, Tim and Frank!

As well as our most recent UK activities, we’ve also unearthed some other curios, such as Patricia’s article for the Catalan Archivists’ Forum, in Catalan, and a piece by Kevin in La Vanguardia, in Spanish. Also of interest is a brief account of ULCC’s first 30 years, in the form of a brochure for a small exhibition that was held at Senate House Library in 1999.

No doubt as we delve through our own digital archives we’ll find more goodies. Having a repository is an excellent opportunity to locate and appraise these things, and share those that seem interesting and informative enough. No less than this blog, and our E-learning colleagues’ El Blog, it should be an attractive and effective shop-window – just like any good Institutional Repository.

AddThis Social Bookmark Button

File formats…or data streams?

December 3rd, 2009 Ed Pinsent Posted in DPC, Events, Reports, Technical | 4 Comments »

On 1st December Malcolm Todd of The National Archives gave a good account of the work he’s been doing on File Formats for Preservation, resulting in a substantial new Technology Watch report for the DPC. It was a seminar hosted by William Kilbride, with participants from the BBC, the BL, NLW and others. The afternoon was useful and interesting for me since I teach an elementary module on file formats in a preservation context for our DPTP courses.

My naïve thinking in the area has been characterised by the assumption that the process is rather static or linear, and that the problem we’re facing is broadly the same every time; migrate data from a format that’s about to become obsolete or unsupported, onto another format that’s stable, supported, and open. MS Word document to PDF or PDF/A…now that, I can understand!

In fact, I learned at least two ways of thinking about formats that hadn’t occurred to me before. One simple one is costs; some formats can cost more to preserve than others. This can be calculated in terms of storage costs, multiplied over time, and the costs associated with migrations to new versions of that format. Read the rest of this entry »

AddThis Social Bookmark Button

DPC AGM – and thoughts on preserving research data

November 30th, 2009 Kevin Ashley Posted in DPC, Reports | No Comments »

Last Monday (2009-11-23) saw DPC members travel to Edinburgh for a board meeting and for the annual general meeting of the company. We elected a new chair – Richard Ovenden – and offered our thanks to Bruno Longmore for the effective leadership he has offered as acting chair following the departure of Ronald Milne for New Zealand earlier this year.

We had a brief preview of the new DPC website, which promises to be a much more effective mechanism for the membership to engage with each other and the wider world, and confirmed recommendations emerging from a planning day earlier in November which should keep the DPC busy (and financially secure) for a few years to come.

Finally, we had an entertaining and thought-provoking talk from Professor Michael Anderson. Professor Anderson touched on many issues relating to digital preservation from his research career, past and present. He mourned the loss of Scottish census microdata from 1951 and 1961, painstakingly copied to magnetic tape from round-holed punch cards for 1951 and standard cards for 1961, which had to be destroyed when ONS realised the potential for inadvertent disclosure of personal information. Read the rest of this entry »

AddThis Social Bookmark Button