Friday 16 May 2008

Costing longterm data preservation

One of the great problems about long term data archiving is sustainability, ie the costs of being able to maintain it into the future, simply because the technology is new and no one really knows.

Jisc have recently published a report that attempts to quantify the costs, and cerainly the costs look fair, with the idea of it needs around 2.5 to 4 FTE staff to keep a service going and that the hardware and storage costs are not amazing.

The staffing level is probably about right, that's also around the same number of staff that used to run the old UK Mirror Service, which was also a big repository and database to tell you where everything was (sounds familiar?).

The hardware costs are probably wrong however, as they've forgotten about resilience.

Most institutional repositories are exactly that - a single instance at a single institution. While there are undoubtably steps taken to maintain the integrity of the data there's no real attempt at resilience. And resilience is important. Even if your servers are clustered, if they're in the same building and the building burns down, you are basically stuffed. And remember with a large multi terabyte repository conventional backup to tape isn't that attractive as most of the data is unchanging, it only grows in quantity which makes the backups increasingly slow.

Boxes such as Sun's Honeycomb get around this by storing multiple instances in an array of disks - think bitorrent in a box - but that immediately throws away resilience.

At the old UK mirror service we had one set of servers at UKC and another at Lancaster and replicated between them. We never backed up the data, reckoning it was unlikely that both sets of servers would be lost at the same time, and that if it did, we could repull all the data again. We later added a third node at Reading and that increased the reslience even more.

But we always had funding rows about doing this. Duplication of hardware - even if a lot of it was pretty cheap commodity hardware - duplication of effort - the need to keep people on each site to deal with local hardware failures - yet we always had 99% uptime.

Doing it that way provided good disaster recovery, and simultaeneously good service. If institutions were to peer, ie I put a box in your machine room and you put a box in mine, the extra staffing costs are probably pretty minimal, hardware is cheap, and you're not paying for backup. It's probably not cost neutral, but it's probably close...

No comments: