Thursday 3 January 2008

Whoops, Microsoft have done it again ...

Way back in June, I blogged about the problems that Microsoft's DocX format was causing for scientific journals and document exchange. [This post resulted in an email exchange with a journalist at the WSJ and might have ended up as a source for a story, but the story didn't eventuate]

Well Microsoft have done it again. According to Wired, the latest update for Office 2003 prevents access to a lot of pre Office 97 documents.

Now a lot of people won't give a stuff, but if you're into digital archiving you may do.

There's an unresolved argument in the digital archiving world between preserving legacy files exactly as generated and converting them to a standard format and verifying the content remains the same. The core of the dispute is that parsers and conversion utilities can have bugs, especially as regards layouts fonts etc and a convert only policy can result in implicit meaning being lost, eg in poetry. Pro conversionists say that on the whole only the content is important and you can handle special cases by hand.

The pragmatic solution is to do both. Maintains access and maintains the original just in case a parsing problem is uncovered later down the track.

Now, while it's true to say that you can use Open Office and related products to read these legacy files, Microsoft's move lends weight to the pro-conversion camp, purely because almost everyone uses Office, and because people using Office can no longer read these formats, there's less pressure on the Open Office crew to maintain these filters, and so, little by little a format dies and if we want to maintain access to content we have to convert, and perhaps we lose a little bit of implicit meaning.

And while we could convert these documents to PDF/A or ODF to avoid the risk of this happening again, verifying documents to ensure the conversion was accurate is just too big a job that it's not going to happen. So a little bit of the human experience dies.

And while you can argue that it's a plausible guess that 99.5% of these documents are of no significance whatsoever consider the following story:

From the time of Augustus to the reforms and changes of the later Roman Empire - a period of around three hundred years - Roman soldiers were paid quarterly. And when they were paid they were given an individual statement of account covering deductions, payments for food, lost equipment and so on. During most of this time the army was around 40,000 men, so we can say that there must have been 4*40000*300 of these statements produced, ie something around forty eight million. So many that any single statement was not of any significance whatsoever. So they got thrown away, used to wrap equipment, used in toilets and so on, with the result that there are only around three partial copies left, all of which are of tremendous significance.

And that's the problem with archiving. It's impossible to tell what's historically significant yet saving everything is not an option, which means we have to try and get it right with what we do save. Microsoft's change just makes it a little more difficult.

1 comment:

dgm said...

There's a good post on this topic over at the file formats blog