Tuesday 30 September 2008

Putting some medieval digitisation strands together

Over the past few weeks I've posted various links and updates to posts around digitising medieval manuscripts, character recognition and then using the material to build up a corpus for textual analysis

Now with the Stavanger Medieval English Grammar project we see how such a solution would work. Crucially we need to go back and digitise the sources - later editors 'smoothed' the text in places and regularised transcriptions, meaning that sources like Project Gutenberg simply don't work. The sources are not actually transcriptions of a single document - medieval books are more like open source projects with the same basic text but some bits added in or taken out. Think ubuntu, kubuntu, xubuntu - all basically the same but different utilities and window managers. So we have to identify common passages for analysis. Not intellectually difficult, but it does take longer.

The other source we have is legal texts, such as the records of the Scottish Parliament where transcriptions are likely to be more accurate - if rather less interesting to read. Of course accuracy is not necessarily a help here as it's the mistakes that are interesting, not the fidelity of the copy, but as they contain a lot of stock bits of boilerplate we can probably see the evolution of grammatical changes.

The other, unanswered question is how good auto recognition of medieval handwriting is. Clerks, who produced manuscripts as an act of devotion tended to have nice text. Commonplace books and legal records less so, sometimes quite a lot less so ...

Saturday 20 September 2008

Adam of Usk and the Espresso book machine ...

Yesterday I twitted a link about an Espresso print on demand machine being installed at the University of Michigan. By chance, yesterday's Australian also had an article about how A&R, Australia's largest bookseller was deploying them at a number of their stores . So far so interesting.

As I've said elsewhere many times print on demand is the ideal solution for rare and obscure books and out of print titles. Basically all you need is a computer, a digitised version of the book, a printer with an auto binder. And the technology to do this is cheap, when a basic laser printer costs a couple of hundred dollars, and a rather more meaty one under a thousand.

And there's a lot of material out there, Project Gutenberg has been happily digitising old out of copyright texts, and now that many of texts have markup they can be processed easily and reformatted for republication.

And we see that publishers have begun to use this to exploit their backlists as in FaberFind. And certainly when I helped put together a similar project for a scholarly publisher, that seemed to be the way to go. No warehousing, no startup costs for printing, just churn them out when required, and only digitise and work on the original text when requested. That way while the first copy was expensive in terms of time and effort, any subsequent copy was free other than the cost of paper and toner.

Not e-texts?

Well once you've a digitised marked up text it's relatively easy to convert it into any of the format commonly used by bookreaders. Texts are hard to read and annotate on the screen, and I would assume so on a Kindle or Sony Book reader - I'm hypothesising here, I've never seen either of these devices - they're not available in Australia but clearly they are supposed to be the ipods of the book world. Anyway, while they may work for fiction or any other book read from beginning to end, I suspect that it's not quite got the utility of a book. And you probably can't read in the bath :-). An e-text reader that allows you to export the text to an sd-card and then take it to print and bind machine for backup or reference purposes might hit the sweet spot for scholarly work. That way you could have a paper reference copy and a portable version to carry around.

And Adam of Usk ?

Adam of Usk was a later fourteenth century/early fifttenth cleric, lawyer, historian and chroncler. If he'd been alive today he'd have been a blogger. He wrote a long rambling gossipy chronicle - part diary part history that covers a whole range of key events from the visit of the Emperor Manuel II of Byzantium to Henry IV of England to drum up support for Byzantium's war against the Turks, Adam's time serving on the legal commision to come up with justifications for the forced deposition of Richard II to the events of the Welsh Wars of Owain Glyndwr and the Peasnt's Revolt.

A book that you'd thing there'd be a Penguin classic edition of. Nope you're wrong. There's an 1876 translation (Adam wrote in Latin) and newer 1997 translation published at the cost of a couple of hundred bucks a copy - purely because this sort of book is probably only really of interest to scholars and the costs of short run conventional publishing are horrendous and self defeating.

Why there's no readily available edition is just one of these mysteries. Gossipy and rambling but then the Alexiad is not exactly a model of conciseness and tight structure. Bust basically there's no readilty available edition to dip in and dip out of. In short it's the ideal example of a text tailor made for print on demand publishing. Thhe thirty bucks than a print on demand copy would cost is a damn sight cheaper than the cost of even a tatty sceond hand version of the 1876 edition (cheapest I found was GBP45 - say a hundred bucks)

Friday 12 September 2008

Peek email reader ...

Yesterday I twitted a link to a report about the Peek email reader.

My immediate reaction was that I want one. My second one was 'what a simple device!'.

Working backwards, its fundamental constraint is it's reliance on GPRS, which is slow and typically priced in terms of the amount of data transferred. So to be economic (or to make a profit) you don't want to stream too much data. However the speeds not unlike a dialup modem, and while complex big documents were slow to transfer, simple character mode email programs such as pine worked just fine. Pine or indeed mutt are good examples - relatively simple interface and coupled with a basic editor such as pico or nano very useful.

And you don't need complex formatting to send emails. It's about text after all. 

But increasingly email is sent in rich formats these days, usually html based but not always. This adds to the payload, the amount of data transferred, but not the content - what's said. Now emails are usually multi part including the mime encoded attachments and the sexy version of the email and hopefully a plain text version.

So by interposing a server that polls the users mailservers by using a fetchmail like process and then decomposing the message into its component parts, throwing away all the non plain text parts and/or stripping out any exteaneous formatting to get rid of the non plain text parts. (yes I've seen examples of message that would die if this was done to them but not that often).

Suddenly you've got a lightweight message to forward on.

As I say, very simple - breathtakingly so. You could imagine also doing a similar service based on qeu and dsl on a usb key and booting into pine. 

In fact, many years ago I built something very similar. 

We had a pile of old computers with limited memory, network cards and no hard disks. And we had boot roms which did pxe style requests to allow you to transfer down and execute a 1.44Mb floppy image. What I did was put an operating system on it, a tcp stack and a locked down version of kermit (actually ms-kermit and freedos) in terminal emulation mode that logged into a sun server and forced the user into pine. Logging out of the system forced the pc to reboot (basically we waited till we saw the word logout go past) to ensure a clean session for the user - basically a quick and dirty university email reading terminal - login, read mailm exit and walk away.

Peek is an enhancement of this concept, and a damned interesting one ...

what transcription mistakes in manuscripts might tell us

words change over time. That's how language changes and diverges. Sometimes the change is rapid and sometimes it's slow. That's why we can more or less follow Shakespeare and not Chaucer, and it's doubtful if Shakespeare would have had any less difficulty understanding Chaucer than we have.

Equally language changes over geography as well as time - the English spoken in Kingston, Jamaica is very different from that spoken in Kingston in the ACT or Kingston-on Thames in London although the latter two are not very different, for a whole lot of reasons such as consistent recent bi-directional migration, greater degree of education etc etc.

By looking at language changes over time  we can see how language changes and show that it evolves, with less common words changing their forms quicker than more common words as people are more likely make mistakes with the rare ones than the common ones.

Anecdotally, you can observe this in Australia, where the English spoken, while almost the same as that in the south of England, is a simpler version, the reasons for this probably being due to the need to absorb migrants from non-English speaking backgrounds, whose command of the language may be a little shaky.
impact of cheap technology And then I got to wondering. Projects such as the Canterbury Tales project  transcribe old manuscripts and collate the differences  in an attempt to build a consensus about what Chaucer originally wrote. But these manuscripts also tell us something about how people spoke, because the transcription 'mistakes' the scribes made were often unconscious corrections to usage. 

They are in fact a frozen record of language change. Of course it's more complicated than that, we need to know the provenance of manuscripts to work out which are temporal corrections - reflecting changes in language over time, and dialectical corrections reflecting geographic distance. And we need a big corpus.

So how do we get a big corpus of text. Typically these texts have been transcribed by hand but advances in character recognition algorithms  and the impact of cheap technology, including cheap digitization technology should give us a large corpus to subject to genetic analysis.

This could be very interesting (in a geeky sort of way) ...

Java Toasters


in 2001 Robert Southgate came up with the idea of the java toaster that burns the weather forecast into your morning slice.

Time and technology move on but the idea stays the same - Electrolux now have come up with a USB version that's essentially a thermal printer for toast ...

Thursday 11 September 2008

interesting twitter behaviour ...


I've noticed something clever in twitter (or twittermail - havn't worked out which).

If I create a mail message with a tinyurl url it gets passed via twitter unaltered to my twitter page. If I do the same thing but use our in house short url service, which is based on nano url, the resulting twitter display is a tinyurl link.

Odd. Actually if you look at the flow what has happened is that the non-tiny url has had a tinyurl created for it - implying that rather than truncating the message there's intelligence to process and automatically generate tinyurls for all non-tinyurl url's

Tuesday 9 September 2008

Twitter ...

My problem with playing with social networking is my shy and retiring nature. No, really. I genuinely don't think that the world is gagging to know that I've just spent $800 on the brakes and steering on my car, or indeed what I'm doing on a day to day basis.

And then I had a thought. One thing I do do is skim blogs and online newsfeeds for things that interest me. And I've often thought about doing a daily post on today's interesting things. Instead I'll use twitter and either tinyurl or our in-house short form url service to post links to things I find interesting ...

ambient intimacy


interesting and thoughtful post from the IHT on ambient intimacy - or how social networking influences society and allows people to feel connected, despite being physically disconnected.

The role of web 2.0 technologies is a fairly interesting topic - blogging has replaced samizdat in repressive societies, but few people have really commented on what function these technologies will have in non-repressive societies.

And of course one thing is the social network, the range of contacts and then being able to track things and how they're going - for example I can track what's going on with some projects I'm no longer involved with but am still interested in, and also the general connectedness with old friends an colleagues so one knows when people change jobs and all these other minutiae that help maintain contacts.

Wednesday 3 September 2008

SMC Skype WiFi phone

At home we have a problem. We're on the side of a narrow canyon without decent line of sight to a cell phone tower. This means that phones ring but the signal quality is too bad to talk unless one goes and stands at the top of the block in the back yard.

At the same time I've also become a convert to Skype for overseas calls - not because it's a lot cheaper than Telstra - three cents a minute for Skype out versus five cents for Telstra - but because the call quality is better when calling overseas. And of course I use Skype to call home when I'm away.

But there's a problem with Skype - it's cumbersome to use. It means donning a headset and being tethered to one computer while making the call, and arranging overseas Skype to Skype calls across timezones requires careful prior co-ordination by email. This means that using Skype is not spontaeneous - it's ok for the occasional conference call and regular overseas call, but you lose the versatility of a phone call. And while you can get cordless handsets for your computer, it means leaving the computer powered up and connected to Skype.

So I cracked and bought a Wi-Fi phone from CE Compass. I ended up with the SMC version,  the same phone is available from Belkin and from Edge but ignore the branding - they're all the same phone.

So what are they like?

Slightly clunky, a bit like a 1999 cell phone with rubbery keys and a slightly crude user interface. That said it was fairly straight forward to setup, basically your skype account details and your wireless access point security details - a nice point is that it goes looking for open access points and can be made to search at any time - useful if you want to work from a coffee shop with free open internet access, train station or airport lounge. You can also add multiple home netwrks for it to try, should you need one at home and one at work, say. You can get that nineties feeling back again and you can use it like the late and unlamented Rabbit phone  that only worked at local access points which led to groups of rabbit users clustered round locations with an access point sign.

Other than that it just works. Rings when people call you and sits as a device on the network.

Boot up and initial connection is a little slow, but call quality is reasonable and saves having to be tethered to a computer. Basically it works all round the house and out into the yard - wherever there's a signal including the garage. Battery life could be better as well but as it's an alternative, not a substitute that's no big deal.

Tuesday 2 September 2008

mobile printing redux ...

Back in August 2007 I blogged about how to design a mobile printing solution [1 ] [2]. Like many IT projects it went nowhere and then suddenly resurfaced, in a slightly different form.
What we still need to do is provide a means for people using their own machines to upload and print but making it as simple as possible for the user. Making it simple means it can't be seamless as we need to make as few assumptions about the user's machine and browser as possible. So here's the cartoon:
  1. User logs in to system
  2. System presents the user with a web page listing the user's files and an option to do an http upload of a file (analogy is the geocities website manager)
  3. Besides each non-pdf file we have two options - convert to pdf and print. All printing is done by converting to pdf and then  pdftops, with conversion being done with either OpenOffice command line mode or abiword command line mode as appropriate and then print/export to produce the pdf. The analogy is with Zoho's or Google Doc's print and import options
  4. Pdf files have an option view or print, this means that users can check the layout before printing
  5. Printing is done by passing the print job through pdftops and then queing it to a holding queue with lprng.
We then have a second web based print release application that can be accessed either separately or as a flow on from the web based printing solution. This application basically allows the user to requeue the print job from the holding queue to an active queue or delete the job. An added refinement would be to add an estimate for the number of pages and hence an estimate of the cost to print.
It's not elegant but it does allow users a way to print from any device with some local store and a browser.