Stuff, geeky stuff: 04/01/2011

Friday 29 April 2011

Cats can't count

If you have a cat you'll know this already - cats can't count. If you feed them on dried cat food, you rapidly come to realise that there are basically three states:

Many - my bowl is full and I cannot see the bottom
Some - the bottom of my bowl is covered with little plastic showing
Few or none - I can see most of the bottom of my bowl, and I don't care if there's some round the rim MeeOW!

State #3 can be tuned into #2 by picking up the bowl to redistribute the contents and perhaps judiciously adding a few to give the illusion of the bowl being refilled.

This of course is good enough for a cat - it knows if it has enough in its bowl and can go and do cat things or if it has to find a human and annoy the hell out of them by scratching the fridge, study chairs, walking on keyboards while you're working, and the like, to get them to add content to the foodbowl.

Fine, enough cat stories, why are you writing about this?

Simple answer - user interface design and changing user expectations of the interface.

Looking at less technical users using applications such as Skype I've noticed a similar phenomenon to the cat and its bowl. They see the nice pane in the middle of the screen and can happily use it to call people, use the dialpad etc. What they can't do is navigate the pull down menus to add a new user - they need someone to help them.

Interestingly they seem to focus on the pane and (on a Mac at least) don't perceive the menu bar along the top of the screen, ie they focus on the application pane. Not a problem for full screen applications, but a problem for other applications.

Now this is purely anecdotal, and I havn't seen it in Windows or Linux (Gnome or KDE) users, but I wonder if we're seeing an iPhone/iPad one app/one screen type of effect here on user expectations ...

Nectar Cloud workshop presentations now online

back at the end of March, I went to the Nectar cloud workshop in Melbourne.

The presentations of the event are now online at http://www.nectar.unimelb.edu.au/about_nectar/background_documents under Research Cloud Technical Workshop March 2011

Wednesday 27 April 2011

Leonardo and Vegetius

Leonardo da Vinci was a great artist and clearly a crazy man - dreaming up lots of wierd ideas, and even trying some of them out, including persuading one of assistants to try out his design for a glider one afternoon in Fiesole.

Leonardo was of course well read, and as a designer of military machines for Ludovico Sforza, must have read Vegetius's Epitome. As one of the first books printed, he might even have owned his own copy.

And this leads me to wonder, were his designs for armoured personnel carriers doodles inspired by some of Vegetius's more fantatsical suggestions for military machines?

Doing less with more ...

No, the title of this post isn't a typo, it's quite deliberate. I could have called it 'Whither University Information Services part 96' but didn't. Catchy is better.

However it is about the future format of Computing Services in universities.

Just about everything can be outsourced. Google Apps, Office 365, Wikis, blog services, notebook services etc. Classroom environments can be replaced with virtual environments and you can people to to build and maintain them for you. Likewise if you still want to provide pc's you can get companies to look after them for you and maintain the disk images.

And, while a lot of people are a little uncomfortable with this, a lot of the MIS functions, paying people, counting leave applications etc can be outsourced

And on the whole these companies are better at the individual activities than you are because each activity is what they focus on.

Which kind of leaves your average university computing service looking distinctly surplus. While you do need some in house expertise, in the end a lot of what they end up with is service portfolio management. And the thing that changed this is having decent high speed internet links meaning that for a large number of operations, they don't need to be on site anymore.

Ignoring the recent Amazon outage, basically if you need a service, storage or compute power, you can rent it. I havn't done a cost analysis so I don't know how costs compare, but I have this gut feel that some outsourced services provide a better and more cost effective service than can be provided in house.

So, all we are left with is a few local experts, some people to manage the service portfolios and look after those few services that can't be outsourced. In other word the computing service does less and less and becomes more and more narrow in its expertiese and capability, and more and more is outsourced.

This could be taken as a dystopian view, but in fact it's an opportunity.

Years ago, before PC's were either common or powerful, most universities had some reasonably beefy machines to provide business services, general purpose computing and some specialist facilities such as text processing. And because of the roots of computing in the numerical and engineering sciences it tended to mean that almost all the general purpose stuff was very numerical.

At the same time the programmers and developers who looked after the computers knew the academic researchers and often helped them with coding, complex statistical analyses and the like - they essentially worked part of the time as research facilitators, helping researchers do what they wanted to do more efficiently.

One example I remember from then was a guy who had recorded the orientations of every neolithic long barrow tomb in northern Scotland and who had a simple question:

Did the long axes of these tombs point to where the sun would be on either solstice in 4000BC?

Basically a very simple bit of statistical analysis - work out where the sun would be, work out the standard deviation of the orientations (plus or minus a bit of fiddling to deal with date variation) and see if there was a statistically significant effect.

The archaeologist concerned was ecstatic - a problem could be answered in a day rather than having to carry out weeks of tedious manual analysis, and all because he'd had the wit to come and talk to the computing people

Today it could be done with Excel, and probably more quickly than it was then.

And then when I listen to some PhD students describing their research analysis methodology for handling and analysing large datasets, I suddenly realise not much as changed. While everyone can use simple spreadsheets, text processing, email and the like, most people lack computing skills. They need someone to suggest a solution, help them with a little bit of code to upload their results, suggest more efficient ways of manipulating data etc.

In other words they need technical facilitators - just as the tomb man did - people who know what's possible, who're used to working out how to translate research problems into computing problems, and who are interested and engaged. The only problem is that these people were, on the whole exactly those who were either let go or moved on to other things ...

Getting bookmarks out of Delicious and Diigo and into Evernote

I have pretty much settled on Evernote as my electronic notebook tool, allowing me to build up a collection of useful notes and web clippings that's both searchable and accessible.

However in my search for solutions to make me more efficient I've stuff in Mendeley and Zotero, not to mention bookmarks in Diigo and Delicious.

Mendeley and Zotero I can dismiss - I never really took to them, in the end they didn't work for me. Delicious and Diigo are something else, with about three hundred bookmarks between them, some of which are relevant, and some of which are probably crap, dead links or whatever, and some of which date back to the days of Furl.

So how to get them into Evernote?

The first step is easy - consolidating them into one place - both Diigo and Delicious provide tools to export bookmarks for import.

Googling for 'export delicious to evernote' produced more results than 'export diigo to evernote', probably as a result of last year's stoush over Yahoo's rumoured closure of Delicious, so I decided to consolidate on Delicious, reckoning if there were more google results someone else must have cracked the problem.

Well, while the consolidation went well, it became clear that most people hadn't solved the 'import to evernote' problem, suggesting instead that you export your bookmarks as an html file and import that into Evernote. Now that creates a searchable document, but isn't really what I wanted, which was to import each bookmark into a separate note with tags and to work through them to kill the dead and unwanted bookmarks, and capture the text of the bookmarks I want to keep. (Yes, possibly I don't get out enough, but that's me)

Evernote uses an XML based data format for import and export so it should be possible to parse the bookmarks export file, and then rewrite it as an Evernote XML file, with one note per database.

I ended not having to do this - I found this truly awsome recipe that uses deliciousxml.com to download your bookmarks in an xml based format and then execute some code on jsdo.it to transform it into a Evernote enex format archive, which imports as a new notebook, making it easy to work through the imported links.

And the recipe just worked. My thanks to Dr.Palaniraja for making this available.

I now need to do the anal retentive thing of working through the notes, but given I'm anal enough to find librarything fun I don't think that's going to be a problem for me ... although one problem is the sheer number of dead links uncovered in tracing back posts anytime much before 2009.

Update

Today's Guardian reports that Yahoo has finally sold Delicious to the original owners of YouTube - If you've been thinking about moving your data across, this might be the moment ...

Thursday 14 April 2011

ISBN's as persistent identifiers

Ever since I discovered LibraryThing, evenings chez Moncur have involved a hour or so sqautting on the floor with our Asus netbook adding our books into our LibraryThing collection. And it's a strangely therapeutic activity.

Basically, the routine is something like this:

make pot of tea
fire up netbook and connect to librarything
decide which shelf to enter
pull books off shelf
find and enter each book's isbn in turn, while drinking tea
return books to shelf
logout and help cook dinner

and the reason why it's so mindlessly enjoyable is the power of the isbn. ISBN's uniquely identify each book (sometimes each edition) published after sometime round about 1970. And because of this they're in all the major public catalogues meaning that adding a book simply involves getting librarything to look up the isbn against a reference source (in my case usually Amazon's UK catalogue), and there's the data.

I've only, so far, found one mistake in the Amazon database, and have only had to use an alternative catalogue source, the National Library of Australia, on three or four occasions. I've been quietly amazed at what's in there - things that you would think difficult such as book published in Thailand (with a Thai ISBN) about Laos are just there. I've only had to do serious detective work on one book - an English translation of a German art book on Egon Schiele.

So ISBN's are the poster child of the persistent identifier world, but when we look at institutional repositories, the use of persistent identifiers is spotty to say the least, and governed by the choice of repository software. Archives of research data are pretty spotty as well, and some registries - the catalogues of the dataset world are not that great either.

ISBN's have been a success because they filled a need for the book trade and libraries, to ensure that books are correctly described and thus when someone orders a book they get what they ordered, rather than something with a similar sounding tittle or author ...

In the data archiving world, ideally one would like to link datasets to publications and also to researchers, but we lack unambiguous primary keys, which is what decent persistent identifiers would give us. We also currently seem to lack a clear driver to introduce such a scheme to enable unambiguous dataset citation ...

Wednesday 13 April 2011

Sustaining content

The threatened closure of AIATSIS's digitisation programme has made explicit the long term problem of all digitisation/ digital content programs - sustainability.

Sustainability can be more accurately described as the problem of how do we keep the curation process going after the initial funding has expired - the curation process meaning that we check to ensure that the data is accessible and has not gone corrupt in some way.

And this is a process that takes money.

The usual output of a digitsation program is a website that gives access to the content, a database server that allows you to query the content, or more accurately the content's metadata, plus the content itself, which needs to be backed up, sanity checked and the rest. I don't have figures for the cost of replicated filestore so I'm going to assume that its 2.5 times the cost of straight filestore.

Ignoring the costs of software licences, 2 reasonable virtual machines - one for the front end, one for the database backend. Each server would cost a little less than $1000 in terms of hardware resource to provide for a 5 year term, but running costs are probably around $1500 per annum for power and cooling, plus $2500 per annum for operating system maintenance and patching - if the cost of the server is minimal compared to the cost of running them for five years - $20000.

Storage is quite cheap to provide as well - say around $4000 for a terabyte of slow SAN based SATA storage over a 5 year term - using NAS would be a bit cheaper.

So let's say $10000 for storage if you add in replication/sanity checking.

So we could say that a project that needs to be maintained costs around $30000 for five years after the end of funding, or $6000 per annum.

Not a great sum.

However there is a problem. Few digitisation or data hosting projects include a coherent costed sustainability plan, the default seeming to be 'oh my institution will look after it', which they might well for a few years if it was only $6000 a year, but if it was 10 projects, each at $6k a year, that's getting on for the cost of a grade 4 library person, trainee network support guy , or whatever.

Projects need better exit and sustainability plans, ones with real costs. And while some might be sustainable by selling subscriptions not all will be - in fact most of them won't be.

The simplest solution is probably to require projects to pay into a ringfenced fund, something akin to an annuity, that will provide for the sustainability of the data for a fixed term. Given that the half life of scientific publications is around five years (ie 50% are never referenced five years after publication) we can probably assume the same is true of data, and can say that a dataset or digitsation project is only of value on the level of access it attracts.

This is an approach which we used to take at the UK Mirror Service when deciding which data sets to cease hosting - and while we also had special pleading it had the advantage of transparency.

The costs cited are from internal estimates from my day job of what it costs to provide these things. They're not necessarily accurate, however the server costs are not too far from the costs charged by large hosting company here in Australia. The storage costs are slightly higher than Amazon's for hosting on redundant storage in their Singapore facility, but remember that Amazon will charge for data transfer costs (essentially website accesses and database lookups) However, these are all back of the envelope prices, and you may find your technology and hosting costs differ significantly. The point remains though that even using serious hardware (virtual machines, high quality blade servers, high performance SAN hardware), the costs of maintaining the resource 'as is' is comparatively low on an annualised basis. However in aggregate they can amount to a reasonable degree of expense - meaning that the host institution probably should care about sustainability even if the individual projects do not ...

Sunday 10 April 2011

Librarything

I’ve discovered a new source of distraction – Librarything – which is a website to allow you to catalogue what’s on your shelves, as well as all the usual social networky stuff de nos jours.

Now this sounds incredibly geeky, but I have a serious purpose here. We have a lot of books, some of which would be quite hard to replace. We used to have a lot more, but when we moved from York to Canberra we gave about three quarters of them away, including my science fiction collection, as there just wasn’t room in the shipping container for them all.

When we gave the books away we intended to be rational and give away the mass market paperbacks, the stuff that didn’t do it for us any more and keep the literature and the good factual stuff that might come in useful.

We weren’t that rational in the end. Somehow all of George Orwell got given away, along with Lindsey Davis, and some things that shouldn’t have come did, some things just got lost, like my copy of Ae Satyre of the Thrie Estatis.

So, as well as buying new books, we’ve replaced some of the missing books, though for some reason Lindsey Davis seems to have taken precedent over George Orwell.

The upshot is, we don’t know what we’ve got, and Librarything seems the perfect solution to let us catalogue our collection. Also at the back of our minds is the problem we had when we had a break in in York and some miscreant stole all our CD’s (as well as the stereo, a Mac, and the rest).

We of course didn’t have a list of what we had but we did have a couple of hundred CD’s. The insurers were happy to agree we had had that many – we could point to the empty CD racks and show them some photos we’d taken that showed the racks to be fairly full.

What they didn’t of course do was deal with the fact we had a pile of classical and medieval music and a pile of world music exotica, some bought on our travels, and offered us the average price for a cd bought from a discount music store.

We argued, at one point I sarcsatically suggested they fly us to Johannesburg to replace the world music cd’s as it would probably work out cheaper than buying the individual imports, and eventually they upped the amount a bit, and we compromised.

And even today, we have less than half the number we used to have.

Now, while no miscreant is going to steal a paperback of Aristophanes’ collected plays, we’re always aware that there’s a nature reserve on top of the hill, and in the event of a serious bushfire it would be ‘grab the cat, the passports, grab a laptop, grab the cd’s with the scanned essentials and get the hell out’.

And afterwards, we’d need to get a valuation on the books we’d inevitably lose – so Librarything seems the ideal solution to show what we had - and of course since the data lives elsewhere, we don't need to make sure that one of the essential CD's was an uptodate booklist.

But how to get the data in?

Well Librarything is part owned by AbeBooks who are themselves part owned by Amazon. Most of our books come from Abe, Amazon in either the States or the UK, or Bookdepository. And we’ve been buying from Amazon for as long as they’ve been going so getting that data in seemed like a start.

Now because of the ownership chain we thought that at least AbeBooks and the Amazon’s would have purchase history import/export filters to LibraryThing. Not a bit of it.

They didn’t even have an option to download your purchase history to a csv file. Neither did Bookdepository.

I was quite prepared to write a little bit of perl if necessary to transform the data (actually, being a sad anorak I was looking forward to it), but no, not a bit of it. Nothing to do but spend an evening with a tab open on your purchase history and another tab open on Librarything doing the data entry by hand.

And that’s probably given us 50% of the books. For the rest, we’re probably up for a few evenings manual data entry.

While Librarything will sell you a cheap barcode scanner for $15, itwould mean waiting for a couple of weeks minimum for one to come from the states, and with no guarantees as none of our computers use any of the operating systems listed on the librarything cuecat webpage.

And, a barcode scanner wouldn't solve all of our problems, given Borders' and Dymocks' habit of sticking their own SKU labels over the publisher's barcode which of course renders them unscannable with no alternative other than manual entry.

And the obvious machine to use for data entry is the OokyGoo, our linux travelling netbook, given its lightness and portability, but cuecats with linux seems a bridge too far. Googling suggests that using the Cuecat with linux is finger in the ear at best, so I think it’s going to be autumn evenings sat on the floor with the Asus netbook typing in ISBN’s by hand …

Thursday 7 April 2011

Capturing what scholars use ...

Building on my recent post on what collaboration tools researchers actually use I thought I'd set about documenting this. To this we need to both collect and publish information. So, to this end:

There is a wiki page summarising information so far. This is a crowdsourced document, if you want to comment please follow the link on the page
A survey on collaboration tool use in academia Click here to take survey

I'll post a summary of the results in due course ....

Wednesday 6 April 2011

What do scholars actually do?

One of the great bugbears of supporting digitally enabled research is understanding what scholars actually do. Not what research they carry out, but actually what they do, and what they use in the course of their daily work.

For example, it's generally believed that collaboration is a good thing and enables productivity. This of course begs the question what collaborative technologies do scholars use?

According to a recent report on the use of digital technologies in the humanities it's fairly prosaic, but none the less interesting:

Google Docs for collaborative editing and to share documents
Dropbox for sharing files
Yousendit to send large files to colleagues elsewhere
Skype for discussions
Blogs/RSS feeds/Twitter to keep ahead of the field

Besides this of course they used a range of specialist tools such as tapor for text analysis. However we shouldn't get distracted by the specialist tools - as technologists we have this tendency to get distracted by shiny things with blue lights - preferably ones that flash - we should rather focus on the prosaic, and if we are to build a collaboration environment aka a VRE or virtual research environment

It needs to provide easy access/integration to Google Docs and other tools - Colwiz and Ojax ++ offer possible routes to building a VRE with third party tool integration
They need a decent blogging platform - which basically means a local install of wordpress. But, they also need a way to do joint blogs and comments with people elsewhere, which takes us to the interesting lands of authorization and authentication. We could say shibboleth, but we need to encompass collaborators overseas and at institutions such as museums and art galleries who may not have access to shibboleth enabled authentication. As we want to maximise ease of use this probably means supporting mutiple authentication methods such as Shibboleth, OpenID and the Google API
Researchers want to share big files. They may be high resolution photographs of cuneiform tablets, survey datasets, astronomical images and the like, but the key thing is that they are big. Given that no one understands ftp anymore services such as Cloudstor are of great potential value, but remember, if we are talking about collaboration we are talking about AuthN and AuthZ, which means paying attention to collaborators not in the traditional academic frogpond
Skype - much the same can be said about Skype versus evo, as could be said about yousendit versus cloudstor.
People also need an RSS feed aggregator. There's just too many potentially valuable feeds out there. It's also interesting the way that people increasingly treat twitter as a curated RSS feed - ie as a feed of interesting miscellanea recommended by colleagues they trust and respect.

However there's another very important point here - researchers have already gone out and found these services for themselves. They havn't come and demanded it from IT, they've gone out found them and their use has spread virally. As IT professionals we need to be realistic about this and seek to integrate these services rather than build parallel service just to prove we can (although it's perfectly valid to build such services as competence building exercise, just a building a private academic compute cloud could well be a valuable competence building exercise)

It's also very interesting what they didn't mention:

Sharepoint
Wikis
Microsoft Live docs and Skydrive
Zoho

suggesting that these products havn't gained a lot of traction in academia.

Wikis are an interesting case - quite a few research projects make use of wikidot to maintain a project wiki, yet researchers don't mention using wikis. I don't have an answer to this - it could be that the ease of creating content for a project blog is more appealing than a wiki, but then I thought the implicit non-linearity of a wiki might also be attractive for some multi threaded projects.

The other interesting case is Windows Live. Most academic seem to use Microsoft Word. All the previous competitors have fallen by the wayside, and Open/Libre Office is regretably not quite there. Yet given their need to share as part of collaboration there seems little use in the Microsoft alternatives to the Google ecology, despite also being free and integrated with the most recent versions of Microsoft Office.

So, the non-adoption of the Microsoft products tells us that offering an alternative isn't a solution, even if its as good or better than what people already use - if people want to use Google Docs or Wordpress that's what they want to use, not some odd geeky equivalent.

Strangely enough I've been here before. A long time ago I built a managed desktop service for a university that didn't use Microsoft tools and instead used a range of either open source or low licence cost Microsoft compatible tools. The justification was of course cash saving, by not having to pay the Microsoft tax. Users (mostly) loved the predictability and stability of the managed environment but the number one request for optional individually licenced and paid for software we got was for Microsoft Office, and the reason invariably given was to collaborate more effectively with colleagues at other institutions and secondly to reduce the risk of data interconversion errors.

There are a lot of reasons for wanting to use certain tools in preference to others, starting with usability and being compatible with colleagues elsewhere but the important takeaway is that people want to use what they're already using. These tools are of course out there and universally accessible - which means that there is no way their use can be curtailed by mandating an alterative.

The consequence of this is that you need to provide a mechanism to integrate them, meaning your research and collaboration vre starts looking a lot like a portal or a dashboard with gadgets linking to standard tools ...

Friday 1 April 2011

nectar cloud workshop

nectar cloud workshop
Originally uploaded by moncur_d.

Just back from the Nectar cloud computing workshop in Melbourne - an event designed to help choose the basic parameters of an academic cloud computing service in Australia.

Now I must admit that before I went to this event I had only a hazy view of cloud computing - to me services like dropbox, skydrive and the various components of the Google ecology represented cloud computing - which of course is really software as a service (SaaS) and it has the capacity to respond to demand and be replicated, so that if one box, or building full of boxes goes down the rest will take up the slack.

On the other hand platform based services such as Amazon's EC2 are slightly different - basically renting computing potential - essentially on demand hosting of virtual machines - so for example if you have a complex application - such as the woodchipper text analysis tool - that you want to run against some data, you would build and load up some VM's - as many as you want, or your credit card will allow to let you do your analysis run in much the same way as my playing with ngram was firing up an instance of the ngram software each time.

Like much in computing nothing is new - only improved - VM/cms with its shared architecture gave you much the same effect on an old IBM or Amdahl mainframe where you effectively had a machine of your own to run your jobs in.

The key differentiators between then and now is (a) the ability to do this with commodity hardware and (b) elasticity - allowing as much or as many on demand. Costing and accounting is of course a complete nightmare.

There is of course an argument as to whether we should bother building a solution in Australia when Amazon and Azure do it better and have more hardware. I've made the argument elsewhere that the future of university computation services lies in mediating access to outsourced services.

However I can see a number of good reasons for building a demonstrator in Australia:

Offshore data shipping costs - moving a large dataset for analysis outside of Australia is a slow and expensive business - the links, while reasonable, are still expensive and capacity bound when it comes to big data
There is currently no significant cloud computing provider in Australia - no Amazon or Azure node, the nearest being Singapore, meaning that data shipping costs are a given
Data security - there is some data - the classic example being health related data - that for legal reasons cannot be shipped out of Australia - and of course there's always the Patriot Act problem when shipping data to the US
Competence building. Even if the Australian academic cloud service ends up as a front end to commercial providers, we need to know about the services in order to give good support to the client base, and only by playing with this stuff can we build understanding
Australia is only 21 million people with 39 universities, of which a quarter are research intensive. Individually each research intensive university probably can't build a service to be elastic enough for real use but collectively we just about can
Having an academic service with a low entry cost lowers the bar for these individual researchers in less research intensive universities who have an occasional need for serious computation - rather than building a beowulf cluster in their office, they can make use of cloud based services for these occasional big analyses - it lowers the bar to their getting on with research

So, cloud is an enabling technology, in other words one that means that researchers can work more effectively without the technology getting in the way. It also means building a culture of engagement with researchers, which is more than classic business analysis.