Tuesday 6 October 2015

Fixing Ernesta

Fixing Ernesta Drinker's book turned out to be easier than expected.

First of all I used gedit to remove the front matter from the text file and then  used cat-s to suppress double blank lines introduced by the digitisation process to get a halfway clean file.

I then used sed to replace the header and footer strings

sed s/header_string\n//g

with null strings, which gave me a reasonably clean text. The only problem was that the file had hard coded end of line markers, and paragraphs were mostly separated by double end of line markers. Here perl was my friend

perl -pi -0 -w -e 's/\n\n/ qq9 /g' infile.txt

to replace the paragraph breaks with qq9 - a string that did not occur in the document. Then I used

perl -p -e 's/\n//g' infile.txt > outfile.txt

to take out the end of line markers

 perl -p -e 's/qq9/\n /' infile.txt > outfile.txt

to put back the paragraph breaks. (And yes, I used stackoverflow). I could have wrapped all of this up in a script, but working out the best order of opeation was a bit iterative , and consequently I ran the individual operations in a terminal window.

At this point I opened the text with Libre Office to check the format and remove a couple of headers garbled in the OCR process. If I was being pedantic I could then have spell checked the document but what I had was good enough to read and take notes from, so I simply used CloudConvert to make an epub file from the saved file.

Not perfect, but good enough.

No comments: