Colophon

Colophon
Prev

For future reference, here's the lowdown on the technical geekery used to digitise this book, having learned some lessons in efficiency from the previous experience of transcribing The Nightclimbers of Cambridge.

This digitised version of the book was produced using the Linux XSane scanner package, a Canon MP510 combo printer/scanner and a loaned 1970 copy of Cambridge Nightclimbing.

I scanned the pages mostly in pairs (which XSane made very efficient). The filenames were chosen to contain a "-l" if a left-hand page was in the image, and a "-r" if a right-hand page was present. Hence most of the images contained a "-l-r" character string. These flags were used to automate the auto-cropping of pages out of the raw scans.

Next, a small shell script was written to chop the images into single pages with the chapter headers and page numbers removed so as to help the OCR process. The same script also did some greyscale thresholding to convert darker area to black and lighter areas to white, with a small range of grey shades retained — this improved the OCR response. The clever bit here were done by the excellent ImageMagick command line tools. Finally, script converts the generated, cropped page images to plain text by the Tesseract OCR program, and concatenated to form one big plain text file with the page breaks indicated. Here is the script in its entirety:

      
#! /usr/bin/env bash

CONVCOLOROPTS="-type Grayscale -black-threshold 50% -white-threshold 65%"
CONVCROPOPTSL="-crop 660x1000+100+95"
CONVCROPOPTSR="-crop 660x1000+850+95"

rm -f *.tiff *.raw *.txt *.map

convert ../0013.png $CONVCOLOROPTS 0013.tiff
for i in ../*-l*.png; do
    newname=$( basename ${i%.png} | sed 's/-r//' ).tiff
    echo $newname
    convert $i $CONVCROPOPTSL $CONVCOLOROPTS $newname
done
for i in ../*-r.png; do
    newname=$( basename ${i%.png} | sed 's/-l//' ).tiff
    echo $newname
    convert $i $CONVCROPOPTSR $CONVCOLOROPTS $newname
done

for i in *.tiff; do
    echo $i
    tesseract $i ${i%.tiff}
done

OUT=camnightclimbing.txt
rm -f $OUT
for i in `ls *.txt | sort`; do 
    cat $i >> $OUT
    echo -e "\n----------------------\n" >> $OUT
done

The final stage was manual: using GNU emacs and the aspell spell-checker, and occasional reference to the scanned pages, I corrected the spelling and pieced the chapters together, removing the page break indicators. Once I had a coherent plain text version of the book, I added Docbook XML tags to structure the chapters, sections and so-on. The photographs were manually cropped and added to the Docbook source as the final stage. HTML and PDF forms of the book were produced using Norman Walsh's Docbook XSL style sheets and the xsltproc program.

And that's all. I'm very pleased to say that the whole process only involved Open Source tools, produced by people much cleverer than myself.

Andy Buckley, 8 March 2008

Prev
New Hall (Just V.S.)	Home