Terese

What is it?

Terese is a tool which can be used to facilitate proofing the outcome of OCR programs, such as Tesseract. The basic idea is to try to map the OCR text to the original image. Differences, i.e. errors in the OCR text, are then easily identifiable. Note that it will not do the proofreading for you; rather, it is a tool which will help you verify that your proofread text corresponds to the original.

To use Terese for your proofing work, you will first have to generate a font for your particular text; that is, train Terese, by letting it calculate an average picture for each of the characters in the text you want to proofread. This process is potentially somewhat time consuming, and for that reason, Terese is specifically suited for longer works (books) set in a single font. If you only want to proof a page or two, it is probably not worth the effort.

To easier let you test the functionality of Terese, I ship it with an example of a generated font, and some pages that you can practice proofreading. See the section Using Terese for more information.

Installation

Windows

(Note that Terese is developed under Linux. For the convenience of Windows users I also offer a Windows version, which, however, has not been as extensively tested as the Linux version. If you have the possibility, I recommend that you run Terese under Linux instead: there are many excellent Linux distribution which you can run directly from a CD or a USB flash drive, without having to install anything on your computer.)

Download the latest Win32 release from the project's files page, unzip it, and move terese.exe to the folder where your scans and their corresponding OCR text are located. Note that Terese will write its configuration file, terese.ini, in the directory from which it was started.

To run it, you will have to install gtkmm for Windows. The current release is built with version 2.16. (It might work even if you have another version installed; I have not tested it.) When it comes to select the type of install, select "Runtime only" to save some hard drive space. Once installed, you might have to restart your computer.

To run the accompanying scripts, you need Python 2.6.

Terese is initially set to use the DPCustomMono2 font (included in the zip file). You should either install it (from the Control Panel, Fonts), or edit terese.ini where you can set the font to whatever you want.

Linux

I don't bother with a binary release for Linux; simply download the latest source release. On Ubuntu at least, the build procedure is as follows:

sudo apt-get install build-essential libgtkmm-2.4-dev
tar xf terese*.tar.bz2
cd terese*
./configure
make
sudo make install

Note that Terese will write its configuration file, terese.ini, to the directory from where it is started. This makes it easy to use Terese with different projects at the same time, since each project would have its own configuration file. You should use the terminal, navigate to the directory where your scans reside, and start Terese from there.

Terese is initially set to use the DPCustomMono2 font (also found in the src directory). You should either install it (for example to ~/.fonts/), or edit terese.ini where you can set the font to whatever you want.

Preparations

(If you only want to test the functionality of Terese, you can skip to the section Using Terese, and test it with the provided example files.)

Scan

I will suppose that you have the work you want to proofread in the form of scanned image files, one image per column of text, at a resolution of, say, 300 to 600 DPI, monochrome tiff. (I haven't been able to get Tesseract to understand grayscale images.) Move the compiled terese to the directory where these tiff files reside.

Installing Tesseract

The actual proofing can be done on text generated by any OCR program, but to train Terese, you have to use Tesseract. In this tutorial, I will use Tesseract 2.04.

Train Terese

Try to find a small number of pages where all the characters in the font are represented; you will probably have to choose several pages, to match all the characters used in the book. (This is not critical; you can always go back and add another page, if it has a new character.) If you have gone through the text already, and done some preliminary proofreading so that the text files are reasonably correct, you can use the script pagepickout.py to find a good set of files.

Generate box-files for the pages you have chosen, like this: tesseract scan.tif scan batch makebox && mv scan.txt scan.box. These box-files need to be carefully corrected; for this, you can use tesseractTrainer.py (see illustrating screenshot).

A drawback with the box file format is that it doesn't save information about spaces. I have solved this with a separate plain text file, which has the same text as represented in the box file, but with spaces in the correct places. If a box file is named scan.box, Terese expects this corresponding text file to have the name scan.box.txt. To generate such text files from your proofread box files, copy the script box2txt.py to the folder where your box files reside, and run it: python box2txt.py 10 *.box. (Note that this will overwrite all the *.box.txt files in this folder!) Adjust the number 10 so that the word spaces in these text files correspond to the spaces in the original. You can (should) edit them manually, and add and remove spaces where needed.

Then start Terese (from the terminal, because it will output information to stdout), click Create Font in the Proofing menu, and select all your corrected box files. The creation of the font will likely take a while. The font consists of three files: .png, .box, and .mtr. You should inspect the boxes with tesseractTrainer.py. If they contain unnecessary gray residues from surrounding characters, you should crop them:

The right hand side can be modified at will.
If you modify the left hand side, you should edit the second entry in the corresponding row in the mtr file. For example, you may note that, say, the box of the character "a" is covering a large portion of residues from preceding characters, and decide to crop the left side of the box with 5 pixels. Then a row such as "a -7 23" in your mtr file should be changed to "a -2 23".
The top and bottom of the boxes are special. They should all be modified by the same amount. But instead of changing it in the box file, you can simply change the first two rows in the mtr file. The first row is the amount to crop the top of the boxes, the second row is the amount to crop the bottom of the boxes. (Both these numbers should generally be positive.)

When using Terese (see below), you can export to a box file (and corresponding .box.txt file). These exported files can then be used to re-generate the font, to get a potentially better result. This is also a convenient way to add new characters to a font. Suppose that your font does not include "ï": in that case, Terese will stall when it comes across it in a text. Temporarilly change it to a similar character, such as "i", and export to a box file. Then edit the box file and box.txt file, and change that "i" back to "ï". You should also adjust the boundaries of its box. Finally generate the font again, incorporating this new box file.

Train Tesseract (optional)

You are now ready to start proofreading! The text can come from any OCR program, or you can even type it by hand, it doesn't matter. But if you are reading this, you will in all likelihood want to use Tesseract, and since you have already created corrected box files, you better use them to train Tesseract as well, to improve your results! See the script train.sh in the examples folder; for the full story, refer to Tesseract's web page.

About markup

Tesseract itself does not have the ability to mark up the text it outputs. Italics, bold text, small caps, etc., all are output as plain text. It is however possible to train it to regard for example italic characters as separate glyphs in the same font. One method to achieve this is to code a character in different styles as a ligature of a prefix and the character. For example, your text may make use of three different styles: regular upright, italic, and bold. Terese and its accompanying scripts are configured to use the prefixes "$" and "@" for italics and bold, respectively. That is, when you make your box file, you should encode an italic "a" as "$a", a bold "b" as "@b", and so on. (If you are using tesseractTrainer.py, this is done automatically if you use the checkboxes.)

If you then train Tesseract with these box files, it will learn to interpret, say,

This is italics and this is bold.

$T$h$i$s $i$s $i$t$a$l$i$c$s and @t@h@i@s @i@s @b@o@l@d@.

Now, that is not very readable, of course, but you can convert it to HTML-tags, using the script txt2rtml.py. Running python txt2rtml.py scan.txt will output scan.rtml which will look like this:

This is italics and this is bold.

HTML you say? Why then is the file ending rtml? Well, when I developed Terese I used it to proofread works for Project Runeberg, and this is an effect of that. Here, RTML stands for Runeberg Text Markup Language, which is an extended subset of HTML. (Note: Not to be confused with RTML, Yahoo!'s programming language.) One important difference between RTML files and regular HTML is that a new paragraph is represented by two continous line breaks, rather than a  tag.

When you open, say, scan001.tif Terese will also try to open scan001.rtml. If it does not exist, it will instead try to open scan001.txt, and if that file does not exist, it will be created.

Doing the OCR

Whatever program you use, be it Tesseract or another OCR program, the output should be one plain text file per tiff file. If you made use of Tesseract and the trick noted above to get the markup of the text, you probably want to use txt2rtml.py to convert them to rtml files.

Summary

To recapitulate:

Scan your work, one tiff per page.
Use Tesseract to generate box files from a few representative tiff files.
Proofread the box files with tesseractTrainer.py.
Generate .box.txt files from the proofread box files with box2txt.py.
Add/remove spaces in these .box.txt files.
Use Terese to generate an average font from the box files.
Check and possibly edit the font's box, mtr, and png file.
(Optional: Train Tesseract with your box files.)
Run OCR on the rest of the tiff files, generating one txt file from each tiff file.
If necessary, convert the txt files to rtml files with txt2rtml.py.
Use Terese to proofread the rtml or txt files (see below).

If all this seems daunting to you, with the proper compensation I may do step 2 to 10 for you. Contact me at johan.winge@telia.com for details.

Using Terese

So, you have copied your terese executable to the directory of your tiffs, and trained it, i.e. generated a font. (If not, you can open the font alice-font.mtr in the example folder.) Now go ahead and open the first tiff file. The corresponding rtml or txt file will be opened as well, provided that it exists:

Terese does not show the raw RTML code, but will parse the (R/H)TML tags instead. The text with blue background, for example, is really enclosed with  tags.

Now open the Proofing menu, and select Compare (or press F5) to tell Terese to perform its main task, i.e. to try to map the text to the image. It is a bit stupid, and might have troubles locating the start of the text:

As you see, the original text in the image is now shown in dark green, while the added text is dark blue. When the original ink of the scanned page and the letters that Terese adds overlap, it will be shown as white. Your objective is to wipe out all the text in this way. All the letters that have been placed out are listed in the lower left corner of the window. As you also see in the image, it failed to place the first letters correctly (and then gave up). Whenever such discrepancies are due to faults in Terese's mapping algorithm, rather than faults in your OCR-text, you can correct the placement of the mapped letters manually.

If you click on the first misaligned character (the left parenthesis in this case) it will be selected in the list. To move the selected character, hold down CTRL and double-click on its correct position:

Then press F5 again, to let Terese make another try. Here is an authentic example of Terese in action:

At least five errors in the OCR text can easily be located here: a hyphen for a period after "alqm", a lacking space after "tristis", another lacking space after "bitter,", lacking comma after "händelse", and a peculiar error where an "e" has been interpreted as "is". After this, Terese loses sync of the text, and soon gives up. (There is yet another error, which is really the fault of the original typesetter, and which is less easy to discern: to the left of the mouse pointer, an "r" that should have been set in Fraktur style is actually set in Antiqua. The sensible thing is of course to simply disregard such things.)

Settings

In the lower right corner, there are a couple of parameters. All units are in pixels:

Min space: When a space is encountered, this is the minimum distance that Terese will skip ahead. This should be set as large as possible. (If set to a too small number, erronous spaces in the rtml text might not be properly identified.)
Max space: If too large, Terese may look ahead too far, and skip an intermediate character. Should be set as small as possible (but larger than min space!)
Indentation: How much should the first line of a paragraph be indented? (A new paragraph is represented by two consecutive line breaks in the rtml file.)
Line space: Distance between two lines.
Right marg(in): Used when hyphenating. See below.

Take your time tweaking these parameters to work optimally with your particular text. They are all saved in Terese's configuration file, so you don't have to re-set them every time you start Terese.

Hyphenation

Terese provides optional hyphenation. If it is not activated (default), each line in the text file should correspond to a line in the scan, and if a word is broken over two lines, you will have to keep it that way in the text file for Terese to be able to map the text correctly. However, you generally want to remove such hyphens in your text, and to get Terese to handle this, you must activate hyphenation (in the Proofing menu). You should move the first part of each hyphenated word to the next line (and not the other way around.) For example, suppose that the text in your scan says:

This is a line with a hyphen-
ated word in it.

Then your text file should read

This is a line with a hyphenated word in it.

For hyphenation to work properly, you must set the "Right marg" parameter. Each line break in the text file that occurs before Terese has reached the right margin, and which does not mark the end of a paragraph, will now be disregarded. After that, Terese will instead look for a hyphen in the image which is not represented in the rtml text, and then, but no sooner, will it move down to continue on the next line. To change this parameter, the easiest way is to hold down Shift, and click in the image, a bit to the left of the hyphen. (In the example, a good place may be between the "e" and the "n".)

Note: This feature more or less relies on a justified margin. If your text is jagged, it may be problematic.

A final word

It is always your responsibility to do the proofreading and judge the output of Terese. It will readily expose some kinds of OCR errors, but other errors may be difficult to see: for example, depending on the font and the quality of the scans, an error such as a "b" for a "h" may be impossible to find if you rely only on Terese. It can never replace manual proofreading. The best procedure is probably to read through the text once, continually comparing it to the original, and correct all errors you can find. Then let Terese do its comparision, to find the few remaining errors that you may have overlooked.

Good luck!