Preparation

This procedure will process the raw document from ABBYY (or other OCR source) and solve the first errors and prepare the document for subsequent processing. A number of issues are handled, some of them to be controlled via the settings.

The procedure is fine-tuned for an export from ABBYY according to a certain procedure. However, other procedures will work as well. The screenshot below is a suggestion of the settings in ABBYY. The headers and footers can also be retained. What is important, is that the results are saved as Editable Copy.

_images/abbyy.png

One of the activities in this procedure is that the procedure is trying to restore the paragraphs as present in the original document. The first new paragraph of a new page (of the original scan) will be made red. This makes it easy to see if there are remaining errors in that part and whether chapter titles are missing.

Parts of this procedure:

  • Repair incorrect lists
    If a line in the scan starts with a hyphen or dash, it can happen that this is interpreted in Word as a list. The same applies to numbers at the beginning of a line.
  • Removal of frames and shapes
  • Detection of scene breaks (whitelines)
  • Correct wrong paragraphs and first start of repairing wrong quotation marks. If the source document has an indent, this is taken into account.

During the execution of the preparation procedure, a number of questions can pop up. If the file contains text boxes, it will be recommended to remove them. Text boxes are not supported.

Another question that might be asked, is a result of OCR. If by change the first character on a line is a hyphen or number, it will be interpreted automatically as a list, even if it is not. This will be checked and for every occurrence it will be asked if it is a listitem or just a hyphen or number in a sentence.