Document some more config options for tesseract

Clarify also the name(s) of the generated OCR result file(s):
Tesseract does not create a file named outbase.txt by default.

Fix also a sentence in the language section.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
This commit is contained in:
Stefan Weil 2018-10-05 15:45:45 +02:00
parent e03ee932d2
commit 383dcf70b5

View File

@ -34,7 +34,9 @@ IN/OUT ARGUMENTS
'outputbase'::
The basename of the output file (to which the appropriate extension
will be appended). By default the output will be named 'outbase.txt'.
will be appended). By default the output will be a text file
with `.txt` added to the basename unless there are one or more
'configfile' options which explicitly specify the desired output.
'stdout'::
Instruction to sent output data to standard output
@ -88,8 +90,19 @@ OPTIONS
contains a list of variables and their values, one per line, with a
space separating variable from value. Interesting config files
include: +
* hocr - Output in hOCR format instead of as a text file.
* pdf - Output in pdf instead of a text file.
* `hocr` - Output in hOCR format (file extension `.hocr`).
* `pdf` - Output PDF (file extension `.pdf`).
* `tsv` - Output TSV (file extension `.tsv`).
* `txt` - Output plain text (file extension `.txt`).
* `get.images` - Write images.
* `logfile` - Write debug file `tesseract.log`.
* `lstm.train` - Used for LSTM training.
* `makebox` - Output box file.
* `quiet` - Write debug file to /dev/null.
It is possible to select several config files, for example
`tesseract image.png demo hocr pdf txt` will create three output files
`demo.hocr`, `demo.pdf` and `demo.txt` with the OCR results.
*Nota Bene:* The options `-l lang` and `--psm N` must occur
before any 'configfile'.
@ -122,7 +135,7 @@ LANGUAGES
The currently available traineddata files for tesseract 4.0
for the following languages are in
(in https://github.com/tesseract-ocr/tessdata_fast):
https://github.com/tesseract-ocr/tessdata_fast:
*afr* (Afrikaans),
*amh* (Amharic),