Merge pull request #2220 from cjmayo/man_config

Man page description of configs and parameters
This commit is contained in:
zdenop 2019-02-16 13:53:47 +01:00 committed by GitHub
commit 48be357688
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

View File

@ -36,7 +36,7 @@ IN/OUT ARGUMENTS
The basename of the output file (to which the appropriate extension The basename of the output file (to which the appropriate extension
will be appended). By default the output will be a text file will be appended). By default the output will be a text file
with `.txt` added to the basename unless there are one or more with `.txt` added to the basename unless there are one or more
'configfile' options which explicitly specify the desired output. parameters set which explicitly specify the desired output.
'stdout':: 'stdout'::
Instruction to send output data to standard output. Instruction to send output data to standard output.
@ -54,7 +54,7 @@ OPTIONS
Specify the location of user patterns file. Specify the location of user patterns file.
'-c configvar=value':: '-c configvar=value'::
Set value for control parameter. Multiple -c arguments are allowed. Set value for parameter 'configvar'. Multiple -c arguments are allowed.
'-l lang':: '-l lang'::
The language to use. If none is specified, English is assumed. The language to use. If none is specified, English is assumed.
@ -86,20 +86,21 @@ OPTIONS
3 = Default, based on what is available. 3 = Default, based on what is available.
'configfile':: 'configfile'::
The name of a config to use. A config is a plaintext file which The name of a config to use. A config is a plain text file which
contains a list of variables and their values, one per line, with a contains a list of parameters and their values, one per line,
space separating variable from value. Interesting config files with a space separating parameter from value. +
include: + Interesting config files include:
* `alto` - Output in ALTO format (file extension `.xml`).
* `hocr` - Output in hOCR format (file extension `.hocr`). * `alto` - Output in ALTO format ('outputbase'`.xml`).
* `pdf` - Output PDF (file extension `.pdf`). * `hocr` - Output in hOCR format ('outputbase'`.hocr`).
* `tsv` - Output TSV (file extension `.tsv`). * `pdf` - Output PDF ('outputbase'`.pdf`).
* `txt` - Output plain text (file extension `.txt`). * `tsv` - Output TSV ('outputbase'`.tsv`).
* `get.images` - Write images. * `txt` - Output plain text ('outputbase'`.txt`).
* `logfile` - Write debug file `tesseract.log`. * `get.images` - Write processed input images to file (`tessinput.tif`).
* `lstm.train` - Used for LSTM training. * `logfile` - Redirect debug messages to file (`tesseract.log`).
* `makebox` - Output box file. * `lstm.train` - Output files used by LSTM training ('outputbase'`.lstmf`).
* `quiet` - Write debug file to /dev/null. * `makebox` - Write box file ('outputbase'`.box`).
* `quiet` - Redirect debug messages to /dev/null.
It is possible to select several config files, for example It is possible to select several config files, for example
`tesseract image.png demo hocr pdf txt` will create three output files `tesseract image.png demo hocr pdf txt` will create three output files
@ -334,14 +335,14 @@ Tesseract 4 LSTM OCR engine.
CONFIG FILES AND AUGMENTING WITH USER DATA CONFIG FILES AND AUGMENTING WITH USER DATA
------------------------------------------ ------------------------------------------
Tesseract config files consist of lines with variable-value pairs (space Tesseract config files consist of lines with parameter-value pairs (space
separated). The variables are documented as flags in the source code like separated). The parameters are documented as flags in the source code like
the following one in tesseractclass.h: the following one in tesseractclass.h:
STRING_VAR_H(tessedit_char_blacklist, "", STRING_VAR_H(tessedit_char_blacklist, "",
"Blacklist of chars not to recognize"); "Blacklist of chars not to recognize");
These variables may enable or disable various features of the engine, and These parameters may enable or disable various features of the engine, and
may cause it to load (or not load) various data. For instance, let's suppose may cause it to load (or not load) various data. For instance, let's suppose
you want to OCR in English, but suppress the normal dictionary and load an you want to OCR in English, but suppress the normal dictionary and load an
alternative word list and an alternative list of patterns -- these two files alternative word list and an alternative list of patterns -- these two files
@ -371,8 +372,8 @@ load_freq_dawg F
user_words_suffix user-words user_words_suffix user-words
user_patterns_suffix user-patterns user_patterns_suffix user-patterns
Now, if you pass the word 'bazaar' as a trailing command line parameter Now, if you pass the word 'bazaar' as a 'configfile' to Tesseract,
to Tesseract, Tesseract will not bother loading the system dictionary nor Tesseract will not bother loading the system dictionary nor
the dictionary of frequent words and will load and use the eng.user-words the dictionary of frequent words and will load and use the eng.user-words
and eng.user-patterns files you provided. The former is a simple word list, and eng.user-patterns files you provided. The former is a simple word list,
one per line. The format of the latter is documented in dict/trie.h one per line. The format of the latter is documented in dict/trie.h