mirror of
https://github.com/tesseract-ocr/tesseract.git
synced 2024-11-23 18:49:08 +08:00
Merge pull request #2220 from cjmayo/man_config
Man page description of configs and parameters
This commit is contained in:
commit
48be357688
@ -36,7 +36,7 @@ IN/OUT ARGUMENTS
|
||||
The basename of the output file (to which the appropriate extension
|
||||
will be appended). By default the output will be a text file
|
||||
with `.txt` added to the basename unless there are one or more
|
||||
'configfile' options which explicitly specify the desired output.
|
||||
parameters set which explicitly specify the desired output.
|
||||
|
||||
'stdout'::
|
||||
Instruction to send output data to standard output.
|
||||
@ -54,7 +54,7 @@ OPTIONS
|
||||
Specify the location of user patterns file.
|
||||
|
||||
'-c configvar=value'::
|
||||
Set value for control parameter. Multiple -c arguments are allowed.
|
||||
Set value for parameter 'configvar'. Multiple -c arguments are allowed.
|
||||
|
||||
'-l lang'::
|
||||
The language to use. If none is specified, English is assumed.
|
||||
@ -86,20 +86,21 @@ OPTIONS
|
||||
3 = Default, based on what is available.
|
||||
|
||||
'configfile'::
|
||||
The name of a config to use. A config is a plaintext file which
|
||||
contains a list of variables and their values, one per line, with a
|
||||
space separating variable from value. Interesting config files
|
||||
include: +
|
||||
* `alto` - Output in ALTO format (file extension `.xml`).
|
||||
* `hocr` - Output in hOCR format (file extension `.hocr`).
|
||||
* `pdf` - Output PDF (file extension `.pdf`).
|
||||
* `tsv` - Output TSV (file extension `.tsv`).
|
||||
* `txt` - Output plain text (file extension `.txt`).
|
||||
* `get.images` - Write images.
|
||||
* `logfile` - Write debug file `tesseract.log`.
|
||||
* `lstm.train` - Used for LSTM training.
|
||||
* `makebox` - Output box file.
|
||||
* `quiet` - Write debug file to /dev/null.
|
||||
The name of a config to use. A config is a plain text file which
|
||||
contains a list of parameters and their values, one per line,
|
||||
with a space separating parameter from value. +
|
||||
Interesting config files include:
|
||||
|
||||
* `alto` - Output in ALTO format ('outputbase'`.xml`).
|
||||
* `hocr` - Output in hOCR format ('outputbase'`.hocr`).
|
||||
* `pdf` - Output PDF ('outputbase'`.pdf`).
|
||||
* `tsv` - Output TSV ('outputbase'`.tsv`).
|
||||
* `txt` - Output plain text ('outputbase'`.txt`).
|
||||
* `get.images` - Write processed input images to file (`tessinput.tif`).
|
||||
* `logfile` - Redirect debug messages to file (`tesseract.log`).
|
||||
* `lstm.train` - Output files used by LSTM training ('outputbase'`.lstmf`).
|
||||
* `makebox` - Write box file ('outputbase'`.box`).
|
||||
* `quiet` - Redirect debug messages to /dev/null.
|
||||
|
||||
It is possible to select several config files, for example
|
||||
`tesseract image.png demo hocr pdf txt` will create three output files
|
||||
@ -334,14 +335,14 @@ Tesseract 4 LSTM OCR engine.
|
||||
CONFIG FILES AND AUGMENTING WITH USER DATA
|
||||
------------------------------------------
|
||||
|
||||
Tesseract config files consist of lines with variable-value pairs (space
|
||||
separated). The variables are documented as flags in the source code like
|
||||
Tesseract config files consist of lines with parameter-value pairs (space
|
||||
separated). The parameters are documented as flags in the source code like
|
||||
the following one in tesseractclass.h:
|
||||
|
||||
STRING_VAR_H(tessedit_char_blacklist, "",
|
||||
"Blacklist of chars not to recognize");
|
||||
|
||||
These variables may enable or disable various features of the engine, and
|
||||
These parameters may enable or disable various features of the engine, and
|
||||
may cause it to load (or not load) various data. For instance, let's suppose
|
||||
you want to OCR in English, but suppress the normal dictionary and load an
|
||||
alternative word list and an alternative list of patterns -- these two files
|
||||
@ -371,8 +372,8 @@ load_freq_dawg F
|
||||
user_words_suffix user-words
|
||||
user_patterns_suffix user-patterns
|
||||
|
||||
Now, if you pass the word 'bazaar' as a trailing command line parameter
|
||||
to Tesseract, Tesseract will not bother loading the system dictionary nor
|
||||
Now, if you pass the word 'bazaar' as a 'configfile' to Tesseract,
|
||||
Tesseract will not bother loading the system dictionary nor
|
||||
the dictionary of frequent words and will load and use the eng.user-words
|
||||
and eng.user-patterns files you provided. The former is a simple word list,
|
||||
one per line. The format of the latter is documented in dict/trie.h
|
||||
|
Loading…
Reference in New Issue
Block a user