mirror of
https://github.com/tesseract-ocr/tesseract.git
synced 2024-11-24 02:59:07 +08:00
Merge pull request #2220 from cjmayo/man_config
Man page description of configs and parameters
This commit is contained in:
commit
48be357688
@ -36,7 +36,7 @@ IN/OUT ARGUMENTS
|
|||||||
The basename of the output file (to which the appropriate extension
|
The basename of the output file (to which the appropriate extension
|
||||||
will be appended). By default the output will be a text file
|
will be appended). By default the output will be a text file
|
||||||
with `.txt` added to the basename unless there are one or more
|
with `.txt` added to the basename unless there are one or more
|
||||||
'configfile' options which explicitly specify the desired output.
|
parameters set which explicitly specify the desired output.
|
||||||
|
|
||||||
'stdout'::
|
'stdout'::
|
||||||
Instruction to send output data to standard output.
|
Instruction to send output data to standard output.
|
||||||
@ -54,7 +54,7 @@ OPTIONS
|
|||||||
Specify the location of user patterns file.
|
Specify the location of user patterns file.
|
||||||
|
|
||||||
'-c configvar=value'::
|
'-c configvar=value'::
|
||||||
Set value for control parameter. Multiple -c arguments are allowed.
|
Set value for parameter 'configvar'. Multiple -c arguments are allowed.
|
||||||
|
|
||||||
'-l lang'::
|
'-l lang'::
|
||||||
The language to use. If none is specified, English is assumed.
|
The language to use. If none is specified, English is assumed.
|
||||||
@ -86,20 +86,21 @@ OPTIONS
|
|||||||
3 = Default, based on what is available.
|
3 = Default, based on what is available.
|
||||||
|
|
||||||
'configfile'::
|
'configfile'::
|
||||||
The name of a config to use. A config is a plaintext file which
|
The name of a config to use. A config is a plain text file which
|
||||||
contains a list of variables and their values, one per line, with a
|
contains a list of parameters and their values, one per line,
|
||||||
space separating variable from value. Interesting config files
|
with a space separating parameter from value. +
|
||||||
include: +
|
Interesting config files include:
|
||||||
* `alto` - Output in ALTO format (file extension `.xml`).
|
|
||||||
* `hocr` - Output in hOCR format (file extension `.hocr`).
|
* `alto` - Output in ALTO format ('outputbase'`.xml`).
|
||||||
* `pdf` - Output PDF (file extension `.pdf`).
|
* `hocr` - Output in hOCR format ('outputbase'`.hocr`).
|
||||||
* `tsv` - Output TSV (file extension `.tsv`).
|
* `pdf` - Output PDF ('outputbase'`.pdf`).
|
||||||
* `txt` - Output plain text (file extension `.txt`).
|
* `tsv` - Output TSV ('outputbase'`.tsv`).
|
||||||
* `get.images` - Write images.
|
* `txt` - Output plain text ('outputbase'`.txt`).
|
||||||
* `logfile` - Write debug file `tesseract.log`.
|
* `get.images` - Write processed input images to file (`tessinput.tif`).
|
||||||
* `lstm.train` - Used for LSTM training.
|
* `logfile` - Redirect debug messages to file (`tesseract.log`).
|
||||||
* `makebox` - Output box file.
|
* `lstm.train` - Output files used by LSTM training ('outputbase'`.lstmf`).
|
||||||
* `quiet` - Write debug file to /dev/null.
|
* `makebox` - Write box file ('outputbase'`.box`).
|
||||||
|
* `quiet` - Redirect debug messages to /dev/null.
|
||||||
|
|
||||||
It is possible to select several config files, for example
|
It is possible to select several config files, for example
|
||||||
`tesseract image.png demo hocr pdf txt` will create three output files
|
`tesseract image.png demo hocr pdf txt` will create three output files
|
||||||
@ -334,14 +335,14 @@ Tesseract 4 LSTM OCR engine.
|
|||||||
CONFIG FILES AND AUGMENTING WITH USER DATA
|
CONFIG FILES AND AUGMENTING WITH USER DATA
|
||||||
------------------------------------------
|
------------------------------------------
|
||||||
|
|
||||||
Tesseract config files consist of lines with variable-value pairs (space
|
Tesseract config files consist of lines with parameter-value pairs (space
|
||||||
separated). The variables are documented as flags in the source code like
|
separated). The parameters are documented as flags in the source code like
|
||||||
the following one in tesseractclass.h:
|
the following one in tesseractclass.h:
|
||||||
|
|
||||||
STRING_VAR_H(tessedit_char_blacklist, "",
|
STRING_VAR_H(tessedit_char_blacklist, "",
|
||||||
"Blacklist of chars not to recognize");
|
"Blacklist of chars not to recognize");
|
||||||
|
|
||||||
These variables may enable or disable various features of the engine, and
|
These parameters may enable or disable various features of the engine, and
|
||||||
may cause it to load (or not load) various data. For instance, let's suppose
|
may cause it to load (or not load) various data. For instance, let's suppose
|
||||||
you want to OCR in English, but suppress the normal dictionary and load an
|
you want to OCR in English, but suppress the normal dictionary and load an
|
||||||
alternative word list and an alternative list of patterns -- these two files
|
alternative word list and an alternative list of patterns -- these two files
|
||||||
@ -371,8 +372,8 @@ load_freq_dawg F
|
|||||||
user_words_suffix user-words
|
user_words_suffix user-words
|
||||||
user_patterns_suffix user-patterns
|
user_patterns_suffix user-patterns
|
||||||
|
|
||||||
Now, if you pass the word 'bazaar' as a trailing command line parameter
|
Now, if you pass the word 'bazaar' as a 'configfile' to Tesseract,
|
||||||
to Tesseract, Tesseract will not bother loading the system dictionary nor
|
Tesseract will not bother loading the system dictionary nor
|
||||||
the dictionary of frequent words and will load and use the eng.user-words
|
the dictionary of frequent words and will load and use the eng.user-words
|
||||||
and eng.user-patterns files you provided. The former is a simple word list,
|
and eng.user-patterns files you provided. The former is a simple word list,
|
||||||
one per line. The format of the latter is documented in dict/trie.h
|
one per line. The format of the latter is documented in dict/trie.h
|
||||||
|
Loading…
Reference in New Issue
Block a user