Merge pull request #2331 from stweil/doc

Improve man page for tesseract  and add Makefile rule for PDF
This commit is contained in:
zdenop 2019-03-16 10:26:16 +01:00 committed by GitHub
commit 0b72f4b722
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
2 changed files with 190 additions and 156 deletions

View File

@ -33,8 +33,9 @@ EXTRA_DIST = $(man_MANS) Doxyfile
.PHONY: html .PHONY: html
html: ${man_MANS:%=%.html} html: ${man_MANS:%=%.html}
pdf: ${man_MANS:%=%.pdf}
SUFFIXES = .asc .html SUFFIXES = .asc .html .pdf
.asc: .asc:
-asciidoc -b docbook -d manpage -o - $< | \ -asciidoc -b docbook -d manpage -o - $< | \
@ -43,6 +44,10 @@ SUFFIXES = .asc .html
.asc.html: .asc.html:
asciidoc -b html5 -o $@ $< asciidoc -b html5 -o $@ $<
.asc.pdf:
asciidoc -b docbook -d manpage -o $*.dbk $<
docbook2pdf $*.dbk
MAINTAINERCLEANFILES = $(man_MANS) Doxyfile MAINTAINERCLEANFILES = $(man_MANS) Doxyfile
endif endif

View File

@ -8,7 +8,7 @@ tesseract - command-line OCR engine
SYNOPSIS SYNOPSIS
-------- --------
*tesseract* 'imagename'|'listname'|'stdin' 'outputbase'|'stdout' [options...] [configfile...] *tesseract* 'FILE' 'OUTPUTBASE' ['OPTIONS']... ['CONFIGFILE']...
DESCRIPTION DESCRIPTION
----------- -----------
@ -20,50 +20,46 @@ at Google since then.
IN/OUT ARGUMENTS IN/OUT ARGUMENTS
---------------- ----------------
'imagename':: 'FILE'::
The name of the input image. Most image file formats (anything The name of the input file.
readable by Leptonica) are supported. This can either be an image file or a text file. +
Most image file formats (anything readable by Leptonica) are supported. +
A text file lists the names of all input images (one image name per line).
The results will be combined in a single file for each output file format
(txt, pdf, hocr, xml). +
If 'FILE' is `stdin` or `-` then the standard input is used.
'listname':: 'OUTPUTBASE'::
The name of a text file which lists the names of all input images
(one image name per line). The results will be combined in a
single file for each output file format (txt, pdf, hocr).
'stdin'::
Instruction to read data from standard input.
'outputbase'::
The basename of the output file (to which the appropriate extension The basename of the output file (to which the appropriate extension
will be appended). By default the output will be a text file will be appended). By default the output will be a text file
with `.txt` added to the basename unless there are one or more with `.txt` added to the basename unless there are one or more
parameters set which explicitly specify the desired output. parameters set which explicitly specify the desired output. +
If 'OUTPUTBASE' is `stdout` or `-` then the standard output is used.
'stdout'::
Instruction to send output data to standard output.
[[TESSDATADIR]]
OPTIONS OPTIONS
------- -------
'--tessdata-dir /path':: *-c* 'CONFIGVAR=VALUE'::
Specify the location of tessdata path. Set value for parameter 'CONFIGVAR' to VALUE. Multiple *-c* arguments are allowed.
'--user-words /path/to/file':: *--dpi* 'N'::
Specify the location of user words file. Specify the resolution 'N' in DPI for the input image(s).
A typical value for 'N' is `300`. Without this option,
the resolution is read from the metadata included in the image.
If an image does not include that information, Tesseract tries to guess it.
'--user-patterns /path/to/file':: *-l* 'LANG'::
Specify the location of user patterns file. *-l* 'SCRIPT'::
The language or script to use.
'-c configvar=value':: If none is specified, `eng` (English) is assumed.
Set value for parameter 'configvar'. Multiple -c arguments are allowed.
'-l lang'::
The language to use. If none is specified, English is assumed.
Multiple languages may be specified, separated by plus characters. Multiple languages may be specified, separated by plus characters.
Tesseract uses 3-character ISO 639-2 language codes. (See LANGUAGES) Tesseract uses 3-character ISO 639-2 language codes
(see <<LANGUAGES,*LANGUAGES AND SCRIPTS*>>).
'--psm N':: *--psm* 'N'::
Set Tesseract to only run a subset of layout analysis and assume Set Tesseract to only run a subset of layout analysis and assume
a certain form of image. The options for *N* are: a certain form of image. The options for 'N' are:
0 = Orientation and script detection (OSD) only. 0 = Orientation and script detection (OSD) only.
1 = Automatic page segmentation with OSD. 1 = Automatic page segmentation with OSD.
@ -76,72 +72,87 @@ OPTIONS
8 = Treat the image as a single word. 8 = Treat the image as a single word.
9 = Treat the image as a single word in a circle. 9 = Treat the image as a single word in a circle.
10 = Treat the image as a single character. 10 = Treat the image as a single character.
11 = Sparse text. Find as much text as possible in no particular order.
12 = Sparse text with OSD.
13 = Raw line. Treat the image as a single text line,
bypassing hacks that are Tesseract-specific.
'--oem N':: *--oem* 'N'::
Specify OCR Engine mode. The options for *N* are: Specify OCR Engine mode. The options for 'N' are:
0 = Original Tesseract only. 0 = Original Tesseract only.
1 = Neural nets LSTM only. 1 = Neural nets LSTM only.
2 = Tesseract + LSTM. 2 = Tesseract + LSTM.
3 = Default, based on what is available. 3 = Default, based on what is available.
'configfile':: *--tessdata-dir* 'PATH'::
The name of a config to use. The name can be a file in tessdata/configs Specify the location of tessdata path.
or tessdata/tessconfigs, or an absolute or relative file path.
*--user-patterns* 'FILE'::
Specify the location of user patterns file.
*--user-words* 'FILE'::
Specify the location of user words file.
[[CONFIGFILE]]
'CONFIGFILE'::
The name of a config to use. The name can be a file in `tessdata/configs`
or `tessdata/tessconfigs`, or an absolute or relative file path.
A config is a plain text file which contains a list of parameters and A config is a plain text file which contains a list of parameters and
their values, one per line, with a space separating parameter from value. + their values, one per line, with a space separating parameter from value. +
Interesting config files include: Interesting config files include:
* `alto` - Output in ALTO format ('outputbase'`.xml`). * *alto* -- Output in ALTO format ('OUTPUTBASE'`.xml`).
* `hocr` - Output in hOCR format ('outputbase'`.hocr`). * *hocr* -- Output in hOCR format ('OUTPUTBASE'`.hocr`).
* `pdf` - Output PDF ('outputbase'`.pdf`). * *pdf* -- Output PDF ('OUTPUTBASE'`.pdf`).
* `tsv` - Output TSV ('outputbase'`.tsv`). * *tsv* -- Output TSV ('OUTPUTBASE'`.tsv`).
* `txt` - Output plain text ('outputbase'`.txt`). * *txt* -- Output plain text ('OUTPUTBASE'`.txt`).
* `get.images` - Write processed input images to file (`tessinput.tif`). * *get.images* -- Write processed input images to file (`tessinput.tif`).
* `logfile` - Redirect debug messages to file (`tesseract.log`). * *logfile* -- Redirect debug messages to file (`tesseract.log`).
* `lstm.train` - Output files used by LSTM training ('outputbase'`.lstmf`). * *lstm.train* -- Output files used by LSTM training ('OUTPUTBASE'`.lstmf`).
* `makebox` - Write box file ('outputbase'`.box`). * *makebox* -- Write box file ('OUTPUTBASE'`.box`).
* `quiet` - Redirect debug messages to /dev/null. * *quiet* -- Redirect debug messages to '/dev/null'.
It is possible to select several config files, for example It is possible to select several config files, for example
`tesseract image.png demo hocr pdf txt` will create three output files `tesseract image.png demo alto hocr pdf txt` will create four output files
`demo.hocr`, `demo.pdf` and `demo.txt` with the OCR results. `demo.alto`, `demo.hocr`, `demo.pdf` and `demo.txt` with the OCR results.
*Nota Bene:* The options `-l lang` and `--psm N` must occur *Nota bene:* The options *-l* 'LANG', *-l* 'SCRIPT' and *--psm* 'N'
before any 'configfile'. must occur before any 'CONFIGFILE'.
SINGLE OPTIONS SINGLE OPTIONS
-------------- --------------
'-h, --help':: *-h, --help*::
Show help message. Show help message.
'--help-extra':: *--help-extra*::
Show extra help for advanced users. Show extra help for advanced users.
'--help-psm':: *--help-psm*::
Show page segmentation modes. Show page segmentation modes.
'--help-oem':: *--help-oem*::
Show OCR Engine modes. Show OCR Engine modes.
'-v, --version':: *-v, --version*::
Returns the current version of the tesseract(1) executable. Returns the current version of the tesseract(1) executable.
'--list-langs':: *--list-langs*::
List available languages for tesseract engine. Can be used with `--tessdata-dir`. List available languages for tesseract engine.
Can be used with *--tessdata-dir* 'PATH'.
'--print-parameters':: *--print-parameters*::
Print tesseract parameters. Print tesseract parameters.
[[LANGUAGES]]
LANGUAGES AND SCRIPTS LANGUAGES AND SCRIPTS
--------------------- ---------------------
To recognize some text with Tesseract, it is normally necessary to specify To recognize some text with Tesseract, it is normally necessary to specify
the language(s) or script of the text (unless it is English text which is the language(s) or script(s) of the text (unless it is English text which is
supported by default) using `-l lang`. supported by default) using *-l* 'LANG' or *-l* 'SCRIPT'.
Selecting a language automatically also selects the language specific Selecting a language automatically also selects the language specific
character set and dictionary (word list). character set and dictionary (word list).
@ -153,6 +164,9 @@ In most cases, a script also supports English.
So it is possible to recognize a language that has not been specifically So it is possible to recognize a language that has not been specifically
trained for by using traineddata for the script it is written in. trained for by using traineddata for the script it is written in.
More than one language or script may be specified by using `+`.
Example: `tesseract myimage.png myimage -l eng+deu+fra`.
https://github.com/tesseract-ocr/tessdata_fast provides fast language and https://github.com/tesseract-ocr/tessdata_fast provides fast language and
script models which are also part of Linux distributions. script models which are also part of Linux distributions.
@ -174,16 +188,16 @@ following languages:
*cat* (Catalan; Valencian), *cat* (Catalan; Valencian),
*ceb* (Cebuano), *ceb* (Cebuano),
*ces* (Czech), *ces* (Czech),
*chi_sim* (Chinese - Simplified), *chi_sim* (Chinese simplified),
*chi_tra* (Chinese - Traditional), *chi_tra* (Chinese traditional),
*chr* (Cherokee), *chr* (Cherokee),
*cym* (Welsh), *cym* (Welsh),
*dan* (Danish), *dan* (Danish),
*deu* (German), *deu* (German),
*dzo* (Dzongkha), *dzo* (Dzongkha),
*ell* (Greek, Modern (1453-)), *ell* (Greek, Modern, 1453-),
*eng* (English), *eng* (English),
*enm* (English, Middle (1100-1500)), *enm* (English, Middle, 1100-1500),
*epo* (Esperanto), *epo* (Esperanto),
*equ* (Math / equation detection module), *equ* (Math / equation detection module),
*est* (Estonian), *est* (Estonian),
@ -192,10 +206,10 @@ following languages:
*fin* (Finnish), *fin* (Finnish),
*fra* (French), *fra* (French),
*frk* (Frankish), *frk* (Frankish),
*frm* (French, Middle (ca.1400-1600)), *frm* (French, Middle, ca.1400-1600),
*gle* (Irish), *gle* (Irish),
*glg* (Galician), *glg* (Galician),
*grc* (Greek, Ancient (to 1453)), *grc* (Greek, Ancient, to 1453),
*guj* (Gujarati), *guj* (Gujarati),
*hat* (Haitian; Haitian Creole), *hat* (Haitian; Haitian Creole),
*heb* (Hebrew), *heb* (Hebrew),
@ -215,9 +229,9 @@ following languages:
*kaz* (Kazakh), *kaz* (Kazakh),
*khm* (Central Khmer), *khm* (Central Khmer),
*kir* (Kirghiz; Kyrgyz), *kir* (Kirghiz; Kyrgyz),
*kmr* (Kurdish Kurmanji),
*kor* (Korean), *kor* (Korean),
*kor_vert* (Korean (vertical)), *kor_vert* (Korean vertical),
*kmr* (Kurdish (Kurmanji)),
*kur* (Kurdish), *kur* (Kurdish),
*lao* (Lao), *lao* (Lao),
*lat* (Latin), *lat* (Latin),
@ -235,7 +249,7 @@ following languages:
*nep* (Nepali), *nep* (Nepali),
*nld* (Dutch; Flemish), *nld* (Dutch; Flemish),
*nor* (Norwegian), *nor* (Norwegian),
*oci* (Occitan (post 1500)), *oci* (Occitan post 1500),
*ori* (Oriya), *ori* (Oriya),
*osd* (Orientation and script detection module), *osd* (Orientation and script detection module),
*pan* (Panjabi; Punjabi), *pan* (Panjabi; Punjabi),
@ -277,51 +291,51 @@ following languages:
*yid* (Yiddish), *yid* (Yiddish),
*yor* (Yoruba) *yor* (Yoruba)
To use a non-standard language pack named *foo.traineddata*, set the To use a non-standard language pack named `foo.traineddata`, set the
*TESSDATA_PREFIX* environment variable so the file can be found at `TESSDATA_PREFIX` environment variable so the file can be found at
*TESSDATA_PREFIX*/tessdata/*foo*.traineddata and give Tesseract the `TESSDATA_PREFIX/tessdata/foo.traineddata` and give Tesseract the
argument `-l foo`. argument *-l* `foo`.
For Tesseract 4, `tessdata_fast` includes traineddata files for the For Tesseract 4, `tessdata_fast` includes traineddata files for the
following scripts: following scripts:
Arabic, *Arabic*,
Armenian, *Armenian*,
Bengali, *Bengali*,
Canadian Aboriginal, *Canadian_Aboriginal*,
Cherokee, *Cherokee*,
Cyrillic, *Cyrillic*,
Devanagari, *Devanagari*,
Ethiopic, *Ethiopic*,
Fraktur, *Fraktur*,
Georgian, *Georgian*,
Greek, *Greek*,
Gujarati, *Gujarati*,
Gurmukhi, *Gurmukhi*,
Han - Simplified, *HanS* (Han simplified),
Han - Simplified (vertical), *HanS_vert* (Han simplified, vertical),
Han - Traditional, *HanT* (Han traditional),
Han - Traditional (vertical), *HanT_vert* (Han traditional, vertical),
Hangul, *Hangul*,
Hangul (vertical), *Hangul_vert* (Hangul vertical),
Hebrew, *Hebrew*,
Japanese, *Japanese*,
Japanese (vertical), *Japanese_vert* (Japanese vertical),
Kannada, *Kannada*,
Khmer, *Khmer*,
Lao, *Lao*,
Latin, *Latin*,
Malayalam, *Malayalam*,
Myanmar, *Myanmar*,
Oriya (Odia), *Oriya* (Odia),
Sinhala, *Sinhala*,
Syriac, *Syriac*,
Tamil, *Tamil*,
Telugu, *Telugu*,
Thaana, *Thaana*,
Thai, *Thai*,
Tibetan, *Tibetan*,
Vietnamese. *Vietnamese*.
The same languages and scripts are available from The same languages and scripts are available from
https://github.com/tesseract-ocr/tessdata_best. https://github.com/tesseract-ocr/tessdata_best.
@ -343,8 +357,8 @@ Tesseract config files consist of lines with parameter-value pairs (space
separated). The parameters are documented as flags in the source code like separated). The parameters are documented as flags in the source code like
the following one in tesseractclass.h: the following one in tesseractclass.h:
STRING_VAR_H(tessedit_char_blacklist, "", `STRING_VAR_H(tessedit_char_blacklist, "",
"Blacklist of chars not to recognize"); "Blacklist of chars not to recognize");`
These parameters may enable or disable various features of the engine, and These parameters may enable or disable various features of the engine, and
may cause it to load (or not load) various data. For instance, let's suppose may cause it to load (or not load) various data. For instance, let's suppose
@ -352,10 +366,10 @@ you want to OCR in English, but suppress the normal dictionary and load an
alternative word list and an alternative list of patterns -- these two files alternative word list and an alternative list of patterns -- these two files
are the most commonly used extra data files. are the most commonly used extra data files.
If your language pack is in /path/to/eng.traineddata and the hocr config If your language pack is in '/path/to/eng.traineddata' and the hocr config
is in /path/to/configs/hocr then create three new files: is in '/path/to/configs/hocr' then create three new files:
/path/to/eng.user-words: '/path/to/eng.user-words':
[verse] [verse]
the the
quick quick
@ -363,25 +377,39 @@ brown
fox fox
jumped jumped
'/path/to/eng.user-patterns':
/path/to/eng.user-patterns:
[verse] [verse]
1-\d\d\d-GOOG-411 1-\d\d\d-GOOG-411
www.\n\\\*.com www.\n\\\*.com
/path/to/configs/bazaar: '/path/to/configs/bazaar':
[verse] [verse]
load_system_dawg F load_system_dawg F
load_freq_dawg F load_freq_dawg F
user_words_suffix user-words user_words_suffix user-words
user_patterns_suffix user-patterns user_patterns_suffix user-patterns
Now, if you pass the word 'bazaar' as a 'configfile' to Tesseract, Now, if you pass the word 'bazaar' as a <<CONFIGFILE,'CONFIGFILE'>> to
Tesseract will not bother loading the system dictionary nor Tesseract, Tesseract will not bother loading the system dictionary nor
the dictionary of frequent words and will load and use the eng.user-words the dictionary of frequent words and will load and use the 'eng.user-words'
and eng.user-patterns files you provided. The former is a simple word list, and 'eng.user-patterns' files you provided. The former is a simple word list,
one per line. The format of the latter is documented in dict/trie.h one per line. The format of the latter is documented in 'dict/trie.h'
on read_pattern_list(). on 'read_pattern_list()'.
ENVIRONMENT VARIABLES
---------------------
*`TESSDATA_PREFIX`*::
If the `TESSDATA_PREFIX` is set to a path, then that path is used to
find the `tessdata` directory with language and script recognition
models and config files.
Using <<TESSDATADIR,*--tessdata-dir* 'PATH'>> is the recommended alternative.
*`OMP_THREAD_LIMIT`*::
If the `tesseract` executable was built with multithreading support,
it will normally use four CPU cores for the OCR process. While this
can be faster for a single image, it gives bad performance if the host
computer provides less than four CPU cores or if OCR is made for many images.
Only a single CPU core is used with `OMP_THREAD_LIMIT=1`.
HISTORY HISTORY
@ -391,7 +419,7 @@ Hewlett Packard Co, Greeley Colorado between 1985 and 1994, with some more
changes made in 1996 to port to Windows, and some $$C++$$izing in 1998. A changes made in 1996 to port to Windows, and some $$C++$$izing in 1998. A
lot of the code was written in C, and then some more was written in $$C++$$. lot of the code was written in C, and then some more was written in $$C++$$.
The $$C++$$ code makes heavy use of a list system using macros. This predates The $$C++$$ code makes heavy use of a list system using macros. This predates
stl, was portable before stl, and is more efficient than stl lists, but has STL, was portable before STL, and is more efficient than STL lists, but has
the big negative that if you do get a segmentation violation, it is hard to the big negative that if you do get a segmentation violation, it is hard to
debug. debug.
@ -399,7 +427,8 @@ Version 2.00 brought Unicode (UTF-8) support, six languages, and the ability
to train Tesseract. to train Tesseract.
Tesseract was included in UNLV's Fourth Annual Test of OCR Accuracy. Tesseract was included in UNLV's Fourth Annual Test of OCR Accuracy.
See <https://github.com/tesseract-ocr/docs/blob/master/AT-1995.pdf>. With Tesseract 2.00, See <https://github.com/tesseract-ocr/docs/blob/master/AT-1995.pdf>.
Since Tesseract 2.00,
scripts are now included to allow anyone to reproduce some of these tests. scripts are now included to allow anyone to reproduce some of these tests.
See <https://github.com/tesseract-ocr/tesseract/wiki/TestingTesseract> for more See <https://github.com/tesseract-ocr/tesseract/wiki/TestingTesseract> for more
details. details.