Merge pull request #2331 from stweil/doc

Improve man page for tesseract  and add Makefile rule for PDF
This commit is contained in:
zdenop 2019-03-16 10:26:16 +01:00 committed by GitHub
commit 0b72f4b722
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
2 changed files with 190 additions and 156 deletions

View File

@ -33,8 +33,9 @@ EXTRA_DIST = $(man_MANS) Doxyfile
.PHONY: html
html: ${man_MANS:%=%.html}
pdf: ${man_MANS:%=%.pdf}
SUFFIXES = .asc .html
SUFFIXES = .asc .html .pdf
.asc:
-asciidoc -b docbook -d manpage -o - $< | \
@ -43,6 +44,10 @@ SUFFIXES = .asc .html
.asc.html:
asciidoc -b html5 -o $@ $<
.asc.pdf:
asciidoc -b docbook -d manpage -o $*.dbk $<
docbook2pdf $*.dbk
MAINTAINERCLEANFILES = $(man_MANS) Doxyfile
endif

View File

@ -8,7 +8,7 @@ tesseract - command-line OCR engine
SYNOPSIS
--------
*tesseract* 'imagename'|'listname'|'stdin' 'outputbase'|'stdout' [options...] [configfile...]
*tesseract* 'FILE' 'OUTPUTBASE' ['OPTIONS']... ['CONFIGFILE']...
DESCRIPTION
-----------
@ -20,128 +20,139 @@ at Google since then.
IN/OUT ARGUMENTS
----------------
'imagename'::
The name of the input image. Most image file formats (anything
readable by Leptonica) are supported.
'FILE'::
The name of the input file.
This can either be an image file or a text file. +
Most image file formats (anything readable by Leptonica) are supported. +
A text file lists the names of all input images (one image name per line).
The results will be combined in a single file for each output file format
(txt, pdf, hocr, xml). +
If 'FILE' is `stdin` or `-` then the standard input is used.
'listname'::
The name of a text file which lists the names of all input images
(one image name per line). The results will be combined in a
single file for each output file format (txt, pdf, hocr).
'stdin'::
Instruction to read data from standard input.
'outputbase'::
The basename of the output file (to which the appropriate extension
will be appended). By default the output will be a text file
with `.txt` added to the basename unless there are one or more
parameters set which explicitly specify the desired output.
'stdout'::
Instruction to send output data to standard output.
'OUTPUTBASE'::
The basename of the output file (to which the appropriate extension
will be appended). By default the output will be a text file
with `.txt` added to the basename unless there are one or more
parameters set which explicitly specify the desired output. +
If 'OUTPUTBASE' is `stdout` or `-` then the standard output is used.
[[TESSDATADIR]]
OPTIONS
-------
'--tessdata-dir /path'::
Specify the location of tessdata path.
*-c* 'CONFIGVAR=VALUE'::
Set value for parameter 'CONFIGVAR' to VALUE. Multiple *-c* arguments are allowed.
'--user-words /path/to/file'::
Specify the location of user words file.
*--dpi* 'N'::
Specify the resolution 'N' in DPI for the input image(s).
A typical value for 'N' is `300`. Without this option,
the resolution is read from the metadata included in the image.
If an image does not include that information, Tesseract tries to guess it.
'--user-patterns /path/to/file'::
Specify the location of user patterns file.
*-l* 'LANG'::
*-l* 'SCRIPT'::
The language or script to use.
If none is specified, `eng` (English) is assumed.
Multiple languages may be specified, separated by plus characters.
Tesseract uses 3-character ISO 639-2 language codes
(see <<LANGUAGES,*LANGUAGES AND SCRIPTS*>>).
'-c configvar=value'::
Set value for parameter 'configvar'. Multiple -c arguments are allowed.
*--psm* 'N'::
Set Tesseract to only run a subset of layout analysis and assume
a certain form of image. The options for 'N' are:
'-l lang'::
The language to use. If none is specified, English is assumed.
Multiple languages may be specified, separated by plus characters.
Tesseract uses 3-character ISO 639-2 language codes. (See LANGUAGES)
0 = Orientation and script detection (OSD) only.
1 = Automatic page segmentation with OSD.
2 = Automatic page segmentation, but no OSD, or OCR. (not implemented)
3 = Fully automatic page segmentation, but no OSD. (Default)
4 = Assume a single column of text of variable sizes.
5 = Assume a single uniform block of vertically aligned text.
6 = Assume a single uniform block of text.
7 = Treat the image as a single text line.
8 = Treat the image as a single word.
9 = Treat the image as a single word in a circle.
10 = Treat the image as a single character.
11 = Sparse text. Find as much text as possible in no particular order.
12 = Sparse text with OSD.
13 = Raw line. Treat the image as a single text line,
bypassing hacks that are Tesseract-specific.
'--psm N'::
Set Tesseract to only run a subset of layout analysis and assume
a certain form of image. The options for *N* are:
*--oem* 'N'::
Specify OCR Engine mode. The options for 'N' are:
0 = Orientation and script detection (OSD) only.
1 = Automatic page segmentation with OSD.
2 = Automatic page segmentation, but no OSD, or OCR. (not implemented)
3 = Fully automatic page segmentation, but no OSD. (Default)
4 = Assume a single column of text of variable sizes.
5 = Assume a single uniform block of vertically aligned text.
6 = Assume a single uniform block of text.
7 = Treat the image as a single text line.
8 = Treat the image as a single word.
9 = Treat the image as a single word in a circle.
10 = Treat the image as a single character.
0 = Original Tesseract only.
1 = Neural nets LSTM only.
2 = Tesseract + LSTM.
3 = Default, based on what is available.
'--oem N'::
Specify OCR Engine mode. The options for *N* are:
*--tessdata-dir* 'PATH'::
Specify the location of tessdata path.
0 = Original Tesseract only.
1 = Neural nets LSTM only.
2 = Tesseract + LSTM.
3 = Default, based on what is available.
*--user-patterns* 'FILE'::
Specify the location of user patterns file.
'configfile'::
The name of a config to use. The name can be a file in tessdata/configs
or tessdata/tessconfigs, or an absolute or relative file path.
A config is a plain text file which contains a list of parameters and
their values, one per line, with a space separating parameter from value. +
Interesting config files include:
*--user-words* 'FILE'::
Specify the location of user words file.
* `alto` - Output in ALTO format ('outputbase'`.xml`).
* `hocr` - Output in hOCR format ('outputbase'`.hocr`).
* `pdf` - Output PDF ('outputbase'`.pdf`).
* `tsv` - Output TSV ('outputbase'`.tsv`).
* `txt` - Output plain text ('outputbase'`.txt`).
* `get.images` - Write processed input images to file (`tessinput.tif`).
* `logfile` - Redirect debug messages to file (`tesseract.log`).
* `lstm.train` - Output files used by LSTM training ('outputbase'`.lstmf`).
* `makebox` - Write box file ('outputbase'`.box`).
* `quiet` - Redirect debug messages to /dev/null.
[[CONFIGFILE]]
'CONFIGFILE'::
The name of a config to use. The name can be a file in `tessdata/configs`
or `tessdata/tessconfigs`, or an absolute or relative file path.
A config is a plain text file which contains a list of parameters and
their values, one per line, with a space separating parameter from value. +
Interesting config files include:
* *alto* -- Output in ALTO format ('OUTPUTBASE'`.xml`).
* *hocr* -- Output in hOCR format ('OUTPUTBASE'`.hocr`).
* *pdf* -- Output PDF ('OUTPUTBASE'`.pdf`).
* *tsv* -- Output TSV ('OUTPUTBASE'`.tsv`).
* *txt* -- Output plain text ('OUTPUTBASE'`.txt`).
* *get.images* -- Write processed input images to file (`tessinput.tif`).
* *logfile* -- Redirect debug messages to file (`tesseract.log`).
* *lstm.train* -- Output files used by LSTM training ('OUTPUTBASE'`.lstmf`).
* *makebox* -- Write box file ('OUTPUTBASE'`.box`).
* *quiet* -- Redirect debug messages to '/dev/null'.
It is possible to select several config files, for example
`tesseract image.png demo hocr pdf txt` will create three output files
`demo.hocr`, `demo.pdf` and `demo.txt` with the OCR results.
`tesseract image.png demo alto hocr pdf txt` will create four output files
`demo.alto`, `demo.hocr`, `demo.pdf` and `demo.txt` with the OCR results.
*Nota Bene:* The options `-l lang` and `--psm N` must occur
before any 'configfile'.
*Nota bene:* The options *-l* 'LANG', *-l* 'SCRIPT' and *--psm* 'N'
must occur before any 'CONFIGFILE'.
SINGLE OPTIONS
--------------
'-h, --help'::
Show help message.
*-h, --help*::
Show help message.
'--help-extra'::
Show extra help for advanced users.
*--help-extra*::
Show extra help for advanced users.
'--help-psm'::
Show page segmentation modes.
*--help-psm*::
Show page segmentation modes.
'--help-oem'::
Show OCR Engine modes.
*--help-oem*::
Show OCR Engine modes.
'-v, --version'::
Returns the current version of the tesseract(1) executable.
*-v, --version*::
Returns the current version of the tesseract(1) executable.
'--list-langs'::
List available languages for tesseract engine. Can be used with `--tessdata-dir`.
'--print-parameters'::
Print tesseract parameters.
*--list-langs*::
List available languages for tesseract engine.
Can be used with *--tessdata-dir* 'PATH'.
*--print-parameters*::
Print tesseract parameters.
[[LANGUAGES]]
LANGUAGES AND SCRIPTS
---------------------
To recognize some text with Tesseract, it is normally necessary to specify
the language(s) or script of the text (unless it is English text which is
supported by default) using `-l lang`.
the language(s) or script(s) of the text (unless it is English text which is
supported by default) using *-l* 'LANG' or *-l* 'SCRIPT'.
Selecting a language automatically also selects the language specific
character set and dictionary (word list).
@ -153,6 +164,9 @@ In most cases, a script also supports English.
So it is possible to recognize a language that has not been specifically
trained for by using traineddata for the script it is written in.
More than one language or script may be specified by using `+`.
Example: `tesseract myimage.png myimage -l eng+deu+fra`.
https://github.com/tesseract-ocr/tessdata_fast provides fast language and
script models which are also part of Linux distributions.
@ -174,16 +188,16 @@ following languages:
*cat* (Catalan; Valencian),
*ceb* (Cebuano),
*ces* (Czech),
*chi_sim* (Chinese - Simplified),
*chi_tra* (Chinese - Traditional),
*chi_sim* (Chinese simplified),
*chi_tra* (Chinese traditional),
*chr* (Cherokee),
*cym* (Welsh),
*dan* (Danish),
*deu* (German),
*dzo* (Dzongkha),
*ell* (Greek, Modern (1453-)),
*ell* (Greek, Modern, 1453-),
*eng* (English),
*enm* (English, Middle (1100-1500)),
*enm* (English, Middle, 1100-1500),
*epo* (Esperanto),
*equ* (Math / equation detection module),
*est* (Estonian),
@ -192,10 +206,10 @@ following languages:
*fin* (Finnish),
*fra* (French),
*frk* (Frankish),
*frm* (French, Middle (ca.1400-1600)),
*frm* (French, Middle, ca.1400-1600),
*gle* (Irish),
*glg* (Galician),
*grc* (Greek, Ancient (to 1453)),
*grc* (Greek, Ancient, to 1453),
*guj* (Gujarati),
*hat* (Haitian; Haitian Creole),
*heb* (Hebrew),
@ -215,9 +229,9 @@ following languages:
*kaz* (Kazakh),
*khm* (Central Khmer),
*kir* (Kirghiz; Kyrgyz),
*kmr* (Kurdish Kurmanji),
*kor* (Korean),
*kor_vert* (Korean (vertical)),
*kmr* (Kurdish (Kurmanji)),
*kor_vert* (Korean vertical),
*kur* (Kurdish),
*lao* (Lao),
*lat* (Latin),
@ -235,7 +249,7 @@ following languages:
*nep* (Nepali),
*nld* (Dutch; Flemish),
*nor* (Norwegian),
*oci* (Occitan (post 1500)),
*oci* (Occitan post 1500),
*ori* (Oriya),
*osd* (Orientation and script detection module),
*pan* (Panjabi; Punjabi),
@ -277,51 +291,51 @@ following languages:
*yid* (Yiddish),
*yor* (Yoruba)
To use a non-standard language pack named *foo.traineddata*, set the
*TESSDATA_PREFIX* environment variable so the file can be found at
*TESSDATA_PREFIX*/tessdata/*foo*.traineddata and give Tesseract the
argument `-l foo`.
To use a non-standard language pack named `foo.traineddata`, set the
`TESSDATA_PREFIX` environment variable so the file can be found at
`TESSDATA_PREFIX/tessdata/foo.traineddata` and give Tesseract the
argument *-l* `foo`.
For Tesseract 4, `tessdata_fast` includes traineddata files for the
following scripts:
Arabic,
Armenian,
Bengali,
Canadian Aboriginal,
Cherokee,
Cyrillic,
Devanagari,
Ethiopic,
Fraktur,
Georgian,
Greek,
Gujarati,
Gurmukhi,
Han - Simplified,
Han - Simplified (vertical),
Han - Traditional,
Han - Traditional (vertical),
Hangul,
Hangul (vertical),
Hebrew,
Japanese,
Japanese (vertical),
Kannada,
Khmer,
Lao,
Latin,
Malayalam,
Myanmar,
Oriya (Odia),
Sinhala,
Syriac,
Tamil,
Telugu,
Thaana,
Thai,
Tibetan,
Vietnamese.
*Arabic*,
*Armenian*,
*Bengali*,
*Canadian_Aboriginal*,
*Cherokee*,
*Cyrillic*,
*Devanagari*,
*Ethiopic*,
*Fraktur*,
*Georgian*,
*Greek*,
*Gujarati*,
*Gurmukhi*,
*HanS* (Han simplified),
*HanS_vert* (Han simplified, vertical),
*HanT* (Han traditional),
*HanT_vert* (Han traditional, vertical),
*Hangul*,
*Hangul_vert* (Hangul vertical),
*Hebrew*,
*Japanese*,
*Japanese_vert* (Japanese vertical),
*Kannada*,
*Khmer*,
*Lao*,
*Latin*,
*Malayalam*,
*Myanmar*,
*Oriya* (Odia),
*Sinhala*,
*Syriac*,
*Tamil*,
*Telugu*,
*Thaana*,
*Thai*,
*Tibetan*,
*Vietnamese*.
The same languages and scripts are available from
https://github.com/tesseract-ocr/tessdata_best.
@ -343,8 +357,8 @@ Tesseract config files consist of lines with parameter-value pairs (space
separated). The parameters are documented as flags in the source code like
the following one in tesseractclass.h:
STRING_VAR_H(tessedit_char_blacklist, "",
"Blacklist of chars not to recognize");
`STRING_VAR_H(tessedit_char_blacklist, "",
"Blacklist of chars not to recognize");`
These parameters may enable or disable various features of the engine, and
may cause it to load (or not load) various data. For instance, let's suppose
@ -352,10 +366,10 @@ you want to OCR in English, but suppress the normal dictionary and load an
alternative word list and an alternative list of patterns -- these two files
are the most commonly used extra data files.
If your language pack is in /path/to/eng.traineddata and the hocr config
is in /path/to/configs/hocr then create three new files:
If your language pack is in '/path/to/eng.traineddata' and the hocr config
is in '/path/to/configs/hocr' then create three new files:
/path/to/eng.user-words:
'/path/to/eng.user-words':
[verse]
the
quick
@ -363,25 +377,39 @@ brown
fox
jumped
/path/to/eng.user-patterns:
'/path/to/eng.user-patterns':
[verse]
1-\d\d\d-GOOG-411
www.\n\\\*.com
/path/to/configs/bazaar:
'/path/to/configs/bazaar':
[verse]
load_system_dawg F
load_freq_dawg F
user_words_suffix user-words
user_patterns_suffix user-patterns
Now, if you pass the word 'bazaar' as a 'configfile' to Tesseract,
Tesseract will not bother loading the system dictionary nor
the dictionary of frequent words and will load and use the eng.user-words
and eng.user-patterns files you provided. The former is a simple word list,
one per line. The format of the latter is documented in dict/trie.h
on read_pattern_list().
Now, if you pass the word 'bazaar' as a <<CONFIGFILE,'CONFIGFILE'>> to
Tesseract, Tesseract will not bother loading the system dictionary nor
the dictionary of frequent words and will load and use the 'eng.user-words'
and 'eng.user-patterns' files you provided. The former is a simple word list,
one per line. The format of the latter is documented in 'dict/trie.h'
on 'read_pattern_list()'.
ENVIRONMENT VARIABLES
---------------------
*`TESSDATA_PREFIX`*::
If the `TESSDATA_PREFIX` is set to a path, then that path is used to
find the `tessdata` directory with language and script recognition
models and config files.
Using <<TESSDATADIR,*--tessdata-dir* 'PATH'>> is the recommended alternative.
*`OMP_THREAD_LIMIT`*::
If the `tesseract` executable was built with multithreading support,
it will normally use four CPU cores for the OCR process. While this
can be faster for a single image, it gives bad performance if the host
computer provides less than four CPU cores or if OCR is made for many images.
Only a single CPU core is used with `OMP_THREAD_LIMIT=1`.
HISTORY
@ -391,7 +419,7 @@ Hewlett Packard Co, Greeley Colorado between 1985 and 1994, with some more
changes made in 1996 to port to Windows, and some $$C++$$izing in 1998. A
lot of the code was written in C, and then some more was written in $$C++$$.
The $$C++$$ code makes heavy use of a list system using macros. This predates
stl, was portable before stl, and is more efficient than stl lists, but has
STL, was portable before STL, and is more efficient than STL lists, but has
the big negative that if you do get a segmentation violation, it is hard to
debug.
@ -399,7 +427,8 @@ Version 2.00 brought Unicode (UTF-8) support, six languages, and the ability
to train Tesseract.
Tesseract was included in UNLV's Fourth Annual Test of OCR Accuracy.
See <https://github.com/tesseract-ocr/docs/blob/master/AT-1995.pdf>. With Tesseract 2.00,
See <https://github.com/tesseract-ocr/docs/blob/master/AT-1995.pdf>.
Since Tesseract 2.00,
scripts are now included to allow anyone to reproduce some of these tests.
See <https://github.com/tesseract-ocr/tesseract/wiki/TestingTesseract> for more
details.