mirror of
https://github.com/tesseract-ocr/tesseract.git
synced 2024-12-01 07:59:05 +08:00
b231aee212
Together with: - fix "C\++" - align executable --print-parameters message
337 lines
9.2 KiB
Plaintext
337 lines
9.2 KiB
Plaintext
TESSERACT(1)
|
|
============
|
|
:doctype: manpage
|
|
|
|
NAME
|
|
----
|
|
tesseract - command-line OCR engine
|
|
|
|
SYNOPSIS
|
|
--------
|
|
*tesseract* 'imagename'|'stdin' 'outputbase'|'stdout' [options...] [configfile...]
|
|
|
|
DESCRIPTION
|
|
-----------
|
|
tesseract(1) is a commercial quality OCR engine originally developed at HP
|
|
between 1985 and 1995. In 1995, this engine was among the top 3 evaluated by
|
|
UNLV. It was open-sourced by HP and UNLV in 2005, and has been developed
|
|
at Google since then.
|
|
|
|
|
|
IN/OUT ARGUMENTS
|
|
----------------
|
|
'imagename'::
|
|
The name of the input image. Most image file formats (anything
|
|
readable by Leptonica) are supported.
|
|
|
|
'stdin'::
|
|
Instruction to read data from standard input
|
|
|
|
'outputbase'::
|
|
The basename of the output file (to which the appropriate extension
|
|
will be appended). By default the output will be named 'outbase.txt'.
|
|
|
|
'stdout'::
|
|
Instruction to sent output data to standard output
|
|
|
|
|
|
OPTIONS
|
|
-------
|
|
'--tessdata-dir /path'::
|
|
Specify the location of tessdata path
|
|
|
|
'--user-words /path/to/file'::
|
|
Specify the location of user words file
|
|
|
|
'--user-patterns /path/to/file specify'::
|
|
The location of user patterns file
|
|
|
|
'-c configvar=value'::
|
|
Set value for control parameter. Multiple -c arguments are allowed.
|
|
|
|
'-l lang'::
|
|
The language to use. If none is specified, English is assumed.
|
|
Multiple languages may be specified, separated by plus characters.
|
|
Tesseract uses 3-character ISO 639-2 language codes. (See LANGUAGES)
|
|
|
|
'--psm N'::
|
|
Set Tesseract to only run a subset of layout analysis and assume
|
|
a certain form of image. The options for *N* are:
|
|
|
|
0 = Orientation and script detection (OSD) only.
|
|
1 = Automatic page segmentation with OSD.
|
|
2 = Automatic page segmentation, but no OSD, or OCR.
|
|
3 = Fully automatic page segmentation, but no OSD. (Default)
|
|
4 = Assume a single column of text of variable sizes.
|
|
5 = Assume a single uniform block of vertically aligned text.
|
|
6 = Assume a single uniform block of text.
|
|
7 = Treat the image as a single text line.
|
|
8 = Treat the image as a single word.
|
|
9 = Treat the image as a single word in a circle.
|
|
10 = Treat the image as a single character.
|
|
|
|
'--oem N'::
|
|
Specify OCR Engine mode. The options for *N* are:
|
|
|
|
0 = Original Tesseract only.
|
|
1 = Neural nets LSTM only.
|
|
2 = Tesseract + LSTM.
|
|
3 = Default, based on what is available.
|
|
|
|
'configfile'::
|
|
The name of a config to use. A config is a plaintext file which
|
|
contains a list of variables and their values, one per line, with a
|
|
space separating variable from value. Interesting config files
|
|
include: +
|
|
* hocr - Output in hOCR format instead of as a text file.
|
|
* pdf - Output in pdf instead of a text file.
|
|
|
|
*Nota Bene:* The options '-l lang' and '--psm N' must occur
|
|
before any 'configfile'.
|
|
|
|
|
|
SINGLE OPTIONS
|
|
--------------
|
|
'-h, --help'::
|
|
Show help message.
|
|
|
|
'--help-psm'::
|
|
Show page segmentation modes.
|
|
|
|
'--help-oem'::
|
|
Show OCR Engine modes.
|
|
|
|
'-v, --version'::
|
|
Returns the current version of the tesseract(1) executable.
|
|
|
|
'--list-langs'::
|
|
List available languages for tesseract engine. Can be used with --tessdata-dir.
|
|
|
|
'--print-parameters'::
|
|
Print tesseract parameters.
|
|
|
|
|
|
|
|
LANGUAGES
|
|
---------
|
|
|
|
There are currently language packs available for the following languages
|
|
(in https://github.com/tesseract-ocr/tessdata):
|
|
|
|
*afr* (Afrikaans)
|
|
*amh* (Amharic)
|
|
*ara* (Arabic)
|
|
*asm* (Assamese)
|
|
*aze* (Azerbaijani)
|
|
*aze_cyrl* (Azerbaijani - Cyrilic)
|
|
*bel* (Belarusian)
|
|
*ben* (Bengali)
|
|
*bod* (Tibetan)
|
|
*bos* (Bosnian)
|
|
*bul* (Bulgarian)
|
|
*cat* (Catalan; Valencian)
|
|
*ceb* (Cebuano)
|
|
*ces* (Czech)
|
|
*chi_sim* (Chinese - Simplified)
|
|
*chi_tra* (Chinese - Traditional)
|
|
*chr* (Cherokee)
|
|
*cym* (Welsh)
|
|
*dan* (Danish)
|
|
*dan_frak* (Danish - Fraktur)
|
|
*deu* (German)
|
|
*deu_frak* (German - Fraktur)
|
|
*dzo* (Dzongkha)
|
|
*ell* (Greek, Modern (1453-))
|
|
*eng* (English)
|
|
*enm* (English, Middle (1100-1500))
|
|
*epo* (Esperanto)
|
|
*equ* (Math / equation detection module)
|
|
*est* (Estonian)
|
|
*eus* (Basque)
|
|
*fas* (Persian)
|
|
*fin* (Finnish)
|
|
*fra* (French)
|
|
*frk* (Frankish)
|
|
*frm* (French, Middle (ca.1400-1600))
|
|
*gle* (Irish)
|
|
*glg* (Galician)
|
|
*grc* (Greek, Ancient (to 1453))
|
|
*guj* (Gujarati)
|
|
*hat* (Haitian; Haitian Creole)
|
|
*heb* (Hebrew)
|
|
*hin* (Hindi)
|
|
*hrv* (Croatian)
|
|
*hun* (Hungarian)
|
|
*iku* (Inuktitut)
|
|
*ind* (Indonesian)
|
|
*isl* (Icelandic)
|
|
*ita* (Italian)
|
|
*ita_old* (Italian - Old)
|
|
*jav* (Javanese)
|
|
*jpn* (Japanese)
|
|
*kan* (Kannada)
|
|
*kat* (Georgian)
|
|
*kat_old* (Georgian - Old)
|
|
*kaz* (Kazakh)
|
|
*khm* (Central Khmer)
|
|
*kir* (Kirghiz; Kyrgyz)
|
|
*kor* (Korean)
|
|
*kur* (Kurdish)
|
|
*lao* (Lao)
|
|
*lat* (Latin)
|
|
*lav* (Latvian)
|
|
*lit* (Lithuanian)
|
|
*mal* (Malayalam)
|
|
*mar* (Marathi)
|
|
*mkd* (Macedonian)
|
|
*mlt* (Maltese)
|
|
*msa* (Malay)
|
|
*mya* (Burmese)
|
|
*nep* (Nepali)
|
|
*nld* (Dutch; Flemish)
|
|
*nor* (Norwegian)
|
|
*ori* (Oriya)
|
|
*osd* (Orientation and script detection module)
|
|
*pan* (Panjabi; Punjabi)
|
|
*pol* (Polish)
|
|
*por* (Portuguese)
|
|
*pus* (Pushto; Pashto)
|
|
*ron* (Romanian; Moldavian; Moldovan)
|
|
*rus* (Russian)
|
|
*san* (Sanskrit)
|
|
*sin* (Sinhala; Sinhalese)
|
|
*slk* (Slovak)
|
|
*slk_frak* (Slovak - Fraktur)
|
|
*slv* (Slovenian)
|
|
*spa* (Spanish; Castilian)
|
|
*spa_old* (Spanish; Castilian - Old)
|
|
*sqi* (Albanian)
|
|
*srp* (Serbian)
|
|
*srp_latn* (Serbian - Latin)
|
|
*swa* (Swahili)
|
|
*swe* (Swedish)
|
|
*syr* (Syriac)
|
|
*tam* (Tamil)
|
|
*tel* (Telugu)
|
|
*tgk* (Tajik)
|
|
*tgl* (Tagalog)
|
|
*tha* (Thai)
|
|
*tir* (Tigrinya)
|
|
*tur* (Turkish)
|
|
*uig* (Uighur; Uyghur)
|
|
*ukr* (Ukrainian)
|
|
*urd* (Urdu)
|
|
*uzb* (Uzbek)
|
|
*uzb_cyrl* (Uzbek - Cyrilic)
|
|
*vie* (Vietnamese)
|
|
*yid* (Yiddish)
|
|
|
|
To use a non-standard language pack named *foo.traineddata*, set the
|
|
*TESSDATA_PREFIX* environment variable so the file can be found at
|
|
*TESSDATA_PREFIX*/tessdata/*foo*.traineddata and give Tesseract the
|
|
argument '-l foo'.
|
|
|
|
CONFIG FILES AND AUGMENTING WITH USER DATA
|
|
------------------------------------------
|
|
|
|
Tesseract config files consist of lines with variable-value pairs (space
|
|
separated). The variables are documented as flags in the source code like
|
|
the following one in tesseractclass.h:
|
|
|
|
STRING_VAR_H(tessedit_char_blacklist, "",
|
|
"Blacklist of chars not to recognize");
|
|
|
|
These variables may enable or disable various features of the engine, and
|
|
may cause it to load (or not load) various data. For instance, let's suppose
|
|
you want to OCR in English, but suppress the normal dictionary and load an
|
|
alternative word list and an alternative list of patterns -- these two files
|
|
are the most commonly used extra data files.
|
|
|
|
If your language pack is in /path/to/eng.traineddata and the hocr config
|
|
is in /path/to/configs/hocr then create three new files:
|
|
|
|
/path/to/eng.user-words:
|
|
[verse]
|
|
the
|
|
quick
|
|
brown
|
|
fox
|
|
jumped
|
|
|
|
|
|
/path/to/eng.user-patterns:
|
|
[verse]
|
|
1-\d\d\d-GOOG-411
|
|
www.\n\\\*.com
|
|
|
|
/path/to/configs/bazaar:
|
|
[verse]
|
|
load_system_dawg F
|
|
load_freq_dawg F
|
|
user_words_suffix user-words
|
|
user_patterns_suffix user-patterns
|
|
|
|
Now, if you pass the word 'bazaar' as a trailing command line parameter
|
|
to Tesseract, Tesseract will not bother loading the system dictionary nor
|
|
the dictionary of frequent words and will load and use the eng.user-words
|
|
and eng.user-patterns files you provided. The former is a simple word list,
|
|
one per line. The format of the latter is documented in dict/trie.h
|
|
on read_pattern_list().
|
|
|
|
|
|
HISTORY
|
|
-------
|
|
The engine was developed at Hewlett Packard Laboratories Bristol and at
|
|
Hewlett Packard Co, Greeley Colorado between 1985 and 1994, with some more
|
|
changes made in 1996 to port to Windows, and some C\+\+izing in 1998. A
|
|
lot of the code was written in C, and then some more was written in C\+\+.
|
|
The C++ code makes heavy use of a list system using macros. This predates
|
|
stl, was portable before stl, and is more efficient than stl lists, but has
|
|
the big negative that if you do get a segmentation violation, it is hard to
|
|
debug.
|
|
|
|
Version 2.00 brought Unicode (UTF-8) support, six languages, and the ability
|
|
to train Tesseract.
|
|
|
|
Tesseract was included in UNLV's Fourth Annual Test of OCR Accuracy.
|
|
See <https://github.com/tesseract-ocr/docs/blob/master/AT-1995.pdf>. With Tesseract 2.00,
|
|
scripts are now included to allow anyone to reproduce some of these tests.
|
|
See <https://github.com/tesseract-ocr/tesseract/wiki/TestingTesseract> for more
|
|
details.
|
|
|
|
Tesseract 3.00 adds a number of new languages, including Chinese, Japanese,
|
|
and Korean. It also introduces a new, single-file based system of managing
|
|
language data.
|
|
|
|
Tesseract 3.02 adds BiDirectional text support, the ability to recognize
|
|
multiple languages in a single image, and improved layout analysis.
|
|
|
|
For further details, see the file ReleaseNotes included with the distribution.
|
|
|
|
RESOURCES
|
|
---------
|
|
Main web site: <https://github.com/tesseract-ocr> +
|
|
Information on training: <https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract>
|
|
|
|
SEE ALSO
|
|
--------
|
|
ambiguous_words(1), cntraining(1), combine_tessdata(1), dawg2wordlist(1),
|
|
shape_training(1), mftraining(1), unicharambigs(5), unicharset(5),
|
|
unicharset_extractor(1), wordlist2dawg(1)
|
|
|
|
AUTHOR
|
|
------
|
|
Tesseract development was led at Hewlett-Packard and Google by Ray Smith.
|
|
The development team has included:
|
|
|
|
Ahmad Abdulkader, Chris Newton, Dan Johnson, Dar-Shyang Lee, David Eger,
|
|
Eric Wiseblatt, Faisal Shafait, Hiroshi Takenaka, Joe Liu, Joern Wanke,
|
|
Mark Seaman, Mickey Namiki, Nicholas Beato, Oded Fuhrmann, Phil Cheatle,
|
|
Pingping Xiu, Pong Eksombatchai (Chantat), Ranjith Unnikrishnan, Raquel
|
|
Romano, Ray Smith, Rika Antonova, Robert Moss, Samuel Charron, Sheelagh
|
|
Lloyd, Shobhit Saxena, and Thomas Kielbus.
|
|
|
|
COPYING
|
|
-------
|
|
Licensed under the Apache License, Version 2.0
|