2010-09-30 00:46:04 +08:00
|
|
|
TESSERACT(1)
|
|
|
|
============
|
|
|
|
|
|
|
|
NAME
|
|
|
|
----
|
|
|
|
tesseract - command-line OCR engine
|
|
|
|
|
|
|
|
SYNOPSIS
|
|
|
|
--------
|
|
|
|
*tesseract* 'imagename' 'textbase' ['configfile'] ['-l lang']
|
|
|
|
|
|
|
|
DESCRIPTION
|
|
|
|
-----------
|
|
|
|
tesseract(1) is a commercial quality OCR engine originally developed at HP
|
|
|
|
between 1985 and 1995. In 1995, this engine was among the top 3 evaluated by
|
|
|
|
UNLV. It was open-sourced by HP and UNLV in 2005, and has been developed
|
|
|
|
by Google since then.
|
|
|
|
|
|
|
|
|
|
|
|
OPTIONS
|
|
|
|
-------
|
|
|
|
'imagename'
|
|
|
|
The name of the input image
|
|
|
|
|
|
|
|
'textbase'
|
|
|
|
The basename of the output file (to which the appropriate extension
|
|
|
|
will be appended)
|
|
|
|
|
|
|
|
'configfile'
|
|
|
|
The config to use. A config is a plaintext file which contains a list
|
|
|
|
of variables and their values, one per line, with a space separating
|
|
|
|
variable from value.
|
|
|
|
|
|
|
|
'-l lang'
|
|
|
|
The language to use. If none is specified, English is assumed.
|
|
|
|
Tesseract uses 3-character ISO 639-2 language codes. (See LANGUAGES)
|
|
|
|
|
|
|
|
'-v'
|
|
|
|
Returns the current version of the tesseract(1) executable.
|
|
|
|
|
|
|
|
LANGUAGES
|
|
|
|
---------
|
|
|
|
|
|
|
|
There are currently language packs available for the following languages:
|
|
|
|
|
2010-09-30 01:59:10 +08:00
|
|
|
[horizontal]
|
|
|
|
bul:: Bulgarian
|
|
|
|
cat:: Catalan
|
|
|
|
ces:: Czech
|
|
|
|
chi_sim:: Simplified Chinese
|
|
|
|
chi_tra:: Traditional Chinese
|
|
|
|
dan:: Danish
|
|
|
|
dan-frak:: Danish (Fraktur)
|
|
|
|
deu:: German
|
|
|
|
ell:: Greek
|
|
|
|
eng:: English
|
|
|
|
fin:: Finnish
|
|
|
|
fra:: French
|
|
|
|
hun:: Hungarian
|
|
|
|
ind:: Indonesian
|
|
|
|
ita:: Italian
|
|
|
|
jpn:: Japanese
|
|
|
|
kor:: Korean
|
|
|
|
lav:: Latvian
|
|
|
|
lit:: Lithuanian
|
|
|
|
nld:: Dutch
|
|
|
|
nor:: Norwegian
|
|
|
|
pol:: Polish
|
|
|
|
por:: Portuguese
|
|
|
|
ron:: Romanian
|
|
|
|
rus:: Russian
|
|
|
|
slk:: Slovakian
|
|
|
|
slv:: Slovenian
|
|
|
|
spa:: Spanish
|
|
|
|
srp:: Serbian
|
|
|
|
swe:: Swedish
|
|
|
|
tgl:: Tagalog
|
|
|
|
tha:: Thai
|
|
|
|
tur:: Turkish
|
|
|
|
ukr:: Ukrainian
|
|
|
|
vie:: Vietnamese
|
2010-09-30 00:46:04 +08:00
|
|
|
|
|
|
|
HISTORY
|
|
|
|
-------
|
|
|
|
The engine was developed at Hewlett Packard Laboratories Bristol and at
|
|
|
|
Hewlett Packard Co, Greeley Colorado between 1985 and 1994, with some more
|
|
|
|
changes made in 1996 to port to Windows, and some C\+\+izing in 1998. A
|
|
|
|
lot of the code was written in C, and then some more was written in C\+\+.
|
|
|
|
Since then all the code has been converted to at least compile with a
|
2010-09-30 01:56:39 +08:00
|
|
|
C\++ compiler. Currently it builds under Linux with gcc4.0, gcc4.1 and
|
|
|
|
under Windows with VC\+\+6 and VC\+\+Express. The C\++ code makes heavy use of
|
2010-09-30 00:46:04 +08:00
|
|
|
a list system using macros. This predates stl, was portable before stl, and
|
|
|
|
is more efficient than stl lists, but has the big negative that if you do get
|
|
|
|
a segmentation violation, it is hard to debug. Another "feature" of the
|
2010-09-30 01:56:39 +08:00
|
|
|
C/C\++ split is that the C\++ data structures get converted to C data
|
2010-09-30 00:46:04 +08:00
|
|
|
structures to call the low-level C code. This is ugly, and the C++izing of
|
|
|
|
the C code is a step towards eliminating the conversion, but it has not
|
|
|
|
happened yet.
|
|
|
|
|
|
|
|
The most important changes in version 2.00 were that Tesseract can now
|
|
|
|
recognize 6 languages, is fully UTF8 capable, and is fully trainable. See
|
|
|
|
<http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract> for more
|
|
|
|
information on training.
|
|
|
|
|
|
|
|
Tesseract was included in UNLV's Fourth Annual Test of OCR Accuracy.
|
|
|
|
See <http://www.isri.unlv.edu/downloads/AT-1995.pdf>. With Tesseract 2.00,
|
|
|
|
scripts are now included to allow anyone to reproduce some of these tests.
|
|
|
|
See <http://code.google.com/p/tesseract-ocr/wiki/TestingTesseract> for more
|
|
|
|
details.
|
|
|
|
|
|
|
|
Tesseract 3.00 adds a number of new languages, including Chinese, Japanese,
|
|
|
|
and Korean. It also introduces a new, single-file based system of managing
|
|
|
|
language data. For further details, see the file ReleaseNotes included with
|
|
|
|
the distribution.
|
|
|
|
|
|
|
|
SEE ALSO
|
|
|
|
--------
|
|
|
|
tesseract(1)
|
|
|
|
|
|
|
|
COPYING
|
|
|
|
-------
|
|
|
|
Licensed under the Apache License, Version 2.0
|