2010-09-30 00:46:04 +08:00
|
|
|
TESSERACT(1)
|
|
|
|
============
|
|
|
|
|
|
|
|
NAME
|
|
|
|
----
|
|
|
|
tesseract - command-line OCR engine
|
|
|
|
|
|
|
|
SYNOPSIS
|
|
|
|
--------
|
|
|
|
*tesseract* 'imagename' 'textbase' ['configfile'] ['-l lang']
|
|
|
|
|
|
|
|
DESCRIPTION
|
|
|
|
-----------
|
|
|
|
tesseract(1) is a commercial quality OCR engine originally developed at HP
|
|
|
|
between 1985 and 1995. In 1995, this engine was among the top 3 evaluated by
|
|
|
|
UNLV. It was open-sourced by HP and UNLV in 2005, and has been developed
|
|
|
|
by Google since then.
|
|
|
|
|
|
|
|
|
|
|
|
OPTIONS
|
|
|
|
-------
|
|
|
|
'imagename'
|
|
|
|
The name of the input image
|
|
|
|
|
|
|
|
'textbase'
|
|
|
|
The basename of the output file (to which the appropriate extension
|
|
|
|
will be appended)
|
|
|
|
|
|
|
|
'configfile'
|
|
|
|
The config to use. A config is a plaintext file which contains a list
|
|
|
|
of variables and their values, one per line, with a space separating
|
|
|
|
variable from value.
|
|
|
|
|
|
|
|
'-l lang'
|
|
|
|
The language to use. If none is specified, English is assumed.
|
|
|
|
Tesseract uses 3-character ISO 639-2 language codes. (See LANGUAGES)
|
|
|
|
|
|
|
|
'-v'
|
|
|
|
Returns the current version of the tesseract(1) executable.
|
|
|
|
|
|
|
|
LANGUAGES
|
|
|
|
---------
|
|
|
|
|
|
|
|
There are currently language packs available for the following languages:
|
|
|
|
|
|
|
|
[width="40%",options="header"]
|
2010-09-30 01:56:39 +08:00
|
|
|
|=============================
|
|
|
|
|Code |Name
|
|
|
|
|bul |Bulgarian
|
|
|
|
|cat |Catalan
|
|
|
|
|ces |Czech
|
|
|
|
|chi_sim |Simplified Chinese
|
|
|
|
|chi_tra |Traditional Chinese
|
|
|
|
|dan |Danish
|
|
|
|
|dan-frak |Danish (Fraktur)
|
|
|
|
|deu |German
|
|
|
|
|ell |Greek
|
|
|
|
|eng |English
|
|
|
|
|fin |Finnish
|
|
|
|
|fra |French
|
|
|
|
|hun |Hungarian
|
|
|
|
|ind |Indonesian
|
|
|
|
|ita |Italian
|
|
|
|
|jpn |Japanese
|
|
|
|
|kor |Korean
|
|
|
|
|lav |Latvian
|
|
|
|
|lit |Lithuanian
|
|
|
|
|nld |Dutch
|
|
|
|
|nor |Norwegian
|
|
|
|
|pol |Polish
|
|
|
|
|por |Portuguese
|
|
|
|
|ron |Romanian
|
|
|
|
|rus |Russian
|
|
|
|
|slk |Slovakian
|
|
|
|
|slv |Slovenian
|
|
|
|
|spa |Spanish
|
|
|
|
|srp |Serbian
|
|
|
|
|swe |Swedish
|
|
|
|
|tgl |Tagalog
|
|
|
|
|tha |Thai
|
|
|
|
|tur |Turkish
|
|
|
|
|ukr |Ukrainian
|
|
|
|
|vie |Vietnamese
|
|
|
|
|=============================
|
2010-09-30 00:46:04 +08:00
|
|
|
|
|
|
|
HISTORY
|
|
|
|
-------
|
|
|
|
The engine was developed at Hewlett Packard Laboratories Bristol and at
|
|
|
|
Hewlett Packard Co, Greeley Colorado between 1985 and 1994, with some more
|
|
|
|
changes made in 1996 to port to Windows, and some C\+\+izing in 1998. A
|
|
|
|
lot of the code was written in C, and then some more was written in C\+\+.
|
|
|
|
Since then all the code has been converted to at least compile with a
|
2010-09-30 01:56:39 +08:00
|
|
|
C\++ compiler. Currently it builds under Linux with gcc4.0, gcc4.1 and
|
|
|
|
under Windows with VC\+\+6 and VC\+\+Express. The C\++ code makes heavy use of
|
2010-09-30 00:46:04 +08:00
|
|
|
a list system using macros. This predates stl, was portable before stl, and
|
|
|
|
is more efficient than stl lists, but has the big negative that if you do get
|
|
|
|
a segmentation violation, it is hard to debug. Another "feature" of the
|
2010-09-30 01:56:39 +08:00
|
|
|
C/C\++ split is that the C\++ data structures get converted to C data
|
2010-09-30 00:46:04 +08:00
|
|
|
structures to call the low-level C code. This is ugly, and the C++izing of
|
|
|
|
the C code is a step towards eliminating the conversion, but it has not
|
|
|
|
happened yet.
|
|
|
|
|
|
|
|
The most important changes in version 2.00 were that Tesseract can now
|
|
|
|
recognize 6 languages, is fully UTF8 capable, and is fully trainable. See
|
|
|
|
<http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract> for more
|
|
|
|
information on training.
|
|
|
|
|
|
|
|
Tesseract was included in UNLV's Fourth Annual Test of OCR Accuracy.
|
|
|
|
See <http://www.isri.unlv.edu/downloads/AT-1995.pdf>. With Tesseract 2.00,
|
|
|
|
scripts are now included to allow anyone to reproduce some of these tests.
|
|
|
|
See <http://code.google.com/p/tesseract-ocr/wiki/TestingTesseract> for more
|
|
|
|
details.
|
|
|
|
|
|
|
|
Tesseract 3.00 adds a number of new languages, including Chinese, Japanese,
|
|
|
|
and Korean. It also introduces a new, single-file based system of managing
|
|
|
|
language data. For further details, see the file ReleaseNotes included with
|
|
|
|
the distribution.
|
|
|
|
|
|
|
|
SEE ALSO
|
|
|
|
--------
|
|
|
|
tesseract(1)
|
|
|
|
|
|
|
|
COPYING
|
|
|
|
-------
|
|
|
|
Licensed under the Apache License, Version 2.0
|