2007-03-08 04:03:40 +08:00
|
|
|
Introduction
|
|
|
|
============
|
|
|
|
This package contains the Tesseract Open Source OCR Engine.
|
|
|
|
Orignally developed at Hewlett Packard Laboratories Bristol and
|
|
|
|
at Hewlett Packard Co, Greeley Colorado, all the code
|
|
|
|
in this distribution is now licensed under the Apache License:
|
|
|
|
|
|
|
|
** Licensed under the Apache License, Version 2.0 (the "License");
|
|
|
|
** you may not use this file except in compliance with the License.
|
|
|
|
** You may obtain a copy of the License at
|
|
|
|
** http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
** Unless required by applicable law or agreed to in writing, software
|
|
|
|
** distributed under the License is distributed on an "AS IS" BASIS,
|
|
|
|
** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
|
|
** See the License for the specific language governing permissions and
|
|
|
|
** limitations under the License.
|
|
|
|
|
|
|
|
|
|
|
|
Other Dependencies and Licenses:
|
|
|
|
================================
|
|
|
|
The Aspirin/MIGRAINES system is no longer required.
|
|
|
|
|
|
|
|
Tesseract can also make use of the libtiff library. (www.libtiff.org)
|
|
|
|
Without libtiff, Tesseract can only read uncompressed and G3 compressed
|
|
|
|
TIFF files.
|
|
|
|
|
|
|
|
|
|
|
|
History:
|
|
|
|
========
|
|
|
|
The engine was developed at Hewlett Packard Laboratories Bristol and
|
|
|
|
at Hewlett Packard Co, Greeley Colorado between 1985 and 1994, with some
|
|
|
|
more changes made in 1996 to port to Windows, and some C++izing in 1998.
|
|
|
|
A lot of the code was written in C, and then some more was written in C++.
|
|
|
|
Since then all the code has been converted to at least compile with a C++
|
|
|
|
compiler. Currently it builds under Linux with gcc2.95 and under Windows
|
|
|
|
with VC++6. The C++ code makes heavy use of a list system using macros.
|
|
|
|
This predates stl, was portable before stl, and is more efficent than stl
|
|
|
|
lists, but has the big negative that if you do get a segmentation violation,
|
|
|
|
it is hard to debug. Another "feature" of the C/C++ split is that the C++
|
|
|
|
data structures get converted to C data structures to call the low-level C
|
|
|
|
code. This is ugly, and the C++izing of the C code is a step towards
|
|
|
|
eliminating the conversion, but it has not happened yet.
|
|
|
|
|
|
|
|
|
|
|
|
Directory Structure (ordered by dependency):
|
|
|
|
============================================
|
|
|
|
ccmain Top-level code. The main program resides in tesseractmain.cpp.
|
|
|
|
display An "editor" to view and operate on the internal structures.
|
|
|
|
(Requires a working viewer - batteries not included.)
|
|
|
|
wordrec The word-level recognizer.
|
|
|
|
textord The module that organizes(orders) text into lines and words.
|
|
|
|
classify The low-level character classifiers.
|
|
|
|
ccstruct Classes to hold information about a page as it is being processed.
|
|
|
|
viewer The client side of a client server viewing system.
|
|
|
|
Unfortunately, at this time, the server side is not available.
|
|
|
|
image Image class and processing functions.
|
|
|
|
dict Language model code.
|
|
|
|
cutil Code for file I/O, lists, heaps etc, from the old C code.
|
|
|
|
ccutil Somewhat newer code for lists, memory allocation etc from the
|
|
|
|
old C++ code.
|
|
|
|
|
|
|
|
|
|
|
|
About the Engine
|
|
|
|
================
|
|
|
|
This code is a raw OCR engine. It has NO PAGE LAYOUT ANALYSIS, NO OUTPUT
|
|
|
|
FORMATTING, and NO UI. It can only process an image of a single column
|
|
|
|
and create text from it. It can detect fixed pitch vs proportional text.
|
|
|
|
Having said that, in 1995, this engine was in the top 3 in terms of character
|
2007-07-18 09:04:56 +08:00
|
|
|
accuracy, and it compiles and runs on both Linux and Windows.
|
|
|
|
As of 2.0, Tesseract is fully unicode (UTF-8) enabled, and can recognize 6
|
|
|
|
languages "out of the box." Code and documentation is provided for the brave
|
|
|
|
to train in other languages. See code.google.com/p/tesseract-ocr for more
|
|
|
|
information on training.
|
2007-03-08 04:03:40 +08:00
|
|
|
|
|
|
|
|
|
|
|
Using the Engine
|
|
|
|
================
|
2007-05-17 09:52:36 +08:00
|
|
|
Windows:
|
2007-03-08 04:03:40 +08:00
|
|
|
The executable must reside in the same directory as the tessdata directory
|
|
|
|
The command line is:
|
2007-07-18 09:04:56 +08:00
|
|
|
tesseract <image.tif> <output> [-l langid]
|
2007-05-17 09:52:36 +08:00
|
|
|
A windows executable (tesseract.exe) is included in the distribution, but
|
|
|
|
may not work for you unless you also have the correct mfc and crt dlls.
|
|
|
|
There is also a tessdll.dll, which you can use to run tesseract from your
|
|
|
|
own program, but you may be better off building it yourself.
|
|
|
|
|
|
|
|
Non-Windows:
|
|
|
|
You have to tell Tesseract through a standard unix mechanism where to find
|
|
|
|
its data directory. You must either:
|
|
|
|
./configure
|
|
|
|
make
|
|
|
|
make install
|
|
|
|
to move the data files to the standard place, or:
|
|
|
|
export TESSDATA_PREFIX="directory in which your tessdata resides/"
|
|
|
|
(or equivalent) in your .profile or whatever or setenv to set the environment
|
|
|
|
variable. Note that the directory must end in a /
|
|
|
|
HAVING tesseract and tessdata IN THE SAME DIRECTORY DOES NOT WORK ANY MORE.
|
|
|
|
The command line is:
|
2007-07-18 09:04:56 +08:00
|
|
|
tesseract <image.tif> <output> [-l langid]
|
2007-05-17 09:52:36 +08:00
|
|
|
|
|
|
|
All Systems:
|
|
|
|
The image file requires a .tif extension for its type to be recognized
|
2007-03-08 04:03:40 +08:00
|
|
|
correctly. If a file exists with the .tif extension replaced by .uzn, then it
|
|
|
|
will be interpreted as a UNLV-style zone file. (See www.isri.unlv.edu for
|
|
|
|
details of the zone files.)
|
2008-04-22 08:44:56 +08:00
|
|
|
langid may be one of the codes defined in ISO 639-3, and you must download
|
2007-07-18 09:04:56 +08:00
|
|
|
the corresponding data files into your tessdata directory.
|