tesseract

mirror of https://github.com/tesseract-ocr/tesseract.git synced 2024-11-25 19:49:04 +08:00

Tesseract Open Source OCR Engine (main repository)

hacktoberfest lstm machine-learning ocr ocr-engine tesseract tesseract-ocr

Go to file

zdenop@gmail.com 68baf257be correction of ambigs.train; win32: update of leptonica library to 1.66, update of tessdll.dll to recent build git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@462 d0cd1f9f-072b-0410-8dd7-cf729c803f20		2010-09-27 09:00:37 +00:00
api	crap	2010-07-27 15:17:52 +00:00
ccmain	small tweaks to doxygen	2010-08-12 18:55:59 +00:00
ccstruct	more doxygen	2010-08-10 19:20:11 +00:00
ccutil	partial merge of doxygen branch (stuff without conflicts, basically)	2010-07-27 13:23:23 +00:00
classify	doxygen	2010-08-19 18:58:48 +00:00
config	config.rpath	2010-09-22 08:56:27 +00:00
contrib	move bash completion script to a contrib directory instead of littering up the top level	2010-05-29 16:33:10 +00:00
cutil	partial merge of doxygen branch (stuff without conflicts, basically)	2010-07-27 13:23:23 +00:00
dict	fix for issue 341, thanks to max.markin	2010-08-19 19:17:06 +00:00
dlltest	silence more useless warnings	2010-07-21 15:11:19 +00:00
doc	partial merge of doxygen branch (stuff without conflicts, basically)	2010-07-27 13:23:23 +00:00
image	more doxygen	2010-08-10 19:20:11 +00:00
include	correction of ambigs.train;	2010-09-27 09:00:37 +00:00
java	Fix issue 333 (patch from Zdenko)	2010-07-21 10:29:42 +00:00
lib	correction of ambigs.train;	2010-09-27 09:00:37 +00:00
m4	i18n/l10n autoconf macros	2010-07-18 23:35:39 +00:00
po	attempting to test this; not working so far	2010-07-19 02:13:27 +00:00
tessdata	correction of ambigs.train;	2010-09-27 09:00:37 +00:00
testing	start of i18n	2010-07-19 01:59:13 +00:00
textord	doxygen	2010-08-19 18:57:57 +00:00
training	improved script for creating language packages, improved tesseract.spec	2010-09-26 20:11:50 +00:00
viewer	partial merge of doxygen branch (stuff without conflicts, basically)	2010-07-27 13:23:23 +00:00
wordrec	partial merge of doxygen branch (stuff without conflicts, basically)	2010-07-27 13:23:23 +00:00
.cvsignore	top-skimming import from sf.net	2007-03-07 20:03:40 +00:00
ABOUT-NLS	partially address issue 353	2010-09-15 23:39:21 +00:00
acinclude.m4	Fixed name collision with jpeg library	2008-04-22 00:44:56 +00:00
aclocal.m4	start of i18n	2010-07-19 01:59:13 +00:00
AUTHORS	top-skimming import from sf.net	2007-03-07 20:03:40 +00:00
ChangeLog	start of i18n	2010-07-19 01:59:13 +00:00
configure	start of i18n	2010-07-19 01:59:13 +00:00
configure.ac	fix issue 332	2010-07-20 10:31:49 +00:00
COPYING	top-skimming import from sf.net	2007-03-07 20:03:40 +00:00
eurotext.tif	Automake changes for version 2.00.	2007-07-18 01:04:56 +00:00
glut32.dll	Misc root changes for 3.00	2009-07-11 03:05:57 +00:00
INSTALL	top-skimming import from sf.net	2007-03-07 20:03:40 +00:00
INSTALL.SVN	added note about aclocal warnings to INSTALL.SVN	2007-03-31 15:26:51 +00:00
jpeg62.dll	Misc root changes for 3.00	2009-07-11 03:05:57 +00:00
leptonlib.dll	correction of ambigs.train;	2010-09-27 09:00:37 +00:00
libimage.dll	Misc root changes for 3.00	2009-07-11 03:05:57 +00:00
libpng12.dll	add libpng12.dll; thanks to dtorne	2010-05-28 01:10:19 +00:00
libpng13.dll	Misc root changes for 3.00	2009-07-11 03:05:57 +00:00
librle3.dll	Misc root changes for 3.00	2009-07-11 03:05:57 +00:00
libtiff3.dll	Misc root changes for 3.00	2009-07-11 03:05:57 +00:00
Makefile.am	start of i18n	2010-07-19 01:59:13 +00:00
Makefile.in	start of i18n	2010-07-19 01:59:13 +00:00
makemoredists	improved script for creating language packages, improved tesseract.spec	2010-09-26 20:11:50 +00:00
NEWS	top-skimming import from sf.net	2007-03-07 20:03:40 +00:00
phototest.tif	top-skimming import from sf.net	2007-03-07 20:03:40 +00:00
README	ReadMe/ReleaseNotes changes for final 2.04 release	2009-06-30 22:18:24 +00:00
ReleaseNotes	ReadMe/ReleaseNotes changes for final 2.04 release	2009-06-30 22:18:24 +00:00
runautoconf	use libtool	2010-05-26 14:20:20 +00:00
StdAfx.cpp	More new files for v2.00	2007-07-18 01:30:21 +00:00
StdAfx.h	More new files for v2.00	2007-07-18 01:30:21 +00:00
tessdll.cpp	Misc root changes for 3.00	2009-07-11 03:05:57 +00:00
tessdll.dll	correction of ambigs.train;	2010-09-27 09:00:37 +00:00
tessdll.h	Misc root changes for 3.00	2009-07-11 03:05:57 +00:00
tessdll.vcproj	patch for issue 304 from max.markin	2010-07-19 02:32:21 +00:00
tesseract.sln	also this	2010-05-28 00:22:23 +00:00
tesseract.spec	improved script for creating language packages, improved tesseract.spec	2010-09-26 20:11:50 +00:00
tesseract.vcproj	/NODEFAULTLIB:library	2010-07-21 15:25:59 +00:00
zlib1.dll	Misc root changes for 3.00	2009-07-11 03:05:57 +00:00

README

Note that this is a text-only and possibly out-of-date version of the 
wiki ReadMe, which is located at:
 http://code.google.com/p/tesseract-ocr/wiki/ReadMe

Introduction
============
This package contains the Tesseract Open Source OCR Engine.
Orignally developed at Hewlett Packard Laboratories Bristol and
at Hewlett Packard Co, Greeley Colorado, all the code
in this distribution is now licensed under the Apache License:

** Licensed under the Apache License, Version 2.0 (the "License");
** you may not use this file except in compliance with the License.
** You may obtain a copy of the License at
** http://www.apache.org/licenses/LICENSE-2.0
** Unless required by applicable law or agreed to in writing, software
** distributed under the License is distributed on an "AS IS" BASIS,
** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
** See the License for the specific language governing permissions and
** limitations under the License.


Other Dependencies and Licenses:
================================
The Aspirin/MIGRAINES system is no longer required.

Tesseract can also make use of the libtiff library. (www.libtiff.org) See
http://code.google.com/p/tesseract-ocr/wiki/FAQ for details.
Without libtiff, Tesseract can only read uncompressed and G3 compressed
TIFF files.

Installing and Running Tesseract
All Users Do NOT Ignore!
The tarballs are split into pieces.

tesseract-2.04.tar.gz contains all the source code.

tesseract-2.00.<lang>.tar.gz contains the language data files for <lang>. You need at least one of these or tesseract will not work.

Note that tesseract-2.04.tar.gz unpacks to the tesseract-2.04 directory. tesseract-2.00.<lang>.tar.gz unpacks to the tessdata directory which belongs inside your tesseract-2.04 directory. It is therefore best to download them into your tesseract-2.04 directory, so you can use unpack here or equivalent. You can unpack as many of the language packs as you care to, as they all contain different files. Note that if you are using make install you should unpack your language data to your source tree before you run make install. If you unpack them as root to the destination directory of make install, then the user ids and access permissions might be messed up.

boxtiff-2.01.<lang>.tar.gz contains data that was used in training for those that want to do their own training. Most users should NOT download these files.

Instructions for using the training tools are documented separately at TrainingTesseract and for testing at TestingTesseract. 

Without Additional Libraries, Image format support is limited!

Without additional libraries, Tesseract can only read uncompressed TIFF. (And some versions of BMP) Upto version 2.04, you can add libtiff-dev. See the FAQ question on compressed TIFF for installation instructions. Version 3.00 will support additional formats via Leptonica, but requires more libraries to be added.
Windows:

There is no windows installer! (Still looking for volunteers to create one.) There are windows executables: tesseract-2.04.exe.tar.gz (It is not for the 'exe' language.) They are built with VC++ express 2008 and come with absolutely no warranty. If they work for you then great, otherwise get Visual C++ Express 2008 with service pack 1 and build from the source. You can also try tesseract-2.01.exe.tar.gz, which is built with VC++6, and may work better if your windows is old, but note that this is an older version of Tesseract.

If you are building from the sources, there are still (up to v2.04) .dsw and .dsp files for vc++6, but the recommended build platform is now VC++ Express 2008. There are also .sln and .vcproj files for VC++ Express 2008, but these files are not backward compatible with any previous version - not even VC++ Express 2005. Note that the executables produced with the newer compiler are smaller, faster, and, believe it or not, more accurate. (See TestingTesseract.)

New with 2.04: the executables are built with static linking, so they stand more chance of working out of the box on more windows systems.

The executable must reside in the same directory as the tessdata directory. (The Visual Studio projects build the release executable directly to the correct place!)

The command line is:

tesseract <image.tif> <output> [-l <langid>]

For interfacing to other applications, there is a DLL included with the executables, but you may be better off building it yourself. The DLL is NOT built for static C-Runtime, so you will probably need VC++ Express 2008 to run it.

The dll has been updated to allow input of non-binary images. (Thanks to Glen of Jetsoft.)

Non-Windows (or Cygwin):

You have to tell Tesseract through a standard unix mechanism where to find its data directory. You must either:

./configure
make
make install

to move the data files to the standard place, or:

export TESSDATA_PREFIX="directory in which your tessdata resides/"

In either case the command line is:

tesseract <image.tif> <output> [-l <langid>]

New there is a tesseract.spec for making rpms. (Thanks to Andrew Ziem for the help.) It might work with your OS if you know how to do that.

If you are linking to the libraries, as Ocropus does, there is now a single master library called libtesseract_full.a.

Libtiff support should now be properly working via configure, but note that you need libtiff-dev, as that contains the header files required to compile the code that uses it. 

History:
========
The engine was developed at Hewlett Packard Laboratories Bristol and
at Hewlett Packard Co, Greeley Colorado between 1985 and 1994, with some
more changes made in 1996 to port to Windows, and some C++izing in 1998.
A lot of the code was written in C, and then some more was written in C++.
Since then all the code has been converted to at least compile with a C++
compiler. Currently it builds under Linux with gcc2.95 and under Windows
with VC++6. The C++ code makes heavy use of a list system using macros.
This predates stl, was portable before stl, and is more efficent than stl
lists, but has the big negative that if you do get a segmentation violation,
it is hard to debug. Another "feature" of the C/C++ split is that the C++
data structures get converted to C data structures to call the low-level C
code. This is ugly, and the C++izing of the C code is a step towards
eliminating the conversion, but it has not happened yet.

The most recent change is that Tesseract can now recognize 6 languages, is fully UTF8 capable, and is fully trainable. See TrainingTesseract for more information on training.

Tesseract was included in UNLV's Fourth Annual Test of OCR Accuracy. See http://www.isri.unlv.edu/downloads/AT-1995.pdf. With Tesseract 2.00, scripts are now included to allow anyone to reproduce some of these tests. See TestingTesseract for more details. 


Directory Structure (ordered by dependency):
============================================
ccmain     Top-level code. The main program resides in tesseractmain.cpp.
display    An "editor" to view and operate on the internal structures.
           (Requires a working viewer - batteries not included.)
wordrec    The word-level recognizer.
textord    The module that organizes(orders) text into lines and words.
classify   The low-level character classifiers.
ccstruct   Classes to hold information about a page as it is being processed.
viewer     The client side of a client server viewing system.
           Unfortunately, at this time, the server side is not available.
image      Image class and processing functions.
dict       Language model code.
cutil      Code for file I/O, lists, heaps etc, from the old C code.
ccutil     Somewhat newer code for lists, memory allocation etc from the
           old C++ code.


About the Engine
================
This code is a raw OCR engine. It has NO PAGE LAYOUT ANALYSIS, NO OUTPUT
FORMATTING, and NO UI. It can only process an image of a single column
and create text from it. It can detect fixed pitch vs proportional text.
Having said that, in 1995, this engine was in the top 3 in terms of character
accuracy, and it compiles and runs on both Linux and Windows.
As of 2.0, Tesseract is fully unicode (UTF-8) enabled, and can recognize 6
languages "out of the box." Code and documentation is provided for the brave
to train in other languages. See code.google.com/p/tesseract-ocr for more
information on training.