tesseract

mirror of https://github.com/tesseract-ocr/tesseract.git synced 2024-11-27 20:59:36 +08:00

Tesseract Open Source OCR Engine (main repository)

hacktoberfest lstm machine-learning ocr ocr-engine tesseract tesseract-ocr

Go to file

theraysmith feed23dd0e Fixed makemoredists for new exe names and adding Basque language set git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@271 d0cd1f9f-072b-0410-8dd7-cf729c803f20		2009-06-30 01:49:43 +00:00
ccmain	Improved box accuracy on failed blobs	2009-06-30 01:48:21 +00:00
ccstruct	Automake changes for potential RC of 2.04	2009-06-03 02:50:54 +00:00
ccutil	Fixed compilation with GRAPHICS_DISABLED	2009-06-03 17:24:08 +00:00
classify	Fixed compilation with GRAPHICS_DISABLED	2009-06-03 17:24:08 +00:00
config	Automake changes for potential RC of 2.04	2009-06-03 02:50:54 +00:00
cutil	Fixed compilation with GRAPHICS_DISABLED	2009-06-03 17:24:08 +00:00
dict	Fixed compilation with GRAPHICS_DISABLED	2009-06-03 17:24:08 +00:00
dlltest	Automake changes for potential RC of 2.04	2009-06-03 02:50:54 +00:00
doc	top-skimming import from sf.net	2007-03-07 20:03:40 +00:00
image	Automake changes for potential RC of 2.04	2009-06-03 02:50:54 +00:00
java	Automake changes for potential RC of 2.04	2009-06-03 02:52:02 +00:00
pageseg	Automake changes for potential RC of 2.04	2009-06-03 02:50:54 +00:00
tessdata	Automake changes for potential RC of 2.04	2009-06-03 02:50:54 +00:00
testing	Added testing results for 2.03 and 2.04	2009-06-30 01:46:29 +00:00
textord	Fixed compilation with GRAPHICS_DISABLED	2009-06-03 17:24:08 +00:00
training	Fixed compilation with GRAPHICS_DISABLED	2009-06-03 17:24:08 +00:00
viewer	Merged viewer with current code	2009-06-03 19:08:12 +00:00
wordrec	Fixed compilation with GRAPHICS_DISABLED	2009-06-03 17:24:08 +00:00
.cvsignore	top-skimming import from sf.net	2007-03-07 20:03:40 +00:00
acinclude.m4	Fixed name collision with jpeg library	2008-04-22 00:44:56 +00:00
AUTHORS	top-skimming import from sf.net	2007-03-07 20:03:40 +00:00
ChangeLog	Automake changes for potential RC of 2.04	2009-06-03 02:39:05 +00:00
configure	Automake changes for potential RC of 2.04	2009-06-03 02:50:54 +00:00
configure.ac	Automake changes for potential RC of 2.04	2009-06-03 02:50:54 +00:00
COPYING	top-skimming import from sf.net	2007-03-07 20:03:40 +00:00
eurotext.tif	Automake changes for version 2.00.	2007-07-18 01:04:56 +00:00
INSTALL	top-skimming import from sf.net	2007-03-07 20:03:40 +00:00
INSTALL.SVN	added note about aclocal warnings to INSTALL.SVN	2007-03-31 15:26:51 +00:00
Makefile.am	Converted 8 spaces to tabs in two Makefile.am-s.	2008-04-22 14:49:14 +00:00
Makefile.in	Automake changes for potential RC of 2.04	2009-06-03 02:46:32 +00:00
makemoredists	Fixed makemoredists for new exe names and adding Basque language set	2009-06-30 01:49:43 +00:00
NEWS	top-skimming import from sf.net	2007-03-07 20:03:40 +00:00
phototest.tif	top-skimming import from sf.net	2007-03-07 20:03:40 +00:00
README	Fixed name collision with jpeg library	2008-04-22 00:44:56 +00:00
ReleaseNotes	Merged viewer with current code	2009-06-03 19:08:12 +00:00
runautoconf	changed runautoconf instructions	2008-08-18 20:18:21 +00:00
StdAfx.cpp	More new files for v2.00	2007-07-18 01:30:21 +00:00
StdAfx.h	More new files for v2.00	2007-07-18 01:30:21 +00:00
tessdll_2008.vcproj	includes the solution and vcproj 2008 files!	2008-08-21 08:56:44 +00:00
tessdll.cpp	Fixed name collision with jpeg library	2008-04-22 00:44:56 +00:00
tessdll.dsp	More changes for VC++6	2009-06-02 22:12:36 +00:00
tessdll.h	Fixed name collision with jpeg library	2008-04-22 00:44:56 +00:00
tessdll.vcproj	Updated vcproj for VC++2008	2009-06-02 19:59:41 +00:00
tesseract_2008.sln	includes the solution and vcproj 2008 files!	2008-08-21 08:56:44 +00:00
tesseract_2008.vcproj	update the vcproj file so that I can copy all the relevant files to output dir	2008-09-26 06:55:25 +00:00
tesseract.dsp	More changes for VC++6	2009-06-02 22:12:36 +00:00
tesseract.dsw	Devstudio changes for v2.00.	2007-07-18 00:59:35 +00:00
tesseract.sln	Updated vcproj for VC++2008	2009-06-02 19:59:41 +00:00
tesseract.spec	Automake changes for version 2.00.	2007-07-18 01:04:56 +00:00
tesseract.vcproj	Updated vcproj for VC++2008	2009-06-02 19:59:41 +00:00

README

Introduction
============
This package contains the Tesseract Open Source OCR Engine.
Orignally developed at Hewlett Packard Laboratories Bristol and
at Hewlett Packard Co, Greeley Colorado, all the code
in this distribution is now licensed under the Apache License:

** Licensed under the Apache License, Version 2.0 (the "License");
** you may not use this file except in compliance with the License.
** You may obtain a copy of the License at
** http://www.apache.org/licenses/LICENSE-2.0
** Unless required by applicable law or agreed to in writing, software
** distributed under the License is distributed on an "AS IS" BASIS,
** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
** See the License for the specific language governing permissions and
** limitations under the License.


Other Dependencies and Licenses:
================================
The Aspirin/MIGRAINES system is no longer required.

Tesseract can also make use of the libtiff library. (www.libtiff.org)
Without libtiff, Tesseract can only read uncompressed and G3 compressed
TIFF files.


History:
========
The engine was developed at Hewlett Packard Laboratories Bristol and
at Hewlett Packard Co, Greeley Colorado between 1985 and 1994, with some
more changes made in 1996 to port to Windows, and some C++izing in 1998.
A lot of the code was written in C, and then some more was written in C++.
Since then all the code has been converted to at least compile with a C++
compiler. Currently it builds under Linux with gcc2.95 and under Windows
with VC++6. The C++ code makes heavy use of a list system using macros.
This predates stl, was portable before stl, and is more efficent than stl
lists, but has the big negative that if you do get a segmentation violation,
it is hard to debug. Another "feature" of the C/C++ split is that the C++
data structures get converted to C data structures to call the low-level C
code. This is ugly, and the C++izing of the C code is a step towards
eliminating the conversion, but it has not happened yet.


Directory Structure (ordered by dependency):
============================================
ccmain     Top-level code. The main program resides in tesseractmain.cpp.
display    An "editor" to view and operate on the internal structures.
           (Requires a working viewer - batteries not included.)
wordrec    The word-level recognizer.
textord    The module that organizes(orders) text into lines and words.
classify   The low-level character classifiers.
ccstruct   Classes to hold information about a page as it is being processed.
viewer     The client side of a client server viewing system.
           Unfortunately, at this time, the server side is not available.
image      Image class and processing functions.
dict       Language model code.
cutil      Code for file I/O, lists, heaps etc, from the old C code.
ccutil     Somewhat newer code for lists, memory allocation etc from the
           old C++ code.


About the Engine
================
This code is a raw OCR engine. It has NO PAGE LAYOUT ANALYSIS, NO OUTPUT
FORMATTING, and NO UI. It can only process an image of a single column
and create text from it. It can detect fixed pitch vs proportional text.
Having said that, in 1995, this engine was in the top 3 in terms of character
accuracy, and it compiles and runs on both Linux and Windows.
As of 2.0, Tesseract is fully unicode (UTF-8) enabled, and can recognize 6
languages "out of the box." Code and documentation is provided for the brave
to train in other languages. See code.google.com/p/tesseract-ocr for more
information on training.


Using the Engine
================
Windows:
The executable must reside in the same directory as the tessdata directory
The command line is:
tesseract <image.tif> <output> [-l langid]
A windows executable (tesseract.exe) is included in the distribution, but
may not work for you unless you also have the correct mfc and crt dlls.
There is also a tessdll.dll, which you can use to run tesseract from your
own program, but you may be better off building it yourself.

Non-Windows:
You have to tell Tesseract through a standard unix mechanism where to find
its data directory. You must either:
./configure
make
make install
to move the data files to the standard place, or:
export TESSDATA_PREFIX="directory in which your tessdata resides/"
(or equivalent) in your .profile or whatever or setenv to set the environment
variable. Note that the directory must end in a /
HAVING tesseract and tessdata IN THE SAME DIRECTORY DOES NOT WORK ANY MORE.
The command line is:
tesseract <image.tif> <output> [-l langid]

All Systems:
The image file requires a .tif extension for its type to be recognized
correctly. If a file exists with the .tif extension replaced by .uzn, then it
will be interpreted as a UNLV-style zone file. (See www.isri.unlv.edu for
details of the zone files.)
langid may be one of the codes defined in ISO 639-3, and you must download
the corresponding data files into your tessdata directory.