tesseract

mirror of https://github.com/tesseract-ocr/tesseract.git synced 2024-11-27 20:59:36 +08:00

Tesseract Open Source OCR Engine (main repository)

hacktoberfest lstm machine-learning ocr ocr-engine tesseract tesseract-ocr

Go to file

theraysmith@gmail.com 6e273b71bd Cube trained data for fra, ita, rus, spa git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@656 d0cd1f9f-072b-0410-8dd7-cf729c803f20		2012-02-02 03:08:26 +00:00
api	Moved ResultIterator/PageIterator to ccmain	2012-02-02 02:47:59 +00:00
ccmain	Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic, Added paragraph detection in layout analysis/post OCR, Fixed inconsistent xheight during training and over-chopping, Added simultaneous multi-language capability, Refactored top-level word recognition module, Fixed problems with internally scaled images	2012-02-02 02:59:49 +00:00
ccstruct	Added simultaneous multi-language capability, Refactored top-level word recognition module, Blamer module added for error analysis, Tidied up constraints on control parameters, Added UNICHARSET to WERD_CHOICE to make mult-language handling easier, Added word bigram correction	2012-02-02 03:06:39 +00:00
ccutil	Removed dead memory mangagement code	2012-02-02 02:51:56 +00:00
classify	Added simultaneous multi-language capability, Added support for ShapeTable in classifier and training, Refactored class pruner, Added new uniform classifier API, Added new training error counter	2012-02-02 02:57:42 +00:00
contrib	move bash completion script to a contrib directory instead of littering up the top level	2010-05-29 16:33:10 +00:00
cube	Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic, Refactored top-level word recognition module, Added simultaneous multi-language capability.	2012-02-02 03:03:56 +00:00
cutil	fixed "one lib" build on linux; runautoconf renamed to autogen.sh;	2011-10-16 19:39:54 +00:00
debian	Debian packages of Leptonica to allow use of 1.67	2010-11-30 01:35:37 +00:00
dict	Fixed endian bug in dawg reader, Added word bigram correction,	2012-02-02 02:56:18 +00:00
doc	man pages included to install script, improved windows installer script (issue 425), output format for "tesseract -v" changed to "3.00 version", README cleanup.	2011-08-08 20:33:18 +00:00
image	fixed "one lib" build on linux; runautoconf renamed to autogen.sh;	2011-10-16 19:39:54 +00:00
java	more Makefile.in	2011-08-18 18:40:33 +00:00
neural_networks/runtime	make single/multiple libraries optional -- this needs testing!!!	2011-08-29 21:28:28 +00:00
po	3.01 code from http://github.com/jimregan/tesseract-ocr with addaptions related to Linux and Windows (VC2008) compile process	2010-11-23 18:34:14 +00:00
tessdata	Cube trained data for fra, ita, rus, spa	2012-02-02 03:08:26 +00:00
testing	Deleted Makefile.in from svn	2011-08-18 16:32:44 +00:00
textord	Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding	2012-02-02 02:53:04 +00:00
training	Fixed training leaks and randomness	2012-02-02 03:02:16 +00:00
viewer	svpaint.cpp moved from include to source	2011-10-16 20:23:49 +00:00
vs2008	removed BOM form strngs.h, updated NSIS script and COPYING	2011-10-22 18:27:31 +00:00
vs2010	VC2010: add support for dynamic linking	2011-10-15 22:17:19 +00:00
wordrec	Refactored top-level word recognition module, Blamer module added for error analysis, Added word bigram correction	2012-02-02 03:01:38 +00:00
ABOUT-NLS	partially address issue 353	2010-09-15 23:39:21 +00:00
aclocal.m4	Last minute fixes for making the tarball	2011-10-22 05:28:44 +00:00
AUTHORS	top-skimming import from sf.net	2007-03-07 20:03:40 +00:00
autogen.sh	fixed "one lib" build on linux; runautoconf renamed to autogen.sh;	2011-10-16 19:39:54 +00:00
ChangeLog	Misc Makefile etc for 3.01	2010-11-30 01:30:09 +00:00
configure.ac	fixed "one lib" build on linux; runautoconf renamed to autogen.sh;	2011-10-16 19:39:54 +00:00
COPYING	removed BOM form strngs.h, updated NSIS script and COPYING	2011-10-22 18:27:31 +00:00
eurotext.tif	Automake changes for version 2.00.	2007-07-18 01:04:56 +00:00
INSTALL	top-skimming import from sf.net	2007-03-07 20:03:40 +00:00
INSTALL.SVN	fixed "one lib" build on linux; runautoconf renamed to autogen.sh;	2011-10-16 19:39:54 +00:00
Makefile.am	Last minute fixes for making the tarball	2011-10-22 05:28:44 +00:00
makemoredists	fixed doxygen path and included doxygen to 'makemoredists' script	2011-06-25 21:58:59 +00:00
NEWS	top-skimming import from sf.net	2007-03-07 20:03:40 +00:00
phototest.tif	top-skimming import from sf.net	2007-03-07 20:03:40 +00:00
README	fixed "one lib" build on linux; runautoconf renamed to autogen.sh;	2011-10-16 19:39:54 +00:00
ReleaseNotes	Last minute fixes for making the tarball	2011-10-22 05:28:44 +00:00
tesseract.spec	improved script for creating language packages, improved tesseract.spec	2010-09-26 20:11:50 +00:00

README

Note that this is a text-only and possibly out-of-date version of the 
wiki ReadMe, which is located at:
 http://code.google.com/p/tesseract-ocr/wiki/ReadMe

Introduction
============

This package contains the Tesseract Open Source OCR Engine.
Originally developed at Hewlett Packard Laboratories Bristol and
at Hewlett Packard Co, Greeley Colorado, all the code
in this distribution is now licensed under the Apache License:

** Licensed under the Apache License, Version 2.0 (the "License");
** you may not use this file except in compliance with the License.
** You may obtain a copy of the License at
** http://www.apache.org/licenses/LICENSE-2.0
** Unless required by applicable law or agreed to in writing, software
** distributed under the License is distributed on an "AS IS" BASIS,
** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
** See the License for the specific language governing permissions and
** limitations under the License.


Dependencies and Licenses:
==========================

Leptonica is required. (www.leptonica.com). Tesseract no longer compiles
without Leptonica.
Libtiff is no longer required as a direct dependency.


Installing and Running Tesseract
All Users Do NOT Ignore!
The tarballs are split into pieces.

tesseract-x.xx.tar.gz contains all the source code.

tesseract-x.xx.<lang>.tar.gz contains the language data files for <lang>.
You need at least one of these or Tesseract will not work.

Note that tesseract-x.xx.tar.gz unpacks to the tesseract-ocr directory.
tesseract-x.xx.<lang>.tar.gz unpacks to the tessdata directory which 
belongs inside your tesseract-ocr directory. It is therefore best to 
download them into your tesseract-x.xx directory, so you can use unpack 
here or equivalent. You can unpack as many of the language packs as you 
care to, as they all contain different files. Note that if you are using
make install you should unpack your language data to your source tree 
before you run make install. If you unpack them as root to the 
destination directory of make install, then the user ids and access
permissions might be messed up.

boxtiff-2.xx.<lang>.tar.gz contains data that was used in training for 
those that want to do their own training. Most users should NOT download
these files.

Instructions for using the training tools are documented separately at 
Tesseract wiki http://code.google.com/p/tesseract-ocr/w/list


Windows:
--------

Please use installer (for 3.00 and above). Tesseract is library with 
command line interface. If you need GUI, please check AddOns wiki page
http://code.google.com/p/tesseract-ocr/wiki/AddOns#GUI

If you are building from the sources, the recommended build platform is 
VC++ Express 2008 (optionally 2010).

The executables are built with static linking, so they stand more chance
of working out of the box on more windows systems.

The executable must reside in the same directory as the tessdata 
directory or you need to set up environment variable TESSDATA_PREFIX.
Installer will set it up for you.

The command line is:

tesseract imagename outputbase [-l lang] [-psm pagesegmode] [configfiles...]

If you need interface to other applications, please check wrapper section
on AddOns wiki page:
http://code.google.com/p/tesseract-ocr/wiki/AddOns#Tesseract_3.0x


Non-Windows (or Cygwin):
------------------------

You have to tell Tesseract through a standard unix mechanism where to 
find its data directory. You must either:

./autogen.sh
./configure
make
make install

to move the data files to the standard place, or:

export TESSDATA_PREFIX="directory in which your tessdata resides/"

In either case the command line is:

tesseract imagename outputbase [-l lang] [-psm pagesegmode] [configfiles...]

New there is a tesseract.spec for making rpms. (Thanks to Andrew Ziem for
the help.) It might work with your OS if you know how to do that.

If you are linking to the libraries, as Ocropus does, please link to
libtesseract_api.



History:
========
The engine was developed at Hewlett Packard Laboratories Bristol and
at Hewlett Packard Co, Greeley Colorado between 1985 and 1994, with some
more changes made in 1996 to port to Windows, and some C++izing in 1998.
A lot of the code was written in C, and then some more was written in C++.
Since then all the code has been converted to at least compile with a C++
compiler. Currently it builds under Linux with gcc4.4.3 and under Windows
with VC++2008. The C++ code makes heavy use of a list system using macros.
This predates stl, was portable before stl, and is more efficient than stl
lists, but has the big negative that if you do get a segmentation violation,
it is hard to debug.

The most recent change is that Tesseract can now recognize 39 languages,
including Arabic, Hindi, Vietnamese, plus 3 Fraktur variants 
is fully UTF8 capable, and is fully trainable. See TrainingTesseract for
more information on training.

Tesseract was included in UNLV's Fourth Annual Test of OCR Accuracy. 
Results were available on http://www.isri.unlv.edu/downloads/AT-1995.pdf.
With Tesseract 2.00, scripts were included to allow anyone to reproduce 
some of these tests. See TestingTesseract for more details. 


About the Engine
================
This code is a raw OCR engine. It has limited PAGE LAYOUT ANALYSIS, simple
OUTPUT FORMATTING (txt, hocr/html), and NO UI. 
Having said that, in 1995, this engine was in the top 3 in terms of character
accuracy, and it compiles and runs on both Linux and Windows.
As of 3.01, Tesseract is fully unicode (UTF-8) enabled, and can recognize 39
languages "out of the box." Code and documentation is provided for the brave
to train in other languages. See code.google.com/p/tesseract-ocr for more
information on training.