Tesseract Open Source OCR Engine (main repository)
Go to file
Jim O'Regan 524a61452d Doxygen
Squashed commit from https://github.com/tesseract-ocr/tesseract/tree/more-doxygen
closes #14

Commits:
6317305  doxygen
9f42f69  doxygen
0fc4d52  doxygen
37b4b55  fix typo
bded8f1  some more doxy
020eb00  slight tweak
524666d  doxygenify
2a36a3e  doxygenify
229d218  doxygenify
7fd28ae  doxygenify
a8c64bc  doxygenify
f5d21b6  fix
5d8ede8  doxygenify
a58a4e0  language_model.cpp
fa85709  lm_pain_points.cpp lm_state.cpp
6418da3  merge
06190ba  Merge branch 'old_doxygen_merge' into more-doxygen
84acf08  Merge branch 'master' into more-doxygen
50fe1ff  pagewalk.cpp cube_reco_context.cpp
2982583  change to relative
192a24a  applybox.cpp, take one
8eeb053  delete docs for obsolete params
52e4c77  modernise classify/ocrfeatures.cpp
2a1cba6  modernise cutil/emalloc.cpp
773e006  silence doxygen warning
aeb1731  silence doxygen warning
f18387f  silence doxygen; new params are unused?
15ad6bd  doxygenify cutil/efio.cpp
c8b5dad  doxygenify cutil/danerror.cpp
784450f  the globals and exceptions parts are obsolete; remove
8bca324  doxygen classify/normfeat.cpp
9bcbe16  doxygen classify/normmatch.cpp
aa9a971  doxygen ccmain/cube_control.cpp
c083ff2  doxygen ccmain/cube_reco_context.cpp
f842850  params changed
5c94f12  doxygen ccmain/cubeclassifier.cpp
15ba750  case sensitive
f5c71d4  case sensitive
f85655b  doxygen classify/intproto.cpp
4bbc7aa  partial doxygen classify/mfx.cpp
dbb6041  partial doxygen classify/intproto.cpp
2aa72db  finish doxygen classify/intproto.cpp
0b8de99  doxygen training/mftraining.cpp
0b5b35c  partial doxygen ccstruct/coutln.cpp
b81c766  partial doxygen ccstruct/coutln.cpp
40fc415  finished? doxygen ccstruct/coutln.cpp
6e4165c  doxygen classify/clusttool.cpp
0267dec  doxygen classify/cutoffs.cpp
7f0c70c  doxygen classify/fpoint.cpp
512f3bd  ignore ~ files
5668a52  doxygen classify/intmatcher.cpp
84788d4  doxygen classify/kdtree.cpp
29f36ca  doxygen classify/mfoutline.cpp
40b94b1  silence doxygen warnings
6c511b9  doxygen classify/mfx.cpp
f9b4080  doxygen classify/outfeat.cpp
aa1df05  doxygen classify/picofeat.cpp
cc5f466  doxygen training/cntraining.cpp
cce044f  doxygen training/commontraining.cpp
167e216  missing param
9498383  renamed params
37eeac2  renamed param
d87b5dd  case
c8ee174  renamed params
b858db8  typo
4c2a838  h2 context?
81a2c0c  fix some param names; add some missing params, no docs
bcf8a4c  add some missing params, no docs
af77f86  add some missing params, no docs; fix some param names
01df24e  fix some params
6161056  fix some params
68508b6  fix some params
285aeb6  doxygen complains here no matter what
529bcfa  rm some missing params, typos
cd21226  rm some missing params, add some new ones
48a4bc2  fix params
c844628  missing param
312ce37  missing param; rename one
ec2fdec  missing param
05e15e0  missing params
d515858  change "<" to &lt; to make doxygen happy
b476a28  wrong place
2015-07-20 18:48:00 +01:00
android Add ability to build under android (without cube or scrollview). 2015-05-12 15:41:15 -07:00
api Merge pull request #52 from unbe/null-pointer-access-in-hocr 2015-07-20 07:40:59 +02:00
ccmain Doxygen 2015-07-20 18:48:00 +01:00
ccstruct Doxygen 2015-07-20 18:48:00 +01:00
ccutil Doxygen 2015-07-20 18:48:00 +01:00
classify Doxygen 2015-07-20 18:48:00 +01:00
config temporary add config/* for release 2015-07-11 09:52:45 +02:00
contrib fix svn:executable atribute, trailing spaces, version include 2013-11-03 17:24:00 +00:00
cube Doxygen 2015-07-20 18:48:00 +01:00
cutil Doxygen 2015-07-20 18:48:00 +01:00
dict Doxygen 2015-07-20 18:48:00 +01:00
doc Doxygen 2015-07-20 18:48:00 +01:00
java temporary add configure and Makefile.in for release 2015-07-11 09:42:43 +02:00
neural_networks/runtime temporary add configure and Makefile.in for release 2015-07-11 09:42:43 +02:00
opencl temporary add configure and Makefile.in for release 2015-07-11 09:42:43 +02:00
tessdata disable text creation for unlv, makebox, box.train, and box.train.stderr (see #49) 2015-07-20 10:07:55 +01:00
testing temporary add configure and Makefile.in for release 2015-07-11 09:42:43 +02:00
textord temporary add configure and Makefile.in for release 2015-07-11 09:42:43 +02:00
training Doxygen 2015-07-20 18:48:00 +01:00
viewer temporary add configure and Makefile.in for release 2015-07-11 09:42:43 +02:00
vs2010 change version 2015-07-11 09:43:50 +02:00
wordrec Doxygen 2015-07-20 18:48:00 +01:00
.gitignore Doxygen 2015-07-20 18:48:00 +01:00
AUTHORS Integrated patch to AUTHORS fixing issue 814 and adding more authors from the code 2013-01-03 18:02:49 +00:00
autogen.sh autogen.sh: fix a bashism 2015-07-13 17:26:40 +02:00
ChangeLog change order of entries V1.0 ... V2.04 2015-06-11 01:34:45 -04:00
configure temporary add configure and Makefile.in for release 2015-07-11 09:42:43 +02:00
configure.ac change version 2015-07-11 09:43:50 +02:00
COPYING removed BOM form strngs.h, updated NSIS script and COPYING 2011-10-22 18:27:31 +00:00
INSTALL install data files; small fix of INSTALL, README; removed ABOUT-NLS (NLS not used at the moment) 2012-02-05 16:25:40 +00:00
INSTALL.GIT change links from code.google.com to github.com 2015-07-11 09:43:31 +02:00
Makefile.am fix filemode; 2014-08-14 23:37:17 +02:00
Makefile.in temporary add configure and Makefile.in for release 2015-07-11 09:42:43 +02:00
NEWS top-skimming import from sf.net 2007-03-07 20:03:40 +00:00
README fix link in README 2015-07-20 07:48:27 +02:00
ReleaseNotes Remaining misc changes for 3.02 2012-02-02 03:14:43 +00:00
tesseract.pc.in change links from code.google.com to github.com 2015-07-11 09:43:31 +02:00

Note that this is a text-only and possibly out-of-date version of the 
wiki ReadMe, which is located at:

  https://github.com/tesseract-ocr/tesseract/blob/master/README

Introduction
============

This package contains the Tesseract Open Source OCR Engine.
Originally developed at Hewlett-Packard Laboratories Bristol and
at Hewlett-Packard Co, Greeley Colorado, all the code
in this distribution is now licensed under the Apache License:

    Licensed under the Apache License, Version 2.0 (the "License");
    you may not use this file except in compliance with the License.
    You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

    Unless required by applicable law or agreed to in writing, software
    distributed under the License is distributed on an "AS IS" BASIS,
    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    See the License for the specific language governing permissions and
    limitations under the License.


Dependencies and Licenses
=========================

[Leptonica](http://www.leptonica.com) is required. Tesseract no longer 
compiles without Leptonica.

Libtiff is no longer required as a direct dependency.


Installing and Running Tesseract
--------------------------------

All Users Do NOT Ignore!

The tarballs are split into pieces.

tesseract-x.xx.tar.gz contains all the source code.

tesseract-x.xx.`<lang>`.tar.gz contains the language data files for `<lang>`.
You need at least one of these or Tesseract will not work.

Note that tesseract-x.xx.tar.gz unpacks to the tesseract-ocr directory.
tesseract-x.xx.`<lang>`.tar.gz unpacks to the tessdata directory which 
belongs inside your tesseract-ocr directory. It is therefore best to 
download them into your tesseract-x.xx directory, so you can use unpack 
here or equivalent. You can unpack as many of the language packs as you 
care to, as they all contain different files. Note that if you are using
make install you should unpack your language data to your source tree 
before you run make install. If you unpack them as root to the 
destination directory of make install, then the user ids and access
permissions might be messed up.

boxtiff-2.xx.`<lang>`.tar.gz contains data that was used in training for 
those that want to do their own training. Most users should NOT download
these files.

Instructions for using the training tools are documented separately at 
[Tesseract Training wiki](https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract)


Windows
-------

Please use the installer (for 3.00 and above). Tesseract is a library with a 
command line interface. If you need a GUI, please check the [3rdParty wiki page](https://github.com/tesseract-ocr/tesseract/wiki/3rdParty#gui).

If you are building from the sources, the recommended build platform is 
VC++ Express 2008 (optionally 2010).

The executables are built with static linking, so they stand more chance
of working out of the box on more Windows systems.

The executable must reside in the same directory as the tessdata 
directory or you need to set up environment variable TESSDATA_PREFIX.
Installer will set it up for you.

The command line is:

    tesseract imagename outputbase [-l lang] [-psm pagesegmode] [configfiles...]

If you need interface to other applications, please check wrapper section
on [AddOns wiki page](https://github.com/tesseract-ocr/tesseract/wiki/AddOns#for-tesseract-ocr-30x).


Non-Windows (or Cygwin)
-----------------------

You have to tell Tesseract through a standard unix mechanism where to 
find its data directory. You must either:

    ./autogen.sh
    ./configure
    make
    make install
    sudo ldconfig

to move the data files to the standard place, or:

    export TESSDATA_PREFIX="directory in which your tessdata resides/"

In either case the command line is:

    tesseract imagename outputbase [-l lang] [-psm pagesegmode] [configfiles...]

New there is a tesseract.spec for making rpms. (Thanks to Andrew Ziem for
the help.) It might work with your OS if you know how to do that.

If you are linking to the libraries, as Ocropus does, please link to
libtesseract_api.


If you get `leptonica not found` and you've installed it with e.g. homebrew, you
can run `CPPFLAGS="-I/usr/local/include" LDFLAGS="-L/usr/local/lib" ./configure`
instead of `./configure` above.


History
=======
The engine was developed at Hewlett-Packard Laboratories Bristol and
at Hewlett-Packard Co, Greeley Colorado between 1985 and 1994, with some
more changes made in 1996 to port to Windows, and some C++izing in 1998.
A lot of the code was written in C, and then some more was written in C++.
Since then all the code has been converted to at least compile with a C++
compiler. Currently it builds under Linux with gcc4.4.3 and under Windows
with VC++2008. The C++ code makes heavy use of a list system using macros.
This predates stl, was portable before stl, and is more efficient than stl
lists, but has the big negative that if you do get a segmentation violation,
it is hard to debug.

The most recent change is that Tesseract can now recognize 39 languages,
including Arabic, Hindi, Vietnamese, plus 3 Fraktur variants, 
is fully UTF8 capable, and is fully trainable. See TrainingTesseract for
more information on training.

Tesseract was included in UNLV's Fourth Annual Test of OCR Accuracy. 
Results were available on https://github.com/tesseract-ocr/docs/blob/master/AT-1995.pdf.
With Tesseract 2.00, scripts were included to allow anyone to reproduce 
some of these tests. See TestingTesseract for more details. 


About the Engine
================
This code is a raw OCR engine. It has limited PAGE LAYOUT ANALYSIS, simple
OUTPUT FORMATTING (txt, hocr/html), and NO UI. 
Having said that, in 1995, this engine was in the top 3 in terms of character
accuracy, and it compiles and runs on both Linux and Windows.
As of 3.01, Tesseract is fully unicode (UTF-8) enabled, and can recognize 39
languages "out of the box." Code and documentation is provided for the brave
to train in other languages. 
See [Tesseract Training wiki](https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract) 
for more information on training. Additional [code and extracted documentation](http://tesseract-ocr.github.io/) was generated by Doxygen.