mirror of https://github.com/tesseract-ocr/tesseract.git synced 2024-11-24 02:59:07 +08:00

Tesseract Open Source OCR Engine (main repository)

hacktoberfest lstm machine-learning ocr ocr-engine tesseract tesseract-ocr

Go to file

Dennis Schridde 6072814fea Compatibility with Leptonica 1.73 http://www.leptonica.org/source/version-notes.html: Naming changes (to avoid collisions): #defines MALLOC --> LEPT_MALLOC, CALLOC --> LEPT_CALLOC, etc. ByteBuffer --> L_ByteBuffer Introduction of the TESSERACT_LIBLEPT_PREREQ macro allows backward compatibility with Leptonica <1.73.		2016-01-31 12:21:20 +01:00
android	Fixes #74 NO_CUBE_BUILD with reverting to ANDROID_BUILD in baseapi	2015-08-09 18:09:30 +02:00
api	Fix #184 . Training should work now	2016-01-17 14:27:35 +02:00
ccmain	Add info for progress monitor, make it visible in doxygen doc; remove commented code	2016-01-05 17:21:53 +01:00
ccstruct	Remove checks for this == NULL	2015-11-07 13:09:53 +01:00
ccutil	Merge pull request #180 from stweil/master	2016-01-05 17:22:57 +01:00
classify	Fix free of buffer which was not allocated	2015-11-27 07:02:22 +01:00
cmake	Improve leptonica search.	2016-01-26 14:52:18 +03:00
contrib	replace vs2008 directory with vs2010 directory (fixes cygwin build)	2015-07-20 20:35:52 +02:00
cube	Fix more typos in comments (found by codespell)	2015-11-04 21:58:42 +01:00
cutil	Remove register attribute for local variables	2015-11-06 06:45:19 +01:00
dict	Remove unneeded const qualifiers	2015-11-05 06:36:42 +01:00
doc	Doxyfile: Fix typo in comment (found by codespell)	2015-09-14 22:17:48 +02:00
java	Java: Fix typos in comments and strings	2015-09-14 22:18:44 +02:00
neural_networks/runtime	Revert "temporary add config/*, configure and Makefile.in for release"	2015-07-31 21:44:43 +02:00
opencl	Compatibility with Leptonica 1.73	2016-01-31 12:21:20 +01:00
tessdata	Update Makefile.am	2015-12-18 16:12:32 +02:00
testing	testing: Fix typo in comment (found by codespell)	2015-11-04 21:58:42 +01:00
textord	Fix compiler warnings (remove unused constants)	2015-12-21 10:01:47 +01:00
training	Add Junicode to neo-Latin fonts	2016-01-13 10:15:57 -05:00
viewer	Fix compiler warnings (remove unused constants)	2015-12-21 10:01:47 +01:00
vs2010	remove empty header file secname.h	2015-07-31 17:32:54 +02:00
wordrec	Remove register attribute for local variables	2015-11-06 06:45:19 +01:00
.gitignore	Merge branch 'master' of github.com:tesseract-ocr/tesseract	2015-10-02 12:02:04 +03:00
.travis.yml	Update .travis.yml	2016-01-26 14:15:17 +03:00
appveyor.yml	Update appveyor.yml	2016-01-26 14:15:37 +03:00
AUTHORS	Integrated patch to AUTHORS fixing issue 814 and adding more authors from the code	2013-01-03 18:02:49 +00:00
autogen.sh	autogen.sh: fix a bashism	2015-07-13 17:26:40 +02:00
ChangeLog	change order of entries V1.0 ... V2.04	2015-06-11 01:34:45 -04:00
CMakeLists.txt	Improve leptonica search.	2016-01-26 14:52:18 +03:00
configure.ac	autotools: fail if g++ or clang++ compiler is not found; Fixes #130	2015-11-04 22:39:24 +01:00
COPYING	Fix grammar in license file	2015-12-07 14:34:24 +01:00
INSTALL	install data files; small fix of INSTALL, README; removed ABOUT-NLS (NLS not used at the moment)	2012-02-05 16:25:40 +00:00
INSTALL.GIT.md	Fix typo in documentation and add missing blank	2015-12-07 14:37:25 +01:00
Makefile.am	rename README to README.md - fixes #45	2015-08-20 13:58:36 +02:00
NEWS	top-skimming import from sf.net	2007-03-07 20:03:40 +00:00
README.md	cmake - add initial cmake instruction to INSTALL.GIT ; rename cmake output tesseractmain to tesseract; updage badges links	2015-10-10 17:26:32 +02:00
ReleaseNotes	Remaining misc changes for 3.02	2012-02-02 03:14:43 +00:00
tesseract.pc.in	change links from code.google.com to github.com	2015-07-11 09:43:31 +02:00

README.md

Note that this is possibly out-of-date version of the wiki ReadMe, which is located at:

https://github.com/tesseract-ocr/tesseract/blob/master/README.md

Introduction

This package contains the Tesseract Open Source OCR Engine. Originally developed at Hewlett-Packard Laboratories Bristol and at Hewlett-Packard Co, Greeley Colorado, all the code in this distribution is now licensed under the Apache License:

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

Dependencies and Licenses

Leptonica is required. Tesseract no longer compiles without Leptonica.

Libtiff is no longer required as a direct dependency.

Installing and Running Tesseract

All Users Do NOT Ignore!

The tarballs are split into pieces.

tesseract-x.xx.tar.gz contains all the source code.

tesseract-x.xx.<lang>.tar.gz contains the language data files for <lang>. You need at least one of these or Tesseract will not work.

Note that tesseract-x.xx.tar.gz unpacks to the tesseract-ocr directory. tesseract-x.xx.<lang>.tar.gz unpacks to the tessdata directory which belongs inside your tesseract-ocr directory. It is therefore best to download them into your tesseract-x.xx directory, so you can use unpack here or equivalent. You can unpack as many of the language packs as you care to, as they all contain different files. Note that if you are using make install you should unpack your language data to your source tree before you run make install. If you unpack them as root to the destination directory of make install, then the user ids and access permissions might be messed up.

boxtiff-2.xx.<lang>.tar.gz contains data that was used in training for those that want to do their own training. Most users should NOT download these files.

Instructions for using the training tools are documented separately at Tesseract Training wiki

Windows

Please use the installer (for 3.00 and above). Tesseract is a library with a command line interface. If you need a GUI, please check the 3rdParty wiki page.

If you are building from the sources, the recommended build platform is VC++ Express 2010.

The executables are built with static linking, so they stand more chance of working out of the box on more Windows systems.

The executable must reside in the same directory as the tessdata directory or you need to set up environment variable TESSDATA_PREFIX. Installer will set it up for you.

The command line is:

tesseract imagename outputbase [-l lang] [-psm pagesegmode] [configfiles...]

If you need interface to other applications, please check wrapper section on AddOns wiki page.

Non-Windows (or Cygwin)

You have to tell Tesseract through a standard unix mechanism where to find its data directory. You must either:

./autogen.sh
./configure
make
sudo make install
sudo ldconfig

to move the data files to the standard place, or:

export TESSDATA_PREFIX="directory in which your tessdata resides/"

In either case the command line is:

tesseract imagename outputbase [-l lang] [-psm pagesegmode] [configfiles...]

New there is a tesseract.spec for making rpms. (Thanks to Andrew Ziem for the help.) It might work with your OS if you know how to do that.

If you are linking to the libraries, as Ocropus does, please link to libtesseract_api.

If you get leptonica not found and you've installed it with e.g. homebrew, you can run CPPFLAGS="-I/usr/local/include" LDFLAGS="-L/usr/local/lib" ./configure instead of ./configure above.

History

The engine was developed at Hewlett-Packard Laboratories Bristol and at Hewlett-Packard Co, Greeley Colorado between 1985 and 1994, with some more changes made in 1996 to port to Windows, and some C++izing in 1998. A lot of the code was written in C, and then some more was written in C++. Since then all the code has been converted to at least compile with a C++ compiler. Currently it builds under Linux with gcc 4.4.3 and under Windows with VC++2010. The C++ code makes heavy use of a list system using macros. This predates stl, was portable before stl, and is more efficient than stl lists, but has the big negative that if you do get a segmentation violation, it is hard to debug.

The most recent change is that Tesseract can now recognize 39 languages, including Arabic, Hindi, Vietnamese, plus 3 Fraktur variants, is fully UTF8 capable, and is fully trainable. See TrainingTesseract for more information on training.

Tesseract was included in UNLV's Fourth Annual Test of OCR Accuracy. Results were available on https://github.com/tesseract-ocr/docs/blob/master/AT-1995.pdf. With Tesseract 2.00, scripts were included to allow anyone to reproduce some of these tests. See TestingTesseract for more details.

About the Engine

This code is a raw OCR engine. It has limited PAGE LAYOUT ANALYSIS, simple OUTPUT FORMATTING (txt, hocr/html), and NO UI. Having said that, in 1995, this engine was in the top 3 in terms of character accuracy, and it compiles and runs on both Linux and Windows. As of 3.01, Tesseract is fully unicode (UTF-8) enabled, and can recognize 39 languages "out of the box." Code and documentation is provided for the brave to train in other languages. See Tesseract Training wiki for more information on training. Additional code and extracted documentation was generated by Doxygen.