tesseract

mirror of https://github.com/tesseract-ocr/tesseract.git synced 2024-11-24 11:09:06 +08:00

Tesseract Open Source OCR Engine (main repository)

hacktoberfest lstm machine-learning ocr ocr-engine tesseract tesseract-ocr

Go to file

Ray Smith 5129c43579 Stopped error messages from pixaGetCount		2016-12-14 10:46:27 -08:00
android	Result of clang tidy on recent merge	2016-11-07 10:46:33 -08:00
api	api: Add missing dependency on libtiff	2016-12-13 13:29:34 +01:00
arch	Fix 32 bit builds (missing _mm256_extract_epi64)	2016-11-24 07:32:49 +01:00
ccmain	Made LSTM the default engine, pushed cube out	2016-12-13 14:37:40 -08:00
ccstruct	Fix two typos in comments	2016-12-11 09:20:17 +01:00
ccutil	Made LSTM the default engine, pushed cube out	2016-12-13 14:37:40 -08:00
classify	Formatting changes from clang_tidy on latest pull	2016-11-30 15:44:25 -08:00
cmake	Fix windows build.	2016-11-24 17:32:23 +03:00
contrib	helper script to generate dawg input files from text	2016-10-17 19:04:29 +01:00
cube	Remove extra semicolons after member function definitions	2016-12-04 14:54:52 +01:00
cutil	Remove unused code.	2016-12-09 15:28:43 +02:00
dict	Merge pull request #353 from pnordhus/remove_dawgpositionvector_dtor	2016-12-08 13:04:58 +01:00
doc	doc: Fix line endings	2016-12-04 20:41:37 +01:00
java	java: Improve build rules	2016-12-11 22:04:17 +01:00
lstm	Made LSTM the default engine, pushed cube out	2016-12-13 14:37:40 -08:00
neural_networks/runtime	Fix two typos in comments	2016-12-11 09:20:17 +01:00
opencl	More clang-tidy from previous commits	2016-12-06 13:45:49 -08:00
tessdata	Added missing lstm.train	2016-12-06 08:48:23 -08:00
testing	Change tesseract parameter -psm to --psm	2016-11-30 22:23:46 +01:00
textord	Stopped error messages from pixaGetCount	2016-12-14 10:46:27 -08:00
training	Use pkg-config for icu compiler and linker flags	2016-12-13 13:29:34 +01:00
viewer	Fixes to training process to allow incremental training from a recognition model	2016-11-30 15:51:17 -08:00
vs2010	Remove 'listio.cpp' and 'listio.h' from vs2010 vcxproj	2016-12-09 16:19:02 +02:00
wordrec	Simplify delete operations	2016-11-24 17:59:13 +01:00
.gitignore	add option "make training-uninstall"	2016-11-22 08:42:55 +01:00
.travis.yml	Turn off macos travis build as it fails during bootstrap.	2016-10-11 17:21:52 +03:00
appveyor.yml	Update appveyor.yml	2016-10-16 21:33:54 +03:00
AUTHORS	AUTHORS: Add more contributors	2016-11-27 00:04:05 +02:00
autogen.sh	Added missing license headers	2016-11-18 15:53:11 -08:00
ChangeLog	Limited max height to 48 even in variable height input, enabled neural nets via ocr engine mode	2016-11-08 14:01:04 -08:00
CMakeLists.txt	Fix broken cmake builds	2016-11-23 16:37:26 +01:00
configure.ac	Use pkg-config for Leptonica compiler flags	2016-12-13 15:52:29 +01:00
CONTRIBUTING.md	CONTRIBUTING.md: Fix a typo	2016-05-29 13:27:33 +03:00
COPYING	Result of clang tidy on recent merge	2016-11-07 10:46:33 -08:00
cppan.yml	Update cppan.yml	2016-12-04 20:19:46 +03:00
docker-compose.yml	Dockerifying using travis build script	2016-03-18 00:32:35 -04:00
Dockerfile	Dockerifying using travis build script	2016-03-18 00:32:35 -04:00
INSTALL	install data files; small fix of INSTALL, README; removed ABOUT-NLS (NLS not used at the moment)	2012-02-05 16:25:40 +00:00
INSTALL.GIT.md	Fix typo in documentation	2016-11-22 08:25:43 +01:00
LICENSE	Added missing license headers	2016-11-18 15:53:11 -08:00
Makefile.am	add option "make training-uninstall"	2016-11-22 08:42:55 +01:00
NEWS	top-skimming import from sf.net	2007-03-07 20:03:40 +00:00
README.md	Change tesseract parameter -psm to --psm	2016-11-30 22:23:46 +01:00
tesseract.pc.in	improve tesseract.pc.in - fixes #241	2016-03-04 22:25:40 +01:00

README.md

For the latest online version of the README.md see:

https://github.com/tesseract-ocr/tesseract/blob/master/README.md

#About

This package contains an OCR engine - libtesseract and a command line program - tesseract.

The lead developer is Ray Smith. The maintainer is Zdenko Podobny. For a list of contributors see AUTHORS and github's log of contributors.

Tesseract has unicode (UTF-8) support, and can recognize more than 100 languages "out of the box". It can be trained to recognize other languages. See Tesseract Training for more information.

Tesseract supports various output formats: plain-text, hocr(html), pdf.

This project does not include a GUI application. If you need one, please see the 3rdParty wiki page.

You should note that in many cases, in order to get better OCR results, you'll need to improve the quality of the image you are giving Tesseract.

The latest stable version is 3.04.01, released in February 2016.

#Brief history

Tesseract was originally developed at Hewlett-Packard Laboratories Bristol and at Hewlett-Packard Co, Greeley Colorado between 1985 and 1994, with some more changes made in 1996 to port to Windows, and some C++izing in 1998.

In 2005 Tesseract was open sourced by HP. Since 2006 it is developed by Google.

Release Notes

#For developers

Developers can use libtesseract C or C++ API to build their own application. If you need bindings to libtesseract for other programming languages, please see the wrapper section on AddOns wiki page.

Documentation of Tesseract generated from source code by doxygen can be found on tesseract-ocr.github.io.

#License

The code in this repository is licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

NOTE: This software depends on other packages that may be licensed under different open source licenses.

#Installing Tesseract

You can either Install Tesseract via pre-built binary package or build it from source.

#Running Tesseract

Basic command line usage:

tesseract imagename outputbase [-l lang] [--psm pagesegmode] [configfiles...]

For more information about the various command line options use tesseract --help or man tesseract.

#Support

Mailing-lists:

tesseract-ocr - For tesseract users.
tesseract-dev - For tesseract developers.

Please read the FAQ before asking any question in the mailing-list or reporting an issue.