From 7a90446a0b25a0ec695bef9c2ec56d5287b420bc Mon Sep 17 00:00:00 2001 From: Amit Dovev Date: Sat, 6 Feb 2016 15:04:27 +0200 Subject: [PATCH 1/3] Update README.md --- README.md | 167 ++++++++++++------------------------------------------ 1 file changed, 35 insertions(+), 132 deletions(-) diff --git a/README.md b/README.md index 7311d7f4..0b247613 100644 --- a/README.md +++ b/README.md @@ -1,20 +1,34 @@ [![Build Status](https://travis-ci.org/tesseract-ocr/tesseract.svg?branch=master)](https://travis-ci.org/tesseract-ocr/tesseract) [![Build status](https://ci.appveyor.com/api/projects/status/miah0ikfsf0j3819?svg=true)](https://ci.appveyor.com/project/zdenop/tesseract/) -Note that this is possibly out-of-date version of the wiki ReadMe, -which is located at: - https://github.com/tesseract-ocr/tesseract/blob/master/README.md +#About -Introduction -============ +This package contains an OCR engine - `libtesseract` and a command line program - `tesseract`. -This package contains the Tesseract Open Source OCR Engine. -Originally developed at Hewlett-Packard Laboratories Bristol and -at Hewlett-Packard Co, Greeley Colorado, all the code -in this distribution is now licensed under the Apache License: +The lead developer is Ray Smith. The maintainer is Zdenko Podobny. +For a list of contributors see [AUTHORS](https://github.com/tesseract-ocr/tesseract/blob/master/AUTHORS) and github's log of [contributors](https://github.com/tesseract-ocr/tesseract/graphs/contributors). - Licensed under the Apache License, Version 2.0 (the "License"); +Tesseract has unicode (UTF-8) support, and can recognize more than 100 +languages "out of the box". It can be trained to recognize other languages. See [Tesseract Training](https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract) for more information. + +Tesseract supports various output formats: plain-text, hocr(html), pdf. + +The latest stable version is 3.04, released in July 2015. + +#Brief history + +Tesseract was originally developed at Hewlett-Packard Laboratories Bristol and +at Hewlett-Packard Co, Greeley Colorado between 1985 and 1994, with some +more changes made in 1996 to port to Windows, and some C++izing in 1998. + +In 2005 Tesseract was open sourced by HP. Since 2006 it is developed by Google. + +[Release Notes](https://github.com/tesseract-ocr/tesseract/wiki/ReleaseNotes) + +#License + + The code in this repository is licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at @@ -26,135 +40,24 @@ in this distribution is now licensed under the Apache License: See the License for the specific language governing permissions and limitations under the License. +**NOTE**: This software depends on other packages that may be licensed under different open source licenses. -Dependencies and Licenses -========================= +#Installing Tesseract -[Leptonica](http://www.leptonica.com) is required. Tesseract no longer -compiles without Leptonica. +You can either [Install Tesseract via pre-built binary package](https://github.com/tesseract-ocr/tesseract/wiki) or [build it from source](https://github.com/tesseract-ocr/tesseract/wiki/Compiling). -Libtiff is no longer required as a direct dependency. +#Running Tesseract - -Installing and Running Tesseract --------------------------------- - -All Users Do NOT Ignore! - -The tarballs are split into pieces. - -tesseract-x.xx.tar.gz contains all the source code. - -tesseract-x.xx.``.tar.gz contains the language data files for ``. -You need at least one of these or Tesseract will not work. - -Note that tesseract-x.xx.tar.gz unpacks to the tesseract-ocr directory. -tesseract-x.xx.``.tar.gz unpacks to the tessdata directory which -belongs inside your tesseract-ocr directory. It is therefore best to -download them into your tesseract-x.xx directory, so you can use unpack -here or equivalent. You can unpack as many of the language packs as you -care to, as they all contain different files. Note that if you are using -make install you should unpack your language data to your source tree -before you run make install. If you unpack them as root to the -destination directory of make install, then the user ids and access -permissions might be messed up. - -boxtiff-2.xx.``.tar.gz contains data that was used in training for -those that want to do their own training. Most users should NOT download -these files. - -Instructions for using the training tools are documented separately at -[Tesseract Training wiki](https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract) - - -Windows -------- - -Please use the installer (for 3.00 and above). Tesseract is a library with a -command line interface. If you need a GUI, please check the [3rdParty wiki page](https://github.com/tesseract-ocr/tesseract/wiki/3rdParty#gui). - -If you are building from the sources, the recommended build platform is -VC++ Express 2010. - -The executables are built with static linking, so they stand more chance -of working out of the box on more Windows systems. - -The executable must reside in the same directory as the tessdata -directory or you need to set up environment variable TESSDATA_PREFIX. -Installer will set it up for you. - -The command line is: +Basic command line usage: tesseract imagename outputbase [-l lang] [-psm pagesegmode] [configfiles...] -If you need interface to other applications, please check wrapper section -on [AddOns wiki page](https://github.com/tesseract-ocr/tesseract/wiki/AddOns#for-tesseract-ocr-30x). +To see the full usage use `tesseract --help` +#Support -Non-Windows (or Cygwin) ------------------------ +Mailing-lists: +* [tesseract-ocr](https://groups.google.com/d/forum/tesseract-ocr) - For tesseract users. +* [tesseract-dev](https://groups.google.com/d/forum/tesseract-dev) - For tesseract developers. -You have to tell Tesseract through a standard unix mechanism where to -find its data directory. You must either: - - ./autogen.sh - ./configure - make - sudo make install - sudo ldconfig - -to move the data files to the standard place, or: - - export TESSDATA_PREFIX="directory in which your tessdata resides/" - -In either case the command line is: - - tesseract imagename outputbase [-l lang] [-psm pagesegmode] [configfiles...] - -New there is a tesseract.spec for making rpms. (Thanks to Andrew Ziem for -the help.) It might work with your OS if you know how to do that. - -If you are linking to the libraries, as Ocropus does, please link to -libtesseract_api. - - -If you get `leptonica not found` and you've installed it with e.g. homebrew, you -can run `CPPFLAGS="-I/usr/local/include" LDFLAGS="-L/usr/local/lib" ./configure` -instead of `./configure` above. - - -History -======= -The engine was developed at Hewlett-Packard Laboratories Bristol and -at Hewlett-Packard Co, Greeley Colorado between 1985 and 1994, with some -more changes made in 1996 to port to Windows, and some C++izing in 1998. -A lot of the code was written in C, and then some more was written in C++. -Since then all the code has been converted to at least compile with a C++ -compiler. Currently it builds under Linux with gcc 4.4.3 and under Windows -with VC++2010. The C++ code makes heavy use of a list system using macros. -This predates stl, was portable before stl, and is more efficient than stl -lists, but has the big negative that if you do get a segmentation violation, -it is hard to debug. - -The most recent change is that Tesseract can now recognize 39 languages, -including Arabic, Hindi, Vietnamese, plus 3 Fraktur variants, -is fully UTF8 capable, and is fully trainable. See TrainingTesseract for -more information on training. - -Tesseract was included in UNLV's Fourth Annual Test of OCR Accuracy. -Results were available on https://github.com/tesseract-ocr/docs/blob/master/AT-1995.pdf. -With Tesseract 2.00, scripts were included to allow anyone to reproduce -some of these tests. See TestingTesseract for more details. - - -About the Engine -================ -This code is a raw OCR engine. It has limited PAGE LAYOUT ANALYSIS, simple -OUTPUT FORMATTING (txt, hocr/html), and NO UI. -Having said that, in 1995, this engine was in the top 3 in terms of character -accuracy, and it compiles and runs on both Linux and Windows. -As of 3.01, Tesseract is fully unicode (UTF-8) enabled, and can recognize 39 -languages "out of the box." Code and documentation is provided for the brave -to train in other languages. -See [Tesseract Training wiki](https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract) -for more information on training. Additional [code and extracted documentation](http://tesseract-ocr.github.io/) was generated by Doxygen. +Please read the [FAQ](https://github.com/tesseract-ocr/tesseract/wiki/FAQ) before asking any question in the mailing-list or reporting an issue. From cc88f3509b7b6263994bb0180af3542e5d99124b Mon Sep 17 00:00:00 2001 From: Amit Dovev Date: Tue, 9 Feb 2016 16:42:12 +0200 Subject: [PATCH 2/3] Update README.md --- README.md | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 0b247613..b80ac28c 100644 --- a/README.md +++ b/README.md @@ -14,6 +14,10 @@ languages "out of the box". It can be trained to recognize other languages. See Tesseract supports various output formats: plain-text, hocr(html), pdf. +This project does not include a GUI application. If you need one, please see the [3rdParty](https://github.com/tesseract-ocr/tesseract/wiki/3rdParty) wiki page. + +You should note that in many cases, in order to get better OCR results, you'll need to [improve the quality](https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality) of the image you are giving Tesseract. + The latest stable version is 3.04, released in July 2015. #Brief history @@ -26,6 +30,12 @@ In 2005 Tesseract was open sourced by HP. Since 2006 it is developed by Google. [Release Notes](https://github.com/tesseract-ocr/tesseract/wiki/ReleaseNotes) +#For developers + +Developers can use `libtesseract` [C](https://github.com/tesseract-ocr/tesseract/blob/master/api/capi.h) or [C++](https://github.com/tesseract-ocr/tesseract/blob/master/api/baseapi.h) API to build their own application. If you need bindings to `libtesseract` for other programming languages, please see the [wrapper](https://github.com/tesseract-ocr/tesseract/wiki/AddOns#tesseract-wrappers) section on AddOns wiki page. + +Documentation of Tesseract generated from source code by doxygen can be found on [tesseract-ocr.github.io](http://tesseract-ocr.github.io/). + #License The code in this repository is licensed under the Apache License, Version 2.0 (the "License"); @@ -52,7 +62,7 @@ Basic command line usage: tesseract imagename outputbase [-l lang] [-psm pagesegmode] [configfiles...] -To see the full usage use `tesseract --help` +For more information about the various command line options use `tesseract --help` or `man tesseract`. #Support From a67278f61a45743ec188970a7767f1589b63ab14 Mon Sep 17 00:00:00 2001 From: Amit Dovev Date: Tue, 9 Feb 2016 23:30:46 +0200 Subject: [PATCH 3/3] Update README.md --- README.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/README.md b/README.md index b80ac28c..54974043 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,9 @@ [![Build Status](https://travis-ci.org/tesseract-ocr/tesseract.svg?branch=master)](https://travis-ci.org/tesseract-ocr/tesseract) [![Build status](https://ci.appveyor.com/api/projects/status/miah0ikfsf0j3819?svg=true)](https://ci.appveyor.com/project/zdenop/tesseract/) +For the latest online version of the README.md see: + + https://github.com/tesseract-ocr/tesseract/blob/master/README.md #About