Merge pull request #211 from amitdo/amitdo-readme-update1

Update README.md
2024-11-25 03:29:05 +08:00 · 2016-02-12 09:46:37 +01:00 · 2016-02-12 09:46:37 +01:00 · 640a98f24b
commit 640a98f24b
parent 4393d040bd a67278f61a
1 changed files with 48 additions and 132 deletions
--- a/README.md
+++ b/README.md
@ -1,20 +1,47 @@
 [![Build Status](https://travis-ci.org/tesseract-ocr/tesseract.svg?branch=master)](https://travis-ci.org/tesseract-ocr/tesseract)
 [![Build status](https://ci.appveyor.com/api/projects/status/miah0ikfsf0j3819?svg=true)](https://ci.appveyor.com/project/zdenop/tesseract/)

-Note that this is possibly out-of-date version of the wiki ReadMe,
-which is located at:
-
+For the latest online version of the README.md see:
+    
  https://github.com/tesseract-ocr/tesseract/blob/master/README.md

-Introduction
-============
+#About

-This package contains the Tesseract Open Source OCR Engine.
-Originally developed at Hewlett-Packard Laboratories Bristol and
-at Hewlett-Packard Co, Greeley Colorado, all the code
-in this distribution is now licensed under the Apache License:
+This package contains an OCR engine - `libtesseract` and a command line program - `tesseract`.

-    Licensed under the Apache License, Version 2.0 (the "License");
+The lead developer is Ray Smith. The maintainer is Zdenko Podobny. 
+For a list of contributors see [AUTHORS](https://github.com/tesseract-ocr/tesseract/blob/master/AUTHORS) and github's log of [contributors](https://github.com/tesseract-ocr/tesseract/graphs/contributors).
+
+Tesseract has unicode (UTF-8) support, and can recognize more than 100
+languages "out of the box". It can be trained to recognize other languages. See [Tesseract Training](https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract) for more information. 
+
+Tesseract supports various output formats: plain-text, hocr(html), pdf.
+
+This project does not include a GUI application. If you need one, please see the [3rdParty](https://github.com/tesseract-ocr/tesseract/wiki/3rdParty) wiki page.
+
+You should note that in many cases, in order to get better OCR results, you'll need to [improve the quality](https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality) of the image you are giving Tesseract.
+
+The latest stable version is 3.04, released in July 2015.
+
+#Brief history
+
+Tesseract was originally developed at Hewlett-Packard Laboratories Bristol and
+at Hewlett-Packard Co, Greeley Colorado between 1985 and 1994, with some
+more changes made in 1996 to port to Windows, and some C++izing in 1998.
+
+In 2005 Tesseract was open sourced by HP. Since 2006 it is developed by Google.
+
+[Release Notes](https://github.com/tesseract-ocr/tesseract/wiki/ReleaseNotes)
+
+#For developers
+
+Developers can use `libtesseract` [C](https://github.com/tesseract-ocr/tesseract/blob/master/api/capi.h) or [C++](https://github.com/tesseract-ocr/tesseract/blob/master/api/baseapi.h) API to build their own application. If you need bindings to `libtesseract` for other programming languages, please see the [wrapper](https://github.com/tesseract-ocr/tesseract/wiki/AddOns#tesseract-wrappers) section on AddOns wiki page.
+
+Documentation of Tesseract generated from source code by doxygen can be found on [tesseract-ocr.github.io](http://tesseract-ocr.github.io/).
+
+#License
+
+    The code in this repository is licensed under the Apache License, Version 2.0 (the "License");
    you may not use this file except in compliance with the License.
    You may obtain a copy of the License at

@ -26,135 +53,24 @@ in this distribution is now licensed under the Apache License:
    See the License for the specific language governing permissions and
    limitations under the License.

+**NOTE**: This software depends on other packages that may be licensed under different open source licenses.

-Dependencies and Licenses
-=========================
+#Installing Tesseract

-[Leptonica](http://www.leptonica.com) is required. Tesseract no longer 
-compiles without Leptonica.
+You can either [Install Tesseract via pre-built binary package](https://github.com/tesseract-ocr/tesseract/wiki) or [build it from source](https://github.com/tesseract-ocr/tesseract/wiki/Compiling).

-Libtiff is no longer required as a direct dependency.
+#Running Tesseract

-
-Installing and Running Tesseract
--------------------------------
-
-All Users Do NOT Ignore!
-
-The tarballs are split into pieces.
-
-tesseract-x.xx.tar.gz contains all the source code.
-
-tesseract-x.xx.`<lang>`.tar.gz contains the language data files for `<lang>`.
-You need at least one of these or Tesseract will not work.
-
-Note that tesseract-x.xx.tar.gz unpacks to the tesseract-ocr directory.
-tesseract-x.xx.`<lang>`.tar.gz unpacks to the tessdata directory which 
-belongs inside your tesseract-ocr directory. It is therefore best to 
-download them into your tesseract-x.xx directory, so you can use unpack 
-here or equivalent. You can unpack as many of the language packs as you 
-care to, as they all contain different files. Note that if you are using
-make install you should unpack your language data to your source tree 
-before you run make install. If you unpack them as root to the 
-destination directory of make install, then the user ids and access
-permissions might be messed up.
-
-boxtiff-2.xx.`<lang>`.tar.gz contains data that was used in training for 
-those that want to do their own training. Most users should NOT download
-these files.
-
-Instructions for using the training tools are documented separately at 
-[Tesseract Training wiki](https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract)
-
-
-Windows
-------
-
-Please use the installer (for 3.00 and above). Tesseract is a library with a 
-command line interface. If you need a GUI, please check the [3rdParty wiki page](https://github.com/tesseract-ocr/tesseract/wiki/3rdParty#gui).
-
-If you are building from the sources, the recommended build platform is 
-VC++ Express 2010.
-
-The executables are built with static linking, so they stand more chance
-of working out of the box on more Windows systems.
-
-The executable must reside in the same directory as the tessdata 
-directory or you need to set up environment variable TESSDATA_PREFIX.
-Installer will set it up for you.
-
-The command line is:
+Basic command line usage:

    tesseract imagename outputbase [-l lang] [-psm pagesegmode] [configfiles...]

-If you need interface to other applications, please check wrapper section
-on [AddOns wiki page](https://github.com/tesseract-ocr/tesseract/wiki/AddOns#for-tesseract-ocr-30x).
+For more information about the various command line options use `tesseract --help` or `man tesseract`. 

+#Support

-Non-Windows (or Cygwin)
-----------------------
+Mailing-lists:
+* [tesseract-ocr](https://groups.google.com/d/forum/tesseract-ocr) - For tesseract users. 
+* [tesseract-dev](https://groups.google.com/d/forum/tesseract-dev) - For tesseract developers. 

-You have to tell Tesseract through a standard unix mechanism where to 
-find its data directory. You must either:
-
-    ./autogen.sh
-    ./configure
-    make
-    sudo make install
-    sudo ldconfig
-
-to move the data files to the standard place, or:
-
-    export TESSDATA_PREFIX="directory in which your tessdata resides/"
-
-In either case the command line is:
-
-    tesseract imagename outputbase [-l lang] [-psm pagesegmode] [configfiles...]
-
-New there is a tesseract.spec for making rpms. (Thanks to Andrew Ziem for
-the help.) It might work with your OS if you know how to do that.
-
-If you are linking to the libraries, as Ocropus does, please link to
-libtesseract_api.
-
-
-If you get `leptonica not found` and you've installed it with e.g. homebrew, you
-can run `CPPFLAGS="-I/usr/local/include" LDFLAGS="-L/usr/local/lib" ./configure`
-instead of `./configure` above.
-
-
-History
-=======
-The engine was developed at Hewlett-Packard Laboratories Bristol and
-at Hewlett-Packard Co, Greeley Colorado between 1985 and 1994, with some
-more changes made in 1996 to port to Windows, and some C++izing in 1998.
-A lot of the code was written in C, and then some more was written in C++.
-Since then all the code has been converted to at least compile with a C++
-compiler. Currently it builds under Linux with gcc 4.4.3 and under Windows
-with VC++2010. The C++ code makes heavy use of a list system using macros.
-This predates stl, was portable before stl, and is more efficient than stl
-lists, but has the big negative that if you do get a segmentation violation,
-it is hard to debug.
-
-The most recent change is that Tesseract can now recognize 39 languages,
-including Arabic, Hindi, Vietnamese, plus 3 Fraktur variants, 
-is fully UTF8 capable, and is fully trainable. See TrainingTesseract for
-more information on training.
-
-Tesseract was included in UNLV's Fourth Annual Test of OCR Accuracy. 
-Results were available on https://github.com/tesseract-ocr/docs/blob/master/AT-1995.pdf.
-With Tesseract 2.00, scripts were included to allow anyone to reproduce 
-some of these tests. See TestingTesseract for more details. 
-
-
-About the Engine
-================
-This code is a raw OCR engine. It has limited PAGE LAYOUT ANALYSIS, simple
-OUTPUT FORMATTING (txt, hocr/html), and NO UI. 
-Having said that, in 1995, this engine was in the top 3 in terms of character
-accuracy, and it compiles and runs on both Linux and Windows.
-As of 3.01, Tesseract is fully unicode (UTF-8) enabled, and can recognize 39
-languages "out of the box." Code and documentation is provided for the brave
-to train in other languages. 
-See [Tesseract Training wiki](https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract) 
-for more information on training. Additional [code and extracted documentation](http://tesseract-ocr.github.io/) was generated by Doxygen.
+Please read the [FAQ](https://github.com/tesseract-ocr/tesseract/wiki/FAQ) before asking any question in the mailing-list or reporting an issue.