Minor edits to Readme

This commit is contained in:
Robert Theis 2015-05-21 19:23:42 -07:00
parent f8ebff262e
commit a36a5f96d0

View File

@ -1,32 +1,35 @@
Note that this is a text-only and possibly out-of-date version of the Note that this is a text-only and possibly out-of-date version of the
wiki ReadMe, which is located at: wiki ReadMe, which is located at:
https://github.com/tesseract-ocr/tesseract/blob/master/README https://github.com/tesseract-ocr/tesseract/blob/master/README.md
Introduction Introduction
============ ============
This package contains the Tesseract Open Source OCR Engine. This package contains the Tesseract Open Source OCR Engine.
Originally developed at Hewlett Packard Laboratories Bristol and Originally developed at Hewlett-Packard Laboratories Bristol and
at Hewlett Packard Co, Greeley Colorado, all the code at Hewlett-Packard Co, Greeley Colorado, all the code
in this distribution is now licensed under the Apache License: in this distribution is now licensed under the Apache License:
* Licensed under the Apache License, Version 2.0 (the "License"); Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License. you may not use this file except in compliance with the License.
* You may obtain a copy of the License at You may obtain a copy of the License at
* http://www.apache.org/licenses/LICENSE-2.0
* Unless required by applicable law or agreed to in writing, software http://www.apache.org/licenses/LICENSE-2.0
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. Unless required by applicable law or agreed to in writing, software
* See the License for the specific language governing permissions and distributed under the License is distributed on an "AS IS" BASIS,
* limitations under the License. WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
Dependencies and Licenses Dependencies and Licenses
========================= =========================
Leptonica is required. (www.leptonica.com). Tesseract no longer compiles [Leptonica](http://www.leptonica.com) is required. Tesseract no longer
without Leptonica. compiles without Leptonica.
Libtiff is no longer required as a direct dependency. Libtiff is no longer required as a direct dependency.
@ -34,15 +37,16 @@ Installing and Running Tesseract
-------------------------------- --------------------------------
All Users Do NOT Ignore! All Users Do NOT Ignore!
The tarballs are split into pieces. The tarballs are split into pieces.
tesseract-x.xx.tar.gz contains all the source code. tesseract-x.xx.tar.gz contains all the source code.
tesseract-x.xx.<lang>.tar.gz contains the language data files for <lang>. tesseract-x.xx.`<lang>`.tar.gz contains the language data files for `<lang>`.
You need at least one of these or Tesseract will not work. You need at least one of these or Tesseract will not work.
Note that tesseract-x.xx.tar.gz unpacks to the tesseract-ocr directory. Note that tesseract-x.xx.tar.gz unpacks to the tesseract-ocr directory.
tesseract-x.xx.<lang>.tar.gz unpacks to the tessdata directory which tesseract-x.xx.`<lang>`.tar.gz unpacks to the tessdata directory which
belongs inside your tesseract-ocr directory. It is therefore best to belongs inside your tesseract-ocr directory. It is therefore best to
download them into your tesseract-x.xx directory, so you can use unpack download them into your tesseract-x.xx directory, so you can use unpack
here or equivalent. You can unpack as many of the language packs as you here or equivalent. You can unpack as many of the language packs as you
@ -52,7 +56,7 @@ before you run make install. If you unpack them as root to the
destination directory of make install, then the user ids and access destination directory of make install, then the user ids and access
permissions might be messed up. permissions might be messed up.
boxtiff-2.xx.<lang>.tar.gz contains data that was used in training for boxtiff-2.xx.`<lang>`.tar.gz contains data that was used in training for
those that want to do their own training. Most users should NOT download those that want to do their own training. Most users should NOT download
these files. these files.
@ -63,8 +67,8 @@ Tesseract wiki https://github.com/tesseract-ocr/tesseract/wiki
Windows Windows
------- -------
Please use installer (for 3.00 and above). Tesseract is library with Please use the installer (for 3.00 and above). Tesseract is a library with a
command line interface. If you need GUI, please check AddOns wiki page command line interface. If you need a GUI, please check the AddOns wiki page.
TODO-UPDATE-WIKI-LINKS TODO-UPDATE-WIKI-LINKS
@ -74,7 +78,7 @@ If you are building from the sources, the recommended build platform is
VC++ Express 2008 (optionally 2010). VC++ Express 2008 (optionally 2010).
The executables are built with static linking, so they stand more chance The executables are built with static linking, so they stand more chance
of working out of the box on more windows systems. of working out of the box on more Windows systems.
The executable must reside in the same directory as the tessdata The executable must reside in the same directory as the tessdata
directory or you need to set up environment variable TESSDATA_PREFIX. directory or you need to set up environment variable TESSDATA_PREFIX.
@ -82,7 +86,7 @@ Installer will set it up for you.
The command line is: The command line is:
tesseract imagename outputbase [-l lang] [-psm pagesegmode] [configfiles...] tesseract imagename outputbase [-l lang] [-psm pagesegmode] [configfiles...]
If you need interface to other applications, please check wrapper section If you need interface to other applications, please check wrapper section
on AddOns wiki page: on AddOns wiki page:
@ -98,19 +102,19 @@ Non-Windows (or Cygwin)
You have to tell Tesseract through a standard unix mechanism where to You have to tell Tesseract through a standard unix mechanism where to
find its data directory. You must either: find its data directory. You must either:
./autogen.sh ./autogen.sh
./configure ./configure
make make
make install make install
sudo ldconfig sudo ldconfig
to move the data files to the standard place, or: to move the data files to the standard place, or:
export TESSDATA_PREFIX="directory in which your tessdata resides/" export TESSDATA_PREFIX="directory in which your tessdata resides/"
In either case the command line is: In either case the command line is:
tesseract imagename outputbase [-l lang] [-psm pagesegmode] [configfiles...] tesseract imagename outputbase [-l lang] [-psm pagesegmode] [configfiles...]
New there is a tesseract.spec for making rpms. (Thanks to Andrew Ziem for New there is a tesseract.spec for making rpms. (Thanks to Andrew Ziem for
the help.) It might work with your OS if you know how to do that. the help.) It might work with your OS if you know how to do that.
@ -126,8 +130,8 @@ instead of `./configure` above.
History History
======= =======
The engine was developed at Hewlett Packard Laboratories Bristol and The engine was developed at Hewlett-Packard Laboratories Bristol and
at Hewlett Packard Co, Greeley Colorado between 1985 and 1994, with some at Hewlett-Packard Co, Greeley Colorado between 1985 and 1994, with some
more changes made in 1996 to port to Windows, and some C++izing in 1998. more changes made in 1996 to port to Windows, and some C++izing in 1998.
A lot of the code was written in C, and then some more was written in C++. A lot of the code was written in C, and then some more was written in C++.
Since then all the code has been converted to at least compile with a C++ Since then all the code has been converted to at least compile with a C++
@ -138,7 +142,7 @@ lists, but has the big negative that if you do get a segmentation violation,
it is hard to debug. it is hard to debug.
The most recent change is that Tesseract can now recognize 39 languages, The most recent change is that Tesseract can now recognize 39 languages,
including Arabic, Hindi, Vietnamese, plus 3 Fraktur variants including Arabic, Hindi, Vietnamese, plus 3 Fraktur variants,
is fully UTF8 capable, and is fully trainable. See TrainingTesseract for is fully UTF8 capable, and is fully trainable. See TrainingTesseract for
more information on training. more information on training.