Merge and enhance documentation on language and script models

Add also links to the user forum and to the Wiki and update the
history text.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
This commit is contained in:
Stefan Weil 2018-10-05 16:55:09 +02:00
parent 551abb2114
commit 3315931859

View File

@ -130,12 +130,28 @@ SINGLE OPTIONS
LANGUAGES
---------
LANGUAGES AND SCRIPTS
---------------------
The currently available traineddata files for tesseract 4.0
for the following languages are in
https://github.com/tesseract-ocr/tessdata_fast:
To recognize some text with Tesseract, it is normally necessary to specify
the language(s) or script of the text (unless it is English text which is
supported by default) using `-l lang`.
Selecting a language automatically also selects the language specific
character set and dictionary (word list).
Selecting a script typically selects all characters of that script
which can be from different languages. The dictionary which is included
also contains a mix from different languages.
In most cases, a script also supports English.
So it is possible to recognize a language that has not been specifically
trained for by using traineddata for the script it is written in.
https://github.com/tesseract-ocr/tessdata_fast provides fast language and
script models which are also part of Linux distributions.
For Tesseract 4, `tessdata_fast` includes traineddata files for the
following languages:
*afr* (Afrikaans),
*amh* (Amharic),
@ -260,15 +276,8 @@ To use a non-standard language pack named *foo.traineddata*, set the
*TESSDATA_PREFIX*/tessdata/*foo*.traineddata and give Tesseract the
argument `-l foo`.
SCRIPTS
-------
The traineddata files for the following scripts for tesseract 4.0
are also in https://github.com/tesseract-ocr/tessdata_fast.
In most cases, each of these contains all the languages that use that script PLUS English.
So it is possible to recognize a language that has not been specifically trained for
by using traineddata for the script it is written in.
For Tesseract 4, `tessdata_fast` includes traineddata files for the
following scripts:
Arabic,
Armenian,
@ -308,6 +317,18 @@ Thai,
Tibetan,
Vietnamese.
The same languages and scripts are available from
https://github.com/tesseract-ocr/tessdata_best.
`tessdata_best` provides slow language and script models.
These models are needed for training. They also can give better OCR results,
but the recognition takes much more time.
Both `tessdata_fast` and `tessdata_best` only support the LSTM OCR engine.
There is a third repository, https://github.com/tesseract-ocr/tessdata,
with models which support both the Tesseract 3 legacy OCR engine and the
Tesseract 4 LSTM OCR engine.
CONFIG FILES AND AUGMENTING WITH USER DATA
------------------------------------------
@ -377,26 +398,29 @@ scripts are now included to allow anyone to reproduce some of these tests.
See <https://github.com/tesseract-ocr/tesseract/wiki/TestingTesseract> for more
details.
Tesseract 3.00 adds a number of new languages, including Chinese, Japanese,
and Korean. It also introduces a new, single-file based system of managing
Tesseract 3.00 added a number of new languages, including Chinese, Japanese,
and Korean. It also introduced a new, single-file based system of managing
language data.
Tesseract 3.02 adds BiDirectional text support, the ability to recognize
Tesseract 3.02 added BiDirectional text support, the ability to recognize
multiple languages in a single image, and improved layout analysis.
Tesseract 4 adds a new neural net (LSTM) based OCR engine which is focused
on line recognition, but also still supports the legacy Tesseract OCR engine of
Tesseract 3 which works by recognizing character patterns. Compatibility with
Tesseract 3 is enabled by `--oem 0`. It also needs traineddata files which
support the legacy engine, for example those from the tessdata repository.
Tesseract 3 is enabled by `--oem 0`. This also needs traineddata files which
support the legacy engine, for example those from the tessdata repository
(https://github.com/tesseract-ocr/tessdata).
For further details, see the file ReleaseNotes in the Tesseract wiki
For further details, see the release notes in the Tesseract wiki
(<https://github.com/tesseract-ocr/tesseract/wiki/ReleaseNotes>).
RESOURCES
---------
Main web site: <https://github.com/tesseract-ocr> +
User forum: <http://groups.google.com/group/tesseract-ocr> +
Wiki: <https://github.com/tesseract-ocr/tesseract/wiki> +
Information on training: <https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract>
SEE ALSO