Merge pull request #1950 from stweil/manpage

Merge and enhance documentation on language and script models
This commit is contained in:
zdenop 2018-10-05 18:09:31 +02:00 committed by GitHub
commit 2cb609d202
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

View File

@ -130,12 +130,28 @@ SINGLE OPTIONS
LANGUAGES LANGUAGES AND SCRIPTS
--------- ---------------------
The currently available traineddata files for tesseract 4.0 To recognize some text with Tesseract, it is normally necessary to specify
for the following languages are in the language(s) or script of the text (unless it is English text which is
https://github.com/tesseract-ocr/tessdata_fast: supported by default) using `-l lang`.
Selecting a language automatically also selects the language specific
character set and dictionary (word list).
Selecting a script typically selects all characters of that script
which can be from different languages. The dictionary which is included
also contains a mix from different languages.
In most cases, a script also supports English.
So it is possible to recognize a language that has not been specifically
trained for by using traineddata for the script it is written in.
https://github.com/tesseract-ocr/tessdata_fast provides fast language and
script models which are also part of Linux distributions.
For Tesseract 4, `tessdata_fast` includes traineddata files for the
following languages:
*afr* (Afrikaans), *afr* (Afrikaans),
*amh* (Amharic), *amh* (Amharic),
@ -260,15 +276,8 @@ To use a non-standard language pack named *foo.traineddata*, set the
*TESSDATA_PREFIX*/tessdata/*foo*.traineddata and give Tesseract the *TESSDATA_PREFIX*/tessdata/*foo*.traineddata and give Tesseract the
argument `-l foo`. argument `-l foo`.
SCRIPTS For Tesseract 4, `tessdata_fast` includes traineddata files for the
------- following scripts:
The traineddata files for the following scripts for tesseract 4.0
are also in https://github.com/tesseract-ocr/tessdata_fast.
In most cases, each of these contains all the languages that use that script PLUS English.
So it is possible to recognize a language that has not been specifically trained for
by using traineddata for the script it is written in.
Arabic, Arabic,
Armenian, Armenian,
@ -308,6 +317,18 @@ Thai,
Tibetan, Tibetan,
Vietnamese. Vietnamese.
The same languages and scripts are available from
https://github.com/tesseract-ocr/tessdata_best.
`tessdata_best` provides slow language and script models.
These models are needed for training. They also can give better OCR results,
but the recognition takes much more time.
Both `tessdata_fast` and `tessdata_best` only support the LSTM OCR engine.
There is a third repository, https://github.com/tesseract-ocr/tessdata,
with models which support both the Tesseract 3 legacy OCR engine and the
Tesseract 4 LSTM OCR engine.
CONFIG FILES AND AUGMENTING WITH USER DATA CONFIG FILES AND AUGMENTING WITH USER DATA
------------------------------------------ ------------------------------------------
@ -377,26 +398,29 @@ scripts are now included to allow anyone to reproduce some of these tests.
See <https://github.com/tesseract-ocr/tesseract/wiki/TestingTesseract> for more See <https://github.com/tesseract-ocr/tesseract/wiki/TestingTesseract> for more
details. details.
Tesseract 3.00 adds a number of new languages, including Chinese, Japanese, Tesseract 3.00 added a number of new languages, including Chinese, Japanese,
and Korean. It also introduces a new, single-file based system of managing and Korean. It also introduced a new, single-file based system of managing
language data. language data.
Tesseract 3.02 adds BiDirectional text support, the ability to recognize Tesseract 3.02 added BiDirectional text support, the ability to recognize
multiple languages in a single image, and improved layout analysis. multiple languages in a single image, and improved layout analysis.
Tesseract 4 adds a new neural net (LSTM) based OCR engine which is focused Tesseract 4 adds a new neural net (LSTM) based OCR engine which is focused
on line recognition, but also still supports the legacy Tesseract OCR engine of on line recognition, but also still supports the legacy Tesseract OCR engine of
Tesseract 3 which works by recognizing character patterns. Compatibility with Tesseract 3 which works by recognizing character patterns. Compatibility with
Tesseract 3 is enabled by `--oem 0`. It also needs traineddata files which Tesseract 3 is enabled by `--oem 0`. This also needs traineddata files which
support the legacy engine, for example those from the tessdata repository. support the legacy engine, for example those from the tessdata repository
(https://github.com/tesseract-ocr/tessdata).
For further details, see the file ReleaseNotes in the Tesseract wiki For further details, see the release notes in the Tesseract wiki
(<https://github.com/tesseract-ocr/tesseract/wiki/ReleaseNotes>). (<https://github.com/tesseract-ocr/tesseract/wiki/ReleaseNotes>).
RESOURCES RESOURCES
--------- ---------
Main web site: <https://github.com/tesseract-ocr> + Main web site: <https://github.com/tesseract-ocr> +
User forum: <http://groups.google.com/group/tesseract-ocr> +
Wiki: <https://github.com/tesseract-ocr/tesseract/wiki> +
Information on training: <https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract> Information on training: <https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract>
SEE ALSO SEE ALSO