mirror of
https://github.com/tesseract-ocr/tesseract.git
synced 2025-01-18 06:30:14 +08:00
Merge pull request #1950 from stweil/manpage
Merge and enhance documentation on language and script models
This commit is contained in:
commit
2cb609d202
@ -130,12 +130,28 @@ SINGLE OPTIONS
|
||||
|
||||
|
||||
|
||||
LANGUAGES
|
||||
---------
|
||||
LANGUAGES AND SCRIPTS
|
||||
---------------------
|
||||
|
||||
The currently available traineddata files for tesseract 4.0
|
||||
for the following languages are in
|
||||
https://github.com/tesseract-ocr/tessdata_fast:
|
||||
To recognize some text with Tesseract, it is normally necessary to specify
|
||||
the language(s) or script of the text (unless it is English text which is
|
||||
supported by default) using `-l lang`.
|
||||
|
||||
Selecting a language automatically also selects the language specific
|
||||
character set and dictionary (word list).
|
||||
|
||||
Selecting a script typically selects all characters of that script
|
||||
which can be from different languages. The dictionary which is included
|
||||
also contains a mix from different languages.
|
||||
In most cases, a script also supports English.
|
||||
So it is possible to recognize a language that has not been specifically
|
||||
trained for by using traineddata for the script it is written in.
|
||||
|
||||
https://github.com/tesseract-ocr/tessdata_fast provides fast language and
|
||||
script models which are also part of Linux distributions.
|
||||
|
||||
For Tesseract 4, `tessdata_fast` includes traineddata files for the
|
||||
following languages:
|
||||
|
||||
*afr* (Afrikaans),
|
||||
*amh* (Amharic),
|
||||
@ -260,15 +276,8 @@ To use a non-standard language pack named *foo.traineddata*, set the
|
||||
*TESSDATA_PREFIX*/tessdata/*foo*.traineddata and give Tesseract the
|
||||
argument `-l foo`.
|
||||
|
||||
SCRIPTS
|
||||
-------
|
||||
|
||||
The traineddata files for the following scripts for tesseract 4.0
|
||||
are also in https://github.com/tesseract-ocr/tessdata_fast.
|
||||
|
||||
In most cases, each of these contains all the languages that use that script PLUS English.
|
||||
So it is possible to recognize a language that has not been specifically trained for
|
||||
by using traineddata for the script it is written in.
|
||||
For Tesseract 4, `tessdata_fast` includes traineddata files for the
|
||||
following scripts:
|
||||
|
||||
Arabic,
|
||||
Armenian,
|
||||
@ -308,6 +317,18 @@ Thai,
|
||||
Tibetan,
|
||||
Vietnamese.
|
||||
|
||||
The same languages and scripts are available from
|
||||
https://github.com/tesseract-ocr/tessdata_best.
|
||||
`tessdata_best` provides slow language and script models.
|
||||
These models are needed for training. They also can give better OCR results,
|
||||
but the recognition takes much more time.
|
||||
|
||||
Both `tessdata_fast` and `tessdata_best` only support the LSTM OCR engine.
|
||||
|
||||
There is a third repository, https://github.com/tesseract-ocr/tessdata,
|
||||
with models which support both the Tesseract 3 legacy OCR engine and the
|
||||
Tesseract 4 LSTM OCR engine.
|
||||
|
||||
|
||||
CONFIG FILES AND AUGMENTING WITH USER DATA
|
||||
------------------------------------------
|
||||
@ -377,26 +398,29 @@ scripts are now included to allow anyone to reproduce some of these tests.
|
||||
See <https://github.com/tesseract-ocr/tesseract/wiki/TestingTesseract> for more
|
||||
details.
|
||||
|
||||
Tesseract 3.00 adds a number of new languages, including Chinese, Japanese,
|
||||
and Korean. It also introduces a new, single-file based system of managing
|
||||
Tesseract 3.00 added a number of new languages, including Chinese, Japanese,
|
||||
and Korean. It also introduced a new, single-file based system of managing
|
||||
language data.
|
||||
|
||||
Tesseract 3.02 adds BiDirectional text support, the ability to recognize
|
||||
Tesseract 3.02 added BiDirectional text support, the ability to recognize
|
||||
multiple languages in a single image, and improved layout analysis.
|
||||
|
||||
Tesseract 4 adds a new neural net (LSTM) based OCR engine which is focused
|
||||
on line recognition, but also still supports the legacy Tesseract OCR engine of
|
||||
Tesseract 3 which works by recognizing character patterns. Compatibility with
|
||||
Tesseract 3 is enabled by `--oem 0`. It also needs traineddata files which
|
||||
support the legacy engine, for example those from the tessdata repository.
|
||||
Tesseract 3 is enabled by `--oem 0`. This also needs traineddata files which
|
||||
support the legacy engine, for example those from the tessdata repository
|
||||
(https://github.com/tesseract-ocr/tessdata).
|
||||
|
||||
For further details, see the file ReleaseNotes in the Tesseract wiki
|
||||
For further details, see the release notes in the Tesseract wiki
|
||||
(<https://github.com/tesseract-ocr/tesseract/wiki/ReleaseNotes>).
|
||||
|
||||
|
||||
RESOURCES
|
||||
---------
|
||||
Main web site: <https://github.com/tesseract-ocr> +
|
||||
User forum: <http://groups.google.com/group/tesseract-ocr> +
|
||||
Wiki: <https://github.com/tesseract-ocr/tesseract/wiki> +
|
||||
Information on training: <https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract>
|
||||
|
||||
SEE ALSO
|
||||
|
Loading…
Reference in New Issue
Block a user