Merge and enhance documentation on language and script models

Add also links to the user forum and to the Wiki and update the history text. Signed-off-by: Stefan Weil <sw@weilnetz.de>
2025-01-18 06:30:14 +08:00 · 2018-10-05 16:55:09 +02:00 · 2018-10-05 16:55:09 +02:00 · 3315931859
commit 3315931859
parent 551abb2114
1 changed files with 44 additions and 20 deletions
--- a/doc/tesseract.1.asc
+++ b/doc/tesseract.1.asc
@ -130,12 +130,28 @@ SINGLE OPTIONS



-LANGUAGES
---------
+LANGUAGES AND SCRIPTS
+---------------------

-The currently available traineddata files for tesseract 4.0
-for the following languages are in
-https://github.com/tesseract-ocr/tessdata_fast:
+To recognize some text with Tesseract, it is normally necessary to specify
+the language(s) or script of the text (unless it is English text which is
+supported by default) using `-l lang`.
+
+Selecting a language automatically also selects the language specific
+character set and dictionary (word list).
+
+Selecting a script typically selects all characters of that script
+which can be from different languages. The dictionary which is included
+also contains a mix from different languages.
+In most cases, a script also supports English.
+So it is possible to recognize a language that has not been specifically
+trained for by using traineddata for the script it is written in.
+
+https://github.com/tesseract-ocr/tessdata_fast provides fast language and
+script models which are also part of Linux distributions.
+
+For Tesseract 4, `tessdata_fast` includes traineddata files for the
+following languages:

 *afr* (Afrikaans),
 *amh* (Amharic),
@ -260,15 +276,8 @@ To use a non-standard language pack named *foo.traineddata*, set the
 *TESSDATA_PREFIX*/tessdata/*foo*.traineddata and give Tesseract the
 argument `-l foo`.

-SCRIPTS
-------
-
-The traineddata files for the following scripts for tesseract 4.0
-are also in https://github.com/tesseract-ocr/tessdata_fast.
-
-In most cases, each of these contains all the languages that use that script PLUS English.
-So it is possible to recognize a language that has not been specifically trained for
-by using traineddata for the script it is written in.
+For Tesseract 4, `tessdata_fast` includes traineddata files for the
+following scripts:

 Arabic,
 Armenian,
@ -308,6 +317,18 @@ Thai,
 Tibetan,
 Vietnamese.

+The same languages and scripts are available from
+https://github.com/tesseract-ocr/tessdata_best.
+`tessdata_best` provides slow language and script models.
+These models are needed for training. They also can give better OCR results,
+but the recognition takes much more time.
+
+Both `tessdata_fast` and `tessdata_best` only support the LSTM OCR engine.
+
+There is a third repository, https://github.com/tesseract-ocr/tessdata,
+with models which support both the Tesseract 3 legacy OCR engine and the
+Tesseract 4 LSTM OCR engine.
+

 CONFIG FILES AND AUGMENTING WITH USER DATA
 ------------------------------------------
@ -377,26 +398,29 @@ scripts are now included to allow anyone to reproduce some of these tests.
 See <https://github.com/tesseract-ocr/tesseract/wiki/TestingTesseract> for more
 details.

-Tesseract 3.00 adds a number of new languages, including Chinese, Japanese,
-and Korean. It also introduces a new, single-file based system of managing
+Tesseract 3.00 added a number of new languages, including Chinese, Japanese,
+and Korean. It also introduced a new, single-file based system of managing
 language data.

-Tesseract 3.02 adds BiDirectional text support, the ability to recognize
+Tesseract 3.02 added BiDirectional text support, the ability to recognize
 multiple languages in a single image, and improved layout analysis.

 Tesseract 4 adds a new neural net (LSTM) based OCR engine which is focused
 on line recognition, but also still supports the legacy Tesseract OCR engine of
 Tesseract 3 which works by recognizing character patterns. Compatibility with
-Tesseract 3 is enabled by `--oem 0`. It also needs traineddata files which
-support the legacy engine, for example those from the tessdata repository.
+Tesseract 3 is enabled by `--oem 0`. This also needs traineddata files which
+support the legacy engine, for example those from the tessdata repository
+(https://github.com/tesseract-ocr/tessdata).

-For further details, see the file ReleaseNotes in the Tesseract wiki
+For further details, see the release notes in the Tesseract wiki
 (<https://github.com/tesseract-ocr/tesseract/wiki/ReleaseNotes>).


 RESOURCES
 ---------
 Main web site: <https://github.com/tesseract-ocr> +
+User forum: <http://groups.google.com/group/tesseract-ocr> +
+Wiki: <https://github.com/tesseract-ocr/tesseract/wiki> +
 Information on training: <https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract>

 SEE ALSO