Merge pull request #1950 from stweil/manpage

Merge and enhance documentation on language and script models
2025-01-18 06:30:14 +08:00 · 2018-10-05 18:09:31 +02:00 · 2018-10-05 18:09:31 +02:00 · 2cb609d202
commit 2cb609d202
parent 551abb2114 3315931859
1 changed files with 44 additions and 20 deletions
--- a/doc/tesseract.1.asc
+++ b/doc/tesseract.1.asc
@ -130,12 +130,28 @@ SINGLE OPTIONS
-LANGUAGES
+LANGUAGES AND SCRIPTS
---------
+---------------------
-The currently available traineddata files for tesseract 4.0
+To recognize some text with Tesseract, it is normally necessary to specify
-for the following languages are in
+the language(s) or script of the text (unless it is English text which is
-https://github.com/tesseract-ocr/tessdata_fast:
+supported by default) using `-l lang`.
 Selecting a language automatically also selects the language specific
 character set and dictionary (word list).
 Selecting a script typically selects all characters of that script
 which can be from different languages. The dictionary which is included
 also contains a mix from different languages.
 In most cases, a script also supports English.
 So it is possible to recognize a language that has not been specifically
 trained for by using traineddata for the script it is written in.
 https://github.com/tesseract-ocr/tessdata_fast provides fast language and
 script models which are also part of Linux distributions.
 For Tesseract 4, `tessdata_fast` includes traineddata files for the
 following languages:
 *afr* (Afrikaans),
 *amh* (Amharic),
@ -260,15 +276,8 @@ To use a non-standard language pack named *foo.traineddata*, set the
 *TESSDATA_PREFIX*/tessdata/*foo*.traineddata and give Tesseract the
 argument `-l foo`.
-SCRIPTS
+For Tesseract 4, `tessdata_fast` includes traineddata files for the
-------
+following scripts:
 The traineddata files for the following scripts for tesseract 4.0
 are also in https://github.com/tesseract-ocr/tessdata_fast.
 In most cases, each of these contains all the languages that use that script PLUS English.
 So it is possible to recognize a language that has not been specifically trained for
 by using traineddata for the script it is written in.
 Arabic,
 Armenian,
@ -308,6 +317,18 @@ Thai,
 Tibetan,
 Vietnamese.
 The same languages and scripts are available from
 https://github.com/tesseract-ocr/tessdata_best.
 `tessdata_best` provides slow language and script models.
 These models are needed for training. They also can give better OCR results,
 but the recognition takes much more time.
 Both `tessdata_fast` and `tessdata_best` only support the LSTM OCR engine.
 There is a third repository, https://github.com/tesseract-ocr/tessdata,
 with models which support both the Tesseract 3 legacy OCR engine and the
 Tesseract 4 LSTM OCR engine.
 CONFIG FILES AND AUGMENTING WITH USER DATA
 ------------------------------------------
@ -377,26 +398,29 @@ scripts are now included to allow anyone to reproduce some of these tests.
 See <https://github.com/tesseract-ocr/tesseract/wiki/TestingTesseract> for more
 details.
-Tesseract 3.00 adds a number of new languages, including Chinese, Japanese,
+Tesseract 3.00 added a number of new languages, including Chinese, Japanese,
-and Korean. It also introduces a new, single-file based system of managing
+and Korean. It also introduced a new, single-file based system of managing
 language data.
-Tesseract 3.02 adds BiDirectional text support, the ability to recognize
+Tesseract 3.02 added BiDirectional text support, the ability to recognize
 multiple languages in a single image, and improved layout analysis.
 Tesseract 4 adds a new neural net (LSTM) based OCR engine which is focused
 on line recognition, but also still supports the legacy Tesseract OCR engine of
 Tesseract 3 which works by recognizing character patterns. Compatibility with
-Tesseract 3 is enabled by `--oem 0`. It also needs traineddata files which
+Tesseract 3 is enabled by `--oem 0`. This also needs traineddata files which
-support the legacy engine, for example those from the tessdata repository.
+support the legacy engine, for example those from the tessdata repository
 (https://github.com/tesseract-ocr/tessdata).
-For further details, see the file ReleaseNotes in the Tesseract wiki
+For further details, see the release notes in the Tesseract wiki
 (<https://github.com/tesseract-ocr/tesseract/wiki/ReleaseNotes>).
 RESOURCES
 ---------
 Main web site: <https://github.com/tesseract-ocr> +
 User forum: <http://groups.google.com/group/tesseract-ocr> +
 Wiki: <https://github.com/tesseract-ocr/tesseract/wiki> +
 Information on training: <https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract>
 SEE ALSO