Merge pull request #2331 from stweil/doc

Improve man page for tesseract and add Makefile rule for PDF
2024-12-14 16:49:30 +08:00 · 2019-03-16 10:26:16 +01:00 · 2019-03-16 10:26:16 +01:00 · 0b72f4b722
commit 0b72f4b722
parent 29389f7145 5f76a8495b
2 changed files with 190 additions and 156 deletions
--- a/doc/Makefile.am
+++ b/doc/Makefile.am
@ -33,8 +33,9 @@ EXTRA_DIST = $(man_MANS) Doxyfile
 .PHONY: html
 html: ${man_MANS:%=%.html}
 pdf: ${man_MANS:%=%.pdf}
-SUFFIXES = .asc .html
+SUFFIXES = .asc .html .pdf
 .asc:
 	-asciidoc -b docbook -d manpage -o - $< | \
@ -43,6 +44,10 @@ SUFFIXES = .asc .html
 .asc.html:
 	asciidoc -b html5 -o $@ $<
 .asc.pdf:
 	asciidoc -b docbook -d manpage -o $*.dbk $<
 	docbook2pdf $*.dbk
 MAINTAINERCLEANFILES = $(man_MANS) Doxyfile
 endif
--- a/doc/tesseract.1.asc
+++ b/doc/tesseract.1.asc
@ -8,7 +8,7 @@ tesseract - command-line OCR engine
 SYNOPSIS
 --------
-*tesseract* 'imagename'|'listname'|'stdin' 'outputbase'|'stdout' [options...] [configfile...]
+*tesseract* 'FILE' 'OUTPUTBASE' ['OPTIONS']... ['CONFIGFILE']...
 DESCRIPTION
 -----------
@ -20,128 +20,139 @@ at Google since then.
 IN/OUT ARGUMENTS
 ----------------
-'imagename'::
+'FILE'::
-	The name of the input image.  Most image file formats (anything
+  The name of the input file.
-	readable by Leptonica) are supported.
+  This can either be an image file or a text file. +
  Most image file formats (anything readable by Leptonica) are supported. +
  A text file lists the names of all input images (one image name per line).
  The results will be combined in a single file for each output file format
  (txt, pdf, hocr, xml). +
  If 'FILE' is `stdin` or `-` then the standard input is used.
-'listname'::
+'OUTPUTBASE'::
-	The name of a text file which lists the names of all input images
+  The basename of the output file (to which the appropriate extension
-	(one image name per line). The results will be combined in a
+  will be appended).  By default the output will be a text file
-	single file for each output file format (txt, pdf, hocr).
+  with `.txt` added to the basename unless there are one or more
-
+  parameters set which explicitly specify the desired output. +
-'stdin'::
+  If 'OUTPUTBASE' is `stdout` or `-` then the standard output is used.
 	Instruction to read data from standard input.
 'outputbase'::
 	The basename of the output file (to which the appropriate extension
 	will be appended).  By default the output will be a text file
 	with `.txt` added to the basename unless there are one or more
 	parameters set which explicitly specify the desired output.
 'stdout'::
 	Instruction to send output data to standard output.
 [[TESSDATADIR]]
 OPTIONS
 -------
-'--tessdata-dir /path'::
+*-c* 'CONFIGVAR=VALUE'::
-	Specify the location of tessdata path.
+  Set value for parameter 'CONFIGVAR' to VALUE. Multiple *-c* arguments are allowed.
-'--user-words /path/to/file'::
+*--dpi* 'N'::
-	Specify the location of user words file.
+  Specify the resolution 'N' in DPI for the input image(s).
  A typical value for 'N' is `300`. Without this option,
  the resolution is read from the metadata included in the image.
  If an image does not include that information, Tesseract tries to guess it.
-'--user-patterns /path/to/file'::
+*-l* 'LANG'::
-	Specify the location of user patterns file.
+*-l* 'SCRIPT'::
  The language or script to use.
  If none is specified, `eng` (English) is assumed.
  Multiple languages may be specified, separated by plus characters.
  Tesseract uses 3-character ISO 639-2 language codes
  (see <<LANGUAGES,*LANGUAGES AND SCRIPTS*>>).
-'-c configvar=value'::
+*--psm* 'N'::
-	Set value for parameter 'configvar'. Multiple -c arguments are allowed.
+  Set Tesseract to only run a subset of layout analysis and assume
  a certain form of image. The options for 'N' are:
-'-l lang'::
+  0 = Orientation and script detection (OSD) only.
-	The language to use. If none is specified, English is assumed.
+  1 = Automatic page segmentation with OSD.
-	Multiple languages may be specified, separated by plus characters.
+  2 = Automatic page segmentation, but no OSD, or OCR. (not implemented)
-	Tesseract uses 3-character ISO 639-2 language codes. (See LANGUAGES)
+  3 = Fully automatic page segmentation, but no OSD. (Default)
  4 = Assume a single column of text of variable sizes.
  5 = Assume a single uniform block of vertically aligned text.
  6 = Assume a single uniform block of text.
  7 = Treat the image as a single text line.
  8 = Treat the image as a single word.
  9 = Treat the image as a single word in a circle.
  10 = Treat the image as a single character.
  11 = Sparse text. Find as much text as possible in no particular order.
  12 = Sparse text with OSD.
  13 = Raw line. Treat the image as a single text line,
       bypassing hacks that are Tesseract-specific.
-'--psm N'::
+*--oem* 'N'::
-	Set Tesseract to only run a subset of layout analysis and assume
+  Specify OCR Engine mode. The options for 'N' are:
 	a certain form of image. The options for *N* are:
-	0 = Orientation and script detection (OSD) only.
+  0 = Original Tesseract only.
-	1 = Automatic page segmentation with OSD.
+  1 = Neural nets LSTM only.
-	2 = Automatic page segmentation, but no OSD, or OCR. (not implemented)
+  2 = Tesseract + LSTM.
-	3 = Fully automatic page segmentation, but no OSD. (Default)
+  3 = Default, based on what is available.
 	4 = Assume a single column of text of variable sizes.
 	5 = Assume a single uniform block of vertically aligned text.
 	6 = Assume a single uniform block of text.
 	7 = Treat the image as a single text line.
 	8 = Treat the image as a single word.
 	9 = Treat the image as a single word in a circle.
 	10 = Treat the image as a single character.
-'--oem N'::
+*--tessdata-dir* 'PATH'::
-	Specify OCR Engine mode. The options for *N* are:
+  Specify the location of tessdata path.
-	0 = Original Tesseract only.
+*--user-patterns* 'FILE'::
-	1 = Neural nets LSTM only.
+  Specify the location of user patterns file.
 	2 = Tesseract + LSTM.
 	3 = Default, based on what is available.
-'configfile'::
+*--user-words* 'FILE'::
-	The name of a config to use. The name can be a file in tessdata/configs
+  Specify the location of user words file.
 	or tessdata/tessconfigs, or an absolute or relative file path.
 	A config is a plain text file which contains a list of parameters and
 	their values, one per line, with a space separating parameter from value. +
 	Interesting config files include:
-	* `alto` - Output in ALTO format ('outputbase'`.xml`).
+[[CONFIGFILE]]
-	* `hocr` - Output in hOCR format ('outputbase'`.hocr`).
+'CONFIGFILE'::
-	* `pdf` - Output PDF ('outputbase'`.pdf`).
+  The name of a config to use. The name can be a file in `tessdata/configs`
-	* `tsv` - Output TSV ('outputbase'`.tsv`).
+  or `tessdata/tessconfigs`, or an absolute or relative file path.
-	* `txt` - Output plain text ('outputbase'`.txt`).
+  A config is a plain text file which contains a list of parameters and
-	* `get.images` - Write processed input images to file (`tessinput.tif`).
+  their values, one per line, with a space separating parameter from value. +
-	* `logfile` - Redirect debug messages to file (`tesseract.log`).
+  Interesting config files include:
-	* `lstm.train` - Output files used by LSTM training ('outputbase'`.lstmf`).
+
-	* `makebox` - Write box file ('outputbase'`.box`).
+  * *alto* -- Output in ALTO format ('OUTPUTBASE'`.xml`).
-	* `quiet` - Redirect debug messages to /dev/null.
+  * *hocr* -- Output in hOCR format ('OUTPUTBASE'`.hocr`).
  * *pdf* -- Output PDF ('OUTPUTBASE'`.pdf`).
  * *tsv* -- Output TSV ('OUTPUTBASE'`.tsv`).
  * *txt* -- Output plain text ('OUTPUTBASE'`.txt`).
  * *get.images* -- Write processed input images to file (`tessinput.tif`).
  * *logfile* -- Redirect debug messages to file (`tesseract.log`).
  * *lstm.train* -- Output files used by LSTM training ('OUTPUTBASE'`.lstmf`).
  * *makebox* -- Write box file ('OUTPUTBASE'`.box`).
  * *quiet* -- Redirect debug messages to '/dev/null'.
 It is possible to select several config files, for example
-`tesseract image.png demo hocr pdf txt` will create three output files
+`tesseract image.png demo alto hocr pdf txt` will create four output files
-`demo.hocr`, `demo.pdf` and `demo.txt` with the OCR results.
+`demo.alto`, `demo.hocr`, `demo.pdf` and `demo.txt` with the OCR results.
-*Nota Bene:*   The options `-l lang` and `--psm N` must occur
+*Nota bene:*   The options *-l* 'LANG', *-l* 'SCRIPT' and *--psm* 'N'
-before any 'configfile'.
+must occur before any 'CONFIGFILE'.
 SINGLE OPTIONS
 --------------
-'-h, --help'::
+*-h, --help*::
-	Show help message.
+  Show help message.
-'--help-extra'::
+*--help-extra*::
-	Show extra help for advanced users.
+  Show extra help for advanced users.
-'--help-psm'::
+*--help-psm*::
-	Show page segmentation modes.
+  Show page segmentation modes.
-'--help-oem'::
+*--help-oem*::
-	Show OCR Engine modes.
+  Show OCR Engine modes.
-'-v, --version'::
+*-v, --version*::
-	Returns the current version of the tesseract(1) executable.
+  Returns the current version of the tesseract(1) executable.
-'--list-langs'::
+*--list-langs*::
-	List available languages for tesseract engine. Can be used with `--tessdata-dir`.
+  List available languages for tesseract engine.
-
+  Can be used with *--tessdata-dir* 'PATH'.
 '--print-parameters'::
 	Print tesseract parameters.
 *--print-parameters*::
  Print tesseract parameters.
 [[LANGUAGES]]
 LANGUAGES AND SCRIPTS
 ---------------------
 To recognize some text with Tesseract, it is normally necessary to specify
-the language(s) or script of the text (unless it is English text which is
+the language(s) or script(s) of the text (unless it is English text which is
-supported by default) using `-l lang`.
+supported by default) using *-l* 'LANG' or *-l* 'SCRIPT'.
 Selecting a language automatically also selects the language specific
 character set and dictionary (word list).
@ -153,6 +164,9 @@ In most cases, a script also supports English.
 So it is possible to recognize a language that has not been specifically
 trained for by using traineddata for the script it is written in.
 More than one language or script may be specified by using `+`.
 Example: `tesseract myimage.png myimage -l eng+deu+fra`.
 https://github.com/tesseract-ocr/tessdata_fast provides fast language and
 script models which are also part of Linux distributions.
@ -174,16 +188,16 @@ following languages:
 *cat* (Catalan; Valencian),
 *ceb* (Cebuano),
 *ces* (Czech),
-*chi_sim* (Chinese - Simplified),
+*chi_sim* (Chinese simplified),
-*chi_tra* (Chinese - Traditional),
+*chi_tra* (Chinese traditional),
 *chr* (Cherokee),
 *cym* (Welsh),
 *dan* (Danish),
 *deu* (German),
 *dzo* (Dzongkha),
-*ell* (Greek, Modern (1453-)),
+*ell* (Greek, Modern, 1453-),
 *eng* (English),
-*enm* (English, Middle (1100-1500)),
+*enm* (English, Middle, 1100-1500),
 *epo* (Esperanto),
 *equ* (Math / equation detection module),
 *est* (Estonian),
@ -192,10 +206,10 @@ following languages:
 *fin* (Finnish),
 *fra* (French),
 *frk* (Frankish),
-*frm* (French, Middle (ca.1400-1600)),
+*frm* (French, Middle, ca.1400-1600),
 *gle* (Irish),
 *glg* (Galician),
-*grc* (Greek, Ancient (to 1453)),
+*grc* (Greek, Ancient, to 1453),
 *guj* (Gujarati),
 *hat* (Haitian; Haitian Creole),
 *heb* (Hebrew),
@ -215,9 +229,9 @@ following languages:
 *kaz* (Kazakh),
 *khm* (Central Khmer),
 *kir* (Kirghiz; Kyrgyz),
 *kmr* (Kurdish Kurmanji),
 *kor* (Korean),
-*kor_vert* (Korean (vertical)),
+*kor_vert* (Korean vertical),
 *kmr* (Kurdish (Kurmanji)),
 *kur* (Kurdish),
 *lao* (Lao),
 *lat* (Latin),
@ -235,7 +249,7 @@ following languages:
 *nep* (Nepali),
 *nld* (Dutch; Flemish),
 *nor* (Norwegian),
-*oci* (Occitan (post 1500)),
+*oci* (Occitan post 1500),
 *ori* (Oriya),
 *osd* (Orientation and script detection module),
 *pan* (Panjabi; Punjabi),
@ -277,51 +291,51 @@ following languages:
 *yid* (Yiddish),
 *yor* (Yoruba)
-To use a non-standard language pack named *foo.traineddata*, set the
+To use a non-standard language pack named `foo.traineddata`, set the
-*TESSDATA_PREFIX* environment variable so the file can be found at
+`TESSDATA_PREFIX` environment variable so the file can be found at
-*TESSDATA_PREFIX*/tessdata/*foo*.traineddata and give Tesseract the
+`TESSDATA_PREFIX/tessdata/foo.traineddata` and give Tesseract the
-argument `-l foo`.
+argument *-l* `foo`.
 For Tesseract 4, `tessdata_fast` includes traineddata files for the
 following scripts:
-Arabic,
+*Arabic*,
-Armenian,
+*Armenian*,
-Bengali,
+*Bengali*,
-Canadian Aboriginal,
+*Canadian_Aboriginal*,
-Cherokee,
+*Cherokee*,
-Cyrillic,
+*Cyrillic*,
-Devanagari,
+*Devanagari*,
-Ethiopic,
+*Ethiopic*,
-Fraktur,
+*Fraktur*,
-Georgian,
+*Georgian*,
-Greek,
+*Greek*,
-Gujarati,
+*Gujarati*,
-Gurmukhi,
+*Gurmukhi*,
-Han - Simplified,
+*HanS* (Han simplified),
-Han - Simplified (vertical),
+*HanS_vert* (Han simplified, vertical),
-Han - Traditional,
+*HanT* (Han traditional),
-Han - Traditional (vertical),
+*HanT_vert* (Han traditional, vertical),
-Hangul,
+*Hangul*,
-Hangul (vertical),
+*Hangul_vert* (Hangul vertical),
-Hebrew,
+*Hebrew*,
-Japanese,
+*Japanese*,
-Japanese (vertical),
+*Japanese_vert* (Japanese vertical),
-Kannada,
+*Kannada*,
-Khmer,
+*Khmer*,
-Lao,
+*Lao*,
-Latin,
+*Latin*,
-Malayalam,
+*Malayalam*,
-Myanmar,
+*Myanmar*,
-Oriya (Odia),
+*Oriya* (Odia),
-Sinhala,
+*Sinhala*,
-Syriac,
+*Syriac*,
-Tamil,
+*Tamil*,
-Telugu,
+*Telugu*,
-Thaana,
+*Thaana*,
-Thai,
+*Thai*,
-Tibetan,
+*Tibetan*,
-Vietnamese.
+*Vietnamese*.
 The same languages and scripts are available from
 https://github.com/tesseract-ocr/tessdata_best.
@ -343,8 +357,8 @@ Tesseract config files consist of lines with parameter-value pairs (space
 separated).  The parameters are documented as flags in the source code like
 the following one in tesseractclass.h:
-STRING_VAR_H(tessedit_char_blacklist, "",
+`STRING_VAR_H(tessedit_char_blacklist, "",
-             "Blacklist of chars not to recognize");
+             "Blacklist of chars not to recognize");`
 These parameters may enable or disable various features of the engine, and
 may cause it to load (or not load) various data.  For instance, let's suppose
@ -352,10 +366,10 @@ you want to OCR in English, but suppress the normal dictionary and load an
 alternative word list and an alternative list of patterns -- these two files
 are the most commonly used extra data files.
-If your language pack is in /path/to/eng.traineddata  and the hocr config
+If your language pack is in '/path/to/eng.traineddata' and the hocr config
-is in /path/to/configs/hocr then create three new files:
+is in '/path/to/configs/hocr' then create three new files:
-/path/to/eng.user-words:
+'/path/to/eng.user-words':
 [verse]
 the
 quick
@ -363,25 +377,39 @@ brown
 fox
 jumped
-
+'/path/to/eng.user-patterns':
 /path/to/eng.user-patterns:
 [verse]
 1-\d\d\d-GOOG-411
 www.\n\\\*.com
-/path/to/configs/bazaar:
+'/path/to/configs/bazaar':
 [verse]
 load_system_dawg     F
 load_freq_dawg       F
 user_words_suffix    user-words
 user_patterns_suffix user-patterns
-Now, if you pass the word 'bazaar' as a 'configfile' to Tesseract,
+Now, if you pass the word 'bazaar' as a <<CONFIGFILE,'CONFIGFILE'>> to
-Tesseract will not bother loading the system dictionary nor
+Tesseract, Tesseract will not bother loading the system dictionary nor
-the dictionary of frequent words and will load and use the eng.user-words
+the dictionary of frequent words and will load and use the 'eng.user-words'
-and eng.user-patterns files you provided.  The former is a simple word list,
+and 'eng.user-patterns' files you provided.  The former is a simple word list,
-one per line.  The format of the latter is documented in dict/trie.h
+one per line.  The format of the latter is documented in 'dict/trie.h'
-on read_pattern_list().
+on 'read_pattern_list()'.
 ENVIRONMENT VARIABLES
 ---------------------
 *`TESSDATA_PREFIX`*::
  If the `TESSDATA_PREFIX` is set to a path, then that path is used to
  find the `tessdata` directory with language and script recognition
  models and config files.
  Using <<TESSDATADIR,*--tessdata-dir* 'PATH'>> is the recommended alternative.
 *`OMP_THREAD_LIMIT`*::
  If the `tesseract` executable was built with multithreading support,
  it will normally use four CPU cores for the OCR process. While this
  can be faster for a single image, it gives bad performance if the host
  computer provides less than four CPU cores or if OCR is made for many images.
  Only a single CPU core is used with `OMP_THREAD_LIMIT=1`.
 HISTORY
@ -391,7 +419,7 @@ Hewlett Packard Co, Greeley Colorado between 1985 and 1994, with some more
 changes made in 1996 to port to Windows, and some $$C++$$izing in 1998. A
 lot of the code was written in C, and then some more was written in $$C++$$.
 The $$C++$$ code makes heavy use of a list system using macros. This predates
-stl, was portable before stl, and is more efficient than stl lists, but has
+STL, was portable before STL, and is more efficient than STL lists, but has
 the big negative that if you do get a segmentation violation, it is hard to
 debug.
@ -399,7 +427,8 @@ Version 2.00 brought Unicode (UTF-8) support, six languages, and the ability
 to train Tesseract.
 Tesseract was included in UNLV's Fourth Annual Test of OCR Accuracy.
-See <https://github.com/tesseract-ocr/docs/blob/master/AT-1995.pdf>. With Tesseract 2.00,
+See <https://github.com/tesseract-ocr/docs/blob/master/AT-1995.pdf>.
 Since Tesseract 2.00,
 scripts are now included to allow anyone to reproduce some of these tests.
 See <https://github.com/tesseract-ocr/tesseract/wiki/TestingTesseract> for more
 details.