Improve formatting for automatic TOC generation for asciidoc

2025-08-06 13:56:47 +08:00 · 2019-02-16 16:25:08 +05:30 · 2019-02-16 16:25:08 +05:30 · c6fa0386eb
commit c6fa0386eb
parent 836356ddea
1 changed files with 116 additions and 153 deletions
--- a/FAQ.asciidoc
+++ b/FAQ.asciidoc
@ -5,6 +5,7 @@ ifdef::env-github[]
 :important-caption: :heavy_exclamation_mark:
 :caution-caption: :fire:
 :warning-caption: :warning:
+:sectlinks:
 endif::[]

 = Frequently Asked Questions (Tesseract 4)
@ -39,16 +40,14 @@ forum.

 toc::[]

-[[how-do-i-get-tesseract-4.0.0]]
-How do I get Tesseract 4.0.0?
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+= Tesseract 4.0.0
+
+== How do I get Tesseract?

 See https://github.com/tesseract-ocr/tesseract/wiki[Tesseract Wiki Home]
 page for details.

-[[which-language-models-are-available-for-tesseract-4.0.0]]
-Which language models are available for Tesseract 4.0.0?
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+== Which language models are available for Tesseract?

 See Tesseract man page for the list of
 https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#languages[languages]
@ -65,10 +64,7 @@ User contributed language models are linked from
 https://github.com/tesseract-ocr/tesseract/wiki/Data-Files-Contributions[Data
 Files Contributions].

-[[where-are-the-language-models-traineddata-files-for-tesseract-4.0.0-installed]]
-Where are the language models (traineddata files) for Tesseract 4.0.0
-installed?
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+== Where are the language models (traineddata files) for Tesseract installed?

 The files should be installed in /usr/share/tesseract-ocr/4.00/tessdata
 (on Ubuntu).
@ -77,9 +73,7 @@ If you get an error message saying eng.traineddata not found, try
 setting `TESSDATA_PREFIX=/usr/share/tesseract-ocr/4.00/tessdata` and all
 will be good.

-[[what-output-formats-can-tesseract-produce]]
-What output formats can Tesseract produce?
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+== What output formats can Tesseract produce?

 * txt
 * pdf
@ -111,9 +105,19 @@ merged with an images-only PDF. See
 https://github.com/tesseract-ocr/tesseract/issues/660#issuecomment-385669193[issue
 660] for related discussion and utility for merging the PDFs.

-[[how-do-i-run-tesseract-4.0.0-from-the-command-line]]
-How do I run Tesseract 4.0.0 from the command line?
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+== What page separators are used in txt output by Tesseract 4.0.0?
+
+Each page will be terminated by the FF character by default for text
+output.
+
+Setting `page_separator` to the LF character would restore the old
+behaviour of adding an empty line at the end of each page.
+
+Setting `page_separator` to an empty string would omit page separators.
+
+= Running Tesseract
+
+== How do I run Tesseract 4.0.0 from the command line?

 See
 https://github.com/tesseract-ocr/tesseract/wiki/Command-Line-Usage[Tesseract
@ -123,9 +127,7 @@ from the command line.
 `tesseract --help` will provide the most recent help information for the
 installed version.

-[[how-to-process-multiple-images-in-a-single-run]]
-How to process multiple images in a single run?
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+== How to process multiple images in a single run?

 Prepare a text file that has the path to each image:

@ -139,85 +141,18 @@ Save it, and then give its name as input file to Tesseract.

 `tesseract savedlist output`

-[[what-page-separators-are-used-in-txt-output-by-tesseract-4.0.0]]
-What page separators are used in txt output by Tesseract 4.0.0?
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+== How to OCR streaming images to pdf using Tesseract?

-Each page will be terminated by the FF character by default for text
-output.
+Let's say you have an amazing but slow multipage scanning device. It
+would be nice to OCR during scanning. In this example, the scanning
+program is sending image filenames to Tesseract as they are produced.
+Tesseract streams a searchable PDF to stdout.

-Setting `page_separator` to the LF character would restore the old
-behaviour of adding an empty line at the end of each page.
+....
+scanimage --batch --batch-print | tesseract -c stream_filelist=true - - pdf > output.pdf
+....

-Setting `page_separator` to an empty string would omit page separators.
-
-[[how-do-i-use-tesseract-4.0.0-using-the-api]]
-How do I use Tesseract 4.0.0 using the API?
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-See https://github.com/tesseract-ocr/tesseract/wiki/APIExample[Tesseract
-Wiki API examples] page for sample programs for using the API.
-
-[[how-do-i-improve-ocr-results]]
-How do I improve OCR results?
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-You should note that in many cases, in order to get better OCR results,
-you'll need to
-https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality[improve
-the quality] of the input image you are giving Tesseract.
-
-[[can-i-increase-speed-of-ocr]]
-Can I increase speed of OCR?
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-If you are running Tesseract 4, you can use the "fast" integer models.
-
-Tesseract 4 also uses up to four CPU threads while processing a page, so
-it will be faster than Tesseract 3 for a single page.
-
-If your computer has only two CPU cores, then running four threads will
-slow down things significantly and it would be better to use a single
-thread or maybe a maximum of two threads! Using a single thread
-eliminates the computation overhead of multithreading and is also the
-best solution for processing lots of images by running one Tesseract
-process per CPU core.
-
-Set the maximum number of threads using the environment variable
-`OMP_THREAD_LIMIT`.
-
-To disable multithreading, use `OMP_THREAD_LIMIT=1`.
-
-[[how-do-i-train-tesseract-4.0.0-lstm-engine]]
-How do I train Tesseract 4.0.0 LSTM Engine?
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-Tesseract can be trained to recognize other languages or finetune
-existing language models. See
-https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00[Tesseract
-Wiki Training Tesseract 4.00] page for information on training the LSTM
-engine.
-
-Please note that currently LSTM training is only supported using
-synthetic images created using a UTF-8 training text and unicode fonts
-to render the text.
-
-[[there-are-inconsistent-results-from-tesseract-when-the-same-tessbaseapi-object-is-used-for-decoding-multiple-images]]
-There are inconsistent results from tesseract when the same TessBaseAPI
-object is used for decoding multiple images
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-Try to turn off the adaptive classifier by setting the config variable
-`classify_enable_learning` to `0`, or to clear the adaptive data with
-the method `ClearAdaptiveClassifier()`.
-
-See also the discussion on the
-https://groups.google.com/d/topic/tesseract-ocr/ByGJhocI9qQ[tesseract
-forum]
-
-[[how-can-i-make-the-error-messages-go-to-tesseract.log-instead-of-stderr]]
-How can I make the error messages go to tesseract.log instead of stderr?
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+== How can I make the error messages go to tesseract.log instead of stderr?

 To restore the old behaviour of writing to tesseract.log instead of
 writing to the console window, you need a text file that contains this:
@ -227,9 +162,7 @@ debug_file tesseract.log
 call the file 'logfile' and put it in tessdata/configs/ Then add logfile
 to the end of your command line.

-[[how-can-i-suppress-tesseract-info-line]]
-How can I suppress tesseract info line?
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+== How can I suppress tesseract info line?

 See
 https://web.archive.org/web/*/http://code.google.com/p/tesseract-ocr/issues/detail?id=579[issue
@ -249,67 +182,48 @@ tesseract phototest.tif phototest quiet
 *Warning:* Both options will cause you to not see the error message if
 there is one.

-[[how-do-i-produce-searchable-pdf-output]]
-How do I produce searchable PDF output?
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+== How do I use Tesseract 4.0.0 using the API?

-Searchable PDF output is a standard feature as of Tesseract version
-3.03. Use the `pdf` config file like this:
+See https://github.com/tesseract-ocr/tesseract/wiki/APIExample[Tesseract
+Wiki API examples] page for sample programs for using the API.

-....
-tesseract phototest.tif phototest pdf
-....
+== There are inconsistent results from tesseract when the same TessBaseAPI object is used for decoding multiple images.

-[[the-searchable-pdf-seems-to-contain-only-spaces-or-spaces-between-the-letters-of-words]]
-The searchable PDF seems to contain only spaces or spaces between the
-letters of words
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Try to turn off the adaptive classifier by setting the config variable
+`classify_enable_learning` to `0`, or to clear the adaptive data with
+the method `ClearAdaptiveClassifier()`.

-There may be nothing wrong with the PDF itself, but its hidden,
-searchable text layer may be not understood by your PDF reader. For
-example, Preview.app in Mac OS X is well known for having problems like
-this, and might "see" only spaces and no text. Try using Adobe Acrobat
-Reader instead.
+See also the discussion on the
+https://groups.google.com/d/topic/tesseract-ocr/ByGJhocI9qQ[tesseract
+forum]

-[[how-to-do-streaming-of-images-to-pdf-using-tesseract]]
-How to do streaming of images to pdf using Tesseract?
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+== How do I improve OCR results?

-Let's say you have an amazing but slow multipage scanning device. It
-would be nice to OCR during scanning. In this example, the scanning
-program is sending image filenames to Tesseract as they are produced.
-Tesseract streams a searchable PDF to stdout.
+You should note that in many cases, in order to get better OCR results,
+you'll need to
+https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality[improve
+the quality] of the input image you are giving Tesseract.

-....
-scanimage --batch --batch-print | tesseract -c stream_filelist=true - - pdf > output.pdf
-....
+== Can I increase speed of OCR?

-[[can-i-use-tesseract-for-handwriting-recognition]]
-Can I use Tesseract for handwriting recognition?
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+If you are running Tesseract 4, you can use the "fast" integer models.

-You can, but it won't work very well, as Tesseract is designed for
-printed text. Take a look at the http://lipitk.sourceforge.net/[Lipi
-Toolkit] project instead.
+Tesseract 4 also uses up to four CPU threads while processing a page, so
+it will be faster than Tesseract 3 for a single page.

-[[can-i-use-tesseract-for-barcode-recognition]]
-Can I use tesseract for barcode recognition?
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+If your computer has only two CPU cores, then running four threads will
+slow down things significantly and it would be better to use a single
+thread or maybe a maximum of two threads! Using a single thread
+eliminates the computation overhead of multithreading and is also the
+best solution for processing lots of images by running one Tesseract
+process per CPU core.

-No. Tesseract is for text recognition.
+Set the maximum number of threads using the environment variable
+`OMP_THREAD_LIMIT`.

-[[where-is-the-documentation]]
-Where is the documentation?
-~~~~~~~~~~~~~~~~~~~~~~~~~~~
+To disable multithreading, use `OMP_THREAD_LIMIT=1`.

-You're looking at it. If things aren't clear, search on the
-http://groups.google.com/group/tesseract-ocr/[Tesseract Google Group] or
-ask us there. If you want to help us write more, please do, and post it
-to the group!
-
-[[how-can-i-try-the-next-version]]
-How can I try the next version?
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+== How can I try the next version?

 Periodically stable versions go to the downloads page. Between releases,
 and in particular, just before a new release, the latest code is
@ -318,11 +232,9 @@ https://github.com/tesseract-ocr/tesseract.git where you can check it
 out either by command line, or by following the link to the howto on
 using various client programs and plugins.

-[[how-do-i-compare-different-versions-of-tesseract]]
-How do I compare different versions of Tesseract
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+== How do I compare different versions of Tesseract

-If you want to have several version of tesseract (e.g. you want to
+If you want to have several versions of tesseract (e.g. you want to
 compare OCR result) I would suggest you to compile them from source
 (e.g. in /usr/src) and not install them. If you want to test particular
 version you can run it this way:
@ -336,9 +248,60 @@ version you can run it this way:
 will take care that correct shared library is used (without
 installation...).

-[[my-question-isnt-in-here]]
-My question isn't in here!
-~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+= Training
+
+== How do I train Tesseract 4.0.0 LSTM Engine?
+
+Tesseract can be trained to recognize other languages or finetune
+existing language models. See
+https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00[Tesseract
+Wiki Training Tesseract 4.00] page for information on training the LSTM
+engine.
+
+Please note that currently LSTM training is only supported using
+synthetic images created using a UTF-8 training text and unicode fonts
+to render the text.
+
+= pdf
+
+== How do I produce searchable PDF output?
+
+Searchable PDF output is a standard feature as of Tesseract version
+3.03. Use the `pdf` config file like this:
+
+....
+tesseract phototest.tif phototest pdf
+....
+
+== The searchable PDF seems to contain only spaces or spaces between the letters of words.
+
+There may be nothing wrong with the PDF itself, but its hidden,
+searchable text layer may be not understood by your PDF reader. For
+example, Preview.app in Mac OS X is well known for having problems like
+this, and might "see" only spaces and no text. Try using Adobe Acrobat
+Reader instead.
+
+= Miscellaneous
+
+== Can I use Tesseract for handwriting recognition?
+
+You can, but it won't work very well, as Tesseract is designed for
+printed text. Look for projects focussed on handwriting recognition.
+
+== Can I use tesseract for barcode recognition?
+
+No. Tesseract is for text recognition.
+
+== Where is the documentation?
+
+You're looking at it. If things aren't clear, search on the
+http://groups.google.com/group/tesseract-ocr/[Tesseract Google Group] or
+ask us there. If you want to help us write more, please do, and post it
+to the group!
+
+
+== My question isn't in here!

 Try searching the forum: http://groups.google.com/group/tesseract-ocr as
 well as open and closed issues on GitHub: