Improve formatting for automatic TOC generation for asciidoc

Shreeshrii 2019-02-16 16:25:08 +05:30
parent 836356ddea
commit c6fa0386eb

@ -5,6 +5,7 @@ ifdef::env-github[]
:important-caption: :heavy_exclamation_mark:
:caution-caption: :fire:
:warning-caption: :warning:
:sectlinks:
endif::[]
= Frequently Asked Questions (Tesseract 4)
@ -39,16 +40,14 @@ forum.
toc::[]
[[how-do-i-get-tesseract-4.0.0]]
How do I get Tesseract 4.0.0?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
= Tesseract 4.0.0
== How do I get Tesseract?
See https://github.com/tesseract-ocr/tesseract/wiki[Tesseract Wiki Home]
page for details.
[[which-language-models-are-available-for-tesseract-4.0.0]]
Which language models are available for Tesseract 4.0.0?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
== Which language models are available for Tesseract?
See Tesseract man page for the list of
https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#languages[languages]
@ -65,10 +64,7 @@ User contributed language models are linked from
https://github.com/tesseract-ocr/tesseract/wiki/Data-Files-Contributions[Data
Files Contributions].
[[where-are-the-language-models-traineddata-files-for-tesseract-4.0.0-installed]]
Where are the language models (traineddata files) for Tesseract 4.0.0
installed?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
== Where are the language models (traineddata files) for Tesseract installed?
The files should be installed in /usr/share/tesseract-ocr/4.00/tessdata
(on Ubuntu).
@ -77,9 +73,7 @@ If you get an error message saying eng.traineddata not found, try
setting `TESSDATA_PREFIX=/usr/share/tesseract-ocr/4.00/tessdata` and all
will be good.
[[what-output-formats-can-tesseract-produce]]
What output formats can Tesseract produce?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
== What output formats can Tesseract produce?
* txt
* pdf
@ -111,9 +105,19 @@ merged with an images-only PDF. See
https://github.com/tesseract-ocr/tesseract/issues/660#issuecomment-385669193[issue
660] for related discussion and utility for merging the PDFs.
[[how-do-i-run-tesseract-4.0.0-from-the-command-line]]
How do I run Tesseract 4.0.0 from the command line?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
== What page separators are used in txt output by Tesseract 4.0.0?
Each page will be terminated by the FF character by default for text
output.
Setting `page_separator` to the LF character would restore the old
behaviour of adding an empty line at the end of each page.
Setting `page_separator` to an empty string would omit page separators.
= Running Tesseract
== How do I run Tesseract 4.0.0 from the command line?
See
https://github.com/tesseract-ocr/tesseract/wiki/Command-Line-Usage[Tesseract
@ -123,9 +127,7 @@ from the command line.
`tesseract --help` will provide the most recent help information for the
installed version.
[[how-to-process-multiple-images-in-a-single-run]]
How to process multiple images in a single run?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
== How to process multiple images in a single run?
Prepare a text file that has the path to each image:
@ -139,85 +141,18 @@ Save it, and then give its name as input file to Tesseract.
`tesseract savedlist output`
[[what-page-separators-are-used-in-txt-output-by-tesseract-4.0.0]]
What page separators are used in txt output by Tesseract 4.0.0?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
== How to OCR streaming images to pdf using Tesseract?
Each page will be terminated by the FF character by default for text
output.
Let's say you have an amazing but slow multipage scanning device. It
would be nice to OCR during scanning. In this example, the scanning
program is sending image filenames to Tesseract as they are produced.
Tesseract streams a searchable PDF to stdout.
Setting `page_separator` to the LF character would restore the old
behaviour of adding an empty line at the end of each page.
....
scanimage --batch --batch-print | tesseract -c stream_filelist=true - - pdf > output.pdf
....
Setting `page_separator` to an empty string would omit page separators.
[[how-do-i-use-tesseract-4.0.0-using-the-api]]
How do I use Tesseract 4.0.0 using the API?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
See https://github.com/tesseract-ocr/tesseract/wiki/APIExample[Tesseract
Wiki API examples] page for sample programs for using the API.
[[how-do-i-improve-ocr-results]]
How do I improve OCR results?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
You should note that in many cases, in order to get better OCR results,
you'll need to
https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality[improve
the quality] of the input image you are giving Tesseract.
[[can-i-increase-speed-of-ocr]]
Can I increase speed of OCR?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If you are running Tesseract 4, you can use the "fast" integer models.
Tesseract 4 also uses up to four CPU threads while processing a page, so
it will be faster than Tesseract 3 for a single page.
If your computer has only two CPU cores, then running four threads will
slow down things significantly and it would be better to use a single
thread or maybe a maximum of two threads! Using a single thread
eliminates the computation overhead of multithreading and is also the
best solution for processing lots of images by running one Tesseract
process per CPU core.
Set the maximum number of threads using the environment variable
`OMP_THREAD_LIMIT`.
To disable multithreading, use `OMP_THREAD_LIMIT=1`.
[[how-do-i-train-tesseract-4.0.0-lstm-engine]]
How do I train Tesseract 4.0.0 LSTM Engine?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tesseract can be trained to recognize other languages or finetune
existing language models. See
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00[Tesseract
Wiki Training Tesseract 4.00] page for information on training the LSTM
engine.
Please note that currently LSTM training is only supported using
synthetic images created using a UTF-8 training text and unicode fonts
to render the text.
[[there-are-inconsistent-results-from-tesseract-when-the-same-tessbaseapi-object-is-used-for-decoding-multiple-images]]
There are inconsistent results from tesseract when the same TessBaseAPI
object is used for decoding multiple images
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Try to turn off the adaptive classifier by setting the config variable
`classify_enable_learning` to `0`, or to clear the adaptive data with
the method `ClearAdaptiveClassifier()`.
See also the discussion on the
https://groups.google.com/d/topic/tesseract-ocr/ByGJhocI9qQ[tesseract
forum]
[[how-can-i-make-the-error-messages-go-to-tesseract.log-instead-of-stderr]]
How can I make the error messages go to tesseract.log instead of stderr?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
== How can I make the error messages go to tesseract.log instead of stderr?
To restore the old behaviour of writing to tesseract.log instead of
writing to the console window, you need a text file that contains this:
@ -227,9 +162,7 @@ debug_file tesseract.log
call the file 'logfile' and put it in tessdata/configs/ Then add logfile
to the end of your command line.
[[how-can-i-suppress-tesseract-info-line]]
How can I suppress tesseract info line?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
== How can I suppress tesseract info line?
See
https://web.archive.org/web/*/http://code.google.com/p/tesseract-ocr/issues/detail?id=579[issue
@ -249,67 +182,48 @@ tesseract phototest.tif phototest quiet
*Warning:* Both options will cause you to not see the error message if
there is one.
[[how-do-i-produce-searchable-pdf-output]]
How do I produce searchable PDF output?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
== How do I use Tesseract 4.0.0 using the API?
Searchable PDF output is a standard feature as of Tesseract version
3.03. Use the `pdf` config file like this:
See https://github.com/tesseract-ocr/tesseract/wiki/APIExample[Tesseract
Wiki API examples] page for sample programs for using the API.
....
tesseract phototest.tif phototest pdf
....
== There are inconsistent results from tesseract when the same TessBaseAPI object is used for decoding multiple images.
[[the-searchable-pdf-seems-to-contain-only-spaces-or-spaces-between-the-letters-of-words]]
The searchable PDF seems to contain only spaces or spaces between the
letters of words
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Try to turn off the adaptive classifier by setting the config variable
`classify_enable_learning` to `0`, or to clear the adaptive data with
the method `ClearAdaptiveClassifier()`.
There may be nothing wrong with the PDF itself, but its hidden,
searchable text layer may be not understood by your PDF reader. For
example, Preview.app in Mac OS X is well known for having problems like
this, and might "see" only spaces and no text. Try using Adobe Acrobat
Reader instead.
See also the discussion on the
https://groups.google.com/d/topic/tesseract-ocr/ByGJhocI9qQ[tesseract
forum]
[[how-to-do-streaming-of-images-to-pdf-using-tesseract]]
How to do streaming of images to pdf using Tesseract?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
== How do I improve OCR results?
Let's say you have an amazing but slow multipage scanning device. It
would be nice to OCR during scanning. In this example, the scanning
program is sending image filenames to Tesseract as they are produced.
Tesseract streams a searchable PDF to stdout.
You should note that in many cases, in order to get better OCR results,
you'll need to
https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality[improve
the quality] of the input image you are giving Tesseract.
....
scanimage --batch --batch-print | tesseract -c stream_filelist=true - - pdf > output.pdf
....
== Can I increase speed of OCR?
[[can-i-use-tesseract-for-handwriting-recognition]]
Can I use Tesseract for handwriting recognition?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If you are running Tesseract 4, you can use the "fast" integer models.
You can, but it won't work very well, as Tesseract is designed for
printed text. Take a look at the http://lipitk.sourceforge.net/[Lipi
Toolkit] project instead.
Tesseract 4 also uses up to four CPU threads while processing a page, so
it will be faster than Tesseract 3 for a single page.
[[can-i-use-tesseract-for-barcode-recognition]]
Can I use tesseract for barcode recognition?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If your computer has only two CPU cores, then running four threads will
slow down things significantly and it would be better to use a single
thread or maybe a maximum of two threads! Using a single thread
eliminates the computation overhead of multithreading and is also the
best solution for processing lots of images by running one Tesseract
process per CPU core.
No. Tesseract is for text recognition.
Set the maximum number of threads using the environment variable
`OMP_THREAD_LIMIT`.
[[where-is-the-documentation]]
Where is the documentation?
~~~~~~~~~~~~~~~~~~~~~~~~~~~
To disable multithreading, use `OMP_THREAD_LIMIT=1`.
You're looking at it. If things aren't clear, search on the
http://groups.google.com/group/tesseract-ocr/[Tesseract Google Group] or
ask us there. If you want to help us write more, please do, and post it
to the group!
[[how-can-i-try-the-next-version]]
How can I try the next version?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
== How can I try the next version?
Periodically stable versions go to the downloads page. Between releases,
and in particular, just before a new release, the latest code is
@ -318,11 +232,9 @@ https://github.com/tesseract-ocr/tesseract.git where you can check it
out either by command line, or by following the link to the howto on
using various client programs and plugins.
[[how-do-i-compare-different-versions-of-tesseract]]
How do I compare different versions of Tesseract
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
== How do I compare different versions of Tesseract
If you want to have several version of tesseract (e.g. you want to
If you want to have several versions of tesseract (e.g. you want to
compare OCR result) I would suggest you to compile them from source
(e.g. in /usr/src) and not install them. If you want to test particular
version you can run it this way:
@ -336,9 +248,60 @@ version you can run it this way:
will take care that correct shared library is used (without
installation...).
[[my-question-isnt-in-here]]
My question isn't in here!
~~~~~~~~~~~~~~~~~~~~~~~~~~
= Training
== How do I train Tesseract 4.0.0 LSTM Engine?
Tesseract can be trained to recognize other languages or finetune
existing language models. See
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00[Tesseract
Wiki Training Tesseract 4.00] page for information on training the LSTM
engine.
Please note that currently LSTM training is only supported using
synthetic images created using a UTF-8 training text and unicode fonts
to render the text.
= pdf
== How do I produce searchable PDF output?
Searchable PDF output is a standard feature as of Tesseract version
3.03. Use the `pdf` config file like this:
....
tesseract phototest.tif phototest pdf
....
== The searchable PDF seems to contain only spaces or spaces between the letters of words.
There may be nothing wrong with the PDF itself, but its hidden,
searchable text layer may be not understood by your PDF reader. For
example, Preview.app in Mac OS X is well known for having problems like
this, and might "see" only spaces and no text. Try using Adobe Acrobat
Reader instead.
= Miscellaneous
== Can I use Tesseract for handwriting recognition?
You can, but it won't work very well, as Tesseract is designed for
printed text. Look for projects focussed on handwriting recognition.
== Can I use tesseract for barcode recognition?
No. Tesseract is for text recognition.
== Where is the documentation?
You're looking at it. If things aren't clear, search on the
http://groups.google.com/group/tesseract-ocr/[Tesseract Google Group] or
ask us there. If you want to help us write more, please do, and post it
to the group!
== My question isn't in here!
Try searching the forum: http://groups.google.com/group/tesseract-ocr as
well as open and closed issues on GitHub: