mirror of
https://github.com/tesseract-ocr/tesseract.git
synced 2025-08-06 13:56:47 +08:00
Improve formatting for automatic TOC generation for asciidoc
parent
836356ddea
commit
c6fa0386eb
269
FAQ.asciidoc
269
FAQ.asciidoc
@ -5,6 +5,7 @@ ifdef::env-github[]
|
||||
:important-caption: :heavy_exclamation_mark:
|
||||
:caution-caption: :fire:
|
||||
:warning-caption: :warning:
|
||||
:sectlinks:
|
||||
endif::[]
|
||||
|
||||
= Frequently Asked Questions (Tesseract 4)
|
||||
@ -39,16 +40,14 @@ forum.
|
||||
|
||||
toc::[]
|
||||
|
||||
[[how-do-i-get-tesseract-4.0.0]]
|
||||
How do I get Tesseract 4.0.0?
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
= Tesseract 4.0.0
|
||||
|
||||
== How do I get Tesseract?
|
||||
|
||||
See https://github.com/tesseract-ocr/tesseract/wiki[Tesseract Wiki Home]
|
||||
page for details.
|
||||
|
||||
[[which-language-models-are-available-for-tesseract-4.0.0]]
|
||||
Which language models are available for Tesseract 4.0.0?
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
== Which language models are available for Tesseract?
|
||||
|
||||
See Tesseract man page for the list of
|
||||
https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#languages[languages]
|
||||
@ -65,10 +64,7 @@ User contributed language models are linked from
|
||||
https://github.com/tesseract-ocr/tesseract/wiki/Data-Files-Contributions[Data
|
||||
Files Contributions].
|
||||
|
||||
[[where-are-the-language-models-traineddata-files-for-tesseract-4.0.0-installed]]
|
||||
Where are the language models (traineddata files) for Tesseract 4.0.0
|
||||
installed?
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
== Where are the language models (traineddata files) for Tesseract installed?
|
||||
|
||||
The files should be installed in /usr/share/tesseract-ocr/4.00/tessdata
|
||||
(on Ubuntu).
|
||||
@ -77,9 +73,7 @@ If you get an error message saying eng.traineddata not found, try
|
||||
setting `TESSDATA_PREFIX=/usr/share/tesseract-ocr/4.00/tessdata` and all
|
||||
will be good.
|
||||
|
||||
[[what-output-formats-can-tesseract-produce]]
|
||||
What output formats can Tesseract produce?
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
== What output formats can Tesseract produce?
|
||||
|
||||
* txt
|
||||
* pdf
|
||||
@ -111,9 +105,19 @@ merged with an images-only PDF. See
|
||||
https://github.com/tesseract-ocr/tesseract/issues/660#issuecomment-385669193[issue
|
||||
660] for related discussion and utility for merging the PDFs.
|
||||
|
||||
[[how-do-i-run-tesseract-4.0.0-from-the-command-line]]
|
||||
How do I run Tesseract 4.0.0 from the command line?
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
== What page separators are used in txt output by Tesseract 4.0.0?
|
||||
|
||||
Each page will be terminated by the FF character by default for text
|
||||
output.
|
||||
|
||||
Setting `page_separator` to the LF character would restore the old
|
||||
behaviour of adding an empty line at the end of each page.
|
||||
|
||||
Setting `page_separator` to an empty string would omit page separators.
|
||||
|
||||
= Running Tesseract
|
||||
|
||||
== How do I run Tesseract 4.0.0 from the command line?
|
||||
|
||||
See
|
||||
https://github.com/tesseract-ocr/tesseract/wiki/Command-Line-Usage[Tesseract
|
||||
@ -123,9 +127,7 @@ from the command line.
|
||||
`tesseract --help` will provide the most recent help information for the
|
||||
installed version.
|
||||
|
||||
[[how-to-process-multiple-images-in-a-single-run]]
|
||||
How to process multiple images in a single run?
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
== How to process multiple images in a single run?
|
||||
|
||||
Prepare a text file that has the path to each image:
|
||||
|
||||
@ -139,85 +141,18 @@ Save it, and then give its name as input file to Tesseract.
|
||||
|
||||
`tesseract savedlist output`
|
||||
|
||||
[[what-page-separators-are-used-in-txt-output-by-tesseract-4.0.0]]
|
||||
What page separators are used in txt output by Tesseract 4.0.0?
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
== How to OCR streaming images to pdf using Tesseract?
|
||||
|
||||
Each page will be terminated by the FF character by default for text
|
||||
output.
|
||||
Let's say you have an amazing but slow multipage scanning device. It
|
||||
would be nice to OCR during scanning. In this example, the scanning
|
||||
program is sending image filenames to Tesseract as they are produced.
|
||||
Tesseract streams a searchable PDF to stdout.
|
||||
|
||||
Setting `page_separator` to the LF character would restore the old
|
||||
behaviour of adding an empty line at the end of each page.
|
||||
....
|
||||
scanimage --batch --batch-print | tesseract -c stream_filelist=true - - pdf > output.pdf
|
||||
....
|
||||
|
||||
Setting `page_separator` to an empty string would omit page separators.
|
||||
|
||||
[[how-do-i-use-tesseract-4.0.0-using-the-api]]
|
||||
How do I use Tesseract 4.0.0 using the API?
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
See https://github.com/tesseract-ocr/tesseract/wiki/APIExample[Tesseract
|
||||
Wiki API examples] page for sample programs for using the API.
|
||||
|
||||
[[how-do-i-improve-ocr-results]]
|
||||
How do I improve OCR results?
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
You should note that in many cases, in order to get better OCR results,
|
||||
you'll need to
|
||||
https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality[improve
|
||||
the quality] of the input image you are giving Tesseract.
|
||||
|
||||
[[can-i-increase-speed-of-ocr]]
|
||||
Can I increase speed of OCR?
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
If you are running Tesseract 4, you can use the "fast" integer models.
|
||||
|
||||
Tesseract 4 also uses up to four CPU threads while processing a page, so
|
||||
it will be faster than Tesseract 3 for a single page.
|
||||
|
||||
If your computer has only two CPU cores, then running four threads will
|
||||
slow down things significantly and it would be better to use a single
|
||||
thread or maybe a maximum of two threads! Using a single thread
|
||||
eliminates the computation overhead of multithreading and is also the
|
||||
best solution for processing lots of images by running one Tesseract
|
||||
process per CPU core.
|
||||
|
||||
Set the maximum number of threads using the environment variable
|
||||
`OMP_THREAD_LIMIT`.
|
||||
|
||||
To disable multithreading, use `OMP_THREAD_LIMIT=1`.
|
||||
|
||||
[[how-do-i-train-tesseract-4.0.0-lstm-engine]]
|
||||
How do I train Tesseract 4.0.0 LSTM Engine?
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Tesseract can be trained to recognize other languages or finetune
|
||||
existing language models. See
|
||||
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00[Tesseract
|
||||
Wiki Training Tesseract 4.00] page for information on training the LSTM
|
||||
engine.
|
||||
|
||||
Please note that currently LSTM training is only supported using
|
||||
synthetic images created using a UTF-8 training text and unicode fonts
|
||||
to render the text.
|
||||
|
||||
[[there-are-inconsistent-results-from-tesseract-when-the-same-tessbaseapi-object-is-used-for-decoding-multiple-images]]
|
||||
There are inconsistent results from tesseract when the same TessBaseAPI
|
||||
object is used for decoding multiple images
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Try to turn off the adaptive classifier by setting the config variable
|
||||
`classify_enable_learning` to `0`, or to clear the adaptive data with
|
||||
the method `ClearAdaptiveClassifier()`.
|
||||
|
||||
See also the discussion on the
|
||||
https://groups.google.com/d/topic/tesseract-ocr/ByGJhocI9qQ[tesseract
|
||||
forum]
|
||||
|
||||
[[how-can-i-make-the-error-messages-go-to-tesseract.log-instead-of-stderr]]
|
||||
How can I make the error messages go to tesseract.log instead of stderr?
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
== How can I make the error messages go to tesseract.log instead of stderr?
|
||||
|
||||
To restore the old behaviour of writing to tesseract.log instead of
|
||||
writing to the console window, you need a text file that contains this:
|
||||
@ -227,9 +162,7 @@ debug_file tesseract.log
|
||||
call the file 'logfile' and put it in tessdata/configs/ Then add logfile
|
||||
to the end of your command line.
|
||||
|
||||
[[how-can-i-suppress-tesseract-info-line]]
|
||||
How can I suppress tesseract info line?
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
== How can I suppress tesseract info line?
|
||||
|
||||
See
|
||||
https://web.archive.org/web/*/http://code.google.com/p/tesseract-ocr/issues/detail?id=579[issue
|
||||
@ -249,67 +182,48 @@ tesseract phototest.tif phototest quiet
|
||||
*Warning:* Both options will cause you to not see the error message if
|
||||
there is one.
|
||||
|
||||
[[how-do-i-produce-searchable-pdf-output]]
|
||||
How do I produce searchable PDF output?
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
== How do I use Tesseract 4.0.0 using the API?
|
||||
|
||||
Searchable PDF output is a standard feature as of Tesseract version
|
||||
3.03. Use the `pdf` config file like this:
|
||||
See https://github.com/tesseract-ocr/tesseract/wiki/APIExample[Tesseract
|
||||
Wiki API examples] page for sample programs for using the API.
|
||||
|
||||
....
|
||||
tesseract phototest.tif phototest pdf
|
||||
....
|
||||
== There are inconsistent results from tesseract when the same TessBaseAPI object is used for decoding multiple images.
|
||||
|
||||
[[the-searchable-pdf-seems-to-contain-only-spaces-or-spaces-between-the-letters-of-words]]
|
||||
The searchable PDF seems to contain only spaces or spaces between the
|
||||
letters of words
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
Try to turn off the adaptive classifier by setting the config variable
|
||||
`classify_enable_learning` to `0`, or to clear the adaptive data with
|
||||
the method `ClearAdaptiveClassifier()`.
|
||||
|
||||
There may be nothing wrong with the PDF itself, but its hidden,
|
||||
searchable text layer may be not understood by your PDF reader. For
|
||||
example, Preview.app in Mac OS X is well known for having problems like
|
||||
this, and might "see" only spaces and no text. Try using Adobe Acrobat
|
||||
Reader instead.
|
||||
See also the discussion on the
|
||||
https://groups.google.com/d/topic/tesseract-ocr/ByGJhocI9qQ[tesseract
|
||||
forum]
|
||||
|
||||
[[how-to-do-streaming-of-images-to-pdf-using-tesseract]]
|
||||
How to do streaming of images to pdf using Tesseract?
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
== How do I improve OCR results?
|
||||
|
||||
Let's say you have an amazing but slow multipage scanning device. It
|
||||
would be nice to OCR during scanning. In this example, the scanning
|
||||
program is sending image filenames to Tesseract as they are produced.
|
||||
Tesseract streams a searchable PDF to stdout.
|
||||
You should note that in many cases, in order to get better OCR results,
|
||||
you'll need to
|
||||
https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality[improve
|
||||
the quality] of the input image you are giving Tesseract.
|
||||
|
||||
....
|
||||
scanimage --batch --batch-print | tesseract -c stream_filelist=true - - pdf > output.pdf
|
||||
....
|
||||
== Can I increase speed of OCR?
|
||||
|
||||
[[can-i-use-tesseract-for-handwriting-recognition]]
|
||||
Can I use Tesseract for handwriting recognition?
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
If you are running Tesseract 4, you can use the "fast" integer models.
|
||||
|
||||
You can, but it won't work very well, as Tesseract is designed for
|
||||
printed text. Take a look at the http://lipitk.sourceforge.net/[Lipi
|
||||
Toolkit] project instead.
|
||||
Tesseract 4 also uses up to four CPU threads while processing a page, so
|
||||
it will be faster than Tesseract 3 for a single page.
|
||||
|
||||
[[can-i-use-tesseract-for-barcode-recognition]]
|
||||
Can I use tesseract for barcode recognition?
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
If your computer has only two CPU cores, then running four threads will
|
||||
slow down things significantly and it would be better to use a single
|
||||
thread or maybe a maximum of two threads! Using a single thread
|
||||
eliminates the computation overhead of multithreading and is also the
|
||||
best solution for processing lots of images by running one Tesseract
|
||||
process per CPU core.
|
||||
|
||||
No. Tesseract is for text recognition.
|
||||
Set the maximum number of threads using the environment variable
|
||||
`OMP_THREAD_LIMIT`.
|
||||
|
||||
[[where-is-the-documentation]]
|
||||
Where is the documentation?
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
To disable multithreading, use `OMP_THREAD_LIMIT=1`.
|
||||
|
||||
You're looking at it. If things aren't clear, search on the
|
||||
http://groups.google.com/group/tesseract-ocr/[Tesseract Google Group] or
|
||||
ask us there. If you want to help us write more, please do, and post it
|
||||
to the group!
|
||||
|
||||
[[how-can-i-try-the-next-version]]
|
||||
How can I try the next version?
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
== How can I try the next version?
|
||||
|
||||
Periodically stable versions go to the downloads page. Between releases,
|
||||
and in particular, just before a new release, the latest code is
|
||||
@ -318,11 +232,9 @@ https://github.com/tesseract-ocr/tesseract.git where you can check it
|
||||
out either by command line, or by following the link to the howto on
|
||||
using various client programs and plugins.
|
||||
|
||||
[[how-do-i-compare-different-versions-of-tesseract]]
|
||||
How do I compare different versions of Tesseract
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
== How do I compare different versions of Tesseract
|
||||
|
||||
If you want to have several version of tesseract (e.g. you want to
|
||||
If you want to have several versions of tesseract (e.g. you want to
|
||||
compare OCR result) I would suggest you to compile them from source
|
||||
(e.g. in /usr/src) and not install them. If you want to test particular
|
||||
version you can run it this way:
|
||||
@ -336,9 +248,60 @@ version you can run it this way:
|
||||
will take care that correct shared library is used (without
|
||||
installation...).
|
||||
|
||||
[[my-question-isnt-in-here]]
|
||||
My question isn't in here!
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
= Training
|
||||
|
||||
== How do I train Tesseract 4.0.0 LSTM Engine?
|
||||
|
||||
Tesseract can be trained to recognize other languages or finetune
|
||||
existing language models. See
|
||||
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00[Tesseract
|
||||
Wiki Training Tesseract 4.00] page for information on training the LSTM
|
||||
engine.
|
||||
|
||||
Please note that currently LSTM training is only supported using
|
||||
synthetic images created using a UTF-8 training text and unicode fonts
|
||||
to render the text.
|
||||
|
||||
= pdf
|
||||
|
||||
== How do I produce searchable PDF output?
|
||||
|
||||
Searchable PDF output is a standard feature as of Tesseract version
|
||||
3.03. Use the `pdf` config file like this:
|
||||
|
||||
....
|
||||
tesseract phototest.tif phototest pdf
|
||||
....
|
||||
|
||||
== The searchable PDF seems to contain only spaces or spaces between the letters of words.
|
||||
|
||||
There may be nothing wrong with the PDF itself, but its hidden,
|
||||
searchable text layer may be not understood by your PDF reader. For
|
||||
example, Preview.app in Mac OS X is well known for having problems like
|
||||
this, and might "see" only spaces and no text. Try using Adobe Acrobat
|
||||
Reader instead.
|
||||
|
||||
= Miscellaneous
|
||||
|
||||
== Can I use Tesseract for handwriting recognition?
|
||||
|
||||
You can, but it won't work very well, as Tesseract is designed for
|
||||
printed text. Look for projects focussed on handwriting recognition.
|
||||
|
||||
== Can I use tesseract for barcode recognition?
|
||||
|
||||
No. Tesseract is for text recognition.
|
||||
|
||||
== Where is the documentation?
|
||||
|
||||
You're looking at it. If things aren't clear, search on the
|
||||
http://groups.google.com/group/tesseract-ocr/[Tesseract Google Group] or
|
||||
ask us there. If you want to help us write more, please do, and post it
|
||||
to the group!
|
||||
|
||||
|
||||
== My question isn't in here!
|
||||
|
||||
Try searching the forum: http://groups.google.com/group/tesseract-ocr as
|
||||
well as open and closed issues on GitHub:
|
||||
|
Loading…
Reference in New Issue
Block a user