Updated FAQ (markdown)

Shreeshrii 2018-05-01 22:45:59 +05:30
parent 09817e2f87
commit 24e1b8de97

44
FAQ.md

@ -15,19 +15,19 @@ Read the [CONTRIBUTING](https://github.com/tesseract-ocr/tesseract/blob/master/C
***
(Please note, this page is currently being updated for 4.0.0).
## Q&A
# Q&A
### How do I get Tesseract 4.0.0?
## How do I get Tesseract 4.0.0?
See [Tesseract Wiki Home](https://github.com/tesseract-ocr/tesseract/wiki) page for details.
### Which language models are available for Tesseract 4.0.0?
## Which language models are available for Tesseract 4.0.0?
See Tesseract man page for the list of [languages](https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#languages) and [scripts](https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#scripts) supported by Tesseract4.0.0.
See the [Tesseract Wiki Data Files](https://github.com/tesseract-ocr/tesseract/wiki/Data-Files#updated-data-files-for-version-400-september-15-2017) page for information regarding the language models available for Tesseract 4.0.0.
### What output formats can Tesseract produce?
## What output formats can Tesseract produce?
* txt
* pdf
@ -45,13 +45,13 @@ With the configfile 'tsv' tesseract will produce [tab-separated values](https://
`tesseract -c textonly_pdf=1` will produce a text-only PDF which can be merged with an images-only PDF. See [issue 660](https://github.com/tesseract-ocr/tesseract/issues/660#issuecomment-385669193) for related discussion and utility for merging the PDFs.
### How do I run Tesseract 4.0.0 from the command line?
## How do I run Tesseract 4.0.0 from the command line?
See [Tesseract Wiki Command Line Usage](https://github.com/tesseract-ocr/tesseract/wiki/Command-Line-Usage) page for information on how to run Tesseract from the command line.
`tesseract --help` will provide the most recent help information for the installed version.
### How to process multiple images in a single run?
## How to process multiple images in a single run?
Prepare a text file that has the path to each image:
@ -65,7 +65,7 @@ Save it, and then give its name as input file to Tesseract.
`tesseract savedlist output`
### What page separators are used in txt output by Tesseract 4.0.0?
## What page separators are used in txt output by Tesseract 4.0.0?
Each page will be terminated by the FF character by default for text output.
@ -73,15 +73,15 @@ Setting `page_separator` to the LF character would restore the old behaviour of
Setting `page_separator` to an empty string would omit page separators.
### How do I use Tesseract 4.0.0 using the API?
## How do I use Tesseract 4.0.0 using the API?
See [Tesseract Wiki API examples](https://github.com/tesseract-ocr/tesseract/wiki/APIExample) page for sample programs for using the API.
### How do I improve OCR results?
## How do I improve OCR results?
You should note that in many cases, in order to get better OCR results, you'll need to [improve the quality](https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality) of the input image you are giving Tesseract.
### Can I increase speed of OCR?
## Can I increase speed of OCR?
If you are running Tesseract 4, you can use the "fast" integer models.
@ -93,13 +93,13 @@ Set the maximum number of threads using the environment variable `OMP_THREAD_LIM
To disable multithreading, use `OMP_THREAD_LIMIT=1`.
### How do I train Tesseract 4.0.0 LSTM Engine?
## How do I train Tesseract 4.0.0 LSTM Engine?
Tesseract can be trained to recognize other languages or finetune existing language models. See [Tesseract Wiki Training Tesseract 4.00](https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00) page for information on training the LSTM engine.
Please note that currently LSTM training is only supported using synthetic images created using a UTF-8 training text and unicode fonts to render the text.
### There are inconsistent results from tesseract when the same TessBaseAPI object is used for decoding multiple images
## There are inconsistent results from tesseract when the same TessBaseAPI object is used for decoding multiple images
Try to clear the adaptive data with `ClearAdaptiveClassifier()` or turn off the adaptive classifier with config variables:
```
@ -109,7 +109,7 @@ classify_enable_adaptive_matcher 0
See also the discussion on the [tesseract forum](https://groups.google.com/d/topic/tesseract-ocr/ByGJhocI9qQ)
### How can I make the error messages go to tesseract.log instead of stderr?
## How can I make the error messages go to tesseract.log instead of stderr?
To restore the old behaviour of writing to tesseract.log instead of writing to the console window, you need a text file that contains this:
@ -118,7 +118,7 @@ debug\_file tesseract.log
call the file 'logfile' and put it in tessdata/configs/
Then add logfile to the end of your command line.
### How can I suppress tesseract info line?
## How can I suppress tesseract info line?
See [issue 579](https://web.archive.org/web/*/http://code.google.com/p/tesseract-ocr/issues/detail?id=579). On linux you can redirect stderr and stdout output to /dev/null. E.g.:
```
@ -131,7 +131,7 @@ tesseract phototest.tif phototest quiet
**Warning:** Both options will cause you to not see the error message if there is one.
### How do I produce searchable PDF output?
## How do I produce searchable PDF output?
Searchable PDF output is a standard feature as of Tesseract version 3.03. Use the `pdf` config file like this:
@ -139,7 +139,7 @@ Searchable PDF output is a standard feature as of Tesseract version 3.03. Use th
tesseract phototest.tif phototest pdf
```
### The searchable PDF seems to contain only spaces or spaces between the letters of words
## The searchable PDF seems to contain only spaces or spaces between the letters of words
There may be nothing wrong with the PDF itself, but its hidden, searchable text layer may be not understood by your PDF reader. For example, Preview.app in Mac OS X is well known for having problems like this, and might "see" only spaces and no text. Try using Adobe Acrobat Reader instead.
@ -155,23 +155,23 @@ searchable PDF to stdout.
scanimage --batch --batch-print | tesseract -c stream_filelist=true - - pdf > output.pdf
```
### Can I use Tesseract for handwriting recognition?
## Can I use Tesseract for handwriting recognition?
You can, but it won't work very well, as Tesseract is designed for printed text. Take a look at the [Lipi Toolkit](http://lipitk.sourceforge.net/) project instead.
### Can I use tesseract for barcode recognition?
## Can I use tesseract for barcode recognition?
No. Tesseract is for text recognition.
### Where is the documentation?
## Where is the documentation?
You're looking at it. If things aren't clear, search on the [Tesseract Google Group](http://groups.google.com/group/tesseract-ocr/) or ask us there. If you want to help us write more, please do, and post it to the group!
### How can I try the next version?
## How can I try the next version?
Periodically stable versions go to the downloads page. Between releases, and in particular, just before a new release, the latest code is available from git. You can find the source here: https://github.com/tesseract-ocr/tesseract.git where you can check it out either by command line, or by following the link to the howto on using various client programs and plugins.
### How do I compare different versions of Tesseract
## How do I compare different versions of Tesseract
If you want to have several version of tesseract (e.g. you want to compare OCR result) I would suggest you to compile them from source (e.g. in /usr/src) and not install them. If you want to test particular version you can run it this way:
@ -182,6 +182,6 @@ If you want to have several version of tesseract (e.g. you want to compare OCR r
/usr/src/tesseract-3.03/api/tesseract is shell wrapper script, and it will take care that correct shared library is used (without installation...).
### My question isn't in here!
## My question isn't in here!
Try searching the forum: http://groups.google.com/group/tesseract-ocr as well as open and closed issues on GitHub: https://github.com/tesseract-ocr/tesseract/issues, as your question may have come up before even if it is not listed here.