test move msg

Shree 2020-02-03 05:08:04 +00:00
parent 2133aa6d25
commit 1616ba962e

@ -1,264 +1,11 @@
## [Tesseract 'main' page](https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc)
We've Moved!
------------
See the [main](https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc) page for command line syntax and other details.
**These wiki pages are no longer maintained.**
--------------------------------------------
All pages were moved to [tesseract-ocr/tessdoc](https://github.com/tesseract-ocr/tessdoc).
## Basic Command Line Usage
The latest documentation is available at https://tesseract-ocr.github.io/.
See [Running Tesseract](https://github.com/tesseract-ocr/tesseract/wiki#running-tesseract) for basic command line usage.
**Please find this page in its new home: https://tesseract-ocr.github.io/tessdoc/Command-Line-Usage**
## FAQ
See [FAQ](https://github.com/tesseract-ocr/tesseract/wiki/FAQ#running-tesseract) for more examples and tips.
--------------------------------------------
## Available OCR Engines in Tesseract 4
Use `--oem 1` for LSTM, `--oem 0` for Legacy Tesseract. Please note that Legacy Tesseract models are only included in traineddata files from [tessdata](https://github.com/tesseract-ocr/tessdata) repo.
`tesseract input.tiff output --oem 1 -l eng`
---------------------------------------------
## Simplest Invocation to OCR an image
tesseract imagename outputbase
This uses **English** as the default language and 3 as the Page Segmentation Mode. The default output format is **text**.
osd.traineddata, for Orientation and Segmentation and eng.traineddata and other language data files for English should be in the "tessdata" directory. TESSDATA_PREFIX environment variable should be set to the parent directory of "tessdata" directory.
The following command would give the same result as above, if eng.traineddata and osd.traineddata files are in /usr/share/tessdata directory.
tesseract --tessdata-dir /usr/share imagename outputbase -l eng --psm 3
____________________________________
Following examples use this image which has text in multiple languages.
![eurotext.png](http://dev.blog.fairway.ne.jp/wp-content/uploads/2014/04/eurotext.png)
## Using One Language
Add '-l LANG' to the command where LANG is three character language code from the list of supported languages. If this is not given then English language is assumed by default.
tesseract testing/eurotext.png testing/eurotext-eng -l eng
Output
The (quick) [brown] {fox} jumps!
Over the $43,456.78 <lazy> #90 dog
& duck/goose, as 12.5% of E-mail
from aspammer@website.com is spam.
Der ,,schnelle” braune Fuchs springt
fiber den faulen Hund. Le renard brun
«rapide» saute par-dessus le chien
paresseux. La volpe marrone rapida
salta sopra i] cane pigro. El zorro
marrén répido salta sobre el perro
perezoso. A raposa marrom répida
salta sobre 0 C50 preguieoso.
## Using Multiple Languages
Add '-l LANG[+LANG]' to the command line to use multiple languages together for recognition
tesseract testing/eurotext.png testing/eurotext-engdeu -l eng+deu
Output
The (quick) [brown] {fox} jumps!
Over the $43,456.78 <lazy> #90 dog
& duck/goose, as 12.5% of E-mail
from aspammer@website.com is spam.
Der „schnelle” braune Fuchs springt
über den faulen Hund. Le renard brun
«rapide» saute par-dessus le chien
paresseux. La volpe marrone rapida
salta sopra il cane pigro. El zorro
marrön räpido salta sobre el perro
perezoso. A raposa marrom räpida
salta sobre o cäo preguieoso.
## Order of multiple languages
The output can be different based on the order of languages, so -l eng+hin can give different result than -l hin+eng.
Following examples use a greyscale version of this image which has text in multiple languages - Hindi and English.
![bilingual.jpg](https://i.ytimg.com/vi/Z0qDeKu7TWA/hqdefault.jpg)
Using English as primary language and then Hindi
tesseract testing/bilingual.jpg testing/bilingual-enghin -l eng+hin
Output
हिदीसेअंठौजी
HINDI To
ENGLISH
Using Hindi as primary language and then English
tesseract testing/bilingual.jpg testing/bilingual-hineng -l hin+eng
Output
हिंदी से अंग्रेजी
H I N D I T o
E N G L I S H
## Searchable pdf output
tesseract testing/eurotext.png testing/eurotext-eng -l eng pdf
This creates a pdf with the image and a separate searchable text layer with the recognized text.
tesseract c:\temp\test_ara.jpg -l ara -psm 3 c:\temp\test_ara pdf
Files are attached (source JPG and output PDF)
![test_ara.jpg](https://cloud.githubusercontent.com/assets/17473681/13320324/bc160e22-dbd0-11e5-8090-6f3728fcc06d.jpg)
![test_ara.pdf](https://github.com/tesseract-ocr/tesseract/files/146534/test_ara.pdf)
## HOCR output
Use 'hocr' config file by adding hocr at the end of the command to get the HOCR output.
tesseract testing/eurotext.png testing/eurotext-eng -l eng hocr
Partial Output
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title></title>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
<meta name='ocr-system' content='tesseract 3.05.00dev' />
<meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par ocr_line ocrx_word'/>
</head>
<body>
<div class='ocr_page' id='page_1' title='image "./testing/eurotext.png"; bbox 0 0 1024 800; ppageno 0'>
<div class='ocr_carea' id='block_1_1' title="bbox 98 66 918 661">
<p class='ocr_par' id='par_1_1' lang='eng' title="bbox 98 66 918 661">
<span class='ocr_line' id='line_1_1' title="bbox 105 66 823 113; baseline 0.015 -18; x_size 39; x_descenders 7; x_ascenders 9"><span class='ocrx_word' id='word_1_1' title='bbox 105 66 178 97; x_wconf 90'>The</span> <span class='ocrx_word' id='word_1_2' title='bbox 205 67 347 106; x_wconf 87'><strong>(quick)</strong></span> <span class='ocrx_word' id='word_1_3' title='bbox 376 69 528 109; x_wconf 89'>[brown]</span> <span class='ocrx_word' id='word_1_4' title='bbox 559 71 663 110; x_wconf 89'>{fox}</span> <span class='ocrx_word' id='word_1_5' title='bbox 687 73 823 113; x_wconf 89'>jumps!</span>
</span>
</p>
</div>
</div>
</body>
</html>
## TSV output (Currently available in 3.05-dev in master branch on github)
Use 'tsv' config file by adding tsv at the end of the command to get the TSV output.
tesseract testing/eurotext.png testing/eurotext-eng -l eng tsv
Partial Output
level page_num block_num par_num line_num word_num left top width height conf text
1 1 0 0 0 0 0 0 1024 800 -1
2 1 1 0 0 0 98 66 821 596 -1
3 1 1 1 0 0 98 66 821 596 -1
4 1 1 1 1 0 105 66 719 48 -1
5 1 1 1 1 1 105 66 74 32 90 The
5 1 1 1 1 2 205 67 143 40 87 (quick)
5 1 1 1 1 3 376 69 153 41 89 [brown]
5 1 1 1 1 4 559 71 105 40 89 {fox}
5 1 1 1 1 5 687 73 137 41 89 jumps!
4 1 1 1 2 0 104 115 784 51 -1
5 1 1 1 2 1 104 115 96 33 91 Over
5 1 1 1 2 2 224 117 60 32 89 the
5 1 1 1 2 3 310 117 224 39 88 $43,456.78
5 1 1 1 2 4 561 121 136 42 92 <lazy>
5 1 1 1 2 5 722 123 70 32 92 #90
5 1 1 1 2 6 818 125 70 41 89 dog
## Using different Page Segmentation Modes
The following examples are using this image with text in Devanagari script and Sanskrit language.
![san002.png](https://cloud.githubusercontent.com/assets/82178/13678011/81953684-e6ba-11e5-91e8-5c40518e94a6.png)
tesseract testing/san002.png testing/san002-psm6 -l san -psm 6
Output
विर्व्य 16
ज्यालत्रुखीसह्स्रनामक्तोव्रम्- नामाकळिट्. 191
दुर्गासहस्रनामस्तीत्रम्- १ नामांक्ळिन्नू ॰213
द्रुर्गासहस्रनत्मस्तीन्रम्- २ नामावळिऽ 238
द्दुगसिद्द्स्रनत्मक्तोत्रम्दकाराद्दि(३) नामाव'ळिऽ 263
ट्टुगसिहस्रनामक्तोत्रम्- ४ नामावळिइं 300
पार्वतीं ह्यो) सहस्रनामातोत्रम्- नामावळिऽ’ 329
द्दुर्गानवाक्षरीन्निशतींनत्माव'क्ति 355
द्बुर्गाष्टोत्तरङ्प्तनत्मरतोव्रम्- नामावक्ति 360
र्व्यत्मामस्वोत्रम्- नामाक्ळिऽ 363
अन्नपूण्स्सिहस्रनत्मस्तीत्रम्- नामावक्ति 365
अन्नघूर्गाष्टोत्तस्यातनामस्तीन्रम्- नामावक्ति 394
क्रुलकुर्व्यसहस्रनत्मक्तोत्रम्- कवचम्… नामावळिथ् 397-
कुमारींसहृस्रनामक्तोन्नम्- नामावळिय् 432
गङ्ग’म्यासद्वृस्रनप्मक्तोव्रम्- नाम।वक्ति` 457
गङ्ग’म्याष्टोत्तराप्तनामप्तोत्रम्- नामावळिऽ 488
गङ्गादातनप्तास्तोत्रम्- नामावक्ति 491
यमुनासहस्रनामरतोव्रम्- नम्पावळिय् 493
'शिवगङ्गासद्दृस्रनत्माव'ळि 517
गम्पत्रीसह्स्रनत्मक्तोत्रम्- नाम।व'ळिऽ (१) 531
tesseract testing/san002.png testing/san002-psm3 -l san -psm 3
Output
ज्यंग्लत्रुखीसह्स्रनामलोत्रम्- नामावळिट्.
दुर्गासहस्रनामस्तीत्रम्- १ नामाक्ळि
दुर्गासहस्रनत्मस्तीत्र्दुं'म्- २ नामावळिऽ
द्बुगसिद्द्स्रनत्मरत्तोत्रम्दकारादि (३) नामावळि
पार्वतीं ह्यो) सहम्रनम्परतोत्रम्- नामावळिऽ’
फुलकुर्व्यसहस्रनत्मक्तोत्रम्-क्ताचम्-नत्माचळिऽ
गम्यत्रीसह्स्रनत्मक्तोत्रम्-नग्मग्वळिऽ(१)
191
,213
238
300
329
355
360
363.
365
394
397-
432
457
488
491
493
517
531