add new lang info

This commit is contained in:
Zdenko Podobný 2015-06-28 22:26:39 +02:00
parent e8b6d6f71b
commit dcc457cc05
4 changed files with 334 additions and 157 deletions

View File

@ -2,12 +2,12 @@
.\" Title: tesseract .\" Title: tesseract
.\" Author: [see the "AUTHOR" section] .\" Author: [see the "AUTHOR" section]
.\" Generator: DocBook XSL Stylesheets v1.78.1 <http://docbook.sf.net/> .\" Generator: DocBook XSL Stylesheets v1.78.1 <http://docbook.sf.net/>
.\" Date: 06/12/2015 .\" Date: 06/28/2015
.\" Manual: \ \& .\" Manual: \ \&
.\" Source: \ \& .\" Source: \ \&
.\" Language: English .\" Language: English
.\" .\"
.TH "TESSERACT" "1" "06/12/2015" "\ \&" "\ \&" .TH "TESSERACT" "1" "06/28/2015" "\ \&" "\ \&"
.\" ----------------------------------------------------------------- .\" -----------------------------------------------------------------
.\" * Define some portability stuff .\" * Define some portability stuff
.\" ----------------------------------------------------------------- .\" -----------------------------------------------------------------
@ -158,9 +158,9 @@ print tesseract parameters to the stdout\&.
.RE .RE
.SH "LANGUAGES" .SH "LANGUAGES"
.sp .sp
There are currently language packs available for the following languages: There are currently language packs available for the following languages (in \m[blue]\fBhttps://github\&.com/tesseract\-ocr/tessdata\fR\m[]):
.sp .sp
\fBara\fR (Arabic), \fBaze\fR (Azerbauijani), \fBbul\fR (Bulgarian), \fBcat\fR (Catalan), \fBces\fR (Czech), \fBchi_sim\fR (Simplified Chinese), \fBchi_tra\fR (Traditional Chinese), \fBchr\fR (Cherokee), \fBdan\fR (Danish), \fBdan\-frak\fR (Danish (Fraktur)), \fBdeu\fR (German), \fBell\fR (Greek), \fBeng\fR (English), \fBenm\fR (Old English), \fBepo\fR (Esperanto), \fBest\fR (Estonian), \fBfin\fR (Finnish), \fBfra\fR (French), \fBfrm\fR (Old French), \fBglg\fR (Galician), \fBheb\fR (Hebrew), \fBhin\fR (Hindi), \fBhrv\fR (Croation), \fBhun\fR (Hungarian), \fBind\fR (Indonesian), \fBita\fR (Italian), \fBjpn\fR (Japanese), \fBkor\fR (Korean), \fBlav\fR (Latvian), \fBlit\fR (Lithuanian), \fBnld\fR (Dutch), \fBnor\fR (Norwegian), \fBpol\fR (Polish), \fBpor\fR (Portuguese), \fBron\fR (Romanian), \fBrus\fR (Russian), \fBslk\fR (Slovakian), \fBslv\fR (Slovenian), \fBsqi\fR (Albanian), \fBspa\fR (Spanish), \fBsrp\fR (Serbian), \fBswe\fR (Swedish), \fBtam\fR (Tamil), \fBtel\fR (Telugu), \fBtgl\fR (Tagalog), \fBtha\fR (Thai), \fBtur\fR (Turkish), \fBukr\fR (Ukrainian), \fBvie\fR (Vietnamese) \fBafr\fR (Afrikaans) \fBamh\fR (Amharic) \fBara\fR (Arabic) \fBasm\fR (Assamese) \fBaze\fR (Azerbaijani) \fBaze_cyrl\fR (Azerbaijani \- Cyrilic) \fBbel\fR (Belarusian) \fBben\fR (Bengali) \fBbod\fR (Tibetan) \fBbos\fR (Bosnian) \fBbul\fR (Bulgarian) \fBcat\fR (Catalan; Valencian) \fBceb\fR (Cebuano) \fBces\fR (Czech) \fBchi_sim\fR (Chinese \- Simplified) \fBchi_tra\fR (Chinese \- Traditional) \fBchr\fR (Cherokee) \fBcym\fR (Welsh) \fBdan\fR (Danish) \fBdan_frak\fR (Danish \- Fraktur) \fBdeu\fR (German) \fBdeu_frak\fR (German \- Fraktur) \fBdzo\fR (Dzongkha) \fBell\fR (Greek, Modern (1453\-)) \fBeng\fR (English) \fBenm\fR (English, Middle (1100\-1500)) \fBepo\fR (Esperanto) \fBequ\fR (Math / equation detection module) \fBest\fR (Estonian) \fBeus\fR (Basque) \fBfas\fR (Persian) \fBfin\fR (Finnish) \fBfra\fR (French) \fBfrk\fR (Frankish) \fBfrm\fR (French, Middle (ca\&.1400\-1600)) \fBgle\fR (Irish) \fBglg\fR (Galician) \fBgrc\fR (Greek, Ancient (to 1453)) \fBguj\fR (Gujarati) \fBhat\fR (Haitian; Haitian Creole) \fBheb\fR (Hebrew) \fBhin\fR (Hindi) \fBhrv\fR (Croatian) \fBhun\fR (Hungarian) \fBiku\fR (Inuktitut) \fBind\fR (Indonesian) \fBisl\fR (Icelandic) \fBita\fR (Italian) \fBita_old\fR (Italian \- Old) \fBjav\fR (Javanese) \fBjpn\fR (Japanese) \fBkan\fR (Kannada) \fBkat\fR (Georgian) \fBkat_old\fR (Georgian \- Old) \fBkaz\fR (Kazakh) \fBkhm\fR (Central Khmer) \fBkir\fR (Kirghiz; Kyrgyz) \fBkor\fR (Korean) \fBkur\fR (Kurdish) \fBlao\fR (Lao) \fBlat\fR (Latin) \fBlav\fR (Latvian) \fBlit\fR (Lithuanian) \fBmal\fR (Malayalam) \fBmar\fR (Marathi) \fBmkd\fR (Macedonian) \fBmlt\fR (Maltese) \fBmsa\fR (Malay) \fBmya\fR (Burmese) \fBnep\fR (Nepali) \fBnld\fR (Dutch; Flemish) \fBnor\fR (Norwegian) \fBori\fR (Oriya) \fBosd\fR (Orientation and script detection module) \fBpan\fR (Panjabi; Punjabi) \fBpol\fR (Polish) \fBpor\fR (Portuguese) \fBpus\fR (Pushto; Pashto) \fBron\fR (Romanian; Moldavian; Moldovan) \fBrus\fR (Russian) \fBsan\fR (Sanskrit) \fBsin\fR (Sinhala; Sinhalese) \fBslk\fR (Slovak) \fBslk_frak\fR (Slovak \- Fraktur) \fBslv\fR (Slovenian) \fBspa\fR (Spanish; Castilian) \fBspa_old\fR (Spanish; Castilian \- Old) \fBsqi\fR (Albanian) \fBsrp\fR (Serbian) \fBsrp_latn\fR (Serbian \- Latin) \fBswa\fR (Swahili) \fBswe\fR (Swedish) \fBsyr\fR (Syriac) \fBtam\fR (Tamil) \fBtel\fR (Telugu) \fBtgk\fR (Tajik) \fBtgl\fR (Tagalog) \fBtha\fR (Thai) \fBtir\fR (Tigrinya) \fBtur\fR (Turkish) \fBuig\fR (Uighur; Uyghur) \fBukr\fR (Ukrainian) \fBurd\fR (Urdu) \fBuzb\fR (Uzbek) \fBuzb_cyrl\fR (Uzbek \- Cyrilic) \fBvie\fR (Vietnamese) \fByid\fR (Yiddish)
.sp .sp
To use a non\-standard language pack named \fBfoo\&.traineddata\fR, set the \fBTESSDATA_PREFIX\fR environment variable so the file can be found at \fBTESSDATA_PREFIX\fR/tessdata/\fBfoo\fR\&.traineddata and give Tesseract the argument \fI\-l foo\fR\&. To use a non\-standard language pack named \fBfoo\&.traineddata\fR, set the \fBTESSDATA_PREFIX\fR environment variable so the file can be found at \fBTESSDATA_PREFIX\fR/tessdata/\fBfoo\fR\&.traineddata and give Tesseract the argument \fI\-l foo\fR\&.
.SH "CONFIG FILES AND AUGMENTING WITH USER DATA" .SH "CONFIG FILES AND AUGMENTING WITH USER DATA"
@ -224,7 +224,7 @@ The engine was developed at Hewlett Packard Laboratories Bristol and at Hewlett
.sp .sp
Version 2\&.00 brought Unicode (UTF\-8) support, six languages, and the ability to train Tesseract\&. Version 2\&.00 brought Unicode (UTF\-8) support, six languages, and the ability to train Tesseract\&.
.sp .sp
Tesseract was included in UNLV\(cqs Fourth Annual Test of OCR Accuracy\&. See \m[blue]\fBhttp://www\&.isri\&.unlv\&.edu/downloads/AT\-1995\&.pdf\fR\m[]\&. With Tesseract 2\&.00, scripts are now included to allow anyone to reproduce some of these tests\&. See \m[blue]\fBhttps://github\&.com/tesseract\-ocr/tesseract/wiki/TestingTesseract\fR\m[] for more details\&. Tesseract was included in UNLV\(cqs Fourth Annual Test of OCR Accuracy\&. See \m[blue]\fBhttps://github\&.com/tesseract\-ocr/docs/blob/master/AT\-1995\&.pdf\fR\m[]\&. With Tesseract 2\&.00, scripts are now included to allow anyone to reproduce some of these tests\&. See \m[blue]\fBhttps://github\&.com/tesseract\-ocr/tesseract/wiki/TestingTesseract\fR\m[] for more details\&.
.sp .sp
Tesseract 3\&.00 adds a number of new languages, including Chinese, Japanese, and Korean\&. It also introduces a new, single\-file based system of managing language data\&. Tesseract 3\&.00 adds a number of new languages, including Chinese, Japanese, and Korean\&. It also introduces a new, single\-file based system of managing language data\&.
.sp .sp

View File

@ -98,57 +98,116 @@ SINGLE OPTIONS
LANGUAGES LANGUAGES
--------- ---------
There are currently language packs available for the following languages: There are currently language packs available for the following languages
(in https://github.com/tesseract-ocr/tessdata):
*ara* (Arabic), *afr* (Afrikaans)
*aze* (Azerbauijani), *amh* (Amharic)
*bul* (Bulgarian), *ara* (Arabic)
*cat* (Catalan), *asm* (Assamese)
*ces* (Czech), *aze* (Azerbaijani)
*chi_sim* (Simplified Chinese), *aze_cyrl* (Azerbaijani - Cyrilic)
*chi_tra* (Traditional Chinese), *bel* (Belarusian)
*chr* (Cherokee), *ben* (Bengali)
*dan* (Danish), *bod* (Tibetan)
*dan-frak* (Danish (Fraktur)), *bos* (Bosnian)
*deu* (German), *bul* (Bulgarian)
*ell* (Greek), *cat* (Catalan; Valencian)
*eng* (English), *ceb* (Cebuano)
*enm* (Old English), *ces* (Czech)
*epo* (Esperanto), *chi_sim* (Chinese - Simplified)
*est* (Estonian), *chi_tra* (Chinese - Traditional)
*fin* (Finnish), *chr* (Cherokee)
*fra* (French), *cym* (Welsh)
*frm* (Old French), *dan* (Danish)
*glg* (Galician), *dan_frak* (Danish - Fraktur)
*heb* (Hebrew), *deu* (German)
*hin* (Hindi), *deu_frak* (German - Fraktur)
*hrv* (Croation), *dzo* (Dzongkha)
*hun* (Hungarian), *ell* (Greek, Modern (1453-))
*ind* (Indonesian), *eng* (English)
*ita* (Italian), *enm* (English, Middle (1100-1500))
*jpn* (Japanese), *epo* (Esperanto)
*kor* (Korean), *equ* (Math / equation detection module)
*lav* (Latvian), *est* (Estonian)
*lit* (Lithuanian), *eus* (Basque)
*nld* (Dutch), *fas* (Persian)
*nor* (Norwegian), *fin* (Finnish)
*pol* (Polish), *fra* (French)
*por* (Portuguese), *frk* (Frankish)
*ron* (Romanian), *frm* (French, Middle (ca.1400-1600))
*rus* (Russian), *gle* (Irish)
*slk* (Slovakian), *glg* (Galician)
*slv* (Slovenian), *grc* (Greek, Ancient (to 1453))
*sqi* (Albanian), *guj* (Gujarati)
*spa* (Spanish), *hat* (Haitian; Haitian Creole)
*srp* (Serbian), *heb* (Hebrew)
*swe* (Swedish), *hin* (Hindi)
*tam* (Tamil), *hrv* (Croatian)
*tel* (Telugu), *hun* (Hungarian)
*tgl* (Tagalog), *iku* (Inuktitut)
*tha* (Thai), *ind* (Indonesian)
*tur* (Turkish), *isl* (Icelandic)
*ukr* (Ukrainian), *ita* (Italian)
*ita_old* (Italian - Old)
*jav* (Javanese)
*jpn* (Japanese)
*kan* (Kannada)
*kat* (Georgian)
*kat_old* (Georgian - Old)
*kaz* (Kazakh)
*khm* (Central Khmer)
*kir* (Kirghiz; Kyrgyz)
*kor* (Korean)
*kur* (Kurdish)
*lao* (Lao)
*lat* (Latin)
*lav* (Latvian)
*lit* (Lithuanian)
*mal* (Malayalam)
*mar* (Marathi)
*mkd* (Macedonian)
*mlt* (Maltese)
*msa* (Malay)
*mya* (Burmese)
*nep* (Nepali)
*nld* (Dutch; Flemish)
*nor* (Norwegian)
*ori* (Oriya)
*osd* (Orientation and script detection module)
*pan* (Panjabi; Punjabi)
*pol* (Polish)
*por* (Portuguese)
*pus* (Pushto; Pashto)
*ron* (Romanian; Moldavian; Moldovan)
*rus* (Russian)
*san* (Sanskrit)
*sin* (Sinhala; Sinhalese)
*slk* (Slovak)
*slk_frak* (Slovak - Fraktur)
*slv* (Slovenian)
*spa* (Spanish; Castilian)
*spa_old* (Spanish; Castilian - Old)
*sqi* (Albanian)
*srp* (Serbian)
*srp_latn* (Serbian - Latin)
*swa* (Swahili)
*swe* (Swedish)
*syr* (Syriac)
*tam* (Tamil)
*tel* (Telugu)
*tgk* (Tajik)
*tgl* (Tagalog)
*tha* (Thai)
*tir* (Tigrinya)
*tur* (Turkish)
*uig* (Uighur; Uyghur)
*ukr* (Ukrainian)
*urd* (Urdu)
*uzb* (Uzbek)
*uzb_cyrl* (Uzbek - Cyrilic)
*vie* (Vietnamese) *vie* (Vietnamese)
*yid* (Yiddish)
To use a non-standard language pack named *foo.traineddata*, set the To use a non-standard language pack named *foo.traineddata*, set the
*TESSDATA_PREFIX* environment variable so the file can be found at *TESSDATA_PREFIX* environment variable so the file can be found at

View File

@ -931,56 +931,115 @@ before any <em>configfile</em>.</p></div>
<div class="sect1"> <div class="sect1">
<h2 id="_languages">LANGUAGES</h2> <h2 id="_languages">LANGUAGES</h2>
<div class="sectionbody"> <div class="sectionbody">
<div class="paragraph"><p>There are currently language packs available for the following languages:</p></div> <div class="paragraph"><p>There are currently language packs available for the following languages
<div class="paragraph"><p><strong>ara</strong> (Arabic), (in <a href="https://github.com/tesseract-ocr/tessdata">https://github.com/tesseract-ocr/tessdata</a>):</p></div>
<strong>aze</strong> (Azerbauijani), <div class="paragraph"><p><strong>afr</strong> (Afrikaans)
<strong>bul</strong> (Bulgarian), <strong>amh</strong> (Amharic)
<strong>cat</strong> (Catalan), <strong>ara</strong> (Arabic)
<strong>ces</strong> (Czech), <strong>asm</strong> (Assamese)
<strong>chi_sim</strong> (Simplified Chinese), <strong>aze</strong> (Azerbaijani)
<strong>chi_tra</strong> (Traditional Chinese), <strong>aze_cyrl</strong> (Azerbaijani - Cyrilic)
<strong>chr</strong> (Cherokee), <strong>bel</strong> (Belarusian)
<strong>dan</strong> (Danish), <strong>ben</strong> (Bengali)
<strong>dan-frak</strong> (Danish (Fraktur)), <strong>bod</strong> (Tibetan)
<strong>deu</strong> (German), <strong>bos</strong> (Bosnian)
<strong>ell</strong> (Greek), <strong>bul</strong> (Bulgarian)
<strong>eng</strong> (English), <strong>cat</strong> (Catalan; Valencian)
<strong>enm</strong> (Old English), <strong>ceb</strong> (Cebuano)
<strong>epo</strong> (Esperanto), <strong>ces</strong> (Czech)
<strong>est</strong> (Estonian), <strong>chi_sim</strong> (Chinese - Simplified)
<strong>fin</strong> (Finnish), <strong>chi_tra</strong> (Chinese - Traditional)
<strong>fra</strong> (French), <strong>chr</strong> (Cherokee)
<strong>frm</strong> (Old French), <strong>cym</strong> (Welsh)
<strong>glg</strong> (Galician), <strong>dan</strong> (Danish)
<strong>heb</strong> (Hebrew), <strong>dan_frak</strong> (Danish - Fraktur)
<strong>hin</strong> (Hindi), <strong>deu</strong> (German)
<strong>hrv</strong> (Croation), <strong>deu_frak</strong> (German - Fraktur)
<strong>hun</strong> (Hungarian), <strong>dzo</strong> (Dzongkha)
<strong>ind</strong> (Indonesian), <strong>ell</strong> (Greek, Modern (1453-))
<strong>ita</strong> (Italian), <strong>eng</strong> (English)
<strong>jpn</strong> (Japanese), <strong>enm</strong> (English, Middle (1100-1500))
<strong>kor</strong> (Korean), <strong>epo</strong> (Esperanto)
<strong>lav</strong> (Latvian), <strong>equ</strong> (Math / equation detection module)
<strong>lit</strong> (Lithuanian), <strong>est</strong> (Estonian)
<strong>nld</strong> (Dutch), <strong>eus</strong> (Basque)
<strong>nor</strong> (Norwegian), <strong>fas</strong> (Persian)
<strong>pol</strong> (Polish), <strong>fin</strong> (Finnish)
<strong>por</strong> (Portuguese), <strong>fra</strong> (French)
<strong>ron</strong> (Romanian), <strong>frk</strong> (Frankish)
<strong>rus</strong> (Russian), <strong>frm</strong> (French, Middle (ca.1400-1600))
<strong>slk</strong> (Slovakian), <strong>gle</strong> (Irish)
<strong>slv</strong> (Slovenian), <strong>glg</strong> (Galician)
<strong>sqi</strong> (Albanian), <strong>grc</strong> (Greek, Ancient (to 1453))
<strong>spa</strong> (Spanish), <strong>guj</strong> (Gujarati)
<strong>srp</strong> (Serbian), <strong>hat</strong> (Haitian; Haitian Creole)
<strong>swe</strong> (Swedish), <strong>heb</strong> (Hebrew)
<strong>tam</strong> (Tamil), <strong>hin</strong> (Hindi)
<strong>tel</strong> (Telugu), <strong>hrv</strong> (Croatian)
<strong>tgl</strong> (Tagalog), <strong>hun</strong> (Hungarian)
<strong>tha</strong> (Thai), <strong>iku</strong> (Inuktitut)
<strong>tur</strong> (Turkish), <strong>ind</strong> (Indonesian)
<strong>ukr</strong> (Ukrainian), <strong>isl</strong> (Icelandic)
<strong>vie</strong> (Vietnamese)</p></div> <strong>ita</strong> (Italian)
<strong>ita_old</strong> (Italian - Old)
<strong>jav</strong> (Javanese)
<strong>jpn</strong> (Japanese)
<strong>kan</strong> (Kannada)
<strong>kat</strong> (Georgian)
<strong>kat_old</strong> (Georgian - Old)
<strong>kaz</strong> (Kazakh)
<strong>khm</strong> (Central Khmer)
<strong>kir</strong> (Kirghiz; Kyrgyz)
<strong>kor</strong> (Korean)
<strong>kur</strong> (Kurdish)
<strong>lao</strong> (Lao)
<strong>lat</strong> (Latin)
<strong>lav</strong> (Latvian)
<strong>lit</strong> (Lithuanian)
<strong>mal</strong> (Malayalam)
<strong>mar</strong> (Marathi)
<strong>mkd</strong> (Macedonian)
<strong>mlt</strong> (Maltese)
<strong>msa</strong> (Malay)
<strong>mya</strong> (Burmese)
<strong>nep</strong> (Nepali)
<strong>nld</strong> (Dutch; Flemish)
<strong>nor</strong> (Norwegian)
<strong>ori</strong> (Oriya)
<strong>osd</strong> (Orientation and script detection module)
<strong>pan</strong> (Panjabi; Punjabi)
<strong>pol</strong> (Polish)
<strong>por</strong> (Portuguese)
<strong>pus</strong> (Pushto; Pashto)
<strong>ron</strong> (Romanian; Moldavian; Moldovan)
<strong>rus</strong> (Russian)
<strong>san</strong> (Sanskrit)
<strong>sin</strong> (Sinhala; Sinhalese)
<strong>slk</strong> (Slovak)
<strong>slk_frak</strong> (Slovak - Fraktur)
<strong>slv</strong> (Slovenian)
<strong>spa</strong> (Spanish; Castilian)
<strong>spa_old</strong> (Spanish; Castilian - Old)
<strong>sqi</strong> (Albanian)
<strong>srp</strong> (Serbian)
<strong>srp_latn</strong> (Serbian - Latin)
<strong>swa</strong> (Swahili)
<strong>swe</strong> (Swedish)
<strong>syr</strong> (Syriac)
<strong>tam</strong> (Tamil)
<strong>tel</strong> (Telugu)
<strong>tgk</strong> (Tajik)
<strong>tgl</strong> (Tagalog)
<strong>tha</strong> (Thai)
<strong>tir</strong> (Tigrinya)
<strong>tur</strong> (Turkish)
<strong>uig</strong> (Uighur; Uyghur)
<strong>ukr</strong> (Ukrainian)
<strong>urd</strong> (Urdu)
<strong>uzb</strong> (Uzbek)
<strong>uzb_cyrl</strong> (Uzbek - Cyrilic)
<strong>vie</strong> (Vietnamese)
<strong>yid</strong> (Yiddish)</p></div>
<div class="paragraph"><p>To use a non-standard language pack named <strong>foo.traineddata</strong>, set the <div class="paragraph"><p>To use a non-standard language pack named <strong>foo.traineddata</strong>, set the
<strong>TESSDATA_PREFIX</strong> environment variable so the file can be found at <strong>TESSDATA_PREFIX</strong> environment variable so the file can be found at
<strong>TESSDATA_PREFIX</strong>/tessdata/<strong>foo</strong>.traineddata and give Tesseract the <strong>TESSDATA_PREFIX</strong>/tessdata/<strong>foo</strong>.traineddata and give Tesseract the
@ -1047,7 +1106,7 @@ debug.</p></div>
<div class="paragraph"><p>Version 2.00 brought Unicode (UTF-8) support, six languages, and the ability <div class="paragraph"><p>Version 2.00 brought Unicode (UTF-8) support, six languages, and the ability
to train Tesseract.</p></div> to train Tesseract.</p></div>
<div class="paragraph"><p>Tesseract was included in UNLV&#8217;s Fourth Annual Test of OCR Accuracy. <div class="paragraph"><p>Tesseract was included in UNLV&#8217;s Fourth Annual Test of OCR Accuracy.
See <a href="http://www.isri.unlv.edu/downloads/AT-1995.pdf">http://www.isri.unlv.edu/downloads/AT-1995.pdf</a>. With Tesseract 2.00, See <a href="https://github.com/tesseract-ocr/docs/blob/master/AT-1995.pdf">https://github.com/tesseract-ocr/docs/blob/master/AT-1995.pdf</a>. With Tesseract 2.00,
scripts are now included to allow anyone to reproduce some of these tests. scripts are now included to allow anyone to reproduce some of these tests.
See <a href="https://github.com/tesseract-ocr/tesseract/wiki/TestingTesseract">https://github.com/tesseract-ocr/tesseract/wiki/TestingTesseract</a> for more See <a href="https://github.com/tesseract-ocr/tesseract/wiki/TestingTesseract">https://github.com/tesseract-ocr/tesseract/wiki/TestingTesseract</a> for more
details.</p></div> details.</p></div>
@ -1097,7 +1156,7 @@ Lloyd, Shobhit Saxena, and Thomas Kielbus.</p></div>
<div id="footnotes"><hr /></div> <div id="footnotes"><hr /></div>
<div id="footer"> <div id="footer">
<div id="footer-text"> <div id="footer-text">
Last updated 2015-06-12 23:49:44 CEST Last updated 2015-06-28 22:23:47 CEST
</div> </div>
</div> </div>
</body> </body>

View File

@ -216,56 +216,115 @@ before any <emphasis>configfile</emphasis>.</simpara>
</refsect1> </refsect1>
<refsect1 id="_languages"> <refsect1 id="_languages">
<title>LANGUAGES</title> <title>LANGUAGES</title>
<simpara>There are currently language packs available for the following languages:</simpara> <simpara>There are currently language packs available for the following languages
<simpara><emphasis role="strong">ara</emphasis> (Arabic), (in <ulink url="https://github.com/tesseract-ocr/tessdata">https://github.com/tesseract-ocr/tessdata</ulink>):</simpara>
<emphasis role="strong">aze</emphasis> (Azerbauijani), <simpara><emphasis role="strong">afr</emphasis> (Afrikaans)
<emphasis role="strong">bul</emphasis> (Bulgarian), <emphasis role="strong">amh</emphasis> (Amharic)
<emphasis role="strong">cat</emphasis> (Catalan), <emphasis role="strong">ara</emphasis> (Arabic)
<emphasis role="strong">ces</emphasis> (Czech), <emphasis role="strong">asm</emphasis> (Assamese)
<emphasis role="strong">chi_sim</emphasis> (Simplified Chinese), <emphasis role="strong">aze</emphasis> (Azerbaijani)
<emphasis role="strong">chi_tra</emphasis> (Traditional Chinese), <emphasis role="strong">aze_cyrl</emphasis> (Azerbaijani - Cyrilic)
<emphasis role="strong">chr</emphasis> (Cherokee), <emphasis role="strong">bel</emphasis> (Belarusian)
<emphasis role="strong">dan</emphasis> (Danish), <emphasis role="strong">ben</emphasis> (Bengali)
<emphasis role="strong">dan-frak</emphasis> (Danish (Fraktur)), <emphasis role="strong">bod</emphasis> (Tibetan)
<emphasis role="strong">deu</emphasis> (German), <emphasis role="strong">bos</emphasis> (Bosnian)
<emphasis role="strong">ell</emphasis> (Greek), <emphasis role="strong">bul</emphasis> (Bulgarian)
<emphasis role="strong">eng</emphasis> (English), <emphasis role="strong">cat</emphasis> (Catalan; Valencian)
<emphasis role="strong">enm</emphasis> (Old English), <emphasis role="strong">ceb</emphasis> (Cebuano)
<emphasis role="strong">epo</emphasis> (Esperanto), <emphasis role="strong">ces</emphasis> (Czech)
<emphasis role="strong">est</emphasis> (Estonian), <emphasis role="strong">chi_sim</emphasis> (Chinese - Simplified)
<emphasis role="strong">fin</emphasis> (Finnish), <emphasis role="strong">chi_tra</emphasis> (Chinese - Traditional)
<emphasis role="strong">fra</emphasis> (French), <emphasis role="strong">chr</emphasis> (Cherokee)
<emphasis role="strong">frm</emphasis> (Old French), <emphasis role="strong">cym</emphasis> (Welsh)
<emphasis role="strong">glg</emphasis> (Galician), <emphasis role="strong">dan</emphasis> (Danish)
<emphasis role="strong">heb</emphasis> (Hebrew), <emphasis role="strong">dan_frak</emphasis> (Danish - Fraktur)
<emphasis role="strong">hin</emphasis> (Hindi), <emphasis role="strong">deu</emphasis> (German)
<emphasis role="strong">hrv</emphasis> (Croation), <emphasis role="strong">deu_frak</emphasis> (German - Fraktur)
<emphasis role="strong">hun</emphasis> (Hungarian), <emphasis role="strong">dzo</emphasis> (Dzongkha)
<emphasis role="strong">ind</emphasis> (Indonesian), <emphasis role="strong">ell</emphasis> (Greek, Modern (1453-))
<emphasis role="strong">ita</emphasis> (Italian), <emphasis role="strong">eng</emphasis> (English)
<emphasis role="strong">jpn</emphasis> (Japanese), <emphasis role="strong">enm</emphasis> (English, Middle (1100-1500))
<emphasis role="strong">kor</emphasis> (Korean), <emphasis role="strong">epo</emphasis> (Esperanto)
<emphasis role="strong">lav</emphasis> (Latvian), <emphasis role="strong">equ</emphasis> (Math / equation detection module)
<emphasis role="strong">lit</emphasis> (Lithuanian), <emphasis role="strong">est</emphasis> (Estonian)
<emphasis role="strong">nld</emphasis> (Dutch), <emphasis role="strong">eus</emphasis> (Basque)
<emphasis role="strong">nor</emphasis> (Norwegian), <emphasis role="strong">fas</emphasis> (Persian)
<emphasis role="strong">pol</emphasis> (Polish), <emphasis role="strong">fin</emphasis> (Finnish)
<emphasis role="strong">por</emphasis> (Portuguese), <emphasis role="strong">fra</emphasis> (French)
<emphasis role="strong">ron</emphasis> (Romanian), <emphasis role="strong">frk</emphasis> (Frankish)
<emphasis role="strong">rus</emphasis> (Russian), <emphasis role="strong">frm</emphasis> (French, Middle (ca.1400-1600))
<emphasis role="strong">slk</emphasis> (Slovakian), <emphasis role="strong">gle</emphasis> (Irish)
<emphasis role="strong">slv</emphasis> (Slovenian), <emphasis role="strong">glg</emphasis> (Galician)
<emphasis role="strong">sqi</emphasis> (Albanian), <emphasis role="strong">grc</emphasis> (Greek, Ancient (to 1453))
<emphasis role="strong">spa</emphasis> (Spanish), <emphasis role="strong">guj</emphasis> (Gujarati)
<emphasis role="strong">srp</emphasis> (Serbian), <emphasis role="strong">hat</emphasis> (Haitian; Haitian Creole)
<emphasis role="strong">swe</emphasis> (Swedish), <emphasis role="strong">heb</emphasis> (Hebrew)
<emphasis role="strong">tam</emphasis> (Tamil), <emphasis role="strong">hin</emphasis> (Hindi)
<emphasis role="strong">tel</emphasis> (Telugu), <emphasis role="strong">hrv</emphasis> (Croatian)
<emphasis role="strong">tgl</emphasis> (Tagalog), <emphasis role="strong">hun</emphasis> (Hungarian)
<emphasis role="strong">tha</emphasis> (Thai), <emphasis role="strong">iku</emphasis> (Inuktitut)
<emphasis role="strong">tur</emphasis> (Turkish), <emphasis role="strong">ind</emphasis> (Indonesian)
<emphasis role="strong">ukr</emphasis> (Ukrainian), <emphasis role="strong">isl</emphasis> (Icelandic)
<emphasis role="strong">vie</emphasis> (Vietnamese)</simpara> <emphasis role="strong">ita</emphasis> (Italian)
<emphasis role="strong">ita_old</emphasis> (Italian - Old)
<emphasis role="strong">jav</emphasis> (Javanese)
<emphasis role="strong">jpn</emphasis> (Japanese)
<emphasis role="strong">kan</emphasis> (Kannada)
<emphasis role="strong">kat</emphasis> (Georgian)
<emphasis role="strong">kat_old</emphasis> (Georgian - Old)
<emphasis role="strong">kaz</emphasis> (Kazakh)
<emphasis role="strong">khm</emphasis> (Central Khmer)
<emphasis role="strong">kir</emphasis> (Kirghiz; Kyrgyz)
<emphasis role="strong">kor</emphasis> (Korean)
<emphasis role="strong">kur</emphasis> (Kurdish)
<emphasis role="strong">lao</emphasis> (Lao)
<emphasis role="strong">lat</emphasis> (Latin)
<emphasis role="strong">lav</emphasis> (Latvian)
<emphasis role="strong">lit</emphasis> (Lithuanian)
<emphasis role="strong">mal</emphasis> (Malayalam)
<emphasis role="strong">mar</emphasis> (Marathi)
<emphasis role="strong">mkd</emphasis> (Macedonian)
<emphasis role="strong">mlt</emphasis> (Maltese)
<emphasis role="strong">msa</emphasis> (Malay)
<emphasis role="strong">mya</emphasis> (Burmese)
<emphasis role="strong">nep</emphasis> (Nepali)
<emphasis role="strong">nld</emphasis> (Dutch; Flemish)
<emphasis role="strong">nor</emphasis> (Norwegian)
<emphasis role="strong">ori</emphasis> (Oriya)
<emphasis role="strong">osd</emphasis> (Orientation and script detection module)
<emphasis role="strong">pan</emphasis> (Panjabi; Punjabi)
<emphasis role="strong">pol</emphasis> (Polish)
<emphasis role="strong">por</emphasis> (Portuguese)
<emphasis role="strong">pus</emphasis> (Pushto; Pashto)
<emphasis role="strong">ron</emphasis> (Romanian; Moldavian; Moldovan)
<emphasis role="strong">rus</emphasis> (Russian)
<emphasis role="strong">san</emphasis> (Sanskrit)
<emphasis role="strong">sin</emphasis> (Sinhala; Sinhalese)
<emphasis role="strong">slk</emphasis> (Slovak)
<emphasis role="strong">slk_frak</emphasis> (Slovak - Fraktur)
<emphasis role="strong">slv</emphasis> (Slovenian)
<emphasis role="strong">spa</emphasis> (Spanish; Castilian)
<emphasis role="strong">spa_old</emphasis> (Spanish; Castilian - Old)
<emphasis role="strong">sqi</emphasis> (Albanian)
<emphasis role="strong">srp</emphasis> (Serbian)
<emphasis role="strong">srp_latn</emphasis> (Serbian - Latin)
<emphasis role="strong">swa</emphasis> (Swahili)
<emphasis role="strong">swe</emphasis> (Swedish)
<emphasis role="strong">syr</emphasis> (Syriac)
<emphasis role="strong">tam</emphasis> (Tamil)
<emphasis role="strong">tel</emphasis> (Telugu)
<emphasis role="strong">tgk</emphasis> (Tajik)
<emphasis role="strong">tgl</emphasis> (Tagalog)
<emphasis role="strong">tha</emphasis> (Thai)
<emphasis role="strong">tir</emphasis> (Tigrinya)
<emphasis role="strong">tur</emphasis> (Turkish)
<emphasis role="strong">uig</emphasis> (Uighur; Uyghur)
<emphasis role="strong">ukr</emphasis> (Ukrainian)
<emphasis role="strong">urd</emphasis> (Urdu)
<emphasis role="strong">uzb</emphasis> (Uzbek)
<emphasis role="strong">uzb_cyrl</emphasis> (Uzbek - Cyrilic)
<emphasis role="strong">vie</emphasis> (Vietnamese)
<emphasis role="strong">yid</emphasis> (Yiddish)</simpara>
<simpara>To use a non-standard language pack named <emphasis role="strong">foo.traineddata</emphasis>, set the <simpara>To use a non-standard language pack named <emphasis role="strong">foo.traineddata</emphasis>, set the
<emphasis role="strong">TESSDATA_PREFIX</emphasis> environment variable so the file can be found at <emphasis role="strong">TESSDATA_PREFIX</emphasis> environment variable so the file can be found at
<emphasis role="strong">TESSDATA_PREFIX</emphasis>/tessdata/<emphasis role="strong">foo</emphasis>.traineddata and give Tesseract the <emphasis role="strong">TESSDATA_PREFIX</emphasis>/tessdata/<emphasis role="strong">foo</emphasis>.traineddata and give Tesseract the
@ -325,7 +384,7 @@ debug.</simpara>
<simpara>Version 2.00 brought Unicode (UTF-8) support, six languages, and the ability <simpara>Version 2.00 brought Unicode (UTF-8) support, six languages, and the ability
to train Tesseract.</simpara> to train Tesseract.</simpara>
<simpara>Tesseract was included in UNLV&#8217;s Fourth Annual Test of OCR Accuracy. <simpara>Tesseract was included in UNLV&#8217;s Fourth Annual Test of OCR Accuracy.
See <ulink url="http://www.isri.unlv.edu/downloads/AT-1995.pdf">http://www.isri.unlv.edu/downloads/AT-1995.pdf</ulink>. With Tesseract 2.00, See <ulink url="https://github.com/tesseract-ocr/docs/blob/master/AT-1995.pdf">https://github.com/tesseract-ocr/docs/blob/master/AT-1995.pdf</ulink>. With Tesseract 2.00,
scripts are now included to allow anyone to reproduce some of these tests. scripts are now included to allow anyone to reproduce some of these tests.
See <ulink url="https://github.com/tesseract-ocr/tesseract/wiki/TestingTesseract">https://github.com/tesseract-ocr/tesseract/wiki/TestingTesseract</ulink> for more See <ulink url="https://github.com/tesseract-ocr/tesseract/wiki/TestingTesseract">https://github.com/tesseract-ocr/tesseract/wiki/TestingTesseract</ulink> for more
details.</simpara> details.</simpara>