add new lang info

This commit is contained in:
Zdenko Podobný 2015-06-28 22:26:39 +02:00
parent e8b6d6f71b
commit dcc457cc05
4 changed files with 334 additions and 157 deletions

View File

@ -2,12 +2,12 @@
.\" Title: tesseract
.\" Author: [see the "AUTHOR" section]
.\" Generator: DocBook XSL Stylesheets v1.78.1 <http://docbook.sf.net/>
.\" Date: 06/12/2015
.\" Date: 06/28/2015
.\" Manual: \ \&
.\" Source: \ \&
.\" Language: English
.\"
.TH "TESSERACT" "1" "06/12/2015" "\ \&" "\ \&"
.TH "TESSERACT" "1" "06/28/2015" "\ \&" "\ \&"
.\" -----------------------------------------------------------------
.\" * Define some portability stuff
.\" -----------------------------------------------------------------
@ -158,9 +158,9 @@ print tesseract parameters to the stdout\&.
.RE
.SH "LANGUAGES"
.sp
There are currently language packs available for the following languages:
There are currently language packs available for the following languages (in \m[blue]\fBhttps://github\&.com/tesseract\-ocr/tessdata\fR\m[]):
.sp
\fBara\fR (Arabic), \fBaze\fR (Azerbauijani), \fBbul\fR (Bulgarian), \fBcat\fR (Catalan), \fBces\fR (Czech), \fBchi_sim\fR (Simplified Chinese), \fBchi_tra\fR (Traditional Chinese), \fBchr\fR (Cherokee), \fBdan\fR (Danish), \fBdan\-frak\fR (Danish (Fraktur)), \fBdeu\fR (German), \fBell\fR (Greek), \fBeng\fR (English), \fBenm\fR (Old English), \fBepo\fR (Esperanto), \fBest\fR (Estonian), \fBfin\fR (Finnish), \fBfra\fR (French), \fBfrm\fR (Old French), \fBglg\fR (Galician), \fBheb\fR (Hebrew), \fBhin\fR (Hindi), \fBhrv\fR (Croation), \fBhun\fR (Hungarian), \fBind\fR (Indonesian), \fBita\fR (Italian), \fBjpn\fR (Japanese), \fBkor\fR (Korean), \fBlav\fR (Latvian), \fBlit\fR (Lithuanian), \fBnld\fR (Dutch), \fBnor\fR (Norwegian), \fBpol\fR (Polish), \fBpor\fR (Portuguese), \fBron\fR (Romanian), \fBrus\fR (Russian), \fBslk\fR (Slovakian), \fBslv\fR (Slovenian), \fBsqi\fR (Albanian), \fBspa\fR (Spanish), \fBsrp\fR (Serbian), \fBswe\fR (Swedish), \fBtam\fR (Tamil), \fBtel\fR (Telugu), \fBtgl\fR (Tagalog), \fBtha\fR (Thai), \fBtur\fR (Turkish), \fBukr\fR (Ukrainian), \fBvie\fR (Vietnamese)
\fBafr\fR (Afrikaans) \fBamh\fR (Amharic) \fBara\fR (Arabic) \fBasm\fR (Assamese) \fBaze\fR (Azerbaijani) \fBaze_cyrl\fR (Azerbaijani \- Cyrilic) \fBbel\fR (Belarusian) \fBben\fR (Bengali) \fBbod\fR (Tibetan) \fBbos\fR (Bosnian) \fBbul\fR (Bulgarian) \fBcat\fR (Catalan; Valencian) \fBceb\fR (Cebuano) \fBces\fR (Czech) \fBchi_sim\fR (Chinese \- Simplified) \fBchi_tra\fR (Chinese \- Traditional) \fBchr\fR (Cherokee) \fBcym\fR (Welsh) \fBdan\fR (Danish) \fBdan_frak\fR (Danish \- Fraktur) \fBdeu\fR (German) \fBdeu_frak\fR (German \- Fraktur) \fBdzo\fR (Dzongkha) \fBell\fR (Greek, Modern (1453\-)) \fBeng\fR (English) \fBenm\fR (English, Middle (1100\-1500)) \fBepo\fR (Esperanto) \fBequ\fR (Math / equation detection module) \fBest\fR (Estonian) \fBeus\fR (Basque) \fBfas\fR (Persian) \fBfin\fR (Finnish) \fBfra\fR (French) \fBfrk\fR (Frankish) \fBfrm\fR (French, Middle (ca\&.1400\-1600)) \fBgle\fR (Irish) \fBglg\fR (Galician) \fBgrc\fR (Greek, Ancient (to 1453)) \fBguj\fR (Gujarati) \fBhat\fR (Haitian; Haitian Creole) \fBheb\fR (Hebrew) \fBhin\fR (Hindi) \fBhrv\fR (Croatian) \fBhun\fR (Hungarian) \fBiku\fR (Inuktitut) \fBind\fR (Indonesian) \fBisl\fR (Icelandic) \fBita\fR (Italian) \fBita_old\fR (Italian \- Old) \fBjav\fR (Javanese) \fBjpn\fR (Japanese) \fBkan\fR (Kannada) \fBkat\fR (Georgian) \fBkat_old\fR (Georgian \- Old) \fBkaz\fR (Kazakh) \fBkhm\fR (Central Khmer) \fBkir\fR (Kirghiz; Kyrgyz) \fBkor\fR (Korean) \fBkur\fR (Kurdish) \fBlao\fR (Lao) \fBlat\fR (Latin) \fBlav\fR (Latvian) \fBlit\fR (Lithuanian) \fBmal\fR (Malayalam) \fBmar\fR (Marathi) \fBmkd\fR (Macedonian) \fBmlt\fR (Maltese) \fBmsa\fR (Malay) \fBmya\fR (Burmese) \fBnep\fR (Nepali) \fBnld\fR (Dutch; Flemish) \fBnor\fR (Norwegian) \fBori\fR (Oriya) \fBosd\fR (Orientation and script detection module) \fBpan\fR (Panjabi; Punjabi) \fBpol\fR (Polish) \fBpor\fR (Portuguese) \fBpus\fR (Pushto; Pashto) \fBron\fR (Romanian; Moldavian; Moldovan) \fBrus\fR (Russian) \fBsan\fR (Sanskrit) \fBsin\fR (Sinhala; Sinhalese) \fBslk\fR (Slovak) \fBslk_frak\fR (Slovak \- Fraktur) \fBslv\fR (Slovenian) \fBspa\fR (Spanish; Castilian) \fBspa_old\fR (Spanish; Castilian \- Old) \fBsqi\fR (Albanian) \fBsrp\fR (Serbian) \fBsrp_latn\fR (Serbian \- Latin) \fBswa\fR (Swahili) \fBswe\fR (Swedish) \fBsyr\fR (Syriac) \fBtam\fR (Tamil) \fBtel\fR (Telugu) \fBtgk\fR (Tajik) \fBtgl\fR (Tagalog) \fBtha\fR (Thai) \fBtir\fR (Tigrinya) \fBtur\fR (Turkish) \fBuig\fR (Uighur; Uyghur) \fBukr\fR (Ukrainian) \fBurd\fR (Urdu) \fBuzb\fR (Uzbek) \fBuzb_cyrl\fR (Uzbek \- Cyrilic) \fBvie\fR (Vietnamese) \fByid\fR (Yiddish)
.sp
To use a non\-standard language pack named \fBfoo\&.traineddata\fR, set the \fBTESSDATA_PREFIX\fR environment variable so the file can be found at \fBTESSDATA_PREFIX\fR/tessdata/\fBfoo\fR\&.traineddata and give Tesseract the argument \fI\-l foo\fR\&.
.SH "CONFIG FILES AND AUGMENTING WITH USER DATA"
@ -224,7 +224,7 @@ The engine was developed at Hewlett Packard Laboratories Bristol and at Hewlett
.sp
Version 2\&.00 brought Unicode (UTF\-8) support, six languages, and the ability to train Tesseract\&.
.sp
Tesseract was included in UNLV\(cqs Fourth Annual Test of OCR Accuracy\&. See \m[blue]\fBhttp://www\&.isri\&.unlv\&.edu/downloads/AT\-1995\&.pdf\fR\m[]\&. With Tesseract 2\&.00, scripts are now included to allow anyone to reproduce some of these tests\&. See \m[blue]\fBhttps://github\&.com/tesseract\-ocr/tesseract/wiki/TestingTesseract\fR\m[] for more details\&.
Tesseract was included in UNLV\(cqs Fourth Annual Test of OCR Accuracy\&. See \m[blue]\fBhttps://github\&.com/tesseract\-ocr/docs/blob/master/AT\-1995\&.pdf\fR\m[]\&. With Tesseract 2\&.00, scripts are now included to allow anyone to reproduce some of these tests\&. See \m[blue]\fBhttps://github\&.com/tesseract\-ocr/tesseract/wiki/TestingTesseract\fR\m[] for more details\&.
.sp
Tesseract 3\&.00 adds a number of new languages, including Chinese, Japanese, and Korean\&. It also introduces a new, single\-file based system of managing language data\&.
.sp

View File

@ -98,57 +98,116 @@ SINGLE OPTIONS
LANGUAGES
---------
There are currently language packs available for the following languages:
There are currently language packs available for the following languages
(in https://github.com/tesseract-ocr/tessdata):
*ara* (Arabic),
*aze* (Azerbauijani),
*bul* (Bulgarian),
*cat* (Catalan),
*ces* (Czech),
*chi_sim* (Simplified Chinese),
*chi_tra* (Traditional Chinese),
*chr* (Cherokee),
*dan* (Danish),
*dan-frak* (Danish (Fraktur)),
*deu* (German),
*ell* (Greek),
*eng* (English),
*enm* (Old English),
*epo* (Esperanto),
*est* (Estonian),
*fin* (Finnish),
*fra* (French),
*frm* (Old French),
*glg* (Galician),
*heb* (Hebrew),
*hin* (Hindi),
*hrv* (Croation),
*hun* (Hungarian),
*ind* (Indonesian),
*ita* (Italian),
*jpn* (Japanese),
*kor* (Korean),
*lav* (Latvian),
*lit* (Lithuanian),
*nld* (Dutch),
*nor* (Norwegian),
*pol* (Polish),
*por* (Portuguese),
*ron* (Romanian),
*rus* (Russian),
*slk* (Slovakian),
*slv* (Slovenian),
*sqi* (Albanian),
*spa* (Spanish),
*srp* (Serbian),
*swe* (Swedish),
*tam* (Tamil),
*tel* (Telugu),
*tgl* (Tagalog),
*tha* (Thai),
*tur* (Turkish),
*ukr* (Ukrainian),
*afr* (Afrikaans)
*amh* (Amharic)
*ara* (Arabic)
*asm* (Assamese)
*aze* (Azerbaijani)
*aze_cyrl* (Azerbaijani - Cyrilic)
*bel* (Belarusian)
*ben* (Bengali)
*bod* (Tibetan)
*bos* (Bosnian)
*bul* (Bulgarian)
*cat* (Catalan; Valencian)
*ceb* (Cebuano)
*ces* (Czech)
*chi_sim* (Chinese - Simplified)
*chi_tra* (Chinese - Traditional)
*chr* (Cherokee)
*cym* (Welsh)
*dan* (Danish)
*dan_frak* (Danish - Fraktur)
*deu* (German)
*deu_frak* (German - Fraktur)
*dzo* (Dzongkha)
*ell* (Greek, Modern (1453-))
*eng* (English)
*enm* (English, Middle (1100-1500))
*epo* (Esperanto)
*equ* (Math / equation detection module)
*est* (Estonian)
*eus* (Basque)
*fas* (Persian)
*fin* (Finnish)
*fra* (French)
*frk* (Frankish)
*frm* (French, Middle (ca.1400-1600))
*gle* (Irish)
*glg* (Galician)
*grc* (Greek, Ancient (to 1453))
*guj* (Gujarati)
*hat* (Haitian; Haitian Creole)
*heb* (Hebrew)
*hin* (Hindi)
*hrv* (Croatian)
*hun* (Hungarian)
*iku* (Inuktitut)
*ind* (Indonesian)
*isl* (Icelandic)
*ita* (Italian)
*ita_old* (Italian - Old)
*jav* (Javanese)
*jpn* (Japanese)
*kan* (Kannada)
*kat* (Georgian)
*kat_old* (Georgian - Old)
*kaz* (Kazakh)
*khm* (Central Khmer)
*kir* (Kirghiz; Kyrgyz)
*kor* (Korean)
*kur* (Kurdish)
*lao* (Lao)
*lat* (Latin)
*lav* (Latvian)
*lit* (Lithuanian)
*mal* (Malayalam)
*mar* (Marathi)
*mkd* (Macedonian)
*mlt* (Maltese)
*msa* (Malay)
*mya* (Burmese)
*nep* (Nepali)
*nld* (Dutch; Flemish)
*nor* (Norwegian)
*ori* (Oriya)
*osd* (Orientation and script detection module)
*pan* (Panjabi; Punjabi)
*pol* (Polish)
*por* (Portuguese)
*pus* (Pushto; Pashto)
*ron* (Romanian; Moldavian; Moldovan)
*rus* (Russian)
*san* (Sanskrit)
*sin* (Sinhala; Sinhalese)
*slk* (Slovak)
*slk_frak* (Slovak - Fraktur)
*slv* (Slovenian)
*spa* (Spanish; Castilian)
*spa_old* (Spanish; Castilian - Old)
*sqi* (Albanian)
*srp* (Serbian)
*srp_latn* (Serbian - Latin)
*swa* (Swahili)
*swe* (Swedish)
*syr* (Syriac)
*tam* (Tamil)
*tel* (Telugu)
*tgk* (Tajik)
*tgl* (Tagalog)
*tha* (Thai)
*tir* (Tigrinya)
*tur* (Turkish)
*uig* (Uighur; Uyghur)
*ukr* (Ukrainian)
*urd* (Urdu)
*uzb* (Uzbek)
*uzb_cyrl* (Uzbek - Cyrilic)
*vie* (Vietnamese)
*yid* (Yiddish)
To use a non-standard language pack named *foo.traineddata*, set the
*TESSDATA_PREFIX* environment variable so the file can be found at

View File

@ -931,56 +931,115 @@ before any <em>configfile</em>.</p></div>
<div class="sect1">
<h2 id="_languages">LANGUAGES</h2>
<div class="sectionbody">
<div class="paragraph"><p>There are currently language packs available for the following languages:</p></div>
<div class="paragraph"><p><strong>ara</strong> (Arabic),
<strong>aze</strong> (Azerbauijani),
<strong>bul</strong> (Bulgarian),
<strong>cat</strong> (Catalan),
<strong>ces</strong> (Czech),
<strong>chi_sim</strong> (Simplified Chinese),
<strong>chi_tra</strong> (Traditional Chinese),
<strong>chr</strong> (Cherokee),
<strong>dan</strong> (Danish),
<strong>dan-frak</strong> (Danish (Fraktur)),
<strong>deu</strong> (German),
<strong>ell</strong> (Greek),
<strong>eng</strong> (English),
<strong>enm</strong> (Old English),
<strong>epo</strong> (Esperanto),
<strong>est</strong> (Estonian),
<strong>fin</strong> (Finnish),
<strong>fra</strong> (French),
<strong>frm</strong> (Old French),
<strong>glg</strong> (Galician),
<strong>heb</strong> (Hebrew),
<strong>hin</strong> (Hindi),
<strong>hrv</strong> (Croation),
<strong>hun</strong> (Hungarian),
<strong>ind</strong> (Indonesian),
<strong>ita</strong> (Italian),
<strong>jpn</strong> (Japanese),
<strong>kor</strong> (Korean),
<strong>lav</strong> (Latvian),
<strong>lit</strong> (Lithuanian),
<strong>nld</strong> (Dutch),
<strong>nor</strong> (Norwegian),
<strong>pol</strong> (Polish),
<strong>por</strong> (Portuguese),
<strong>ron</strong> (Romanian),
<strong>rus</strong> (Russian),
<strong>slk</strong> (Slovakian),
<strong>slv</strong> (Slovenian),
<strong>sqi</strong> (Albanian),
<strong>spa</strong> (Spanish),
<strong>srp</strong> (Serbian),
<strong>swe</strong> (Swedish),
<strong>tam</strong> (Tamil),
<strong>tel</strong> (Telugu),
<strong>tgl</strong> (Tagalog),
<strong>tha</strong> (Thai),
<strong>tur</strong> (Turkish),
<strong>ukr</strong> (Ukrainian),
<strong>vie</strong> (Vietnamese)</p></div>
<div class="paragraph"><p>There are currently language packs available for the following languages
(in <a href="https://github.com/tesseract-ocr/tessdata">https://github.com/tesseract-ocr/tessdata</a>):</p></div>
<div class="paragraph"><p><strong>afr</strong> (Afrikaans)
<strong>amh</strong> (Amharic)
<strong>ara</strong> (Arabic)
<strong>asm</strong> (Assamese)
<strong>aze</strong> (Azerbaijani)
<strong>aze_cyrl</strong> (Azerbaijani - Cyrilic)
<strong>bel</strong> (Belarusian)
<strong>ben</strong> (Bengali)
<strong>bod</strong> (Tibetan)
<strong>bos</strong> (Bosnian)
<strong>bul</strong> (Bulgarian)
<strong>cat</strong> (Catalan; Valencian)
<strong>ceb</strong> (Cebuano)
<strong>ces</strong> (Czech)
<strong>chi_sim</strong> (Chinese - Simplified)
<strong>chi_tra</strong> (Chinese - Traditional)
<strong>chr</strong> (Cherokee)
<strong>cym</strong> (Welsh)
<strong>dan</strong> (Danish)
<strong>dan_frak</strong> (Danish - Fraktur)
<strong>deu</strong> (German)
<strong>deu_frak</strong> (German - Fraktur)
<strong>dzo</strong> (Dzongkha)
<strong>ell</strong> (Greek, Modern (1453-))
<strong>eng</strong> (English)
<strong>enm</strong> (English, Middle (1100-1500))
<strong>epo</strong> (Esperanto)
<strong>equ</strong> (Math / equation detection module)
<strong>est</strong> (Estonian)
<strong>eus</strong> (Basque)
<strong>fas</strong> (Persian)
<strong>fin</strong> (Finnish)
<strong>fra</strong> (French)
<strong>frk</strong> (Frankish)
<strong>frm</strong> (French, Middle (ca.1400-1600))
<strong>gle</strong> (Irish)
<strong>glg</strong> (Galician)
<strong>grc</strong> (Greek, Ancient (to 1453))
<strong>guj</strong> (Gujarati)
<strong>hat</strong> (Haitian; Haitian Creole)
<strong>heb</strong> (Hebrew)
<strong>hin</strong> (Hindi)
<strong>hrv</strong> (Croatian)
<strong>hun</strong> (Hungarian)
<strong>iku</strong> (Inuktitut)
<strong>ind</strong> (Indonesian)
<strong>isl</strong> (Icelandic)
<strong>ita</strong> (Italian)
<strong>ita_old</strong> (Italian - Old)
<strong>jav</strong> (Javanese)
<strong>jpn</strong> (Japanese)
<strong>kan</strong> (Kannada)
<strong>kat</strong> (Georgian)
<strong>kat_old</strong> (Georgian - Old)
<strong>kaz</strong> (Kazakh)
<strong>khm</strong> (Central Khmer)
<strong>kir</strong> (Kirghiz; Kyrgyz)
<strong>kor</strong> (Korean)
<strong>kur</strong> (Kurdish)
<strong>lao</strong> (Lao)
<strong>lat</strong> (Latin)
<strong>lav</strong> (Latvian)
<strong>lit</strong> (Lithuanian)
<strong>mal</strong> (Malayalam)
<strong>mar</strong> (Marathi)
<strong>mkd</strong> (Macedonian)
<strong>mlt</strong> (Maltese)
<strong>msa</strong> (Malay)
<strong>mya</strong> (Burmese)
<strong>nep</strong> (Nepali)
<strong>nld</strong> (Dutch; Flemish)
<strong>nor</strong> (Norwegian)
<strong>ori</strong> (Oriya)
<strong>osd</strong> (Orientation and script detection module)
<strong>pan</strong> (Panjabi; Punjabi)
<strong>pol</strong> (Polish)
<strong>por</strong> (Portuguese)
<strong>pus</strong> (Pushto; Pashto)
<strong>ron</strong> (Romanian; Moldavian; Moldovan)
<strong>rus</strong> (Russian)
<strong>san</strong> (Sanskrit)
<strong>sin</strong> (Sinhala; Sinhalese)
<strong>slk</strong> (Slovak)
<strong>slk_frak</strong> (Slovak - Fraktur)
<strong>slv</strong> (Slovenian)
<strong>spa</strong> (Spanish; Castilian)
<strong>spa_old</strong> (Spanish; Castilian - Old)
<strong>sqi</strong> (Albanian)
<strong>srp</strong> (Serbian)
<strong>srp_latn</strong> (Serbian - Latin)
<strong>swa</strong> (Swahili)
<strong>swe</strong> (Swedish)
<strong>syr</strong> (Syriac)
<strong>tam</strong> (Tamil)
<strong>tel</strong> (Telugu)
<strong>tgk</strong> (Tajik)
<strong>tgl</strong> (Tagalog)
<strong>tha</strong> (Thai)
<strong>tir</strong> (Tigrinya)
<strong>tur</strong> (Turkish)
<strong>uig</strong> (Uighur; Uyghur)
<strong>ukr</strong> (Ukrainian)
<strong>urd</strong> (Urdu)
<strong>uzb</strong> (Uzbek)
<strong>uzb_cyrl</strong> (Uzbek - Cyrilic)
<strong>vie</strong> (Vietnamese)
<strong>yid</strong> (Yiddish)</p></div>
<div class="paragraph"><p>To use a non-standard language pack named <strong>foo.traineddata</strong>, set the
<strong>TESSDATA_PREFIX</strong> environment variable so the file can be found at
<strong>TESSDATA_PREFIX</strong>/tessdata/<strong>foo</strong>.traineddata and give Tesseract the
@ -1047,7 +1106,7 @@ debug.</p></div>
<div class="paragraph"><p>Version 2.00 brought Unicode (UTF-8) support, six languages, and the ability
to train Tesseract.</p></div>
<div class="paragraph"><p>Tesseract was included in UNLV&#8217;s Fourth Annual Test of OCR Accuracy.
See <a href="http://www.isri.unlv.edu/downloads/AT-1995.pdf">http://www.isri.unlv.edu/downloads/AT-1995.pdf</a>. With Tesseract 2.00,
See <a href="https://github.com/tesseract-ocr/docs/blob/master/AT-1995.pdf">https://github.com/tesseract-ocr/docs/blob/master/AT-1995.pdf</a>. With Tesseract 2.00,
scripts are now included to allow anyone to reproduce some of these tests.
See <a href="https://github.com/tesseract-ocr/tesseract/wiki/TestingTesseract">https://github.com/tesseract-ocr/tesseract/wiki/TestingTesseract</a> for more
details.</p></div>
@ -1097,7 +1156,7 @@ Lloyd, Shobhit Saxena, and Thomas Kielbus.</p></div>
<div id="footnotes"><hr /></div>
<div id="footer">
<div id="footer-text">
Last updated 2015-06-12 23:49:44 CEST
Last updated 2015-06-28 22:23:47 CEST
</div>
</div>
</body>

View File

@ -216,56 +216,115 @@ before any <emphasis>configfile</emphasis>.</simpara>
</refsect1>
<refsect1 id="_languages">
<title>LANGUAGES</title>
<simpara>There are currently language packs available for the following languages:</simpara>
<simpara><emphasis role="strong">ara</emphasis> (Arabic),
<emphasis role="strong">aze</emphasis> (Azerbauijani),
<emphasis role="strong">bul</emphasis> (Bulgarian),
<emphasis role="strong">cat</emphasis> (Catalan),
<emphasis role="strong">ces</emphasis> (Czech),
<emphasis role="strong">chi_sim</emphasis> (Simplified Chinese),
<emphasis role="strong">chi_tra</emphasis> (Traditional Chinese),
<emphasis role="strong">chr</emphasis> (Cherokee),
<emphasis role="strong">dan</emphasis> (Danish),
<emphasis role="strong">dan-frak</emphasis> (Danish (Fraktur)),
<emphasis role="strong">deu</emphasis> (German),
<emphasis role="strong">ell</emphasis> (Greek),
<emphasis role="strong">eng</emphasis> (English),
<emphasis role="strong">enm</emphasis> (Old English),
<emphasis role="strong">epo</emphasis> (Esperanto),
<emphasis role="strong">est</emphasis> (Estonian),
<emphasis role="strong">fin</emphasis> (Finnish),
<emphasis role="strong">fra</emphasis> (French),
<emphasis role="strong">frm</emphasis> (Old French),
<emphasis role="strong">glg</emphasis> (Galician),
<emphasis role="strong">heb</emphasis> (Hebrew),
<emphasis role="strong">hin</emphasis> (Hindi),
<emphasis role="strong">hrv</emphasis> (Croation),
<emphasis role="strong">hun</emphasis> (Hungarian),
<emphasis role="strong">ind</emphasis> (Indonesian),
<emphasis role="strong">ita</emphasis> (Italian),
<emphasis role="strong">jpn</emphasis> (Japanese),
<emphasis role="strong">kor</emphasis> (Korean),
<emphasis role="strong">lav</emphasis> (Latvian),
<emphasis role="strong">lit</emphasis> (Lithuanian),
<emphasis role="strong">nld</emphasis> (Dutch),
<emphasis role="strong">nor</emphasis> (Norwegian),
<emphasis role="strong">pol</emphasis> (Polish),
<emphasis role="strong">por</emphasis> (Portuguese),
<emphasis role="strong">ron</emphasis> (Romanian),
<emphasis role="strong">rus</emphasis> (Russian),
<emphasis role="strong">slk</emphasis> (Slovakian),
<emphasis role="strong">slv</emphasis> (Slovenian),
<emphasis role="strong">sqi</emphasis> (Albanian),
<emphasis role="strong">spa</emphasis> (Spanish),
<emphasis role="strong">srp</emphasis> (Serbian),
<emphasis role="strong">swe</emphasis> (Swedish),
<emphasis role="strong">tam</emphasis> (Tamil),
<emphasis role="strong">tel</emphasis> (Telugu),
<emphasis role="strong">tgl</emphasis> (Tagalog),
<emphasis role="strong">tha</emphasis> (Thai),
<emphasis role="strong">tur</emphasis> (Turkish),
<emphasis role="strong">ukr</emphasis> (Ukrainian),
<emphasis role="strong">vie</emphasis> (Vietnamese)</simpara>
<simpara>There are currently language packs available for the following languages
(in <ulink url="https://github.com/tesseract-ocr/tessdata">https://github.com/tesseract-ocr/tessdata</ulink>):</simpara>
<simpara><emphasis role="strong">afr</emphasis> (Afrikaans)
<emphasis role="strong">amh</emphasis> (Amharic)
<emphasis role="strong">ara</emphasis> (Arabic)
<emphasis role="strong">asm</emphasis> (Assamese)
<emphasis role="strong">aze</emphasis> (Azerbaijani)
<emphasis role="strong">aze_cyrl</emphasis> (Azerbaijani - Cyrilic)
<emphasis role="strong">bel</emphasis> (Belarusian)
<emphasis role="strong">ben</emphasis> (Bengali)
<emphasis role="strong">bod</emphasis> (Tibetan)
<emphasis role="strong">bos</emphasis> (Bosnian)
<emphasis role="strong">bul</emphasis> (Bulgarian)
<emphasis role="strong">cat</emphasis> (Catalan; Valencian)
<emphasis role="strong">ceb</emphasis> (Cebuano)
<emphasis role="strong">ces</emphasis> (Czech)
<emphasis role="strong">chi_sim</emphasis> (Chinese - Simplified)
<emphasis role="strong">chi_tra</emphasis> (Chinese - Traditional)
<emphasis role="strong">chr</emphasis> (Cherokee)
<emphasis role="strong">cym</emphasis> (Welsh)
<emphasis role="strong">dan</emphasis> (Danish)
<emphasis role="strong">dan_frak</emphasis> (Danish - Fraktur)
<emphasis role="strong">deu</emphasis> (German)
<emphasis role="strong">deu_frak</emphasis> (German - Fraktur)
<emphasis role="strong">dzo</emphasis> (Dzongkha)
<emphasis role="strong">ell</emphasis> (Greek, Modern (1453-))
<emphasis role="strong">eng</emphasis> (English)
<emphasis role="strong">enm</emphasis> (English, Middle (1100-1500))
<emphasis role="strong">epo</emphasis> (Esperanto)
<emphasis role="strong">equ</emphasis> (Math / equation detection module)
<emphasis role="strong">est</emphasis> (Estonian)
<emphasis role="strong">eus</emphasis> (Basque)
<emphasis role="strong">fas</emphasis> (Persian)
<emphasis role="strong">fin</emphasis> (Finnish)
<emphasis role="strong">fra</emphasis> (French)
<emphasis role="strong">frk</emphasis> (Frankish)
<emphasis role="strong">frm</emphasis> (French, Middle (ca.1400-1600))
<emphasis role="strong">gle</emphasis> (Irish)
<emphasis role="strong">glg</emphasis> (Galician)
<emphasis role="strong">grc</emphasis> (Greek, Ancient (to 1453))
<emphasis role="strong">guj</emphasis> (Gujarati)
<emphasis role="strong">hat</emphasis> (Haitian; Haitian Creole)
<emphasis role="strong">heb</emphasis> (Hebrew)
<emphasis role="strong">hin</emphasis> (Hindi)
<emphasis role="strong">hrv</emphasis> (Croatian)
<emphasis role="strong">hun</emphasis> (Hungarian)
<emphasis role="strong">iku</emphasis> (Inuktitut)
<emphasis role="strong">ind</emphasis> (Indonesian)
<emphasis role="strong">isl</emphasis> (Icelandic)
<emphasis role="strong">ita</emphasis> (Italian)
<emphasis role="strong">ita_old</emphasis> (Italian - Old)
<emphasis role="strong">jav</emphasis> (Javanese)
<emphasis role="strong">jpn</emphasis> (Japanese)
<emphasis role="strong">kan</emphasis> (Kannada)
<emphasis role="strong">kat</emphasis> (Georgian)
<emphasis role="strong">kat_old</emphasis> (Georgian - Old)
<emphasis role="strong">kaz</emphasis> (Kazakh)
<emphasis role="strong">khm</emphasis> (Central Khmer)
<emphasis role="strong">kir</emphasis> (Kirghiz; Kyrgyz)
<emphasis role="strong">kor</emphasis> (Korean)
<emphasis role="strong">kur</emphasis> (Kurdish)
<emphasis role="strong">lao</emphasis> (Lao)
<emphasis role="strong">lat</emphasis> (Latin)
<emphasis role="strong">lav</emphasis> (Latvian)
<emphasis role="strong">lit</emphasis> (Lithuanian)
<emphasis role="strong">mal</emphasis> (Malayalam)
<emphasis role="strong">mar</emphasis> (Marathi)
<emphasis role="strong">mkd</emphasis> (Macedonian)
<emphasis role="strong">mlt</emphasis> (Maltese)
<emphasis role="strong">msa</emphasis> (Malay)
<emphasis role="strong">mya</emphasis> (Burmese)
<emphasis role="strong">nep</emphasis> (Nepali)
<emphasis role="strong">nld</emphasis> (Dutch; Flemish)
<emphasis role="strong">nor</emphasis> (Norwegian)
<emphasis role="strong">ori</emphasis> (Oriya)
<emphasis role="strong">osd</emphasis> (Orientation and script detection module)
<emphasis role="strong">pan</emphasis> (Panjabi; Punjabi)
<emphasis role="strong">pol</emphasis> (Polish)
<emphasis role="strong">por</emphasis> (Portuguese)
<emphasis role="strong">pus</emphasis> (Pushto; Pashto)
<emphasis role="strong">ron</emphasis> (Romanian; Moldavian; Moldovan)
<emphasis role="strong">rus</emphasis> (Russian)
<emphasis role="strong">san</emphasis> (Sanskrit)
<emphasis role="strong">sin</emphasis> (Sinhala; Sinhalese)
<emphasis role="strong">slk</emphasis> (Slovak)
<emphasis role="strong">slk_frak</emphasis> (Slovak - Fraktur)
<emphasis role="strong">slv</emphasis> (Slovenian)
<emphasis role="strong">spa</emphasis> (Spanish; Castilian)
<emphasis role="strong">spa_old</emphasis> (Spanish; Castilian - Old)
<emphasis role="strong">sqi</emphasis> (Albanian)
<emphasis role="strong">srp</emphasis> (Serbian)
<emphasis role="strong">srp_latn</emphasis> (Serbian - Latin)
<emphasis role="strong">swa</emphasis> (Swahili)
<emphasis role="strong">swe</emphasis> (Swedish)
<emphasis role="strong">syr</emphasis> (Syriac)
<emphasis role="strong">tam</emphasis> (Tamil)
<emphasis role="strong">tel</emphasis> (Telugu)
<emphasis role="strong">tgk</emphasis> (Tajik)
<emphasis role="strong">tgl</emphasis> (Tagalog)
<emphasis role="strong">tha</emphasis> (Thai)
<emphasis role="strong">tir</emphasis> (Tigrinya)
<emphasis role="strong">tur</emphasis> (Turkish)
<emphasis role="strong">uig</emphasis> (Uighur; Uyghur)
<emphasis role="strong">ukr</emphasis> (Ukrainian)
<emphasis role="strong">urd</emphasis> (Urdu)
<emphasis role="strong">uzb</emphasis> (Uzbek)
<emphasis role="strong">uzb_cyrl</emphasis> (Uzbek - Cyrilic)
<emphasis role="strong">vie</emphasis> (Vietnamese)
<emphasis role="strong">yid</emphasis> (Yiddish)</simpara>
<simpara>To use a non-standard language pack named <emphasis role="strong">foo.traineddata</emphasis>, set the
<emphasis role="strong">TESSDATA_PREFIX</emphasis> environment variable so the file can be found at
<emphasis role="strong">TESSDATA_PREFIX</emphasis>/tessdata/<emphasis role="strong">foo</emphasis>.traineddata and give Tesseract the
@ -325,7 +384,7 @@ debug.</simpara>
<simpara>Version 2.00 brought Unicode (UTF-8) support, six languages, and the ability
to train Tesseract.</simpara>
<simpara>Tesseract was included in UNLV&#8217;s Fourth Annual Test of OCR Accuracy.
See <ulink url="http://www.isri.unlv.edu/downloads/AT-1995.pdf">http://www.isri.unlv.edu/downloads/AT-1995.pdf</ulink>. With Tesseract 2.00,
See <ulink url="https://github.com/tesseract-ocr/docs/blob/master/AT-1995.pdf">https://github.com/tesseract-ocr/docs/blob/master/AT-1995.pdf</ulink>. With Tesseract 2.00,
scripts are now included to allow anyone to reproduce some of these tests.
See <ulink url="https://github.com/tesseract-ocr/tesseract/wiki/TestingTesseract">https://github.com/tesseract-ocr/tesseract/wiki/TestingTesseract</ulink> for more
details.</simpara>