mirror of
https://github.com/tesseract-ocr/tesseract.git
synced 2025-06-07 09:52:40 +08:00
add new lang info
This commit is contained in:
parent
e8b6d6f71b
commit
dcc457cc05
@ -2,12 +2,12 @@
|
||||
.\" Title: tesseract
|
||||
.\" Author: [see the "AUTHOR" section]
|
||||
.\" Generator: DocBook XSL Stylesheets v1.78.1 <http://docbook.sf.net/>
|
||||
.\" Date: 06/12/2015
|
||||
.\" Date: 06/28/2015
|
||||
.\" Manual: \ \&
|
||||
.\" Source: \ \&
|
||||
.\" Language: English
|
||||
.\"
|
||||
.TH "TESSERACT" "1" "06/12/2015" "\ \&" "\ \&"
|
||||
.TH "TESSERACT" "1" "06/28/2015" "\ \&" "\ \&"
|
||||
.\" -----------------------------------------------------------------
|
||||
.\" * Define some portability stuff
|
||||
.\" -----------------------------------------------------------------
|
||||
@ -158,9 +158,9 @@ print tesseract parameters to the stdout\&.
|
||||
.RE
|
||||
.SH "LANGUAGES"
|
||||
.sp
|
||||
There are currently language packs available for the following languages:
|
||||
There are currently language packs available for the following languages (in \m[blue]\fBhttps://github\&.com/tesseract\-ocr/tessdata\fR\m[]):
|
||||
.sp
|
||||
\fBara\fR (Arabic), \fBaze\fR (Azerbauijani), \fBbul\fR (Bulgarian), \fBcat\fR (Catalan), \fBces\fR (Czech), \fBchi_sim\fR (Simplified Chinese), \fBchi_tra\fR (Traditional Chinese), \fBchr\fR (Cherokee), \fBdan\fR (Danish), \fBdan\-frak\fR (Danish (Fraktur)), \fBdeu\fR (German), \fBell\fR (Greek), \fBeng\fR (English), \fBenm\fR (Old English), \fBepo\fR (Esperanto), \fBest\fR (Estonian), \fBfin\fR (Finnish), \fBfra\fR (French), \fBfrm\fR (Old French), \fBglg\fR (Galician), \fBheb\fR (Hebrew), \fBhin\fR (Hindi), \fBhrv\fR (Croation), \fBhun\fR (Hungarian), \fBind\fR (Indonesian), \fBita\fR (Italian), \fBjpn\fR (Japanese), \fBkor\fR (Korean), \fBlav\fR (Latvian), \fBlit\fR (Lithuanian), \fBnld\fR (Dutch), \fBnor\fR (Norwegian), \fBpol\fR (Polish), \fBpor\fR (Portuguese), \fBron\fR (Romanian), \fBrus\fR (Russian), \fBslk\fR (Slovakian), \fBslv\fR (Slovenian), \fBsqi\fR (Albanian), \fBspa\fR (Spanish), \fBsrp\fR (Serbian), \fBswe\fR (Swedish), \fBtam\fR (Tamil), \fBtel\fR (Telugu), \fBtgl\fR (Tagalog), \fBtha\fR (Thai), \fBtur\fR (Turkish), \fBukr\fR (Ukrainian), \fBvie\fR (Vietnamese)
|
||||
\fBafr\fR (Afrikaans) \fBamh\fR (Amharic) \fBara\fR (Arabic) \fBasm\fR (Assamese) \fBaze\fR (Azerbaijani) \fBaze_cyrl\fR (Azerbaijani \- Cyrilic) \fBbel\fR (Belarusian) \fBben\fR (Bengali) \fBbod\fR (Tibetan) \fBbos\fR (Bosnian) \fBbul\fR (Bulgarian) \fBcat\fR (Catalan; Valencian) \fBceb\fR (Cebuano) \fBces\fR (Czech) \fBchi_sim\fR (Chinese \- Simplified) \fBchi_tra\fR (Chinese \- Traditional) \fBchr\fR (Cherokee) \fBcym\fR (Welsh) \fBdan\fR (Danish) \fBdan_frak\fR (Danish \- Fraktur) \fBdeu\fR (German) \fBdeu_frak\fR (German \- Fraktur) \fBdzo\fR (Dzongkha) \fBell\fR (Greek, Modern (1453\-)) \fBeng\fR (English) \fBenm\fR (English, Middle (1100\-1500)) \fBepo\fR (Esperanto) \fBequ\fR (Math / equation detection module) \fBest\fR (Estonian) \fBeus\fR (Basque) \fBfas\fR (Persian) \fBfin\fR (Finnish) \fBfra\fR (French) \fBfrk\fR (Frankish) \fBfrm\fR (French, Middle (ca\&.1400\-1600)) \fBgle\fR (Irish) \fBglg\fR (Galician) \fBgrc\fR (Greek, Ancient (to 1453)) \fBguj\fR (Gujarati) \fBhat\fR (Haitian; Haitian Creole) \fBheb\fR (Hebrew) \fBhin\fR (Hindi) \fBhrv\fR (Croatian) \fBhun\fR (Hungarian) \fBiku\fR (Inuktitut) \fBind\fR (Indonesian) \fBisl\fR (Icelandic) \fBita\fR (Italian) \fBita_old\fR (Italian \- Old) \fBjav\fR (Javanese) \fBjpn\fR (Japanese) \fBkan\fR (Kannada) \fBkat\fR (Georgian) \fBkat_old\fR (Georgian \- Old) \fBkaz\fR (Kazakh) \fBkhm\fR (Central Khmer) \fBkir\fR (Kirghiz; Kyrgyz) \fBkor\fR (Korean) \fBkur\fR (Kurdish) \fBlao\fR (Lao) \fBlat\fR (Latin) \fBlav\fR (Latvian) \fBlit\fR (Lithuanian) \fBmal\fR (Malayalam) \fBmar\fR (Marathi) \fBmkd\fR (Macedonian) \fBmlt\fR (Maltese) \fBmsa\fR (Malay) \fBmya\fR (Burmese) \fBnep\fR (Nepali) \fBnld\fR (Dutch; Flemish) \fBnor\fR (Norwegian) \fBori\fR (Oriya) \fBosd\fR (Orientation and script detection module) \fBpan\fR (Panjabi; Punjabi) \fBpol\fR (Polish) \fBpor\fR (Portuguese) \fBpus\fR (Pushto; Pashto) \fBron\fR (Romanian; Moldavian; Moldovan) \fBrus\fR (Russian) \fBsan\fR (Sanskrit) \fBsin\fR (Sinhala; Sinhalese) \fBslk\fR (Slovak) \fBslk_frak\fR (Slovak \- Fraktur) \fBslv\fR (Slovenian) \fBspa\fR (Spanish; Castilian) \fBspa_old\fR (Spanish; Castilian \- Old) \fBsqi\fR (Albanian) \fBsrp\fR (Serbian) \fBsrp_latn\fR (Serbian \- Latin) \fBswa\fR (Swahili) \fBswe\fR (Swedish) \fBsyr\fR (Syriac) \fBtam\fR (Tamil) \fBtel\fR (Telugu) \fBtgk\fR (Tajik) \fBtgl\fR (Tagalog) \fBtha\fR (Thai) \fBtir\fR (Tigrinya) \fBtur\fR (Turkish) \fBuig\fR (Uighur; Uyghur) \fBukr\fR (Ukrainian) \fBurd\fR (Urdu) \fBuzb\fR (Uzbek) \fBuzb_cyrl\fR (Uzbek \- Cyrilic) \fBvie\fR (Vietnamese) \fByid\fR (Yiddish)
|
||||
.sp
|
||||
To use a non\-standard language pack named \fBfoo\&.traineddata\fR, set the \fBTESSDATA_PREFIX\fR environment variable so the file can be found at \fBTESSDATA_PREFIX\fR/tessdata/\fBfoo\fR\&.traineddata and give Tesseract the argument \fI\-l foo\fR\&.
|
||||
.SH "CONFIG FILES AND AUGMENTING WITH USER DATA"
|
||||
@ -224,7 +224,7 @@ The engine was developed at Hewlett Packard Laboratories Bristol and at Hewlett
|
||||
.sp
|
||||
Version 2\&.00 brought Unicode (UTF\-8) support, six languages, and the ability to train Tesseract\&.
|
||||
.sp
|
||||
Tesseract was included in UNLV\(cqs Fourth Annual Test of OCR Accuracy\&. See \m[blue]\fBhttp://www\&.isri\&.unlv\&.edu/downloads/AT\-1995\&.pdf\fR\m[]\&. With Tesseract 2\&.00, scripts are now included to allow anyone to reproduce some of these tests\&. See \m[blue]\fBhttps://github\&.com/tesseract\-ocr/tesseract/wiki/TestingTesseract\fR\m[] for more details\&.
|
||||
Tesseract was included in UNLV\(cqs Fourth Annual Test of OCR Accuracy\&. See \m[blue]\fBhttps://github\&.com/tesseract\-ocr/docs/blob/master/AT\-1995\&.pdf\fR\m[]\&. With Tesseract 2\&.00, scripts are now included to allow anyone to reproduce some of these tests\&. See \m[blue]\fBhttps://github\&.com/tesseract\-ocr/tesseract/wiki/TestingTesseract\fR\m[] for more details\&.
|
||||
.sp
|
||||
Tesseract 3\&.00 adds a number of new languages, including Chinese, Japanese, and Korean\&. It also introduces a new, single\-file based system of managing language data\&.
|
||||
.sp
|
||||
|
@ -98,57 +98,116 @@ SINGLE OPTIONS
|
||||
LANGUAGES
|
||||
---------
|
||||
|
||||
There are currently language packs available for the following languages:
|
||||
There are currently language packs available for the following languages
|
||||
(in https://github.com/tesseract-ocr/tessdata):
|
||||
|
||||
*ara* (Arabic),
|
||||
*aze* (Azerbauijani),
|
||||
*bul* (Bulgarian),
|
||||
*cat* (Catalan),
|
||||
*ces* (Czech),
|
||||
*chi_sim* (Simplified Chinese),
|
||||
*chi_tra* (Traditional Chinese),
|
||||
*chr* (Cherokee),
|
||||
*dan* (Danish),
|
||||
*dan-frak* (Danish (Fraktur)),
|
||||
*deu* (German),
|
||||
*ell* (Greek),
|
||||
*eng* (English),
|
||||
*enm* (Old English),
|
||||
*epo* (Esperanto),
|
||||
*est* (Estonian),
|
||||
*fin* (Finnish),
|
||||
*fra* (French),
|
||||
*frm* (Old French),
|
||||
*glg* (Galician),
|
||||
*heb* (Hebrew),
|
||||
*hin* (Hindi),
|
||||
*hrv* (Croation),
|
||||
*hun* (Hungarian),
|
||||
*ind* (Indonesian),
|
||||
*ita* (Italian),
|
||||
*jpn* (Japanese),
|
||||
*kor* (Korean),
|
||||
*lav* (Latvian),
|
||||
*lit* (Lithuanian),
|
||||
*nld* (Dutch),
|
||||
*nor* (Norwegian),
|
||||
*pol* (Polish),
|
||||
*por* (Portuguese),
|
||||
*ron* (Romanian),
|
||||
*rus* (Russian),
|
||||
*slk* (Slovakian),
|
||||
*slv* (Slovenian),
|
||||
*sqi* (Albanian),
|
||||
*spa* (Spanish),
|
||||
*srp* (Serbian),
|
||||
*swe* (Swedish),
|
||||
*tam* (Tamil),
|
||||
*tel* (Telugu),
|
||||
*tgl* (Tagalog),
|
||||
*tha* (Thai),
|
||||
*tur* (Turkish),
|
||||
*ukr* (Ukrainian),
|
||||
*afr* (Afrikaans)
|
||||
*amh* (Amharic)
|
||||
*ara* (Arabic)
|
||||
*asm* (Assamese)
|
||||
*aze* (Azerbaijani)
|
||||
*aze_cyrl* (Azerbaijani - Cyrilic)
|
||||
*bel* (Belarusian)
|
||||
*ben* (Bengali)
|
||||
*bod* (Tibetan)
|
||||
*bos* (Bosnian)
|
||||
*bul* (Bulgarian)
|
||||
*cat* (Catalan; Valencian)
|
||||
*ceb* (Cebuano)
|
||||
*ces* (Czech)
|
||||
*chi_sim* (Chinese - Simplified)
|
||||
*chi_tra* (Chinese - Traditional)
|
||||
*chr* (Cherokee)
|
||||
*cym* (Welsh)
|
||||
*dan* (Danish)
|
||||
*dan_frak* (Danish - Fraktur)
|
||||
*deu* (German)
|
||||
*deu_frak* (German - Fraktur)
|
||||
*dzo* (Dzongkha)
|
||||
*ell* (Greek, Modern (1453-))
|
||||
*eng* (English)
|
||||
*enm* (English, Middle (1100-1500))
|
||||
*epo* (Esperanto)
|
||||
*equ* (Math / equation detection module)
|
||||
*est* (Estonian)
|
||||
*eus* (Basque)
|
||||
*fas* (Persian)
|
||||
*fin* (Finnish)
|
||||
*fra* (French)
|
||||
*frk* (Frankish)
|
||||
*frm* (French, Middle (ca.1400-1600))
|
||||
*gle* (Irish)
|
||||
*glg* (Galician)
|
||||
*grc* (Greek, Ancient (to 1453))
|
||||
*guj* (Gujarati)
|
||||
*hat* (Haitian; Haitian Creole)
|
||||
*heb* (Hebrew)
|
||||
*hin* (Hindi)
|
||||
*hrv* (Croatian)
|
||||
*hun* (Hungarian)
|
||||
*iku* (Inuktitut)
|
||||
*ind* (Indonesian)
|
||||
*isl* (Icelandic)
|
||||
*ita* (Italian)
|
||||
*ita_old* (Italian - Old)
|
||||
*jav* (Javanese)
|
||||
*jpn* (Japanese)
|
||||
*kan* (Kannada)
|
||||
*kat* (Georgian)
|
||||
*kat_old* (Georgian - Old)
|
||||
*kaz* (Kazakh)
|
||||
*khm* (Central Khmer)
|
||||
*kir* (Kirghiz; Kyrgyz)
|
||||
*kor* (Korean)
|
||||
*kur* (Kurdish)
|
||||
*lao* (Lao)
|
||||
*lat* (Latin)
|
||||
*lav* (Latvian)
|
||||
*lit* (Lithuanian)
|
||||
*mal* (Malayalam)
|
||||
*mar* (Marathi)
|
||||
*mkd* (Macedonian)
|
||||
*mlt* (Maltese)
|
||||
*msa* (Malay)
|
||||
*mya* (Burmese)
|
||||
*nep* (Nepali)
|
||||
*nld* (Dutch; Flemish)
|
||||
*nor* (Norwegian)
|
||||
*ori* (Oriya)
|
||||
*osd* (Orientation and script detection module)
|
||||
*pan* (Panjabi; Punjabi)
|
||||
*pol* (Polish)
|
||||
*por* (Portuguese)
|
||||
*pus* (Pushto; Pashto)
|
||||
*ron* (Romanian; Moldavian; Moldovan)
|
||||
*rus* (Russian)
|
||||
*san* (Sanskrit)
|
||||
*sin* (Sinhala; Sinhalese)
|
||||
*slk* (Slovak)
|
||||
*slk_frak* (Slovak - Fraktur)
|
||||
*slv* (Slovenian)
|
||||
*spa* (Spanish; Castilian)
|
||||
*spa_old* (Spanish; Castilian - Old)
|
||||
*sqi* (Albanian)
|
||||
*srp* (Serbian)
|
||||
*srp_latn* (Serbian - Latin)
|
||||
*swa* (Swahili)
|
||||
*swe* (Swedish)
|
||||
*syr* (Syriac)
|
||||
*tam* (Tamil)
|
||||
*tel* (Telugu)
|
||||
*tgk* (Tajik)
|
||||
*tgl* (Tagalog)
|
||||
*tha* (Thai)
|
||||
*tir* (Tigrinya)
|
||||
*tur* (Turkish)
|
||||
*uig* (Uighur; Uyghur)
|
||||
*ukr* (Ukrainian)
|
||||
*urd* (Urdu)
|
||||
*uzb* (Uzbek)
|
||||
*uzb_cyrl* (Uzbek - Cyrilic)
|
||||
*vie* (Vietnamese)
|
||||
*yid* (Yiddish)
|
||||
|
||||
To use a non-standard language pack named *foo.traineddata*, set the
|
||||
*TESSDATA_PREFIX* environment variable so the file can be found at
|
||||
|
@ -931,56 +931,115 @@ before any <em>configfile</em>.</p></div>
|
||||
<div class="sect1">
|
||||
<h2 id="_languages">LANGUAGES</h2>
|
||||
<div class="sectionbody">
|
||||
<div class="paragraph"><p>There are currently language packs available for the following languages:</p></div>
|
||||
<div class="paragraph"><p><strong>ara</strong> (Arabic),
|
||||
<strong>aze</strong> (Azerbauijani),
|
||||
<strong>bul</strong> (Bulgarian),
|
||||
<strong>cat</strong> (Catalan),
|
||||
<strong>ces</strong> (Czech),
|
||||
<strong>chi_sim</strong> (Simplified Chinese),
|
||||
<strong>chi_tra</strong> (Traditional Chinese),
|
||||
<strong>chr</strong> (Cherokee),
|
||||
<strong>dan</strong> (Danish),
|
||||
<strong>dan-frak</strong> (Danish (Fraktur)),
|
||||
<strong>deu</strong> (German),
|
||||
<strong>ell</strong> (Greek),
|
||||
<strong>eng</strong> (English),
|
||||
<strong>enm</strong> (Old English),
|
||||
<strong>epo</strong> (Esperanto),
|
||||
<strong>est</strong> (Estonian),
|
||||
<strong>fin</strong> (Finnish),
|
||||
<strong>fra</strong> (French),
|
||||
<strong>frm</strong> (Old French),
|
||||
<strong>glg</strong> (Galician),
|
||||
<strong>heb</strong> (Hebrew),
|
||||
<strong>hin</strong> (Hindi),
|
||||
<strong>hrv</strong> (Croation),
|
||||
<strong>hun</strong> (Hungarian),
|
||||
<strong>ind</strong> (Indonesian),
|
||||
<strong>ita</strong> (Italian),
|
||||
<strong>jpn</strong> (Japanese),
|
||||
<strong>kor</strong> (Korean),
|
||||
<strong>lav</strong> (Latvian),
|
||||
<strong>lit</strong> (Lithuanian),
|
||||
<strong>nld</strong> (Dutch),
|
||||
<strong>nor</strong> (Norwegian),
|
||||
<strong>pol</strong> (Polish),
|
||||
<strong>por</strong> (Portuguese),
|
||||
<strong>ron</strong> (Romanian),
|
||||
<strong>rus</strong> (Russian),
|
||||
<strong>slk</strong> (Slovakian),
|
||||
<strong>slv</strong> (Slovenian),
|
||||
<strong>sqi</strong> (Albanian),
|
||||
<strong>spa</strong> (Spanish),
|
||||
<strong>srp</strong> (Serbian),
|
||||
<strong>swe</strong> (Swedish),
|
||||
<strong>tam</strong> (Tamil),
|
||||
<strong>tel</strong> (Telugu),
|
||||
<strong>tgl</strong> (Tagalog),
|
||||
<strong>tha</strong> (Thai),
|
||||
<strong>tur</strong> (Turkish),
|
||||
<strong>ukr</strong> (Ukrainian),
|
||||
<strong>vie</strong> (Vietnamese)</p></div>
|
||||
<div class="paragraph"><p>There are currently language packs available for the following languages
|
||||
(in <a href="https://github.com/tesseract-ocr/tessdata">https://github.com/tesseract-ocr/tessdata</a>):</p></div>
|
||||
<div class="paragraph"><p><strong>afr</strong> (Afrikaans)
|
||||
<strong>amh</strong> (Amharic)
|
||||
<strong>ara</strong> (Arabic)
|
||||
<strong>asm</strong> (Assamese)
|
||||
<strong>aze</strong> (Azerbaijani)
|
||||
<strong>aze_cyrl</strong> (Azerbaijani - Cyrilic)
|
||||
<strong>bel</strong> (Belarusian)
|
||||
<strong>ben</strong> (Bengali)
|
||||
<strong>bod</strong> (Tibetan)
|
||||
<strong>bos</strong> (Bosnian)
|
||||
<strong>bul</strong> (Bulgarian)
|
||||
<strong>cat</strong> (Catalan; Valencian)
|
||||
<strong>ceb</strong> (Cebuano)
|
||||
<strong>ces</strong> (Czech)
|
||||
<strong>chi_sim</strong> (Chinese - Simplified)
|
||||
<strong>chi_tra</strong> (Chinese - Traditional)
|
||||
<strong>chr</strong> (Cherokee)
|
||||
<strong>cym</strong> (Welsh)
|
||||
<strong>dan</strong> (Danish)
|
||||
<strong>dan_frak</strong> (Danish - Fraktur)
|
||||
<strong>deu</strong> (German)
|
||||
<strong>deu_frak</strong> (German - Fraktur)
|
||||
<strong>dzo</strong> (Dzongkha)
|
||||
<strong>ell</strong> (Greek, Modern (1453-))
|
||||
<strong>eng</strong> (English)
|
||||
<strong>enm</strong> (English, Middle (1100-1500))
|
||||
<strong>epo</strong> (Esperanto)
|
||||
<strong>equ</strong> (Math / equation detection module)
|
||||
<strong>est</strong> (Estonian)
|
||||
<strong>eus</strong> (Basque)
|
||||
<strong>fas</strong> (Persian)
|
||||
<strong>fin</strong> (Finnish)
|
||||
<strong>fra</strong> (French)
|
||||
<strong>frk</strong> (Frankish)
|
||||
<strong>frm</strong> (French, Middle (ca.1400-1600))
|
||||
<strong>gle</strong> (Irish)
|
||||
<strong>glg</strong> (Galician)
|
||||
<strong>grc</strong> (Greek, Ancient (to 1453))
|
||||
<strong>guj</strong> (Gujarati)
|
||||
<strong>hat</strong> (Haitian; Haitian Creole)
|
||||
<strong>heb</strong> (Hebrew)
|
||||
<strong>hin</strong> (Hindi)
|
||||
<strong>hrv</strong> (Croatian)
|
||||
<strong>hun</strong> (Hungarian)
|
||||
<strong>iku</strong> (Inuktitut)
|
||||
<strong>ind</strong> (Indonesian)
|
||||
<strong>isl</strong> (Icelandic)
|
||||
<strong>ita</strong> (Italian)
|
||||
<strong>ita_old</strong> (Italian - Old)
|
||||
<strong>jav</strong> (Javanese)
|
||||
<strong>jpn</strong> (Japanese)
|
||||
<strong>kan</strong> (Kannada)
|
||||
<strong>kat</strong> (Georgian)
|
||||
<strong>kat_old</strong> (Georgian - Old)
|
||||
<strong>kaz</strong> (Kazakh)
|
||||
<strong>khm</strong> (Central Khmer)
|
||||
<strong>kir</strong> (Kirghiz; Kyrgyz)
|
||||
<strong>kor</strong> (Korean)
|
||||
<strong>kur</strong> (Kurdish)
|
||||
<strong>lao</strong> (Lao)
|
||||
<strong>lat</strong> (Latin)
|
||||
<strong>lav</strong> (Latvian)
|
||||
<strong>lit</strong> (Lithuanian)
|
||||
<strong>mal</strong> (Malayalam)
|
||||
<strong>mar</strong> (Marathi)
|
||||
<strong>mkd</strong> (Macedonian)
|
||||
<strong>mlt</strong> (Maltese)
|
||||
<strong>msa</strong> (Malay)
|
||||
<strong>mya</strong> (Burmese)
|
||||
<strong>nep</strong> (Nepali)
|
||||
<strong>nld</strong> (Dutch; Flemish)
|
||||
<strong>nor</strong> (Norwegian)
|
||||
<strong>ori</strong> (Oriya)
|
||||
<strong>osd</strong> (Orientation and script detection module)
|
||||
<strong>pan</strong> (Panjabi; Punjabi)
|
||||
<strong>pol</strong> (Polish)
|
||||
<strong>por</strong> (Portuguese)
|
||||
<strong>pus</strong> (Pushto; Pashto)
|
||||
<strong>ron</strong> (Romanian; Moldavian; Moldovan)
|
||||
<strong>rus</strong> (Russian)
|
||||
<strong>san</strong> (Sanskrit)
|
||||
<strong>sin</strong> (Sinhala; Sinhalese)
|
||||
<strong>slk</strong> (Slovak)
|
||||
<strong>slk_frak</strong> (Slovak - Fraktur)
|
||||
<strong>slv</strong> (Slovenian)
|
||||
<strong>spa</strong> (Spanish; Castilian)
|
||||
<strong>spa_old</strong> (Spanish; Castilian - Old)
|
||||
<strong>sqi</strong> (Albanian)
|
||||
<strong>srp</strong> (Serbian)
|
||||
<strong>srp_latn</strong> (Serbian - Latin)
|
||||
<strong>swa</strong> (Swahili)
|
||||
<strong>swe</strong> (Swedish)
|
||||
<strong>syr</strong> (Syriac)
|
||||
<strong>tam</strong> (Tamil)
|
||||
<strong>tel</strong> (Telugu)
|
||||
<strong>tgk</strong> (Tajik)
|
||||
<strong>tgl</strong> (Tagalog)
|
||||
<strong>tha</strong> (Thai)
|
||||
<strong>tir</strong> (Tigrinya)
|
||||
<strong>tur</strong> (Turkish)
|
||||
<strong>uig</strong> (Uighur; Uyghur)
|
||||
<strong>ukr</strong> (Ukrainian)
|
||||
<strong>urd</strong> (Urdu)
|
||||
<strong>uzb</strong> (Uzbek)
|
||||
<strong>uzb_cyrl</strong> (Uzbek - Cyrilic)
|
||||
<strong>vie</strong> (Vietnamese)
|
||||
<strong>yid</strong> (Yiddish)</p></div>
|
||||
<div class="paragraph"><p>To use a non-standard language pack named <strong>foo.traineddata</strong>, set the
|
||||
<strong>TESSDATA_PREFIX</strong> environment variable so the file can be found at
|
||||
<strong>TESSDATA_PREFIX</strong>/tessdata/<strong>foo</strong>.traineddata and give Tesseract the
|
||||
@ -1047,7 +1106,7 @@ debug.</p></div>
|
||||
<div class="paragraph"><p>Version 2.00 brought Unicode (UTF-8) support, six languages, and the ability
|
||||
to train Tesseract.</p></div>
|
||||
<div class="paragraph"><p>Tesseract was included in UNLV’s Fourth Annual Test of OCR Accuracy.
|
||||
See <a href="http://www.isri.unlv.edu/downloads/AT-1995.pdf">http://www.isri.unlv.edu/downloads/AT-1995.pdf</a>. With Tesseract 2.00,
|
||||
See <a href="https://github.com/tesseract-ocr/docs/blob/master/AT-1995.pdf">https://github.com/tesseract-ocr/docs/blob/master/AT-1995.pdf</a>. With Tesseract 2.00,
|
||||
scripts are now included to allow anyone to reproduce some of these tests.
|
||||
See <a href="https://github.com/tesseract-ocr/tesseract/wiki/TestingTesseract">https://github.com/tesseract-ocr/tesseract/wiki/TestingTesseract</a> for more
|
||||
details.</p></div>
|
||||
@ -1097,7 +1156,7 @@ Lloyd, Shobhit Saxena, and Thomas Kielbus.</p></div>
|
||||
<div id="footnotes"><hr /></div>
|
||||
<div id="footer">
|
||||
<div id="footer-text">
|
||||
Last updated 2015-06-12 23:49:44 CEST
|
||||
Last updated 2015-06-28 22:23:47 CEST
|
||||
</div>
|
||||
</div>
|
||||
</body>
|
||||
|
@ -216,56 +216,115 @@ before any <emphasis>configfile</emphasis>.</simpara>
|
||||
</refsect1>
|
||||
<refsect1 id="_languages">
|
||||
<title>LANGUAGES</title>
|
||||
<simpara>There are currently language packs available for the following languages:</simpara>
|
||||
<simpara><emphasis role="strong">ara</emphasis> (Arabic),
|
||||
<emphasis role="strong">aze</emphasis> (Azerbauijani),
|
||||
<emphasis role="strong">bul</emphasis> (Bulgarian),
|
||||
<emphasis role="strong">cat</emphasis> (Catalan),
|
||||
<emphasis role="strong">ces</emphasis> (Czech),
|
||||
<emphasis role="strong">chi_sim</emphasis> (Simplified Chinese),
|
||||
<emphasis role="strong">chi_tra</emphasis> (Traditional Chinese),
|
||||
<emphasis role="strong">chr</emphasis> (Cherokee),
|
||||
<emphasis role="strong">dan</emphasis> (Danish),
|
||||
<emphasis role="strong">dan-frak</emphasis> (Danish (Fraktur)),
|
||||
<emphasis role="strong">deu</emphasis> (German),
|
||||
<emphasis role="strong">ell</emphasis> (Greek),
|
||||
<emphasis role="strong">eng</emphasis> (English),
|
||||
<emphasis role="strong">enm</emphasis> (Old English),
|
||||
<emphasis role="strong">epo</emphasis> (Esperanto),
|
||||
<emphasis role="strong">est</emphasis> (Estonian),
|
||||
<emphasis role="strong">fin</emphasis> (Finnish),
|
||||
<emphasis role="strong">fra</emphasis> (French),
|
||||
<emphasis role="strong">frm</emphasis> (Old French),
|
||||
<emphasis role="strong">glg</emphasis> (Galician),
|
||||
<emphasis role="strong">heb</emphasis> (Hebrew),
|
||||
<emphasis role="strong">hin</emphasis> (Hindi),
|
||||
<emphasis role="strong">hrv</emphasis> (Croation),
|
||||
<emphasis role="strong">hun</emphasis> (Hungarian),
|
||||
<emphasis role="strong">ind</emphasis> (Indonesian),
|
||||
<emphasis role="strong">ita</emphasis> (Italian),
|
||||
<emphasis role="strong">jpn</emphasis> (Japanese),
|
||||
<emphasis role="strong">kor</emphasis> (Korean),
|
||||
<emphasis role="strong">lav</emphasis> (Latvian),
|
||||
<emphasis role="strong">lit</emphasis> (Lithuanian),
|
||||
<emphasis role="strong">nld</emphasis> (Dutch),
|
||||
<emphasis role="strong">nor</emphasis> (Norwegian),
|
||||
<emphasis role="strong">pol</emphasis> (Polish),
|
||||
<emphasis role="strong">por</emphasis> (Portuguese),
|
||||
<emphasis role="strong">ron</emphasis> (Romanian),
|
||||
<emphasis role="strong">rus</emphasis> (Russian),
|
||||
<emphasis role="strong">slk</emphasis> (Slovakian),
|
||||
<emphasis role="strong">slv</emphasis> (Slovenian),
|
||||
<emphasis role="strong">sqi</emphasis> (Albanian),
|
||||
<emphasis role="strong">spa</emphasis> (Spanish),
|
||||
<emphasis role="strong">srp</emphasis> (Serbian),
|
||||
<emphasis role="strong">swe</emphasis> (Swedish),
|
||||
<emphasis role="strong">tam</emphasis> (Tamil),
|
||||
<emphasis role="strong">tel</emphasis> (Telugu),
|
||||
<emphasis role="strong">tgl</emphasis> (Tagalog),
|
||||
<emphasis role="strong">tha</emphasis> (Thai),
|
||||
<emphasis role="strong">tur</emphasis> (Turkish),
|
||||
<emphasis role="strong">ukr</emphasis> (Ukrainian),
|
||||
<emphasis role="strong">vie</emphasis> (Vietnamese)</simpara>
|
||||
<simpara>There are currently language packs available for the following languages
|
||||
(in <ulink url="https://github.com/tesseract-ocr/tessdata">https://github.com/tesseract-ocr/tessdata</ulink>):</simpara>
|
||||
<simpara><emphasis role="strong">afr</emphasis> (Afrikaans)
|
||||
<emphasis role="strong">amh</emphasis> (Amharic)
|
||||
<emphasis role="strong">ara</emphasis> (Arabic)
|
||||
<emphasis role="strong">asm</emphasis> (Assamese)
|
||||
<emphasis role="strong">aze</emphasis> (Azerbaijani)
|
||||
<emphasis role="strong">aze_cyrl</emphasis> (Azerbaijani - Cyrilic)
|
||||
<emphasis role="strong">bel</emphasis> (Belarusian)
|
||||
<emphasis role="strong">ben</emphasis> (Bengali)
|
||||
<emphasis role="strong">bod</emphasis> (Tibetan)
|
||||
<emphasis role="strong">bos</emphasis> (Bosnian)
|
||||
<emphasis role="strong">bul</emphasis> (Bulgarian)
|
||||
<emphasis role="strong">cat</emphasis> (Catalan; Valencian)
|
||||
<emphasis role="strong">ceb</emphasis> (Cebuano)
|
||||
<emphasis role="strong">ces</emphasis> (Czech)
|
||||
<emphasis role="strong">chi_sim</emphasis> (Chinese - Simplified)
|
||||
<emphasis role="strong">chi_tra</emphasis> (Chinese - Traditional)
|
||||
<emphasis role="strong">chr</emphasis> (Cherokee)
|
||||
<emphasis role="strong">cym</emphasis> (Welsh)
|
||||
<emphasis role="strong">dan</emphasis> (Danish)
|
||||
<emphasis role="strong">dan_frak</emphasis> (Danish - Fraktur)
|
||||
<emphasis role="strong">deu</emphasis> (German)
|
||||
<emphasis role="strong">deu_frak</emphasis> (German - Fraktur)
|
||||
<emphasis role="strong">dzo</emphasis> (Dzongkha)
|
||||
<emphasis role="strong">ell</emphasis> (Greek, Modern (1453-))
|
||||
<emphasis role="strong">eng</emphasis> (English)
|
||||
<emphasis role="strong">enm</emphasis> (English, Middle (1100-1500))
|
||||
<emphasis role="strong">epo</emphasis> (Esperanto)
|
||||
<emphasis role="strong">equ</emphasis> (Math / equation detection module)
|
||||
<emphasis role="strong">est</emphasis> (Estonian)
|
||||
<emphasis role="strong">eus</emphasis> (Basque)
|
||||
<emphasis role="strong">fas</emphasis> (Persian)
|
||||
<emphasis role="strong">fin</emphasis> (Finnish)
|
||||
<emphasis role="strong">fra</emphasis> (French)
|
||||
<emphasis role="strong">frk</emphasis> (Frankish)
|
||||
<emphasis role="strong">frm</emphasis> (French, Middle (ca.1400-1600))
|
||||
<emphasis role="strong">gle</emphasis> (Irish)
|
||||
<emphasis role="strong">glg</emphasis> (Galician)
|
||||
<emphasis role="strong">grc</emphasis> (Greek, Ancient (to 1453))
|
||||
<emphasis role="strong">guj</emphasis> (Gujarati)
|
||||
<emphasis role="strong">hat</emphasis> (Haitian; Haitian Creole)
|
||||
<emphasis role="strong">heb</emphasis> (Hebrew)
|
||||
<emphasis role="strong">hin</emphasis> (Hindi)
|
||||
<emphasis role="strong">hrv</emphasis> (Croatian)
|
||||
<emphasis role="strong">hun</emphasis> (Hungarian)
|
||||
<emphasis role="strong">iku</emphasis> (Inuktitut)
|
||||
<emphasis role="strong">ind</emphasis> (Indonesian)
|
||||
<emphasis role="strong">isl</emphasis> (Icelandic)
|
||||
<emphasis role="strong">ita</emphasis> (Italian)
|
||||
<emphasis role="strong">ita_old</emphasis> (Italian - Old)
|
||||
<emphasis role="strong">jav</emphasis> (Javanese)
|
||||
<emphasis role="strong">jpn</emphasis> (Japanese)
|
||||
<emphasis role="strong">kan</emphasis> (Kannada)
|
||||
<emphasis role="strong">kat</emphasis> (Georgian)
|
||||
<emphasis role="strong">kat_old</emphasis> (Georgian - Old)
|
||||
<emphasis role="strong">kaz</emphasis> (Kazakh)
|
||||
<emphasis role="strong">khm</emphasis> (Central Khmer)
|
||||
<emphasis role="strong">kir</emphasis> (Kirghiz; Kyrgyz)
|
||||
<emphasis role="strong">kor</emphasis> (Korean)
|
||||
<emphasis role="strong">kur</emphasis> (Kurdish)
|
||||
<emphasis role="strong">lao</emphasis> (Lao)
|
||||
<emphasis role="strong">lat</emphasis> (Latin)
|
||||
<emphasis role="strong">lav</emphasis> (Latvian)
|
||||
<emphasis role="strong">lit</emphasis> (Lithuanian)
|
||||
<emphasis role="strong">mal</emphasis> (Malayalam)
|
||||
<emphasis role="strong">mar</emphasis> (Marathi)
|
||||
<emphasis role="strong">mkd</emphasis> (Macedonian)
|
||||
<emphasis role="strong">mlt</emphasis> (Maltese)
|
||||
<emphasis role="strong">msa</emphasis> (Malay)
|
||||
<emphasis role="strong">mya</emphasis> (Burmese)
|
||||
<emphasis role="strong">nep</emphasis> (Nepali)
|
||||
<emphasis role="strong">nld</emphasis> (Dutch; Flemish)
|
||||
<emphasis role="strong">nor</emphasis> (Norwegian)
|
||||
<emphasis role="strong">ori</emphasis> (Oriya)
|
||||
<emphasis role="strong">osd</emphasis> (Orientation and script detection module)
|
||||
<emphasis role="strong">pan</emphasis> (Panjabi; Punjabi)
|
||||
<emphasis role="strong">pol</emphasis> (Polish)
|
||||
<emphasis role="strong">por</emphasis> (Portuguese)
|
||||
<emphasis role="strong">pus</emphasis> (Pushto; Pashto)
|
||||
<emphasis role="strong">ron</emphasis> (Romanian; Moldavian; Moldovan)
|
||||
<emphasis role="strong">rus</emphasis> (Russian)
|
||||
<emphasis role="strong">san</emphasis> (Sanskrit)
|
||||
<emphasis role="strong">sin</emphasis> (Sinhala; Sinhalese)
|
||||
<emphasis role="strong">slk</emphasis> (Slovak)
|
||||
<emphasis role="strong">slk_frak</emphasis> (Slovak - Fraktur)
|
||||
<emphasis role="strong">slv</emphasis> (Slovenian)
|
||||
<emphasis role="strong">spa</emphasis> (Spanish; Castilian)
|
||||
<emphasis role="strong">spa_old</emphasis> (Spanish; Castilian - Old)
|
||||
<emphasis role="strong">sqi</emphasis> (Albanian)
|
||||
<emphasis role="strong">srp</emphasis> (Serbian)
|
||||
<emphasis role="strong">srp_latn</emphasis> (Serbian - Latin)
|
||||
<emphasis role="strong">swa</emphasis> (Swahili)
|
||||
<emphasis role="strong">swe</emphasis> (Swedish)
|
||||
<emphasis role="strong">syr</emphasis> (Syriac)
|
||||
<emphasis role="strong">tam</emphasis> (Tamil)
|
||||
<emphasis role="strong">tel</emphasis> (Telugu)
|
||||
<emphasis role="strong">tgk</emphasis> (Tajik)
|
||||
<emphasis role="strong">tgl</emphasis> (Tagalog)
|
||||
<emphasis role="strong">tha</emphasis> (Thai)
|
||||
<emphasis role="strong">tir</emphasis> (Tigrinya)
|
||||
<emphasis role="strong">tur</emphasis> (Turkish)
|
||||
<emphasis role="strong">uig</emphasis> (Uighur; Uyghur)
|
||||
<emphasis role="strong">ukr</emphasis> (Ukrainian)
|
||||
<emphasis role="strong">urd</emphasis> (Urdu)
|
||||
<emphasis role="strong">uzb</emphasis> (Uzbek)
|
||||
<emphasis role="strong">uzb_cyrl</emphasis> (Uzbek - Cyrilic)
|
||||
<emphasis role="strong">vie</emphasis> (Vietnamese)
|
||||
<emphasis role="strong">yid</emphasis> (Yiddish)</simpara>
|
||||
<simpara>To use a non-standard language pack named <emphasis role="strong">foo.traineddata</emphasis>, set the
|
||||
<emphasis role="strong">TESSDATA_PREFIX</emphasis> environment variable so the file can be found at
|
||||
<emphasis role="strong">TESSDATA_PREFIX</emphasis>/tessdata/<emphasis role="strong">foo</emphasis>.traineddata and give Tesseract the
|
||||
@ -325,7 +384,7 @@ debug.</simpara>
|
||||
<simpara>Version 2.00 brought Unicode (UTF-8) support, six languages, and the ability
|
||||
to train Tesseract.</simpara>
|
||||
<simpara>Tesseract was included in UNLV’s Fourth Annual Test of OCR Accuracy.
|
||||
See <ulink url="http://www.isri.unlv.edu/downloads/AT-1995.pdf">http://www.isri.unlv.edu/downloads/AT-1995.pdf</ulink>. With Tesseract 2.00,
|
||||
See <ulink url="https://github.com/tesseract-ocr/docs/blob/master/AT-1995.pdf">https://github.com/tesseract-ocr/docs/blob/master/AT-1995.pdf</ulink>. With Tesseract 2.00,
|
||||
scripts are now included to allow anyone to reproduce some of these tests.
|
||||
See <ulink url="https://github.com/tesseract-ocr/tesseract/wiki/TestingTesseract">https://github.com/tesseract-ocr/tesseract/wiki/TestingTesseract</ulink> for more
|
||||
details.</simpara>
|
||||
|
Loading…
Reference in New Issue
Block a user