tesseract/doc/unicharset.5.asc

UNICHARSET(5)
=============

NAME
----
unicharset - character properties for use by Tesseract

DESCRIPTION
-----------
Tesseract needs to have access to the character properties isalpha, 
isdigit, isupper, islower, ispunctuation. This data must be encoded 
in the unicharset data file. Each line of this file corresponds to 
one character. The character in UTF-8 is followed by a hexadecimal 
number representing a binary mask that encodes the properties. 
Each bit corresponds to a property. If the bit is set to 1, it 
means that the property is true. The bit ordering is (from least 
significant bit to most significant bit): isalpha, islower, isupper, 
isdigit, ispunctuation.

Each line in the unicharset file has four space-separated fields:
......................................
[character] [properties] [script] [id]
......................................

EXAMPLE
-------
..............
; 10 Common 46
b 3 Latin 59
W 5 Latin 40
7 8 Common 66
= 0 Common 93
..............

";" is a punctuation character. Its properties are thus represented by the binary number 
10000 (10 in hexadecimal).

"b" is an alphabetic character and a lower case character. Its properties are thus 
represented by the binary number 00011 (3 in hexadecimal).

"W" is an alphabetic character and an upper case character. Its properties are thus 
represented by the binary number 00101 (5 in hexadecimal).

"7" is just a digit. Its properties are thus represented by the binary number 01000 
(8 in hexadecimal).

"=" is not punctuation nor a digit nor an alphabetic character. Its properties are 
thus represented by the binary number 00000 (0 in hexadecimal).

Japanese or Chinese alphabetic character properties are represented by the binary number 
00001 (1 in hexadecimal): they are alphabetic, but neither upper nor lower case.

The last two columns represent the type of script (Latin, Common, Greek, Cyrillic, Han, 
null) and id code of the character.

HISTORY
-------
The unicharset format first appeared with Tesseract 2.00, which was the first version
to support languages other than English. The unicharset file contained only the first 
two fields, and the "ispunctuation" property was absent (punctuation was regarded as
"0", as "=" is in the above example.

SEE ALSO
--------
tesseract(1), unicharset_extractor(1)
more docs git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@482 d0cd1f9f-072b-0410-8dd7-cf729c803f20 2010-09-30 10:06:29 +08:00			`UNICHARSET(5)`
			`=============`

			`NAME`
			`----`
			`unicharset - character properties for use by Tesseract`

			`DESCRIPTION`
			`-----------`
			`Tesseract needs to have access to the character properties isalpha,`
			`isdigit, isupper, islower, ispunctuation. This data must be encoded`
			`in the unicharset data file. Each line of this file corresponds to`
			`one character. The character in UTF-8 is followed by a hexadecimal`
			`number representing a binary mask that encodes the properties.`
			`Each bit corresponds to a property. If the bit is set to 1, it`
			`means that the property is true. The bit ordering is (from least`
			`significant bit to most significant bit): isalpha, islower, isupper,`
			`isdigit, ispunctuation.`

			`Each line in the unicharset file has four space-separated fields:`
			`......................................`
			`[character] [properties] [script] [id]`
			`......................................`

			`EXAMPLE`
			`-------`
			`..............`
			`; 10 Common 46`
			`b 3 Latin 59`
			`W 5 Latin 40`
			`7 8 Common 66`
			`= 0 Common 93`
			`..............`

			`";" is a punctuation character. Its properties are thus represented by the binary number`
			`10000 (10 in hexadecimal).`

			`"b" is an alphabetic character and a lower case character. Its properties are thus`
			`represented by the binary number 00011 (3 in hexadecimal).`

			`"W" is an alphabetic character and an upper case character. Its properties are thus`
			`represented by the binary number 00101 (5 in hexadecimal).`

			`"7" is just a digit. Its properties are thus represented by the binary number 01000`
			`(8 in hexadecimal).`

			`"=" is not punctuation nor a digit nor an alphabetic character. Its properties are`
			`thus represented by the binary number 00000 (0 in hexadecimal).`

			`Japanese or Chinese alphabetic character properties are represented by the binary number`
			`00001 (1 in hexadecimal): they are alphabetic, but neither upper nor lower case.`

			`The last two columns represent the type of script (Latin, Common, Greek, Cyrillic, Han,`
			`null) and id code of the character.`

			`HISTORY`
			`-------`
			`The unicharset format first appeared with Tesseract 2.00, which was the first version`
			`to support languages other than English. The unicharset file contained only the first`
			`two fields, and the "ispunctuation" property was absent (punctuation was regarded as`
			`"0", as "=" is in the above example.`

			`SEE ALSO`
			`--------`
			`tesseract(1), unicharset_extractor(1)`