mirror of
https://github.com/tesseract-ocr/tesseract.git
synced 2024-11-24 02:59:07 +08:00
0759ee7e17
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@482 d0cd1f9f-072b-0410-8dd7-cf729c803f20
67 lines
2.3 KiB
Plaintext
67 lines
2.3 KiB
Plaintext
UNICHARSET(5)
|
|
=============
|
|
|
|
NAME
|
|
----
|
|
unicharset - character properties for use by Tesseract
|
|
|
|
DESCRIPTION
|
|
-----------
|
|
Tesseract needs to have access to the character properties isalpha,
|
|
isdigit, isupper, islower, ispunctuation. This data must be encoded
|
|
in the unicharset data file. Each line of this file corresponds to
|
|
one character. The character in UTF-8 is followed by a hexadecimal
|
|
number representing a binary mask that encodes the properties.
|
|
Each bit corresponds to a property. If the bit is set to 1, it
|
|
means that the property is true. The bit ordering is (from least
|
|
significant bit to most significant bit): isalpha, islower, isupper,
|
|
isdigit, ispunctuation.
|
|
|
|
Each line in the unicharset file has four space-separated fields:
|
|
......................................
|
|
[character] [properties] [script] [id]
|
|
......................................
|
|
|
|
EXAMPLE
|
|
-------
|
|
..............
|
|
; 10 Common 46
|
|
b 3 Latin 59
|
|
W 5 Latin 40
|
|
7 8 Common 66
|
|
= 0 Common 93
|
|
..............
|
|
|
|
";" is a punctuation character. Its properties are thus represented by the binary number
|
|
10000 (10 in hexadecimal).
|
|
|
|
"b" is an alphabetic character and a lower case character. Its properties are thus
|
|
represented by the binary number 00011 (3 in hexadecimal).
|
|
|
|
"W" is an alphabetic character and an upper case character. Its properties are thus
|
|
represented by the binary number 00101 (5 in hexadecimal).
|
|
|
|
"7" is just a digit. Its properties are thus represented by the binary number 01000
|
|
(8 in hexadecimal).
|
|
|
|
"=" is not punctuation nor a digit nor an alphabetic character. Its properties are
|
|
thus represented by the binary number 00000 (0 in hexadecimal).
|
|
|
|
Japanese or Chinese alphabetic character properties are represented by the binary number
|
|
00001 (1 in hexadecimal): they are alphabetic, but neither upper nor lower case.
|
|
|
|
The last two columns represent the type of script (Latin, Common, Greek, Cyrillic, Han,
|
|
null) and id code of the character.
|
|
|
|
HISTORY
|
|
-------
|
|
The unicharset format first appeared with Tesseract 2.00, which was the first version
|
|
to support languages other than English. The unicharset file contained only the first
|
|
two fields, and the "ispunctuation" property was absent (punctuation was regarded as
|
|
"0", as "=" is in the above example.
|
|
|
|
SEE ALSO
|
|
--------
|
|
tesseract(1), unicharset_extractor(1)
|
|
|