mirror of
https://github.com/tesseract-ocr/tesseract.git
synced 2025-01-08 10:57:48 +08:00
67 lines
2.3 KiB
Plaintext
67 lines
2.3 KiB
Plaintext
|
UNICHARSET(5)
|
||
|
=============
|
||
|
|
||
|
NAME
|
||
|
----
|
||
|
unicharset - character properties for use by Tesseract
|
||
|
|
||
|
DESCRIPTION
|
||
|
-----------
|
||
|
Tesseract needs to have access to the character properties isalpha,
|
||
|
isdigit, isupper, islower, ispunctuation. This data must be encoded
|
||
|
in the unicharset data file. Each line of this file corresponds to
|
||
|
one character. The character in UTF-8 is followed by a hexadecimal
|
||
|
number representing a binary mask that encodes the properties.
|
||
|
Each bit corresponds to a property. If the bit is set to 1, it
|
||
|
means that the property is true. The bit ordering is (from least
|
||
|
significant bit to most significant bit): isalpha, islower, isupper,
|
||
|
isdigit, ispunctuation.
|
||
|
|
||
|
Each line in the unicharset file has four space-separated fields:
|
||
|
......................................
|
||
|
[character] [properties] [script] [id]
|
||
|
......................................
|
||
|
|
||
|
EXAMPLE
|
||
|
-------
|
||
|
..............
|
||
|
; 10 Common 46
|
||
|
b 3 Latin 59
|
||
|
W 5 Latin 40
|
||
|
7 8 Common 66
|
||
|
= 0 Common 93
|
||
|
..............
|
||
|
|
||
|
";" is a punctuation character. Its properties are thus represented by the binary number
|
||
|
10000 (10 in hexadecimal).
|
||
|
|
||
|
"b" is an alphabetic character and a lower case character. Its properties are thus
|
||
|
represented by the binary number 00011 (3 in hexadecimal).
|
||
|
|
||
|
"W" is an alphabetic character and an upper case character. Its properties are thus
|
||
|
represented by the binary number 00101 (5 in hexadecimal).
|
||
|
|
||
|
"7" is just a digit. Its properties are thus represented by the binary number 01000
|
||
|
(8 in hexadecimal).
|
||
|
|
||
|
"=" is not punctuation nor a digit nor an alphabetic character. Its properties are
|
||
|
thus represented by the binary number 00000 (0 in hexadecimal).
|
||
|
|
||
|
Japanese or Chinese alphabetic character properties are represented by the binary number
|
||
|
00001 (1 in hexadecimal): they are alphabetic, but neither upper nor lower case.
|
||
|
|
||
|
The last two columns represent the type of script (Latin, Common, Greek, Cyrillic, Han,
|
||
|
null) and id code of the character.
|
||
|
|
||
|
HISTORY
|
||
|
-------
|
||
|
The unicharset format first appeared with Tesseract 2.00, which was the first version
|
||
|
to support languages other than English. The unicharset file contained only the first
|
||
|
two fields, and the "ispunctuation" property was absent (punctuation was regarded as
|
||
|
"0", as "=" is in the above example.
|
||
|
|
||
|
SEE ALSO
|
||
|
--------
|
||
|
tesseract(1), unicharset_extractor(1)
|
||
|
|