A unicharset file (i\&.e\&. \fIeng\&.unicharset\fR) is distributed as part of a Tesseract language pack (i\&.e\&. \fIeng\&.traineddata\fR)\&. For information on extracting the unicharset file, see combine_tessdata(1)\&.
.sp
The first line of a unicharset file contains the number of unichars in the file\&. After this line, each subsequent line provides information for a single unichar\&. The first such line contains a placeholder reserved for the space character\&. Each unichar is referred to within Tesseract by its Unichar ID, which is the line number (minus 1) within the unicharset file\&. Therefore, space gets unichar 0\&.
.sp
Each unichar line in the unicharset file (v2+) may have four space\-separated fields:
The UTF\-8 encoded string to be produced for this unichar\&.
.RE
.PP
\fIproperties\fR
.RS4
An integer mask of character properties, one per bit\&. From least to most significant bit, these are: isalpha, islower, isupper, isdigit, ispunctuation\&.
.RE
.PP
\fIglyph_metrics\fR
.RS4
Ten comma\-separated integers representing various standards for where this glyph is to be found within a baseline\-normalized coordinate system where 128 is normalized to x\-height\&.
.sp
.RS4
.ien\{\
\h'-04'\(bu\h'+03'\c
.\}
.el\{\
.sp-1
.IP\(bu2.3
.\}
min_bottom, max_bottom: the ranges where the bottom of the character can be found\&.
.RE
.sp
.RS4
.ien\{\
\h'-04'\(bu\h'+03'\c
.\}
.el\{\
.sp-1
.IP\(bu2.3
.\}
min_top, max_top: the ranges where the top of the character may be found\&.
.RE
.sp
.RS4
.ien\{\
\h'-04'\(bu\h'+03'\c
.\}
.el\{\
.sp-1
.IP\(bu2.3
.\}
min_width, max_width: horizontal width of the character\&.
.RE
.sp
.RS4
.ien\{\
\h'-04'\(bu\h'+03'\c
.\}
.el\{\
.sp-1
.IP\(bu2.3
.\}
min_bearing, max_bearing: how far from the usual start position does the leftmost part of the character begin\&.
.RE
.sp
.RS4
.ien\{\
\h'-04'\(bu\h'+03'\c
.\}
.el\{\
.sp-1
.IP\(bu2.3
.\}
min_advance, max_advance: how far from the printer\(cqs cell left do we advance to begin the next character\&.
.RE
.RE
.PP
\fIscript\fR
.RS4
Name of the script (Latin, Common, Greek, Cyrillic, Han, null)\&.
.RE
.PP
\fIother_case\fR
.RS4
The Unichar ID of the other case version of this character (upper or lower)\&.
.RE
.PP
\fIdirection\fR
.RS4
The Unicode BiDi direction of this character, as defined by ICU\(cqs enum UCharDirection\&. (0 = Left to Right, 1 = Right to Left, 2 = European Number\&...)
.RE
.PP
\fImirror\fR
.RS4
The Unichar ID of the BiDirectional mirror of this character\&. For example the mirror of open paren is close paren, but Latin Capital C has no mirror, so it remains a Latin Capital C\&.
.RE
.PP
\fInormed_form\fR
.RS4
The UTF\-8 representation of a "normalized form" of this unichar for the purpose of blaming a module for errors given ground truth text\&. For instance, a left or right single quote may normalize to an ASCII quote\&.
";" is a punctuation character\&. Its properties are thus represented by the binary number 10000 (10 in hexadecimal)\&.
.sp
"b" is an alphabetic character and a lower case character\&. Its properties are thus represented by the binary number 00011 (3 in hexadecimal)\&.
.sp
"W" is an alphabetic character and an upper case character\&. Its properties are thus represented by the binary number 00101 (5 in hexadecimal)\&.
.sp
"7" is just a digit\&. Its properties are thus represented by the binary number 01000 (8 in hexadecimal)\&.
.sp
"=" is not punctuation nor a digit nor an alphabetic character\&. Its properties are thus represented by the binary number 00000 (0 in hexadecimal)\&.
.sp
Japanese or Chinese alphabetic character properties are represented by the binary number 00001 (1 in hexadecimal): they are alphabetic, but neither upper nor lower case\&.
N 5 59,68,216,255,87,236,0,27,104,227 Latin 11 0 1 N
Y 5 59,68,216,255,91,205,0,47,91,223 Latin 33 0 2 Y
1 8 59,69,203,255,45,128,0,66,74,173 Common 3 2 3 1
9 8 18,66,203,255,89,156,0,39,104,173 Common 4 2 4 9
a 3 58,65,186,198,85,164,0,26,97,185 Latin 56 0 5 a
\&. \&. \&.
.fi
.ifn\{\
.RE
.\}
.SH"CAVEATS"
.sp
Although the unicharset reader maintains the ability to read unicharsets of older formats and will assign default values to missing fields, the accuracy will be degraded\&.
Further, most other data files are indexed by the unicharset file, so changing it without re\-generating the others is likely to have dire consequences\&.
The unicharset format first appeared with Tesseract 2\&.00, which was the first version to support languages other than English\&. The unicharset file contained only the first two fields, and the "ispunctuation" property was absent (punctuation was regarded as "0", as "=" is in the above example\&.