'\" t .\" Title: unicharset .\" Author: [see the "AUTHOR" section] .\" Generator: DocBook XSL Stylesheets v1.75.2 .\" Date: 02/09/2012 .\" Manual: \ \& .\" Source: \ \& .\" Language: English .\" .TH "UNICHARSET" "5" "02/09/2012" "\ \&" "\ \&" .\" ----------------------------------------------------------------- .\" * Define some portability stuff .\" ----------------------------------------------------------------- .\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .\" http://bugs.debian.org/507673 .\" http://lists.gnu.org/archive/html/groff/2009-02/msg00013.html .\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .ie \n(.g .ds Aq \(aq .el .ds Aq ' .\" ----------------------------------------------------------------- .\" * set default formatting .\" ----------------------------------------------------------------- .\" disable hyphenation .nh .\" disable justification (adjust text to left margin only) .ad l .\" ----------------------------------------------------------------- .\" * MAIN CONTENT STARTS HERE * .\" ----------------------------------------------------------------- .SH "NAME" unicharset \- character properties file used by tesseract(1) .SH "DESCRIPTION" .sp Tesseract\(cqs unicharset file contains information on each symbol (unichar) the Tesseract OCR engine is trained to recognize\&. .sp A unicharset file (i\&.e\&. \fIeng\&.unicharset\fR) is distributed as part of a Tesseract language pack (i\&.e\&. \fIeng\&.traineddata\fR)\&. For information on extracting the unicharset file, see combine_tessdata(1)\&. .sp The first line of a unicharset file contains the number of unichars in the file\&. After this line, each subsequent line provides information for a single unichar\&. The first such line contains a placeholder reserved for the space character\&. Each unichar is referred to within Tesseract by its Unichar ID, which is the line number (minus 1) within the unicharset file\&. Therefore, space gets unichar 0\&. .sp Each unichar line in the unicharset file (v2+) may have four space\-separated fields: .sp .if n \{\ .RS 4 .\} .nf \*(Aqcharacter\*(Aq \*(Aqproperties\*(Aq \*(Aqscript\*(Aq \*(Aqid\*(Aq .fi .if n \{\ .RE .\} .sp Starting with Tesseract v3\&.02, more information may be given for each unichar: .sp .if n \{\ .RS 4 .\} .nf \*(Aqcharacter\*(Aq \*(Aqproperties\*(Aq \*(Aqglyph_metrics\*(Aq \*(Aqscript\*(Aq \*(Aqother_case\*(Aq \*(Aqdirection\*(Aq \*(Aqmirror\*(Aq \*(Aqnormed_form\*(Aq .fi .if n \{\ .RE .\} .sp Entries: .PP \fIcharacter\fR .RS 4 The UTF\-8 encoded string to be produced for this unichar\&. .RE .PP \fIproperties\fR .RS 4 An integer mask of character properties, one per bit\&. From least to most significant bit, these are: isalpha, islower, isupper, isdigit, ispunctuation\&. .RE .PP \fIglyph_metrics\fR .RS 4 Ten comma\-separated integers representing various standards for where this glyph is to be found within a baseline\-normalized coordinate system where 128 is normalized to x\-height\&. .sp .RS 4 .ie n \{\ \h'-04'\(bu\h'+03'\c .\} .el \{\ .sp -1 .IP \(bu 2.3 .\} min_bottom, max_bottom: the ranges where the bottom of the character can be found\&. .RE .sp .RS 4 .ie n \{\ \h'-04'\(bu\h'+03'\c .\} .el \{\ .sp -1 .IP \(bu 2.3 .\} min_top, max_top: the ranges where the top of the character may be found\&. .RE .sp .RS 4 .ie n \{\ \h'-04'\(bu\h'+03'\c .\} .el \{\ .sp -1 .IP \(bu 2.3 .\} min_width, max_width: horizontal width of the character\&. .RE .sp .RS 4 .ie n \{\ \h'-04'\(bu\h'+03'\c .\} .el \{\ .sp -1 .IP \(bu 2.3 .\} min_bearing, max_bearing: how far from the usual start position does the leftmost part of the character begin\&. .RE .sp .RS 4 .ie n \{\ \h'-04'\(bu\h'+03'\c .\} .el \{\ .sp -1 .IP \(bu 2.3 .\} min_advance, max_advance: how far from the printer\(cqs cell left do we advance to begin the next character\&. .RE .RE .PP \fIscript\fR .RS 4 Name of the script (Latin, Common, Greek, Cyrillic, Han, null)\&. .RE .PP \fIother_case\fR .RS 4 The Unichar ID of the other case version of this character (upper or lower)\&. .RE .PP \fIdirection\fR .RS 4 The Unicode BiDi direction of this character, as defined by ICU\(cqs enum UCharDirection\&. (0 = Left to Right, 1 = Right to Left, 2 = European Number\&...) .RE .PP \fImirror\fR .RS 4 The Unichar ID of the BiDirectional mirror of this character\&. For example the mirror of open paren is close paren, but Latin Capital C has no mirror, so it remains a Latin Capital C\&. .RE .PP \fInormed_form\fR .RS 4 The UTF\-8 representation of a "normalized form" of this unichar for the purpose of blaming a module for errors given ground truth text\&. For instance, a left or right single quote may normalize to an ASCII quote\&. .RE .SH "EXAMPLE (V2)" .sp .if n \{\ .RS 4 .\} .nf ; 10 Common 46 b 3 Latin 59 W 5 Latin 40 7 8 Common 66 = 0 Common 93 .fi .if n \{\ .RE .\} .sp ";" is a punctuation character\&. Its properties are thus represented by the binary number 10000 (10 in hexadecimal)\&. .sp "b" is an alphabetic character and a lower case character\&. Its properties are thus represented by the binary number 00011 (3 in hexadecimal)\&. .sp "W" is an alphabetic character and an upper case character\&. Its properties are thus represented by the binary number 00101 (5 in hexadecimal)\&. .sp "7" is just a digit\&. Its properties are thus represented by the binary number 01000 (8 in hexadecimal)\&. .sp "=" is not punctuation nor a digit nor an alphabetic character\&. Its properties are thus represented by the binary number 00000 (0 in hexadecimal)\&. .sp Japanese or Chinese alphabetic character properties are represented by the binary number 00001 (1 in hexadecimal): they are alphabetic, but neither upper nor lower case\&. .SH "EXAMPLE (V3.02)" .sp .if n \{\ .RS 4 .\} .nf 110 NULL 0 NULL 0 N 5 59,68,216,255,87,236,0,27,104,227 Latin 11 0 1 N Y 5 59,68,216,255,91,205,0,47,91,223 Latin 33 0 2 Y 1 8 59,69,203,255,45,128,0,66,74,173 Common 3 2 3 1 9 8 18,66,203,255,89,156,0,39,104,173 Common 4 2 4 9 a 3 58,65,186,198,85,164,0,26,97,185 Latin 56 0 5 a \&. \&. \&. .fi .if n \{\ .RE .\} .SH "CAVEATS" .sp Although the unicharset reader maintains the ability to read unicharsets of older formats and will assign default values to missing fields, the accuracy will be degraded\&. .sp Further, most other data files are indexed by the unicharset file, so changing it without re\-generating the others is likely to have dire consequences\&. .SH "HISTORY" .sp The unicharset format first appeared with Tesseract 2\&.00, which was the first version to support languages other than English\&. The unicharset file contained only the first two fields, and the "ispunctuation" property was absent (punctuation was regarded as "0", as "=" is in the above example\&. .SH "SEE ALSO" .sp tesseract(1), combine_tessdata(1), unicharset_extractor(1) .sp \m[blue]\fBhttp://code\&.google\&.com/p/tesseract\-ocr/wiki/TrainingTesseract3\fR\m[] .SH "AUTHOR" .sp The Tesseract OCR engine was written by Ray Smith and his research groups at Hewlett Packard (1985\-1995) and Google (2006\-present)\&.