tesseract/doc/unicharset.5

'\" t
.\"     Title: unicharset
.\"    Author: [see the "AUTHOR" section]
.\" Generator: DocBook XSL Stylesheets v1.78.1 <http://docbook.sf.net/>
.\"      Date: 06/12/2015
.\"    Manual: \ \&
.\"    Source: \ \&
.\"  Language: English
.\"
.TH "UNICHARSET" "5" "06/12/2015" "\ \&" "\ \&"
.\" -----------------------------------------------------------------
.\" * Define some portability stuff
.\" -----------------------------------------------------------------
.\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.\" http://bugs.debian.org/507673
.\" http://lists.gnu.org/archive/html/groff/2009-02/msg00013.html
.\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.ie \n(.g .ds Aq \(aq
.el       .ds Aq '
.\" -----------------------------------------------------------------
.\" * set default formatting
.\" -----------------------------------------------------------------
.\" disable hyphenation
.nh
.\" disable justification (adjust text to left margin only)
.ad l
.\" -----------------------------------------------------------------
.\" * MAIN CONTENT STARTS HERE *
.\" -----------------------------------------------------------------
.SH "NAME"
unicharset \- character properties file used by tesseract(1)
.SH "DESCRIPTION"
.sp
Tesseract\(cqs unicharset file contains information on each symbol (unichar) the Tesseract OCR engine is trained to recognize\&.
.sp
A unicharset file (i\&.e\&. \fIeng\&.unicharset\fR) is distributed as part of a Tesseract language pack (i\&.e\&. \fIeng\&.traineddata\fR)\&. For information on extracting the unicharset file, see combine_tessdata(1)\&.
.sp
The first line of a unicharset file contains the number of unichars in the file\&. After this line, each subsequent line provides information for a single unichar\&. The first such line contains a placeholder reserved for the space character\&. Each unichar is referred to within Tesseract by its Unichar ID, which is the line number (minus 1) within the unicharset file\&. Therefore, space gets unichar 0\&.
.sp
Each unichar line in the unicharset file (v2+) may have four space\-separated fields:
.sp
.if n \{\
.RS 4
.\}
.nf
\*(Aqcharacter\*(Aq \*(Aqproperties\*(Aq \*(Aqscript\*(Aq \*(Aqid\*(Aq
.fi
.if n \{\
.RE
.\}
.sp
Starting with Tesseract v3\&.02, more information may be given for each unichar:
.sp
.if n \{\
.RS 4
.\}
.nf
\*(Aqcharacter\*(Aq \*(Aqproperties\*(Aq \*(Aqglyph_metrics\*(Aq \*(Aqscript\*(Aq \*(Aqother_case\*(Aq \*(Aqdirection\*(Aq \*(Aqmirror\*(Aq \*(Aqnormed_form\*(Aq
.fi
.if n \{\
.RE
.\}
.sp
Entries:
.PP
\fIcharacter\fR
.RS 4
The UTF\-8 encoded string to be produced for this unichar\&.
.RE
.PP
\fIproperties\fR
.RS 4
An integer mask of character properties, one per bit\&. From least to most significant bit, these are: isalpha, islower, isupper, isdigit, ispunctuation\&.
.RE
.PP
\fIglyph_metrics\fR
.RS 4
Ten comma\-separated integers representing various standards for where this glyph is to be found within a baseline\-normalized coordinate system where 128 is normalized to x\-height\&.
.sp
.RS 4
.ie n \{\
\h'-04'\(bu\h'+03'\c
.\}
.el \{\
.sp -1
.IP \(bu 2.3
.\}
min_bottom, max_bottom: the ranges where the bottom of the character can be found\&.
.RE
.sp
.RS 4
.ie n \{\
\h'-04'\(bu\h'+03'\c
.\}
.el \{\
.sp -1
.IP \(bu 2.3
.\}
min_top, max_top: the ranges where the top of the character may be found\&.
.RE
.sp
.RS 4
.ie n \{\
\h'-04'\(bu\h'+03'\c
.\}
.el \{\
.sp -1
.IP \(bu 2.3
.\}
min_width, max_width: horizontal width of the character\&.
.RE
.sp
.RS 4
.ie n \{\
\h'-04'\(bu\h'+03'\c
.\}
.el \{\
.sp -1
.IP \(bu 2.3
.\}
min_bearing, max_bearing: how far from the usual start position does the leftmost part of the character begin\&.
.RE
.sp
.RS 4
.ie n \{\
\h'-04'\(bu\h'+03'\c
.\}
.el \{\
.sp -1
.IP \(bu 2.3
.\}
min_advance, max_advance: how far from the printer\(cqs cell left do we advance to begin the next character\&.
.RE
.RE
.PP
\fIscript\fR
.RS 4
Name of the script (Latin, Common, Greek, Cyrillic, Han, null)\&.
.RE
.PP
\fIother_case\fR
.RS 4
The Unichar ID of the other case version of this character (upper or lower)\&.
.RE
.PP
\fIdirection\fR
.RS 4
The Unicode BiDi direction of this character, as defined by ICU\(cqs enum UCharDirection\&. (0 = Left to Right, 1 = Right to Left, 2 = European Number\&...)
.RE
.PP
\fImirror\fR
.RS 4
The Unichar ID of the BiDirectional mirror of this character\&. For example the mirror of open paren is close paren, but Latin Capital C has no mirror, so it remains a Latin Capital C\&.
.RE
.PP
\fInormed_form\fR
.RS 4
The UTF\-8 representation of a "normalized form" of this unichar for the purpose of blaming a module for errors given ground truth text\&. For instance, a left or right single quote may normalize to an ASCII quote\&.
.RE
.SH "EXAMPLE (V2)"
.sp
.if n \{\
.RS 4
.\}
.nf
; 10 Common 46
b 3 Latin 59
W 5 Latin 40
7 8 Common 66
= 0 Common 93
.fi
.if n \{\
.RE
.\}
.sp
";" is a punctuation character\&. Its properties are thus represented by the binary number 10000 (10 in hexadecimal)\&.
.sp
"b" is an alphabetic character and a lower case character\&. Its properties are thus represented by the binary number 00011 (3 in hexadecimal)\&.
.sp
"W" is an alphabetic character and an upper case character\&. Its properties are thus represented by the binary number 00101 (5 in hexadecimal)\&.
.sp
"7" is just a digit\&. Its properties are thus represented by the binary number 01000 (8 in hexadecimal)\&.
.sp
"=" is not punctuation nor a digit nor an alphabetic character\&. Its properties are thus represented by the binary number 00000 (0 in hexadecimal)\&.
.sp
Japanese or Chinese alphabetic character properties are represented by the binary number 00001 (1 in hexadecimal): they are alphabetic, but neither upper nor lower case\&.
.SH "EXAMPLE (V3.02)"
.sp
.if n \{\
.RS 4
.\}
.nf
110
NULL 0 NULL 0
N 5 59,68,216,255,87,236,0,27,104,227 Latin 11 0 1 N
Y 5 59,68,216,255,91,205,0,47,91,223 Latin 33 0 2 Y
1 8 59,69,203,255,45,128,0,66,74,173 Common 3 2 3 1
9 8 18,66,203,255,89,156,0,39,104,173 Common 4 2 4 9
a 3 58,65,186,198,85,164,0,26,97,185 Latin 56 0 5 a
\&. \&. \&.
.fi
.if n \{\
.RE
.\}
.SH "CAVEATS"
.sp
Although the unicharset reader maintains the ability to read unicharsets of older formats and will assign default values to missing fields, the accuracy will be degraded\&.
.sp
Further, most other data files are indexed by the unicharset file, so changing it without re\-generating the others is likely to have dire consequences\&.
.SH "HISTORY"
.sp
The unicharset format first appeared with Tesseract 2\&.00, which was the first version to support languages other than English\&. The unicharset file contained only the first two fields, and the "ispunctuation" property was absent (punctuation was regarded as "0", as "=" is in the above example\&.
.SH "SEE ALSO"
.sp
tesseract(1), combine_tessdata(1), unicharset_extractor(1)
.sp
\m[blue]\fBhttps://github\&.com/tesseract\-ocr/tesseract/wiki/TrainingTesseract\fR\m[]
.SH "AUTHOR"
.sp
The Tesseract OCR engine was written by Ray Smith and his research groups at Hewlett Packard (1985\-1995) and Google (2006\-present)\&.
generated docs git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@484 d0cd1f9f-072b-0410-8dd7-cf729c803f20 2010-09-30 10:21:18 +08:00			`'\" t`
			`.\" Title: unicharset`
Update man pages for Tesseract 3.02. git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@670 d0cd1f9f-072b-0410-8dd7-cf729c803f20 2012-02-10 06:55:47 +08:00			`.\" Author: [see the "AUTHOR" section]`
fix links in doc; autotools requires README 2015-06-13 06:08:05 +08:00			`.\" Generator: DocBook XSL Stylesheets v1.78.1 <http://docbook.sf.net/>`
			`.\" Date: 06/12/2015`
generated docs git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@484 d0cd1f9f-072b-0410-8dd7-cf729c803f20 2010-09-30 10:21:18 +08:00			`.\" Manual: \ \&`
			`.\" Source: \ \&`
			`.\" Language: English`
			`.\"`
fix links in doc; autotools requires README 2015-06-13 06:08:05 +08:00			`.TH "UNICHARSET" "5" "06/12/2015" "\ \&" "\ \&"`
generated docs git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@484 d0cd1f9f-072b-0410-8dd7-cf729c803f20 2010-09-30 10:21:18 +08:00			`.\" -----------------------------------------------------------------`
			`.\" * Define some portability stuff`
			`.\" -----------------------------------------------------------------`
			`.\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~`
			`.\" http://bugs.debian.org/507673`
			`.\" http://lists.gnu.org/archive/html/groff/2009-02/msg00013.html`
			`.\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~`
			`.ie \n(.g .ds Aq \(aq`
			`.el .ds Aq '`
			`.\" -----------------------------------------------------------------`
			`.\" * set default formatting`
			`.\" -----------------------------------------------------------------`
			`.\" disable hyphenation`
			`.nh`
			`.\" disable justification (adjust text to left margin only)`
			`.ad l`
			`.\" -----------------------------------------------------------------`
			`.\" * MAIN CONTENT STARTS HERE *`
			`.\" -----------------------------------------------------------------`
			`.SH "NAME"`
Update man pages for Tesseract 3.02. git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@670 d0cd1f9f-072b-0410-8dd7-cf729c803f20 2012-02-10 06:55:47 +08:00			`unicharset \- character properties file used by tesseract(1)`
generated docs git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@484 d0cd1f9f-072b-0410-8dd7-cf729c803f20 2010-09-30 10:21:18 +08:00			`.SH "DESCRIPTION"`
			`.sp`
Update man pages for Tesseract 3.02. git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@670 d0cd1f9f-072b-0410-8dd7-cf729c803f20 2012-02-10 06:55:47 +08:00			`Tesseract\(cqs unicharset file contains information on each symbol (unichar) the Tesseract OCR engine is trained to recognize\&.`
generated docs git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@484 d0cd1f9f-072b-0410-8dd7-cf729c803f20 2010-09-30 10:21:18 +08:00			`.sp`
Update man pages for Tesseract 3.02. git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@670 d0cd1f9f-072b-0410-8dd7-cf729c803f20 2012-02-10 06:55:47 +08:00			`A unicharset file (i\&.e\&. \fIeng\&.unicharset\fR) is distributed as part of a Tesseract language pack (i\&.e\&. \fIeng\&.traineddata\fR)\&. For information on extracting the unicharset file, see combine_tessdata(1)\&.`
			`.sp`
			`The first line of a unicharset file contains the number of unichars in the file\&. After this line, each subsequent line provides information for a single unichar\&. The first such line contains a placeholder reserved for the space character\&. Each unichar is referred to within Tesseract by its Unichar ID, which is the line number (minus 1) within the unicharset file\&. Therefore, space gets unichar 0\&.`
			`.sp`
			`Each unichar line in the unicharset file (v2+) may have four space\-separated fields:`
generated docs git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@484 d0cd1f9f-072b-0410-8dd7-cf729c803f20 2010-09-30 10:21:18 +08:00			`.sp`
			`.if n \{\`
			`.RS 4`
			`.\}`
			`.nf`
Update man pages for Tesseract 3.02. git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@670 d0cd1f9f-072b-0410-8dd7-cf729c803f20 2012-02-10 06:55:47 +08:00			`\(Aqcharacter\(Aq \(Aqproperties\(Aq \(Aqscript\(Aq \(Aqid\(Aq`
generated docs git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@484 d0cd1f9f-072b-0410-8dd7-cf729c803f20 2010-09-30 10:21:18 +08:00			`.fi`
			`.if n \{\`
			`.RE`
			`.\}`
Update man pages for Tesseract 3.02. git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@670 d0cd1f9f-072b-0410-8dd7-cf729c803f20 2012-02-10 06:55:47 +08:00			`.sp`
			`Starting with Tesseract v3\&.02, more information may be given for each unichar:`
			`.sp`
			`.if n \{\`
			`.RS 4`
			`.\}`
			`.nf`
			`\(Aqcharacter\(Aq \(Aqproperties\(Aq \(Aqglyph_metrics\(Aq \(Aqscript\(Aq \(Aqother_case\(Aq \(Aqdirection\(Aq \(Aqmirror\(Aq \(Aqnormed_form\(Aq`
			`.fi`
			`.if n \{\`
			`.RE`
			`.\}`
			`.sp`
			`Entries:`
			`.PP`
			`\fIcharacter\fR`
			`.RS 4`
			`The UTF\-8 encoded string to be produced for this unichar\&.`
			`.RE`
			`.PP`
			`\fIproperties\fR`
			`.RS 4`
			`An integer mask of character properties, one per bit\&. From least to most significant bit, these are: isalpha, islower, isupper, isdigit, ispunctuation\&.`
			`.RE`
			`.PP`
			`\fIglyph_metrics\fR`
			`.RS 4`
			`Ten comma\-separated integers representing various standards for where this glyph is to be found within a baseline\-normalized coordinate system where 128 is normalized to x\-height\&.`
			`.sp`
			`.RS 4`
			`.ie n \{\`
			`\h'-04'\(bu\h'+03'\c`
			`.\}`
			`.el \{\`
			`.sp -1`
			`.IP \(bu 2.3`
			`.\}`
			`min_bottom, max_bottom: the ranges where the bottom of the character can be found\&.`
			`.RE`
			`.sp`
			`.RS 4`
			`.ie n \{\`
			`\h'-04'\(bu\h'+03'\c`
			`.\}`
			`.el \{\`
			`.sp -1`
			`.IP \(bu 2.3`
			`.\}`
			`min_top, max_top: the ranges where the top of the character may be found\&.`
			`.RE`
			`.sp`
			`.RS 4`
			`.ie n \{\`
			`\h'-04'\(bu\h'+03'\c`
			`.\}`
			`.el \{\`
			`.sp -1`
			`.IP \(bu 2.3`
			`.\}`
			`min_width, max_width: horizontal width of the character\&.`
			`.RE`
			`.sp`
			`.RS 4`
			`.ie n \{\`
			`\h'-04'\(bu\h'+03'\c`
			`.\}`
			`.el \{\`
			`.sp -1`
			`.IP \(bu 2.3`
			`.\}`
			`min_bearing, max_bearing: how far from the usual start position does the leftmost part of the character begin\&.`
			`.RE`
			`.sp`
			`.RS 4`
			`.ie n \{\`
			`\h'-04'\(bu\h'+03'\c`
			`.\}`
			`.el \{\`
			`.sp -1`
			`.IP \(bu 2.3`
			`.\}`
			`min_advance, max_advance: how far from the printer\(cqs cell left do we advance to begin the next character\&.`
			`.RE`
			`.RE`
			`.PP`
			`\fIscript\fR`
			`.RS 4`
			`Name of the script (Latin, Common, Greek, Cyrillic, Han, null)\&.`
			`.RE`
			`.PP`
			`\fIother_case\fR`
			`.RS 4`
			`The Unichar ID of the other case version of this character (upper or lower)\&.`
			`.RE`
			`.PP`
			`\fIdirection\fR`
			`.RS 4`
			`The Unicode BiDi direction of this character, as defined by ICU\(cqs enum UCharDirection\&. (0 = Left to Right, 1 = Right to Left, 2 = European Number\&...)`
			`.RE`
			`.PP`
			`\fImirror\fR`
			`.RS 4`
			`The Unichar ID of the BiDirectional mirror of this character\&. For example the mirror of open paren is close paren, but Latin Capital C has no mirror, so it remains a Latin Capital C\&.`
			`.RE`
			`.PP`
			`\fInormed_form\fR`
			`.RS 4`
			`The UTF\-8 representation of a "normalized form" of this unichar for the purpose of blaming a module for errors given ground truth text\&. For instance, a left or right single quote may normalize to an ASCII quote\&.`
			`.RE`
			`.SH "EXAMPLE (V2)"`
generated docs git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@484 d0cd1f9f-072b-0410-8dd7-cf729c803f20 2010-09-30 10:21:18 +08:00			`.sp`
			`.if n \{\`
			`.RS 4`
			`.\}`
			`.nf`
			`; 10 Common 46`
			`b 3 Latin 59`
			`W 5 Latin 40`
			`7 8 Common 66`
			`= 0 Common 93`
			`.fi`
			`.if n \{\`
			`.RE`
			`.\}`
			`.sp`
			`";" is a punctuation character\&. Its properties are thus represented by the binary number 10000 (10 in hexadecimal)\&.`
			`.sp`
			`"b" is an alphabetic character and a lower case character\&. Its properties are thus represented by the binary number 00011 (3 in hexadecimal)\&.`
			`.sp`
			`"W" is an alphabetic character and an upper case character\&. Its properties are thus represented by the binary number 00101 (5 in hexadecimal)\&.`
			`.sp`
			`"7" is just a digit\&. Its properties are thus represented by the binary number 01000 (8 in hexadecimal)\&.`
			`.sp`
			`"=" is not punctuation nor a digit nor an alphabetic character\&. Its properties are thus represented by the binary number 00000 (0 in hexadecimal)\&.`
			`.sp`
			`Japanese or Chinese alphabetic character properties are represented by the binary number 00001 (1 in hexadecimal): they are alphabetic, but neither upper nor lower case\&.`
Update man pages for Tesseract 3.02. git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@670 d0cd1f9f-072b-0410-8dd7-cf729c803f20 2012-02-10 06:55:47 +08:00			`.SH "EXAMPLE (V3.02)"`
			`.sp`
			`.if n \{\`
			`.RS 4`
			`.\}`
			`.nf`
			`110`
			`NULL 0 NULL 0`
			`N 5 59,68,216,255,87,236,0,27,104,227 Latin 11 0 1 N`
			`Y 5 59,68,216,255,91,205,0,47,91,223 Latin 33 0 2 Y`
			`1 8 59,69,203,255,45,128,0,66,74,173 Common 3 2 3 1`
			`9 8 18,66,203,255,89,156,0,39,104,173 Common 4 2 4 9`
			`a 3 58,65,186,198,85,164,0,26,97,185 Latin 56 0 5 a`
			`\&. \&. \&.`
			`.fi`
			`.if n \{\`
			`.RE`
			`.\}`
			`.SH "CAVEATS"`
			`.sp`
			`Although the unicharset reader maintains the ability to read unicharsets of older formats and will assign default values to missing fields, the accuracy will be degraded\&.`
generated docs git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@484 d0cd1f9f-072b-0410-8dd7-cf729c803f20 2010-09-30 10:21:18 +08:00			`.sp`
Update man pages for Tesseract 3.02. git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@670 d0cd1f9f-072b-0410-8dd7-cf729c803f20 2012-02-10 06:55:47 +08:00			`Further, most other data files are indexed by the unicharset file, so changing it without re\-generating the others is likely to have dire consequences\&.`
generated docs git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@484 d0cd1f9f-072b-0410-8dd7-cf729c803f20 2010-09-30 10:21:18 +08:00			`.SH "HISTORY"`
			`.sp`
			`The unicharset format first appeared with Tesseract 2\&.00, which was the first version to support languages other than English\&. The unicharset file contained only the first two fields, and the "ispunctuation" property was absent (punctuation was regarded as "0", as "=" is in the above example\&.`
			`.SH "SEE ALSO"`
			`.sp`
Update man pages for Tesseract 3.02. git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@670 d0cd1f9f-072b-0410-8dd7-cf729c803f20 2012-02-10 06:55:47 +08:00			`tesseract(1), combine_tessdata(1), unicharset_extractor(1)`
			`.sp`
fix links in doc; autotools requires README 2015-06-13 06:08:05 +08:00			`\m[blue]\fBhttps://github\&.com/tesseract\-ocr/tesseract/wiki/TrainingTesseract\fR\m[]`
Update man pages for Tesseract 3.02. git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@670 d0cd1f9f-072b-0410-8dd7-cf729c803f20 2012-02-10 06:55:47 +08:00			`.SH "AUTHOR"`
			`.sp`
			`The Tesseract OCR engine was written by Ray Smith and his research groups at Hewlett Packard (1985\-1995) and Google (2006\-present)\&.`