'\" t .\" Title: unicharambigs .\" Author: [see the "AUTHOR" section] .\" Generator: DocBook XSL Stylesheets v1.78.1 .\" Date: 06/12/2015 .\" Manual: \ \& .\" Source: \ \& .\" Language: English .\" .TH "UNICHARAMBIGS" "5" "06/12/2015" "\ \&" "\ \&" .\" ----------------------------------------------------------------- .\" * Define some portability stuff .\" ----------------------------------------------------------------- .\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .\" http://bugs.debian.org/507673 .\" http://lists.gnu.org/archive/html/groff/2009-02/msg00013.html .\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .ie \n(.g .ds Aq \(aq .el .ds Aq ' .\" ----------------------------------------------------------------- .\" * set default formatting .\" ----------------------------------------------------------------- .\" disable hyphenation .nh .\" disable justification (adjust text to left margin only) .ad l .\" ----------------------------------------------------------------- .\" * MAIN CONTENT STARTS HERE * .\" ----------------------------------------------------------------- .SH "NAME" unicharambigs \- Tesseract unicharset ambiguities .SH "DESCRIPTION" .sp The unicharambigs file (a component of traineddata, see combine_tessdata(1) ) is used by Tesseract to represent possible ambiguities between characters, or groups of characters\&. .sp The file contains a number of lines, laid out as follow: .sp .if n \{\ .RS 4 .\} .nf [num] [char(s)] [num] [char(s)] [num] .fi .if n \{\ .RE .\} .sp .TS tab(:); lt lt lt lt lt lt lt lt lt lt. T{ .sp Field one T}:T{ .sp the number of characters contained in field two T} T{ .sp Field two T}:T{ .sp the character sequence to be replaced T} T{ .sp Field three T}:T{ .sp the number of characters contained in field four T} T{ .sp Field four T}:T{ .sp the character sequence used to replace field two T} T{ .sp Field five T}:T{ .sp contains either 1 or 0\&. 1 denotes a mandatory replacement, 0 denotes an optional replacement\&. T} .TE .sp 1 .sp Characters appearing in fields two and four should appear in unicharset\&. The numbers in fields one and three refer to the number of unichars (not bytes)\&. .SH "EXAMPLE" .sp .if n \{\ .RS 4 .\} .nf 2 \*(Aq \*(Aq 1 " 1 1 m 2 r n 0 3 i i i 1 m 0 .fi .if n \{\ .RE .\} .sp In this example, all instances of the \fI2\fR character sequence \fI\*(Aq\fR\*(Aq will \fBalways\fR be replaced by the \fI1\fR character sequence \fI"\fR; a \fI1\fR character sequence \fIm\fR \fBmay\fR be replaced by the \fI2\fR character sequence \fIrn\fR, and the \fI3\fR character sequence \fBmay\fR be replaced by the \fI1\fR character sequence \fIm\fR\&. .SH "HISTORY" .sp The unicharambigs file first appeared in Tesseract 3\&.00; prior to that, a similar format, called DangAmbigs (\fIdangerous ambiguities\fR) was used: the format was almost identical, except only mandatory replacements could be specified, and field 5 was absent\&. .SH "BUGS" .sp This is a documentation "bug": it\(cqs not currently clear what should be done in the case of ligatures (such as \fIfi\fR) which may also appear as regular letters in the unicharset\&. .SH "SEE ALSO" .sp tesseract(1), unicharset(5) .SH "AUTHOR" .sp The Tesseract OCR engine was written by Ray Smith and his research groups at Hewlett Packard (1985\-1995) and Google (2006\-present)\&.