mirror of
https://github.com/tesseract-ocr/tesseract.git
synced 2024-12-03 00:49:01 +08:00
121 lines
3.4 KiB
Groff
121 lines
3.4 KiB
Groff
'\" t
|
|
.\" Title: unicharambigs
|
|
.\" Author: [see the "AUTHOR" section]
|
|
.\" Generator: DocBook XSL Stylesheets v1.78.1 <http://docbook.sf.net/>
|
|
.\" Date: 06/12/2015
|
|
.\" Manual: \ \&
|
|
.\" Source: \ \&
|
|
.\" Language: English
|
|
.\"
|
|
.TH "UNICHARAMBIGS" "5" "06/12/2015" "\ \&" "\ \&"
|
|
.\" -----------------------------------------------------------------
|
|
.\" * Define some portability stuff
|
|
.\" -----------------------------------------------------------------
|
|
.\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
.\" http://bugs.debian.org/507673
|
|
.\" http://lists.gnu.org/archive/html/groff/2009-02/msg00013.html
|
|
.\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
.ie \n(.g .ds Aq \(aq
|
|
.el .ds Aq '
|
|
.\" -----------------------------------------------------------------
|
|
.\" * set default formatting
|
|
.\" -----------------------------------------------------------------
|
|
.\" disable hyphenation
|
|
.nh
|
|
.\" disable justification (adjust text to left margin only)
|
|
.ad l
|
|
.\" -----------------------------------------------------------------
|
|
.\" * MAIN CONTENT STARTS HERE *
|
|
.\" -----------------------------------------------------------------
|
|
.SH "NAME"
|
|
unicharambigs \- Tesseract unicharset ambiguities
|
|
.SH "DESCRIPTION"
|
|
.sp
|
|
The unicharambigs file (a component of traineddata, see combine_tessdata(1) ) is used by Tesseract to represent possible ambiguities between characters, or groups of characters\&.
|
|
.sp
|
|
The file contains a number of lines, laid out as follow:
|
|
.sp
|
|
.if n \{\
|
|
.RS 4
|
|
.\}
|
|
.nf
|
|
[num] <TAB> [char(s)] <TAB> [num] <TAB> [char(s)] <TAB> [num]
|
|
.fi
|
|
.if n \{\
|
|
.RE
|
|
.\}
|
|
.sp
|
|
.TS
|
|
tab(:);
|
|
lt lt
|
|
lt lt
|
|
lt lt
|
|
lt lt
|
|
lt lt.
|
|
T{
|
|
.sp
|
|
Field one
|
|
T}:T{
|
|
.sp
|
|
the number of characters contained in field two
|
|
T}
|
|
T{
|
|
.sp
|
|
Field two
|
|
T}:T{
|
|
.sp
|
|
the character sequence to be replaced
|
|
T}
|
|
T{
|
|
.sp
|
|
Field three
|
|
T}:T{
|
|
.sp
|
|
the number of characters contained in field four
|
|
T}
|
|
T{
|
|
.sp
|
|
Field four
|
|
T}:T{
|
|
.sp
|
|
the character sequence used to replace field two
|
|
T}
|
|
T{
|
|
.sp
|
|
Field five
|
|
T}:T{
|
|
.sp
|
|
contains either 1 or 0\&. 1 denotes a mandatory replacement, 0 denotes an optional replacement\&.
|
|
T}
|
|
.TE
|
|
.sp 1
|
|
.sp
|
|
Characters appearing in fields two and four should appear in unicharset\&. The numbers in fields one and three refer to the number of unichars (not bytes)\&.
|
|
.SH "EXAMPLE"
|
|
.sp
|
|
.if n \{\
|
|
.RS 4
|
|
.\}
|
|
.nf
|
|
2 \*(Aq \*(Aq 1 " 1
|
|
1 m 2 r n 0
|
|
3 i i i 1 m 0
|
|
.fi
|
|
.if n \{\
|
|
.RE
|
|
.\}
|
|
.sp
|
|
In this example, all instances of the \fI2\fR character sequence \fI\*(Aq\fR\*(Aq will \fBalways\fR be replaced by the \fI1\fR character sequence \fI"\fR; a \fI1\fR character sequence \fIm\fR \fBmay\fR be replaced by the \fI2\fR character sequence \fIrn\fR, and the \fI3\fR character sequence \fBmay\fR be replaced by the \fI1\fR character sequence \fIm\fR\&.
|
|
.SH "HISTORY"
|
|
.sp
|
|
The unicharambigs file first appeared in Tesseract 3\&.00; prior to that, a similar format, called DangAmbigs (\fIdangerous ambiguities\fR) was used: the format was almost identical, except only mandatory replacements could be specified, and field 5 was absent\&.
|
|
.SH "BUGS"
|
|
.sp
|
|
This is a documentation "bug": it\(cqs not currently clear what should be done in the case of ligatures (such as \fIfi\fR) which may also appear as regular letters in the unicharset\&.
|
|
.SH "SEE ALSO"
|
|
.sp
|
|
tesseract(1), unicharset(5)
|
|
.SH "AUTHOR"
|
|
.sp
|
|
The Tesseract OCR engine was written by Ray Smith and his research groups at Hewlett Packard (1985\-1995) and Google (2006\-present)\&.
|