mirror of
https://github.com/tesseract-ocr/tesseract.git
synced 2025-01-07 10:17:50 +08:00
dbed3e0179
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@475 d0cd1f9f-072b-0410-8dd7-cf729c803f20
64 lines
1.9 KiB
Plaintext
64 lines
1.9 KiB
Plaintext
UNICHARAMBIGS(5)
|
|
================
|
|
|
|
NAME
|
|
----
|
|
unicharambigs - Tesseract unicharset ambiguities
|
|
|
|
DESCRIPTION
|
|
-----------
|
|
The unicharset file is used by Tesseract to represent possible
|
|
ambiguities between characters, or groups of characters.
|
|
|
|
The file contains a number of lines, laid out as follow:
|
|
|
|
...........................
|
|
[num] <TAB> [char(s)] <TAB> [num] <TAB> [char(s)] <TAB> [num]
|
|
...........................
|
|
|
|
[horizontal]
|
|
Field one:: the number of characters contained in field two
|
|
Field two:: the character sequence to be replaced
|
|
Field three:: the number of characters contained in field four
|
|
Field four:: the character sequence used to replace field two
|
|
Field five:: contains either 1 or 0. 1 denotes a mandatory
|
|
replacement, 0 denotes an optional replacement.
|
|
|
|
Characters appearing in fields two and four should appear in
|
|
unicharset. The numbers in fields one and three refer to the
|
|
number of unichars (not bytes).
|
|
|
|
EXAMPLE
|
|
-------
|
|
|
|
...............................
|
|
2 ' ' 1 " 1
|
|
1 m 2 r n 0
|
|
3 i i i 1 m 0
|
|
...............................
|
|
|
|
In this example, all instances of the '2' character sequence '''' will
|
|
*always* be replaced by the '1' character sequence '"'; a '1' character
|
|
sequence 'm' *may* be replaced by the '2' character sequence 'rn', and
|
|
the '3' character sequence *may* be replaced by the '1' character
|
|
sequence 'm'.
|
|
|
|
HISTORY
|
|
-------
|
|
The unicharambigs file first appeared in Tesseract 3.00; prior to that, a
|
|
similar format, called DangAmbigs ('dangerous ambiguities') was used: the
|
|
format was almost identical, except only mandatory replacements could be
|
|
specified, and field 5 was absent.
|
|
|
|
BUGS
|
|
----
|
|
This is a documentation "bug": it's not currently clear what should be done
|
|
in the case of ligatures (such as 'fi') which may also appear as regular
|
|
letters in the unicharset.
|
|
|
|
SEE ALSO
|
|
--------
|
|
tesseract(1), unicharset(5)
|
|
|
|
|