mirror of
https://github.com/tesseract-ocr/tesseract.git
synced 2025-01-23 10:56:19 +08:00
64 lines
1.9 KiB
Plaintext
64 lines
1.9 KiB
Plaintext
|
UNICHARAMBIGS(5)
|
||
|
================
|
||
|
|
||
|
NAME
|
||
|
----
|
||
|
unicharambigs - Tesseract unicharset ambiguities
|
||
|
|
||
|
DESCRIPTION
|
||
|
-----------
|
||
|
The unicharset file is used by Tesseract to represent possible
|
||
|
ambiguities between characters, or groups of characters.
|
||
|
|
||
|
The file contains a number of lines, laid out as follow:
|
||
|
|
||
|
...........................
|
||
|
[num] <TAB> [char(s)] <TAB> [num] <TAB> [char(s)] <TAB> [num]
|
||
|
...........................
|
||
|
|
||
|
[horizontal]
|
||
|
Field one:: the number of characters contained in field two
|
||
|
Field two:: the character sequence to be replaced
|
||
|
Field three:: the number of characters contained in field four
|
||
|
Field four:: the character sequence used to replace field two
|
||
|
Field five:: contains either 1 or 0. 1 denotes a mandatory
|
||
|
replacement, 0 denotes an optional replacement.
|
||
|
|
||
|
Characters appearing in fields two and four should appear in
|
||
|
unicharset. The numbers in fields one and three refer to the
|
||
|
number of unichars (not bytes).
|
||
|
|
||
|
EXAMPLE
|
||
|
-------
|
||
|
|
||
|
...............................
|
||
|
2 ' ' 1 " 1
|
||
|
1 m 2 r n 0
|
||
|
3 i i i 1 m 0
|
||
|
...............................
|
||
|
|
||
|
In this example, all instances of the '2' character sequence '''' will
|
||
|
*always* be replaced by the '1' character sequence '"'; a '1' character
|
||
|
sequence 'm' *may* be replaced by the '2' character sequence 'rn', and
|
||
|
the '3' character sequence *may* be replaced by the '1' character
|
||
|
sequence 'm'.
|
||
|
|
||
|
HISTORY
|
||
|
-------
|
||
|
The unicharambigs file first appeared in Tesseract 3.00; prior to that, a
|
||
|
similar format, called DangAmbigs ('dangerous ambiguities') was used: the
|
||
|
format was almost identical, except only mandatory replacements could be
|
||
|
specified, and field 5 was absent.
|
||
|
|
||
|
BUGS
|
||
|
----
|
||
|
This is a documentation "bug": it's not currently clear what should be done
|
||
|
in the case of ligatures (such as 'fi') which may also appear as regular
|
||
|
letters in the unicharset.
|
||
|
|
||
|
SEE ALSO
|
||
|
--------
|
||
|
tesseract(1), unicharset(5)
|
||
|
|
||
|
|