tesseract/doc/unicharambigs.5.asc

UNICHARAMBIGS(5)
================

NAME
----
unicharambigs - Tesseract unicharset ambiguities

DESCRIPTION
-----------
The unicharambigs file (a component of traineddata, see combine_tessdata(1) )
is used by Tesseract to represent possible ambiguities between characters,
or groups of characters.

The file contains a number of lines, laid out as follow:

...........................
[num] <TAB> [char(s)] <TAB> [num] <TAB> [char(s)] <TAB> [num]
...........................

[horizontal]
Field one:: the number of characters contained in field two
Field two:: the character sequence to be replaced
Field three:: the number of characters contained in field four
Field four:: the character sequence used to replace field two
Field five:: contains either 1 or 0. 1 denotes a mandatory
replacement, 0 denotes an optional replacement.

Characters appearing in fields two and four should appear in
unicharset. The numbers in fields one and three refer to the
number of unichars (not bytes).

EXAMPLE
-------

...............................
2       ' '     1       "     1
1       m       2       r n   0
3       i i i   1       m     0
...............................

In this example, all instances of the '2' character sequence '''' will
*always* be replaced by the '1' character sequence '"'; a '1' character
sequence 'm' *may* be replaced by the '2' character sequence 'rn', and
the '3' character sequence *may* be replaced by the '1' character
sequence 'm'.

HISTORY
-------
The unicharambigs file first appeared in Tesseract 3.00; prior to that, a
similar format, called DangAmbigs ('dangerous ambiguities') was used: the
format was almost identical, except only mandatory replacements could be
specified, and field 5 was absent.

BUGS
----
This is a documentation "bug": it's not currently clear what should be done
in the case of ligatures (such as 'fi') which may also appear as regular
letters in the unicharset.

SEE ALSO
--------
tesseract(1), unicharset(5)

AUTHOR
------
The Tesseract OCR engine was written by Ray Smith and his research groups
at Hewlett Packard (1985-1995) and Google (2006-present).