mirror of
https://github.com/tesseract-ocr/tesseract.git
synced 2024-11-24 11:09:06 +08:00
58e06c8c45
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@670 d0cd1f9f-072b-0410-8dd7-cf729c803f20
68 lines
2.1 KiB
Plaintext
68 lines
2.1 KiB
Plaintext
UNICHARAMBIGS(5)
|
|
================
|
|
|
|
NAME
|
|
----
|
|
unicharambigs - Tesseract unicharset ambiguities
|
|
|
|
DESCRIPTION
|
|
-----------
|
|
The unicharambigs file (a component of traineddata, see combine_tessdata(1) )
|
|
is used by Tesseract to represent possible ambiguities between characters,
|
|
or groups of characters.
|
|
|
|
The file contains a number of lines, laid out as follow:
|
|
|
|
...........................
|
|
[num] <TAB> [char(s)] <TAB> [num] <TAB> [char(s)] <TAB> [num]
|
|
...........................
|
|
|
|
[horizontal]
|
|
Field one:: the number of characters contained in field two
|
|
Field two:: the character sequence to be replaced
|
|
Field three:: the number of characters contained in field four
|
|
Field four:: the character sequence used to replace field two
|
|
Field five:: contains either 1 or 0. 1 denotes a mandatory
|
|
replacement, 0 denotes an optional replacement.
|
|
|
|
Characters appearing in fields two and four should appear in
|
|
unicharset. The numbers in fields one and three refer to the
|
|
number of unichars (not bytes).
|
|
|
|
EXAMPLE
|
|
-------
|
|
|
|
...............................
|
|
2 ' ' 1 " 1
|
|
1 m 2 r n 0
|
|
3 i i i 1 m 0
|
|
...............................
|
|
|
|
In this example, all instances of the '2' character sequence '''' will
|
|
*always* be replaced by the '1' character sequence '"'; a '1' character
|
|
sequence 'm' *may* be replaced by the '2' character sequence 'rn', and
|
|
the '3' character sequence *may* be replaced by the '1' character
|
|
sequence 'm'.
|
|
|
|
HISTORY
|
|
-------
|
|
The unicharambigs file first appeared in Tesseract 3.00; prior to that, a
|
|
similar format, called DangAmbigs ('dangerous ambiguities') was used: the
|
|
format was almost identical, except only mandatory replacements could be
|
|
specified, and field 5 was absent.
|
|
|
|
BUGS
|
|
----
|
|
This is a documentation "bug": it's not currently clear what should be done
|
|
in the case of ligatures (such as 'fi') which may also appear as regular
|
|
letters in the unicharset.
|
|
|
|
SEE ALSO
|
|
--------
|
|
tesseract(1), unicharset(5)
|
|
|
|
AUTHOR
|
|
------
|
|
The Tesseract OCR engine was written by Ray Smith and his research groups
|
|
at Hewlett Packard (1985-1995) and Google (2006-present).
|