tesseract/doc/unicharambigs.5.asc

UNICHARAMBIGS(5)
================

NAME
----
unicharambigs - Tesseract unicharset ambiguities

DESCRIPTION
-----------
The unicharambigs file (a component of traineddata, see combine_tessdata(1) )
is used by Tesseract to represent possible ambiguities between characters,
or groups of characters.

The file contains a number of lines, laid out as follow:

...........................
[num] <TAB> [char(s)] <TAB> [num] <TAB> [char(s)] <TAB> [num]
...........................

[horizontal]
Field one:: the number of characters contained in field two
Field two:: the character sequence to be replaced
Field three:: the number of characters contained in field four
Field four:: the character sequence used to replace field two
Field five:: contains either 1 or 0. 1 denotes a mandatory
replacement, 0 denotes an optional replacement.

Characters appearing in fields two and four should appear in
unicharset. The numbers in fields one and three refer to the
number of unichars (not bytes).

EXAMPLE
-------

...............................
v1
2       ' '     1       "     1
1       m       2       r n   0
3       i i i   1       m     0
...............................

The first line is a version identifier.
In this example, all instances of the '2' character sequence '''' will
*always* be replaced by the '1' character sequence '"'; a '1' character
sequence 'm' *may* be replaced by the '2' character sequence 'rn', and
the '3' character sequence *may* be replaced by the '1' character
sequence 'm'.

Version 3.03 and on supports a new, simpler format for the unicharambigs
file:

...............................
v2
'' " 1
m rn 0
iii m 0
...............................

In this format, the "error" and "correction" are simple UTF-8 strings
separated by a space, and, after another space, the same type specifier
as v1 (0 for optional and 1 for mandatory substitution). Note the downside
of this simpler format is that Tesseract has to encode the UTF-8 strings
into the components of the unicharset. In complex scripts, this encoding
may be ambiguous. In this case, the encoding is chosen such as to use the
least UTF-8 characters for each component, ie the shortest unicharset
components will make up the encoding.

HISTORY
-------
The unicharambigs file first appeared in Tesseract 3.00; prior to that, a
similar format, called DangAmbigs ('dangerous ambiguities') was used: the
format was almost identical, except only mandatory replacements could be
specified, and field 5 was absent.

BUGS
----
This is a documentation "bug": it's not currently clear what should be done
in the case of ligatures (such as 'fi') which may also appear as regular
letters in the unicharset.

SEE ALSO
--------
tesseract(1), unicharset(5)
https://tesseract-ocr.github.io/tessdoc/Training-Tesseract-3.03%E2%80%933.05.html#the-unicharambigs-file

AUTHOR
------
The Tesseract OCR engine was written by Ray Smith and his research groups
at Hewlett Packard (1985-1995) and Google (2006-2018).
more man pages git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@475 d0cd1f9f-072b-0410-8dd7-cf729c803f20 2010-09-30 07:14:42 +08:00			`UNICHARAMBIGS(5)`
			`================`

			`NAME`
			`----`
			`unicharambigs - Tesseract unicharset ambiguities`

			`DESCRIPTION`
			`-----------`
Update man pages for Tesseract 3.02. git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@670 d0cd1f9f-072b-0410-8dd7-cf729c803f20 2012-02-10 06:55:47 +08:00			`The unicharambigs file (a component of traineddata, see combine_tessdata(1) )`
			`is used by Tesseract to represent possible ambiguities between characters,`
			`or groups of characters.`
more man pages git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@475 d0cd1f9f-072b-0410-8dd7-cf729c803f20 2010-09-30 07:14:42 +08:00
			`The file contains a number of lines, laid out as follow:`

			`...........................`
			`[num] <TAB> [char(s)] <TAB> [num] <TAB> [char(s)] <TAB> [num]`
			`...........................`

			`[horizontal]`
			`Field one:: the number of characters contained in field two`
			`Field two:: the character sequence to be replaced`
			`Field three:: the number of characters contained in field four`
			`Field four:: the character sequence used to replace field two`
			`Field five:: contains either 1 or 0. 1 denotes a mandatory`
			`replacement, 0 denotes an optional replacement.`

			`Characters appearing in fields two and four should appear in`
			`unicharset. The numbers in fields one and three refer to the`
			`number of unichars (not bytes).`

			`EXAMPLE`
			`-------`

			`...............................`
add info about unicharambigs file v2; fixes #165 2018-10-22 02:18:48 +08:00			`v1`
more man pages git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@475 d0cd1f9f-072b-0410-8dd7-cf729c803f20 2010-09-30 07:14:42 +08:00			`2 ' ' 1 " 1`
			`1 m 2 r n 0`
			`3 i i i 1 m 0`
			`...............................`

add info about unicharambigs file v2; fixes #165 2018-10-22 02:18:48 +08:00			`The first line is a version identifier.`
doc: Fix line endings Remove spaces at line endings and replace CRLF by LF. Signed-off-by: Stefan Weil <sw@weilnetz.de> 2016-12-05 03:41:37 +08:00			`In this example, all instances of the '2' character sequence '''' will`
more man pages git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@475 d0cd1f9f-072b-0410-8dd7-cf729c803f20 2010-09-30 07:14:42 +08:00			`always be replaced by the '1' character sequence '"'; a '1' character`
			`sequence 'm' may be replaced by the '2' character sequence 'rn', and`
			`the '3' character sequence may be replaced by the '1' character`
			`sequence 'm'.`

add info about unicharambigs file v2; fixes #165 2018-10-22 02:18:48 +08:00			`Version 3.03 and on supports a new, simpler format for the unicharambigs`
			`file:`

			`...............................`
			`v2`
			`'' " 1`
			`m rn 0`
			`iii m 0`
			`...............................`

			`In this format, the "error" and "correction" are simple UTF-8 strings`
			`separated by a space, and, after another space, the same type specifier`
			`as v1 (0 for optional and 1 for mandatory substitution). Note the downside`
			`of this simpler format is that Tesseract has to encode the UTF-8 strings`
			`into the components of the unicharset. In complex scripts, this encoding`
			`may be ambiguous. In this case, the encoding is chosen such as to use the`
			`least UTF-8 characters for each component, ie the shortest unicharset`
			`components will make up the encoding.`

more man pages git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@475 d0cd1f9f-072b-0410-8dd7-cf729c803f20 2010-09-30 07:14:42 +08:00			`HISTORY`
			`-------`
			`The unicharambigs file first appeared in Tesseract 3.00; prior to that, a`
			`similar format, called DangAmbigs ('dangerous ambiguities') was used: the`
			`format was almost identical, except only mandatory replacements could be`
			`specified, and field 5 was absent.`

			`BUGS`
			`----`
			`This is a documentation "bug": it's not currently clear what should be done`
			`in the case of ligatures (such as 'fi') which may also appear as regular`
			`letters in the unicharset.`

			`SEE ALSO`
			`--------`
			`tesseract(1), unicharset(5)`
Replace references to the old wiki by new URLs Signed-off-by: Stefan Weil <sw@weilnetz.de> 2020-02-03 18:37:41 +08:00			`https://tesseract-ocr.github.io/tessdoc/Training-Tesseract-3.03%E2%80%933.05.html#the-unicharambigs-file`
more man pages git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@475 d0cd1f9f-072b-0410-8dd7-cf729c803f20 2010-09-30 07:14:42 +08:00
Update man pages for Tesseract 3.02. git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@670 d0cd1f9f-072b-0410-8dd7-cf729c803f20 2012-02-10 06:55:47 +08:00			`AUTHOR`
			`------`
			`The Tesseract OCR engine was written by Ray Smith and his research groups`
Update documentation The last contribution from Google was in 2018 (see commit ce88adbf326a40b0). Signed-off-by: Stefan Weil <sw@weilnetz.de> 2024-05-03 21:44:03 +08:00			`at Hewlett Packard (1985-1995) and Google (2006-2018).`