<simpara>Characters appearing in fields two and four should appear in
unicharset. The numbers in fields one and three refer to the
number of unichars (not bytes).</simpara>
</refsect1>
<refsect1id="_example">
<title>EXAMPLE</title>
<literallayoutclass="monospaced">2 ' ' 1 " 1
1 m 2 r n 0
3 i i i 1 m 0</literallayout>
<simpara>In this example, all instances of the <emphasis>2</emphasis> character sequence <emphasis>'</emphasis>' will
<emphasisrole="strong">always</emphasis> be replaced by the <emphasis>1</emphasis> character sequence <emphasis>"</emphasis>; a <emphasis>1</emphasis> character
sequence <emphasis>m</emphasis><emphasisrole="strong">may</emphasis> be replaced by the <emphasis>2</emphasis> character sequence <emphasis>rn</emphasis>, and
the <emphasis>3</emphasis> character sequence <emphasisrole="strong">may</emphasis> be replaced by the <emphasis>1</emphasis> character
sequence <emphasis>m</emphasis>.</simpara>
</refsect1>
<refsect1id="_history">
<title>HISTORY</title>
<simpara>The unicharambigs file first appeared in Tesseract 3.00; prior to that, a
similar format, called DangAmbigs (<emphasis>dangerous ambiguities</emphasis>) was used: the
format was almost identical, except only mandatory replacements could be
specified, and field 5 was absent.</simpara>
</refsect1>
<refsect1id="_bugs">
<title>BUGS</title>
<simpara>This is a documentation "bug": it’s not currently clear what should be done
in the case of ligatures (such as <emphasis>fi</emphasis>) which may also appear as regular
letters in the unicharset.</simpara>
</refsect1>
<refsect1id="_see_also">
<title>SEE ALSO</title>
<simpara>tesseract(1), unicharset(5)</simpara>
</refsect1>
<refsect1id="_author">
<title>AUTHOR</title>
<simpara>The Tesseract OCR engine was written by Ray Smith and his research groups
at Hewlett Packard (1985-1995) and Google (2006-present).</simpara>