tesseract/doc/unicharambigs.5.html

700 lines
16 KiB
HTML
Raw Normal View History

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<meta name="generator" content="AsciiDoc 8.5.2" />
<title>UNICHARAMBIGS(5)</title>
<style type="text/css">
/* Debug borders */
p, li, dt, dd, div, pre, h1, h2, h3, h4, h5, h6 {
/*
border: 1px solid red;
*/
}
body {
margin: 1em 5% 1em 5%;
}
a {
color: blue;
text-decoration: underline;
}
a:visited {
color: fuchsia;
}
em {
font-style: italic;
color: navy;
}
strong {
font-weight: bold;
color: #083194;
}
tt {
color: navy;
}
h1, h2, h3, h4, h5, h6 {
color: #527bbd;
font-family: sans-serif;
margin-top: 1.2em;
margin-bottom: 0.5em;
line-height: 1.3;
}
h1, h2, h3 {
border-bottom: 2px solid silver;
}
h2 {
padding-top: 0.5em;
}
h3 {
float: left;
}
h3 + * {
clear: left;
}
div.sectionbody {
font-family: serif;
margin-left: 0;
}
hr {
border: 1px solid silver;
}
p {
margin-top: 0.5em;
margin-bottom: 0.5em;
}
ul, ol, li > p {
margin-top: 0;
}
pre {
padding: 0;
margin: 0;
}
span#author {
color: #527bbd;
font-family: sans-serif;
font-weight: bold;
font-size: 1.1em;
}
span#email {
}
span#revnumber, span#revdate, span#revremark {
font-family: sans-serif;
}
div#footer {
font-family: sans-serif;
font-size: small;
border-top: 2px solid silver;
padding-top: 0.5em;
margin-top: 4.0em;
}
div#footer-text {
float: left;
padding-bottom: 0.5em;
}
div#footer-badges {
float: right;
padding-bottom: 0.5em;
}
div#preamble {
margin-top: 1.5em;
margin-bottom: 1.5em;
}
div.tableblock, div.imageblock, div.exampleblock, div.verseblock,
div.quoteblock, div.literalblock, div.listingblock, div.sidebarblock,
div.admonitionblock {
margin-top: 1.0em;
margin-bottom: 1.5em;
}
div.admonitionblock {
margin-top: 2.0em;
margin-bottom: 2.0em;
margin-right: 10%;
color: #606060;
}
div.content { /* Block element content. */
padding: 0;
}
/* Block element titles. */
div.title, caption.title {
color: #527bbd;
font-family: sans-serif;
font-weight: bold;
text-align: left;
margin-top: 1.0em;
margin-bottom: 0.5em;
}
div.title + * {
margin-top: 0;
}
td div.title:first-child {
margin-top: 0.0em;
}
div.content div.title:first-child {
margin-top: 0.0em;
}
div.content + div.title {
margin-top: 0.0em;
}
div.sidebarblock > div.content {
background: #ffffee;
border: 1px solid silver;
padding: 0.5em;
}
div.listingblock > div.content {
border: 1px solid silver;
background: #f4f4f4;
padding: 0.5em;
}
div.quoteblock, div.verseblock {
padding-left: 1.0em;
margin-left: 1.0em;
margin-right: 10%;
border-left: 5px solid #dddddd;
color: #777777;
}
div.quoteblock > div.attribution {
padding-top: 0.5em;
text-align: right;
}
div.verseblock > div.content {
white-space: pre;
}
div.verseblock > div.attribution {
padding-top: 0.75em;
text-align: left;
}
/* DEPRECATED: Pre version 8.2.7 verse style literal block. */
div.verseblock + div.attribution {
text-align: left;
}
div.admonitionblock .icon {
vertical-align: top;
font-size: 1.1em;
font-weight: bold;
text-decoration: underline;
color: #527bbd;
padding-right: 0.5em;
}
div.admonitionblock td.content {
padding-left: 0.5em;
border-left: 3px solid #dddddd;
}
div.exampleblock > div.content {
border-left: 3px solid #dddddd;
padding-left: 0.5em;
}
div.imageblock div.content { padding-left: 0; }
span.image img { border-style: none; }
a.image:visited { color: white; }
dl {
margin-top: 0.8em;
margin-bottom: 0.8em;
}
dt {
margin-top: 0.5em;
margin-bottom: 0;
font-style: normal;
color: navy;
}
dd > *:first-child {
margin-top: 0.1em;
}
ul, ol {
list-style-position: outside;
}
ol.arabic {
list-style-type: decimal;
}
ol.loweralpha {
list-style-type: lower-alpha;
}
ol.upperalpha {
list-style-type: upper-alpha;
}
ol.lowerroman {
list-style-type: lower-roman;
}
ol.upperroman {
list-style-type: upper-roman;
}
div.compact ul, div.compact ol,
div.compact p, div.compact p,
div.compact div, div.compact div {
margin-top: 0.1em;
margin-bottom: 0.1em;
}
div.tableblock > table {
border: 3px solid #527bbd;
}
thead, p.table.header {
font-family: sans-serif;
font-weight: bold;
}
tfoot {
font-weight: bold;
}
td > div.verse {
white-space: pre;
}
p.table {
margin-top: 0;
}
/* Because the table frame attribute is overriden by CSS in most browsers. */
div.tableblock > table[frame="void"] {
border-style: none;
}
div.tableblock > table[frame="hsides"] {
border-left-style: none;
border-right-style: none;
}
div.tableblock > table[frame="vsides"] {
border-top-style: none;
border-bottom-style: none;
}
div.hdlist {
margin-top: 0.8em;
margin-bottom: 0.8em;
}
div.hdlist tr {
padding-bottom: 15px;
}
dt.hdlist1.strong, td.hdlist1.strong {
font-weight: bold;
}
td.hdlist1 {
vertical-align: top;
font-style: normal;
padding-right: 0.8em;
color: navy;
}
td.hdlist2 {
vertical-align: top;
}
div.hdlist.compact tr {
margin: 0;
padding-bottom: 0;
}
.comment {
background: yellow;
}
.footnote, .footnoteref {
font-size: 0.8em;
}
span.footnote, span.footnoteref {
vertical-align: super;
}
#footnotes {
margin: 20px 0 20px 0;
padding: 7px 0 0 0;
}
#footnotes div.footnote {
margin: 0 0 5px 0;
}
#footnotes hr {
border: none;
border-top: 1px solid silver;
height: 1px;
text-align: left;
margin-left: 0;
width: 20%;
min-width: 100px;
}
@media print {
div#footer-badges { display: none; }
}
div#toc {
margin-bottom: 2.5em;
}
div#toctitle {
color: #527bbd;
font-family: sans-serif;
font-size: 1.1em;
font-weight: bold;
margin-top: 1.0em;
margin-bottom: 0.1em;
}
div.toclevel1, div.toclevel2, div.toclevel3, div.toclevel4 {
margin-top: 0;
margin-bottom: 0;
}
div.toclevel2 {
margin-left: 2em;
font-size: 0.9em;
}
div.toclevel3 {
margin-left: 4em;
font-size: 0.9em;
}
div.toclevel4 {
margin-left: 6em;
font-size: 0.9em;
}
/* Overrides for manpage documents */
h1 {
padding-top: 0.5em;
padding-bottom: 0.5em;
border-top: 2px solid silver;
border-bottom: 2px solid silver;
}
h2 {
border-style: none;
}
div.sectionbody {
margin-left: 5%;
}
@media print {
div#toc { display: none; }
}
/* Workarounds for IE6's broken and incomplete CSS2. */
div.sidebar-content {
background: #ffffee;
border: 1px solid silver;
padding: 0.5em;
}
div.sidebar-title, div.image-title {
color: #527bbd;
font-family: sans-serif;
font-weight: bold;
margin-top: 0.0em;
margin-bottom: 0.5em;
}
div.listingblock div.content {
border: 1px solid silver;
background: #f4f4f4;
padding: 0.5em;
}
div.quoteblock-attribution {
padding-top: 0.5em;
text-align: right;
}
div.verseblock-content {
white-space: pre;
}
div.verseblock-attribution {
padding-top: 0.75em;
text-align: left;
}
div.exampleblock-content {
border-left: 3px solid #dddddd;
padding-left: 0.5em;
}
/* IE6 sets dynamically generated links as visited. */
div#toc a:visited { color: blue; }
</style>
<script type="text/javascript">
/*<![CDATA[*/
window.onload = function(){asciidoc.footnotes();}
var asciidoc = { // Namespace.
/////////////////////////////////////////////////////////////////////
// Table Of Contents generator
/////////////////////////////////////////////////////////////////////
/* Author: Mihai Bazon, September 2002
* http://students.infoiasi.ro/~mishoo
*
* Table Of Content generator
* Version: 0.4
*
* Feel free to use this script under the terms of the GNU General Public
* License, as long as you do not remove or alter this notice.
*/
/* modified by Troy D. Hanson, September 2006. License: GPL */
/* modified by Stuart Rackham, 2006, 2009. License: GPL */
// toclevels = 1..4.
toc: function (toclevels) {
function getText(el) {
var text = "";
for (var i = el.firstChild; i != null; i = i.nextSibling) {
if (i.nodeType == 3 /* Node.TEXT_NODE */) // IE doesn't speak constants.
text += i.data;
else if (i.firstChild != null)
text += getText(i);
}
return text;
}
function TocEntry(el, text, toclevel) {
this.element = el;
this.text = text;
this.toclevel = toclevel;
}
function tocEntries(el, toclevels) {
var result = new Array;
var re = new RegExp('[hH]([2-'+(toclevels+1)+'])');
// Function that scans the DOM tree for header elements (the DOM2
// nodeIterator API would be a better technique but not supported by all
// browsers).
var iterate = function (el) {
for (var i = el.firstChild; i != null; i = i.nextSibling) {
if (i.nodeType == 1 /* Node.ELEMENT_NODE */) {
var mo = re.exec(i.tagName);
if (mo && (i.getAttribute("class") || i.getAttribute("className")) != "float") {
result[result.length] = new TocEntry(i, getText(i), mo[1]-1);
}
iterate(i);
}
}
}
iterate(el);
return result;
}
var toc = document.getElementById("toc");
var entries = tocEntries(document.getElementById("content"), toclevels);
for (var i = 0; i < entries.length; ++i) {
var entry = entries[i];
if (entry.element.id == "")
entry.element.id = "_toc_" + i;
var a = document.createElement("a");
a.href = "#" + entry.element.id;
a.appendChild(document.createTextNode(entry.text));
var div = document.createElement("div");
div.appendChild(a);
div.className = "toclevel" + entry.toclevel;
toc.appendChild(div);
}
if (entries.length == 0)
toc.parentNode.removeChild(toc);
},
/////////////////////////////////////////////////////////////////////
// Footnotes generator
/////////////////////////////////////////////////////////////////////
/* Based on footnote generation code from:
* http://www.brandspankingnew.net/archive/2005/07/format_footnote.html
*/
footnotes: function () {
var cont = document.getElementById("content");
var noteholder = document.getElementById("footnotes");
var spans = cont.getElementsByTagName("span");
var refs = {};
var n = 0;
for (i=0; i<spans.length; i++) {
if (spans[i].className == "footnote") {
n++;
// Use [\s\S] in place of . so multi-line matches work.
// Because JavaScript has no s (dotall) regex flag.
note = spans[i].innerHTML.match(/\s*\[([\s\S]*)]\s*/)[1];
noteholder.innerHTML +=
"<div class='footnote' id='_footnote_" + n + "'>" +
"<a href='#_footnoteref_" + n + "' title='Return to text'>" +
n + "</a>. " + note + "</div>";
spans[i].innerHTML =
"[<a id='_footnoteref_" + n + "' href='#_footnote_" + n +
"' title='View footnote' class='footnote'>" + n + "</a>]";
var id =spans[i].getAttribute("id");
if (id != null) refs["#"+id] = n;
}
}
if (n == 0)
noteholder.parentNode.removeChild(noteholder);
else {
// Process footnoterefs.
for (i=0; i<spans.length; i++) {
if (spans[i].className == "footnoteref") {
var href = spans[i].getElementsByTagName("a")[0].getAttribute("href");
href = href.match(/#.*/)[0]; // Because IE return full URL.
n = refs[href];
spans[i].innerHTML =
"[<a href='#_footnote_" + n +
"' title='View footnote' class='footnote'>" + n + "</a>]";
}
}
}
}
}
/*]]>*/
</script>
</head>
<body>
<div id="header">
<h1>
UNICHARAMBIGS(5) Manual Page
</h1>
<h2>NAME</h2>
<div class="sectionbody">
<p>unicharambigs -
Tesseract unicharset ambiguities
</p>
</div>
</div>
<div id="content">
<h2 id="_description">DESCRIPTION</h2>
<div class="sectionbody">
<div class="paragraph"><p>The unicharambigs file (a component of traineddata, see combine_tessdata(1) )
is used by Tesseract to represent possible ambiguities between characters,
or groups of characters.</p></div>
<div class="paragraph"><p>The file contains a number of lines, laid out as follow:</p></div>
<div class="literalblock">
<div class="content">
<pre><tt>[num] &lt;TAB&gt; [char(s)] &lt;TAB&gt; [num] &lt;TAB&gt; [char(s)] &lt;TAB&gt; [num]</tt></pre>
</div></div>
<div class="hdlist"><table>
<tr>
<td class="hdlist1">
Field one
<br />
</td>
<td class="hdlist2">
<p style="margin-top: 0;">
the number of characters contained in field two
</p>
</td>
</tr>
<tr>
<td class="hdlist1">
Field two
<br />
</td>
<td class="hdlist2">
<p style="margin-top: 0;">
the character sequence to be replaced
</p>
</td>
</tr>
<tr>
<td class="hdlist1">
Field three
<br />
</td>
<td class="hdlist2">
<p style="margin-top: 0;">
the number of characters contained in field four
</p>
</td>
</tr>
<tr>
<td class="hdlist1">
Field four
<br />
</td>
<td class="hdlist2">
<p style="margin-top: 0;">
the character sequence used to replace field two
</p>
</td>
</tr>
<tr>
<td class="hdlist1">
Field five
<br />
</td>
<td class="hdlist2">
<p style="margin-top: 0;">
contains either 1 or 0. 1 denotes a mandatory
replacement, 0 denotes an optional replacement.
</p>
</td>
</tr>
</table></div>
<div class="paragraph"><p>Characters appearing in fields two and four should appear in
unicharset. The numbers in fields one and three refer to the
number of unichars (not bytes).</p></div>
</div>
<h2 id="_example">EXAMPLE</h2>
<div class="sectionbody">
<div class="literalblock">
<div class="content">
<pre><tt>2 ' ' 1 " 1
1 m 2 r n 0
3 i i i 1 m 0</tt></pre>
</div></div>
<div class="paragraph"><p>In this example, all instances of the <em>2</em> character sequence <em>'</em>' will
<strong>always</strong> be replaced by the <em>1</em> character sequence <em>"</em>; a <em>1</em> character
sequence <em>m</em> <strong>may</strong> be replaced by the <em>2</em> character sequence <em>rn</em>, and
the <em>3</em> character sequence <strong>may</strong> be replaced by the <em>1</em> character
sequence <em>m</em>.</p></div>
</div>
<h2 id="_history">HISTORY</h2>
<div class="sectionbody">
<div class="paragraph"><p>The unicharambigs file first appeared in Tesseract 3.00; prior to that, a
similar format, called DangAmbigs (<em>dangerous ambiguities</em>) was used: the
format was almost identical, except only mandatory replacements could be
specified, and field 5 was absent.</p></div>
</div>
<h2 id="_bugs">BUGS</h2>
<div class="sectionbody">
<div class="paragraph"><p>This is a documentation "bug": it&#8217;s not currently clear what should be done
in the case of ligatures (such as <em>fi</em>) which may also appear as regular
letters in the unicharset.</p></div>
</div>
<h2 id="_see_also">SEE ALSO</h2>
<div class="sectionbody">
<div class="paragraph"><p>tesseract(1), unicharset(5)</p></div>
</div>
<h2 id="_author">AUTHOR</h2>
<div class="sectionbody">
<div class="paragraph"><p>The Tesseract OCR engine was written by Ray Smith and his research groups
at Hewlett Packard (1985-1995) and Google (2006-present).</p></div>
</div>
</div>
<div id="footnotes"><hr /></div>
<div id="footer">
<div id="footer-text">
Last updated 2012-02-08 10:59:49 PDT
</div>
</div>
</body>
</html>