Update man pages for Tesseract 3.02.

git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@670 d0cd1f9f-072b-0410-8dd7-cf729c803f20
This commit is contained in:
david.eger@gmail.com 2012-02-09 22:55:47 +00:00
parent 78a8356a76
commit 58e06c8c45
45 changed files with 4702 additions and 1520 deletions

46
doc/ambiguous_words.1 Normal file
View File

@ -0,0 +1,46 @@
'\" t
.\" Title: ambiguous_words
.\" Author: [see the "AUTHOR" section]
.\" Generator: DocBook XSL Stylesheets v1.75.2 <http://docbook.sf.net/>
.\" Date: 02/09/2012
.\" Manual: \ \&
.\" Source: \ \&
.\" Language: English
.\"
.TH "AMBIGUOUS_WORDS" "1" "02/09/2012" "\ \&" "\ \&"
.\" -----------------------------------------------------------------
.\" * Define some portability stuff
.\" -----------------------------------------------------------------
.\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.\" http://bugs.debian.org/507673
.\" http://lists.gnu.org/archive/html/groff/2009-02/msg00013.html
.\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.ie \n(.g .ds Aq \(aq
.el .ds Aq '
.\" -----------------------------------------------------------------
.\" * set default formatting
.\" -----------------------------------------------------------------
.\" disable hyphenation
.nh
.\" disable justification (adjust text to left margin only)
.ad l
.\" -----------------------------------------------------------------
.\" * MAIN CONTENT STARTS HERE *
.\" -----------------------------------------------------------------
.SH "NAME"
ambiguous_words \- generate sets of words Tesseract is likely to find ambiguous
.SH "SYNOPSIS"
.sp
\fBambiguous_words\fR [\-l lang] \fITESSDATADIR\fR \fIWORDLIST\fR \fIAMBIGUOUSFILE\fR
.SH "DESCRIPTION"
.sp
ambiguous_words(1) runs Tesseract in a special mode, and for each word in word list, produces a set of words which Tesseract thinks might be ambiguous with it\&. \fITESSDATADIR\fR must be set to the absolute path of a directory containing \fItessdata/lang\&.traineddata\fR\&.
.SH "SEE ALSO"
.sp
tesseract(1)
.SH "COPYING"
.sp
Copyright (C) 2012 Google, Inc\&. Licensed under the Apache License, Version 2\&.0
.SH "AUTHOR"
.sp
The Tesseract OCR engine was written by Ray Smith and his research groups at Hewlett Packard (1985\-1995) and Google (2006\-present)\&.

32
doc/ambiguous_words.1.asc Normal file
View File

@ -0,0 +1,32 @@
AMBIGUOUS_WORDS(1)
==================
:doctype: manpage
NAME
----
ambiguous_words - generate sets of words Tesseract is likely to find ambiguous
SYNOPSIS
--------
*ambiguous_words* [-l lang] 'TESSDATADIR' 'WORDLIST' 'AMBIGUOUSFILE'
DESCRIPTION
-----------
ambiguous_words(1) runs Tesseract in a special mode, and for each word
in word list, produces a set of words which Tesseract thinks might be
ambiguous with it. 'TESSDATADIR' must be set to the absolute path of
a directory containing 'tessdata/lang.traineddata'.
SEE ALSO
--------
tesseract(1)
COPYING
-------
Copyright \(C) 2012 Google, Inc.
Licensed under the Apache License, Version 2.0
AUTHOR
------
The Tesseract OCR engine was written by Ray Smith and his research groups
at Hewlett Packard (1985-1995) and Google (2006-present).

616
doc/ambiguous_words.1.html Normal file
View File

@ -0,0 +1,616 @@
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<meta name="generator" content="AsciiDoc 8.5.2" />
<title>AMBIGUOUS_WORDS(1)</title>
<style type="text/css">
/* Debug borders */
p, li, dt, dd, div, pre, h1, h2, h3, h4, h5, h6 {
/*
border: 1px solid red;
*/
}
body {
margin: 1em 5% 1em 5%;
}
a {
color: blue;
text-decoration: underline;
}
a:visited {
color: fuchsia;
}
em {
font-style: italic;
color: navy;
}
strong {
font-weight: bold;
color: #083194;
}
tt {
color: navy;
}
h1, h2, h3, h4, h5, h6 {
color: #527bbd;
font-family: sans-serif;
margin-top: 1.2em;
margin-bottom: 0.5em;
line-height: 1.3;
}
h1, h2, h3 {
border-bottom: 2px solid silver;
}
h2 {
padding-top: 0.5em;
}
h3 {
float: left;
}
h3 + * {
clear: left;
}
div.sectionbody {
font-family: serif;
margin-left: 0;
}
hr {
border: 1px solid silver;
}
p {
margin-top: 0.5em;
margin-bottom: 0.5em;
}
ul, ol, li > p {
margin-top: 0;
}
pre {
padding: 0;
margin: 0;
}
span#author {
color: #527bbd;
font-family: sans-serif;
font-weight: bold;
font-size: 1.1em;
}
span#email {
}
span#revnumber, span#revdate, span#revremark {
font-family: sans-serif;
}
div#footer {
font-family: sans-serif;
font-size: small;
border-top: 2px solid silver;
padding-top: 0.5em;
margin-top: 4.0em;
}
div#footer-text {
float: left;
padding-bottom: 0.5em;
}
div#footer-badges {
float: right;
padding-bottom: 0.5em;
}
div#preamble {
margin-top: 1.5em;
margin-bottom: 1.5em;
}
div.tableblock, div.imageblock, div.exampleblock, div.verseblock,
div.quoteblock, div.literalblock, div.listingblock, div.sidebarblock,
div.admonitionblock {
margin-top: 1.0em;
margin-bottom: 1.5em;
}
div.admonitionblock {
margin-top: 2.0em;
margin-bottom: 2.0em;
margin-right: 10%;
color: #606060;
}
div.content { /* Block element content. */
padding: 0;
}
/* Block element titles. */
div.title, caption.title {
color: #527bbd;
font-family: sans-serif;
font-weight: bold;
text-align: left;
margin-top: 1.0em;
margin-bottom: 0.5em;
}
div.title + * {
margin-top: 0;
}
td div.title:first-child {
margin-top: 0.0em;
}
div.content div.title:first-child {
margin-top: 0.0em;
}
div.content + div.title {
margin-top: 0.0em;
}
div.sidebarblock > div.content {
background: #ffffee;
border: 1px solid silver;
padding: 0.5em;
}
div.listingblock > div.content {
border: 1px solid silver;
background: #f4f4f4;
padding: 0.5em;
}
div.quoteblock, div.verseblock {
padding-left: 1.0em;
margin-left: 1.0em;
margin-right: 10%;
border-left: 5px solid #dddddd;
color: #777777;
}
div.quoteblock > div.attribution {
padding-top: 0.5em;
text-align: right;
}
div.verseblock > div.content {
white-space: pre;
}
div.verseblock > div.attribution {
padding-top: 0.75em;
text-align: left;
}
/* DEPRECATED: Pre version 8.2.7 verse style literal block. */
div.verseblock + div.attribution {
text-align: left;
}
div.admonitionblock .icon {
vertical-align: top;
font-size: 1.1em;
font-weight: bold;
text-decoration: underline;
color: #527bbd;
padding-right: 0.5em;
}
div.admonitionblock td.content {
padding-left: 0.5em;
border-left: 3px solid #dddddd;
}
div.exampleblock > div.content {
border-left: 3px solid #dddddd;
padding-left: 0.5em;
}
div.imageblock div.content { padding-left: 0; }
span.image img { border-style: none; }
a.image:visited { color: white; }
dl {
margin-top: 0.8em;
margin-bottom: 0.8em;
}
dt {
margin-top: 0.5em;
margin-bottom: 0;
font-style: normal;
color: navy;
}
dd > *:first-child {
margin-top: 0.1em;
}
ul, ol {
list-style-position: outside;
}
ol.arabic {
list-style-type: decimal;
}
ol.loweralpha {
list-style-type: lower-alpha;
}
ol.upperalpha {
list-style-type: upper-alpha;
}
ol.lowerroman {
list-style-type: lower-roman;
}
ol.upperroman {
list-style-type: upper-roman;
}
div.compact ul, div.compact ol,
div.compact p, div.compact p,
div.compact div, div.compact div {
margin-top: 0.1em;
margin-bottom: 0.1em;
}
div.tableblock > table {
border: 3px solid #527bbd;
}
thead, p.table.header {
font-family: sans-serif;
font-weight: bold;
}
tfoot {
font-weight: bold;
}
td > div.verse {
white-space: pre;
}
p.table {
margin-top: 0;
}
/* Because the table frame attribute is overriden by CSS in most browsers. */
div.tableblock > table[frame="void"] {
border-style: none;
}
div.tableblock > table[frame="hsides"] {
border-left-style: none;
border-right-style: none;
}
div.tableblock > table[frame="vsides"] {
border-top-style: none;
border-bottom-style: none;
}
div.hdlist {
margin-top: 0.8em;
margin-bottom: 0.8em;
}
div.hdlist tr {
padding-bottom: 15px;
}
dt.hdlist1.strong, td.hdlist1.strong {
font-weight: bold;
}
td.hdlist1 {
vertical-align: top;
font-style: normal;
padding-right: 0.8em;
color: navy;
}
td.hdlist2 {
vertical-align: top;
}
div.hdlist.compact tr {
margin: 0;
padding-bottom: 0;
}
.comment {
background: yellow;
}
.footnote, .footnoteref {
font-size: 0.8em;
}
span.footnote, span.footnoteref {
vertical-align: super;
}
#footnotes {
margin: 20px 0 20px 0;
padding: 7px 0 0 0;
}
#footnotes div.footnote {
margin: 0 0 5px 0;
}
#footnotes hr {
border: none;
border-top: 1px solid silver;
height: 1px;
text-align: left;
margin-left: 0;
width: 20%;
min-width: 100px;
}
@media print {
div#footer-badges { display: none; }
}
div#toc {
margin-bottom: 2.5em;
}
div#toctitle {
color: #527bbd;
font-family: sans-serif;
font-size: 1.1em;
font-weight: bold;
margin-top: 1.0em;
margin-bottom: 0.1em;
}
div.toclevel1, div.toclevel2, div.toclevel3, div.toclevel4 {
margin-top: 0;
margin-bottom: 0;
}
div.toclevel2 {
margin-left: 2em;
font-size: 0.9em;
}
div.toclevel3 {
margin-left: 4em;
font-size: 0.9em;
}
div.toclevel4 {
margin-left: 6em;
font-size: 0.9em;
}
/* Overrides for manpage documents */
h1 {
padding-top: 0.5em;
padding-bottom: 0.5em;
border-top: 2px solid silver;
border-bottom: 2px solid silver;
}
h2 {
border-style: none;
}
div.sectionbody {
margin-left: 5%;
}
@media print {
div#toc { display: none; }
}
/* Workarounds for IE6's broken and incomplete CSS2. */
div.sidebar-content {
background: #ffffee;
border: 1px solid silver;
padding: 0.5em;
}
div.sidebar-title, div.image-title {
color: #527bbd;
font-family: sans-serif;
font-weight: bold;
margin-top: 0.0em;
margin-bottom: 0.5em;
}
div.listingblock div.content {
border: 1px solid silver;
background: #f4f4f4;
padding: 0.5em;
}
div.quoteblock-attribution {
padding-top: 0.5em;
text-align: right;
}
div.verseblock-content {
white-space: pre;
}
div.verseblock-attribution {
padding-top: 0.75em;
text-align: left;
}
div.exampleblock-content {
border-left: 3px solid #dddddd;
padding-left: 0.5em;
}
/* IE6 sets dynamically generated links as visited. */
div#toc a:visited { color: blue; }
</style>
<script type="text/javascript">
/*<![CDATA[*/
window.onload = function(){asciidoc.footnotes();}
var asciidoc = { // Namespace.
/////////////////////////////////////////////////////////////////////
// Table Of Contents generator
/////////////////////////////////////////////////////////////////////
/* Author: Mihai Bazon, September 2002
* http://students.infoiasi.ro/~mishoo
*
* Table Of Content generator
* Version: 0.4
*
* Feel free to use this script under the terms of the GNU General Public
* License, as long as you do not remove or alter this notice.
*/
/* modified by Troy D. Hanson, September 2006. License: GPL */
/* modified by Stuart Rackham, 2006, 2009. License: GPL */
// toclevels = 1..4.
toc: function (toclevels) {
function getText(el) {
var text = "";
for (var i = el.firstChild; i != null; i = i.nextSibling) {
if (i.nodeType == 3 /* Node.TEXT_NODE */) // IE doesn't speak constants.
text += i.data;
else if (i.firstChild != null)
text += getText(i);
}
return text;
}
function TocEntry(el, text, toclevel) {
this.element = el;
this.text = text;
this.toclevel = toclevel;
}
function tocEntries(el, toclevels) {
var result = new Array;
var re = new RegExp('[hH]([2-'+(toclevels+1)+'])');
// Function that scans the DOM tree for header elements (the DOM2
// nodeIterator API would be a better technique but not supported by all
// browsers).
var iterate = function (el) {
for (var i = el.firstChild; i != null; i = i.nextSibling) {
if (i.nodeType == 1 /* Node.ELEMENT_NODE */) {
var mo = re.exec(i.tagName);
if (mo && (i.getAttribute("class") || i.getAttribute("className")) != "float") {
result[result.length] = new TocEntry(i, getText(i), mo[1]-1);
}
iterate(i);
}
}
}
iterate(el);
return result;
}
var toc = document.getElementById("toc");
var entries = tocEntries(document.getElementById("content"), toclevels);
for (var i = 0; i < entries.length; ++i) {
var entry = entries[i];
if (entry.element.id == "")
entry.element.id = "_toc_" + i;
var a = document.createElement("a");
a.href = "#" + entry.element.id;
a.appendChild(document.createTextNode(entry.text));
var div = document.createElement("div");
div.appendChild(a);
div.className = "toclevel" + entry.toclevel;
toc.appendChild(div);
}
if (entries.length == 0)
toc.parentNode.removeChild(toc);
},
/////////////////////////////////////////////////////////////////////
// Footnotes generator
/////////////////////////////////////////////////////////////////////
/* Based on footnote generation code from:
* http://www.brandspankingnew.net/archive/2005/07/format_footnote.html
*/
footnotes: function () {
var cont = document.getElementById("content");
var noteholder = document.getElementById("footnotes");
var spans = cont.getElementsByTagName("span");
var refs = {};
var n = 0;
for (i=0; i<spans.length; i++) {
if (spans[i].className == "footnote") {
n++;
// Use [\s\S] in place of . so multi-line matches work.
// Because JavaScript has no s (dotall) regex flag.
note = spans[i].innerHTML.match(/\s*\[([\s\S]*)]\s*/)[1];
noteholder.innerHTML +=
"<div class='footnote' id='_footnote_" + n + "'>" +
"<a href='#_footnoteref_" + n + "' title='Return to text'>" +
n + "</a>. " + note + "</div>";
spans[i].innerHTML =
"[<a id='_footnoteref_" + n + "' href='#_footnote_" + n +
"' title='View footnote' class='footnote'>" + n + "</a>]";
var id =spans[i].getAttribute("id");
if (id != null) refs["#"+id] = n;
}
}
if (n == 0)
noteholder.parentNode.removeChild(noteholder);
else {
// Process footnoterefs.
for (i=0; i<spans.length; i++) {
if (spans[i].className == "footnoteref") {
var href = spans[i].getElementsByTagName("a")[0].getAttribute("href");
href = href.match(/#.*/)[0]; // Because IE return full URL.
n = refs[href];
spans[i].innerHTML =
"[<a href='#_footnote_" + n +
"' title='View footnote' class='footnote'>" + n + "</a>]";
}
}
}
}
}
/*]]>*/
</script>
</head>
<body>
<div id="header">
<h1>
AMBIGUOUS_WORDS(1) Manual Page
</h1>
<h2>NAME</h2>
<div class="sectionbody">
<p>ambiguous_words -
generate sets of words Tesseract is likely to find ambiguous
</p>
</div>
</div>
<div id="content">
<h2 id="_synopsis">SYNOPSIS</h2>
<div class="sectionbody">
<div class="paragraph"><p><strong>ambiguous_words</strong> [-l lang] <em>TESSDATADIR</em> <em>WORDLIST</em> <em>AMBIGUOUSFILE</em></p></div>
</div>
<h2 id="_description">DESCRIPTION</h2>
<div class="sectionbody">
<div class="paragraph"><p>ambiguous_words(1) runs Tesseract in a special mode, and for each word
in word list, produces a set of words which Tesseract thinks might be
ambiguous with it. <em>TESSDATADIR</em> must be set to the absolute path of
a directory containing <em>tessdata/lang.traineddata</em>.</p></div>
</div>
<h2 id="_see_also">SEE ALSO</h2>
<div class="sectionbody">
<div class="paragraph"><p>tesseract(1)</p></div>
</div>
<h2 id="_copying">COPYING</h2>
<div class="sectionbody">
<div class="paragraph"><p>Copyright (C) 2012 Google, Inc.
Licensed under the Apache License, Version 2.0</p></div>
</div>
<h2 id="_author">AUTHOR</h2>
<div class="sectionbody">
<div class="paragraph"><p>The Tesseract OCR engine was written by Ray Smith and his research groups
at Hewlett Packard (1985-1995) and Google (2006-present).</p></div>
</div>
</div>
<div id="footnotes"><hr /></div>
<div id="footer">
<div id="footer-text">
Last updated 2012-02-07 13:38:29 PDT
</div>
</div>
</body>
</html>

40
doc/ambiguous_words.1.xml Normal file
View File

@ -0,0 +1,40 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE refentry PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd">
<?asciidoc-toc?>
<?asciidoc-numbered?>
<refentry lang="en">
<refmeta>
<refentrytitle>ambiguous_words</refentrytitle>
<manvolnum>1</manvolnum>
<refmiscinfo class="source">&nbsp;</refmiscinfo>
<refmiscinfo class="manual">&nbsp;</refmiscinfo>
</refmeta>
<refnamediv>
<refname>ambiguous_words</refname>
<refpurpose>generate sets of words Tesseract is likely to find ambiguous</refpurpose>
</refnamediv>
<refsynopsisdiv id="_synopsis">
<simpara><emphasis role="strong">ambiguous_words</emphasis> [-l lang] <emphasis>TESSDATADIR</emphasis> <emphasis>WORDLIST</emphasis> <emphasis>AMBIGUOUSFILE</emphasis></simpara>
</refsynopsisdiv>
<refsect1 id="_description">
<title>DESCRIPTION</title>
<simpara>ambiguous_words(1) runs Tesseract in a special mode, and for each word
in word list, produces a set of words which Tesseract thinks might be
ambiguous with it. <emphasis>TESSDATADIR</emphasis> must be set to the absolute path of
a directory containing <emphasis>tessdata/lang.traineddata</emphasis>.</simpara>
</refsect1>
<refsect1 id="_see_also">
<title>SEE ALSO</title>
<simpara>tesseract(1)</simpara>
</refsect1>
<refsect1 id="_copying">
<title>COPYING</title>
<simpara>Copyright (C) 2012 Google, Inc.
Licensed under the Apache License, Version 2.0</simpara>
</refsect1>
<refsect1 id="_author">
<title>AUTHOR</title>
<simpara>The Tesseract OCR engine was written by Ray Smith and his research groups
at Hewlett Packard (1985-1995) and Google (2006-present).</simpara>
</refsect1>
</refentry>

View File

@ -1,13 +1,13 @@
'\" t
.\" Title: cntraining
.\" Author: [FIXME: author] [see http://docbook.sf.net/el/author]
.\" Author: [see the "AUTHOR" section]
.\" Generator: DocBook XSL Stylesheets v1.75.2 <http://docbook.sf.net/>
.\" Date: 09/30/2010
.\" Date: 02/09/2012
.\" Manual: \ \&
.\" Source: \ \&
.\" Language: English
.\"
.TH "CNTRAINING" "1" "09/30/2010" "\ \&" "\ \&"
.TH "CNTRAINING" "1" "02/09/2012" "\ \&" "\ \&"
.\" -----------------------------------------------------------------
.\" * Define some portability stuff
.\" -----------------------------------------------------------------
@ -31,15 +31,24 @@
cntraining \- character normalization training for Tesseract
.SH "SYNOPSIS"
.sp
\fBcntraining\fR \fIFILE\fR\&...
\fBcntraining\fR [\-D \fIdir\fR] \fIFILE\fR\&...
.SH "DESCRIPTION"
.sp
cntraining takes a list of \&.tr files, from which it generates the normproto data file (the character normalization sensitivity prototypes)\&.
cntraining takes a list of \&.tr files, from which it generates the \fBnormproto\fR data file (the character normalization sensitivity prototypes)\&.
.SH "OPTIONS"
.PP
\-D \fIdir\fR
.RS 4
Directory to write output files to\&.
.RE
.SH "SEE ALSO"
.sp
tesseract(1), mftraining(1)
tesseract(1), shapeclustering(1), mftraining(1)
.sp
\m[blue]\fBhttp://code\&.google\&.com/p/tesseract\-ocr/wiki/TrainingTesseract3\fR\m[]
.SH "COPYING"
.sp
Copyright (c) Hewlett\-Packard Company, 1988 Licensed under the Apache License, Version 2\&.0
.SH "AUTHOR"
.sp
The Tesseract OCR engine was written by Ray Smith and his research groups at Hewlett Packard (1985\-1995) and Google (2006\-present)\&.

View File

@ -7,16 +7,22 @@ cntraining - character normalization training for Tesseract
SYNOPSIS
--------
*cntraining* 'FILE'...
*cntraining* [-D 'dir'] 'FILE'...
DESCRIPTION
-----------
cntraining takes a list of .tr files, from which it generates the
normproto data file (the character normalization sensitivity prototypes).
*normproto* data file (the character normalization sensitivity
prototypes).
OPTIONS
--------
-D 'dir'::
Directory to write output files to.
SEE ALSO
--------
tesseract(1), mftraining(1)
tesseract(1), shapeclustering(1), mftraining(1)
<http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3>
@ -24,3 +30,8 @@ COPYING
-------
Copyright (c) Hewlett-Packard Company, 1988
Licensed under the Apache License, Version 2.0
AUTHOR
------
The Tesseract OCR engine was written by Ray Smith and his research groups
at Hewlett Packard (1985-1995) and Google (2006-present).

View File

@ -582,16 +582,30 @@ CNTRAINING(1) Manual Page
<div id="content">
<h2 id="_synopsis">SYNOPSIS</h2>
<div class="sectionbody">
<div class="paragraph"><p><strong>cntraining</strong> <em>FILE</em>&#8230;</p></div>
<div class="paragraph"><p><strong>cntraining</strong> [-D <em>dir</em>] <em>FILE</em>&#8230;</p></div>
</div>
<h2 id="_description">DESCRIPTION</h2>
<div class="sectionbody">
<div class="paragraph"><p>cntraining takes a list of .tr files, from which it generates the
normproto data file (the character normalization sensitivity prototypes).</p></div>
<strong>normproto</strong> data file (the character normalization sensitivity
prototypes).</p></div>
</div>
<h2 id="_options">OPTIONS</h2>
<div class="sectionbody">
<div class="dlist"><dl>
<dt class="hdlist1">
-D <em>dir</em>
</dt>
<dd>
<p>
Directory to write output files to.
</p>
</dd>
</dl></div>
</div>
<h2 id="_see_also">SEE ALSO</h2>
<div class="sectionbody">
<div class="paragraph"><p>tesseract(1), mftraining(1)</p></div>
<div class="paragraph"><p>tesseract(1), shapeclustering(1), mftraining(1)</p></div>
<div class="paragraph"><p><a href="http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3">http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3</a></p></div>
</div>
<h2 id="_copying">COPYING</h2>
@ -599,11 +613,16 @@ normproto data file (the character normalization sensitivity prototypes).</p></d
<div class="paragraph"><p>Copyright (c) Hewlett-Packard Company, 1988
Licensed under the Apache License, Version 2.0</p></div>
</div>
<h2 id="_author">AUTHOR</h2>
<div class="sectionbody">
<div class="paragraph"><p>The Tesseract OCR engine was written by Ray Smith and his research groups
at Hewlett Packard (1985-1995) and Google (2006-present).</p></div>
</div>
</div>
<div id="footnotes"><hr /></div>
<div id="footer">
<div id="footer-text">
Last updated 2010-09-30 02:57:07 IST
Last updated 2012-02-09 11:37:31 PDT
</div>
</div>
</body>

View File

@ -14,16 +14,32 @@
<refpurpose>character normalization training for Tesseract</refpurpose>
</refnamediv>
<refsynopsisdiv id="_synopsis">
<simpara><emphasis role="strong">cntraining</emphasis> <emphasis>FILE</emphasis>&#8230;</simpara>
<simpara><emphasis role="strong">cntraining</emphasis> [-D <emphasis>dir</emphasis>] <emphasis>FILE</emphasis>&#8230;</simpara>
</refsynopsisdiv>
<refsect1 id="_description">
<title>DESCRIPTION</title>
<simpara>cntraining takes a list of .tr files, from which it generates the
normproto data file (the character normalization sensitivity prototypes).</simpara>
<emphasis role="strong">normproto</emphasis> data file (the character normalization sensitivity
prototypes).</simpara>
</refsect1>
<refsect1 id="_options">
<title>OPTIONS</title>
<variablelist>
<varlistentry>
<term>
-D <emphasis>dir</emphasis>
</term>
<listitem>
<simpara>
Directory to write output files to.
</simpara>
</listitem>
</varlistentry>
</variablelist>
</refsect1>
<refsect1 id="_see_also">
<title>SEE ALSO</title>
<simpara>tesseract(1), mftraining(1)</simpara>
<simpara>tesseract(1), shapeclustering(1), mftraining(1)</simpara>
<simpara><ulink url="http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3">http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3</ulink></simpara>
</refsect1>
<refsect1 id="_copying">
@ -31,4 +47,9 @@ normproto data file (the character normalization sensitivity prototypes).</simpa
<simpara>Copyright (c) Hewlett-Packard Company, 1988
Licensed under the Apache License, Version 2.0</simpara>
</refsect1>
<refsect1 id="_author">
<title>AUTHOR</title>
<simpara>The Tesseract OCR engine was written by Ray Smith and his research groups
at Hewlett Packard (1985-1995) and Google (2006-present).</simpara>
</refsect1>
</refentry>

View File

@ -1,13 +1,13 @@
'\" t
.\" Title: combine_tessdata
.\" Author: [FIXME: author] [see http://docbook.sf.net/el/author]
.\" Author: [see the "AUTHOR" section]
.\" Generator: DocBook XSL Stylesheets v1.75.2 <http://docbook.sf.net/>
.\" Date: 09/30/2010
.\" Date: 02/09/2012
.\" Manual: \ \&
.\" Source: \ \&
.\" Language: English
.\"
.TH "COMBINE_TESSDATA" "1" "09/30/2010" "\ \&" "\ \&"
.TH "COMBINE_TESSDATA" "1" "02/09/2012" "\ \&" "\ \&"
.\" -----------------------------------------------------------------
.\" * Define some portability stuff
.\" -----------------------------------------------------------------
@ -56,8 +56,8 @@ Specify option \-e if you would like to extract individual components from a com
.RS 4
.\}
.nf
combine_tessdata \-e tessdata/eng\&.traineddata
/home/$USER/temp/eng\&.config /home/$USER/temp/eng\&.unicharset
combine_tessdata \-e tessdata/eng\&.traineddata \e
/home/$USER/temp/eng\&.config /home/$USER/temp/eng\&.unicharset
.fi
.if n \{\
.RE
@ -71,8 +71,8 @@ Specify option \-o to overwrite individual components of the given [lang]\&.trai
.RS 4
.\}
.nf
combine_tessdata \-o tessdata/eng\&.traineddata
/home/$USER/temp/eng\&.config /home/$USER/temp/eng\&.unicharambigs
combine_tessdata \-o tessdata/eng\&.traineddata \e
/home/$USER/temp/eng\&.config /home/$USER/temp/eng\&.unicharambigs
.fi
.if n \{\
.RE
@ -105,12 +105,111 @@ This will create /home/$USER/temp/eng\&.* files with individual tessdata compone
.SH "CAVEATS"
.sp
\fIPrefix\fR refers to the full file prefix, including period (\&.)
.SH "COMPONENTS"
.sp
The components in a Tesseract lang\&.traineddata file as of Tesseract 3\&.02 are briefly described below; For more information on many of these files, see \m[blue]\fBhttp://code\&.google\&.com/p/tesseract\-ocr/wiki/TrainingTesseract3\fR\m[]
.PP
lang\&.config
.RS 4
(Optional) Language\-specific overrides to default config variables\&.
.RE
.PP
lang\&.unicharset
.RS 4
(Required) The list of symbols that Tesseract recognizes, with properties\&. See unicharset(5)\&.
.RE
.PP
lang\&.unicharambigs
.RS 4
(Optional) This file contains information on pairs of recognized symbols which are often confused\&. For example,
\fIrn\fR
and
\fIm\fR\&.
.RE
.PP
lang\&.inttemp
.RS 4
(Required) Character shape templates for each unichar\&. Produced by mftraining(1)\&.
.RE
.PP
lang\&.pffmtable
.RS 4
(Required) The number of features expected for each unichar\&. Produced by mftraining(1) from
\fB\&.tr\fR
files\&.
.RE
.PP
lang\&.normproto
.RS 4
(Required) Character normalization prototypes generated by cntraining(1) from
\fB\&.tr\fR
files\&.
.RE
.PP
lang\&.punc\-dawg
.RS 4
(Optional) A dawg made from punctuation patterns found around words\&. The "word" part is replaced by a single space\&.
.RE
.PP
lang\&.word\-dawg
.RS 4
(Optional) A dawg made from dictionary words from the language\&.
.RE
.PP
lang\&.number\-dawg
.RS 4
(Optional) A dawg made from tokens which originally contained digits\&. Each digit is replaced by a space character\&.
.RE
.PP
lang\&.freq\-dawg
.RS 4
(Optional) A dawg made from the most frequent words which would have gone into word\-dawg\&.
.RE
.PP
lang\&.fixed\-length\-dawgs
.RS 4
(Optional) Several dawgs of different fixed lengths \(em useful for languages like Chinese\&.
.RE
.PP
lang\&.cube\-unicharset
.RS 4
(Optional) A unicharset for cube, if cube was trained on a different set of symbols\&.
.RE
.PP
lang\&.cube\-word\-dawg
.RS 4
(Optional) A word dawg for cube\(cqs alternate unicharset\&. Not needed if Cube was trained with Tesseract\(cqs unicharset\&.
.RE
.PP
lang\&.shapetable
.RS 4
(Optional) When present, a shapetable is an extra layer between the character classifier and the word recognizer that allows the character classifier to return a collection of unichar ids and fonts instead of a single unichar\-id and font\&.
.RE
.PP
lang\&.bigram\-dawg
.RS 4
(Optional) A dawg of word bigrams where the words are separated by a space and each digit is replaced by a
\fI?\fR\&.
.RE
.PP
lang\&.unambig\-dawg
.RS 4
(Optional) TODO: Describe\&.
.RE
.PP
lang\&.params\-training\-model
.RS 4
(Optional) TODO: Describe\&.
.RE
.SH "HISTORY"
.sp
combine_tessdata(1) first appeared in version 3\&.00 of Tesseract
.SH "SEE ALSO"
.sp
tesseract(1)
tesseract(1), wordlist2dawg(1), cntraining(1), mftraining(1), unicharset(5), unicharambigs(5)
.SH "COPYING"
.sp
Copyright (C) 2009, Google Inc\&. Licensed under the Apache License, Version 2\&.0
.SH "AUTHOR"
.sp
The Tesseract OCR engine was written by Ray Smith and his research groups at Hewlett Packard (1985\-1995) and Google (2006\-present)\&.

View File

@ -26,8 +26,8 @@ Specify option -e if you would like to extract individual components
from a combined traineddata file. For example, to extract language config
file and the unicharset from tessdata/eng.traineddata run:
combine_tessdata -e tessdata/eng.traineddata
/home/$USER/temp/eng.config /home/$USER/temp/eng.unicharset
combine_tessdata -e tessdata/eng.traineddata \
/home/$USER/temp/eng.config /home/$USER/temp/eng.unicharset
The desired config file and unicharset will be written to
/home/$USER/temp/eng.config /home/$USER/temp/eng.unicharset
@ -36,8 +36,8 @@ Specify option -o to overwrite individual components of the given
[lang].traineddata file. For example, to overwrite language config
and unichar ambiguities files in tessdata/eng.traineddata use:
combine_tessdata -o tessdata/eng.traineddata
/home/$USER/temp/eng.config /home/$USER/temp/eng.unicharambigs
combine_tessdata -o tessdata/eng.traineddata \
/home/$USER/temp/eng.config /home/$USER/temp/eng.unicharambigs
As a result, tessdata/eng.traineddata will contain the new language config
and unichar ambigs, plus all the original DAWGs, classifier templates, etc.
@ -66,20 +66,100 @@ OPTIONS
*-u* '.traineddata' 'PATHPREFIX'
Unpacks the .traineddata using the provided prefix.
CAVEATS
-------
'Prefix' refers to the full file prefix, including period (.)
COMPONENTS
----------
The components in a Tesseract lang.traineddata file as of
Tesseract 3.02 are briefly described below; For more information on
many of these files, see
<http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3>
lang.config::
(Optional) Language-specific overrides to default config variables.
lang.unicharset::
(Required) The list of symbols that Tesseract recognizes, with properties.
See unicharset(5).
lang.unicharambigs::
(Optional) This file contains information on pairs of recognized symbols
which are often confused. For example, 'rn' and 'm'.
lang.inttemp::
(Required) Character shape templates for each unichar. Produced by
mftraining(1).
lang.pffmtable::
(Required) The number of features expected for each unichar.
Produced by mftraining(1) from *.tr* files.
lang.normproto::
(Required) Character normalization prototypes generated by cntraining(1)
from *.tr* files.
lang.punc-dawg::
(Optional) A dawg made from punctuation patterns found around words.
The "word" part is replaced by a single space.
lang.word-dawg::
(Optional) A dawg made from dictionary words from the language.
lang.number-dawg::
(Optional) A dawg made from tokens which originally contained digits.
Each digit is replaced by a space character.
lang.freq-dawg::
(Optional) A dawg made from the most frequent words which would have
gone into word-dawg.
lang.fixed-length-dawgs::
(Optional) Several dawgs of different fixed lengths -- useful for
languages like Chinese.
lang.cube-unicharset::
(Optional) A unicharset for cube, if cube was trained on a different set
of symbols.
lang.cube-word-dawg::
(Optional) A word dawg for cube's alternate unicharset. Not needed if Cube
was trained with Tesseract's unicharset.
lang.shapetable::
(Optional) When present, a shapetable is an extra layer between the character
classifier and the word recognizer that allows the character classifier to
return a collection of unichar ids and fonts instead of a single unichar-id
and font.
lang.bigram-dawg::
(Optional) A dawg of word bigrams where the words are separated by a space
and each digit is replaced by a '?'.
lang.unambig-dawg::
(Optional) TODO: Describe.
lang.params-training-model::
(Optional) TODO: Describe.
HISTORY
-------
combine_tessdata(1) first appeared in version 3.00 of Tesseract
SEE ALSO
--------
tesseract(1)
tesseract(1), wordlist2dawg(1), cntraining(1), mftraining(1), unicharset(5),
unicharambigs(5)
COPYING
-------
Copyright \(C) 2009, Google Inc.
Licensed under the Apache License, Version 2.0
AUTHOR
------
The Tesseract OCR engine was written by Ray Smith and his research groups
at Hewlett Packard (1985-1995) and Google (2006-present).

View File

@ -601,8 +601,8 @@ from a combined traineddata file. For example, to extract language config
file and the unicharset from tessdata/eng.traineddata run:</p></div>
<div class="literalblock">
<div class="content">
<pre><tt>combine_tessdata -e tessdata/eng.traineddata
/home/$USER/temp/eng.config /home/$USER/temp/eng.unicharset</tt></pre>
<pre><tt>combine_tessdata -e tessdata/eng.traineddata \
/home/$USER/temp/eng.config /home/$USER/temp/eng.unicharset</tt></pre>
</div></div>
<div class="paragraph"><p>The desired config file and unicharset will be written to
/home/$USER/temp/eng.config /home/$USER/temp/eng.unicharset</p></div>
@ -611,8 +611,8 @@ file and the unicharset from tessdata/eng.traineddata run:</p></div>
and unichar ambiguities files in tessdata/eng.traineddata use:</p></div>
<div class="literalblock">
<div class="content">
<pre><tt>combine_tessdata -o tessdata/eng.traineddata
/home/$USER/temp/eng.config /home/$USER/temp/eng.unicharambigs</tt></pre>
<pre><tt>combine_tessdata -o tessdata/eng.traineddata \
/home/$USER/temp/eng.config /home/$USER/temp/eng.unicharambigs</tt></pre>
</div></div>
<div class="paragraph"><p>As a result, tessdata/eng.traineddata will contain the new language config
and unichar ambigs, plus all the original DAWGs, classifier templates, etc.</p></div>
@ -642,24 +642,190 @@ components from tessdata/eng.traineddata.</p></div>
<div class="sectionbody">
<div class="paragraph"><p><em>Prefix</em> refers to the full file prefix, including period (.)</p></div>
</div>
<h2 id="_components">COMPONENTS</h2>
<div class="sectionbody">
<div class="paragraph"><p>The components in a Tesseract lang.traineddata file as of
Tesseract 3.02 are briefly described below; For more information on
many of these files, see
<a href="http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3">http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3</a></p></div>
<div class="dlist"><dl>
<dt class="hdlist1">
lang.config
</dt>
<dd>
<p>
(Optional) Language-specific overrides to default config variables.
</p>
</dd>
<dt class="hdlist1">
lang.unicharset
</dt>
<dd>
<p>
(Required) The list of symbols that Tesseract recognizes, with properties.
See unicharset(5).
</p>
</dd>
<dt class="hdlist1">
lang.unicharambigs
</dt>
<dd>
<p>
(Optional) This file contains information on pairs of recognized symbols
which are often confused. For example, <em>rn</em> and <em>m</em>.
</p>
</dd>
<dt class="hdlist1">
lang.inttemp
</dt>
<dd>
<p>
(Required) Character shape templates for each unichar. Produced by
mftraining(1).
</p>
</dd>
<dt class="hdlist1">
lang.pffmtable
</dt>
<dd>
<p>
(Required) The number of features expected for each unichar.
Produced by mftraining(1) from <strong>.tr</strong> files.
</p>
</dd>
<dt class="hdlist1">
lang.normproto
</dt>
<dd>
<p>
(Required) Character normalization prototypes generated by cntraining(1)
from <strong>.tr</strong> files.
</p>
</dd>
<dt class="hdlist1">
lang.punc-dawg
</dt>
<dd>
<p>
(Optional) A dawg made from punctuation patterns found around words.
The "word" part is replaced by a single space.
</p>
</dd>
<dt class="hdlist1">
lang.word-dawg
</dt>
<dd>
<p>
(Optional) A dawg made from dictionary words from the language.
</p>
</dd>
<dt class="hdlist1">
lang.number-dawg
</dt>
<dd>
<p>
(Optional) A dawg made from tokens which originally contained digits.
Each digit is replaced by a space character.
</p>
</dd>
<dt class="hdlist1">
lang.freq-dawg
</dt>
<dd>
<p>
(Optional) A dawg made from the most frequent words which would have
gone into word-dawg.
</p>
</dd>
<dt class="hdlist1">
lang.fixed-length-dawgs
</dt>
<dd>
<p>
(Optional) Several dawgs of different fixed lengths&#8201;&#8212;&#8201;useful for
languages like Chinese.
</p>
</dd>
<dt class="hdlist1">
lang.cube-unicharset
</dt>
<dd>
<p>
(Optional) A unicharset for cube, if cube was trained on a different set
of symbols.
</p>
</dd>
<dt class="hdlist1">
lang.cube-word-dawg
</dt>
<dd>
<p>
(Optional) A word dawg for cube&#8217;s alternate unicharset. Not needed if Cube
was trained with Tesseract&#8217;s unicharset.
</p>
</dd>
<dt class="hdlist1">
lang.shapetable
</dt>
<dd>
<p>
(Optional) When present, a shapetable is an extra layer between the character
classifier and the word recognizer that allows the character classifier to
return a collection of unichar ids and fonts instead of a single unichar-id
and font.
</p>
</dd>
<dt class="hdlist1">
lang.bigram-dawg
</dt>
<dd>
<p>
(Optional) A dawg of word bigrams where the words are separated by a space
and each digit is replaced by a <em>?</em>.
</p>
</dd>
<dt class="hdlist1">
lang.unambig-dawg
</dt>
<dd>
<p>
(Optional) TODO: Describe.
</p>
</dd>
<dt class="hdlist1">
lang.params-training-model
</dt>
<dd>
<p>
(Optional) TODO: Describe.
</p>
</dd>
</dl></div>
</div>
<h2 id="_history">HISTORY</h2>
<div class="sectionbody">
<div class="paragraph"><p>combine_tessdata(1) first appeared in version 3.00 of Tesseract</p></div>
</div>
<h2 id="_see_also">SEE ALSO</h2>
<div class="sectionbody">
<div class="paragraph"><p>tesseract(1)</p></div>
<div class="paragraph"><p>tesseract(1), wordlist2dawg(1), cntraining(1), mftraining(1), unicharset(5),
unicharambigs(5)</p></div>
</div>
<h2 id="_copying">COPYING</h2>
<div class="sectionbody">
<div class="paragraph"><p>Copyright (C) 2009, Google Inc.
Licensed under the Apache License, Version 2.0</p></div>
</div>
<h2 id="_author">AUTHOR</h2>
<div class="sectionbody">
<div class="paragraph"><p>The Tesseract OCR engine was written by Ray Smith and his research groups
at Hewlett Packard (1985-1995) and Google (2006-present).</p></div>
</div>
</div>
<div id="footnotes"><hr /></div>
<div id="footer">
<div id="footer-text">
Last updated 2010-09-29 17:42:57 IST
Last updated 2012-02-08 10:52:17 PDT
</div>
</div>
</body>

View File

@ -28,15 +28,15 @@ classifier templates, ambiguities, language configs) located at, say,
<simpara>Specify option -e if you would like to extract individual components
from a combined traineddata file. For example, to extract language config
file and the unicharset from tessdata/eng.traineddata run:</simpara>
<literallayout class="monospaced">combine_tessdata -e tessdata/eng.traineddata
/home/$USER/temp/eng.config /home/$USER/temp/eng.unicharset</literallayout>
<literallayout class="monospaced">combine_tessdata -e tessdata/eng.traineddata \
/home/$USER/temp/eng.config /home/$USER/temp/eng.unicharset</literallayout>
<simpara>The desired config file and unicharset will be written to
/home/$USER/temp/eng.config /home/$USER/temp/eng.unicharset</simpara>
<simpara>Specify option -o to overwrite individual components of the given
[lang].traineddata file. For example, to overwrite language config
and unichar ambiguities files in tessdata/eng.traineddata use:</simpara>
<literallayout class="monospaced">combine_tessdata -o tessdata/eng.traineddata
/home/$USER/temp/eng.config /home/$USER/temp/eng.unicharambigs</literallayout>
<literallayout class="monospaced">combine_tessdata -o tessdata/eng.traineddata \
/home/$USER/temp/eng.config /home/$USER/temp/eng.unicharambigs</literallayout>
<simpara>As a result, tessdata/eng.traineddata will contain the new language config
and unichar ambigs, plus all the original DAWGs, classifier templates, etc.</simpara>
<simpara>Note: the file names of the files to extract to and to overwrite from should
@ -62,17 +62,217 @@ components from tessdata/eng.traineddata.</simpara>
<title>CAVEATS</title>
<simpara><emphasis>Prefix</emphasis> refers to the full file prefix, including period (.)</simpara>
</refsect1>
<refsect1 id="_components">
<title>COMPONENTS</title>
<simpara>The components in a Tesseract lang.traineddata file as of
Tesseract 3.02 are briefly described below; For more information on
many of these files, see
<ulink url="http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3">http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3</ulink></simpara>
<variablelist>
<varlistentry>
<term>
lang.config
</term>
<listitem>
<simpara>
(Optional) Language-specific overrides to default config variables.
</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term>
lang.unicharset
</term>
<listitem>
<simpara>
(Required) The list of symbols that Tesseract recognizes, with properties.
See unicharset(5).
</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term>
lang.unicharambigs
</term>
<listitem>
<simpara>
(Optional) This file contains information on pairs of recognized symbols
which are often confused. For example, <emphasis>rn</emphasis> and <emphasis>m</emphasis>.
</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term>
lang.inttemp
</term>
<listitem>
<simpara>
(Required) Character shape templates for each unichar. Produced by
mftraining(1).
</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term>
lang.pffmtable
</term>
<listitem>
<simpara>
(Required) The number of features expected for each unichar.
Produced by mftraining(1) from <emphasis role="strong">.tr</emphasis> files.
</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term>
lang.normproto
</term>
<listitem>
<simpara>
(Required) Character normalization prototypes generated by cntraining(1)
from <emphasis role="strong">.tr</emphasis> files.
</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term>
lang.punc-dawg
</term>
<listitem>
<simpara>
(Optional) A dawg made from punctuation patterns found around words.
The "word" part is replaced by a single space.
</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term>
lang.word-dawg
</term>
<listitem>
<simpara>
(Optional) A dawg made from dictionary words from the language.
</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term>
lang.number-dawg
</term>
<listitem>
<simpara>
(Optional) A dawg made from tokens which originally contained digits.
Each digit is replaced by a space character.
</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term>
lang.freq-dawg
</term>
<listitem>
<simpara>
(Optional) A dawg made from the most frequent words which would have
gone into word-dawg.
</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term>
lang.fixed-length-dawgs
</term>
<listitem>
<simpara>
(Optional) Several dawgs of different fixed lengths&#8201;&#8212;&#8201;useful for
languages like Chinese.
</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term>
lang.cube-unicharset
</term>
<listitem>
<simpara>
(Optional) A unicharset for cube, if cube was trained on a different set
of symbols.
</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term>
lang.cube-word-dawg
</term>
<listitem>
<simpara>
(Optional) A word dawg for cube&#8217;s alternate unicharset. Not needed if Cube
was trained with Tesseract&#8217;s unicharset.
</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term>
lang.shapetable
</term>
<listitem>
<simpara>
(Optional) When present, a shapetable is an extra layer between the character
classifier and the word recognizer that allows the character classifier to
return a collection of unichar ids and fonts instead of a single unichar-id
and font.
</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term>
lang.bigram-dawg
</term>
<listitem>
<simpara>
(Optional) A dawg of word bigrams where the words are separated by a space
and each digit is replaced by a <emphasis>?</emphasis>.
</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term>
lang.unambig-dawg
</term>
<listitem>
<simpara>
(Optional) TODO: Describe.
</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term>
lang.params-training-model
</term>
<listitem>
<simpara>
(Optional) TODO: Describe.
</simpara>
</listitem>
</varlistentry>
</variablelist>
</refsect1>
<refsect1 id="_history">
<title>HISTORY</title>
<simpara>combine_tessdata(1) first appeared in version 3.00 of Tesseract</simpara>
</refsect1>
<refsect1 id="_see_also">
<title>SEE ALSO</title>
<simpara>tesseract(1)</simpara>
<simpara>tesseract(1), wordlist2dawg(1), cntraining(1), mftraining(1), unicharset(5),
unicharambigs(5)</simpara>
</refsect1>
<refsect1 id="_copying">
<title>COPYING</title>
<simpara>Copyright (C) 2009, Google Inc.
Licensed under the Apache License, Version 2.0</simpara>
</refsect1>
<refsect1 id="_author">
<title>AUTHOR</title>
<simpara>The Tesseract OCR engine was written by Ray Smith and his research groups
at Hewlett Packard (1985-1995) and Google (2006-present).</simpara>
</refsect1>
</refentry>

55
doc/dawg2wordlist.1 Normal file
View File

@ -0,0 +1,55 @@
'\" t
.\" Title: dawg2wordlist
.\" Author: [see the "AUTHOR" section]
.\" Generator: DocBook XSL Stylesheets v1.75.2 <http://docbook.sf.net/>
.\" Date: 02/09/2012
.\" Manual: \ \&
.\" Source: \ \&
.\" Language: English
.\"
.TH "DAWG2WORDLIST" "1" "02/09/2012" "\ \&" "\ \&"
.\" -----------------------------------------------------------------
.\" * Define some portability stuff
.\" -----------------------------------------------------------------
.\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.\" http://bugs.debian.org/507673
.\" http://lists.gnu.org/archive/html/groff/2009-02/msg00013.html
.\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.ie \n(.g .ds Aq \(aq
.el .ds Aq '
.\" -----------------------------------------------------------------
.\" * set default formatting
.\" -----------------------------------------------------------------
.\" disable hyphenation
.nh
.\" disable justification (adjust text to left margin only)
.ad l
.\" -----------------------------------------------------------------
.\" * MAIN CONTENT STARTS HERE *
.\" -----------------------------------------------------------------
.SH "NAME"
dawg2wordlist \- convert a Tesseract DAWG to a wordlist
.SH "SYNOPSIS"
.sp
\fBdawg2wordlist\fR \fIUNICHARSET\fR \fIDAWG\fR \fIWORDLIST\fR
.SH "DESCRIPTION"
.sp
dawg2wordlist(1) converts a Tesseract Directed Acyclic Word Graph (DAWG) to a list of words using a unicharset as key\&.
.SH "OPTIONS"
.sp
\fIUNICHARSET\fR The unicharset of the language\&. This is the unicharset generated by mftraining(1)\&.
.sp
\fIDAWG\fR The input DAWG, created by wordlist2dawg(1)
.sp
\fIWORDLIST\fR Plain text (output) file in UTF\-8, one word per line
.SH "SEE ALSO"
.sp
tesseract(1), mftraining(1), wordlist2dawg(1), unicharset(5), combine_tessdata(1)
.sp
\m[blue]\fBhttp://code\&.google\&.com/p/tesseract\-ocr/wiki/TrainingTesseract3\fR\m[]
.SH "COPYING"
.sp
Copyright (C) 2012 Google, Inc\&. Licensed under the Apache License, Version 2\&.0
.SH "AUTHOR"
.sp
The Tesseract OCR engine was written by Ray Smith and his research groups at Hewlett Packard (1985\-1995) and Google (2006\-present)\&.

45
doc/dawg2wordlist.1.asc Normal file
View File

@ -0,0 +1,45 @@
DAWG2WORDLIST(1)
================
:doctype: manpage
NAME
----
dawg2wordlist - convert a Tesseract DAWG to a wordlist
SYNOPSIS
--------
*dawg2wordlist* 'UNICHARSET' 'DAWG' 'WORDLIST'
DESCRIPTION
-----------
dawg2wordlist(1) converts a Tesseract Directed Acyclic Word
Graph (DAWG) to a list of words using a unicharset as key.
OPTIONS
-------
'UNICHARSET'
The unicharset of the language. This is the unicharset
generated by mftraining(1).
'DAWG'
The input DAWG, created by wordlist2dawg(1)
'WORDLIST'
Plain text (output) file in UTF-8, one word per line
SEE ALSO
--------
tesseract(1), mftraining(1), wordlist2dawg(1), unicharset(5),
combine_tessdata(1)
<http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3>
COPYING
-------
Copyright \(C) 2012 Google, Inc.
Licensed under the Apache License, Version 2.0
AUTHOR
------
The Tesseract OCR engine was written by Ray Smith and his research groups
at Hewlett Packard (1985-1995) and Google (2006-present).

626
doc/dawg2wordlist.1.html Normal file
View File

@ -0,0 +1,626 @@
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<meta name="generator" content="AsciiDoc 8.5.2" />
<title>DAWG2WORDLIST(1)</title>
<style type="text/css">
/* Debug borders */
p, li, dt, dd, div, pre, h1, h2, h3, h4, h5, h6 {
/*
border: 1px solid red;
*/
}
body {
margin: 1em 5% 1em 5%;
}
a {
color: blue;
text-decoration: underline;
}
a:visited {
color: fuchsia;
}
em {
font-style: italic;
color: navy;
}
strong {
font-weight: bold;
color: #083194;
}
tt {
color: navy;
}
h1, h2, h3, h4, h5, h6 {
color: #527bbd;
font-family: sans-serif;
margin-top: 1.2em;
margin-bottom: 0.5em;
line-height: 1.3;
}
h1, h2, h3 {
border-bottom: 2px solid silver;
}
h2 {
padding-top: 0.5em;
}
h3 {
float: left;
}
h3 + * {
clear: left;
}
div.sectionbody {
font-family: serif;
margin-left: 0;
}
hr {
border: 1px solid silver;
}
p {
margin-top: 0.5em;
margin-bottom: 0.5em;
}
ul, ol, li > p {
margin-top: 0;
}
pre {
padding: 0;
margin: 0;
}
span#author {
color: #527bbd;
font-family: sans-serif;
font-weight: bold;
font-size: 1.1em;
}
span#email {
}
span#revnumber, span#revdate, span#revremark {
font-family: sans-serif;
}
div#footer {
font-family: sans-serif;
font-size: small;
border-top: 2px solid silver;
padding-top: 0.5em;
margin-top: 4.0em;
}
div#footer-text {
float: left;
padding-bottom: 0.5em;
}
div#footer-badges {
float: right;
padding-bottom: 0.5em;
}
div#preamble {
margin-top: 1.5em;
margin-bottom: 1.5em;
}
div.tableblock, div.imageblock, div.exampleblock, div.verseblock,
div.quoteblock, div.literalblock, div.listingblock, div.sidebarblock,
div.admonitionblock {
margin-top: 1.0em;
margin-bottom: 1.5em;
}
div.admonitionblock {
margin-top: 2.0em;
margin-bottom: 2.0em;
margin-right: 10%;
color: #606060;
}
div.content { /* Block element content. */
padding: 0;
}
/* Block element titles. */
div.title, caption.title {
color: #527bbd;
font-family: sans-serif;
font-weight: bold;
text-align: left;
margin-top: 1.0em;
margin-bottom: 0.5em;
}
div.title + * {
margin-top: 0;
}
td div.title:first-child {
margin-top: 0.0em;
}
div.content div.title:first-child {
margin-top: 0.0em;
}
div.content + div.title {
margin-top: 0.0em;
}
div.sidebarblock > div.content {
background: #ffffee;
border: 1px solid silver;
padding: 0.5em;
}
div.listingblock > div.content {
border: 1px solid silver;
background: #f4f4f4;
padding: 0.5em;
}
div.quoteblock, div.verseblock {
padding-left: 1.0em;
margin-left: 1.0em;
margin-right: 10%;
border-left: 5px solid #dddddd;
color: #777777;
}
div.quoteblock > div.attribution {
padding-top: 0.5em;
text-align: right;
}
div.verseblock > div.content {
white-space: pre;
}
div.verseblock > div.attribution {
padding-top: 0.75em;
text-align: left;
}
/* DEPRECATED: Pre version 8.2.7 verse style literal block. */
div.verseblock + div.attribution {
text-align: left;
}
div.admonitionblock .icon {
vertical-align: top;
font-size: 1.1em;
font-weight: bold;
text-decoration: underline;
color: #527bbd;
padding-right: 0.5em;
}
div.admonitionblock td.content {
padding-left: 0.5em;
border-left: 3px solid #dddddd;
}
div.exampleblock > div.content {
border-left: 3px solid #dddddd;
padding-left: 0.5em;
}
div.imageblock div.content { padding-left: 0; }
span.image img { border-style: none; }
a.image:visited { color: white; }
dl {
margin-top: 0.8em;
margin-bottom: 0.8em;
}
dt {
margin-top: 0.5em;
margin-bottom: 0;
font-style: normal;
color: navy;
}
dd > *:first-child {
margin-top: 0.1em;
}
ul, ol {
list-style-position: outside;
}
ol.arabic {
list-style-type: decimal;
}
ol.loweralpha {
list-style-type: lower-alpha;
}
ol.upperalpha {
list-style-type: upper-alpha;
}
ol.lowerroman {
list-style-type: lower-roman;
}
ol.upperroman {
list-style-type: upper-roman;
}
div.compact ul, div.compact ol,
div.compact p, div.compact p,
div.compact div, div.compact div {
margin-top: 0.1em;
margin-bottom: 0.1em;
}
div.tableblock > table {
border: 3px solid #527bbd;
}
thead, p.table.header {
font-family: sans-serif;
font-weight: bold;
}
tfoot {
font-weight: bold;
}
td > div.verse {
white-space: pre;
}
p.table {
margin-top: 0;
}
/* Because the table frame attribute is overriden by CSS in most browsers. */
div.tableblock > table[frame="void"] {
border-style: none;
}
div.tableblock > table[frame="hsides"] {
border-left-style: none;
border-right-style: none;
}
div.tableblock > table[frame="vsides"] {
border-top-style: none;
border-bottom-style: none;
}
div.hdlist {
margin-top: 0.8em;
margin-bottom: 0.8em;
}
div.hdlist tr {
padding-bottom: 15px;
}
dt.hdlist1.strong, td.hdlist1.strong {
font-weight: bold;
}
td.hdlist1 {
vertical-align: top;
font-style: normal;
padding-right: 0.8em;
color: navy;
}
td.hdlist2 {
vertical-align: top;
}
div.hdlist.compact tr {
margin: 0;
padding-bottom: 0;
}
.comment {
background: yellow;
}
.footnote, .footnoteref {
font-size: 0.8em;
}
span.footnote, span.footnoteref {
vertical-align: super;
}
#footnotes {
margin: 20px 0 20px 0;
padding: 7px 0 0 0;
}
#footnotes div.footnote {
margin: 0 0 5px 0;
}
#footnotes hr {
border: none;
border-top: 1px solid silver;
height: 1px;
text-align: left;
margin-left: 0;
width: 20%;
min-width: 100px;
}
@media print {
div#footer-badges { display: none; }
}
div#toc {
margin-bottom: 2.5em;
}
div#toctitle {
color: #527bbd;
font-family: sans-serif;
font-size: 1.1em;
font-weight: bold;
margin-top: 1.0em;
margin-bottom: 0.1em;
}
div.toclevel1, div.toclevel2, div.toclevel3, div.toclevel4 {
margin-top: 0;
margin-bottom: 0;
}
div.toclevel2 {
margin-left: 2em;
font-size: 0.9em;
}
div.toclevel3 {
margin-left: 4em;
font-size: 0.9em;
}
div.toclevel4 {
margin-left: 6em;
font-size: 0.9em;
}
/* Overrides for manpage documents */
h1 {
padding-top: 0.5em;
padding-bottom: 0.5em;
border-top: 2px solid silver;
border-bottom: 2px solid silver;
}
h2 {
border-style: none;
}
div.sectionbody {
margin-left: 5%;
}
@media print {
div#toc { display: none; }
}
/* Workarounds for IE6's broken and incomplete CSS2. */
div.sidebar-content {
background: #ffffee;
border: 1px solid silver;
padding: 0.5em;
}
div.sidebar-title, div.image-title {
color: #527bbd;
font-family: sans-serif;
font-weight: bold;
margin-top: 0.0em;
margin-bottom: 0.5em;
}
div.listingblock div.content {
border: 1px solid silver;
background: #f4f4f4;
padding: 0.5em;
}
div.quoteblock-attribution {
padding-top: 0.5em;
text-align: right;
}
div.verseblock-content {
white-space: pre;
}
div.verseblock-attribution {
padding-top: 0.75em;
text-align: left;
}
div.exampleblock-content {
border-left: 3px solid #dddddd;
padding-left: 0.5em;
}
/* IE6 sets dynamically generated links as visited. */
div#toc a:visited { color: blue; }
</style>
<script type="text/javascript">
/*<![CDATA[*/
window.onload = function(){asciidoc.footnotes();}
var asciidoc = { // Namespace.
/////////////////////////////////////////////////////////////////////
// Table Of Contents generator
/////////////////////////////////////////////////////////////////////
/* Author: Mihai Bazon, September 2002
* http://students.infoiasi.ro/~mishoo
*
* Table Of Content generator
* Version: 0.4
*
* Feel free to use this script under the terms of the GNU General Public
* License, as long as you do not remove or alter this notice.
*/
/* modified by Troy D. Hanson, September 2006. License: GPL */
/* modified by Stuart Rackham, 2006, 2009. License: GPL */
// toclevels = 1..4.
toc: function (toclevels) {
function getText(el) {
var text = "";
for (var i = el.firstChild; i != null; i = i.nextSibling) {
if (i.nodeType == 3 /* Node.TEXT_NODE */) // IE doesn't speak constants.
text += i.data;
else if (i.firstChild != null)
text += getText(i);
}
return text;
}
function TocEntry(el, text, toclevel) {
this.element = el;
this.text = text;
this.toclevel = toclevel;
}
function tocEntries(el, toclevels) {
var result = new Array;
var re = new RegExp('[hH]([2-'+(toclevels+1)+'])');
// Function that scans the DOM tree for header elements (the DOM2
// nodeIterator API would be a better technique but not supported by all
// browsers).
var iterate = function (el) {
for (var i = el.firstChild; i != null; i = i.nextSibling) {
if (i.nodeType == 1 /* Node.ELEMENT_NODE */) {
var mo = re.exec(i.tagName);
if (mo && (i.getAttribute("class") || i.getAttribute("className")) != "float") {
result[result.length] = new TocEntry(i, getText(i), mo[1]-1);
}
iterate(i);
}
}
}
iterate(el);
return result;
}
var toc = document.getElementById("toc");
var entries = tocEntries(document.getElementById("content"), toclevels);
for (var i = 0; i < entries.length; ++i) {
var entry = entries[i];
if (entry.element.id == "")
entry.element.id = "_toc_" + i;
var a = document.createElement("a");
a.href = "#" + entry.element.id;
a.appendChild(document.createTextNode(entry.text));
var div = document.createElement("div");
div.appendChild(a);
div.className = "toclevel" + entry.toclevel;
toc.appendChild(div);
}
if (entries.length == 0)
toc.parentNode.removeChild(toc);
},
/////////////////////////////////////////////////////////////////////
// Footnotes generator
/////////////////////////////////////////////////////////////////////
/* Based on footnote generation code from:
* http://www.brandspankingnew.net/archive/2005/07/format_footnote.html
*/
footnotes: function () {
var cont = document.getElementById("content");
var noteholder = document.getElementById("footnotes");
var spans = cont.getElementsByTagName("span");
var refs = {};
var n = 0;
for (i=0; i<spans.length; i++) {
if (spans[i].className == "footnote") {
n++;
// Use [\s\S] in place of . so multi-line matches work.
// Because JavaScript has no s (dotall) regex flag.
note = spans[i].innerHTML.match(/\s*\[([\s\S]*)]\s*/)[1];
noteholder.innerHTML +=
"<div class='footnote' id='_footnote_" + n + "'>" +
"<a href='#_footnoteref_" + n + "' title='Return to text'>" +
n + "</a>. " + note + "</div>";
spans[i].innerHTML =
"[<a id='_footnoteref_" + n + "' href='#_footnote_" + n +
"' title='View footnote' class='footnote'>" + n + "</a>]";
var id =spans[i].getAttribute("id");
if (id != null) refs["#"+id] = n;
}
}
if (n == 0)
noteholder.parentNode.removeChild(noteholder);
else {
// Process footnoterefs.
for (i=0; i<spans.length; i++) {
if (spans[i].className == "footnoteref") {
var href = spans[i].getElementsByTagName("a")[0].getAttribute("href");
href = href.match(/#.*/)[0]; // Because IE return full URL.
n = refs[href];
spans[i].innerHTML =
"[<a href='#_footnote_" + n +
"' title='View footnote' class='footnote'>" + n + "</a>]";
}
}
}
}
}
/*]]>*/
</script>
</head>
<body>
<div id="header">
<h1>
DAWG2WORDLIST(1) Manual Page
</h1>
<h2>NAME</h2>
<div class="sectionbody">
<p>dawg2wordlist -
convert a Tesseract DAWG to a wordlist
</p>
</div>
</div>
<div id="content">
<h2 id="_synopsis">SYNOPSIS</h2>
<div class="sectionbody">
<div class="paragraph"><p><strong>dawg2wordlist</strong> <em>UNICHARSET</em> <em>DAWG</em> <em>WORDLIST</em></p></div>
</div>
<h2 id="_description">DESCRIPTION</h2>
<div class="sectionbody">
<div class="paragraph"><p>dawg2wordlist(1) converts a Tesseract Directed Acyclic Word
Graph (DAWG) to a list of words using a unicharset as key.</p></div>
</div>
<h2 id="_options">OPTIONS</h2>
<div class="sectionbody">
<div class="paragraph"><p><em>UNICHARSET</em>
The unicharset of the language. This is the unicharset
generated by mftraining(1).</p></div>
<div class="paragraph"><p><em>DAWG</em>
The input DAWG, created by wordlist2dawg(1)</p></div>
<div class="paragraph"><p><em>WORDLIST</em>
Plain text (output) file in UTF-8, one word per line</p></div>
</div>
<h2 id="_see_also">SEE ALSO</h2>
<div class="sectionbody">
<div class="paragraph"><p>tesseract(1), mftraining(1), wordlist2dawg(1), unicharset(5),
combine_tessdata(1)</p></div>
<div class="paragraph"><p><a href="http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3">http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3</a></p></div>
</div>
<h2 id="_copying">COPYING</h2>
<div class="sectionbody">
<div class="paragraph"><p>Copyright (C) 2012 Google, Inc.
Licensed under the Apache License, Version 2.0</p></div>
</div>
<h2 id="_author">AUTHOR</h2>
<div class="sectionbody">
<div class="paragraph"><p>The Tesseract OCR engine was written by Ray Smith and his research groups
at Hewlett Packard (1985-1995) and Google (2006-present).</p></div>
</div>
</div>
<div id="footnotes"><hr /></div>
<div id="footer">
<div id="footer-text">
Last updated 2012-02-07 13:38:01 PDT
</div>
</div>
</body>
</html>

50
doc/dawg2wordlist.1.xml Normal file
View File

@ -0,0 +1,50 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE refentry PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd">
<?asciidoc-toc?>
<?asciidoc-numbered?>
<refentry lang="en">
<refmeta>
<refentrytitle>dawg2wordlist</refentrytitle>
<manvolnum>1</manvolnum>
<refmiscinfo class="source">&nbsp;</refmiscinfo>
<refmiscinfo class="manual">&nbsp;</refmiscinfo>
</refmeta>
<refnamediv>
<refname>dawg2wordlist</refname>
<refpurpose>convert a Tesseract DAWG to a wordlist</refpurpose>
</refnamediv>
<refsynopsisdiv id="_synopsis">
<simpara><emphasis role="strong">dawg2wordlist</emphasis> <emphasis>UNICHARSET</emphasis> <emphasis>DAWG</emphasis> <emphasis>WORDLIST</emphasis></simpara>
</refsynopsisdiv>
<refsect1 id="_description">
<title>DESCRIPTION</title>
<simpara>dawg2wordlist(1) converts a Tesseract Directed Acyclic Word
Graph (DAWG) to a list of words using a unicharset as key.</simpara>
</refsect1>
<refsect1 id="_options">
<title>OPTIONS</title>
<simpara><emphasis>UNICHARSET</emphasis>
The unicharset of the language. This is the unicharset
generated by mftraining(1).</simpara>
<simpara><emphasis>DAWG</emphasis>
The input DAWG, created by wordlist2dawg(1)</simpara>
<simpara><emphasis>WORDLIST</emphasis>
Plain text (output) file in UTF-8, one word per line</simpara>
</refsect1>
<refsect1 id="_see_also">
<title>SEE ALSO</title>
<simpara>tesseract(1), mftraining(1), wordlist2dawg(1), unicharset(5),
combine_tessdata(1)</simpara>
<simpara><ulink url="http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3">http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3</ulink></simpara>
</refsect1>
<refsect1 id="_copying">
<title>COPYING</title>
<simpara>Copyright (C) 2012 Google, Inc.
Licensed under the Apache License, Version 2.0</simpara>
</refsect1>
<refsect1 id="_author">
<title>AUTHOR</title>
<simpara>The Tesseract OCR engine was written by Ray Smith and his research groups
at Hewlett Packard (1985-1995) and Google (2006-present).</simpara>
</refsect1>
</refentry>

34
doc/generate_manpages.sh Executable file
View File

@ -0,0 +1,34 @@
#!/bin/bash
#
# File: generate_manpages.sh
# Description: Converts .asc files into man pages, etc. for Tesseract.
# Author: eger@google.com (David Eger)
# Created: 9 Feb 2012
#
# (C) Copyright 2012 Google Inc.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
man_xslt=/usr/share/xml/docbook/stylesheet/docbook-xsl/manpages/docbook.xsl
asciidoc=$(which asciidoc)
xsltproc=$(which xsltproc)
if [[ -z "${asciidoc}" ]] || [[ -z "${xsltproc}" ]]; then
echo "Please make sure asciidoc and xsltproc are installed."
exit 1
else
for src in *.asc; do
pagename=${src/.asc/}
(${asciidoc} -d manpage ${src} &&
${asciidoc} -d manpage -b docbook ${src} &&
${xsltproc} ${man_xslt} ${pagename}.xml) ||
echo "Error generating ${pagename}"
done
fi
exit 0

View File

@ -1,13 +1,13 @@
'\" t
.\" Title: mftraining
.\" Author: [FIXME: author] [see http://docbook.sf.net/el/author]
.\" Author: [see the "AUTHOR" section]
.\" Generator: DocBook XSL Stylesheets v1.75.2 <http://docbook.sf.net/>
.\" Date: 09/30/2010
.\" Date: 02/09/2012
.\" Manual: \ \&
.\" Source: \ \&
.\" Language: English
.\"
.TH "MFTRAINING" "1" "09/30/2010" "\ \&" "\ \&"
.TH "MFTRAINING" "1" "02/09/2012" "\ \&" "\ \&"
.\" -----------------------------------------------------------------
.\" * Define some portability stuff
.\" -----------------------------------------------------------------
@ -34,15 +34,61 @@ mftraining \- feature training for Tesseract
mftraining \-U \fIunicharset\fR \-O \fIlang\&.unicharset\fR \fIFILE\fR\&...
.SH "DESCRIPTION"
.sp
mftraining takes a list of \&.tr files, from which it generates the files inttemp (the shape prototypes) and pffmtable (the number of expected features for each character)\&. (A third file called Microfeat is also written by this program, but it is not used\&.)
mftraining takes a list of \&.tr files, from which it generates the files \fBinttemp\fR (the shape prototypes), \fBshapetable\fR, and \fBpffmtable\fR (the number of expected features for each character)\&. (A fourth file called Microfeat is also written by this program, but it is not used\&.)
.SH "OPTIONS"
.PP
\-U \fIFILE\fR
.RS 4
(Input) The unicharset generated by unicharset_extractor(1)
.RE
.PP
\-F \fIfont_properties_file\fR
.RS 4
(Input) font properties file, each line is of the following form, where each field other than the font name is 0 or 1:
.sp
\fI\-U\fR FILE The unicharset generated by unicharset_extractor
.if n \{\
.RS 4
.\}
.nf
*font_name* *italic* *bold* *fixed_pitch* *serif* *fraktur*
.fi
.if n \{\
.RE
.\}
.RE
.PP
\-X \fIxheights_file\fR
.RS 4
(Input) x heights file, each line is of the following form, where xheight is calculated as the pixel x height of a character drawn at 32pt on 300 dpi\&. [ That is, if base x height + ascenders + descenders = 133, how much is x height? ]
.sp
\fI\-O\fR FILE The output unicharset that will be given to combine_tessdata\&.
.if n \{\
.RS 4
.\}
.nf
*font_name* *xheight*
.fi
.if n \{\
.RE
.\}
.RE
.PP
\-D \fIdir\fR
.RS 4
Directory to write output files to\&.
.RE
.PP
\-O \fIFILE\fR
.RS 4
(Output) The output unicharset that will be given to combine_tessdata(1)
.RE
.SH "SEE ALSO"
.sp
tesseract(1), cntraining(1), unicharset_extractor(1), combine_tessdata(1)
tesseract(1), cntraining(1), unicharset_extractor(1), combine_tessdata(1), shapeclustering(1), unicharset(5)
.sp
\m[blue]\fBhttp://code\&.google\&.com/p/tesseract\-ocr/wiki/TrainingTesseract3\fR\m[]
.SH "COPYING"
.sp
Copyright (c) Hewlett\-Packard Company, 1988 Licensed under the Apache License, Version 2\&.0
Copyright (C) Hewlett\-Packard Company, 1988 Licensed under the Apache License, Version 2\&.0
.SH "AUTHOR"
.sp
The Tesseract OCR engine was written by Ray Smith and his research groups at Hewlett Packard (1985\-1995) and Google (2006\-present)\&.

View File

@ -1,5 +1,6 @@
MFTRAINING(1)
=============
:doctype: manpage
NAME
----
@ -12,23 +13,44 @@ mftraining -U 'unicharset' -O 'lang.unicharset' 'FILE'...
DESCRIPTION
-----------
mftraining takes a list of .tr files, from which it generates the
files inttemp (the shape prototypes) and pffmtable (the number of
expected features for each character). (A third file called Microfeat
is also written by this program, but it is not used.)
files *inttemp* (the shape prototypes), *shapetable*, and *pffmtable*
(the number of expected features for each character). (A fourth file
called Microfeat is also written by this program, but it is not used.)
OPTIONS
-------
'-U' FILE
The unicharset generated by unicharset_extractor
-U 'FILE'::
(Input) The unicharset generated by unicharset_extractor(1)
'-O' FILE
The output unicharset that will be given to combine_tessdata.
-F 'font_properties_file'::
(Input) font properties file, each line is of the following form, where each field other than the font name is 0 or 1:
*font_name* *italic* *bold* *fixed_pitch* *serif* *fraktur*
-X 'xheights_file'::
(Input) x heights file, each line is of the following form, where xheight is calculated as the pixel x height of a character drawn at 32pt on 300 dpi. [ That is, if base x height + ascenders + descenders = 133, how much is x height? ]
*font_name* *xheight*
-D 'dir'::
Directory to write output files to.
-O 'FILE'::
(Output) The output unicharset that will be given to combine_tessdata(1)
SEE ALSO
--------
tesseract(1), cntraining(1), unicharset_extractor(1), combine_tessdata(1)
tesseract(1), cntraining(1), unicharset_extractor(1), combine_tessdata(1),
shapeclustering(1), unicharset(5)
<http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3>
COPYING
-------
Copyright (c) Hewlett-Packard Company, 1988
Copyright \(C) Hewlett-Packard Company, 1988
Licensed under the Apache License, Version 2.0
AUTHOR
------
The Tesseract OCR engine was written by Ray Smith and his research groups
at Hewlett Packard (1985-1995) and Google (2006-present).

View File

@ -587,31 +587,84 @@ MFTRAINING(1) Manual Page
<h2 id="_description">DESCRIPTION</h2>
<div class="sectionbody">
<div class="paragraph"><p>mftraining takes a list of .tr files, from which it generates the
files inttemp (the shape prototypes) and pffmtable (the number of
expected features for each character). (A third file called Microfeat
is also written by this program, but it is not used.)</p></div>
files <strong>inttemp</strong> (the shape prototypes), <strong>shapetable</strong>, and <strong>pffmtable</strong>
(the number of expected features for each character). (A fourth file
called Microfeat is also written by this program, but it is not used.)</p></div>
</div>
<h2 id="_options">OPTIONS</h2>
<div class="sectionbody">
<div class="paragraph"><p><em>-U</em> FILE
The unicharset generated by unicharset_extractor</p></div>
<div class="paragraph"><p><em>-O</em> FILE
The output unicharset that will be given to combine_tessdata.</p></div>
<div class="dlist"><dl>
<dt class="hdlist1">
-U <em>FILE</em>
</dt>
<dd>
<p>
(Input) The unicharset generated by unicharset_extractor(1)
</p>
</dd>
<dt class="hdlist1">
-F <em>font_properties_file</em>
</dt>
<dd>
<p>
(Input) font properties file, each line is of the following form, where each field other than the font name is 0 or 1:
</p>
<div class="literalblock">
<div class="content">
<pre><tt>*font_name* *italic* *bold* *fixed_pitch* *serif* *fraktur*</tt></pre>
</div></div>
</dd>
<dt class="hdlist1">
-X <em>xheights_file</em>
</dt>
<dd>
<p>
(Input) x heights file, each line is of the following form, where xheight is calculated as the pixel x height of a character drawn at 32pt on 300 dpi. [ That is, if base x height + ascenders + descenders = 133, how much is x height? ]
</p>
<div class="literalblock">
<div class="content">
<pre><tt>*font_name* *xheight*</tt></pre>
</div></div>
</dd>
<dt class="hdlist1">
-D <em>dir</em>
</dt>
<dd>
<p>
Directory to write output files to.
</p>
</dd>
<dt class="hdlist1">
-O <em>FILE</em>
</dt>
<dd>
<p>
(Output) The output unicharset that will be given to combine_tessdata(1)
</p>
</dd>
</dl></div>
</div>
<h2 id="_see_also">SEE ALSO</h2>
<div class="sectionbody">
<div class="paragraph"><p>tesseract(1), cntraining(1), unicharset_extractor(1), combine_tessdata(1)</p></div>
<div class="paragraph"><p>tesseract(1), cntraining(1), unicharset_extractor(1), combine_tessdata(1),
shapeclustering(1), unicharset(5)</p></div>
<div class="paragraph"><p><a href="http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3">http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3</a></p></div>
</div>
<h2 id="_copying">COPYING</h2>
<div class="sectionbody">
<div class="paragraph"><p>Copyright (c) Hewlett-Packard Company, 1988
<div class="paragraph"><p>Copyright (C) Hewlett-Packard Company, 1988
Licensed under the Apache License, Version 2.0</p></div>
</div>
<h2 id="_author">AUTHOR</h2>
<div class="sectionbody">
<div class="paragraph"><p>The Tesseract OCR engine was written by Ray Smith and his research groups
at Hewlett Packard (1985-1995) and Google (2006-present).</p></div>
</div>
</div>
<div id="footnotes"><hr /></div>
<div id="footer">
<div id="footer-text">
Last updated 2010-09-30 03:04:15 IST
Last updated 2012-02-09 14:23:49 PDT
</div>
</div>
</body>

View File

@ -19,24 +19,81 @@
<refsect1 id="_description">
<title>DESCRIPTION</title>
<simpara>mftraining takes a list of .tr files, from which it generates the
files inttemp (the shape prototypes) and pffmtable (the number of
expected features for each character). (A third file called Microfeat
is also written by this program, but it is not used.)</simpara>
files <emphasis role="strong">inttemp</emphasis> (the shape prototypes), <emphasis role="strong">shapetable</emphasis>, and <emphasis role="strong">pffmtable</emphasis>
(the number of expected features for each character). (A fourth file
called Microfeat is also written by this program, but it is not used.)</simpara>
</refsect1>
<refsect1 id="_options">
<title>OPTIONS</title>
<simpara><emphasis>-U</emphasis> FILE
The unicharset generated by unicharset_extractor</simpara>
<simpara><emphasis>-O</emphasis> FILE
The output unicharset that will be given to combine_tessdata.</simpara>
<variablelist>
<varlistentry>
<term>
-U <emphasis>FILE</emphasis>
</term>
<listitem>
<simpara>
(Input) The unicharset generated by unicharset_extractor(1)
</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term>
-F <emphasis>font_properties_file</emphasis>
</term>
<listitem>
<simpara>
(Input) font properties file, each line is of the following form, where each field other than the font name is 0 or 1:
</simpara>
<literallayout class="monospaced">*font_name* *italic* *bold* *fixed_pitch* *serif* *fraktur*</literallayout>
</listitem>
</varlistentry>
<varlistentry>
<term>
-X <emphasis>xheights_file</emphasis>
</term>
<listitem>
<simpara>
(Input) x heights file, each line is of the following form, where xheight is calculated as the pixel x height of a character drawn at 32pt on 300 dpi. [ That is, if base x height + ascenders + descenders = 133, how much is x height? ]
</simpara>
<literallayout class="monospaced">*font_name* *xheight*</literallayout>
</listitem>
</varlistentry>
<varlistentry>
<term>
-D <emphasis>dir</emphasis>
</term>
<listitem>
<simpara>
Directory to write output files to.
</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term>
-O <emphasis>FILE</emphasis>
</term>
<listitem>
<simpara>
(Output) The output unicharset that will be given to combine_tessdata(1)
</simpara>
</listitem>
</varlistentry>
</variablelist>
</refsect1>
<refsect1 id="_see_also">
<title>SEE ALSO</title>
<simpara>tesseract(1), cntraining(1), unicharset_extractor(1), combine_tessdata(1)</simpara>
<simpara>tesseract(1), cntraining(1), unicharset_extractor(1), combine_tessdata(1),
shapeclustering(1), unicharset(5)</simpara>
<simpara><ulink url="http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3">http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3</ulink></simpara>
</refsect1>
<refsect1 id="_copying">
<title>COPYING</title>
<simpara>Copyright (c) Hewlett-Packard Company, 1988
<simpara>Copyright (C) Hewlett-Packard Company, 1988
Licensed under the Apache License, Version 2.0</simpara>
</refsect1>
<refsect1 id="_author">
<title>AUTHOR</title>
<simpara>The Tesseract OCR engine was written by Ray Smith and his research groups
at Hewlett Packard (1985-1995) and Google (2006-present).</simpara>
</refsect1>
</refentry>

94
doc/shapeclustering.1 Normal file
View File

@ -0,0 +1,94 @@
'\" t
.\" Title: shapeclustering
.\" Author: [see the "AUTHOR" section]
.\" Generator: DocBook XSL Stylesheets v1.75.2 <http://docbook.sf.net/>
.\" Date: 02/09/2012
.\" Manual: \ \&
.\" Source: \ \&
.\" Language: English
.\"
.TH "SHAPECLUSTERING" "1" "02/09/2012" "\ \&" "\ \&"
.\" -----------------------------------------------------------------
.\" * Define some portability stuff
.\" -----------------------------------------------------------------
.\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.\" http://bugs.debian.org/507673
.\" http://lists.gnu.org/archive/html/groff/2009-02/msg00013.html
.\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.ie \n(.g .ds Aq \(aq
.el .ds Aq '
.\" -----------------------------------------------------------------
.\" * set default formatting
.\" -----------------------------------------------------------------
.\" disable hyphenation
.nh
.\" disable justification (adjust text to left margin only)
.ad l
.\" -----------------------------------------------------------------
.\" * MAIN CONTENT STARTS HERE *
.\" -----------------------------------------------------------------
.SH "NAME"
shapeclustering \- shape clustering training for Tesseract
.SH "SYNOPSIS"
.sp
shapeclustering \-D \fIoutput_dir\fR \-U \fIunicharset\fR \-O \fImfunicharset\fR \-F \fIfont_props\fR \-X \fIxheights\fR \fIFILE\fR\&...
.SH "DESCRIPTION"
.sp
shapeclustering(1) takes extracted feature \&.tr files (generated by tesseract(1) run in a special mode from box files) and produces a file \fBshapetable\fR and an enhanced unicharset\&. This program is still experimental, and is not required (yet) for training Tesseract\&.
.SH "OPTIONS"
.PP
\-U \fIFILE\fR
.RS 4
The unicharset generated by unicharset_extractor(1)\&.
.RE
.PP
\-D \fIdir\fR
.RS 4
Directory to write output files to\&.
.RE
.PP
\-F \fIfont_properties_file\fR
.RS 4
(Input) font properties file, where each line is of the following form, where each field other than the font name is 0 or 1:
.sp
.if n \{\
.RS 4
.\}
.nf
\*(Aqfont_name\*(Aq \*(Aqitalic\*(Aq \*(Aqbold\*(Aq \*(Aqfixed_pitch\*(Aq \*(Aqserif\*(Aq \*(Aqfraktur\*(Aq
.fi
.if n \{\
.RE
.\}
.RE
.PP
\-X \fIxheights_file\fR
.RS 4
(Input) x heights file, each line is of the following form, where xheight is calculated as the pixel x height of a character drawn at 32pt on 300 dpi\&. [ That is, if base x height + ascenders + descenders = 133, how much is x height? ]
.sp
.if n \{\
.RS 4
.\}
.nf
\*(Aqfont_name\*(Aq \*(Aqxheight\*(Aq
.fi
.if n \{\
.RE
.\}
.RE
.PP
\-O \fIFILE\fR
.RS 4
The output unicharset that will be given to combine_tessdata(1)\&.
.RE
.SH "SEE ALSO"
.sp
tesseract(1), cntraining(1), unicharset_extractor(1), combine_tessdata(1), unicharset(5)
.sp
\m[blue]\fBhttp://code\&.google\&.com/p/tesseract\-ocr/wiki/TrainingTesseract3\fR\m[]
.SH "COPYING"
.sp
Copyright (C) Google, 2011 Licensed under the Apache License, Version 2\&.0
.SH "AUTHOR"
.sp
The Tesseract OCR engine was written by Ray Smith and his research groups at Hewlett Packard (1985\-1995) and Google (2006\-present)\&.

59
doc/shapeclustering.1.asc Normal file
View File

@ -0,0 +1,59 @@
SHAPECLUSTERING(1)
==================
:doctype: manpage
NAME
----
shapeclustering - shape clustering training for Tesseract
SYNOPSIS
--------
shapeclustering -D 'output_dir'
-U 'unicharset' -O 'mfunicharset'
-F 'font_props' -X 'xheights'
'FILE'...
DESCRIPTION
-----------
shapeclustering(1) takes extracted feature .tr files (generated by
tesseract(1) run in a special mode from box files) and produces a
file *shapetable* and an enhanced unicharset. This program is still
experimental, and is not required (yet) for training Tesseract.
OPTIONS
-------
-U 'FILE'::
The unicharset generated by unicharset_extractor(1).
-D 'dir'::
Directory to write output files to.
-F 'font_properties_file'::
(Input) font properties file, where each line is of the following form, where each field other than the font name is 0 or 1:
'font_name' 'italic' 'bold' 'fixed_pitch' 'serif' 'fraktur'
-X 'xheights_file'::
(Input) x heights file, each line is of the following form, where xheight is calculated as the pixel x height of a character drawn at 32pt on 300 dpi. [ That is, if base x height + ascenders + descenders = 133, how much is x height? ]
'font_name' 'xheight'
-O 'FILE'::
The output unicharset that will be given to combine_tessdata(1).
SEE ALSO
--------
tesseract(1), cntraining(1), unicharset_extractor(1), combine_tessdata(1),
unicharset(5)
<http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3>
COPYING
-------
Copyright \(C) Google, 2011
Licensed under the Apache License, Version 2.0
AUTHOR
------
The Tesseract OCR engine was written by Ray Smith and his research groups
at Hewlett Packard (1985-1995) and Google (2006-present).

674
doc/shapeclustering.1.html Normal file
View File

@ -0,0 +1,674 @@
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<meta name="generator" content="AsciiDoc 8.5.2" />
<title>SHAPECLUSTERING(1)</title>
<style type="text/css">
/* Debug borders */
p, li, dt, dd, div, pre, h1, h2, h3, h4, h5, h6 {
/*
border: 1px solid red;
*/
}
body {
margin: 1em 5% 1em 5%;
}
a {
color: blue;
text-decoration: underline;
}
a:visited {
color: fuchsia;
}
em {
font-style: italic;
color: navy;
}
strong {
font-weight: bold;
color: #083194;
}
tt {
color: navy;
}
h1, h2, h3, h4, h5, h6 {
color: #527bbd;
font-family: sans-serif;
margin-top: 1.2em;
margin-bottom: 0.5em;
line-height: 1.3;
}
h1, h2, h3 {
border-bottom: 2px solid silver;
}
h2 {
padding-top: 0.5em;
}
h3 {
float: left;
}
h3 + * {
clear: left;
}
div.sectionbody {
font-family: serif;
margin-left: 0;
}
hr {
border: 1px solid silver;
}
p {
margin-top: 0.5em;
margin-bottom: 0.5em;
}
ul, ol, li > p {
margin-top: 0;
}
pre {
padding: 0;
margin: 0;
}
span#author {
color: #527bbd;
font-family: sans-serif;
font-weight: bold;
font-size: 1.1em;
}
span#email {
}
span#revnumber, span#revdate, span#revremark {
font-family: sans-serif;
}
div#footer {
font-family: sans-serif;
font-size: small;
border-top: 2px solid silver;
padding-top: 0.5em;
margin-top: 4.0em;
}
div#footer-text {
float: left;
padding-bottom: 0.5em;
}
div#footer-badges {
float: right;
padding-bottom: 0.5em;
}
div#preamble {
margin-top: 1.5em;
margin-bottom: 1.5em;
}
div.tableblock, div.imageblock, div.exampleblock, div.verseblock,
div.quoteblock, div.literalblock, div.listingblock, div.sidebarblock,
div.admonitionblock {
margin-top: 1.0em;
margin-bottom: 1.5em;
}
div.admonitionblock {
margin-top: 2.0em;
margin-bottom: 2.0em;
margin-right: 10%;
color: #606060;
}
div.content { /* Block element content. */
padding: 0;
}
/* Block element titles. */
div.title, caption.title {
color: #527bbd;
font-family: sans-serif;
font-weight: bold;
text-align: left;
margin-top: 1.0em;
margin-bottom: 0.5em;
}
div.title + * {
margin-top: 0;
}
td div.title:first-child {
margin-top: 0.0em;
}
div.content div.title:first-child {
margin-top: 0.0em;
}
div.content + div.title {
margin-top: 0.0em;
}
div.sidebarblock > div.content {
background: #ffffee;
border: 1px solid silver;
padding: 0.5em;
}
div.listingblock > div.content {
border: 1px solid silver;
background: #f4f4f4;
padding: 0.5em;
}
div.quoteblock, div.verseblock {
padding-left: 1.0em;
margin-left: 1.0em;
margin-right: 10%;
border-left: 5px solid #dddddd;
color: #777777;
}
div.quoteblock > div.attribution {
padding-top: 0.5em;
text-align: right;
}
div.verseblock > div.content {
white-space: pre;
}
div.verseblock > div.attribution {
padding-top: 0.75em;
text-align: left;
}
/* DEPRECATED: Pre version 8.2.7 verse style literal block. */
div.verseblock + div.attribution {
text-align: left;
}
div.admonitionblock .icon {
vertical-align: top;
font-size: 1.1em;
font-weight: bold;
text-decoration: underline;
color: #527bbd;
padding-right: 0.5em;
}
div.admonitionblock td.content {
padding-left: 0.5em;
border-left: 3px solid #dddddd;
}
div.exampleblock > div.content {
border-left: 3px solid #dddddd;
padding-left: 0.5em;
}
div.imageblock div.content { padding-left: 0; }
span.image img { border-style: none; }
a.image:visited { color: white; }
dl {
margin-top: 0.8em;
margin-bottom: 0.8em;
}
dt {
margin-top: 0.5em;
margin-bottom: 0;
font-style: normal;
color: navy;
}
dd > *:first-child {
margin-top: 0.1em;
}
ul, ol {
list-style-position: outside;
}
ol.arabic {
list-style-type: decimal;
}
ol.loweralpha {
list-style-type: lower-alpha;
}
ol.upperalpha {
list-style-type: upper-alpha;
}
ol.lowerroman {
list-style-type: lower-roman;
}
ol.upperroman {
list-style-type: upper-roman;
}
div.compact ul, div.compact ol,
div.compact p, div.compact p,
div.compact div, div.compact div {
margin-top: 0.1em;
margin-bottom: 0.1em;
}
div.tableblock > table {
border: 3px solid #527bbd;
}
thead, p.table.header {
font-family: sans-serif;
font-weight: bold;
}
tfoot {
font-weight: bold;
}
td > div.verse {
white-space: pre;
}
p.table {
margin-top: 0;
}
/* Because the table frame attribute is overriden by CSS in most browsers. */
div.tableblock > table[frame="void"] {
border-style: none;
}
div.tableblock > table[frame="hsides"] {
border-left-style: none;
border-right-style: none;
}
div.tableblock > table[frame="vsides"] {
border-top-style: none;
border-bottom-style: none;
}
div.hdlist {
margin-top: 0.8em;
margin-bottom: 0.8em;
}
div.hdlist tr {
padding-bottom: 15px;
}
dt.hdlist1.strong, td.hdlist1.strong {
font-weight: bold;
}
td.hdlist1 {
vertical-align: top;
font-style: normal;
padding-right: 0.8em;
color: navy;
}
td.hdlist2 {
vertical-align: top;
}
div.hdlist.compact tr {
margin: 0;
padding-bottom: 0;
}
.comment {
background: yellow;
}
.footnote, .footnoteref {
font-size: 0.8em;
}
span.footnote, span.footnoteref {
vertical-align: super;
}
#footnotes {
margin: 20px 0 20px 0;
padding: 7px 0 0 0;
}
#footnotes div.footnote {
margin: 0 0 5px 0;
}
#footnotes hr {
border: none;
border-top: 1px solid silver;
height: 1px;
text-align: left;
margin-left: 0;
width: 20%;
min-width: 100px;
}
@media print {
div#footer-badges { display: none; }
}
div#toc {
margin-bottom: 2.5em;
}
div#toctitle {
color: #527bbd;
font-family: sans-serif;
font-size: 1.1em;
font-weight: bold;
margin-top: 1.0em;
margin-bottom: 0.1em;
}
div.toclevel1, div.toclevel2, div.toclevel3, div.toclevel4 {
margin-top: 0;
margin-bottom: 0;
}
div.toclevel2 {
margin-left: 2em;
font-size: 0.9em;
}
div.toclevel3 {
margin-left: 4em;
font-size: 0.9em;
}
div.toclevel4 {
margin-left: 6em;
font-size: 0.9em;
}
/* Overrides for manpage documents */
h1 {
padding-top: 0.5em;
padding-bottom: 0.5em;
border-top: 2px solid silver;
border-bottom: 2px solid silver;
}
h2 {
border-style: none;
}
div.sectionbody {
margin-left: 5%;
}
@media print {
div#toc { display: none; }
}
/* Workarounds for IE6's broken and incomplete CSS2. */
div.sidebar-content {
background: #ffffee;
border: 1px solid silver;
padding: 0.5em;
}
div.sidebar-title, div.image-title {
color: #527bbd;
font-family: sans-serif;
font-weight: bold;
margin-top: 0.0em;
margin-bottom: 0.5em;
}
div.listingblock div.content {
border: 1px solid silver;
background: #f4f4f4;
padding: 0.5em;
}
div.quoteblock-attribution {
padding-top: 0.5em;
text-align: right;
}
div.verseblock-content {
white-space: pre;
}
div.verseblock-attribution {
padding-top: 0.75em;
text-align: left;
}
div.exampleblock-content {
border-left: 3px solid #dddddd;
padding-left: 0.5em;
}
/* IE6 sets dynamically generated links as visited. */
div#toc a:visited { color: blue; }
</style>
<script type="text/javascript">
/*<![CDATA[*/
window.onload = function(){asciidoc.footnotes();}
var asciidoc = { // Namespace.
/////////////////////////////////////////////////////////////////////
// Table Of Contents generator
/////////////////////////////////////////////////////////////////////
/* Author: Mihai Bazon, September 2002
* http://students.infoiasi.ro/~mishoo
*
* Table Of Content generator
* Version: 0.4
*
* Feel free to use this script under the terms of the GNU General Public
* License, as long as you do not remove or alter this notice.
*/
/* modified by Troy D. Hanson, September 2006. License: GPL */
/* modified by Stuart Rackham, 2006, 2009. License: GPL */
// toclevels = 1..4.
toc: function (toclevels) {
function getText(el) {
var text = "";
for (var i = el.firstChild; i != null; i = i.nextSibling) {
if (i.nodeType == 3 /* Node.TEXT_NODE */) // IE doesn't speak constants.
text += i.data;
else if (i.firstChild != null)
text += getText(i);
}
return text;
}
function TocEntry(el, text, toclevel) {
this.element = el;
this.text = text;
this.toclevel = toclevel;
}
function tocEntries(el, toclevels) {
var result = new Array;
var re = new RegExp('[hH]([2-'+(toclevels+1)+'])');
// Function that scans the DOM tree for header elements (the DOM2
// nodeIterator API would be a better technique but not supported by all
// browsers).
var iterate = function (el) {
for (var i = el.firstChild; i != null; i = i.nextSibling) {
if (i.nodeType == 1 /* Node.ELEMENT_NODE */) {
var mo = re.exec(i.tagName);
if (mo && (i.getAttribute("class") || i.getAttribute("className")) != "float") {
result[result.length] = new TocEntry(i, getText(i), mo[1]-1);
}
iterate(i);
}
}
}
iterate(el);
return result;
}
var toc = document.getElementById("toc");
var entries = tocEntries(document.getElementById("content"), toclevels);
for (var i = 0; i < entries.length; ++i) {
var entry = entries[i];
if (entry.element.id == "")
entry.element.id = "_toc_" + i;
var a = document.createElement("a");
a.href = "#" + entry.element.id;
a.appendChild(document.createTextNode(entry.text));
var div = document.createElement("div");
div.appendChild(a);
div.className = "toclevel" + entry.toclevel;
toc.appendChild(div);
}
if (entries.length == 0)
toc.parentNode.removeChild(toc);
},
/////////////////////////////////////////////////////////////////////
// Footnotes generator
/////////////////////////////////////////////////////////////////////
/* Based on footnote generation code from:
* http://www.brandspankingnew.net/archive/2005/07/format_footnote.html
*/
footnotes: function () {
var cont = document.getElementById("content");
var noteholder = document.getElementById("footnotes");
var spans = cont.getElementsByTagName("span");
var refs = {};
var n = 0;
for (i=0; i<spans.length; i++) {
if (spans[i].className == "footnote") {
n++;
// Use [\s\S] in place of . so multi-line matches work.
// Because JavaScript has no s (dotall) regex flag.
note = spans[i].innerHTML.match(/\s*\[([\s\S]*)]\s*/)[1];
noteholder.innerHTML +=
"<div class='footnote' id='_footnote_" + n + "'>" +
"<a href='#_footnoteref_" + n + "' title='Return to text'>" +
n + "</a>. " + note + "</div>";
spans[i].innerHTML =
"[<a id='_footnoteref_" + n + "' href='#_footnote_" + n +
"' title='View footnote' class='footnote'>" + n + "</a>]";
var id =spans[i].getAttribute("id");
if (id != null) refs["#"+id] = n;
}
}
if (n == 0)
noteholder.parentNode.removeChild(noteholder);
else {
// Process footnoterefs.
for (i=0; i<spans.length; i++) {
if (spans[i].className == "footnoteref") {
var href = spans[i].getElementsByTagName("a")[0].getAttribute("href");
href = href.match(/#.*/)[0]; // Because IE return full URL.
n = refs[href];
spans[i].innerHTML =
"[<a href='#_footnote_" + n +
"' title='View footnote' class='footnote'>" + n + "</a>]";
}
}
}
}
}
/*]]>*/
</script>
</head>
<body>
<div id="header">
<h1>
SHAPECLUSTERING(1) Manual Page
</h1>
<h2>NAME</h2>
<div class="sectionbody">
<p>shapeclustering -
shape clustering training for Tesseract
</p>
</div>
</div>
<div id="content">
<h2 id="_synopsis">SYNOPSIS</h2>
<div class="sectionbody">
<div class="paragraph"><p>shapeclustering -D <em>output_dir</em>
-U <em>unicharset</em> -O <em>mfunicharset</em>
-F <em>font_props</em> -X <em>xheights</em>
<em>FILE</em>&#8230;</p></div>
</div>
<h2 id="_description">DESCRIPTION</h2>
<div class="sectionbody">
<div class="paragraph"><p>shapeclustering(1) takes extracted feature .tr files (generated by
tesseract(1) run in a special mode from box files) and produces a
file <strong>shapetable</strong> and an enhanced unicharset. This program is still
experimental, and is not required (yet) for training Tesseract.</p></div>
</div>
<h2 id="_options">OPTIONS</h2>
<div class="sectionbody">
<div class="dlist"><dl>
<dt class="hdlist1">
-U <em>FILE</em>
</dt>
<dd>
<p>
The unicharset generated by unicharset_extractor(1).
</p>
</dd>
<dt class="hdlist1">
-D <em>dir</em>
</dt>
<dd>
<p>
Directory to write output files to.
</p>
</dd>
<dt class="hdlist1">
-F <em>font_properties_file</em>
</dt>
<dd>
<p>
(Input) font properties file, where each line is of the following form, where each field other than the font name is 0 or 1:
</p>
<div class="literalblock">
<div class="content">
<pre><tt>'font_name' 'italic' 'bold' 'fixed_pitch' 'serif' 'fraktur'</tt></pre>
</div></div>
</dd>
<dt class="hdlist1">
-X <em>xheights_file</em>
</dt>
<dd>
<p>
(Input) x heights file, each line is of the following form, where xheight is calculated as the pixel x height of a character drawn at 32pt on 300 dpi. [ That is, if base x height + ascenders + descenders = 133, how much is x height? ]
</p>
<div class="literalblock">
<div class="content">
<pre><tt>'font_name' 'xheight'</tt></pre>
</div></div>
</dd>
<dt class="hdlist1">
-O <em>FILE</em>
</dt>
<dd>
<p>
The output unicharset that will be given to combine_tessdata(1).
</p>
</dd>
</dl></div>
</div>
<h2 id="_see_also">SEE ALSO</h2>
<div class="sectionbody">
<div class="paragraph"><p>tesseract(1), cntraining(1), unicharset_extractor(1), combine_tessdata(1),
unicharset(5)</p></div>
<div class="paragraph"><p><a href="http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3">http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3</a></p></div>
</div>
<h2 id="_copying">COPYING</h2>
<div class="sectionbody">
<div class="paragraph"><p>Copyright (C) Google, 2011
Licensed under the Apache License, Version 2.0</p></div>
</div>
<h2 id="_author">AUTHOR</h2>
<div class="sectionbody">
<div class="paragraph"><p>The Tesseract OCR engine was written by Ray Smith and his research groups
at Hewlett Packard (1985-1995) and Google (2006-present).</p></div>
</div>
</div>
<div id="footnotes"><hr /></div>
<div id="footer">
<div id="footer-text">
Last updated 2012-02-09 14:19:44 PDT
</div>
</div>
</body>
</html>

102
doc/shapeclustering.1.xml Normal file
View File

@ -0,0 +1,102 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE refentry PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd">
<?asciidoc-toc?>
<?asciidoc-numbered?>
<refentry lang="en">
<refmeta>
<refentrytitle>shapeclustering</refentrytitle>
<manvolnum>1</manvolnum>
<refmiscinfo class="source">&nbsp;</refmiscinfo>
<refmiscinfo class="manual">&nbsp;</refmiscinfo>
</refmeta>
<refnamediv>
<refname>shapeclustering</refname>
<refpurpose>shape clustering training for Tesseract</refpurpose>
</refnamediv>
<refsynopsisdiv id="_synopsis">
<simpara>shapeclustering -D <emphasis>output_dir</emphasis>
-U <emphasis>unicharset</emphasis> -O <emphasis>mfunicharset</emphasis>
-F <emphasis>font_props</emphasis> -X <emphasis>xheights</emphasis>
<emphasis>FILE</emphasis>&#8230;</simpara>
</refsynopsisdiv>
<refsect1 id="_description">
<title>DESCRIPTION</title>
<simpara>shapeclustering(1) takes extracted feature .tr files (generated by
tesseract(1) run in a special mode from box files) and produces a
file <emphasis role="strong">shapetable</emphasis> and an enhanced unicharset. This program is still
experimental, and is not required (yet) for training Tesseract.</simpara>
</refsect1>
<refsect1 id="_options">
<title>OPTIONS</title>
<variablelist>
<varlistentry>
<term>
-U <emphasis>FILE</emphasis>
</term>
<listitem>
<simpara>
The unicharset generated by unicharset_extractor(1).
</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term>
-D <emphasis>dir</emphasis>
</term>
<listitem>
<simpara>
Directory to write output files to.
</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term>
-F <emphasis>font_properties_file</emphasis>
</term>
<listitem>
<simpara>
(Input) font properties file, where each line is of the following form, where each field other than the font name is 0 or 1:
</simpara>
<literallayout class="monospaced">'font_name' 'italic' 'bold' 'fixed_pitch' 'serif' 'fraktur'</literallayout>
</listitem>
</varlistentry>
<varlistentry>
<term>
-X <emphasis>xheights_file</emphasis>
</term>
<listitem>
<simpara>
(Input) x heights file, each line is of the following form, where xheight is calculated as the pixel x height of a character drawn at 32pt on 300 dpi. [ That is, if base x height + ascenders + descenders = 133, how much is x height? ]
</simpara>
<literallayout class="monospaced">'font_name' 'xheight'</literallayout>
</listitem>
</varlistentry>
<varlistentry>
<term>
-O <emphasis>FILE</emphasis>
</term>
<listitem>
<simpara>
The output unicharset that will be given to combine_tessdata(1).
</simpara>
</listitem>
</varlistentry>
</variablelist>
</refsect1>
<refsect1 id="_see_also">
<title>SEE ALSO</title>
<simpara>tesseract(1), cntraining(1), unicharset_extractor(1), combine_tessdata(1),
unicharset(5)</simpara>
<simpara><ulink url="http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3">http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3</ulink></simpara>
</refsect1>
<refsect1 id="_copying">
<title>COPYING</title>
<simpara>Copyright (C) Google, 2011
Licensed under the Apache License, Version 2.0</simpara>
</refsect1>
<refsect1 id="_author">
<title>AUTHOR</title>
<simpara>The Tesseract OCR engine was written by Ray Smith and his research groups
at Hewlett Packard (1985-1995) and Google (2006-present).</simpara>
</refsect1>
</refentry>

View File

@ -1,13 +1,13 @@
'\" t
.\" Title: tesseract
.\" Author: [FIXME: author] [see http://docbook.sf.net/el/author]
.\" Author: [see the "AUTHOR" section]
.\" Generator: DocBook XSL Stylesheets v1.75.2 <http://docbook.sf.net/>
.\" Date: 09/30/2010
.\" Date: 02/09/2012
.\" Manual: \ \&
.\" Source: \ \&
.\" Language: English
.\"
.TH "TESSERACT" "1" "09/30/2010" "\ \&" "\ \&"
.TH "TESSERACT" "1" "02/09/2012" "\ \&" "\ \&"
.\" -----------------------------------------------------------------
.\" * Define some portability stuff
.\" -----------------------------------------------------------------
@ -31,320 +31,109 @@
tesseract \- command\-line OCR engine
.SH "SYNOPSIS"
.sp
\fBtesseract\fR \fIimagename\fR \fItextbase\fR [\fIconfigfile\fR] [\fI\-l lang\fR]
\fBtesseract\fR \fIimagename\fR \fIoutbase\fR [\fI\-l lang\fR] [\fI\-psm N\fR] [\fIconfigfile\fR \&...]
.SH "DESCRIPTION"
.sp
tesseract(1) is a commercial quality OCR engine originally developed at HP between 1985 and 1995\&. In 1995, this engine was among the top 3 evaluated by UNLV\&. It was open\-sourced by HP and UNLV in 2005, and has been developed by Google since then\&.
tesseract(1) is a commercial quality OCR engine originally developed at HP between 1985 and 1995\&. In 1995, this engine was among the top 3 evaluated by UNLV\&. It was open\-sourced by HP and UNLV in 2005, and has been developed at Google since then\&.
.SH "OPTIONS"
.PP
\fIimagename\fR
.RS 4
The name of the input image\&. Most image file formats (anything readable by Leptonica) are supported\&.
.RE
.PP
\fIoutbase\fR
.RS 4
The basename of the output file (to which the appropriate extension will be appended)\&. By default the output will be named
\fIoutbase\&.txt\fR\&.
.RE
.PP
\fI\-l lang\fR
.RS 4
The language to use\&. If none is specified, English is assumed\&. Multiple languages may be specified, separated by plus characters\&. Tesseract uses 3\-character ISO 639\-2 language codes\&. (See LANGUAGES)
.RE
.PP
\fI\-psm N\fR
.RS 4
Set Tesseract to only run a subset of layout analysis and assume a certain form of image\&. The options for
\fBN\fR
are:
.sp
\fIimagename\fR The name of the input image
.if n \{\
.RS 4
.\}
.nf
0 = Orientation and script detection (OSD) only\&.
1 = Automatic page segmentation with OSD\&.
2 = Automatic page segmentation, but no OSD, or OCR\&.
3 = Fully automatic page segmentation, but no OSD\&. (Default)
4 = Assume a single column of text of variable sizes\&.
5 = Assume a single uniform block of vertically aligned text\&.
6 = Assume a single uniform block of text\&.
7 = Treat the image as a single text line\&.
8 = Treat the image as a single word\&.
9 = Treat the image as a single word in a circle\&.
10 = Treat the image as a single character\&.
.fi
.if n \{\
.RE
.\}
.RE
.PP
\fI\-v\fR
.RS 4
Returns the current version of the tesseract(1) executable\&.
.RE
.PP
\fIconfigfile\fR
.RS 4
The name of a config to use\&. A config is a plaintext file which contains a list of variables and their values, one per line, with a space separating variable from value\&. Interesting config files include:
.sp
\fItextbase\fR The basename of the output file (to which the appropriate extension will be appended)
.RS 4
.ie n \{\
\h'-04'\(bu\h'+03'\c
.\}
.el \{\
.sp -1
.IP \(bu 2.3
.\}
hocr \- Output in hOCR format instead of as a text file\&.
.RE
.RE
.sp
\fIconfigfile\fR The config to use\&. A config is a plaintext file which contains a list of variables and their values, one per line, with a space separating variable from value\&.
.sp
\fI\-l lang\fR The language to use\&. If none is specified, English is assumed\&. Tesseract uses 3\-character ISO 639\-2 language codes\&. (See LANGUAGES)
.sp
\fI\-v\fR Returns the current version of the tesseract(1) executable\&.
\fBNota Bene:\fR The options \fI\-l lang\fR and \fI\-psm N\fR must occur before any \fIconfigfile\fR\&.
.SH "LANGUAGES"
.sp
There are currently language packs available for the following languages:
.TS
tab(:);
lt lt
lt lt
lt lt
lt lt
lt lt
lt lt
lt lt
lt lt
lt lt
lt lt
lt lt
lt lt
lt lt
lt lt
lt lt
lt lt
lt lt
lt lt
lt lt
lt lt
lt lt
lt lt
lt lt
lt lt
lt lt
lt lt
lt lt
lt lt
lt lt
lt lt
lt lt
lt lt
lt lt
lt lt
lt lt.
T{
.sp
bul
T}:T{
\fBara\fR (Arabic), \fBaze\fR (Azerbauijani), \fBbul\fR (Bulgarian), \fBcat\fR (Catalan), \fBces\fR (Czech), \fBchi_sim\fR (Simplified Chinese), \fBchi_tra\fR (Traditional Chinese), \fBchr\fR (Cherokee), \fBdan\fR (Danish), \fBdan\-frak\fR (Danish (Fraktur)), \fBdeu\fR (German), \fBell\fR (Greek), \fBeng\fR (English), \fBenm\fR (Old English), \fBepo\fR (Esperanto), \fBest\fR (Estonian), \fBfin\fR (Finnish), \fBfra\fR (French), \fBfrm\fR (Old French), \fBglg\fR (Galician), \fBheb\fR (Hebrew), \fBhin\fR (Hindi), \fBhrv\fR (Croation), \fBhun\fR (Hungarian), \fBind\fR (Indonesian), \fBita\fR (Italian), \fBjpn\fR (Japanese), \fBkor\fR (Korean), \fBlav\fR (Latvian), \fBlit\fR (Lithuanian), \fBnld\fR (Dutch), \fBnor\fR (Norwegian), \fBpol\fR (Polish), \fBpor\fR (Portuguese), \fBron\fR (Romanian), \fBrus\fR (Russian), \fBslk\fR (Slovakian), \fBslv\fR (Slovenian), \fBsqi\fR (Albanian), \fBspa\fR (Spanish), \fBsrp\fR (Serbian), \fBswe\fR (Swedish), \fBtam\fR (Tamil), \fBtel\fR (Telugu), \fBtgl\fR (Tagalog), \fBtha\fR (Thai), \fBtur\fR (Turkish), \fBukr\fR (Ukrainian), \fBvie\fR (Vietnamese)
.sp
Bulgarian
T}
T{
.sp
cat
T}:T{
.sp
Catalan
T}
T{
.sp
ces
T}:T{
.sp
Czech
T}
T{
.sp
chi_sim
T}:T{
.sp
Simplified Chinese
T}
T{
.sp
chi_tra
T}:T{
.sp
Traditional Chinese
T}
T{
.sp
dan
T}:T{
.sp
Danish
T}
T{
.sp
dan\-frak
T}:T{
.sp
Danish (Fraktur)
T}
T{
.sp
deu
T}:T{
.sp
German
T}
T{
.sp
ell
T}:T{
.sp
Greek
T}
T{
.sp
eng
T}:T{
.sp
English
T}
T{
.sp
fin
T}:T{
.sp
Finnish
T}
T{
.sp
fra
T}:T{
.sp
French
T}
T{
.sp
hun
T}:T{
.sp
Hungarian
T}
T{
.sp
ind
T}:T{
.sp
Indonesian
T}
T{
.sp
ita
T}:T{
.sp
Italian
T}
T{
.sp
jpn
T}:T{
.sp
Japanese
T}
T{
.sp
kor
T}:T{
.sp
Korean
T}
T{
.sp
lav
T}:T{
.sp
Latvian
T}
T{
.sp
lit
T}:T{
.sp
Lithuanian
T}
T{
.sp
nld
T}:T{
.sp
Dutch
T}
T{
.sp
nor
T}:T{
.sp
Norwegian
T}
T{
.sp
pol
T}:T{
.sp
Polish
T}
T{
.sp
por
T}:T{
.sp
Portuguese
T}
T{
.sp
ron
T}:T{
.sp
Romanian
T}
T{
.sp
rus
T}:T{
.sp
Russian
T}
T{
.sp
slk
T}:T{
.sp
Slovakian
T}
T{
.sp
slv
T}:T{
.sp
Slovenian
T}
T{
.sp
spa
T}:T{
.sp
Spanish
T}
T{
.sp
srp
T}:T{
.sp
Serbian
T}
T{
.sp
swe
T}:T{
.sp
Swedish
T}
T{
.sp
tgl
T}:T{
.sp
Tagalog
T}
T{
.sp
tha
T}:T{
.sp
Thai
T}
T{
.sp
tur
T}:T{
.sp
Turkish
T}
T{
.sp
ukr
T}:T{
.sp
Ukrainian
T}
T{
.sp
vie
T}:T{
.sp
Vietnamese
T}
.TE
.sp 1
To use a non\-standard language pack named \fBfoo\&.traineddata\fR, set the \fBTESSDATA_PREFIX\fR environment variable so the file can be found at \fBTESSDATA_PREFIX\fR/tessdata/\fBfoo\fR\&.traineddata and give Tesseract the argument \fI\-l foo\fR\&.
.SH "HISTORY"
.sp
The engine was developed at Hewlett Packard Laboratories Bristol and at Hewlett Packard Co, Greeley Colorado between 1985 and 1994, with some more changes made in 1996 to port to Windows, and some C++izing in 1998\&. A lot of the code was written in C, and then some more was written in C++\&. Since then all the code has been converted to at least compile with a C++ compiler\&. Currently it builds under Linux with gcc4\&.0, gcc4\&.1 and under Windows with VC++6 and VC++Express\&. The C++ code makes heavy use of a list system using macros\&. This predates stl, was portable before stl, and is more efficient than stl lists, but has the big negative that if you do get a segmentation violation, it is hard to debug\&. Another "feature" of the C/C++ split is that the C++ data structures get converted to C data structures to call the low\-level C code\&. This is ugly, and the C++izing of the C code is a step towards eliminating the conversion, but it has not happened yet\&.
The engine was developed at Hewlett Packard Laboratories Bristol and at Hewlett Packard Co, Greeley Colorado between 1985 and 1994, with some more changes made in 1996 to port to Windows, and some C++izing in 1998\&. A lot of the code was written in C, and then some more was written in C++\&. The C\e++ code makes heavy use of a list system using macros\&. This predates stl, was portable before stl, and is more efficient than stl lists, but has the big negative that if you do get a segmentation violation, it is hard to debug\&.
.sp
The most important changes in version 2\&.00 were that Tesseract can now recognize 6 languages, is fully UTF8 capable, and is fully trainable\&. See \m[blue]\fBhttp://code\&.google\&.com/p/tesseract\-ocr/wiki/TrainingTesseract\fR\m[] for more information on training\&.
Version 2\&.00 brought Unicode (UTF\-8) support, six languages, and the ability to train Tesseract\&.
.sp
Tesseract was included in UNLV\(cqs Fourth Annual Test of OCR Accuracy\&. See \m[blue]\fBhttp://www\&.isri\&.unlv\&.edu/downloads/AT\-1995\&.pdf\fR\m[]\&. With Tesseract 2\&.00, scripts are now included to allow anyone to reproduce some of these tests\&. See \m[blue]\fBhttp://code\&.google\&.com/p/tesseract\-ocr/wiki/TestingTesseract\fR\m[] for more details\&.
.sp
Tesseract 3\&.00 adds a number of new languages, including Chinese, Japanese, and Korean\&. It also introduces a new, single\-file based system of managing language data\&. For further details, see the file ReleaseNotes included with the distribution\&.
Tesseract 3\&.00 adds a number of new languages, including Chinese, Japanese, and Korean\&. It also introduces a new, single\-file based system of managing language data\&.
.sp
Tesseract 3\&.02 adds BiDirectional text support, the ability to recognize multiple languages in a single image, and improved layout analysis\&.
.sp
For further details, see the file ReleaseNotes included with the distribution\&.
.SH "RESOURCES"
.sp
Main web site: \m[blue]\fBhttp://code\&.google\&.com/p/tesseract\-ocr/\fR\m[] Information on training: \m[blue]\fBhttp://code\&.google\&.com/p/tesseract\-ocr/wiki/TrainingTesseract3\fR\m[]
.SH "SEE ALSO"
.sp
tesseract(1)
ambiguous_words(1), cntraining(1), combine_tessdata(1), dawg2wordlist(1), shape_training(1), mftraining(1), unicharambigs(5), unicharset(5), unicharset_extractor(1), wordlist2dawg(1)
.SH "AUTHOR"
.sp
Tesseract development was led at Hewlett\-Packard and Google by Ray Smith\&. The development team has included:
.sp
Ahmad Abdulkader, Chris Newton, Dan Johnson, Dar\-Shyang Lee, David Eger, Eric Wiseblatt, Faisal Shafait, Hiroshi Takenaka, Joe Liu, Joern Wanke, Mark Seaman, Mickey Namiki, Nicholas Beato, Oded Fuhrmann, Phil Cheatle, Pingping Xiu, Pong Eksombatchai (Chantat), Ranjith Unnikrishnan, Raquel Romano, Ray Smith, Rika Antonova, Robert Moss, Samuel Charron, Sheelagh Lloyd, Shobhit Saxena, and Thomas Kielbus\&.
.SH "COPYING"
.sp
Licensed under the Apache License, Version 2\&.0

View File

@ -1,5 +1,6 @@
TESSERACT(1)
============
:doctype: manpage
NAME
----
@ -7,78 +8,118 @@ tesseract - command-line OCR engine
SYNOPSIS
--------
*tesseract* 'imagename' 'textbase' ['configfile'] ['-l lang']
*tesseract* 'imagename' 'outbase' ['-l lang'] ['-psm N'] ['configfile' ...]
DESCRIPTION
-----------
tesseract(1) is a commercial quality OCR engine originally developed at HP
between 1985 and 1995. In 1995, this engine was among the top 3 evaluated by
between 1985 and 1995. In 1995, this engine was among the top 3 evaluated by
UNLV. It was open-sourced by HP and UNLV in 2005, and has been developed
by Google since then.
at Google since then.
OPTIONS
-------
'imagename'
The name of the input image
'imagename'::
The name of the input image. Most image file formats (anything
readable by Leptonica) are supported.
'textbase'
'outbase'::
The basename of the output file (to which the appropriate extension
will be appended)
will be appended). By default the output will be named 'outbase.txt'.
'configfile'
The config to use. A config is a plaintext file which contains a list
of variables and their values, one per line, with a space separating
variable from value.
'-l lang'
'-l lang'::
The language to use. If none is specified, English is assumed.
Multiple languages may be specified, separated by plus characters.
Tesseract uses 3-character ISO 639-2 language codes. (See LANGUAGES)
'-v'
'-psm N'::
Set Tesseract to only run a subset of layout analysis and assume a certain form of image. The options for *N* are:
0 = Orientation and script detection (OSD) only.
1 = Automatic page segmentation with OSD.
2 = Automatic page segmentation, but no OSD, or OCR.
3 = Fully automatic page segmentation, but no OSD. (Default)
4 = Assume a single column of text of variable sizes.
5 = Assume a single uniform block of vertically aligned text.
6 = Assume a single uniform block of text.
7 = Treat the image as a single text line.
8 = Treat the image as a single word.
9 = Treat the image as a single word in a circle.
10 = Treat the image as a single character.
'-v'::
Returns the current version of the tesseract(1) executable.
'configfile'::
The name of a config to use. A config is a plaintext file which
contains a list of variables and their values, one per line, with a
space separating variable from value. Interesting config files
include: +
* hocr - Output in hOCR format instead of as a text file.
*Nota Bene:* The options '-l lang' and '-psm N' must occur
before any 'configfile'.
LANGUAGES
---------
There are currently language packs available for the following languages:
[horizontal]
bul:: Bulgarian
cat:: Catalan
ces:: Czech
chi_sim:: Simplified Chinese
chi_tra:: Traditional Chinese
dan:: Danish
dan-frak:: Danish (Fraktur)
deu:: German
ell:: Greek
eng:: English
fin:: Finnish
fra:: French
hun:: Hungarian
ind:: Indonesian
ita:: Italian
jpn:: Japanese
kor:: Korean
lav:: Latvian
lit:: Lithuanian
nld:: Dutch
nor:: Norwegian
pol:: Polish
por:: Portuguese
ron:: Romanian
rus:: Russian
slk:: Slovakian
slv:: Slovenian
spa:: Spanish
srp:: Serbian
swe:: Swedish
tgl:: Tagalog
tha:: Thai
tur:: Turkish
ukr:: Ukrainian
vie:: Vietnamese
*ara* (Arabic),
*aze* (Azerbauijani),
*bul* (Bulgarian),
*cat* (Catalan),
*ces* (Czech),
*chi_sim* (Simplified Chinese),
*chi_tra* (Traditional Chinese),
*chr* (Cherokee),
*dan* (Danish),
*dan-frak* (Danish (Fraktur)),
*deu* (German),
*ell* (Greek),
*eng* (English),
*enm* (Old English),
*epo* (Esperanto),
*est* (Estonian),
*fin* (Finnish),
*fra* (French),
*frm* (Old French),
*glg* (Galician),
*heb* (Hebrew),
*hin* (Hindi),
*hrv* (Croation),
*hun* (Hungarian),
*ind* (Indonesian),
*ita* (Italian),
*jpn* (Japanese),
*kor* (Korean),
*lav* (Latvian),
*lit* (Lithuanian),
*nld* (Dutch),
*nor* (Norwegian),
*pol* (Polish),
*por* (Portuguese),
*ron* (Romanian),
*rus* (Russian),
*slk* (Slovakian),
*slv* (Slovenian),
*sqi* (Albanian),
*spa* (Spanish),
*srp* (Serbian),
*swe* (Swedish),
*tam* (Tamil),
*tel* (Telugu),
*tgl* (Tagalog),
*tha* (Thai),
*tur* (Turkish),
*ukr* (Ukrainian),
*vie* (Vietnamese)
To use a non-standard language pack named *foo.traineddata*, set the
*TESSDATA_PREFIX* environment variable so the file can be found at
*TESSDATA_PREFIX*/tessdata/*foo*.traineddata and give Tesseract the
argument '-l foo'.
HISTORY
-------
@ -86,21 +127,13 @@ The engine was developed at Hewlett Packard Laboratories Bristol and at
Hewlett Packard Co, Greeley Colorado between 1985 and 1994, with some more
changes made in 1996 to port to Windows, and some C\+\+izing in 1998. A
lot of the code was written in C, and then some more was written in C\+\+.
Since then all the code has been converted to at least compile with a
C\++ compiler. Currently it builds under Linux with gcc4.0, gcc4.1 and
under Windows with VC\+\+6 and VC\+\+Express. The C\++ code makes heavy use of
a list system using macros. This predates stl, was portable before stl, and
is more efficient than stl lists, but has the big negative that if you do get
a segmentation violation, it is hard to debug. Another "feature" of the
C/C\++ split is that the C\++ data structures get converted to C data
structures to call the low-level C code. This is ugly, and the C++izing of
the C code is a step towards eliminating the conversion, but it has not
happened yet.
The C\++ code makes heavy use of a list system using macros. This predates
stl, was portable before stl, and is more efficient than stl lists, but has
the big negative that if you do get a segmentation violation, it is hard to
debug.
The most important changes in version 2.00 were that Tesseract can now
recognize 6 languages, is fully UTF8 capable, and is fully trainable. See
<http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract> for more
information on training.
Version 2.00 brought Unicode (UTF-8) support, six languages, and the ability
to train Tesseract.
Tesseract was included in UNLV's Fourth Annual Test of OCR Accuracy.
See <http://www.isri.unlv.edu/downloads/AT-1995.pdf>. With Tesseract 2.00,
@ -110,12 +143,35 @@ details.
Tesseract 3.00 adds a number of new languages, including Chinese, Japanese,
and Korean. It also introduces a new, single-file based system of managing
language data. For further details, see the file ReleaseNotes included with
the distribution.
language data.
Tesseract 3.02 adds BiDirectional text support, the ability to recognize
multiple languages in a single image, and improved layout analysis.
For further details, see the file ReleaseNotes included with the distribution.
RESOURCES
---------
Main web site: <http://code.google.com/p/tesseract-ocr/> +
Information on training: <http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3>
SEE ALSO
--------
tesseract(1)
ambiguous_words(1), cntraining(1), combine_tessdata(1), dawg2wordlist(1),
shape_training(1), mftraining(1), unicharambigs(5), unicharset(5),
unicharset_extractor(1), wordlist2dawg(1)
AUTHOR
------
Tesseract development was led at Hewlett-Packard and Google by Ray Smith.
The development team has included:
Ahmad Abdulkader, Chris Newton, Dan Johnson, Dar-Shyang Lee, David Eger,
Eric Wiseblatt, Faisal Shafait, Hiroshi Takenaka, Joe Liu, Joern Wanke,
Mark Seaman, Mickey Namiki, Nicholas Beato, Oded Fuhrmann, Phil Cheatle,
Pingping Xiu, Pong Eksombatchai (Chantat), Ranjith Unnikrishnan, Raquel
Romano, Ray Smith, Rika Antonova, Robert Moss, Samuel Charron, Sheelagh
Lloyd, Shobhit Saxena, and Thomas Kielbus.
COPYING
-------

View File

@ -582,422 +582,154 @@ TESSERACT(1) Manual Page
<div id="content">
<h2 id="_synopsis">SYNOPSIS</h2>
<div class="sectionbody">
<div class="paragraph"><p><strong>tesseract</strong> <em>imagename</em> <em>textbase</em> [<em>configfile</em>] [<em>-l lang</em>]</p></div>
<div class="paragraph"><p><strong>tesseract</strong> <em>imagename</em> <em>outbase</em> [<em>-l lang</em>] [<em>-psm N</em>] [<em>configfile</em> &#8230;]</p></div>
</div>
<h2 id="_description">DESCRIPTION</h2>
<div class="sectionbody">
<div class="paragraph"><p>tesseract(1) is a commercial quality OCR engine originally developed at HP
between 1985 and 1995. In 1995, this engine was among the top 3 evaluated by
UNLV. It was open-sourced by HP and UNLV in 2005, and has been developed
by Google since then.</p></div>
at Google since then.</p></div>
</div>
<h2 id="_options">OPTIONS</h2>
<div class="sectionbody">
<div class="paragraph"><p><em>imagename</em>
The name of the input image</p></div>
<div class="paragraph"><p><em>textbase</em>
<div class="dlist"><dl>
<dt class="hdlist1">
<em>imagename</em>
</dt>
<dd>
<p>
The name of the input image. Most image file formats (anything
readable by Leptonica) are supported.
</p>
</dd>
<dt class="hdlist1">
<em>outbase</em>
</dt>
<dd>
<p>
The basename of the output file (to which the appropriate extension
will be appended)</p></div>
<div class="paragraph"><p><em>configfile</em>
The config to use. A config is a plaintext file which contains a list
of variables and their values, one per line, with a space separating
variable from value.</p></div>
<div class="paragraph"><p><em>-l lang</em>
will be appended). By default the output will be named <em>outbase.txt</em>.
</p>
</dd>
<dt class="hdlist1">
<em>-l lang</em>
</dt>
<dd>
<p>
The language to use. If none is specified, English is assumed.
Tesseract uses 3-character ISO 639-2 language codes. (See LANGUAGES)</p></div>
<div class="paragraph"><p><em>-v</em>
Returns the current version of the tesseract(1) executable.</p></div>
Multiple languages may be specified, separated by plus characters.
Tesseract uses 3-character ISO 639-2 language codes. (See LANGUAGES)
</p>
</dd>
<dt class="hdlist1">
<em>-psm N</em>
</dt>
<dd>
<p>
Set Tesseract to only run a subset of layout analysis and assume a certain form of image. The options for <strong>N</strong> are:
</p>
<div class="literalblock">
<div class="content">
<pre><tt>0 = Orientation and script detection (OSD) only.
1 = Automatic page segmentation with OSD.
2 = Automatic page segmentation, but no OSD, or OCR.
3 = Fully automatic page segmentation, but no OSD. (Default)
4 = Assume a single column of text of variable sizes.
5 = Assume a single uniform block of vertically aligned text.
6 = Assume a single uniform block of text.
7 = Treat the image as a single text line.
8 = Treat the image as a single word.
9 = Treat the image as a single word in a circle.
10 = Treat the image as a single character.</tt></pre>
</div></div>
</dd>
<dt class="hdlist1">
<em>-v</em>
</dt>
<dd>
<p>
Returns the current version of the tesseract(1) executable.
</p>
</dd>
<dt class="hdlist1">
<em>configfile</em>
</dt>
<dd>
<p>
The name of a config to use. A config is a plaintext file which
contains a list of variables and their values, one per line, with a
space separating variable from value. Interesting config files
include:<br />
</p>
<div class="ulist"><ul>
<li>
<p>
hocr - Output in hOCR format instead of as a text file.
</p>
</li>
</ul></div>
</dd>
</dl></div>
<div class="paragraph"><p><strong>Nota Bene:</strong> The options <em>-l lang</em> and <em>-psm N</em> must occur
before any <em>configfile</em>.</p></div>
</div>
<h2 id="_languages">LANGUAGES</h2>
<div class="sectionbody">
<div class="paragraph"><p>There are currently language packs available for the following languages:</p></div>
<div class="hdlist"><table>
<tr>
<td class="hdlist1">
bul
<br />
</td>
<td class="hdlist2">
<p style="margin-top: 0;">
Bulgarian
</p>
</td>
</tr>
<tr>
<td class="hdlist1">
cat
<br />
</td>
<td class="hdlist2">
<p style="margin-top: 0;">
Catalan
</p>
</td>
</tr>
<tr>
<td class="hdlist1">
ces
<br />
</td>
<td class="hdlist2">
<p style="margin-top: 0;">
Czech
</p>
</td>
</tr>
<tr>
<td class="hdlist1">
chi_sim
<br />
</td>
<td class="hdlist2">
<p style="margin-top: 0;">
Simplified Chinese
</p>
</td>
</tr>
<tr>
<td class="hdlist1">
chi_tra
<br />
</td>
<td class="hdlist2">
<p style="margin-top: 0;">
Traditional Chinese
</p>
</td>
</tr>
<tr>
<td class="hdlist1">
dan
<br />
</td>
<td class="hdlist2">
<p style="margin-top: 0;">
Danish
</p>
</td>
</tr>
<tr>
<td class="hdlist1">
dan-frak
<br />
</td>
<td class="hdlist2">
<p style="margin-top: 0;">
Danish (Fraktur)
</p>
</td>
</tr>
<tr>
<td class="hdlist1">
deu
<br />
</td>
<td class="hdlist2">
<p style="margin-top: 0;">
German
</p>
</td>
</tr>
<tr>
<td class="hdlist1">
ell
<br />
</td>
<td class="hdlist2">
<p style="margin-top: 0;">
Greek
</p>
</td>
</tr>
<tr>
<td class="hdlist1">
eng
<br />
</td>
<td class="hdlist2">
<p style="margin-top: 0;">
English
</p>
</td>
</tr>
<tr>
<td class="hdlist1">
fin
<br />
</td>
<td class="hdlist2">
<p style="margin-top: 0;">
Finnish
</p>
</td>
</tr>
<tr>
<td class="hdlist1">
fra
<br />
</td>
<td class="hdlist2">
<p style="margin-top: 0;">
French
</p>
</td>
</tr>
<tr>
<td class="hdlist1">
hun
<br />
</td>
<td class="hdlist2">
<p style="margin-top: 0;">
Hungarian
</p>
</td>
</tr>
<tr>
<td class="hdlist1">
ind
<br />
</td>
<td class="hdlist2">
<p style="margin-top: 0;">
Indonesian
</p>
</td>
</tr>
<tr>
<td class="hdlist1">
ita
<br />
</td>
<td class="hdlist2">
<p style="margin-top: 0;">
Italian
</p>
</td>
</tr>
<tr>
<td class="hdlist1">
jpn
<br />
</td>
<td class="hdlist2">
<p style="margin-top: 0;">
Japanese
</p>
</td>
</tr>
<tr>
<td class="hdlist1">
kor
<br />
</td>
<td class="hdlist2">
<p style="margin-top: 0;">
Korean
</p>
</td>
</tr>
<tr>
<td class="hdlist1">
lav
<br />
</td>
<td class="hdlist2">
<p style="margin-top: 0;">
Latvian
</p>
</td>
</tr>
<tr>
<td class="hdlist1">
lit
<br />
</td>
<td class="hdlist2">
<p style="margin-top: 0;">
Lithuanian
</p>
</td>
</tr>
<tr>
<td class="hdlist1">
nld
<br />
</td>
<td class="hdlist2">
<p style="margin-top: 0;">
Dutch
</p>
</td>
</tr>
<tr>
<td class="hdlist1">
nor
<br />
</td>
<td class="hdlist2">
<p style="margin-top: 0;">
Norwegian
</p>
</td>
</tr>
<tr>
<td class="hdlist1">
pol
<br />
</td>
<td class="hdlist2">
<p style="margin-top: 0;">
Polish
</p>
</td>
</tr>
<tr>
<td class="hdlist1">
por
<br />
</td>
<td class="hdlist2">
<p style="margin-top: 0;">
Portuguese
</p>
</td>
</tr>
<tr>
<td class="hdlist1">
ron
<br />
</td>
<td class="hdlist2">
<p style="margin-top: 0;">
Romanian
</p>
</td>
</tr>
<tr>
<td class="hdlist1">
rus
<br />
</td>
<td class="hdlist2">
<p style="margin-top: 0;">
Russian
</p>
</td>
</tr>
<tr>
<td class="hdlist1">
slk
<br />
</td>
<td class="hdlist2">
<p style="margin-top: 0;">
Slovakian
</p>
</td>
</tr>
<tr>
<td class="hdlist1">
slv
<br />
</td>
<td class="hdlist2">
<p style="margin-top: 0;">
Slovenian
</p>
</td>
</tr>
<tr>
<td class="hdlist1">
spa
<br />
</td>
<td class="hdlist2">
<p style="margin-top: 0;">
Spanish
</p>
</td>
</tr>
<tr>
<td class="hdlist1">
srp
<br />
</td>
<td class="hdlist2">
<p style="margin-top: 0;">
Serbian
</p>
</td>
</tr>
<tr>
<td class="hdlist1">
swe
<br />
</td>
<td class="hdlist2">
<p style="margin-top: 0;">
Swedish
</p>
</td>
</tr>
<tr>
<td class="hdlist1">
tgl
<br />
</td>
<td class="hdlist2">
<p style="margin-top: 0;">
Tagalog
</p>
</td>
</tr>
<tr>
<td class="hdlist1">
tha
<br />
</td>
<td class="hdlist2">
<p style="margin-top: 0;">
Thai
</p>
</td>
</tr>
<tr>
<td class="hdlist1">
tur
<br />
</td>
<td class="hdlist2">
<p style="margin-top: 0;">
Turkish
</p>
</td>
</tr>
<tr>
<td class="hdlist1">
ukr
<br />
</td>
<td class="hdlist2">
<p style="margin-top: 0;">
Ukrainian
</p>
</td>
</tr>
<tr>
<td class="hdlist1">
vie
<br />
</td>
<td class="hdlist2">
<p style="margin-top: 0;">
Vietnamese
</p>
</td>
</tr>
</table></div>
<div class="paragraph"><p><strong>ara</strong> (Arabic),
<strong>aze</strong> (Azerbauijani),
<strong>bul</strong> (Bulgarian),
<strong>cat</strong> (Catalan),
<strong>ces</strong> (Czech),
<strong>chi_sim</strong> (Simplified Chinese),
<strong>chi_tra</strong> (Traditional Chinese),
<strong>chr</strong> (Cherokee),
<strong>dan</strong> (Danish),
<strong>dan-frak</strong> (Danish (Fraktur)),
<strong>deu</strong> (German),
<strong>ell</strong> (Greek),
<strong>eng</strong> (English),
<strong>enm</strong> (Old English),
<strong>epo</strong> (Esperanto),
<strong>est</strong> (Estonian),
<strong>fin</strong> (Finnish),
<strong>fra</strong> (French),
<strong>frm</strong> (Old French),
<strong>glg</strong> (Galician),
<strong>heb</strong> (Hebrew),
<strong>hin</strong> (Hindi),
<strong>hrv</strong> (Croation),
<strong>hun</strong> (Hungarian),
<strong>ind</strong> (Indonesian),
<strong>ita</strong> (Italian),
<strong>jpn</strong> (Japanese),
<strong>kor</strong> (Korean),
<strong>lav</strong> (Latvian),
<strong>lit</strong> (Lithuanian),
<strong>nld</strong> (Dutch),
<strong>nor</strong> (Norwegian),
<strong>pol</strong> (Polish),
<strong>por</strong> (Portuguese),
<strong>ron</strong> (Romanian),
<strong>rus</strong> (Russian),
<strong>slk</strong> (Slovakian),
<strong>slv</strong> (Slovenian),
<strong>sqi</strong> (Albanian),
<strong>spa</strong> (Spanish),
<strong>srp</strong> (Serbian),
<strong>swe</strong> (Swedish),
<strong>tam</strong> (Tamil),
<strong>tel</strong> (Telugu),
<strong>tgl</strong> (Tagalog),
<strong>tha</strong> (Thai),
<strong>tur</strong> (Turkish),
<strong>ukr</strong> (Ukrainian),
<strong>vie</strong> (Vietnamese)</p></div>
<div class="paragraph"><p>To use a non-standard language pack named <strong>foo.traineddata</strong>, set the
<strong>TESSDATA_PREFIX</strong> environment variable so the file can be found at
<strong>TESSDATA_PREFIX</strong>/tessdata/<strong>foo</strong>.traineddata and give Tesseract the
argument <em>-l foo</em>.</p></div>
</div>
<h2 id="_history">HISTORY</h2>
<div class="sectionbody">
@ -1005,20 +737,12 @@ Vietnamese
Hewlett Packard Co, Greeley Colorado between 1985 and 1994, with some more
changes made in 1996 to port to Windows, and some C++izing in 1998. A
lot of the code was written in C, and then some more was written in C++.
Since then all the code has been converted to at least compile with a
C++ compiler. Currently it builds under Linux with gcc4.0, gcc4.1 and
under Windows with VC++6 and VC++Express. The C++ code makes heavy use of
a list system using macros. This predates stl, was portable before stl, and
is more efficient than stl lists, but has the big negative that if you do get
a segmentation violation, it is hard to debug. Another "feature" of the
C/C++ split is that the C++ data structures get converted to C data
structures to call the low-level C code. This is ugly, and the C++izing of
the C code is a step towards eliminating the conversion, but it has not
happened yet.</p></div>
<div class="paragraph"><p>The most important changes in version 2.00 were that Tesseract can now
recognize 6 languages, is fully UTF8 capable, and is fully trainable. See
<a href="http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract">http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract</a> for more
information on training.</p></div>
The C\++ code makes heavy use of a list system using macros. This predates
stl, was portable before stl, and is more efficient than stl lists, but has
the big negative that if you do get a segmentation violation, it is hard to
debug.</p></div>
<div class="paragraph"><p>Version 2.00 brought Unicode (UTF-8) support, six languages, and the ability
to train Tesseract.</p></div>
<div class="paragraph"><p>Tesseract was included in UNLV&#8217;s Fourth Annual Test of OCR Accuracy.
See <a href="http://www.isri.unlv.edu/downloads/AT-1995.pdf">http://www.isri.unlv.edu/downloads/AT-1995.pdf</a>. With Tesseract 2.00,
scripts are now included to allow anyone to reproduce some of these tests.
@ -1026,12 +750,32 @@ See <a href="http://code.google.com/p/tesseract-ocr/wiki/TestingTesseract">http:
details.</p></div>
<div class="paragraph"><p>Tesseract 3.00 adds a number of new languages, including Chinese, Japanese,
and Korean. It also introduces a new, single-file based system of managing
language data. For further details, see the file ReleaseNotes included with
the distribution.</p></div>
language data.</p></div>
<div class="paragraph"><p>Tesseract 3.02 adds BiDirectional text support, the ability to recognize
multiple languages in a single image, and improved layout analysis.</p></div>
<div class="paragraph"><p>For further details, see the file ReleaseNotes included with the distribution.</p></div>
</div>
<h2 id="_resources">RESOURCES</h2>
<div class="sectionbody">
<div class="paragraph"><p>Main web site: <a href="http://code.google.com/p/tesseract-ocr/">http://code.google.com/p/tesseract-ocr/</a><br />
Information on training: <a href="http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3">http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3</a></p></div>
</div>
<h2 id="_see_also">SEE ALSO</h2>
<div class="sectionbody">
<div class="paragraph"><p>tesseract(1)</p></div>
<div class="paragraph"><p>ambiguous_words(1), cntraining(1), combine_tessdata(1), dawg2wordlist(1),
shape_training(1), mftraining(1), unicharambigs(5), unicharset(5),
unicharset_extractor(1), wordlist2dawg(1)</p></div>
</div>
<h2 id="_author">AUTHOR</h2>
<div class="sectionbody">
<div class="paragraph"><p>Tesseract development was led at Hewlett-Packard and Google by Ray Smith.
The development team has included:</p></div>
<div class="paragraph"><p>Ahmad Abdulkader, Chris Newton, Dan Johnson, Dar-Shyang Lee, David Eger,
Eric Wiseblatt, Faisal Shafait, Hiroshi Takenaka, Joe Liu, Joern Wanke,
Mark Seaman, Mickey Namiki, Nicholas Beato, Oded Fuhrmann, Phil Cheatle,
Pingping Xiu, Pong Eksombatchai (Chantat), Ranjith Unnikrishnan, Raquel
Romano, Ray Smith, Rika Antonova, Robert Moss, Samuel Charron, Sheelagh
Lloyd, Shobhit Saxena, and Thomas Kielbus.</p></div>
</div>
<h2 id="_copying">COPYING</h2>
<div class="sectionbody">
@ -1041,7 +785,7 @@ the distribution.</p></div>
<div id="footnotes"><hr /></div>
<div id="footer">
<div id="footer-text">
Last updated 2010-09-29 19:55:57 IST
Last updated 2012-02-09 14:18:49 PDT
</div>
</div>
</body>

View File

@ -14,457 +14,163 @@
<refpurpose>command-line OCR engine</refpurpose>
</refnamediv>
<refsynopsisdiv id="_synopsis">
<simpara><emphasis role="strong">tesseract</emphasis> <emphasis>imagename</emphasis> <emphasis>textbase</emphasis> [<emphasis>configfile</emphasis>] [<emphasis>-l lang</emphasis>]</simpara>
<simpara><emphasis role="strong">tesseract</emphasis> <emphasis>imagename</emphasis> <emphasis>outbase</emphasis> [<emphasis>-l lang</emphasis>] [<emphasis>-psm N</emphasis>] [<emphasis>configfile</emphasis> &#8230;]</simpara>
</refsynopsisdiv>
<refsect1 id="_description">
<title>DESCRIPTION</title>
<simpara>tesseract(1) is a commercial quality OCR engine originally developed at HP
between 1985 and 1995. In 1995, this engine was among the top 3 evaluated by
UNLV. It was open-sourced by HP and UNLV in 2005, and has been developed
by Google since then.</simpara>
at Google since then.</simpara>
</refsect1>
<refsect1 id="_options">
<title>OPTIONS</title>
<simpara><emphasis>imagename</emphasis>
The name of the input image</simpara>
<simpara><emphasis>textbase</emphasis>
<variablelist>
<varlistentry>
<term>
<emphasis>imagename</emphasis>
</term>
<listitem>
<simpara>
The name of the input image. Most image file formats (anything
readable by Leptonica) are supported.
</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term>
<emphasis>outbase</emphasis>
</term>
<listitem>
<simpara>
The basename of the output file (to which the appropriate extension
will be appended)</simpara>
<simpara><emphasis>configfile</emphasis>
The config to use. A config is a plaintext file which contains a list
of variables and their values, one per line, with a space separating
variable from value.</simpara>
<simpara><emphasis>-l lang</emphasis>
will be appended). By default the output will be named <emphasis>outbase.txt</emphasis>.
</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term>
<emphasis>-l lang</emphasis>
</term>
<listitem>
<simpara>
The language to use. If none is specified, English is assumed.
Tesseract uses 3-character ISO 639-2 language codes. (See LANGUAGES)</simpara>
<simpara><emphasis>-v</emphasis>
Returns the current version of the tesseract(1) executable.</simpara>
Multiple languages may be specified, separated by plus characters.
Tesseract uses 3-character ISO 639-2 language codes. (See LANGUAGES)
</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term>
<emphasis>-psm N</emphasis>
</term>
<listitem>
<simpara>
Set Tesseract to only run a subset of layout analysis and assume a certain form of image. The options for <emphasis role="strong">N</emphasis> are:
</simpara>
<literallayout class="monospaced">0 = Orientation and script detection (OSD) only.
1 = Automatic page segmentation with OSD.
2 = Automatic page segmentation, but no OSD, or OCR.
3 = Fully automatic page segmentation, but no OSD. (Default)
4 = Assume a single column of text of variable sizes.
5 = Assume a single uniform block of vertically aligned text.
6 = Assume a single uniform block of text.
7 = Treat the image as a single text line.
8 = Treat the image as a single word.
9 = Treat the image as a single word in a circle.
10 = Treat the image as a single character.</literallayout>
</listitem>
</varlistentry>
<varlistentry>
<term>
<emphasis>-v</emphasis>
</term>
<listitem>
<simpara>
Returns the current version of the tesseract(1) executable.
</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term>
<emphasis>configfile</emphasis>
</term>
<listitem>
<simpara>
The name of a config to use. A config is a plaintext file which
contains a list of variables and their values, one per line, with a
space separating variable from value. Interesting config files
include:<?asciidoc-br?>
</simpara>
<itemizedlist>
<listitem>
<simpara>
hocr - Output in hOCR format instead of as a text file.
</simpara>
</listitem>
</itemizedlist>
</listitem>
</varlistentry>
</variablelist>
<simpara><emphasis role="strong">Nota Bene:</emphasis> The options <emphasis>-l lang</emphasis> and <emphasis>-psm N</emphasis> must occur
before any <emphasis>configfile</emphasis>.</simpara>
</refsect1>
<refsect1 id="_languages">
<title>LANGUAGES</title>
<simpara>There are currently language packs available for the following languages:</simpara>
<informaltable tabstyle="horizontal" frame="none" colsep="0" rowsep="0"><tgroup cols="2"><colspec colwidth="15*"/><colspec colwidth="85*"/><tbody valign="top">
<row>
<entry>
<simpara>
bul
</simpara>
</entry>
<entry>
<simpara>
Bulgarian
</simpara>
</entry>
</row>
<row>
<entry>
<simpara>
cat
</simpara>
</entry>
<entry>
<simpara>
Catalan
</simpara>
</entry>
</row>
<row>
<entry>
<simpara>
ces
</simpara>
</entry>
<entry>
<simpara>
Czech
</simpara>
</entry>
</row>
<row>
<entry>
<simpara>
chi_sim
</simpara>
</entry>
<entry>
<simpara>
Simplified Chinese
</simpara>
</entry>
</row>
<row>
<entry>
<simpara>
chi_tra
</simpara>
</entry>
<entry>
<simpara>
Traditional Chinese
</simpara>
</entry>
</row>
<row>
<entry>
<simpara>
dan
</simpara>
</entry>
<entry>
<simpara>
Danish
</simpara>
</entry>
</row>
<row>
<entry>
<simpara>
dan-frak
</simpara>
</entry>
<entry>
<simpara>
Danish (Fraktur)
</simpara>
</entry>
</row>
<row>
<entry>
<simpara>
deu
</simpara>
</entry>
<entry>
<simpara>
German
</simpara>
</entry>
</row>
<row>
<entry>
<simpara>
ell
</simpara>
</entry>
<entry>
<simpara>
Greek
</simpara>
</entry>
</row>
<row>
<entry>
<simpara>
eng
</simpara>
</entry>
<entry>
<simpara>
English
</simpara>
</entry>
</row>
<row>
<entry>
<simpara>
fin
</simpara>
</entry>
<entry>
<simpara>
Finnish
</simpara>
</entry>
</row>
<row>
<entry>
<simpara>
fra
</simpara>
</entry>
<entry>
<simpara>
French
</simpara>
</entry>
</row>
<row>
<entry>
<simpara>
hun
</simpara>
</entry>
<entry>
<simpara>
Hungarian
</simpara>
</entry>
</row>
<row>
<entry>
<simpara>
ind
</simpara>
</entry>
<entry>
<simpara>
Indonesian
</simpara>
</entry>
</row>
<row>
<entry>
<simpara>
ita
</simpara>
</entry>
<entry>
<simpara>
Italian
</simpara>
</entry>
</row>
<row>
<entry>
<simpara>
jpn
</simpara>
</entry>
<entry>
<simpara>
Japanese
</simpara>
</entry>
</row>
<row>
<entry>
<simpara>
kor
</simpara>
</entry>
<entry>
<simpara>
Korean
</simpara>
</entry>
</row>
<row>
<entry>
<simpara>
lav
</simpara>
</entry>
<entry>
<simpara>
Latvian
</simpara>
</entry>
</row>
<row>
<entry>
<simpara>
lit
</simpara>
</entry>
<entry>
<simpara>
Lithuanian
</simpara>
</entry>
</row>
<row>
<entry>
<simpara>
nld
</simpara>
</entry>
<entry>
<simpara>
Dutch
</simpara>
</entry>
</row>
<row>
<entry>
<simpara>
nor
</simpara>
</entry>
<entry>
<simpara>
Norwegian
</simpara>
</entry>
</row>
<row>
<entry>
<simpara>
pol
</simpara>
</entry>
<entry>
<simpara>
Polish
</simpara>
</entry>
</row>
<row>
<entry>
<simpara>
por
</simpara>
</entry>
<entry>
<simpara>
Portuguese
</simpara>
</entry>
</row>
<row>
<entry>
<simpara>
ron
</simpara>
</entry>
<entry>
<simpara>
Romanian
</simpara>
</entry>
</row>
<row>
<entry>
<simpara>
rus
</simpara>
</entry>
<entry>
<simpara>
Russian
</simpara>
</entry>
</row>
<row>
<entry>
<simpara>
slk
</simpara>
</entry>
<entry>
<simpara>
Slovakian
</simpara>
</entry>
</row>
<row>
<entry>
<simpara>
slv
</simpara>
</entry>
<entry>
<simpara>
Slovenian
</simpara>
</entry>
</row>
<row>
<entry>
<simpara>
spa
</simpara>
</entry>
<entry>
<simpara>
Spanish
</simpara>
</entry>
</row>
<row>
<entry>
<simpara>
srp
</simpara>
</entry>
<entry>
<simpara>
Serbian
</simpara>
</entry>
</row>
<row>
<entry>
<simpara>
swe
</simpara>
</entry>
<entry>
<simpara>
Swedish
</simpara>
</entry>
</row>
<row>
<entry>
<simpara>
tgl
</simpara>
</entry>
<entry>
<simpara>
Tagalog
</simpara>
</entry>
</row>
<row>
<entry>
<simpara>
tha
</simpara>
</entry>
<entry>
<simpara>
Thai
</simpara>
</entry>
</row>
<row>
<entry>
<simpara>
tur
</simpara>
</entry>
<entry>
<simpara>
Turkish
</simpara>
</entry>
</row>
<row>
<entry>
<simpara>
ukr
</simpara>
</entry>
<entry>
<simpara>
Ukrainian
</simpara>
</entry>
</row>
<row>
<entry>
<simpara>
vie
</simpara>
</entry>
<entry>
<simpara>
Vietnamese
</simpara>
</entry>
</row>
</tbody></tgroup></informaltable>
<simpara><emphasis role="strong">ara</emphasis> (Arabic),
<emphasis role="strong">aze</emphasis> (Azerbauijani),
<emphasis role="strong">bul</emphasis> (Bulgarian),
<emphasis role="strong">cat</emphasis> (Catalan),
<emphasis role="strong">ces</emphasis> (Czech),
<emphasis role="strong">chi_sim</emphasis> (Simplified Chinese),
<emphasis role="strong">chi_tra</emphasis> (Traditional Chinese),
<emphasis role="strong">chr</emphasis> (Cherokee),
<emphasis role="strong">dan</emphasis> (Danish),
<emphasis role="strong">dan-frak</emphasis> (Danish (Fraktur)),
<emphasis role="strong">deu</emphasis> (German),
<emphasis role="strong">ell</emphasis> (Greek),
<emphasis role="strong">eng</emphasis> (English),
<emphasis role="strong">enm</emphasis> (Old English),
<emphasis role="strong">epo</emphasis> (Esperanto),
<emphasis role="strong">est</emphasis> (Estonian),
<emphasis role="strong">fin</emphasis> (Finnish),
<emphasis role="strong">fra</emphasis> (French),
<emphasis role="strong">frm</emphasis> (Old French),
<emphasis role="strong">glg</emphasis> (Galician),
<emphasis role="strong">heb</emphasis> (Hebrew),
<emphasis role="strong">hin</emphasis> (Hindi),
<emphasis role="strong">hrv</emphasis> (Croation),
<emphasis role="strong">hun</emphasis> (Hungarian),
<emphasis role="strong">ind</emphasis> (Indonesian),
<emphasis role="strong">ita</emphasis> (Italian),
<emphasis role="strong">jpn</emphasis> (Japanese),
<emphasis role="strong">kor</emphasis> (Korean),
<emphasis role="strong">lav</emphasis> (Latvian),
<emphasis role="strong">lit</emphasis> (Lithuanian),
<emphasis role="strong">nld</emphasis> (Dutch),
<emphasis role="strong">nor</emphasis> (Norwegian),
<emphasis role="strong">pol</emphasis> (Polish),
<emphasis role="strong">por</emphasis> (Portuguese),
<emphasis role="strong">ron</emphasis> (Romanian),
<emphasis role="strong">rus</emphasis> (Russian),
<emphasis role="strong">slk</emphasis> (Slovakian),
<emphasis role="strong">slv</emphasis> (Slovenian),
<emphasis role="strong">sqi</emphasis> (Albanian),
<emphasis role="strong">spa</emphasis> (Spanish),
<emphasis role="strong">srp</emphasis> (Serbian),
<emphasis role="strong">swe</emphasis> (Swedish),
<emphasis role="strong">tam</emphasis> (Tamil),
<emphasis role="strong">tel</emphasis> (Telugu),
<emphasis role="strong">tgl</emphasis> (Tagalog),
<emphasis role="strong">tha</emphasis> (Thai),
<emphasis role="strong">tur</emphasis> (Turkish),
<emphasis role="strong">ukr</emphasis> (Ukrainian),
<emphasis role="strong">vie</emphasis> (Vietnamese)</simpara>
<simpara>To use a non-standard language pack named <emphasis role="strong">foo.traineddata</emphasis>, set the
<emphasis role="strong">TESSDATA_PREFIX</emphasis> environment variable so the file can be found at
<emphasis role="strong">TESSDATA_PREFIX</emphasis>/tessdata/<emphasis role="strong">foo</emphasis>.traineddata and give Tesseract the
argument <emphasis>-l foo</emphasis>.</simpara>
</refsect1>
<refsect1 id="_history">
<title>HISTORY</title>
@ -472,20 +178,12 @@ Vietnamese
Hewlett Packard Co, Greeley Colorado between 1985 and 1994, with some more
changes made in 1996 to port to Windows, and some C++izing in 1998. A
lot of the code was written in C, and then some more was written in C++.
Since then all the code has been converted to at least compile with a
C++ compiler. Currently it builds under Linux with gcc4.0, gcc4.1 and
under Windows with VC++6 and VC++Express. The C++ code makes heavy use of
a list system using macros. This predates stl, was portable before stl, and
is more efficient than stl lists, but has the big negative that if you do get
a segmentation violation, it is hard to debug. Another "feature" of the
C/C++ split is that the C++ data structures get converted to C data
structures to call the low-level C code. This is ugly, and the C++izing of
the C code is a step towards eliminating the conversion, but it has not
happened yet.</simpara>
<simpara>The most important changes in version 2.00 were that Tesseract can now
recognize 6 languages, is fully UTF8 capable, and is fully trainable. See
<ulink url="http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract">http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract</ulink> for more
information on training.</simpara>
The C\++ code makes heavy use of a list system using macros. This predates
stl, was portable before stl, and is more efficient than stl lists, but has
the big negative that if you do get a segmentation violation, it is hard to
debug.</simpara>
<simpara>Version 2.00 brought Unicode (UTF-8) support, six languages, and the ability
to train Tesseract.</simpara>
<simpara>Tesseract was included in UNLV&#8217;s Fourth Annual Test of OCR Accuracy.
See <ulink url="http://www.isri.unlv.edu/downloads/AT-1995.pdf">http://www.isri.unlv.edu/downloads/AT-1995.pdf</ulink>. With Tesseract 2.00,
scripts are now included to allow anyone to reproduce some of these tests.
@ -493,12 +191,32 @@ See <ulink url="http://code.google.com/p/tesseract-ocr/wiki/TestingTesseract">ht
details.</simpara>
<simpara>Tesseract 3.00 adds a number of new languages, including Chinese, Japanese,
and Korean. It also introduces a new, single-file based system of managing
language data. For further details, see the file ReleaseNotes included with
the distribution.</simpara>
language data.</simpara>
<simpara>Tesseract 3.02 adds BiDirectional text support, the ability to recognize
multiple languages in a single image, and improved layout analysis.</simpara>
<simpara>For further details, see the file ReleaseNotes included with the distribution.</simpara>
</refsect1>
<refsect1 id="_resources">
<title>RESOURCES</title>
<simpara>Main web site: <ulink url="http://code.google.com/p/tesseract-ocr/">http://code.google.com/p/tesseract-ocr/</ulink><?asciidoc-br?>
Information on training: <ulink url="http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3">http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3</ulink></simpara>
</refsect1>
<refsect1 id="_see_also">
<title>SEE ALSO</title>
<simpara>tesseract(1)</simpara>
<simpara>ambiguous_words(1), cntraining(1), combine_tessdata(1), dawg2wordlist(1),
shape_training(1), mftraining(1), unicharambigs(5), unicharset(5),
unicharset_extractor(1), wordlist2dawg(1)</simpara>
</refsect1>
<refsect1 id="_author">
<title>AUTHOR</title>
<simpara>Tesseract development was led at Hewlett-Packard and Google by Ray Smith.
The development team has included:</simpara>
<simpara>Ahmad Abdulkader, Chris Newton, Dan Johnson, Dar-Shyang Lee, David Eger,
Eric Wiseblatt, Faisal Shafait, Hiroshi Takenaka, Joe Liu, Joern Wanke,
Mark Seaman, Mickey Namiki, Nicholas Beato, Oded Fuhrmann, Phil Cheatle,
Pingping Xiu, Pong Eksombatchai (Chantat), Ranjith Unnikrishnan, Raquel
Romano, Ray Smith, Rika Antonova, Robert Moss, Samuel Charron, Sheelagh
Lloyd, Shobhit Saxena, and Thomas Kielbus.</simpara>
</refsect1>
<refsect1 id="_copying">
<title>COPYING</title>

View File

@ -1,13 +1,13 @@
'\" t
.\" Title: unicharambigs
.\" Author: [FIXME: author] [see http://docbook.sf.net/el/author]
.\" Author: [see the "AUTHOR" section]
.\" Generator: DocBook XSL Stylesheets v1.75.2 <http://docbook.sf.net/>
.\" Date: 09/30/2010
.\" Date: 02/09/2012
.\" Manual: \ \&
.\" Source: \ \&
.\" Language: English
.\"
.TH "UNICHARAMBIGS" "5" "09/30/2010" "\ \&" "\ \&"
.TH "UNICHARAMBIGS" "5" "02/09/2012" "\ \&" "\ \&"
.\" -----------------------------------------------------------------
.\" * Define some portability stuff
.\" -----------------------------------------------------------------
@ -31,7 +31,7 @@
unicharambigs \- Tesseract unicharset ambiguities
.SH "DESCRIPTION"
.sp
The unicharset file is used by Tesseract to represent possible ambiguities between characters, or groups of characters\&.
The unicharambigs file (a component of traineddata, see combine_tessdata(1) ) is used by Tesseract to represent possible ambiguities between characters, or groups of characters\&.
.sp
The file contains a number of lines, laid out as follow:
.sp
@ -115,3 +115,6 @@ This is a documentation "bug": it\(cqs not currently clear what should be done i
.SH "SEE ALSO"
.sp
tesseract(1), unicharset(5)
.SH "AUTHOR"
.sp
The Tesseract OCR engine was written by Ray Smith and his research groups at Hewlett Packard (1985\-1995) and Google (2006\-present)\&.

View File

@ -7,8 +7,9 @@ unicharambigs - Tesseract unicharset ambiguities
DESCRIPTION
-----------
The unicharset file is used by Tesseract to represent possible
ambiguities between characters, or groups of characters.
The unicharambigs file (a component of traineddata, see combine_tessdata(1) )
is used by Tesseract to represent possible ambiguities between characters,
or groups of characters.
The file contains a number of lines, laid out as follow:
@ -60,4 +61,7 @@ SEE ALSO
--------
tesseract(1), unicharset(5)
AUTHOR
------
The Tesseract OCR engine was written by Ray Smith and his research groups
at Hewlett Packard (1985-1995) and Google (2006-present).

View File

@ -582,8 +582,9 @@ UNICHARAMBIGS(5) Manual Page
<div id="content">
<h2 id="_description">DESCRIPTION</h2>
<div class="sectionbody">
<div class="paragraph"><p>The unicharset file is used by Tesseract to represent possible
ambiguities between characters, or groups of characters.</p></div>
<div class="paragraph"><p>The unicharambigs file (a component of traineddata, see combine_tessdata(1) )
is used by Tesseract to represent possible ambiguities between characters,
or groups of characters.</p></div>
<div class="paragraph"><p>The file contains a number of lines, laid out as follow:</p></div>
<div class="literalblock">
<div class="content">
@ -682,11 +683,16 @@ letters in the unicharset.</p></div>
<div class="sectionbody">
<div class="paragraph"><p>tesseract(1), unicharset(5)</p></div>
</div>
<h2 id="_author">AUTHOR</h2>
<div class="sectionbody">
<div class="paragraph"><p>The Tesseract OCR engine was written by Ray Smith and his research groups
at Hewlett Packard (1985-1995) and Google (2006-present).</p></div>
</div>
</div>
<div id="footnotes"><hr /></div>
<div id="footer">
<div id="footer-text">
Last updated 2010-09-29 23:17:24 IST
Last updated 2012-02-08 10:59:49 PDT
</div>
</div>
</body>

View File

@ -15,8 +15,9 @@
</refnamediv>
<refsect1 id="_description">
<title>DESCRIPTION</title>
<simpara>The unicharset file is used by Tesseract to represent possible
ambiguities between characters, or groups of characters.</simpara>
<simpara>The unicharambigs file (a component of traineddata, see combine_tessdata(1) )
is used by Tesseract to represent possible ambiguities between characters,
or groups of characters.</simpara>
<simpara>The file contains a number of lines, laid out as follow:</simpara>
<literallayout class="monospaced">[num] &lt;TAB&gt; [char(s)] &lt;TAB&gt; [num] &lt;TAB&gt; [char(s)] &lt;TAB&gt; [num]</literallayout>
<informaltable tabstyle="horizontal" frame="none" colsep="0" rowsep="0"><tgroup cols="2"><colspec colwidth="15*"/><colspec colwidth="85*"/><tbody valign="top">
@ -114,4 +115,9 @@ letters in the unicharset.</simpara>
<title>SEE ALSO</title>
<simpara>tesseract(1), unicharset(5)</simpara>
</refsect1>
<refsect1 id="_author">
<title>AUTHOR</title>
<simpara>The Tesseract OCR engine was written by Ray Smith and his research groups
at Hewlett Packard (1985-1995) and Google (2006-present).</simpara>
</refsect1>
</refentry>

View File

@ -1,13 +1,13 @@
'\" t
.\" Title: unicharset
.\" Author: [FIXME: author] [see http://docbook.sf.net/el/author]
.\" Author: [see the "AUTHOR" section]
.\" Generator: DocBook XSL Stylesheets v1.75.2 <http://docbook.sf.net/>
.\" Date: 09/30/2010
.\" Date: 02/09/2012
.\" Manual: \ \&
.\" Source: \ \&
.\" Language: English
.\"
.TH "UNICHARSET" "5" "09/30/2010" "\ \&" "\ \&"
.TH "UNICHARSET" "5" "02/09/2012" "\ \&" "\ \&"
.\" -----------------------------------------------------------------
.\" * Define some portability stuff
.\" -----------------------------------------------------------------
@ -28,23 +28,136 @@
.\" * MAIN CONTENT STARTS HERE *
.\" -----------------------------------------------------------------
.SH "NAME"
unicharset \- character properties for use by Tesseract
unicharset \- character properties file used by tesseract(1)
.SH "DESCRIPTION"
.sp
Tesseract needs to have access to the character properties isalpha, isdigit, isupper, islower, ispunctuation\&. This data must be encoded in the unicharset data file\&. Each line of this file corresponds to one character\&. The character in UTF\-8 is followed by a hexadecimal number representing a binary mask that encodes the properties\&. Each bit corresponds to a property\&. If the bit is set to 1, it means that the property is true\&. The bit ordering is (from least significant bit to most significant bit): isalpha, islower, isupper, isdigit, ispunctuation\&.
Tesseract\(cqs unicharset file contains information on each symbol (unichar) the Tesseract OCR engine is trained to recognize\&.
.sp
Each line in the unicharset file has four space\-separated fields:
A unicharset file (i\&.e\&. \fIeng\&.unicharset\fR) is distributed as part of a Tesseract language pack (i\&.e\&. \fIeng\&.traineddata\fR)\&. For information on extracting the unicharset file, see combine_tessdata(1)\&.
.sp
The first line of a unicharset file contains the number of unichars in the file\&. After this line, each subsequent line provides information for a single unichar\&. The first such line contains a placeholder reserved for the space character\&. Each unichar is referred to within Tesseract by its Unichar ID, which is the line number (minus 1) within the unicharset file\&. Therefore, space gets unichar 0\&.
.sp
Each unichar line in the unicharset file (v2+) may have four space\-separated fields:
.sp
.if n \{\
.RS 4
.\}
.nf
[character] [properties] [script] [id]
\*(Aqcharacter\*(Aq \*(Aqproperties\*(Aq \*(Aqscript\*(Aq \*(Aqid\*(Aq
.fi
.if n \{\
.RE
.\}
.SH "EXAMPLE"
.sp
Starting with Tesseract v3\&.02, more information may be given for each unichar:
.sp
.if n \{\
.RS 4
.\}
.nf
\*(Aqcharacter\*(Aq \*(Aqproperties\*(Aq \*(Aqglyph_metrics\*(Aq \*(Aqscript\*(Aq \*(Aqother_case\*(Aq \*(Aqdirection\*(Aq \*(Aqmirror\*(Aq \*(Aqnormed_form\*(Aq
.fi
.if n \{\
.RE
.\}
.sp
Entries:
.PP
\fIcharacter\fR
.RS 4
The UTF\-8 encoded string to be produced for this unichar\&.
.RE
.PP
\fIproperties\fR
.RS 4
An integer mask of character properties, one per bit\&. From least to most significant bit, these are: isalpha, islower, isupper, isdigit, ispunctuation\&.
.RE
.PP
\fIglyph_metrics\fR
.RS 4
Ten comma\-separated integers representing various standards for where this glyph is to be found within a baseline\-normalized coordinate system where 128 is normalized to x\-height\&.
.sp
.RS 4
.ie n \{\
\h'-04'\(bu\h'+03'\c
.\}
.el \{\
.sp -1
.IP \(bu 2.3
.\}
min_bottom, max_bottom: the ranges where the bottom of the character can be found\&.
.RE
.sp
.RS 4
.ie n \{\
\h'-04'\(bu\h'+03'\c
.\}
.el \{\
.sp -1
.IP \(bu 2.3
.\}
min_top, max_top: the ranges where the top of the character may be found\&.
.RE
.sp
.RS 4
.ie n \{\
\h'-04'\(bu\h'+03'\c
.\}
.el \{\
.sp -1
.IP \(bu 2.3
.\}
min_width, max_width: horizontal width of the character\&.
.RE
.sp
.RS 4
.ie n \{\
\h'-04'\(bu\h'+03'\c
.\}
.el \{\
.sp -1
.IP \(bu 2.3
.\}
min_bearing, max_bearing: how far from the usual start position does the leftmost part of the character begin\&.
.RE
.sp
.RS 4
.ie n \{\
\h'-04'\(bu\h'+03'\c
.\}
.el \{\
.sp -1
.IP \(bu 2.3
.\}
min_advance, max_advance: how far from the printer\(cqs cell left do we advance to begin the next character\&.
.RE
.RE
.PP
\fIscript\fR
.RS 4
Name of the script (Latin, Common, Greek, Cyrillic, Han, null)\&.
.RE
.PP
\fIother_case\fR
.RS 4
The Unichar ID of the other case version of this character (upper or lower)\&.
.RE
.PP
\fIdirection\fR
.RS 4
The Unicode BiDi direction of this character, as defined by ICU\(cqs enum UCharDirection\&. (0 = Left to Right, 1 = Right to Left, 2 = European Number\&...)
.RE
.PP
\fImirror\fR
.RS 4
The Unichar ID of the BiDirectional mirror of this character\&. For example the mirror of open paren is close paren, but Latin Capital C has no mirror, so it remains a Latin Capital C\&.
.RE
.PP
\fInormed_form\fR
.RS 4
The UTF\-8 representation of a "normalized form" of this unichar for the purpose of blaming a module for errors given ground truth text\&. For instance, a left or right single quote may normalize to an ASCII quote\&.
.RE
.SH "EXAMPLE (V2)"
.sp
.if n \{\
.RS 4
@ -71,11 +184,37 @@ W 5 Latin 40
"=" is not punctuation nor a digit nor an alphabetic character\&. Its properties are thus represented by the binary number 00000 (0 in hexadecimal)\&.
.sp
Japanese or Chinese alphabetic character properties are represented by the binary number 00001 (1 in hexadecimal): they are alphabetic, but neither upper nor lower case\&.
.SH "EXAMPLE (V3.02)"
.sp
The last two columns represent the type of script (Latin, Common, Greek, Cyrillic, Han, null) and id code of the character\&.
.if n \{\
.RS 4
.\}
.nf
110
NULL 0 NULL 0
N 5 59,68,216,255,87,236,0,27,104,227 Latin 11 0 1 N
Y 5 59,68,216,255,91,205,0,47,91,223 Latin 33 0 2 Y
1 8 59,69,203,255,45,128,0,66,74,173 Common 3 2 3 1
9 8 18,66,203,255,89,156,0,39,104,173 Common 4 2 4 9
a 3 58,65,186,198,85,164,0,26,97,185 Latin 56 0 5 a
\&. \&. \&.
.fi
.if n \{\
.RE
.\}
.SH "CAVEATS"
.sp
Although the unicharset reader maintains the ability to read unicharsets of older formats and will assign default values to missing fields, the accuracy will be degraded\&.
.sp
Further, most other data files are indexed by the unicharset file, so changing it without re\-generating the others is likely to have dire consequences\&.
.SH "HISTORY"
.sp
The unicharset format first appeared with Tesseract 2\&.00, which was the first version to support languages other than English\&. The unicharset file contained only the first two fields, and the "ispunctuation" property was absent (punctuation was regarded as "0", as "=" is in the above example\&.
.SH "SEE ALSO"
.sp
tesseract(1), unicharset_extractor(1)
tesseract(1), combine_tessdata(1), unicharset_extractor(1)
.sp
\m[blue]\fBhttp://code\&.google\&.com/p/tesseract\-ocr/wiki/TrainingTesseract3\fR\m[]
.SH "AUTHOR"
.sp
The Tesseract OCR engine was written by Ray Smith and his research groups at Hewlett Packard (1985\-1995) and Google (2006\-present)\&.

View File

@ -1,29 +1,68 @@
UNICHARSET(5)
=============
:doctype: manpage
NAME
----
unicharset - character properties for use by Tesseract
unicharset - character properties file used by tesseract(1)
DESCRIPTION
-----------
Tesseract needs to have access to the character properties isalpha,
isdigit, isupper, islower, ispunctuation. This data must be encoded
in the unicharset data file. Each line of this file corresponds to
one character. The character in UTF-8 is followed by a hexadecimal
number representing a binary mask that encodes the properties.
Each bit corresponds to a property. If the bit is set to 1, it
means that the property is true. The bit ordering is (from least
significant bit to most significant bit): isalpha, islower, isupper,
isdigit, ispunctuation.
Tesseract's unicharset file contains information on each symbol
(unichar) the Tesseract OCR engine is trained to recognize.
Each line in the unicharset file has four space-separated fields:
......................................
[character] [properties] [script] [id]
......................................
A unicharset file (i.e. 'eng.unicharset') is distributed as part of a
Tesseract language pack (i.e. 'eng.traineddata'). For information on
extracting the unicharset file, see combine_tessdata(1).
EXAMPLE
-------
The first line of a unicharset file contains the number of unichars in
the file. After this line, each subsequent line provides information for
a single unichar. The first such line contains a placeholder reserved for
the space character. Each unichar is referred to within Tesseract by its
Unichar ID, which is the line number (minus 1) within the unicharset file.
Therefore, space gets unichar 0.
Each unichar line in the unicharset file (v2+) may have four space-separated fields:
'character' 'properties' 'script' 'id'
Starting with Tesseract v3.02, more information may be given for each unichar:
'character' 'properties' 'glyph_metrics' 'script' 'other_case' 'direction' 'mirror' 'normed_form'
Entries:
'character':: The UTF-8 encoded string to be produced for this unichar.
'properties':: An integer mask of character properties, one per bit.
From least to most significant bit, these are: isalpha, islower, isupper,
isdigit, ispunctuation.
'glyph_metrics':: Ten comma-separated integers representing various standards
for where this glyph is to be found within a baseline-normalized coordinate
system where 128 is normalized to x-height.
* min_bottom, max_bottom: the ranges where the bottom of the character can
be found.
* min_top, max_top: the ranges where the top of the character may be found.
* min_width, max_width: horizontal width of the character.
* min_bearing, max_bearing: how far from the usual start position does the
leftmost part of the character begin.
* min_advance, max_advance: how far from the printer's cell left do we
advance to begin the next character.
'script':: Name of the script (Latin, Common, Greek, Cyrillic, Han, null).
'other_case':: The Unichar ID of the other case version of this character
(upper or lower).
'direction':: The Unicode BiDi direction of this character, as defined by
ICU's enum UCharDirection. (0 = Left to Right, 1 = Right to Left,
2 = European Number...)
'mirror':: The Unichar ID of the BiDirectional mirror of this character.
For example the mirror of open paren is close paren, but Latin Capital C
has no mirror, so it remains a Latin Capital C.
'normed_form':: The UTF-8 representation of a "normalized form" of this unichar
for the purpose of blaming a module for errors given ground truth text.
For instance, a left or right single quote may normalize to an ASCII quote.
EXAMPLE (v2)
------------
..............
; 10 Common 46
b 3 Latin 59
@ -32,35 +71,63 @@ W 5 Latin 40
= 0 Common 93
..............
";" is a punctuation character. Its properties are thus represented by the binary number
10000 (10 in hexadecimal).
";" is a punctuation character. Its properties are thus represented by the
binary number 10000 (10 in hexadecimal).
"b" is an alphabetic character and a lower case character. Its properties are thus
represented by the binary number 00011 (3 in hexadecimal).
"b" is an alphabetic character and a lower case character. Its properties are
thus represented by the binary number 00011 (3 in hexadecimal).
"W" is an alphabetic character and an upper case character. Its properties are thus
represented by the binary number 00101 (5 in hexadecimal).
"W" is an alphabetic character and an upper case character. Its properties are
thus represented by the binary number 00101 (5 in hexadecimal).
"7" is just a digit. Its properties are thus represented by the binary number 01000
(8 in hexadecimal).
"7" is just a digit. Its properties are thus represented by the binary number
01000 (8 in hexadecimal).
"=" is not punctuation nor a digit nor an alphabetic character. Its properties are
thus represented by the binary number 00000 (0 in hexadecimal).
"=" is not punctuation nor a digit nor an alphabetic character. Its properties
are thus represented by the binary number 00000 (0 in hexadecimal).
Japanese or Chinese alphabetic character properties are represented by the binary number
00001 (1 in hexadecimal): they are alphabetic, but neither upper nor lower case.
Japanese or Chinese alphabetic character properties are represented by the
binary number 00001 (1 in hexadecimal): they are alphabetic, but neither
upper nor lower case.
The last two columns represent the type of script (Latin, Common, Greek, Cyrillic, Han,
null) and id code of the character.
EXAMPLE (v3.02)
---------------
..................................................................
110
NULL 0 NULL 0
N 5 59,68,216,255,87,236,0,27,104,227 Latin 11 0 1 N
Y 5 59,68,216,255,91,205,0,47,91,223 Latin 33 0 2 Y
1 8 59,69,203,255,45,128,0,66,74,173 Common 3 2 3 1
9 8 18,66,203,255,89,156,0,39,104,173 Common 4 2 4 9
a 3 58,65,186,198,85,164,0,26,97,185 Latin 56 0 5 a
. . .
..................................................................
CAVEATS
-------
Although the unicharset reader maintains the ability to read unicharsets
of older formats and will assign default values to missing fields,
the accuracy will be degraded.
Further, most other data files are indexed by the unicharset file,
so changing it without re-generating the others is likely to have dire
consequences.
HISTORY
-------
The unicharset format first appeared with Tesseract 2.00, which was the first version
to support languages other than English. The unicharset file contained only the first
two fields, and the "ispunctuation" property was absent (punctuation was regarded as
"0", as "=" is in the above example.
The unicharset format first appeared with Tesseract 2.00, which was the
first version to support languages other than English. The unicharset file
contained only the first two fields, and the "ispunctuation" property was
absent (punctuation was regarded as "0", as "=" is in the above example.
SEE ALSO
--------
tesseract(1), unicharset_extractor(1)
tesseract(1), combine_tessdata(1), unicharset_extractor(1)
<http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3>
AUTHOR
------
The Tesseract OCR engine was written by Ray Smith and his research groups
at Hewlett Packard (1985-1995) and Google (2006-present).

View File

@ -575,29 +575,144 @@ UNICHARSET(5) Manual Page
<h2>NAME</h2>
<div class="sectionbody">
<p>unicharset -
character properties for use by Tesseract
character properties file used by tesseract(1)
</p>
</div>
</div>
<div id="content">
<h2 id="_description">DESCRIPTION</h2>
<div class="sectionbody">
<div class="paragraph"><p>Tesseract needs to have access to the character properties isalpha,
isdigit, isupper, islower, ispunctuation. This data must be encoded
in the unicharset data file. Each line of this file corresponds to
one character. The character in UTF-8 is followed by a hexadecimal
number representing a binary mask that encodes the properties.
Each bit corresponds to a property. If the bit is set to 1, it
means that the property is true. The bit ordering is (from least
significant bit to most significant bit): isalpha, islower, isupper,
isdigit, ispunctuation.</p></div>
<div class="paragraph"><p>Each line in the unicharset file has four space-separated fields:</p></div>
<div class="paragraph"><p>Tesseract&#8217;s unicharset file contains information on each symbol
(unichar) the Tesseract OCR engine is trained to recognize.</p></div>
<div class="paragraph"><p>A unicharset file (i.e. <em>eng.unicharset</em>) is distributed as part of a
Tesseract language pack (i.e. <em>eng.traineddata</em>). For information on
extracting the unicharset file, see combine_tessdata(1).</p></div>
<div class="paragraph"><p>The first line of a unicharset file contains the number of unichars in
the file. After this line, each subsequent line provides information for
a single unichar. The first such line contains a placeholder reserved for
the space character. Each unichar is referred to within Tesseract by its
Unichar ID, which is the line number (minus 1) within the unicharset file.
Therefore, space gets unichar 0.</p></div>
<div class="paragraph"><p>Each unichar line in the unicharset file (v2+) may have four space-separated fields:</p></div>
<div class="literalblock">
<div class="content">
<pre><tt>[character] [properties] [script] [id]</tt></pre>
<pre><tt>'character' 'properties' 'script' 'id'</tt></pre>
</div></div>
<div class="paragraph"><p>Starting with Tesseract v3.02, more information may be given for each unichar:</p></div>
<div class="literalblock">
<div class="content">
<pre><tt>'character' 'properties' 'glyph_metrics' 'script' 'other_case' 'direction' 'mirror' 'normed_form'</tt></pre>
</div></div>
<div class="paragraph"><p>Entries:</p></div>
<div class="dlist"><dl>
<dt class="hdlist1">
<em>character</em>
</dt>
<dd>
<p>
The UTF-8 encoded string to be produced for this unichar.
</p>
</dd>
<dt class="hdlist1">
<em>properties</em>
</dt>
<dd>
<p>
An integer mask of character properties, one per bit.
From least to most significant bit, these are: isalpha, islower, isupper,
isdigit, ispunctuation.
</p>
</dd>
<dt class="hdlist1">
<em>glyph_metrics</em>
</dt>
<dd>
<p>
Ten comma-separated integers representing various standards
for where this glyph is to be found within a baseline-normalized coordinate
system where 128 is normalized to x-height.
</p>
<div class="ulist"><ul>
<li>
<p>
min_bottom, max_bottom: the ranges where the bottom of the character can
be found.
</p>
</li>
<li>
<p>
min_top, max_top: the ranges where the top of the character may be found.
</p>
</li>
<li>
<p>
min_width, max_width: horizontal width of the character.
</p>
</li>
<li>
<p>
min_bearing, max_bearing: how far from the usual start position does the
leftmost part of the character begin.
</p>
</li>
<li>
<p>
min_advance, max_advance: how far from the printer&#8217;s cell left do we
advance to begin the next character.
</p>
</li>
</ul></div>
</dd>
<dt class="hdlist1">
<em>script</em>
</dt>
<dd>
<p>
Name of the script (Latin, Common, Greek, Cyrillic, Han, null).
</p>
</dd>
<dt class="hdlist1">
<em>other_case</em>
</dt>
<dd>
<p>
The Unichar ID of the other case version of this character
(upper or lower).
</p>
</dd>
<dt class="hdlist1">
<em>direction</em>
</dt>
<dd>
<p>
The Unicode BiDi direction of this character, as defined by
ICU&#8217;s enum UCharDirection. (0 = Left to Right, 1 = Right to Left,
2 = European Number&#8230;)
</p>
</dd>
<dt class="hdlist1">
<em>mirror</em>
</dt>
<dd>
<p>
The Unichar ID of the BiDirectional mirror of this character.
For example the mirror of open paren is close paren, but Latin Capital C
has no mirror, so it remains a Latin Capital C.
</p>
</dd>
<dt class="hdlist1">
<em>normed_form</em>
</dt>
<dd>
<p>
The UTF-8 representation of a "normalized form" of this unichar
for the purpose of blaming a module for errors given ground truth text.
For instance, a left or right single quote may normalize to an ASCII quote.
</p>
</dd>
</dl></div>
</div>
<h2 id="_example">EXAMPLE</h2>
<h2 id="_example_v2">EXAMPLE (v2)</h2>
<div class="sectionbody">
<div class="literalblock">
<div class="content">
@ -607,37 +722,65 @@ W 5 Latin 40
7 8 Common 66
= 0 Common 93</tt></pre>
</div></div>
<div class="paragraph"><p>";" is a punctuation character. Its properties are thus represented by the binary number
10000 (10 in hexadecimal).</p></div>
<div class="paragraph"><p>"b" is an alphabetic character and a lower case character. Its properties are thus
represented by the binary number 00011 (3 in hexadecimal).</p></div>
<div class="paragraph"><p>"W" is an alphabetic character and an upper case character. Its properties are thus
represented by the binary number 00101 (5 in hexadecimal).</p></div>
<div class="paragraph"><p>"7" is just a digit. Its properties are thus represented by the binary number 01000
(8 in hexadecimal).</p></div>
<div class="paragraph"><p>"=" is not punctuation nor a digit nor an alphabetic character. Its properties are
thus represented by the binary number 00000 (0 in hexadecimal).</p></div>
<div class="paragraph"><p>Japanese or Chinese alphabetic character properties are represented by the binary number
00001 (1 in hexadecimal): they are alphabetic, but neither upper nor lower case.</p></div>
<div class="paragraph"><p>The last two columns represent the type of script (Latin, Common, Greek, Cyrillic, Han,
null) and id code of the character.</p></div>
<div class="paragraph"><p>";" is a punctuation character. Its properties are thus represented by the
binary number 10000 (10 in hexadecimal).</p></div>
<div class="paragraph"><p>"b" is an alphabetic character and a lower case character. Its properties are
thus represented by the binary number 00011 (3 in hexadecimal).</p></div>
<div class="paragraph"><p>"W" is an alphabetic character and an upper case character. Its properties are
thus represented by the binary number 00101 (5 in hexadecimal).</p></div>
<div class="paragraph"><p>"7" is just a digit. Its properties are thus represented by the binary number
01000 (8 in hexadecimal).</p></div>
<div class="paragraph"><p>"=" is not punctuation nor a digit nor an alphabetic character. Its properties
are thus represented by the binary number 00000 (0 in hexadecimal).</p></div>
<div class="paragraph"><p>Japanese or Chinese alphabetic character properties are represented by the
binary number 00001 (1 in hexadecimal): they are alphabetic, but neither
upper nor lower case.</p></div>
</div>
<h2 id="_example_v3_02">EXAMPLE (v3.02)</h2>
<div class="sectionbody">
<div class="literalblock">
<div class="content">
<pre><tt>110
NULL 0 NULL 0
N 5 59,68,216,255,87,236,0,27,104,227 Latin 11 0 1 N
Y 5 59,68,216,255,91,205,0,47,91,223 Latin 33 0 2 Y
1 8 59,69,203,255,45,128,0,66,74,173 Common 3 2 3 1
9 8 18,66,203,255,89,156,0,39,104,173 Common 4 2 4 9
a 3 58,65,186,198,85,164,0,26,97,185 Latin 56 0 5 a
. . .</tt></pre>
</div></div>
</div>
<h2 id="_caveats">CAVEATS</h2>
<div class="sectionbody">
<div class="paragraph"><p>Although the unicharset reader maintains the ability to read unicharsets
of older formats and will assign default values to missing fields,
the accuracy will be degraded.</p></div>
<div class="paragraph"><p>Further, most other data files are indexed by the unicharset file,
so changing it without re-generating the others is likely to have dire
consequences.</p></div>
</div>
<h2 id="_history">HISTORY</h2>
<div class="sectionbody">
<div class="paragraph"><p>The unicharset format first appeared with Tesseract 2.00, which was the first version
to support languages other than English. The unicharset file contained only the first
two fields, and the "ispunctuation" property was absent (punctuation was regarded as
"0", as "=" is in the above example.</p></div>
<div class="paragraph"><p>The unicharset format first appeared with Tesseract 2.00, which was the
first version to support languages other than English. The unicharset file
contained only the first two fields, and the "ispunctuation" property was
absent (punctuation was regarded as "0", as "=" is in the above example.</p></div>
</div>
<h2 id="_see_also">SEE ALSO</h2>
<div class="sectionbody">
<div class="paragraph"><p>tesseract(1), unicharset_extractor(1)</p></div>
<div class="paragraph"><p>tesseract(1), combine_tessdata(1), unicharset_extractor(1)</p></div>
<div class="paragraph"><p><a href="http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3">http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3</a></p></div>
</div>
<h2 id="_author">AUTHOR</h2>
<div class="sectionbody">
<div class="paragraph"><p>The Tesseract OCR engine was written by Ray Smith and his research groups
at Hewlett Packard (1985-1995) and Google (2006-present).</p></div>
</div>
</div>
<div id="footnotes"><hr /></div>
<div id="footer">
<div id="footer-text">
Last updated 2010-09-29 23:46:32 IST
Last updated 2012-02-08 11:01:57 PDT
</div>
</div>
</body>

View File

@ -11,53 +11,206 @@
</refmeta>
<refnamediv>
<refname>unicharset</refname>
<refpurpose>character properties for use by Tesseract</refpurpose>
<refpurpose>character properties file used by tesseract(1)</refpurpose>
</refnamediv>
<refsect1 id="_description">
<title>DESCRIPTION</title>
<simpara>Tesseract needs to have access to the character properties isalpha,
isdigit, isupper, islower, ispunctuation. This data must be encoded
in the unicharset data file. Each line of this file corresponds to
one character. The character in UTF-8 is followed by a hexadecimal
number representing a binary mask that encodes the properties.
Each bit corresponds to a property. If the bit is set to 1, it
means that the property is true. The bit ordering is (from least
significant bit to most significant bit): isalpha, islower, isupper,
isdigit, ispunctuation.</simpara>
<simpara>Each line in the unicharset file has four space-separated fields:</simpara>
<literallayout class="monospaced">[character] [properties] [script] [id]</literallayout>
<simpara>Tesseract&#8217;s unicharset file contains information on each symbol
(unichar) the Tesseract OCR engine is trained to recognize.</simpara>
<simpara>A unicharset file (i.e. <emphasis>eng.unicharset</emphasis>) is distributed as part of a
Tesseract language pack (i.e. <emphasis>eng.traineddata</emphasis>). For information on
extracting the unicharset file, see combine_tessdata(1).</simpara>
<simpara>The first line of a unicharset file contains the number of unichars in
the file. After this line, each subsequent line provides information for
a single unichar. The first such line contains a placeholder reserved for
the space character. Each unichar is referred to within Tesseract by its
Unichar ID, which is the line number (minus 1) within the unicharset file.
Therefore, space gets unichar 0.</simpara>
<simpara>Each unichar line in the unicharset file (v2+) may have four space-separated fields:</simpara>
<literallayout class="monospaced">'character' 'properties' 'script' 'id'</literallayout>
<simpara>Starting with Tesseract v3.02, more information may be given for each unichar:</simpara>
<literallayout class="monospaced">'character' 'properties' 'glyph_metrics' 'script' 'other_case' 'direction' 'mirror' 'normed_form'</literallayout>
<simpara>Entries:</simpara>
<variablelist>
<varlistentry>
<term>
<emphasis>character</emphasis>
</term>
<listitem>
<simpara>
The UTF-8 encoded string to be produced for this unichar.
</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term>
<emphasis>properties</emphasis>
</term>
<listitem>
<simpara>
An integer mask of character properties, one per bit.
From least to most significant bit, these are: isalpha, islower, isupper,
isdigit, ispunctuation.
</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term>
<emphasis>glyph_metrics</emphasis>
</term>
<listitem>
<simpara>
Ten comma-separated integers representing various standards
for where this glyph is to be found within a baseline-normalized coordinate
system where 128 is normalized to x-height.
</simpara>
<itemizedlist>
<listitem>
<simpara>
min_bottom, max_bottom: the ranges where the bottom of the character can
be found.
</simpara>
</listitem>
<listitem>
<simpara>
min_top, max_top: the ranges where the top of the character may be found.
</simpara>
</listitem>
<listitem>
<simpara>
min_width, max_width: horizontal width of the character.
</simpara>
</listitem>
<listitem>
<simpara>
min_bearing, max_bearing: how far from the usual start position does the
leftmost part of the character begin.
</simpara>
</listitem>
<listitem>
<simpara>
min_advance, max_advance: how far from the printer&#8217;s cell left do we
advance to begin the next character.
</simpara>
</listitem>
</itemizedlist>
</listitem>
</varlistentry>
<varlistentry>
<term>
<emphasis>script</emphasis>
</term>
<listitem>
<simpara>
Name of the script (Latin, Common, Greek, Cyrillic, Han, null).
</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term>
<emphasis>other_case</emphasis>
</term>
<listitem>
<simpara>
The Unichar ID of the other case version of this character
(upper or lower).
</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term>
<emphasis>direction</emphasis>
</term>
<listitem>
<simpara>
The Unicode BiDi direction of this character, as defined by
ICU&#8217;s enum UCharDirection. (0 = Left to Right, 1 = Right to Left,
2 = European Number&#8230;)
</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term>
<emphasis>mirror</emphasis>
</term>
<listitem>
<simpara>
The Unichar ID of the BiDirectional mirror of this character.
For example the mirror of open paren is close paren, but Latin Capital C
has no mirror, so it remains a Latin Capital C.
</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term>
<emphasis>normed_form</emphasis>
</term>
<listitem>
<simpara>
The UTF-8 representation of a "normalized form" of this unichar
for the purpose of blaming a module for errors given ground truth text.
For instance, a left or right single quote may normalize to an ASCII quote.
</simpara>
</listitem>
</varlistentry>
</variablelist>
</refsect1>
<refsect1 id="_example">
<title>EXAMPLE</title>
<refsect1 id="_example_v2">
<title>EXAMPLE (v2)</title>
<literallayout class="monospaced">; 10 Common 46
b 3 Latin 59
W 5 Latin 40
7 8 Common 66
= 0 Common 93</literallayout>
<simpara>";" is a punctuation character. Its properties are thus represented by the binary number
10000 (10 in hexadecimal).</simpara>
<simpara>"b" is an alphabetic character and a lower case character. Its properties are thus
represented by the binary number 00011 (3 in hexadecimal).</simpara>
<simpara>"W" is an alphabetic character and an upper case character. Its properties are thus
represented by the binary number 00101 (5 in hexadecimal).</simpara>
<simpara>"7" is just a digit. Its properties are thus represented by the binary number 01000
(8 in hexadecimal).</simpara>
<simpara>"=" is not punctuation nor a digit nor an alphabetic character. Its properties are
thus represented by the binary number 00000 (0 in hexadecimal).</simpara>
<simpara>Japanese or Chinese alphabetic character properties are represented by the binary number
00001 (1 in hexadecimal): they are alphabetic, but neither upper nor lower case.</simpara>
<simpara>The last two columns represent the type of script (Latin, Common, Greek, Cyrillic, Han,
null) and id code of the character.</simpara>
<simpara>";" is a punctuation character. Its properties are thus represented by the
binary number 10000 (10 in hexadecimal).</simpara>
<simpara>"b" is an alphabetic character and a lower case character. Its properties are
thus represented by the binary number 00011 (3 in hexadecimal).</simpara>
<simpara>"W" is an alphabetic character and an upper case character. Its properties are
thus represented by the binary number 00101 (5 in hexadecimal).</simpara>
<simpara>"7" is just a digit. Its properties are thus represented by the binary number
01000 (8 in hexadecimal).</simpara>
<simpara>"=" is not punctuation nor a digit nor an alphabetic character. Its properties
are thus represented by the binary number 00000 (0 in hexadecimal).</simpara>
<simpara>Japanese or Chinese alphabetic character properties are represented by the
binary number 00001 (1 in hexadecimal): they are alphabetic, but neither
upper nor lower case.</simpara>
</refsect1>
<refsect1 id="_example_v3_02">
<title>EXAMPLE (v3.02)</title>
<literallayout class="monospaced">110
NULL 0 NULL 0
N 5 59,68,216,255,87,236,0,27,104,227 Latin 11 0 1 N
Y 5 59,68,216,255,91,205,0,47,91,223 Latin 33 0 2 Y
1 8 59,69,203,255,45,128,0,66,74,173 Common 3 2 3 1
9 8 18,66,203,255,89,156,0,39,104,173 Common 4 2 4 9
a 3 58,65,186,198,85,164,0,26,97,185 Latin 56 0 5 a
. . .</literallayout>
</refsect1>
<refsect1 id="_caveats">
<title>CAVEATS</title>
<simpara>Although the unicharset reader maintains the ability to read unicharsets
of older formats and will assign default values to missing fields,
the accuracy will be degraded.</simpara>
<simpara>Further, most other data files are indexed by the unicharset file,
so changing it without re-generating the others is likely to have dire
consequences.</simpara>
</refsect1>
<refsect1 id="_history">
<title>HISTORY</title>
<simpara>The unicharset format first appeared with Tesseract 2.00, which was the first version
to support languages other than English. The unicharset file contained only the first
two fields, and the "ispunctuation" property was absent (punctuation was regarded as
"0", as "=" is in the above example.</simpara>
<simpara>The unicharset format first appeared with Tesseract 2.00, which was the
first version to support languages other than English. The unicharset file
contained only the first two fields, and the "ispunctuation" property was
absent (punctuation was regarded as "0", as "=" is in the above example.</simpara>
</refsect1>
<refsect1 id="_see_also">
<title>SEE ALSO</title>
<simpara>tesseract(1), unicharset_extractor(1)</simpara>
<simpara>tesseract(1), combine_tessdata(1), unicharset_extractor(1)</simpara>
<simpara><ulink url="http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3">http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3</ulink></simpara>
</refsect1>
<refsect1 id="_author">
<title>AUTHOR</title>
<simpara>The Tesseract OCR engine was written by Ray Smith and his research groups
at Hewlett Packard (1985-1995) and Google (2006-present).</simpara>
</refsect1>
</refentry>

View File

@ -1,13 +1,13 @@
'\" t
.\" Title: unicharset_extractor
.\" Author: [FIXME: author] [see http://docbook.sf.net/el/author]
.\" Author: [see the "AUTHOR" section]
.\" Generator: DocBook XSL Stylesheets v1.75.2 <http://docbook.sf.net/>
.\" Date: 09/30/2010
.\" Date: 02/09/2012
.\" Manual: \ \&
.\" Source: \ \&
.\" Language: English
.\"
.TH "UNICHARSET_EXTRACTOR" "1" "09/30/2010" "\ \&" "\ \&"
.TH "UNICHARSET_EXTRACTOR" "1" "02/09/2012" "\ \&" "\ \&"
.\" -----------------------------------------------------------------
.\" * Define some portability stuff
.\" -----------------------------------------------------------------
@ -31,7 +31,7 @@
unicharset_extractor \- extract unicharset from Tesseract boxfiles
.SH "SYNOPSIS"
.sp
\fBunicharset_extractor\fR \fIFILE\fR\&...
\fBunicharset_extractor\fR \fI[\-D dir]\fR \fIFILE\fR\&...
.SH "DESCRIPTION"
.sp
Tesseract needs to know the set of possible characters it can output\&. To generate the unicharset data file, use the unicharset_extractor program on the same training pages bounding box files as used for clustering:
@ -46,7 +46,9 @@ unicharset_extractor fontfile_1\&.box fontfile_2\&.box \&.\&.\&.
.RE
.\}
.sp
Tesseract needs to have access to character properties isalpha, isdigit, isupper, islower, ispunctuation\&. This data must be encoded in the unicharset data file\&. Each line of this file corresponds to one character\&. The character in UTF\-8 is followed by a hexadecimal number representing a binary mask that encodes the properties\&. Each bit corresponds to a property\&. If the bit is set to 1, it means that the property is true\&. The bit ordering is (from least significant bit to most significant bit): isalpha, islower, isupper, isdigit, ispunctuation\&. (See unicharset(5))
The unicharset will be put into the file \fIdir/unicharset\fR, or simply \fI\&./unicharset\fR if no output directory is provided\&.
.sp
Tesseract also needs to have access to character properties isalpha, isdigit, isupper, islower, ispunctuation\&. all of this auxilury data and more is encoded in this file\&. (See unicharset(5))
.sp
If your system supports the wctype functions, these values will be set automatically by unicharset_extractor and there is no need to edit the unicharset file\&. On some older systems (eg Windows 95), the unicharset file must be edited by hand to add these property description codes\&.
.sp
@ -54,9 +56,14 @@ If your system supports the wctype functions, these values will be set automatic
.SH "SEE ALSO"
.sp
tesseract(1), unicharset(5)
.sp
\m[blue]\fBhttp://code\&.google\&.com/p/tesseract\-ocr/wiki/TrainingTesseract3\fR\m[]
.SH "HISTORY"
.sp
unicharset_extractor first appeared in Tesseract 2\&.00\&.
.SH "COPYING"
.sp
Copyright \(co 2006, Google Inc\&. Licensed under the Apache License, Version 2\&.0
Copyright (C) 2006, Google Inc\&. Licensed under the Apache License, Version 2\&.0
.SH "AUTHOR"
.sp
The Tesseract OCR engine was written by Ray Smith and his research groups at Hewlett Packard (1985\-1995) and Google (2006\-present)\&.

View File

@ -7,7 +7,7 @@ unicharset_extractor - extract unicharset from Tesseract boxfiles
SYNOPSIS
--------
*unicharset_extractor* 'FILE'...
*unicharset_extractor* '[-D dir]' 'FILE'...
DESCRIPTION
-----------
@ -18,16 +18,12 @@ clustering:
unicharset_extractor fontfile_1.box fontfile_2.box ...
Tesseract needs to have access to character properties isalpha,
isdigit, isupper, islower, ispunctuation. This data must be encoded
in the unicharset data file. Each line of this file corresponds to
one character. The character in UTF-8 is followed by a hexadecimal
number representing a binary mask that encodes the properties. Each
bit corresponds to a property. If the bit is set to 1, it means that
the property is true. The bit ordering is (from least significant bit
to most significant bit): isalpha, islower, isupper, isdigit,
ispunctuation.
(See unicharset(5))
The unicharset will be put into the file 'dir/unicharset', or simply
'./unicharset' if no output directory is provided.
Tesseract also needs to have access to character properties isalpha,
isdigit, isupper, islower, ispunctuation. all of this auxilury data
and more is encoded in this file. (See unicharset(5))
If your system supports the wctype functions, these values will be set
automatically by unicharset_extractor and there is no need to edit the
@ -44,11 +40,18 @@ SEE ALSO
--------
tesseract(1), unicharset(5)
<http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3>
HISTORY
-------
unicharset_extractor first appeared in Tesseract 2.00.
COPYING
-------
Copyright (C) 2006, Google Inc.
Copyright \(C) 2006, Google Inc.
Licensed under the Apache License, Version 2.0
AUTHOR
------
The Tesseract OCR engine was written by Ray Smith and his research groups
at Hewlett Packard (1985-1995) and Google (2006-present).

View File

@ -582,7 +582,7 @@ UNICHARSET_EXTRACTOR(1) Manual Page
<div id="content">
<h2 id="_synopsis">SYNOPSIS</h2>
<div class="sectionbody">
<div class="paragraph"><p><strong>unicharset_extractor</strong> <em>FILE</em>&#8230;</p></div>
<div class="paragraph"><p><strong>unicharset_extractor</strong> <em>[-D dir]</em> <em>FILE</em>&#8230;</p></div>
</div>
<h2 id="_description">DESCRIPTION</h2>
<div class="sectionbody">
@ -594,16 +594,11 @@ clustering:</p></div>
<div class="content">
<pre><tt>unicharset_extractor fontfile_1.box fontfile_2.box ...</tt></pre>
</div></div>
<div class="paragraph"><p>Tesseract needs to have access to character properties isalpha,
isdigit, isupper, islower, ispunctuation. This data must be encoded
in the unicharset data file. Each line of this file corresponds to
one character. The character in UTF-8 is followed by a hexadecimal
number representing a binary mask that encodes the properties. Each
bit corresponds to a property. If the bit is set to 1, it means that
the property is true. The bit ordering is (from least significant bit
to most significant bit): isalpha, islower, isupper, isdigit,
ispunctuation.
(See unicharset(5))</p></div>
<div class="paragraph"><p>The unicharset will be put into the file <em>dir/unicharset</em>, or simply
<em>./unicharset</em> if no output directory is provided.</p></div>
<div class="paragraph"><p>Tesseract also needs to have access to character properties isalpha,
isdigit, isupper, islower, ispunctuation. all of this auxilury data
and more is encoded in this file. (See unicharset(5))</p></div>
<div class="paragraph"><p>If your system supports the wctype functions, these values will be set
automatically by unicharset_extractor and there is no need to edit the
unicharset file. On some older systems (eg Windows 95), the unicharset
@ -617,6 +612,7 @@ cntraining, and giving the unicharset to mftraining.</p></div>
<h2 id="_see_also">SEE ALSO</h2>
<div class="sectionbody">
<div class="paragraph"><p>tesseract(1), unicharset(5)</p></div>
<div class="paragraph"><p><a href="http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3">http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3</a></p></div>
</div>
<h2 id="_history">HISTORY</h2>
<div class="sectionbody">
@ -624,14 +620,19 @@ cntraining, and giving the unicharset to mftraining.</p></div>
</div>
<h2 id="_copying">COPYING</h2>
<div class="sectionbody">
<div class="paragraph"><p>Copyright &#169; 2006, Google Inc.
<div class="paragraph"><p>Copyright (C) 2006, Google Inc.
Licensed under the Apache License, Version 2.0</p></div>
</div>
<h2 id="_author">AUTHOR</h2>
<div class="sectionbody">
<div class="paragraph"><p>The Tesseract OCR engine was written by Ray Smith and his research groups
at Hewlett Packard (1985-1995) and Google (2006-present).</p></div>
</div>
</div>
<div id="footnotes"><hr /></div>
<div id="footer">
<div id="footer-text">
Last updated 2010-09-29 23:30:22 IST
Last updated 2012-02-09 09:19:05 PDT
</div>
</div>
</body>

View File

@ -14,7 +14,7 @@
<refpurpose>extract unicharset from Tesseract boxfiles</refpurpose>
</refnamediv>
<refsynopsisdiv id="_synopsis">
<simpara><emphasis role="strong">unicharset_extractor</emphasis> <emphasis>FILE</emphasis>&#8230;</simpara>
<simpara><emphasis role="strong">unicharset_extractor</emphasis> <emphasis>[-D dir]</emphasis> <emphasis>FILE</emphasis>&#8230;</simpara>
</refsynopsisdiv>
<refsect1 id="_description">
<title>DESCRIPTION</title>
@ -23,16 +23,11 @@ To generate the unicharset data file, use the unicharset_extractor
program on the same training pages bounding box files as used for
clustering:</simpara>
<literallayout class="monospaced">unicharset_extractor fontfile_1.box fontfile_2.box ...</literallayout>
<simpara>Tesseract needs to have access to character properties isalpha,
isdigit, isupper, islower, ispunctuation. This data must be encoded
in the unicharset data file. Each line of this file corresponds to
one character. The character in UTF-8 is followed by a hexadecimal
number representing a binary mask that encodes the properties. Each
bit corresponds to a property. If the bit is set to 1, it means that
the property is true. The bit ordering is (from least significant bit
to most significant bit): isalpha, islower, isupper, isdigit,
ispunctuation.
(See unicharset(5))</simpara>
<simpara>The unicharset will be put into the file <emphasis>dir/unicharset</emphasis>, or simply
<emphasis>./unicharset</emphasis> if no output directory is provided.</simpara>
<simpara>Tesseract also needs to have access to character properties isalpha,
isdigit, isupper, islower, ispunctuation. all of this auxilury data
and more is encoded in this file. (See unicharset(5))</simpara>
<simpara>If your system supports the wctype functions, these values will be set
automatically by unicharset_extractor and there is no need to edit the
unicharset file. On some older systems (eg Windows 95), the unicharset
@ -46,6 +41,7 @@ cntraining, and giving the unicharset to mftraining.</simpara>
<refsect1 id="_see_also">
<title>SEE ALSO</title>
<simpara>tesseract(1), unicharset(5)</simpara>
<simpara><ulink url="http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3">http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3</ulink></simpara>
</refsect1>
<refsect1 id="_history">
<title>HISTORY</title>
@ -53,7 +49,12 @@ cntraining, and giving the unicharset to mftraining.</simpara>
</refsect1>
<refsect1 id="_copying">
<title>COPYING</title>
<simpara>Copyright &#169; 2006, Google Inc.
<simpara>Copyright (C) 2006, Google Inc.
Licensed under the Apache License, Version 2.0</simpara>
</refsect1>
<refsect1 id="_author">
<title>AUTHOR</title>
<simpara>The Tesseract OCR engine was written by Ray Smith and his research groups
at Hewlett Packard (1985-1995) and Google (2006-present).</simpara>
</refsect1>
</refentry>

View File

@ -1,13 +1,13 @@
'\" t
.\" Title: wordlist2dawg
.\" Author: [FIXME: author] [see http://docbook.sf.net/el/author]
.\" Author: [see the "AUTHOR" section]
.\" Generator: DocBook XSL Stylesheets v1.75.2 <http://docbook.sf.net/>
.\" Date: 09/30/2010
.\" Date: 02/09/2012
.\" Manual: \ \&
.\" Source: \ \&
.\" Language: English
.\"
.TH "WORDLIST2DAWG" "1" "09/30/2010" "\ \&" "\ \&"
.TH "WORDLIST2DAWG" "1" "02/09/2012" "\ \&" "\ \&"
.\" -----------------------------------------------------------------
.\" * Define some portability stuff
.\" -----------------------------------------------------------------
@ -32,23 +32,41 @@ wordlist2dawg \- convert a wordlist to a DAWG for Tesseract
.SH "SYNOPSIS"
.sp
\fBwordlist2dawg\fR \fIWORDLIST\fR \fIDAWG\fR \fIlang\&.unicharset\fR
.sp
\fBwordlist2dawg\fR \-t \fIWORDLIST\fR \fIDAWG\fR \fIlang\&.unicharset\fR
.sp
\fBwordlist2dawg\fR \-r 1 \fIWORDLIST\fR \fIDAWG\fR \fIlang\&.unicharset\fR
.sp
\fBwordlist2dawg\fR \-r 2 \fIWORDLIST\fR \fIDAWG\fR \fIlang\&.unicharset\fR
.sp
\fBwordlist2dawg\fR \-l <short> <long> \fIWORDLIST\fR \fIDAWG\fR \fIlang\&.unicharset\fR
.SH "DESCRIPTION"
.sp
wordlist2dawg(1) converts a wordlist to a Directed Acyclic Word Graph (DAWG) for use with Tesseract\&.
.sp
The wordlists are split into two: one with high frequency words, and one with the rest\&.
wordlist2dawg(1) converts a wordlist to a Directed Acyclic Word Graph (DAWG) for use with Tesseract\&. A DAWG is a compressed, space and time efficient representation of a word list\&.
.SH "OPTIONS"
.sp
\fIWORDLIST\fR A plain text file in UTF\-8, one word per line
\-t Verify that a given dawg file is equivalent to a given wordlist\&.
.sp
\fIDAWG\fR The output DAWG to write
\-r 1 Reverse a word if it contains an RTL character\&.
.sp
\fIlang\&.unicharset\fR The unicharset of the language\&. This is the unicharset generated by mftraining(1)
\-r 2 Reverse all words\&.
.sp
\-l <short> <long> Produce a file with several dawgs in it, one each for words of length <short>, <short+1>,\&... <long>
.SH "ARGUMENTS"
.sp
\fIWORDLIST\fR A plain text file in UTF\-8, one word per line\&.
.sp
\fIDAWG\fR The output DAWG to write\&.
.sp
\fIlang\&.unicharset\fR The unicharset of the language\&. This is the unicharset generated by mftraining(1)\&.
.SH "SEE ALSO"
.sp
tesseract(1), mftraining(1)
tesseract(1), combine_tessdata(1), dawg2wordlist(1)
.sp
\m[blue]\fBhttp://code\&.google\&.com/p/tesseract\-ocr/wiki/TrainingTesseract3\fR\m[]
.SH "COPYING"
.sp
Copyright (c) 2006 Google, Inc\&. Licensed under the Apache License, Version 2\&.0
Copyright (C) 2006 Google, Inc\&. Licensed under the Apache License, Version 2\&.0
.SH "AUTHOR"
.sp
The Tesseract OCR engine was written by Ray Smith and his research groups at Hewlett Packard (1985\-1995) and Google (2006\-present)\&.

View File

@ -1,5 +1,6 @@
WORDLIST2DAWG(1)
================
:doctype: manpage
NAME
----
@ -9,33 +10,60 @@ SYNOPSIS
--------
*wordlist2dawg* 'WORDLIST' 'DAWG' 'lang.unicharset'
*wordlist2dawg* -t 'WORDLIST' 'DAWG' 'lang.unicharset'
*wordlist2dawg* -r 1 'WORDLIST' 'DAWG' 'lang.unicharset'
*wordlist2dawg* -r 2 'WORDLIST' 'DAWG' 'lang.unicharset'
*wordlist2dawg* -l <short> <long> 'WORDLIST' 'DAWG' 'lang.unicharset'
DESCRIPTION
-----------
wordlist2dawg(1) converts a wordlist to a Directed Acyclic
Word Graph (DAWG) for use with Tesseract.
The wordlists are split into two: one with high frequency
words, and one with the rest.
wordlist2dawg(1) converts a wordlist to a Directed Acyclic Word Graph
(DAWG) for use with Tesseract. A DAWG is a compressed, space and time
efficient representation of a word list.
OPTIONS
-------
-t
Verify that a given dawg file is equivalent to a given wordlist.
-r 1
Reverse a word if it contains an RTL character.
-r 2
Reverse all words.
-l <short> <long>
Produce a file with several dawgs in it, one each for words
of length <short>, <short+1>,... <long>
ARGUMENTS
---------
'WORDLIST'
A plain text file in UTF-8, one word per line
A plain text file in UTF-8, one word per line.
'DAWG'
The output DAWG to write
The output DAWG to write.
'lang.unicharset'
The unicharset of the language. This is the unicharset
generated by mftraining(1)
generated by mftraining(1).
SEE ALSO
--------
tesseract(1), mftraining(1)
tesseract(1), combine_tessdata(1), dawg2wordlist(1)
<http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3>
COPYING
-------
Copyright (c) 2006 Google, Inc.
Copyright \(C) 2006 Google, Inc.
Licensed under the Apache License, Version 2.0
AUTHOR
------
The Tesseract OCR engine was written by Ray Smith and his research groups
at Hewlett Packard (1985-1995) and Google (2006-present).

View File

@ -583,39 +583,59 @@ WORDLIST2DAWG(1) Manual Page
<h2 id="_synopsis">SYNOPSIS</h2>
<div class="sectionbody">
<div class="paragraph"><p><strong>wordlist2dawg</strong> <em>WORDLIST</em> <em>DAWG</em> <em>lang.unicharset</em></p></div>
<div class="paragraph"><p><strong>wordlist2dawg</strong> -t <em>WORDLIST</em> <em>DAWG</em> <em>lang.unicharset</em></p></div>
<div class="paragraph"><p><strong>wordlist2dawg</strong> -r 1 <em>WORDLIST</em> <em>DAWG</em> <em>lang.unicharset</em></p></div>
<div class="paragraph"><p><strong>wordlist2dawg</strong> -r 2 <em>WORDLIST</em> <em>DAWG</em> <em>lang.unicharset</em></p></div>
<div class="paragraph"><p><strong>wordlist2dawg</strong> -l &lt;short&gt; &lt;long&gt; <em>WORDLIST</em> <em>DAWG</em> <em>lang.unicharset</em></p></div>
</div>
<h2 id="_description">DESCRIPTION</h2>
<div class="sectionbody">
<div class="paragraph"><p>wordlist2dawg(1) converts a wordlist to a Directed Acyclic
Word Graph (DAWG) for use with Tesseract.</p></div>
<div class="paragraph"><p>The wordlists are split into two: one with high frequency
words, and one with the rest.</p></div>
<div class="paragraph"><p>wordlist2dawg(1) converts a wordlist to a Directed Acyclic Word Graph
(DAWG) for use with Tesseract. A DAWG is a compressed, space and time
efficient representation of a word list.</p></div>
</div>
<h2 id="_options">OPTIONS</h2>
<div class="sectionbody">
<div class="paragraph"><p>-t
Verify that a given dawg file is equivalent to a given wordlist.</p></div>
<div class="paragraph"><p>-r 1
Reverse a word if it contains an RTL character.</p></div>
<div class="paragraph"><p>-r 2
Reverse all words.</p></div>
<div class="paragraph"><p>-l &lt;short&gt; &lt;long&gt;
Produce a file with several dawgs in it, one each for words
of length &lt;short&gt;, &lt;short+1&gt;,&#8230; &lt;long&gt;</p></div>
</div>
<h2 id="_arguments">ARGUMENTS</h2>
<div class="sectionbody">
<div class="paragraph"><p><em>WORDLIST</em>
A plain text file in UTF-8, one word per line</p></div>
A plain text file in UTF-8, one word per line.</p></div>
<div class="paragraph"><p><em>DAWG</em>
The output DAWG to write</p></div>
The output DAWG to write.</p></div>
<div class="paragraph"><p><em>lang.unicharset</em>
The unicharset of the language. This is the unicharset
generated by mftraining(1)</p></div>
generated by mftraining(1).</p></div>
</div>
<h2 id="_see_also">SEE ALSO</h2>
<div class="sectionbody">
<div class="paragraph"><p>tesseract(1), mftraining(1)</p></div>
<div class="paragraph"><p>tesseract(1), combine_tessdata(1), dawg2wordlist(1)</p></div>
<div class="paragraph"><p><a href="http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3">http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3</a></p></div>
</div>
<h2 id="_copying">COPYING</h2>
<div class="sectionbody">
<div class="paragraph"><p>Copyright (c) 2006 Google, Inc.
<div class="paragraph"><p>Copyright (C) 2006 Google, Inc.
Licensed under the Apache License, Version 2.0</p></div>
</div>
<h2 id="_author">AUTHOR</h2>
<div class="sectionbody">
<div class="paragraph"><p>The Tesseract OCR engine was written by Ray Smith and his research groups
at Hewlett Packard (1985-1995) and Google (2006-present).</p></div>
</div>
</div>
<div id="footnotes"><hr /></div>
<div id="footer">
<div id="footer-text">
Last updated 2010-09-30 03:16:57 IST
Last updated 2012-02-07 13:43:35 PDT
</div>
</div>
</body>

View File

@ -15,32 +15,52 @@
</refnamediv>
<refsynopsisdiv id="_synopsis">
<simpara><emphasis role="strong">wordlist2dawg</emphasis> <emphasis>WORDLIST</emphasis> <emphasis>DAWG</emphasis> <emphasis>lang.unicharset</emphasis></simpara>
<simpara><emphasis role="strong">wordlist2dawg</emphasis> -t <emphasis>WORDLIST</emphasis> <emphasis>DAWG</emphasis> <emphasis>lang.unicharset</emphasis></simpara>
<simpara><emphasis role="strong">wordlist2dawg</emphasis> -r 1 <emphasis>WORDLIST</emphasis> <emphasis>DAWG</emphasis> <emphasis>lang.unicharset</emphasis></simpara>
<simpara><emphasis role="strong">wordlist2dawg</emphasis> -r 2 <emphasis>WORDLIST</emphasis> <emphasis>DAWG</emphasis> <emphasis>lang.unicharset</emphasis></simpara>
<simpara><emphasis role="strong">wordlist2dawg</emphasis> -l &lt;short&gt; &lt;long&gt; <emphasis>WORDLIST</emphasis> <emphasis>DAWG</emphasis> <emphasis>lang.unicharset</emphasis></simpara>
</refsynopsisdiv>
<refsect1 id="_description">
<title>DESCRIPTION</title>
<simpara>wordlist2dawg(1) converts a wordlist to a Directed Acyclic
Word Graph (DAWG) for use with Tesseract.</simpara>
<simpara>The wordlists are split into two: one with high frequency
words, and one with the rest.</simpara>
<simpara>wordlist2dawg(1) converts a wordlist to a Directed Acyclic Word Graph
(DAWG) for use with Tesseract. A DAWG is a compressed, space and time
efficient representation of a word list.</simpara>
</refsect1>
<refsect1 id="_options">
<title>OPTIONS</title>
<simpara>-t
Verify that a given dawg file is equivalent to a given wordlist.</simpara>
<simpara>-r 1
Reverse a word if it contains an RTL character.</simpara>
<simpara>-r 2
Reverse all words.</simpara>
<simpara>-l &lt;short&gt; &lt;long&gt;
Produce a file with several dawgs in it, one each for words
of length &lt;short&gt;, &lt;short+1&gt;,&#8230; &lt;long&gt;</simpara>
</refsect1>
<refsect1 id="_arguments">
<title>ARGUMENTS</title>
<simpara><emphasis>WORDLIST</emphasis>
A plain text file in UTF-8, one word per line</simpara>
A plain text file in UTF-8, one word per line.</simpara>
<simpara><emphasis>DAWG</emphasis>
The output DAWG to write</simpara>
The output DAWG to write.</simpara>
<simpara><emphasis>lang.unicharset</emphasis>
The unicharset of the language. This is the unicharset
generated by mftraining(1)</simpara>
generated by mftraining(1).</simpara>
</refsect1>
<refsect1 id="_see_also">
<title>SEE ALSO</title>
<simpara>tesseract(1), mftraining(1)</simpara>
<simpara>tesseract(1), combine_tessdata(1), dawg2wordlist(1)</simpara>
<simpara><ulink url="http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3">http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3</ulink></simpara>
</refsect1>
<refsect1 id="_copying">
<title>COPYING</title>
<simpara>Copyright (c) 2006 Google, Inc.
<simpara>Copyright (C) 2006 Google, Inc.
Licensed under the Apache License, Version 2.0</simpara>
</refsect1>
<refsect1 id="_author">
<title>AUTHOR</title>
<simpara>The Tesseract OCR engine was written by Ray Smith and his research groups
at Hewlett Packard (1985-1995) and Google (2006-present).</simpara>
</refsect1>
</refentry>