doc: Remove generated files and add rules to build manpages

Those files can be built by doc/generate_manpages.sh.

The manpages are needed for the installation,
so add Makefile rules for them.

Git must ignore the generated manpages.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
This commit is contained in:
Stefan Weil 2017-04-30 11:27:17 +02:00
parent 7a116ce8bb
commit 2794410c9b
35 changed files with 17 additions and 12646 deletions

8
.gitignore vendored
View File

@ -20,8 +20,6 @@ vs2010/LIB_Release/*
vs2010/LIB_OpenCL_Release/*
vs2010/LIB_OpenCL_Debug/*
# Linux
# ignore local configuration
config.*
@ -36,7 +34,11 @@ libtool
stamp-h1
tesseract.pc
config_auto.h
doc/html/*
/doc/html/*
/doc/*.1
/doc/*.5
/doc/*.html
/doc/*.xml
api/tesseract
training/ambiguous_words
training/classifier_tester

View File

@ -1,4 +1,16 @@
asciidoc=asciidoc -d manpage
man_MANS = cntraining.1 combine_tessdata.1 mftraining.1 tesseract.1 \
unicharset_extractor.1 wordlist2dawg.1 unicharambigs.5 \
unicharset.5 ambiguous_words.1 shapeclustering.1 dawg2wordlist.1
if MAINTAINER_MODE
EXTRA_DIST = $(man_MANS) Doxyfile
%: %.asc
$(asciidoc) -o $@ $<
endif # MAINTAINER_MODE
MAINTAINERCLEANFILES = $(man_MANS) Doxyfile

View File

@ -1,46 +0,0 @@
'\" t
.\" Title: ambiguous_words
.\" Author: [see the "AUTHOR" section]
.\" Generator: DocBook XSL Stylesheets v1.78.1 <http://docbook.sf.net/>
.\" Date: 06/12/2015
.\" Manual: \ \&
.\" Source: \ \&
.\" Language: English
.\"
.TH "AMBIGUOUS_WORDS" "1" "06/12/2015" "\ \&" "\ \&"
.\" -----------------------------------------------------------------
.\" * Define some portability stuff
.\" -----------------------------------------------------------------
.\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.\" http://bugs.debian.org/507673
.\" http://lists.gnu.org/archive/html/groff/2009-02/msg00013.html
.\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.ie \n(.g .ds Aq \(aq
.el .ds Aq '
.\" -----------------------------------------------------------------
.\" * set default formatting
.\" -----------------------------------------------------------------
.\" disable hyphenation
.nh
.\" disable justification (adjust text to left margin only)
.ad l
.\" -----------------------------------------------------------------
.\" * MAIN CONTENT STARTS HERE *
.\" -----------------------------------------------------------------
.SH "NAME"
ambiguous_words \- generate sets of words Tesseract is likely to find ambiguous
.SH "SYNOPSIS"
.sp
\fBambiguous_words\fR [\-l lang] \fITESSDATADIR\fR \fIWORDLIST\fR \fIAMBIGUOUSFILE\fR
.SH "DESCRIPTION"
.sp
ambiguous_words(1) runs Tesseract in a special mode, and for each word in word list, produces a set of words which Tesseract thinks might be ambiguous with it\&. \fITESSDATADIR\fR must be set to the absolute path of a directory containing \fItessdata/lang\&.traineddata\fR\&.
.SH "SEE ALSO"
.sp
tesseract(1)
.SH "COPYING"
.sp
Copyright (C) 2012 Google, Inc\&. Licensed under the Apache License, Version 2\&.0
.SH "AUTHOR"
.sp
The Tesseract OCR engine was written by Ray Smith and his research groups at Hewlett Packard (1985\-1995) and Google (2006\-present)\&.

View File

@ -1,790 +0,0 @@
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<meta http-equiv="Content-Type" content="application/xhtml+xml; charset=UTF-8" />
<meta name="generator" content="AsciiDoc 8.6.9" />
<title>AMBIGUOUS_WORDS(1)</title>
<style type="text/css">
/* Shared CSS for AsciiDoc xhtml11 and html5 backends */
/* Default font. */
body {
font-family: Georgia,serif;
}
/* Title font. */
h1, h2, h3, h4, h5, h6,
div.title, caption.title,
thead, p.table.header,
#toctitle,
#author, #revnumber, #revdate, #revremark,
#footer {
font-family: Arial,Helvetica,sans-serif;
}
body {
margin: 1em 5% 1em 5%;
}
a {
color: blue;
text-decoration: underline;
}
a:visited {
color: fuchsia;
}
em {
font-style: italic;
color: navy;
}
strong {
font-weight: bold;
color: #083194;
}
h1, h2, h3, h4, h5, h6 {
color: #527bbd;
margin-top: 1.2em;
margin-bottom: 0.5em;
line-height: 1.3;
}
h1, h2, h3 {
border-bottom: 2px solid silver;
}
h2 {
padding-top: 0.5em;
}
h3 {
float: left;
}
h3 + * {
clear: left;
}
h5 {
font-size: 1.0em;
}
div.sectionbody {
margin-left: 0;
}
hr {
border: 1px solid silver;
}
p {
margin-top: 0.5em;
margin-bottom: 0.5em;
}
ul, ol, li > p {
margin-top: 0;
}
ul > li { color: #aaa; }
ul > li > * { color: black; }
.monospaced, code, pre {
font-family: "Courier New", Courier, monospace;
font-size: inherit;
color: navy;
padding: 0;
margin: 0;
}
pre {
white-space: pre-wrap;
}
#author {
color: #527bbd;
font-weight: bold;
font-size: 1.1em;
}
#email {
}
#revnumber, #revdate, #revremark {
}
#footer {
font-size: small;
border-top: 2px solid silver;
padding-top: 0.5em;
margin-top: 4.0em;
}
#footer-text {
float: left;
padding-bottom: 0.5em;
}
#footer-badges {
float: right;
padding-bottom: 0.5em;
}
#preamble {
margin-top: 1.5em;
margin-bottom: 1.5em;
}
div.imageblock, div.exampleblock, div.verseblock,
div.quoteblock, div.literalblock, div.listingblock, div.sidebarblock,
div.admonitionblock {
margin-top: 1.0em;
margin-bottom: 1.5em;
}
div.admonitionblock {
margin-top: 2.0em;
margin-bottom: 2.0em;
margin-right: 10%;
color: #606060;
}
div.content { /* Block element content. */
padding: 0;
}
/* Block element titles. */
div.title, caption.title {
color: #527bbd;
font-weight: bold;
text-align: left;
margin-top: 1.0em;
margin-bottom: 0.5em;
}
div.title + * {
margin-top: 0;
}
td div.title:first-child {
margin-top: 0.0em;
}
div.content div.title:first-child {
margin-top: 0.0em;
}
div.content + div.title {
margin-top: 0.0em;
}
div.sidebarblock > div.content {
background: #ffffee;
border: 1px solid #dddddd;
border-left: 4px solid #f0f0f0;
padding: 0.5em;
}
div.listingblock > div.content {
border: 1px solid #dddddd;
border-left: 5px solid #f0f0f0;
background: #f8f8f8;
padding: 0.5em;
}
div.quoteblock, div.verseblock {
padding-left: 1.0em;
margin-left: 1.0em;
margin-right: 10%;
border-left: 5px solid #f0f0f0;
color: #888;
}
div.quoteblock > div.attribution {
padding-top: 0.5em;
text-align: right;
}
div.verseblock > pre.content {
font-family: inherit;
font-size: inherit;
}
div.verseblock > div.attribution {
padding-top: 0.75em;
text-align: left;
}
/* DEPRECATED: Pre version 8.2.7 verse style literal block. */
div.verseblock + div.attribution {
text-align: left;
}
div.admonitionblock .icon {
vertical-align: top;
font-size: 1.1em;
font-weight: bold;
text-decoration: underline;
color: #527bbd;
padding-right: 0.5em;
}
div.admonitionblock td.content {
padding-left: 0.5em;
border-left: 3px solid #dddddd;
}
div.exampleblock > div.content {
border-left: 3px solid #dddddd;
padding-left: 0.5em;
}
div.imageblock div.content { padding-left: 0; }
span.image img { border-style: none; vertical-align: text-bottom; }
a.image:visited { color: white; }
dl {
margin-top: 0.8em;
margin-bottom: 0.8em;
}
dt {
margin-top: 0.5em;
margin-bottom: 0;
font-style: normal;
color: navy;
}
dd > *:first-child {
margin-top: 0.1em;
}
ul, ol {
list-style-position: outside;
}
ol.arabic {
list-style-type: decimal;
}
ol.loweralpha {
list-style-type: lower-alpha;
}
ol.upperalpha {
list-style-type: upper-alpha;
}
ol.lowerroman {
list-style-type: lower-roman;
}
ol.upperroman {
list-style-type: upper-roman;
}
div.compact ul, div.compact ol,
div.compact p, div.compact p,
div.compact div, div.compact div {
margin-top: 0.1em;
margin-bottom: 0.1em;
}
tfoot {
font-weight: bold;
}
td > div.verse {
white-space: pre;
}
div.hdlist {
margin-top: 0.8em;
margin-bottom: 0.8em;
}
div.hdlist tr {
padding-bottom: 15px;
}
dt.hdlist1.strong, td.hdlist1.strong {
font-weight: bold;
}
td.hdlist1 {
vertical-align: top;
font-style: normal;
padding-right: 0.8em;
color: navy;
}
td.hdlist2 {
vertical-align: top;
}
div.hdlist.compact tr {
margin: 0;
padding-bottom: 0;
}
.comment {
background: yellow;
}
.footnote, .footnoteref {
font-size: 0.8em;
}
span.footnote, span.footnoteref {
vertical-align: super;
}
#footnotes {
margin: 20px 0 20px 0;
padding: 7px 0 0 0;
}
#footnotes div.footnote {
margin: 0 0 5px 0;
}
#footnotes hr {
border: none;
border-top: 1px solid silver;
height: 1px;
text-align: left;
margin-left: 0;
width: 20%;
min-width: 100px;
}
div.colist td {
padding-right: 0.5em;
padding-bottom: 0.3em;
vertical-align: top;
}
div.colist td img {
margin-top: 0.3em;
}
@media print {
#footer-badges { display: none; }
}
#toc {
margin-bottom: 2.5em;
}
#toctitle {
color: #527bbd;
font-size: 1.1em;
font-weight: bold;
margin-top: 1.0em;
margin-bottom: 0.1em;
}
div.toclevel0, div.toclevel1, div.toclevel2, div.toclevel3, div.toclevel4 {
margin-top: 0;
margin-bottom: 0;
}
div.toclevel2 {
margin-left: 2em;
font-size: 0.9em;
}
div.toclevel3 {
margin-left: 4em;
font-size: 0.9em;
}
div.toclevel4 {
margin-left: 6em;
font-size: 0.9em;
}
span.aqua { color: aqua; }
span.black { color: black; }
span.blue { color: blue; }
span.fuchsia { color: fuchsia; }
span.gray { color: gray; }
span.green { color: green; }
span.lime { color: lime; }
span.maroon { color: maroon; }
span.navy { color: navy; }
span.olive { color: olive; }
span.purple { color: purple; }
span.red { color: red; }
span.silver { color: silver; }
span.teal { color: teal; }
span.white { color: white; }
span.yellow { color: yellow; }
span.aqua-background { background: aqua; }
span.black-background { background: black; }
span.blue-background { background: blue; }
span.fuchsia-background { background: fuchsia; }
span.gray-background { background: gray; }
span.green-background { background: green; }
span.lime-background { background: lime; }
span.maroon-background { background: maroon; }
span.navy-background { background: navy; }
span.olive-background { background: olive; }
span.purple-background { background: purple; }
span.red-background { background: red; }
span.silver-background { background: silver; }
span.teal-background { background: teal; }
span.white-background { background: white; }
span.yellow-background { background: yellow; }
span.big { font-size: 2em; }
span.small { font-size: 0.6em; }
span.underline { text-decoration: underline; }
span.overline { text-decoration: overline; }
span.line-through { text-decoration: line-through; }
div.unbreakable { page-break-inside: avoid; }
/*
* xhtml11 specific
*
* */
div.tableblock {
margin-top: 1.0em;
margin-bottom: 1.5em;
}
div.tableblock > table {
border: 3px solid #527bbd;
}
thead, p.table.header {
font-weight: bold;
color: #527bbd;
}
p.table {
margin-top: 0;
}
/* Because the table frame attribute is overriden by CSS in most browsers. */
div.tableblock > table[frame="void"] {
border-style: none;
}
div.tableblock > table[frame="hsides"] {
border-left-style: none;
border-right-style: none;
}
div.tableblock > table[frame="vsides"] {
border-top-style: none;
border-bottom-style: none;
}
/*
* html5 specific
*
* */
table.tableblock {
margin-top: 1.0em;
margin-bottom: 1.5em;
}
thead, p.tableblock.header {
font-weight: bold;
color: #527bbd;
}
p.tableblock {
margin-top: 0;
}
table.tableblock {
border-width: 3px;
border-spacing: 0px;
border-style: solid;
border-color: #527bbd;
border-collapse: collapse;
}
th.tableblock, td.tableblock {
border-width: 1px;
padding: 4px;
border-style: solid;
border-color: #527bbd;
}
table.tableblock.frame-topbot {
border-left-style: hidden;
border-right-style: hidden;
}
table.tableblock.frame-sides {
border-top-style: hidden;
border-bottom-style: hidden;
}
table.tableblock.frame-none {
border-style: hidden;
}
th.tableblock.halign-left, td.tableblock.halign-left {
text-align: left;
}
th.tableblock.halign-center, td.tableblock.halign-center {
text-align: center;
}
th.tableblock.halign-right, td.tableblock.halign-right {
text-align: right;
}
th.tableblock.valign-top, td.tableblock.valign-top {
vertical-align: top;
}
th.tableblock.valign-middle, td.tableblock.valign-middle {
vertical-align: middle;
}
th.tableblock.valign-bottom, td.tableblock.valign-bottom {
vertical-align: bottom;
}
/*
* manpage specific
*
* */
body.manpage h1 {
padding-top: 0.5em;
padding-bottom: 0.5em;
border-top: 2px solid silver;
border-bottom: 2px solid silver;
}
body.manpage h2 {
border-style: none;
}
body.manpage div.sectionbody {
margin-left: 3em;
}
@media print {
body.manpage div#toc { display: none; }
}
</style>
<script type="text/javascript">
/*<![CDATA[*/
var asciidoc = { // Namespace.
/////////////////////////////////////////////////////////////////////
// Table Of Contents generator
/////////////////////////////////////////////////////////////////////
/* Author: Mihai Bazon, September 2002
* http://students.infoiasi.ro/~mishoo
*
* Table Of Content generator
* Version: 0.4
*
* Feel free to use this script under the terms of the GNU General Public
* License, as long as you do not remove or alter this notice.
*/
/* modified by Troy D. Hanson, September 2006. License: GPL */
/* modified by Stuart Rackham, 2006, 2009. License: GPL */
// toclevels = 1..4.
toc: function (toclevels) {
function getText(el) {
var text = "";
for (var i = el.firstChild; i != null; i = i.nextSibling) {
if (i.nodeType == 3 /* Node.TEXT_NODE */) // IE doesn't speak constants.
text += i.data;
else if (i.firstChild != null)
text += getText(i);
}
return text;
}
function TocEntry(el, text, toclevel) {
this.element = el;
this.text = text;
this.toclevel = toclevel;
}
function tocEntries(el, toclevels) {
var result = new Array;
var re = new RegExp('[hH]([1-'+(toclevels+1)+'])');
// Function that scans the DOM tree for header elements (the DOM2
// nodeIterator API would be a better technique but not supported by all
// browsers).
var iterate = function (el) {
for (var i = el.firstChild; i != null; i = i.nextSibling) {
if (i.nodeType == 1 /* Node.ELEMENT_NODE */) {
var mo = re.exec(i.tagName);
if (mo && (i.getAttribute("class") || i.getAttribute("className")) != "float") {
result[result.length] = new TocEntry(i, getText(i), mo[1]-1);
}
iterate(i);
}
}
}
iterate(el);
return result;
}
var toc = document.getElementById("toc");
if (!toc) {
return;
}
// Delete existing TOC entries in case we're reloading the TOC.
var tocEntriesToRemove = [];
var i;
for (i = 0; i < toc.childNodes.length; i++) {
var entry = toc.childNodes[i];
if (entry.nodeName.toLowerCase() == 'div'
&& entry.getAttribute("class")
&& entry.getAttribute("class").match(/^toclevel/))
tocEntriesToRemove.push(entry);
}
for (i = 0; i < tocEntriesToRemove.length; i++) {
toc.removeChild(tocEntriesToRemove[i]);
}
// Rebuild TOC entries.
var entries = tocEntries(document.getElementById("content"), toclevels);
for (var i = 0; i < entries.length; ++i) {
var entry = entries[i];
if (entry.element.id == "")
entry.element.id = "_toc_" + i;
var a = document.createElement("a");
a.href = "#" + entry.element.id;
a.appendChild(document.createTextNode(entry.text));
var div = document.createElement("div");
div.appendChild(a);
div.className = "toclevel" + entry.toclevel;
toc.appendChild(div);
}
if (entries.length == 0)
toc.parentNode.removeChild(toc);
},
/////////////////////////////////////////////////////////////////////
// Footnotes generator
/////////////////////////////////////////////////////////////////////
/* Based on footnote generation code from:
* http://www.brandspankingnew.net/archive/2005/07/format_footnote.html
*/
footnotes: function () {
// Delete existing footnote entries in case we're reloading the footnodes.
var i;
var noteholder = document.getElementById("footnotes");
if (!noteholder) {
return;
}
var entriesToRemove = [];
for (i = 0; i < noteholder.childNodes.length; i++) {
var entry = noteholder.childNodes[i];
if (entry.nodeName.toLowerCase() == 'div' && entry.getAttribute("class") == "footnote")
entriesToRemove.push(entry);
}
for (i = 0; i < entriesToRemove.length; i++) {
noteholder.removeChild(entriesToRemove[i]);
}
// Rebuild footnote entries.
var cont = document.getElementById("content");
var spans = cont.getElementsByTagName("span");
var refs = {};
var n = 0;
for (i=0; i<spans.length; i++) {
if (spans[i].className == "footnote") {
n++;
var note = spans[i].getAttribute("data-note");
if (!note) {
// Use [\s\S] in place of . so multi-line matches work.
// Because JavaScript has no s (dotall) regex flag.
note = spans[i].innerHTML.match(/\s*\[([\s\S]*)]\s*/)[1];
spans[i].innerHTML =
"[<a id='_footnoteref_" + n + "' href='#_footnote_" + n +
"' title='View footnote' class='footnote'>" + n + "</a>]";
spans[i].setAttribute("data-note", note);
}
noteholder.innerHTML +=
"<div class='footnote' id='_footnote_" + n + "'>" +
"<a href='#_footnoteref_" + n + "' title='Return to text'>" +
n + "</a>. " + note + "</div>";
var id =spans[i].getAttribute("id");
if (id != null) refs["#"+id] = n;
}
}
if (n == 0)
noteholder.parentNode.removeChild(noteholder);
else {
// Process footnoterefs.
for (i=0; i<spans.length; i++) {
if (spans[i].className == "footnoteref") {
var href = spans[i].getElementsByTagName("a")[0].getAttribute("href");
href = href.match(/#.*/)[0]; // Because IE return full URL.
n = refs[href];
spans[i].innerHTML =
"[<a href='#_footnote_" + n +
"' title='View footnote' class='footnote'>" + n + "</a>]";
}
}
}
},
install: function(toclevels) {
var timerId;
function reinstall() {
asciidoc.footnotes();
if (toclevels) {
asciidoc.toc(toclevels);
}
}
function reinstallAndRemoveTimer() {
clearInterval(timerId);
reinstall();
}
timerId = setInterval(reinstall, 500);
if (document.addEventListener)
document.addEventListener("DOMContentLoaded", reinstallAndRemoveTimer, false);
else
window.onload = reinstallAndRemoveTimer;
}
}
asciidoc.install();
/*]]>*/
</script>
</head>
<body class="manpage">
<div id="header">
<h1>
AMBIGUOUS_WORDS(1) Manual Page
</h1>
<h2>NAME</h2>
<div class="sectionbody">
<p>ambiguous_words -
generate sets of words Tesseract is likely to find ambiguous
</p>
</div>
</div>
<div id="content">
<div class="sect1">
<h2 id="_synopsis">SYNOPSIS</h2>
<div class="sectionbody">
<div class="paragraph"><p><strong>ambiguous_words</strong> [-l lang] <em>TESSDATADIR</em> <em>WORDLIST</em> <em>AMBIGUOUSFILE</em></p></div>
</div>
</div>
<div class="sect1">
<h2 id="_description">DESCRIPTION</h2>
<div class="sectionbody">
<div class="paragraph"><p>ambiguous_words(1) runs Tesseract in a special mode, and for each word
in word list, produces a set of words which Tesseract thinks might be
ambiguous with it. <em>TESSDATADIR</em> must be set to the absolute path of
a directory containing <em>tessdata/lang.traineddata</em>.</p></div>
</div>
</div>
<div class="sect1">
<h2 id="_see_also">SEE ALSO</h2>
<div class="sectionbody">
<div class="paragraph"><p>tesseract(1)</p></div>
</div>
</div>
<div class="sect1">
<h2 id="_copying">COPYING</h2>
<div class="sectionbody">
<div class="paragraph"><p>Copyright (C) 2012 Google, Inc.
Licensed under the Apache License, Version 2.0</p></div>
</div>
</div>
<div class="sect1">
<h2 id="_author">AUTHOR</h2>
<div class="sectionbody">
<div class="paragraph"><p>The Tesseract OCR engine was written by Ray Smith and his research groups
at Hewlett Packard (1985-1995) and Google (2006-present).</p></div>
</div>
</div>
</div>
<div id="footnotes"><hr /></div>
<div id="footer">
<div id="footer-text">
Last updated 2015-05-13 19:59:45 CEST
</div>
</div>
</body>
</html>

View File

@ -1,43 +0,0 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE refentry PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd">
<?asciidoc-toc?>
<?asciidoc-numbered?>
<refentry lang="en">
<refentryinfo>
<title>AMBIGUOUS_WORDS(1)</title>
</refentryinfo>
<refmeta>
<refentrytitle>ambiguous_words</refentrytitle>
<manvolnum>1</manvolnum>
<refmiscinfo class="source">&#160;</refmiscinfo>
<refmiscinfo class="manual">&#160;</refmiscinfo>
</refmeta>
<refnamediv>
<refname>ambiguous_words</refname>
<refpurpose>generate sets of words Tesseract is likely to find ambiguous</refpurpose>
</refnamediv>
<refsynopsisdiv id="_synopsis">
<simpara><emphasis role="strong">ambiguous_words</emphasis> [-l lang] <emphasis>TESSDATADIR</emphasis> <emphasis>WORDLIST</emphasis> <emphasis>AMBIGUOUSFILE</emphasis></simpara>
</refsynopsisdiv>
<refsect1 id="_description">
<title>DESCRIPTION</title>
<simpara>ambiguous_words(1) runs Tesseract in a special mode, and for each word
in word list, produces a set of words which Tesseract thinks might be
ambiguous with it. <emphasis>TESSDATADIR</emphasis> must be set to the absolute path of
a directory containing <emphasis>tessdata/lang.traineddata</emphasis>.</simpara>
</refsect1>
<refsect1 id="_see_also">
<title>SEE ALSO</title>
<simpara>tesseract(1)</simpara>
</refsect1>
<refsect1 id="_copying">
<title>COPYING</title>
<simpara>Copyright (C) 2012 Google, Inc.
Licensed under the Apache License, Version 2.0</simpara>
</refsect1>
<refsect1 id="_author">
<title>AUTHOR</title>
<simpara>The Tesseract OCR engine was written by Ray Smith and his research groups
at Hewlett Packard (1985-1995) and Google (2006-present).</simpara>
</refsect1>
</refentry>

View File

@ -1,54 +0,0 @@
'\" t
.\" Title: cntraining
.\" Author: [see the "AUTHOR" section]
.\" Generator: DocBook XSL Stylesheets v1.78.1 <http://docbook.sf.net/>
.\" Date: 06/12/2015
.\" Manual: \ \&
.\" Source: \ \&
.\" Language: English
.\"
.TH "CNTRAINING" "1" "06/12/2015" "\ \&" "\ \&"
.\" -----------------------------------------------------------------
.\" * Define some portability stuff
.\" -----------------------------------------------------------------
.\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.\" http://bugs.debian.org/507673
.\" http://lists.gnu.org/archive/html/groff/2009-02/msg00013.html
.\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.ie \n(.g .ds Aq \(aq
.el .ds Aq '
.\" -----------------------------------------------------------------
.\" * set default formatting
.\" -----------------------------------------------------------------
.\" disable hyphenation
.nh
.\" disable justification (adjust text to left margin only)
.ad l
.\" -----------------------------------------------------------------
.\" * MAIN CONTENT STARTS HERE *
.\" -----------------------------------------------------------------
.SH "NAME"
cntraining \- character normalization training for Tesseract
.SH "SYNOPSIS"
.sp
\fBcntraining\fR [\-D \fIdir\fR] \fIFILE\fR\&...
.SH "DESCRIPTION"
.sp
cntraining takes a list of \&.tr files, from which it generates the \fBnormproto\fR data file (the character normalization sensitivity prototypes)\&.
.SH "OPTIONS"
.PP
\-D \fIdir\fR
.RS 4
Directory to write output files to\&.
.RE
.SH "SEE ALSO"
.sp
tesseract(1), shapeclustering(1), mftraining(1)
.sp
\m[blue]\fBhttps://github\&.com/tesseract\-ocr/tesseract/wiki/TrainingTesseract\fR\m[]
.SH "COPYING"
.sp
Copyright (c) Hewlett\-Packard Company, 1988 Licensed under the Apache License, Version 2\&.0
.SH "AUTHOR"
.sp
The Tesseract OCR engine was written by Ray Smith and his research groups at Hewlett Packard (1985\-1995) and Google (2006\-present)\&.

View File

@ -1,805 +0,0 @@
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<meta http-equiv="Content-Type" content="application/xhtml+xml; charset=UTF-8" />
<meta name="generator" content="AsciiDoc 8.6.9" />
<title>CNTRAINING(1)</title>
<style type="text/css">
/* Shared CSS for AsciiDoc xhtml11 and html5 backends */
/* Default font. */
body {
font-family: Georgia,serif;
}
/* Title font. */
h1, h2, h3, h4, h5, h6,
div.title, caption.title,
thead, p.table.header,
#toctitle,
#author, #revnumber, #revdate, #revremark,
#footer {
font-family: Arial,Helvetica,sans-serif;
}
body {
margin: 1em 5% 1em 5%;
}
a {
color: blue;
text-decoration: underline;
}
a:visited {
color: fuchsia;
}
em {
font-style: italic;
color: navy;
}
strong {
font-weight: bold;
color: #083194;
}
h1, h2, h3, h4, h5, h6 {
color: #527bbd;
margin-top: 1.2em;
margin-bottom: 0.5em;
line-height: 1.3;
}
h1, h2, h3 {
border-bottom: 2px solid silver;
}
h2 {
padding-top: 0.5em;
}
h3 {
float: left;
}
h3 + * {
clear: left;
}
h5 {
font-size: 1.0em;
}
div.sectionbody {
margin-left: 0;
}
hr {
border: 1px solid silver;
}
p {
margin-top: 0.5em;
margin-bottom: 0.5em;
}
ul, ol, li > p {
margin-top: 0;
}
ul > li { color: #aaa; }
ul > li > * { color: black; }
.monospaced, code, pre {
font-family: "Courier New", Courier, monospace;
font-size: inherit;
color: navy;
padding: 0;
margin: 0;
}
pre {
white-space: pre-wrap;
}
#author {
color: #527bbd;
font-weight: bold;
font-size: 1.1em;
}
#email {
}
#revnumber, #revdate, #revremark {
}
#footer {
font-size: small;
border-top: 2px solid silver;
padding-top: 0.5em;
margin-top: 4.0em;
}
#footer-text {
float: left;
padding-bottom: 0.5em;
}
#footer-badges {
float: right;
padding-bottom: 0.5em;
}
#preamble {
margin-top: 1.5em;
margin-bottom: 1.5em;
}
div.imageblock, div.exampleblock, div.verseblock,
div.quoteblock, div.literalblock, div.listingblock, div.sidebarblock,
div.admonitionblock {
margin-top: 1.0em;
margin-bottom: 1.5em;
}
div.admonitionblock {
margin-top: 2.0em;
margin-bottom: 2.0em;
margin-right: 10%;
color: #606060;
}
div.content { /* Block element content. */
padding: 0;
}
/* Block element titles. */
div.title, caption.title {
color: #527bbd;
font-weight: bold;
text-align: left;
margin-top: 1.0em;
margin-bottom: 0.5em;
}
div.title + * {
margin-top: 0;
}
td div.title:first-child {
margin-top: 0.0em;
}
div.content div.title:first-child {
margin-top: 0.0em;
}
div.content + div.title {
margin-top: 0.0em;
}
div.sidebarblock > div.content {
background: #ffffee;
border: 1px solid #dddddd;
border-left: 4px solid #f0f0f0;
padding: 0.5em;
}
div.listingblock > div.content {
border: 1px solid #dddddd;
border-left: 5px solid #f0f0f0;
background: #f8f8f8;
padding: 0.5em;
}
div.quoteblock, div.verseblock {
padding-left: 1.0em;
margin-left: 1.0em;
margin-right: 10%;
border-left: 5px solid #f0f0f0;
color: #888;
}
div.quoteblock > div.attribution {
padding-top: 0.5em;
text-align: right;
}
div.verseblock > pre.content {
font-family: inherit;
font-size: inherit;
}
div.verseblock > div.attribution {
padding-top: 0.75em;
text-align: left;
}
/* DEPRECATED: Pre version 8.2.7 verse style literal block. */
div.verseblock + div.attribution {
text-align: left;
}
div.admonitionblock .icon {
vertical-align: top;
font-size: 1.1em;
font-weight: bold;
text-decoration: underline;
color: #527bbd;
padding-right: 0.5em;
}
div.admonitionblock td.content {
padding-left: 0.5em;
border-left: 3px solid #dddddd;
}
div.exampleblock > div.content {
border-left: 3px solid #dddddd;
padding-left: 0.5em;
}
div.imageblock div.content { padding-left: 0; }
span.image img { border-style: none; vertical-align: text-bottom; }
a.image:visited { color: white; }
dl {
margin-top: 0.8em;
margin-bottom: 0.8em;
}
dt {
margin-top: 0.5em;
margin-bottom: 0;
font-style: normal;
color: navy;
}
dd > *:first-child {
margin-top: 0.1em;
}
ul, ol {
list-style-position: outside;
}
ol.arabic {
list-style-type: decimal;
}
ol.loweralpha {
list-style-type: lower-alpha;
}
ol.upperalpha {
list-style-type: upper-alpha;
}
ol.lowerroman {
list-style-type: lower-roman;
}
ol.upperroman {
list-style-type: upper-roman;
}
div.compact ul, div.compact ol,
div.compact p, div.compact p,
div.compact div, div.compact div {
margin-top: 0.1em;
margin-bottom: 0.1em;
}
tfoot {
font-weight: bold;
}
td > div.verse {
white-space: pre;
}
div.hdlist {
margin-top: 0.8em;
margin-bottom: 0.8em;
}
div.hdlist tr {
padding-bottom: 15px;
}
dt.hdlist1.strong, td.hdlist1.strong {
font-weight: bold;
}
td.hdlist1 {
vertical-align: top;
font-style: normal;
padding-right: 0.8em;
color: navy;
}
td.hdlist2 {
vertical-align: top;
}
div.hdlist.compact tr {
margin: 0;
padding-bottom: 0;
}
.comment {
background: yellow;
}
.footnote, .footnoteref {
font-size: 0.8em;
}
span.footnote, span.footnoteref {
vertical-align: super;
}
#footnotes {
margin: 20px 0 20px 0;
padding: 7px 0 0 0;
}
#footnotes div.footnote {
margin: 0 0 5px 0;
}
#footnotes hr {
border: none;
border-top: 1px solid silver;
height: 1px;
text-align: left;
margin-left: 0;
width: 20%;
min-width: 100px;
}
div.colist td {
padding-right: 0.5em;
padding-bottom: 0.3em;
vertical-align: top;
}
div.colist td img {
margin-top: 0.3em;
}
@media print {
#footer-badges { display: none; }
}
#toc {
margin-bottom: 2.5em;
}
#toctitle {
color: #527bbd;
font-size: 1.1em;
font-weight: bold;
margin-top: 1.0em;
margin-bottom: 0.1em;
}
div.toclevel0, div.toclevel1, div.toclevel2, div.toclevel3, div.toclevel4 {
margin-top: 0;
margin-bottom: 0;
}
div.toclevel2 {
margin-left: 2em;
font-size: 0.9em;
}
div.toclevel3 {
margin-left: 4em;
font-size: 0.9em;
}
div.toclevel4 {
margin-left: 6em;
font-size: 0.9em;
}
span.aqua { color: aqua; }
span.black { color: black; }
span.blue { color: blue; }
span.fuchsia { color: fuchsia; }
span.gray { color: gray; }
span.green { color: green; }
span.lime { color: lime; }
span.maroon { color: maroon; }
span.navy { color: navy; }
span.olive { color: olive; }
span.purple { color: purple; }
span.red { color: red; }
span.silver { color: silver; }
span.teal { color: teal; }
span.white { color: white; }
span.yellow { color: yellow; }
span.aqua-background { background: aqua; }
span.black-background { background: black; }
span.blue-background { background: blue; }
span.fuchsia-background { background: fuchsia; }
span.gray-background { background: gray; }
span.green-background { background: green; }
span.lime-background { background: lime; }
span.maroon-background { background: maroon; }
span.navy-background { background: navy; }
span.olive-background { background: olive; }
span.purple-background { background: purple; }
span.red-background { background: red; }
span.silver-background { background: silver; }
span.teal-background { background: teal; }
span.white-background { background: white; }
span.yellow-background { background: yellow; }
span.big { font-size: 2em; }
span.small { font-size: 0.6em; }
span.underline { text-decoration: underline; }
span.overline { text-decoration: overline; }
span.line-through { text-decoration: line-through; }
div.unbreakable { page-break-inside: avoid; }
/*
* xhtml11 specific
*
* */
div.tableblock {
margin-top: 1.0em;
margin-bottom: 1.5em;
}
div.tableblock > table {
border: 3px solid #527bbd;
}
thead, p.table.header {
font-weight: bold;
color: #527bbd;
}
p.table {
margin-top: 0;
}
/* Because the table frame attribute is overriden by CSS in most browsers. */
div.tableblock > table[frame="void"] {
border-style: none;
}
div.tableblock > table[frame="hsides"] {
border-left-style: none;
border-right-style: none;
}
div.tableblock > table[frame="vsides"] {
border-top-style: none;
border-bottom-style: none;
}
/*
* html5 specific
*
* */
table.tableblock {
margin-top: 1.0em;
margin-bottom: 1.5em;
}
thead, p.tableblock.header {
font-weight: bold;
color: #527bbd;
}
p.tableblock {
margin-top: 0;
}
table.tableblock {
border-width: 3px;
border-spacing: 0px;
border-style: solid;
border-color: #527bbd;
border-collapse: collapse;
}
th.tableblock, td.tableblock {
border-width: 1px;
padding: 4px;
border-style: solid;
border-color: #527bbd;
}
table.tableblock.frame-topbot {
border-left-style: hidden;
border-right-style: hidden;
}
table.tableblock.frame-sides {
border-top-style: hidden;
border-bottom-style: hidden;
}
table.tableblock.frame-none {
border-style: hidden;
}
th.tableblock.halign-left, td.tableblock.halign-left {
text-align: left;
}
th.tableblock.halign-center, td.tableblock.halign-center {
text-align: center;
}
th.tableblock.halign-right, td.tableblock.halign-right {
text-align: right;
}
th.tableblock.valign-top, td.tableblock.valign-top {
vertical-align: top;
}
th.tableblock.valign-middle, td.tableblock.valign-middle {
vertical-align: middle;
}
th.tableblock.valign-bottom, td.tableblock.valign-bottom {
vertical-align: bottom;
}
/*
* manpage specific
*
* */
body.manpage h1 {
padding-top: 0.5em;
padding-bottom: 0.5em;
border-top: 2px solid silver;
border-bottom: 2px solid silver;
}
body.manpage h2 {
border-style: none;
}
body.manpage div.sectionbody {
margin-left: 3em;
}
@media print {
body.manpage div#toc { display: none; }
}
</style>
<script type="text/javascript">
/*<![CDATA[*/
var asciidoc = { // Namespace.
/////////////////////////////////////////////////////////////////////
// Table Of Contents generator
/////////////////////////////////////////////////////////////////////
/* Author: Mihai Bazon, September 2002
* http://students.infoiasi.ro/~mishoo
*
* Table Of Content generator
* Version: 0.4
*
* Feel free to use this script under the terms of the GNU General Public
* License, as long as you do not remove or alter this notice.
*/
/* modified by Troy D. Hanson, September 2006. License: GPL */
/* modified by Stuart Rackham, 2006, 2009. License: GPL */
// toclevels = 1..4.
toc: function (toclevels) {
function getText(el) {
var text = "";
for (var i = el.firstChild; i != null; i = i.nextSibling) {
if (i.nodeType == 3 /* Node.TEXT_NODE */) // IE doesn't speak constants.
text += i.data;
else if (i.firstChild != null)
text += getText(i);
}
return text;
}
function TocEntry(el, text, toclevel) {
this.element = el;
this.text = text;
this.toclevel = toclevel;
}
function tocEntries(el, toclevels) {
var result = new Array;
var re = new RegExp('[hH]([1-'+(toclevels+1)+'])');
// Function that scans the DOM tree for header elements (the DOM2
// nodeIterator API would be a better technique but not supported by all
// browsers).
var iterate = function (el) {
for (var i = el.firstChild; i != null; i = i.nextSibling) {
if (i.nodeType == 1 /* Node.ELEMENT_NODE */) {
var mo = re.exec(i.tagName);
if (mo && (i.getAttribute("class") || i.getAttribute("className")) != "float") {
result[result.length] = new TocEntry(i, getText(i), mo[1]-1);
}
iterate(i);
}
}
}
iterate(el);
return result;
}
var toc = document.getElementById("toc");
if (!toc) {
return;
}
// Delete existing TOC entries in case we're reloading the TOC.
var tocEntriesToRemove = [];
var i;
for (i = 0; i < toc.childNodes.length; i++) {
var entry = toc.childNodes[i];
if (entry.nodeName.toLowerCase() == 'div'
&& entry.getAttribute("class")
&& entry.getAttribute("class").match(/^toclevel/))
tocEntriesToRemove.push(entry);
}
for (i = 0; i < tocEntriesToRemove.length; i++) {
toc.removeChild(tocEntriesToRemove[i]);
}
// Rebuild TOC entries.
var entries = tocEntries(document.getElementById("content"), toclevels);
for (var i = 0; i < entries.length; ++i) {
var entry = entries[i];
if (entry.element.id == "")
entry.element.id = "_toc_" + i;
var a = document.createElement("a");
a.href = "#" + entry.element.id;
a.appendChild(document.createTextNode(entry.text));
var div = document.createElement("div");
div.appendChild(a);
div.className = "toclevel" + entry.toclevel;
toc.appendChild(div);
}
if (entries.length == 0)
toc.parentNode.removeChild(toc);
},
/////////////////////////////////////////////////////////////////////
// Footnotes generator
/////////////////////////////////////////////////////////////////////
/* Based on footnote generation code from:
* http://www.brandspankingnew.net/archive/2005/07/format_footnote.html
*/
footnotes: function () {
// Delete existing footnote entries in case we're reloading the footnodes.
var i;
var noteholder = document.getElementById("footnotes");
if (!noteholder) {
return;
}
var entriesToRemove = [];
for (i = 0; i < noteholder.childNodes.length; i++) {
var entry = noteholder.childNodes[i];
if (entry.nodeName.toLowerCase() == 'div' && entry.getAttribute("class") == "footnote")
entriesToRemove.push(entry);
}
for (i = 0; i < entriesToRemove.length; i++) {
noteholder.removeChild(entriesToRemove[i]);
}
// Rebuild footnote entries.
var cont = document.getElementById("content");
var spans = cont.getElementsByTagName("span");
var refs = {};
var n = 0;
for (i=0; i<spans.length; i++) {
if (spans[i].className == "footnote") {
n++;
var note = spans[i].getAttribute("data-note");
if (!note) {
// Use [\s\S] in place of . so multi-line matches work.
// Because JavaScript has no s (dotall) regex flag.
note = spans[i].innerHTML.match(/\s*\[([\s\S]*)]\s*/)[1];
spans[i].innerHTML =
"[<a id='_footnoteref_" + n + "' href='#_footnote_" + n +
"' title='View footnote' class='footnote'>" + n + "</a>]";
spans[i].setAttribute("data-note", note);
}
noteholder.innerHTML +=
"<div class='footnote' id='_footnote_" + n + "'>" +
"<a href='#_footnoteref_" + n + "' title='Return to text'>" +
n + "</a>. " + note + "</div>";
var id =spans[i].getAttribute("id");
if (id != null) refs["#"+id] = n;
}
}
if (n == 0)
noteholder.parentNode.removeChild(noteholder);
else {
// Process footnoterefs.
for (i=0; i<spans.length; i++) {
if (spans[i].className == "footnoteref") {
var href = spans[i].getElementsByTagName("a")[0].getAttribute("href");
href = href.match(/#.*/)[0]; // Because IE return full URL.
n = refs[href];
spans[i].innerHTML =
"[<a href='#_footnote_" + n +
"' title='View footnote' class='footnote'>" + n + "</a>]";
}
}
}
},
install: function(toclevels) {
var timerId;
function reinstall() {
asciidoc.footnotes();
if (toclevels) {
asciidoc.toc(toclevels);
}
}
function reinstallAndRemoveTimer() {
clearInterval(timerId);
reinstall();
}
timerId = setInterval(reinstall, 500);
if (document.addEventListener)
document.addEventListener("DOMContentLoaded", reinstallAndRemoveTimer, false);
else
window.onload = reinstallAndRemoveTimer;
}
}
asciidoc.install();
/*]]>*/
</script>
</head>
<body class="manpage">
<div id="header">
<h1>
CNTRAINING(1) Manual Page
</h1>
<h2>NAME</h2>
<div class="sectionbody">
<p>cntraining -
character normalization training for Tesseract
</p>
</div>
</div>
<div id="content">
<div class="sect1">
<h2 id="_synopsis">SYNOPSIS</h2>
<div class="sectionbody">
<div class="paragraph"><p><strong>cntraining</strong> [-D <em>dir</em>] <em>FILE</em>&#8230;</p></div>
</div>
</div>
<div class="sect1">
<h2 id="_description">DESCRIPTION</h2>
<div class="sectionbody">
<div class="paragraph"><p>cntraining takes a list of .tr files, from which it generates the
<strong>normproto</strong> data file (the character normalization sensitivity
prototypes).</p></div>
</div>
</div>
<div class="sect1">
<h2 id="_options">OPTIONS</h2>
<div class="sectionbody">
<div class="dlist"><dl>
<dt class="hdlist1">
-D <em>dir</em>
</dt>
<dd>
<p>
Directory to write output files to.
</p>
</dd>
</dl></div>
</div>
</div>
<div class="sect1">
<h2 id="_see_also">SEE ALSO</h2>
<div class="sectionbody">
<div class="paragraph"><p>tesseract(1), shapeclustering(1), mftraining(1)</p></div>
<div class="paragraph"><p><a href="https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract">https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract</a></p></div>
</div>
</div>
<div class="sect1">
<h2 id="_copying">COPYING</h2>
<div class="sectionbody">
<div class="paragraph"><p>Copyright (c) Hewlett-Packard Company, 1988
Licensed under the Apache License, Version 2.0</p></div>
</div>
</div>
<div class="sect1">
<h2 id="_author">AUTHOR</h2>
<div class="sectionbody">
<div class="paragraph"><p>The Tesseract OCR engine was written by Ray Smith and his research groups
at Hewlett Packard (1985-1995) and Google (2006-present).</p></div>
</div>
</div>
</div>
<div id="footnotes"><hr /></div>
<div id="footer">
<div id="footer-text">
Last updated 2015-06-12 23:50:30 CEST
</div>
</div>
</body>
</html>

View File

@ -1,58 +0,0 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE refentry PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd">
<?asciidoc-toc?>
<?asciidoc-numbered?>
<refentry lang="en">
<refentryinfo>
<title>CNTRAINING(1)</title>
</refentryinfo>
<refmeta>
<refentrytitle>cntraining</refentrytitle>
<manvolnum>1</manvolnum>
<refmiscinfo class="source">&#160;</refmiscinfo>
<refmiscinfo class="manual">&#160;</refmiscinfo>
</refmeta>
<refnamediv>
<refname>cntraining</refname>
<refpurpose>character normalization training for Tesseract</refpurpose>
</refnamediv>
<refsynopsisdiv id="_synopsis">
<simpara><emphasis role="strong">cntraining</emphasis> [-D <emphasis>dir</emphasis>] <emphasis>FILE</emphasis>&#8230;</simpara>
</refsynopsisdiv>
<refsect1 id="_description">
<title>DESCRIPTION</title>
<simpara>cntraining takes a list of .tr files, from which it generates the
<emphasis role="strong">normproto</emphasis> data file (the character normalization sensitivity
prototypes).</simpara>
</refsect1>
<refsect1 id="_options">
<title>OPTIONS</title>
<variablelist>
<varlistentry>
<term>
-D <emphasis>dir</emphasis>
</term>
<listitem>
<simpara>
Directory to write output files to.
</simpara>
</listitem>
</varlistentry>
</variablelist>
</refsect1>
<refsect1 id="_see_also">
<title>SEE ALSO</title>
<simpara>tesseract(1), shapeclustering(1), mftraining(1)</simpara>
<simpara><ulink url="https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract">https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract</ulink></simpara>
</refsect1>
<refsect1 id="_copying">
<title>COPYING</title>
<simpara>Copyright (c) Hewlett-Packard Company, 1988
Licensed under the Apache License, Version 2.0</simpara>
</refsect1>
<refsect1 id="_author">
<title>AUTHOR</title>
<simpara>The Tesseract OCR engine was written by Ray Smith and his research groups
at Hewlett Packard (1985-1995) and Google (2006-present).</simpara>
</refsect1>
</refentry>

View File

@ -1,205 +0,0 @@
'\" t
.\" Title: combine_tessdata
.\" Author: [see the "AUTHOR" section]
.\" Generator: DocBook XSL Stylesheets v1.78.1 <http://docbook.sf.net/>
.\" Date: 06/12/2015
.\" Manual: \ \&
.\" Source: \ \&
.\" Language: English
.\"
.TH "COMBINE_TESSDATA" "1" "06/12/2015" "\ \&" "\ \&"
.\" -----------------------------------------------------------------
.\" * Define some portability stuff
.\" -----------------------------------------------------------------
.\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.\" http://bugs.debian.org/507673
.\" http://lists.gnu.org/archive/html/groff/2009-02/msg00013.html
.\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.ie \n(.g .ds Aq \(aq
.el .ds Aq '
.\" -----------------------------------------------------------------
.\" * set default formatting
.\" -----------------------------------------------------------------
.\" disable hyphenation
.nh
.\" disable justification (adjust text to left margin only)
.ad l
.\" -----------------------------------------------------------------
.\" * MAIN CONTENT STARTS HERE *
.\" -----------------------------------------------------------------
.SH "NAME"
combine_tessdata \- combine/extract/overwrite Tesseract data
.SH "SYNOPSIS"
.sp
\fBcombine_tessdata\fR [\fIOPTION\fR] \fIFILE\fR\&...
.SH "DESCRIPTION"
.sp
combine_tessdata(1) is the main program to combine/extract/overwrite tessdata components in [lang]\&.traineddata files\&.
.sp
To combine all the individual tessdata components (unicharset, DAWGs, classifier templates, ambiguities, language configs) located at, say, /home/$USER/temp/eng\&.* run:
.sp
.if n \{\
.RS 4
.\}
.nf
combine_tessdata /home/$USER/temp/eng\&.
.fi
.if n \{\
.RE
.\}
.sp
The result will be a combined tessdata file /home/$USER/temp/eng\&.traineddata
.sp
Specify option \-e if you would like to extract individual components from a combined traineddata file\&. For example, to extract language config file and the unicharset from tessdata/eng\&.traineddata run:
.sp
.if n \{\
.RS 4
.\}
.nf
combine_tessdata \-e tessdata/eng\&.traineddata \e
/home/$USER/temp/eng\&.config /home/$USER/temp/eng\&.unicharset
.fi
.if n \{\
.RE
.\}
.sp
The desired config file and unicharset will be written to /home/$USER/temp/eng\&.config /home/$USER/temp/eng\&.unicharset
.sp
Specify option \-o to overwrite individual components of the given [lang]\&.traineddata file\&. For example, to overwrite language config and unichar ambiguities files in tessdata/eng\&.traineddata use:
.sp
.if n \{\
.RS 4
.\}
.nf
combine_tessdata \-o tessdata/eng\&.traineddata \e
/home/$USER/temp/eng\&.config /home/$USER/temp/eng\&.unicharambigs
.fi
.if n \{\
.RE
.\}
.sp
As a result, tessdata/eng\&.traineddata will contain the new language config and unichar ambigs, plus all the original DAWGs, classifier templates, etc\&.
.sp
Note: the file names of the files to extract to and to overwrite from should have the appropriate file suffixes (extensions) indicating their tessdata component type (\&.unicharset for the unicharset, \&.unicharambigs for unichar ambigs, etc)\&. See k*FileSuffix variable in ccutil/tessdatamanager\&.h\&.
.sp
Specify option \-u to unpack all the components to the specified path:
.sp
.if n \{\
.RS 4
.\}
.nf
combine_tessdata \-u tessdata/eng\&.traineddata /home/$USER/temp/eng\&.
.fi
.if n \{\
.RE
.\}
.sp
This will create /home/$USER/temp/eng\&.* files with individual tessdata components from tessdata/eng\&.traineddata\&.
.SH "OPTIONS"
.sp
\fB\-e\fR \fI\&.traineddata\fR \fIFILE\fR\&...: Extracts the specified components from the \&.traineddata file
.sp
\fB\-o\fR \fI\&.traineddata\fR \fIFILE\fR\&...: Overwrites the specified components of the \&.traineddata file with those provided on the comand line\&.
.sp
\fB\-u\fR \fI\&.traineddata\fR \fIPATHPREFIX\fR Unpacks the \&.traineddata using the provided prefix\&.
.SH "CAVEATS"
.sp
\fIPrefix\fR refers to the full file prefix, including period (\&.)
.SH "COMPONENTS"
.sp
The components in a Tesseract lang\&.traineddata file as of Tesseract 3\&.02 are briefly described below; For more information on many of these files, see \m[blue]\fBhttps://github\&.com/tesseract\-ocr/tesseract/wiki/TrainingTesseract\fR\m[]
.PP
lang\&.config
.RS 4
(Optional) Language\-specific overrides to default config variables\&.
.RE
.PP
lang\&.unicharset
.RS 4
(Required) The list of symbols that Tesseract recognizes, with properties\&. See unicharset(5)\&.
.RE
.PP
lang\&.unicharambigs
.RS 4
(Optional) This file contains information on pairs of recognized symbols which are often confused\&. For example,
\fIrn\fR
and
\fIm\fR\&.
.RE
.PP
lang\&.inttemp
.RS 4
(Required) Character shape templates for each unichar\&. Produced by mftraining(1)\&.
.RE
.PP
lang\&.pffmtable
.RS 4
(Required) The number of features expected for each unichar\&. Produced by mftraining(1) from
\fB\&.tr\fR
files\&.
.RE
.PP
lang\&.normproto
.RS 4
(Required) Character normalization prototypes generated by cntraining(1) from
\fB\&.tr\fR
files\&.
.RE
.PP
lang\&.punc\-dawg
.RS 4
(Optional) A dawg made from punctuation patterns found around words\&. The "word" part is replaced by a single space\&.
.RE
.PP
lang\&.word\-dawg
.RS 4
(Optional) A dawg made from dictionary words from the language\&.
.RE
.PP
lang\&.number\-dawg
.RS 4
(Optional) A dawg made from tokens which originally contained digits\&. Each digit is replaced by a space character\&.
.RE
.PP
lang\&.freq\-dawg
.RS 4
(Optional) A dawg made from the most frequent words which would have gone into word\-dawg\&.
.RE
.PP
lang\&.fixed\-length\-dawgs
.RS 4
(Optional) Several dawgs of different fixed lengths \(em useful for languages like Chinese\&.
.RE
.PP
lang\&.shapetable
.RS 4
(Optional) When present, a shapetable is an extra layer between the character classifier and the word recognizer that allows the character classifier to return a collection of unichar ids and fonts instead of a single unichar\-id and font\&.
.RE
.PP
lang\&.bigram\-dawg
.RS 4
(Optional) A dawg of word bigrams where the words are separated by a space and each digit is replaced by a
\fI?\fR\&.
.RE
.PP
lang\&.unambig\-dawg
.RS 4
(Optional) TODO: Describe\&.
.RE
.PP
lang\&.params\-training\-model
.RS 4
(Optional) TODO: Describe\&.
.RE
.SH "HISTORY"
.sp
combine_tessdata(1) first appeared in version 3\&.00 of Tesseract
.SH "SEE ALSO"
.sp
tesseract(1), wordlist2dawg(1), cntraining(1), mftraining(1), unicharset(5), unicharambigs(5)
.SH "COPYING"
.sp
Copyright (C) 2009, Google Inc\&. Licensed under the Apache License, Version 2\&.0
.SH "AUTHOR"
.sp
The Tesseract OCR engine was written by Ray Smith and his research groups at Hewlett Packard (1985\-1995) and Google (2006\-present)\&.

View File

@ -1,996 +0,0 @@
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<meta http-equiv="Content-Type" content="application/xhtml+xml; charset=UTF-8" />
<meta name="generator" content="AsciiDoc 8.6.9" />
<title>COMBINE_TESSDATA(1)</title>
<style type="text/css">
/* Shared CSS for AsciiDoc xhtml11 and html5 backends */
/* Default font. */
body {
font-family: Georgia,serif;
}
/* Title font. */
h1, h2, h3, h4, h5, h6,
div.title, caption.title,
thead, p.table.header,
#toctitle,
#author, #revnumber, #revdate, #revremark,
#footer {
font-family: Arial,Helvetica,sans-serif;
}
body {
margin: 1em 5% 1em 5%;
}
a {
color: blue;
text-decoration: underline;
}
a:visited {
color: fuchsia;
}
em {
font-style: italic;
color: navy;
}
strong {
font-weight: bold;
color: #083194;
}
h1, h2, h3, h4, h5, h6 {
color: #527bbd;
margin-top: 1.2em;
margin-bottom: 0.5em;
line-height: 1.3;
}
h1, h2, h3 {
border-bottom: 2px solid silver;
}
h2 {
padding-top: 0.5em;
}
h3 {
float: left;
}
h3 + * {
clear: left;
}
h5 {
font-size: 1.0em;
}
div.sectionbody {
margin-left: 0;
}
hr {
border: 1px solid silver;
}
p {
margin-top: 0.5em;
margin-bottom: 0.5em;
}
ul, ol, li > p {
margin-top: 0;
}
ul > li { color: #aaa; }
ul > li > * { color: black; }
.monospaced, code, pre {
font-family: "Courier New", Courier, monospace;
font-size: inherit;
color: navy;
padding: 0;
margin: 0;
}
pre {
white-space: pre-wrap;
}
#author {
color: #527bbd;
font-weight: bold;
font-size: 1.1em;
}
#email {
}
#revnumber, #revdate, #revremark {
}
#footer {
font-size: small;
border-top: 2px solid silver;
padding-top: 0.5em;
margin-top: 4.0em;
}
#footer-text {
float: left;
padding-bottom: 0.5em;
}
#footer-badges {
float: right;
padding-bottom: 0.5em;
}
#preamble {
margin-top: 1.5em;
margin-bottom: 1.5em;
}
div.imageblock, div.exampleblock, div.verseblock,
div.quoteblock, div.literalblock, div.listingblock, div.sidebarblock,
div.admonitionblock {
margin-top: 1.0em;
margin-bottom: 1.5em;
}
div.admonitionblock {
margin-top: 2.0em;
margin-bottom: 2.0em;
margin-right: 10%;
color: #606060;
}
div.content { /* Block element content. */
padding: 0;
}
/* Block element titles. */
div.title, caption.title {
color: #527bbd;
font-weight: bold;
text-align: left;
margin-top: 1.0em;
margin-bottom: 0.5em;
}
div.title + * {
margin-top: 0;
}
td div.title:first-child {
margin-top: 0.0em;
}
div.content div.title:first-child {
margin-top: 0.0em;
}
div.content + div.title {
margin-top: 0.0em;
}
div.sidebarblock > div.content {
background: #ffffee;
border: 1px solid #dddddd;
border-left: 4px solid #f0f0f0;
padding: 0.5em;
}
div.listingblock > div.content {
border: 1px solid #dddddd;
border-left: 5px solid #f0f0f0;
background: #f8f8f8;
padding: 0.5em;
}
div.quoteblock, div.verseblock {
padding-left: 1.0em;
margin-left: 1.0em;
margin-right: 10%;
border-left: 5px solid #f0f0f0;
color: #888;
}
div.quoteblock > div.attribution {
padding-top: 0.5em;
text-align: right;
}
div.verseblock > pre.content {
font-family: inherit;
font-size: inherit;
}
div.verseblock > div.attribution {
padding-top: 0.75em;
text-align: left;
}
/* DEPRECATED: Pre version 8.2.7 verse style literal block. */
div.verseblock + div.attribution {
text-align: left;
}
div.admonitionblock .icon {
vertical-align: top;
font-size: 1.1em;
font-weight: bold;
text-decoration: underline;
color: #527bbd;
padding-right: 0.5em;
}
div.admonitionblock td.content {
padding-left: 0.5em;
border-left: 3px solid #dddddd;
}
div.exampleblock > div.content {
border-left: 3px solid #dddddd;
padding-left: 0.5em;
}
div.imageblock div.content { padding-left: 0; }
span.image img { border-style: none; vertical-align: text-bottom; }
a.image:visited { color: white; }
dl {
margin-top: 0.8em;
margin-bottom: 0.8em;
}
dt {
margin-top: 0.5em;
margin-bottom: 0;
font-style: normal;
color: navy;
}
dd > *:first-child {
margin-top: 0.1em;
}
ul, ol {
list-style-position: outside;
}
ol.arabic {
list-style-type: decimal;
}
ol.loweralpha {
list-style-type: lower-alpha;
}
ol.upperalpha {
list-style-type: upper-alpha;
}
ol.lowerroman {
list-style-type: lower-roman;
}
ol.upperroman {
list-style-type: upper-roman;
}
div.compact ul, div.compact ol,
div.compact p, div.compact p,
div.compact div, div.compact div {
margin-top: 0.1em;
margin-bottom: 0.1em;
}
tfoot {
font-weight: bold;
}
td > div.verse {
white-space: pre;
}
div.hdlist {
margin-top: 0.8em;
margin-bottom: 0.8em;
}
div.hdlist tr {
padding-bottom: 15px;
}
dt.hdlist1.strong, td.hdlist1.strong {
font-weight: bold;
}
td.hdlist1 {
vertical-align: top;
font-style: normal;
padding-right: 0.8em;
color: navy;
}
td.hdlist2 {
vertical-align: top;
}
div.hdlist.compact tr {
margin: 0;
padding-bottom: 0;
}
.comment {
background: yellow;
}
.footnote, .footnoteref {
font-size: 0.8em;
}
span.footnote, span.footnoteref {
vertical-align: super;
}
#footnotes {
margin: 20px 0 20px 0;
padding: 7px 0 0 0;
}
#footnotes div.footnote {
margin: 0 0 5px 0;
}
#footnotes hr {
border: none;
border-top: 1px solid silver;
height: 1px;
text-align: left;
margin-left: 0;
width: 20%;
min-width: 100px;
}
div.colist td {
padding-right: 0.5em;
padding-bottom: 0.3em;
vertical-align: top;
}
div.colist td img {
margin-top: 0.3em;
}
@media print {
#footer-badges { display: none; }
}
#toc {
margin-bottom: 2.5em;
}
#toctitle {
color: #527bbd;
font-size: 1.1em;
font-weight: bold;
margin-top: 1.0em;
margin-bottom: 0.1em;
}
div.toclevel0, div.toclevel1, div.toclevel2, div.toclevel3, div.toclevel4 {
margin-top: 0;
margin-bottom: 0;
}
div.toclevel2 {
margin-left: 2em;
font-size: 0.9em;
}
div.toclevel3 {
margin-left: 4em;
font-size: 0.9em;
}
div.toclevel4 {
margin-left: 6em;
font-size: 0.9em;
}
span.aqua { color: aqua; }
span.black { color: black; }
span.blue { color: blue; }
span.fuchsia { color: fuchsia; }
span.gray { color: gray; }
span.green { color: green; }
span.lime { color: lime; }
span.maroon { color: maroon; }
span.navy { color: navy; }
span.olive { color: olive; }
span.purple { color: purple; }
span.red { color: red; }
span.silver { color: silver; }
span.teal { color: teal; }
span.white { color: white; }
span.yellow { color: yellow; }
span.aqua-background { background: aqua; }
span.black-background { background: black; }
span.blue-background { background: blue; }
span.fuchsia-background { background: fuchsia; }
span.gray-background { background: gray; }
span.green-background { background: green; }
span.lime-background { background: lime; }
span.maroon-background { background: maroon; }
span.navy-background { background: navy; }
span.olive-background { background: olive; }
span.purple-background { background: purple; }
span.red-background { background: red; }
span.silver-background { background: silver; }
span.teal-background { background: teal; }
span.white-background { background: white; }
span.yellow-background { background: yellow; }
span.big { font-size: 2em; }
span.small { font-size: 0.6em; }
span.underline { text-decoration: underline; }
span.overline { text-decoration: overline; }
span.line-through { text-decoration: line-through; }
div.unbreakable { page-break-inside: avoid; }
/*
* xhtml11 specific
*
* */
div.tableblock {
margin-top: 1.0em;
margin-bottom: 1.5em;
}
div.tableblock > table {
border: 3px solid #527bbd;
}
thead, p.table.header {
font-weight: bold;
color: #527bbd;
}
p.table {
margin-top: 0;
}
/* Because the table frame attribute is overriden by CSS in most browsers. */
div.tableblock > table[frame="void"] {
border-style: none;
}
div.tableblock > table[frame="hsides"] {
border-left-style: none;
border-right-style: none;
}
div.tableblock > table[frame="vsides"] {
border-top-style: none;
border-bottom-style: none;
}
/*
* html5 specific
*
* */
table.tableblock {
margin-top: 1.0em;
margin-bottom: 1.5em;
}
thead, p.tableblock.header {
font-weight: bold;
color: #527bbd;
}
p.tableblock {
margin-top: 0;
}
table.tableblock {
border-width: 3px;
border-spacing: 0px;
border-style: solid;
border-color: #527bbd;
border-collapse: collapse;
}
th.tableblock, td.tableblock {
border-width: 1px;
padding: 4px;
border-style: solid;
border-color: #527bbd;
}
table.tableblock.frame-topbot {
border-left-style: hidden;
border-right-style: hidden;
}
table.tableblock.frame-sides {
border-top-style: hidden;
border-bottom-style: hidden;
}
table.tableblock.frame-none {
border-style: hidden;
}
th.tableblock.halign-left, td.tableblock.halign-left {
text-align: left;
}
th.tableblock.halign-center, td.tableblock.halign-center {
text-align: center;
}
th.tableblock.halign-right, td.tableblock.halign-right {
text-align: right;
}
th.tableblock.valign-top, td.tableblock.valign-top {
vertical-align: top;
}
th.tableblock.valign-middle, td.tableblock.valign-middle {
vertical-align: middle;
}
th.tableblock.valign-bottom, td.tableblock.valign-bottom {
vertical-align: bottom;
}
/*
* manpage specific
*
* */
body.manpage h1 {
padding-top: 0.5em;
padding-bottom: 0.5em;
border-top: 2px solid silver;
border-bottom: 2px solid silver;
}
body.manpage h2 {
border-style: none;
}
body.manpage div.sectionbody {
margin-left: 3em;
}
@media print {
body.manpage div#toc { display: none; }
}
</style>
<script type="text/javascript">
/*<![CDATA[*/
var asciidoc = { // Namespace.
/////////////////////////////////////////////////////////////////////
// Table Of Contents generator
/////////////////////////////////////////////////////////////////////
/* Author: Mihai Bazon, September 2002
* http://students.infoiasi.ro/~mishoo
*
* Table Of Content generator
* Version: 0.4
*
* Feel free to use this script under the terms of the GNU General Public
* License, as long as you do not remove or alter this notice.
*/
/* modified by Troy D. Hanson, September 2006. License: GPL */
/* modified by Stuart Rackham, 2006, 2009. License: GPL */
// toclevels = 1..4.
toc: function (toclevels) {
function getText(el) {
var text = "";
for (var i = el.firstChild; i != null; i = i.nextSibling) {
if (i.nodeType == 3 /* Node.TEXT_NODE */) // IE doesn't speak constants.
text += i.data;
else if (i.firstChild != null)
text += getText(i);
}
return text;
}
function TocEntry(el, text, toclevel) {
this.element = el;
this.text = text;
this.toclevel = toclevel;
}
function tocEntries(el, toclevels) {
var result = new Array;
var re = new RegExp('[hH]([1-'+(toclevels+1)+'])');
// Function that scans the DOM tree for header elements (the DOM2
// nodeIterator API would be a better technique but not supported by all
// browsers).
var iterate = function (el) {
for (var i = el.firstChild; i != null; i = i.nextSibling) {
if (i.nodeType == 1 /* Node.ELEMENT_NODE */) {
var mo = re.exec(i.tagName);
if (mo && (i.getAttribute("class") || i.getAttribute("className")) != "float") {
result[result.length] = new TocEntry(i, getText(i), mo[1]-1);
}
iterate(i);
}
}
}
iterate(el);
return result;
}
var toc = document.getElementById("toc");
if (!toc) {
return;
}
// Delete existing TOC entries in case we're reloading the TOC.
var tocEntriesToRemove = [];
var i;
for (i = 0; i < toc.childNodes.length; i++) {
var entry = toc.childNodes[i];
if (entry.nodeName.toLowerCase() == 'div'
&& entry.getAttribute("class")
&& entry.getAttribute("class").match(/^toclevel/))
tocEntriesToRemove.push(entry);
}
for (i = 0; i < tocEntriesToRemove.length; i++) {
toc.removeChild(tocEntriesToRemove[i]);
}
// Rebuild TOC entries.
var entries = tocEntries(document.getElementById("content"), toclevels);
for (var i = 0; i < entries.length; ++i) {
var entry = entries[i];
if (entry.element.id == "")
entry.element.id = "_toc_" + i;
var a = document.createElement("a");
a.href = "#" + entry.element.id;
a.appendChild(document.createTextNode(entry.text));
var div = document.createElement("div");
div.appendChild(a);
div.className = "toclevel" + entry.toclevel;
toc.appendChild(div);
}
if (entries.length == 0)
toc.parentNode.removeChild(toc);
},
/////////////////////////////////////////////////////////////////////
// Footnotes generator
/////////////////////////////////////////////////////////////////////
/* Based on footnote generation code from:
* http://www.brandspankingnew.net/archive/2005/07/format_footnote.html
*/
footnotes: function () {
// Delete existing footnote entries in case we're reloading the footnodes.
var i;
var noteholder = document.getElementById("footnotes");
if (!noteholder) {
return;
}
var entriesToRemove = [];
for (i = 0; i < noteholder.childNodes.length; i++) {
var entry = noteholder.childNodes[i];
if (entry.nodeName.toLowerCase() == 'div' && entry.getAttribute("class") == "footnote")
entriesToRemove.push(entry);
}
for (i = 0; i < entriesToRemove.length; i++) {
noteholder.removeChild(entriesToRemove[i]);
}
// Rebuild footnote entries.
var cont = document.getElementById("content");
var spans = cont.getElementsByTagName("span");
var refs = {};
var n = 0;
for (i=0; i<spans.length; i++) {
if (spans[i].className == "footnote") {
n++;
var note = spans[i].getAttribute("data-note");
if (!note) {
// Use [\s\S] in place of . so multi-line matches work.
// Because JavaScript has no s (dotall) regex flag.
note = spans[i].innerHTML.match(/\s*\[([\s\S]*)]\s*/)[1];
spans[i].innerHTML =
"[<a id='_footnoteref_" + n + "' href='#_footnote_" + n +
"' title='View footnote' class='footnote'>" + n + "</a>]";
spans[i].setAttribute("data-note", note);
}
noteholder.innerHTML +=
"<div class='footnote' id='_footnote_" + n + "'>" +
"<a href='#_footnoteref_" + n + "' title='Return to text'>" +
n + "</a>. " + note + "</div>";
var id =spans[i].getAttribute("id");
if (id != null) refs["#"+id] = n;
}
}
if (n == 0)
noteholder.parentNode.removeChild(noteholder);
else {
// Process footnoterefs.
for (i=0; i<spans.length; i++) {
if (spans[i].className == "footnoteref") {
var href = spans[i].getElementsByTagName("a")[0].getAttribute("href");
href = href.match(/#.*/)[0]; // Because IE return full URL.
n = refs[href];
spans[i].innerHTML =
"[<a href='#_footnote_" + n +
"' title='View footnote' class='footnote'>" + n + "</a>]";
}
}
}
},
install: function(toclevels) {
var timerId;
function reinstall() {
asciidoc.footnotes();
if (toclevels) {
asciidoc.toc(toclevels);
}
}
function reinstallAndRemoveTimer() {
clearInterval(timerId);
reinstall();
}
timerId = setInterval(reinstall, 500);
if (document.addEventListener)
document.addEventListener("DOMContentLoaded", reinstallAndRemoveTimer, false);
else
window.onload = reinstallAndRemoveTimer;
}
}
asciidoc.install();
/*]]>*/
</script>
</head>
<body class="manpage">
<div id="header">
<h1>
COMBINE_TESSDATA(1) Manual Page
</h1>
<h2>NAME</h2>
<div class="sectionbody">
<p>combine_tessdata -
combine/extract/overwrite Tesseract data
</p>
</div>
</div>
<div id="content">
<div class="sect1">
<h2 id="_synopsis">SYNOPSIS</h2>
<div class="sectionbody">
<div class="paragraph"><p><strong>combine_tessdata</strong> [<em>OPTION</em>] <em>FILE</em>&#8230;</p></div>
</div>
</div>
<div class="sect1">
<h2 id="_description">DESCRIPTION</h2>
<div class="sectionbody">
<div class="paragraph"><p>combine_tessdata(1) is the main program to combine/extract/overwrite
tessdata components in [lang].traineddata files.</p></div>
<div class="paragraph"><p>To combine all the individual tessdata components (unicharset, DAWGs,
classifier templates, ambiguities, language configs) located at, say,
/home/$USER/temp/eng.* run:</p></div>
<div class="literalblock">
<div class="content">
<pre><code>combine_tessdata /home/$USER/temp/eng.</code></pre>
</div></div>
<div class="paragraph"><p>The result will be a combined tessdata file /home/$USER/temp/eng.traineddata</p></div>
<div class="paragraph"><p>Specify option -e if you would like to extract individual components
from a combined traineddata file. For example, to extract language config
file and the unicharset from tessdata/eng.traineddata run:</p></div>
<div class="literalblock">
<div class="content">
<pre><code>combine_tessdata -e tessdata/eng.traineddata \
/home/$USER/temp/eng.config /home/$USER/temp/eng.unicharset</code></pre>
</div></div>
<div class="paragraph"><p>The desired config file and unicharset will be written to
/home/$USER/temp/eng.config /home/$USER/temp/eng.unicharset</p></div>
<div class="paragraph"><p>Specify option -o to overwrite individual components of the given
[lang].traineddata file. For example, to overwrite language config
and unichar ambiguities files in tessdata/eng.traineddata use:</p></div>
<div class="literalblock">
<div class="content">
<pre><code>combine_tessdata -o tessdata/eng.traineddata \
/home/$USER/temp/eng.config /home/$USER/temp/eng.unicharambigs</code></pre>
</div></div>
<div class="paragraph"><p>As a result, tessdata/eng.traineddata will contain the new language config
and unichar ambigs, plus all the original DAWGs, classifier templates, etc.</p></div>
<div class="paragraph"><p>Note: the file names of the files to extract to and to overwrite from should
have the appropriate file suffixes (extensions) indicating their tessdata
component type (.unicharset for the unicharset, .unicharambigs for unichar
ambigs, etc). See k*FileSuffix variable in ccutil/tessdatamanager.h.</p></div>
<div class="paragraph"><p>Specify option -u to unpack all the components to the specified path:</p></div>
<div class="literalblock">
<div class="content">
<pre><code>combine_tessdata -u tessdata/eng.traineddata /home/$USER/temp/eng.</code></pre>
</div></div>
<div class="paragraph"><p>This will create /home/$USER/temp/eng.* files with individual tessdata
components from tessdata/eng.traineddata.</p></div>
</div>
</div>
<div class="sect1">
<h2 id="_options">OPTIONS</h2>
<div class="sectionbody">
<div class="paragraph"><p><strong>-e</strong> <em>.traineddata</em> <em>FILE</em>&#8230;:
Extracts the specified components from the .traineddata file</p></div>
<div class="paragraph"><p><strong>-o</strong> <em>.traineddata</em> <em>FILE</em>&#8230;:
Overwrites the specified components of the .traineddata file
with those provided on the comand line.</p></div>
<div class="paragraph"><p><strong>-u</strong> <em>.traineddata</em> <em>PATHPREFIX</em>
Unpacks the .traineddata using the provided prefix.</p></div>
</div>
</div>
<div class="sect1">
<h2 id="_caveats">CAVEATS</h2>
<div class="sectionbody">
<div class="paragraph"><p><em>Prefix</em> refers to the full file prefix, including period (.)</p></div>
</div>
</div>
<div class="sect1">
<h2 id="_components">COMPONENTS</h2>
<div class="sectionbody">
<div class="paragraph"><p>The components in a Tesseract lang.traineddata file as of
Tesseract 3.02 are briefly described below; For more information on
many of these files, see
<a href="https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract">https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract</a></p></div>
<div class="dlist"><dl>
<dt class="hdlist1">
lang.config
</dt>
<dd>
<p>
(Optional) Language-specific overrides to default config variables.
</p>
</dd>
<dt class="hdlist1">
lang.unicharset
</dt>
<dd>
<p>
(Required) The list of symbols that Tesseract recognizes, with properties.
See unicharset(5).
</p>
</dd>
<dt class="hdlist1">
lang.unicharambigs
</dt>
<dd>
<p>
(Optional) This file contains information on pairs of recognized symbols
which are often confused. For example, <em>rn</em> and <em>m</em>.
</p>
</dd>
<dt class="hdlist1">
lang.inttemp
</dt>
<dd>
<p>
(Required) Character shape templates for each unichar. Produced by
mftraining(1).
</p>
</dd>
<dt class="hdlist1">
lang.pffmtable
</dt>
<dd>
<p>
(Required) The number of features expected for each unichar.
Produced by mftraining(1) from <strong>.tr</strong> files.
</p>
</dd>
<dt class="hdlist1">
lang.normproto
</dt>
<dd>
<p>
(Required) Character normalization prototypes generated by cntraining(1)
from <strong>.tr</strong> files.
</p>
</dd>
<dt class="hdlist1">
lang.punc-dawg
</dt>
<dd>
<p>
(Optional) A dawg made from punctuation patterns found around words.
The "word" part is replaced by a single space.
</p>
</dd>
<dt class="hdlist1">
lang.word-dawg
</dt>
<dd>
<p>
(Optional) A dawg made from dictionary words from the language.
</p>
</dd>
<dt class="hdlist1">
lang.number-dawg
</dt>
<dd>
<p>
(Optional) A dawg made from tokens which originally contained digits.
Each digit is replaced by a space character.
</p>
</dd>
<dt class="hdlist1">
lang.freq-dawg
</dt>
<dd>
<p>
(Optional) A dawg made from the most frequent words which would have
gone into word-dawg.
</p>
</dd>
<dt class="hdlist1">
lang.fixed-length-dawgs
</dt>
<dd>
<p>
(Optional) Several dawgs of different fixed lengths&#8201;&#8212;&#8201;useful for
languages like Chinese.
</p>
</dd>
<dt class="hdlist1">
lang.shapetable
</dt>
<dd>
<p>
(Optional) When present, a shapetable is an extra layer between the character
classifier and the word recognizer that allows the character classifier to
return a collection of unichar ids and fonts instead of a single unichar-id
and font.
</p>
</dd>
<dt class="hdlist1">
lang.bigram-dawg
</dt>
<dd>
<p>
(Optional) A dawg of word bigrams where the words are separated by a space
and each digit is replaced by a <em>?</em>.
</p>
</dd>
<dt class="hdlist1">
lang.unambig-dawg
</dt>
<dd>
<p>
(Optional) TODO: Describe.
</p>
</dd>
<dt class="hdlist1">
lang.params-training-model
</dt>
<dd>
<p>
(Optional) TODO: Describe.
</p>
</dd>
</dl></div>
</div>
</div>
<div class="sect1">
<h2 id="_history">HISTORY</h2>
<div class="sectionbody">
<div class="paragraph"><p>combine_tessdata(1) first appeared in version 3.00 of Tesseract</p></div>
</div>
</div>
<div class="sect1">
<h2 id="_see_also">SEE ALSO</h2>
<div class="sectionbody">
<div class="paragraph"><p>tesseract(1), wordlist2dawg(1), cntraining(1), mftraining(1), unicharset(5),
unicharambigs(5)</p></div>
</div>
</div>
<div class="sect1">
<h2 id="_copying">COPYING</h2>
<div class="sectionbody">
<div class="paragraph"><p>Copyright (C) 2009, Google Inc.
Licensed under the Apache License, Version 2.0</p></div>
</div>
</div>
<div class="sect1">
<h2 id="_author">AUTHOR</h2>
<div class="sectionbody">
<div class="paragraph"><p>The Tesseract OCR engine was written by Ray Smith and his research groups
at Hewlett Packard (1985-1995) and Google (2006-present).</p></div>
</div>
</div>
</div>
<div id="footnotes"><hr /></div>
<div id="footer">
<div id="footer-text">
Last updated 2015-06-12 23:52:02 CEST
</div>
</div>
</body>
</html>

View File

@ -1,259 +0,0 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE refentry PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd">
<?asciidoc-toc?>
<?asciidoc-numbered?>
<refentry lang="en">
<refentryinfo>
<title>COMBINE_TESSDATA(1)</title>
</refentryinfo>
<refmeta>
<refentrytitle>combine_tessdata</refentrytitle>
<manvolnum>1</manvolnum>
<refmiscinfo class="source">&#160;</refmiscinfo>
<refmiscinfo class="manual">&#160;</refmiscinfo>
</refmeta>
<refnamediv>
<refname>combine_tessdata</refname>
<refpurpose>combine/extract/overwrite Tesseract data</refpurpose>
</refnamediv>
<refsynopsisdiv id="_synopsis">
<simpara><emphasis role="strong">combine_tessdata</emphasis> [<emphasis>OPTION</emphasis>] <emphasis>FILE</emphasis>&#8230;</simpara>
</refsynopsisdiv>
<refsect1 id="_description">
<title>DESCRIPTION</title>
<simpara>combine_tessdata(1) is the main program to combine/extract/overwrite
tessdata components in [lang].traineddata files.</simpara>
<simpara>To combine all the individual tessdata components (unicharset, DAWGs,
classifier templates, ambiguities, language configs) located at, say,
/home/$USER/temp/eng.* run:</simpara>
<literallayout class="monospaced">combine_tessdata /home/$USER/temp/eng.</literallayout>
<simpara>The result will be a combined tessdata file /home/$USER/temp/eng.traineddata</simpara>
<simpara>Specify option -e if you would like to extract individual components
from a combined traineddata file. For example, to extract language config
file and the unicharset from tessdata/eng.traineddata run:</simpara>
<literallayout class="monospaced">combine_tessdata -e tessdata/eng.traineddata \
/home/$USER/temp/eng.config /home/$USER/temp/eng.unicharset</literallayout>
<simpara>The desired config file and unicharset will be written to
/home/$USER/temp/eng.config /home/$USER/temp/eng.unicharset</simpara>
<simpara>Specify option -o to overwrite individual components of the given
[lang].traineddata file. For example, to overwrite language config
and unichar ambiguities files in tessdata/eng.traineddata use:</simpara>
<literallayout class="monospaced">combine_tessdata -o tessdata/eng.traineddata \
/home/$USER/temp/eng.config /home/$USER/temp/eng.unicharambigs</literallayout>
<simpara>As a result, tessdata/eng.traineddata will contain the new language config
and unichar ambigs, plus all the original DAWGs, classifier templates, etc.</simpara>
<simpara>Note: the file names of the files to extract to and to overwrite from should
have the appropriate file suffixes (extensions) indicating their tessdata
component type (.unicharset for the unicharset, .unicharambigs for unichar
ambigs, etc). See k*FileSuffix variable in ccutil/tessdatamanager.h.</simpara>
<simpara>Specify option -u to unpack all the components to the specified path:</simpara>
<literallayout class="monospaced">combine_tessdata -u tessdata/eng.traineddata /home/$USER/temp/eng.</literallayout>
<simpara>This will create /home/$USER/temp/eng.* files with individual tessdata
components from tessdata/eng.traineddata.</simpara>
</refsect1>
<refsect1 id="_options">
<title>OPTIONS</title>
<simpara><emphasis role="strong">-e</emphasis> <emphasis>.traineddata</emphasis> <emphasis>FILE</emphasis>&#8230;:
Extracts the specified components from the .traineddata file</simpara>
<simpara><emphasis role="strong">-o</emphasis> <emphasis>.traineddata</emphasis> <emphasis>FILE</emphasis>&#8230;:
Overwrites the specified components of the .traineddata file
with those provided on the comand line.</simpara>
<simpara><emphasis role="strong">-u</emphasis> <emphasis>.traineddata</emphasis> <emphasis>PATHPREFIX</emphasis>
Unpacks the .traineddata using the provided prefix.</simpara>
</refsect1>
<refsect1 id="_caveats">
<title>CAVEATS</title>
<simpara><emphasis>Prefix</emphasis> refers to the full file prefix, including period (.)</simpara>
</refsect1>
<refsect1 id="_components">
<title>COMPONENTS</title>
<simpara>The components in a Tesseract lang.traineddata file as of
Tesseract 3.02 are briefly described below; For more information on
many of these files, see
<ulink url="https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract">https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract</ulink></simpara>
<variablelist>
<varlistentry>
<term>
lang.config
</term>
<listitem>
<simpara>
(Optional) Language-specific overrides to default config variables.
</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term>
lang.unicharset
</term>
<listitem>
<simpara>
(Required) The list of symbols that Tesseract recognizes, with properties.
See unicharset(5).
</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term>
lang.unicharambigs
</term>
<listitem>
<simpara>
(Optional) This file contains information on pairs of recognized symbols
which are often confused. For example, <emphasis>rn</emphasis> and <emphasis>m</emphasis>.
</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term>
lang.inttemp
</term>
<listitem>
<simpara>
(Required) Character shape templates for each unichar. Produced by
mftraining(1).
</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term>
lang.pffmtable
</term>
<listitem>
<simpara>
(Required) The number of features expected for each unichar.
Produced by mftraining(1) from <emphasis role="strong">.tr</emphasis> files.
</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term>
lang.normproto
</term>
<listitem>
<simpara>
(Required) Character normalization prototypes generated by cntraining(1)
from <emphasis role="strong">.tr</emphasis> files.
</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term>
lang.punc-dawg
</term>
<listitem>
<simpara>
(Optional) A dawg made from punctuation patterns found around words.
The "word" part is replaced by a single space.
</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term>
lang.word-dawg
</term>
<listitem>
<simpara>
(Optional) A dawg made from dictionary words from the language.
</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term>
lang.number-dawg
</term>
<listitem>
<simpara>
(Optional) A dawg made from tokens which originally contained digits.
Each digit is replaced by a space character.
</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term>
lang.freq-dawg
</term>
<listitem>
<simpara>
(Optional) A dawg made from the most frequent words which would have
gone into word-dawg.
</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term>
lang.fixed-length-dawgs
</term>
<listitem>
<simpara>
(Optional) Several dawgs of different fixed lengths&#8201;&#8212;&#8201;useful for
languages like Chinese.
</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term>
lang.shapetable
</term>
<listitem>
<simpara>
(Optional) When present, a shapetable is an extra layer between the character
classifier and the word recognizer that allows the character classifier to
return a collection of unichar ids and fonts instead of a single unichar-id
and font.
</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term>
lang.bigram-dawg
</term>
<listitem>
<simpara>
(Optional) A dawg of word bigrams where the words are separated by a space
and each digit is replaced by a <emphasis>?</emphasis>.
</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term>
lang.unambig-dawg
</term>
<listitem>
<simpara>
(Optional) TODO: Describe.
</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term>
lang.params-training-model
</term>
<listitem>
<simpara>
(Optional) TODO: Describe.
</simpara>
</listitem>
</varlistentry>
</variablelist>
</refsect1>
<refsect1 id="_history">
<title>HISTORY</title>
<simpara>combine_tessdata(1) first appeared in version 3.00 of Tesseract</simpara>
</refsect1>
<refsect1 id="_see_also">
<title>SEE ALSO</title>
<simpara>tesseract(1), wordlist2dawg(1), cntraining(1), mftraining(1), unicharset(5),
unicharambigs(5)</simpara>
</refsect1>
<refsect1 id="_copying">
<title>COPYING</title>
<simpara>Copyright (C) 2009, Google Inc.
Licensed under the Apache License, Version 2.0</simpara>
</refsect1>
<refsect1 id="_author">
<title>AUTHOR</title>
<simpara>The Tesseract OCR engine was written by Ray Smith and his research groups
at Hewlett Packard (1985-1995) and Google (2006-present).</simpara>
</refsect1>
</refentry>

View File

@ -1,55 +0,0 @@
'\" t
.\" Title: dawg2wordlist
.\" Author: [see the "AUTHOR" section]
.\" Generator: DocBook XSL Stylesheets v1.78.1 <http://docbook.sf.net/>
.\" Date: 06/12/2015
.\" Manual: \ \&
.\" Source: \ \&
.\" Language: English
.\"
.TH "DAWG2WORDLIST" "1" "06/12/2015" "\ \&" "\ \&"
.\" -----------------------------------------------------------------
.\" * Define some portability stuff
.\" -----------------------------------------------------------------
.\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.\" http://bugs.debian.org/507673
.\" http://lists.gnu.org/archive/html/groff/2009-02/msg00013.html
.\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.ie \n(.g .ds Aq \(aq
.el .ds Aq '
.\" -----------------------------------------------------------------
.\" * set default formatting
.\" -----------------------------------------------------------------
.\" disable hyphenation
.nh
.\" disable justification (adjust text to left margin only)
.ad l
.\" -----------------------------------------------------------------
.\" * MAIN CONTENT STARTS HERE *
.\" -----------------------------------------------------------------
.SH "NAME"
dawg2wordlist \- convert a Tesseract DAWG to a wordlist
.SH "SYNOPSIS"
.sp
\fBdawg2wordlist\fR \fIUNICHARSET\fR \fIDAWG\fR \fIWORDLIST\fR
.SH "DESCRIPTION"
.sp
dawg2wordlist(1) converts a Tesseract Directed Acyclic Word Graph (DAWG) to a list of words using a unicharset as key\&.
.SH "OPTIONS"
.sp
\fIUNICHARSET\fR The unicharset of the language\&. This is the unicharset generated by mftraining(1)\&.
.sp
\fIDAWG\fR The input DAWG, created by wordlist2dawg(1)
.sp
\fIWORDLIST\fR Plain text (output) file in UTF\-8, one word per line
.SH "SEE ALSO"
.sp
tesseract(1), mftraining(1), wordlist2dawg(1), unicharset(5), combine_tessdata(1)
.sp
\m[blue]\fBhttps://github\&.com/tesseract\-ocr/tesseract/wiki/TrainingTesseract\fR\m[]
.SH "COPYING"
.sp
Copyright (C) 2012 Google, Inc\&. Licensed under the Apache License, Version 2\&.0
.SH "AUTHOR"
.sp
The Tesseract OCR engine was written by Ray Smith and his research groups at Hewlett Packard (1985\-1995) and Google (2006\-present)\&.

View File

@ -1,802 +0,0 @@
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<meta http-equiv="Content-Type" content="application/xhtml+xml; charset=UTF-8" />
<meta name="generator" content="AsciiDoc 8.6.9" />
<title>DAWG2WORDLIST(1)</title>
<style type="text/css">
/* Shared CSS for AsciiDoc xhtml11 and html5 backends */
/* Default font. */
body {
font-family: Georgia,serif;
}
/* Title font. */
h1, h2, h3, h4, h5, h6,
div.title, caption.title,
thead, p.table.header,
#toctitle,
#author, #revnumber, #revdate, #revremark,
#footer {
font-family: Arial,Helvetica,sans-serif;
}
body {
margin: 1em 5% 1em 5%;
}
a {
color: blue;
text-decoration: underline;
}
a:visited {
color: fuchsia;
}
em {
font-style: italic;
color: navy;
}
strong {
font-weight: bold;
color: #083194;
}
h1, h2, h3, h4, h5, h6 {
color: #527bbd;
margin-top: 1.2em;
margin-bottom: 0.5em;
line-height: 1.3;
}
h1, h2, h3 {
border-bottom: 2px solid silver;
}
h2 {
padding-top: 0.5em;
}
h3 {
float: left;
}
h3 + * {
clear: left;
}
h5 {
font-size: 1.0em;
}
div.sectionbody {
margin-left: 0;
}
hr {
border: 1px solid silver;
}
p {
margin-top: 0.5em;
margin-bottom: 0.5em;
}
ul, ol, li > p {
margin-top: 0;
}
ul > li { color: #aaa; }
ul > li > * { color: black; }
.monospaced, code, pre {
font-family: "Courier New", Courier, monospace;
font-size: inherit;
color: navy;
padding: 0;
margin: 0;
}
pre {
white-space: pre-wrap;
}
#author {
color: #527bbd;
font-weight: bold;
font-size: 1.1em;
}
#email {
}
#revnumber, #revdate, #revremark {
}
#footer {
font-size: small;
border-top: 2px solid silver;
padding-top: 0.5em;
margin-top: 4.0em;
}
#footer-text {
float: left;
padding-bottom: 0.5em;
}
#footer-badges {
float: right;
padding-bottom: 0.5em;
}
#preamble {
margin-top: 1.5em;
margin-bottom: 1.5em;
}
div.imageblock, div.exampleblock, div.verseblock,
div.quoteblock, div.literalblock, div.listingblock, div.sidebarblock,
div.admonitionblock {
margin-top: 1.0em;
margin-bottom: 1.5em;
}
div.admonitionblock {
margin-top: 2.0em;
margin-bottom: 2.0em;
margin-right: 10%;
color: #606060;
}
div.content { /* Block element content. */
padding: 0;
}
/* Block element titles. */
div.title, caption.title {
color: #527bbd;
font-weight: bold;
text-align: left;
margin-top: 1.0em;
margin-bottom: 0.5em;
}
div.title + * {
margin-top: 0;
}
td div.title:first-child {
margin-top: 0.0em;
}
div.content div.title:first-child {
margin-top: 0.0em;
}
div.content + div.title {
margin-top: 0.0em;
}
div.sidebarblock > div.content {
background: #ffffee;
border: 1px solid #dddddd;
border-left: 4px solid #f0f0f0;
padding: 0.5em;
}
div.listingblock > div.content {
border: 1px solid #dddddd;
border-left: 5px solid #f0f0f0;
background: #f8f8f8;
padding: 0.5em;
}
div.quoteblock, div.verseblock {
padding-left: 1.0em;
margin-left: 1.0em;
margin-right: 10%;
border-left: 5px solid #f0f0f0;
color: #888;
}
div.quoteblock > div.attribution {
padding-top: 0.5em;
text-align: right;
}
div.verseblock > pre.content {
font-family: inherit;
font-size: inherit;
}
div.verseblock > div.attribution {
padding-top: 0.75em;
text-align: left;
}
/* DEPRECATED: Pre version 8.2.7 verse style literal block. */
div.verseblock + div.attribution {
text-align: left;
}
div.admonitionblock .icon {
vertical-align: top;
font-size: 1.1em;
font-weight: bold;
text-decoration: underline;
color: #527bbd;
padding-right: 0.5em;
}
div.admonitionblock td.content {
padding-left: 0.5em;
border-left: 3px solid #dddddd;
}
div.exampleblock > div.content {
border-left: 3px solid #dddddd;
padding-left: 0.5em;
}
div.imageblock div.content { padding-left: 0; }
span.image img { border-style: none; vertical-align: text-bottom; }
a.image:visited { color: white; }
dl {
margin-top: 0.8em;
margin-bottom: 0.8em;
}
dt {
margin-top: 0.5em;
margin-bottom: 0;
font-style: normal;
color: navy;
}
dd > *:first-child {
margin-top: 0.1em;
}
ul, ol {
list-style-position: outside;
}
ol.arabic {
list-style-type: decimal;
}
ol.loweralpha {
list-style-type: lower-alpha;
}
ol.upperalpha {
list-style-type: upper-alpha;
}
ol.lowerroman {
list-style-type: lower-roman;
}
ol.upperroman {
list-style-type: upper-roman;
}
div.compact ul, div.compact ol,
div.compact p, div.compact p,
div.compact div, div.compact div {
margin-top: 0.1em;
margin-bottom: 0.1em;
}
tfoot {
font-weight: bold;
}
td > div.verse {
white-space: pre;
}
div.hdlist {
margin-top: 0.8em;
margin-bottom: 0.8em;
}
div.hdlist tr {
padding-bottom: 15px;
}
dt.hdlist1.strong, td.hdlist1.strong {
font-weight: bold;
}
td.hdlist1 {
vertical-align: top;
font-style: normal;
padding-right: 0.8em;
color: navy;
}
td.hdlist2 {
vertical-align: top;
}
div.hdlist.compact tr {
margin: 0;
padding-bottom: 0;
}
.comment {
background: yellow;
}
.footnote, .footnoteref {
font-size: 0.8em;
}
span.footnote, span.footnoteref {
vertical-align: super;
}
#footnotes {
margin: 20px 0 20px 0;
padding: 7px 0 0 0;
}
#footnotes div.footnote {
margin: 0 0 5px 0;
}
#footnotes hr {
border: none;
border-top: 1px solid silver;
height: 1px;
text-align: left;
margin-left: 0;
width: 20%;
min-width: 100px;
}
div.colist td {
padding-right: 0.5em;
padding-bottom: 0.3em;
vertical-align: top;
}
div.colist td img {
margin-top: 0.3em;
}
@media print {
#footer-badges { display: none; }
}
#toc {
margin-bottom: 2.5em;
}
#toctitle {
color: #527bbd;
font-size: 1.1em;
font-weight: bold;
margin-top: 1.0em;
margin-bottom: 0.1em;
}
div.toclevel0, div.toclevel1, div.toclevel2, div.toclevel3, div.toclevel4 {
margin-top: 0;
margin-bottom: 0;
}
div.toclevel2 {
margin-left: 2em;
font-size: 0.9em;
}
div.toclevel3 {
margin-left: 4em;
font-size: 0.9em;
}
div.toclevel4 {
margin-left: 6em;
font-size: 0.9em;
}
span.aqua { color: aqua; }
span.black { color: black; }
span.blue { color: blue; }
span.fuchsia { color: fuchsia; }
span.gray { color: gray; }
span.green { color: green; }
span.lime { color: lime; }
span.maroon { color: maroon; }
span.navy { color: navy; }
span.olive { color: olive; }
span.purple { color: purple; }
span.red { color: red; }
span.silver { color: silver; }
span.teal { color: teal; }
span.white { color: white; }
span.yellow { color: yellow; }
span.aqua-background { background: aqua; }
span.black-background { background: black; }
span.blue-background { background: blue; }
span.fuchsia-background { background: fuchsia; }
span.gray-background { background: gray; }
span.green-background { background: green; }
span.lime-background { background: lime; }
span.maroon-background { background: maroon; }
span.navy-background { background: navy; }
span.olive-background { background: olive; }
span.purple-background { background: purple; }
span.red-background { background: red; }
span.silver-background { background: silver; }
span.teal-background { background: teal; }
span.white-background { background: white; }
span.yellow-background { background: yellow; }
span.big { font-size: 2em; }
span.small { font-size: 0.6em; }
span.underline { text-decoration: underline; }
span.overline { text-decoration: overline; }
span.line-through { text-decoration: line-through; }
div.unbreakable { page-break-inside: avoid; }
/*
* xhtml11 specific
*
* */
div.tableblock {
margin-top: 1.0em;
margin-bottom: 1.5em;
}
div.tableblock > table {
border: 3px solid #527bbd;
}
thead, p.table.header {
font-weight: bold;
color: #527bbd;
}
p.table {
margin-top: 0;
}
/* Because the table frame attribute is overriden by CSS in most browsers. */
div.tableblock > table[frame="void"] {
border-style: none;
}
div.tableblock > table[frame="hsides"] {
border-left-style: none;
border-right-style: none;
}
div.tableblock > table[frame="vsides"] {
border-top-style: none;
border-bottom-style: none;
}
/*
* html5 specific
*
* */
table.tableblock {
margin-top: 1.0em;
margin-bottom: 1.5em;
}
thead, p.tableblock.header {
font-weight: bold;
color: #527bbd;
}
p.tableblock {
margin-top: 0;
}
table.tableblock {
border-width: 3px;
border-spacing: 0px;
border-style: solid;
border-color: #527bbd;
border-collapse: collapse;
}
th.tableblock, td.tableblock {
border-width: 1px;
padding: 4px;
border-style: solid;
border-color: #527bbd;
}
table.tableblock.frame-topbot {
border-left-style: hidden;
border-right-style: hidden;
}
table.tableblock.frame-sides {
border-top-style: hidden;
border-bottom-style: hidden;
}
table.tableblock.frame-none {
border-style: hidden;
}
th.tableblock.halign-left, td.tableblock.halign-left {
text-align: left;
}
th.tableblock.halign-center, td.tableblock.halign-center {
text-align: center;
}
th.tableblock.halign-right, td.tableblock.halign-right {
text-align: right;
}
th.tableblock.valign-top, td.tableblock.valign-top {
vertical-align: top;
}
th.tableblock.valign-middle, td.tableblock.valign-middle {
vertical-align: middle;
}
th.tableblock.valign-bottom, td.tableblock.valign-bottom {
vertical-align: bottom;
}
/*
* manpage specific
*
* */
body.manpage h1 {
padding-top: 0.5em;
padding-bottom: 0.5em;
border-top: 2px solid silver;
border-bottom: 2px solid silver;
}
body.manpage h2 {
border-style: none;
}
body.manpage div.sectionbody {
margin-left: 3em;
}
@media print {
body.manpage div#toc { display: none; }
}
</style>
<script type="text/javascript">
/*<![CDATA[*/
var asciidoc = { // Namespace.
/////////////////////////////////////////////////////////////////////
// Table Of Contents generator
/////////////////////////////////////////////////////////////////////
/* Author: Mihai Bazon, September 2002
* http://students.infoiasi.ro/~mishoo
*
* Table Of Content generator
* Version: 0.4
*
* Feel free to use this script under the terms of the GNU General Public
* License, as long as you do not remove or alter this notice.
*/
/* modified by Troy D. Hanson, September 2006. License: GPL */
/* modified by Stuart Rackham, 2006, 2009. License: GPL */
// toclevels = 1..4.
toc: function (toclevels) {
function getText(el) {
var text = "";
for (var i = el.firstChild; i != null; i = i.nextSibling) {
if (i.nodeType == 3 /* Node.TEXT_NODE */) // IE doesn't speak constants.
text += i.data;
else if (i.firstChild != null)
text += getText(i);
}
return text;
}
function TocEntry(el, text, toclevel) {
this.element = el;
this.text = text;
this.toclevel = toclevel;
}
function tocEntries(el, toclevels) {
var result = new Array;
var re = new RegExp('[hH]([1-'+(toclevels+1)+'])');
// Function that scans the DOM tree for header elements (the DOM2
// nodeIterator API would be a better technique but not supported by all
// browsers).
var iterate = function (el) {
for (var i = el.firstChild; i != null; i = i.nextSibling) {
if (i.nodeType == 1 /* Node.ELEMENT_NODE */) {
var mo = re.exec(i.tagName);
if (mo && (i.getAttribute("class") || i.getAttribute("className")) != "float") {
result[result.length] = new TocEntry(i, getText(i), mo[1]-1);
}
iterate(i);
}
}
}
iterate(el);
return result;
}
var toc = document.getElementById("toc");
if (!toc) {
return;
}
// Delete existing TOC entries in case we're reloading the TOC.
var tocEntriesToRemove = [];
var i;
for (i = 0; i < toc.childNodes.length; i++) {
var entry = toc.childNodes[i];
if (entry.nodeName.toLowerCase() == 'div'
&& entry.getAttribute("class")
&& entry.getAttribute("class").match(/^toclevel/))
tocEntriesToRemove.push(entry);
}
for (i = 0; i < tocEntriesToRemove.length; i++) {
toc.removeChild(tocEntriesToRemove[i]);
}
// Rebuild TOC entries.
var entries = tocEntries(document.getElementById("content"), toclevels);
for (var i = 0; i < entries.length; ++i) {
var entry = entries[i];
if (entry.element.id == "")
entry.element.id = "_toc_" + i;
var a = document.createElement("a");
a.href = "#" + entry.element.id;
a.appendChild(document.createTextNode(entry.text));
var div = document.createElement("div");
div.appendChild(a);
div.className = "toclevel" + entry.toclevel;
toc.appendChild(div);
}
if (entries.length == 0)
toc.parentNode.removeChild(toc);
},
/////////////////////////////////////////////////////////////////////
// Footnotes generator
/////////////////////////////////////////////////////////////////////
/* Based on footnote generation code from:
* http://www.brandspankingnew.net/archive/2005/07/format_footnote.html
*/
footnotes: function () {
// Delete existing footnote entries in case we're reloading the footnodes.
var i;
var noteholder = document.getElementById("footnotes");
if (!noteholder) {
return;
}
var entriesToRemove = [];
for (i = 0; i < noteholder.childNodes.length; i++) {
var entry = noteholder.childNodes[i];
if (entry.nodeName.toLowerCase() == 'div' && entry.getAttribute("class") == "footnote")
entriesToRemove.push(entry);
}
for (i = 0; i < entriesToRemove.length; i++) {
noteholder.removeChild(entriesToRemove[i]);
}
// Rebuild footnote entries.
var cont = document.getElementById("content");
var spans = cont.getElementsByTagName("span");
var refs = {};
var n = 0;
for (i=0; i<spans.length; i++) {
if (spans[i].className == "footnote") {
n++;
var note = spans[i].getAttribute("data-note");
if (!note) {
// Use [\s\S] in place of . so multi-line matches work.
// Because JavaScript has no s (dotall) regex flag.
note = spans[i].innerHTML.match(/\s*\[([\s\S]*)]\s*/)[1];
spans[i].innerHTML =
"[<a id='_footnoteref_" + n + "' href='#_footnote_" + n +
"' title='View footnote' class='footnote'>" + n + "</a>]";
spans[i].setAttribute("data-note", note);
}
noteholder.innerHTML +=
"<div class='footnote' id='_footnote_" + n + "'>" +
"<a href='#_footnoteref_" + n + "' title='Return to text'>" +
n + "</a>. " + note + "</div>";
var id =spans[i].getAttribute("id");
if (id != null) refs["#"+id] = n;
}
}
if (n == 0)
noteholder.parentNode.removeChild(noteholder);
else {
// Process footnoterefs.
for (i=0; i<spans.length; i++) {
if (spans[i].className == "footnoteref") {
var href = spans[i].getElementsByTagName("a")[0].getAttribute("href");
href = href.match(/#.*/)[0]; // Because IE return full URL.
n = refs[href];
spans[i].innerHTML =
"[<a href='#_footnote_" + n +
"' title='View footnote' class='footnote'>" + n + "</a>]";
}
}
}
},
install: function(toclevels) {
var timerId;
function reinstall() {
asciidoc.footnotes();
if (toclevels) {
asciidoc.toc(toclevels);
}
}
function reinstallAndRemoveTimer() {
clearInterval(timerId);
reinstall();
}
timerId = setInterval(reinstall, 500);
if (document.addEventListener)
document.addEventListener("DOMContentLoaded", reinstallAndRemoveTimer, false);
else
window.onload = reinstallAndRemoveTimer;
}
}
asciidoc.install();
/*]]>*/
</script>
</head>
<body class="manpage">
<div id="header">
<h1>
DAWG2WORDLIST(1) Manual Page
</h1>
<h2>NAME</h2>
<div class="sectionbody">
<p>dawg2wordlist -
convert a Tesseract DAWG to a wordlist
</p>
</div>
</div>
<div id="content">
<div class="sect1">
<h2 id="_synopsis">SYNOPSIS</h2>
<div class="sectionbody">
<div class="paragraph"><p><strong>dawg2wordlist</strong> <em>UNICHARSET</em> <em>DAWG</em> <em>WORDLIST</em></p></div>
</div>
</div>
<div class="sect1">
<h2 id="_description">DESCRIPTION</h2>
<div class="sectionbody">
<div class="paragraph"><p>dawg2wordlist(1) converts a Tesseract Directed Acyclic Word
Graph (DAWG) to a list of words using a unicharset as key.</p></div>
</div>
</div>
<div class="sect1">
<h2 id="_options">OPTIONS</h2>
<div class="sectionbody">
<div class="paragraph"><p><em>UNICHARSET</em>
The unicharset of the language. This is the unicharset
generated by mftraining(1).</p></div>
<div class="paragraph"><p><em>DAWG</em>
The input DAWG, created by wordlist2dawg(1)</p></div>
<div class="paragraph"><p><em>WORDLIST</em>
Plain text (output) file in UTF-8, one word per line</p></div>
</div>
</div>
<div class="sect1">
<h2 id="_see_also">SEE ALSO</h2>
<div class="sectionbody">
<div class="paragraph"><p>tesseract(1), mftraining(1), wordlist2dawg(1), unicharset(5),
combine_tessdata(1)</p></div>
<div class="paragraph"><p><a href="https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract">https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract</a></p></div>
</div>
</div>
<div class="sect1">
<h2 id="_copying">COPYING</h2>
<div class="sectionbody">
<div class="paragraph"><p>Copyright (C) 2012 Google, Inc.
Licensed under the Apache License, Version 2.0</p></div>
</div>
</div>
<div class="sect1">
<h2 id="_author">AUTHOR</h2>
<div class="sectionbody">
<div class="paragraph"><p>The Tesseract OCR engine was written by Ray Smith and his research groups
at Hewlett Packard (1985-1995) and Google (2006-present).</p></div>
</div>
</div>
</div>
<div id="footnotes"><hr /></div>
<div id="footer">
<div id="footer-text">
Last updated 2015-06-12 23:52:09 CEST
</div>
</div>
</body>
</html>

View File

@ -1,53 +0,0 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE refentry PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd">
<?asciidoc-toc?>
<?asciidoc-numbered?>
<refentry lang="en">
<refentryinfo>
<title>DAWG2WORDLIST(1)</title>
</refentryinfo>
<refmeta>
<refentrytitle>dawg2wordlist</refentrytitle>
<manvolnum>1</manvolnum>
<refmiscinfo class="source">&#160;</refmiscinfo>
<refmiscinfo class="manual">&#160;</refmiscinfo>
</refmeta>
<refnamediv>
<refname>dawg2wordlist</refname>
<refpurpose>convert a Tesseract DAWG to a wordlist</refpurpose>
</refnamediv>
<refsynopsisdiv id="_synopsis">
<simpara><emphasis role="strong">dawg2wordlist</emphasis> <emphasis>UNICHARSET</emphasis> <emphasis>DAWG</emphasis> <emphasis>WORDLIST</emphasis></simpara>
</refsynopsisdiv>
<refsect1 id="_description">
<title>DESCRIPTION</title>
<simpara>dawg2wordlist(1) converts a Tesseract Directed Acyclic Word
Graph (DAWG) to a list of words using a unicharset as key.</simpara>
</refsect1>
<refsect1 id="_options">
<title>OPTIONS</title>
<simpara><emphasis>UNICHARSET</emphasis>
The unicharset of the language. This is the unicharset
generated by mftraining(1).</simpara>
<simpara><emphasis>DAWG</emphasis>
The input DAWG, created by wordlist2dawg(1)</simpara>
<simpara><emphasis>WORDLIST</emphasis>
Plain text (output) file in UTF-8, one word per line</simpara>
</refsect1>
<refsect1 id="_see_also">
<title>SEE ALSO</title>
<simpara>tesseract(1), mftraining(1), wordlist2dawg(1), unicharset(5),
combine_tessdata(1)</simpara>
<simpara><ulink url="https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract">https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract</ulink></simpara>
</refsect1>
<refsect1 id="_copying">
<title>COPYING</title>
<simpara>Copyright (C) 2012 Google, Inc.
Licensed under the Apache License, Version 2.0</simpara>
</refsect1>
<refsect1 id="_author">
<title>AUTHOR</title>
<simpara>The Tesseract OCR engine was written by Ray Smith and his research groups
at Hewlett Packard (1985-1995) and Google (2006-present).</simpara>
</refsect1>
</refentry>

View File

@ -1,94 +0,0 @@
'\" t
.\" Title: mftraining
.\" Author: [see the "AUTHOR" section]
.\" Generator: DocBook XSL Stylesheets v1.78.1 <http://docbook.sf.net/>
.\" Date: 06/12/2015
.\" Manual: \ \&
.\" Source: \ \&
.\" Language: English
.\"
.TH "MFTRAINING" "1" "06/12/2015" "\ \&" "\ \&"
.\" -----------------------------------------------------------------
.\" * Define some portability stuff
.\" -----------------------------------------------------------------
.\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.\" http://bugs.debian.org/507673
.\" http://lists.gnu.org/archive/html/groff/2009-02/msg00013.html
.\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.ie \n(.g .ds Aq \(aq
.el .ds Aq '
.\" -----------------------------------------------------------------
.\" * set default formatting
.\" -----------------------------------------------------------------
.\" disable hyphenation
.nh
.\" disable justification (adjust text to left margin only)
.ad l
.\" -----------------------------------------------------------------
.\" * MAIN CONTENT STARTS HERE *
.\" -----------------------------------------------------------------
.SH "NAME"
mftraining \- feature training for Tesseract
.SH "SYNOPSIS"
.sp
mftraining \-U \fIunicharset\fR \-O \fIlang\&.unicharset\fR \fIFILE\fR\&...
.SH "DESCRIPTION"
.sp
mftraining takes a list of \&.tr files, from which it generates the files \fBinttemp\fR (the shape prototypes), \fBshapetable\fR, and \fBpffmtable\fR (the number of expected features for each character)\&. (A fourth file called Microfeat is also written by this program, but it is not used\&.)
.SH "OPTIONS"
.PP
\-U \fIFILE\fR
.RS 4
(Input) The unicharset generated by unicharset_extractor(1)
.RE
.PP
\-F \fIfont_properties_file\fR
.RS 4
(Input) font properties file, each line is of the following form, where each field other than the font name is 0 or 1:
.sp
.if n \{\
.RS 4
.\}
.nf
*font_name* *italic* *bold* *fixed_pitch* *serif* *fraktur*
.fi
.if n \{\
.RE
.\}
.RE
.PP
\-X \fIxheights_file\fR
.RS 4
(Input) x heights file, each line is of the following form, where xheight is calculated as the pixel x height of a character drawn at 32pt on 300 dpi\&. [ That is, if base x height + ascenders + descenders = 133, how much is x height? ]
.sp
.if n \{\
.RS 4
.\}
.nf
*font_name* *xheight*
.fi
.if n \{\
.RE
.\}
.RE
.PP
\-D \fIdir\fR
.RS 4
Directory to write output files to\&.
.RE
.PP
\-O \fIFILE\fR
.RS 4
(Output) The output unicharset that will be given to combine_tessdata(1)
.RE
.SH "SEE ALSO"
.sp
tesseract(1), cntraining(1), unicharset_extractor(1), combine_tessdata(1), shapeclustering(1), unicharset(5)
.sp
\m[blue]\fBhttps://github\&.com/tesseract\-ocr/tesseract/wiki/TrainingTesseract\fR\m[]
.SH "COPYING"
.sp
Copyright (C) Hewlett\-Packard Company, 1988 Licensed under the Apache License, Version 2\&.0
.SH "AUTHOR"
.sp
The Tesseract OCR engine was written by Ray Smith and his research groups at Hewlett Packard (1985\-1995) and Google (2006\-present)\&.

View File

@ -1,847 +0,0 @@
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<meta http-equiv="Content-Type" content="application/xhtml+xml; charset=UTF-8" />
<meta name="generator" content="AsciiDoc 8.6.9" />
<title>MFTRAINING(1)</title>
<style type="text/css">
/* Shared CSS for AsciiDoc xhtml11 and html5 backends */
/* Default font. */
body {
font-family: Georgia,serif;
}
/* Title font. */
h1, h2, h3, h4, h5, h6,
div.title, caption.title,
thead, p.table.header,
#toctitle,
#author, #revnumber, #revdate, #revremark,
#footer {
font-family: Arial,Helvetica,sans-serif;
}
body {
margin: 1em 5% 1em 5%;
}
a {
color: blue;
text-decoration: underline;
}
a:visited {
color: fuchsia;
}
em {
font-style: italic;
color: navy;
}
strong {
font-weight: bold;
color: #083194;
}
h1, h2, h3, h4, h5, h6 {
color: #527bbd;
margin-top: 1.2em;
margin-bottom: 0.5em;
line-height: 1.3;
}
h1, h2, h3 {
border-bottom: 2px solid silver;
}
h2 {
padding-top: 0.5em;
}
h3 {
float: left;
}
h3 + * {
clear: left;
}
h5 {
font-size: 1.0em;
}
div.sectionbody {
margin-left: 0;
}
hr {
border: 1px solid silver;
}
p {
margin-top: 0.5em;
margin-bottom: 0.5em;
}
ul, ol, li > p {
margin-top: 0;
}
ul > li { color: #aaa; }
ul > li > * { color: black; }
.monospaced, code, pre {
font-family: "Courier New", Courier, monospace;
font-size: inherit;
color: navy;
padding: 0;
margin: 0;
}
pre {
white-space: pre-wrap;
}
#author {
color: #527bbd;
font-weight: bold;
font-size: 1.1em;
}
#email {
}
#revnumber, #revdate, #revremark {
}
#footer {
font-size: small;
border-top: 2px solid silver;
padding-top: 0.5em;
margin-top: 4.0em;
}
#footer-text {
float: left;
padding-bottom: 0.5em;
}
#footer-badges {
float: right;
padding-bottom: 0.5em;
}
#preamble {
margin-top: 1.5em;
margin-bottom: 1.5em;
}
div.imageblock, div.exampleblock, div.verseblock,
div.quoteblock, div.literalblock, div.listingblock, div.sidebarblock,
div.admonitionblock {
margin-top: 1.0em;
margin-bottom: 1.5em;
}
div.admonitionblock {
margin-top: 2.0em;
margin-bottom: 2.0em;
margin-right: 10%;
color: #606060;
}
div.content { /* Block element content. */
padding: 0;
}
/* Block element titles. */
div.title, caption.title {
color: #527bbd;
font-weight: bold;
text-align: left;
margin-top: 1.0em;
margin-bottom: 0.5em;
}
div.title + * {
margin-top: 0;
}
td div.title:first-child {
margin-top: 0.0em;
}
div.content div.title:first-child {
margin-top: 0.0em;
}
div.content + div.title {
margin-top: 0.0em;
}
div.sidebarblock > div.content {
background: #ffffee;
border: 1px solid #dddddd;
border-left: 4px solid #f0f0f0;
padding: 0.5em;
}
div.listingblock > div.content {
border: 1px solid #dddddd;
border-left: 5px solid #f0f0f0;
background: #f8f8f8;
padding: 0.5em;
}
div.quoteblock, div.verseblock {
padding-left: 1.0em;
margin-left: 1.0em;
margin-right: 10%;
border-left: 5px solid #f0f0f0;
color: #888;
}
div.quoteblock > div.attribution {
padding-top: 0.5em;
text-align: right;
}
div.verseblock > pre.content {
font-family: inherit;
font-size: inherit;
}
div.verseblock > div.attribution {
padding-top: 0.75em;
text-align: left;
}
/* DEPRECATED: Pre version 8.2.7 verse style literal block. */
div.verseblock + div.attribution {
text-align: left;
}
div.admonitionblock .icon {
vertical-align: top;
font-size: 1.1em;
font-weight: bold;
text-decoration: underline;
color: #527bbd;
padding-right: 0.5em;
}
div.admonitionblock td.content {
padding-left: 0.5em;
border-left: 3px solid #dddddd;
}
div.exampleblock > div.content {
border-left: 3px solid #dddddd;
padding-left: 0.5em;
}
div.imageblock div.content { padding-left: 0; }
span.image img { border-style: none; vertical-align: text-bottom; }
a.image:visited { color: white; }
dl {
margin-top: 0.8em;
margin-bottom: 0.8em;
}
dt {
margin-top: 0.5em;
margin-bottom: 0;
font-style: normal;
color: navy;
}
dd > *:first-child {
margin-top: 0.1em;
}
ul, ol {
list-style-position: outside;
}
ol.arabic {
list-style-type: decimal;
}
ol.loweralpha {
list-style-type: lower-alpha;
}
ol.upperalpha {
list-style-type: upper-alpha;
}
ol.lowerroman {
list-style-type: lower-roman;
}
ol.upperroman {
list-style-type: upper-roman;
}
div.compact ul, div.compact ol,
div.compact p, div.compact p,
div.compact div, div.compact div {
margin-top: 0.1em;
margin-bottom: 0.1em;
}
tfoot {
font-weight: bold;
}
td > div.verse {
white-space: pre;
}
div.hdlist {
margin-top: 0.8em;
margin-bottom: 0.8em;
}
div.hdlist tr {
padding-bottom: 15px;
}
dt.hdlist1.strong, td.hdlist1.strong {
font-weight: bold;
}
td.hdlist1 {
vertical-align: top;
font-style: normal;
padding-right: 0.8em;
color: navy;
}
td.hdlist2 {
vertical-align: top;
}
div.hdlist.compact tr {
margin: 0;
padding-bottom: 0;
}
.comment {
background: yellow;
}
.footnote, .footnoteref {
font-size: 0.8em;
}
span.footnote, span.footnoteref {
vertical-align: super;
}
#footnotes {
margin: 20px 0 20px 0;
padding: 7px 0 0 0;
}
#footnotes div.footnote {
margin: 0 0 5px 0;
}
#footnotes hr {
border: none;
border-top: 1px solid silver;
height: 1px;
text-align: left;
margin-left: 0;
width: 20%;
min-width: 100px;
}
div.colist td {
padding-right: 0.5em;
padding-bottom: 0.3em;
vertical-align: top;
}
div.colist td img {
margin-top: 0.3em;
}
@media print {
#footer-badges { display: none; }
}
#toc {
margin-bottom: 2.5em;
}
#toctitle {
color: #527bbd;
font-size: 1.1em;
font-weight: bold;
margin-top: 1.0em;
margin-bottom: 0.1em;
}
div.toclevel0, div.toclevel1, div.toclevel2, div.toclevel3, div.toclevel4 {
margin-top: 0;
margin-bottom: 0;
}
div.toclevel2 {
margin-left: 2em;
font-size: 0.9em;
}
div.toclevel3 {
margin-left: 4em;
font-size: 0.9em;
}
div.toclevel4 {
margin-left: 6em;
font-size: 0.9em;
}
span.aqua { color: aqua; }
span.black { color: black; }
span.blue { color: blue; }
span.fuchsia { color: fuchsia; }
span.gray { color: gray; }
span.green { color: green; }
span.lime { color: lime; }
span.maroon { color: maroon; }
span.navy { color: navy; }
span.olive { color: olive; }
span.purple { color: purple; }
span.red { color: red; }
span.silver { color: silver; }
span.teal { color: teal; }
span.white { color: white; }
span.yellow { color: yellow; }
span.aqua-background { background: aqua; }
span.black-background { background: black; }
span.blue-background { background: blue; }
span.fuchsia-background { background: fuchsia; }
span.gray-background { background: gray; }
span.green-background { background: green; }
span.lime-background { background: lime; }
span.maroon-background { background: maroon; }
span.navy-background { background: navy; }
span.olive-background { background: olive; }
span.purple-background { background: purple; }
span.red-background { background: red; }
span.silver-background { background: silver; }
span.teal-background { background: teal; }
span.white-background { background: white; }
span.yellow-background { background: yellow; }
span.big { font-size: 2em; }
span.small { font-size: 0.6em; }
span.underline { text-decoration: underline; }
span.overline { text-decoration: overline; }
span.line-through { text-decoration: line-through; }
div.unbreakable { page-break-inside: avoid; }
/*
* xhtml11 specific
*
* */
div.tableblock {
margin-top: 1.0em;
margin-bottom: 1.5em;
}
div.tableblock > table {
border: 3px solid #527bbd;
}
thead, p.table.header {
font-weight: bold;
color: #527bbd;
}
p.table {
margin-top: 0;
}
/* Because the table frame attribute is overriden by CSS in most browsers. */
div.tableblock > table[frame="void"] {
border-style: none;
}
div.tableblock > table[frame="hsides"] {
border-left-style: none;
border-right-style: none;
}
div.tableblock > table[frame="vsides"] {
border-top-style: none;
border-bottom-style: none;
}
/*
* html5 specific
*
* */
table.tableblock {
margin-top: 1.0em;
margin-bottom: 1.5em;
}
thead, p.tableblock.header {
font-weight: bold;
color: #527bbd;
}
p.tableblock {
margin-top: 0;
}
table.tableblock {
border-width: 3px;
border-spacing: 0px;
border-style: solid;
border-color: #527bbd;
border-collapse: collapse;
}
th.tableblock, td.tableblock {
border-width: 1px;
padding: 4px;
border-style: solid;
border-color: #527bbd;
}
table.tableblock.frame-topbot {
border-left-style: hidden;
border-right-style: hidden;
}
table.tableblock.frame-sides {
border-top-style: hidden;
border-bottom-style: hidden;
}
table.tableblock.frame-none {
border-style: hidden;
}
th.tableblock.halign-left, td.tableblock.halign-left {
text-align: left;
}
th.tableblock.halign-center, td.tableblock.halign-center {
text-align: center;
}
th.tableblock.halign-right, td.tableblock.halign-right {
text-align: right;
}
th.tableblock.valign-top, td.tableblock.valign-top {
vertical-align: top;
}
th.tableblock.valign-middle, td.tableblock.valign-middle {
vertical-align: middle;
}
th.tableblock.valign-bottom, td.tableblock.valign-bottom {
vertical-align: bottom;
}
/*
* manpage specific
*
* */
body.manpage h1 {
padding-top: 0.5em;
padding-bottom: 0.5em;
border-top: 2px solid silver;
border-bottom: 2px solid silver;
}
body.manpage h2 {
border-style: none;
}
body.manpage div.sectionbody {
margin-left: 3em;
}
@media print {
body.manpage div#toc { display: none; }
}
</style>
<script type="text/javascript">
/*<![CDATA[*/
var asciidoc = { // Namespace.
/////////////////////////////////////////////////////////////////////
// Table Of Contents generator
/////////////////////////////////////////////////////////////////////
/* Author: Mihai Bazon, September 2002
* http://students.infoiasi.ro/~mishoo
*
* Table Of Content generator
* Version: 0.4
*
* Feel free to use this script under the terms of the GNU General Public
* License, as long as you do not remove or alter this notice.
*/
/* modified by Troy D. Hanson, September 2006. License: GPL */
/* modified by Stuart Rackham, 2006, 2009. License: GPL */
// toclevels = 1..4.
toc: function (toclevels) {
function getText(el) {
var text = "";
for (var i = el.firstChild; i != null; i = i.nextSibling) {
if (i.nodeType == 3 /* Node.TEXT_NODE */) // IE doesn't speak constants.
text += i.data;
else if (i.firstChild != null)
text += getText(i);
}
return text;
}
function TocEntry(el, text, toclevel) {
this.element = el;
this.text = text;
this.toclevel = toclevel;
}
function tocEntries(el, toclevels) {
var result = new Array;
var re = new RegExp('[hH]([1-'+(toclevels+1)+'])');
// Function that scans the DOM tree for header elements (the DOM2
// nodeIterator API would be a better technique but not supported by all
// browsers).
var iterate = function (el) {
for (var i = el.firstChild; i != null; i = i.nextSibling) {
if (i.nodeType == 1 /* Node.ELEMENT_NODE */) {
var mo = re.exec(i.tagName);
if (mo && (i.getAttribute("class") || i.getAttribute("className")) != "float") {
result[result.length] = new TocEntry(i, getText(i), mo[1]-1);
}
iterate(i);
}
}
}
iterate(el);
return result;
}
var toc = document.getElementById("toc");
if (!toc) {
return;
}
// Delete existing TOC entries in case we're reloading the TOC.
var tocEntriesToRemove = [];
var i;
for (i = 0; i < toc.childNodes.length; i++) {
var entry = toc.childNodes[i];
if (entry.nodeName.toLowerCase() == 'div'
&& entry.getAttribute("class")
&& entry.getAttribute("class").match(/^toclevel/))
tocEntriesToRemove.push(entry);
}
for (i = 0; i < tocEntriesToRemove.length; i++) {
toc.removeChild(tocEntriesToRemove[i]);
}
// Rebuild TOC entries.
var entries = tocEntries(document.getElementById("content"), toclevels);
for (var i = 0; i < entries.length; ++i) {
var entry = entries[i];
if (entry.element.id == "")
entry.element.id = "_toc_" + i;
var a = document.createElement("a");
a.href = "#" + entry.element.id;
a.appendChild(document.createTextNode(entry.text));
var div = document.createElement("div");
div.appendChild(a);
div.className = "toclevel" + entry.toclevel;
toc.appendChild(div);
}
if (entries.length == 0)
toc.parentNode.removeChild(toc);
},
/////////////////////////////////////////////////////////////////////
// Footnotes generator
/////////////////////////////////////////////////////////////////////
/* Based on footnote generation code from:
* http://www.brandspankingnew.net/archive/2005/07/format_footnote.html
*/
footnotes: function () {
// Delete existing footnote entries in case we're reloading the footnodes.
var i;
var noteholder = document.getElementById("footnotes");
if (!noteholder) {
return;
}
var entriesToRemove = [];
for (i = 0; i < noteholder.childNodes.length; i++) {
var entry = noteholder.childNodes[i];
if (entry.nodeName.toLowerCase() == 'div' && entry.getAttribute("class") == "footnote")
entriesToRemove.push(entry);
}
for (i = 0; i < entriesToRemove.length; i++) {
noteholder.removeChild(entriesToRemove[i]);
}
// Rebuild footnote entries.
var cont = document.getElementById("content");
var spans = cont.getElementsByTagName("span");
var refs = {};
var n = 0;
for (i=0; i<spans.length; i++) {
if (spans[i].className == "footnote") {
n++;
var note = spans[i].getAttribute("data-note");
if (!note) {
// Use [\s\S] in place of . so multi-line matches work.
// Because JavaScript has no s (dotall) regex flag.
note = spans[i].innerHTML.match(/\s*\[([\s\S]*)]\s*/)[1];
spans[i].innerHTML =
"[<a id='_footnoteref_" + n + "' href='#_footnote_" + n +
"' title='View footnote' class='footnote'>" + n + "</a>]";
spans[i].setAttribute("data-note", note);
}
noteholder.innerHTML +=
"<div class='footnote' id='_footnote_" + n + "'>" +
"<a href='#_footnoteref_" + n + "' title='Return to text'>" +
n + "</a>. " + note + "</div>";
var id =spans[i].getAttribute("id");
if (id != null) refs["#"+id] = n;
}
}
if (n == 0)
noteholder.parentNode.removeChild(noteholder);
else {
// Process footnoterefs.
for (i=0; i<spans.length; i++) {
if (spans[i].className == "footnoteref") {
var href = spans[i].getElementsByTagName("a")[0].getAttribute("href");
href = href.match(/#.*/)[0]; // Because IE return full URL.
n = refs[href];
spans[i].innerHTML =
"[<a href='#_footnote_" + n +
"' title='View footnote' class='footnote'>" + n + "</a>]";
}
}
}
},
install: function(toclevels) {
var timerId;
function reinstall() {
asciidoc.footnotes();
if (toclevels) {
asciidoc.toc(toclevels);
}
}
function reinstallAndRemoveTimer() {
clearInterval(timerId);
reinstall();
}
timerId = setInterval(reinstall, 500);
if (document.addEventListener)
document.addEventListener("DOMContentLoaded", reinstallAndRemoveTimer, false);
else
window.onload = reinstallAndRemoveTimer;
}
}
asciidoc.install();
/*]]>*/
</script>
</head>
<body class="manpage">
<div id="header">
<h1>
MFTRAINING(1) Manual Page
</h1>
<h2>NAME</h2>
<div class="sectionbody">
<p>mftraining -
feature training for Tesseract
</p>
</div>
</div>
<div id="content">
<div class="sect1">
<h2 id="_synopsis">SYNOPSIS</h2>
<div class="sectionbody">
<div class="paragraph"><p>mftraining -U <em>unicharset</em> -O <em>lang.unicharset</em> <em>FILE</em>&#8230;</p></div>
</div>
</div>
<div class="sect1">
<h2 id="_description">DESCRIPTION</h2>
<div class="sectionbody">
<div class="paragraph"><p>mftraining takes a list of .tr files, from which it generates the
files <strong>inttemp</strong> (the shape prototypes), <strong>shapetable</strong>, and <strong>pffmtable</strong>
(the number of expected features for each character). (A fourth file
called Microfeat is also written by this program, but it is not used.)</p></div>
</div>
</div>
<div class="sect1">
<h2 id="_options">OPTIONS</h2>
<div class="sectionbody">
<div class="dlist"><dl>
<dt class="hdlist1">
-U <em>FILE</em>
</dt>
<dd>
<p>
(Input) The unicharset generated by unicharset_extractor(1)
</p>
</dd>
<dt class="hdlist1">
-F <em>font_properties_file</em>
</dt>
<dd>
<p>
(Input) font properties file, each line is of the following form, where each field other than the font name is 0 or 1:
</p>
<div class="literalblock">
<div class="content">
<pre><code>*font_name* *italic* *bold* *fixed_pitch* *serif* *fraktur*</code></pre>
</div></div>
</dd>
<dt class="hdlist1">
-X <em>xheights_file</em>
</dt>
<dd>
<p>
(Input) x heights file, each line is of the following form, where xheight is calculated as the pixel x height of a character drawn at 32pt on 300 dpi. [ That is, if base x height + ascenders + descenders = 133, how much is x height? ]
</p>
<div class="literalblock">
<div class="content">
<pre><code>*font_name* *xheight*</code></pre>
</div></div>
</dd>
<dt class="hdlist1">
-D <em>dir</em>
</dt>
<dd>
<p>
Directory to write output files to.
</p>
</dd>
<dt class="hdlist1">
-O <em>FILE</em>
</dt>
<dd>
<p>
(Output) The output unicharset that will be given to combine_tessdata(1)
</p>
</dd>
</dl></div>
</div>
</div>
<div class="sect1">
<h2 id="_see_also">SEE ALSO</h2>
<div class="sectionbody">
<div class="paragraph"><p>tesseract(1), cntraining(1), unicharset_extractor(1), combine_tessdata(1),
shapeclustering(1), unicharset(5)</p></div>
<div class="paragraph"><p><a href="https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract">https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract</a></p></div>
</div>
</div>
<div class="sect1">
<h2 id="_copying">COPYING</h2>
<div class="sectionbody">
<div class="paragraph"><p>Copyright (C) Hewlett-Packard Company, 1988
Licensed under the Apache License, Version 2.0</p></div>
</div>
</div>
<div class="sect1">
<h2 id="_author">AUTHOR</h2>
<div class="sectionbody">
<div class="paragraph"><p>The Tesseract OCR engine was written by Ray Smith and his research groups
at Hewlett Packard (1985-1995) and Google (2006-present).</p></div>
</div>
</div>
</div>
<div id="footnotes"><hr /></div>
<div id="footer">
<div id="footer-text">
Last updated 2015-06-12 23:52:19 CEST
</div>
</div>
</body>
</html>

View File

@ -1,102 +0,0 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE refentry PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd">
<?asciidoc-toc?>
<?asciidoc-numbered?>
<refentry lang="en">
<refentryinfo>
<title>MFTRAINING(1)</title>
</refentryinfo>
<refmeta>
<refentrytitle>mftraining</refentrytitle>
<manvolnum>1</manvolnum>
<refmiscinfo class="source">&#160;</refmiscinfo>
<refmiscinfo class="manual">&#160;</refmiscinfo>
</refmeta>
<refnamediv>
<refname>mftraining</refname>
<refpurpose>feature training for Tesseract</refpurpose>
</refnamediv>
<refsynopsisdiv id="_synopsis">
<simpara>mftraining -U <emphasis>unicharset</emphasis> -O <emphasis>lang.unicharset</emphasis> <emphasis>FILE</emphasis>&#8230;</simpara>
</refsynopsisdiv>
<refsect1 id="_description">
<title>DESCRIPTION</title>
<simpara>mftraining takes a list of .tr files, from which it generates the
files <emphasis role="strong">inttemp</emphasis> (the shape prototypes), <emphasis role="strong">shapetable</emphasis>, and <emphasis role="strong">pffmtable</emphasis>
(the number of expected features for each character). (A fourth file
called Microfeat is also written by this program, but it is not used.)</simpara>
</refsect1>
<refsect1 id="_options">
<title>OPTIONS</title>
<variablelist>
<varlistentry>
<term>
-U <emphasis>FILE</emphasis>
</term>
<listitem>
<simpara>
(Input) The unicharset generated by unicharset_extractor(1)
</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term>
-F <emphasis>font_properties_file</emphasis>
</term>
<listitem>
<simpara>
(Input) font properties file, each line is of the following form, where each field other than the font name is 0 or 1:
</simpara>
<literallayout class="monospaced">*font_name* *italic* *bold* *fixed_pitch* *serif* *fraktur*</literallayout>
</listitem>
</varlistentry>
<varlistentry>
<term>
-X <emphasis>xheights_file</emphasis>
</term>
<listitem>
<simpara>
(Input) x heights file, each line is of the following form, where xheight is calculated as the pixel x height of a character drawn at 32pt on 300 dpi. [ That is, if base x height + ascenders + descenders = 133, how much is x height? ]
</simpara>
<literallayout class="monospaced">*font_name* *xheight*</literallayout>
</listitem>
</varlistentry>
<varlistentry>
<term>
-D <emphasis>dir</emphasis>
</term>
<listitem>
<simpara>
Directory to write output files to.
</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term>
-O <emphasis>FILE</emphasis>
</term>
<listitem>
<simpara>
(Output) The output unicharset that will be given to combine_tessdata(1)
</simpara>
</listitem>
</varlistentry>
</variablelist>
</refsect1>
<refsect1 id="_see_also">
<title>SEE ALSO</title>
<simpara>tesseract(1), cntraining(1), unicharset_extractor(1), combine_tessdata(1),
shapeclustering(1), unicharset(5)</simpara>
<simpara><ulink url="https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract">https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract</ulink></simpara>
</refsect1>
<refsect1 id="_copying">
<title>COPYING</title>
<simpara>Copyright (C) Hewlett-Packard Company, 1988
Licensed under the Apache License, Version 2.0</simpara>
</refsect1>
<refsect1 id="_author">
<title>AUTHOR</title>
<simpara>The Tesseract OCR engine was written by Ray Smith and his research groups
at Hewlett Packard (1985-1995) and Google (2006-present).</simpara>
</refsect1>
</refentry>

View File

@ -1,94 +0,0 @@
'\" t
.\" Title: shapeclustering
.\" Author: [see the "AUTHOR" section]
.\" Generator: DocBook XSL Stylesheets v1.78.1 <http://docbook.sf.net/>
.\" Date: 06/12/2015
.\" Manual: \ \&
.\" Source: \ \&
.\" Language: English
.\"
.TH "SHAPECLUSTERING" "1" "06/12/2015" "\ \&" "\ \&"
.\" -----------------------------------------------------------------
.\" * Define some portability stuff
.\" -----------------------------------------------------------------
.\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.\" http://bugs.debian.org/507673
.\" http://lists.gnu.org/archive/html/groff/2009-02/msg00013.html
.\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.ie \n(.g .ds Aq \(aq
.el .ds Aq '
.\" -----------------------------------------------------------------
.\" * set default formatting
.\" -----------------------------------------------------------------
.\" disable hyphenation
.nh
.\" disable justification (adjust text to left margin only)
.ad l
.\" -----------------------------------------------------------------
.\" * MAIN CONTENT STARTS HERE *
.\" -----------------------------------------------------------------
.SH "NAME"
shapeclustering \- shape clustering training for Tesseract
.SH "SYNOPSIS"
.sp
shapeclustering \-D \fIoutput_dir\fR \-U \fIunicharset\fR \-O \fImfunicharset\fR \-F \fIfont_props\fR \-X \fIxheights\fR \fIFILE\fR\&...
.SH "DESCRIPTION"
.sp
shapeclustering(1) takes extracted feature \&.tr files (generated by tesseract(1) run in a special mode from box files) and produces a file \fBshapetable\fR and an enhanced unicharset\&. This program is still experimental, and is not required (yet) for training Tesseract\&.
.SH "OPTIONS"
.PP
\-U \fIFILE\fR
.RS 4
The unicharset generated by unicharset_extractor(1)\&.
.RE
.PP
\-D \fIdir\fR
.RS 4
Directory to write output files to\&.
.RE
.PP
\-F \fIfont_properties_file\fR
.RS 4
(Input) font properties file, where each line is of the following form, where each field other than the font name is 0 or 1:
.sp
.if n \{\
.RS 4
.\}
.nf
\*(Aqfont_name\*(Aq \*(Aqitalic\*(Aq \*(Aqbold\*(Aq \*(Aqfixed_pitch\*(Aq \*(Aqserif\*(Aq \*(Aqfraktur\*(Aq
.fi
.if n \{\
.RE
.\}
.RE
.PP
\-X \fIxheights_file\fR
.RS 4
(Input) x heights file, each line is of the following form, where xheight is calculated as the pixel x height of a character drawn at 32pt on 300 dpi\&. [ That is, if base x height + ascenders + descenders = 133, how much is x height? ]
.sp
.if n \{\
.RS 4
.\}
.nf
\*(Aqfont_name\*(Aq \*(Aqxheight\*(Aq
.fi
.if n \{\
.RE
.\}
.RE
.PP
\-O \fIFILE\fR
.RS 4
The output unicharset that will be given to combine_tessdata(1)\&.
.RE
.SH "SEE ALSO"
.sp
tesseract(1), cntraining(1), unicharset_extractor(1), combine_tessdata(1), unicharset(5)
.sp
\m[blue]\fBhttps://github\&.com/tesseract\-ocr/tesseract/wiki/TrainingTesseract\fR\m[]
.SH "COPYING"
.sp
Copyright (C) Google, 2011 Licensed under the Apache License, Version 2\&.0
.SH "AUTHOR"
.sp
The Tesseract OCR engine was written by Ray Smith and his research groups at Hewlett Packard (1985\-1995) and Google (2006\-present)\&.

View File

@ -1,850 +0,0 @@
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<meta http-equiv="Content-Type" content="application/xhtml+xml; charset=UTF-8" />
<meta name="generator" content="AsciiDoc 8.6.9" />
<title>SHAPECLUSTERING(1)</title>
<style type="text/css">
/* Shared CSS for AsciiDoc xhtml11 and html5 backends */
/* Default font. */
body {
font-family: Georgia,serif;
}
/* Title font. */
h1, h2, h3, h4, h5, h6,
div.title, caption.title,
thead, p.table.header,
#toctitle,
#author, #revnumber, #revdate, #revremark,
#footer {
font-family: Arial,Helvetica,sans-serif;
}
body {
margin: 1em 5% 1em 5%;
}
a {
color: blue;
text-decoration: underline;
}
a:visited {
color: fuchsia;
}
em {
font-style: italic;
color: navy;
}
strong {
font-weight: bold;
color: #083194;
}
h1, h2, h3, h4, h5, h6 {
color: #527bbd;
margin-top: 1.2em;
margin-bottom: 0.5em;
line-height: 1.3;
}
h1, h2, h3 {
border-bottom: 2px solid silver;
}
h2 {
padding-top: 0.5em;
}
h3 {
float: left;
}
h3 + * {
clear: left;
}
h5 {
font-size: 1.0em;
}
div.sectionbody {
margin-left: 0;
}
hr {
border: 1px solid silver;
}
p {
margin-top: 0.5em;
margin-bottom: 0.5em;
}
ul, ol, li > p {
margin-top: 0;
}
ul > li { color: #aaa; }
ul > li > * { color: black; }
.monospaced, code, pre {
font-family: "Courier New", Courier, monospace;
font-size: inherit;
color: navy;
padding: 0;
margin: 0;
}
pre {
white-space: pre-wrap;
}
#author {
color: #527bbd;
font-weight: bold;
font-size: 1.1em;
}
#email {
}
#revnumber, #revdate, #revremark {
}
#footer {
font-size: small;
border-top: 2px solid silver;
padding-top: 0.5em;
margin-top: 4.0em;
}
#footer-text {
float: left;
padding-bottom: 0.5em;
}
#footer-badges {
float: right;
padding-bottom: 0.5em;
}
#preamble {
margin-top: 1.5em;
margin-bottom: 1.5em;
}
div.imageblock, div.exampleblock, div.verseblock,
div.quoteblock, div.literalblock, div.listingblock, div.sidebarblock,
div.admonitionblock {
margin-top: 1.0em;
margin-bottom: 1.5em;
}
div.admonitionblock {
margin-top: 2.0em;
margin-bottom: 2.0em;
margin-right: 10%;
color: #606060;
}
div.content { /* Block element content. */
padding: 0;
}
/* Block element titles. */
div.title, caption.title {
color: #527bbd;
font-weight: bold;
text-align: left;
margin-top: 1.0em;
margin-bottom: 0.5em;
}
div.title + * {
margin-top: 0;
}
td div.title:first-child {
margin-top: 0.0em;
}
div.content div.title:first-child {
margin-top: 0.0em;
}
div.content + div.title {
margin-top: 0.0em;
}
div.sidebarblock > div.content {
background: #ffffee;
border: 1px solid #dddddd;
border-left: 4px solid #f0f0f0;
padding: 0.5em;
}
div.listingblock > div.content {
border: 1px solid #dddddd;
border-left: 5px solid #f0f0f0;
background: #f8f8f8;
padding: 0.5em;
}
div.quoteblock, div.verseblock {
padding-left: 1.0em;
margin-left: 1.0em;
margin-right: 10%;
border-left: 5px solid #f0f0f0;
color: #888;
}
div.quoteblock > div.attribution {
padding-top: 0.5em;
text-align: right;
}
div.verseblock > pre.content {
font-family: inherit;
font-size: inherit;
}
div.verseblock > div.attribution {
padding-top: 0.75em;
text-align: left;
}
/* DEPRECATED: Pre version 8.2.7 verse style literal block. */
div.verseblock + div.attribution {
text-align: left;
}
div.admonitionblock .icon {
vertical-align: top;
font-size: 1.1em;
font-weight: bold;
text-decoration: underline;
color: #527bbd;
padding-right: 0.5em;
}
div.admonitionblock td.content {
padding-left: 0.5em;
border-left: 3px solid #dddddd;
}
div.exampleblock > div.content {
border-left: 3px solid #dddddd;
padding-left: 0.5em;
}
div.imageblock div.content { padding-left: 0; }
span.image img { border-style: none; vertical-align: text-bottom; }
a.image:visited { color: white; }
dl {
margin-top: 0.8em;
margin-bottom: 0.8em;
}
dt {
margin-top: 0.5em;
margin-bottom: 0;
font-style: normal;
color: navy;
}
dd > *:first-child {
margin-top: 0.1em;
}
ul, ol {
list-style-position: outside;
}
ol.arabic {
list-style-type: decimal;
}
ol.loweralpha {
list-style-type: lower-alpha;
}
ol.upperalpha {
list-style-type: upper-alpha;
}
ol.lowerroman {
list-style-type: lower-roman;
}
ol.upperroman {
list-style-type: upper-roman;
}
div.compact ul, div.compact ol,
div.compact p, div.compact p,
div.compact div, div.compact div {
margin-top: 0.1em;
margin-bottom: 0.1em;
}
tfoot {
font-weight: bold;
}
td > div.verse {
white-space: pre;
}
div.hdlist {
margin-top: 0.8em;
margin-bottom: 0.8em;
}
div.hdlist tr {
padding-bottom: 15px;
}
dt.hdlist1.strong, td.hdlist1.strong {
font-weight: bold;
}
td.hdlist1 {
vertical-align: top;
font-style: normal;
padding-right: 0.8em;
color: navy;
}
td.hdlist2 {
vertical-align: top;
}
div.hdlist.compact tr {
margin: 0;
padding-bottom: 0;
}
.comment {
background: yellow;
}
.footnote, .footnoteref {
font-size: 0.8em;
}
span.footnote, span.footnoteref {
vertical-align: super;
}
#footnotes {
margin: 20px 0 20px 0;
padding: 7px 0 0 0;
}
#footnotes div.footnote {
margin: 0 0 5px 0;
}
#footnotes hr {
border: none;
border-top: 1px solid silver;
height: 1px;
text-align: left;
margin-left: 0;
width: 20%;
min-width: 100px;
}
div.colist td {
padding-right: 0.5em;
padding-bottom: 0.3em;
vertical-align: top;
}
div.colist td img {
margin-top: 0.3em;
}
@media print {
#footer-badges { display: none; }
}
#toc {
margin-bottom: 2.5em;
}
#toctitle {
color: #527bbd;
font-size: 1.1em;
font-weight: bold;
margin-top: 1.0em;
margin-bottom: 0.1em;
}
div.toclevel0, div.toclevel1, div.toclevel2, div.toclevel3, div.toclevel4 {
margin-top: 0;
margin-bottom: 0;
}
div.toclevel2 {
margin-left: 2em;
font-size: 0.9em;
}
div.toclevel3 {
margin-left: 4em;
font-size: 0.9em;
}
div.toclevel4 {
margin-left: 6em;
font-size: 0.9em;
}
span.aqua { color: aqua; }
span.black { color: black; }
span.blue { color: blue; }
span.fuchsia { color: fuchsia; }
span.gray { color: gray; }
span.green { color: green; }
span.lime { color: lime; }
span.maroon { color: maroon; }
span.navy { color: navy; }
span.olive { color: olive; }
span.purple { color: purple; }
span.red { color: red; }
span.silver { color: silver; }
span.teal { color: teal; }
span.white { color: white; }
span.yellow { color: yellow; }
span.aqua-background { background: aqua; }
span.black-background { background: black; }
span.blue-background { background: blue; }
span.fuchsia-background { background: fuchsia; }
span.gray-background { background: gray; }
span.green-background { background: green; }
span.lime-background { background: lime; }
span.maroon-background { background: maroon; }
span.navy-background { background: navy; }
span.olive-background { background: olive; }
span.purple-background { background: purple; }
span.red-background { background: red; }
span.silver-background { background: silver; }
span.teal-background { background: teal; }
span.white-background { background: white; }
span.yellow-background { background: yellow; }
span.big { font-size: 2em; }
span.small { font-size: 0.6em; }
span.underline { text-decoration: underline; }
span.overline { text-decoration: overline; }
span.line-through { text-decoration: line-through; }
div.unbreakable { page-break-inside: avoid; }
/*
* xhtml11 specific
*
* */
div.tableblock {
margin-top: 1.0em;
margin-bottom: 1.5em;
}
div.tableblock > table {
border: 3px solid #527bbd;
}
thead, p.table.header {
font-weight: bold;
color: #527bbd;
}
p.table {
margin-top: 0;
}
/* Because the table frame attribute is overriden by CSS in most browsers. */
div.tableblock > table[frame="void"] {
border-style: none;
}
div.tableblock > table[frame="hsides"] {
border-left-style: none;
border-right-style: none;
}
div.tableblock > table[frame="vsides"] {
border-top-style: none;
border-bottom-style: none;
}
/*
* html5 specific
*
* */
table.tableblock {
margin-top: 1.0em;
margin-bottom: 1.5em;
}
thead, p.tableblock.header {
font-weight: bold;
color: #527bbd;
}
p.tableblock {
margin-top: 0;
}
table.tableblock {
border-width: 3px;
border-spacing: 0px;
border-style: solid;
border-color: #527bbd;
border-collapse: collapse;
}
th.tableblock, td.tableblock {
border-width: 1px;
padding: 4px;
border-style: solid;
border-color: #527bbd;
}
table.tableblock.frame-topbot {
border-left-style: hidden;
border-right-style: hidden;
}
table.tableblock.frame-sides {
border-top-style: hidden;
border-bottom-style: hidden;
}
table.tableblock.frame-none {
border-style: hidden;
}
th.tableblock.halign-left, td.tableblock.halign-left {
text-align: left;
}
th.tableblock.halign-center, td.tableblock.halign-center {
text-align: center;
}
th.tableblock.halign-right, td.tableblock.halign-right {
text-align: right;
}
th.tableblock.valign-top, td.tableblock.valign-top {
vertical-align: top;
}
th.tableblock.valign-middle, td.tableblock.valign-middle {
vertical-align: middle;
}
th.tableblock.valign-bottom, td.tableblock.valign-bottom {
vertical-align: bottom;
}
/*
* manpage specific
*
* */
body.manpage h1 {
padding-top: 0.5em;
padding-bottom: 0.5em;
border-top: 2px solid silver;
border-bottom: 2px solid silver;
}
body.manpage h2 {
border-style: none;
}
body.manpage div.sectionbody {
margin-left: 3em;
}
@media print {
body.manpage div#toc { display: none; }
}
</style>
<script type="text/javascript">
/*<![CDATA[*/
var asciidoc = { // Namespace.
/////////////////////////////////////////////////////////////////////
// Table Of Contents generator
/////////////////////////////////////////////////////////////////////
/* Author: Mihai Bazon, September 2002
* http://students.infoiasi.ro/~mishoo
*
* Table Of Content generator
* Version: 0.4
*
* Feel free to use this script under the terms of the GNU General Public
* License, as long as you do not remove or alter this notice.
*/
/* modified by Troy D. Hanson, September 2006. License: GPL */
/* modified by Stuart Rackham, 2006, 2009. License: GPL */
// toclevels = 1..4.
toc: function (toclevels) {
function getText(el) {
var text = "";
for (var i = el.firstChild; i != null; i = i.nextSibling) {
if (i.nodeType == 3 /* Node.TEXT_NODE */) // IE doesn't speak constants.
text += i.data;
else if (i.firstChild != null)
text += getText(i);
}
return text;
}
function TocEntry(el, text, toclevel) {
this.element = el;
this.text = text;
this.toclevel = toclevel;
}
function tocEntries(el, toclevels) {
var result = new Array;
var re = new RegExp('[hH]([1-'+(toclevels+1)+'])');
// Function that scans the DOM tree for header elements (the DOM2
// nodeIterator API would be a better technique but not supported by all
// browsers).
var iterate = function (el) {
for (var i = el.firstChild; i != null; i = i.nextSibling) {
if (i.nodeType == 1 /* Node.ELEMENT_NODE */) {
var mo = re.exec(i.tagName);
if (mo && (i.getAttribute("class") || i.getAttribute("className")) != "float") {
result[result.length] = new TocEntry(i, getText(i), mo[1]-1);
}
iterate(i);
}
}
}
iterate(el);
return result;
}
var toc = document.getElementById("toc");
if (!toc) {
return;
}
// Delete existing TOC entries in case we're reloading the TOC.
var tocEntriesToRemove = [];
var i;
for (i = 0; i < toc.childNodes.length; i++) {
var entry = toc.childNodes[i];
if (entry.nodeName.toLowerCase() == 'div'
&& entry.getAttribute("class")
&& entry.getAttribute("class").match(/^toclevel/))
tocEntriesToRemove.push(entry);
}
for (i = 0; i < tocEntriesToRemove.length; i++) {
toc.removeChild(tocEntriesToRemove[i]);
}
// Rebuild TOC entries.
var entries = tocEntries(document.getElementById("content"), toclevels);
for (var i = 0; i < entries.length; ++i) {
var entry = entries[i];
if (entry.element.id == "")
entry.element.id = "_toc_" + i;
var a = document.createElement("a");
a.href = "#" + entry.element.id;
a.appendChild(document.createTextNode(entry.text));
var div = document.createElement("div");
div.appendChild(a);
div.className = "toclevel" + entry.toclevel;
toc.appendChild(div);
}
if (entries.length == 0)
toc.parentNode.removeChild(toc);
},
/////////////////////////////////////////////////////////////////////
// Footnotes generator
/////////////////////////////////////////////////////////////////////
/* Based on footnote generation code from:
* http://www.brandspankingnew.net/archive/2005/07/format_footnote.html
*/
footnotes: function () {
// Delete existing footnote entries in case we're reloading the footnodes.
var i;
var noteholder = document.getElementById("footnotes");
if (!noteholder) {
return;
}
var entriesToRemove = [];
for (i = 0; i < noteholder.childNodes.length; i++) {
var entry = noteholder.childNodes[i];
if (entry.nodeName.toLowerCase() == 'div' && entry.getAttribute("class") == "footnote")
entriesToRemove.push(entry);
}
for (i = 0; i < entriesToRemove.length; i++) {
noteholder.removeChild(entriesToRemove[i]);
}
// Rebuild footnote entries.
var cont = document.getElementById("content");
var spans = cont.getElementsByTagName("span");
var refs = {};
var n = 0;
for (i=0; i<spans.length; i++) {
if (spans[i].className == "footnote") {
n++;
var note = spans[i].getAttribute("data-note");
if (!note) {
// Use [\s\S] in place of . so multi-line matches work.
// Because JavaScript has no s (dotall) regex flag.
note = spans[i].innerHTML.match(/\s*\[([\s\S]*)]\s*/)[1];
spans[i].innerHTML =
"[<a id='_footnoteref_" + n + "' href='#_footnote_" + n +
"' title='View footnote' class='footnote'>" + n + "</a>]";
spans[i].setAttribute("data-note", note);
}
noteholder.innerHTML +=
"<div class='footnote' id='_footnote_" + n + "'>" +
"<a href='#_footnoteref_" + n + "' title='Return to text'>" +
n + "</a>. " + note + "</div>";
var id =spans[i].getAttribute("id");
if (id != null) refs["#"+id] = n;
}
}
if (n == 0)
noteholder.parentNode.removeChild(noteholder);
else {
// Process footnoterefs.
for (i=0; i<spans.length; i++) {
if (spans[i].className == "footnoteref") {
var href = spans[i].getElementsByTagName("a")[0].getAttribute("href");
href = href.match(/#.*/)[0]; // Because IE return full URL.
n = refs[href];
spans[i].innerHTML =
"[<a href='#_footnote_" + n +
"' title='View footnote' class='footnote'>" + n + "</a>]";
}
}
}
},
install: function(toclevels) {
var timerId;
function reinstall() {
asciidoc.footnotes();
if (toclevels) {
asciidoc.toc(toclevels);
}
}
function reinstallAndRemoveTimer() {
clearInterval(timerId);
reinstall();
}
timerId = setInterval(reinstall, 500);
if (document.addEventListener)
document.addEventListener("DOMContentLoaded", reinstallAndRemoveTimer, false);
else
window.onload = reinstallAndRemoveTimer;
}
}
asciidoc.install();
/*]]>*/
</script>
</head>
<body class="manpage">
<div id="header">
<h1>
SHAPECLUSTERING(1) Manual Page
</h1>
<h2>NAME</h2>
<div class="sectionbody">
<p>shapeclustering -
shape clustering training for Tesseract
</p>
</div>
</div>
<div id="content">
<div class="sect1">
<h2 id="_synopsis">SYNOPSIS</h2>
<div class="sectionbody">
<div class="paragraph"><p>shapeclustering -D <em>output_dir</em>
-U <em>unicharset</em> -O <em>mfunicharset</em>
-F <em>font_props</em> -X <em>xheights</em>
<em>FILE</em>&#8230;</p></div>
</div>
</div>
<div class="sect1">
<h2 id="_description">DESCRIPTION</h2>
<div class="sectionbody">
<div class="paragraph"><p>shapeclustering(1) takes extracted feature .tr files (generated by
tesseract(1) run in a special mode from box files) and produces a
file <strong>shapetable</strong> and an enhanced unicharset. This program is still
experimental, and is not required (yet) for training Tesseract.</p></div>
</div>
</div>
<div class="sect1">
<h2 id="_options">OPTIONS</h2>
<div class="sectionbody">
<div class="dlist"><dl>
<dt class="hdlist1">
-U <em>FILE</em>
</dt>
<dd>
<p>
The unicharset generated by unicharset_extractor(1).
</p>
</dd>
<dt class="hdlist1">
-D <em>dir</em>
</dt>
<dd>
<p>
Directory to write output files to.
</p>
</dd>
<dt class="hdlist1">
-F <em>font_properties_file</em>
</dt>
<dd>
<p>
(Input) font properties file, where each line is of the following form, where each field other than the font name is 0 or 1:
</p>
<div class="literalblock">
<div class="content">
<pre><code>'font_name' 'italic' 'bold' 'fixed_pitch' 'serif' 'fraktur'</code></pre>
</div></div>
</dd>
<dt class="hdlist1">
-X <em>xheights_file</em>
</dt>
<dd>
<p>
(Input) x heights file, each line is of the following form, where xheight is calculated as the pixel x height of a character drawn at 32pt on 300 dpi. [ That is, if base x height + ascenders + descenders = 133, how much is x height? ]
</p>
<div class="literalblock">
<div class="content">
<pre><code>'font_name' 'xheight'</code></pre>
</div></div>
</dd>
<dt class="hdlist1">
-O <em>FILE</em>
</dt>
<dd>
<p>
The output unicharset that will be given to combine_tessdata(1).
</p>
</dd>
</dl></div>
</div>
</div>
<div class="sect1">
<h2 id="_see_also">SEE ALSO</h2>
<div class="sectionbody">
<div class="paragraph"><p>tesseract(1), cntraining(1), unicharset_extractor(1), combine_tessdata(1),
unicharset(5)</p></div>
<div class="paragraph"><p><a href="https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract">https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract</a></p></div>
</div>
</div>
<div class="sect1">
<h2 id="_copying">COPYING</h2>
<div class="sectionbody">
<div class="paragraph"><p>Copyright (C) Google, 2011
Licensed under the Apache License, Version 2.0</p></div>
</div>
</div>
<div class="sect1">
<h2 id="_author">AUTHOR</h2>
<div class="sectionbody">
<div class="paragraph"><p>The Tesseract OCR engine was written by Ray Smith and his research groups
at Hewlett Packard (1985-1995) and Google (2006-present).</p></div>
</div>
</div>
</div>
<div id="footnotes"><hr /></div>
<div id="footer">
<div id="footer-text">
Last updated 2015-06-12 23:52:24 CEST
</div>
</div>
</body>
</html>

View File

@ -1,105 +0,0 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE refentry PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd">
<?asciidoc-toc?>
<?asciidoc-numbered?>
<refentry lang="en">
<refentryinfo>
<title>SHAPECLUSTERING(1)</title>
</refentryinfo>
<refmeta>
<refentrytitle>shapeclustering</refentrytitle>
<manvolnum>1</manvolnum>
<refmiscinfo class="source">&#160;</refmiscinfo>
<refmiscinfo class="manual">&#160;</refmiscinfo>
</refmeta>
<refnamediv>
<refname>shapeclustering</refname>
<refpurpose>shape clustering training for Tesseract</refpurpose>
</refnamediv>
<refsynopsisdiv id="_synopsis">
<simpara>shapeclustering -D <emphasis>output_dir</emphasis>
-U <emphasis>unicharset</emphasis> -O <emphasis>mfunicharset</emphasis>
-F <emphasis>font_props</emphasis> -X <emphasis>xheights</emphasis>
<emphasis>FILE</emphasis>&#8230;</simpara>
</refsynopsisdiv>
<refsect1 id="_description">
<title>DESCRIPTION</title>
<simpara>shapeclustering(1) takes extracted feature .tr files (generated by
tesseract(1) run in a special mode from box files) and produces a
file <emphasis role="strong">shapetable</emphasis> and an enhanced unicharset. This program is still
experimental, and is not required (yet) for training Tesseract.</simpara>
</refsect1>
<refsect1 id="_options">
<title>OPTIONS</title>
<variablelist>
<varlistentry>
<term>
-U <emphasis>FILE</emphasis>
</term>
<listitem>
<simpara>
The unicharset generated by unicharset_extractor(1).
</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term>
-D <emphasis>dir</emphasis>
</term>
<listitem>
<simpara>
Directory to write output files to.
</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term>
-F <emphasis>font_properties_file</emphasis>
</term>
<listitem>
<simpara>
(Input) font properties file, where each line is of the following form, where each field other than the font name is 0 or 1:
</simpara>
<literallayout class="monospaced">'font_name' 'italic' 'bold' 'fixed_pitch' 'serif' 'fraktur'</literallayout>
</listitem>
</varlistentry>
<varlistentry>
<term>
-X <emphasis>xheights_file</emphasis>
</term>
<listitem>
<simpara>
(Input) x heights file, each line is of the following form, where xheight is calculated as the pixel x height of a character drawn at 32pt on 300 dpi. [ That is, if base x height + ascenders + descenders = 133, how much is x height? ]
</simpara>
<literallayout class="monospaced">'font_name' 'xheight'</literallayout>
</listitem>
</varlistentry>
<varlistentry>
<term>
-O <emphasis>FILE</emphasis>
</term>
<listitem>
<simpara>
The output unicharset that will be given to combine_tessdata(1).
</simpara>
</listitem>
</varlistentry>
</variablelist>
</refsect1>
<refsect1 id="_see_also">
<title>SEE ALSO</title>
<simpara>tesseract(1), cntraining(1), unicharset_extractor(1), combine_tessdata(1),
unicharset(5)</simpara>
<simpara><ulink url="https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract">https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract</ulink></simpara>
</refsect1>
<refsect1 id="_copying">
<title>COPYING</title>
<simpara>Copyright (C) Google, 2011
Licensed under the Apache License, Version 2.0</simpara>
</refsect1>
<refsect1 id="_author">
<title>AUTHOR</title>
<simpara>The Tesseract OCR engine was written by Ray Smith and his research groups
at Hewlett Packard (1985-1995) and Google (2006-present).</simpara>
</refsect1>
</refentry>

View File

@ -1,282 +0,0 @@
'\" t
.\" Title: tesseract
.\" Author: [see the "AUTHOR" section]
.\" Generator: DocBook XSL Stylesheets v1.78.1 <http://docbook.sf.net/>
.\" Date: 03/23/2017
.\" Manual: \ \&
.\" Source: \ \&
.\" Language: English
.\"
.TH "TESSERACT" "1" "03/23/2017" "\ \&" "\ \&"
.\" -----------------------------------------------------------------
.\" * Define some portability stuff
.\" -----------------------------------------------------------------
.\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.\" http://bugs.debian.org/507673
.\" http://lists.gnu.org/archive/html/groff/2009-02/msg00013.html
.\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.ie \n(.g .ds Aq \(aq
.el .ds Aq '
.\" -----------------------------------------------------------------
.\" * set default formatting
.\" -----------------------------------------------------------------
.\" disable hyphenation
.nh
.\" disable justification (adjust text to left margin only)
.ad l
.\" -----------------------------------------------------------------
.\" * MAIN CONTENT STARTS HERE *
.\" -----------------------------------------------------------------
.SH "NAME"
tesseract \- command\-line OCR engine
.SH "SYNOPSIS"
.sp
\fBtesseract\fR \fIimagename\fR|\fIstdin\fR \fIoutputbase\fR|\fIstdout\fR [options\&...] [configfile\&...]
.SH "DESCRIPTION"
.sp
tesseract(1) is a commercial quality OCR engine originally developed at HP between 1985 and 1995\&. In 1995, this engine was among the top 3 evaluated by UNLV\&. It was open\-sourced by HP and UNLV in 2005, and has been developed at Google since then\&.
.SH "IN/OUT ARGUMENTS"
.PP
\fIimagename\fR
.RS 4
The name of the input image\&. Most image file formats (anything readable by Leptonica) are supported\&.
.RE
.PP
\fIstdin\fR
.RS 4
Instruction to read data from standard input
.RE
.PP
\fIoutputbase\fR
.RS 4
The basename of the output file (to which the appropriate extension will be appended)\&. By default the output will be named
\fIoutbase\&.txt\fR\&.
.RE
.PP
\fIstdout\fR
.RS 4
Instruction to sent output data to standard output
.RE
.SH "OPTIONS"
.PP
\fI\-\-tessdata\-dir /path\fR
.RS 4
Specify the location of tessdata path
.RE
.PP
\fI\-\-user\-words /path/to/file\fR
.RS 4
Specify the location of user words file
.RE
.PP
\fI\-\-user\-patterns /path/to/file specify\fR
.RS 4
The location of user patterns file
.RE
.PP
\fI\-c configvar=value\fR
.RS 4
Set value for control parameter\&. Multiple \-c arguments are allowed\&.
.RE
.PP
\fI\-l lang\fR
.RS 4
The language to use\&. If none is specified, English is assumed\&. Multiple languages may be specified, separated by plus characters\&. Tesseract uses 3\-character ISO 639\-2 language codes\&. (See LANGUAGES)
.RE
.PP
\fI\-\-psm N\fR
.RS 4
Set Tesseract to only run a subset of layout analysis and assume a certain form of image\&. The options for
\fBN\fR
are:
.sp
.if n \{\
.RS 4
.\}
.nf
0 = Orientation and script detection (OSD) only\&.
1 = Automatic page segmentation with OSD\&.
2 = Automatic page segmentation, but no OSD, or OCR\&.
3 = Fully automatic page segmentation, but no OSD\&. (Default)
4 = Assume a single column of text of variable sizes\&.
5 = Assume a single uniform block of vertically aligned text\&.
6 = Assume a single uniform block of text\&.
7 = Treat the image as a single text line\&.
8 = Treat the image as a single word\&.
9 = Treat the image as a single word in a circle\&.
10 = Treat the image as a single character\&.
.fi
.if n \{\
.RE
.\}
.RE
.PP
\fI\-\-oem N\fR
.RS 4
Specify OCR Engine mode\&. The options for
\fBN\fR
are:
.sp
.if n \{\
.RS 4
.\}
.nf
0 = Original Tesseract only\&.
1 = Neural nets LSTM only\&.
2 = Tesseract + LSTM\&.
3 = Default, based on what is available\&.
.fi
.if n \{\
.RE
.\}
.RE
.PP
\fIconfigfile\fR
.RS 4
The name of a config to use\&. A config is a plaintext file which contains a list of variables and their values, one per line, with a space separating variable from value\&. Interesting config files include:
.sp
.RS 4
.ie n \{\
\h'-04'\(bu\h'+03'\c
.\}
.el \{\
.sp -1
.IP \(bu 2.3
.\}
hocr \- Output in hOCR format instead of as a text file\&.
.RE
.sp
.RS 4
.ie n \{\
\h'-04'\(bu\h'+03'\c
.\}
.el \{\
.sp -1
.IP \(bu 2.3
.\}
pdf \- Output in pdf instead of a text file\&.
.RE
.RE
.sp
\fBNota Bene:\fR The options \fI\-l lang\fR and \fI\-\-psm N\fR must occur before any \fIconfigfile\fR\&.
.SH "SINGLE OPTIONS"
.PP
\fI\-h, \-\-help\fR
.RS 4
Show help message\&.
.RE
.PP
\fI\-\-help\-psm\fR
.RS 4
Show page segmentation modes\&.
.RE
.PP
\fI\-\-help\-oem\fR
.RS 4
Show OCR Engine modes\&.
.RE
.PP
\fI\-v, \-\-version\fR
.RS 4
Returns the current version of the tesseract(1) executable\&.
.RE
.PP
\fI\-\-list\-langs\fR
.RS 4
List available languages for tesseract engine\&. Can be used with \-\-tessdata\-dir\&.
.RE
.PP
\fI\-\-print\-parameters\fR
.RS 4
Print tesseract parameters\&.
.RE
.SH "LANGUAGES"
.sp
There are currently language packs available for the following languages (in \m[blue]\fBhttps://github\&.com/tesseract\-ocr/tessdata\fR\m[]):
.sp
\fBafr\fR (Afrikaans) \fBamh\fR (Amharic) \fBara\fR (Arabic) \fBasm\fR (Assamese) \fBaze\fR (Azerbaijani) \fBaze_cyrl\fR (Azerbaijani \- Cyrilic) \fBbel\fR (Belarusian) \fBben\fR (Bengali) \fBbod\fR (Tibetan) \fBbos\fR (Bosnian) \fBbul\fR (Bulgarian) \fBcat\fR (Catalan; Valencian) \fBceb\fR (Cebuano) \fBces\fR (Czech) \fBchi_sim\fR (Chinese \- Simplified) \fBchi_tra\fR (Chinese \- Traditional) \fBchr\fR (Cherokee) \fBcym\fR (Welsh) \fBdan\fR (Danish) \fBdan_frak\fR (Danish \- Fraktur) \fBdeu\fR (German) \fBdeu_frak\fR (German \- Fraktur) \fBdzo\fR (Dzongkha) \fBell\fR (Greek, Modern (1453\-)) \fBeng\fR (English) \fBenm\fR (English, Middle (1100\-1500)) \fBepo\fR (Esperanto) \fBequ\fR (Math / equation detection module) \fBest\fR (Estonian) \fBeus\fR (Basque) \fBfas\fR (Persian) \fBfin\fR (Finnish) \fBfra\fR (French) \fBfrk\fR (Frankish) \fBfrm\fR (French, Middle (ca\&.1400\-1600)) \fBgle\fR (Irish) \fBglg\fR (Galician) \fBgrc\fR (Greek, Ancient (to 1453)) \fBguj\fR (Gujarati) \fBhat\fR (Haitian; Haitian Creole) \fBheb\fR (Hebrew) \fBhin\fR (Hindi) \fBhrv\fR (Croatian) \fBhun\fR (Hungarian) \fBiku\fR (Inuktitut) \fBind\fR (Indonesian) \fBisl\fR (Icelandic) \fBita\fR (Italian) \fBita_old\fR (Italian \- Old) \fBjav\fR (Javanese) \fBjpn\fR (Japanese) \fBkan\fR (Kannada) \fBkat\fR (Georgian) \fBkat_old\fR (Georgian \- Old) \fBkaz\fR (Kazakh) \fBkhm\fR (Central Khmer) \fBkir\fR (Kirghiz; Kyrgyz) \fBkor\fR (Korean) \fBkur\fR (Kurdish) \fBlao\fR (Lao) \fBlat\fR (Latin) \fBlav\fR (Latvian) \fBlit\fR (Lithuanian) \fBmal\fR (Malayalam) \fBmar\fR (Marathi) \fBmkd\fR (Macedonian) \fBmlt\fR (Maltese) \fBmsa\fR (Malay) \fBmya\fR (Burmese) \fBnep\fR (Nepali) \fBnld\fR (Dutch; Flemish) \fBnor\fR (Norwegian) \fBori\fR (Oriya) \fBosd\fR (Orientation and script detection module) \fBpan\fR (Panjabi; Punjabi) \fBpol\fR (Polish) \fBpor\fR (Portuguese) \fBpus\fR (Pushto; Pashto) \fBron\fR (Romanian; Moldavian; Moldovan) \fBrus\fR (Russian) \fBsan\fR (Sanskrit) \fBsin\fR (Sinhala; Sinhalese) \fBslk\fR (Slovak) \fBslk_frak\fR (Slovak \- Fraktur) \fBslv\fR (Slovenian) \fBspa\fR (Spanish; Castilian) \fBspa_old\fR (Spanish; Castilian \- Old) \fBsqi\fR (Albanian) \fBsrp\fR (Serbian) \fBsrp_latn\fR (Serbian \- Latin) \fBswa\fR (Swahili) \fBswe\fR (Swedish) \fBsyr\fR (Syriac) \fBtam\fR (Tamil) \fBtel\fR (Telugu) \fBtgk\fR (Tajik) \fBtgl\fR (Tagalog) \fBtha\fR (Thai) \fBtir\fR (Tigrinya) \fBtur\fR (Turkish) \fBuig\fR (Uighur; Uyghur) \fBukr\fR (Ukrainian) \fBurd\fR (Urdu) \fBuzb\fR (Uzbek) \fBuzb_cyrl\fR (Uzbek \- Cyrilic) \fBvie\fR (Vietnamese) \fByid\fR (Yiddish)
.sp
To use a non\-standard language pack named \fBfoo\&.traineddata\fR, set the \fBTESSDATA_PREFIX\fR environment variable so the file can be found at \fBTESSDATA_PREFIX\fR/tessdata/\fBfoo\fR\&.traineddata and give Tesseract the argument \fI\-l foo\fR\&.
.SH "CONFIG FILES AND AUGMENTING WITH USER DATA"
.sp
Tesseract config files consist of lines with variable\-value pairs (space separated)\&. The variables are documented as flags in the source code like the following one in tesseractclass\&.h:
.sp
STRING_VAR_H(tessedit_char_blacklist, "", "Blacklist of chars not to recognize");
.sp
These variables may enable or disable various features of the engine, and may cause it to load (or not load) various data\&. For instance, let\(cqs suppose you want to OCR in English, but suppress the normal dictionary and load an alternative word list and an alternative list of patterns \(em these two files are the most commonly used extra data files\&.
.sp
If your language pack is in /path/to/eng\&.traineddata and the hocr config is in /path/to/configs/hocr then create three new files:
.sp
/path/to/eng\&.user\-words:
.sp
.if n \{\
.RS 4
.\}
.nf
the
quick
brown
fox
jumped
.fi
.if n \{\
.RE
.\}
.sp
/path/to/eng\&.user\-patterns:
.sp
.if n \{\
.RS 4
.\}
.nf
1\-\ed\ed\ed\-GOOG\-411
www\&.\en\e\e\e*\&.com
.fi
.if n \{\
.RE
.\}
.sp
/path/to/configs/bazaar:
.sp
.if n \{\
.RS 4
.\}
.nf
load_system_dawg F
load_freq_dawg F
user_words_suffix user\-words
user_patterns_suffix user\-patterns
.fi
.if n \{\
.RE
.\}
.sp
Now, if you pass the word \fIbazaar\fR as a trailing command line parameter to Tesseract, Tesseract will not bother loading the system dictionary nor the dictionary of frequent words and will load and use the eng\&.user\-words and eng\&.user\-patterns files you provided\&. The former is a simple word list, one per line\&. The format of the latter is documented in dict/trie\&.h on read_pattern_list()\&.
.SH "HISTORY"
.sp
The engine was developed at Hewlett Packard Laboratories Bristol and at Hewlett Packard Co, Greeley Colorado between 1985 and 1994, with some more changes made in 1996 to port to Windows, and some C++izing in 1998\&. A lot of the code was written in C, and then some more was written in C++\&. The C++ code makes heavy use of a list system using macros\&. This predates stl, was portable before stl, and is more efficient than stl lists, but has the big negative that if you do get a segmentation violation, it is hard to debug\&.
.sp
Version 2\&.00 brought Unicode (UTF\-8) support, six languages, and the ability to train Tesseract\&.
.sp
Tesseract was included in UNLV\(cqs Fourth Annual Test of OCR Accuracy\&. See \m[blue]\fBhttps://github\&.com/tesseract\-ocr/docs/blob/master/AT\-1995\&.pdf\fR\m[]\&. With Tesseract 2\&.00, scripts are now included to allow anyone to reproduce some of these tests\&. See \m[blue]\fBhttps://github\&.com/tesseract\-ocr/tesseract/wiki/TestingTesseract\fR\m[] for more details\&.
.sp
Tesseract 3\&.00 adds a number of new languages, including Chinese, Japanese, and Korean\&. It also introduces a new, single\-file based system of managing language data\&.
.sp
Tesseract 3\&.02 adds BiDirectional text support, the ability to recognize multiple languages in a single image, and improved layout analysis\&.
.sp
For further details, see the file ReleaseNotes included with the distribution\&.
.SH "RESOURCES"
.sp
Main web site: \m[blue]\fBhttps://github\&.com/tesseract\-ocr\fR\m[] Information on training: \m[blue]\fBhttps://github\&.com/tesseract\-ocr/tesseract/wiki/TrainingTesseract\fR\m[]
.SH "SEE ALSO"
.sp
ambiguous_words(1), cntraining(1), combine_tessdata(1), dawg2wordlist(1), shape_training(1), mftraining(1), unicharambigs(5), unicharset(5), unicharset_extractor(1), wordlist2dawg(1)
.SH "AUTHOR"
.sp
Tesseract development was led at Hewlett\-Packard and Google by Ray Smith\&. The development team has included:
.sp
Ahmad Abdulkader, Chris Newton, Dan Johnson, Dar\-Shyang Lee, David Eger, Eric Wiseblatt, Faisal Shafait, Hiroshi Takenaka, Joe Liu, Joern Wanke, Mark Seaman, Mickey Namiki, Nicholas Beato, Oded Fuhrmann, Phil Cheatle, Pingping Xiu, Pong Eksombatchai (Chantat), Ranjith Unnikrishnan, Raquel Romano, Ray Smith, Rika Antonova, Robert Moss, Samuel Charron, Sheelagh Lloyd, Shobhit Saxena, and Thomas Kielbus\&.
.SH "COPYING"
.sp
Licensed under the Apache License, Version 2\&.0

File diff suppressed because it is too large Load Diff

View File

@ -1,468 +0,0 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE refentry PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd">
<?asciidoc-toc?>
<?asciidoc-numbered?>
<refentry lang="en">
<refentryinfo>
<title>TESSERACT(1)</title>
</refentryinfo>
<refmeta>
<refentrytitle>tesseract</refentrytitle>
<manvolnum>1</manvolnum>
<refmiscinfo class="source">&#160;</refmiscinfo>
<refmiscinfo class="manual">&#160;</refmiscinfo>
</refmeta>
<refnamediv>
<refname>tesseract</refname>
<refpurpose>command-line OCR engine</refpurpose>
</refnamediv>
<refsynopsisdiv id="_synopsis">
<simpara><emphasis role="strong">tesseract</emphasis> <emphasis>imagename</emphasis>|<emphasis>stdin</emphasis> <emphasis>outputbase</emphasis>|<emphasis>stdout</emphasis> [options&#8230;] [configfile&#8230;]</simpara>
</refsynopsisdiv>
<refsect1 id="_description">
<title>DESCRIPTION</title>
<simpara>tesseract(1) is a commercial quality OCR engine originally developed at HP
between 1985 and 1995. In 1995, this engine was among the top 3 evaluated by
UNLV. It was open-sourced by HP and UNLV in 2005, and has been developed
at Google since then.</simpara>
</refsect1>
<refsect1 id="_in_out_arguments">
<title>IN/OUT ARGUMENTS</title>
<variablelist>
<varlistentry>
<term>
<emphasis>imagename</emphasis>
</term>
<listitem>
<simpara>
The name of the input image. Most image file formats (anything
readable by Leptonica) are supported.
</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term>
<emphasis>stdin</emphasis>
</term>
<listitem>
<simpara>
Instruction to read data from standard input
</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term>
<emphasis>outputbase</emphasis>
</term>
<listitem>
<simpara>
The basename of the output file (to which the appropriate extension
will be appended). By default the output will be named <emphasis>outbase.txt</emphasis>.
</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term>
<emphasis>stdout</emphasis>
</term>
<listitem>
<simpara>
Instruction to sent output data to standard output
</simpara>
</listitem>
</varlistentry>
</variablelist>
</refsect1>
<refsect1 id="_options">
<title>OPTIONS</title>
<variablelist>
<varlistentry>
<term>
<emphasis>--tessdata-dir /path</emphasis>
</term>
<listitem>
<simpara>
Specify the location of tessdata path
</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term>
<emphasis>--user-words /path/to/file</emphasis>
</term>
<listitem>
<simpara>
Specify the location of user words file
</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term>
<emphasis>--user-patterns /path/to/file specify</emphasis>
</term>
<listitem>
<simpara>
The location of user patterns file
</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term>
<emphasis>-c configvar=value</emphasis>
</term>
<listitem>
<simpara>
Set value for control parameter. Multiple -c arguments are allowed.
</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term>
<emphasis>-l lang</emphasis>
</term>
<listitem>
<simpara>
The language to use. If none is specified, English is assumed.
Multiple languages may be specified, separated by plus characters.
Tesseract uses 3-character ISO 639-2 language codes. (See LANGUAGES)
</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term>
<emphasis>--psm N</emphasis>
</term>
<listitem>
<simpara>
Set Tesseract to only run a subset of layout analysis and assume
a certain form of image. The options for <emphasis role="strong">N</emphasis> are:
</simpara>
<literallayout class="monospaced">0 = Orientation and script detection (OSD) only.
1 = Automatic page segmentation with OSD.
2 = Automatic page segmentation, but no OSD, or OCR.
3 = Fully automatic page segmentation, but no OSD. (Default)
4 = Assume a single column of text of variable sizes.
5 = Assume a single uniform block of vertically aligned text.
6 = Assume a single uniform block of text.
7 = Treat the image as a single text line.
8 = Treat the image as a single word.
9 = Treat the image as a single word in a circle.
10 = Treat the image as a single character.</literallayout>
</listitem>
</varlistentry>
<varlistentry>
<term>
<emphasis>--oem N</emphasis>
</term>
<listitem>
<simpara>
Specify OCR Engine mode. The options for <emphasis role="strong">N</emphasis> are:
</simpara>
<literallayout class="monospaced">0 = Original Tesseract only.
1 = Neural nets LSTM only.
2 = Tesseract + LSTM.
3 = Default, based on what is available.</literallayout>
</listitem>
</varlistentry>
<varlistentry>
<term>
<emphasis>configfile</emphasis>
</term>
<listitem>
<simpara>
The name of a config to use. A config is a plaintext file which
contains a list of variables and their values, one per line, with a
space separating variable from value. Interesting config files
include:<?asciidoc-br?>
</simpara>
<itemizedlist>
<listitem>
<simpara>
hocr - Output in hOCR format instead of as a text file.
</simpara>
</listitem>
<listitem>
<simpara>
pdf - Output in pdf instead of a text file.
</simpara>
</listitem>
</itemizedlist>
</listitem>
</varlistentry>
</variablelist>
<simpara><emphasis role="strong">Nota Bene:</emphasis> The options <emphasis>-l lang</emphasis> and <emphasis>--psm N</emphasis> must occur
before any <emphasis>configfile</emphasis>.</simpara>
</refsect1>
<refsect1 id="_single_options">
<title>SINGLE OPTIONS</title>
<variablelist>
<varlistentry>
<term>
<emphasis>-h, --help</emphasis>
</term>
<listitem>
<simpara>
Show help message.
</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term>
<emphasis>--help-psm</emphasis>
</term>
<listitem>
<simpara>
Show page segmentation modes.
</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term>
<emphasis>--help-oem</emphasis>
</term>
<listitem>
<simpara>
Show OCR Engine modes.
</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term>
<emphasis>-v, --version</emphasis>
</term>
<listitem>
<simpara>
Returns the current version of the tesseract(1) executable.
</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term>
<emphasis>--list-langs</emphasis>
</term>
<listitem>
<simpara>
List available languages for tesseract engine. Can be used with --tessdata-dir.
</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term>
<emphasis>--print-parameters</emphasis>
</term>
<listitem>
<simpara>
Print tesseract parameters.
</simpara>
</listitem>
</varlistentry>
</variablelist>
</refsect1>
<refsect1 id="_languages">
<title>LANGUAGES</title>
<simpara>There are currently language packs available for the following languages
(in <ulink url="https://github.com/tesseract-ocr/tessdata">https://github.com/tesseract-ocr/tessdata</ulink>):</simpara>
<simpara><emphasis role="strong">afr</emphasis> (Afrikaans)
<emphasis role="strong">amh</emphasis> (Amharic)
<emphasis role="strong">ara</emphasis> (Arabic)
<emphasis role="strong">asm</emphasis> (Assamese)
<emphasis role="strong">aze</emphasis> (Azerbaijani)
<emphasis role="strong">aze_cyrl</emphasis> (Azerbaijani - Cyrilic)
<emphasis role="strong">bel</emphasis> (Belarusian)
<emphasis role="strong">ben</emphasis> (Bengali)
<emphasis role="strong">bod</emphasis> (Tibetan)
<emphasis role="strong">bos</emphasis> (Bosnian)
<emphasis role="strong">bul</emphasis> (Bulgarian)
<emphasis role="strong">cat</emphasis> (Catalan; Valencian)
<emphasis role="strong">ceb</emphasis> (Cebuano)
<emphasis role="strong">ces</emphasis> (Czech)
<emphasis role="strong">chi_sim</emphasis> (Chinese - Simplified)
<emphasis role="strong">chi_tra</emphasis> (Chinese - Traditional)
<emphasis role="strong">chr</emphasis> (Cherokee)
<emphasis role="strong">cym</emphasis> (Welsh)
<emphasis role="strong">dan</emphasis> (Danish)
<emphasis role="strong">dan_frak</emphasis> (Danish - Fraktur)
<emphasis role="strong">deu</emphasis> (German)
<emphasis role="strong">deu_frak</emphasis> (German - Fraktur)
<emphasis role="strong">dzo</emphasis> (Dzongkha)
<emphasis role="strong">ell</emphasis> (Greek, Modern (1453-))
<emphasis role="strong">eng</emphasis> (English)
<emphasis role="strong">enm</emphasis> (English, Middle (1100-1500))
<emphasis role="strong">epo</emphasis> (Esperanto)
<emphasis role="strong">equ</emphasis> (Math / equation detection module)
<emphasis role="strong">est</emphasis> (Estonian)
<emphasis role="strong">eus</emphasis> (Basque)
<emphasis role="strong">fas</emphasis> (Persian)
<emphasis role="strong">fin</emphasis> (Finnish)
<emphasis role="strong">fra</emphasis> (French)
<emphasis role="strong">frk</emphasis> (Frankish)
<emphasis role="strong">frm</emphasis> (French, Middle (ca.1400-1600))
<emphasis role="strong">gle</emphasis> (Irish)
<emphasis role="strong">glg</emphasis> (Galician)
<emphasis role="strong">grc</emphasis> (Greek, Ancient (to 1453))
<emphasis role="strong">guj</emphasis> (Gujarati)
<emphasis role="strong">hat</emphasis> (Haitian; Haitian Creole)
<emphasis role="strong">heb</emphasis> (Hebrew)
<emphasis role="strong">hin</emphasis> (Hindi)
<emphasis role="strong">hrv</emphasis> (Croatian)
<emphasis role="strong">hun</emphasis> (Hungarian)
<emphasis role="strong">iku</emphasis> (Inuktitut)
<emphasis role="strong">ind</emphasis> (Indonesian)
<emphasis role="strong">isl</emphasis> (Icelandic)
<emphasis role="strong">ita</emphasis> (Italian)
<emphasis role="strong">ita_old</emphasis> (Italian - Old)
<emphasis role="strong">jav</emphasis> (Javanese)
<emphasis role="strong">jpn</emphasis> (Japanese)
<emphasis role="strong">kan</emphasis> (Kannada)
<emphasis role="strong">kat</emphasis> (Georgian)
<emphasis role="strong">kat_old</emphasis> (Georgian - Old)
<emphasis role="strong">kaz</emphasis> (Kazakh)
<emphasis role="strong">khm</emphasis> (Central Khmer)
<emphasis role="strong">kir</emphasis> (Kirghiz; Kyrgyz)
<emphasis role="strong">kor</emphasis> (Korean)
<emphasis role="strong">kur</emphasis> (Kurdish)
<emphasis role="strong">lao</emphasis> (Lao)
<emphasis role="strong">lat</emphasis> (Latin)
<emphasis role="strong">lav</emphasis> (Latvian)
<emphasis role="strong">lit</emphasis> (Lithuanian)
<emphasis role="strong">mal</emphasis> (Malayalam)
<emphasis role="strong">mar</emphasis> (Marathi)
<emphasis role="strong">mkd</emphasis> (Macedonian)
<emphasis role="strong">mlt</emphasis> (Maltese)
<emphasis role="strong">msa</emphasis> (Malay)
<emphasis role="strong">mya</emphasis> (Burmese)
<emphasis role="strong">nep</emphasis> (Nepali)
<emphasis role="strong">nld</emphasis> (Dutch; Flemish)
<emphasis role="strong">nor</emphasis> (Norwegian)
<emphasis role="strong">ori</emphasis> (Oriya)
<emphasis role="strong">osd</emphasis> (Orientation and script detection module)
<emphasis role="strong">pan</emphasis> (Panjabi; Punjabi)
<emphasis role="strong">pol</emphasis> (Polish)
<emphasis role="strong">por</emphasis> (Portuguese)
<emphasis role="strong">pus</emphasis> (Pushto; Pashto)
<emphasis role="strong">ron</emphasis> (Romanian; Moldavian; Moldovan)
<emphasis role="strong">rus</emphasis> (Russian)
<emphasis role="strong">san</emphasis> (Sanskrit)
<emphasis role="strong">sin</emphasis> (Sinhala; Sinhalese)
<emphasis role="strong">slk</emphasis> (Slovak)
<emphasis role="strong">slk_frak</emphasis> (Slovak - Fraktur)
<emphasis role="strong">slv</emphasis> (Slovenian)
<emphasis role="strong">spa</emphasis> (Spanish; Castilian)
<emphasis role="strong">spa_old</emphasis> (Spanish; Castilian - Old)
<emphasis role="strong">sqi</emphasis> (Albanian)
<emphasis role="strong">srp</emphasis> (Serbian)
<emphasis role="strong">srp_latn</emphasis> (Serbian - Latin)
<emphasis role="strong">swa</emphasis> (Swahili)
<emphasis role="strong">swe</emphasis> (Swedish)
<emphasis role="strong">syr</emphasis> (Syriac)
<emphasis role="strong">tam</emphasis> (Tamil)
<emphasis role="strong">tel</emphasis> (Telugu)
<emphasis role="strong">tgk</emphasis> (Tajik)
<emphasis role="strong">tgl</emphasis> (Tagalog)
<emphasis role="strong">tha</emphasis> (Thai)
<emphasis role="strong">tir</emphasis> (Tigrinya)
<emphasis role="strong">tur</emphasis> (Turkish)
<emphasis role="strong">uig</emphasis> (Uighur; Uyghur)
<emphasis role="strong">ukr</emphasis> (Ukrainian)
<emphasis role="strong">urd</emphasis> (Urdu)
<emphasis role="strong">uzb</emphasis> (Uzbek)
<emphasis role="strong">uzb_cyrl</emphasis> (Uzbek - Cyrilic)
<emphasis role="strong">vie</emphasis> (Vietnamese)
<emphasis role="strong">yid</emphasis> (Yiddish)</simpara>
<simpara>To use a non-standard language pack named <emphasis role="strong">foo.traineddata</emphasis>, set the
<emphasis role="strong">TESSDATA_PREFIX</emphasis> environment variable so the file can be found at
<emphasis role="strong">TESSDATA_PREFIX</emphasis>/tessdata/<emphasis role="strong">foo</emphasis>.traineddata and give Tesseract the
argument <emphasis>-l foo</emphasis>.</simpara>
</refsect1>
<refsect1 id="_config_files_and_augmenting_with_user_data">
<title>CONFIG FILES AND AUGMENTING WITH USER DATA</title>
<simpara>Tesseract config files consist of lines with variable-value pairs (space
separated). The variables are documented as flags in the source code like
the following one in tesseractclass.h:</simpara>
<simpara>STRING_VAR_H(tessedit_char_blacklist, "",
"Blacklist of chars not to recognize");</simpara>
<simpara>These variables may enable or disable various features of the engine, and
may cause it to load (or not load) various data. For instance, let&#8217;s suppose
you want to OCR in English, but suppress the normal dictionary and load an
alternative word list and an alternative list of patterns&#8201;&#8212;&#8201;these two files
are the most commonly used extra data files.</simpara>
<simpara>If your language pack is in /path/to/eng.traineddata and the hocr config
is in /path/to/configs/hocr then create three new files:</simpara>
<simpara>/path/to/eng.user-words:</simpara>
<blockquote>
<literallayout>the
quick
brown
fox
jumped</literallayout>
</blockquote>
<simpara>/path/to/eng.user-patterns:</simpara>
<blockquote>
<literallayout>1-\d\d\d-GOOG-411
www.\n\\\*.com</literallayout>
</blockquote>
<simpara>/path/to/configs/bazaar:</simpara>
<blockquote>
<literallayout>load_system_dawg F
load_freq_dawg F
user_words_suffix user-words
user_patterns_suffix user-patterns</literallayout>
</blockquote>
<simpara>Now, if you pass the word <emphasis>bazaar</emphasis> as a trailing command line parameter
to Tesseract, Tesseract will not bother loading the system dictionary nor
the dictionary of frequent words and will load and use the eng.user-words
and eng.user-patterns files you provided. The former is a simple word list,
one per line. The format of the latter is documented in dict/trie.h
on read_pattern_list().</simpara>
</refsect1>
<refsect1 id="_history">
<title>HISTORY</title>
<simpara>The engine was developed at Hewlett Packard Laboratories Bristol and at
Hewlett Packard Co, Greeley Colorado between 1985 and 1994, with some more
changes made in 1996 to port to Windows, and some C++izing in 1998. A
lot of the code was written in C, and then some more was written in C++.
The C++ code makes heavy use of a list system using macros. This predates
stl, was portable before stl, and is more efficient than stl lists, but has
the big negative that if you do get a segmentation violation, it is hard to
debug.</simpara>
<simpara>Version 2.00 brought Unicode (UTF-8) support, six languages, and the ability
to train Tesseract.</simpara>
<simpara>Tesseract was included in UNLV&#8217;s Fourth Annual Test of OCR Accuracy.
See <ulink url="https://github.com/tesseract-ocr/docs/blob/master/AT-1995.pdf">https://github.com/tesseract-ocr/docs/blob/master/AT-1995.pdf</ulink>. With Tesseract 2.00,
scripts are now included to allow anyone to reproduce some of these tests.
See <ulink url="https://github.com/tesseract-ocr/tesseract/wiki/TestingTesseract">https://github.com/tesseract-ocr/tesseract/wiki/TestingTesseract</ulink> for more
details.</simpara>
<simpara>Tesseract 3.00 adds a number of new languages, including Chinese, Japanese,
and Korean. It also introduces a new, single-file based system of managing
language data.</simpara>
<simpara>Tesseract 3.02 adds BiDirectional text support, the ability to recognize
multiple languages in a single image, and improved layout analysis.</simpara>
<simpara>For further details, see the file ReleaseNotes included with the distribution.</simpara>
</refsect1>
<refsect1 id="_resources">
<title>RESOURCES</title>
<simpara>Main web site: <ulink url="https://github.com/tesseract-ocr">https://github.com/tesseract-ocr</ulink><?asciidoc-br?>
Information on training: <ulink url="https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract">https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract</ulink></simpara>
</refsect1>
<refsect1 id="_see_also">
<title>SEE ALSO</title>
<simpara>ambiguous_words(1), cntraining(1), combine_tessdata(1), dawg2wordlist(1),
shape_training(1), mftraining(1), unicharambigs(5), unicharset(5),
unicharset_extractor(1), wordlist2dawg(1)</simpara>
</refsect1>
<refsect1 id="_author">
<title>AUTHOR</title>
<simpara>Tesseract development was led at Hewlett-Packard and Google by Ray Smith.
The development team has included:</simpara>
<simpara>Ahmad Abdulkader, Chris Newton, Dan Johnson, Dar-Shyang Lee, David Eger,
Eric Wiseblatt, Faisal Shafait, Hiroshi Takenaka, Joe Liu, Joern Wanke,
Mark Seaman, Mickey Namiki, Nicholas Beato, Oded Fuhrmann, Phil Cheatle,
Pingping Xiu, Pong Eksombatchai (Chantat), Ranjith Unnikrishnan, Raquel
Romano, Ray Smith, Rika Antonova, Robert Moss, Samuel Charron, Sheelagh
Lloyd, Shobhit Saxena, and Thomas Kielbus.</simpara>
</refsect1>
<refsect1 id="_copying">
<title>COPYING</title>
<simpara>Licensed under the Apache License, Version 2.0</simpara>
</refsect1>
</refentry>

View File

@ -1,120 +0,0 @@
'\" t
.\" Title: unicharambigs
.\" Author: [see the "AUTHOR" section]
.\" Generator: DocBook XSL Stylesheets v1.78.1 <http://docbook.sf.net/>
.\" Date: 06/12/2015
.\" Manual: \ \&
.\" Source: \ \&
.\" Language: English
.\"
.TH "UNICHARAMBIGS" "5" "06/12/2015" "\ \&" "\ \&"
.\" -----------------------------------------------------------------
.\" * Define some portability stuff
.\" -----------------------------------------------------------------
.\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.\" http://bugs.debian.org/507673
.\" http://lists.gnu.org/archive/html/groff/2009-02/msg00013.html
.\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.ie \n(.g .ds Aq \(aq
.el .ds Aq '
.\" -----------------------------------------------------------------
.\" * set default formatting
.\" -----------------------------------------------------------------
.\" disable hyphenation
.nh
.\" disable justification (adjust text to left margin only)
.ad l
.\" -----------------------------------------------------------------
.\" * MAIN CONTENT STARTS HERE *
.\" -----------------------------------------------------------------
.SH "NAME"
unicharambigs \- Tesseract unicharset ambiguities
.SH "DESCRIPTION"
.sp
The unicharambigs file (a component of traineddata, see combine_tessdata(1) ) is used by Tesseract to represent possible ambiguities between characters, or groups of characters\&.
.sp
The file contains a number of lines, laid out as follow:
.sp
.if n \{\
.RS 4
.\}
.nf
[num] <TAB> [char(s)] <TAB> [num] <TAB> [char(s)] <TAB> [num]
.fi
.if n \{\
.RE
.\}
.sp
.TS
tab(:);
lt lt
lt lt
lt lt
lt lt
lt lt.
T{
.sp
Field one
T}:T{
.sp
the number of characters contained in field two
T}
T{
.sp
Field two
T}:T{
.sp
the character sequence to be replaced
T}
T{
.sp
Field three
T}:T{
.sp
the number of characters contained in field four
T}
T{
.sp
Field four
T}:T{
.sp
the character sequence used to replace field two
T}
T{
.sp
Field five
T}:T{
.sp
contains either 1 or 0\&. 1 denotes a mandatory replacement, 0 denotes an optional replacement\&.
T}
.TE
.sp 1
.sp
Characters appearing in fields two and four should appear in unicharset\&. The numbers in fields one and three refer to the number of unichars (not bytes)\&.
.SH "EXAMPLE"
.sp
.if n \{\
.RS 4
.\}
.nf
2 \*(Aq \*(Aq 1 " 1
1 m 2 r n 0
3 i i i 1 m 0
.fi
.if n \{\
.RE
.\}
.sp
In this example, all instances of the \fI2\fR character sequence \fI\*(Aq\fR\*(Aq will \fBalways\fR be replaced by the \fI1\fR character sequence \fI"\fR; a \fI1\fR character sequence \fIm\fR \fBmay\fR be replaced by the \fI2\fR character sequence \fIrn\fR, and the \fI3\fR character sequence \fBmay\fR be replaced by the \fI1\fR character sequence \fIm\fR\&.
.SH "HISTORY"
.sp
The unicharambigs file first appeared in Tesseract 3\&.00; prior to that, a similar format, called DangAmbigs (\fIdangerous ambiguities\fR) was used: the format was almost identical, except only mandatory replacements could be specified, and field 5 was absent\&.
.SH "BUGS"
.sp
This is a documentation "bug": it\(cqs not currently clear what should be done in the case of ligatures (such as \fIfi\fR) which may also appear as regular letters in the unicharset\&.
.SH "SEE ALSO"
.sp
tesseract(1), unicharset(5)
.SH "AUTHOR"
.sp
The Tesseract OCR engine was written by Ray Smith and his research groups at Hewlett Packard (1985\-1995) and Google (2006\-present)\&.

View File

@ -1,875 +0,0 @@
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<meta http-equiv="Content-Type" content="application/xhtml+xml; charset=UTF-8" />
<meta name="generator" content="AsciiDoc 8.6.9" />
<title>UNICHARAMBIGS(5)</title>
<style type="text/css">
/* Shared CSS for AsciiDoc xhtml11 and html5 backends */
/* Default font. */
body {
font-family: Georgia,serif;
}
/* Title font. */
h1, h2, h3, h4, h5, h6,
div.title, caption.title,
thead, p.table.header,
#toctitle,
#author, #revnumber, #revdate, #revremark,
#footer {
font-family: Arial,Helvetica,sans-serif;
}
body {
margin: 1em 5% 1em 5%;
}
a {
color: blue;
text-decoration: underline;
}
a:visited {
color: fuchsia;
}
em {
font-style: italic;
color: navy;
}
strong {
font-weight: bold;
color: #083194;
}
h1, h2, h3, h4, h5, h6 {
color: #527bbd;
margin-top: 1.2em;
margin-bottom: 0.5em;
line-height: 1.3;
}
h1, h2, h3 {
border-bottom: 2px solid silver;
}
h2 {
padding-top: 0.5em;
}
h3 {
float: left;
}
h3 + * {
clear: left;
}
h5 {
font-size: 1.0em;
}
div.sectionbody {
margin-left: 0;
}
hr {
border: 1px solid silver;
}
p {
margin-top: 0.5em;
margin-bottom: 0.5em;
}
ul, ol, li > p {
margin-top: 0;
}
ul > li { color: #aaa; }
ul > li > * { color: black; }
.monospaced, code, pre {
font-family: "Courier New", Courier, monospace;
font-size: inherit;
color: navy;
padding: 0;
margin: 0;
}
pre {
white-space: pre-wrap;
}
#author {
color: #527bbd;
font-weight: bold;
font-size: 1.1em;
}
#email {
}
#revnumber, #revdate, #revremark {
}
#footer {
font-size: small;
border-top: 2px solid silver;
padding-top: 0.5em;
margin-top: 4.0em;
}
#footer-text {
float: left;
padding-bottom: 0.5em;
}
#footer-badges {
float: right;
padding-bottom: 0.5em;
}
#preamble {
margin-top: 1.5em;
margin-bottom: 1.5em;
}
div.imageblock, div.exampleblock, div.verseblock,
div.quoteblock, div.literalblock, div.listingblock, div.sidebarblock,
div.admonitionblock {
margin-top: 1.0em;
margin-bottom: 1.5em;
}
div.admonitionblock {
margin-top: 2.0em;
margin-bottom: 2.0em;
margin-right: 10%;
color: #606060;
}
div.content { /* Block element content. */
padding: 0;
}
/* Block element titles. */
div.title, caption.title {
color: #527bbd;
font-weight: bold;
text-align: left;
margin-top: 1.0em;
margin-bottom: 0.5em;
}
div.title + * {
margin-top: 0;
}
td div.title:first-child {
margin-top: 0.0em;
}
div.content div.title:first-child {
margin-top: 0.0em;
}
div.content + div.title {
margin-top: 0.0em;
}
div.sidebarblock > div.content {
background: #ffffee;
border: 1px solid #dddddd;
border-left: 4px solid #f0f0f0;
padding: 0.5em;
}
div.listingblock > div.content {
border: 1px solid #dddddd;
border-left: 5px solid #f0f0f0;
background: #f8f8f8;
padding: 0.5em;
}
div.quoteblock, div.verseblock {
padding-left: 1.0em;
margin-left: 1.0em;
margin-right: 10%;
border-left: 5px solid #f0f0f0;
color: #888;
}
div.quoteblock > div.attribution {
padding-top: 0.5em;
text-align: right;
}
div.verseblock > pre.content {
font-family: inherit;
font-size: inherit;
}
div.verseblock > div.attribution {
padding-top: 0.75em;
text-align: left;
}
/* DEPRECATED: Pre version 8.2.7 verse style literal block. */
div.verseblock + div.attribution {
text-align: left;
}
div.admonitionblock .icon {
vertical-align: top;
font-size: 1.1em;
font-weight: bold;
text-decoration: underline;
color: #527bbd;
padding-right: 0.5em;
}
div.admonitionblock td.content {
padding-left: 0.5em;
border-left: 3px solid #dddddd;
}
div.exampleblock > div.content {
border-left: 3px solid #dddddd;
padding-left: 0.5em;
}
div.imageblock div.content { padding-left: 0; }
span.image img { border-style: none; vertical-align: text-bottom; }
a.image:visited { color: white; }
dl {
margin-top: 0.8em;
margin-bottom: 0.8em;
}
dt {
margin-top: 0.5em;
margin-bottom: 0;
font-style: normal;
color: navy;
}
dd > *:first-child {
margin-top: 0.1em;
}
ul, ol {
list-style-position: outside;
}
ol.arabic {
list-style-type: decimal;
}
ol.loweralpha {
list-style-type: lower-alpha;
}
ol.upperalpha {
list-style-type: upper-alpha;
}
ol.lowerroman {
list-style-type: lower-roman;
}
ol.upperroman {
list-style-type: upper-roman;
}
div.compact ul, div.compact ol,
div.compact p, div.compact p,
div.compact div, div.compact div {
margin-top: 0.1em;
margin-bottom: 0.1em;
}
tfoot {
font-weight: bold;
}
td > div.verse {
white-space: pre;
}
div.hdlist {
margin-top: 0.8em;
margin-bottom: 0.8em;
}
div.hdlist tr {
padding-bottom: 15px;
}
dt.hdlist1.strong, td.hdlist1.strong {
font-weight: bold;
}
td.hdlist1 {
vertical-align: top;
font-style: normal;
padding-right: 0.8em;
color: navy;
}
td.hdlist2 {
vertical-align: top;
}
div.hdlist.compact tr {
margin: 0;
padding-bottom: 0;
}
.comment {
background: yellow;
}
.footnote, .footnoteref {
font-size: 0.8em;
}
span.footnote, span.footnoteref {
vertical-align: super;
}
#footnotes {
margin: 20px 0 20px 0;
padding: 7px 0 0 0;
}
#footnotes div.footnote {
margin: 0 0 5px 0;
}
#footnotes hr {
border: none;
border-top: 1px solid silver;
height: 1px;
text-align: left;
margin-left: 0;
width: 20%;
min-width: 100px;
}
div.colist td {
padding-right: 0.5em;
padding-bottom: 0.3em;
vertical-align: top;
}
div.colist td img {
margin-top: 0.3em;
}
@media print {
#footer-badges { display: none; }
}
#toc {
margin-bottom: 2.5em;
}
#toctitle {
color: #527bbd;
font-size: 1.1em;
font-weight: bold;
margin-top: 1.0em;
margin-bottom: 0.1em;
}
div.toclevel0, div.toclevel1, div.toclevel2, div.toclevel3, div.toclevel4 {
margin-top: 0;
margin-bottom: 0;
}
div.toclevel2 {
margin-left: 2em;
font-size: 0.9em;
}
div.toclevel3 {
margin-left: 4em;
font-size: 0.9em;
}
div.toclevel4 {
margin-left: 6em;
font-size: 0.9em;
}
span.aqua { color: aqua; }
span.black { color: black; }
span.blue { color: blue; }
span.fuchsia { color: fuchsia; }
span.gray { color: gray; }
span.green { color: green; }
span.lime { color: lime; }
span.maroon { color: maroon; }
span.navy { color: navy; }
span.olive { color: olive; }
span.purple { color: purple; }
span.red { color: red; }
span.silver { color: silver; }
span.teal { color: teal; }
span.white { color: white; }
span.yellow { color: yellow; }
span.aqua-background { background: aqua; }
span.black-background { background: black; }
span.blue-background { background: blue; }
span.fuchsia-background { background: fuchsia; }
span.gray-background { background: gray; }
span.green-background { background: green; }
span.lime-background { background: lime; }
span.maroon-background { background: maroon; }
span.navy-background { background: navy; }
span.olive-background { background: olive; }
span.purple-background { background: purple; }
span.red-background { background: red; }
span.silver-background { background: silver; }
span.teal-background { background: teal; }
span.white-background { background: white; }
span.yellow-background { background: yellow; }
span.big { font-size: 2em; }
span.small { font-size: 0.6em; }
span.underline { text-decoration: underline; }
span.overline { text-decoration: overline; }
span.line-through { text-decoration: line-through; }
div.unbreakable { page-break-inside: avoid; }
/*
* xhtml11 specific
*
* */
div.tableblock {
margin-top: 1.0em;
margin-bottom: 1.5em;
}
div.tableblock > table {
border: 3px solid #527bbd;
}
thead, p.table.header {
font-weight: bold;
color: #527bbd;
}
p.table {
margin-top: 0;
}
/* Because the table frame attribute is overriden by CSS in most browsers. */
div.tableblock > table[frame="void"] {
border-style: none;
}
div.tableblock > table[frame="hsides"] {
border-left-style: none;
border-right-style: none;
}
div.tableblock > table[frame="vsides"] {
border-top-style: none;
border-bottom-style: none;
}
/*
* html5 specific
*
* */
table.tableblock {
margin-top: 1.0em;
margin-bottom: 1.5em;
}
thead, p.tableblock.header {
font-weight: bold;
color: #527bbd;
}
p.tableblock {
margin-top: 0;
}
table.tableblock {
border-width: 3px;
border-spacing: 0px;
border-style: solid;
border-color: #527bbd;
border-collapse: collapse;
}
th.tableblock, td.tableblock {
border-width: 1px;
padding: 4px;
border-style: solid;
border-color: #527bbd;
}
table.tableblock.frame-topbot {
border-left-style: hidden;
border-right-style: hidden;
}
table.tableblock.frame-sides {
border-top-style: hidden;
border-bottom-style: hidden;
}
table.tableblock.frame-none {
border-style: hidden;
}
th.tableblock.halign-left, td.tableblock.halign-left {
text-align: left;
}
th.tableblock.halign-center, td.tableblock.halign-center {
text-align: center;
}
th.tableblock.halign-right, td.tableblock.halign-right {
text-align: right;
}
th.tableblock.valign-top, td.tableblock.valign-top {
vertical-align: top;
}
th.tableblock.valign-middle, td.tableblock.valign-middle {
vertical-align: middle;
}
th.tableblock.valign-bottom, td.tableblock.valign-bottom {
vertical-align: bottom;
}
/*
* manpage specific
*
* */
body.manpage h1 {
padding-top: 0.5em;
padding-bottom: 0.5em;
border-top: 2px solid silver;
border-bottom: 2px solid silver;
}
body.manpage h2 {
border-style: none;
}
body.manpage div.sectionbody {
margin-left: 3em;
}
@media print {
body.manpage div#toc { display: none; }
}
</style>
<script type="text/javascript">
/*<![CDATA[*/
var asciidoc = { // Namespace.
/////////////////////////////////////////////////////////////////////
// Table Of Contents generator
/////////////////////////////////////////////////////////////////////
/* Author: Mihai Bazon, September 2002
* http://students.infoiasi.ro/~mishoo
*
* Table Of Content generator
* Version: 0.4
*
* Feel free to use this script under the terms of the GNU General Public
* License, as long as you do not remove or alter this notice.
*/
/* modified by Troy D. Hanson, September 2006. License: GPL */
/* modified by Stuart Rackham, 2006, 2009. License: GPL */
// toclevels = 1..4.
toc: function (toclevels) {
function getText(el) {
var text = "";
for (var i = el.firstChild; i != null; i = i.nextSibling) {
if (i.nodeType == 3 /* Node.TEXT_NODE */) // IE doesn't speak constants.
text += i.data;
else if (i.firstChild != null)
text += getText(i);
}
return text;
}
function TocEntry(el, text, toclevel) {
this.element = el;
this.text = text;
this.toclevel = toclevel;
}
function tocEntries(el, toclevels) {
var result = new Array;
var re = new RegExp('[hH]([1-'+(toclevels+1)+'])');
// Function that scans the DOM tree for header elements (the DOM2
// nodeIterator API would be a better technique but not supported by all
// browsers).
var iterate = function (el) {
for (var i = el.firstChild; i != null; i = i.nextSibling) {
if (i.nodeType == 1 /* Node.ELEMENT_NODE */) {
var mo = re.exec(i.tagName);
if (mo && (i.getAttribute("class") || i.getAttribute("className")) != "float") {
result[result.length] = new TocEntry(i, getText(i), mo[1]-1);
}
iterate(i);
}
}
}
iterate(el);
return result;
}
var toc = document.getElementById("toc");
if (!toc) {
return;
}
// Delete existing TOC entries in case we're reloading the TOC.
var tocEntriesToRemove = [];
var i;
for (i = 0; i < toc.childNodes.length; i++) {
var entry = toc.childNodes[i];
if (entry.nodeName.toLowerCase() == 'div'
&& entry.getAttribute("class")
&& entry.getAttribute("class").match(/^toclevel/))
tocEntriesToRemove.push(entry);
}
for (i = 0; i < tocEntriesToRemove.length; i++) {
toc.removeChild(tocEntriesToRemove[i]);
}
// Rebuild TOC entries.
var entries = tocEntries(document.getElementById("content"), toclevels);
for (var i = 0; i < entries.length; ++i) {
var entry = entries[i];
if (entry.element.id == "")
entry.element.id = "_toc_" + i;
var a = document.createElement("a");
a.href = "#" + entry.element.id;
a.appendChild(document.createTextNode(entry.text));
var div = document.createElement("div");
div.appendChild(a);
div.className = "toclevel" + entry.toclevel;
toc.appendChild(div);
}
if (entries.length == 0)
toc.parentNode.removeChild(toc);
},
/////////////////////////////////////////////////////////////////////
// Footnotes generator
/////////////////////////////////////////////////////////////////////
/* Based on footnote generation code from:
* http://www.brandspankingnew.net/archive/2005/07/format_footnote.html
*/
footnotes: function () {
// Delete existing footnote entries in case we're reloading the footnodes.
var i;
var noteholder = document.getElementById("footnotes");
if (!noteholder) {
return;
}
var entriesToRemove = [];
for (i = 0; i < noteholder.childNodes.length; i++) {
var entry = noteholder.childNodes[i];
if (entry.nodeName.toLowerCase() == 'div' && entry.getAttribute("class") == "footnote")
entriesToRemove.push(entry);
}
for (i = 0; i < entriesToRemove.length; i++) {
noteholder.removeChild(entriesToRemove[i]);
}
// Rebuild footnote entries.
var cont = document.getElementById("content");
var spans = cont.getElementsByTagName("span");
var refs = {};
var n = 0;
for (i=0; i<spans.length; i++) {
if (spans[i].className == "footnote") {
n++;
var note = spans[i].getAttribute("data-note");
if (!note) {
// Use [\s\S] in place of . so multi-line matches work.
// Because JavaScript has no s (dotall) regex flag.
note = spans[i].innerHTML.match(/\s*\[([\s\S]*)]\s*/)[1];
spans[i].innerHTML =
"[<a id='_footnoteref_" + n + "' href='#_footnote_" + n +
"' title='View footnote' class='footnote'>" + n + "</a>]";
spans[i].setAttribute("data-note", note);
}
noteholder.innerHTML +=
"<div class='footnote' id='_footnote_" + n + "'>" +
"<a href='#_footnoteref_" + n + "' title='Return to text'>" +
n + "</a>. " + note + "</div>";
var id =spans[i].getAttribute("id");
if (id != null) refs["#"+id] = n;
}
}
if (n == 0)
noteholder.parentNode.removeChild(noteholder);
else {
// Process footnoterefs.
for (i=0; i<spans.length; i++) {
if (spans[i].className == "footnoteref") {
var href = spans[i].getElementsByTagName("a")[0].getAttribute("href");
href = href.match(/#.*/)[0]; // Because IE return full URL.
n = refs[href];
spans[i].innerHTML =
"[<a href='#_footnote_" + n +
"' title='View footnote' class='footnote'>" + n + "</a>]";
}
}
}
},
install: function(toclevels) {
var timerId;
function reinstall() {
asciidoc.footnotes();
if (toclevels) {
asciidoc.toc(toclevels);
}
}
function reinstallAndRemoveTimer() {
clearInterval(timerId);
reinstall();
}
timerId = setInterval(reinstall, 500);
if (document.addEventListener)
document.addEventListener("DOMContentLoaded", reinstallAndRemoveTimer, false);
else
window.onload = reinstallAndRemoveTimer;
}
}
asciidoc.install();
/*]]>*/
</script>
</head>
<body class="manpage">
<div id="header">
<h1>
UNICHARAMBIGS(5) Manual Page
</h1>
<h2>NAME</h2>
<div class="sectionbody">
<p>unicharambigs -
Tesseract unicharset ambiguities
</p>
</div>
</div>
<div id="content">
<div class="sect1">
<h2 id="_description">DESCRIPTION</h2>
<div class="sectionbody">
<div class="paragraph"><p>The unicharambigs file (a component of traineddata, see combine_tessdata(1) )
is used by Tesseract to represent possible ambiguities between characters,
or groups of characters.</p></div>
<div class="paragraph"><p>The file contains a number of lines, laid out as follow:</p></div>
<div class="literalblock">
<div class="content">
<pre><code>[num] &lt;TAB&gt; [char(s)] &lt;TAB&gt; [num] &lt;TAB&gt; [char(s)] &lt;TAB&gt; [num]</code></pre>
</div></div>
<div class="hdlist"><table>
<tr>
<td class="hdlist1">
Field one
<br />
</td>
<td class="hdlist2">
<p style="margin-top: 0;">
the number of characters contained in field two
</p>
</td>
</tr>
<tr>
<td class="hdlist1">
Field two
<br />
</td>
<td class="hdlist2">
<p style="margin-top: 0;">
the character sequence to be replaced
</p>
</td>
</tr>
<tr>
<td class="hdlist1">
Field three
<br />
</td>
<td class="hdlist2">
<p style="margin-top: 0;">
the number of characters contained in field four
</p>
</td>
</tr>
<tr>
<td class="hdlist1">
Field four
<br />
</td>
<td class="hdlist2">
<p style="margin-top: 0;">
the character sequence used to replace field two
</p>
</td>
</tr>
<tr>
<td class="hdlist1">
Field five
<br />
</td>
<td class="hdlist2">
<p style="margin-top: 0;">
contains either 1 or 0. 1 denotes a mandatory
replacement, 0 denotes an optional replacement.
</p>
</td>
</tr>
</table></div>
<div class="paragraph"><p>Characters appearing in fields two and four should appear in
unicharset. The numbers in fields one and three refer to the
number of unichars (not bytes).</p></div>
</div>
</div>
<div class="sect1">
<h2 id="_example">EXAMPLE</h2>
<div class="sectionbody">
<div class="literalblock">
<div class="content">
<pre><code>2 ' ' 1 " 1
1 m 2 r n 0
3 i i i 1 m 0</code></pre>
</div></div>
<div class="paragraph"><p>In this example, all instances of the <em>2</em> character sequence <em>'</em>' will
<strong>always</strong> be replaced by the <em>1</em> character sequence <em>"</em>; a <em>1</em> character
sequence <em>m</em> <strong>may</strong> be replaced by the <em>2</em> character sequence <em>rn</em>, and
the <em>3</em> character sequence <strong>may</strong> be replaced by the <em>1</em> character
sequence <em>m</em>.</p></div>
</div>
</div>
<div class="sect1">
<h2 id="_history">HISTORY</h2>
<div class="sectionbody">
<div class="paragraph"><p>The unicharambigs file first appeared in Tesseract 3.00; prior to that, a
similar format, called DangAmbigs (<em>dangerous ambiguities</em>) was used: the
format was almost identical, except only mandatory replacements could be
specified, and field 5 was absent.</p></div>
</div>
</div>
<div class="sect1">
<h2 id="_bugs">BUGS</h2>
<div class="sectionbody">
<div class="paragraph"><p>This is a documentation "bug": it&#8217;s not currently clear what should be done
in the case of ligatures (such as <em>fi</em>) which may also appear as regular
letters in the unicharset.</p></div>
</div>
</div>
<div class="sect1">
<h2 id="_see_also">SEE ALSO</h2>
<div class="sectionbody">
<div class="paragraph"><p>tesseract(1), unicharset(5)</p></div>
</div>
</div>
<div class="sect1">
<h2 id="_author">AUTHOR</h2>
<div class="sectionbody">
<div class="paragraph"><p>The Tesseract OCR engine was written by Ray Smith and his research groups
at Hewlett Packard (1985-1995) and Google (2006-present).</p></div>
</div>
</div>
</div>
<div id="footnotes"><hr /></div>
<div id="footer">
<div id="footer-text">
Last updated 2015-05-13 19:59:45 CEST
</div>
</div>
</body>
</html>

View File

@ -1,126 +0,0 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE refentry PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd">
<?asciidoc-toc?>
<?asciidoc-numbered?>
<refentry lang="en">
<refentryinfo>
<title>UNICHARAMBIGS(5)</title>
</refentryinfo>
<refmeta>
<refentrytitle>unicharambigs</refentrytitle>
<manvolnum>5</manvolnum>
<refmiscinfo class="source">&#160;</refmiscinfo>
<refmiscinfo class="manual">&#160;</refmiscinfo>
</refmeta>
<refnamediv>
<refname>unicharambigs</refname>
<refpurpose>Tesseract unicharset ambiguities</refpurpose>
</refnamediv>
<refsect1 id="_description">
<title>DESCRIPTION</title>
<simpara>The unicharambigs file (a component of traineddata, see combine_tessdata(1) )
is used by Tesseract to represent possible ambiguities between characters,
or groups of characters.</simpara>
<simpara>The file contains a number of lines, laid out as follow:</simpara>
<literallayout class="monospaced">[num] &lt;TAB&gt; [char(s)] &lt;TAB&gt; [num] &lt;TAB&gt; [char(s)] &lt;TAB&gt; [num]</literallayout>
<informaltable tabstyle="horizontal" frame="none" colsep="0" rowsep="0"><tgroup cols="2"><colspec colwidth="15*"/><colspec colwidth="85*"/><tbody valign="top">
<row>
<entry>
<simpara>
Field one
</simpara>
</entry>
<entry>
<simpara>
the number of characters contained in field two
</simpara>
</entry>
</row>
<row>
<entry>
<simpara>
Field two
</simpara>
</entry>
<entry>
<simpara>
the character sequence to be replaced
</simpara>
</entry>
</row>
<row>
<entry>
<simpara>
Field three
</simpara>
</entry>
<entry>
<simpara>
the number of characters contained in field four
</simpara>
</entry>
</row>
<row>
<entry>
<simpara>
Field four
</simpara>
</entry>
<entry>
<simpara>
the character sequence used to replace field two
</simpara>
</entry>
</row>
<row>
<entry>
<simpara>
Field five
</simpara>
</entry>
<entry>
<simpara>
contains either 1 or 0. 1 denotes a mandatory
replacement, 0 denotes an optional replacement.
</simpara>
</entry>
</row>
</tbody></tgroup></informaltable>
<simpara>Characters appearing in fields two and four should appear in
unicharset. The numbers in fields one and three refer to the
number of unichars (not bytes).</simpara>
</refsect1>
<refsect1 id="_example">
<title>EXAMPLE</title>
<literallayout class="monospaced">2 ' ' 1 " 1
1 m 2 r n 0
3 i i i 1 m 0</literallayout>
<simpara>In this example, all instances of the <emphasis>2</emphasis> character sequence <emphasis>'</emphasis>' will
<emphasis role="strong">always</emphasis> be replaced by the <emphasis>1</emphasis> character sequence <emphasis>"</emphasis>; a <emphasis>1</emphasis> character
sequence <emphasis>m</emphasis> <emphasis role="strong">may</emphasis> be replaced by the <emphasis>2</emphasis> character sequence <emphasis>rn</emphasis>, and
the <emphasis>3</emphasis> character sequence <emphasis role="strong">may</emphasis> be replaced by the <emphasis>1</emphasis> character
sequence <emphasis>m</emphasis>.</simpara>
</refsect1>
<refsect1 id="_history">
<title>HISTORY</title>
<simpara>The unicharambigs file first appeared in Tesseract 3.00; prior to that, a
similar format, called DangAmbigs (<emphasis>dangerous ambiguities</emphasis>) was used: the
format was almost identical, except only mandatory replacements could be
specified, and field 5 was absent.</simpara>
</refsect1>
<refsect1 id="_bugs">
<title>BUGS</title>
<simpara>This is a documentation "bug": it&#8217;s not currently clear what should be done
in the case of ligatures (such as <emphasis>fi</emphasis>) which may also appear as regular
letters in the unicharset.</simpara>
</refsect1>
<refsect1 id="_see_also">
<title>SEE ALSO</title>
<simpara>tesseract(1), unicharset(5)</simpara>
</refsect1>
<refsect1 id="_author">
<title>AUTHOR</title>
<simpara>The Tesseract OCR engine was written by Ray Smith and his research groups
at Hewlett Packard (1985-1995) and Google (2006-present).</simpara>
</refsect1>
</refentry>

View File

@ -1,220 +0,0 @@
'\" t
.\" Title: unicharset
.\" Author: [see the "AUTHOR" section]
.\" Generator: DocBook XSL Stylesheets v1.78.1 <http://docbook.sf.net/>
.\" Date: 06/12/2015
.\" Manual: \ \&
.\" Source: \ \&
.\" Language: English
.\"
.TH "UNICHARSET" "5" "06/12/2015" "\ \&" "\ \&"
.\" -----------------------------------------------------------------
.\" * Define some portability stuff
.\" -----------------------------------------------------------------
.\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.\" http://bugs.debian.org/507673
.\" http://lists.gnu.org/archive/html/groff/2009-02/msg00013.html
.\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.ie \n(.g .ds Aq \(aq
.el .ds Aq '
.\" -----------------------------------------------------------------
.\" * set default formatting
.\" -----------------------------------------------------------------
.\" disable hyphenation
.nh
.\" disable justification (adjust text to left margin only)
.ad l
.\" -----------------------------------------------------------------
.\" * MAIN CONTENT STARTS HERE *
.\" -----------------------------------------------------------------
.SH "NAME"
unicharset \- character properties file used by tesseract(1)
.SH "DESCRIPTION"
.sp
Tesseract\(cqs unicharset file contains information on each symbol (unichar) the Tesseract OCR engine is trained to recognize\&.
.sp
A unicharset file (i\&.e\&. \fIeng\&.unicharset\fR) is distributed as part of a Tesseract language pack (i\&.e\&. \fIeng\&.traineddata\fR)\&. For information on extracting the unicharset file, see combine_tessdata(1)\&.
.sp
The first line of a unicharset file contains the number of unichars in the file\&. After this line, each subsequent line provides information for a single unichar\&. The first such line contains a placeholder reserved for the space character\&. Each unichar is referred to within Tesseract by its Unichar ID, which is the line number (minus 1) within the unicharset file\&. Therefore, space gets unichar 0\&.
.sp
Each unichar line in the unicharset file (v2+) may have four space\-separated fields:
.sp
.if n \{\
.RS 4
.\}
.nf
\*(Aqcharacter\*(Aq \*(Aqproperties\*(Aq \*(Aqscript\*(Aq \*(Aqid\*(Aq
.fi
.if n \{\
.RE
.\}
.sp
Starting with Tesseract v3\&.02, more information may be given for each unichar:
.sp
.if n \{\
.RS 4
.\}
.nf
\*(Aqcharacter\*(Aq \*(Aqproperties\*(Aq \*(Aqglyph_metrics\*(Aq \*(Aqscript\*(Aq \*(Aqother_case\*(Aq \*(Aqdirection\*(Aq \*(Aqmirror\*(Aq \*(Aqnormed_form\*(Aq
.fi
.if n \{\
.RE
.\}
.sp
Entries:
.PP
\fIcharacter\fR
.RS 4
The UTF\-8 encoded string to be produced for this unichar\&.
.RE
.PP
\fIproperties\fR
.RS 4
An integer mask of character properties, one per bit\&. From least to most significant bit, these are: isalpha, islower, isupper, isdigit, ispunctuation\&.
.RE
.PP
\fIglyph_metrics\fR
.RS 4
Ten comma\-separated integers representing various standards for where this glyph is to be found within a baseline\-normalized coordinate system where 128 is normalized to x\-height\&.
.sp
.RS 4
.ie n \{\
\h'-04'\(bu\h'+03'\c
.\}
.el \{\
.sp -1
.IP \(bu 2.3
.\}
min_bottom, max_bottom: the ranges where the bottom of the character can be found\&.
.RE
.sp
.RS 4
.ie n \{\
\h'-04'\(bu\h'+03'\c
.\}
.el \{\
.sp -1
.IP \(bu 2.3
.\}
min_top, max_top: the ranges where the top of the character may be found\&.
.RE
.sp
.RS 4
.ie n \{\
\h'-04'\(bu\h'+03'\c
.\}
.el \{\
.sp -1
.IP \(bu 2.3
.\}
min_width, max_width: horizontal width of the character\&.
.RE
.sp
.RS 4
.ie n \{\
\h'-04'\(bu\h'+03'\c
.\}
.el \{\
.sp -1
.IP \(bu 2.3
.\}
min_bearing, max_bearing: how far from the usual start position does the leftmost part of the character begin\&.
.RE
.sp
.RS 4
.ie n \{\
\h'-04'\(bu\h'+03'\c
.\}
.el \{\
.sp -1
.IP \(bu 2.3
.\}
min_advance, max_advance: how far from the printer\(cqs cell left do we advance to begin the next character\&.
.RE
.RE
.PP
\fIscript\fR
.RS 4
Name of the script (Latin, Common, Greek, Cyrillic, Han, null)\&.
.RE
.PP
\fIother_case\fR
.RS 4
The Unichar ID of the other case version of this character (upper or lower)\&.
.RE
.PP
\fIdirection\fR
.RS 4
The Unicode BiDi direction of this character, as defined by ICU\(cqs enum UCharDirection\&. (0 = Left to Right, 1 = Right to Left, 2 = European Number\&...)
.RE
.PP
\fImirror\fR
.RS 4
The Unichar ID of the BiDirectional mirror of this character\&. For example the mirror of open paren is close paren, but Latin Capital C has no mirror, so it remains a Latin Capital C\&.
.RE
.PP
\fInormed_form\fR
.RS 4
The UTF\-8 representation of a "normalized form" of this unichar for the purpose of blaming a module for errors given ground truth text\&. For instance, a left or right single quote may normalize to an ASCII quote\&.
.RE
.SH "EXAMPLE (V2)"
.sp
.if n \{\
.RS 4
.\}
.nf
; 10 Common 46
b 3 Latin 59
W 5 Latin 40
7 8 Common 66
= 0 Common 93
.fi
.if n \{\
.RE
.\}
.sp
";" is a punctuation character\&. Its properties are thus represented by the binary number 10000 (10 in hexadecimal)\&.
.sp
"b" is an alphabetic character and a lower case character\&. Its properties are thus represented by the binary number 00011 (3 in hexadecimal)\&.
.sp
"W" is an alphabetic character and an upper case character\&. Its properties are thus represented by the binary number 00101 (5 in hexadecimal)\&.
.sp
"7" is just a digit\&. Its properties are thus represented by the binary number 01000 (8 in hexadecimal)\&.
.sp
"=" is not punctuation nor a digit nor an alphabetic character\&. Its properties are thus represented by the binary number 00000 (0 in hexadecimal)\&.
.sp
Japanese or Chinese alphabetic character properties are represented by the binary number 00001 (1 in hexadecimal): they are alphabetic, but neither upper nor lower case\&.
.SH "EXAMPLE (V3.02)"
.sp
.if n \{\
.RS 4
.\}
.nf
110
NULL 0 NULL 0
N 5 59,68,216,255,87,236,0,27,104,227 Latin 11 0 1 N
Y 5 59,68,216,255,91,205,0,47,91,223 Latin 33 0 2 Y
1 8 59,69,203,255,45,128,0,66,74,173 Common 3 2 3 1
9 8 18,66,203,255,89,156,0,39,104,173 Common 4 2 4 9
a 3 58,65,186,198,85,164,0,26,97,185 Latin 56 0 5 a
\&. \&. \&.
.fi
.if n \{\
.RE
.\}
.SH "CAVEATS"
.sp
Although the unicharset reader maintains the ability to read unicharsets of older formats and will assign default values to missing fields, the accuracy will be degraded\&.
.sp
Further, most other data files are indexed by the unicharset file, so changing it without re\-generating the others is likely to have dire consequences\&.
.SH "HISTORY"
.sp
The unicharset format first appeared with Tesseract 2\&.00, which was the first version to support languages other than English\&. The unicharset file contained only the first two fields, and the "ispunctuation" property was absent (punctuation was regarded as "0", as "=" is in the above example\&.
.SH "SEE ALSO"
.sp
tesseract(1), combine_tessdata(1), unicharset_extractor(1)
.sp
\m[blue]\fBhttps://github\&.com/tesseract\-ocr/tesseract/wiki/TrainingTesseract\fR\m[]
.SH "AUTHOR"
.sp
The Tesseract OCR engine was written by Ray Smith and his research groups at Hewlett Packard (1985\-1995) and Google (2006\-present)\&.

View File

@ -1,965 +0,0 @@
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<meta http-equiv="Content-Type" content="application/xhtml+xml; charset=UTF-8" />
<meta name="generator" content="AsciiDoc 8.6.9" />
<title>UNICHARSET(5)</title>
<style type="text/css">
/* Shared CSS for AsciiDoc xhtml11 and html5 backends */
/* Default font. */
body {
font-family: Georgia,serif;
}
/* Title font. */
h1, h2, h3, h4, h5, h6,
div.title, caption.title,
thead, p.table.header,
#toctitle,
#author, #revnumber, #revdate, #revremark,
#footer {
font-family: Arial,Helvetica,sans-serif;
}
body {
margin: 1em 5% 1em 5%;
}
a {
color: blue;
text-decoration: underline;
}
a:visited {
color: fuchsia;
}
em {
font-style: italic;
color: navy;
}
strong {
font-weight: bold;
color: #083194;
}
h1, h2, h3, h4, h5, h6 {
color: #527bbd;
margin-top: 1.2em;
margin-bottom: 0.5em;
line-height: 1.3;
}
h1, h2, h3 {
border-bottom: 2px solid silver;
}
h2 {
padding-top: 0.5em;
}
h3 {
float: left;
}
h3 + * {
clear: left;
}
h5 {
font-size: 1.0em;
}
div.sectionbody {
margin-left: 0;
}
hr {
border: 1px solid silver;
}
p {
margin-top: 0.5em;
margin-bottom: 0.5em;
}
ul, ol, li > p {
margin-top: 0;
}
ul > li { color: #aaa; }
ul > li > * { color: black; }
.monospaced, code, pre {
font-family: "Courier New", Courier, monospace;
font-size: inherit;
color: navy;
padding: 0;
margin: 0;
}
pre {
white-space: pre-wrap;
}
#author {
color: #527bbd;
font-weight: bold;
font-size: 1.1em;
}
#email {
}
#revnumber, #revdate, #revremark {
}
#footer {
font-size: small;
border-top: 2px solid silver;
padding-top: 0.5em;
margin-top: 4.0em;
}
#footer-text {
float: left;
padding-bottom: 0.5em;
}
#footer-badges {
float: right;
padding-bottom: 0.5em;
}
#preamble {
margin-top: 1.5em;
margin-bottom: 1.5em;
}
div.imageblock, div.exampleblock, div.verseblock,
div.quoteblock, div.literalblock, div.listingblock, div.sidebarblock,
div.admonitionblock {
margin-top: 1.0em;
margin-bottom: 1.5em;
}
div.admonitionblock {
margin-top: 2.0em;
margin-bottom: 2.0em;
margin-right: 10%;
color: #606060;
}
div.content { /* Block element content. */
padding: 0;
}
/* Block element titles. */
div.title, caption.title {
color: #527bbd;
font-weight: bold;
text-align: left;
margin-top: 1.0em;
margin-bottom: 0.5em;
}
div.title + * {
margin-top: 0;
}
td div.title:first-child {
margin-top: 0.0em;
}
div.content div.title:first-child {
margin-top: 0.0em;
}
div.content + div.title {
margin-top: 0.0em;
}
div.sidebarblock > div.content {
background: #ffffee;
border: 1px solid #dddddd;
border-left: 4px solid #f0f0f0;
padding: 0.5em;
}
div.listingblock > div.content {
border: 1px solid #dddddd;
border-left: 5px solid #f0f0f0;
background: #f8f8f8;
padding: 0.5em;
}
div.quoteblock, div.verseblock {
padding-left: 1.0em;
margin-left: 1.0em;
margin-right: 10%;
border-left: 5px solid #f0f0f0;
color: #888;
}
div.quoteblock > div.attribution {
padding-top: 0.5em;
text-align: right;
}
div.verseblock > pre.content {
font-family: inherit;
font-size: inherit;
}
div.verseblock > div.attribution {
padding-top: 0.75em;
text-align: left;
}
/* DEPRECATED: Pre version 8.2.7 verse style literal block. */
div.verseblock + div.attribution {
text-align: left;
}
div.admonitionblock .icon {
vertical-align: top;
font-size: 1.1em;
font-weight: bold;
text-decoration: underline;
color: #527bbd;
padding-right: 0.5em;
}
div.admonitionblock td.content {
padding-left: 0.5em;
border-left: 3px solid #dddddd;
}
div.exampleblock > div.content {
border-left: 3px solid #dddddd;
padding-left: 0.5em;
}
div.imageblock div.content { padding-left: 0; }
span.image img { border-style: none; vertical-align: text-bottom; }
a.image:visited { color: white; }
dl {
margin-top: 0.8em;
margin-bottom: 0.8em;
}
dt {
margin-top: 0.5em;
margin-bottom: 0;
font-style: normal;
color: navy;
}
dd > *:first-child {
margin-top: 0.1em;
}
ul, ol {
list-style-position: outside;
}
ol.arabic {
list-style-type: decimal;
}
ol.loweralpha {
list-style-type: lower-alpha;
}
ol.upperalpha {
list-style-type: upper-alpha;
}
ol.lowerroman {
list-style-type: lower-roman;
}
ol.upperroman {
list-style-type: upper-roman;
}
div.compact ul, div.compact ol,
div.compact p, div.compact p,
div.compact div, div.compact div {
margin-top: 0.1em;
margin-bottom: 0.1em;
}
tfoot {
font-weight: bold;
}
td > div.verse {
white-space: pre;
}
div.hdlist {
margin-top: 0.8em;
margin-bottom: 0.8em;
}
div.hdlist tr {
padding-bottom: 15px;
}
dt.hdlist1.strong, td.hdlist1.strong {
font-weight: bold;
}
td.hdlist1 {
vertical-align: top;
font-style: normal;
padding-right: 0.8em;
color: navy;
}
td.hdlist2 {
vertical-align: top;
}
div.hdlist.compact tr {
margin: 0;
padding-bottom: 0;
}
.comment {
background: yellow;
}
.footnote, .footnoteref {
font-size: 0.8em;
}
span.footnote, span.footnoteref {
vertical-align: super;
}
#footnotes {
margin: 20px 0 20px 0;
padding: 7px 0 0 0;
}
#footnotes div.footnote {
margin: 0 0 5px 0;
}
#footnotes hr {
border: none;
border-top: 1px solid silver;
height: 1px;
text-align: left;
margin-left: 0;
width: 20%;
min-width: 100px;
}
div.colist td {
padding-right: 0.5em;
padding-bottom: 0.3em;
vertical-align: top;
}
div.colist td img {
margin-top: 0.3em;
}
@media print {
#footer-badges { display: none; }
}
#toc {
margin-bottom: 2.5em;
}
#toctitle {
color: #527bbd;
font-size: 1.1em;
font-weight: bold;
margin-top: 1.0em;
margin-bottom: 0.1em;
}
div.toclevel0, div.toclevel1, div.toclevel2, div.toclevel3, div.toclevel4 {
margin-top: 0;
margin-bottom: 0;
}
div.toclevel2 {
margin-left: 2em;
font-size: 0.9em;
}
div.toclevel3 {
margin-left: 4em;
font-size: 0.9em;
}
div.toclevel4 {
margin-left: 6em;
font-size: 0.9em;
}
span.aqua { color: aqua; }
span.black { color: black; }
span.blue { color: blue; }
span.fuchsia { color: fuchsia; }
span.gray { color: gray; }
span.green { color: green; }
span.lime { color: lime; }
span.maroon { color: maroon; }
span.navy { color: navy; }
span.olive { color: olive; }
span.purple { color: purple; }
span.red { color: red; }
span.silver { color: silver; }
span.teal { color: teal; }
span.white { color: white; }
span.yellow { color: yellow; }
span.aqua-background { background: aqua; }
span.black-background { background: black; }
span.blue-background { background: blue; }
span.fuchsia-background { background: fuchsia; }
span.gray-background { background: gray; }
span.green-background { background: green; }
span.lime-background { background: lime; }
span.maroon-background { background: maroon; }
span.navy-background { background: navy; }
span.olive-background { background: olive; }
span.purple-background { background: purple; }
span.red-background { background: red; }
span.silver-background { background: silver; }
span.teal-background { background: teal; }
span.white-background { background: white; }
span.yellow-background { background: yellow; }
span.big { font-size: 2em; }
span.small { font-size: 0.6em; }
span.underline { text-decoration: underline; }
span.overline { text-decoration: overline; }
span.line-through { text-decoration: line-through; }
div.unbreakable { page-break-inside: avoid; }
/*
* xhtml11 specific
*
* */
div.tableblock {
margin-top: 1.0em;
margin-bottom: 1.5em;
}
div.tableblock > table {
border: 3px solid #527bbd;
}
thead, p.table.header {
font-weight: bold;
color: #527bbd;
}
p.table {
margin-top: 0;
}
/* Because the table frame attribute is overriden by CSS in most browsers. */
div.tableblock > table[frame="void"] {
border-style: none;
}
div.tableblock > table[frame="hsides"] {
border-left-style: none;
border-right-style: none;
}
div.tableblock > table[frame="vsides"] {
border-top-style: none;
border-bottom-style: none;
}
/*
* html5 specific
*
* */
table.tableblock {
margin-top: 1.0em;
margin-bottom: 1.5em;
}
thead, p.tableblock.header {
font-weight: bold;
color: #527bbd;
}
p.tableblock {
margin-top: 0;
}
table.tableblock {
border-width: 3px;
border-spacing: 0px;
border-style: solid;
border-color: #527bbd;
border-collapse: collapse;
}
th.tableblock, td.tableblock {
border-width: 1px;
padding: 4px;
border-style: solid;
border-color: #527bbd;
}
table.tableblock.frame-topbot {
border-left-style: hidden;
border-right-style: hidden;
}
table.tableblock.frame-sides {
border-top-style: hidden;
border-bottom-style: hidden;
}
table.tableblock.frame-none {
border-style: hidden;
}
th.tableblock.halign-left, td.tableblock.halign-left {
text-align: left;
}
th.tableblock.halign-center, td.tableblock.halign-center {
text-align: center;
}
th.tableblock.halign-right, td.tableblock.halign-right {
text-align: right;
}
th.tableblock.valign-top, td.tableblock.valign-top {
vertical-align: top;
}
th.tableblock.valign-middle, td.tableblock.valign-middle {
vertical-align: middle;
}
th.tableblock.valign-bottom, td.tableblock.valign-bottom {
vertical-align: bottom;
}
/*
* manpage specific
*
* */
body.manpage h1 {
padding-top: 0.5em;
padding-bottom: 0.5em;
border-top: 2px solid silver;
border-bottom: 2px solid silver;
}
body.manpage h2 {
border-style: none;
}
body.manpage div.sectionbody {
margin-left: 3em;
}
@media print {
body.manpage div#toc { display: none; }
}
</style>
<script type="text/javascript">
/*<![CDATA[*/
var asciidoc = { // Namespace.
/////////////////////////////////////////////////////////////////////
// Table Of Contents generator
/////////////////////////////////////////////////////////////////////
/* Author: Mihai Bazon, September 2002
* http://students.infoiasi.ro/~mishoo
*
* Table Of Content generator
* Version: 0.4
*
* Feel free to use this script under the terms of the GNU General Public
* License, as long as you do not remove or alter this notice.
*/
/* modified by Troy D. Hanson, September 2006. License: GPL */
/* modified by Stuart Rackham, 2006, 2009. License: GPL */
// toclevels = 1..4.
toc: function (toclevels) {
function getText(el) {
var text = "";
for (var i = el.firstChild; i != null; i = i.nextSibling) {
if (i.nodeType == 3 /* Node.TEXT_NODE */) // IE doesn't speak constants.
text += i.data;
else if (i.firstChild != null)
text += getText(i);
}
return text;
}
function TocEntry(el, text, toclevel) {
this.element = el;
this.text = text;
this.toclevel = toclevel;
}
function tocEntries(el, toclevels) {
var result = new Array;
var re = new RegExp('[hH]([1-'+(toclevels+1)+'])');
// Function that scans the DOM tree for header elements (the DOM2
// nodeIterator API would be a better technique but not supported by all
// browsers).
var iterate = function (el) {
for (var i = el.firstChild; i != null; i = i.nextSibling) {
if (i.nodeType == 1 /* Node.ELEMENT_NODE */) {
var mo = re.exec(i.tagName);
if (mo && (i.getAttribute("class") || i.getAttribute("className")) != "float") {
result[result.length] = new TocEntry(i, getText(i), mo[1]-1);
}
iterate(i);
}
}
}
iterate(el);
return result;
}
var toc = document.getElementById("toc");
if (!toc) {
return;
}
// Delete existing TOC entries in case we're reloading the TOC.
var tocEntriesToRemove = [];
var i;
for (i = 0; i < toc.childNodes.length; i++) {
var entry = toc.childNodes[i];
if (entry.nodeName.toLowerCase() == 'div'
&& entry.getAttribute("class")
&& entry.getAttribute("class").match(/^toclevel/))
tocEntriesToRemove.push(entry);
}
for (i = 0; i < tocEntriesToRemove.length; i++) {
toc.removeChild(tocEntriesToRemove[i]);
}
// Rebuild TOC entries.
var entries = tocEntries(document.getElementById("content"), toclevels);
for (var i = 0; i < entries.length; ++i) {
var entry = entries[i];
if (entry.element.id == "")
entry.element.id = "_toc_" + i;
var a = document.createElement("a");
a.href = "#" + entry.element.id;
a.appendChild(document.createTextNode(entry.text));
var div = document.createElement("div");
div.appendChild(a);
div.className = "toclevel" + entry.toclevel;
toc.appendChild(div);
}
if (entries.length == 0)
toc.parentNode.removeChild(toc);
},
/////////////////////////////////////////////////////////////////////
// Footnotes generator
/////////////////////////////////////////////////////////////////////
/* Based on footnote generation code from:
* http://www.brandspankingnew.net/archive/2005/07/format_footnote.html
*/
footnotes: function () {
// Delete existing footnote entries in case we're reloading the footnodes.
var i;
var noteholder = document.getElementById("footnotes");
if (!noteholder) {
return;
}
var entriesToRemove = [];
for (i = 0; i < noteholder.childNodes.length; i++) {
var entry = noteholder.childNodes[i];
if (entry.nodeName.toLowerCase() == 'div' && entry.getAttribute("class") == "footnote")
entriesToRemove.push(entry);
}
for (i = 0; i < entriesToRemove.length; i++) {
noteholder.removeChild(entriesToRemove[i]);
}
// Rebuild footnote entries.
var cont = document.getElementById("content");
var spans = cont.getElementsByTagName("span");
var refs = {};
var n = 0;
for (i=0; i<spans.length; i++) {
if (spans[i].className == "footnote") {
n++;
var note = spans[i].getAttribute("data-note");
if (!note) {
// Use [\s\S] in place of . so multi-line matches work.
// Because JavaScript has no s (dotall) regex flag.
note = spans[i].innerHTML.match(/\s*\[([\s\S]*)]\s*/)[1];
spans[i].innerHTML =
"[<a id='_footnoteref_" + n + "' href='#_footnote_" + n +
"' title='View footnote' class='footnote'>" + n + "</a>]";
spans[i].setAttribute("data-note", note);
}
noteholder.innerHTML +=
"<div class='footnote' id='_footnote_" + n + "'>" +
"<a href='#_footnoteref_" + n + "' title='Return to text'>" +
n + "</a>. " + note + "</div>";
var id =spans[i].getAttribute("id");
if (id != null) refs["#"+id] = n;
}
}
if (n == 0)
noteholder.parentNode.removeChild(noteholder);
else {
// Process footnoterefs.
for (i=0; i<spans.length; i++) {
if (spans[i].className == "footnoteref") {
var href = spans[i].getElementsByTagName("a")[0].getAttribute("href");
href = href.match(/#.*/)[0]; // Because IE return full URL.
n = refs[href];
spans[i].innerHTML =
"[<a href='#_footnote_" + n +
"' title='View footnote' class='footnote'>" + n + "</a>]";
}
}
}
},
install: function(toclevels) {
var timerId;
function reinstall() {
asciidoc.footnotes();
if (toclevels) {
asciidoc.toc(toclevels);
}
}
function reinstallAndRemoveTimer() {
clearInterval(timerId);
reinstall();
}
timerId = setInterval(reinstall, 500);
if (document.addEventListener)
document.addEventListener("DOMContentLoaded", reinstallAndRemoveTimer, false);
else
window.onload = reinstallAndRemoveTimer;
}
}
asciidoc.install();
/*]]>*/
</script>
</head>
<body class="manpage">
<div id="header">
<h1>
UNICHARSET(5) Manual Page
</h1>
<h2>NAME</h2>
<div class="sectionbody">
<p>unicharset -
character properties file used by tesseract(1)
</p>
</div>
</div>
<div id="content">
<div class="sect1">
<h2 id="_description">DESCRIPTION</h2>
<div class="sectionbody">
<div class="paragraph"><p>Tesseract&#8217;s unicharset file contains information on each symbol
(unichar) the Tesseract OCR engine is trained to recognize.</p></div>
<div class="paragraph"><p>A unicharset file (i.e. <em>eng.unicharset</em>) is distributed as part of a
Tesseract language pack (i.e. <em>eng.traineddata</em>). For information on
extracting the unicharset file, see combine_tessdata(1).</p></div>
<div class="paragraph"><p>The first line of a unicharset file contains the number of unichars in
the file. After this line, each subsequent line provides information for
a single unichar. The first such line contains a placeholder reserved for
the space character. Each unichar is referred to within Tesseract by its
Unichar ID, which is the line number (minus 1) within the unicharset file.
Therefore, space gets unichar 0.</p></div>
<div class="paragraph"><p>Each unichar line in the unicharset file (v2+) may have four space-separated fields:</p></div>
<div class="literalblock">
<div class="content">
<pre><code>'character' 'properties' 'script' 'id'</code></pre>
</div></div>
<div class="paragraph"><p>Starting with Tesseract v3.02, more information may be given for each unichar:</p></div>
<div class="literalblock">
<div class="content">
<pre><code>'character' 'properties' 'glyph_metrics' 'script' 'other_case' 'direction' 'mirror' 'normed_form'</code></pre>
</div></div>
<div class="paragraph"><p>Entries:</p></div>
<div class="dlist"><dl>
<dt class="hdlist1">
<em>character</em>
</dt>
<dd>
<p>
The UTF-8 encoded string to be produced for this unichar.
</p>
</dd>
<dt class="hdlist1">
<em>properties</em>
</dt>
<dd>
<p>
An integer mask of character properties, one per bit.
From least to most significant bit, these are: isalpha, islower, isupper,
isdigit, ispunctuation.
</p>
</dd>
<dt class="hdlist1">
<em>glyph_metrics</em>
</dt>
<dd>
<p>
Ten comma-separated integers representing various standards
for where this glyph is to be found within a baseline-normalized coordinate
system where 128 is normalized to x-height.
</p>
<div class="ulist"><ul>
<li>
<p>
min_bottom, max_bottom: the ranges where the bottom of the character can
be found.
</p>
</li>
<li>
<p>
min_top, max_top: the ranges where the top of the character may be found.
</p>
</li>
<li>
<p>
min_width, max_width: horizontal width of the character.
</p>
</li>
<li>
<p>
min_bearing, max_bearing: how far from the usual start position does the
leftmost part of the character begin.
</p>
</li>
<li>
<p>
min_advance, max_advance: how far from the printer&#8217;s cell left do we
advance to begin the next character.
</p>
</li>
</ul></div>
</dd>
<dt class="hdlist1">
<em>script</em>
</dt>
<dd>
<p>
Name of the script (Latin, Common, Greek, Cyrillic, Han, null).
</p>
</dd>
<dt class="hdlist1">
<em>other_case</em>
</dt>
<dd>
<p>
The Unichar ID of the other case version of this character
(upper or lower).
</p>
</dd>
<dt class="hdlist1">
<em>direction</em>
</dt>
<dd>
<p>
The Unicode BiDi direction of this character, as defined by
ICU&#8217;s enum UCharDirection. (0 = Left to Right, 1 = Right to Left,
2 = European Number&#8230;)
</p>
</dd>
<dt class="hdlist1">
<em>mirror</em>
</dt>
<dd>
<p>
The Unichar ID of the BiDirectional mirror of this character.
For example the mirror of open paren is close paren, but Latin Capital C
has no mirror, so it remains a Latin Capital C.
</p>
</dd>
<dt class="hdlist1">
<em>normed_form</em>
</dt>
<dd>
<p>
The UTF-8 representation of a "normalized form" of this unichar
for the purpose of blaming a module for errors given ground truth text.
For instance, a left or right single quote may normalize to an ASCII quote.
</p>
</dd>
</dl></div>
</div>
</div>
<div class="sect1">
<h2 id="_example_v2">EXAMPLE (v2)</h2>
<div class="sectionbody">
<div class="literalblock">
<div class="content">
<pre><code>; 10 Common 46
b 3 Latin 59
W 5 Latin 40
7 8 Common 66
= 0 Common 93</code></pre>
</div></div>
<div class="paragraph"><p>";" is a punctuation character. Its properties are thus represented by the
binary number 10000 (10 in hexadecimal).</p></div>
<div class="paragraph"><p>"b" is an alphabetic character and a lower case character. Its properties are
thus represented by the binary number 00011 (3 in hexadecimal).</p></div>
<div class="paragraph"><p>"W" is an alphabetic character and an upper case character. Its properties are
thus represented by the binary number 00101 (5 in hexadecimal).</p></div>
<div class="paragraph"><p>"7" is just a digit. Its properties are thus represented by the binary number
01000 (8 in hexadecimal).</p></div>
<div class="paragraph"><p>"=" is not punctuation nor a digit nor an alphabetic character. Its properties
are thus represented by the binary number 00000 (0 in hexadecimal).</p></div>
<div class="paragraph"><p>Japanese or Chinese alphabetic character properties are represented by the
binary number 00001 (1 in hexadecimal): they are alphabetic, but neither
upper nor lower case.</p></div>
</div>
</div>
<div class="sect1">
<h2 id="_example_v3_02">EXAMPLE (v3.02)</h2>
<div class="sectionbody">
<div class="literalblock">
<div class="content">
<pre><code>110
NULL 0 NULL 0
N 5 59,68,216,255,87,236,0,27,104,227 Latin 11 0 1 N
Y 5 59,68,216,255,91,205,0,47,91,223 Latin 33 0 2 Y
1 8 59,69,203,255,45,128,0,66,74,173 Common 3 2 3 1
9 8 18,66,203,255,89,156,0,39,104,173 Common 4 2 4 9
a 3 58,65,186,198,85,164,0,26,97,185 Latin 56 0 5 a
. . .</code></pre>
</div></div>
</div>
</div>
<div class="sect1">
<h2 id="_caveats">CAVEATS</h2>
<div class="sectionbody">
<div class="paragraph"><p>Although the unicharset reader maintains the ability to read unicharsets
of older formats and will assign default values to missing fields,
the accuracy will be degraded.</p></div>
<div class="paragraph"><p>Further, most other data files are indexed by the unicharset file,
so changing it without re-generating the others is likely to have dire
consequences.</p></div>
</div>
</div>
<div class="sect1">
<h2 id="_history">HISTORY</h2>
<div class="sectionbody">
<div class="paragraph"><p>The unicharset format first appeared with Tesseract 2.00, which was the
first version to support languages other than English. The unicharset file
contained only the first two fields, and the "ispunctuation" property was
absent (punctuation was regarded as "0", as "=" is in the above example.</p></div>
</div>
</div>
<div class="sect1">
<h2 id="_see_also">SEE ALSO</h2>
<div class="sectionbody">
<div class="paragraph"><p>tesseract(1), combine_tessdata(1), unicharset_extractor(1)</p></div>
<div class="paragraph"><p><a href="https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract">https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract</a></p></div>
</div>
</div>
<div class="sect1">
<h2 id="_author">AUTHOR</h2>
<div class="sectionbody">
<div class="paragraph"><p>The Tesseract OCR engine was written by Ray Smith and his research groups
at Hewlett Packard (1985-1995) and Google (2006-present).</p></div>
</div>
</div>
</div>
<div id="footnotes"><hr /></div>
<div id="footer">
<div id="footer-text">
Last updated 2015-06-12 23:52:34 CEST
</div>
</div>
</body>
</html>

View File

@ -1,219 +0,0 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE refentry PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd">
<?asciidoc-toc?>
<?asciidoc-numbered?>
<refentry lang="en">
<refentryinfo>
<title>UNICHARSET(5)</title>
</refentryinfo>
<refmeta>
<refentrytitle>unicharset</refentrytitle>
<manvolnum>5</manvolnum>
<refmiscinfo class="source">&#160;</refmiscinfo>
<refmiscinfo class="manual">&#160;</refmiscinfo>
</refmeta>
<refnamediv>
<refname>unicharset</refname>
<refpurpose>character properties file used by tesseract(1)</refpurpose>
</refnamediv>
<refsect1 id="_description">
<title>DESCRIPTION</title>
<simpara>Tesseract&#8217;s unicharset file contains information on each symbol
(unichar) the Tesseract OCR engine is trained to recognize.</simpara>
<simpara>A unicharset file (i.e. <emphasis>eng.unicharset</emphasis>) is distributed as part of a
Tesseract language pack (i.e. <emphasis>eng.traineddata</emphasis>). For information on
extracting the unicharset file, see combine_tessdata(1).</simpara>
<simpara>The first line of a unicharset file contains the number of unichars in
the file. After this line, each subsequent line provides information for
a single unichar. The first such line contains a placeholder reserved for
the space character. Each unichar is referred to within Tesseract by its
Unichar ID, which is the line number (minus 1) within the unicharset file.
Therefore, space gets unichar 0.</simpara>
<simpara>Each unichar line in the unicharset file (v2+) may have four space-separated fields:</simpara>
<literallayout class="monospaced">'character' 'properties' 'script' 'id'</literallayout>
<simpara>Starting with Tesseract v3.02, more information may be given for each unichar:</simpara>
<literallayout class="monospaced">'character' 'properties' 'glyph_metrics' 'script' 'other_case' 'direction' 'mirror' 'normed_form'</literallayout>
<simpara>Entries:</simpara>
<variablelist>
<varlistentry>
<term>
<emphasis>character</emphasis>
</term>
<listitem>
<simpara>
The UTF-8 encoded string to be produced for this unichar.
</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term>
<emphasis>properties</emphasis>
</term>
<listitem>
<simpara>
An integer mask of character properties, one per bit.
From least to most significant bit, these are: isalpha, islower, isupper,
isdigit, ispunctuation.
</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term>
<emphasis>glyph_metrics</emphasis>
</term>
<listitem>
<simpara>
Ten comma-separated integers representing various standards
for where this glyph is to be found within a baseline-normalized coordinate
system where 128 is normalized to x-height.
</simpara>
<itemizedlist>
<listitem>
<simpara>
min_bottom, max_bottom: the ranges where the bottom of the character can
be found.
</simpara>
</listitem>
<listitem>
<simpara>
min_top, max_top: the ranges where the top of the character may be found.
</simpara>
</listitem>
<listitem>
<simpara>
min_width, max_width: horizontal width of the character.
</simpara>
</listitem>
<listitem>
<simpara>
min_bearing, max_bearing: how far from the usual start position does the
leftmost part of the character begin.
</simpara>
</listitem>
<listitem>
<simpara>
min_advance, max_advance: how far from the printer&#8217;s cell left do we
advance to begin the next character.
</simpara>
</listitem>
</itemizedlist>
</listitem>
</varlistentry>
<varlistentry>
<term>
<emphasis>script</emphasis>
</term>
<listitem>
<simpara>
Name of the script (Latin, Common, Greek, Cyrillic, Han, null).
</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term>
<emphasis>other_case</emphasis>
</term>
<listitem>
<simpara>
The Unichar ID of the other case version of this character
(upper or lower).
</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term>
<emphasis>direction</emphasis>
</term>
<listitem>
<simpara>
The Unicode BiDi direction of this character, as defined by
ICU&#8217;s enum UCharDirection. (0 = Left to Right, 1 = Right to Left,
2 = European Number&#8230;)
</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term>
<emphasis>mirror</emphasis>
</term>
<listitem>
<simpara>
The Unichar ID of the BiDirectional mirror of this character.
For example the mirror of open paren is close paren, but Latin Capital C
has no mirror, so it remains a Latin Capital C.
</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term>
<emphasis>normed_form</emphasis>
</term>
<listitem>
<simpara>
The UTF-8 representation of a "normalized form" of this unichar
for the purpose of blaming a module for errors given ground truth text.
For instance, a left or right single quote may normalize to an ASCII quote.
</simpara>
</listitem>
</varlistentry>
</variablelist>
</refsect1>
<refsect1 id="_example_v2">
<title>EXAMPLE (v2)</title>
<literallayout class="monospaced">; 10 Common 46
b 3 Latin 59
W 5 Latin 40
7 8 Common 66
= 0 Common 93</literallayout>
<simpara>";" is a punctuation character. Its properties are thus represented by the
binary number 10000 (10 in hexadecimal).</simpara>
<simpara>"b" is an alphabetic character and a lower case character. Its properties are
thus represented by the binary number 00011 (3 in hexadecimal).</simpara>
<simpara>"W" is an alphabetic character and an upper case character. Its properties are
thus represented by the binary number 00101 (5 in hexadecimal).</simpara>
<simpara>"7" is just a digit. Its properties are thus represented by the binary number
01000 (8 in hexadecimal).</simpara>
<simpara>"=" is not punctuation nor a digit nor an alphabetic character. Its properties
are thus represented by the binary number 00000 (0 in hexadecimal).</simpara>
<simpara>Japanese or Chinese alphabetic character properties are represented by the
binary number 00001 (1 in hexadecimal): they are alphabetic, but neither
upper nor lower case.</simpara>
</refsect1>
<refsect1 id="_example_v3_02">
<title>EXAMPLE (v3.02)</title>
<literallayout class="monospaced">110
NULL 0 NULL 0
N 5 59,68,216,255,87,236,0,27,104,227 Latin 11 0 1 N
Y 5 59,68,216,255,91,205,0,47,91,223 Latin 33 0 2 Y
1 8 59,69,203,255,45,128,0,66,74,173 Common 3 2 3 1
9 8 18,66,203,255,89,156,0,39,104,173 Common 4 2 4 9
a 3 58,65,186,198,85,164,0,26,97,185 Latin 56 0 5 a
. . .</literallayout>
</refsect1>
<refsect1 id="_caveats">
<title>CAVEATS</title>
<simpara>Although the unicharset reader maintains the ability to read unicharsets
of older formats and will assign default values to missing fields,
the accuracy will be degraded.</simpara>
<simpara>Further, most other data files are indexed by the unicharset file,
so changing it without re-generating the others is likely to have dire
consequences.</simpara>
</refsect1>
<refsect1 id="_history">
<title>HISTORY</title>
<simpara>The unicharset format first appeared with Tesseract 2.00, which was the
first version to support languages other than English. The unicharset file
contained only the first two fields, and the "ispunctuation" property was
absent (punctuation was regarded as "0", as "=" is in the above example.</simpara>
</refsect1>
<refsect1 id="_see_also">
<title>SEE ALSO</title>
<simpara>tesseract(1), combine_tessdata(1), unicharset_extractor(1)</simpara>
<simpara><ulink url="https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract">https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract</ulink></simpara>
</refsect1>
<refsect1 id="_author">
<title>AUTHOR</title>
<simpara>The Tesseract OCR engine was written by Ray Smith and his research groups
at Hewlett Packard (1985-1995) and Google (2006-present).</simpara>
</refsect1>
</refentry>

View File

@ -1,69 +0,0 @@
'\" t
.\" Title: unicharset_extractor
.\" Author: [see the "AUTHOR" section]
.\" Generator: DocBook XSL Stylesheets v1.78.1 <http://docbook.sf.net/>
.\" Date: 06/12/2015
.\" Manual: \ \&
.\" Source: \ \&
.\" Language: English
.\"
.TH "UNICHARSET_EXTRACTOR" "1" "06/12/2015" "\ \&" "\ \&"
.\" -----------------------------------------------------------------
.\" * Define some portability stuff
.\" -----------------------------------------------------------------
.\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.\" http://bugs.debian.org/507673
.\" http://lists.gnu.org/archive/html/groff/2009-02/msg00013.html
.\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.ie \n(.g .ds Aq \(aq
.el .ds Aq '
.\" -----------------------------------------------------------------
.\" * set default formatting
.\" -----------------------------------------------------------------
.\" disable hyphenation
.nh
.\" disable justification (adjust text to left margin only)
.ad l
.\" -----------------------------------------------------------------
.\" * MAIN CONTENT STARTS HERE *
.\" -----------------------------------------------------------------
.SH "NAME"
unicharset_extractor \- extract unicharset from Tesseract boxfiles
.SH "SYNOPSIS"
.sp
\fBunicharset_extractor\fR \fI[\-D dir]\fR \fIFILE\fR\&...
.SH "DESCRIPTION"
.sp
Tesseract needs to know the set of possible characters it can output\&. To generate the unicharset data file, use the unicharset_extractor program on the same training pages bounding box files as used for clustering:
.sp
.if n \{\
.RS 4
.\}
.nf
unicharset_extractor fontfile_1\&.box fontfile_2\&.box \&.\&.\&.
.fi
.if n \{\
.RE
.\}
.sp
The unicharset will be put into the file \fIdir/unicharset\fR, or simply \fI\&./unicharset\fR if no output directory is provided\&.
.sp
Tesseract also needs to have access to character properties isalpha, isdigit, isupper, islower, ispunctuation\&. all of this auxilury data and more is encoded in this file\&. (See unicharset(5))
.sp
If your system supports the wctype functions, these values will be set automatically by unicharset_extractor and there is no need to edit the unicharset file\&. On some older systems (eg Windows 95), the unicharset file must be edited by hand to add these property description codes\&.
.sp
\fBNOTE\fR The unicharset file must be regenerated whenever inttemp, normproto and pffmtable are generated (i\&.e\&. they must all be recreated when the box file is changed) as they have to be in sync\&. This is made easier than in previous versions by running unicharset_extractor before mftraining and cntraining, and giving the unicharset to mftraining\&.
.SH "SEE ALSO"
.sp
tesseract(1), unicharset(5)
.sp
\m[blue]\fBhttps://github\&.com/tesseract\-ocr/tesseract/wiki/TrainingTesseract\fR\m[]
.SH "HISTORY"
.sp
unicharset_extractor first appeared in Tesseract 2\&.00\&.
.SH "COPYING"
.sp
Copyright (C) 2006, Google Inc\&. Licensed under the Apache License, Version 2\&.0
.SH "AUTHOR"
.sp
The Tesseract OCR engine was written by Ray Smith and his research groups at Hewlett Packard (1985\-1995) and Google (2006\-present)\&.

View File

@ -1,815 +0,0 @@
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<meta http-equiv="Content-Type" content="application/xhtml+xml; charset=UTF-8" />
<meta name="generator" content="AsciiDoc 8.6.9" />
<title>UNICHARSET_EXTRACTOR(1)</title>
<style type="text/css">
/* Shared CSS for AsciiDoc xhtml11 and html5 backends */
/* Default font. */
body {
font-family: Georgia,serif;
}
/* Title font. */
h1, h2, h3, h4, h5, h6,
div.title, caption.title,
thead, p.table.header,
#toctitle,
#author, #revnumber, #revdate, #revremark,
#footer {
font-family: Arial,Helvetica,sans-serif;
}
body {
margin: 1em 5% 1em 5%;
}
a {
color: blue;
text-decoration: underline;
}
a:visited {
color: fuchsia;
}
em {
font-style: italic;
color: navy;
}
strong {
font-weight: bold;
color: #083194;
}
h1, h2, h3, h4, h5, h6 {
color: #527bbd;
margin-top: 1.2em;
margin-bottom: 0.5em;
line-height: 1.3;
}
h1, h2, h3 {
border-bottom: 2px solid silver;
}
h2 {
padding-top: 0.5em;
}
h3 {
float: left;
}
h3 + * {
clear: left;
}
h5 {
font-size: 1.0em;
}
div.sectionbody {
margin-left: 0;
}
hr {
border: 1px solid silver;
}
p {
margin-top: 0.5em;
margin-bottom: 0.5em;
}
ul, ol, li > p {
margin-top: 0;
}
ul > li { color: #aaa; }
ul > li > * { color: black; }
.monospaced, code, pre {
font-family: "Courier New", Courier, monospace;
font-size: inherit;
color: navy;
padding: 0;
margin: 0;
}
pre {
white-space: pre-wrap;
}
#author {
color: #527bbd;
font-weight: bold;
font-size: 1.1em;
}
#email {
}
#revnumber, #revdate, #revremark {
}
#footer {
font-size: small;
border-top: 2px solid silver;
padding-top: 0.5em;
margin-top: 4.0em;
}
#footer-text {
float: left;
padding-bottom: 0.5em;
}
#footer-badges {
float: right;
padding-bottom: 0.5em;
}
#preamble {
margin-top: 1.5em;
margin-bottom: 1.5em;
}
div.imageblock, div.exampleblock, div.verseblock,
div.quoteblock, div.literalblock, div.listingblock, div.sidebarblock,
div.admonitionblock {
margin-top: 1.0em;
margin-bottom: 1.5em;
}
div.admonitionblock {
margin-top: 2.0em;
margin-bottom: 2.0em;
margin-right: 10%;
color: #606060;
}
div.content { /* Block element content. */
padding: 0;
}
/* Block element titles. */
div.title, caption.title {
color: #527bbd;
font-weight: bold;
text-align: left;
margin-top: 1.0em;
margin-bottom: 0.5em;
}
div.title + * {
margin-top: 0;
}
td div.title:first-child {
margin-top: 0.0em;
}
div.content div.title:first-child {
margin-top: 0.0em;
}
div.content + div.title {
margin-top: 0.0em;
}
div.sidebarblock > div.content {
background: #ffffee;
border: 1px solid #dddddd;
border-left: 4px solid #f0f0f0;
padding: 0.5em;
}
div.listingblock > div.content {
border: 1px solid #dddddd;
border-left: 5px solid #f0f0f0;
background: #f8f8f8;
padding: 0.5em;
}
div.quoteblock, div.verseblock {
padding-left: 1.0em;
margin-left: 1.0em;
margin-right: 10%;
border-left: 5px solid #f0f0f0;
color: #888;
}
div.quoteblock > div.attribution {
padding-top: 0.5em;
text-align: right;
}
div.verseblock > pre.content {
font-family: inherit;
font-size: inherit;
}
div.verseblock > div.attribution {
padding-top: 0.75em;
text-align: left;
}
/* DEPRECATED: Pre version 8.2.7 verse style literal block. */
div.verseblock + div.attribution {
text-align: left;
}
div.admonitionblock .icon {
vertical-align: top;
font-size: 1.1em;
font-weight: bold;
text-decoration: underline;
color: #527bbd;
padding-right: 0.5em;
}
div.admonitionblock td.content {
padding-left: 0.5em;
border-left: 3px solid #dddddd;
}
div.exampleblock > div.content {
border-left: 3px solid #dddddd;
padding-left: 0.5em;
}
div.imageblock div.content { padding-left: 0; }
span.image img { border-style: none; vertical-align: text-bottom; }
a.image:visited { color: white; }
dl {
margin-top: 0.8em;
margin-bottom: 0.8em;
}
dt {
margin-top: 0.5em;
margin-bottom: 0;
font-style: normal;
color: navy;
}
dd > *:first-child {
margin-top: 0.1em;
}
ul, ol {
list-style-position: outside;
}
ol.arabic {
list-style-type: decimal;
}
ol.loweralpha {
list-style-type: lower-alpha;
}
ol.upperalpha {
list-style-type: upper-alpha;
}
ol.lowerroman {
list-style-type: lower-roman;
}
ol.upperroman {
list-style-type: upper-roman;
}
div.compact ul, div.compact ol,
div.compact p, div.compact p,
div.compact div, div.compact div {
margin-top: 0.1em;
margin-bottom: 0.1em;
}
tfoot {
font-weight: bold;
}
td > div.verse {
white-space: pre;
}
div.hdlist {
margin-top: 0.8em;
margin-bottom: 0.8em;
}
div.hdlist tr {
padding-bottom: 15px;
}
dt.hdlist1.strong, td.hdlist1.strong {
font-weight: bold;
}
td.hdlist1 {
vertical-align: top;
font-style: normal;
padding-right: 0.8em;
color: navy;
}
td.hdlist2 {
vertical-align: top;
}
div.hdlist.compact tr {
margin: 0;
padding-bottom: 0;
}
.comment {
background: yellow;
}
.footnote, .footnoteref {
font-size: 0.8em;
}
span.footnote, span.footnoteref {
vertical-align: super;
}
#footnotes {
margin: 20px 0 20px 0;
padding: 7px 0 0 0;
}
#footnotes div.footnote {
margin: 0 0 5px 0;
}
#footnotes hr {
border: none;
border-top: 1px solid silver;
height: 1px;
text-align: left;
margin-left: 0;
width: 20%;
min-width: 100px;
}
div.colist td {
padding-right: 0.5em;
padding-bottom: 0.3em;
vertical-align: top;
}
div.colist td img {
margin-top: 0.3em;
}
@media print {
#footer-badges { display: none; }
}
#toc {
margin-bottom: 2.5em;
}
#toctitle {
color: #527bbd;
font-size: 1.1em;
font-weight: bold;
margin-top: 1.0em;
margin-bottom: 0.1em;
}
div.toclevel0, div.toclevel1, div.toclevel2, div.toclevel3, div.toclevel4 {
margin-top: 0;
margin-bottom: 0;
}
div.toclevel2 {
margin-left: 2em;
font-size: 0.9em;
}
div.toclevel3 {
margin-left: 4em;
font-size: 0.9em;
}
div.toclevel4 {
margin-left: 6em;
font-size: 0.9em;
}
span.aqua { color: aqua; }
span.black { color: black; }
span.blue { color: blue; }
span.fuchsia { color: fuchsia; }
span.gray { color: gray; }
span.green { color: green; }
span.lime { color: lime; }
span.maroon { color: maroon; }
span.navy { color: navy; }
span.olive { color: olive; }
span.purple { color: purple; }
span.red { color: red; }
span.silver { color: silver; }
span.teal { color: teal; }
span.white { color: white; }
span.yellow { color: yellow; }
span.aqua-background { background: aqua; }
span.black-background { background: black; }
span.blue-background { background: blue; }
span.fuchsia-background { background: fuchsia; }
span.gray-background { background: gray; }
span.green-background { background: green; }
span.lime-background { background: lime; }
span.maroon-background { background: maroon; }
span.navy-background { background: navy; }
span.olive-background { background: olive; }
span.purple-background { background: purple; }
span.red-background { background: red; }
span.silver-background { background: silver; }
span.teal-background { background: teal; }
span.white-background { background: white; }
span.yellow-background { background: yellow; }
span.big { font-size: 2em; }
span.small { font-size: 0.6em; }
span.underline { text-decoration: underline; }
span.overline { text-decoration: overline; }
span.line-through { text-decoration: line-through; }
div.unbreakable { page-break-inside: avoid; }
/*
* xhtml11 specific
*
* */
div.tableblock {
margin-top: 1.0em;
margin-bottom: 1.5em;
}
div.tableblock > table {
border: 3px solid #527bbd;
}
thead, p.table.header {
font-weight: bold;
color: #527bbd;
}
p.table {
margin-top: 0;
}
/* Because the table frame attribute is overriden by CSS in most browsers. */
div.tableblock > table[frame="void"] {
border-style: none;
}
div.tableblock > table[frame="hsides"] {
border-left-style: none;
border-right-style: none;
}
div.tableblock > table[frame="vsides"] {
border-top-style: none;
border-bottom-style: none;
}
/*
* html5 specific
*
* */
table.tableblock {
margin-top: 1.0em;
margin-bottom: 1.5em;
}
thead, p.tableblock.header {
font-weight: bold;
color: #527bbd;
}
p.tableblock {
margin-top: 0;
}
table.tableblock {
border-width: 3px;
border-spacing: 0px;
border-style: solid;
border-color: #527bbd;
border-collapse: collapse;
}
th.tableblock, td.tableblock {
border-width: 1px;
padding: 4px;
border-style: solid;
border-color: #527bbd;
}
table.tableblock.frame-topbot {
border-left-style: hidden;
border-right-style: hidden;
}
table.tableblock.frame-sides {
border-top-style: hidden;
border-bottom-style: hidden;
}
table.tableblock.frame-none {
border-style: hidden;
}
th.tableblock.halign-left, td.tableblock.halign-left {
text-align: left;
}
th.tableblock.halign-center, td.tableblock.halign-center {
text-align: center;
}
th.tableblock.halign-right, td.tableblock.halign-right {
text-align: right;
}
th.tableblock.valign-top, td.tableblock.valign-top {
vertical-align: top;
}
th.tableblock.valign-middle, td.tableblock.valign-middle {
vertical-align: middle;
}
th.tableblock.valign-bottom, td.tableblock.valign-bottom {
vertical-align: bottom;
}
/*
* manpage specific
*
* */
body.manpage h1 {
padding-top: 0.5em;
padding-bottom: 0.5em;
border-top: 2px solid silver;
border-bottom: 2px solid silver;
}
body.manpage h2 {
border-style: none;
}
body.manpage div.sectionbody {
margin-left: 3em;
}
@media print {
body.manpage div#toc { display: none; }
}
</style>
<script type="text/javascript">
/*<![CDATA[*/
var asciidoc = { // Namespace.
/////////////////////////////////////////////////////////////////////
// Table Of Contents generator
/////////////////////////////////////////////////////////////////////
/* Author: Mihai Bazon, September 2002
* http://students.infoiasi.ro/~mishoo
*
* Table Of Content generator
* Version: 0.4
*
* Feel free to use this script under the terms of the GNU General Public
* License, as long as you do not remove or alter this notice.
*/
/* modified by Troy D. Hanson, September 2006. License: GPL */
/* modified by Stuart Rackham, 2006, 2009. License: GPL */
// toclevels = 1..4.
toc: function (toclevels) {
function getText(el) {
var text = "";
for (var i = el.firstChild; i != null; i = i.nextSibling) {
if (i.nodeType == 3 /* Node.TEXT_NODE */) // IE doesn't speak constants.
text += i.data;
else if (i.firstChild != null)
text += getText(i);
}
return text;
}
function TocEntry(el, text, toclevel) {
this.element = el;
this.text = text;
this.toclevel = toclevel;
}
function tocEntries(el, toclevels) {
var result = new Array;
var re = new RegExp('[hH]([1-'+(toclevels+1)+'])');
// Function that scans the DOM tree for header elements (the DOM2
// nodeIterator API would be a better technique but not supported by all
// browsers).
var iterate = function (el) {
for (var i = el.firstChild; i != null; i = i.nextSibling) {
if (i.nodeType == 1 /* Node.ELEMENT_NODE */) {
var mo = re.exec(i.tagName);
if (mo && (i.getAttribute("class") || i.getAttribute("className")) != "float") {
result[result.length] = new TocEntry(i, getText(i), mo[1]-1);
}
iterate(i);
}
}
}
iterate(el);
return result;
}
var toc = document.getElementById("toc");
if (!toc) {
return;
}
// Delete existing TOC entries in case we're reloading the TOC.
var tocEntriesToRemove = [];
var i;
for (i = 0; i < toc.childNodes.length; i++) {
var entry = toc.childNodes[i];
if (entry.nodeName.toLowerCase() == 'div'
&& entry.getAttribute("class")
&& entry.getAttribute("class").match(/^toclevel/))
tocEntriesToRemove.push(entry);
}
for (i = 0; i < tocEntriesToRemove.length; i++) {
toc.removeChild(tocEntriesToRemove[i]);
}
// Rebuild TOC entries.
var entries = tocEntries(document.getElementById("content"), toclevels);
for (var i = 0; i < entries.length; ++i) {
var entry = entries[i];
if (entry.element.id == "")
entry.element.id = "_toc_" + i;
var a = document.createElement("a");
a.href = "#" + entry.element.id;
a.appendChild(document.createTextNode(entry.text));
var div = document.createElement("div");
div.appendChild(a);
div.className = "toclevel" + entry.toclevel;
toc.appendChild(div);
}
if (entries.length == 0)
toc.parentNode.removeChild(toc);
},
/////////////////////////////////////////////////////////////////////
// Footnotes generator
/////////////////////////////////////////////////////////////////////
/* Based on footnote generation code from:
* http://www.brandspankingnew.net/archive/2005/07/format_footnote.html
*/
footnotes: function () {
// Delete existing footnote entries in case we're reloading the footnodes.
var i;
var noteholder = document.getElementById("footnotes");
if (!noteholder) {
return;
}
var entriesToRemove = [];
for (i = 0; i < noteholder.childNodes.length; i++) {
var entry = noteholder.childNodes[i];
if (entry.nodeName.toLowerCase() == 'div' && entry.getAttribute("class") == "footnote")
entriesToRemove.push(entry);
}
for (i = 0; i < entriesToRemove.length; i++) {
noteholder.removeChild(entriesToRemove[i]);
}
// Rebuild footnote entries.
var cont = document.getElementById("content");
var spans = cont.getElementsByTagName("span");
var refs = {};
var n = 0;
for (i=0; i<spans.length; i++) {
if (spans[i].className == "footnote") {
n++;
var note = spans[i].getAttribute("data-note");
if (!note) {
// Use [\s\S] in place of . so multi-line matches work.
// Because JavaScript has no s (dotall) regex flag.
note = spans[i].innerHTML.match(/\s*\[([\s\S]*)]\s*/)[1];
spans[i].innerHTML =
"[<a id='_footnoteref_" + n + "' href='#_footnote_" + n +
"' title='View footnote' class='footnote'>" + n + "</a>]";
spans[i].setAttribute("data-note", note);
}
noteholder.innerHTML +=
"<div class='footnote' id='_footnote_" + n + "'>" +
"<a href='#_footnoteref_" + n + "' title='Return to text'>" +
n + "</a>. " + note + "</div>";
var id =spans[i].getAttribute("id");
if (id != null) refs["#"+id] = n;
}
}
if (n == 0)
noteholder.parentNode.removeChild(noteholder);
else {
// Process footnoterefs.
for (i=0; i<spans.length; i++) {
if (spans[i].className == "footnoteref") {
var href = spans[i].getElementsByTagName("a")[0].getAttribute("href");
href = href.match(/#.*/)[0]; // Because IE return full URL.
n = refs[href];
spans[i].innerHTML =
"[<a href='#_footnote_" + n +
"' title='View footnote' class='footnote'>" + n + "</a>]";
}
}
}
},
install: function(toclevels) {
var timerId;
function reinstall() {
asciidoc.footnotes();
if (toclevels) {
asciidoc.toc(toclevels);
}
}
function reinstallAndRemoveTimer() {
clearInterval(timerId);
reinstall();
}
timerId = setInterval(reinstall, 500);
if (document.addEventListener)
document.addEventListener("DOMContentLoaded", reinstallAndRemoveTimer, false);
else
window.onload = reinstallAndRemoveTimer;
}
}
asciidoc.install();
/*]]>*/
</script>
</head>
<body class="manpage">
<div id="header">
<h1>
UNICHARSET_EXTRACTOR(1) Manual Page
</h1>
<h2>NAME</h2>
<div class="sectionbody">
<p>unicharset_extractor -
extract unicharset from Tesseract boxfiles
</p>
</div>
</div>
<div id="content">
<div class="sect1">
<h2 id="_synopsis">SYNOPSIS</h2>
<div class="sectionbody">
<div class="paragraph"><p><strong>unicharset_extractor</strong> <em>[-D dir]</em> <em>FILE</em>&#8230;</p></div>
</div>
</div>
<div class="sect1">
<h2 id="_description">DESCRIPTION</h2>
<div class="sectionbody">
<div class="paragraph"><p>Tesseract needs to know the set of possible characters it can output.
To generate the unicharset data file, use the unicharset_extractor
program on the same training pages bounding box files as used for
clustering:</p></div>
<div class="literalblock">
<div class="content">
<pre><code>unicharset_extractor fontfile_1.box fontfile_2.box ...</code></pre>
</div></div>
<div class="paragraph"><p>The unicharset will be put into the file <em>dir/unicharset</em>, or simply
<em>./unicharset</em> if no output directory is provided.</p></div>
<div class="paragraph"><p>Tesseract also needs to have access to character properties isalpha,
isdigit, isupper, islower, ispunctuation. all of this auxilury data
and more is encoded in this file. (See unicharset(5))</p></div>
<div class="paragraph"><p>If your system supports the wctype functions, these values will be set
automatically by unicharset_extractor and there is no need to edit the
unicharset file. On some older systems (eg Windows 95), the unicharset
file must be edited by hand to add these property description codes.</p></div>
<div class="paragraph"><p><strong>NOTE</strong> The unicharset file must be regenerated whenever inttemp, normproto
and pffmtable are generated (i.e. they must all be recreated when the box
file is changed) as they have to be in sync. This is made easier than in
previous versions by running unicharset_extractor before mftraining and
cntraining, and giving the unicharset to mftraining.</p></div>
</div>
</div>
<div class="sect1">
<h2 id="_see_also">SEE ALSO</h2>
<div class="sectionbody">
<div class="paragraph"><p>tesseract(1), unicharset(5)</p></div>
<div class="paragraph"><p><a href="https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract">https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract</a></p></div>
</div>
</div>
<div class="sect1">
<h2 id="_history">HISTORY</h2>
<div class="sectionbody">
<div class="paragraph"><p>unicharset_extractor first appeared in Tesseract 2.00.</p></div>
</div>
</div>
<div class="sect1">
<h2 id="_copying">COPYING</h2>
<div class="sectionbody">
<div class="paragraph"><p>Copyright (C) 2006, Google Inc.
Licensed under the Apache License, Version 2.0</p></div>
</div>
</div>
<div class="sect1">
<h2 id="_author">AUTHOR</h2>
<div class="sectionbody">
<div class="paragraph"><p>The Tesseract OCR engine was written by Ray Smith and his research groups
at Hewlett Packard (1985-1995) and Google (2006-present).</p></div>
</div>
</div>
</div>
<div id="footnotes"><hr /></div>
<div id="footer">
<div id="footer-text">
Last updated 2015-06-12 23:52:38 CEST
</div>
</div>
</body>
</html>

View File

@ -1,63 +0,0 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE refentry PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd">
<?asciidoc-toc?>
<?asciidoc-numbered?>
<refentry lang="en">
<refentryinfo>
<title>UNICHARSET_EXTRACTOR(1)</title>
</refentryinfo>
<refmeta>
<refentrytitle>unicharset_extractor</refentrytitle>
<manvolnum>1</manvolnum>
<refmiscinfo class="source">&#160;</refmiscinfo>
<refmiscinfo class="manual">&#160;</refmiscinfo>
</refmeta>
<refnamediv>
<refname>unicharset_extractor</refname>
<refpurpose>extract unicharset from Tesseract boxfiles</refpurpose>
</refnamediv>
<refsynopsisdiv id="_synopsis">
<simpara><emphasis role="strong">unicharset_extractor</emphasis> <emphasis>[-D dir]</emphasis> <emphasis>FILE</emphasis>&#8230;</simpara>
</refsynopsisdiv>
<refsect1 id="_description">
<title>DESCRIPTION</title>
<simpara>Tesseract needs to know the set of possible characters it can output.
To generate the unicharset data file, use the unicharset_extractor
program on the same training pages bounding box files as used for
clustering:</simpara>
<literallayout class="monospaced">unicharset_extractor fontfile_1.box fontfile_2.box ...</literallayout>
<simpara>The unicharset will be put into the file <emphasis>dir/unicharset</emphasis>, or simply
<emphasis>./unicharset</emphasis> if no output directory is provided.</simpara>
<simpara>Tesseract also needs to have access to character properties isalpha,
isdigit, isupper, islower, ispunctuation. all of this auxilury data
and more is encoded in this file. (See unicharset(5))</simpara>
<simpara>If your system supports the wctype functions, these values will be set
automatically by unicharset_extractor and there is no need to edit the
unicharset file. On some older systems (eg Windows 95), the unicharset
file must be edited by hand to add these property description codes.</simpara>
<simpara><emphasis role="strong">NOTE</emphasis> The unicharset file must be regenerated whenever inttemp, normproto
and pffmtable are generated (i.e. they must all be recreated when the box
file is changed) as they have to be in sync. This is made easier than in
previous versions by running unicharset_extractor before mftraining and
cntraining, and giving the unicharset to mftraining.</simpara>
</refsect1>
<refsect1 id="_see_also">
<title>SEE ALSO</title>
<simpara>tesseract(1), unicharset(5)</simpara>
<simpara><ulink url="https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract">https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract</ulink></simpara>
</refsect1>
<refsect1 id="_history">
<title>HISTORY</title>
<simpara>unicharset_extractor first appeared in Tesseract 2.00.</simpara>
</refsect1>
<refsect1 id="_copying">
<title>COPYING</title>
<simpara>Copyright (C) 2006, Google Inc.
Licensed under the Apache License, Version 2.0</simpara>
</refsect1>
<refsect1 id="_author">
<title>AUTHOR</title>
<simpara>The Tesseract OCR engine was written by Ray Smith and his research groups
at Hewlett Packard (1985-1995) and Google (2006-present).</simpara>
</refsect1>
</refentry>

View File

@ -1,72 +0,0 @@
'\" t
.\" Title: wordlist2dawg
.\" Author: [see the "AUTHOR" section]
.\" Generator: DocBook XSL Stylesheets v1.78.1 <http://docbook.sf.net/>
.\" Date: 06/12/2015
.\" Manual: \ \&
.\" Source: \ \&
.\" Language: English
.\"
.TH "WORDLIST2DAWG" "1" "06/12/2015" "\ \&" "\ \&"
.\" -----------------------------------------------------------------
.\" * Define some portability stuff
.\" -----------------------------------------------------------------
.\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.\" http://bugs.debian.org/507673
.\" http://lists.gnu.org/archive/html/groff/2009-02/msg00013.html
.\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.ie \n(.g .ds Aq \(aq
.el .ds Aq '
.\" -----------------------------------------------------------------
.\" * set default formatting
.\" -----------------------------------------------------------------
.\" disable hyphenation
.nh
.\" disable justification (adjust text to left margin only)
.ad l
.\" -----------------------------------------------------------------
.\" * MAIN CONTENT STARTS HERE *
.\" -----------------------------------------------------------------
.SH "NAME"
wordlist2dawg \- convert a wordlist to a DAWG for Tesseract
.SH "SYNOPSIS"
.sp
\fBwordlist2dawg\fR \fIWORDLIST\fR \fIDAWG\fR \fIlang\&.unicharset\fR
.sp
\fBwordlist2dawg\fR \-t \fIWORDLIST\fR \fIDAWG\fR \fIlang\&.unicharset\fR
.sp
\fBwordlist2dawg\fR \-r 1 \fIWORDLIST\fR \fIDAWG\fR \fIlang\&.unicharset\fR
.sp
\fBwordlist2dawg\fR \-r 2 \fIWORDLIST\fR \fIDAWG\fR \fIlang\&.unicharset\fR
.sp
\fBwordlist2dawg\fR \-l <short> <long> \fIWORDLIST\fR \fIDAWG\fR \fIlang\&.unicharset\fR
.SH "DESCRIPTION"
.sp
wordlist2dawg(1) converts a wordlist to a Directed Acyclic Word Graph (DAWG) for use with Tesseract\&. A DAWG is a compressed, space and time efficient representation of a word list\&.
.SH "OPTIONS"
.sp
\-t Verify that a given dawg file is equivalent to a given wordlist\&.
.sp
\-r 1 Reverse a word if it contains an RTL character\&.
.sp
\-r 2 Reverse all words\&.
.sp
\-l <short> <long> Produce a file with several dawgs in it, one each for words of length <short>, <short+1>,\&... <long>
.SH "ARGUMENTS"
.sp
\fIWORDLIST\fR A plain text file in UTF\-8, one word per line\&.
.sp
\fIDAWG\fR The output DAWG to write\&.
.sp
\fIlang\&.unicharset\fR The unicharset of the language\&. This is the unicharset generated by mftraining(1)\&.
.SH "SEE ALSO"
.sp
tesseract(1), combine_tessdata(1), dawg2wordlist(1)
.sp
\m[blue]\fBhttps://github\&.com/tesseract\-ocr/tesseract/wiki/TrainingTesseract\fR\m[]
.SH "COPYING"
.sp
Copyright (C) 2006 Google, Inc\&. Licensed under the Apache License, Version 2\&.0
.SH "AUTHOR"
.sp
The Tesseract OCR engine was written by Ray Smith and his research groups at Hewlett Packard (1985\-1995) and Google (2006\-present)\&.

View File

@ -1,820 +0,0 @@
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<meta http-equiv="Content-Type" content="application/xhtml+xml; charset=UTF-8" />
<meta name="generator" content="AsciiDoc 8.6.9" />
<title>WORDLIST2DAWG(1)</title>
<style type="text/css">
/* Shared CSS for AsciiDoc xhtml11 and html5 backends */
/* Default font. */
body {
font-family: Georgia,serif;
}
/* Title font. */
h1, h2, h3, h4, h5, h6,
div.title, caption.title,
thead, p.table.header,
#toctitle,
#author, #revnumber, #revdate, #revremark,
#footer {
font-family: Arial,Helvetica,sans-serif;
}
body {
margin: 1em 5% 1em 5%;
}
a {
color: blue;
text-decoration: underline;
}
a:visited {
color: fuchsia;
}
em {
font-style: italic;
color: navy;
}
strong {
font-weight: bold;
color: #083194;
}
h1, h2, h3, h4, h5, h6 {
color: #527bbd;
margin-top: 1.2em;
margin-bottom: 0.5em;
line-height: 1.3;
}
h1, h2, h3 {
border-bottom: 2px solid silver;
}
h2 {
padding-top: 0.5em;
}
h3 {
float: left;
}
h3 + * {
clear: left;
}
h5 {
font-size: 1.0em;
}
div.sectionbody {
margin-left: 0;
}
hr {
border: 1px solid silver;
}
p {
margin-top: 0.5em;
margin-bottom: 0.5em;
}
ul, ol, li > p {
margin-top: 0;
}
ul > li { color: #aaa; }
ul > li > * { color: black; }
.monospaced, code, pre {
font-family: "Courier New", Courier, monospace;
font-size: inherit;
color: navy;
padding: 0;
margin: 0;
}
pre {
white-space: pre-wrap;
}
#author {
color: #527bbd;
font-weight: bold;
font-size: 1.1em;
}
#email {
}
#revnumber, #revdate, #revremark {
}
#footer {
font-size: small;
border-top: 2px solid silver;
padding-top: 0.5em;
margin-top: 4.0em;
}
#footer-text {
float: left;
padding-bottom: 0.5em;
}
#footer-badges {
float: right;
padding-bottom: 0.5em;
}
#preamble {
margin-top: 1.5em;
margin-bottom: 1.5em;
}
div.imageblock, div.exampleblock, div.verseblock,
div.quoteblock, div.literalblock, div.listingblock, div.sidebarblock,
div.admonitionblock {
margin-top: 1.0em;
margin-bottom: 1.5em;
}
div.admonitionblock {
margin-top: 2.0em;
margin-bottom: 2.0em;
margin-right: 10%;
color: #606060;
}
div.content { /* Block element content. */
padding: 0;
}
/* Block element titles. */
div.title, caption.title {
color: #527bbd;
font-weight: bold;
text-align: left;
margin-top: 1.0em;
margin-bottom: 0.5em;
}
div.title + * {
margin-top: 0;
}
td div.title:first-child {
margin-top: 0.0em;
}
div.content div.title:first-child {
margin-top: 0.0em;
}
div.content + div.title {
margin-top: 0.0em;
}
div.sidebarblock > div.content {
background: #ffffee;
border: 1px solid #dddddd;
border-left: 4px solid #f0f0f0;
padding: 0.5em;
}
div.listingblock > div.content {
border: 1px solid #dddddd;
border-left: 5px solid #f0f0f0;
background: #f8f8f8;
padding: 0.5em;
}
div.quoteblock, div.verseblock {
padding-left: 1.0em;
margin-left: 1.0em;
margin-right: 10%;
border-left: 5px solid #f0f0f0;
color: #888;
}
div.quoteblock > div.attribution {
padding-top: 0.5em;
text-align: right;
}
div.verseblock > pre.content {
font-family: inherit;
font-size: inherit;
}
div.verseblock > div.attribution {
padding-top: 0.75em;
text-align: left;
}
/* DEPRECATED: Pre version 8.2.7 verse style literal block. */
div.verseblock + div.attribution {
text-align: left;
}
div.admonitionblock .icon {
vertical-align: top;
font-size: 1.1em;
font-weight: bold;
text-decoration: underline;
color: #527bbd;
padding-right: 0.5em;
}
div.admonitionblock td.content {
padding-left: 0.5em;
border-left: 3px solid #dddddd;
}
div.exampleblock > div.content {
border-left: 3px solid #dddddd;
padding-left: 0.5em;
}
div.imageblock div.content { padding-left: 0; }
span.image img { border-style: none; vertical-align: text-bottom; }
a.image:visited { color: white; }
dl {
margin-top: 0.8em;
margin-bottom: 0.8em;
}
dt {
margin-top: 0.5em;
margin-bottom: 0;
font-style: normal;
color: navy;
}
dd > *:first-child {
margin-top: 0.1em;
}
ul, ol {
list-style-position: outside;
}
ol.arabic {
list-style-type: decimal;
}
ol.loweralpha {
list-style-type: lower-alpha;
}
ol.upperalpha {
list-style-type: upper-alpha;
}
ol.lowerroman {
list-style-type: lower-roman;
}
ol.upperroman {
list-style-type: upper-roman;
}
div.compact ul, div.compact ol,
div.compact p, div.compact p,
div.compact div, div.compact div {
margin-top: 0.1em;
margin-bottom: 0.1em;
}
tfoot {
font-weight: bold;
}
td > div.verse {
white-space: pre;
}
div.hdlist {
margin-top: 0.8em;
margin-bottom: 0.8em;
}
div.hdlist tr {
padding-bottom: 15px;
}
dt.hdlist1.strong, td.hdlist1.strong {
font-weight: bold;
}
td.hdlist1 {
vertical-align: top;
font-style: normal;
padding-right: 0.8em;
color: navy;
}
td.hdlist2 {
vertical-align: top;
}
div.hdlist.compact tr {
margin: 0;
padding-bottom: 0;
}
.comment {
background: yellow;
}
.footnote, .footnoteref {
font-size: 0.8em;
}
span.footnote, span.footnoteref {
vertical-align: super;
}
#footnotes {
margin: 20px 0 20px 0;
padding: 7px 0 0 0;
}
#footnotes div.footnote {
margin: 0 0 5px 0;
}
#footnotes hr {
border: none;
border-top: 1px solid silver;
height: 1px;
text-align: left;
margin-left: 0;
width: 20%;
min-width: 100px;
}
div.colist td {
padding-right: 0.5em;
padding-bottom: 0.3em;
vertical-align: top;
}
div.colist td img {
margin-top: 0.3em;
}
@media print {
#footer-badges { display: none; }
}
#toc {
margin-bottom: 2.5em;
}
#toctitle {
color: #527bbd;
font-size: 1.1em;
font-weight: bold;
margin-top: 1.0em;
margin-bottom: 0.1em;
}
div.toclevel0, div.toclevel1, div.toclevel2, div.toclevel3, div.toclevel4 {
margin-top: 0;
margin-bottom: 0;
}
div.toclevel2 {
margin-left: 2em;
font-size: 0.9em;
}
div.toclevel3 {
margin-left: 4em;
font-size: 0.9em;
}
div.toclevel4 {
margin-left: 6em;
font-size: 0.9em;
}
span.aqua { color: aqua; }
span.black { color: black; }
span.blue { color: blue; }
span.fuchsia { color: fuchsia; }
span.gray { color: gray; }
span.green { color: green; }
span.lime { color: lime; }
span.maroon { color: maroon; }
span.navy { color: navy; }
span.olive { color: olive; }
span.purple { color: purple; }
span.red { color: red; }
span.silver { color: silver; }
span.teal { color: teal; }
span.white { color: white; }
span.yellow { color: yellow; }
span.aqua-background { background: aqua; }
span.black-background { background: black; }
span.blue-background { background: blue; }
span.fuchsia-background { background: fuchsia; }
span.gray-background { background: gray; }
span.green-background { background: green; }
span.lime-background { background: lime; }
span.maroon-background { background: maroon; }
span.navy-background { background: navy; }
span.olive-background { background: olive; }
span.purple-background { background: purple; }
span.red-background { background: red; }
span.silver-background { background: silver; }
span.teal-background { background: teal; }
span.white-background { background: white; }
span.yellow-background { background: yellow; }
span.big { font-size: 2em; }
span.small { font-size: 0.6em; }
span.underline { text-decoration: underline; }
span.overline { text-decoration: overline; }
span.line-through { text-decoration: line-through; }
div.unbreakable { page-break-inside: avoid; }
/*
* xhtml11 specific
*
* */
div.tableblock {
margin-top: 1.0em;
margin-bottom: 1.5em;
}
div.tableblock > table {
border: 3px solid #527bbd;
}
thead, p.table.header {
font-weight: bold;
color: #527bbd;
}
p.table {
margin-top: 0;
}
/* Because the table frame attribute is overriden by CSS in most browsers. */
div.tableblock > table[frame="void"] {
border-style: none;
}
div.tableblock > table[frame="hsides"] {
border-left-style: none;
border-right-style: none;
}
div.tableblock > table[frame="vsides"] {
border-top-style: none;
border-bottom-style: none;
}
/*
* html5 specific
*
* */
table.tableblock {
margin-top: 1.0em;
margin-bottom: 1.5em;
}
thead, p.tableblock.header {
font-weight: bold;
color: #527bbd;
}
p.tableblock {
margin-top: 0;
}
table.tableblock {
border-width: 3px;
border-spacing: 0px;
border-style: solid;
border-color: #527bbd;
border-collapse: collapse;
}
th.tableblock, td.tableblock {
border-width: 1px;
padding: 4px;
border-style: solid;
border-color: #527bbd;
}
table.tableblock.frame-topbot {
border-left-style: hidden;
border-right-style: hidden;
}
table.tableblock.frame-sides {
border-top-style: hidden;
border-bottom-style: hidden;
}
table.tableblock.frame-none {
border-style: hidden;
}
th.tableblock.halign-left, td.tableblock.halign-left {
text-align: left;
}
th.tableblock.halign-center, td.tableblock.halign-center {
text-align: center;
}
th.tableblock.halign-right, td.tableblock.halign-right {
text-align: right;
}
th.tableblock.valign-top, td.tableblock.valign-top {
vertical-align: top;
}
th.tableblock.valign-middle, td.tableblock.valign-middle {
vertical-align: middle;
}
th.tableblock.valign-bottom, td.tableblock.valign-bottom {
vertical-align: bottom;
}
/*
* manpage specific
*
* */
body.manpage h1 {
padding-top: 0.5em;
padding-bottom: 0.5em;
border-top: 2px solid silver;
border-bottom: 2px solid silver;
}
body.manpage h2 {
border-style: none;
}
body.manpage div.sectionbody {
margin-left: 3em;
}
@media print {
body.manpage div#toc { display: none; }
}
</style>
<script type="text/javascript">
/*<![CDATA[*/
var asciidoc = { // Namespace.
/////////////////////////////////////////////////////////////////////
// Table Of Contents generator
/////////////////////////////////////////////////////////////////////
/* Author: Mihai Bazon, September 2002
* http://students.infoiasi.ro/~mishoo
*
* Table Of Content generator
* Version: 0.4
*
* Feel free to use this script under the terms of the GNU General Public
* License, as long as you do not remove or alter this notice.
*/
/* modified by Troy D. Hanson, September 2006. License: GPL */
/* modified by Stuart Rackham, 2006, 2009. License: GPL */
// toclevels = 1..4.
toc: function (toclevels) {
function getText(el) {
var text = "";
for (var i = el.firstChild; i != null; i = i.nextSibling) {
if (i.nodeType == 3 /* Node.TEXT_NODE */) // IE doesn't speak constants.
text += i.data;
else if (i.firstChild != null)
text += getText(i);
}
return text;
}
function TocEntry(el, text, toclevel) {
this.element = el;
this.text = text;
this.toclevel = toclevel;
}
function tocEntries(el, toclevels) {
var result = new Array;
var re = new RegExp('[hH]([1-'+(toclevels+1)+'])');
// Function that scans the DOM tree for header elements (the DOM2
// nodeIterator API would be a better technique but not supported by all
// browsers).
var iterate = function (el) {
for (var i = el.firstChild; i != null; i = i.nextSibling) {
if (i.nodeType == 1 /* Node.ELEMENT_NODE */) {
var mo = re.exec(i.tagName);
if (mo && (i.getAttribute("class") || i.getAttribute("className")) != "float") {
result[result.length] = new TocEntry(i, getText(i), mo[1]-1);
}
iterate(i);
}
}
}
iterate(el);
return result;
}
var toc = document.getElementById("toc");
if (!toc) {
return;
}
// Delete existing TOC entries in case we're reloading the TOC.
var tocEntriesToRemove = [];
var i;
for (i = 0; i < toc.childNodes.length; i++) {
var entry = toc.childNodes[i];
if (entry.nodeName.toLowerCase() == 'div'
&& entry.getAttribute("class")
&& entry.getAttribute("class").match(/^toclevel/))
tocEntriesToRemove.push(entry);
}
for (i = 0; i < tocEntriesToRemove.length; i++) {
toc.removeChild(tocEntriesToRemove[i]);
}
// Rebuild TOC entries.
var entries = tocEntries(document.getElementById("content"), toclevels);
for (var i = 0; i < entries.length; ++i) {
var entry = entries[i];
if (entry.element.id == "")
entry.element.id = "_toc_" + i;
var a = document.createElement("a");
a.href = "#" + entry.element.id;
a.appendChild(document.createTextNode(entry.text));
var div = document.createElement("div");
div.appendChild(a);
div.className = "toclevel" + entry.toclevel;
toc.appendChild(div);
}
if (entries.length == 0)
toc.parentNode.removeChild(toc);
},
/////////////////////////////////////////////////////////////////////
// Footnotes generator
/////////////////////////////////////////////////////////////////////
/* Based on footnote generation code from:
* http://www.brandspankingnew.net/archive/2005/07/format_footnote.html
*/
footnotes: function () {
// Delete existing footnote entries in case we're reloading the footnodes.
var i;
var noteholder = document.getElementById("footnotes");
if (!noteholder) {
return;
}
var entriesToRemove = [];
for (i = 0; i < noteholder.childNodes.length; i++) {
var entry = noteholder.childNodes[i];
if (entry.nodeName.toLowerCase() == 'div' && entry.getAttribute("class") == "footnote")
entriesToRemove.push(entry);
}
for (i = 0; i < entriesToRemove.length; i++) {
noteholder.removeChild(entriesToRemove[i]);
}
// Rebuild footnote entries.
var cont = document.getElementById("content");
var spans = cont.getElementsByTagName("span");
var refs = {};
var n = 0;
for (i=0; i<spans.length; i++) {
if (spans[i].className == "footnote") {
n++;
var note = spans[i].getAttribute("data-note");
if (!note) {
// Use [\s\S] in place of . so multi-line matches work.
// Because JavaScript has no s (dotall) regex flag.
note = spans[i].innerHTML.match(/\s*\[([\s\S]*)]\s*/)[1];
spans[i].innerHTML =
"[<a id='_footnoteref_" + n + "' href='#_footnote_" + n +
"' title='View footnote' class='footnote'>" + n + "</a>]";
spans[i].setAttribute("data-note", note);
}
noteholder.innerHTML +=
"<div class='footnote' id='_footnote_" + n + "'>" +
"<a href='#_footnoteref_" + n + "' title='Return to text'>" +
n + "</a>. " + note + "</div>";
var id =spans[i].getAttribute("id");
if (id != null) refs["#"+id] = n;
}
}
if (n == 0)
noteholder.parentNode.removeChild(noteholder);
else {
// Process footnoterefs.
for (i=0; i<spans.length; i++) {
if (spans[i].className == "footnoteref") {
var href = spans[i].getElementsByTagName("a")[0].getAttribute("href");
href = href.match(/#.*/)[0]; // Because IE return full URL.
n = refs[href];
spans[i].innerHTML =
"[<a href='#_footnote_" + n +
"' title='View footnote' class='footnote'>" + n + "</a>]";
}
}
}
},
install: function(toclevels) {
var timerId;
function reinstall() {
asciidoc.footnotes();
if (toclevels) {
asciidoc.toc(toclevels);
}
}
function reinstallAndRemoveTimer() {
clearInterval(timerId);
reinstall();
}
timerId = setInterval(reinstall, 500);
if (document.addEventListener)
document.addEventListener("DOMContentLoaded", reinstallAndRemoveTimer, false);
else
window.onload = reinstallAndRemoveTimer;
}
}
asciidoc.install();
/*]]>*/
</script>
</head>
<body class="manpage">
<div id="header">
<h1>
WORDLIST2DAWG(1) Manual Page
</h1>
<h2>NAME</h2>
<div class="sectionbody">
<p>wordlist2dawg -
convert a wordlist to a DAWG for Tesseract
</p>
</div>
</div>
<div id="content">
<div class="sect1">
<h2 id="_synopsis">SYNOPSIS</h2>
<div class="sectionbody">
<div class="paragraph"><p><strong>wordlist2dawg</strong> <em>WORDLIST</em> <em>DAWG</em> <em>lang.unicharset</em></p></div>
<div class="paragraph"><p><strong>wordlist2dawg</strong> -t <em>WORDLIST</em> <em>DAWG</em> <em>lang.unicharset</em></p></div>
<div class="paragraph"><p><strong>wordlist2dawg</strong> -r 1 <em>WORDLIST</em> <em>DAWG</em> <em>lang.unicharset</em></p></div>
<div class="paragraph"><p><strong>wordlist2dawg</strong> -r 2 <em>WORDLIST</em> <em>DAWG</em> <em>lang.unicharset</em></p></div>
<div class="paragraph"><p><strong>wordlist2dawg</strong> -l &lt;short&gt; &lt;long&gt; <em>WORDLIST</em> <em>DAWG</em> <em>lang.unicharset</em></p></div>
</div>
</div>
<div class="sect1">
<h2 id="_description">DESCRIPTION</h2>
<div class="sectionbody">
<div class="paragraph"><p>wordlist2dawg(1) converts a wordlist to a Directed Acyclic Word Graph
(DAWG) for use with Tesseract. A DAWG is a compressed, space and time
efficient representation of a word list.</p></div>
</div>
</div>
<div class="sect1">
<h2 id="_options">OPTIONS</h2>
<div class="sectionbody">
<div class="paragraph"><p>-t
Verify that a given dawg file is equivalent to a given wordlist.</p></div>
<div class="paragraph"><p>-r 1
Reverse a word if it contains an RTL character.</p></div>
<div class="paragraph"><p>-r 2
Reverse all words.</p></div>
<div class="paragraph"><p>-l &lt;short&gt; &lt;long&gt;
Produce a file with several dawgs in it, one each for words
of length &lt;short&gt;, &lt;short+1&gt;,&#8230; &lt;long&gt;</p></div>
</div>
</div>
<div class="sect1">
<h2 id="_arguments">ARGUMENTS</h2>
<div class="sectionbody">
<div class="paragraph"><p><em>WORDLIST</em>
A plain text file in UTF-8, one word per line.</p></div>
<div class="paragraph"><p><em>DAWG</em>
The output DAWG to write.</p></div>
<div class="paragraph"><p><em>lang.unicharset</em>
The unicharset of the language. This is the unicharset
generated by mftraining(1).</p></div>
</div>
</div>
<div class="sect1">
<h2 id="_see_also">SEE ALSO</h2>
<div class="sectionbody">
<div class="paragraph"><p>tesseract(1), combine_tessdata(1), dawg2wordlist(1)</p></div>
<div class="paragraph"><p><a href="https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract">https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract</a></p></div>
</div>
</div>
<div class="sect1">
<h2 id="_copying">COPYING</h2>
<div class="sectionbody">
<div class="paragraph"><p>Copyright (C) 2006 Google, Inc.
Licensed under the Apache License, Version 2.0</p></div>
</div>
</div>
<div class="sect1">
<h2 id="_author">AUTHOR</h2>
<div class="sectionbody">
<div class="paragraph"><p>The Tesseract OCR engine was written by Ray Smith and his research groups
at Hewlett Packard (1985-1995) and Google (2006-present).</p></div>
</div>
</div>
</div>
<div id="footnotes"><hr /></div>
<div id="footer">
<div id="footer-text">
Last updated 2015-06-12 23:52:50 CEST
</div>
</div>
</body>
</html>

View File

@ -1,69 +0,0 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE refentry PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd">
<?asciidoc-toc?>
<?asciidoc-numbered?>
<refentry lang="en">
<refentryinfo>
<title>WORDLIST2DAWG(1)</title>
</refentryinfo>
<refmeta>
<refentrytitle>wordlist2dawg</refentrytitle>
<manvolnum>1</manvolnum>
<refmiscinfo class="source">&#160;</refmiscinfo>
<refmiscinfo class="manual">&#160;</refmiscinfo>
</refmeta>
<refnamediv>
<refname>wordlist2dawg</refname>
<refpurpose>convert a wordlist to a DAWG for Tesseract</refpurpose>
</refnamediv>
<refsynopsisdiv id="_synopsis">
<simpara><emphasis role="strong">wordlist2dawg</emphasis> <emphasis>WORDLIST</emphasis> <emphasis>DAWG</emphasis> <emphasis>lang.unicharset</emphasis></simpara>
<simpara><emphasis role="strong">wordlist2dawg</emphasis> -t <emphasis>WORDLIST</emphasis> <emphasis>DAWG</emphasis> <emphasis>lang.unicharset</emphasis></simpara>
<simpara><emphasis role="strong">wordlist2dawg</emphasis> -r 1 <emphasis>WORDLIST</emphasis> <emphasis>DAWG</emphasis> <emphasis>lang.unicharset</emphasis></simpara>
<simpara><emphasis role="strong">wordlist2dawg</emphasis> -r 2 <emphasis>WORDLIST</emphasis> <emphasis>DAWG</emphasis> <emphasis>lang.unicharset</emphasis></simpara>
<simpara><emphasis role="strong">wordlist2dawg</emphasis> -l &lt;short&gt; &lt;long&gt; <emphasis>WORDLIST</emphasis> <emphasis>DAWG</emphasis> <emphasis>lang.unicharset</emphasis></simpara>
</refsynopsisdiv>
<refsect1 id="_description">
<title>DESCRIPTION</title>
<simpara>wordlist2dawg(1) converts a wordlist to a Directed Acyclic Word Graph
(DAWG) for use with Tesseract. A DAWG is a compressed, space and time
efficient representation of a word list.</simpara>
</refsect1>
<refsect1 id="_options">
<title>OPTIONS</title>
<simpara>-t
Verify that a given dawg file is equivalent to a given wordlist.</simpara>
<simpara>-r 1
Reverse a word if it contains an RTL character.</simpara>
<simpara>-r 2
Reverse all words.</simpara>
<simpara>-l &lt;short&gt; &lt;long&gt;
Produce a file with several dawgs in it, one each for words
of length &lt;short&gt;, &lt;short+1&gt;,&#8230; &lt;long&gt;</simpara>
</refsect1>
<refsect1 id="_arguments">
<title>ARGUMENTS</title>
<simpara><emphasis>WORDLIST</emphasis>
A plain text file in UTF-8, one word per line.</simpara>
<simpara><emphasis>DAWG</emphasis>
The output DAWG to write.</simpara>
<simpara><emphasis>lang.unicharset</emphasis>
The unicharset of the language. This is the unicharset
generated by mftraining(1).</simpara>
</refsect1>
<refsect1 id="_see_also">
<title>SEE ALSO</title>
<simpara>tesseract(1), combine_tessdata(1), dawg2wordlist(1)</simpara>
<simpara><ulink url="https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract">https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract</ulink></simpara>
</refsect1>
<refsect1 id="_copying">
<title>COPYING</title>
<simpara>Copyright (C) 2006 Google, Inc.
Licensed under the Apache License, Version 2.0</simpara>
</refsect1>
<refsect1 id="_author">
<title>AUTHOR</title>
<simpara>The Tesseract OCR engine was written by Ray Smith and his research groups
at Hewlett Packard (1985-1995) and Google (2006-present).</simpara>
</refsect1>
</refentry>