doc: Fix line endings

Remove spaces at line endings and replace CRLF by LF.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
This commit is contained in:
Stefan Weil 2016-12-04 20:41:37 +01:00
parent 798d79aaa5
commit 61d0e8f0ff
28 changed files with 11318 additions and 11318 deletions

File diff suppressed because it is too large Load Diff

View File

@ -1,43 +1,43 @@
<?xml version="1.0" encoding="UTF-8"?> <?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE refentry PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd"> <!DOCTYPE refentry PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd">
<?asciidoc-toc?> <?asciidoc-toc?>
<?asciidoc-numbered?> <?asciidoc-numbered?>
<refentry lang="en"> <refentry lang="en">
<refentryinfo> <refentryinfo>
<title>AMBIGUOUS_WORDS(1)</title> <title>AMBIGUOUS_WORDS(1)</title>
</refentryinfo> </refentryinfo>
<refmeta> <refmeta>
<refentrytitle>ambiguous_words</refentrytitle> <refentrytitle>ambiguous_words</refentrytitle>
<manvolnum>1</manvolnum> <manvolnum>1</manvolnum>
<refmiscinfo class="source">&#160;</refmiscinfo> <refmiscinfo class="source">&#160;</refmiscinfo>
<refmiscinfo class="manual">&#160;</refmiscinfo> <refmiscinfo class="manual">&#160;</refmiscinfo>
</refmeta> </refmeta>
<refnamediv> <refnamediv>
<refname>ambiguous_words</refname> <refname>ambiguous_words</refname>
<refpurpose>generate sets of words Tesseract is likely to find ambiguous</refpurpose> <refpurpose>generate sets of words Tesseract is likely to find ambiguous</refpurpose>
</refnamediv> </refnamediv>
<refsynopsisdiv id="_synopsis"> <refsynopsisdiv id="_synopsis">
<simpara><emphasis role="strong">ambiguous_words</emphasis> [-l lang] <emphasis>TESSDATADIR</emphasis> <emphasis>WORDLIST</emphasis> <emphasis>AMBIGUOUSFILE</emphasis></simpara> <simpara><emphasis role="strong">ambiguous_words</emphasis> [-l lang] <emphasis>TESSDATADIR</emphasis> <emphasis>WORDLIST</emphasis> <emphasis>AMBIGUOUSFILE</emphasis></simpara>
</refsynopsisdiv> </refsynopsisdiv>
<refsect1 id="_description"> <refsect1 id="_description">
<title>DESCRIPTION</title> <title>DESCRIPTION</title>
<simpara>ambiguous_words(1) runs Tesseract in a special mode, and for each word <simpara>ambiguous_words(1) runs Tesseract in a special mode, and for each word
in word list, produces a set of words which Tesseract thinks might be in word list, produces a set of words which Tesseract thinks might be
ambiguous with it. <emphasis>TESSDATADIR</emphasis> must be set to the absolute path of ambiguous with it. <emphasis>TESSDATADIR</emphasis> must be set to the absolute path of
a directory containing <emphasis>tessdata/lang.traineddata</emphasis>.</simpara> a directory containing <emphasis>tessdata/lang.traineddata</emphasis>.</simpara>
</refsect1> </refsect1>
<refsect1 id="_see_also"> <refsect1 id="_see_also">
<title>SEE ALSO</title> <title>SEE ALSO</title>
<simpara>tesseract(1)</simpara> <simpara>tesseract(1)</simpara>
</refsect1> </refsect1>
<refsect1 id="_copying"> <refsect1 id="_copying">
<title>COPYING</title> <title>COPYING</title>
<simpara>Copyright (C) 2012 Google, Inc. <simpara>Copyright (C) 2012 Google, Inc.
Licensed under the Apache License, Version 2.0</simpara> Licensed under the Apache License, Version 2.0</simpara>
</refsect1> </refsect1>
<refsect1 id="_author"> <refsect1 id="_author">
<title>AUTHOR</title> <title>AUTHOR</title>
<simpara>The Tesseract OCR engine was written by Ray Smith and his research groups <simpara>The Tesseract OCR engine was written by Ray Smith and his research groups
at Hewlett Packard (1985-1995) and Google (2006-present).</simpara> at Hewlett Packard (1985-1995) and Google (2006-present).</simpara>
</refsect1> </refsect1>
</refentry> </refentry>

File diff suppressed because it is too large Load Diff

View File

@ -1,58 +1,58 @@
<?xml version="1.0" encoding="UTF-8"?> <?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE refentry PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd"> <!DOCTYPE refentry PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd">
<?asciidoc-toc?> <?asciidoc-toc?>
<?asciidoc-numbered?> <?asciidoc-numbered?>
<refentry lang="en"> <refentry lang="en">
<refentryinfo> <refentryinfo>
<title>CNTRAINING(1)</title> <title>CNTRAINING(1)</title>
</refentryinfo> </refentryinfo>
<refmeta> <refmeta>
<refentrytitle>cntraining</refentrytitle> <refentrytitle>cntraining</refentrytitle>
<manvolnum>1</manvolnum> <manvolnum>1</manvolnum>
<refmiscinfo class="source">&#160;</refmiscinfo> <refmiscinfo class="source">&#160;</refmiscinfo>
<refmiscinfo class="manual">&#160;</refmiscinfo> <refmiscinfo class="manual">&#160;</refmiscinfo>
</refmeta> </refmeta>
<refnamediv> <refnamediv>
<refname>cntraining</refname> <refname>cntraining</refname>
<refpurpose>character normalization training for Tesseract</refpurpose> <refpurpose>character normalization training for Tesseract</refpurpose>
</refnamediv> </refnamediv>
<refsynopsisdiv id="_synopsis"> <refsynopsisdiv id="_synopsis">
<simpara><emphasis role="strong">cntraining</emphasis> [-D <emphasis>dir</emphasis>] <emphasis>FILE</emphasis>&#8230;</simpara> <simpara><emphasis role="strong">cntraining</emphasis> [-D <emphasis>dir</emphasis>] <emphasis>FILE</emphasis>&#8230;</simpara>
</refsynopsisdiv> </refsynopsisdiv>
<refsect1 id="_description"> <refsect1 id="_description">
<title>DESCRIPTION</title> <title>DESCRIPTION</title>
<simpara>cntraining takes a list of .tr files, from which it generates the <simpara>cntraining takes a list of .tr files, from which it generates the
<emphasis role="strong">normproto</emphasis> data file (the character normalization sensitivity <emphasis role="strong">normproto</emphasis> data file (the character normalization sensitivity
prototypes).</simpara> prototypes).</simpara>
</refsect1> </refsect1>
<refsect1 id="_options"> <refsect1 id="_options">
<title>OPTIONS</title> <title>OPTIONS</title>
<variablelist> <variablelist>
<varlistentry> <varlistentry>
<term> <term>
-D <emphasis>dir</emphasis> -D <emphasis>dir</emphasis>
</term> </term>
<listitem> <listitem>
<simpara> <simpara>
Directory to write output files to. Directory to write output files to.
</simpara> </simpara>
</listitem> </listitem>
</varlistentry> </varlistentry>
</variablelist> </variablelist>
</refsect1> </refsect1>
<refsect1 id="_see_also"> <refsect1 id="_see_also">
<title>SEE ALSO</title> <title>SEE ALSO</title>
<simpara>tesseract(1), shapeclustering(1), mftraining(1)</simpara> <simpara>tesseract(1), shapeclustering(1), mftraining(1)</simpara>
<simpara><ulink url="https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract">https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract</ulink></simpara> <simpara><ulink url="https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract">https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract</ulink></simpara>
</refsect1> </refsect1>
<refsect1 id="_copying"> <refsect1 id="_copying">
<title>COPYING</title> <title>COPYING</title>
<simpara>Copyright (c) Hewlett-Packard Company, 1988 <simpara>Copyright (c) Hewlett-Packard Company, 1988
Licensed under the Apache License, Version 2.0</simpara> Licensed under the Apache License, Version 2.0</simpara>
</refsect1> </refsect1>
<refsect1 id="_author"> <refsect1 id="_author">
<title>AUTHOR</title> <title>AUTHOR</title>
<simpara>The Tesseract OCR engine was written by Ray Smith and his research groups <simpara>The Tesseract OCR engine was written by Ray Smith and his research groups
at Hewlett Packard (1985-1995) and Google (2006-present).</simpara> at Hewlett Packard (1985-1995) and Google (2006-present).</simpara>
</refsect1> </refsect1>
</refentry> </refentry>

View File

@ -11,7 +11,7 @@ SYNOPSIS
DESCRIPTION DESCRIPTION
----------- -----------
combine_tessdata(1) is the main program to combine/extract/overwrite combine_tessdata(1) is the main program to combine/extract/overwrite
tessdata components in [lang].traineddata files. tessdata components in [lang].traineddata files.
To combine all the individual tessdata components (unicharset, DAWGs, To combine all the individual tessdata components (unicharset, DAWGs,

File diff suppressed because it is too large Load Diff

View File

@ -1,281 +1,281 @@
<?xml version="1.0" encoding="UTF-8"?> <?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE refentry PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd"> <!DOCTYPE refentry PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd">
<?asciidoc-toc?> <?asciidoc-toc?>
<?asciidoc-numbered?> <?asciidoc-numbered?>
<refentry lang="en"> <refentry lang="en">
<refentryinfo> <refentryinfo>
<title>COMBINE_TESSDATA(1)</title> <title>COMBINE_TESSDATA(1)</title>
</refentryinfo> </refentryinfo>
<refmeta> <refmeta>
<refentrytitle>combine_tessdata</refentrytitle> <refentrytitle>combine_tessdata</refentrytitle>
<manvolnum>1</manvolnum> <manvolnum>1</manvolnum>
<refmiscinfo class="source">&#160;</refmiscinfo> <refmiscinfo class="source">&#160;</refmiscinfo>
<refmiscinfo class="manual">&#160;</refmiscinfo> <refmiscinfo class="manual">&#160;</refmiscinfo>
</refmeta> </refmeta>
<refnamediv> <refnamediv>
<refname>combine_tessdata</refname> <refname>combine_tessdata</refname>
<refpurpose>combine/extract/overwrite Tesseract data</refpurpose> <refpurpose>combine/extract/overwrite Tesseract data</refpurpose>
</refnamediv> </refnamediv>
<refsynopsisdiv id="_synopsis"> <refsynopsisdiv id="_synopsis">
<simpara><emphasis role="strong">combine_tessdata</emphasis> [<emphasis>OPTION</emphasis>] <emphasis>FILE</emphasis>&#8230;</simpara> <simpara><emphasis role="strong">combine_tessdata</emphasis> [<emphasis>OPTION</emphasis>] <emphasis>FILE</emphasis>&#8230;</simpara>
</refsynopsisdiv> </refsynopsisdiv>
<refsect1 id="_description"> <refsect1 id="_description">
<title>DESCRIPTION</title> <title>DESCRIPTION</title>
<simpara>combine_tessdata(1) is the main program to combine/extract/overwrite <simpara>combine_tessdata(1) is the main program to combine/extract/overwrite
tessdata components in [lang].traineddata files.</simpara> tessdata components in [lang].traineddata files.</simpara>
<simpara>To combine all the individual tessdata components (unicharset, DAWGs, <simpara>To combine all the individual tessdata components (unicharset, DAWGs,
classifier templates, ambiguities, language configs) located at, say, classifier templates, ambiguities, language configs) located at, say,
/home/$USER/temp/eng.* run:</simpara> /home/$USER/temp/eng.* run:</simpara>
<literallayout class="monospaced">combine_tessdata /home/$USER/temp/eng.</literallayout> <literallayout class="monospaced">combine_tessdata /home/$USER/temp/eng.</literallayout>
<simpara>The result will be a combined tessdata file /home/$USER/temp/eng.traineddata</simpara> <simpara>The result will be a combined tessdata file /home/$USER/temp/eng.traineddata</simpara>
<simpara>Specify option -e if you would like to extract individual components <simpara>Specify option -e if you would like to extract individual components
from a combined traineddata file. For example, to extract language config from a combined traineddata file. For example, to extract language config
file and the unicharset from tessdata/eng.traineddata run:</simpara> file and the unicharset from tessdata/eng.traineddata run:</simpara>
<literallayout class="monospaced">combine_tessdata -e tessdata/eng.traineddata \ <literallayout class="monospaced">combine_tessdata -e tessdata/eng.traineddata \
/home/$USER/temp/eng.config /home/$USER/temp/eng.unicharset</literallayout> /home/$USER/temp/eng.config /home/$USER/temp/eng.unicharset</literallayout>
<simpara>The desired config file and unicharset will be written to <simpara>The desired config file and unicharset will be written to
/home/$USER/temp/eng.config /home/$USER/temp/eng.unicharset</simpara> /home/$USER/temp/eng.config /home/$USER/temp/eng.unicharset</simpara>
<simpara>Specify option -o to overwrite individual components of the given <simpara>Specify option -o to overwrite individual components of the given
[lang].traineddata file. For example, to overwrite language config [lang].traineddata file. For example, to overwrite language config
and unichar ambiguities files in tessdata/eng.traineddata use:</simpara> and unichar ambiguities files in tessdata/eng.traineddata use:</simpara>
<literallayout class="monospaced">combine_tessdata -o tessdata/eng.traineddata \ <literallayout class="monospaced">combine_tessdata -o tessdata/eng.traineddata \
/home/$USER/temp/eng.config /home/$USER/temp/eng.unicharambigs</literallayout> /home/$USER/temp/eng.config /home/$USER/temp/eng.unicharambigs</literallayout>
<simpara>As a result, tessdata/eng.traineddata will contain the new language config <simpara>As a result, tessdata/eng.traineddata will contain the new language config
and unichar ambigs, plus all the original DAWGs, classifier templates, etc.</simpara> and unichar ambigs, plus all the original DAWGs, classifier templates, etc.</simpara>
<simpara>Note: the file names of the files to extract to and to overwrite from should <simpara>Note: the file names of the files to extract to and to overwrite from should
have the appropriate file suffixes (extensions) indicating their tessdata have the appropriate file suffixes (extensions) indicating their tessdata
component type (.unicharset for the unicharset, .unicharambigs for unichar component type (.unicharset for the unicharset, .unicharambigs for unichar
ambigs, etc). See k*FileSuffix variable in ccutil/tessdatamanager.h.</simpara> ambigs, etc). See k*FileSuffix variable in ccutil/tessdatamanager.h.</simpara>
<simpara>Specify option -u to unpack all the components to the specified path:</simpara> <simpara>Specify option -u to unpack all the components to the specified path:</simpara>
<literallayout class="monospaced">combine_tessdata -u tessdata/eng.traineddata /home/$USER/temp/eng.</literallayout> <literallayout class="monospaced">combine_tessdata -u tessdata/eng.traineddata /home/$USER/temp/eng.</literallayout>
<simpara>This will create /home/$USER/temp/eng.* files with individual tessdata <simpara>This will create /home/$USER/temp/eng.* files with individual tessdata
components from tessdata/eng.traineddata.</simpara> components from tessdata/eng.traineddata.</simpara>
</refsect1> </refsect1>
<refsect1 id="_options"> <refsect1 id="_options">
<title>OPTIONS</title> <title>OPTIONS</title>
<simpara><emphasis role="strong">-e</emphasis> <emphasis>.traineddata</emphasis> <emphasis>FILE</emphasis>&#8230;: <simpara><emphasis role="strong">-e</emphasis> <emphasis>.traineddata</emphasis> <emphasis>FILE</emphasis>&#8230;:
Extracts the specified components from the .traineddata file</simpara> Extracts the specified components from the .traineddata file</simpara>
<simpara><emphasis role="strong">-o</emphasis> <emphasis>.traineddata</emphasis> <emphasis>FILE</emphasis>&#8230;: <simpara><emphasis role="strong">-o</emphasis> <emphasis>.traineddata</emphasis> <emphasis>FILE</emphasis>&#8230;:
Overwrites the specified components of the .traineddata file Overwrites the specified components of the .traineddata file
with those provided on the comand line.</simpara> with those provided on the comand line.</simpara>
<simpara><emphasis role="strong">-u</emphasis> <emphasis>.traineddata</emphasis> <emphasis>PATHPREFIX</emphasis> <simpara><emphasis role="strong">-u</emphasis> <emphasis>.traineddata</emphasis> <emphasis>PATHPREFIX</emphasis>
Unpacks the .traineddata using the provided prefix.</simpara> Unpacks the .traineddata using the provided prefix.</simpara>
</refsect1> </refsect1>
<refsect1 id="_caveats"> <refsect1 id="_caveats">
<title>CAVEATS</title> <title>CAVEATS</title>
<simpara><emphasis>Prefix</emphasis> refers to the full file prefix, including period (.)</simpara> <simpara><emphasis>Prefix</emphasis> refers to the full file prefix, including period (.)</simpara>
</refsect1> </refsect1>
<refsect1 id="_components"> <refsect1 id="_components">
<title>COMPONENTS</title> <title>COMPONENTS</title>
<simpara>The components in a Tesseract lang.traineddata file as of <simpara>The components in a Tesseract lang.traineddata file as of
Tesseract 3.02 are briefly described below; For more information on Tesseract 3.02 are briefly described below; For more information on
many of these files, see many of these files, see
<ulink url="https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract">https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract</ulink></simpara> <ulink url="https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract">https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract</ulink></simpara>
<variablelist> <variablelist>
<varlistentry> <varlistentry>
<term> <term>
lang.config lang.config
</term> </term>
<listitem> <listitem>
<simpara> <simpara>
(Optional) Language-specific overrides to default config variables. (Optional) Language-specific overrides to default config variables.
</simpara> </simpara>
</listitem> </listitem>
</varlistentry> </varlistentry>
<varlistentry> <varlistentry>
<term> <term>
lang.unicharset lang.unicharset
</term> </term>
<listitem> <listitem>
<simpara> <simpara>
(Required) The list of symbols that Tesseract recognizes, with properties. (Required) The list of symbols that Tesseract recognizes, with properties.
See unicharset(5). See unicharset(5).
</simpara> </simpara>
</listitem> </listitem>
</varlistentry> </varlistentry>
<varlistentry> <varlistentry>
<term> <term>
lang.unicharambigs lang.unicharambigs
</term> </term>
<listitem> <listitem>
<simpara> <simpara>
(Optional) This file contains information on pairs of recognized symbols (Optional) This file contains information on pairs of recognized symbols
which are often confused. For example, <emphasis>rn</emphasis> and <emphasis>m</emphasis>. which are often confused. For example, <emphasis>rn</emphasis> and <emphasis>m</emphasis>.
</simpara> </simpara>
</listitem> </listitem>
</varlistentry> </varlistentry>
<varlistentry> <varlistentry>
<term> <term>
lang.inttemp lang.inttemp
</term> </term>
<listitem> <listitem>
<simpara> <simpara>
(Required) Character shape templates for each unichar. Produced by (Required) Character shape templates for each unichar. Produced by
mftraining(1). mftraining(1).
</simpara> </simpara>
</listitem> </listitem>
</varlistentry> </varlistentry>
<varlistentry> <varlistentry>
<term> <term>
lang.pffmtable lang.pffmtable
</term> </term>
<listitem> <listitem>
<simpara> <simpara>
(Required) The number of features expected for each unichar. (Required) The number of features expected for each unichar.
Produced by mftraining(1) from <emphasis role="strong">.tr</emphasis> files. Produced by mftraining(1) from <emphasis role="strong">.tr</emphasis> files.
</simpara> </simpara>
</listitem> </listitem>
</varlistentry> </varlistentry>
<varlistentry> <varlistentry>
<term> <term>
lang.normproto lang.normproto
</term> </term>
<listitem> <listitem>
<simpara> <simpara>
(Required) Character normalization prototypes generated by cntraining(1) (Required) Character normalization prototypes generated by cntraining(1)
from <emphasis role="strong">.tr</emphasis> files. from <emphasis role="strong">.tr</emphasis> files.
</simpara> </simpara>
</listitem> </listitem>
</varlistentry> </varlistentry>
<varlistentry> <varlistentry>
<term> <term>
lang.punc-dawg lang.punc-dawg
</term> </term>
<listitem> <listitem>
<simpara> <simpara>
(Optional) A dawg made from punctuation patterns found around words. (Optional) A dawg made from punctuation patterns found around words.
The "word" part is replaced by a single space. The "word" part is replaced by a single space.
</simpara> </simpara>
</listitem> </listitem>
</varlistentry> </varlistentry>
<varlistentry> <varlistentry>
<term> <term>
lang.word-dawg lang.word-dawg
</term> </term>
<listitem> <listitem>
<simpara> <simpara>
(Optional) A dawg made from dictionary words from the language. (Optional) A dawg made from dictionary words from the language.
</simpara> </simpara>
</listitem> </listitem>
</varlistentry> </varlistentry>
<varlistentry> <varlistentry>
<term> <term>
lang.number-dawg lang.number-dawg
</term> </term>
<listitem> <listitem>
<simpara> <simpara>
(Optional) A dawg made from tokens which originally contained digits. (Optional) A dawg made from tokens which originally contained digits.
Each digit is replaced by a space character. Each digit is replaced by a space character.
</simpara> </simpara>
</listitem> </listitem>
</varlistentry> </varlistentry>
<varlistentry> <varlistentry>
<term> <term>
lang.freq-dawg lang.freq-dawg
</term> </term>
<listitem> <listitem>
<simpara> <simpara>
(Optional) A dawg made from the most frequent words which would have (Optional) A dawg made from the most frequent words which would have
gone into word-dawg. gone into word-dawg.
</simpara> </simpara>
</listitem> </listitem>
</varlistentry> </varlistentry>
<varlistentry> <varlistentry>
<term> <term>
lang.fixed-length-dawgs lang.fixed-length-dawgs
</term> </term>
<listitem> <listitem>
<simpara> <simpara>
(Optional) Several dawgs of different fixed lengths&#8201;&#8212;&#8201;useful for (Optional) Several dawgs of different fixed lengths&#8201;&#8212;&#8201;useful for
languages like Chinese. languages like Chinese.
</simpara> </simpara>
</listitem> </listitem>
</varlistentry> </varlistentry>
<varlistentry> <varlistentry>
<term> <term>
lang.cube-unicharset lang.cube-unicharset
</term> </term>
<listitem> <listitem>
<simpara> <simpara>
(Optional) A unicharset for cube, if cube was trained on a different set (Optional) A unicharset for cube, if cube was trained on a different set
of symbols. of symbols.
</simpara> </simpara>
</listitem> </listitem>
</varlistentry> </varlistentry>
<varlistentry> <varlistentry>
<term> <term>
lang.cube-word-dawg lang.cube-word-dawg
</term> </term>
<listitem> <listitem>
<simpara> <simpara>
(Optional) A word dawg for cube&#8217;s alternate unicharset. Not needed if Cube (Optional) A word dawg for cube&#8217;s alternate unicharset. Not needed if Cube
was trained with Tesseract&#8217;s unicharset. was trained with Tesseract&#8217;s unicharset.
</simpara> </simpara>
</listitem> </listitem>
</varlistentry> </varlistentry>
<varlistentry> <varlistentry>
<term> <term>
lang.shapetable lang.shapetable
</term> </term>
<listitem> <listitem>
<simpara> <simpara>
(Optional) When present, a shapetable is an extra layer between the character (Optional) When present, a shapetable is an extra layer between the character
classifier and the word recognizer that allows the character classifier to classifier and the word recognizer that allows the character classifier to
return a collection of unichar ids and fonts instead of a single unichar-id return a collection of unichar ids and fonts instead of a single unichar-id
and font. and font.
</simpara> </simpara>
</listitem> </listitem>
</varlistentry> </varlistentry>
<varlistentry> <varlistentry>
<term> <term>
lang.bigram-dawg lang.bigram-dawg
</term> </term>
<listitem> <listitem>
<simpara> <simpara>
(Optional) A dawg of word bigrams where the words are separated by a space (Optional) A dawg of word bigrams where the words are separated by a space
and each digit is replaced by a <emphasis>?</emphasis>. and each digit is replaced by a <emphasis>?</emphasis>.
</simpara> </simpara>
</listitem> </listitem>
</varlistentry> </varlistentry>
<varlistentry> <varlistentry>
<term> <term>
lang.unambig-dawg lang.unambig-dawg
</term> </term>
<listitem> <listitem>
<simpara> <simpara>
(Optional) TODO: Describe. (Optional) TODO: Describe.
</simpara> </simpara>
</listitem> </listitem>
</varlistentry> </varlistentry>
<varlistentry> <varlistentry>
<term> <term>
lang.params-training-model lang.params-training-model
</term> </term>
<listitem> <listitem>
<simpara> <simpara>
(Optional) TODO: Describe. (Optional) TODO: Describe.
</simpara> </simpara>
</listitem> </listitem>
</varlistentry> </varlistentry>
</variablelist> </variablelist>
</refsect1> </refsect1>
<refsect1 id="_history"> <refsect1 id="_history">
<title>HISTORY</title> <title>HISTORY</title>
<simpara>combine_tessdata(1) first appeared in version 3.00 of Tesseract</simpara> <simpara>combine_tessdata(1) first appeared in version 3.00 of Tesseract</simpara>
</refsect1> </refsect1>
<refsect1 id="_see_also"> <refsect1 id="_see_also">
<title>SEE ALSO</title> <title>SEE ALSO</title>
<simpara>tesseract(1), wordlist2dawg(1), cntraining(1), mftraining(1), unicharset(5), <simpara>tesseract(1), wordlist2dawg(1), cntraining(1), mftraining(1), unicharset(5),
unicharambigs(5)</simpara> unicharambigs(5)</simpara>
</refsect1> </refsect1>
<refsect1 id="_copying"> <refsect1 id="_copying">
<title>COPYING</title> <title>COPYING</title>
<simpara>Copyright (C) 2009, Google Inc. <simpara>Copyright (C) 2009, Google Inc.
Licensed under the Apache License, Version 2.0</simpara> Licensed under the Apache License, Version 2.0</simpara>
</refsect1> </refsect1>
<refsect1 id="_author"> <refsect1 id="_author">
<title>AUTHOR</title> <title>AUTHOR</title>
<simpara>The Tesseract OCR engine was written by Ray Smith and his research groups <simpara>The Tesseract OCR engine was written by Ray Smith and his research groups
at Hewlett Packard (1985-1995) and Google (2006-present).</simpara> at Hewlett Packard (1985-1995) and Google (2006-present).</simpara>
</refsect1> </refsect1>
</refentry> </refentry>

File diff suppressed because it is too large Load Diff

View File

@ -1,53 +1,53 @@
<?xml version="1.0" encoding="UTF-8"?> <?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE refentry PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd"> <!DOCTYPE refentry PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd">
<?asciidoc-toc?> <?asciidoc-toc?>
<?asciidoc-numbered?> <?asciidoc-numbered?>
<refentry lang="en"> <refentry lang="en">
<refentryinfo> <refentryinfo>
<title>DAWG2WORDLIST(1)</title> <title>DAWG2WORDLIST(1)</title>
</refentryinfo> </refentryinfo>
<refmeta> <refmeta>
<refentrytitle>dawg2wordlist</refentrytitle> <refentrytitle>dawg2wordlist</refentrytitle>
<manvolnum>1</manvolnum> <manvolnum>1</manvolnum>
<refmiscinfo class="source">&#160;</refmiscinfo> <refmiscinfo class="source">&#160;</refmiscinfo>
<refmiscinfo class="manual">&#160;</refmiscinfo> <refmiscinfo class="manual">&#160;</refmiscinfo>
</refmeta> </refmeta>
<refnamediv> <refnamediv>
<refname>dawg2wordlist</refname> <refname>dawg2wordlist</refname>
<refpurpose>convert a Tesseract DAWG to a wordlist</refpurpose> <refpurpose>convert a Tesseract DAWG to a wordlist</refpurpose>
</refnamediv> </refnamediv>
<refsynopsisdiv id="_synopsis"> <refsynopsisdiv id="_synopsis">
<simpara><emphasis role="strong">dawg2wordlist</emphasis> <emphasis>UNICHARSET</emphasis> <emphasis>DAWG</emphasis> <emphasis>WORDLIST</emphasis></simpara> <simpara><emphasis role="strong">dawg2wordlist</emphasis> <emphasis>UNICHARSET</emphasis> <emphasis>DAWG</emphasis> <emphasis>WORDLIST</emphasis></simpara>
</refsynopsisdiv> </refsynopsisdiv>
<refsect1 id="_description"> <refsect1 id="_description">
<title>DESCRIPTION</title> <title>DESCRIPTION</title>
<simpara>dawg2wordlist(1) converts a Tesseract Directed Acyclic Word <simpara>dawg2wordlist(1) converts a Tesseract Directed Acyclic Word
Graph (DAWG) to a list of words using a unicharset as key.</simpara> Graph (DAWG) to a list of words using a unicharset as key.</simpara>
</refsect1> </refsect1>
<refsect1 id="_options"> <refsect1 id="_options">
<title>OPTIONS</title> <title>OPTIONS</title>
<simpara><emphasis>UNICHARSET</emphasis> <simpara><emphasis>UNICHARSET</emphasis>
The unicharset of the language. This is the unicharset The unicharset of the language. This is the unicharset
generated by mftraining(1).</simpara> generated by mftraining(1).</simpara>
<simpara><emphasis>DAWG</emphasis> <simpara><emphasis>DAWG</emphasis>
The input DAWG, created by wordlist2dawg(1)</simpara> The input DAWG, created by wordlist2dawg(1)</simpara>
<simpara><emphasis>WORDLIST</emphasis> <simpara><emphasis>WORDLIST</emphasis>
Plain text (output) file in UTF-8, one word per line</simpara> Plain text (output) file in UTF-8, one word per line</simpara>
</refsect1> </refsect1>
<refsect1 id="_see_also"> <refsect1 id="_see_also">
<title>SEE ALSO</title> <title>SEE ALSO</title>
<simpara>tesseract(1), mftraining(1), wordlist2dawg(1), unicharset(5), <simpara>tesseract(1), mftraining(1), wordlist2dawg(1), unicharset(5),
combine_tessdata(1)</simpara> combine_tessdata(1)</simpara>
<simpara><ulink url="https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract">https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract</ulink></simpara> <simpara><ulink url="https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract">https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract</ulink></simpara>
</refsect1> </refsect1>
<refsect1 id="_copying"> <refsect1 id="_copying">
<title>COPYING</title> <title>COPYING</title>
<simpara>Copyright (C) 2012 Google, Inc. <simpara>Copyright (C) 2012 Google, Inc.
Licensed under the Apache License, Version 2.0</simpara> Licensed under the Apache License, Version 2.0</simpara>
</refsect1> </refsect1>
<refsect1 id="_author"> <refsect1 id="_author">
<title>AUTHOR</title> <title>AUTHOR</title>
<simpara>The Tesseract OCR engine was written by Ray Smith and his research groups <simpara>The Tesseract OCR engine was written by Ray Smith and his research groups
at Hewlett Packard (1985-1995) and Google (2006-present).</simpara> at Hewlett Packard (1985-1995) and Google (2006-present).</simpara>
</refsect1> </refsect1>
</refentry> </refentry>

View File

@ -24,12 +24,12 @@ OPTIONS
-F 'font_properties_file':: -F 'font_properties_file'::
(Input) font properties file, each line is of the following form, where each field other than the font name is 0 or 1: (Input) font properties file, each line is of the following form, where each field other than the font name is 0 or 1:
*font_name* *italic* *bold* *fixed_pitch* *serif* *fraktur* *font_name* *italic* *bold* *fixed_pitch* *serif* *fraktur*
-X 'xheights_file':: -X 'xheights_file'::
(Input) x heights file, each line is of the following form, where xheight is calculated as the pixel x height of a character drawn at 32pt on 300 dpi. [ That is, if base x height + ascenders + descenders = 133, how much is x height? ] (Input) x heights file, each line is of the following form, where xheight is calculated as the pixel x height of a character drawn at 32pt on 300 dpi. [ That is, if base x height + ascenders + descenders = 133, how much is x height? ]
*font_name* *xheight* *font_name* *xheight*
-D 'dir':: -D 'dir'::

File diff suppressed because it is too large Load Diff

View File

@ -1,102 +1,102 @@
<?xml version="1.0" encoding="UTF-8"?> <?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE refentry PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd"> <!DOCTYPE refentry PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd">
<?asciidoc-toc?> <?asciidoc-toc?>
<?asciidoc-numbered?> <?asciidoc-numbered?>
<refentry lang="en"> <refentry lang="en">
<refentryinfo> <refentryinfo>
<title>MFTRAINING(1)</title> <title>MFTRAINING(1)</title>
</refentryinfo> </refentryinfo>
<refmeta> <refmeta>
<refentrytitle>mftraining</refentrytitle> <refentrytitle>mftraining</refentrytitle>
<manvolnum>1</manvolnum> <manvolnum>1</manvolnum>
<refmiscinfo class="source">&#160;</refmiscinfo> <refmiscinfo class="source">&#160;</refmiscinfo>
<refmiscinfo class="manual">&#160;</refmiscinfo> <refmiscinfo class="manual">&#160;</refmiscinfo>
</refmeta> </refmeta>
<refnamediv> <refnamediv>
<refname>mftraining</refname> <refname>mftraining</refname>
<refpurpose>feature training for Tesseract</refpurpose> <refpurpose>feature training for Tesseract</refpurpose>
</refnamediv> </refnamediv>
<refsynopsisdiv id="_synopsis"> <refsynopsisdiv id="_synopsis">
<simpara>mftraining -U <emphasis>unicharset</emphasis> -O <emphasis>lang.unicharset</emphasis> <emphasis>FILE</emphasis>&#8230;</simpara> <simpara>mftraining -U <emphasis>unicharset</emphasis> -O <emphasis>lang.unicharset</emphasis> <emphasis>FILE</emphasis>&#8230;</simpara>
</refsynopsisdiv> </refsynopsisdiv>
<refsect1 id="_description"> <refsect1 id="_description">
<title>DESCRIPTION</title> <title>DESCRIPTION</title>
<simpara>mftraining takes a list of .tr files, from which it generates the <simpara>mftraining takes a list of .tr files, from which it generates the
files <emphasis role="strong">inttemp</emphasis> (the shape prototypes), <emphasis role="strong">shapetable</emphasis>, and <emphasis role="strong">pffmtable</emphasis> files <emphasis role="strong">inttemp</emphasis> (the shape prototypes), <emphasis role="strong">shapetable</emphasis>, and <emphasis role="strong">pffmtable</emphasis>
(the number of expected features for each character). (A fourth file (the number of expected features for each character). (A fourth file
called Microfeat is also written by this program, but it is not used.)</simpara> called Microfeat is also written by this program, but it is not used.)</simpara>
</refsect1> </refsect1>
<refsect1 id="_options"> <refsect1 id="_options">
<title>OPTIONS</title> <title>OPTIONS</title>
<variablelist> <variablelist>
<varlistentry> <varlistentry>
<term> <term>
-U <emphasis>FILE</emphasis> -U <emphasis>FILE</emphasis>
</term> </term>
<listitem> <listitem>
<simpara> <simpara>
(Input) The unicharset generated by unicharset_extractor(1) (Input) The unicharset generated by unicharset_extractor(1)
</simpara> </simpara>
</listitem> </listitem>
</varlistentry> </varlistentry>
<varlistentry> <varlistentry>
<term> <term>
-F <emphasis>font_properties_file</emphasis> -F <emphasis>font_properties_file</emphasis>
</term> </term>
<listitem> <listitem>
<simpara> <simpara>
(Input) font properties file, each line is of the following form, where each field other than the font name is 0 or 1: (Input) font properties file, each line is of the following form, where each field other than the font name is 0 or 1:
</simpara> </simpara>
<literallayout class="monospaced">*font_name* *italic* *bold* *fixed_pitch* *serif* *fraktur*</literallayout> <literallayout class="monospaced">*font_name* *italic* *bold* *fixed_pitch* *serif* *fraktur*</literallayout>
</listitem> </listitem>
</varlistentry> </varlistentry>
<varlistentry> <varlistentry>
<term> <term>
-X <emphasis>xheights_file</emphasis> -X <emphasis>xheights_file</emphasis>
</term> </term>
<listitem> <listitem>
<simpara> <simpara>
(Input) x heights file, each line is of the following form, where xheight is calculated as the pixel x height of a character drawn at 32pt on 300 dpi. [ That is, if base x height + ascenders + descenders = 133, how much is x height? ] (Input) x heights file, each line is of the following form, where xheight is calculated as the pixel x height of a character drawn at 32pt on 300 dpi. [ That is, if base x height + ascenders + descenders = 133, how much is x height? ]
</simpara> </simpara>
<literallayout class="monospaced">*font_name* *xheight*</literallayout> <literallayout class="monospaced">*font_name* *xheight*</literallayout>
</listitem> </listitem>
</varlistentry> </varlistentry>
<varlistentry> <varlistentry>
<term> <term>
-D <emphasis>dir</emphasis> -D <emphasis>dir</emphasis>
</term> </term>
<listitem> <listitem>
<simpara> <simpara>
Directory to write output files to. Directory to write output files to.
</simpara> </simpara>
</listitem> </listitem>
</varlistentry> </varlistentry>
<varlistentry> <varlistentry>
<term> <term>
-O <emphasis>FILE</emphasis> -O <emphasis>FILE</emphasis>
</term> </term>
<listitem> <listitem>
<simpara> <simpara>
(Output) The output unicharset that will be given to combine_tessdata(1) (Output) The output unicharset that will be given to combine_tessdata(1)
</simpara> </simpara>
</listitem> </listitem>
</varlistentry> </varlistentry>
</variablelist> </variablelist>
</refsect1> </refsect1>
<refsect1 id="_see_also"> <refsect1 id="_see_also">
<title>SEE ALSO</title> <title>SEE ALSO</title>
<simpara>tesseract(1), cntraining(1), unicharset_extractor(1), combine_tessdata(1), <simpara>tesseract(1), cntraining(1), unicharset_extractor(1), combine_tessdata(1),
shapeclustering(1), unicharset(5)</simpara> shapeclustering(1), unicharset(5)</simpara>
<simpara><ulink url="https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract">https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract</ulink></simpara> <simpara><ulink url="https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract">https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract</ulink></simpara>
</refsect1> </refsect1>
<refsect1 id="_copying"> <refsect1 id="_copying">
<title>COPYING</title> <title>COPYING</title>
<simpara>Copyright (C) Hewlett-Packard Company, 1988 <simpara>Copyright (C) Hewlett-Packard Company, 1988
Licensed under the Apache License, Version 2.0</simpara> Licensed under the Apache License, Version 2.0</simpara>
</refsect1> </refsect1>
<refsect1 id="_author"> <refsect1 id="_author">
<title>AUTHOR</title> <title>AUTHOR</title>
<simpara>The Tesseract OCR engine was written by Ray Smith and his research groups <simpara>The Tesseract OCR engine was written by Ray Smith and his research groups
at Hewlett Packard (1985-1995) and Google (2006-present).</simpara> at Hewlett Packard (1985-1995) and Google (2006-present).</simpara>
</refsect1> </refsect1>
</refentry> </refentry>

View File

@ -35,7 +35,7 @@ OPTIONS
-X 'xheights_file':: -X 'xheights_file'::
(Input) x heights file, each line is of the following form, where xheight is calculated as the pixel x height of a character drawn at 32pt on 300 dpi. [ That is, if base x height + ascenders + descenders = 133, how much is x height? ] (Input) x heights file, each line is of the following form, where xheight is calculated as the pixel x height of a character drawn at 32pt on 300 dpi. [ That is, if base x height + ascenders + descenders = 133, how much is x height? ]
'font_name' 'xheight' 'font_name' 'xheight'
-O 'FILE':: -O 'FILE'::

File diff suppressed because it is too large Load Diff

View File

@ -1,105 +1,105 @@
<?xml version="1.0" encoding="UTF-8"?> <?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE refentry PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd"> <!DOCTYPE refentry PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd">
<?asciidoc-toc?> <?asciidoc-toc?>
<?asciidoc-numbered?> <?asciidoc-numbered?>
<refentry lang="en"> <refentry lang="en">
<refentryinfo> <refentryinfo>
<title>SHAPECLUSTERING(1)</title> <title>SHAPECLUSTERING(1)</title>
</refentryinfo> </refentryinfo>
<refmeta> <refmeta>
<refentrytitle>shapeclustering</refentrytitle> <refentrytitle>shapeclustering</refentrytitle>
<manvolnum>1</manvolnum> <manvolnum>1</manvolnum>
<refmiscinfo class="source">&#160;</refmiscinfo> <refmiscinfo class="source">&#160;</refmiscinfo>
<refmiscinfo class="manual">&#160;</refmiscinfo> <refmiscinfo class="manual">&#160;</refmiscinfo>
</refmeta> </refmeta>
<refnamediv> <refnamediv>
<refname>shapeclustering</refname> <refname>shapeclustering</refname>
<refpurpose>shape clustering training for Tesseract</refpurpose> <refpurpose>shape clustering training for Tesseract</refpurpose>
</refnamediv> </refnamediv>
<refsynopsisdiv id="_synopsis"> <refsynopsisdiv id="_synopsis">
<simpara>shapeclustering -D <emphasis>output_dir</emphasis> <simpara>shapeclustering -D <emphasis>output_dir</emphasis>
-U <emphasis>unicharset</emphasis> -O <emphasis>mfunicharset</emphasis> -U <emphasis>unicharset</emphasis> -O <emphasis>mfunicharset</emphasis>
-F <emphasis>font_props</emphasis> -X <emphasis>xheights</emphasis> -F <emphasis>font_props</emphasis> -X <emphasis>xheights</emphasis>
<emphasis>FILE</emphasis>&#8230;</simpara> <emphasis>FILE</emphasis>&#8230;</simpara>
</refsynopsisdiv> </refsynopsisdiv>
<refsect1 id="_description"> <refsect1 id="_description">
<title>DESCRIPTION</title> <title>DESCRIPTION</title>
<simpara>shapeclustering(1) takes extracted feature .tr files (generated by <simpara>shapeclustering(1) takes extracted feature .tr files (generated by
tesseract(1) run in a special mode from box files) and produces a tesseract(1) run in a special mode from box files) and produces a
file <emphasis role="strong">shapetable</emphasis> and an enhanced unicharset. This program is still file <emphasis role="strong">shapetable</emphasis> and an enhanced unicharset. This program is still
experimental, and is not required (yet) for training Tesseract.</simpara> experimental, and is not required (yet) for training Tesseract.</simpara>
</refsect1> </refsect1>
<refsect1 id="_options"> <refsect1 id="_options">
<title>OPTIONS</title> <title>OPTIONS</title>
<variablelist> <variablelist>
<varlistentry> <varlistentry>
<term> <term>
-U <emphasis>FILE</emphasis> -U <emphasis>FILE</emphasis>
</term> </term>
<listitem> <listitem>
<simpara> <simpara>
The unicharset generated by unicharset_extractor(1). The unicharset generated by unicharset_extractor(1).
</simpara> </simpara>
</listitem> </listitem>
</varlistentry> </varlistentry>
<varlistentry> <varlistentry>
<term> <term>
-D <emphasis>dir</emphasis> -D <emphasis>dir</emphasis>
</term> </term>
<listitem> <listitem>
<simpara> <simpara>
Directory to write output files to. Directory to write output files to.
</simpara> </simpara>
</listitem> </listitem>
</varlistentry> </varlistentry>
<varlistentry> <varlistentry>
<term> <term>
-F <emphasis>font_properties_file</emphasis> -F <emphasis>font_properties_file</emphasis>
</term> </term>
<listitem> <listitem>
<simpara> <simpara>
(Input) font properties file, where each line is of the following form, where each field other than the font name is 0 or 1: (Input) font properties file, where each line is of the following form, where each field other than the font name is 0 or 1:
</simpara> </simpara>
<literallayout class="monospaced">'font_name' 'italic' 'bold' 'fixed_pitch' 'serif' 'fraktur'</literallayout> <literallayout class="monospaced">'font_name' 'italic' 'bold' 'fixed_pitch' 'serif' 'fraktur'</literallayout>
</listitem> </listitem>
</varlistentry> </varlistentry>
<varlistentry> <varlistentry>
<term> <term>
-X <emphasis>xheights_file</emphasis> -X <emphasis>xheights_file</emphasis>
</term> </term>
<listitem> <listitem>
<simpara> <simpara>
(Input) x heights file, each line is of the following form, where xheight is calculated as the pixel x height of a character drawn at 32pt on 300 dpi. [ That is, if base x height + ascenders + descenders = 133, how much is x height? ] (Input) x heights file, each line is of the following form, where xheight is calculated as the pixel x height of a character drawn at 32pt on 300 dpi. [ That is, if base x height + ascenders + descenders = 133, how much is x height? ]
</simpara> </simpara>
<literallayout class="monospaced">'font_name' 'xheight'</literallayout> <literallayout class="monospaced">'font_name' 'xheight'</literallayout>
</listitem> </listitem>
</varlistentry> </varlistentry>
<varlistentry> <varlistentry>
<term> <term>
-O <emphasis>FILE</emphasis> -O <emphasis>FILE</emphasis>
</term> </term>
<listitem> <listitem>
<simpara> <simpara>
The output unicharset that will be given to combine_tessdata(1). The output unicharset that will be given to combine_tessdata(1).
</simpara> </simpara>
</listitem> </listitem>
</varlistentry> </varlistentry>
</variablelist> </variablelist>
</refsect1> </refsect1>
<refsect1 id="_see_also"> <refsect1 id="_see_also">
<title>SEE ALSO</title> <title>SEE ALSO</title>
<simpara>tesseract(1), cntraining(1), unicharset_extractor(1), combine_tessdata(1), <simpara>tesseract(1), cntraining(1), unicharset_extractor(1), combine_tessdata(1),
unicharset(5)</simpara> unicharset(5)</simpara>
<simpara><ulink url="https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract">https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract</ulink></simpara> <simpara><ulink url="https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract">https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract</ulink></simpara>
</refsect1> </refsect1>
<refsect1 id="_copying"> <refsect1 id="_copying">
<title>COPYING</title> <title>COPYING</title>
<simpara>Copyright (C) Google, 2011 <simpara>Copyright (C) Google, 2011
Licensed under the Apache License, Version 2.0</simpara> Licensed under the Apache License, Version 2.0</simpara>
</refsect1> </refsect1>
<refsect1 id="_author"> <refsect1 id="_author">
<title>AUTHOR</title> <title>AUTHOR</title>
<simpara>The Tesseract OCR engine was written by Ray Smith and his research groups <simpara>The Tesseract OCR engine was written by Ray Smith and his research groups
at Hewlett Packard (1985-1995) and Google (2006-present).</simpara> at Hewlett Packard (1985-1995) and Google (2006-present).</simpara>
</refsect1> </refsect1>
</refentry> </refentry>

View File

@ -67,7 +67,7 @@ OPTIONS
6 = Assume a single uniform block of text. 6 = Assume a single uniform block of text.
7 = Treat the image as a single text line. 7 = Treat the image as a single text line.
8 = Treat the image as a single word. 8 = Treat the image as a single word.
9 = Treat the image as a single word in a circle. 9 = Treat the image as a single word in a circle.
10 = Treat the image as a single character. 10 = Treat the image as a single character.
'configfile':: 'configfile'::
@ -264,10 +264,10 @@ on read_pattern_list().
HISTORY HISTORY
------- -------
The engine was developed at Hewlett Packard Laboratories Bristol and at The engine was developed at Hewlett Packard Laboratories Bristol and at
Hewlett Packard Co, Greeley Colorado between 1985 and 1994, with some more Hewlett Packard Co, Greeley Colorado between 1985 and 1994, with some more
changes made in 1996 to port to Windows, and some C\+\+izing in 1998. A changes made in 1996 to port to Windows, and some C\+\+izing in 1998. A
lot of the code was written in C, and then some more was written in C\+\+. lot of the code was written in C, and then some more was written in C\+\+.
The C\+\+ code makes heavy use of a list system using macros. This predates The C\+\+ code makes heavy use of a list system using macros. This predates
stl, was portable before stl, and is more efficient than stl lists, but has stl, was portable before stl, and is more efficient than stl lists, but has
the big negative that if you do get a segmentation violation, it is hard to the big negative that if you do get a segmentation violation, it is hard to
@ -276,18 +276,18 @@ debug.
Version 2.00 brought Unicode (UTF-8) support, six languages, and the ability Version 2.00 brought Unicode (UTF-8) support, six languages, and the ability
to train Tesseract. to train Tesseract.
Tesseract was included in UNLV's Fourth Annual Test of OCR Accuracy. Tesseract was included in UNLV's Fourth Annual Test of OCR Accuracy.
See <https://github.com/tesseract-ocr/docs/blob/master/AT-1995.pdf>. With Tesseract 2.00, See <https://github.com/tesseract-ocr/docs/blob/master/AT-1995.pdf>. With Tesseract 2.00,
scripts are now included to allow anyone to reproduce some of these tests. scripts are now included to allow anyone to reproduce some of these tests.
See <https://github.com/tesseract-ocr/tesseract/wiki/TestingTesseract> for more See <https://github.com/tesseract-ocr/tesseract/wiki/TestingTesseract> for more
details. details.
Tesseract 3.00 adds a number of new languages, including Chinese, Japanese, Tesseract 3.00 adds a number of new languages, including Chinese, Japanese,
and Korean. It also introduces a new, single-file based system of managing and Korean. It also introduces a new, single-file based system of managing
language data. language data.
Tesseract 3.02 adds BiDirectional text support, the ability to recognize Tesseract 3.02 adds BiDirectional text support, the ability to recognize
multiple languages in a single image, and improved layout analysis. multiple languages in a single image, and improved layout analysis.
For further details, see the file ReleaseNotes included with the distribution. For further details, see the file ReleaseNotes included with the distribution.

File diff suppressed because it is too large Load Diff

View File

@ -1,424 +1,424 @@
<?xml version="1.0" encoding="UTF-8"?> <?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE refentry PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd"> <!DOCTYPE refentry PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd">
<?asciidoc-toc?> <?asciidoc-toc?>
<?asciidoc-numbered?> <?asciidoc-numbered?>
<refentry lang="en"> <refentry lang="en">
<refentryinfo> <refentryinfo>
<title>TESSERACT(1)</title> <title>TESSERACT(1)</title>
</refentryinfo> </refentryinfo>
<refmeta> <refmeta>
<refentrytitle>tesseract</refentrytitle> <refentrytitle>tesseract</refentrytitle>
<manvolnum>1</manvolnum> <manvolnum>1</manvolnum>
<refmiscinfo class="source">&#160;</refmiscinfo> <refmiscinfo class="source">&#160;</refmiscinfo>
<refmiscinfo class="manual">&#160;</refmiscinfo> <refmiscinfo class="manual">&#160;</refmiscinfo>
</refmeta> </refmeta>
<refnamediv> <refnamediv>
<refname>tesseract</refname> <refname>tesseract</refname>
<refpurpose>command-line OCR engine</refpurpose> <refpurpose>command-line OCR engine</refpurpose>
</refnamediv> </refnamediv>
<refsynopsisdiv id="_synopsis"> <refsynopsisdiv id="_synopsis">
<simpara><emphasis role="strong">tesseract</emphasis> <emphasis>imagename</emphasis>|<emphasis>stdin</emphasis> <emphasis>outputbase</emphasis>|<emphasis>stdout</emphasis> [options&#8230;] [configfile&#8230;]</simpara> <simpara><emphasis role="strong">tesseract</emphasis> <emphasis>imagename</emphasis>|<emphasis>stdin</emphasis> <emphasis>outputbase</emphasis>|<emphasis>stdout</emphasis> [options&#8230;] [configfile&#8230;]</simpara>
</refsynopsisdiv> </refsynopsisdiv>
<refsect1 id="_description"> <refsect1 id="_description">
<title>DESCRIPTION</title> <title>DESCRIPTION</title>
<simpara>tesseract(1) is a commercial quality OCR engine originally developed at HP <simpara>tesseract(1) is a commercial quality OCR engine originally developed at HP
between 1985 and 1995. In 1995, this engine was among the top 3 evaluated by between 1985 and 1995. In 1995, this engine was among the top 3 evaluated by
UNLV. It was open-sourced by HP and UNLV in 2005, and has been developed UNLV. It was open-sourced by HP and UNLV in 2005, and has been developed
at Google since then.</simpara> at Google since then.</simpara>
</refsect1> </refsect1>
<refsect1 id="_in_out_arguments"> <refsect1 id="_in_out_arguments">
<title>IN/OUT ARGUMENTS</title> <title>IN/OUT ARGUMENTS</title>
<variablelist> <variablelist>
<varlistentry> <varlistentry>
<term> <term>
<emphasis>imagename</emphasis> <emphasis>imagename</emphasis>
</term> </term>
<listitem> <listitem>
<simpara> <simpara>
The name of the input image. Most image file formats (anything The name of the input image. Most image file formats (anything
readable by Leptonica) are supported. readable by Leptonica) are supported.
</simpara> </simpara>
</listitem> </listitem>
</varlistentry> </varlistentry>
<varlistentry> <varlistentry>
<term> <term>
<emphasis>stdin</emphasis> <emphasis>stdin</emphasis>
</term> </term>
<listitem> <listitem>
<simpara> <simpara>
Instruction to read data from standard input Instruction to read data from standard input
</simpara> </simpara>
</listitem> </listitem>
</varlistentry> </varlistentry>
<varlistentry> <varlistentry>
<term> <term>
<emphasis>outputbase</emphasis> <emphasis>outputbase</emphasis>
</term> </term>
<listitem> <listitem>
<simpara> <simpara>
The basename of the output file (to which the appropriate extension The basename of the output file (to which the appropriate extension
will be appended). By default the output will be named <emphasis>outbase.txt</emphasis>. will be appended). By default the output will be named <emphasis>outbase.txt</emphasis>.
</simpara> </simpara>
</listitem> </listitem>
</varlistentry> </varlistentry>
<varlistentry> <varlistentry>
<term> <term>
<emphasis>stdout</emphasis> <emphasis>stdout</emphasis>
</term> </term>
<listitem> <listitem>
<simpara> <simpara>
Instruction to sent output data to standard output Instruction to sent output data to standard output
</simpara> </simpara>
</listitem> </listitem>
</varlistentry> </varlistentry>
</variablelist> </variablelist>
</refsect1> </refsect1>
<refsect1 id="_options"> <refsect1 id="_options">
<title>OPTIONS</title> <title>OPTIONS</title>
<variablelist> <variablelist>
<varlistentry> <varlistentry>
<term> <term>
<emphasis>--tessdata-dir /path</emphasis> <emphasis>--tessdata-dir /path</emphasis>
</term> </term>
<listitem> <listitem>
<simpara> <simpara>
Specify the location of tessdata path Specify the location of tessdata path
</simpara> </simpara>
</listitem> </listitem>
</varlistentry> </varlistentry>
<varlistentry> <varlistentry>
<term> <term>
<emphasis>--user-words /path/to/file</emphasis> <emphasis>--user-words /path/to/file</emphasis>
</term> </term>
<listitem> <listitem>
<simpara> <simpara>
Specify the location of user words file Specify the location of user words file
</simpara> </simpara>
</listitem> </listitem>
</varlistentry> </varlistentry>
<varlistentry> <varlistentry>
<term> <term>
<emphasis>--user-patterns /path/to/file specify</emphasis> <emphasis>--user-patterns /path/to/file specify</emphasis>
</term> </term>
<listitem> <listitem>
<simpara> <simpara>
The location of user patterns file The location of user patterns file
</simpara> </simpara>
</listitem> </listitem>
</varlistentry> </varlistentry>
<varlistentry> <varlistentry>
<term> <term>
<emphasis>-c configvar=value</emphasis> <emphasis>-c configvar=value</emphasis>
</term> </term>
<listitem> <listitem>
<simpara> <simpara>
Set value for control parameter. Multiple -c arguments are allowed. Set value for control parameter. Multiple -c arguments are allowed.
</simpara> </simpara>
</listitem> </listitem>
</varlistentry> </varlistentry>
<varlistentry> <varlistentry>
<term> <term>
<emphasis>-l lang</emphasis> <emphasis>-l lang</emphasis>
</term> </term>
<listitem> <listitem>
<simpara> <simpara>
The language to use. If none is specified, English is assumed. The language to use. If none is specified, English is assumed.
Multiple languages may be specified, separated by plus characters. Multiple languages may be specified, separated by plus characters.
Tesseract uses 3-character ISO 639-2 language codes. (See LANGUAGES) Tesseract uses 3-character ISO 639-2 language codes. (See LANGUAGES)
</simpara> </simpara>
</listitem> </listitem>
</varlistentry> </varlistentry>
<varlistentry> <varlistentry>
<term> <term>
<emphasis>--psm N</emphasis> <emphasis>--psm N</emphasis>
</term> </term>
<listitem> <listitem>
<simpara> <simpara>
Set Tesseract to only run a subset of layout analysis and assume Set Tesseract to only run a subset of layout analysis and assume
a certain form of image. The options for <emphasis role="strong">N</emphasis> are: a certain form of image. The options for <emphasis role="strong">N</emphasis> are:
</simpara> </simpara>
<literallayout class="monospaced">0 = Orientation and script detection (OSD) only. <literallayout class="monospaced">0 = Orientation and script detection (OSD) only.
1 = Automatic page segmentation with OSD. 1 = Automatic page segmentation with OSD.
2 = Automatic page segmentation, but no OSD, or OCR. 2 = Automatic page segmentation, but no OSD, or OCR.
3 = Fully automatic page segmentation, but no OSD. (Default) 3 = Fully automatic page segmentation, but no OSD. (Default)
4 = Assume a single column of text of variable sizes. 4 = Assume a single column of text of variable sizes.
5 = Assume a single uniform block of vertically aligned text. 5 = Assume a single uniform block of vertically aligned text.
6 = Assume a single uniform block of text. 6 = Assume a single uniform block of text.
7 = Treat the image as a single text line. 7 = Treat the image as a single text line.
8 = Treat the image as a single word. 8 = Treat the image as a single word.
9 = Treat the image as a single word in a circle. 9 = Treat the image as a single word in a circle.
10 = Treat the image as a single character.</literallayout> 10 = Treat the image as a single character.</literallayout>
</listitem> </listitem>
</varlistentry> </varlistentry>
<varlistentry> <varlistentry>
<term> <term>
<emphasis>configfile</emphasis> <emphasis>configfile</emphasis>
</term> </term>
<listitem> <listitem>
<simpara> <simpara>
The name of a config to use. A config is a plaintext file which The name of a config to use. A config is a plaintext file which
contains a list of variables and their values, one per line, with a contains a list of variables and their values, one per line, with a
space separating variable from value. Interesting config files space separating variable from value. Interesting config files
include:<?asciidoc-br?> include:<?asciidoc-br?>
</simpara> </simpara>
<itemizedlist> <itemizedlist>
<listitem> <listitem>
<simpara> <simpara>
hocr - Output in hOCR format instead of as a text file. hocr - Output in hOCR format instead of as a text file.
</simpara> </simpara>
</listitem> </listitem>
<listitem> <listitem>
<simpara> <simpara>
pdf - Output in pdf instead of a text file. pdf - Output in pdf instead of a text file.
</simpara> </simpara>
</listitem> </listitem>
</itemizedlist> </itemizedlist>
</listitem> </listitem>
</varlistentry> </varlistentry>
</variablelist> </variablelist>
<simpara><emphasis role="strong">Nota Bene:</emphasis> The options <emphasis>-l lang</emphasis> and <emphasis>--psm N</emphasis> must occur <simpara><emphasis role="strong">Nota Bene:</emphasis> The options <emphasis>-l lang</emphasis> and <emphasis>--psm N</emphasis> must occur
before any <emphasis>configfile</emphasis>.</simpara> before any <emphasis>configfile</emphasis>.</simpara>
</refsect1> </refsect1>
<refsect1 id="_single_options"> <refsect1 id="_single_options">
<title>SINGLE OPTIONS</title> <title>SINGLE OPTIONS</title>
<variablelist> <variablelist>
<varlistentry> <varlistentry>
<term> <term>
<emphasis>-v</emphasis> <emphasis>-v</emphasis>
</term> </term>
<listitem> <listitem>
<simpara> <simpara>
Returns the current version of the tesseract(1) executable. Returns the current version of the tesseract(1) executable.
</simpara> </simpara>
</listitem> </listitem>
</varlistentry> </varlistentry>
<varlistentry> <varlistentry>
<term> <term>
<emphasis>--list-langs</emphasis> <emphasis>--list-langs</emphasis>
</term> </term>
<listitem> <listitem>
<simpara> <simpara>
list available languages for tesseract engine. Can be used with --tessdata-dir. list available languages for tesseract engine. Can be used with --tessdata-dir.
</simpara> </simpara>
</listitem> </listitem>
</varlistentry> </varlistentry>
<varlistentry> <varlistentry>
<term> <term>
<emphasis>--print-parameters</emphasis> <emphasis>--print-parameters</emphasis>
</term> </term>
<listitem> <listitem>
<simpara> <simpara>
print tesseract parameters to the stdout. print tesseract parameters to the stdout.
</simpara> </simpara>
</listitem> </listitem>
</varlistentry> </varlistentry>
</variablelist> </variablelist>
</refsect1> </refsect1>
<refsect1 id="_languages"> <refsect1 id="_languages">
<title>LANGUAGES</title> <title>LANGUAGES</title>
<simpara>There are currently language packs available for the following languages <simpara>There are currently language packs available for the following languages
(in <ulink url="https://github.com/tesseract-ocr/tessdata">https://github.com/tesseract-ocr/tessdata</ulink>):</simpara> (in <ulink url="https://github.com/tesseract-ocr/tessdata">https://github.com/tesseract-ocr/tessdata</ulink>):</simpara>
<simpara><emphasis role="strong">afr</emphasis> (Afrikaans) <simpara><emphasis role="strong">afr</emphasis> (Afrikaans)
<emphasis role="strong">amh</emphasis> (Amharic) <emphasis role="strong">amh</emphasis> (Amharic)
<emphasis role="strong">ara</emphasis> (Arabic) <emphasis role="strong">ara</emphasis> (Arabic)
<emphasis role="strong">asm</emphasis> (Assamese) <emphasis role="strong">asm</emphasis> (Assamese)
<emphasis role="strong">aze</emphasis> (Azerbaijani) <emphasis role="strong">aze</emphasis> (Azerbaijani)
<emphasis role="strong">aze_cyrl</emphasis> (Azerbaijani - Cyrilic) <emphasis role="strong">aze_cyrl</emphasis> (Azerbaijani - Cyrilic)
<emphasis role="strong">bel</emphasis> (Belarusian) <emphasis role="strong">bel</emphasis> (Belarusian)
<emphasis role="strong">ben</emphasis> (Bengali) <emphasis role="strong">ben</emphasis> (Bengali)
<emphasis role="strong">bod</emphasis> (Tibetan) <emphasis role="strong">bod</emphasis> (Tibetan)
<emphasis role="strong">bos</emphasis> (Bosnian) <emphasis role="strong">bos</emphasis> (Bosnian)
<emphasis role="strong">bul</emphasis> (Bulgarian) <emphasis role="strong">bul</emphasis> (Bulgarian)
<emphasis role="strong">cat</emphasis> (Catalan; Valencian) <emphasis role="strong">cat</emphasis> (Catalan; Valencian)
<emphasis role="strong">ceb</emphasis> (Cebuano) <emphasis role="strong">ceb</emphasis> (Cebuano)
<emphasis role="strong">ces</emphasis> (Czech) <emphasis role="strong">ces</emphasis> (Czech)
<emphasis role="strong">chi_sim</emphasis> (Chinese - Simplified) <emphasis role="strong">chi_sim</emphasis> (Chinese - Simplified)
<emphasis role="strong">chi_tra</emphasis> (Chinese - Traditional) <emphasis role="strong">chi_tra</emphasis> (Chinese - Traditional)
<emphasis role="strong">chr</emphasis> (Cherokee) <emphasis role="strong">chr</emphasis> (Cherokee)
<emphasis role="strong">cym</emphasis> (Welsh) <emphasis role="strong">cym</emphasis> (Welsh)
<emphasis role="strong">dan</emphasis> (Danish) <emphasis role="strong">dan</emphasis> (Danish)
<emphasis role="strong">dan_frak</emphasis> (Danish - Fraktur) <emphasis role="strong">dan_frak</emphasis> (Danish - Fraktur)
<emphasis role="strong">deu</emphasis> (German) <emphasis role="strong">deu</emphasis> (German)
<emphasis role="strong">deu_frak</emphasis> (German - Fraktur) <emphasis role="strong">deu_frak</emphasis> (German - Fraktur)
<emphasis role="strong">dzo</emphasis> (Dzongkha) <emphasis role="strong">dzo</emphasis> (Dzongkha)
<emphasis role="strong">ell</emphasis> (Greek, Modern (1453-)) <emphasis role="strong">ell</emphasis> (Greek, Modern (1453-))
<emphasis role="strong">eng</emphasis> (English) <emphasis role="strong">eng</emphasis> (English)
<emphasis role="strong">enm</emphasis> (English, Middle (1100-1500)) <emphasis role="strong">enm</emphasis> (English, Middle (1100-1500))
<emphasis role="strong">epo</emphasis> (Esperanto) <emphasis role="strong">epo</emphasis> (Esperanto)
<emphasis role="strong">equ</emphasis> (Math / equation detection module) <emphasis role="strong">equ</emphasis> (Math / equation detection module)
<emphasis role="strong">est</emphasis> (Estonian) <emphasis role="strong">est</emphasis> (Estonian)
<emphasis role="strong">eus</emphasis> (Basque) <emphasis role="strong">eus</emphasis> (Basque)
<emphasis role="strong">fas</emphasis> (Persian) <emphasis role="strong">fas</emphasis> (Persian)
<emphasis role="strong">fin</emphasis> (Finnish) <emphasis role="strong">fin</emphasis> (Finnish)
<emphasis role="strong">fra</emphasis> (French) <emphasis role="strong">fra</emphasis> (French)
<emphasis role="strong">frk</emphasis> (Frankish) <emphasis role="strong">frk</emphasis> (Frankish)
<emphasis role="strong">frm</emphasis> (French, Middle (ca.1400-1600)) <emphasis role="strong">frm</emphasis> (French, Middle (ca.1400-1600))
<emphasis role="strong">gle</emphasis> (Irish) <emphasis role="strong">gle</emphasis> (Irish)
<emphasis role="strong">glg</emphasis> (Galician) <emphasis role="strong">glg</emphasis> (Galician)
<emphasis role="strong">grc</emphasis> (Greek, Ancient (to 1453)) <emphasis role="strong">grc</emphasis> (Greek, Ancient (to 1453))
<emphasis role="strong">guj</emphasis> (Gujarati) <emphasis role="strong">guj</emphasis> (Gujarati)
<emphasis role="strong">hat</emphasis> (Haitian; Haitian Creole) <emphasis role="strong">hat</emphasis> (Haitian; Haitian Creole)
<emphasis role="strong">heb</emphasis> (Hebrew) <emphasis role="strong">heb</emphasis> (Hebrew)
<emphasis role="strong">hin</emphasis> (Hindi) <emphasis role="strong">hin</emphasis> (Hindi)
<emphasis role="strong">hrv</emphasis> (Croatian) <emphasis role="strong">hrv</emphasis> (Croatian)
<emphasis role="strong">hun</emphasis> (Hungarian) <emphasis role="strong">hun</emphasis> (Hungarian)
<emphasis role="strong">iku</emphasis> (Inuktitut) <emphasis role="strong">iku</emphasis> (Inuktitut)
<emphasis role="strong">ind</emphasis> (Indonesian) <emphasis role="strong">ind</emphasis> (Indonesian)
<emphasis role="strong">isl</emphasis> (Icelandic) <emphasis role="strong">isl</emphasis> (Icelandic)
<emphasis role="strong">ita</emphasis> (Italian) <emphasis role="strong">ita</emphasis> (Italian)
<emphasis role="strong">ita_old</emphasis> (Italian - Old) <emphasis role="strong">ita_old</emphasis> (Italian - Old)
<emphasis role="strong">jav</emphasis> (Javanese) <emphasis role="strong">jav</emphasis> (Javanese)
<emphasis role="strong">jpn</emphasis> (Japanese) <emphasis role="strong">jpn</emphasis> (Japanese)
<emphasis role="strong">kan</emphasis> (Kannada) <emphasis role="strong">kan</emphasis> (Kannada)
<emphasis role="strong">kat</emphasis> (Georgian) <emphasis role="strong">kat</emphasis> (Georgian)
<emphasis role="strong">kat_old</emphasis> (Georgian - Old) <emphasis role="strong">kat_old</emphasis> (Georgian - Old)
<emphasis role="strong">kaz</emphasis> (Kazakh) <emphasis role="strong">kaz</emphasis> (Kazakh)
<emphasis role="strong">khm</emphasis> (Central Khmer) <emphasis role="strong">khm</emphasis> (Central Khmer)
<emphasis role="strong">kir</emphasis> (Kirghiz; Kyrgyz) <emphasis role="strong">kir</emphasis> (Kirghiz; Kyrgyz)
<emphasis role="strong">kor</emphasis> (Korean) <emphasis role="strong">kor</emphasis> (Korean)
<emphasis role="strong">kur</emphasis> (Kurdish) <emphasis role="strong">kur</emphasis> (Kurdish)
<emphasis role="strong">lao</emphasis> (Lao) <emphasis role="strong">lao</emphasis> (Lao)
<emphasis role="strong">lat</emphasis> (Latin) <emphasis role="strong">lat</emphasis> (Latin)
<emphasis role="strong">lav</emphasis> (Latvian) <emphasis role="strong">lav</emphasis> (Latvian)
<emphasis role="strong">lit</emphasis> (Lithuanian) <emphasis role="strong">lit</emphasis> (Lithuanian)
<emphasis role="strong">mal</emphasis> (Malayalam) <emphasis role="strong">mal</emphasis> (Malayalam)
<emphasis role="strong">mar</emphasis> (Marathi) <emphasis role="strong">mar</emphasis> (Marathi)
<emphasis role="strong">mkd</emphasis> (Macedonian) <emphasis role="strong">mkd</emphasis> (Macedonian)
<emphasis role="strong">mlt</emphasis> (Maltese) <emphasis role="strong">mlt</emphasis> (Maltese)
<emphasis role="strong">msa</emphasis> (Malay) <emphasis role="strong">msa</emphasis> (Malay)
<emphasis role="strong">mya</emphasis> (Burmese) <emphasis role="strong">mya</emphasis> (Burmese)
<emphasis role="strong">nep</emphasis> (Nepali) <emphasis role="strong">nep</emphasis> (Nepali)
<emphasis role="strong">nld</emphasis> (Dutch; Flemish) <emphasis role="strong">nld</emphasis> (Dutch; Flemish)
<emphasis role="strong">nor</emphasis> (Norwegian) <emphasis role="strong">nor</emphasis> (Norwegian)
<emphasis role="strong">ori</emphasis> (Oriya) <emphasis role="strong">ori</emphasis> (Oriya)
<emphasis role="strong">osd</emphasis> (Orientation and script detection module) <emphasis role="strong">osd</emphasis> (Orientation and script detection module)
<emphasis role="strong">pan</emphasis> (Panjabi; Punjabi) <emphasis role="strong">pan</emphasis> (Panjabi; Punjabi)
<emphasis role="strong">pol</emphasis> (Polish) <emphasis role="strong">pol</emphasis> (Polish)
<emphasis role="strong">por</emphasis> (Portuguese) <emphasis role="strong">por</emphasis> (Portuguese)
<emphasis role="strong">pus</emphasis> (Pushto; Pashto) <emphasis role="strong">pus</emphasis> (Pushto; Pashto)
<emphasis role="strong">ron</emphasis> (Romanian; Moldavian; Moldovan) <emphasis role="strong">ron</emphasis> (Romanian; Moldavian; Moldovan)
<emphasis role="strong">rus</emphasis> (Russian) <emphasis role="strong">rus</emphasis> (Russian)
<emphasis role="strong">san</emphasis> (Sanskrit) <emphasis role="strong">san</emphasis> (Sanskrit)
<emphasis role="strong">sin</emphasis> (Sinhala; Sinhalese) <emphasis role="strong">sin</emphasis> (Sinhala; Sinhalese)
<emphasis role="strong">slk</emphasis> (Slovak) <emphasis role="strong">slk</emphasis> (Slovak)
<emphasis role="strong">slk_frak</emphasis> (Slovak - Fraktur) <emphasis role="strong">slk_frak</emphasis> (Slovak - Fraktur)
<emphasis role="strong">slv</emphasis> (Slovenian) <emphasis role="strong">slv</emphasis> (Slovenian)
<emphasis role="strong">spa</emphasis> (Spanish; Castilian) <emphasis role="strong">spa</emphasis> (Spanish; Castilian)
<emphasis role="strong">spa_old</emphasis> (Spanish; Castilian - Old) <emphasis role="strong">spa_old</emphasis> (Spanish; Castilian - Old)
<emphasis role="strong">sqi</emphasis> (Albanian) <emphasis role="strong">sqi</emphasis> (Albanian)
<emphasis role="strong">srp</emphasis> (Serbian) <emphasis role="strong">srp</emphasis> (Serbian)
<emphasis role="strong">srp_latn</emphasis> (Serbian - Latin) <emphasis role="strong">srp_latn</emphasis> (Serbian - Latin)
<emphasis role="strong">swa</emphasis> (Swahili) <emphasis role="strong">swa</emphasis> (Swahili)
<emphasis role="strong">swe</emphasis> (Swedish) <emphasis role="strong">swe</emphasis> (Swedish)
<emphasis role="strong">syr</emphasis> (Syriac) <emphasis role="strong">syr</emphasis> (Syriac)
<emphasis role="strong">tam</emphasis> (Tamil) <emphasis role="strong">tam</emphasis> (Tamil)
<emphasis role="strong">tel</emphasis> (Telugu) <emphasis role="strong">tel</emphasis> (Telugu)
<emphasis role="strong">tgk</emphasis> (Tajik) <emphasis role="strong">tgk</emphasis> (Tajik)
<emphasis role="strong">tgl</emphasis> (Tagalog) <emphasis role="strong">tgl</emphasis> (Tagalog)
<emphasis role="strong">tha</emphasis> (Thai) <emphasis role="strong">tha</emphasis> (Thai)
<emphasis role="strong">tir</emphasis> (Tigrinya) <emphasis role="strong">tir</emphasis> (Tigrinya)
<emphasis role="strong">tur</emphasis> (Turkish) <emphasis role="strong">tur</emphasis> (Turkish)
<emphasis role="strong">uig</emphasis> (Uighur; Uyghur) <emphasis role="strong">uig</emphasis> (Uighur; Uyghur)
<emphasis role="strong">ukr</emphasis> (Ukrainian) <emphasis role="strong">ukr</emphasis> (Ukrainian)
<emphasis role="strong">urd</emphasis> (Urdu) <emphasis role="strong">urd</emphasis> (Urdu)
<emphasis role="strong">uzb</emphasis> (Uzbek) <emphasis role="strong">uzb</emphasis> (Uzbek)
<emphasis role="strong">uzb_cyrl</emphasis> (Uzbek - Cyrilic) <emphasis role="strong">uzb_cyrl</emphasis> (Uzbek - Cyrilic)
<emphasis role="strong">vie</emphasis> (Vietnamese) <emphasis role="strong">vie</emphasis> (Vietnamese)
<emphasis role="strong">yid</emphasis> (Yiddish)</simpara> <emphasis role="strong">yid</emphasis> (Yiddish)</simpara>
<simpara>To use a non-standard language pack named <emphasis role="strong">foo.traineddata</emphasis>, set the <simpara>To use a non-standard language pack named <emphasis role="strong">foo.traineddata</emphasis>, set the
<emphasis role="strong">TESSDATA_PREFIX</emphasis> environment variable so the file can be found at <emphasis role="strong">TESSDATA_PREFIX</emphasis> environment variable so the file can be found at
<emphasis role="strong">TESSDATA_PREFIX</emphasis>/tessdata/<emphasis role="strong">foo</emphasis>.traineddata and give Tesseract the <emphasis role="strong">TESSDATA_PREFIX</emphasis>/tessdata/<emphasis role="strong">foo</emphasis>.traineddata and give Tesseract the
argument <emphasis>-l foo</emphasis>.</simpara> argument <emphasis>-l foo</emphasis>.</simpara>
</refsect1> </refsect1>
<refsect1 id="_config_files_and_augmenting_with_user_data"> <refsect1 id="_config_files_and_augmenting_with_user_data">
<title>CONFIG FILES AND AUGMENTING WITH USER DATA</title> <title>CONFIG FILES AND AUGMENTING WITH USER DATA</title>
<simpara>Tesseract config files consist of lines with variable-value pairs (space <simpara>Tesseract config files consist of lines with variable-value pairs (space
separated). The variables are documented as flags in the source code like separated). The variables are documented as flags in the source code like
the following one in tesseractclass.h:</simpara> the following one in tesseractclass.h:</simpara>
<simpara>STRING_VAR_H(tessedit_char_blacklist, "", <simpara>STRING_VAR_H(tessedit_char_blacklist, "",
"Blacklist of chars not to recognize");</simpara> "Blacklist of chars not to recognize");</simpara>
<simpara>These variables may enable or disable various features of the engine, and <simpara>These variables may enable or disable various features of the engine, and
may cause it to load (or not load) various data. For instance, let&#8217;s suppose may cause it to load (or not load) various data. For instance, let&#8217;s suppose
you want to OCR in English, but suppress the normal dictionary and load an you want to OCR in English, but suppress the normal dictionary and load an
alternative word list and an alternative list of patterns&#8201;&#8212;&#8201;these two files alternative word list and an alternative list of patterns&#8201;&#8212;&#8201;these two files
are the most commonly used extra data files.</simpara> are the most commonly used extra data files.</simpara>
<simpara>If your language pack is in /path/to/eng.traineddata and the hocr config <simpara>If your language pack is in /path/to/eng.traineddata and the hocr config
is in /path/to/configs/hocr then create three new files:</simpara> is in /path/to/configs/hocr then create three new files:</simpara>
<simpara>/path/to/eng.user-words:</simpara> <simpara>/path/to/eng.user-words:</simpara>
<blockquote> <blockquote>
<literallayout>the <literallayout>the
quick quick
brown brown
fox fox
jumped</literallayout> jumped</literallayout>
</blockquote> </blockquote>
<simpara>/path/to/eng.user-patterns:</simpara> <simpara>/path/to/eng.user-patterns:</simpara>
<blockquote> <blockquote>
<literallayout>1-\d\d\d-GOOG-411 <literallayout>1-\d\d\d-GOOG-411
www.\n\\\*.com</literallayout> www.\n\\\*.com</literallayout>
</blockquote> </blockquote>
<simpara>/path/to/configs/bazaar:</simpara> <simpara>/path/to/configs/bazaar:</simpara>
<blockquote> <blockquote>
<literallayout>load_system_dawg F <literallayout>load_system_dawg F
load_freq_dawg F load_freq_dawg F
user_words_suffix user-words user_words_suffix user-words
user_patterns_suffix user-patterns</literallayout> user_patterns_suffix user-patterns</literallayout>
</blockquote> </blockquote>
<simpara>Now, if you pass the word <emphasis>bazaar</emphasis> as a trailing command line parameter <simpara>Now, if you pass the word <emphasis>bazaar</emphasis> as a trailing command line parameter
to Tesseract, Tesseract will not bother loading the system dictionary nor to Tesseract, Tesseract will not bother loading the system dictionary nor
the dictionary of frequent words and will load and use the eng.user-words the dictionary of frequent words and will load and use the eng.user-words
and eng.user-patterns files you provided. The former is a simple word list, and eng.user-patterns files you provided. The former is a simple word list,
one per line. The format of the latter is documented in dict/trie.h one per line. The format of the latter is documented in dict/trie.h
on read_pattern_list().</simpara> on read_pattern_list().</simpara>
</refsect1> </refsect1>
<refsect1 id="_history"> <refsect1 id="_history">
<title>HISTORY</title> <title>HISTORY</title>
<simpara>The engine was developed at Hewlett Packard Laboratories Bristol and at <simpara>The engine was developed at Hewlett Packard Laboratories Bristol and at
Hewlett Packard Co, Greeley Colorado between 1985 and 1994, with some more Hewlett Packard Co, Greeley Colorado between 1985 and 1994, with some more
changes made in 1996 to port to Windows, and some C++izing in 1998. A changes made in 1996 to port to Windows, and some C++izing in 1998. A
lot of the code was written in C, and then some more was written in C++. lot of the code was written in C, and then some more was written in C++.
The C\++ code makes heavy use of a list system using macros. This predates The C\++ code makes heavy use of a list system using macros. This predates
stl, was portable before stl, and is more efficient than stl lists, but has stl, was portable before stl, and is more efficient than stl lists, but has
the big negative that if you do get a segmentation violation, it is hard to the big negative that if you do get a segmentation violation, it is hard to
debug.</simpara> debug.</simpara>
<simpara>Version 2.00 brought Unicode (UTF-8) support, six languages, and the ability <simpara>Version 2.00 brought Unicode (UTF-8) support, six languages, and the ability
to train Tesseract.</simpara> to train Tesseract.</simpara>
<simpara>Tesseract was included in UNLV&#8217;s Fourth Annual Test of OCR Accuracy. <simpara>Tesseract was included in UNLV&#8217;s Fourth Annual Test of OCR Accuracy.
See <ulink url="https://github.com/tesseract-ocr/docs/blob/master/AT-1995.pdf">https://github.com/tesseract-ocr/docs/blob/master/AT-1995.pdf</ulink>. With Tesseract 2.00, See <ulink url="https://github.com/tesseract-ocr/docs/blob/master/AT-1995.pdf">https://github.com/tesseract-ocr/docs/blob/master/AT-1995.pdf</ulink>. With Tesseract 2.00,
scripts are now included to allow anyone to reproduce some of these tests. scripts are now included to allow anyone to reproduce some of these tests.
See <ulink url="https://github.com/tesseract-ocr/tesseract/wiki/TestingTesseract">https://github.com/tesseract-ocr/tesseract/wiki/TestingTesseract</ulink> for more See <ulink url="https://github.com/tesseract-ocr/tesseract/wiki/TestingTesseract">https://github.com/tesseract-ocr/tesseract/wiki/TestingTesseract</ulink> for more
details.</simpara> details.</simpara>
<simpara>Tesseract 3.00 adds a number of new languages, including Chinese, Japanese, <simpara>Tesseract 3.00 adds a number of new languages, including Chinese, Japanese,
and Korean. It also introduces a new, single-file based system of managing and Korean. It also introduces a new, single-file based system of managing
language data.</simpara> language data.</simpara>
<simpara>Tesseract 3.02 adds BiDirectional text support, the ability to recognize <simpara>Tesseract 3.02 adds BiDirectional text support, the ability to recognize
multiple languages in a single image, and improved layout analysis.</simpara> multiple languages in a single image, and improved layout analysis.</simpara>
<simpara>For further details, see the file ReleaseNotes included with the distribution.</simpara> <simpara>For further details, see the file ReleaseNotes included with the distribution.</simpara>
</refsect1> </refsect1>
<refsect1 id="_resources"> <refsect1 id="_resources">
<title>RESOURCES</title> <title>RESOURCES</title>
<simpara>Main web site: <ulink url="https://github.com/tesseract-ocr">https://github.com/tesseract-ocr</ulink><?asciidoc-br?> <simpara>Main web site: <ulink url="https://github.com/tesseract-ocr">https://github.com/tesseract-ocr</ulink><?asciidoc-br?>
Information on training: <ulink url="https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract">https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract</ulink></simpara> Information on training: <ulink url="https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract">https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract</ulink></simpara>
</refsect1> </refsect1>
<refsect1 id="_see_also"> <refsect1 id="_see_also">
<title>SEE ALSO</title> <title>SEE ALSO</title>
<simpara>ambiguous_words(1), cntraining(1), combine_tessdata(1), dawg2wordlist(1), <simpara>ambiguous_words(1), cntraining(1), combine_tessdata(1), dawg2wordlist(1),
shape_training(1), mftraining(1), unicharambigs(5), unicharset(5), shape_training(1), mftraining(1), unicharambigs(5), unicharset(5),
unicharset_extractor(1), wordlist2dawg(1)</simpara> unicharset_extractor(1), wordlist2dawg(1)</simpara>
</refsect1> </refsect1>
<refsect1 id="_author"> <refsect1 id="_author">
<title>AUTHOR</title> <title>AUTHOR</title>
<simpara>Tesseract development was led at Hewlett-Packard and Google by Ray Smith. <simpara>Tesseract development was led at Hewlett-Packard and Google by Ray Smith.
The development team has included:</simpara> The development team has included:</simpara>
<simpara>Ahmad Abdulkader, Chris Newton, Dan Johnson, Dar-Shyang Lee, David Eger, <simpara>Ahmad Abdulkader, Chris Newton, Dan Johnson, Dar-Shyang Lee, David Eger,
Eric Wiseblatt, Faisal Shafait, Hiroshi Takenaka, Joe Liu, Joern Wanke, Eric Wiseblatt, Faisal Shafait, Hiroshi Takenaka, Joe Liu, Joern Wanke,
Mark Seaman, Mickey Namiki, Nicholas Beato, Oded Fuhrmann, Phil Cheatle, Mark Seaman, Mickey Namiki, Nicholas Beato, Oded Fuhrmann, Phil Cheatle,
Pingping Xiu, Pong Eksombatchai (Chantat), Ranjith Unnikrishnan, Raquel Pingping Xiu, Pong Eksombatchai (Chantat), Ranjith Unnikrishnan, Raquel
Romano, Ray Smith, Rika Antonova, Robert Moss, Samuel Charron, Sheelagh Romano, Ray Smith, Rika Antonova, Robert Moss, Samuel Charron, Sheelagh
Lloyd, Shobhit Saxena, and Thomas Kielbus.</simpara> Lloyd, Shobhit Saxena, and Thomas Kielbus.</simpara>
</refsect1> </refsect1>
<refsect1 id="_copying"> <refsect1 id="_copying">
<title>COPYING</title> <title>COPYING</title>
<simpara>Licensed under the Apache License, Version 2.0</simpara> <simpara>Licensed under the Apache License, Version 2.0</simpara>
</refsect1> </refsect1>
</refentry> </refentry>

View File

@ -38,7 +38,7 @@ EXAMPLE
3 i i i 1 m 0 3 i i i 1 m 0
............................... ...............................
In this example, all instances of the '2' character sequence '''' will In this example, all instances of the '2' character sequence '''' will
*always* be replaced by the '1' character sequence '"'; a '1' character *always* be replaced by the '1' character sequence '"'; a '1' character
sequence 'm' *may* be replaced by the '2' character sequence 'rn', and sequence 'm' *may* be replaced by the '2' character sequence 'rn', and
the '3' character sequence *may* be replaced by the '1' character the '3' character sequence *may* be replaced by the '1' character

File diff suppressed because it is too large Load Diff

View File

@ -1,126 +1,126 @@
<?xml version="1.0" encoding="UTF-8"?> <?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE refentry PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd"> <!DOCTYPE refentry PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd">
<?asciidoc-toc?> <?asciidoc-toc?>
<?asciidoc-numbered?> <?asciidoc-numbered?>
<refentry lang="en"> <refentry lang="en">
<refentryinfo> <refentryinfo>
<title>UNICHARAMBIGS(5)</title> <title>UNICHARAMBIGS(5)</title>
</refentryinfo> </refentryinfo>
<refmeta> <refmeta>
<refentrytitle>unicharambigs</refentrytitle> <refentrytitle>unicharambigs</refentrytitle>
<manvolnum>5</manvolnum> <manvolnum>5</manvolnum>
<refmiscinfo class="source">&#160;</refmiscinfo> <refmiscinfo class="source">&#160;</refmiscinfo>
<refmiscinfo class="manual">&#160;</refmiscinfo> <refmiscinfo class="manual">&#160;</refmiscinfo>
</refmeta> </refmeta>
<refnamediv> <refnamediv>
<refname>unicharambigs</refname> <refname>unicharambigs</refname>
<refpurpose>Tesseract unicharset ambiguities</refpurpose> <refpurpose>Tesseract unicharset ambiguities</refpurpose>
</refnamediv> </refnamediv>
<refsect1 id="_description"> <refsect1 id="_description">
<title>DESCRIPTION</title> <title>DESCRIPTION</title>
<simpara>The unicharambigs file (a component of traineddata, see combine_tessdata(1) ) <simpara>The unicharambigs file (a component of traineddata, see combine_tessdata(1) )
is used by Tesseract to represent possible ambiguities between characters, is used by Tesseract to represent possible ambiguities between characters,
or groups of characters.</simpara> or groups of characters.</simpara>
<simpara>The file contains a number of lines, laid out as follow:</simpara> <simpara>The file contains a number of lines, laid out as follow:</simpara>
<literallayout class="monospaced">[num] &lt;TAB&gt; [char(s)] &lt;TAB&gt; [num] &lt;TAB&gt; [char(s)] &lt;TAB&gt; [num]</literallayout> <literallayout class="monospaced">[num] &lt;TAB&gt; [char(s)] &lt;TAB&gt; [num] &lt;TAB&gt; [char(s)] &lt;TAB&gt; [num]</literallayout>
<informaltable tabstyle="horizontal" frame="none" colsep="0" rowsep="0"><tgroup cols="2"><colspec colwidth="15*"/><colspec colwidth="85*"/><tbody valign="top"> <informaltable tabstyle="horizontal" frame="none" colsep="0" rowsep="0"><tgroup cols="2"><colspec colwidth="15*"/><colspec colwidth="85*"/><tbody valign="top">
<row> <row>
<entry> <entry>
<simpara> <simpara>
Field one Field one
</simpara> </simpara>
</entry> </entry>
<entry> <entry>
<simpara> <simpara>
the number of characters contained in field two the number of characters contained in field two
</simpara> </simpara>
</entry> </entry>
</row> </row>
<row> <row>
<entry> <entry>
<simpara> <simpara>
Field two Field two
</simpara> </simpara>
</entry> </entry>
<entry> <entry>
<simpara> <simpara>
the character sequence to be replaced the character sequence to be replaced
</simpara> </simpara>
</entry> </entry>
</row> </row>
<row> <row>
<entry> <entry>
<simpara> <simpara>
Field three Field three
</simpara> </simpara>
</entry> </entry>
<entry> <entry>
<simpara> <simpara>
the number of characters contained in field four the number of characters contained in field four
</simpara> </simpara>
</entry> </entry>
</row> </row>
<row> <row>
<entry> <entry>
<simpara> <simpara>
Field four Field four
</simpara> </simpara>
</entry> </entry>
<entry> <entry>
<simpara> <simpara>
the character sequence used to replace field two the character sequence used to replace field two
</simpara> </simpara>
</entry> </entry>
</row> </row>
<row> <row>
<entry> <entry>
<simpara> <simpara>
Field five Field five
</simpara> </simpara>
</entry> </entry>
<entry> <entry>
<simpara> <simpara>
contains either 1 or 0. 1 denotes a mandatory contains either 1 or 0. 1 denotes a mandatory
replacement, 0 denotes an optional replacement. replacement, 0 denotes an optional replacement.
</simpara> </simpara>
</entry> </entry>
</row> </row>
</tbody></tgroup></informaltable> </tbody></tgroup></informaltable>
<simpara>Characters appearing in fields two and four should appear in <simpara>Characters appearing in fields two and four should appear in
unicharset. The numbers in fields one and three refer to the unicharset. The numbers in fields one and three refer to the
number of unichars (not bytes).</simpara> number of unichars (not bytes).</simpara>
</refsect1> </refsect1>
<refsect1 id="_example"> <refsect1 id="_example">
<title>EXAMPLE</title> <title>EXAMPLE</title>
<literallayout class="monospaced">2 ' ' 1 " 1 <literallayout class="monospaced">2 ' ' 1 " 1
1 m 2 r n 0 1 m 2 r n 0
3 i i i 1 m 0</literallayout> 3 i i i 1 m 0</literallayout>
<simpara>In this example, all instances of the <emphasis>2</emphasis> character sequence <emphasis>'</emphasis>' will <simpara>In this example, all instances of the <emphasis>2</emphasis> character sequence <emphasis>'</emphasis>' will
<emphasis role="strong">always</emphasis> be replaced by the <emphasis>1</emphasis> character sequence <emphasis>"</emphasis>; a <emphasis>1</emphasis> character <emphasis role="strong">always</emphasis> be replaced by the <emphasis>1</emphasis> character sequence <emphasis>"</emphasis>; a <emphasis>1</emphasis> character
sequence <emphasis>m</emphasis> <emphasis role="strong">may</emphasis> be replaced by the <emphasis>2</emphasis> character sequence <emphasis>rn</emphasis>, and sequence <emphasis>m</emphasis> <emphasis role="strong">may</emphasis> be replaced by the <emphasis>2</emphasis> character sequence <emphasis>rn</emphasis>, and
the <emphasis>3</emphasis> character sequence <emphasis role="strong">may</emphasis> be replaced by the <emphasis>1</emphasis> character the <emphasis>3</emphasis> character sequence <emphasis role="strong">may</emphasis> be replaced by the <emphasis>1</emphasis> character
sequence <emphasis>m</emphasis>.</simpara> sequence <emphasis>m</emphasis>.</simpara>
</refsect1> </refsect1>
<refsect1 id="_history"> <refsect1 id="_history">
<title>HISTORY</title> <title>HISTORY</title>
<simpara>The unicharambigs file first appeared in Tesseract 3.00; prior to that, a <simpara>The unicharambigs file first appeared in Tesseract 3.00; prior to that, a
similar format, called DangAmbigs (<emphasis>dangerous ambiguities</emphasis>) was used: the similar format, called DangAmbigs (<emphasis>dangerous ambiguities</emphasis>) was used: the
format was almost identical, except only mandatory replacements could be format was almost identical, except only mandatory replacements could be
specified, and field 5 was absent.</simpara> specified, and field 5 was absent.</simpara>
</refsect1> </refsect1>
<refsect1 id="_bugs"> <refsect1 id="_bugs">
<title>BUGS</title> <title>BUGS</title>
<simpara>This is a documentation "bug": it&#8217;s not currently clear what should be done <simpara>This is a documentation "bug": it&#8217;s not currently clear what should be done
in the case of ligatures (such as <emphasis>fi</emphasis>) which may also appear as regular in the case of ligatures (such as <emphasis>fi</emphasis>) which may also appear as regular
letters in the unicharset.</simpara> letters in the unicharset.</simpara>
</refsect1> </refsect1>
<refsect1 id="_see_also"> <refsect1 id="_see_also">
<title>SEE ALSO</title> <title>SEE ALSO</title>
<simpara>tesseract(1), unicharset(5)</simpara> <simpara>tesseract(1), unicharset(5)</simpara>
</refsect1> </refsect1>
<refsect1 id="_author"> <refsect1 id="_author">
<title>AUTHOR</title> <title>AUTHOR</title>
<simpara>The Tesseract OCR engine was written by Ray Smith and his research groups <simpara>The Tesseract OCR engine was written by Ray Smith and his research groups
at Hewlett Packard (1985-1995) and Google (2006-present).</simpara> at Hewlett Packard (1985-1995) and Google (2006-present).</simpara>
</refsect1> </refsect1>
</refentry> </refentry>

File diff suppressed because it is too large Load Diff

View File

@ -1,219 +1,219 @@
<?xml version="1.0" encoding="UTF-8"?> <?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE refentry PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd"> <!DOCTYPE refentry PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd">
<?asciidoc-toc?> <?asciidoc-toc?>
<?asciidoc-numbered?> <?asciidoc-numbered?>
<refentry lang="en"> <refentry lang="en">
<refentryinfo> <refentryinfo>
<title>UNICHARSET(5)</title> <title>UNICHARSET(5)</title>
</refentryinfo> </refentryinfo>
<refmeta> <refmeta>
<refentrytitle>unicharset</refentrytitle> <refentrytitle>unicharset</refentrytitle>
<manvolnum>5</manvolnum> <manvolnum>5</manvolnum>
<refmiscinfo class="source">&#160;</refmiscinfo> <refmiscinfo class="source">&#160;</refmiscinfo>
<refmiscinfo class="manual">&#160;</refmiscinfo> <refmiscinfo class="manual">&#160;</refmiscinfo>
</refmeta> </refmeta>
<refnamediv> <refnamediv>
<refname>unicharset</refname> <refname>unicharset</refname>
<refpurpose>character properties file used by tesseract(1)</refpurpose> <refpurpose>character properties file used by tesseract(1)</refpurpose>
</refnamediv> </refnamediv>
<refsect1 id="_description"> <refsect1 id="_description">
<title>DESCRIPTION</title> <title>DESCRIPTION</title>
<simpara>Tesseract&#8217;s unicharset file contains information on each symbol <simpara>Tesseract&#8217;s unicharset file contains information on each symbol
(unichar) the Tesseract OCR engine is trained to recognize.</simpara> (unichar) the Tesseract OCR engine is trained to recognize.</simpara>
<simpara>A unicharset file (i.e. <emphasis>eng.unicharset</emphasis>) is distributed as part of a <simpara>A unicharset file (i.e. <emphasis>eng.unicharset</emphasis>) is distributed as part of a
Tesseract language pack (i.e. <emphasis>eng.traineddata</emphasis>). For information on Tesseract language pack (i.e. <emphasis>eng.traineddata</emphasis>). For information on
extracting the unicharset file, see combine_tessdata(1).</simpara> extracting the unicharset file, see combine_tessdata(1).</simpara>
<simpara>The first line of a unicharset file contains the number of unichars in <simpara>The first line of a unicharset file contains the number of unichars in
the file. After this line, each subsequent line provides information for the file. After this line, each subsequent line provides information for
a single unichar. The first such line contains a placeholder reserved for a single unichar. The first such line contains a placeholder reserved for
the space character. Each unichar is referred to within Tesseract by its the space character. Each unichar is referred to within Tesseract by its
Unichar ID, which is the line number (minus 1) within the unicharset file. Unichar ID, which is the line number (minus 1) within the unicharset file.
Therefore, space gets unichar 0.</simpara> Therefore, space gets unichar 0.</simpara>
<simpara>Each unichar line in the unicharset file (v2+) may have four space-separated fields:</simpara> <simpara>Each unichar line in the unicharset file (v2+) may have four space-separated fields:</simpara>
<literallayout class="monospaced">'character' 'properties' 'script' 'id'</literallayout> <literallayout class="monospaced">'character' 'properties' 'script' 'id'</literallayout>
<simpara>Starting with Tesseract v3.02, more information may be given for each unichar:</simpara> <simpara>Starting with Tesseract v3.02, more information may be given for each unichar:</simpara>
<literallayout class="monospaced">'character' 'properties' 'glyph_metrics' 'script' 'other_case' 'direction' 'mirror' 'normed_form'</literallayout> <literallayout class="monospaced">'character' 'properties' 'glyph_metrics' 'script' 'other_case' 'direction' 'mirror' 'normed_form'</literallayout>
<simpara>Entries:</simpara> <simpara>Entries:</simpara>
<variablelist> <variablelist>
<varlistentry> <varlistentry>
<term> <term>
<emphasis>character</emphasis> <emphasis>character</emphasis>
</term> </term>
<listitem> <listitem>
<simpara> <simpara>
The UTF-8 encoded string to be produced for this unichar. The UTF-8 encoded string to be produced for this unichar.
</simpara> </simpara>
</listitem> </listitem>
</varlistentry> </varlistentry>
<varlistentry> <varlistentry>
<term> <term>
<emphasis>properties</emphasis> <emphasis>properties</emphasis>
</term> </term>
<listitem> <listitem>
<simpara> <simpara>
An integer mask of character properties, one per bit. An integer mask of character properties, one per bit.
From least to most significant bit, these are: isalpha, islower, isupper, From least to most significant bit, these are: isalpha, islower, isupper,
isdigit, ispunctuation. isdigit, ispunctuation.
</simpara> </simpara>
</listitem> </listitem>
</varlistentry> </varlistentry>
<varlistentry> <varlistentry>
<term> <term>
<emphasis>glyph_metrics</emphasis> <emphasis>glyph_metrics</emphasis>
</term> </term>
<listitem> <listitem>
<simpara> <simpara>
Ten comma-separated integers representing various standards Ten comma-separated integers representing various standards
for where this glyph is to be found within a baseline-normalized coordinate for where this glyph is to be found within a baseline-normalized coordinate
system where 128 is normalized to x-height. system where 128 is normalized to x-height.
</simpara> </simpara>
<itemizedlist> <itemizedlist>
<listitem> <listitem>
<simpara> <simpara>
min_bottom, max_bottom: the ranges where the bottom of the character can min_bottom, max_bottom: the ranges where the bottom of the character can
be found. be found.
</simpara> </simpara>
</listitem> </listitem>
<listitem> <listitem>
<simpara> <simpara>
min_top, max_top: the ranges where the top of the character may be found. min_top, max_top: the ranges where the top of the character may be found.
</simpara> </simpara>
</listitem> </listitem>
<listitem> <listitem>
<simpara> <simpara>
min_width, max_width: horizontal width of the character. min_width, max_width: horizontal width of the character.
</simpara> </simpara>
</listitem> </listitem>
<listitem> <listitem>
<simpara> <simpara>
min_bearing, max_bearing: how far from the usual start position does the min_bearing, max_bearing: how far from the usual start position does the
leftmost part of the character begin. leftmost part of the character begin.
</simpara> </simpara>
</listitem> </listitem>
<listitem> <listitem>
<simpara> <simpara>
min_advance, max_advance: how far from the printer&#8217;s cell left do we min_advance, max_advance: how far from the printer&#8217;s cell left do we
advance to begin the next character. advance to begin the next character.
</simpara> </simpara>
</listitem> </listitem>
</itemizedlist> </itemizedlist>
</listitem> </listitem>
</varlistentry> </varlistentry>
<varlistentry> <varlistentry>
<term> <term>
<emphasis>script</emphasis> <emphasis>script</emphasis>
</term> </term>
<listitem> <listitem>
<simpara> <simpara>
Name of the script (Latin, Common, Greek, Cyrillic, Han, null). Name of the script (Latin, Common, Greek, Cyrillic, Han, null).
</simpara> </simpara>
</listitem> </listitem>
</varlistentry> </varlistentry>
<varlistentry> <varlistentry>
<term> <term>
<emphasis>other_case</emphasis> <emphasis>other_case</emphasis>
</term> </term>
<listitem> <listitem>
<simpara> <simpara>
The Unichar ID of the other case version of this character The Unichar ID of the other case version of this character
(upper or lower). (upper or lower).
</simpara> </simpara>
</listitem> </listitem>
</varlistentry> </varlistentry>
<varlistentry> <varlistentry>
<term> <term>
<emphasis>direction</emphasis> <emphasis>direction</emphasis>
</term> </term>
<listitem> <listitem>
<simpara> <simpara>
The Unicode BiDi direction of this character, as defined by The Unicode BiDi direction of this character, as defined by
ICU&#8217;s enum UCharDirection. (0 = Left to Right, 1 = Right to Left, ICU&#8217;s enum UCharDirection. (0 = Left to Right, 1 = Right to Left,
2 = European Number&#8230;) 2 = European Number&#8230;)
</simpara> </simpara>
</listitem> </listitem>
</varlistentry> </varlistentry>
<varlistentry> <varlistentry>
<term> <term>
<emphasis>mirror</emphasis> <emphasis>mirror</emphasis>
</term> </term>
<listitem> <listitem>
<simpara> <simpara>
The Unichar ID of the BiDirectional mirror of this character. The Unichar ID of the BiDirectional mirror of this character.
For example the mirror of open paren is close paren, but Latin Capital C For example the mirror of open paren is close paren, but Latin Capital C
has no mirror, so it remains a Latin Capital C. has no mirror, so it remains a Latin Capital C.
</simpara> </simpara>
</listitem> </listitem>
</varlistentry> </varlistentry>
<varlistentry> <varlistentry>
<term> <term>
<emphasis>normed_form</emphasis> <emphasis>normed_form</emphasis>
</term> </term>
<listitem> <listitem>
<simpara> <simpara>
The UTF-8 representation of a "normalized form" of this unichar The UTF-8 representation of a "normalized form" of this unichar
for the purpose of blaming a module for errors given ground truth text. for the purpose of blaming a module for errors given ground truth text.
For instance, a left or right single quote may normalize to an ASCII quote. For instance, a left or right single quote may normalize to an ASCII quote.
</simpara> </simpara>
</listitem> </listitem>
</varlistentry> </varlistentry>
</variablelist> </variablelist>
</refsect1> </refsect1>
<refsect1 id="_example_v2"> <refsect1 id="_example_v2">
<title>EXAMPLE (v2)</title> <title>EXAMPLE (v2)</title>
<literallayout class="monospaced">; 10 Common 46 <literallayout class="monospaced">; 10 Common 46
b 3 Latin 59 b 3 Latin 59
W 5 Latin 40 W 5 Latin 40
7 8 Common 66 7 8 Common 66
= 0 Common 93</literallayout> = 0 Common 93</literallayout>
<simpara>";" is a punctuation character. Its properties are thus represented by the <simpara>";" is a punctuation character. Its properties are thus represented by the
binary number 10000 (10 in hexadecimal).</simpara> binary number 10000 (10 in hexadecimal).</simpara>
<simpara>"b" is an alphabetic character and a lower case character. Its properties are <simpara>"b" is an alphabetic character and a lower case character. Its properties are
thus represented by the binary number 00011 (3 in hexadecimal).</simpara> thus represented by the binary number 00011 (3 in hexadecimal).</simpara>
<simpara>"W" is an alphabetic character and an upper case character. Its properties are <simpara>"W" is an alphabetic character and an upper case character. Its properties are
thus represented by the binary number 00101 (5 in hexadecimal).</simpara> thus represented by the binary number 00101 (5 in hexadecimal).</simpara>
<simpara>"7" is just a digit. Its properties are thus represented by the binary number <simpara>"7" is just a digit. Its properties are thus represented by the binary number
01000 (8 in hexadecimal).</simpara> 01000 (8 in hexadecimal).</simpara>
<simpara>"=" is not punctuation nor a digit nor an alphabetic character. Its properties <simpara>"=" is not punctuation nor a digit nor an alphabetic character. Its properties
are thus represented by the binary number 00000 (0 in hexadecimal).</simpara> are thus represented by the binary number 00000 (0 in hexadecimal).</simpara>
<simpara>Japanese or Chinese alphabetic character properties are represented by the <simpara>Japanese or Chinese alphabetic character properties are represented by the
binary number 00001 (1 in hexadecimal): they are alphabetic, but neither binary number 00001 (1 in hexadecimal): they are alphabetic, but neither
upper nor lower case.</simpara> upper nor lower case.</simpara>
</refsect1> </refsect1>
<refsect1 id="_example_v3_02"> <refsect1 id="_example_v3_02">
<title>EXAMPLE (v3.02)</title> <title>EXAMPLE (v3.02)</title>
<literallayout class="monospaced">110 <literallayout class="monospaced">110
NULL 0 NULL 0 NULL 0 NULL 0
N 5 59,68,216,255,87,236,0,27,104,227 Latin 11 0 1 N N 5 59,68,216,255,87,236,0,27,104,227 Latin 11 0 1 N
Y 5 59,68,216,255,91,205,0,47,91,223 Latin 33 0 2 Y Y 5 59,68,216,255,91,205,0,47,91,223 Latin 33 0 2 Y
1 8 59,69,203,255,45,128,0,66,74,173 Common 3 2 3 1 1 8 59,69,203,255,45,128,0,66,74,173 Common 3 2 3 1
9 8 18,66,203,255,89,156,0,39,104,173 Common 4 2 4 9 9 8 18,66,203,255,89,156,0,39,104,173 Common 4 2 4 9
a 3 58,65,186,198,85,164,0,26,97,185 Latin 56 0 5 a a 3 58,65,186,198,85,164,0,26,97,185 Latin 56 0 5 a
. . .</literallayout> . . .</literallayout>
</refsect1> </refsect1>
<refsect1 id="_caveats"> <refsect1 id="_caveats">
<title>CAVEATS</title> <title>CAVEATS</title>
<simpara>Although the unicharset reader maintains the ability to read unicharsets <simpara>Although the unicharset reader maintains the ability to read unicharsets
of older formats and will assign default values to missing fields, of older formats and will assign default values to missing fields,
the accuracy will be degraded.</simpara> the accuracy will be degraded.</simpara>
<simpara>Further, most other data files are indexed by the unicharset file, <simpara>Further, most other data files are indexed by the unicharset file,
so changing it without re-generating the others is likely to have dire so changing it without re-generating the others is likely to have dire
consequences.</simpara> consequences.</simpara>
</refsect1> </refsect1>
<refsect1 id="_history"> <refsect1 id="_history">
<title>HISTORY</title> <title>HISTORY</title>
<simpara>The unicharset format first appeared with Tesseract 2.00, which was the <simpara>The unicharset format first appeared with Tesseract 2.00, which was the
first version to support languages other than English. The unicharset file first version to support languages other than English. The unicharset file
contained only the first two fields, and the "ispunctuation" property was contained only the first two fields, and the "ispunctuation" property was
absent (punctuation was regarded as "0", as "=" is in the above example.</simpara> absent (punctuation was regarded as "0", as "=" is in the above example.</simpara>
</refsect1> </refsect1>
<refsect1 id="_see_also"> <refsect1 id="_see_also">
<title>SEE ALSO</title> <title>SEE ALSO</title>
<simpara>tesseract(1), combine_tessdata(1), unicharset_extractor(1)</simpara> <simpara>tesseract(1), combine_tessdata(1), unicharset_extractor(1)</simpara>
<simpara><ulink url="https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract">https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract</ulink></simpara> <simpara><ulink url="https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract">https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract</ulink></simpara>
</refsect1> </refsect1>
<refsect1 id="_author"> <refsect1 id="_author">
<title>AUTHOR</title> <title>AUTHOR</title>
<simpara>The Tesseract OCR engine was written by Ray Smith and his research groups <simpara>The Tesseract OCR engine was written by Ray Smith and his research groups
at Hewlett Packard (1985-1995) and Google (2006-present).</simpara> at Hewlett Packard (1985-1995) and Google (2006-present).</simpara>
</refsect1> </refsect1>
</refentry> </refentry>

View File

@ -11,9 +11,9 @@ SYNOPSIS
DESCRIPTION DESCRIPTION
----------- -----------
Tesseract needs to know the set of possible characters it can output. Tesseract needs to know the set of possible characters it can output.
To generate the unicharset data file, use the unicharset_extractor To generate the unicharset data file, use the unicharset_extractor
program on the same training pages bounding box files as used for program on the same training pages bounding box files as used for
clustering: clustering:
unicharset_extractor fontfile_1.box fontfile_2.box ... unicharset_extractor fontfile_1.box fontfile_2.box ...
@ -21,19 +21,19 @@ clustering:
The unicharset will be put into the file 'dir/unicharset', or simply The unicharset will be put into the file 'dir/unicharset', or simply
'./unicharset' if no output directory is provided. './unicharset' if no output directory is provided.
Tesseract also needs to have access to character properties isalpha, Tesseract also needs to have access to character properties isalpha,
isdigit, isupper, islower, ispunctuation. all of this auxilury data isdigit, isupper, islower, ispunctuation. all of this auxilury data
and more is encoded in this file. (See unicharset(5)) and more is encoded in this file. (See unicharset(5))
If your system supports the wctype functions, these values will be set If your system supports the wctype functions, these values will be set
automatically by unicharset_extractor and there is no need to edit the automatically by unicharset_extractor and there is no need to edit the
unicharset file. On some older systems (eg Windows 95), the unicharset unicharset file. On some older systems (eg Windows 95), the unicharset
file must be edited by hand to add these property description codes. file must be edited by hand to add these property description codes.
*NOTE* The unicharset file must be regenerated whenever inttemp, normproto *NOTE* The unicharset file must be regenerated whenever inttemp, normproto
and pffmtable are generated (i.e. they must all be recreated when the box and pffmtable are generated (i.e. they must all be recreated when the box
file is changed) as they have to be in sync. This is made easier than in file is changed) as they have to be in sync. This is made easier than in
previous versions by running unicharset_extractor before mftraining and previous versions by running unicharset_extractor before mftraining and
cntraining, and giving the unicharset to mftraining. cntraining, and giving the unicharset to mftraining.
SEE ALSO SEE ALSO

File diff suppressed because it is too large Load Diff

View File

@ -1,63 +1,63 @@
<?xml version="1.0" encoding="UTF-8"?> <?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE refentry PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd"> <!DOCTYPE refentry PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd">
<?asciidoc-toc?> <?asciidoc-toc?>
<?asciidoc-numbered?> <?asciidoc-numbered?>
<refentry lang="en"> <refentry lang="en">
<refentryinfo> <refentryinfo>
<title>UNICHARSET_EXTRACTOR(1)</title> <title>UNICHARSET_EXTRACTOR(1)</title>
</refentryinfo> </refentryinfo>
<refmeta> <refmeta>
<refentrytitle>unicharset_extractor</refentrytitle> <refentrytitle>unicharset_extractor</refentrytitle>
<manvolnum>1</manvolnum> <manvolnum>1</manvolnum>
<refmiscinfo class="source">&#160;</refmiscinfo> <refmiscinfo class="source">&#160;</refmiscinfo>
<refmiscinfo class="manual">&#160;</refmiscinfo> <refmiscinfo class="manual">&#160;</refmiscinfo>
</refmeta> </refmeta>
<refnamediv> <refnamediv>
<refname>unicharset_extractor</refname> <refname>unicharset_extractor</refname>
<refpurpose>extract unicharset from Tesseract boxfiles</refpurpose> <refpurpose>extract unicharset from Tesseract boxfiles</refpurpose>
</refnamediv> </refnamediv>
<refsynopsisdiv id="_synopsis"> <refsynopsisdiv id="_synopsis">
<simpara><emphasis role="strong">unicharset_extractor</emphasis> <emphasis>[-D dir]</emphasis> <emphasis>FILE</emphasis>&#8230;</simpara> <simpara><emphasis role="strong">unicharset_extractor</emphasis> <emphasis>[-D dir]</emphasis> <emphasis>FILE</emphasis>&#8230;</simpara>
</refsynopsisdiv> </refsynopsisdiv>
<refsect1 id="_description"> <refsect1 id="_description">
<title>DESCRIPTION</title> <title>DESCRIPTION</title>
<simpara>Tesseract needs to know the set of possible characters it can output. <simpara>Tesseract needs to know the set of possible characters it can output.
To generate the unicharset data file, use the unicharset_extractor To generate the unicharset data file, use the unicharset_extractor
program on the same training pages bounding box files as used for program on the same training pages bounding box files as used for
clustering:</simpara> clustering:</simpara>
<literallayout class="monospaced">unicharset_extractor fontfile_1.box fontfile_2.box ...</literallayout> <literallayout class="monospaced">unicharset_extractor fontfile_1.box fontfile_2.box ...</literallayout>
<simpara>The unicharset will be put into the file <emphasis>dir/unicharset</emphasis>, or simply <simpara>The unicharset will be put into the file <emphasis>dir/unicharset</emphasis>, or simply
<emphasis>./unicharset</emphasis> if no output directory is provided.</simpara> <emphasis>./unicharset</emphasis> if no output directory is provided.</simpara>
<simpara>Tesseract also needs to have access to character properties isalpha, <simpara>Tesseract also needs to have access to character properties isalpha,
isdigit, isupper, islower, ispunctuation. all of this auxilury data isdigit, isupper, islower, ispunctuation. all of this auxilury data
and more is encoded in this file. (See unicharset(5))</simpara> and more is encoded in this file. (See unicharset(5))</simpara>
<simpara>If your system supports the wctype functions, these values will be set <simpara>If your system supports the wctype functions, these values will be set
automatically by unicharset_extractor and there is no need to edit the automatically by unicharset_extractor and there is no need to edit the
unicharset file. On some older systems (eg Windows 95), the unicharset unicharset file. On some older systems (eg Windows 95), the unicharset
file must be edited by hand to add these property description codes.</simpara> file must be edited by hand to add these property description codes.</simpara>
<simpara><emphasis role="strong">NOTE</emphasis> The unicharset file must be regenerated whenever inttemp, normproto <simpara><emphasis role="strong">NOTE</emphasis> The unicharset file must be regenerated whenever inttemp, normproto
and pffmtable are generated (i.e. they must all be recreated when the box and pffmtable are generated (i.e. they must all be recreated when the box
file is changed) as they have to be in sync. This is made easier than in file is changed) as they have to be in sync. This is made easier than in
previous versions by running unicharset_extractor before mftraining and previous versions by running unicharset_extractor before mftraining and
cntraining, and giving the unicharset to mftraining.</simpara> cntraining, and giving the unicharset to mftraining.</simpara>
</refsect1> </refsect1>
<refsect1 id="_see_also"> <refsect1 id="_see_also">
<title>SEE ALSO</title> <title>SEE ALSO</title>
<simpara>tesseract(1), unicharset(5)</simpara> <simpara>tesseract(1), unicharset(5)</simpara>
<simpara><ulink url="https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract">https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract</ulink></simpara> <simpara><ulink url="https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract">https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract</ulink></simpara>
</refsect1> </refsect1>
<refsect1 id="_history"> <refsect1 id="_history">
<title>HISTORY</title> <title>HISTORY</title>
<simpara>unicharset_extractor first appeared in Tesseract 2.00.</simpara> <simpara>unicharset_extractor first appeared in Tesseract 2.00.</simpara>
</refsect1> </refsect1>
<refsect1 id="_copying"> <refsect1 id="_copying">
<title>COPYING</title> <title>COPYING</title>
<simpara>Copyright (C) 2006, Google Inc. <simpara>Copyright (C) 2006, Google Inc.
Licensed under the Apache License, Version 2.0</simpara> Licensed under the Apache License, Version 2.0</simpara>
</refsect1> </refsect1>
<refsect1 id="_author"> <refsect1 id="_author">
<title>AUTHOR</title> <title>AUTHOR</title>
<simpara>The Tesseract OCR engine was written by Ray Smith and his research groups <simpara>The Tesseract OCR engine was written by Ray Smith and his research groups
at Hewlett Packard (1985-1995) and Google (2006-present).</simpara> at Hewlett Packard (1985-1995) and Google (2006-present).</simpara>
</refsect1> </refsect1>
</refentry> </refentry>

File diff suppressed because it is too large Load Diff

View File

@ -1,69 +1,69 @@
<?xml version="1.0" encoding="UTF-8"?> <?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE refentry PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd"> <!DOCTYPE refentry PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd">
<?asciidoc-toc?> <?asciidoc-toc?>
<?asciidoc-numbered?> <?asciidoc-numbered?>
<refentry lang="en"> <refentry lang="en">
<refentryinfo> <refentryinfo>
<title>WORDLIST2DAWG(1)</title> <title>WORDLIST2DAWG(1)</title>
</refentryinfo> </refentryinfo>
<refmeta> <refmeta>
<refentrytitle>wordlist2dawg</refentrytitle> <refentrytitle>wordlist2dawg</refentrytitle>
<manvolnum>1</manvolnum> <manvolnum>1</manvolnum>
<refmiscinfo class="source">&#160;</refmiscinfo> <refmiscinfo class="source">&#160;</refmiscinfo>
<refmiscinfo class="manual">&#160;</refmiscinfo> <refmiscinfo class="manual">&#160;</refmiscinfo>
</refmeta> </refmeta>
<refnamediv> <refnamediv>
<refname>wordlist2dawg</refname> <refname>wordlist2dawg</refname>
<refpurpose>convert a wordlist to a DAWG for Tesseract</refpurpose> <refpurpose>convert a wordlist to a DAWG for Tesseract</refpurpose>
</refnamediv> </refnamediv>
<refsynopsisdiv id="_synopsis"> <refsynopsisdiv id="_synopsis">
<simpara><emphasis role="strong">wordlist2dawg</emphasis> <emphasis>WORDLIST</emphasis> <emphasis>DAWG</emphasis> <emphasis>lang.unicharset</emphasis></simpara> <simpara><emphasis role="strong">wordlist2dawg</emphasis> <emphasis>WORDLIST</emphasis> <emphasis>DAWG</emphasis> <emphasis>lang.unicharset</emphasis></simpara>
<simpara><emphasis role="strong">wordlist2dawg</emphasis> -t <emphasis>WORDLIST</emphasis> <emphasis>DAWG</emphasis> <emphasis>lang.unicharset</emphasis></simpara> <simpara><emphasis role="strong">wordlist2dawg</emphasis> -t <emphasis>WORDLIST</emphasis> <emphasis>DAWG</emphasis> <emphasis>lang.unicharset</emphasis></simpara>
<simpara><emphasis role="strong">wordlist2dawg</emphasis> -r 1 <emphasis>WORDLIST</emphasis> <emphasis>DAWG</emphasis> <emphasis>lang.unicharset</emphasis></simpara> <simpara><emphasis role="strong">wordlist2dawg</emphasis> -r 1 <emphasis>WORDLIST</emphasis> <emphasis>DAWG</emphasis> <emphasis>lang.unicharset</emphasis></simpara>
<simpara><emphasis role="strong">wordlist2dawg</emphasis> -r 2 <emphasis>WORDLIST</emphasis> <emphasis>DAWG</emphasis> <emphasis>lang.unicharset</emphasis></simpara> <simpara><emphasis role="strong">wordlist2dawg</emphasis> -r 2 <emphasis>WORDLIST</emphasis> <emphasis>DAWG</emphasis> <emphasis>lang.unicharset</emphasis></simpara>
<simpara><emphasis role="strong">wordlist2dawg</emphasis> -l &lt;short&gt; &lt;long&gt; <emphasis>WORDLIST</emphasis> <emphasis>DAWG</emphasis> <emphasis>lang.unicharset</emphasis></simpara> <simpara><emphasis role="strong">wordlist2dawg</emphasis> -l &lt;short&gt; &lt;long&gt; <emphasis>WORDLIST</emphasis> <emphasis>DAWG</emphasis> <emphasis>lang.unicharset</emphasis></simpara>
</refsynopsisdiv> </refsynopsisdiv>
<refsect1 id="_description"> <refsect1 id="_description">
<title>DESCRIPTION</title> <title>DESCRIPTION</title>
<simpara>wordlist2dawg(1) converts a wordlist to a Directed Acyclic Word Graph <simpara>wordlist2dawg(1) converts a wordlist to a Directed Acyclic Word Graph
(DAWG) for use with Tesseract. A DAWG is a compressed, space and time (DAWG) for use with Tesseract. A DAWG is a compressed, space and time
efficient representation of a word list.</simpara> efficient representation of a word list.</simpara>
</refsect1> </refsect1>
<refsect1 id="_options"> <refsect1 id="_options">
<title>OPTIONS</title> <title>OPTIONS</title>
<simpara>-t <simpara>-t
Verify that a given dawg file is equivalent to a given wordlist.</simpara> Verify that a given dawg file is equivalent to a given wordlist.</simpara>
<simpara>-r 1 <simpara>-r 1
Reverse a word if it contains an RTL character.</simpara> Reverse a word if it contains an RTL character.</simpara>
<simpara>-r 2 <simpara>-r 2
Reverse all words.</simpara> Reverse all words.</simpara>
<simpara>-l &lt;short&gt; &lt;long&gt; <simpara>-l &lt;short&gt; &lt;long&gt;
Produce a file with several dawgs in it, one each for words Produce a file with several dawgs in it, one each for words
of length &lt;short&gt;, &lt;short+1&gt;,&#8230; &lt;long&gt;</simpara> of length &lt;short&gt;, &lt;short+1&gt;,&#8230; &lt;long&gt;</simpara>
</refsect1> </refsect1>
<refsect1 id="_arguments"> <refsect1 id="_arguments">
<title>ARGUMENTS</title> <title>ARGUMENTS</title>
<simpara><emphasis>WORDLIST</emphasis> <simpara><emphasis>WORDLIST</emphasis>
A plain text file in UTF-8, one word per line.</simpara> A plain text file in UTF-8, one word per line.</simpara>
<simpara><emphasis>DAWG</emphasis> <simpara><emphasis>DAWG</emphasis>
The output DAWG to write.</simpara> The output DAWG to write.</simpara>
<simpara><emphasis>lang.unicharset</emphasis> <simpara><emphasis>lang.unicharset</emphasis>
The unicharset of the language. This is the unicharset The unicharset of the language. This is the unicharset
generated by mftraining(1).</simpara> generated by mftraining(1).</simpara>
</refsect1> </refsect1>
<refsect1 id="_see_also"> <refsect1 id="_see_also">
<title>SEE ALSO</title> <title>SEE ALSO</title>
<simpara>tesseract(1), combine_tessdata(1), dawg2wordlist(1)</simpara> <simpara>tesseract(1), combine_tessdata(1), dawg2wordlist(1)</simpara>
<simpara><ulink url="https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract">https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract</ulink></simpara> <simpara><ulink url="https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract">https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract</ulink></simpara>
</refsect1> </refsect1>
<refsect1 id="_copying"> <refsect1 id="_copying">
<title>COPYING</title> <title>COPYING</title>
<simpara>Copyright (C) 2006 Google, Inc. <simpara>Copyright (C) 2006 Google, Inc.
Licensed under the Apache License, Version 2.0</simpara> Licensed under the Apache License, Version 2.0</simpara>
</refsect1> </refsect1>
<refsect1 id="_author"> <refsect1 id="_author">
<title>AUTHOR</title> <title>AUTHOR</title>
<simpara>The Tesseract OCR engine was written by Ray Smith and his research groups <simpara>The Tesseract OCR engine was written by Ray Smith and his research groups
at Hewlett Packard (1985-1995) and Google (2006-present).</simpara> at Hewlett Packard (1985-1995) and Google (2006-present).</simpara>
</refsect1> </refsect1>
</refentry> </refentry>