tesseract/doc/mftraining.1.xml

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE refentry PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd">
<?asciidoc-toc?>
<?asciidoc-numbered?>
<refentry lang="en">
<refentryinfo>
    <title>MFTRAINING(1)</title>
</refentryinfo>
<refmeta>
<refentrytitle>mftraining</refentrytitle>
<manvolnum>1</manvolnum>
<refmiscinfo class="source">&#160;</refmiscinfo>
<refmiscinfo class="manual">&#160;</refmiscinfo>
</refmeta>
<refnamediv>
    <refname>mftraining</refname>
    <refpurpose>feature training for Tesseract</refpurpose>
</refnamediv>
<refsynopsisdiv id="_synopsis">
<simpara>mftraining -U <emphasis>unicharset</emphasis> -O <emphasis>lang.unicharset</emphasis> <emphasis>FILE</emphasis>&#8230;</simpara>
</refsynopsisdiv>
<refsect1 id="_description">
<title>DESCRIPTION</title>
<simpara>mftraining takes a list of .tr files, from which it generates the
files <emphasis role="strong">inttemp</emphasis> (the shape prototypes), <emphasis role="strong">shapetable</emphasis>, and <emphasis role="strong">pffmtable</emphasis>
(the number of expected features for each character).  (A fourth file
called Microfeat is also written by this program, but it is not used.)</simpara>
</refsect1>
<refsect1 id="_options">
<title>OPTIONS</title>
<variablelist>
<varlistentry>
<term>
-U <emphasis>FILE</emphasis>
</term>
<listitem>
<simpara>
        (Input) The unicharset generated by unicharset_extractor(1)
</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term>
-F <emphasis>font_properties_file</emphasis>
</term>
<listitem>
<simpara>
        (Input) font properties file, each line is of the following form, where each field other than the font name is 0 or 1:
</simpara>
<literallayout class="monospaced">*font_name* *italic* *bold* *fixed_pitch* *serif* *fraktur*</literallayout>
</listitem>
</varlistentry>
<varlistentry>
<term>
-X <emphasis>xheights_file</emphasis>
</term>
<listitem>
<simpara>
        (Input) x heights file, each line is of the following form, where xheight is calculated as the pixel x height of a character drawn at 32pt on 300 dpi. [ That is, if base x height + ascenders + descenders = 133, how much is x height? ]
</simpara>
<literallayout class="monospaced">*font_name* *xheight*</literallayout>
</listitem>
</varlistentry>
<varlistentry>
<term>
-D <emphasis>dir</emphasis>
</term>
<listitem>
<simpara>
        Directory to write output files to.
</simpara>
</listitem>
</varlistentry>
<varlistentry>
<term>
-O <emphasis>FILE</emphasis>
</term>
<listitem>
<simpara>
        (Output) The output unicharset that will be given to combine_tessdata(1)
</simpara>
</listitem>
</varlistentry>
</variablelist>
</refsect1>
<refsect1 id="_see_also">
<title>SEE ALSO</title>
<simpara>tesseract(1), cntraining(1), unicharset_extractor(1), combine_tessdata(1),
shapeclustering(1), unicharset(5)</simpara>
<simpara><ulink url="https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract">https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract</ulink></simpara>
</refsect1>
<refsect1 id="_copying">
<title>COPYING</title>
<simpara>Copyright (C) Hewlett-Packard Company, 1988
Licensed under the Apache License, Version 2.0</simpara>
</refsect1>
<refsect1 id="_author">
<title>AUTHOR</title>
<simpara>The Tesseract OCR engine was written by Ray Smith and his research groups
at Hewlett Packard (1985-1995) and Google (2006-present).</simpara>
</refsect1>
</refentry>
more docs git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@482 d0cd1f9f-072b-0410-8dd7-cf729c803f20 2010-09-30 10:06:29 +08:00			`<?xml version="1.0" encoding="UTF-8"?>`
			`<!DOCTYPE refentry PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd">`
			`<?asciidoc-toc?>`
			`<?asciidoc-numbered?>`
			`<refentry lang="en">`
fix links in doc; autotools requires README 2015-06-13 06:08:05 +08:00			`<refentryinfo>`
			`<title>MFTRAINING(1)</title>`
			`</refentryinfo>`
more docs git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@482 d0cd1f9f-072b-0410-8dd7-cf729c803f20 2010-09-30 10:06:29 +08:00			`<refmeta>`
			`<refentrytitle>mftraining</refentrytitle>`
			`<manvolnum>1</manvolnum>`
fix links in doc; autotools requires README 2015-06-13 06:08:05 +08:00			`<refmiscinfo class="source"> </refmiscinfo>`
			`<refmiscinfo class="manual"> </refmiscinfo>`
more docs git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@482 d0cd1f9f-072b-0410-8dd7-cf729c803f20 2010-09-30 10:06:29 +08:00			`</refmeta>`
			`<refnamediv>`
			`<refname>mftraining</refname>`
			`<refpurpose>feature training for Tesseract</refpurpose>`
			`</refnamediv>`
			`<refsynopsisdiv id="_synopsis">`
			`<simpara>mftraining -U <emphasis>unicharset</emphasis> -O <emphasis>lang.unicharset</emphasis> <emphasis>FILE</emphasis>…</simpara>`
			`</refsynopsisdiv>`
			`<refsect1 id="_description">`
			`<title>DESCRIPTION</title>`
			`<simpara>mftraining takes a list of .tr files, from which it generates the`
Update man pages for Tesseract 3.02. git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@670 d0cd1f9f-072b-0410-8dd7-cf729c803f20 2012-02-10 06:55:47 +08:00			`files <emphasis role="strong">inttemp</emphasis> (the shape prototypes), <emphasis role="strong">shapetable</emphasis>, and <emphasis role="strong">pffmtable</emphasis>`
			`(the number of expected features for each character). (A fourth file`
			`called Microfeat is also written by this program, but it is not used.)</simpara>`
more docs git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@482 d0cd1f9f-072b-0410-8dd7-cf729c803f20 2010-09-30 10:06:29 +08:00			`</refsect1>`
			`<refsect1 id="_options">`
			`<title>OPTIONS</title>`
Update man pages for Tesseract 3.02. git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@670 d0cd1f9f-072b-0410-8dd7-cf729c803f20 2012-02-10 06:55:47 +08:00			`<variablelist>`
			`<varlistentry>`
			`<term>`
			`-U <emphasis>FILE</emphasis>`
			`</term>`
			`<listitem>`
			`<simpara>`
			`(Input) The unicharset generated by unicharset_extractor(1)`
			`</simpara>`
			`</listitem>`
			`</varlistentry>`
			`<varlistentry>`
			`<term>`
			`-F <emphasis>font_properties_file</emphasis>`
			`</term>`
			`<listitem>`
			`<simpara>`
			`(Input) font properties file, each line is of the following form, where each field other than the font name is 0 or 1:`
			`</simpara>`
			`<literallayout class="monospaced">font_name italic bold fixed_pitch serif fraktur</literallayout>`
			`</listitem>`
			`</varlistentry>`
			`<varlistentry>`
			`<term>`
			`-X <emphasis>xheights_file</emphasis>`
			`</term>`
			`<listitem>`
			`<simpara>`
			`(Input) x heights file, each line is of the following form, where xheight is calculated as the pixel x height of a character drawn at 32pt on 300 dpi. [ That is, if base x height + ascenders + descenders = 133, how much is x height? ]`
			`</simpara>`
			`<literallayout class="monospaced">font_name xheight</literallayout>`
			`</listitem>`
			`</varlistentry>`
			`<varlistentry>`
			`<term>`
			`-D <emphasis>dir</emphasis>`
			`</term>`
			`<listitem>`
			`<simpara>`
			`Directory to write output files to.`
			`</simpara>`
			`</listitem>`
			`</varlistentry>`
			`<varlistentry>`
			`<term>`
			`-O <emphasis>FILE</emphasis>`
			`</term>`
			`<listitem>`
			`<simpara>`
			`(Output) The output unicharset that will be given to combine_tessdata(1)`
			`</simpara>`
			`</listitem>`
			`</varlistentry>`
			`</variablelist>`
more docs git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@482 d0cd1f9f-072b-0410-8dd7-cf729c803f20 2010-09-30 10:06:29 +08:00			`</refsect1>`
			`<refsect1 id="_see_also">`
			`<title>SEE ALSO</title>`
Update man pages for Tesseract 3.02. git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@670 d0cd1f9f-072b-0410-8dd7-cf729c803f20 2012-02-10 06:55:47 +08:00			`<simpara>tesseract(1), cntraining(1), unicharset_extractor(1), combine_tessdata(1),`
			`shapeclustering(1), unicharset(5)</simpara>`
fix links in doc; autotools requires README 2015-06-13 06:08:05 +08:00			`<simpara><ulink url="https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract">https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract</ulink></simpara>`
more docs git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@482 d0cd1f9f-072b-0410-8dd7-cf729c803f20 2010-09-30 10:06:29 +08:00			`</refsect1>`
			`<refsect1 id="_copying">`
			`<title>COPYING</title>`
Update man pages for Tesseract 3.02. git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@670 d0cd1f9f-072b-0410-8dd7-cf729c803f20 2012-02-10 06:55:47 +08:00			`<simpara>Copyright (C) Hewlett-Packard Company, 1988`
more docs git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@482 d0cd1f9f-072b-0410-8dd7-cf729c803f20 2010-09-30 10:06:29 +08:00			`Licensed under the Apache License, Version 2.0</simpara>`
			`</refsect1>`
Update man pages for Tesseract 3.02. git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@670 d0cd1f9f-072b-0410-8dd7-cf729c803f20 2012-02-10 06:55:47 +08:00			`<refsect1 id="_author">`
			`<title>AUTHOR</title>`
			`<simpara>The Tesseract OCR engine was written by Ray Smith and his research groups`
			`at Hewlett Packard (1985-1995) and Google (2006-present).</simpara>`
			`</refsect1>`
more docs git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@482 d0cd1f9f-072b-0410-8dd7-cf729c803f20 2010-09-30 10:06:29 +08:00			`</refentry>`