2010-09-30 07:14:42 +08:00
|
|
|
<?xml version="1.0" encoding="UTF-8"?>
|
|
|
|
<!DOCTYPE refentry PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd">
|
|
|
|
<?asciidoc-toc?>
|
|
|
|
<?asciidoc-numbered?>
|
|
|
|
<refentry lang="en">
|
|
|
|
<refmeta>
|
|
|
|
<refentrytitle>unicharset_extractor</refentrytitle>
|
|
|
|
<manvolnum>1</manvolnum>
|
|
|
|
<refmiscinfo class="source"> </refmiscinfo>
|
|
|
|
<refmiscinfo class="manual"> </refmiscinfo>
|
|
|
|
</refmeta>
|
|
|
|
<refnamediv>
|
|
|
|
<refname>unicharset_extractor</refname>
|
|
|
|
<refpurpose>extract unicharset from Tesseract boxfiles</refpurpose>
|
|
|
|
</refnamediv>
|
|
|
|
<refsynopsisdiv id="_synopsis">
|
2012-02-10 06:55:47 +08:00
|
|
|
<simpara><emphasis role="strong">unicharset_extractor</emphasis> <emphasis>[-D dir]</emphasis> <emphasis>FILE</emphasis>…</simpara>
|
2010-09-30 07:14:42 +08:00
|
|
|
</refsynopsisdiv>
|
|
|
|
<refsect1 id="_description">
|
|
|
|
<title>DESCRIPTION</title>
|
|
|
|
<simpara>Tesseract needs to know the set of possible characters it can output.
|
|
|
|
To generate the unicharset data file, use the unicharset_extractor
|
|
|
|
program on the same training pages bounding box files as used for
|
|
|
|
clustering:</simpara>
|
2010-09-30 10:21:18 +08:00
|
|
|
<literallayout class="monospaced">unicharset_extractor fontfile_1.box fontfile_2.box ...</literallayout>
|
2012-02-10 06:55:47 +08:00
|
|
|
<simpara>The unicharset will be put into the file <emphasis>dir/unicharset</emphasis>, or simply
|
|
|
|
<emphasis>./unicharset</emphasis> if no output directory is provided.</simpara>
|
|
|
|
<simpara>Tesseract also needs to have access to character properties isalpha,
|
|
|
|
isdigit, isupper, islower, ispunctuation. all of this auxilury data
|
|
|
|
and more is encoded in this file. (See unicharset(5))</simpara>
|
2010-09-30 07:14:42 +08:00
|
|
|
<simpara>If your system supports the wctype functions, these values will be set
|
|
|
|
automatically by unicharset_extractor and there is no need to edit the
|
|
|
|
unicharset file. On some older systems (eg Windows 95), the unicharset
|
|
|
|
file must be edited by hand to add these property description codes.</simpara>
|
|
|
|
<simpara><emphasis role="strong">NOTE</emphasis> The unicharset file must be regenerated whenever inttemp, normproto
|
|
|
|
and pffmtable are generated (i.e. they must all be recreated when the box
|
|
|
|
file is changed) as they have to be in sync. This is made easier than in
|
|
|
|
previous versions by running unicharset_extractor before mftraining and
|
|
|
|
cntraining, and giving the unicharset to mftraining.</simpara>
|
|
|
|
</refsect1>
|
|
|
|
<refsect1 id="_see_also">
|
|
|
|
<title>SEE ALSO</title>
|
|
|
|
<simpara>tesseract(1), unicharset(5)</simpara>
|
2012-02-10 06:55:47 +08:00
|
|
|
<simpara><ulink url="http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3">http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3</ulink></simpara>
|
2010-09-30 07:14:42 +08:00
|
|
|
</refsect1>
|
2010-09-30 10:21:18 +08:00
|
|
|
<refsect1 id="_history">
|
|
|
|
<title>HISTORY</title>
|
|
|
|
<simpara>unicharset_extractor first appeared in Tesseract 2.00.</simpara>
|
|
|
|
</refsect1>
|
2010-09-30 07:14:42 +08:00
|
|
|
<refsect1 id="_copying">
|
|
|
|
<title>COPYING</title>
|
2012-02-10 06:55:47 +08:00
|
|
|
<simpara>Copyright (C) 2006, Google Inc.
|
2010-09-30 07:14:42 +08:00
|
|
|
Licensed under the Apache License, Version 2.0</simpara>
|
|
|
|
</refsect1>
|
2012-02-10 06:55:47 +08:00
|
|
|
<refsect1 id="_author">
|
|
|
|
<title>AUTHOR</title>
|
|
|
|
<simpara>The Tesseract OCR engine was written by Ray Smith and his research groups
|
|
|
|
at Hewlett Packard (1985-1995) and Google (2006-present).</simpara>
|
|
|
|
</refsect1>
|
2010-09-30 07:14:42 +08:00
|
|
|
</refentry>
|