mirror of
https://github.com/tesseract-ocr/tesseract.git
synced 2025-01-03 23:27:50 +08:00
220 lines
7.7 KiB
XML
220 lines
7.7 KiB
XML
<?xml version="1.0" encoding="UTF-8"?>
|
|
<!DOCTYPE refentry PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd">
|
|
<?asciidoc-toc?>
|
|
<?asciidoc-numbered?>
|
|
<refentry lang="en">
|
|
<refentryinfo>
|
|
<title>UNICHARSET(5)</title>
|
|
</refentryinfo>
|
|
<refmeta>
|
|
<refentrytitle>unicharset</refentrytitle>
|
|
<manvolnum>5</manvolnum>
|
|
<refmiscinfo class="source"> </refmiscinfo>
|
|
<refmiscinfo class="manual"> </refmiscinfo>
|
|
</refmeta>
|
|
<refnamediv>
|
|
<refname>unicharset</refname>
|
|
<refpurpose>character properties file used by tesseract(1)</refpurpose>
|
|
</refnamediv>
|
|
<refsect1 id="_description">
|
|
<title>DESCRIPTION</title>
|
|
<simpara>Tesseract’s unicharset file contains information on each symbol
|
|
(unichar) the Tesseract OCR engine is trained to recognize.</simpara>
|
|
<simpara>A unicharset file (i.e. <emphasis>eng.unicharset</emphasis>) is distributed as part of a
|
|
Tesseract language pack (i.e. <emphasis>eng.traineddata</emphasis>). For information on
|
|
extracting the unicharset file, see combine_tessdata(1).</simpara>
|
|
<simpara>The first line of a unicharset file contains the number of unichars in
|
|
the file. After this line, each subsequent line provides information for
|
|
a single unichar. The first such line contains a placeholder reserved for
|
|
the space character. Each unichar is referred to within Tesseract by its
|
|
Unichar ID, which is the line number (minus 1) within the unicharset file.
|
|
Therefore, space gets unichar 0.</simpara>
|
|
<simpara>Each unichar line in the unicharset file (v2+) may have four space-separated fields:</simpara>
|
|
<literallayout class="monospaced">'character' 'properties' 'script' 'id'</literallayout>
|
|
<simpara>Starting with Tesseract v3.02, more information may be given for each unichar:</simpara>
|
|
<literallayout class="monospaced">'character' 'properties' 'glyph_metrics' 'script' 'other_case' 'direction' 'mirror' 'normed_form'</literallayout>
|
|
<simpara>Entries:</simpara>
|
|
<variablelist>
|
|
<varlistentry>
|
|
<term>
|
|
<emphasis>character</emphasis>
|
|
</term>
|
|
<listitem>
|
|
<simpara>
|
|
The UTF-8 encoded string to be produced for this unichar.
|
|
</simpara>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>
|
|
<emphasis>properties</emphasis>
|
|
</term>
|
|
<listitem>
|
|
<simpara>
|
|
An integer mask of character properties, one per bit.
|
|
From least to most significant bit, these are: isalpha, islower, isupper,
|
|
isdigit, ispunctuation.
|
|
</simpara>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>
|
|
<emphasis>glyph_metrics</emphasis>
|
|
</term>
|
|
<listitem>
|
|
<simpara>
|
|
Ten comma-separated integers representing various standards
|
|
for where this glyph is to be found within a baseline-normalized coordinate
|
|
system where 128 is normalized to x-height.
|
|
</simpara>
|
|
<itemizedlist>
|
|
<listitem>
|
|
<simpara>
|
|
min_bottom, max_bottom: the ranges where the bottom of the character can
|
|
be found.
|
|
</simpara>
|
|
</listitem>
|
|
<listitem>
|
|
<simpara>
|
|
min_top, max_top: the ranges where the top of the character may be found.
|
|
</simpara>
|
|
</listitem>
|
|
<listitem>
|
|
<simpara>
|
|
min_width, max_width: horizontal width of the character.
|
|
</simpara>
|
|
</listitem>
|
|
<listitem>
|
|
<simpara>
|
|
min_bearing, max_bearing: how far from the usual start position does the
|
|
leftmost part of the character begin.
|
|
</simpara>
|
|
</listitem>
|
|
<listitem>
|
|
<simpara>
|
|
min_advance, max_advance: how far from the printer’s cell left do we
|
|
advance to begin the next character.
|
|
</simpara>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>
|
|
<emphasis>script</emphasis>
|
|
</term>
|
|
<listitem>
|
|
<simpara>
|
|
Name of the script (Latin, Common, Greek, Cyrillic, Han, null).
|
|
</simpara>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>
|
|
<emphasis>other_case</emphasis>
|
|
</term>
|
|
<listitem>
|
|
<simpara>
|
|
The Unichar ID of the other case version of this character
|
|
(upper or lower).
|
|
</simpara>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>
|
|
<emphasis>direction</emphasis>
|
|
</term>
|
|
<listitem>
|
|
<simpara>
|
|
The Unicode BiDi direction of this character, as defined by
|
|
ICU’s enum UCharDirection. (0 = Left to Right, 1 = Right to Left,
|
|
2 = European Number…)
|
|
</simpara>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>
|
|
<emphasis>mirror</emphasis>
|
|
</term>
|
|
<listitem>
|
|
<simpara>
|
|
The Unichar ID of the BiDirectional mirror of this character.
|
|
For example the mirror of open paren is close paren, but Latin Capital C
|
|
has no mirror, so it remains a Latin Capital C.
|
|
</simpara>
|
|
</listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term>
|
|
<emphasis>normed_form</emphasis>
|
|
</term>
|
|
<listitem>
|
|
<simpara>
|
|
The UTF-8 representation of a "normalized form" of this unichar
|
|
for the purpose of blaming a module for errors given ground truth text.
|
|
For instance, a left or right single quote may normalize to an ASCII quote.
|
|
</simpara>
|
|
</listitem>
|
|
</varlistentry>
|
|
</variablelist>
|
|
</refsect1>
|
|
<refsect1 id="_example_v2">
|
|
<title>EXAMPLE (v2)</title>
|
|
<literallayout class="monospaced">; 10 Common 46
|
|
b 3 Latin 59
|
|
W 5 Latin 40
|
|
7 8 Common 66
|
|
= 0 Common 93</literallayout>
|
|
<simpara>";" is a punctuation character. Its properties are thus represented by the
|
|
binary number 10000 (10 in hexadecimal).</simpara>
|
|
<simpara>"b" is an alphabetic character and a lower case character. Its properties are
|
|
thus represented by the binary number 00011 (3 in hexadecimal).</simpara>
|
|
<simpara>"W" is an alphabetic character and an upper case character. Its properties are
|
|
thus represented by the binary number 00101 (5 in hexadecimal).</simpara>
|
|
<simpara>"7" is just a digit. Its properties are thus represented by the binary number
|
|
01000 (8 in hexadecimal).</simpara>
|
|
<simpara>"=" is not punctuation nor a digit nor an alphabetic character. Its properties
|
|
are thus represented by the binary number 00000 (0 in hexadecimal).</simpara>
|
|
<simpara>Japanese or Chinese alphabetic character properties are represented by the
|
|
binary number 00001 (1 in hexadecimal): they are alphabetic, but neither
|
|
upper nor lower case.</simpara>
|
|
</refsect1>
|
|
<refsect1 id="_example_v3_02">
|
|
<title>EXAMPLE (v3.02)</title>
|
|
<literallayout class="monospaced">110
|
|
NULL 0 NULL 0
|
|
N 5 59,68,216,255,87,236,0,27,104,227 Latin 11 0 1 N
|
|
Y 5 59,68,216,255,91,205,0,47,91,223 Latin 33 0 2 Y
|
|
1 8 59,69,203,255,45,128,0,66,74,173 Common 3 2 3 1
|
|
9 8 18,66,203,255,89,156,0,39,104,173 Common 4 2 4 9
|
|
a 3 58,65,186,198,85,164,0,26,97,185 Latin 56 0 5 a
|
|
. . .</literallayout>
|
|
</refsect1>
|
|
<refsect1 id="_caveats">
|
|
<title>CAVEATS</title>
|
|
<simpara>Although the unicharset reader maintains the ability to read unicharsets
|
|
of older formats and will assign default values to missing fields,
|
|
the accuracy will be degraded.</simpara>
|
|
<simpara>Further, most other data files are indexed by the unicharset file,
|
|
so changing it without re-generating the others is likely to have dire
|
|
consequences.</simpara>
|
|
</refsect1>
|
|
<refsect1 id="_history">
|
|
<title>HISTORY</title>
|
|
<simpara>The unicharset format first appeared with Tesseract 2.00, which was the
|
|
first version to support languages other than English. The unicharset file
|
|
contained only the first two fields, and the "ispunctuation" property was
|
|
absent (punctuation was regarded as "0", as "=" is in the above example.</simpara>
|
|
</refsect1>
|
|
<refsect1 id="_see_also">
|
|
<title>SEE ALSO</title>
|
|
<simpara>tesseract(1), combine_tessdata(1), unicharset_extractor(1)</simpara>
|
|
<simpara><ulink url="https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract">https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract</ulink></simpara>
|
|
</refsect1>
|
|
<refsect1 id="_author">
|
|
<title>AUTHOR</title>
|
|
<simpara>The Tesseract OCR engine was written by Ray Smith and his research groups
|
|
at Hewlett Packard (1985-1995) and Google (2006-present).</simpara>
|
|
</refsect1>
|
|
</refentry>
|