mirror of
https://github.com/tesseract-ocr/tesseract.git
synced 2024-11-24 02:59:07 +08:00
Update documentation for unicharset_extractor
This commit is contained in:
parent
7d3e1324a8
commit
00abf57d02
@ -3,38 +3,28 @@ UNICHARSET_EXTRACTOR(1)
|
|||||||
|
|
||||||
NAME
|
NAME
|
||||||
----
|
----
|
||||||
unicharset_extractor - extract unicharset from Tesseract boxfiles
|
unicharset_extractor - Reads box or plain text files to extract the unicharset.
|
||||||
|
|
||||||
SYNOPSIS
|
SYNOPSIS
|
||||||
--------
|
--------
|
||||||
*unicharset_extractor* '[-D dir]' 'FILE'...
|
*unicharset_extractor* [--output_unicharset filename] [--norm_mode mode] box_or_text_file [...]
|
||||||
|
|
||||||
|
Where mode means:
|
||||||
|
1=combine graphemes (use for Latin and other simple scripts)
|
||||||
|
2=split graphemes (use for Indic/Khmer/Myanmar)
|
||||||
|
3=pure unicode (use for Arabic/Hebrew/Thai/Tibetan)
|
||||||
|
|
||||||
DESCRIPTION
|
DESCRIPTION
|
||||||
-----------
|
-----------
|
||||||
Tesseract needs to know the set of possible characters it can output.
|
Tesseract needs to know the set of possible characters it can output.
|
||||||
To generate the unicharset data file, use the unicharset_extractor
|
To generate the unicharset data file, use the unicharset_extractor
|
||||||
program on the same training pages bounding box files as used for
|
program on training pages bounding box files or a plain text file:
|
||||||
clustering:
|
|
||||||
|
|
||||||
unicharset_extractor fontfile_1.box fontfile_2.box ...
|
unicharset_extractor fontfile_1.box fontfile_2.box ...
|
||||||
|
|
||||||
The unicharset will be put into the file 'dir/unicharset', or simply
|
The unicharset will be put into the file './unicharset' if no output filename is provided.
|
||||||
'./unicharset' if no output directory is provided.
|
|
||||||
|
|
||||||
Tesseract also needs to have access to character properties isalpha,
|
*NOTE* Use the appropriate norm_mode based on the language.
|
||||||
isdigit, isupper, islower, ispunctuation. all of this auxilury data
|
|
||||||
and more is encoded in this file. (See unicharset(5))
|
|
||||||
|
|
||||||
If your system supports the wctype functions, these values will be set
|
|
||||||
automatically by unicharset_extractor and there is no need to edit the
|
|
||||||
unicharset file. On some older systems (eg Windows 95), the unicharset
|
|
||||||
file must be edited by hand to add these property description codes.
|
|
||||||
|
|
||||||
*NOTE* The unicharset file must be regenerated whenever inttemp, normproto
|
|
||||||
and pffmtable are generated (i.e. they must all be recreated when the box
|
|
||||||
file is changed) as they have to be in sync. This is made easier than in
|
|
||||||
previous versions by running unicharset_extractor before mftraining and
|
|
||||||
cntraining, and giving the unicharset to mftraining.
|
|
||||||
|
|
||||||
SEE ALSO
|
SEE ALSO
|
||||||
--------
|
--------
|
||||||
|
Loading…
Reference in New Issue
Block a user