mirror of
https://github.com/tesseract-ocr/tesseract.git
synced 2024-11-24 02:59:07 +08:00
55 lines
1.9 KiB
Plaintext
55 lines
1.9 KiB
Plaintext
|
UNICHARSET_EXTRACTOR(1)
|
||
|
=======================
|
||
|
|
||
|
NAME
|
||
|
----
|
||
|
unicharset_extractor - extract unicharset from Tesseract boxfiles
|
||
|
|
||
|
SYNOPSIS
|
||
|
--------
|
||
|
*unicharset_extractor* 'FILE'...
|
||
|
|
||
|
DESCRIPTION
|
||
|
-----------
|
||
|
Tesseract needs to know the set of possible characters it can output.
|
||
|
To generate the unicharset data file, use the unicharset_extractor
|
||
|
program on the same training pages bounding box files as used for
|
||
|
clustering:
|
||
|
|
||
|
unicharset_extractor fontfile_1.box fontfile_2.box ...
|
||
|
|
||
|
Tesseract needs to have access to character properties isalpha,
|
||
|
isdigit, isupper, islower, ispunctuation. This data must be encoded
|
||
|
in the unicharset data file. Each line of this file corresponds to
|
||
|
one character. The character in UTF-8 is followed by a hexadecimal
|
||
|
number representing a binary mask that encodes the properties. Each
|
||
|
bit corresponds to a property. If the bit is set to 1, it means that
|
||
|
the property is true. The bit ordering is (from least significant bit
|
||
|
to most significant bit): isalpha, islower, isupper, isdigit,
|
||
|
ispunctuation.
|
||
|
(See unicharset(5))
|
||
|
|
||
|
If your system supports the wctype functions, these values will be set
|
||
|
automatically by unicharset_extractor and there is no need to edit the
|
||
|
unicharset file. On some older systems (eg Windows 95), the unicharset
|
||
|
file must be edited by hand to add these property description codes.
|
||
|
|
||
|
*NOTE* The unicharset file must be regenerated whenever inttemp, normproto
|
||
|
and pffmtable are generated (i.e. they must all be recreated when the box
|
||
|
file is changed) as they have to be in sync. This is made easier than in
|
||
|
previous versions by running unicharset_extractor before mftraining and
|
||
|
cntraining, and giving the unicharset to mftraining.
|
||
|
|
||
|
SEE ALSO
|
||
|
--------
|
||
|
tesseract(1), unicharset(5)
|
||
|
|
||
|
HISTORY
|
||
|
-------
|
||
|
unicharset_extractor first appeared in Tesseract 2.00.
|
||
|
|
||
|
COPYING
|
||
|
-------
|
||
|
Copyright (C) 2006, Google Inc.
|
||
|
Licensed under the Apache License, Version 2.0
|