Update documentation for unicharset_extractor

2024-11-24 02:59:07 +08:00 · 2019-05-31 08:20:19 +00:00 · 2019-05-31 08:20:19 +00:00 · 00abf57d02
commit 00abf57d02
parent 7d3e1324a8
1 changed files with 10 additions and 20 deletions
--- a/doc/unicharset_extractor.1.asc
+++ b/doc/unicharset_extractor.1.asc
@ -3,38 +3,28 @@ UNICHARSET_EXTRACTOR(1)

 NAME
 ----
-unicharset_extractor - extract unicharset from Tesseract boxfiles
+unicharset_extractor - Reads box or plain text files to extract the unicharset.

 SYNOPSIS
 --------
-*unicharset_extractor* '[-D dir]' 'FILE'...
+*unicharset_extractor*  [--output_unicharset filename] [--norm_mode mode] box_or_text_file [...]
+
+Where mode means:
+ 1=combine graphemes (use for Latin and other simple scripts)
+ 2=split graphemes (use for Indic/Khmer/Myanmar)
+ 3=pure unicode (use for Arabic/Hebrew/Thai/Tibetan)

 DESCRIPTION
 -----------
 Tesseract needs to know the set of possible characters it can output.
 To generate the unicharset data file, use the unicharset_extractor
-program on the same training pages bounding box files as used for
-clustering:
+program on training pages bounding box files or a plain text file:

    unicharset_extractor fontfile_1.box fontfile_2.box ...

-The unicharset will be put into the file 'dir/unicharset', or simply
-'./unicharset' if no output directory is provided.
+The unicharset will be put into the file './unicharset' if no output filename is provided.

-Tesseract also needs to have access to character properties isalpha,
-isdigit, isupper, islower, ispunctuation. all of this auxilury data
-and more is encoded in this file. (See unicharset(5))
-
-If your system supports the wctype functions, these values will be set
-automatically by unicharset_extractor and there is no need to edit the
-unicharset file. On some older systems (eg Windows 95), the unicharset
-file must be edited by hand to add these property description codes.
-
-*NOTE* The unicharset file must be regenerated whenever inttemp, normproto
-and pffmtable are generated (i.e. they must all be recreated when the box
-file is changed) as they have to be in sync. This is made easier than in
-previous versions by running unicharset_extractor before mftraining and
-cntraining, and giving the unicharset to mftraining.
+*NOTE* Use the appropriate norm_mode based on the language.

 SEE ALSO
 --------