diff --git a/doc/unicharset_extractor.1.asc b/doc/unicharset_extractor.1.asc index bde21ab3b..2918350c6 100644 --- a/doc/unicharset_extractor.1.asc +++ b/doc/unicharset_extractor.1.asc @@ -3,38 +3,28 @@ UNICHARSET_EXTRACTOR(1) NAME ---- -unicharset_extractor - extract unicharset from Tesseract boxfiles +unicharset_extractor - Reads box or plain text files to extract the unicharset. SYNOPSIS -------- -*unicharset_extractor* '[-D dir]' 'FILE'... +*unicharset_extractor* [--output_unicharset filename] [--norm_mode mode] box_or_text_file [...] + +Where mode means: + 1=combine graphemes (use for Latin and other simple scripts) + 2=split graphemes (use for Indic/Khmer/Myanmar) + 3=pure unicode (use for Arabic/Hebrew/Thai/Tibetan) DESCRIPTION ----------- Tesseract needs to know the set of possible characters it can output. To generate the unicharset data file, use the unicharset_extractor -program on the same training pages bounding box files as used for -clustering: +program on training pages bounding box files or a plain text file: unicharset_extractor fontfile_1.box fontfile_2.box ... -The unicharset will be put into the file 'dir/unicharset', or simply -'./unicharset' if no output directory is provided. +The unicharset will be put into the file './unicharset' if no output filename is provided. -Tesseract also needs to have access to character properties isalpha, -isdigit, isupper, islower, ispunctuation. all of this auxilury data -and more is encoded in this file. (See unicharset(5)) - -If your system supports the wctype functions, these values will be set -automatically by unicharset_extractor and there is no need to edit the -unicharset file. On some older systems (eg Windows 95), the unicharset -file must be edited by hand to add these property description codes. - -*NOTE* The unicharset file must be regenerated whenever inttemp, normproto -and pffmtable are generated (i.e. they must all be recreated when the box -file is changed) as they have to be in sync. This is made easier than in -previous versions by running unicharset_extractor before mftraining and -cntraining, and giving the unicharset to mftraining. +*NOTE* Use the appropriate norm_mode based on the language. SEE ALSO --------