Tesseract needs to know the set of possible characters it can output\&. To generate the unicharset data file, use the unicharset_extractor program on the same training pages bounding box files as used for clustering:
The unicharset will be put into the file \fIdir/unicharset\fR, or simply \fI\&./unicharset\fR if no output directory is provided\&.
.sp
Tesseract also needs to have access to character properties isalpha, isdigit, isupper, islower, ispunctuation\&. all of this auxilury data and more is encoded in this file\&. (See unicharset(5))
If your system supports the wctype functions, these values will be set automatically by unicharset_extractor and there is no need to edit the unicharset file\&. On some older systems (eg Windows 95), the unicharset file must be edited by hand to add these property description codes\&.
.sp
\fBNOTE\fR The unicharset file must be regenerated whenever inttemp, normproto and pffmtable are generated (i\&.e\&. they must all be recreated when the box file is changed) as they have to be in sync\&. This is made easier than in previous versions by running unicharset_extractor before mftraining and cntraining, and giving the unicharset to mftraining\&.