combine_tessdata(1) is the main program to combine/extract/overwrite tessdata components in [lang]\&.traineddata files\&.
To combine all the individual tessdata components (unicharset, DAWGs, classifier templates, ambiguities, language configs) located at, say, /home/$USER/temp/eng\&.* run:
combine_tessdata /home/$USER/temp/eng\&.
The result will be a combined tessdata file /home/$USER/temp/eng\&.traineddata
Specify option \-e if you would like to extract individual components from a combined traineddata file\&. For example, to extract language config file and the unicharset from tessdata/eng\&.traineddata run:
The desired config file and unicharset will be written to /home/$USER/temp/eng\&.config /home/$USER/temp/eng\&.unicharset
Specify option \-o to overwrite individual components of the given [lang]\&.traineddata file\&. For example, to overwrite language config and unichar ambiguities files in tessdata/eng\&.traineddata use:
As a result, tessdata/eng\&.traineddata will contain the new language config and unichar ambigs, plus all the original DAWGs, classifier templates, etc\&.
Note: the file names of the files to extract to and to overwrite from should have the appropriate file suffixes (extensions) indicating their tessdata component type (\&.unicharset for the unicharset, \&.unicharambigs for unichar ambigs, etc)\&. See k*FileSuffix variable in ccutil/tessdatamanager\&.h\&.
Specify option \-u to unpack all the components to the specified path:
The components in a Tesseract lang\&.traineddata file as of Tesseract 3\&.02 are briefly described below; For more information on many of these files, see \m[blue]\fBhttps://github\&.com/tesseract\-ocr/tesseract/wiki/TrainingTesseract\fR\m[]
(Optional) Language\-specific overrides to default config variables\&.
(Required) The list of symbols that Tesseract recognizes, with properties\&. See unicharset(5)\&.
(Optional) This file contains information on pairs of recognized symbols which are often confused\&. For example,
(Required) Character shape templates for each unichar\&. Produced by mftraining(1)\&.
(Required) The number of features expected for each unichar\&. Produced by mftraining(1) from
(Required) Character normalization prototypes generated by cntraining(1) from
(Optional) A dawg made from punctuation patterns found around words\&. The "word" part is replaced by a single space\&.
(Optional) A dawg made from dictionary words from the language\&.
(Optional) A dawg made from tokens which originally contained digits\&. Each digit is replaced by a space character\&.
(Optional) A dawg made from the most frequent words which would have gone into word\-dawg\&.
(Optional) Several dawgs of different fixed lengths \(em useful for languages like Chinese\&.
(Optional) When present, a shapetable is an extra layer between the character classifier and the word recognizer that allows the character classifier to return a collection of unichar ids and fonts instead of a single unichar\-id and font\&.
(Optional) A dawg of word bigrams where the words are separated by a space and each digit is replaced by a