tesseract/doc/combine_tessdata.1

'\" t
.\"     Title: combine_tessdata
.\"    Author: [see the "AUTHOR" section]
.\" Generator: DocBook XSL Stylesheets v1.78.1 <http://docbook.sf.net/>
.\"      Date: 06/12/2015
.\"    Manual: \ \&
.\"    Source: \ \&
.\"  Language: English
.\"
.TH "COMBINE_TESSDATA" "1" "06/12/2015" "\ \&" "\ \&"
.\" -----------------------------------------------------------------
.\" * Define some portability stuff
.\" -----------------------------------------------------------------
.\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.\" http://bugs.debian.org/507673
.\" http://lists.gnu.org/archive/html/groff/2009-02/msg00013.html
.\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.ie \n(.g .ds Aq \(aq
.el       .ds Aq '
.\" -----------------------------------------------------------------
.\" * set default formatting
.\" -----------------------------------------------------------------
.\" disable hyphenation
.nh
.\" disable justification (adjust text to left margin only)
.ad l
.\" -----------------------------------------------------------------
.\" * MAIN CONTENT STARTS HERE *
.\" -----------------------------------------------------------------
.SH "NAME"
combine_tessdata \- combine/extract/overwrite Tesseract data
.SH "SYNOPSIS"
.sp
\fBcombine_tessdata\fR [\fIOPTION\fR] \fIFILE\fR\&...
.SH "DESCRIPTION"
.sp
combine_tessdata(1) is the main program to combine/extract/overwrite tessdata components in [lang]\&.traineddata files\&.
.sp
To combine all the individual tessdata components (unicharset, DAWGs, classifier templates, ambiguities, language configs) located at, say, /home/$USER/temp/eng\&.* run:
.sp
.if n \{\
.RS 4
.\}
.nf
combine_tessdata /home/$USER/temp/eng\&.
.fi
.if n \{\
.RE
.\}
.sp
The result will be a combined tessdata file /home/$USER/temp/eng\&.traineddata
.sp
Specify option \-e if you would like to extract individual components from a combined traineddata file\&. For example, to extract language config file and the unicharset from tessdata/eng\&.traineddata run:
.sp
.if n \{\
.RS 4
.\}
.nf
combine_tessdata \-e tessdata/eng\&.traineddata \e
  /home/$USER/temp/eng\&.config /home/$USER/temp/eng\&.unicharset
.fi
.if n \{\
.RE
.\}
.sp
The desired config file and unicharset will be written to /home/$USER/temp/eng\&.config /home/$USER/temp/eng\&.unicharset
.sp
Specify option \-o to overwrite individual components of the given [lang]\&.traineddata file\&. For example, to overwrite language config and unichar ambiguities files in tessdata/eng\&.traineddata use:
.sp
.if n \{\
.RS 4
.\}
.nf
combine_tessdata \-o tessdata/eng\&.traineddata \e
  /home/$USER/temp/eng\&.config /home/$USER/temp/eng\&.unicharambigs
.fi
.if n \{\
.RE
.\}
.sp
As a result, tessdata/eng\&.traineddata will contain the new language config and unichar ambigs, plus all the original DAWGs, classifier templates, etc\&.
.sp
Note: the file names of the files to extract to and to overwrite from should have the appropriate file suffixes (extensions) indicating their tessdata component type (\&.unicharset for the unicharset, \&.unicharambigs for unichar ambigs, etc)\&. See k*FileSuffix variable in ccutil/tessdatamanager\&.h\&.
.sp
Specify option \-u to unpack all the components to the specified path:
.sp
.if n \{\
.RS 4
.\}
.nf
combine_tessdata \-u tessdata/eng\&.traineddata /home/$USER/temp/eng\&.
.fi
.if n \{\
.RE
.\}
.sp
This will create /home/$USER/temp/eng\&.* files with individual tessdata components from tessdata/eng\&.traineddata\&.
.SH "OPTIONS"
.sp
\fB\-e\fR \fI\&.traineddata\fR \fIFILE\fR\&...: Extracts the specified components from the \&.traineddata file
.sp
\fB\-o\fR \fI\&.traineddata\fR \fIFILE\fR\&...: Overwrites the specified components of the \&.traineddata file with those provided on the comand line\&.
.sp
\fB\-u\fR \fI\&.traineddata\fR \fIPATHPREFIX\fR Unpacks the \&.traineddata using the provided prefix\&.
.SH "CAVEATS"
.sp
\fIPrefix\fR refers to the full file prefix, including period (\&.)
.SH "COMPONENTS"
.sp
The components in a Tesseract lang\&.traineddata file as of Tesseract 3\&.02 are briefly described below; For more information on many of these files, see \m[blue]\fBhttps://github\&.com/tesseract\-ocr/tesseract/wiki/TrainingTesseract\fR\m[]
.PP
lang\&.config
.RS 4
(Optional) Language\-specific overrides to default config variables\&.
.RE
.PP
lang\&.unicharset
.RS 4
(Required) The list of symbols that Tesseract recognizes, with properties\&. See unicharset(5)\&.
.RE
.PP
lang\&.unicharambigs
.RS 4
(Optional) This file contains information on pairs of recognized symbols which are often confused\&. For example,
\fIrn\fR
and
\fIm\fR\&.
.RE
.PP
lang\&.inttemp
.RS 4
(Required) Character shape templates for each unichar\&. Produced by mftraining(1)\&.
.RE
.PP
lang\&.pffmtable
.RS 4
(Required) The number of features expected for each unichar\&. Produced by mftraining(1) from
\fB\&.tr\fR
files\&.
.RE
.PP
lang\&.normproto
.RS 4
(Required) Character normalization prototypes generated by cntraining(1) from
\fB\&.tr\fR
files\&.
.RE
.PP
lang\&.punc\-dawg
.RS 4
(Optional) A dawg made from punctuation patterns found around words\&. The "word" part is replaced by a single space\&.
.RE
.PP
lang\&.word\-dawg
.RS 4
(Optional) A dawg made from dictionary words from the language\&.
.RE
.PP
lang\&.number\-dawg
.RS 4
(Optional) A dawg made from tokens which originally contained digits\&. Each digit is replaced by a space character\&.
.RE
.PP
lang\&.freq\-dawg
.RS 4
(Optional) A dawg made from the most frequent words which would have gone into word\-dawg\&.
.RE
.PP
lang\&.fixed\-length\-dawgs
.RS 4
(Optional) Several dawgs of different fixed lengths \(em useful for languages like Chinese\&.
.RE
.PP
lang\&.shapetable
.RS 4
(Optional) When present, a shapetable is an extra layer between the character classifier and the word recognizer that allows the character classifier to return a collection of unichar ids and fonts instead of a single unichar\-id and font\&.
.RE
.PP
lang\&.bigram\-dawg
.RS 4
(Optional) A dawg of word bigrams where the words are separated by a space and each digit is replaced by a
\fI?\fR\&.
.RE
.PP
lang\&.unambig\-dawg
.RS 4
(Optional) TODO: Describe\&.
.RE
.PP
lang\&.params\-training\-model
.RS 4
(Optional) TODO: Describe\&.
.RE
.SH "HISTORY"
.sp
combine_tessdata(1) first appeared in version 3\&.00 of Tesseract
.SH "SEE ALSO"
.sp
tesseract(1), wordlist2dawg(1), cntraining(1), mftraining(1), unicharset(5), unicharambigs(5)
.SH "COPYING"
.sp
Copyright (C) 2009, Google Inc\&. Licensed under the Apache License, Version 2\&.0
.SH "AUTHOR"
.sp
The Tesseract OCR engine was written by Ray Smith and his research groups at Hewlett Packard (1985\-1995) and Google (2006\-present)\&.
generated docs git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@484 d0cd1f9f-072b-0410-8dd7-cf729c803f20 2010-09-30 10:21:18 +08:00			`'\" t`
			`.\" Title: combine_tessdata`
Update man pages for Tesseract 3.02. git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@670 d0cd1f9f-072b-0410-8dd7-cf729c803f20 2012-02-10 06:55:47 +08:00			`.\" Author: [see the "AUTHOR" section]`
fix links in doc; autotools requires README 2015-06-13 06:08:05 +08:00			`.\" Generator: DocBook XSL Stylesheets v1.78.1 <http://docbook.sf.net/>`
			`.\" Date: 06/12/2015`
generated docs git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@484 d0cd1f9f-072b-0410-8dd7-cf729c803f20 2010-09-30 10:21:18 +08:00			`.\" Manual: \ \&`
			`.\" Source: \ \&`
			`.\" Language: English`
			`.\"`
fix links in doc; autotools requires README 2015-06-13 06:08:05 +08:00			`.TH "COMBINE_TESSDATA" "1" "06/12/2015" "\ \&" "\ \&"`
generated docs git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@484 d0cd1f9f-072b-0410-8dd7-cf729c803f20 2010-09-30 10:21:18 +08:00			`.\" -----------------------------------------------------------------`
			`.\" * Define some portability stuff`
			`.\" -----------------------------------------------------------------`
			`.\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~`
			`.\" http://bugs.debian.org/507673`
			`.\" http://lists.gnu.org/archive/html/groff/2009-02/msg00013.html`
			`.\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~`
			`.ie \n(.g .ds Aq \(aq`
			`.el .ds Aq '`
			`.\" -----------------------------------------------------------------`
			`.\" * set default formatting`
			`.\" -----------------------------------------------------------------`
			`.\" disable hyphenation`
			`.nh`
			`.\" disable justification (adjust text to left margin only)`
			`.ad l`
			`.\" -----------------------------------------------------------------`
			`.\" * MAIN CONTENT STARTS HERE *`
			`.\" -----------------------------------------------------------------`
			`.SH "NAME"`
			`combine_tessdata \- combine/extract/overwrite Tesseract data`
			`.SH "SYNOPSIS"`
			`.sp`
			`\fBcombine_tessdata\fR [\fIOPTION\fR] \fIFILE\fR\&...`
			`.SH "DESCRIPTION"`
			`.sp`
			`combine_tessdata(1) is the main program to combine/extract/overwrite tessdata components in [lang]\&.traineddata files\&.`
			`.sp`
			`To combine all the individual tessdata components (unicharset, DAWGs, classifier templates, ambiguities, language configs) located at, say, /home/$USER/temp/eng\&.* run:`
			`.sp`
			`.if n \{\`
			`.RS 4`
			`.\}`
			`.nf`
			`combine_tessdata /home/$USER/temp/eng\&.`
			`.fi`
			`.if n \{\`
			`.RE`
			`.\}`
			`.sp`
			`The result will be a combined tessdata file /home/$USER/temp/eng\&.traineddata`
			`.sp`
			`Specify option \-e if you would like to extract individual components from a combined traineddata file\&. For example, to extract language config file and the unicharset from tessdata/eng\&.traineddata run:`
			`.sp`
			`.if n \{\`
			`.RS 4`
			`.\}`
			`.nf`
Update man pages for Tesseract 3.02. git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@670 d0cd1f9f-072b-0410-8dd7-cf729c803f20 2012-02-10 06:55:47 +08:00			`combine_tessdata \-e tessdata/eng\&.traineddata \e`
			`/home/$USER/temp/eng\&.config /home/$USER/temp/eng\&.unicharset`
generated docs git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@484 d0cd1f9f-072b-0410-8dd7-cf729c803f20 2010-09-30 10:21:18 +08:00			`.fi`
			`.if n \{\`
			`.RE`
			`.\}`
			`.sp`
			`The desired config file and unicharset will be written to /home/$USER/temp/eng\&.config /home/$USER/temp/eng\&.unicharset`
			`.sp`
			`Specify option \-o to overwrite individual components of the given [lang]\&.traineddata file\&. For example, to overwrite language config and unichar ambiguities files in tessdata/eng\&.traineddata use:`
			`.sp`
			`.if n \{\`
			`.RS 4`
			`.\}`
			`.nf`
Update man pages for Tesseract 3.02. git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@670 d0cd1f9f-072b-0410-8dd7-cf729c803f20 2012-02-10 06:55:47 +08:00			`combine_tessdata \-o tessdata/eng\&.traineddata \e`
			`/home/$USER/temp/eng\&.config /home/$USER/temp/eng\&.unicharambigs`
generated docs git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@484 d0cd1f9f-072b-0410-8dd7-cf729c803f20 2010-09-30 10:21:18 +08:00			`.fi`
			`.if n \{\`
			`.RE`
			`.\}`
			`.sp`
			`As a result, tessdata/eng\&.traineddata will contain the new language config and unichar ambigs, plus all the original DAWGs, classifier templates, etc\&.`
			`.sp`
			`Note: the file names of the files to extract to and to overwrite from should have the appropriate file suffixes (extensions) indicating their tessdata component type (\&.unicharset for the unicharset, \&.unicharambigs for unichar ambigs, etc)\&. See k*FileSuffix variable in ccutil/tessdatamanager\&.h\&.`
			`.sp`
			`Specify option \-u to unpack all the components to the specified path:`
			`.sp`
			`.if n \{\`
			`.RS 4`
			`.\}`
			`.nf`
			`combine_tessdata \-u tessdata/eng\&.traineddata /home/$USER/temp/eng\&.`
			`.fi`
			`.if n \{\`
			`.RE`
			`.\}`
			`.sp`
			`This will create /home/$USER/temp/eng\&.* files with individual tessdata components from tessdata/eng\&.traineddata\&.`
			`.SH "OPTIONS"`
			`.sp`
			`\fB\-e\fR \fI\&.traineddata\fR \fIFILE\fR\&...: Extracts the specified components from the \&.traineddata file`
			`.sp`
			`\fB\-o\fR \fI\&.traineddata\fR \fIFILE\fR\&...: Overwrites the specified components of the \&.traineddata file with those provided on the comand line\&.`
			`.sp`
make one thing a little more clear git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@487 d0cd1f9f-072b-0410-8dd7-cf729c803f20 2010-09-30 18:13:09 +08:00			`\fB\-u\fR \fI\&.traineddata\fR \fIPATHPREFIX\fR Unpacks the \&.traineddata using the provided prefix\&.`
generated docs git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@484 d0cd1f9f-072b-0410-8dd7-cf729c803f20 2010-09-30 10:21:18 +08:00			`.SH "CAVEATS"`
			`.sp`
			`\fIPrefix\fR refers to the full file prefix, including period (\&.)`
Update man pages for Tesseract 3.02. git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@670 d0cd1f9f-072b-0410-8dd7-cf729c803f20 2012-02-10 06:55:47 +08:00			`.SH "COMPONENTS"`
			`.sp`
fix links in doc; autotools requires README 2015-06-13 06:08:05 +08:00			`The components in a Tesseract lang\&.traineddata file as of Tesseract 3\&.02 are briefly described below; For more information on many of these files, see \m[blue]\fBhttps://github\&.com/tesseract\-ocr/tesseract/wiki/TrainingTesseract\fR\m[]`
Update man pages for Tesseract 3.02. git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@670 d0cd1f9f-072b-0410-8dd7-cf729c803f20 2012-02-10 06:55:47 +08:00			`.PP`
			`lang\&.config`
			`.RS 4`
			`(Optional) Language\-specific overrides to default config variables\&.`
			`.RE`
			`.PP`
			`lang\&.unicharset`
			`.RS 4`
			`(Required) The list of symbols that Tesseract recognizes, with properties\&. See unicharset(5)\&.`
			`.RE`
			`.PP`
			`lang\&.unicharambigs`
			`.RS 4`
			`(Optional) This file contains information on pairs of recognized symbols which are often confused\&. For example,`
			`\fIrn\fR`
			`and`
			`\fIm\fR\&.`
			`.RE`
			`.PP`
			`lang\&.inttemp`
			`.RS 4`
			`(Required) Character shape templates for each unichar\&. Produced by mftraining(1)\&.`
			`.RE`
			`.PP`
			`lang\&.pffmtable`
			`.RS 4`
			`(Required) The number of features expected for each unichar\&. Produced by mftraining(1) from`
			`\fB\&.tr\fR`
			`files\&.`
			`.RE`
			`.PP`
			`lang\&.normproto`
			`.RS 4`
			`(Required) Character normalization prototypes generated by cntraining(1) from`
			`\fB\&.tr\fR`
			`files\&.`
			`.RE`
			`.PP`
			`lang\&.punc\-dawg`
			`.RS 4`
			`(Optional) A dawg made from punctuation patterns found around words\&. The "word" part is replaced by a single space\&.`
			`.RE`
			`.PP`
			`lang\&.word\-dawg`
			`.RS 4`
			`(Optional) A dawg made from dictionary words from the language\&.`
			`.RE`
			`.PP`
			`lang\&.number\-dawg`
			`.RS 4`
			`(Optional) A dawg made from tokens which originally contained digits\&. Each digit is replaced by a space character\&.`
			`.RE`
			`.PP`
			`lang\&.freq\-dawg`
			`.RS 4`
			`(Optional) A dawg made from the most frequent words which would have gone into word\-dawg\&.`
			`.RE`
			`.PP`
			`lang\&.fixed\-length\-dawgs`
			`.RS 4`
			`(Optional) Several dawgs of different fixed lengths \(em useful for languages like Chinese\&.`
			`.RE`
			`.PP`
			`lang\&.shapetable`
			`.RS 4`
			`(Optional) When present, a shapetable is an extra layer between the character classifier and the word recognizer that allows the character classifier to return a collection of unichar ids and fonts instead of a single unichar\-id and font\&.`
			`.RE`
			`.PP`
			`lang\&.bigram\-dawg`
			`.RS 4`
			`(Optional) A dawg of word bigrams where the words are separated by a space and each digit is replaced by a`
			`\fI?\fR\&.`
			`.RE`
			`.PP`
			`lang\&.unambig\-dawg`
			`.RS 4`
			`(Optional) TODO: Describe\&.`
			`.RE`
			`.PP`
			`lang\&.params\-training\-model`
			`.RS 4`
			`(Optional) TODO: Describe\&.`
			`.RE`
generated docs git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@484 d0cd1f9f-072b-0410-8dd7-cf729c803f20 2010-09-30 10:21:18 +08:00			`.SH "HISTORY"`
			`.sp`
			`combine_tessdata(1) first appeared in version 3\&.00 of Tesseract`
			`.SH "SEE ALSO"`
			`.sp`
Update man pages for Tesseract 3.02. git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@670 d0cd1f9f-072b-0410-8dd7-cf729c803f20 2012-02-10 06:55:47 +08:00			`tesseract(1), wordlist2dawg(1), cntraining(1), mftraining(1), unicharset(5), unicharambigs(5)`
generated docs git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@484 d0cd1f9f-072b-0410-8dd7-cf729c803f20 2010-09-30 10:21:18 +08:00			`.SH "COPYING"`
			`.sp`
			`Copyright (C) 2009, Google Inc\&. Licensed under the Apache License, Version 2\&.0`
Update man pages for Tesseract 3.02. git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@670 d0cd1f9f-072b-0410-8dd7-cf729c803f20 2012-02-10 06:55:47 +08:00			`.SH "AUTHOR"`
			`.sp`
			`The Tesseract OCR engine was written by Ray Smith and his research groups at Hewlett Packard (1985\-1995) and Google (2006\-present)\&.`