mirror of
https://github.com/tesseract-ocr/tesseract.git
synced 2024-11-24 02:59:07 +08:00
de0aff106f
Signed-off-by: Stefan Weil <sw@weilnetz.de>
206 lines
6.6 KiB
Groff
206 lines
6.6 KiB
Groff
'\" t
|
|
.\" Title: combine_tessdata
|
|
.\" Author: [see the "AUTHOR" section]
|
|
.\" Generator: DocBook XSL Stylesheets v1.78.1 <http://docbook.sf.net/>
|
|
.\" Date: 06/12/2015
|
|
.\" Manual: \ \&
|
|
.\" Source: \ \&
|
|
.\" Language: English
|
|
.\"
|
|
.TH "COMBINE_TESSDATA" "1" "06/12/2015" "\ \&" "\ \&"
|
|
.\" -----------------------------------------------------------------
|
|
.\" * Define some portability stuff
|
|
.\" -----------------------------------------------------------------
|
|
.\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
.\" http://bugs.debian.org/507673
|
|
.\" http://lists.gnu.org/archive/html/groff/2009-02/msg00013.html
|
|
.\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
.ie \n(.g .ds Aq \(aq
|
|
.el .ds Aq '
|
|
.\" -----------------------------------------------------------------
|
|
.\" * set default formatting
|
|
.\" -----------------------------------------------------------------
|
|
.\" disable hyphenation
|
|
.nh
|
|
.\" disable justification (adjust text to left margin only)
|
|
.ad l
|
|
.\" -----------------------------------------------------------------
|
|
.\" * MAIN CONTENT STARTS HERE *
|
|
.\" -----------------------------------------------------------------
|
|
.SH "NAME"
|
|
combine_tessdata \- combine/extract/overwrite Tesseract data
|
|
.SH "SYNOPSIS"
|
|
.sp
|
|
\fBcombine_tessdata\fR [\fIOPTION\fR] \fIFILE\fR\&...
|
|
.SH "DESCRIPTION"
|
|
.sp
|
|
combine_tessdata(1) is the main program to combine/extract/overwrite tessdata components in [lang]\&.traineddata files\&.
|
|
.sp
|
|
To combine all the individual tessdata components (unicharset, DAWGs, classifier templates, ambiguities, language configs) located at, say, /home/$USER/temp/eng\&.* run:
|
|
.sp
|
|
.if n \{\
|
|
.RS 4
|
|
.\}
|
|
.nf
|
|
combine_tessdata /home/$USER/temp/eng\&.
|
|
.fi
|
|
.if n \{\
|
|
.RE
|
|
.\}
|
|
.sp
|
|
The result will be a combined tessdata file /home/$USER/temp/eng\&.traineddata
|
|
.sp
|
|
Specify option \-e if you would like to extract individual components from a combined traineddata file\&. For example, to extract language config file and the unicharset from tessdata/eng\&.traineddata run:
|
|
.sp
|
|
.if n \{\
|
|
.RS 4
|
|
.\}
|
|
.nf
|
|
combine_tessdata \-e tessdata/eng\&.traineddata \e
|
|
/home/$USER/temp/eng\&.config /home/$USER/temp/eng\&.unicharset
|
|
.fi
|
|
.if n \{\
|
|
.RE
|
|
.\}
|
|
.sp
|
|
The desired config file and unicharset will be written to /home/$USER/temp/eng\&.config /home/$USER/temp/eng\&.unicharset
|
|
.sp
|
|
Specify option \-o to overwrite individual components of the given [lang]\&.traineddata file\&. For example, to overwrite language config and unichar ambiguities files in tessdata/eng\&.traineddata use:
|
|
.sp
|
|
.if n \{\
|
|
.RS 4
|
|
.\}
|
|
.nf
|
|
combine_tessdata \-o tessdata/eng\&.traineddata \e
|
|
/home/$USER/temp/eng\&.config /home/$USER/temp/eng\&.unicharambigs
|
|
.fi
|
|
.if n \{\
|
|
.RE
|
|
.\}
|
|
.sp
|
|
As a result, tessdata/eng\&.traineddata will contain the new language config and unichar ambigs, plus all the original DAWGs, classifier templates, etc\&.
|
|
.sp
|
|
Note: the file names of the files to extract to and to overwrite from should have the appropriate file suffixes (extensions) indicating their tessdata component type (\&.unicharset for the unicharset, \&.unicharambigs for unichar ambigs, etc)\&. See k*FileSuffix variable in ccutil/tessdatamanager\&.h\&.
|
|
.sp
|
|
Specify option \-u to unpack all the components to the specified path:
|
|
.sp
|
|
.if n \{\
|
|
.RS 4
|
|
.\}
|
|
.nf
|
|
combine_tessdata \-u tessdata/eng\&.traineddata /home/$USER/temp/eng\&.
|
|
.fi
|
|
.if n \{\
|
|
.RE
|
|
.\}
|
|
.sp
|
|
This will create /home/$USER/temp/eng\&.* files with individual tessdata components from tessdata/eng\&.traineddata\&.
|
|
.SH "OPTIONS"
|
|
.sp
|
|
\fB\-e\fR \fI\&.traineddata\fR \fIFILE\fR\&...: Extracts the specified components from the \&.traineddata file
|
|
.sp
|
|
\fB\-o\fR \fI\&.traineddata\fR \fIFILE\fR\&...: Overwrites the specified components of the \&.traineddata file with those provided on the comand line\&.
|
|
.sp
|
|
\fB\-u\fR \fI\&.traineddata\fR \fIPATHPREFIX\fR Unpacks the \&.traineddata using the provided prefix\&.
|
|
.SH "CAVEATS"
|
|
.sp
|
|
\fIPrefix\fR refers to the full file prefix, including period (\&.)
|
|
.SH "COMPONENTS"
|
|
.sp
|
|
The components in a Tesseract lang\&.traineddata file as of Tesseract 3\&.02 are briefly described below; For more information on many of these files, see \m[blue]\fBhttps://github\&.com/tesseract\-ocr/tesseract/wiki/TrainingTesseract\fR\m[]
|
|
.PP
|
|
lang\&.config
|
|
.RS 4
|
|
(Optional) Language\-specific overrides to default config variables\&.
|
|
.RE
|
|
.PP
|
|
lang\&.unicharset
|
|
.RS 4
|
|
(Required) The list of symbols that Tesseract recognizes, with properties\&. See unicharset(5)\&.
|
|
.RE
|
|
.PP
|
|
lang\&.unicharambigs
|
|
.RS 4
|
|
(Optional) This file contains information on pairs of recognized symbols which are often confused\&. For example,
|
|
\fIrn\fR
|
|
and
|
|
\fIm\fR\&.
|
|
.RE
|
|
.PP
|
|
lang\&.inttemp
|
|
.RS 4
|
|
(Required) Character shape templates for each unichar\&. Produced by mftraining(1)\&.
|
|
.RE
|
|
.PP
|
|
lang\&.pffmtable
|
|
.RS 4
|
|
(Required) The number of features expected for each unichar\&. Produced by mftraining(1) from
|
|
\fB\&.tr\fR
|
|
files\&.
|
|
.RE
|
|
.PP
|
|
lang\&.normproto
|
|
.RS 4
|
|
(Required) Character normalization prototypes generated by cntraining(1) from
|
|
\fB\&.tr\fR
|
|
files\&.
|
|
.RE
|
|
.PP
|
|
lang\&.punc\-dawg
|
|
.RS 4
|
|
(Optional) A dawg made from punctuation patterns found around words\&. The "word" part is replaced by a single space\&.
|
|
.RE
|
|
.PP
|
|
lang\&.word\-dawg
|
|
.RS 4
|
|
(Optional) A dawg made from dictionary words from the language\&.
|
|
.RE
|
|
.PP
|
|
lang\&.number\-dawg
|
|
.RS 4
|
|
(Optional) A dawg made from tokens which originally contained digits\&. Each digit is replaced by a space character\&.
|
|
.RE
|
|
.PP
|
|
lang\&.freq\-dawg
|
|
.RS 4
|
|
(Optional) A dawg made from the most frequent words which would have gone into word\-dawg\&.
|
|
.RE
|
|
.PP
|
|
lang\&.fixed\-length\-dawgs
|
|
.RS 4
|
|
(Optional) Several dawgs of different fixed lengths \(em useful for languages like Chinese\&.
|
|
.RE
|
|
.PP
|
|
lang\&.shapetable
|
|
.RS 4
|
|
(Optional) When present, a shapetable is an extra layer between the character classifier and the word recognizer that allows the character classifier to return a collection of unichar ids and fonts instead of a single unichar\-id and font\&.
|
|
.RE
|
|
.PP
|
|
lang\&.bigram\-dawg
|
|
.RS 4
|
|
(Optional) A dawg of word bigrams where the words are separated by a space and each digit is replaced by a
|
|
\fI?\fR\&.
|
|
.RE
|
|
.PP
|
|
lang\&.unambig\-dawg
|
|
.RS 4
|
|
(Optional) TODO: Describe\&.
|
|
.RE
|
|
.PP
|
|
lang\&.params\-training\-model
|
|
.RS 4
|
|
(Optional) TODO: Describe\&.
|
|
.RE
|
|
.SH "HISTORY"
|
|
.sp
|
|
combine_tessdata(1) first appeared in version 3\&.00 of Tesseract
|
|
.SH "SEE ALSO"
|
|
.sp
|
|
tesseract(1), wordlist2dawg(1), cntraining(1), mftraining(1), unicharset(5), unicharambigs(5)
|
|
.SH "COPYING"
|
|
.sp
|
|
Copyright (C) 2009, Google Inc\&. Licensed under the Apache License, Version 2\&.0
|
|
.SH "AUTHOR"
|
|
.sp
|
|
The Tesseract OCR engine was written by Ray Smith and his research groups at Hewlett Packard (1985\-1995) and Google (2006\-present)\&.
|