mirror of
https://github.com/tesseract-ocr/tesseract.git
synced 2024-11-25 11:39:06 +08:00
70 lines
3.1 KiB
Groff
70 lines
3.1 KiB
Groff
'\" t
|
|
.\" Title: unicharset_extractor
|
|
.\" Author: [see the "AUTHOR" section]
|
|
.\" Generator: DocBook XSL Stylesheets v1.78.1 <http://docbook.sf.net/>
|
|
.\" Date: 06/12/2015
|
|
.\" Manual: \ \&
|
|
.\" Source: \ \&
|
|
.\" Language: English
|
|
.\"
|
|
.TH "UNICHARSET_EXTRACTOR" "1" "06/12/2015" "\ \&" "\ \&"
|
|
.\" -----------------------------------------------------------------
|
|
.\" * Define some portability stuff
|
|
.\" -----------------------------------------------------------------
|
|
.\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
.\" http://bugs.debian.org/507673
|
|
.\" http://lists.gnu.org/archive/html/groff/2009-02/msg00013.html
|
|
.\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
.ie \n(.g .ds Aq \(aq
|
|
.el .ds Aq '
|
|
.\" -----------------------------------------------------------------
|
|
.\" * set default formatting
|
|
.\" -----------------------------------------------------------------
|
|
.\" disable hyphenation
|
|
.nh
|
|
.\" disable justification (adjust text to left margin only)
|
|
.ad l
|
|
.\" -----------------------------------------------------------------
|
|
.\" * MAIN CONTENT STARTS HERE *
|
|
.\" -----------------------------------------------------------------
|
|
.SH "NAME"
|
|
unicharset_extractor \- extract unicharset from Tesseract boxfiles
|
|
.SH "SYNOPSIS"
|
|
.sp
|
|
\fBunicharset_extractor\fR \fI[\-D dir]\fR \fIFILE\fR\&...
|
|
.SH "DESCRIPTION"
|
|
.sp
|
|
Tesseract needs to know the set of possible characters it can output\&. To generate the unicharset data file, use the unicharset_extractor program on the same training pages bounding box files as used for clustering:
|
|
.sp
|
|
.if n \{\
|
|
.RS 4
|
|
.\}
|
|
.nf
|
|
unicharset_extractor fontfile_1\&.box fontfile_2\&.box \&.\&.\&.
|
|
.fi
|
|
.if n \{\
|
|
.RE
|
|
.\}
|
|
.sp
|
|
The unicharset will be put into the file \fIdir/unicharset\fR, or simply \fI\&./unicharset\fR if no output directory is provided\&.
|
|
.sp
|
|
Tesseract also needs to have access to character properties isalpha, isdigit, isupper, islower, ispunctuation\&. all of this auxilury data and more is encoded in this file\&. (See unicharset(5))
|
|
.sp
|
|
If your system supports the wctype functions, these values will be set automatically by unicharset_extractor and there is no need to edit the unicharset file\&. On some older systems (eg Windows 95), the unicharset file must be edited by hand to add these property description codes\&.
|
|
.sp
|
|
\fBNOTE\fR The unicharset file must be regenerated whenever inttemp, normproto and pffmtable are generated (i\&.e\&. they must all be recreated when the box file is changed) as they have to be in sync\&. This is made easier than in previous versions by running unicharset_extractor before mftraining and cntraining, and giving the unicharset to mftraining\&.
|
|
.SH "SEE ALSO"
|
|
.sp
|
|
tesseract(1), unicharset(5)
|
|
.sp
|
|
\m[blue]\fBhttps://github\&.com/tesseract\-ocr/tesseract/wiki/TrainingTesseract\fR\m[]
|
|
.SH "HISTORY"
|
|
.sp
|
|
unicharset_extractor first appeared in Tesseract 2\&.00\&.
|
|
.SH "COPYING"
|
|
.sp
|
|
Copyright (C) 2006, Google Inc\&. Licensed under the Apache License, Version 2\&.0
|
|
.SH "AUTHOR"
|
|
.sp
|
|
The Tesseract OCR engine was written by Ray Smith and his research groups at Hewlett Packard (1985\-1995) and Google (2006\-present)\&.
|