tesseract/doc/wordlist2dawg.1

55 lines
2.0 KiB
Groff
Raw Normal View History

'\" t
.\" Title: wordlist2dawg
.\" Author: [FIXME: author] [see http://docbook.sf.net/el/author]
.\" Generator: DocBook XSL Stylesheets v1.75.2 <http://docbook.sf.net/>
.\" Date: 09/30/2010
.\" Manual: \ \&
.\" Source: \ \&
.\" Language: English
.\"
.TH "WORDLIST2DAWG" "1" "09/30/2010" "\ \&" "\ \&"
.\" -----------------------------------------------------------------
.\" * Define some portability stuff
.\" -----------------------------------------------------------------
.\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.\" http://bugs.debian.org/507673
.\" http://lists.gnu.org/archive/html/groff/2009-02/msg00013.html
.\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.ie \n(.g .ds Aq \(aq
.el .ds Aq '
.\" -----------------------------------------------------------------
.\" * set default formatting
.\" -----------------------------------------------------------------
.\" disable hyphenation
.nh
.\" disable justification (adjust text to left margin only)
.ad l
.\" -----------------------------------------------------------------
.\" * MAIN CONTENT STARTS HERE *
.\" -----------------------------------------------------------------
.SH "NAME"
wordlist2dawg \- convert a wordlist to a DAWG for Tesseract
.SH "SYNOPSIS"
.sp
\fBwordlist2dawg\fR \fIWORDLIST\fR \fIDAWG\fR \fIlang\&.unicharset\fR
.SH "DESCRIPTION"
.sp
wordlist2dawg(1) converts a wordlist to a Directed Acyclic Word Graph (DAWG) for use with Tesseract\&.
.sp
The wordlists are split into two: one with high frequency words, and one with the rest\&.
.SH "OPTIONS"
.sp
\fIWORDLIST\fR A plain text file in UTF\-8, one word per line
.sp
\fIDAWG\fR The output DAWG to write
.sp
\fIlang\&.unicharset\fR The unicharset of the language\&. This is the unicharset generated by mftraining(1)
.SH "SEE ALSO"
.sp
tesseract(1), mftraining(1)
.sp
\m[blue]\fBhttp://code\&.google\&.com/p/tesseract\-ocr/wiki/TrainingTesseract3\fR\m[]
.SH "COPYING"
.sp
Copyright (c) 2006 Google, Inc\&. Licensed under the Apache License, Version 2\&.0