'\" t .\" Title: wordlist2dawg .\" Author: [FIXME: author] [see http://docbook.sf.net/el/author] .\" Generator: DocBook XSL Stylesheets v1.75.2 .\" Date: 09/30/2010 .\" Manual: \ \& .\" Source: \ \& .\" Language: English .\" .TH "WORDLIST2DAWG" "1" "09/30/2010" "\ \&" "\ \&" .\" ----------------------------------------------------------------- .\" * Define some portability stuff .\" ----------------------------------------------------------------- .\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .\" http://bugs.debian.org/507673 .\" http://lists.gnu.org/archive/html/groff/2009-02/msg00013.html .\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .ie \n(.g .ds Aq \(aq .el .ds Aq ' .\" ----------------------------------------------------------------- .\" * set default formatting .\" ----------------------------------------------------------------- .\" disable hyphenation .nh .\" disable justification (adjust text to left margin only) .ad l .\" ----------------------------------------------------------------- .\" * MAIN CONTENT STARTS HERE * .\" ----------------------------------------------------------------- .SH "NAME" wordlist2dawg \- convert a wordlist to a DAWG for Tesseract .SH "SYNOPSIS" .sp \fBwordlist2dawg\fR \fIWORDLIST\fR \fIDAWG\fR \fIlang\&.unicharset\fR .SH "DESCRIPTION" .sp wordlist2dawg(1) converts a wordlist to a Directed Acyclic Word Graph (DAWG) for use with Tesseract\&. .sp The wordlists are split into two: one with high frequency words, and one with the rest\&. .SH "OPTIONS" .sp \fIWORDLIST\fR A plain text file in UTF\-8, one word per line .sp \fIDAWG\fR The output DAWG to write .sp \fIlang\&.unicharset\fR The unicharset of the language\&. This is the unicharset generated by mftraining(1) .SH "SEE ALSO" .sp tesseract(1), mftraining(1) .sp \m[blue]\fBhttp://code\&.google\&.com/p/tesseract\-ocr/wiki/TrainingTesseract3\fR\m[] .SH "COPYING" .sp Copyright (c) 2006 Google, Inc\&. Licensed under the Apache License, Version 2\&.0