2010-09-30 10:21:18 +08:00
'\" t
. \" Title: wordlist2dawg
2012-02-10 06:55:47 +08:00
. \" Author: [see the "AUTHOR" section]
2015-06-13 06:08:05 +08:00
. \" Generator: DocBook XSL Stylesheets v1.78.1 <http://docbook.sf.net/>
. \" Date: 06/12/2015
2010-09-30 10:21:18 +08:00
. \" Manual: \ \&
. \" Source: \ \&
. \" Language: English
. \"
2015-06-13 06:08:05 +08:00
.TH "WORDLIST2DAWG" "1" "06/12/2015" "\ \&" "\ \&"
2010-09-30 10:21:18 +08:00
. \" -----------------------------------------------------------------
. \" * Define some portability stuff
. \" -----------------------------------------------------------------
. \" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
. \" http://bugs.debian.org/507673
. \" http://lists.gnu.org/archive/html/groff/2009-02/msg00013.html
. \" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.ie \n (.g .ds Aq \(aq
.el .ds Aq '
. \" -----------------------------------------------------------------
. \" * set default formatting
. \" -----------------------------------------------------------------
. \" disable hyphenation
.nh
. \" disable justification (adjust text to left margin only)
.ad l
. \" -----------------------------------------------------------------
. \" * MAIN CONTENT STARTS HERE *
. \" -----------------------------------------------------------------
.SH "NAME"
wordlist2dawg \- convert a wordlist to a DAWG for Tesseract
.SH "SYNOPSIS"
.sp
\fB wordlist2dawg\fR \fI WORDLIST\fR \fI DAWG\fR \fI lang\& .unicharset\fR
.sp
2012-02-10 06:55:47 +08:00
\fB wordlist2dawg\fR \- t \fI WORDLIST\fR \fI DAWG\fR \fI lang\& .unicharset\fR
.sp
\fB wordlist2dawg\fR \- r 1 \fI WORDLIST\fR \fI DAWG\fR \fI lang\& .unicharset\fR
.sp
\fB wordlist2dawg\fR \- r 2 \fI WORDLIST\fR \fI DAWG\fR \fI lang\& .unicharset\fR
.sp
\fB wordlist2dawg\fR \- l <short> <long> \fI WORDLIST\fR \fI DAWG\fR \fI lang\& .unicharset\fR
.SH "DESCRIPTION"
2010-09-30 10:21:18 +08:00
.sp
2012-02-10 06:55:47 +08:00
wordlist2dawg(1) converts a wordlist to a Directed Acyclic Word Graph (DAWG) for use with Tesseract\& . A DAWG is a compressed, space and time efficient representation of a word list\& .
2010-09-30 10:21:18 +08:00
.SH "OPTIONS"
.sp
2012-02-10 06:55:47 +08:00
\- t Verify that a given dawg file is equivalent to a given wordlist\& .
2010-09-30 10:21:18 +08:00
.sp
2012-02-10 06:55:47 +08:00
\- r 1 Reverse a word if it contains an RTL character\& .
2010-09-30 10:21:18 +08:00
.sp
2012-02-10 06:55:47 +08:00
\- r 2 Reverse all words\& .
.sp
\- l <short> <long> Produce a file with several dawgs in it, one each for words of length <short>, <short+1>,\& ... <long>
.SH "ARGUMENTS"
.sp
\fI WORDLIST\fR A plain text file in UTF\- 8, one word per line\& .
.sp
\fI DAWG\fR The output DAWG to write\& .
.sp
\fI lang\& .unicharset\fR The unicharset of the language\& . This is the unicharset generated by mftraining(1)\& .
2010-09-30 10:21:18 +08:00
.SH "SEE ALSO"
.sp
2012-02-10 06:55:47 +08:00
tesseract(1), combine_tessdata(1), dawg2wordlist(1)
2010-09-30 10:21:18 +08:00
.sp
2015-06-13 06:08:05 +08:00
\m[blue]\fBhttps://github\&.com/tesseract\-ocr/tesseract/wiki/TrainingTesseract\fR\m[]
2010-09-30 10:21:18 +08:00
.SH "COPYING"
.sp
2012-02-10 06:55:47 +08:00
Copyright (C) 2006 Google, Inc\& . Licensed under the Apache License, Version 2\& .0
.SH "AUTHOR"
.sp
The Tesseract OCR engine was written by Ray Smith and his research groups at Hewlett Packard (1985\- 1995) and Google (2006\- present)\& .