The recommended audience for this document are developers who want to use Microsoft Visual Studio 2008 with Tesseract-OCR. If you simply want to run tesseract or its various language training applications, then see the ReadMe. You’ll find instructions there on how to download tesseract’s Windows installer.
Tesseract-OCR consists of:
libtesseract – the static (or dynamic) library that does all the actual work. As of February 2012 it consists of 260+ C++ files along with 290+ header files.
tesseract.exe – the command-line OCR engine. It’s built from a single, small C++ file that just calls functions in libtesseract. There currently isn’t very much documentation on how to use tesseract.exe, but you can look at what’s there in the repository’s doc subdirectory.
Language packs – needed by tesseract.exe in order to recognize particular languages.
Language training applications – used to teach tesseract.exe new languages. Each has their own (very brief) man page in the doc subdirectory and include:
ambiguous_words.exe – generate sets of words Tesseract is likely to find ambiguous
classifier_tester – tests a Tesseract character classifier on data as formatted for training
cntraining.exe – character normalization training
combine_tessdata.exe – combine/extract/overwrite Tesseract data
dawg2wordlist.exe – convert a Tesseract DAWG to a wordlist
mftraining.exe – feature training
shapeclustering.exe – shape clustering training
unicharset_extractor.exe – extract unicharset from Tesseract boxfiles
wordlist2dawg.exe – convert a wordlist to a DAWG
Their use is described in the TrainingTesseract3 Wiki page.
This document explains how to:
Setup the proper directory structure required to use the supplied Visual Studio 2008 Solution