Overview¶

The recommended audience for this document are developers who want to use Microsoft Visual Studio 2008 with Tesseract-OCR. If you simply want to run tesseract or its various language training applications, then see the ReadMe. You’ll find instructions there on how to download tesseract’s Windows installer.

Tesseract-OCR consists of:

libtesseract – the static (or dynamic) library that does all the actual work. As of February 2012 it consists of 260+ C++ files along with 290+ header files.
tesseract.exe – the command-line OCR engine. It’s built from a single, small C++ file that just calls functions in libtesseract. There currently isn’t very much documentation on how to use tesseract.exe, but you can look at what’s there in the repository’s doc subdirectory.
Language packs – needed by tesseract.exe in order to recognize particular languages.

Language training applications – used to teach tesseract.exe new languages. Each has their own (very brief) man page in the doc subdirectory and include:
- ambiguous_words.exe – generate sets of words Tesseract is likely to find ambiguous
- classifier_tester – tests a Tesseract character classifier on data as formatted for training
- cntraining.exe – character normalization training
- combine_tessdata.exe – combine/extract/overwrite Tesseract data
- dawg2wordlist.exe – convert a Tesseract DAWG to a wordlist
- mftraining.exe – feature training
- shapeclustering.exe – shape clustering training
- unicharset_extractor.exe – extract unicharset from Tesseract boxfiles
- wordlist2dawg.exe – convert a wordlist to a DAWG
Their use is described in the TrainingTesseract3 Wiki page.

This document explains how to:

Setup the proper directory structure required to use the supplied Visual Studio 2008 Solution

Build libtesseract, tesseract.exe, and the training apps
Write programs that link with libtesseract

Navigation

Overview¶

Quick search

Navigation