OverviewΒΆ

The recommended audience for this document are developers who want to use Microsoft Visual Studio 2008 with Tesseract-OCR. If you simply want to run tesseract or its various language training applications, then see the ReadMe. You’ll find instructions there on how to download tesseract’s Windows installer.

Tesseract-OCR consists of:

  • libtesseract – the static (or dynamic) library that does all the actual work. As of February 2012 it consists of 260+ C++ files along with 290+ header files.

  • tesseract.exe – the command-line OCR engine. It’s built from a single, small C++ file that just calls functions in libtesseract. There currently isn’t very much documentation on how to use tesseract.exe, but you can look at what’s there in the repository’s doc subdirectory.

  • Language packs – needed by tesseract.exe in order to recognize particular languages.

  • Language training applications – used to teach tesseract.exe new languages. Each has their own (very brief) man page in the doc subdirectory and include:

    • ambiguous_words.exe – generate sets of words Tesseract is likely to find ambiguous

    • classifier_tester – tests a Tesseract character classifier on data as formatted for training

    • cntraining.exe – character normalization training

    • combine_tessdata.exe – combine/extract/overwrite Tesseract data

    • dawg2wordlist.exe – convert a Tesseract DAWG to a wordlist

    • mftraining.exe – feature training

    • shapeclustering.exe – shape clustering training

    • unicharset_extractor.exe – extract unicharset from Tesseract boxfiles

    • wordlist2dawg.exe – convert a wordlist to a DAWG

    Their use is described in the TrainingTesseract3 Wiki page.

This document explains how to:

  • Setup the proper directory structure required to use the supplied Visual Studio 2008 Solution

  • Build libtesseract, tesseract.exe, and the training apps

  • Write programs that link with libtesseract