The Visual Studio 2008 Solutions included with Tesseract-OCR, rely on relative paths to reference files and directories — including locations that are outside of the tesseract-3.0x tree. It is therefore vitally important to correctly set up the directories for the various components. This section describes how to do this.
First create an empty directory where you will unpack all the required downloads. Assume you call this directory C:\BuildFolder.
Download the Leptonica 1.68 pre-built binary package (leptonica-1.68-win32-lib-include-dirs.zip) from:
and unpack it to C:\BuildFolder.
Leptonica, even on Windows as of v1.68, still requires a few unix utilities (like rm, diff, sleep). The easiest way to deal with this is to follow the instructions at Installing Cygwin coreutils.
At this point, if all you want to do is link with libtesseract you can download the file that just contains the “public” Tesseract-OCR headers along with the precompiled library binaries for Windows. Unpack it to C:\BuildFolder and you’ll now have:
C:\BuildFolder\
include\
leptonica\
tesseract\
leptonica_versionnumbers.vsprops
tesseract_versionnumbers.vsprops
lib\
giflib416-static-mtdll-debug.lib
giflib416-static-mtdll.lib
libjpeg8c-static-mtdll-debug.lib
libjpeg8c-static-mtdll.lib
liblept168-static-mtdll-debug.lib
liblept168-static-mtdll.lib
liblept168.dll
liblept168.lib
liblept168d.dll
liblept168d.lib
libpng143-static-mtdll-debug.lib
libpng143-static-mtdll.lib
libtesseract302.dll
libtesseract302.lib
libtesseract302d.dll
libtesseract302d.lib
libtesseract302-static.lib
libtesseract302-static-debug.lib
libtiff394-static-mtdll-debug.lib
libtiff394-static-mtdll.lib
zlib125-static-mtdll-debug.lib
zlib125-static-mtdll.lib
and you can skip the rest of this page and go directly to Programming with libtesseract.
The recommended action, however, is to download the Tesseract-OCR sources and build them yourself. Therefore...
Download the Tesseract-OCR Visual Studio 2008 source files from the downloads page. If, for example, you’d like to build v3.02 you would use the following link:
Unpack the file to C:\BuildFolder
You would now have the following directory structure:
C:\BuildFolder\
include\
leptonica\
leptonica_versionnumbers.vsprops
tesseract_versionnumbers.vsprops
lib\
giflib416-static-mtdll-debug.lib
giflib416-static-mtdll.lib
libjpeg8c-static-mtdll-debug.lib
libjpeg8c-static-mtdll.lib
liblept168-static-mtdll-debug.lib
liblept168-static-mtdll.lib
liblept168.dll
liblept168.lib
liblept168d.dll
liblept168d.lib
libpng143-static-mtdll-debug.lib
libpng143-static-mtdll.lib
libtiff394-static-mtdll-debug.lib
libtiff394-static-mtdll.lib
zlib125-static-mtdll-debug.lib
zlib125-static-mtdll.lib
tesseract-3.02\
vs2008\
ambiguous_words\
classifier_tester\
cntraining\
combine_tessdata\
dawg2wordlist\
doc\
include\
libtesseract\
libtesseract.vcproj
mftraining\
port\
shapeclustering\
sphinx\
tesseract\
tesseract.vcproj
unicharset_extractor\
wordlist2dawg\
tesseract.sln
tesshelper.py
Download the Tesseract-OCR source files for the same version as the VS2008 files you just unpacked. In this case, the proper link would be:
Unpack the file to C:\BuildFolder
This will add a bunch of directories to your already existing C:\BuildFolder\tesseract-3.0x directory. You should now have (for v3.02):
C:\BuildFolder\
include\
leptonica\
lib\
tesseract-3.02\
api\
ccmain\
ccstruct\
ccutil\
classify\
config\
contrib\
cube\
cutil\
dict\
doc\
image\
java\
image\
neural_networks\
tessdata\
testing\
textord\
training\
viewer\
vs2008\
wordrec\
If you are planning on writing applications that link with Tesseract-OCR, and you don’t want to add all the tesseract-3.0x directories to your project’s list of include directories, then do this additional step:
Copy all the required headers to the “public” include folder.
If you already have a C:\BuildFolder\include\tesseract directory you should delete it in case some of the files have been removed.
Then use the python tess-helper.py script to copy (possibly updated versions of) the required headers by doing:
cd C:\BuildFolder\tesseract-3.02\vs2008
python tesshelper.py .. copy ..\..\include
See The tesshelper.py Python script for more details.
You are now ready to build Tesseract-OCR using Visual Studio 2008.
If you’d like to try the absolute latest version of Tesseract-OCR, here’s how to download the source files from its SVN repository:
Follow Steps 1 and 2 above.
Checkout the Tesseract-OCR sources to a directory on your computer. This directory should not be C:\BuildFolder!
If you are unfamiliar with SVN, the easiest way to do this is to first download and install TortoiseSVN and then:
Right-click the (empty) directory where you want the working copy and choose SVN Checkout... from the pop-up menu.
Enter http://tesseract-ocr.googlecode.com/svn/trunk/ for URL of repository. You can keep all the other settings at their defaults.
Click the OK button to commence downloading the Tesseract-OCR sources to your computer. This might take a while as the language data in the tessdata directory is quite large. As of February 2012, about 335MB needs to be transferred for the initial checkout. The total size of the resulting working copy is about 1.2GB.
Keeping your working copy up to date after this is as simple as right-clicking its directory and choosing SVN Update. Unlike the initial checkout, this will usually finish very quickly.
Copy the contents of your working directory, except for the tessdata directory, to C:\BuildFolder\tesseract-3.0x, where x should probably be the latest stable release + alpha, beta, etc.
Optionally, follow Step 5 from above.
You’ll probably want to set an environment varible named TESSDATA_PREFIX to point at your working copy directory (since that now contains the latest tessdata directory).
If someone hasn’t already done so, you have to proceed to Updating the VS2008 directory for new releases of Tesseract-OCR. You can skip all the steps that relate to updating the version number. Otherwise, depending on how many changes have been made since the last stable release, you may have little or no work to do.