tesseract/ReleaseNotes
theraysmith@gmail.com e0d735b122 Remaining misc changes for 3.02
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@658 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2012-02-02 03:14:43 +00:00

270 lines
13 KiB
Plaintext

= Tesseract release notes Feb 01 2012 - V3.02 =
* Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic.
* Added paragraph detection in layout analysis/post OCR.
* Added simultaneous multi-language capability.
* Added experimental equation detector.
* Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding.
* Improved line detection and removal.
* Added fixed pitch chopper for CJK.
* Added word bigram correction.
* Added new uniform classifier API.
* Added new training error counter.
* More detailed changes recorded in ChangeLog.
= Tesseract release notes Oct 21 2011 - V3.01 =
* Thread-safety! Moved all critical globals and statics to members of the appropriate class. Tesseract is now thread-safe (multiple instances can be used in parallel in multiple threads.) with the minor exception that some control parameters are still global and affect all threads.
* Added Cube, a new recognizer for Arabic. Cube can also be used in combination with normal Tesseract for other languages with an improvement in accuracy at the cost of (much) lower speed. *There is no training module for Cube yet.*
* `OcrEngineMode` in `Init` replaces `AccuracyVSpeed` to control cube.
* Greatly improved segmentation search with consequent accuracy and speed improvements, especially for Chinese.
* Added `PageIterator` and `ResultIterator` as cleaner ways to get the full results out of Tesseract, that are not currently provided by any of the `TessBaseAPI::Get*` methods. All other methods, such as the `ETEXT_STRUCT` in particular are deprecated and will be deleted in the future.
* ApplyBoxes totally rewritten to make training easier. It can now cope with touching/overlapping training characters, and a new boxfile format allows word boxes instead of character boxes, BUT to use that you have to have already boostrapped the language with character boxes. "Cyclic dependency" on traineddata.
* Auto orientation and script detection added to page layout analysis.
* Deleted *lots* of dead code.
* Fixxht module replaced with scalable data-driven module.
* Output font characteristics accuracy improved.
* Removed the double conversion at each classification.
* Upgraded oldest structs to be classes and deprecated PBLOB.
* Removed non-deterministic baseline fit.
* Added fixed length dawgs for Chinese.
* Handling of vertical text improved.
* Handling of leader dots improved.
* Table detection greatly improved.
* Fixed a couple of memory leaks.
* Fixed font labels on output text. (Not perfect, but a lot better than before.)
* Cleanup and more bug fixes
* Special treatments for Hindi.
* Support for build in VS2010 with Microsoft Windows SDK for Windows 7 (thanks to Michael Lutz)
Tesseract release notes Sep 30 2010 - V3.00
* Preparations for thread safety:
* Changed TessBaseAPI methods to be non-static
* Created a class hierarchy for the directories to hold instance data,
and began moving code into the classes.
* Moved thresholding code to a separate class.
* Added major new page layout analysis module.
* Added HOCR output.
* Added Leptonica as main image I/O and handling. Currently optional,
but in future releases linking with Leptonica will be mandatory.
* Ambiguity table rewritten to allow definite replacements in place
of fix_quotes.
* Added TessdataManager to combine data files into a single file.
* Some dead code deleted.
* VC++6 no longer supported. It can't cope with the use of templates.
* Many more languages added.
* Doxygenation of most of the function header comments.
Tesseract release notes June 30 2009 - V2.04
Integrated patches for portability and to remove some of the
"access" macros.
Removed dependence on lua from the viewer making it a *lot*
faster. Also the viewer now compiles and works (on Linux.)
Fixed the following issues:
1, 63, 67, 71, 76, 79, 81, 82, 84, 106, 108, 111, 112, 128, 129, 130, 133, 135,
142, 143, 145, 146, 147, 153, 154, 160, 165, 169, 170, 175, 177, 187, 192,
195, 199, 201, 205, 209.
This is the last version to support VC++6!
This may also be the last version to compile without leptonica!
Windows version now outputs to stderr by default, fixing a lot of the problems with lack of visible meaningful error messages.
Tesseract release notes April 22 2008 - V2.03
2.02 was unrunnable, due to a last-minute "simple" change.
2.03 fixes the problem and also adds an include check for leptonica
to make it more usable.
Tesseract release notes April 21 2008 - V2.02
Improvements to clustering, training and classifier.
Major internationalization improvements for large-character-set
languages, eg Kannada.
Removed some compiler warnings.
Added multipage tiff support for training and running.
Updated graphics output to talk to new java-based viewer.
Added ability to save n-best lists.
Added leptonica support for more file types.
Improved Init/End to make them safe.
Reduced memory use of dictionaries.
Added some new APIs to TessBaseAPI.
Fixed namespace collisions with jpeg library (INT32).
Portability fixes for Windows for new code.
Updates to autoconf system for new code.
Tesseract release notes August 27 2007 - V2.01
Fixed UTF8 input problems with box file reader.
Fixed various infinite loops and crashes in dawg code.
Removed include of config_auto.h from host.h.
Added automatic wctype encoding to unicharset_extractor.
Fixed dawg table too full error.
Removed svn files from tarball.
Added new functions to tessdll.
Increased maximum utf8 string in a classification result to 8.
Tesseract release notes July 17, 2007 - V2.00
First release of the International version.
This version recognizes the following languages:
English - eng
French - fra
Italian - ita
German - deu
Spanish - spa
Dutch - nld
The language codes follow ISO 639-2. The default language is English.
To recognize another language:
tesseract inputimage outputbase -l langcode
To train on a new language, see separate documentation.
More languages will be appearing over time.
List of changes in this release:
Converted internal character handling to UTF8.
Trained with 6 languages.
Added unicharset_extractor, wordlist2dawg.
Added boxfile creation mode.
Added UNLV regression test capability.
Fixed problems with copyright and registered symbols.
Fixed extern "C" declarations problem.
Made some improvements to consistency of accuracy across platforms.
Added vc++ express support.
Instructions for downloading and building version 2.00.
Things have changed quite a bit since the previous versions so please read carefully.
*All users*
The tarballs are split into pieces.
tesseract-2.00.tar.gz contains all the source code.
tesseract-2.00.<lang>.tar.gt contains the data files for <lang>. You need at least one of these or tesseract will not work.
tesseract-2.00.exe.tar.gz is not for the 'exe' language. It is windows executables. They are built with VC++ express and come with absolutely no warranty. If they work for you then great, otherwise get visual C++ express (and the platform sdk) and build from the source.
*Non-windows users*
As with 1.04, this version works with make install.
*New* there is a tesseract.spec for making rpms. (Thanks to Andrew Ziem for the help.)
It might work with your OS if you know how to do that sort of thing.
If you are linking to the libraries, as with Ocropus, there is now a single master
library called libtesseract_full.a.
*Windows users*
If you are building from the sources, there are still dsw and dsp files for vc++6 and also
sln and vcproj files for vc++ express.
The dll has been updated to allow input of non-binary images. (Thanks to Glen of Jetsoft.)
Tesseract release notes May 15, 2007 - V1.04.
=== Windows users only ===
Added a dll interface for windows. Thanks to Glen at Jetsoft for contributing
this. To use the dll, include tessdll.h, import tessdll.lib and put tessdll.dll
somewhere where the system can find it. There is also a small dlltest program
to test the dll. Run with:
dlltest phototest.tif phototest.txt
It will output the text from phototest.tif with bounding box information.
**New for Windows** the distribution now includes tesseract.exe and tessdll.dll
which *might* work out of the box! There are no guarantees as you need
VC++6 versions of mfc and crt (at least) for it to work. (Batteries not
included, and certainly no installshield.)
== Important note for anyone building with make: i.e. anyone except devstudio
users ==
This release includes new standardization for the data directory. To enable
Tesseract to find its data files, you must either:
./configure
make
make install
to move the data files to the standard place, or:
export TESSDATA_PREFIX="directory in which your tessdata resides/"
(or equivalent) in your .profile or whatever or setenv to set the environment
variable. Note that the directory must end in a /
HAVING tesseract and tessdata IN THE SAME DIRECTORY DOES NOT WORK ANY MORE.
== All users ==
Fixed a bunch of name collisions - mostly with stl.
Made some preliminary changes for unicode compatibility. Includes a new data
file (unicharset) and renaming of the other data files to eng.* to support
different languages.
There are also several other minor bug fixes and portability improvements
for 64 bit, the latest visual studio compiler etc. Thanks to all who have
contributed these fixes.
NOTE: This is likely to be the last English-only release!
Apologies in advance to non-windows users for bloating the distribution with
windows executables. This will probably get fixed in the next release with
the multi-language capability, since that will also bloat the distribution.
Tesseract release notes Feb 2, 2007 - V1.03.
Added mftraining and cntraining. Using an image with a box file, tesseract
generates .tr output files. cntraining runs on the .tr files to make
normproto that lives in tessdata. mftraining runs on the .tr files to
make inttemp and pffmtable in tessdata. These are the main data files
that tesseract uses to recognize characters. At present, the code to make
dictionary files is not yet available, nor are any sample box files or
rebuilt inttemp or documentation to create any of these. Recognition is
still limited to the ASCII set, but when this problem is fixed, documentation
will follow.
Added a new API with adaptive thresholding for grey and color images.
See ccmain/baseapi.h/cpp for details. The main program has been converted
to use the API as an example. See main() in ccmain/tesseractmain.cpp for
details. The API is designed to make it easy to add subclasses with ability
to output the bounding boxes etc from the internal structures. The adaptive
thresholding improves accuracy (most of the time) on non-binary images.
Many memory leaks have been fixed. There are no known leaks left from using
the API correctly.
The adaptive classifier was not operating correctly. This bug, and several
others have been fixed, including poor chopping, an indefinite (if not quite
infinite) loop in the number parser, and a couple of crash bugs. Thanks to
all that have contributed bugs and bug fixes.
It is now possible to build without any of the graphics support to save code
size using #define GRAPHICS_DISABLED. There is also a new EMBEDDED define
for use on operating systems with limited library support.
64-bit and Mac OSX buildability is now included in the mainline source tree.
Thanks to all that have contributed patches and comments to help with that.
1.03 is also endian-independent, apart from the tiff i/o, so if you use
libtiff, the code should run on all platforms, even if you get/create new
data files of a different endinanness.
Some of the bug fixes improve accuracy, and so do some of the changes to
DangAmbigs and user-words.
Tesseract release notes, Oct 4 2006 - V1.02.
Removed dependency on aspirin. *All* code is now licensed under Apache2.0.
Tesseract release notes, Sep 7 2006 - V1.01.
Fixes for this release:
Added mfcpch.cpp and getopt.cpp for VC++.
Fixed problem with greyscale images and no libtiff.
Stopped debug window from being used for the usage output.
Fixed load of inttemp for big-endian architectures.
Fixed some Mac compilation issues.
This version should read uncompressed 8 bit grey and 24 bit color tiffs
without having to have libtiff. It does a dumb threshold though, so don't
expect good results from poor contrast or images of natural scenes etc.
If you just run tesseract with no command line args you should now get a
sensible usage message on linux, with or without X-windows.
If you can get it to compile on a PPC Mac, it may now run correctly,
although not all the build issues are fixed yet.
Building Tesseract:
Windows:
Unpack the tar.gz archive
Open tesseract.dsw in DevStudio (preferably version 6, higher versions will be more difficult)
Set Win32 - Release as the active configuration.
Build.
Copy tesseract.exe from bin.rel up one directory level.
Run tesseract phototest.tif phototest
This will create phototest.txt.
Linux:
Unpack the tar.gz archive
./configure
make
Copy tesseract from ccmain up one directory level (or create a symbolic link)
Run tesseract phototest.tif phototest
This will create phototest.txt.