tesseract/ChangeLog

2014-02-04 v3.03
  * Added new training tool text2image to generate box/tif file pairs from
    text and truetype fonts.
  * Added support for PDF output with searchable text.
  * Removed entire IMAGE class and all code in image directory.
  * Tesseract executable: support for output to stdout; limited support for one 
    page images from stdin  (especially on Windows)
  * Added Renderer to API to allow document-level processing and output
    of document formats, like hOCR, PDF.
  * Major refactor of word-level recognition, beam search, eliminating dead code.
  * Refactored classifier to make it easier to add new ones.
  * Generalized feature extractor to allow feature extraction from greyscale.
  * Improved sub/superscript treatment.
  * Improved baseline fit.
  * Added set_unicharset_properties to training tools.
  * Many bug fixes.
  * More training source data included.

2012-02-01 - v3.02
  * Moved ResultIterator/PageIterator to ccmain.
  * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic.
  * Added paragraph detection in layout analysis/post OCR.
  * Fixed inconsistent xheight during training and over-chopping.
  * Added simultaneous multi-language capability.
  * Refactored top-level word recognition module.
  * Added experimental equation detector.
  * Improved handling of resolution from input images.
  * Blamer module added for error analysis.
  * Cleaned up externally used namespace by removing includes from baseapi.h.
  * Removed dead memory mangagement code.
  * Tidied up constraints on control parameters.
  * Added support for ShapeTable in classifier and training.
  * Refactored class pruner.
  * Fixed training leaks and randomness.
  * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding.
  * Improved line detection and removal.
  * Added fixed pitch chopper for CJK.
  * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier.
  * Fixed problems with internally scaled images.
  * Added page and bbox to string in tr files to identify source of training data better.
  * Fixes to Hindi Shiroreka splitter.
  * Added word bigram correction.
  * Reduced stack memory consumption and eliminated some ugly typedefs.
  * Added new uniform classifier API.
  * Added new training error counter.
  * Fixed endian bug in dawg reader.
  * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.

2010-11-29 - V3.01
  * Removed old/dead serialise/deserialze methods on *LISTIZED classes.
  * Total rewrite of DENORM to better encapsulate operation and make
    for potential to extract features from images.
  * Thread-safety! Moved all critical globals and statics to members of the appropriate class. Tesseract is now thread-safe (multiple instances can be used in parallel in multiple threads.) with the minor exception that some control parameters are still global and affect all threads.
  * Added Cube, a new recognizer for Arabic. Cube can also be used in combination with normal Tesseract for other languages with an improvement in accuracy at the cost of (much) lower speed. *There is no training module for Cube yet.*
  * `OcrEngineMode` in `Init` replaces `AccuracyVSpeed` to control cube.
  * Greatly improved segmentation search with consequent accuracy and speed improvements, especially for Chinese.
  * Added `PageIterator` and `ResultIterator` as cleaner ways to get the full results out of Tesseract, that are not currently provided by any of the `TessBaseAPI::Get*` methods. All other methods, such as the `ETEXT_STRUCT` in particular are deprecated and will be deleted in the future.
  * ApplyBoxes totally rewritten to make training easier. It can now cope with touching/overlapping training characters, and a new boxfile format allows word boxes instead of character boxes, BUT to use that you have to have already boostrapped the language with character boxes. "Cyclic dependency" on traineddata.
  * Auto orientation and script detection added to page layout analysis.
  * Deleted *lots* of dead code.
  * Fixxht module replaced with scalable data-driven module.
  * Output font characteristics accuracy improved.
  * Removed the double conversion at each classification.
  * Upgraded oldest structs to be classes and deprecated PBLOB.
  * Removed non-deterministic baseline fit.
  * Added fixed length dawgs for Chinese.
  * Handling of vertical text improved.
  * Handling of leader dots improved.
  * Table detection greatly improved.
  * Fixed a couple of memory leaks.
  * Fixed font labels on output text. (Not perfect, but a lot better than before.)
  * Cleanup and more bug fixes
  * Special treatments for Hindi.
  * Support for build in VS2010 with Microsoft Windows SDK for Windows 7 (thanks to Michael Lutz)

2010-09-21 - V3.00
  * Preparations for thread safety:
     * Changed TessBaseAPI methods to be non-static
     * Created a class hierarchy for the directories to hold instance data,
       and began moving code into the classes.
     * Moved thresholding code to a separate class.
  * Added major new page layout analysis module.
  * Added HOCR output (issues 221, 263: thanks to amkryukov).
  * Added Leptonica as main image I/O and handling. Currently optional,
    but in future releases linking with Leptonica will be mandatory.
  * Ambiguity table rewritten to allow definite replacements in place
    of fix_quotes.
  * Added TessdataManager to combine data files into a single file.
  * Some dead code deleted.
  * VC++6 no longer supported. It can't cope with the use of templates.
  * Many more languages added. 
  * Doxygenation of most of the function header comments.
  * Added man pages.
  * Added bash completion script (issue 247: thanks to neskiem)
  * Fix integer overview in thresholding (issue 366: thanks to Cyanide.Drake)
  * Add Danish Fraktur support (issues 300, 360: thanks to 
    dsl602230@vip.cybercity.dk)
  * Fix file pointer leak (issue 359, thanks to yukihiro.nakadaira)
  * Fix an error using user-words (Issue 345: thanks to max.markin)
  * Fix a memory leak in tablefind.cpp (Issue 342, thanks to zdravco)
  * Fix a segfault due to double fclose (Issue 320, thanks to souther)
  * Fix an automake error (Issue 318, thanks to ichanjz)
  * Fix a Win32 crash on fileFormatIsTiff() (Issues 304, 316, 317, 330, 347,
    349, 352: thanks to nguyenq87, max.markin, zdenop)
  * Fixed a number of errors in newer (stricter) versions of VC++ (Issues 
    301, among others)

2009-06-30 - V2.04
  * Integrated bug fixes and patches and misc changes for portability.
  * Integrated a patch to remove some of the "access" macros.
  * Removed dependence on lua from the viewer, speeding it up
    dramatically.
  * Fixed the viewer so it compiles and runs properly!
  * Specifically fixing issues: 1, 63, 67, 71, 76, 81, 82, 106, 111,
   112, 128, 129, 130, 133, 135, 142, 143, 145, 147, 153, 154, 160,
   165, 170, 175, 177, 187, 192, 195, 199, 201, 205, 209, 108, 169

2008-04-22 - V2.03
  * Fixed crash introduced in 2.02.
  * Fixed lack of tessembedded.cpp in distribution.
  * Added test for leptonica header files and conditional test for lib.

2008-04-21 - V2.02 (again)
  * Fixed namespace collisions with jpeg library (INT32).
  * Portability fixes for Windows for new code.
  * Updates to autoconf system for new code.

2008-01-23 - V2.02
  * Improvements to clustering, training and classifier.
  * Major internationalization improvements for large-character-set
  * languages, eg Kannada.
  * Removed some compiler warnings.
  * Added multipage tiff support for training and running.
  * Updated graphics output to talk to new java-based viewer.
  * Added ability to save n-best lists.
  * Added leptonica support for more file types.
  * Improved Init/End to make them safe.
  * Reduced memory use of dictionaries.
  * Added some new APIs to TessBaseAPI.

2007-08-27 - V2.01
  * Fixed UTF8 input problems with box file reader.
  * Fixed various infinite loops and crashes in dawg code.
  * Removed include of config_auto.h from host.h.
  * Added automatic wctype encoding to unicharset_extractor.
  * Fixed dawg table too full error.
  * Removed svn files from tarball.
  * Added new functions to tessdll.
  * Increased maximum utf8 string in a classification result to 8.

2007-07-02 - V2.00
  * Converted internal character handling to UTF8.
  * Trained with 6 languages.
  * Added unicharset_extractor, wordlist2dawg.
  * Added boxfile creation mode.
  * Added UNLV regression test capability.
  * Fixed problems with copyright and registered symbols.
  * Fixed extern "C" declarations problem.

2007-05-15 - V1.04
  * Added dll exports for Windows.
  * Fixed name collisions with stl etc.
  * Made some preliminary changes ready for unicodeization.
  * Several bug fixes discovered during unicodeization.

2007-02-02 - V1.03
  * Added mftraining and cntraining.
  * Added baseapi with adaptive thresholding for grey and color.
  * Fixed many memory leaks.
  * Fixed several bugs including lack of use of adaptive classifier.
  * Added ifdefs to eliminate graphics code and add embedded platform support.
  * Incorporated several patches, including 64-bit builds, Mac builds.
  * Minor accuracy improvements.

2006-10-04 - V1.02
  * Removed dependency on Aspirin.
  * Fixed a few missing Apache license headers.
  * Removed $log.

2006-09-07 - V1.01.
  * Added mfcpch.cpp and getopt.cpp for VC++.
  * Fixed problem with greyscale images and no libtiff.
  * Stopped debug window from being used for the usage output.
  * Fixed load of inttemp for big-endian architectures.
  * Fixed some Mac compilation issues.

2006-06-16 - V1.0 of open source Tesseract checked-in.
Updated ChangeLog for 3.03 rc1 git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@1049 d0cd1f9f-072b-0410-8dd7-cf729c803f20 2014-02-05 08:53:56 +08:00			`2014-02-04 v3.03`
uniform bullet formatting 2015-06-11 10:52:37 +08:00			`* Added new training tool text2image to generate box/tif file pairs from`
			`text and truetype fonts.`
			`* Added support for PDF output with searchable text.`
			`* Removed entire IMAGE class and all code in image directory.`
			`* Tesseract executable: support for output to stdout; limited support for one`
			`page images from stdin (especially on Windows)`
			`* Added Renderer to API to allow document-level processing and output`
			`of document formats, like hOCR, PDF.`
			`* Major refactor of word-level recognition, beam search, eliminating dead code.`
			`* Refactored classifier to make it easier to add new ones.`
			`* Generalized feature extractor to allow feature extraction from greyscale.`
			`* Improved sub/superscript treatment.`
			`* Improved baseline fit.`
			`* Added set_unicharset_properties to training tools.`
			`* Many bug fixes.`
			`* More training source data included.`
Major refactor of beam search, elimination of dead code, misc bug fixes, updates to Makefile.am, Changelog etc. git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@878 d0cd1f9f-072b-0410-8dd7-cf729c803f20 2013-09-23 23:26:50 +08:00
Remaining misc changes for 3.02 git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@658 d0cd1f9f-072b-0410-8dd7-cf729c803f20 2012-02-02 11:14:43 +08:00			`2012-02-01 - v3.02`
			`* Moved ResultIterator/PageIterator to ccmain.`
			`* Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic.`
			`* Added paragraph detection in layout analysis/post OCR.`
			`* Fixed inconsistent xheight during training and over-chopping.`
			`* Added simultaneous multi-language capability.`
			`* Refactored top-level word recognition module.`
			`* Added experimental equation detector.`
			`* Improved handling of resolution from input images.`
			`* Blamer module added for error analysis.`
			`* Cleaned up externally used namespace by removing includes from baseapi.h.`
			`* Removed dead memory mangagement code.`
			`* Tidied up constraints on control parameters.`
			`* Added support for ShapeTable in classifier and training.`
			`* Refactored class pruner.`
			`* Fixed training leaks and randomness.`
			`* Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding.`
			`* Improved line detection and removal.`
			`* Added fixed pitch chopper for CJK.`
			`* Added UNICHARSET to WERD_CHOICE to make mult-language handling easier.`
			`* Fixed problems with internally scaled images.`
			`* Added page and bbox to string in tr files to identify source of training data better.`
			`* Fixes to Hindi Shiroreka splitter.`
			`* Added word bigram correction.`
			`* Reduced stack memory consumption and eliminated some ugly typedefs.`
			`* Added new uniform classifier API.`
			`* Added new training error counter.`
			`* Fixed endian bug in dawg reader.`
			`* Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.`

Misc Makefile etc for 3.01 git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@541 d0cd1f9f-072b-0410-8dd7-cf729c803f20 2010-11-30 09:30:09 +08:00			`2010-11-29 - V3.01`
			`* Removed old/dead serialise/deserialze methods on *LISTIZED classes.`
			`* Total rewrite of DENORM to better encapsulate operation and make`
			`for potential to extract features from images.`
Remaining misc changes for 3.02 git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@658 d0cd1f9f-072b-0410-8dd7-cf729c803f20 2012-02-02 11:14:43 +08:00			`* Thread-safety! Moved all critical globals and statics to members of the appropriate class. Tesseract is now thread-safe (multiple instances can be used in parallel in multiple threads.) with the minor exception that some control parameters are still global and affect all threads.`
			`* Added Cube, a new recognizer for Arabic. Cube can also be used in combination with normal Tesseract for other languages with an improvement in accuracy at the cost of (much) lower speed. There is no training module for Cube yet.`
			* `OcrEngineMode` in `Init` replaces `AccuracyVSpeed` to control cube.
			`* Greatly improved segmentation search with consequent accuracy and speed improvements, especially for Chinese.`
			* Added `PageIterator` and `ResultIterator` as cleaner ways to get the full results out of Tesseract, that are not currently provided by any of the `TessBaseAPI::Get*` methods. All other methods, such as the `ETEXT_STRUCT` in particular are deprecated and will be deleted in the future.
			`* ApplyBoxes totally rewritten to make training easier. It can now cope with touching/overlapping training characters, and a new boxfile format allows word boxes instead of character boxes, BUT to use that you have to have already boostrapped the language with character boxes. "Cyclic dependency" on traineddata.`
3.01 code from http://github.com/jimregan/tesseract-ocr with addaptions related to Linux and Windows (VC2008) compile process git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@526 d0cd1f9f-072b-0410-8dd7-cf729c803f20 2010-11-24 02:34:14 +08:00			`* Auto orientation and script detection added to page layout analysis.`
			`* Deleted lots of dead code.`
			`* Fixxht module replaced with scalable data-driven module.`
			`* Output font characteristics accuracy improved.`
			`* Removed the double conversion at each classification.`
			`* Upgraded oldest structs to be classes and deprecated PBLOB.`
			`* Removed non-deterministic baseline fit.`
			`* Added fixed length dawgs for Chinese.`
			`* Handling of vertical text improved.`
			`* Handling of leader dots improved.`
			`* Table detection greatly improved.`
Remaining misc changes for 3.02 git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@658 d0cd1f9f-072b-0410-8dd7-cf729c803f20 2012-02-02 11:14:43 +08:00			`* Fixed a couple of memory leaks.`
			`* Fixed font labels on output text. (Not perfect, but a lot better than before.)`
			`* Cleanup and more bug fixes`
			`* Special treatments for Hindi.`
			`* Support for build in VS2010 with Microsoft Windows SDK for Windows 7 (thanks to Michael Lutz)`
3.01 code from http://github.com/jimregan/tesseract-ocr with addaptions related to Linux and Windows (VC2008) compile process git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@526 d0cd1f9f-072b-0410-8dd7-cf729c803f20 2010-11-24 02:34:14 +08:00
add Ray's notes for the upcoming release... git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@467 d0cd1f9f-072b-0410-8dd7-cf729c803f20 2010-09-30 00:23:54 +08:00			`2010-09-21 - V3.00`
			`* Preparations for thread safety:`
			`* Changed TessBaseAPI methods to be non-static`
			`* Created a class hierarchy for the directories to hold instance data,`
			`and began moving code into the classes.`
			`* Moved thresholding code to a separate class.`
			`* Added major new page layout analysis module.`
add some more to the changelog, including some thanks for some of the contributed patches - trying to be a good open source citizen, but I know I missed several people :/ git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@489 d0cd1f9f-072b-0410-8dd7-cf729c803f20 2010-09-30 19:46:56 +08:00			`* Added HOCR output (issues 221, 263: thanks to amkryukov).`
add Ray's notes for the upcoming release... git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@467 d0cd1f9f-072b-0410-8dd7-cf729c803f20 2010-09-30 00:23:54 +08:00			`* Added Leptonica as main image I/O and handling. Currently optional,`
			`but in future releases linking with Leptonica will be mandatory.`
			`* Ambiguity table rewritten to allow definite replacements in place`
			`of fix_quotes.`
			`* Added TessdataManager to combine data files into a single file.`
			`* Some dead code deleted.`
			`* VC++6 no longer supported. It can't cope with the use of templates.`
			`* Many more languages added.`
			`* Doxygenation of most of the function header comments.`
add some more to the changelog, including some thanks for some of the contributed patches - trying to be a good open source citizen, but I know I missed several people :/ git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@489 d0cd1f9f-072b-0410-8dd7-cf729c803f20 2010-09-30 19:46:56 +08:00			`* Added man pages.`
			`* Added bash completion script (issue 247: thanks to neskiem)`
			`* Fix integer overview in thresholding (issue 366: thanks to Cyanide.Drake)`
			`* Add Danish Fraktur support (issues 300, 360: thanks to`
			`dsl602230@vip.cybercity.dk)`
			`* Fix file pointer leak (issue 359, thanks to yukihiro.nakadaira)`
			`* Fix an error using user-words (Issue 345: thanks to max.markin)`
			`* Fix a memory leak in tablefind.cpp (Issue 342, thanks to zdravco)`
			`* Fix a segfault due to double fclose (Issue 320, thanks to souther)`
			`* Fix an automake error (Issue 318, thanks to ichanjz)`
			`* Fix a Win32 crash on fileFormatIsTiff() (Issues 304, 316, 317, 330, 347,`
			`349, 352: thanks to nguyenq87, max.markin, zdenop)`
			`* Fixed a number of errors in newer (stricter) versions of VC++ (Issues`
			`301, among others)`
add Ray's notes for the upcoming release... git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@467 d0cd1f9f-072b-0410-8dd7-cf729c803f20 2010-09-30 00:23:54 +08:00
change order of entries V1.0 ... V2.04 This is to have the newest on top ordering of revisions. 2015-06-11 13:34:45 +08:00			`2009-06-30 - V2.04`
			`* Integrated bug fixes and patches and misc changes for portability.`
			`* Integrated a patch to remove some of the "access" macros.`
			`* Removed dependence on lua from the viewer, speeding it up`
			`dramatically.`
			`* Fixed the viewer so it compiles and runs properly!`
			`* Specifically fixing issues: 1, 63, 67, 71, 76, 81, 82, 106, 111,`
			`112, 128, 129, 130, 133, 135, 142, 143, 145, 147, 153, 154, 160,`
			`165, 170, 175, 177, 187, 192, 195, 199, 201, 205, 209, 108, 169`
uniform bullet formatting 2015-06-11 10:52:37 +08:00
change order of entries V1.0 ... V2.04 This is to have the newest on top ordering of revisions. 2015-06-11 13:34:45 +08:00			`2008-04-22 - V2.03`
			`* Fixed crash introduced in 2.02.`
			`* Fixed lack of tessembedded.cpp in distribution.`
			`* Added test for leptonica header files and conditional test for lib.`
uniform bullet formatting 2015-06-11 10:52:37 +08:00
change order of entries V1.0 ... V2.04 This is to have the newest on top ordering of revisions. 2015-06-11 13:34:45 +08:00			`2008-04-21 - V2.02 (again)`
			`* Fixed namespace collisions with jpeg library (INT32).`
			`* Portability fixes for Windows for new code.`
			`* Updates to autoconf system for new code.`
Initial top-level changes for v2.02 git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@154 d0cd1f9f-072b-0410-8dd7-cf729c803f20 2008-02-01 08:59:01 +08:00
convert date formats 2015-06-11 13:27:11 +08:00			`2008-01-23 - V2.02`
uniform bullet formatting 2015-06-11 10:52:37 +08:00			`* Improvements to clustering, training and classifier.`
			`* Major internationalization improvements for large-character-set`
			`* languages, eg Kannada.`
			`* Removed some compiler warnings.`
			`* Added multipage tiff support for training and running.`
			`* Updated graphics output to talk to new java-based viewer.`
			`* Added ability to save n-best lists.`
			`* Added leptonica support for more file types.`
			`* Improved Init/End to make them safe.`
			`* Reduced memory use of dictionaries.`
			`* Added some new APIs to TessBaseAPI.`

change order of entries V1.0 ... V2.04 This is to have the newest on top ordering of revisions. 2015-06-11 13:34:45 +08:00			`2007-08-27 - V2.01`
			`* Fixed UTF8 input problems with box file reader.`
			`* Fixed various infinite loops and crashes in dawg code.`
			`* Removed include of config_auto.h from host.h.`
			`* Added automatic wctype encoding to unicharset_extractor.`
			`* Fixed dawg table too full error.`
			`* Removed svn files from tarball.`
			`* Added new functions to tessdll.`
			`* Increased maximum utf8 string in a classification result to 8.`
uniform bullet formatting 2015-06-11 10:52:37 +08:00
change order of entries V1.0 ... V2.04 This is to have the newest on top ordering of revisions. 2015-06-11 13:34:45 +08:00			`2007-07-02 - V2.00`
			`* Converted internal character handling to UTF8.`
			`* Trained with 6 languages.`
			`* Added unicharset_extractor, wordlist2dawg.`
			`* Added boxfile creation mode.`
			`* Added UNLV regression test capability.`
			`* Fixed problems with copyright and registered symbols.`
			`* Fixed extern "C" declarations problem.`
uniform bullet formatting 2015-06-11 10:52:37 +08:00
change order of entries V1.0 ... V2.04 This is to have the newest on top ordering of revisions. 2015-06-11 13:34:45 +08:00			`2007-05-15 - V1.04`
			`* Added dll exports for Windows.`
			`* Fixed name collisions with stl etc.`
			`* Made some preliminary changes ready for unicodeization.`
			`* Several bug fixes discovered during unicodeization.`

			`2007-02-02 - V1.03`
			`* Added mftraining and cntraining.`
			`* Added baseapi with adaptive thresholding for grey and color.`
			`* Fixed many memory leaks.`
			`* Fixed several bugs including lack of use of adaptive classifier.`
			`* Added ifdefs to eliminate graphics code and add embedded platform support.`
			`* Incorporated several patches, including 64-bit builds, Mac builds.`
			`* Minor accuracy improvements.`

			`2006-10-04 - V1.02`
			`* Removed dependency on Aspirin.`
			`* Fixed a few missing Apache license headers.`
			`* Removed $log.`

			`2006-09-07 - V1.01.`
			`* Added mfcpch.cpp and getopt.cpp for VC++.`
			`* Fixed problem with greyscale images and no libtiff.`
			`* Stopped debug window from being used for the usage output.`
			`* Fixed load of inttemp for big-endian architectures.`
			`* Fixed some Mac compilation issues.`

			`2006-06-16 - V1.0 of open source Tesseract checked-in.`