tesseract/ChangeLog
Stefan Weil ee29fca9ce Create new release 5.0.0-rc3
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-11-22 22:01:18 +01:00

298 lines
15 KiB
Plaintext

2021-11-22 - V5.0.0-rc3
* Faster training and recognition by default (float instead of
double calculations)
* More options for binarization
* Improved support for ARM NEON
* Modernized code
* Removed proprietary data types like GenericVector and STRING
from public API
* pdf.ttf no longer needed, now integrated into the code
* Faster flat build with automake
* New options for combine_tessdata to show details of traineddata files
* Improved training messages
* Improved unit tests and fuzzing tests
* Lots of bug fixes
2019-07-07 - V4.1.0
* Added new renders Alto, LSTMBox, WordStrBox.
* Added character boxes in hOCR output.
* Added python training scripts (experimental) as alternative shell scripts.
* Better support AVX / AVX2 / SSE.
* Disable OpenMP support by default (see e.g. #1171, #1081).
* Fix for bounding box problem.
* Implemented support for whitelist/blacklist in LSTM engine.
* Improved cmake configuration.
* Code modernization and improvements.
* A lot of bug fixes...
2018-10-29 - V4.0.0
* Added new neural network system based on LSTMs, with major accuracy gains.
* Improvements to PDF rendering.
* Fixes to trainingdata rendering.
* Added LSTM models+lang models to 101 languages. (tessdata repository)
* Improved multi-page TIFF handling.
* Fixed damage to binary images when processing PDFs.
* Fixes to training process to allow incremental training from a recognition model.
* Made LSTM the default engine, pushed cube out.
* Deleted cube code.
* Changed OEModes --oem 0 for legacy tesseract engine, --oem 1 for LSTM, --oem 2 for both, --oem 3 for default.
* Avoid use of Leptonica debug parameters or functions.
* Fixed multi-language mode.
* Removed support for VS2010.
* Added Support for VS2015 and VS2017 with CPPAN.
* Implemented invisible text only for PDF.
* Added AVX / SSE support for Windows.
* Enabled OpenMP support.
* Parameter unlv_tilde_crunching change to false.
* Miscellaneous Fixes.
* Detailed Changelog can be found at https://tesseract-ocr.github.io/tessdoc/4.0x-Changelog.html and https://tesseract-ocr.github.io/tessdoc/ReleaseNotes.html#tesseract-release-notes-oct-29-2018---v400
2017-02-16 - V3.05.00
* Made some fine tuning to the hOCR output.
* Added TSV as another optional output format.
* Fixed ABI break introduced in 3.04.00 with the AnalyseLayout() method.
* text2image tool - Enable all OpenType ligatures available in a font. This feature requires Pango 1.38 or newer.
* Training tools - Replaced asserts with tprintf() and exit(1).
* Fixed Cygwin compatibility.
* Improved multipage tiff processing.
* Improved the embedded pdf font (pdf.ttf).
* Enable selection of OCR engine mode from command line.
* Changed tesseract command line parameter '-psm' to '--psm'.
* Write output of tesseract --help, --version and --list-langs to stdout instead of stderr.
* Added new C API for orientation and script detection, removed the old one.
* Increased minimum autoconf version to 2.59.
* Removed dead code.
* Require Leptonica 1.74 or higher.
* Fixed many compiler warning.
* Fixed memory and resource leaks.
* Fixed some issues with the 'Cube' OCR engine.
* Fixed some openCL issues.
* Added option to build Tesseract with CMake build system.
* Implemented CPPAN support for easy Windows building.
2016-02-17 - V3.04.01
* Added OSD renderer for psm 0. Works for single page and multi-page images.
* Improve tesstrain.sh script.
* Simplify build and run of ScrollView.
* Improved PDF output for OS X Preview utility.
* INCOMPATIBLE fix to hOCR line height information - commit 134ebc3.
* Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD).
* Enable OpenMP support.
* Many bug fixes.
2015-07-11 - V3.04.00
* Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting).
* Tesseract now requires leptonica 1.71 or a higher version.
* Removed official support for VS 2008.
* Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid
* Major updates to training system as a result of extensive testing on 100 languages.
* New training data for over 100 languages
* Improved performance with PIC compilation option.
* Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript.
* Improved font identification.
* Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc.
* Fixed problems with shifted baselines so recognition can recover from layout analysis errors.
* Major refactor to improve speed on difficult images, especially when running a heap checker.
* Moved params from global in page layout to tesseractclass.
* Improved single column layout analysis.
* Allow ocr output to multiple formats using tesseract command line executable.
* Fixed issues with mixed eng+ara scripts.
* Improved script consistency in numbers.
* Major refactor of control.cpp to enable line recognition.
* Added tesstrain.sh - a master training script.
* Added ability to text2image training tool to just list available fonts.
* Added ability to text2image to underline words.
* Improved efficiency of image processing for PDF output.
* Added parameter description for each parameter listed with 'print-parameters' command line option.
* Added font info to hOCR output.
* Enabled streaming input and output of multi-page documents.
* Many bug fixes.
2014-02-04 - V3.03(rc1)
* Added new training tool text2image to generate box/tif file pairs from
text and truetype fonts.
* Added support for PDF output with searchable text.
* Removed entire IMAGE class and all code in image directory.
* Tesseract executable: support for output to stdout; limited support for one
page images from stdin (especially on Windows)
* Added Renderer to API to allow document-level processing and output
of document formats, like hOCR, PDF.
* Major refactor of word-level recognition, beam search, eliminating dead code.
* Refactored classifier to make it easier to add new ones.
* Generalized feature extractor to allow feature extraction from greyscale.
* Improved sub/superscript treatment.
* Improved baseline fit.
* Added set_unicharset_properties to training tools.
* Many bug fixes.
* More training source data included.
2012-02-01 - V3.02
* Moved ResultIterator/PageIterator to ccmain.
* Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic.
* Added paragraph detection in layout analysis/post OCR.
* Fixed inconsistent xheight during training and over-chopping.
* Added simultaneous multi-language capability.
* Refactored top-level word recognition module.
* Added experimental equation detector.
* Improved handling of resolution from input images.
* Blamer module added for error analysis.
* Cleaned up externally used namespace by removing includes from baseapi.h.
* Removed dead memory mangagement code.
* Tidied up constraints on control parameters.
* Added support for ShapeTable in classifier and training.
* Refactored class pruner.
* Fixed training leaks and randomness.
* Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding.
* Improved line detection and removal.
* Added fixed pitch chopper for CJK.
* Added UNICHARSET to WERD_CHOICE to make mult-language handling easier.
* Fixed problems with internally scaled images.
* Added page and bbox to string in tr files to identify source of training data better.
* Fixes to Hindi Shiroreka splitter.
* Added word bigram correction.
* Reduced stack memory consumption and eliminated some ugly typedefs.
* Added new uniform classifier API.
* Added new training error counter.
* Fixed endian bug in dawg reader.
* Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
2010-11-29 - V3.01
* Removed old/dead serialise/deserialze methods on *LISTIZED classes.
* Total rewrite of DENORM to better encapsulate operation and make
for potential to extract features from images.
* Thread-safety! Moved all critical globals and statics to members of the appropriate class. Tesseract is now thread-safe (multiple instances can be used in parallel in multiple threads.) with the minor exception that some control parameters are still global and affect all threads.
* Added Cube, a new recognizer for Arabic. Cube can also be used in combination with normal Tesseract for other languages with an improvement in accuracy at the cost of (much) lower speed. *There is no training module for Cube yet.*
* `OcrEngineMode` in `Init` replaces `AccuracyVSpeed` to control cube.
* Greatly improved segmentation search with consequent accuracy and speed improvements, especially for Chinese.
* Added `PageIterator` and `ResultIterator` as cleaner ways to get the full results out of Tesseract, that are not currently provided by any of the `TessBaseAPI::Get*` methods. All other methods, such as the `ETEXT_STRUCT` in particular are deprecated and will be deleted in the future.
* ApplyBoxes totally rewritten to make training easier. It can now cope with touching/overlapping training characters, and a new boxfile format allows word boxes instead of character boxes, BUT to use that you have to have already bootstrapped the language with character boxes. "Cyclic dependency" on traineddata.
* Auto orientation and script detection added to page layout analysis.
* Deleted *lots* of dead code.
* Fixxht module replaced with scalable data-driven module.
* Output font characteristics accuracy improved.
* Removed the double conversion at each classification.
* Upgraded oldest structs to be classes and deprecated PBLOB.
* Removed non-deterministic baseline fit.
* Added fixed length dawgs for Chinese.
* Handling of vertical text improved.
* Handling of leader dots improved.
* Table detection greatly improved.
* Fixed a couple of memory leaks.
* Fixed font labels on output text. (Not perfect, but a lot better than before.)
* Cleanup and more bug fixes
* Special treatments for Hindi.
* Support for build in VS2010 with Microsoft Windows SDK for Windows 7 (thanks to Michael Lutz)
2010-09-21 - V3.00
* Preparations for thread safety:
* Changed TessBaseAPI methods to be non-static
* Created a class hierarchy for the directories to hold instance data,
and began moving code into the classes.
* Moved thresholding code to a separate class.
* Added major new page layout analysis module.
* Added HOCR output (issues 221, 263: thanks to amkryukov).
* Added Leptonica as main image I/O and handling. Currently optional,
but in future releases linking with Leptonica will be mandatory.
* Ambiguity table rewritten to allow definite replacements in place
of fix_quotes.
* Added TessdataManager to combine data files into a single file.
* Some dead code deleted.
* VC++6 no longer supported. It can't cope with the use of templates.
* Many more languages added.
* Doxygenation of most of the function header comments.
* Added man pages.
* Added bash completion script (issue 247: thanks to neskiem)
* Fix integer overview in thresholding (issue 366: thanks to Cyanide.Drake)
* Add Danish Fraktur support (issues 300, 360: thanks to
dsl602230@vip.cybercity.dk)
* Fix file pointer leak (issue 359, thanks to yukihiro.nakadaira)
* Fix an error using user-words (Issue 345: thanks to max.markin)
* Fix a memory leak in tablefind.cpp (Issue 342, thanks to zdravco)
* Fix a segfault due to double fclose (Issue 320, thanks to souther)
* Fix an automake error (Issue 318, thanks to ichanjz)
* Fix a Win32 crash on fileFormatIsTiff() (Issues 304, 316, 317, 330, 347,
349, 352: thanks to nguyenq87, max.markin, zdenop)
* Fixed a number of errors in newer (stricter) versions of VC++ (Issues
301, among others)
2009-06-30 - V2.04
* Integrated bug fixes and patches and misc changes for portability.
* Integrated a patch to remove some of the "access" macros.
* Removed dependence on lua from the viewer, speeding it up
dramatically.
* Fixed the viewer so it compiles and runs properly!
* Specifically fixing issues: 1, 63, 67, 71, 76, 81, 82, 106, 111,
112, 128, 129, 130, 133, 135, 142, 143, 145, 147, 153, 154, 160,
165, 170, 175, 177, 187, 192, 195, 199, 201, 205, 209, 108, 169
2008-04-22 - V2.03
* Fixed crash introduced in 2.02.
* Fixed lack of tessembedded.cpp in distribution.
* Added test for leptonica header files and conditional test for lib.
2008-04-21 - V2.02 (again)
* Fixed namespace collisions with jpeg library (INT32).
* Portability fixes for Windows for new code.
* Updates to autoconf system for new code.
2008-01-23 - V2.02
* Improvements to clustering, training and classifier.
* Major internationalization improvements for large-character-set
* languages, eg Kannada.
* Removed some compiler warnings.
* Added multipage tiff support for training and running.
* Updated graphics output to talk to new java-based viewer.
* Added ability to save n-best lists.
* Added leptonica support for more file types.
* Improved Init/End to make them safe.
* Reduced memory use of dictionaries.
* Added some new APIs to TessBaseAPI.
2007-08-27 - V2.01
* Fixed UTF8 input problems with box file reader.
* Fixed various infinite loops and crashes in dawg code.
* Removed include of config_auto.h from host.h.
* Added automatic wctype encoding to unicharset_extractor.
* Fixed dawg table too full error.
* Removed svn files from tarball.
* Added new functions to tessdll.
* Increased maximum utf8 string in a classification result to 8.
2007-07-02 - V2.00
* Converted internal character handling to UTF8.
* Trained with 6 languages.
* Added unicharset_extractor, wordlist2dawg.
* Added boxfile creation mode.
* Added UNLV regression test capability.
* Fixed problems with copyright and registered symbols.
* Fixed extern "C" declarations problem.
2007-05-15 - V1.04
* Added dll exports for Windows.
* Fixed name collisions with stl etc.
* Made some preliminary changes ready for unicodeization.
* Several bug fixes discovered during unicodeization.
2007-02-02 - V1.03
* Added mftraining and cntraining.
* Added baseapi with adaptive thresholding for grey and color.
* Fixed many memory leaks.
* Fixed several bugs including lack of use of adaptive classifier.
* Added ifdefs to eliminate graphics code and add embedded platform support.
* Incorporated several patches, including 64-bit builds, Mac builds.
* Minor accuracy improvements.
2006-10-04 - V1.02
* Removed dependency on Aspirin.
* Fixed a few missing Apache license headers.
* Removed $log.
2006-09-07 - V1.01.
* Added mfcpch.cpp and getopt.cpp for VC++.
* Fixed problem with greyscale images and no libtiff.
* Stopped debug window from being used for the usage output.
* Fixed load of inttemp for big-endian architectures.
* Fixed some Mac compilation issues.
2006-06-16 - V1.0 of open source Tesseract checked-in.