From d1dd5ed3589f68f616f6b789c2a5d0564e2f683c Mon Sep 17 00:00:00 2001 From: Stefan Weil Date: Sun, 25 May 2025 09:09:07 +0200 Subject: [PATCH] Use links to the git history and online release notes instead of local ChangeLog Signed-off-by: Stefan Weil --- ChangeLog | 467 +----------------------------------------------------- 1 file changed, 4 insertions(+), 463 deletions(-) diff --git a/ChangeLog b/ChangeLog index 9e7ec162c..87f713a22 100644 --- a/ChangeLog +++ b/ChangeLog @@ -1,466 +1,7 @@ -2024-11-10 - V5.5.0 -* Set hOCR capabilities ocrp_dir and ocrp_lang unconditionally. -* Calculate row bounding box in single-word mode per (issue #4304). -* Reduce clock syscalls (#4303). -* Several small performance and other code fixes. -* Modernized code. -* Print time for tessedit_timing_debug in milliseconds. -* Print time for ErrorCounter::ComputeErrorRate in milliseconds. -* cmake: Correctly set the soversion based on SemVer properties. -* Do not export PDBs for static libraries (issue #4279). -* Several other small fixes and improvements for builds and CI. -* Modernize code for renderers and remove filename conversion for Windows (#4330). -* Add build rule for Windows installer. -* Support symbolic values for --oem and --psm options. -* Remove Tensorflow support. -* Add RISC-V V support (#4346). -* Remove broken GitHub action msys2-4.1.1. +The ChangeLog for all releases from 1.0 (2006-06-16) up to 5.0.0 (2024-11-10) +is available here: -2024-06-11 - V5.4.1 -* Avoid FP overflow in NormEvidenceOf (fixes issue #4257) (#4259) -* Small build fixes and code improvements (#4262, #4263, #4266, #4267) +https://github.com/tesseract-ocr/tesseract/blob/64eab6c457b2337dd690746a5fde5c222b40d5f8/ChangeLog -2024-06-06 - V5.4.0 -* Small build fixes and code improvements - (#4241, #4243, #4244, #4245, #4246, #4248, #4249, #4250, #4253) +See https://github.com/tesseract-ocr/tesseract/releases for the latest release notes. -2024-05-19 - V5.4.0-rc2 -* Fix setup of datadir on installations with Conda (issue #4230) (#4240) -* Fix FP exception in Wordrec::angle_change (issue #4242) (#4243) - -2024-05-12 - V5.4.0-rc1 -* Build fixes, code refactoring and other smaller changes. -* Fix grey result of indexed PNG in pdfrenderer. -* Rename frk -> deu_latf (ISO 639-3, ISO 15924). -* Remove broken Dockerfile. -* Fixes for several issues reported by Coverity Scan. -* Remove unsupported OpenCL code and related API functions (#4220). -* Facilitate vectorization for generic build (#4223). -* Add PAGE XML renderer / export (#4214). -* Support training without lstmf files. -* Improve CCUtil::main_setup (fixes issue #4230 related to Coda). -* Allow for text angle/gradient to be retrieved (#4070). - -2024-01-18 - V5.3.4 -* Fixes for scrollview -* Fixes for autoconf, clang and sw builds -* Improve OCR for an image URL - * Fail on curl download errors - * New parameter curl_cookiefile - * Set User-Agent: header field in HTTP request for curl downloads -* Output directory list from "combine_tessdata -d" to stdout -* Other small improvements for code and documentation. - -2023-10-05 - V5.3.3 -* Small code fixes and improvements to fix Coverity Scan issues. -* Disable -mfpu=neon for aarch64. -* Fix build without git clone in cloned directory (required for FreeBSD). -* Other build fixes for autotools, cmake and sw. -* Fix regression in layout detection which was introduced in release 5.0.0. -* Fix regression which prevented loading of submodels, introduced in release 5.0.0-rc2. -* Other small improvements for code and documentation. - -2023-07-11 - V5.3.2 -* Updates for snap package building. -* Support for Sgaw and W Pwo Karen languages in the Myanmar validator (#4065). -* Improve format of logging from lstmtraining. -* Use less digits in filenames of checkpoints written by lstmtraining. -* Replace deprecated sprintf. -* Remove unused code in function fix_rep_char. -* Avoid 32 bit overflow in multiplication (fixes 3 CodeQL CI alerts). -* Avoid conversions from std::string to char* to std::string. -* Abort with error message if OSD is requested with LSTM-only model. -* cmake: allow to disable tiff (-DDISABLE_TIFF=ON). -* cmake: provide info about disabled LibArchive and CURL. -* cmake: check if leptonica was build with tiff support. -* Remove old broken GitHub action vcpkg-4.1.1 (fixes issue #4078). -* Create config.yml. -* Fix typos. - -2023-04-01 - V5.3.1 - * Bug fixes for some special scenarios: - * Fix issue #4010. - * textord: Catch empty rows in block iterator (fixes #4039). - * Fix FP division by zero (issue #3995). - * Improve documentation and log messages. - * Build fixes and improvements (mainly for cmake). - -2022-12-22 - V5.3.0 - * Minor updates for documentation and cmake builds. - -2022-12-13 - V5.3.0-rc1 - * Fix the training tools for the legacy OCR engine (fix issue #3925). - * PDF renderer: Ignore non-text blocks (fix issue #3957). - * Remove colormap before thresholding (fix issue #3940). - * Fix a number of performance issues reported by Coverity Scan. - * Training tools: Replace call of exit function by return statement in main function. - * Fix double free in function vigorous_noise_removal (fix issue #3876). - * Create to_win if needed in Textord::make_spline_rows (fix issue #3875). - * Bug fixes for ScrollView viewer: - * Fix memory issues in ScrollView::MessageReceiver. - * Catch potential nullptr in SVNetwork::SVNetwork. - * Move svpaint.cpp from src/viewer to src/. - * Add rule for svpaint executable in Autotools. - * Bug fixes and improvements for build tools: - * Fix AMD64 detection with autobuild on FreeBSD (fix issue #3964). - * Fix tesseract.pc generated from CMake to match Autotools. - * Detect availability of AVX512-VNNI. - * configure.ac: fix build on aarch64_be. - * Drop CI for old versions of macOS and Ubuntu. - -2022-07-06 - V5.2.0 - * Improvements and fixes for continuous integration, - autoconf and cmake builds. - * Set /Os for some 32 bit MS compilers (fixes #3769). - * Improve comments and other documentation. - * Add initial support for Intel AVX512F. - * Fix for very large PDF files on 32 bit hosts (fixes #3805). - * Fix NEON detection on FreeBSD. - * Fix regression with UZN files (fixes #3837). - * Fix calling delete[] for memory allocated by malloc in C API. - * Add an API function to init tesseract with traineddata from memory - (fixes #3691). - * Replace direct access to Leptonica internal data structures by - function calls and support latest releases of Leptonica. - * Replace std::regex by std::string functions (fixes issue #3830). - * Use compiled-in TESSDATA_PREFIX also on Windows (fixes #3767). - * Add new parameter 'invert_threshold', change the default threshold - from 0.5 to 0.7 and mark parameter 'tessedit_do_invert' as deprecated. - -2022-03-01 - V5.1.0 - * Handle image and line regions in output formats ALTO, hOCR and text. - * New parameter curl_timeout for curl_easy_setop. - * Build fixes and improvements. - * Catch nullptr in PageIterator::Orientation to improve robustness. - * Remove unused code. - -2022-01-06 - V5.0.1 - * Add SPDX-License-Identifier to public include files. - * Support redirections when running OCR on a URL. - * Lots of fixes and improvements for cmake builds. - Distributions should use the autoconf build. - * Fix broken msys2 build with gcc 11. - * Fix parameter certainty_scale (was duplicated). - * Fix some compiler warnings and clean code. - * Correctly detect amd64 and i386 on FreeBSD. - * Add libarchive and libcurl in continuous integration actions. - * Update submodule googletest to release v1.11.0. - -2021-11-22 - V5.0.0 - * Faster training and recognition by default (float instead of - double calculations) - * More options for binarization - * Improved support for ARM NEON - * Modernized code - * Removed proprietary data types like GenericVector and STRING - from public API - * pdf.ttf no longer needed, now integrated into the code - * Faster flat build with automake - * New options for combine_tessdata to show details of traineddata files - * Improved training messages - * Improved unit tests and fuzzing tests - * Lots of bug fixes - -2021-11-15 - V4.1.3 - * Fix build regression for autoconf build - -2021-11-14 - V4.1.2 - * Add RowAttributes getter to PageIterator - * Allow line images with larger width for training - * Fix memory leaks - * Improve build process - * Don't output empty ALTO sourceImageInformation (issue #2700) - * Extend URI support for Tesseract with libcurl - * Abort LSTM training with integer model (fixes issue #1573) - * Update documentation - * Make automake builds less noisy by default - * Don't use -march=native in automake builds - -2019-12-26 - V4.1.1 - * Implemented sw build (cppan is depreciated) - * Improved cmake build - * Code cleanup and optimization - * A lot of bug fixes... - -2019-07-07 - V4.1.0 - * Added new renders Alto, LSTMBox, WordStrBox. - * Added character boxes in hOCR output. - * Added python training scripts (experimental) as alternative shell scripts. - * Better support AVX / AVX2 / SSE. - * Disable OpenMP support by default (see e.g. #1171, #1081). - * Fix for bounding box problem. - * Implemented support for whitelist/blacklist in LSTM engine. - * Improved cmake configuration. - * Code modernization and improvements. - * A lot of bug fixes... - -2018-10-29 - V4.0.0 - * Added new neural network system based on LSTMs, with major accuracy gains. - * Improvements to PDF rendering. - * Fixes to trainingdata rendering. - * Added LSTM models+lang models to 101 languages. (tessdata repository) - * Improved multi-page TIFF handling. - * Fixed damage to binary images when processing PDFs. - * Fixes to training process to allow incremental training from a recognition model. - * Made LSTM the default engine, pushed cube out. - * Deleted cube code. - * Changed OEModes --oem 0 for legacy tesseract engine, --oem 1 for LSTM, --oem 2 for both, --oem 3 for default. - * Avoid use of Leptonica debug parameters or functions. - * Fixed multi-language mode. - * Removed support for VS2010. - * Added Support for VS2015 and VS2017 with CPPAN. - * Implemented invisible text only for PDF. - * Added AVX / SSE support for Windows. - * Enabled OpenMP support. - * Parameter unlv_tilde_crunching change to false. - * Miscellaneous Fixes. - * Detailed Changelog can be found at https://tesseract-ocr.github.io/tessdoc/4.0x-Changelog.html and https://tesseract-ocr.github.io/tessdoc/ReleaseNotes.html#tesseract-release-notes-oct-29-2018---v400 - -2017-02-16 - V3.05.00 - * Made some fine tuning to the hOCR output. - * Added TSV as another optional output format. - * Fixed ABI break introduced in 3.04.00 with the AnalyseLayout() method. - * text2image tool - Enable all OpenType ligatures available in a font. This feature requires Pango 1.38 or newer. - * Training tools - Replaced asserts with tprintf() and exit(1). - * Fixed Cygwin compatibility. - * Improved multipage tiff processing. - * Improved the embedded pdf font (pdf.ttf). - * Enable selection of OCR engine mode from command line. - * Changed tesseract command line parameter '-psm' to '--psm'. - * Write output of tesseract --help, --version and --list-langs to stdout instead of stderr. - * Added new C API for orientation and script detection, removed the old one. - * Increased minimum autoconf version to 2.59. - * Removed dead code. - * Require Leptonica 1.74 or higher. - * Fixed many compiler warning. - * Fixed memory and resource leaks. - * Fixed some issues with the 'Cube' OCR engine. - * Fixed some openCL issues. - * Added option to build Tesseract with CMake build system. - * Implemented CPPAN support for easy Windows building. - -2016-02-17 - V3.04.01 - * Added OSD renderer for psm 0. Works for single page and multi-page images. - * Improve tesstrain.sh script. - * Simplify build and run of ScrollView. - * Improved PDF output for OS X Preview utility. - * INCOMPATIBLE fix to hOCR line height information - commit 134ebc3. - * Added option to build Tesseract without Cube OCR engine (-DNO_CUBE_BUILD). - * Enable OpenMP support. - * Many bug fixes. - -2015-07-11 - V3.04.00 - * Tesseract development is now done with Git and hosted at github.com (Previously we used Subversion as a VCS and code.google.com for hosting). - * Tesseract now requires leptonica 1.71 or a higher version. - * Removed official support for VS 2008. - * Added support for 39 additional scripts/languages, including: amh, asm, aze_cyrl, bod, bos, ceb, cym, dzo, fas, gle, guj, hat, iku, jav, kat, kat_old, kaz, khm, kir, kur, lao, lat, mar, mya, nep, ori, pan, pus, san, sin, srp_latn, syr, tgk, tir, uig, urd, uzb, uzb_cyrl, yid - * Major updates to training system as a result of extensive testing on 100 languages. - * New training data for over 100 languages - * Improved performance with PIC compilation option. - * Significant change to invisible font system in pdf output to improve correctness and compatibility with external programs, particularly ghostscript. - * Improved font identification. - * Major change to improve layout analysis for heavily diacritic languages: Thai, Vietnamese, Kannada, Telugu etc. - * Fixed problems with shifted baselines so recognition can recover from layout analysis errors. - * Major refactor to improve speed on difficult images, especially when running a heap checker. - * Moved params from global in page layout to tesseractclass. - * Improved single column layout analysis. - * Allow ocr output to multiple formats using tesseract command line executable. - * Fixed issues with mixed eng+ara scripts. - * Improved script consistency in numbers. - * Major refactor of control.cpp to enable line recognition. - * Added tesstrain.sh - a master training script. - * Added ability to text2image training tool to just list available fonts. - * Added ability to text2image to underline words. - * Improved efficiency of image processing for PDF output. - * Added parameter description for each parameter listed with 'print-parameters' command line option. - * Added font info to hOCR output. - * Enabled streaming input and output of multi-page documents. - * Many bug fixes. - -2014-02-04 - V3.03(rc1) - * Added new training tool text2image to generate box/tif file pairs from - text and truetype fonts. - * Added support for PDF output with searchable text. - * Removed entire IMAGE class and all code in image directory. - * Tesseract executable: support for output to stdout; limited support for one - page images from stdin (especially on Windows) - * Added Renderer to API to allow document-level processing and output - of document formats, like hOCR, PDF. - * Major refactor of word-level recognition, beam search, eliminating dead code. - * Refactored classifier to make it easier to add new ones. - * Generalized feature extractor to allow feature extraction from greyscale. - * Improved sub/superscript treatment. - * Improved baseline fit. - * Added set_unicharset_properties to training tools. - * Many bug fixes. - * More training source data included. - -2012-02-01 - V3.02 - * Moved ResultIterator/PageIterator to ccmain. - * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic. - * Added paragraph detection in layout analysis/post OCR. - * Fixed inconsistent xheight during training and over-chopping. - * Added simultaneous multi-language capability. - * Refactored top-level word recognition module. - * Added experimental equation detector. - * Improved handling of resolution from input images. - * Blamer module added for error analysis. - * Cleaned up externally used namespace by removing includes from baseapi.h. - * Removed dead memory mangagement code. - * Tidied up constraints on control parameters. - * Added support for ShapeTable in classifier and training. - * Refactored class pruner. - * Fixed training leaks and randomness. - * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding. - * Improved line detection and removal. - * Added fixed pitch chopper for CJK. - * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier. - * Fixed problems with internally scaled images. - * Added page and bbox to string in tr files to identify source of training data better. - * Fixes to Hindi Shiroreka splitter. - * Added word bigram correction. - * Reduced stack memory consumption and eliminated some ugly typedefs. - * Added new uniform classifier API. - * Added new training error counter. - * Fixed endian bug in dawg reader. - * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so. - -2010-11-29 - V3.01 - * Removed old/dead serialise/deserialize methods on *LISTIZED classes. - * Total rewrite of DENORM to better encapsulate operation and make - for potential to extract features from images. - * Thread-safety! Moved all critical global and static variables to members of the appropriate class. Tesseract is now thread-safe (multiple instances can be used in parallel in multiple threads.) with the minor exception that some control parameters are still global and affect all threads. - * Added Cube, a new recognizer for Arabic. Cube can also be used in combination with normal Tesseract for other languages with an improvement in accuracy at the cost of (much) lower speed. *There is no training module for Cube yet.* - * `OcrEngineMode` in `Init` replaces `AccuracyVSpeed` to control cube. - * Greatly improved segmentation search with consequent accuracy and speed improvements, especially for Chinese. - * Added `PageIterator` and `ResultIterator` as cleaner ways to get the full results out of Tesseract, that are not currently provided by any of the `TessBaseAPI::Get*` methods. All other methods, such as the `ETEXT_STRUCT` in particular are deprecated and will be deleted in the future. - * ApplyBoxes totally rewritten to make training easier. It can now cope with touching/overlapping training characters, and a new boxfile format allows word boxes instead of character boxes, BUT to use that you have to have already bootstrapped the language with character boxes. "Cyclic dependency" on traineddata. - * Auto orientation and script detection added to page layout analysis. - * Deleted *lots* of dead code. - * Fixxht module replaced with scalable data-driven module. - * Output font characteristics accuracy improved. - * Removed the double conversion at each classification. - * Upgraded oldest structs to be classes and deprecated PBLOB. - * Removed non-deterministic baseline fit. - * Added fixed length dawgs for Chinese. - * Handling of vertical text improved. - * Handling of leader dots improved. - * Table detection greatly improved. - * Fixed a couple of memory leaks. - * Fixed font labels on output text. (Not perfect, but a lot better than before.) - * Cleanup and more bug fixes - * Special treatments for Hindi. - * Support for build in VS2010 with Microsoft Windows SDK for Windows 7 (thanks to Michael Lutz) - -2010-09-21 - V3.00 - * Preparations for thread safety: - * Changed TessBaseAPI methods to be non-static - * Created a class hierarchy for the directories to hold instance data, - and began moving code into the classes. - * Moved thresholding code to a separate class. - * Added major new page layout analysis module. - * Added HOCR output (issues 221, 263: thanks to amkryukov). - * Added Leptonica as main image I/O and handling. Currently optional, - but in future releases linking with Leptonica will be mandatory. - * Ambiguity table rewritten to allow definite replacements in place - of fix_quotes. - * Added TessdataManager to combine data files into a single file. - * Some dead code deleted. - * VC++6 no longer supported. It can't cope with the use of templates. - * Many more languages added. - * Doxygenation of most of the function header comments. - * Added man pages. - * Added bash completion script (issue 247: thanks to neskiem) - * Fix integer overview in thresholding (issue 366: thanks to Cyanide.Drake) - * Add Danish Fraktur support (issues 300, 360: thanks to - dsl602230@vip.cybercity.dk) - * Fix file pointer leak (issue 359, thanks to yukihiro.nakadaira) - * Fix an error using user-words (Issue 345: thanks to max.markin) - * Fix a memory leak in tablefind.cpp (Issue 342, thanks to zdravco) - * Fix a segfault due to double fclose (Issue 320, thanks to souther) - * Fix an automake error (Issue 318, thanks to ichanjz) - * Fix a Win32 crash on fileFormatIsTiff() (Issues 304, 316, 317, 330, 347, - 349, 352: thanks to nguyenq87, max.markin, zdenop) - * Fixed a number of errors in newer (stricter) versions of VC++ (Issues - 301, among others) - -2009-06-30 - V2.04 - * Integrated bug fixes and patches and misc changes for portability. - * Integrated a patch to remove some of the "access" macros. - * Removed dependence on lua from the viewer, speeding it up - dramatically. - * Fixed the viewer so it compiles and runs properly! - * Specifically fixing issues: 1, 63, 67, 71, 76, 81, 82, 106, 111, - 112, 128, 129, 130, 133, 135, 142, 143, 145, 147, 153, 154, 160, - 165, 170, 175, 177, 187, 192, 195, 199, 201, 205, 209, 108, 169 - -2008-04-22 - V2.03 - * Fixed crash introduced in 2.02. - * Fixed lack of tessembedded.cpp in distribution. - * Added test for leptonica header files and conditional test for lib. - -2008-04-21 - V2.02 (again) - * Fixed namespace collisions with jpeg library (INT32). - * Portability fixes for Windows for new code. - * Updates to autoconf system for new code. - -2008-01-23 - V2.02 - * Improvements to clustering, training and classifier. - * Major internationalization improvements for large-character-set - * languages, eg Kannada. - * Removed some compiler warnings. - * Added multipage tiff support for training and running. - * Updated graphics output to talk to new java-based viewer. - * Added ability to save n-best lists. - * Added leptonica support for more file types. - * Improved Init/End to make them safe. - * Reduced memory use of dictionaries. - * Added some new APIs to TessBaseAPI. - -2007-08-27 - V2.01 - * Fixed UTF8 input problems with box file reader. - * Fixed various infinite loops and crashes in dawg code. - * Removed include of config_auto.h from host.h. - * Added automatic wctype encoding to unicharset_extractor. - * Fixed dawg table too full error. - * Removed svn files from tarball. - * Added new functions to tessdll. - * Increased maximum utf8 string in a classification result to 8. - -2007-07-02 - V2.00 - * Converted internal character handling to UTF8. - * Trained with 6 languages. - * Added unicharset_extractor, wordlist2dawg. - * Added boxfile creation mode. - * Added UNLV regression test capability. - * Fixed problems with copyright and registered symbols. - * Fixed extern "C" declarations problem. - -2007-05-15 - V1.04 - * Added dll exports for Windows. - * Fixed name collisions with stl etc. - * Made some preliminary changes ready for unicodeization. - * Several bug fixes discovered during unicodeization. - -2007-02-02 - V1.03 - * Added mftraining and cntraining. - * Added baseapi with adaptive thresholding for grey and color. - * Fixed many memory leaks. - * Fixed several bugs including lack of use of adaptive classifier. - * Added ifdefs to eliminate graphics code and add embedded platform support. - * Incorporated several patches, including 64-bit builds, Mac builds. - * Minor accuracy improvements. - -2006-10-04 - V1.02 - * Removed dependency on Aspirin. - * Fixed a few missing Apache license headers. - * Removed $log. - -2006-09-07 - V1.01. - * Added mfcpch.cpp and getopt.cpp for VC++. - * Fixed problem with greyscale images and no libtiff. - * Stopped debug window from being used for the usage output. - * Fixed load of inttemp for big-endian architectures. - * Fixed some Mac compilation issues. - -2006-06-16 - V1.0 of open source Tesseract checked-in.