mirror of
https://github.com/tesseract-ocr/tesseract.git
synced 2024-11-27 20:59:36 +08:00
e0d735b122
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@658 d0cd1f9f-072b-0410-8dd7-cf729c803f20
270 lines
13 KiB
Plaintext
270 lines
13 KiB
Plaintext
= Tesseract release notes Feb 01 2012 - V3.02 =
|
|
* Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic.
|
|
* Added paragraph detection in layout analysis/post OCR.
|
|
* Added simultaneous multi-language capability.
|
|
* Added experimental equation detector.
|
|
* Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding.
|
|
* Improved line detection and removal.
|
|
* Added fixed pitch chopper for CJK.
|
|
* Added word bigram correction.
|
|
* Added new uniform classifier API.
|
|
* Added new training error counter.
|
|
* More detailed changes recorded in ChangeLog.
|
|
|
|
|
|
= Tesseract release notes Oct 21 2011 - V3.01 =
|
|
* Thread-safety! Moved all critical globals and statics to members of the appropriate class. Tesseract is now thread-safe (multiple instances can be used in parallel in multiple threads.) with the minor exception that some control parameters are still global and affect all threads.
|
|
* Added Cube, a new recognizer for Arabic. Cube can also be used in combination with normal Tesseract for other languages with an improvement in accuracy at the cost of (much) lower speed. *There is no training module for Cube yet.*
|
|
* `OcrEngineMode` in `Init` replaces `AccuracyVSpeed` to control cube.
|
|
* Greatly improved segmentation search with consequent accuracy and speed improvements, especially for Chinese.
|
|
* Added `PageIterator` and `ResultIterator` as cleaner ways to get the full results out of Tesseract, that are not currently provided by any of the `TessBaseAPI::Get*` methods. All other methods, such as the `ETEXT_STRUCT` in particular are deprecated and will be deleted in the future.
|
|
* ApplyBoxes totally rewritten to make training easier. It can now cope with touching/overlapping training characters, and a new boxfile format allows word boxes instead of character boxes, BUT to use that you have to have already boostrapped the language with character boxes. "Cyclic dependency" on traineddata.
|
|
* Auto orientation and script detection added to page layout analysis.
|
|
* Deleted *lots* of dead code.
|
|
* Fixxht module replaced with scalable data-driven module.
|
|
* Output font characteristics accuracy improved.
|
|
* Removed the double conversion at each classification.
|
|
* Upgraded oldest structs to be classes and deprecated PBLOB.
|
|
* Removed non-deterministic baseline fit.
|
|
* Added fixed length dawgs for Chinese.
|
|
* Handling of vertical text improved.
|
|
* Handling of leader dots improved.
|
|
* Table detection greatly improved.
|
|
* Fixed a couple of memory leaks.
|
|
* Fixed font labels on output text. (Not perfect, but a lot better than before.)
|
|
* Cleanup and more bug fixes
|
|
* Special treatments for Hindi.
|
|
* Support for build in VS2010 with Microsoft Windows SDK for Windows 7 (thanks to Michael Lutz)
|
|
|
|
Tesseract release notes Sep 30 2010 - V3.00
|
|
* Preparations for thread safety:
|
|
* Changed TessBaseAPI methods to be non-static
|
|
* Created a class hierarchy for the directories to hold instance data,
|
|
and began moving code into the classes.
|
|
* Moved thresholding code to a separate class.
|
|
* Added major new page layout analysis module.
|
|
* Added HOCR output.
|
|
* Added Leptonica as main image I/O and handling. Currently optional,
|
|
but in future releases linking with Leptonica will be mandatory.
|
|
* Ambiguity table rewritten to allow definite replacements in place
|
|
of fix_quotes.
|
|
* Added TessdataManager to combine data files into a single file.
|
|
* Some dead code deleted.
|
|
* VC++6 no longer supported. It can't cope with the use of templates.
|
|
* Many more languages added.
|
|
* Doxygenation of most of the function header comments.
|
|
|
|
Tesseract release notes June 30 2009 - V2.04
|
|
Integrated patches for portability and to remove some of the
|
|
"access" macros.
|
|
Removed dependence on lua from the viewer making it a *lot*
|
|
faster. Also the viewer now compiles and works (on Linux.)
|
|
Fixed the following issues:
|
|
1, 63, 67, 71, 76, 79, 81, 82, 84, 106, 108, 111, 112, 128, 129, 130, 133, 135,
|
|
142, 143, 145, 146, 147, 153, 154, 160, 165, 169, 170, 175, 177, 187, 192,
|
|
195, 199, 201, 205, 209.
|
|
This is the last version to support VC++6!
|
|
This may also be the last version to compile without leptonica!
|
|
Windows version now outputs to stderr by default, fixing a lot of the problems with lack of visible meaningful error messages.
|
|
|
|
Tesseract release notes April 22 2008 - V2.03
|
|
2.02 was unrunnable, due to a last-minute "simple" change.
|
|
2.03 fixes the problem and also adds an include check for leptonica
|
|
to make it more usable.
|
|
|
|
Tesseract release notes April 21 2008 - V2.02
|
|
Improvements to clustering, training and classifier.
|
|
Major internationalization improvements for large-character-set
|
|
languages, eg Kannada.
|
|
Removed some compiler warnings.
|
|
Added multipage tiff support for training and running.
|
|
Updated graphics output to talk to new java-based viewer.
|
|
Added ability to save n-best lists.
|
|
Added leptonica support for more file types.
|
|
Improved Init/End to make them safe.
|
|
Reduced memory use of dictionaries.
|
|
Added some new APIs to TessBaseAPI.
|
|
Fixed namespace collisions with jpeg library (INT32).
|
|
Portability fixes for Windows for new code.
|
|
Updates to autoconf system for new code.
|
|
|
|
Tesseract release notes August 27 2007 - V2.01
|
|
Fixed UTF8 input problems with box file reader.
|
|
Fixed various infinite loops and crashes in dawg code.
|
|
Removed include of config_auto.h from host.h.
|
|
Added automatic wctype encoding to unicharset_extractor.
|
|
Fixed dawg table too full error.
|
|
Removed svn files from tarball.
|
|
Added new functions to tessdll.
|
|
Increased maximum utf8 string in a classification result to 8.
|
|
|
|
Tesseract release notes July 17, 2007 - V2.00
|
|
|
|
First release of the International version.
|
|
This version recognizes the following languages:
|
|
English - eng
|
|
French - fra
|
|
Italian - ita
|
|
German - deu
|
|
Spanish - spa
|
|
Dutch - nld
|
|
The language codes follow ISO 639-2. The default language is English.
|
|
To recognize another language:
|
|
tesseract inputimage outputbase -l langcode
|
|
|
|
To train on a new language, see separate documentation.
|
|
More languages will be appearing over time.
|
|
|
|
List of changes in this release:
|
|
Converted internal character handling to UTF8.
|
|
Trained with 6 languages.
|
|
Added unicharset_extractor, wordlist2dawg.
|
|
Added boxfile creation mode.
|
|
Added UNLV regression test capability.
|
|
Fixed problems with copyright and registered symbols.
|
|
Fixed extern "C" declarations problem.
|
|
Made some improvements to consistency of accuracy across platforms.
|
|
Added vc++ express support.
|
|
|
|
Instructions for downloading and building version 2.00.
|
|
Things have changed quite a bit since the previous versions so please read carefully.
|
|
*All users*
|
|
The tarballs are split into pieces.
|
|
tesseract-2.00.tar.gz contains all the source code.
|
|
tesseract-2.00.<lang>.tar.gt contains the data files for <lang>. You need at least one of these or tesseract will not work.
|
|
tesseract-2.00.exe.tar.gz is not for the 'exe' language. It is windows executables. They are built with VC++ express and come with absolutely no warranty. If they work for you then great, otherwise get visual C++ express (and the platform sdk) and build from the source.
|
|
|
|
*Non-windows users*
|
|
As with 1.04, this version works with make install.
|
|
*New* there is a tesseract.spec for making rpms. (Thanks to Andrew Ziem for the help.)
|
|
It might work with your OS if you know how to do that sort of thing.
|
|
If you are linking to the libraries, as with Ocropus, there is now a single master
|
|
library called libtesseract_full.a.
|
|
|
|
*Windows users*
|
|
If you are building from the sources, there are still dsw and dsp files for vc++6 and also
|
|
sln and vcproj files for vc++ express.
|
|
The dll has been updated to allow input of non-binary images. (Thanks to Glen of Jetsoft.)
|
|
|
|
|
|
Tesseract release notes May 15, 2007 - V1.04.
|
|
|
|
=== Windows users only ===
|
|
Added a dll interface for windows. Thanks to Glen at Jetsoft for contributing
|
|
this. To use the dll, include tessdll.h, import tessdll.lib and put tessdll.dll
|
|
somewhere where the system can find it. There is also a small dlltest program
|
|
to test the dll. Run with:
|
|
dlltest phototest.tif phototest.txt
|
|
It will output the text from phototest.tif with bounding box information.
|
|
**New for Windows** the distribution now includes tesseract.exe and tessdll.dll
|
|
which *might* work out of the box! There are no guarantees as you need
|
|
VC++6 versions of mfc and crt (at least) for it to work. (Batteries not
|
|
included, and certainly no installshield.)
|
|
|
|
== Important note for anyone building with make: i.e. anyone except devstudio
|
|
users ==
|
|
This release includes new standardization for the data directory. To enable
|
|
Tesseract to find its data files, you must either:
|
|
./configure
|
|
make
|
|
make install
|
|
to move the data files to the standard place, or:
|
|
export TESSDATA_PREFIX="directory in which your tessdata resides/"
|
|
(or equivalent) in your .profile or whatever or setenv to set the environment
|
|
variable. Note that the directory must end in a /
|
|
HAVING tesseract and tessdata IN THE SAME DIRECTORY DOES NOT WORK ANY MORE.
|
|
|
|
== All users ==
|
|
Fixed a bunch of name collisions - mostly with stl.
|
|
Made some preliminary changes for unicode compatibility. Includes a new data
|
|
file (unicharset) and renaming of the other data files to eng.* to support
|
|
different languages.
|
|
There are also several other minor bug fixes and portability improvements
|
|
for 64 bit, the latest visual studio compiler etc. Thanks to all who have
|
|
contributed these fixes.
|
|
|
|
NOTE: This is likely to be the last English-only release!
|
|
Apologies in advance to non-windows users for bloating the distribution with
|
|
windows executables. This will probably get fixed in the next release with
|
|
the multi-language capability, since that will also bloat the distribution.
|
|
|
|
|
|
Tesseract release notes Feb 2, 2007 - V1.03.
|
|
Added mftraining and cntraining. Using an image with a box file, tesseract
|
|
generates .tr output files. cntraining runs on the .tr files to make
|
|
normproto that lives in tessdata. mftraining runs on the .tr files to
|
|
make inttemp and pffmtable in tessdata. These are the main data files
|
|
that tesseract uses to recognize characters. At present, the code to make
|
|
dictionary files is not yet available, nor are any sample box files or
|
|
rebuilt inttemp or documentation to create any of these. Recognition is
|
|
still limited to the ASCII set, but when this problem is fixed, documentation
|
|
will follow.
|
|
|
|
Added a new API with adaptive thresholding for grey and color images.
|
|
See ccmain/baseapi.h/cpp for details. The main program has been converted
|
|
to use the API as an example. See main() in ccmain/tesseractmain.cpp for
|
|
details. The API is designed to make it easy to add subclasses with ability
|
|
to output the bounding boxes etc from the internal structures. The adaptive
|
|
thresholding improves accuracy (most of the time) on non-binary images.
|
|
|
|
Many memory leaks have been fixed. There are no known leaks left from using
|
|
the API correctly.
|
|
|
|
The adaptive classifier was not operating correctly. This bug, and several
|
|
others have been fixed, including poor chopping, an indefinite (if not quite
|
|
infinite) loop in the number parser, and a couple of crash bugs. Thanks to
|
|
all that have contributed bugs and bug fixes.
|
|
|
|
It is now possible to build without any of the graphics support to save code
|
|
size using #define GRAPHICS_DISABLED. There is also a new EMBEDDED define
|
|
for use on operating systems with limited library support.
|
|
|
|
64-bit and Mac OSX buildability is now included in the mainline source tree.
|
|
Thanks to all that have contributed patches and comments to help with that.
|
|
1.03 is also endian-independent, apart from the tiff i/o, so if you use
|
|
libtiff, the code should run on all platforms, even if you get/create new
|
|
data files of a different endinanness.
|
|
|
|
Some of the bug fixes improve accuracy, and so do some of the changes to
|
|
DangAmbigs and user-words.
|
|
|
|
Tesseract release notes, Oct 4 2006 - V1.02.
|
|
Removed dependency on aspirin. *All* code is now licensed under Apache2.0.
|
|
|
|
Tesseract release notes, Sep 7 2006 - V1.01.
|
|
|
|
Fixes for this release:
|
|
Added mfcpch.cpp and getopt.cpp for VC++.
|
|
Fixed problem with greyscale images and no libtiff.
|
|
Stopped debug window from being used for the usage output.
|
|
Fixed load of inttemp for big-endian architectures.
|
|
Fixed some Mac compilation issues.
|
|
|
|
This version should read uncompressed 8 bit grey and 24 bit color tiffs
|
|
without having to have libtiff. It does a dumb threshold though, so don't
|
|
expect good results from poor contrast or images of natural scenes etc.
|
|
|
|
If you just run tesseract with no command line args you should now get a
|
|
sensible usage message on linux, with or without X-windows.
|
|
|
|
If you can get it to compile on a PPC Mac, it may now run correctly,
|
|
although not all the build issues are fixed yet.
|
|
|
|
Building Tesseract:
|
|
Windows:
|
|
Unpack the tar.gz archive
|
|
Open tesseract.dsw in DevStudio (preferably version 6, higher versions will be more difficult)
|
|
Set Win32 - Release as the active configuration.
|
|
Build.
|
|
Copy tesseract.exe from bin.rel up one directory level.
|
|
Run tesseract phototest.tif phototest
|
|
This will create phototest.txt.
|
|
|
|
Linux:
|
|
Unpack the tar.gz archive
|
|
./configure
|
|
make
|
|
Copy tesseract from ccmain up one directory level (or create a symbolic link)
|
|
Run tesseract phototest.tif phototest
|
|
This will create phototest.txt.
|