This allows OCR of images from the internet without downloading them first:
tesseract http://IMAGE_URL OUTPUT ...
It uses libcurl.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
- Use C++ type casts
- Remove unneeded type cast
- Simplify code for function pop
- Remove macro push_on (it was only used once)
This fixes lots of compiler warnings caused by old type casts.
- Use C++ enums
- Use strongly typed C++11 enum for DIRECTION and optimize struct MFEDGEPT
- Use float constant for MF_SCALE_FACTOR
- Replace macros by inline functions
- Fix documentation comment
This fixes several warnings from clang.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes a clang warning:
src/ccstruct/polyblk.cpp:412:12: warning: result of comparison of
unsigned enum expression >= 0 is always true
[-Wtautological-unsigned-enum-zero-compare]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Replace the macros which were declared in vecfuncs.h by member functions
and move a function which was only used in chop.cpp to that file.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Removing STRING from genericvector.h allows eliminating the proprietary
STRING data type from the public Tesseract API.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
- add another constructor for LSTMRecognizer
which takes the language_data_path_prefix configured/selected
at runtime and passes it to the internal CCUtil
- use this in Tesseract::init_tesseract_lang_data when LSTMs
are available
(this was missing from 297d7d86ce)
This fixes compiler warnings caused by
commit 091ce345f6:
src/wordrec/lm_state.h💯7: warning: field 'cost'
will be initialized after field 'curr_b' [-Wreorder]
src/wordrec/lm_state.h:104:7: warning: field 'top_choice_flags'
will be initialized after field 'dawg_info' [-Wreorder]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes compiler warnings caused by
commit 5b4565b80b:
src/textord/colpartition.cpp:91:24: warning: field 'last_column_'
will be initialized after field 'column_set_' [-Wreorder]
src/textord/colpartition.cpp:93:37: warning: field 'inside_table_column_'
will be initialized after field 'nearest_neighbor_above_' [-Wreorder]
src/textord/colpartition.cpp:95:58: warning: field 'space_to_right_'
will be initialized after field 'owns_blobs_' [-Wreorder]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes compiler warnings caused by
commit ecf0f2dee5:
src/dict/dawg.h:202:9: warning: field 'type_' will be initialized
after field 'lang_' [-Wreorder]
src/dict/dawg.h:355:9: warning: field 'dawg_index' will be initialized
after field 'dawg_ref' [-Wreorder]
src/dict/dawg.h:356:9: warning: field 'punc_index' will be initialized
after field 'punc_ref' [-Wreorder]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes compiler warnings caused by
commit 751fcd2b11:
src/classify/classify.cpp:176:7: warning:
field 'EnableLearning' will be initialized after
field 'il1_adaption_test' [-Wreorder]
src/classify/classify.cpp:187:7: warning:
field 'dict_' will be initialized after
field 'static_classifier_' [-Wreorder]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Only one of bIt, dIt, iIt and sIt is used, so put all four in a union.
This fixes CID 1164628, CID 1164629, CID 1164630 and CID 1164631.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Report from Coverity Scan:
CID 1405560 (#1 of 1): Uninitialized scalar field (UNINIT_CTOR)
2. uninit_member: Non-static class member end is not initialized in
this constructor nor in any functions that it calls.
CID 1405561 [...]
Modernize and optimize class WERD_RES. This not only fixes the issues
but also reduces the size and eliminates the functions InitNonPointers
and InitPointers.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Reduce size from 368 to 352 bytes for Trie, 72 to 64 bytes for Dawg
and 40 to 24 bytes for DawgPosition by avoiding holes.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
The class no longer uses bit fields. Re-ordering the member variables
avoids holes and reduces the size of BLOBNBOX from 168 to 152 bytes.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Fix this runtime error in recodebeam_test and unicharcompress_test:
src/ccutil/unicharcompress.h:84:27: runtime error:
left shift of 267 by 28 places cannot be represented in type 'int'
code has up to kMaxCodeLen (9) values, so the highest possible value for
i is 8, and the shift value can reach 7 * 8 = 56.
That requires an uint64_t data type.
size_t would fit for 64 bit hosts, but be too small for 32 bit hosts.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Fix this runtime error in osd_test and textlineprojection_test:
src/ccmain/osdetect.cpp:109:14: runtime error: division by zero
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Fix these runtime errors in mastertrainer_test:
src/ccutil/bitvector.cpp:119:18: runtime error:
null pointer passed as argument 2, which is declared to never be null
src/ccutil/bitvector.cpp:124:10: runtime error:
null pointer passed as argument 1, which is declared to never be null
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes three LGTM warnings:
Multiplication result may overflow 'float' before it is converted to 'double'.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
They are moved from src/classify and src/lstm to src/training.
This reduces the size of the Tesseract library.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
It is only used in unittest/layout_test.cc after moving a test from
baseapi_test.cc to that file, so it can be made local.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
The method was only used in unittest where it can be replaced by
UNICHARSET::load_from_file which also simplifies the code.
This allows removing the class InMemoryFilePointer and fixes a TODO.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
The MS compiler only accepts string constants up to 65535 characters,
so shorten the string for that compiler to fix the compilation.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This converts special character like '<' or '>' to the
correct HTML entities.
Optimize also the code a little bit.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
The vector was already limited to MAX_NUM_PROTOS (512) entries or 64 bytes
in the old code. Now it uses that size right from the start which avoids
reallocating it later when entries are added.
The old code which reallocated the vector to expand it was buggy because
the realloc function can return a different pointer, but the code still
used the original pointer to reset the new bits.
Function ExpandBitVector is now unused and therefore removed.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
tesseract::FileReader and tesseract::FileWriter are already declared
in serialis.h which is included by genericvector.h.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
That case now uses Leptonica to deliver the desired image instead of
using an inefficient loop in the Tesseract code.
See commit 54fafc4e2e which used similar
code in the past.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This reverts commit 75d230a7ac.
That commit introduced new problems (memory leak, potential endless loop)
and style issues.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
The new code avoids dynamic memory allocation, uses faster function calls
and allows removing more code from tesscallback.h.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
The function pointers and callbacks file_reader_, file_writer_,
checkpointer_reader_ and checkpoint_writer_ are always set to
the same values. Replacing them by direct function calls
simplifies the code and allows removing more code from tesscallback.h.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
It does neither need a temporary TessResultCallback2 nor the function
LMPainPoints::GenerateForBlamer.
This also allows removing more code from tesscallback.h.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
C++17 drops support for `std::random_shuffle`, breaking C++17 compilers
that run to compile text2image.cpp. std::shuffle is valid on C++11
through C++17, so use std::shuffle instead.
Due to the use `std::random_shuffle`, `text2image --render_ngrams`
would not give consistent results for different compilers or platforms.
With the current change, the same random number generator is used for
all platforms and initialized to the same seed, so training output
should be consistent.
This fixes compiler warnings from clang++ like these ones:
src/ccutil/params.cpp:34:9: warning: macro is not used [-Wunused-macros]
src/cutil/oldlist.cpp:67:9: warning: macro is not used [-Wunused-macros]
src/cutil/oldlist.cpp:68:9: warning: macro is not used [-Wunused-macros]
src/cutil/oldlist.cpp:78:9: warning: macro is not used [-Wunused-macros]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
That fixes several warnings from clang++ like the following one:
src/training/combine_lang_model.cpp:36:1: warning: no previous extern declaration for non-static variable 'FLAGS_lang_is_rtl' [-Wmissing-variable-declarations]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
That fixes several warnings from clang++ like the following one:
src/training/commontraining.cpp:95:1: warning: no previous extern declaration for non-static variable 'FLAGS_D' [-Wmissing-variable-declarations]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes lots of compiler warnings like these ones:
src/api/baseapi.cpp:113:13: warning: no previous extern declaration for non-static variable 'kInputFile' [-Wmissing-variable-declarations]
src/api/baseapi.cpp:117:13: warning: no previous extern declaration for non-static variable 'kOldVarsFile' [-Wmissing-variable-declarations]
src/api/baseapi.cpp:97:10: warning: no previous extern declaration for non-static variable 'stream_filelist' [-Wmissing-variable-declarations]
src/ccmain/equationdetect.cpp:46:10: warning: no previous extern declaration for non-static variable 'equationdetect_save_bi_image' [-Wmissing-variable-declarations]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This also fixes warnings like the following one from clang++:
src/ccmain/pgedit.cpp:114:15: warning: declaration requires a global constructor [-Wglobal-constructors]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This also fixes some warnings from clang++:
src/classify/featdefs.cpp:47:15: warning: declaration requires a global constructor [-Wglobal-constructors]
src/classify/featdefs.cpp:57:15: warning: declaration requires a global constructor [-Wglobal-constructors]
src/classify/featdefs.cpp:66:15: warning: declaration requires a global constructor [-Wglobal-constructors]
src/classify/featdefs.cpp:75:15: warning: declaration requires a global constructor [-Wglobal-constructors]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This looks for one of the header files which are included by Tesseract.
It currently uses a hard coded path which works for Debian / Ubuntu.
Simplify also the rules for linking Tensorflow.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
It expects include files in /usr/include/tensorflow.
* Add configure option --with-tensorflow (disabled by default)
* Fix data type tensorflow::int64
* Remove "third_party/" in include statements
* Add dummy implementations for Backward and DebugWeights in TFNetwork
* Add files generated with protoc from tfnetwork.proto
(so the Tensorflow sources are not needed for the build)
* Update Makefiles
Signed-off-by: Stefan Weil <sw@weilnetz.de>
sqrt(0.5) = 1 / sqrt(2) can be replaced by the macro M_SQRT1_2.
This also fixes a compiler warning:
src/lstm/lstmtrainer.cpp:51:14: warning: declaration requires a global constructor [-Wglobal-constructors]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
That debugging code uses very much memory and is no longer useful.
text data bss dec hex filename
815 0 262144 262959 4032f src/ccutil/globaloc.o
Remove also the function err_exit which was only used in ccmain/reject.cpp.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Reduce the maximum message size from 64 KiB to 2 KiB which still should
be large enought for trace messages.
Create the smaller message on the stack instead of using a global
array to allow reentrancy and to reduce the memory use of Tesseract.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
It is defined for all platforms when math.h or cmath is included
after defining the macro _USE_MATH_DEFINES.
Define _USE_MATH_DEFINES before any include statement to make sure
that M_PI gets defined. It is not necessary to define it conditionally
only for Windows.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes lots of warnings related to ERRCODE like the following one:
src/ccutil/errcode.h:81:15: warning:
declaration requires a global constructor [-Wglobal-constructors]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
The function did not correctly read Chinese unichars into the local
Class variable if the locale was set to de_DE.UTF-8 (or other
incompatible locales). That resulted in a wrong ClassId which was
used to write into the Cutoffs array without checking for valid bounds.
On macOS the result was a runtime error in baseapi_test (see GitHub
issue #1250):
[ RUN ] TesseractTest.InitConfigOnlyTest
baseapi_test(21845,0x1134c45c0) malloc: *** error for object 0x927f96c28005e0: pointer being freed was not allocated
baseapi_test(21845,0x1134c45c0) malloc: *** set a breakpoint in malloc_error_break to debug
Replacing sscanf by std::istringstream fixes that.
Add also an assertion to catch future out-of-bounds writes.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
The latest code passed all unittests with locale de_DE.UTF-8
and has fixed the locale issues which were reported on GitHub.
Therefore the assertions can be removed.
Any remaining locale issue will be fixed when it is identified.
To help finding such remaining isses, debug code now uses the
user's locale settings instead of the default "C" locale for all
executables which use TessBaseAPI.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
That function writes float values which must always use '.' as the
decimal separator, no matter what the current locale setting is.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
The unittest failed with LANG=de_DE.UTF-8:
$ unittest/baseapi_test
Running main() from ../../../../unittest/../googletest/googletest/src/gtest_main.cc
[==========] Running 12 tests from 2 test suites.
[----------] Global test environment set-up.
[----------] 10 tests from TesseractTest
[ RUN ] TesseractTest.ArraySizeTest
[ OK ] TesseractTest.ArraySizeTest (0 ms)
[ RUN ] TesseractTest.BasicTesseractTest
[ OK ] TesseractTest.BasicTesseractTest (1251 ms)
[ RUN ] TesseractTest.IteratesParagraphsEvenIfNotDetected
[ OK ] TesseractTest.IteratesParagraphsEvenIfNotDetected (347 ms)
[ RUN ] TesseractTest.HOCRWorksWithoutSetInputName
[ OK ] TesseractTest.HOCRWorksWithoutSetInputName (403 ms)
[ RUN ] TesseractTest.HOCRContainsBaseline
[ OK ] TesseractTest.HOCRContainsBaseline (389 ms)
[ RUN ] TesseractTest.RickSnyderNotFuckSnyder
[ OK ] TesseractTest.RickSnyderNotFuckSnyder (346 ms)
[ RUN ] TesseractTest.AdaptToWordStrTest
Trying to adapt "136
" to "1 3 6"
Trying to adapt "256
" to "2 5 6"
Trying to adapt "410
" to "4 1 0"
Trying to adapt "432
" to "4 3 2"
Trying to adapt "540
" to "5 4 0"
Trying to adapt "692
" to "6 9 2"
Trying to adapt "779
" to "7 7 9"
Trying to adapt "793
" to "7 9 3"
Trying to adapt "808
" to "8 0 8"
Trying to adapt "815
" to "8 1 5"
Trying to adapt "12
" to "1 2"
Trying to adapt "12
" to "1 2"
[ OK ] TesseractTest.AdaptToWordStrTest (788 ms)
[ RUN ] TesseractTest.BasicLSTMTest
[ OK ] TesseractTest.BasicLSTMTest (4525 ms)
[ RUN ] TesseractTest.LSTMGeometryTest
[ OK ] TesseractTest.LSTMGeometryTest (615 ms)
[ RUN ] TesseractTest.InitConfigOnlyTest
Error: unichar ? in normproto file is not in unichar set.
Error: unichar 0.232621 in normproto file is not in unichar set.
Error: unichar 0.000400 in normproto file is not in unichar set.
Error: unichar 0.231864 in normproto file is not in unichar set.
[...]
Error: unichar ? in normproto file is not in unichar set.
Error: unichar 0.233915 in normproto file is not in unichar set.
Error: unichar 0.000400 in normproto file is not in unichar set.
Error: unichar 0.221755 in normproto file is not in unichar set.
Error: unichar 0.000400 in normproto file is not in unichar set.
Error: unichar ? in normproto file is not in unichar set.
baseapi_test(21845,0x1134c45c0) malloc: *** error for object 0x927f96c28005e0: pointer being freed was not allocated
baseapi_test(21845,0x1134c45c0) malloc: *** set a breakpoint in malloc_error_break to debug
[INFO] Lang eng took 327ms in regular init
[INFO] Lang chi_tra took 1422ms in regular init
Abort trap: 6
TesseractTest.InitConfigOnlyTest is fixed by using std::istringstream
instead of sscanf.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
The unittest failed with LANG=de_DE.UTF-8:
$ unittest/apiexample_test
Running main() from ../../../../unittest/../googletest/googletest/src/gtest_main.cc
[==========] Running 4 tests from 2 test suites.
[----------] Global test environment set-up.
[----------] 1 test from EuroText
[ RUN ] EuroText.FastLatinOCR
contains_unichar_id(unichar_id):Error:Assert failed:in file ../../../../../src/ccutil/unicharset.h, line 874
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Void functions should not use @return. It causes compiler warnings
like this one:
src/classify/intproto.cpp:326:5: warning:
'@return' command used in a comment that is attached to a function
returning void [-Wdocumentation]
Some non-void functions also were documented with @return none.
Fix those comments, too.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
PrintNormMatch was unused. Remove it and remove also an unused prototype.
Make the only remaining private function NormEvidenceOf static.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
The NonEssential parameter was wrongly derived from linear_token instead
of essential_token and therefore always set to true.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This allows fixing two compiler warnings from clang++:
src/ccutil/universalambigs.cpp:23:19: warning: no previous extern declaration for non-static variable 'kUniversalAmbigsFile' [-Wmissing-variable-declarations]
src/ccutil/universalambigs.cpp:19019:18: warning: no previous extern declaration for non-static variable 'ksizeofUniversalAmbigsFile' [-Wmissing-variable-declarations]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
* WriteProtoList is unused. Remove it.
* ReadNFloats, WriteNFloats and WriteProtoStyle are only used locally,
so make them local.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
* winsock2.h is case sensitive, lower case is required for cross build.
* ws2tcpip.h is required for addrinfo.
* FreeAddrInfo conflicts with existing freeaddrinfo, so rename it.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
* Remove unused macros ultoa, SIGNED.
* Move macros NOMINMAX and WIN32_LEAN_AND_MEAN to host.h
because they are used when including windows.h.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This partially reverts commit c150b9832d.
Now params.cpp includes host.h which also gets the definition for MAX_PATH.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Using std::stringstream allows conversion of float to string
independent of the current locale setting.
Some snprintf statements are not needed at all because a constant string
can be appended directly.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
pgeditor_show_point is unused, so remove it completely.
Some more functions are only used locally, so make them static functions.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This needs the latest test submodule.
The test uses LoadFromFile which is not used otherwise, so remove that
function from class ParamsModel.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Using std::stringstream simplifies the code and allows conversion of
double to string independent of the current locale setting.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
clang-tidy added it in commit ac0b191f6b.
The "e" flag is an extension for glibc which sets the O_CLOEXEC flag,
so the file handle is not leaked to child processes. It is not needed
here.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
The code was modernized using clang-tidy with "modernize-use-using".
The modified files were then formatted using clang-tidy with
"google-readability-braces-around-statements", then clang-format
was applied.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
It was only defined for Windows builds.
Use also false instead of 0 to set the default value of
two boolean config variables.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Remove unneeded include statements for host.h, add required ones and
update the comments for the remaining include statements.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
The modifications were done using this command:
run-clang-tidy-8.py -header-filter='.*' -checks='-*,modernize-loop-convert' -fix
Then the resulting code was cleaned manually.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
The modifications were done using this command:
run-clang-tidy-8.py -header-filter='.*' -checks='-*,modernize-use-bool-literals' -fix
Signed-off-by: Stefan Weil <sw@weilnetz.de>
The modifications were done using this command:
run-clang-tidy-8.py -header-filter='.*' -checks='-*,modernize-use-auto' -fix
Signed-off-by: Stefan Weil <sw@weilnetz.de>
The modifications were done using this command:
run-clang-tidy-8.py -header-filter='.*' -checks='-*,modernize-use-override' -fix
Signed-off-by: Stefan Weil <sw@weilnetz.de>
clang warnings:
src/ccutil/unicharcompress.cpp:172:27: warning: comparison of integers of different signs: 'int' and 'std::__cxx1998::vector::size_type' (aka 'unsigned long') [-Wsign-compare]
src/lstm/recodebeam.cpp:129:29: warning: comparison of integers of different signs: 'std::__cxx1998::vector::size_type' (aka 'unsigned long') and 'int' [-Wsign-compare]
src/lstm/recodebeam.cpp:276:48: warning: comparison of integers of different signs: 'std::__cxx1998::vector::size_type' (aka 'unsigned long') and 'int' [-Wsign-compare]
unittest/imagedata_test.cc:101:21: warning: comparison of integers of different signs: 'int' and 'std::__cxx1998::vector::size_type' (aka 'unsigned long') [-Wsign-compare]
unittest/linlsq_test.cc:33:23: warning: comparison of integers of different signs: 'int' and 'std::__cxx1998::vector::size_type' (aka 'unsigned long') [-Wsign-compare]
unittest/linlsq_test.cc:44:23: warning: comparison of integers of different signs: 'int' and 'std::__cxx1998::vector::size_type' (aka 'unsigned long') [-Wsign-compare]
unittest/nthitem_test.cc:27:23: warning: comparison of integers of different signs: 'int' and 'unsigned long' [-Wsign-compare]
unittest/nthitem_test.cc:68:21: warning: comparison of integers of different signs: 'int' and 'unsigned long' [-Wsign-compare]
unittest/stats_test.cc:26:23: warning: comparison of integers of different signs: 'int' and 'unsigned long' [-Wsign-compare]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
The modified definition avoids warnings caused by redundant semicolons.
Now a semicolon is required when using the macro, so a few code locations
had to be updated.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Credit to OSS-Fuzz which reported this issue:
intmatcher.cpp:1121:17: runtime error: index 24 out of bounds for type 'uint8_t [24]'
#0 0x61034b in ScratchEvidence::UpdateSumOfProtoEvidences(INT_CLASS_STRUCT*, unsigned int*, short) tesseract/src/classify/intmatcher.cpp:1121:17
#1 0x60f560 in IntegerMatcher::Match(INT_CLASS_STRUCT*, unsigned int*, unsigned int*, short, INT_FEATURE_STRUCT const*, tesseract::UnicharRating*, int, int, bool) tesseract/src/classify/intmatcher.cpp:514:11
#2 0x5f3a25 in tesseract::Classify::AdaptToChar(TBLOB*, int, int, float, ADAPT_TEMPLATES_STRUCT*) tesseract/src/classify/adaptmatch.cpp:894:9
#3 0x5f2ccd in tesseract::Classify::LearnPieces(char const*, int, int, float, tesseract::CharSegmentationType, char const*, WERD_RES*) tesseract/src/classify/adaptmatch.cpp:430:5
#4 0x5f16ee in tesseract::Classify::LearnWord(char const*, WERD_RES*) tesseract/src/classify/adaptmatch.cpp:293:7
This catches the out of bounds data reads in release builds.
Add also assertions for debug builds.
See https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=13818.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Credit to OSS-Fuzz which reported this issue:
intmatcher.cpp:1163:17: runtime error: index 24 out of bounds for type 'uint8_t [24]'
#0 0x610d3b in ScratchEvidence::UpdateSumOfProtoEvidences(INT_CLASS_STRUCT*, unsigned int*) tesseract/src/classify/intmatcher.cpp:1163:17
#1 0x60ff4e in IntegerMatcher::Match(INT_CLASS_STRUCT*, unsigned int*, unsigned int*, short, INT_FEATURE_STRUCT const*, tesseract::UnicharRating*, int, int, bool) tesseract/src/classify/intmatcher.cpp:563:11
#2 0x5f4355 in tesseract::Classify::AdaptToChar(TBLOB*, int, int, float, ADAPT_TEMPLATES_STRUCT*) tesseract/src/classify/adaptmatch.cpp:894:9
#3 0x5f35fd in tesseract::Classify::LearnPieces(char const*, int, int, float, tesseract::CharSegmentationType, char const*, WERD_RES*) tesseract/src/classify/adaptmatch.cpp:430:5
#4 0x5f201e in tesseract::Classify::LearnWord(char const*, WERD_RES*) tesseract/src/classify/adaptmatch.cpp:293:7
This catches the out of bounds data reads, but does not fix the primary
reason: ProtoLengths currently gets values which are larger than the
allowed index.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Credit to OSS-Fuzz which reported this issue:
pageres.cpp:1143:7: runtime error: load of value 249, which is not a valid value for type 'bool'
#0 0x6ba560 in WERD_RES::Clear() tesseract/src/ccstruct/pageres.cpp:1143:7
#1 0x6b9fd1 in WERD_RES::operator=(WERD_RES const&) tesseract/src/ccstruct/pageres.cpp:193:3
#2 0x49a9ad in WERD_RES::WERD_RES(WERD_RES const&) tesseract/src/ccstruct/pageres.h:356:11
See https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=13707.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
The old code was a hack to improve the performance.
The new code is clearer and results in the same binary when compiling
with gcc 8.3.0, so it looks like the old hack is no longer needed with
modern compilers.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
- pass in ParamsVectors from Tesseract
(carrying values from langdata/config/api)
into LSTMRecognizer::Load and LoadDictionary
- after LSTMRecognizer's Dict is initialised
(with default values), reset the variables
user_{words,patterns}_{suffix,file} from the
corresponding entries in the passed vector
Warning from clang++:
..\src\ccmain\ltrresultiterator.cpp(454,8): warning: expression result unused [-Wunused-value]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
symbol_steps is a vector, so testing for a nullptr was wrong.
clang++ reports:
..\src\ccmain\ltrresultiterator.cpp(440,19): warning: comparison of address of 'this->word_res_->symbol_steps' equal to a null pointer is always false [-Wtautological-pointer-compare]
if (&word_res_->symbol_steps == nullptr || !LSTM_mode_) return nullptr;
~~~~~~~~~~~^~~~~~~~~~~~ ~~~~~~~
Signed-off-by: Stefan Weil <sw@weilnetz.de>
svpaint is a standalone application (it includes a main function)
and should not be part of the Tesseract library.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Credit to OSS-Fuzz: it found another case which triggered this assertion:
contains_unichar_id(unichar_id):Error:Assert failed:in file ../../src/ccutil/unicharset.h, line 502
This is the OSS-Fuzz testcase:
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=13662
Signed-off-by: Stefan Weil <sw@weilnetz.de>
That runtime error is normally not visible because it does not abort
the program, but is detected when the code was compiled with sanitizers.
It can be triggered with this OSS-Fuzz testcase:
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=13662
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Shift operations are undefined for negative numbers, but at least on
Intel they return the same value as a multiplication with 2 ^ shift value.
This fixes runtime errors reported by sanitizers and OSS-Fuzz:
intmatcher.cpp:821:59: runtime error: left shift of negative value -14
intmatcher.cpp:823:75: runtime error: left shift of negative value -512
intmatcher.cpp:820:50: runtime error: left shift of negative value -80
See issue #2297 and
https://oss-fuzz.com/testcase-detail/4845195990925312 for details.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes issue #2299, an issue which was already reported by
static code analyzers and now by OSS-Fuzz, see details at
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=13597.
The Tesseract code assigns an address which is out-of-bounds to a pointer
variable, but increments that variable later. So this is a false positive.
Change the code nevertheless to satisfy OSS-Fuzz.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Credit to OSS-Fuzz:
This fixes an issue which was reported by OSS-Fuzz, see details at
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=13592.
OSS-Fuzz triggered this assertion:
contains_unichar_id(unichar_id):Error:Assert failed:in file ../../src/ccutil/unicharset.h, line 502
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Credit to OSS-Fuzz:
This fixes a security issue which was reported by OSS-Fuzz, see details at
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=13590.
Add also some assertions to catch similar bugs.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
- move decision from ComputeTopN to ContinueContext, where
it belongs: block context continuations which emit final
codes translating to disabled unichar_ids.
(The normal logic for fallback from top2 > top2 > rest
will apply.)
- pass UNICHARSET refs appropriately
- ignore matrix outputs in ComputeTopN if they
belong to a disabled unichar_id
- pass UNICHARSET refs to check that
- in SetBlackAndWhitelist, also update the unicharset
of the lstm_recognizer_ instance, if any
This requires libarchive-dev.
Tesseract can now load traineddata files in any of the archive formats
which are supported by libarchive. Example of a zipped BagIt archive:
$ unzip -l /usr/local/share/tessdata/zip.traineddata
Archive: /usr/local/share/tessdata/zip.traineddata
Length Date Time Name
--------- ---------- ----- ----
55 2019-03-05 15:27 bagit.txt
0 2019-03-05 15:25 data/
1557 2019-03-05 15:28 manifest-sha256.txt
1082890 2019-03-05 15:25 data/eng.word-dawg
1487588 2019-03-05 15:25 data/eng.lstm
7477 2019-03-05 15:25 data/eng.unicharset
63346 2019-03-05 15:25 data/eng.shapetable
976552 2019-03-05 15:25 data/eng.inttemp
13408 2019-03-05 15:25 data/eng.normproto
4322 2019-03-05 15:25 data/eng.punc-dawg
4738 2019-03-05 15:25 data/eng.lstm-number-dawg
1410 2019-03-05 15:25 data/eng.freq-dawg
844 2019-03-05 15:25 data/eng.pffmtable
6360 2019-03-05 15:25 data/eng.lstm-unicharset
1012 2019-03-05 15:25 data/eng.lstm-recoder
1047 2019-03-05 15:25 data/eng.unicharambigs
4322 2019-03-05 15:25 data/eng.lstm-punc-dawg
16109842 2019-03-05 15:25 data/eng.bigram-dawg
80 2019-03-05 15:25 data/eng.version
6426 2019-03-05 15:25 data/eng.number-dawg
3694794 2019-03-05 15:25 data/eng.lstm-word-dawg
--------- -------
23468070 21 files
`combine_tessdata -d` and `combine_tessdata -u` also work.
The traineddata files in the new format can be generated with
standard tools like zip or tar.
More work is needed for other training tools and big endian support.
Signed-off-by: Stefan Weil <sw@weilnetz.de>