The code was modernized using clang-tidy with "modernize-use-using".
The modified files were then formatted using clang-tidy with
"google-readability-braces-around-statements", then clang-format
was applied.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
It was only defined for Windows builds.
Use also false instead of 0 to set the default value of
two boolean config variables.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Remove unneeded include statements for host.h, add required ones and
update the comments for the remaining include statements.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
The modifications were done using this command:
run-clang-tidy-8.py -header-filter='.*' -checks='-*,modernize-loop-convert' -fix
Then the resulting code was cleaned manually.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
The modifications were done using this command:
run-clang-tidy-8.py -header-filter='.*' -checks='-*,modernize-use-bool-literals' -fix
Signed-off-by: Stefan Weil <sw@weilnetz.de>
The modifications were done using this command:
run-clang-tidy-8.py -header-filter='.*' -checks='-*,modernize-use-auto' -fix
Signed-off-by: Stefan Weil <sw@weilnetz.de>
The modifications were done using this command:
run-clang-tidy-8.py -header-filter='.*' -checks='-*,modernize-use-override' -fix
Signed-off-by: Stefan Weil <sw@weilnetz.de>
clang warnings:
src/ccutil/unicharcompress.cpp:172:27: warning: comparison of integers of different signs: 'int' and 'std::__cxx1998::vector::size_type' (aka 'unsigned long') [-Wsign-compare]
src/lstm/recodebeam.cpp:129:29: warning: comparison of integers of different signs: 'std::__cxx1998::vector::size_type' (aka 'unsigned long') and 'int' [-Wsign-compare]
src/lstm/recodebeam.cpp:276:48: warning: comparison of integers of different signs: 'std::__cxx1998::vector::size_type' (aka 'unsigned long') and 'int' [-Wsign-compare]
unittest/imagedata_test.cc:101:21: warning: comparison of integers of different signs: 'int' and 'std::__cxx1998::vector::size_type' (aka 'unsigned long') [-Wsign-compare]
unittest/linlsq_test.cc:33:23: warning: comparison of integers of different signs: 'int' and 'std::__cxx1998::vector::size_type' (aka 'unsigned long') [-Wsign-compare]
unittest/linlsq_test.cc:44:23: warning: comparison of integers of different signs: 'int' and 'std::__cxx1998::vector::size_type' (aka 'unsigned long') [-Wsign-compare]
unittest/nthitem_test.cc:27:23: warning: comparison of integers of different signs: 'int' and 'unsigned long' [-Wsign-compare]
unittest/nthitem_test.cc:68:21: warning: comparison of integers of different signs: 'int' and 'unsigned long' [-Wsign-compare]
unittest/stats_test.cc:26:23: warning: comparison of integers of different signs: 'int' and 'unsigned long' [-Wsign-compare]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
The modified definition avoids warnings caused by redundant semicolons.
Now a semicolon is required when using the macro, so a few code locations
had to be updated.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Credit to OSS-Fuzz which reported this issue:
intmatcher.cpp:1121:17: runtime error: index 24 out of bounds for type 'uint8_t [24]'
#0 0x61034b in ScratchEvidence::UpdateSumOfProtoEvidences(INT_CLASS_STRUCT*, unsigned int*, short) tesseract/src/classify/intmatcher.cpp:1121:17
#1 0x60f560 in IntegerMatcher::Match(INT_CLASS_STRUCT*, unsigned int*, unsigned int*, short, INT_FEATURE_STRUCT const*, tesseract::UnicharRating*, int, int, bool) tesseract/src/classify/intmatcher.cpp:514:11
#2 0x5f3a25 in tesseract::Classify::AdaptToChar(TBLOB*, int, int, float, ADAPT_TEMPLATES_STRUCT*) tesseract/src/classify/adaptmatch.cpp:894:9
#3 0x5f2ccd in tesseract::Classify::LearnPieces(char const*, int, int, float, tesseract::CharSegmentationType, char const*, WERD_RES*) tesseract/src/classify/adaptmatch.cpp:430:5
#4 0x5f16ee in tesseract::Classify::LearnWord(char const*, WERD_RES*) tesseract/src/classify/adaptmatch.cpp:293:7
This catches the out of bounds data reads in release builds.
Add also assertions for debug builds.
See https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=13818.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Credit to OSS-Fuzz which reported this issue:
intmatcher.cpp:1163:17: runtime error: index 24 out of bounds for type 'uint8_t [24]'
#0 0x610d3b in ScratchEvidence::UpdateSumOfProtoEvidences(INT_CLASS_STRUCT*, unsigned int*) tesseract/src/classify/intmatcher.cpp:1163:17
#1 0x60ff4e in IntegerMatcher::Match(INT_CLASS_STRUCT*, unsigned int*, unsigned int*, short, INT_FEATURE_STRUCT const*, tesseract::UnicharRating*, int, int, bool) tesseract/src/classify/intmatcher.cpp:563:11
#2 0x5f4355 in tesseract::Classify::AdaptToChar(TBLOB*, int, int, float, ADAPT_TEMPLATES_STRUCT*) tesseract/src/classify/adaptmatch.cpp:894:9
#3 0x5f35fd in tesseract::Classify::LearnPieces(char const*, int, int, float, tesseract::CharSegmentationType, char const*, WERD_RES*) tesseract/src/classify/adaptmatch.cpp:430:5
#4 0x5f201e in tesseract::Classify::LearnWord(char const*, WERD_RES*) tesseract/src/classify/adaptmatch.cpp:293:7
This catches the out of bounds data reads, but does not fix the primary
reason: ProtoLengths currently gets values which are larger than the
allowed index.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Credit to OSS-Fuzz which reported this issue:
pageres.cpp:1143:7: runtime error: load of value 249, which is not a valid value for type 'bool'
#0 0x6ba560 in WERD_RES::Clear() tesseract/src/ccstruct/pageres.cpp:1143:7
#1 0x6b9fd1 in WERD_RES::operator=(WERD_RES const&) tesseract/src/ccstruct/pageres.cpp:193:3
#2 0x49a9ad in WERD_RES::WERD_RES(WERD_RES const&) tesseract/src/ccstruct/pageres.h:356:11
See https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=13707.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
The old code was a hack to improve the performance.
The new code is clearer and results in the same binary when compiling
with gcc 8.3.0, so it looks like the old hack is no longer needed with
modern compilers.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
- pass in ParamsVectors from Tesseract
(carrying values from langdata/config/api)
into LSTMRecognizer::Load and LoadDictionary
- after LSTMRecognizer's Dict is initialised
(with default values), reset the variables
user_{words,patterns}_{suffix,file} from the
corresponding entries in the passed vector
Warning from clang++:
..\src\ccmain\ltrresultiterator.cpp(454,8): warning: expression result unused [-Wunused-value]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
symbol_steps is a vector, so testing for a nullptr was wrong.
clang++ reports:
..\src\ccmain\ltrresultiterator.cpp(440,19): warning: comparison of address of 'this->word_res_->symbol_steps' equal to a null pointer is always false [-Wtautological-pointer-compare]
if (&word_res_->symbol_steps == nullptr || !LSTM_mode_) return nullptr;
~~~~~~~~~~~^~~~~~~~~~~~ ~~~~~~~
Signed-off-by: Stefan Weil <sw@weilnetz.de>
svpaint is a standalone application (it includes a main function)
and should not be part of the Tesseract library.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Credit to OSS-Fuzz: it found another case which triggered this assertion:
contains_unichar_id(unichar_id):Error:Assert failed:in file ../../src/ccutil/unicharset.h, line 502
This is the OSS-Fuzz testcase:
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=13662
Signed-off-by: Stefan Weil <sw@weilnetz.de>
That runtime error is normally not visible because it does not abort
the program, but is detected when the code was compiled with sanitizers.
It can be triggered with this OSS-Fuzz testcase:
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=13662
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Shift operations are undefined for negative numbers, but at least on
Intel they return the same value as a multiplication with 2 ^ shift value.
This fixes runtime errors reported by sanitizers and OSS-Fuzz:
intmatcher.cpp:821:59: runtime error: left shift of negative value -14
intmatcher.cpp:823:75: runtime error: left shift of negative value -512
intmatcher.cpp:820:50: runtime error: left shift of negative value -80
See issue #2297 and
https://oss-fuzz.com/testcase-detail/4845195990925312 for details.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes issue #2299, an issue which was already reported by
static code analyzers and now by OSS-Fuzz, see details at
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=13597.
The Tesseract code assigns an address which is out-of-bounds to a pointer
variable, but increments that variable later. So this is a false positive.
Change the code nevertheless to satisfy OSS-Fuzz.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Credit to OSS-Fuzz:
This fixes an issue which was reported by OSS-Fuzz, see details at
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=13592.
OSS-Fuzz triggered this assertion:
contains_unichar_id(unichar_id):Error:Assert failed:in file ../../src/ccutil/unicharset.h, line 502
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Credit to OSS-Fuzz:
This fixes a security issue which was reported by OSS-Fuzz, see details at
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=13590.
Add also some assertions to catch similar bugs.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
- move decision from ComputeTopN to ContinueContext, where
it belongs: block context continuations which emit final
codes translating to disabled unichar_ids.
(The normal logic for fallback from top2 > top2 > rest
will apply.)
- pass UNICHARSET refs appropriately
- ignore matrix outputs in ComputeTopN if they
belong to a disabled unichar_id
- pass UNICHARSET refs to check that
- in SetBlackAndWhitelist, also update the unicharset
of the lstm_recognizer_ instance, if any
This requires libarchive-dev.
Tesseract can now load traineddata files in any of the archive formats
which are supported by libarchive. Example of a zipped BagIt archive:
$ unzip -l /usr/local/share/tessdata/zip.traineddata
Archive: /usr/local/share/tessdata/zip.traineddata
Length Date Time Name
--------- ---------- ----- ----
55 2019-03-05 15:27 bagit.txt
0 2019-03-05 15:25 data/
1557 2019-03-05 15:28 manifest-sha256.txt
1082890 2019-03-05 15:25 data/eng.word-dawg
1487588 2019-03-05 15:25 data/eng.lstm
7477 2019-03-05 15:25 data/eng.unicharset
63346 2019-03-05 15:25 data/eng.shapetable
976552 2019-03-05 15:25 data/eng.inttemp
13408 2019-03-05 15:25 data/eng.normproto
4322 2019-03-05 15:25 data/eng.punc-dawg
4738 2019-03-05 15:25 data/eng.lstm-number-dawg
1410 2019-03-05 15:25 data/eng.freq-dawg
844 2019-03-05 15:25 data/eng.pffmtable
6360 2019-03-05 15:25 data/eng.lstm-unicharset
1012 2019-03-05 15:25 data/eng.lstm-recoder
1047 2019-03-05 15:25 data/eng.unicharambigs
4322 2019-03-05 15:25 data/eng.lstm-punc-dawg
16109842 2019-03-05 15:25 data/eng.bigram-dawg
80 2019-03-05 15:25 data/eng.version
6426 2019-03-05 15:25 data/eng.number-dawg
3694794 2019-03-05 15:25 data/eng.lstm-word-dawg
--------- -------
23468070 21 files
`combine_tessdata -d` and `combine_tessdata -u` also work.
The traineddata files in the new format can be generated with
standard tools like zip or tar.
More work is needed for other training tools and big endian support.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes a warning from Apple's clang compiler:
[ 34%] Building CXX object CMakeFiles/libtesseract.dir/src/ccutil/errcode.cpp.o
/Users/travis/build/stweil/tesseract/src/ccutil/errcode.cpp:83:7: warning: indirection of non-volatile null pointer will be deleted, not trap [-Wnull-dereference]
*reinterpret_cast<int*>(0) = 0;
^~~~~~~~~~~~~~~~~~~~~~~~~~
/Users/travis/build/stweil/tesseract/src/ccutil/errcode.cpp:83:7: note: consider using __builtin_trap() or qualifying pointer with 'volatile'
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Both functions are called very often, so computing the table values
at program start should be faster than computing them on demand.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
gcc warning:
src/lstm/recodebeam.cpp:270:41: warning: ‘current_char’ may be used uninitialized in this function [-Wmaybe-uninitialized]
It's a false positive, but setting the variable to 0 satisfies the compiler.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
gcc warnings:
src/viewer/scrollview.cpp:404:31: warning: ‘%s’ directive output may be
truncated writing up to 4095 bytes into a region of size between 4084 and 4093 [-Wformat-truncation=]
src/viewer/scrollview.cpp:572:31: warning: ‘%s’ directive output may be
truncated writing up to 4095 bytes into a region of size between 4084 and 4093 [-Wformat-truncation=]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
gcc warnings:
src/ccmain/docqual.cpp:734:26: warning: this statement may fall through [-Wimplicit-fallthrough=]
src/ccmain/docqual.cpp:764:26: warning: this statement may fall through [-Wimplicit-fallthrough=]
src/ccmain/docqual.cpp:782:26: warning: this statement may fall through [-Wimplicit-fallthrough=]
[...]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
tesstrain_utils.sh sets the shell flag -e, so it exits immediately
if a command exits with a non-zero status.
The following command returns a non-zero status as soon as counter is a
multiple of par_factor (par_factor=8, that means as soon as 8 fonts or
images are processed):
let rem=counter%par_factor
The new code fixes this undesired exit.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
pango_coverage_get and pango_coverage_unref should not be called
with coverage == nullptr.
pango_font_get_coverage should not be called with font == nullptr.
Otherwise Pango prints runtime error messages:
(process:12657): Pango-CRITICAL **: pango_coverage_get: assertion 'coverage != NULL' failed
(process:12657): Pango-CRITICAL **: pango_coverage_unref: assertion 'coverage != NULL' failed
(process:12657): Pango-CRITICAL **: pango_font_get_coverage: assertion 'font != NULL' failed
(process:12657): GLib-GObject-CRITICAL **: g_object_unref: assertion 'G_IS_OBJECT (object)' failed
Typically those errors occur if a required font is not installed,
so this can be a quite common error.
Fix also a potential resource leak in PangoFontInfo::CoversUTF8Text.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Commit d36231e3e4 did not distinguish
between AVX and AVX2, so AVX2 code was enabled for IntSimdMatrix
even when only AVX was supported.
This resulted in an illegal instruction.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Bug message from AddressSanitizer:
==7153==ERROR: AddressSanitizer: alloc-dealloc-mismatch (operator new [] vs free) on 0x602000072cb0
#0 0x7ffff70c6a10 in free (/usr/lib/x86_64-linux-gnu/libasan.so.3+0xc1a10)
#1 0x555557188638 in writeProfileToFile ../../../../../src/opencl/openclwrapper.cpp:541
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Bug message from AddressSanitizer:
==6158==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x7fffe774b7fc at pc 0x555557086b54 bp 0x7fffffffcee0 sp 0x7fffffffced8
READ of size 1 at 0x7fffe774b7fc thread T0
#0 0x555557086b53 in tesseract::HistogramRect(Pix*, int, int, int, int, int, int*) ../../../../../src/ccstruct/otsuthr.cpp:163
Signed-off-by: Stefan Weil <sw@weilnetz.de>
* Move IntDotProductSSE. That allows inlining of the code.
* Improve IntDotProductSSE by moving some instructions.
* Remove unused num_input_groups_ from IntSimdMatrix.
* Re-order elements in IntSimdMatrix to avoid padding.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Always use OpenCL device selection if OpenCL is enabled.
This fixes a regression which was introduced by commit
5c6a57b727 which removed
the definition for USE_DEVICE_SELECTION.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
gcc warning:
src/training/text2image.cpp:694:35: warning:
ISO C++ forbids converting a string constant to ‘char*’
[-Wwrite-strings]
putenv expects a string which can be modified.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Commit 49d7df6dc3 added error handling,
but since that commit Tesseract used the text fallback if the user
selected output failed.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Using std::stringstream simplifies the code and allows conversion of
double to string independant of the current locale setting.
Signed-off-by: Stefan Weil <sw@weilnetz.de>