Commit Graph

3708 Commits

Author SHA1 Message Date
Stefan Weil
44a6d9f4d4 intmatcher: Catch more out of bounds reads
Credit to OSS-Fuzz which reported this issue:

intmatcher.cpp:1121:17: runtime error: index 24 out of bounds for type 'uint8_t [24]'
	    #0 0x61034b in ScratchEvidence::UpdateSumOfProtoEvidences(INT_CLASS_STRUCT*, unsigned int*, short) tesseract/src/classify/intmatcher.cpp:1121:17
	    #1 0x60f560 in IntegerMatcher::Match(INT_CLASS_STRUCT*, unsigned int*, unsigned int*, short, INT_FEATURE_STRUCT const*, tesseract::UnicharRating*, int, int, bool) tesseract/src/classify/intmatcher.cpp:514:11
	    #2 0x5f3a25 in tesseract::Classify::AdaptToChar(TBLOB*, int, int, float, ADAPT_TEMPLATES_STRUCT*) tesseract/src/classify/adaptmatch.cpp:894:9
	    #3 0x5f2ccd in tesseract::Classify::LearnPieces(char const*, int, int, float, tesseract::CharSegmentationType, char const*, WERD_RES*) tesseract/src/classify/adaptmatch.cpp:430:5
	    #4 0x5f16ee in tesseract::Classify::LearnWord(char const*, WERD_RES*) tesseract/src/classify/adaptmatch.cpp:293:7

This catches the out of bounds data reads in release builds.
Add also assertions for debug builds.

See https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=13818.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-24 17:27:43 +01:00
Stefan Weil
5fd7228414 intmatcher: Catch out of bounds reads
Credit to OSS-Fuzz which reported this issue:

    intmatcher.cpp:1163:17: runtime error: index 24 out of bounds for type 'uint8_t [24]'
	    #0 0x610d3b in ScratchEvidence::UpdateSumOfProtoEvidences(INT_CLASS_STRUCT*, unsigned int*) tesseract/src/classify/intmatcher.cpp:1163:17
	    #1 0x60ff4e in IntegerMatcher::Match(INT_CLASS_STRUCT*, unsigned int*, unsigned int*, short, INT_FEATURE_STRUCT const*, tesseract::UnicharRating*, int, int, bool) tesseract/src/classify/intmatcher.cpp:563:11
	    #2 0x5f4355 in tesseract::Classify::AdaptToChar(TBLOB*, int, int, float, ADAPT_TEMPLATES_STRUCT*) tesseract/src/classify/adaptmatch.cpp:894:9
	    #3 0x5f35fd in tesseract::Classify::LearnPieces(char const*, int, int, float, tesseract::CharSegmentationType, char const*, WERD_RES*) tesseract/src/classify/adaptmatch.cpp:430:5
	    #4 0x5f201e in tesseract::Classify::LearnWord(char const*, WERD_RES*) tesseract/src/classify/adaptmatch.cpp:293:7

This catches the out of bounds data reads, but does not fix the primary
reason: ProtoLengths currently gets values which are larger than the
allowed index.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-24 15:44:33 +01:00
Stefan Weil
509ee95023 IntegerMatcher: Fix data type of loop counters
ClassTemplate->ProtoLengths[n] is of type uint8_t, so use that for
the related loop counters, too.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-24 15:35:06 +01:00
Stefan Weil
f4f34a87db WERD_RES: Fix uninitialized member variable
Credit to OSS-Fuzz which reported this issue:

    pageres.cpp:1143:7: runtime error: load of value 249, which is not a valid value for type 'bool'
	    #0 0x6ba560 in WERD_RES::Clear() tesseract/src/ccstruct/pageres.cpp:1143:7
	    #1 0x6b9fd1 in WERD_RES::operator=(WERD_RES const&) tesseract/src/ccstruct/pageres.cpp:193:3
	    #2 0x49a9ad in WERD_RES::WERD_RES(WERD_RES const&) tesseract/src/ccstruct/pageres.h:356:11

See https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=13707.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-24 14:59:08 +01:00
Stefan Weil
afc099b9f4 intmatcher: Split data_table
The old code was a hack to improve the performance.

The new code is clearer and results in the same binary when compiling
with gcc 8.3.0, so it looks like the old hack is no longer needed with
modern compilers.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-24 08:15:40 +01:00
Stefan Weil
2fcb483efc Update test submodule
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-23 08:25:19 +01:00
Stefan Weil
9aadaaba27 Fix automake rules for doc-clean and doc-pack
They used the wrong directory and failed for out of tree builds.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-23 08:25:19 +01:00
Egor Pugin
11e09bd4a1
Update appveyor.yml 2019-03-22 16:33:55 +03:00
Egor Pugin
02f97c3f51
Update appveyor.yml 2019-03-22 15:17:26 +03:00
Stefan Weil
e1e56d9d66 Remove local function declarations from intmatcher.h
This requires moving the local function HeapSort to the beginning.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-22 11:39:39 +01:00
Stefan Weil
2ba194ca8d Remove four unused parameters
This fixes some compiler warnings:

    src/classify/intmatcher.cpp:711:63: warning: unused parameter ‘ConfigMask’ [-Wunused-parameter]
    src/classify/intmatcher.cpp:1007:16: warning: unused parameter ‘ProtoMask’ [-Wunused-parameter]
    src/classify/intmatcher.cpp:1095:61: warning: unused parameter ‘NumFeatures’ [-Wunused-parameter]
    src/classify/intmatcher.cpp:1136:59: warning: unused parameter ‘used_features’ [-Wunused-parameter]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-22 11:30:24 +01:00
Stefan Weil
dd79d56e9f Remove unused parameter BlobLength
This fixes two compiler warnings:

    src/classify/intmatcher.cpp:553:14: warning: unused parameter ‘BlobLength’ [-Wunused-parameter]
    src/classify/intmatcher.cpp:622:14: warning: unused parameter ‘BlobLength’ [-Wunused-parameter]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-22 11:17:19 +01:00
zdenop
ac7ea4322a
Merge pull request #2335 from Shreeshrii/master
Changes to tesstrain.py - max_workers=8, distort_image=false
2019-03-17 15:27:34 +01:00
zdenop
26877ba703 check min. python version; os.uname is not available on windows 2019-03-17 15:25:48 +01:00
zdenop
8891ba9711 add autotools options to cmake build 2019-03-17 14:50:36 +01:00
Shreeshrii
f8e8521606
Update tesstrain_utils.py 2019-03-17 15:32:35 +05:30
Shree
6fa8e1bb15 Set max_workers=8 2019-03-17 09:58:11 +00:00
Shree
e21499e81e Set default value for distort_image 2019-03-17 09:54:16 +00:00
Shree
af7a97e33e Merge branch 'master' of https://github.com/tesseract-ocr/tesseract 2019-03-16 14:30:24 +00:00
zdenop
ea3b806357
Merge pull request #2332 from stweil/doc
Remove old comments in file headers
2019-03-16 11:02:11 +01:00
Stefan Weil
ee2f9bf7bf Remove old comments in file headers
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-16 10:55:00 +01:00
zdenop
1b40cae0f2
Merge pull request #2329 from Shreeshrii/kur_train
training script changes
2019-03-16 10:27:35 +01:00
zdenop
0b72f4b722
Merge pull request #2331 from stweil/doc
Improve man page for tesseract  and add Makefile rule for PDF
2019-03-16 10:26:16 +01:00
Stefan Weil
5f76a8495b Sort options alphabetically in tesseract man page
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-16 10:19:00 +01:00
Stefan Weil
b55984fb88 Add description for new --dpi option in tesseract man page
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-16 09:33:41 +01:00
Stefan Weil
26b4457b86 Add description for new --psm values in tesseract man page
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-16 09:24:40 +01:00
Stefan Weil
a6981ae548 Improve man page for tesseract
Format it like the example
https://github.com/asciidoc/asciidoc/blob/master/doc/asciidoc.1.txt.

Replace tab characters by blanks.

Add also a chapter on environment variables.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-16 08:54:28 +01:00
Stefan Weil
6b3c81c909 Add rule for PDF documentation
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-15 21:53:34 +01:00
Shree
804d2aaecf Merge branch 'master' of https://github.com/tesseract-ocr/tesseract 2019-03-15 17:41:12 +00:00
Shree
d47b0d588a Use LATIN_FONTS for kmr 2019-03-15 15:47:56 +00:00
Shree
3eee1d217a Add kmr and kur_ara, remove kur from training scripts 2019-03-15 15:37:49 +00:00
Shree
b2ebf0195f Add kmr and kur_ara, remove kur from training scripts 2019-03-15 14:39:39 +00:00
Shree
37befdf6c4 Add option for --distort_image 2019-03-15 13:32:36 +00:00
Egor Pugin
29389f7145
Fix appveyor artifacts. 2019-03-15 15:55:15 +03:00
Stefan Weil
e14797563b Update documentation for supported languages
kur_ara.traineddata was renamed to kmr.traineddata.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-15 11:07:54 +01:00
Stefan Weil
85d7feebf7 Add missing documentation for --help-extra
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-15 09:36:10 +01:00
zdenop
0a36b38169
Merge pull request #2317 from eighttails/master
Added missing linker flags for MinGW.
2019-03-15 08:01:21 +01:00
Stefan Weil
4c2bbebecc Fix compiler warning (-Wunused-value)
Warning from clang++:

    ..\src\ccmain\ltrresultiterator.cpp(454,8):  warning: expression result unused [-Wunused-value]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-13 20:56:03 +01:00
Stefan Weil
ed84ba0a44 Fix wrong comparison
symbol_steps is a vector, so testing for a nullptr was wrong.

clang++ reports:

    ..\src\ccmain\ltrresultiterator.cpp(440,19):  warning: comparison of address of 'this->word_res_->symbol_steps' equal to a null pointer is always false [-Wtautological-pointer-compare]
      if (&word_res_->symbol_steps == nullptr || !LSTM_mode_) return nullptr;
           ~~~~~~~~~~~^~~~~~~~~~~~    ~~~~~~~

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-13 20:38:38 +01:00
Tadahito Yao
bbbd262a8d Added missing linker flags for MinGW. 2019-03-13 22:10:36 +09:00
Stefan Weil
681e6301cd
Merge pull request #2316 from vidiecan/fix_accumulated_timesteps_check
`accumulated_timesteps` is not a pointer but a vector
2019-03-13 13:17:18 +01:00
jm server2
1206362d30 accumulated_timesteps is not a pointer but a vector and in case we use ChoiceIterator without lstm_choice_mode tesseract crashes (or similar) because the check is true and we reference not existing item 2019-03-13 12:55:14 +01:00
Stefan Weil
3baf0d8076 Fix boolean assignments
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-12 15:34:24 +01:00
zdenop
e965de77eb
Merge pull request #2314 from stweil/svpaint
Remove svpaint.cpp from libtesseract
2019-03-12 12:36:22 +01:00
Stefan Weil
8ad0489f0f Remove svpaint.cpp from libtesseract
svpaint is a standalone application (it includes a main function)
and should not be part of the Tesseract library.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-12 12:22:53 +01:00
zdenop
7546a01020
Merge pull request #2310 from noahmetzger/LSTMChoiceRIL
Lstm choice ril
2019-03-12 10:46:11 +01:00
zdenop
76fbade3ee
Merge pull request #2309 from stweil/fuzz
Fix several runtime errors (found by OSS-Fuzz)
2019-03-12 10:45:21 +01:00
Stefan Weil
35a999f91a Fix assertion caused by wrong unicharset
Credit to OSS-Fuzz: it found another case which triggered this assertion:

    contains_unichar_id(unichar_id):Error:Assert failed:in file ../../src/ccutil/unicharset.h, line 502

This is the OSS-Fuzz testcase:
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=13662

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-12 09:31:21 +01:00
Stefan Weil
56a39bda77 Fix float division by zero
That runtime error is normally not visible because it does not abort
the program, but is detected when the code was compiled with sanitizers.

It can be triggered with this OSS-Fuzz testcase:
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=13662

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-12 09:28:16 +01:00
Noah Metzger
5b3e2fe812 Integrated accumulated Symbol Choice in the Choice Iterator and made the api lstm_choice_mode independent
Signed-off-by: Noah Metzger <noah.metzger@bib.uni-mannheim.de>
2019-03-12 09:15:10 +01:00