Shift operations are undefined for negative numbers, but at least on
Intel they return the same value as a multiplication with 2 ^ shift value.
This fixes runtime errors reported by sanitizers and OSS-Fuzz:
intmatcher.cpp:821:59: runtime error: left shift of negative value -14
intmatcher.cpp:823:75: runtime error: left shift of negative value -512
intmatcher.cpp:820:50: runtime error: left shift of negative value -80
See issue #2297 and
https://oss-fuzz.com/testcase-detail/4845195990925312 for details.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes issue #2299, an issue which was already reported by
static code analyzers and now by OSS-Fuzz, see details at
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=13597.
The Tesseract code assigns an address which is out-of-bounds to a pointer
variable, but increments that variable later. So this is a false positive.
Change the code nevertheless to satisfy OSS-Fuzz.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Credit to OSS-Fuzz:
This fixes an issue which was reported by OSS-Fuzz, see details at
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=13592.
OSS-Fuzz triggered this assertion:
contains_unichar_id(unichar_id):Error:Assert failed:in file ../../src/ccutil/unicharset.h, line 502
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Credit to OSS-Fuzz:
This fixes a security issue which was reported by OSS-Fuzz, see details at
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=13590.
Add also some assertions to catch similar bugs.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
- move decision from ComputeTopN to ContinueContext, where
it belongs: block context continuations which emit final
codes translating to disabled unichar_ids.
(The normal logic for fallback from top2 > top2 > rest
will apply.)
- pass UNICHARSET refs appropriately
- ignore matrix outputs in ComputeTopN if they
belong to a disabled unichar_id
- pass UNICHARSET refs to check that
- in SetBlackAndWhitelist, also update the unicharset
of the lstm_recognizer_ instance, if any
This requires libarchive-dev.
Tesseract can now load traineddata files in any of the archive formats
which are supported by libarchive. Example of a zipped BagIt archive:
$ unzip -l /usr/local/share/tessdata/zip.traineddata
Archive: /usr/local/share/tessdata/zip.traineddata
Length Date Time Name
--------- ---------- ----- ----
55 2019-03-05 15:27 bagit.txt
0 2019-03-05 15:25 data/
1557 2019-03-05 15:28 manifest-sha256.txt
1082890 2019-03-05 15:25 data/eng.word-dawg
1487588 2019-03-05 15:25 data/eng.lstm
7477 2019-03-05 15:25 data/eng.unicharset
63346 2019-03-05 15:25 data/eng.shapetable
976552 2019-03-05 15:25 data/eng.inttemp
13408 2019-03-05 15:25 data/eng.normproto
4322 2019-03-05 15:25 data/eng.punc-dawg
4738 2019-03-05 15:25 data/eng.lstm-number-dawg
1410 2019-03-05 15:25 data/eng.freq-dawg
844 2019-03-05 15:25 data/eng.pffmtable
6360 2019-03-05 15:25 data/eng.lstm-unicharset
1012 2019-03-05 15:25 data/eng.lstm-recoder
1047 2019-03-05 15:25 data/eng.unicharambigs
4322 2019-03-05 15:25 data/eng.lstm-punc-dawg
16109842 2019-03-05 15:25 data/eng.bigram-dawg
80 2019-03-05 15:25 data/eng.version
6426 2019-03-05 15:25 data/eng.number-dawg
3694794 2019-03-05 15:25 data/eng.lstm-word-dawg
--------- -------
23468070 21 files
`combine_tessdata -d` and `combine_tessdata -u` also work.
The traineddata files in the new format can be generated with
standard tools like zip or tar.
More work is needed for other training tools and big endian support.
Signed-off-by: Stefan Weil <sw@weilnetz.de>