The old code was a hack to improve the performance.
The new code is clearer and results in the same binary when compiling
with gcc 8.3.0, so it looks like the old hack is no longer needed with
modern compilers.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
- pass in ParamsVectors from Tesseract
(carrying values from langdata/config/api)
into LSTMRecognizer::Load and LoadDictionary
- after LSTMRecognizer's Dict is initialised
(with default values), reset the variables
user_{words,patterns}_{suffix,file} from the
corresponding entries in the passed vector
Warning from clang++:
..\src\ccmain\ltrresultiterator.cpp(454,8): warning: expression result unused [-Wunused-value]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
symbol_steps is a vector, so testing for a nullptr was wrong.
clang++ reports:
..\src\ccmain\ltrresultiterator.cpp(440,19): warning: comparison of address of 'this->word_res_->symbol_steps' equal to a null pointer is always false [-Wtautological-pointer-compare]
if (&word_res_->symbol_steps == nullptr || !LSTM_mode_) return nullptr;
~~~~~~~~~~~^~~~~~~~~~~~ ~~~~~~~
Signed-off-by: Stefan Weil <sw@weilnetz.de>
svpaint is a standalone application (it includes a main function)
and should not be part of the Tesseract library.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Credit to OSS-Fuzz: it found another case which triggered this assertion:
contains_unichar_id(unichar_id):Error:Assert failed:in file ../../src/ccutil/unicharset.h, line 502
This is the OSS-Fuzz testcase:
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=13662
Signed-off-by: Stefan Weil <sw@weilnetz.de>
That runtime error is normally not visible because it does not abort
the program, but is detected when the code was compiled with sanitizers.
It can be triggered with this OSS-Fuzz testcase:
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=13662
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Shift operations are undefined for negative numbers, but at least on
Intel they return the same value as a multiplication with 2 ^ shift value.
This fixes runtime errors reported by sanitizers and OSS-Fuzz:
intmatcher.cpp:821:59: runtime error: left shift of negative value -14
intmatcher.cpp:823:75: runtime error: left shift of negative value -512
intmatcher.cpp:820:50: runtime error: left shift of negative value -80
See issue #2297 and
https://oss-fuzz.com/testcase-detail/4845195990925312 for details.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes issue #2299, an issue which was already reported by
static code analyzers and now by OSS-Fuzz, see details at
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=13597.
The Tesseract code assigns an address which is out-of-bounds to a pointer
variable, but increments that variable later. So this is a false positive.
Change the code nevertheless to satisfy OSS-Fuzz.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Credit to OSS-Fuzz:
This fixes an issue which was reported by OSS-Fuzz, see details at
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=13592.
OSS-Fuzz triggered this assertion:
contains_unichar_id(unichar_id):Error:Assert failed:in file ../../src/ccutil/unicharset.h, line 502
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Credit to OSS-Fuzz:
This fixes a security issue which was reported by OSS-Fuzz, see details at
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=13590.
Add also some assertions to catch similar bugs.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
- move decision from ComputeTopN to ContinueContext, where
it belongs: block context continuations which emit final
codes translating to disabled unichar_ids.
(The normal logic for fallback from top2 > top2 > rest
will apply.)
- pass UNICHARSET refs appropriately
- ignore matrix outputs in ComputeTopN if they
belong to a disabled unichar_id
- pass UNICHARSET refs to check that
- in SetBlackAndWhitelist, also update the unicharset
of the lstm_recognizer_ instance, if any
This requires libarchive-dev.
Tesseract can now load traineddata files in any of the archive formats
which are supported by libarchive. Example of a zipped BagIt archive:
$ unzip -l /usr/local/share/tessdata/zip.traineddata
Archive: /usr/local/share/tessdata/zip.traineddata
Length Date Time Name
--------- ---------- ----- ----
55 2019-03-05 15:27 bagit.txt
0 2019-03-05 15:25 data/
1557 2019-03-05 15:28 manifest-sha256.txt
1082890 2019-03-05 15:25 data/eng.word-dawg
1487588 2019-03-05 15:25 data/eng.lstm
7477 2019-03-05 15:25 data/eng.unicharset
63346 2019-03-05 15:25 data/eng.shapetable
976552 2019-03-05 15:25 data/eng.inttemp
13408 2019-03-05 15:25 data/eng.normproto
4322 2019-03-05 15:25 data/eng.punc-dawg
4738 2019-03-05 15:25 data/eng.lstm-number-dawg
1410 2019-03-05 15:25 data/eng.freq-dawg
844 2019-03-05 15:25 data/eng.pffmtable
6360 2019-03-05 15:25 data/eng.lstm-unicharset
1012 2019-03-05 15:25 data/eng.lstm-recoder
1047 2019-03-05 15:25 data/eng.unicharambigs
4322 2019-03-05 15:25 data/eng.lstm-punc-dawg
16109842 2019-03-05 15:25 data/eng.bigram-dawg
80 2019-03-05 15:25 data/eng.version
6426 2019-03-05 15:25 data/eng.number-dawg
3694794 2019-03-05 15:25 data/eng.lstm-word-dawg
--------- -------
23468070 21 files
`combine_tessdata -d` and `combine_tessdata -u` also work.
The traineddata files in the new format can be generated with
standard tools like zip or tar.
More work is needed for other training tools and big endian support.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes a warning from Apple's clang compiler:
[ 34%] Building CXX object CMakeFiles/libtesseract.dir/src/ccutil/errcode.cpp.o
/Users/travis/build/stweil/tesseract/src/ccutil/errcode.cpp:83:7: warning: indirection of non-volatile null pointer will be deleted, not trap [-Wnull-dereference]
*reinterpret_cast<int*>(0) = 0;
^~~~~~~~~~~~~~~~~~~~~~~~~~
/Users/travis/build/stweil/tesseract/src/ccutil/errcode.cpp:83:7: note: consider using __builtin_trap() or qualifying pointer with 'volatile'
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Both functions are called very often, so computing the table values
at program start should be faster than computing them on demand.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
gcc warning:
src/lstm/recodebeam.cpp:270:41: warning: ‘current_char’ may be used uninitialized in this function [-Wmaybe-uninitialized]
It's a false positive, but setting the variable to 0 satisfies the compiler.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
gcc warnings:
src/viewer/scrollview.cpp:404:31: warning: ‘%s’ directive output may be
truncated writing up to 4095 bytes into a region of size between 4084 and 4093 [-Wformat-truncation=]
src/viewer/scrollview.cpp:572:31: warning: ‘%s’ directive output may be
truncated writing up to 4095 bytes into a region of size between 4084 and 4093 [-Wformat-truncation=]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
gcc warnings:
src/ccmain/docqual.cpp:734:26: warning: this statement may fall through [-Wimplicit-fallthrough=]
src/ccmain/docqual.cpp:764:26: warning: this statement may fall through [-Wimplicit-fallthrough=]
src/ccmain/docqual.cpp:782:26: warning: this statement may fall through [-Wimplicit-fallthrough=]
[...]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
tesstrain_utils.sh sets the shell flag -e, so it exits immediately
if a command exits with a non-zero status.
The following command returns a non-zero status as soon as counter is a
multiple of par_factor (par_factor=8, that means as soon as 8 fonts or
images are processed):
let rem=counter%par_factor
The new code fixes this undesired exit.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
pango_coverage_get and pango_coverage_unref should not be called
with coverage == nullptr.
pango_font_get_coverage should not be called with font == nullptr.
Otherwise Pango prints runtime error messages:
(process:12657): Pango-CRITICAL **: pango_coverage_get: assertion 'coverage != NULL' failed
(process:12657): Pango-CRITICAL **: pango_coverage_unref: assertion 'coverage != NULL' failed
(process:12657): Pango-CRITICAL **: pango_font_get_coverage: assertion 'font != NULL' failed
(process:12657): GLib-GObject-CRITICAL **: g_object_unref: assertion 'G_IS_OBJECT (object)' failed
Typically those errors occur if a required font is not installed,
so this can be a quite common error.
Fix also a potential resource leak in PangoFontInfo::CoversUTF8Text.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Commit d36231e3e4 did not distinguish
between AVX and AVX2, so AVX2 code was enabled for IntSimdMatrix
even when only AVX was supported.
This resulted in an illegal instruction.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Bug message from AddressSanitizer:
==7153==ERROR: AddressSanitizer: alloc-dealloc-mismatch (operator new [] vs free) on 0x602000072cb0
#0 0x7ffff70c6a10 in free (/usr/lib/x86_64-linux-gnu/libasan.so.3+0xc1a10)
#1 0x555557188638 in writeProfileToFile ../../../../../src/opencl/openclwrapper.cpp:541
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Bug message from AddressSanitizer:
==6158==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x7fffe774b7fc at pc 0x555557086b54 bp 0x7fffffffcee0 sp 0x7fffffffced8
READ of size 1 at 0x7fffe774b7fc thread T0
#0 0x555557086b53 in tesseract::HistogramRect(Pix*, int, int, int, int, int, int*) ../../../../../src/ccstruct/otsuthr.cpp:163
Signed-off-by: Stefan Weil <sw@weilnetz.de>
* Move IntDotProductSSE. That allows inlining of the code.
* Improve IntDotProductSSE by moving some instructions.
* Remove unused num_input_groups_ from IntSimdMatrix.
* Re-order elements in IntSimdMatrix to avoid padding.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Always use OpenCL device selection if OpenCL is enabled.
This fixes a regression which was introduced by commit
5c6a57b727 which removed
the definition for USE_DEVICE_SELECTION.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
gcc warning:
src/training/text2image.cpp:694:35: warning:
ISO C++ forbids converting a string constant to ‘char*’
[-Wwrite-strings]
putenv expects a string which can be modified.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Commit 49d7df6dc3 added error handling,
but since that commit Tesseract used the text fallback if the user
selected output failed.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Using std::stringstream simplifies the code and allows conversion of
double to string independant of the current locale setting.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Using std::stringstream simplifies the code.
The <SP> element is needed between two >String> elements.
Remove also some unneeded spaces in the ALTO output.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes warnings from the Intel compiler:
src/textord/cjkpitch.cpp(319): warning #177:
function "<unnamed>::FPRow::good_gaps" was declared but never referenced
src/textord/cjkpitch.cpp(383): warning #177:
function "<unnamed>::FPRow::is_bad" was declared but never referenced
src/textord/cjkpitch.cpp(387): warning #177:
function "<unnamed>::FPRow::is_unknown" was declared but never referenced
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes a warning from the Intel compiler:
src/textord/cjkpitch.cpp(79): warning #177:
function "<unnamed>::SimpleStats::maximum" was declared
but never referenced
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Instrumented code throws this runtime error during OCR:
../../src/api/baseapi.cpp:1616:5: runtime error: load of value 128,
which is not a valid value for type 'bool'
../../src/api/baseapi.cpp:1627:5: runtime error: load of value 128,
which is not a valid value for type 'bool'
If there is no font information (typical for Tesseract with a LSTM model),
the font attributes got random values resulting in wrong hOCR output.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Instrumented code throws this runtime error during OCR:
../../src/ccstruct/matrix.h:84:11: runtime error:
null pointer passed as argument 2, which is declared to never be null
Signed-off-by: Stefan Weil <sw@weilnetz.de>
All also a C++ implementation with more aggressive compiler options
which is optimized for the CPU where the software was built.
It is now possible to select the function used for the dot product
with -c dotproduct=FUNCTION where FUNCTION can be one of those values:
* auto selection based on detected hardware (default)
* generic C++ code with default compiler options
* native C++ code optimized for build host
* avx optimized code for AVX
* sse optimized code for SSE
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This reduces the code size for intsimdmatrixavx2 from 2700 to 2668
and slightly improves the performance for fast models with AVX2.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This improves performace for the "best" models because it
avoids function calls.
The compiler also knows the passed values for the parameters
add_bias_fwd and skip_bias_back.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This is a lightweight, semi-Pythonic conversion of tesstrain.sh that currently
supports only LSTM and not the Tesseract 3 training mode.
I attempted to keep source changes minimal so it would be easy to compare
bash to Python in code review and confirm equivalence.
Python 3.6+ is required. Ubuntu 18.04 ships Python 3.6 and it is a mandatory
package (the package manager is also written in Python), so it is available
in the baseline Tesseract 4.0 system.
There are minor output and behavioral changes, and advantages. Python's loggingis used. Temporary files are only deleted on success, so they can be inspected
if training files. Console output is more terse and the log file is more
verbose. And there are progress bars! (The python3-tqdm package is required.)
Where tesstrain.sh would sometimes fail without explanation and return an error
code of 1, it is much easier to find the point of failure in this version.
That was also the main motivation for this work.
Argument checking is also more comprehensive.
The local variable k should be 10 ^ (number of digits after comma),
but will overflow when there are more than 9 digits after the comma
because an int value cannot store 10000000000.
This results in wrong double values read from .tr files for example
(or in a runtime exception if Tesseract was compiled with -ftrapv).
Using uint64_t does not fix the general problem but allows more digits
which should be sufficient for the data read by Tesseract.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
shellcheck warning:
In /tesseract/src/training/tesstrain_utils.sh line 209:
TIMESTAMP=`date +%Y-%m-%d`
^-- SC2006: Use $(..) instead of legacy `..`.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
The commit 10f2c45c00 unified the usage of mktemp, but with a
incorrect bash syntax and unnecessary definition of LANG_CODE
and TIMESTAMP. This patch fixes the above problems.
Compiler warning on macOS:
tesscallback.h:29:7: warning:
'TessClosure' has no out-of-line virtual method definitions;
its vtable will be emitted in every translation unit [-Wweak-vtables]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes a compiler warning:
globaloc.cpp:33:6: warning: no previous extern declaration for
non-static variable 'global_crash_pixes'
[-Wmissing-variable-declarations]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes some compiler warnings:
mainblk.cpp:28:9: warning: macro is not used [-Wunused-macros]
mainblk.cpp:29:9: warning: macro is not used [-Wunused-macros]
[...]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
TessBaseAPI::GetAvailableLanguagesAsVector returned the list of languages
without sorting, so the result was random and not user friendly.
Now `tesseract --list-langs` shows the available languages and scripts
in alphabetic order.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
The format string which builds the command only takes one or two
string arguments, so the function allocated too much memory and
passed too many arguments to snprintf.
This also fixes a compiler warning (clang).
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes two warnings from LGTM:
Parameter feature_defs hides a global variable with the same name.
Parameter Config hides a global variable with the same name.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes a warning from LGTM:
Bit field area of type int should have explicitly unsigned integral,
explicitly signed integral, or enumeration type.
Maybe area should be unsigned, but that would require lots of other
changes, so for now signedness is not changed.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes several issues reported by LGTM:
Multiplication result may overflow 'int'
before it is converted to 'size_type'.
Multiplication result may overflow 'float'
before it is converted to 'double'.
Multiplication result may overflow 'int'
before it is converted to 'unsigned long'.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
It does not need an implementation as it is currently not used.
This fixes a warning from LGTM:
No matching copy assignment operator in class BlamerBundle.
It is good practice to match a copy constructor
with a copy assignment operator.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
It does not need an implementation as it is currently not used.
This fixes a warning from LGTM:
No matching copy constructor in class C_OUTLINE_FRAG.
It is good practice to match a copy assignment operator
with a copy constructor.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
It does not need an implementation as it is currently not used.
This fixes a warning from LGTM:
No matching copy constructor in class ROW.
It is good practice to match a copy assignment operator
with a copy constructor.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes a warning from LGTM:
No matching copy assignment operator in class BLOB_CHOICE.
It is good practice to match a copy constructor
with a copy assignment operator.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes a warning from LGTM:
No matching copy assignment operator in class ParamsTrainingHypothesis.
It is good practice to match a copy constructor
with a copy assignment operator.
Use also a simpler expression for the size of features.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes a warning from LGTM:
No matching copy assignment operator in class LineHypothesis.
It is good practice to match a copy constructor
with a copy assignment operator.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Renamed the global attribute glyph_confidences to lstm_choice_mode and the method GetGlyphConfidences() to GetChoices(). All Variables and comments contained in related methods were renamed as well.
Signed-off-by: Noah Metzger <noah.metzger@bib.uni-mannheim.de>
This also fixes two warnings from LGTM:
Multiplication result may overflow 'float'
before it is converted to 'double'.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This should fix warnings from LGTM:
Multiplication result may overflow 'float'
before it is converted to 'double'.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This also fixes two warnings from LGTM:
Multiplication result may overflow 'float'
before it is converted to 'double'.
Replace also FALSE / TRUE by false / true for bool return value.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This should fix a warning from LGTM:
Multiplication result may overflow 'int' before it is
converted to 'unsigned long'.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Showing them in a window (default) is not acceptable for a console
application like Tesseract which must be able to work in batch mode.
Such error messages can be triggered by TIFF files which include
vendor specific tags.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
It only defines the macro partial_split_priority which is only used in
findseam.cpp, so move it to that file.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
* 'master' of https://github.com/tesseract-ocr/tesseract:
Remove code for _MSC_VER < 1900
keep API compatibility with #1265
Update googletest submodule to release v1.8.1
Update test submodule
Always use isascii() with isspace()
Avoid crash with --psm 0 and LSTM traineddata
SVPaint: Remove empty block
Classify: Don't hide debug parameter
UNICHARMAP: Remove comparison which is always false
svpaint: Change a variable from global to local
pgedit: remove unused declaration of display_bln_lines
Plumbing: Remove comparison which is always false
Release candidate 2
use pdf L_FLATE_ENCODE only for png input; fixes#1961
isspace() must only used with an unsigned char or EOF argument,
and even then its result can depend on the current locale settings.
While this is not a problem for C/C++ executables which use the default
"C" locale, it becomes a problem when the Tesseract API is called from
languages like Python or Java which don't use the "C" locale.
By calling isasci() before calling isspace() this uncertainty can be
avoided, because any locale will hopefully give identical results for
the basic ASCII character set.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes a warning from LGTM:
Poor global variable name 'rgb'. Prefer longer, descriptive
names for globals (eg. kMyGlobalConstant, not foo).
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes a warning from LGTM:
This parameter of type ScrollView is 144 bytes
- consider passing a pointer/reference instead.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
* 'master' of https://github.com/tesseract-ocr/tesseract: (27 commits)
Rework check for readable input file
fix "mktemp -d --tmpdir" on Mac OS; see #1453
pgedit: Change some variables from global to local ones
improve description of min_characters_to_try variable
WERD_RES: Remove comparisons which are constant
GENERIC_2D_ARRAY: Pass parameters by reference
genericvector: Pass parameters by reference
chop: Use more efficient float calculations for sqrt
rect: Use more efficient float calculations for ceil, floor
intproto: Use more efficient float calculations for floor
genericvector: Rewrite code to satisfy static code analyzer
Fix constructor for class Dict (uninitialized member variables)
Fix use of wrong UNICHARSET
lstmtraining: Remove dead code for purified model name
combine_tessdata: Handle failures when extracting
lstmtraining: Check write permission for output model
implement parameter min_characters_to_try for minimum characters to try to skip page entirely. fixes#1729
Merge and enhance documentation on language and script models
Document some more config options for tesseract
Add Makefile rule to build HTML manpages
...
This fixes compiler warnings and a warning from LGTM:
Poor global variable name 'pe'. Prefer longer, descriptive names [...]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes warnings from LGTM:
Comparison is always false because id >= 0.
Comparison is always true because mirrored >= 1.
Comparison is always false because id >= 0.
INVALID_UNICHAR_ID is -1, so the warnings are correct.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes warnings from LGTM:
This parameter of type FontClassInfo is 192 bytes
- consider passing a pointer/reference instead.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes warnings like the following one from LGTM:
This parameter of type ParamsTrainingHypothesis is 112 bytes
- consider passing a pointer/reference instead.
Most parameters can also get the const attribute.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes warnings from LGTM:
Multiplication result may overflow 'float' before it is converted
to 'double'.
While the sqrt function always calculates with double, here the
overloaded std::sqrt can be used to handle the float arguments
more efficiently.
Replace also an old C++ type cast by a static_cast.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes warnings from LGTM:
Multiplication result may overflow 'float' before it is converted
to 'double'.
While the floor function always calculates with double, here the
overloaded std::floor can be used to handle the float arguments
more efficiently.
Replace also old C++ type casts by static_cast.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes warnings from LGTM:
Multiplication result may overflow 'float' before it is converted
to 'double'.
While the floor function always calculates with double, here the
overloaded std::floor can be used to handle the float arguments
more efficiently.
Replace also old C++ type casts by static_cast.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Warning from LGTM:
Resource data_ is acquired by class GenericVector<FontSpacingInfo *>
but not released in the destructor.
LGTM complains about data_ not being deleted in the destructor.
The destructor calls the clear() method, but the delete there
was conditional which confuses the static code analyzer.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
wildcard_unichar_id_, apostrophe_unichar_id_, question_unichar_id_ and
slash_unichar_id_ were not initialized in the constructor.
slash_unichar_id_ was used later in a conditional.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Report an error and terminate if that fails.
Use also EXIT_SUCCESS and EXIT_FAILURE for the return values of main()
and add missing return at end of main().
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This is done by creating a temporary file.
Report an error and terminate if that fails.
Use also EXIT_SUCCESS and EXIT_FAILURE for the return values of main().
Signed-off-by: Stefan Weil <sw@weilnetz.de>
While orientation and script detection (OSD) normally requires
osd.traineddata to detect both, it must also be possible to do
only orientation detection with eng.traineddata or any other
traineddata.
Enforce osd.traineddata only if there was no `-l` command line option.
Commit 27ce472666 was too restrictive.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
* 'master' of https://github.com/tesseract-ocr/tesseract:
Fix CID 1164579 (Explicit null dereferenced)
print help for tesstrain.sh; fixes#1469
Fix CID 1395882 (Uninitialized scalar variable)
Fix comments
Move content of ipoints.h to points.h and remove ipoints.h
remove duplicate help from combine_lang_model
Fix typo.
use tprintf instead of printf to be able disable messages by quiet option (issue #1240)
add "sudo ldconfig" to install instruction. fixes#1212
unittest: Replace NULL by nullptr
unittest: Format code
tesseract app: check if input file exists; fixes#1023
Format code (replace ( xxx ) by (xxx))
Simplify boolean expressions
Win32: use the ISO C and C++ conformant name "_putenv" instead of deprecated "putenv"
The report from Coverity Scan is a false positive.
Nevertheless the code can be rewritten and optimized
a little bit to fix that report.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
The implementation for ICOORD only allows division by scale != 0.
Do the same for FCOORD by asserting that scale != 0.0f,
so undefined program behaviour will be caught.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Only print "Merging rows..." if textord_debug_blob==true (like all the other debug messages).
Otherwise, there are a lot of "Merging rows..." messages in console output.
The error message "segmentation fault" confuses most users,
so enforce a segmentation fault only in debug code.
Release code simply calls the abort function.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Wrong or old parameters in traineddata files should not terminate
the program, so make that a warning instead of a fatal error.
This fixes issue #1520.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes compiler warnings from clang:
src/textord/equationdetectbase.h:32:7: warning:
'EquationDetectBase' has no out-of-line virtual method definitions;
its vtable will be emitted in every translation unit [-Wweak-vtables]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes compiler warnings from clang:
src/textord/blobgrid.h:33:7: warning:
'BlobGrid' has no out-of-line virtual method definitions;
its vtable will be emitted in every translation unit [-Wweak-vtables]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes compiler warnings from clang:
src/textord/bbgrid.h:53:7: warning:
'GridBase' has no out-of-line virtual method definitions;
its vtable will be emitted in every translation unit [-Wweak-vtables]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes compiler warnings from clang:
src/textord/alignedblob.h:81:7: warning:
'AlignedBlob' has no out-of-line virtual method definitions;
its vtable will be emitted in every translation unit [-Wweak-vtables]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes compiler warnings from clang:
src/lstm/weightmatrix.h:33:7: warning:
'TransposedArray' has no out-of-line virtual method definitions;
its vtable will be emitted in every translation unit [-Wweak-vtables]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes compiler warnings from clang:
src/ccutil/indexmapbidi.h:102:7: warning:
'IndexMapBiDi' has no out-of-line virtual method definitions;
its vtable will be emitted in every translation unit [-Wweak-vtables]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes compiler warnings from clang:
src/training/icuerrorcode.h:44:7: warning:
'IcuErrorCode' has no out-of-line virtual method definitions;
its vtable will be emitted in every translation unit [-Wweak-vtables]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes compiler warnings from clang:
src/training/validator.h:72:7: warning:
'Validator' has no out-of-line virtual method definitions;
its vtable will be emitted in every translation unit [-Wweak-vtables]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes compiler warnings from clang:
src/dict/dawg.h:119:7: warning:
'Dawg' has no out-of-line virtual method definitions;
its vtable will be emitted in every translation unit [-Wweak-vtables]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes compiler warnings from clang:
src/cutil/cutil_class.h:27:7: warning:
'CUtil' has no out-of-line virtual method definitions;
its vtable will be emitted in every translation unit [-Wweak-vtables]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes compiler warnings from clang:
src/ccutil/indexmapbidi.h:102:7: warning:
'IndexMapBiDi' has no out-of-line virtual method definitions;
its vtable will be emitted in every translation unit [-Wweak-vtables]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes compiler warnings from clang:
src/ccutil/ccutil.h:51:7: warning:
'CCUtil' has no out-of-line virtual method definitions;
its vtable will be emitted in every translation unit [-Wweak-vtables]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes compiler warnings from clang:
src/ccstruct/matrix.h:575:7: warning:
'MATRIX' has no out-of-line virtual method definitions;
its vtable will be emitted in every translation unit [-Wweak-vtables]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes compiler warnings from clang:
src/ccstruct/ccstruct.h:25:7: warning:
'CCStruct' has no out-of-line virtual method definitions;
its vtable will be emitted in every translation unit [-Wweak-vtables]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes compiler warnings from clang:
src/viewer/scrollview.h:86:7: warning:
'SVEventHandler' has no out-of-line virtual method definitions;
its vtable will be emitted in every translation unit [-Wweak-vtables]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes compiler warnings from clang:
src/ccmain/mutableiterator.h:44:7: warning:
'MutableIterator' has no out-of-line virtual method definitions;
its vtable will be emitted in every translation unit [-Wweak-vtables]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes compiler warnings from clang:
src/ccmain/ltrresultiterator.h:48:16: warning:
'LTRResultIterator' has no out-of-line virtual method definitions;
its vtable will be emitted in every translation unit [-Wweak-vtables]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Either it was not needed, or it could be replaced by checking
for not _WIN32.
This fixes a compiler warning from clang:
src/ccutil/platform.h:41:9: warning:
macro name is a reserved identifier [-Wreserved-id-macro]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Compiler warning from clang:
src/api/pdfrenderer.cpp:848:28: warning:
cast from 'const char *' to 'char *' drops const qualifier [-Wcast-qual]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
size_t would require a different format string. Here an unsigned int
is sufficient in both cases, so use that.
This error was found by lgtm, see
https://lgtm.com/projects/g/tesseract-ocr/tesseract/.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Compiler warnings from clang:
src/textord/makerow.cpp:2579:36: warning:
cast from 'const void *' to 'BLOBNBOX **' drops const qualifier [-Wcast-qual]
src/textord/makerow.cpp:2581:36: warning:
cast from 'const void *' to 'BLOBNBOX **' drops const qualifier [-Wcast-qual]
src/textord/makerow.cpp:2601:31: warning:
cast from 'const void *' to 'TO_ROW **' drops const qualifier [-Wcast-qual]
src/textord/makerow.cpp:2603:31: warning:
cast from 'const void *' to 'TO_ROW **' drops const qualifier [-Wcast-qual]
src/textord/makerow.cpp:2623:31: warning:
cast from 'const void *' to 'TO_ROW **' drops const qualifier [-Wcast-qual]
src/textord/makerow.cpp:2625:31: warning:
cast from 'const void *' to 'TO_ROW **' drops const qualifier [-Wcast-qual]
Warning from lgtm:
Local variable 'blob' hides a parameter of the same name.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Compiler warnings from clang:
src/ccstruct/werd.cpp:128:4: warning:
use of old-style cast [-Wold-style-cast]
src/ccstruct/werd.cpp:394:18: warning:
use of old-style cast [-Wold-style-cast]
src/ccstruct/werd.cpp:394:27: warning:
cast from 'const void *' to 'WERD **' drops const qualifier [-Wcast-qual]
src/ccstruct/werd.cpp:395:18: warning:
use of old-style cast [-Wold-style-cast]
src/ccstruct/werd.cpp:395:27: warning:
cast from 'const void *' to 'WERD **' drops const qualifier [-Wcast-qual]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Compiler warnings from clang:
src/ccstruct/polyblk.cpp:194:16: warning:
use of old-style cast [-Wold-style-cast]
src/ccstruct/polyblk.cpp:195:16: warning:
use of old-style cast [-Wold-style-cast]
src/ccstruct/polyblk.cpp:292:45: warning:
use of old-style cast [-Wold-style-cast]
src/ccstruct/polyblk.cpp:30:9: warning:
macro is not used [-Wunused-macros]
src/ccstruct/polyblk.cpp:348:8: warning:
use of old-style cast [-Wold-style-cast]
src/ccstruct/polyblk.cpp:358:12: warning:
use of old-style cast [-Wold-style-cast]
src/ccstruct/polyblk.cpp:362:26: warning:
use of old-style cast [-Wold-style-cast]
src/ccstruct/polyblk.cpp:383:21: warning:
use of old-style cast [-Wold-style-cast]
src/ccstruct/polyblk.cpp:383:36: warning:
cast from 'const void *' to 'ICOORDELT **' drops const qualifier [-Wcast-qual]
src/ccstruct/polyblk.cpp:384:21: warning:
use of old-style cast [-Wold-style-cast]
src/ccstruct/polyblk.cpp:384:36:
warning: cast from 'const void *' to 'ICOORDELT **' drops const qualifier [-Wcast-qual]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Compiler warnings from clang:
src/ccstruct/ocrblock.cpp:74:12: warning:
use of old-style cast [-Wold-style-cast]
src/ccstruct/ocrblock.cpp:74:21: warning:
cast from 'const void *' to 'ROW **' drops const qualifier [-Wcast-qual]
src/ccstruct/ocrblock.cpp:75:16: warning:
cast from 'const void *' to 'ROW **' drops const qualifier [-Wcast-qual]
src/ccstruct/ocrblock.cpp:75:7: warning:
use of old-style cast [-Wold-style-cast]
Make also the function decreasing_top_order a local function as it is
only used locally and remove its global declarations (2 locations).
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Compiler warnings from clang:
src/ccstruct/mod128.cpp:57:15: warning:
no previous extern declaration for non-static variable 'dirtab' [-Wmissing-variable-declarations]
src/ccstruct/mod128.cpp:57:24: warning:
use of old-style cast [-Wold-style-cast]
src/ccstruct/mod128.cpp:57:35: warning:
cast from 'const short *' to 'ICOORD *' drops const qualifier [-Wcast-qual]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Compiler warnings from clang:
src/ccstruct/genblob.cpp:34:20: warning:
use of old-style cast [-Wold-style-cast]
src/ccstruct/genblob.cpp:34:32: warning:
cast from 'const void *' to 'C_BLOB **' drops const qualifier [-Wcast-qual]
src/ccstruct/genblob.cpp:35:20: warning:
use of old-style cast [-Wold-style-cast]
src/ccstruct/genblob.cpp:35:32: warning:
cast from 'const void *' to 'C_BLOB **' drops const qualifier [-Wcast-qual]
The function c_blob_comparator is only used in fixspace.cpp,
so move it to that file, make it a local function, and remove
genblob.cpp and genblob.h which are no longer needed.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
It is only used in textord/topitch.cpp, so move it into that file.
Remove also the inline attribute as it has not effect here and
update the type casts to fix some compiler warnings from clang.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
- add linefeed after last line
- remove blanks at line endings
This fixes some warnings from clang:
src/training/validate_javanese.h:63:51: warning:
no newline at end of file [-Wnewline-eof]
src/training/validate_javanese.cpp:269:26: warning:
no newline at end of file [-Wnewline-eof]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Instead of adding an empty TBOX at the end of the box list,
that corner case is now handled by passing a nullptr (like
it was already done for the first box in the list).
This avoids the calls of BoxMissMetric with a TBOX
which raises an assertion there (b == 0).
Signed-off-by: Stefan Weil <sw@weilnetz.de>
It looks like the check cblob_ptr != nullptr is not needed.
If cblob_ptr were NULL, we would have seen crashes in compute_bounding_box.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Let's hope that word->best_choice is never NULL.
Overwise both the old and the new code would abort with SIGSEGV.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
The parameter glyph_confidences is changed from bool to int.
An execution with value 1 outputs the hOCR file enriched with glyph confidences
for every timestep like before. An execution with value 2 outputs the timesteps
accumulated over the recognized characters.
Signed-off-by: Noah Metzger <noah.metzger@bib.uni-mannheim.de>
Page segmentation mode "OSD only" requires osd.traineddata,
so use it automatically.
Report a warning if the user specified a different language.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
By default, that script creates two new temporary directories with random
names in /tmp.
The new command line flag --workspace_dir PATH uses the given path as
a base directory for all temporary files.
That allows better reproducable training results (no random directory
names in log files).
Signed-off-by: Stefan Weil <stweil@ub-backup.bib.uni-mannheim.de>
By using the parameter -c glyph_confidences=true the user is able to enrich
the hOCR output with additional information. Tesseract then lists additionally
the timesteps with all glyphs that were considered with their confidence
for every timestep of the LSTM.
The format of the hOCR output is slightly changed: There is now a linebreak
after every word for better readability by humans.
Signed-off-by: Noah Metzger <noah.metzger@bib.uni-mannheim.de>
One of the checks was too restrictive, as lstmeval deserializes
char arrays with 14000000 elements, so raise the limit to 30000000.
That check was added in commit 992031e824.
Add also assertions which help finding such problems in debug mode.
Signed-off-by: Stefan Weil <stweil@ub-backup.bib.uni-mannheim.de>
It is needed for running the training tutorial on Linux.
The correct mode was lost when moving the files in
commit 104fe7931c.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
The Serialize method is used indirectly by MasterTrainer::Serialize,
but there is no corresponding MasterTrainer::DeSerialize.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
OpenclDevice::getDeviceSelection crashed when outdated information
was read from file and device.score was not set.
Change also the struct definitions from C to C++ and
eliminate some type casts.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Commit 4d514d5a60 introduced tprintf_internal
with an additional argument "level" which was removed again in commit
7dc5296fe9.
So we can now restore the original state without tprintf_internal.
Remove also the declaration of debug_window_on (it does not exist since
commit 030aae9896) and make the
configuration parameter debug_file local as it is only used by tprintf.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
`int depth = strtol(*str + 1, str, 10);`
`**str` holds the words in the VGSL specification, and `*str` holds a single word, lets say, `Fr64`. Now, the `strtol` function modifies `str` to point to the first character which a non-digit number, and assumes that ` *str+1 ` points to a number (of valid integer format) as a string (automatically skipping all the white spaces, and no other characters), where in reality, it seems to point to `r` in `Fr164`.This is a bad argument, which results in strtol returning 0.
` strtol (*str + 2, str, 10)` should be passed instead.
Limit the matrix to UINT16_MAX x UINT16_MAX.
Larger dimensions could also result in an arithmetic overflow
when multiplying the two dimensions.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Wrong file data could give a large value for the number of vector elements
resulting in very large memory allocations.
Limit the allowed data range to UINT16_MAX (65535) elements
which hopefully should be sufficient for all use cases.
Changing the data type of the related member variables from int to
uint32_t allowed removing several type casts.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Add missing include statements, add missing "static" qualifiers or
remove functions which are not used at all.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
* Add break in default case to avoid potential problems with
future case statements following the default case.
* Remove empty statement.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
clang warnings:
src/ccstruct/coutln.cpp:231:15: warning:
variable 'destindex' may be uninitialized when used here [-Wconditional-uninitialized]
src/wordrec/language_model.cpp:1170:27: warning:
variable 'expected_gap' may be uninitialized when used here [-Wconditional-uninitialized]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
clang warnings:
src/api/baseapi.cpp:1642:18: warning:
possible misuse of comma operator here [-Wcomma]
src/api/baseapi.cpp:1642:31: warning:
possible misuse of comma operator here [-Wcomma]
src/api/baseapi.cpp:1642:45: warning:
possible misuse of comma operator here [-Wcomma]
src/api/baseapi.cpp:1652:16: warning:
possible misuse of comma operator here [-Wcomma]
src/api/baseapi.cpp:1652:30: warning:
possible misuse of comma operator here [-Wcomma]
src/api/baseapi.cpp:1662:17: warning:
possible misuse of comma operator here [-Wcomma]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
clang warning:
src/ccstruct/polyblk.cpp:48:36: warning:
constructor parameter 'box' shadows the field 'box' of 'POLY_BLOCK'
[-Wshadow-field-in-constructor]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
clang warning:
src/lstm/networkio.cpp:56:15: warning:
'this' pointer cannot be null in well-defined C++ code;
comparison may be assumed to always evaluate to true [-Wtautological-undefined-compare]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
clang warning:
src/lstm/lstmrecognizer.cpp:411:13: warning:
unused function 'NullIsBest' [-Wunused-function]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
clang warning:
src/lstm/network.cpp:249:7:
warning: 'break' will never be executed [-Wunreachable-code-break]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
The functions TessBaseAPIInitLangMod, TessBaseAPIClearAdaptiveClassifier
and TessBaseAPIDetectOrientationScript need conditional compilation.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Instead of defining the DISABLED_LEGACY_ENGINE macro in config_auto.h
(which is not included by all source files), define it as a preprocessor
option for those parts of the code which require it.
Signed-off-by: Stefan Weil <sw@weilnetz.de>