This reduces the code size for intsimdmatrixavx2 from 2700 to 2668
and slightly improves the performance for fast models with AVX2.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This improves performace for the "best" models because it
avoids function calls.
The compiler also knows the passed values for the parameters
add_bias_fwd and skip_bias_back.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This is a lightweight, semi-Pythonic conversion of tesstrain.sh that currently
supports only LSTM and not the Tesseract 3 training mode.
I attempted to keep source changes minimal so it would be easy to compare
bash to Python in code review and confirm equivalence.
Python 3.6+ is required. Ubuntu 18.04 ships Python 3.6 and it is a mandatory
package (the package manager is also written in Python), so it is available
in the baseline Tesseract 4.0 system.
There are minor output and behavioral changes, and advantages. Python's loggingis used. Temporary files are only deleted on success, so they can be inspected
if training files. Console output is more terse and the log file is more
verbose. And there are progress bars! (The python3-tqdm package is required.)
Where tesstrain.sh would sometimes fail without explanation and return an error
code of 1, it is much easier to find the point of failure in this version.
That was also the main motivation for this work.
Argument checking is also more comprehensive.
The local variable k should be 10 ^ (number of digits after comma),
but will overflow when there are more than 9 digits after the comma
because an int value cannot store 10000000000.
This results in wrong double values read from .tr files for example
(or in a runtime exception if Tesseract was compiled with -ftrapv).
Using uint64_t does not fix the general problem but allows more digits
which should be sufficient for the data read by Tesseract.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
shellcheck warning:
In /tesseract/src/training/tesstrain_utils.sh line 209:
TIMESTAMP=`date +%Y-%m-%d`
^-- SC2006: Use $(..) instead of legacy `..`.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
The commit 10f2c45c00 unified the usage of mktemp, but with a
incorrect bash syntax and unnecessary definition of LANG_CODE
and TIMESTAMP. This patch fixes the above problems.
Compiler warning on macOS:
tesscallback.h:29:7: warning:
'TessClosure' has no out-of-line virtual method definitions;
its vtable will be emitted in every translation unit [-Wweak-vtables]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes a compiler warning:
globaloc.cpp:33:6: warning: no previous extern declaration for
non-static variable 'global_crash_pixes'
[-Wmissing-variable-declarations]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes some compiler warnings:
mainblk.cpp:28:9: warning: macro is not used [-Wunused-macros]
mainblk.cpp:29:9: warning: macro is not used [-Wunused-macros]
[...]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
TessBaseAPI::GetAvailableLanguagesAsVector returned the list of languages
without sorting, so the result was random and not user friendly.
Now `tesseract --list-langs` shows the available languages and scripts
in alphabetic order.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
The format string which builds the command only takes one or two
string arguments, so the function allocated too much memory and
passed too many arguments to snprintf.
This also fixes a compiler warning (clang).
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes two warnings from LGTM:
Parameter feature_defs hides a global variable with the same name.
Parameter Config hides a global variable with the same name.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes a warning from LGTM:
Bit field area of type int should have explicitly unsigned integral,
explicitly signed integral, or enumeration type.
Maybe area should be unsigned, but that would require lots of other
changes, so for now signedness is not changed.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes several issues reported by LGTM:
Multiplication result may overflow 'int'
before it is converted to 'size_type'.
Multiplication result may overflow 'float'
before it is converted to 'double'.
Multiplication result may overflow 'int'
before it is converted to 'unsigned long'.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
It does not need an implementation as it is currently not used.
This fixes a warning from LGTM:
No matching copy assignment operator in class BlamerBundle.
It is good practice to match a copy constructor
with a copy assignment operator.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
It does not need an implementation as it is currently not used.
This fixes a warning from LGTM:
No matching copy constructor in class C_OUTLINE_FRAG.
It is good practice to match a copy assignment operator
with a copy constructor.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
It does not need an implementation as it is currently not used.
This fixes a warning from LGTM:
No matching copy constructor in class ROW.
It is good practice to match a copy assignment operator
with a copy constructor.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes a warning from LGTM:
No matching copy assignment operator in class BLOB_CHOICE.
It is good practice to match a copy constructor
with a copy assignment operator.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes a warning from LGTM:
No matching copy assignment operator in class ParamsTrainingHypothesis.
It is good practice to match a copy constructor
with a copy assignment operator.
Use also a simpler expression for the size of features.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes a warning from LGTM:
No matching copy assignment operator in class LineHypothesis.
It is good practice to match a copy constructor
with a copy assignment operator.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Renamed the global attribute glyph_confidences to lstm_choice_mode and the method GetGlyphConfidences() to GetChoices(). All Variables and comments contained in related methods were renamed as well.
Signed-off-by: Noah Metzger <noah.metzger@bib.uni-mannheim.de>
This also fixes two warnings from LGTM:
Multiplication result may overflow 'float'
before it is converted to 'double'.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This should fix warnings from LGTM:
Multiplication result may overflow 'float'
before it is converted to 'double'.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This also fixes two warnings from LGTM:
Multiplication result may overflow 'float'
before it is converted to 'double'.
Replace also FALSE / TRUE by false / true for bool return value.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This should fix a warning from LGTM:
Multiplication result may overflow 'int' before it is
converted to 'unsigned long'.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Showing them in a window (default) is not acceptable for a console
application like Tesseract which must be able to work in batch mode.
Such error messages can be triggered by TIFF files which include
vendor specific tags.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
It only defines the macro partial_split_priority which is only used in
findseam.cpp, so move it to that file.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
* 'master' of https://github.com/tesseract-ocr/tesseract:
Remove code for _MSC_VER < 1900
keep API compatibility with #1265
Update googletest submodule to release v1.8.1
Update test submodule
Always use isascii() with isspace()
Avoid crash with --psm 0 and LSTM traineddata
SVPaint: Remove empty block
Classify: Don't hide debug parameter
UNICHARMAP: Remove comparison which is always false
svpaint: Change a variable from global to local
pgedit: remove unused declaration of display_bln_lines
Plumbing: Remove comparison which is always false
Release candidate 2
use pdf L_FLATE_ENCODE only for png input; fixes#1961
isspace() must only used with an unsigned char or EOF argument,
and even then its result can depend on the current locale settings.
While this is not a problem for C/C++ executables which use the default
"C" locale, it becomes a problem when the Tesseract API is called from
languages like Python or Java which don't use the "C" locale.
By calling isasci() before calling isspace() this uncertainty can be
avoided, because any locale will hopefully give identical results for
the basic ASCII character set.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes a warning from LGTM:
Poor global variable name 'rgb'. Prefer longer, descriptive
names for globals (eg. kMyGlobalConstant, not foo).
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes a warning from LGTM:
This parameter of type ScrollView is 144 bytes
- consider passing a pointer/reference instead.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
* 'master' of https://github.com/tesseract-ocr/tesseract: (27 commits)
Rework check for readable input file
fix "mktemp -d --tmpdir" on Mac OS; see #1453
pgedit: Change some variables from global to local ones
improve description of min_characters_to_try variable
WERD_RES: Remove comparisons which are constant
GENERIC_2D_ARRAY: Pass parameters by reference
genericvector: Pass parameters by reference
chop: Use more efficient float calculations for sqrt
rect: Use more efficient float calculations for ceil, floor
intproto: Use more efficient float calculations for floor
genericvector: Rewrite code to satisfy static code analyzer
Fix constructor for class Dict (uninitialized member variables)
Fix use of wrong UNICHARSET
lstmtraining: Remove dead code for purified model name
combine_tessdata: Handle failures when extracting
lstmtraining: Check write permission for output model
implement parameter min_characters_to_try for minimum characters to try to skip page entirely. fixes#1729
Merge and enhance documentation on language and script models
Document some more config options for tesseract
Add Makefile rule to build HTML manpages
...
This fixes compiler warnings and a warning from LGTM:
Poor global variable name 'pe'. Prefer longer, descriptive names [...]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes warnings from LGTM:
Comparison is always false because id >= 0.
Comparison is always true because mirrored >= 1.
Comparison is always false because id >= 0.
INVALID_UNICHAR_ID is -1, so the warnings are correct.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes warnings from LGTM:
This parameter of type FontClassInfo is 192 bytes
- consider passing a pointer/reference instead.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes warnings like the following one from LGTM:
This parameter of type ParamsTrainingHypothesis is 112 bytes
- consider passing a pointer/reference instead.
Most parameters can also get the const attribute.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes warnings from LGTM:
Multiplication result may overflow 'float' before it is converted
to 'double'.
While the sqrt function always calculates with double, here the
overloaded std::sqrt can be used to handle the float arguments
more efficiently.
Replace also an old C++ type cast by a static_cast.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes warnings from LGTM:
Multiplication result may overflow 'float' before it is converted
to 'double'.
While the floor function always calculates with double, here the
overloaded std::floor can be used to handle the float arguments
more efficiently.
Replace also old C++ type casts by static_cast.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes warnings from LGTM:
Multiplication result may overflow 'float' before it is converted
to 'double'.
While the floor function always calculates with double, here the
overloaded std::floor can be used to handle the float arguments
more efficiently.
Replace also old C++ type casts by static_cast.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Warning from LGTM:
Resource data_ is acquired by class GenericVector<FontSpacingInfo *>
but not released in the destructor.
LGTM complains about data_ not being deleted in the destructor.
The destructor calls the clear() method, but the delete there
was conditional which confuses the static code analyzer.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
wildcard_unichar_id_, apostrophe_unichar_id_, question_unichar_id_ and
slash_unichar_id_ were not initialized in the constructor.
slash_unichar_id_ was used later in a conditional.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Report an error and terminate if that fails.
Use also EXIT_SUCCESS and EXIT_FAILURE for the return values of main()
and add missing return at end of main().
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This is done by creating a temporary file.
Report an error and terminate if that fails.
Use also EXIT_SUCCESS and EXIT_FAILURE for the return values of main().
Signed-off-by: Stefan Weil <sw@weilnetz.de>