Using std::stringstream simplifies the code.
The <SP> element is needed between two >String> elements.
Remove also some unneeded spaces in the ALTO output.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes warnings from the Intel compiler:
src/textord/cjkpitch.cpp(319): warning #177:
function "<unnamed>::FPRow::good_gaps" was declared but never referenced
src/textord/cjkpitch.cpp(383): warning #177:
function "<unnamed>::FPRow::is_bad" was declared but never referenced
src/textord/cjkpitch.cpp(387): warning #177:
function "<unnamed>::FPRow::is_unknown" was declared but never referenced
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes a warning from the Intel compiler:
src/textord/cjkpitch.cpp(79): warning #177:
function "<unnamed>::SimpleStats::maximum" was declared
but never referenced
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Instrumented code throws this runtime error during OCR:
../../src/api/baseapi.cpp:1616:5: runtime error: load of value 128,
which is not a valid value for type 'bool'
../../src/api/baseapi.cpp:1627:5: runtime error: load of value 128,
which is not a valid value for type 'bool'
If there is no font information (typical for Tesseract with a LSTM model),
the font attributes got random values resulting in wrong hOCR output.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Instrumented code throws this runtime error during OCR:
../../src/ccstruct/matrix.h:84:11: runtime error:
null pointer passed as argument 2, which is declared to never be null
Signed-off-by: Stefan Weil <sw@weilnetz.de>
All also a C++ implementation with more aggressive compiler options
which is optimized for the CPU where the software was built.
It is now possible to select the function used for the dot product
with -c dotproduct=FUNCTION where FUNCTION can be one of those values:
* auto selection based on detected hardware (default)
* generic C++ code with default compiler options
* native C++ code optimized for build host
* avx optimized code for AVX
* sse optimized code for SSE
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This reduces the code size for intsimdmatrixavx2 from 2700 to 2668
and slightly improves the performance for fast models with AVX2.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This improves performace for the "best" models because it
avoids function calls.
The compiler also knows the passed values for the parameters
add_bias_fwd and skip_bias_back.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This is a lightweight, semi-Pythonic conversion of tesstrain.sh that currently
supports only LSTM and not the Tesseract 3 training mode.
I attempted to keep source changes minimal so it would be easy to compare
bash to Python in code review and confirm equivalence.
Python 3.6+ is required. Ubuntu 18.04 ships Python 3.6 and it is a mandatory
package (the package manager is also written in Python), so it is available
in the baseline Tesseract 4.0 system.
There are minor output and behavioral changes, and advantages. Python's loggingis used. Temporary files are only deleted on success, so they can be inspected
if training files. Console output is more terse and the log file is more
verbose. And there are progress bars! (The python3-tqdm package is required.)
Where tesstrain.sh would sometimes fail without explanation and return an error
code of 1, it is much easier to find the point of failure in this version.
That was also the main motivation for this work.
Argument checking is also more comprehensive.
The local variable k should be 10 ^ (number of digits after comma),
but will overflow when there are more than 9 digits after the comma
because an int value cannot store 10000000000.
This results in wrong double values read from .tr files for example
(or in a runtime exception if Tesseract was compiled with -ftrapv).
Using uint64_t does not fix the general problem but allows more digits
which should be sufficient for the data read by Tesseract.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
shellcheck warning:
In /tesseract/src/training/tesstrain_utils.sh line 209:
TIMESTAMP=`date +%Y-%m-%d`
^-- SC2006: Use $(..) instead of legacy `..`.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
The commit 10f2c45c00 unified the usage of mktemp, but with a
incorrect bash syntax and unnecessary definition of LANG_CODE
and TIMESTAMP. This patch fixes the above problems.
Compiler warning on macOS:
tesscallback.h:29:7: warning:
'TessClosure' has no out-of-line virtual method definitions;
its vtable will be emitted in every translation unit [-Wweak-vtables]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes a compiler warning:
globaloc.cpp:33:6: warning: no previous extern declaration for
non-static variable 'global_crash_pixes'
[-Wmissing-variable-declarations]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes some compiler warnings:
mainblk.cpp:28:9: warning: macro is not used [-Wunused-macros]
mainblk.cpp:29:9: warning: macro is not used [-Wunused-macros]
[...]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
TessBaseAPI::GetAvailableLanguagesAsVector returned the list of languages
without sorting, so the result was random and not user friendly.
Now `tesseract --list-langs` shows the available languages and scripts
in alphabetic order.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
The format string which builds the command only takes one or two
string arguments, so the function allocated too much memory and
passed too many arguments to snprintf.
This also fixes a compiler warning (clang).
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes two warnings from LGTM:
Parameter feature_defs hides a global variable with the same name.
Parameter Config hides a global variable with the same name.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes a warning from LGTM:
Bit field area of type int should have explicitly unsigned integral,
explicitly signed integral, or enumeration type.
Maybe area should be unsigned, but that would require lots of other
changes, so for now signedness is not changed.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes several issues reported by LGTM:
Multiplication result may overflow 'int'
before it is converted to 'size_type'.
Multiplication result may overflow 'float'
before it is converted to 'double'.
Multiplication result may overflow 'int'
before it is converted to 'unsigned long'.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
It does not need an implementation as it is currently not used.
This fixes a warning from LGTM:
No matching copy assignment operator in class BlamerBundle.
It is good practice to match a copy constructor
with a copy assignment operator.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
It does not need an implementation as it is currently not used.
This fixes a warning from LGTM:
No matching copy constructor in class C_OUTLINE_FRAG.
It is good practice to match a copy assignment operator
with a copy constructor.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
It does not need an implementation as it is currently not used.
This fixes a warning from LGTM:
No matching copy constructor in class ROW.
It is good practice to match a copy assignment operator
with a copy constructor.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes a warning from LGTM:
No matching copy assignment operator in class BLOB_CHOICE.
It is good practice to match a copy constructor
with a copy assignment operator.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes a warning from LGTM:
No matching copy assignment operator in class ParamsTrainingHypothesis.
It is good practice to match a copy constructor
with a copy assignment operator.
Use also a simpler expression for the size of features.
Signed-off-by: Stefan Weil <sw@weilnetz.de>