tesstrain_utils.sh sets the shell flag -e, so it exits immediately
if a command exits with a non-zero status.
The following command returns a non-zero status as soon as counter is a
multiple of par_factor (par_factor=8, that means as soon as 8 fonts or
images are processed):
let rem=counter%par_factor
The new code fixes this undesired exit.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
pango_coverage_get and pango_coverage_unref should not be called
with coverage == nullptr.
pango_font_get_coverage should not be called with font == nullptr.
Otherwise Pango prints runtime error messages:
(process:12657): Pango-CRITICAL **: pango_coverage_get: assertion 'coverage != NULL' failed
(process:12657): Pango-CRITICAL **: pango_coverage_unref: assertion 'coverage != NULL' failed
(process:12657): Pango-CRITICAL **: pango_font_get_coverage: assertion 'font != NULL' failed
(process:12657): GLib-GObject-CRITICAL **: g_object_unref: assertion 'G_IS_OBJECT (object)' failed
Typically those errors occur if a required font is not installed,
so this can be a quite common error.
Fix also a potential resource leak in PangoFontInfo::CoversUTF8Text.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
gcc warning:
src/training/text2image.cpp:694:35: warning:
ISO C++ forbids converting a string constant to ‘char*’
[-Wwrite-strings]
putenv expects a string which can be modified.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This is a lightweight, semi-Pythonic conversion of tesstrain.sh that currently
supports only LSTM and not the Tesseract 3 training mode.
I attempted to keep source changes minimal so it would be easy to compare
bash to Python in code review and confirm equivalence.
Python 3.6+ is required. Ubuntu 18.04 ships Python 3.6 and it is a mandatory
package (the package manager is also written in Python), so it is available
in the baseline Tesseract 4.0 system.
There are minor output and behavioral changes, and advantages. Python's loggingis used. Temporary files are only deleted on success, so they can be inspected
if training files. Console output is more terse and the log file is more
verbose. And there are progress bars! (The python3-tqdm package is required.)
Where tesstrain.sh would sometimes fail without explanation and return an error
code of 1, it is much easier to find the point of failure in this version.
That was also the main motivation for this work.
Argument checking is also more comprehensive.
shellcheck warning:
In /tesseract/src/training/tesstrain_utils.sh line 209:
TIMESTAMP=`date +%Y-%m-%d`
^-- SC2006: Use $(..) instead of legacy `..`.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
The commit 10f2c45c00 unified the usage of mktemp, but with a
incorrect bash syntax and unnecessary definition of LANG_CODE
and TIMESTAMP. This patch fixes the above problems.
This fixes two warnings from LGTM:
Parameter feature_defs hides a global variable with the same name.
Parameter Config hides a global variable with the same name.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Report an error and terminate if that fails.
Use also EXIT_SUCCESS and EXIT_FAILURE for the return values of main()
and add missing return at end of main().
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This is done by creating a temporary file.
Report an error and terminate if that fails.
Use also EXIT_SUCCESS and EXIT_FAILURE for the return values of main().
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes compiler warnings from clang:
src/training/icuerrorcode.h:44:7: warning:
'IcuErrorCode' has no out-of-line virtual method definitions;
its vtable will be emitted in every translation unit [-Wweak-vtables]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes compiler warnings from clang:
src/training/validator.h:72:7: warning:
'Validator' has no out-of-line virtual method definitions;
its vtable will be emitted in every translation unit [-Wweak-vtables]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
- add linefeed after last line
- remove blanks at line endings
This fixes some warnings from clang:
src/training/validate_javanese.h:63:51: warning:
no newline at end of file [-Wnewline-eof]
src/training/validate_javanese.cpp:269:26: warning:
no newline at end of file [-Wnewline-eof]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
By default, that script creates two new temporary directories with random
names in /tmp.
The new command line flag --workspace_dir PATH uses the given path as
a base directory for all temporary files.
That allows better reproducable training results (no random directory
names in log files).
Signed-off-by: Stefan Weil <stweil@ub-backup.bib.uni-mannheim.de>
It is needed for running the training tutorial on Linux.
The correct mode was lost when moving the files in
commit 104fe7931c.
Signed-off-by: Stefan Weil <sw@weilnetz.de>