C++17 drops support for `std::random_shuffle`, breaking C++17 compilers
that run to compile text2image.cpp. std::shuffle is valid on C++11
through C++17, so use std::shuffle instead.
Due to the use `std::random_shuffle`, `text2image --render_ngrams`
would not give consistent results for different compilers or platforms.
With the current change, the same random number generator is used for
all platforms and initialized to the same seed, so training output
should be consistent.
This fixes compiler warnings from clang++ like these ones:
src/ccutil/params.cpp:34:9: warning: macro is not used [-Wunused-macros]
src/cutil/oldlist.cpp:67:9: warning: macro is not used [-Wunused-macros]
src/cutil/oldlist.cpp:68:9: warning: macro is not used [-Wunused-macros]
src/cutil/oldlist.cpp:78:9: warning: macro is not used [-Wunused-macros]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
That fixes several warnings from clang++ like the following one:
src/training/combine_lang_model.cpp:36:1: warning: no previous extern declaration for non-static variable 'FLAGS_lang_is_rtl' [-Wmissing-variable-declarations]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
That fixes several warnings from clang++ like the following one:
src/training/commontraining.cpp:95:1: warning: no previous extern declaration for non-static variable 'FLAGS_D' [-Wmissing-variable-declarations]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes lots of compiler warnings like these ones:
src/api/baseapi.cpp:113:13: warning: no previous extern declaration for non-static variable 'kInputFile' [-Wmissing-variable-declarations]
src/api/baseapi.cpp:117:13: warning: no previous extern declaration for non-static variable 'kOldVarsFile' [-Wmissing-variable-declarations]
src/api/baseapi.cpp:97:10: warning: no previous extern declaration for non-static variable 'stream_filelist' [-Wmissing-variable-declarations]
src/ccmain/equationdetect.cpp:46:10: warning: no previous extern declaration for non-static variable 'equationdetect_save_bi_image' [-Wmissing-variable-declarations]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This looks for one of the header files which are included by Tesseract.
It currently uses a hard coded path which works for Debian / Ubuntu.
Simplify also the rules for linking Tensorflow.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
It expects include files in /usr/include/tensorflow.
* Add configure option --with-tensorflow (disabled by default)
* Fix data type tensorflow::int64
* Remove "third_party/" in include statements
* Add dummy implementations for Backward and DebugWeights in TFNetwork
* Add files generated with protoc from tfnetwork.proto
(so the Tensorflow sources are not needed for the build)
* Update Makefiles
Signed-off-by: Stefan Weil <sw@weilnetz.de>
It is defined for all platforms when math.h or cmath is included
after defining the macro _USE_MATH_DEFINES.
Define _USE_MATH_DEFINES before any include statement to make sure
that M_PI gets defined. It is not necessary to define it conditionally
only for Windows.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Void functions should not use @return. It causes compiler warnings
like this one:
src/classify/intproto.cpp:326:5: warning:
'@return' command used in a comment that is attached to a function
returning void [-Wdocumentation]
Some non-void functions also were documented with @return none.
Fix those comments, too.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Remove unneeded include statements for host.h, add required ones and
update the comments for the remaining include statements.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This requires libarchive-dev.
Tesseract can now load traineddata files in any of the archive formats
which are supported by libarchive. Example of a zipped BagIt archive:
$ unzip -l /usr/local/share/tessdata/zip.traineddata
Archive: /usr/local/share/tessdata/zip.traineddata
Length Date Time Name
--------- ---------- ----- ----
55 2019-03-05 15:27 bagit.txt
0 2019-03-05 15:25 data/
1557 2019-03-05 15:28 manifest-sha256.txt
1082890 2019-03-05 15:25 data/eng.word-dawg
1487588 2019-03-05 15:25 data/eng.lstm
7477 2019-03-05 15:25 data/eng.unicharset
63346 2019-03-05 15:25 data/eng.shapetable
976552 2019-03-05 15:25 data/eng.inttemp
13408 2019-03-05 15:25 data/eng.normproto
4322 2019-03-05 15:25 data/eng.punc-dawg
4738 2019-03-05 15:25 data/eng.lstm-number-dawg
1410 2019-03-05 15:25 data/eng.freq-dawg
844 2019-03-05 15:25 data/eng.pffmtable
6360 2019-03-05 15:25 data/eng.lstm-unicharset
1012 2019-03-05 15:25 data/eng.lstm-recoder
1047 2019-03-05 15:25 data/eng.unicharambigs
4322 2019-03-05 15:25 data/eng.lstm-punc-dawg
16109842 2019-03-05 15:25 data/eng.bigram-dawg
80 2019-03-05 15:25 data/eng.version
6426 2019-03-05 15:25 data/eng.number-dawg
3694794 2019-03-05 15:25 data/eng.lstm-word-dawg
--------- -------
23468070 21 files
`combine_tessdata -d` and `combine_tessdata -u` also work.
The traineddata files in the new format can be generated with
standard tools like zip or tar.
More work is needed for other training tools and big endian support.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
tesstrain_utils.sh sets the shell flag -e, so it exits immediately
if a command exits with a non-zero status.
The following command returns a non-zero status as soon as counter is a
multiple of par_factor (par_factor=8, that means as soon as 8 fonts or
images are processed):
let rem=counter%par_factor
The new code fixes this undesired exit.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
pango_coverage_get and pango_coverage_unref should not be called
with coverage == nullptr.
pango_font_get_coverage should not be called with font == nullptr.
Otherwise Pango prints runtime error messages:
(process:12657): Pango-CRITICAL **: pango_coverage_get: assertion 'coverage != NULL' failed
(process:12657): Pango-CRITICAL **: pango_coverage_unref: assertion 'coverage != NULL' failed
(process:12657): Pango-CRITICAL **: pango_font_get_coverage: assertion 'font != NULL' failed
(process:12657): GLib-GObject-CRITICAL **: g_object_unref: assertion 'G_IS_OBJECT (object)' failed
Typically those errors occur if a required font is not installed,
so this can be a quite common error.
Fix also a potential resource leak in PangoFontInfo::CoversUTF8Text.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
gcc warning:
src/training/text2image.cpp:694:35: warning:
ISO C++ forbids converting a string constant to ‘char*’
[-Wwrite-strings]
putenv expects a string which can be modified.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This is a lightweight, semi-Pythonic conversion of tesstrain.sh that currently
supports only LSTM and not the Tesseract 3 training mode.
I attempted to keep source changes minimal so it would be easy to compare
bash to Python in code review and confirm equivalence.
Python 3.6+ is required. Ubuntu 18.04 ships Python 3.6 and it is a mandatory
package (the package manager is also written in Python), so it is available
in the baseline Tesseract 4.0 system.
There are minor output and behavioral changes, and advantages. Python's loggingis used. Temporary files are only deleted on success, so they can be inspected
if training files. Console output is more terse and the log file is more
verbose. And there are progress bars! (The python3-tqdm package is required.)
Where tesstrain.sh would sometimes fail without explanation and return an error
code of 1, it is much easier to find the point of failure in this version.
That was also the main motivation for this work.
Argument checking is also more comprehensive.