Commit 94d0f77f56 tried to fix issue #2741
but created a new problem.
This commit should fix both old and new issue.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
If Tesseract cannot find text in the input image, it should not write
an empty lstmf file. This problem was reported in issue #2741.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
option --ptsize which defaults to 12. This option is not exposed through
tesstrain.sh; thus, you cannot use tesstrain.sh to explore training with
different font sizes. I made a small modification to expose the --ptsize
option to tesstrain.sh. It defaults to 12 if not specified.
Fix two occurrences of this LGTM warning:
Multiplication result may overflow 'double'
before it is converted to 'long double'.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes wrong output of integers with locale de_DE.UTF-8:
- /Width 2.481
- /Height 3.508
+ /Width 2481
+ /Height 3508
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes wrong output of integers with locale de_DE.UTF-8:
- <Page WIDTH="2.481" HEIGHT="3.508" PHYSICAL_IMG_NR="0" ID="page_0">
+ <Page WIDTH="2481" HEIGHT="3508" PHYSICAL_IMG_NR="0" ID="page_0">
Signed-off-by: Stefan Weil <sw@weilnetz.de>
The title can be set for hOCR and PDF output.
Currently it is also used for ALTO, so setting the title can be used
as a workaround for issue #2700.
The constant unknown_title_ is no longer needed and therefore removed.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
The function derives the file name for the .box file from an image name.
For training from existing line images, it is useful to directly support
the image names which are commonly used.
While generated images for Tesseract training typically use the name
pattern NAME.tif, other ground truth sets use NAME.bin.png for binarized
or NAME.nrm.png for grayscale images.
BoxFileName is also now a local function as it is only used locally.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
The configuration file lstm.train causes Tesseract to generate
training data for training of an LSTM line recognizer.
In this mode, no other files with OCR results should be written.
Without this patch, Tesseract writes a small text file.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This allows OCR of images from the internet without downloading them first:
tesseract http://IMAGE_URL OUTPUT ...
It uses libcurl.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
- Use C++ type casts
- Remove unneeded type cast
- Simplify code for function pop
- Remove macro push_on (it was only used once)
This fixes lots of compiler warnings caused by old type casts.
- Use C++ enums
- Use strongly typed C++11 enum for DIRECTION and optimize struct MFEDGEPT
- Use float constant for MF_SCALE_FACTOR
- Replace macros by inline functions
- Fix documentation comment
This fixes several warnings from clang.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes a clang warning:
src/ccstruct/polyblk.cpp:412:12: warning: result of comparison of
unsigned enum expression >= 0 is always true
[-Wtautological-unsigned-enum-zero-compare]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Replace the macros which were declared in vecfuncs.h by member functions
and move a function which was only used in chop.cpp to that file.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Removing STRING from genericvector.h allows eliminating the proprietary
STRING data type from the public Tesseract API.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
- add another constructor for LSTMRecognizer
which takes the language_data_path_prefix configured/selected
at runtime and passes it to the internal CCUtil
- use this in Tesseract::init_tesseract_lang_data when LSTMs
are available
(this was missing from 297d7d86ce)
This fixes compiler warnings caused by
commit 091ce345f6:
src/wordrec/lm_state.h💯7: warning: field 'cost'
will be initialized after field 'curr_b' [-Wreorder]
src/wordrec/lm_state.h:104:7: warning: field 'top_choice_flags'
will be initialized after field 'dawg_info' [-Wreorder]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes compiler warnings caused by
commit 5b4565b80b:
src/textord/colpartition.cpp:91:24: warning: field 'last_column_'
will be initialized after field 'column_set_' [-Wreorder]
src/textord/colpartition.cpp:93:37: warning: field 'inside_table_column_'
will be initialized after field 'nearest_neighbor_above_' [-Wreorder]
src/textord/colpartition.cpp:95:58: warning: field 'space_to_right_'
will be initialized after field 'owns_blobs_' [-Wreorder]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes compiler warnings caused by
commit ecf0f2dee5:
src/dict/dawg.h:202:9: warning: field 'type_' will be initialized
after field 'lang_' [-Wreorder]
src/dict/dawg.h:355:9: warning: field 'dawg_index' will be initialized
after field 'dawg_ref' [-Wreorder]
src/dict/dawg.h:356:9: warning: field 'punc_index' will be initialized
after field 'punc_ref' [-Wreorder]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes compiler warnings caused by
commit 751fcd2b11:
src/classify/classify.cpp:176:7: warning:
field 'EnableLearning' will be initialized after
field 'il1_adaption_test' [-Wreorder]
src/classify/classify.cpp:187:7: warning:
field 'dict_' will be initialized after
field 'static_classifier_' [-Wreorder]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Only one of bIt, dIt, iIt and sIt is used, so put all four in a union.
This fixes CID 1164628, CID 1164629, CID 1164630 and CID 1164631.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Report from Coverity Scan:
CID 1405560 (#1 of 1): Uninitialized scalar field (UNINIT_CTOR)
2. uninit_member: Non-static class member end is not initialized in
this constructor nor in any functions that it calls.
CID 1405561 [...]
Modernize and optimize class WERD_RES. This not only fixes the issues
but also reduces the size and eliminates the functions InitNonPointers
and InitPointers.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Reduce size from 368 to 352 bytes for Trie, 72 to 64 bytes for Dawg
and 40 to 24 bytes for DawgPosition by avoiding holes.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
The class no longer uses bit fields. Re-ordering the member variables
avoids holes and reduces the size of BLOBNBOX from 168 to 152 bytes.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Fix this runtime error in recodebeam_test and unicharcompress_test:
src/ccutil/unicharcompress.h:84:27: runtime error:
left shift of 267 by 28 places cannot be represented in type 'int'
code has up to kMaxCodeLen (9) values, so the highest possible value for
i is 8, and the shift value can reach 7 * 8 = 56.
That requires an uint64_t data type.
size_t would fit for 64 bit hosts, but be too small for 32 bit hosts.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Fix this runtime error in osd_test and textlineprojection_test:
src/ccmain/osdetect.cpp:109:14: runtime error: division by zero
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Fix these runtime errors in mastertrainer_test:
src/ccutil/bitvector.cpp:119:18: runtime error:
null pointer passed as argument 2, which is declared to never be null
src/ccutil/bitvector.cpp:124:10: runtime error:
null pointer passed as argument 1, which is declared to never be null
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes three LGTM warnings:
Multiplication result may overflow 'float' before it is converted to 'double'.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
They are moved from src/classify and src/lstm to src/training.
This reduces the size of the Tesseract library.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
It is only used in unittest/layout_test.cc after moving a test from
baseapi_test.cc to that file, so it can be made local.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
The method was only used in unittest where it can be replaced by
UNICHARSET::load_from_file which also simplifies the code.
This allows removing the class InMemoryFilePointer and fixes a TODO.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
The MS compiler only accepts string constants up to 65535 characters,
so shorten the string for that compiler to fix the compilation.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This converts special character like '<' or '>' to the
correct HTML entities.
Optimize also the code a little bit.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
The vector was already limited to MAX_NUM_PROTOS (512) entries or 64 bytes
in the old code. Now it uses that size right from the start which avoids
reallocating it later when entries are added.
The old code which reallocated the vector to expand it was buggy because
the realloc function can return a different pointer, but the code still
used the original pointer to reset the new bits.
Function ExpandBitVector is now unused and therefore removed.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
tesseract::FileReader and tesseract::FileWriter are already declared
in serialis.h which is included by genericvector.h.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
That case now uses Leptonica to deliver the desired image instead of
using an inefficient loop in the Tesseract code.
See commit 54fafc4e2e which used similar
code in the past.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This reverts commit 75d230a7ac.
That commit introduced new problems (memory leak, potential endless loop)
and style issues.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
The new code avoids dynamic memory allocation, uses faster function calls
and allows removing more code from tesscallback.h.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
The function pointers and callbacks file_reader_, file_writer_,
checkpointer_reader_ and checkpoint_writer_ are always set to
the same values. Replacing them by direct function calls
simplifies the code and allows removing more code from tesscallback.h.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
It does neither need a temporary TessResultCallback2 nor the function
LMPainPoints::GenerateForBlamer.
This also allows removing more code from tesscallback.h.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
C++17 drops support for `std::random_shuffle`, breaking C++17 compilers
that run to compile text2image.cpp. std::shuffle is valid on C++11
through C++17, so use std::shuffle instead.
Due to the use `std::random_shuffle`, `text2image --render_ngrams`
would not give consistent results for different compilers or platforms.
With the current change, the same random number generator is used for
all platforms and initialized to the same seed, so training output
should be consistent.
This fixes compiler warnings from clang++ like these ones:
src/ccutil/params.cpp:34:9: warning: macro is not used [-Wunused-macros]
src/cutil/oldlist.cpp:67:9: warning: macro is not used [-Wunused-macros]
src/cutil/oldlist.cpp:68:9: warning: macro is not used [-Wunused-macros]
src/cutil/oldlist.cpp:78:9: warning: macro is not used [-Wunused-macros]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
That fixes several warnings from clang++ like the following one:
src/training/combine_lang_model.cpp:36:1: warning: no previous extern declaration for non-static variable 'FLAGS_lang_is_rtl' [-Wmissing-variable-declarations]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
That fixes several warnings from clang++ like the following one:
src/training/commontraining.cpp:95:1: warning: no previous extern declaration for non-static variable 'FLAGS_D' [-Wmissing-variable-declarations]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes lots of compiler warnings like these ones:
src/api/baseapi.cpp:113:13: warning: no previous extern declaration for non-static variable 'kInputFile' [-Wmissing-variable-declarations]
src/api/baseapi.cpp:117:13: warning: no previous extern declaration for non-static variable 'kOldVarsFile' [-Wmissing-variable-declarations]
src/api/baseapi.cpp:97:10: warning: no previous extern declaration for non-static variable 'stream_filelist' [-Wmissing-variable-declarations]
src/ccmain/equationdetect.cpp:46:10: warning: no previous extern declaration for non-static variable 'equationdetect_save_bi_image' [-Wmissing-variable-declarations]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This also fixes warnings like the following one from clang++:
src/ccmain/pgedit.cpp:114:15: warning: declaration requires a global constructor [-Wglobal-constructors]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This also fixes some warnings from clang++:
src/classify/featdefs.cpp:47:15: warning: declaration requires a global constructor [-Wglobal-constructors]
src/classify/featdefs.cpp:57:15: warning: declaration requires a global constructor [-Wglobal-constructors]
src/classify/featdefs.cpp:66:15: warning: declaration requires a global constructor [-Wglobal-constructors]
src/classify/featdefs.cpp:75:15: warning: declaration requires a global constructor [-Wglobal-constructors]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This looks for one of the header files which are included by Tesseract.
It currently uses a hard coded path which works for Debian / Ubuntu.
Simplify also the rules for linking Tensorflow.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
It expects include files in /usr/include/tensorflow.
* Add configure option --with-tensorflow (disabled by default)
* Fix data type tensorflow::int64
* Remove "third_party/" in include statements
* Add dummy implementations for Backward and DebugWeights in TFNetwork
* Add files generated with protoc from tfnetwork.proto
(so the Tensorflow sources are not needed for the build)
* Update Makefiles
Signed-off-by: Stefan Weil <sw@weilnetz.de>
sqrt(0.5) = 1 / sqrt(2) can be replaced by the macro M_SQRT1_2.
This also fixes a compiler warning:
src/lstm/lstmtrainer.cpp:51:14: warning: declaration requires a global constructor [-Wglobal-constructors]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
That debugging code uses very much memory and is no longer useful.
text data bss dec hex filename
815 0 262144 262959 4032f src/ccutil/globaloc.o
Remove also the function err_exit which was only used in ccmain/reject.cpp.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Reduce the maximum message size from 64 KiB to 2 KiB which still should
be large enought for trace messages.
Create the smaller message on the stack instead of using a global
array to allow reentrancy and to reduce the memory use of Tesseract.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
It is defined for all platforms when math.h or cmath is included
after defining the macro _USE_MATH_DEFINES.
Define _USE_MATH_DEFINES before any include statement to make sure
that M_PI gets defined. It is not necessary to define it conditionally
only for Windows.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes lots of warnings related to ERRCODE like the following one:
src/ccutil/errcode.h:81:15: warning:
declaration requires a global constructor [-Wglobal-constructors]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
The function did not correctly read Chinese unichars into the local
Class variable if the locale was set to de_DE.UTF-8 (or other
incompatible locales). That resulted in a wrong ClassId which was
used to write into the Cutoffs array without checking for valid bounds.
On macOS the result was a runtime error in baseapi_test (see GitHub
issue #1250):
[ RUN ] TesseractTest.InitConfigOnlyTest
baseapi_test(21845,0x1134c45c0) malloc: *** error for object 0x927f96c28005e0: pointer being freed was not allocated
baseapi_test(21845,0x1134c45c0) malloc: *** set a breakpoint in malloc_error_break to debug
Replacing sscanf by std::istringstream fixes that.
Add also an assertion to catch future out-of-bounds writes.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
The latest code passed all unittests with locale de_DE.UTF-8
and has fixed the locale issues which were reported on GitHub.
Therefore the assertions can be removed.
Any remaining locale issue will be fixed when it is identified.
To help finding such remaining isses, debug code now uses the
user's locale settings instead of the default "C" locale for all
executables which use TessBaseAPI.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
That function writes float values which must always use '.' as the
decimal separator, no matter what the current locale setting is.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
The unittest failed with LANG=de_DE.UTF-8:
$ unittest/baseapi_test
Running main() from ../../../../unittest/../googletest/googletest/src/gtest_main.cc
[==========] Running 12 tests from 2 test suites.
[----------] Global test environment set-up.
[----------] 10 tests from TesseractTest
[ RUN ] TesseractTest.ArraySizeTest
[ OK ] TesseractTest.ArraySizeTest (0 ms)
[ RUN ] TesseractTest.BasicTesseractTest
[ OK ] TesseractTest.BasicTesseractTest (1251 ms)
[ RUN ] TesseractTest.IteratesParagraphsEvenIfNotDetected
[ OK ] TesseractTest.IteratesParagraphsEvenIfNotDetected (347 ms)
[ RUN ] TesseractTest.HOCRWorksWithoutSetInputName
[ OK ] TesseractTest.HOCRWorksWithoutSetInputName (403 ms)
[ RUN ] TesseractTest.HOCRContainsBaseline
[ OK ] TesseractTest.HOCRContainsBaseline (389 ms)
[ RUN ] TesseractTest.RickSnyderNotFuckSnyder
[ OK ] TesseractTest.RickSnyderNotFuckSnyder (346 ms)
[ RUN ] TesseractTest.AdaptToWordStrTest
Trying to adapt "136
" to "1 3 6"
Trying to adapt "256
" to "2 5 6"
Trying to adapt "410
" to "4 1 0"
Trying to adapt "432
" to "4 3 2"
Trying to adapt "540
" to "5 4 0"
Trying to adapt "692
" to "6 9 2"
Trying to adapt "779
" to "7 7 9"
Trying to adapt "793
" to "7 9 3"
Trying to adapt "808
" to "8 0 8"
Trying to adapt "815
" to "8 1 5"
Trying to adapt "12
" to "1 2"
Trying to adapt "12
" to "1 2"
[ OK ] TesseractTest.AdaptToWordStrTest (788 ms)
[ RUN ] TesseractTest.BasicLSTMTest
[ OK ] TesseractTest.BasicLSTMTest (4525 ms)
[ RUN ] TesseractTest.LSTMGeometryTest
[ OK ] TesseractTest.LSTMGeometryTest (615 ms)
[ RUN ] TesseractTest.InitConfigOnlyTest
Error: unichar ? in normproto file is not in unichar set.
Error: unichar 0.232621 in normproto file is not in unichar set.
Error: unichar 0.000400 in normproto file is not in unichar set.
Error: unichar 0.231864 in normproto file is not in unichar set.
[...]
Error: unichar ? in normproto file is not in unichar set.
Error: unichar 0.233915 in normproto file is not in unichar set.
Error: unichar 0.000400 in normproto file is not in unichar set.
Error: unichar 0.221755 in normproto file is not in unichar set.
Error: unichar 0.000400 in normproto file is not in unichar set.
Error: unichar ? in normproto file is not in unichar set.
baseapi_test(21845,0x1134c45c0) malloc: *** error for object 0x927f96c28005e0: pointer being freed was not allocated
baseapi_test(21845,0x1134c45c0) malloc: *** set a breakpoint in malloc_error_break to debug
[INFO] Lang eng took 327ms in regular init
[INFO] Lang chi_tra took 1422ms in regular init
Abort trap: 6
TesseractTest.InitConfigOnlyTest is fixed by using std::istringstream
instead of sscanf.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
The unittest failed with LANG=de_DE.UTF-8:
$ unittest/apiexample_test
Running main() from ../../../../unittest/../googletest/googletest/src/gtest_main.cc
[==========] Running 4 tests from 2 test suites.
[----------] Global test environment set-up.
[----------] 1 test from EuroText
[ RUN ] EuroText.FastLatinOCR
contains_unichar_id(unichar_id):Error:Assert failed:in file ../../../../../src/ccutil/unicharset.h, line 874
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Void functions should not use @return. It causes compiler warnings
like this one:
src/classify/intproto.cpp:326:5: warning:
'@return' command used in a comment that is attached to a function
returning void [-Wdocumentation]
Some non-void functions also were documented with @return none.
Fix those comments, too.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
PrintNormMatch was unused. Remove it and remove also an unused prototype.
Make the only remaining private function NormEvidenceOf static.
Signed-off-by: Stefan Weil <sw@weilnetz.de>