Commit Graph

1817 Commits

Author SHA1 Message Date
Stefan Weil
5db92b26aa Replace remaining GenericVector by std::vector for src/textord
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-03-16 16:59:12 +01:00
Stefan Weil
1f94d79c81 Replace remaining GenericVector by std::vector for src/ccmain
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-03-16 16:55:38 +01:00
Stefan Brechtken
d856acba56 Change License to Apache V2, add new file to Makefile.am, change file name to .h ending 2021-03-16 14:16:02 +01:00
Stefan Weil
bf42f8313d Replace remaining GenericVector by std::vector for src/dict
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-03-16 12:25:11 +01:00
Stefan Weil
17eee8648f Replace more GenericVector by std::vector
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-03-16 12:25:11 +01:00
Stefan Weil
2a3682a35e Replace remaining GenericVector by std::vector in src/lstm
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-03-16 12:25:11 +01:00
Stefan Brechtken
e10d19b084 updating function documentation and removing unnecessary include 2021-03-15 17:25:10 +01:00
Stefan Brechtken
594a000ecd merging with tesseract master in order to create a pull request 2021-03-15 17:02:19 +01:00
Stefan Weil
e51fcb2d31 Remove last usage of STRING
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-03-15 09:11:41 +01:00
Stefan Weil
57920174dc Remove unused parts of class STRING
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-03-15 09:11:41 +01:00
Stefan Weil
576c09bf31 Replace remaining STRING by std::string in unittest
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-03-15 09:11:41 +01:00
Stefan Weil
0edd69eb10 Replace remaining STRING by std::string in src/training
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-03-15 09:11:41 +01:00
Stefan Weil
d16fba9bed Replace all but one remaining STRING by std::string in src/ccstruct
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-03-15 09:11:41 +01:00
Stefan Weil
21cf7cf84e Replace remaining STRING by std::string in src/dict
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-03-15 09:11:41 +01:00
Stefan Weil
21d9aad594 Replace remaining STRING by std::string in src/viewer and src/wordrec
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-03-15 09:11:41 +01:00
Stefan Weil
e0ce040832 Replace remaining STRING by std::string in src/classify
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-03-15 09:11:41 +01:00
Stefan Weil
db9f963411 Replace remaining STRING by std::string in src/ccmain
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-03-15 09:11:41 +01:00
Egor Pugin
d7823a71c2 Remove unused file. 2021-03-15 09:47:04 +03:00
Egor Pugin
efd17e205a Replace typedef structs with structs.
typedef enums are left intact.
2021-03-15 09:47:04 +03:00
Egor Pugin
262f65a4d2
snprintf will add '\0' at the end itself. 2021-03-14 23:54:29 +03:00
Egor Pugin
26ceeef6c0 [training] Modernize. 2021-03-14 23:47:42 +03:00
Shree Devi Kumar
efe9ff611f Limit unicharset from training_text only to Indic languages 2021-03-14 17:58:57 +00:00
Shree Devi Kumar
a589ded25f Create unicharset from training text to avoid normalization errors 2021-03-14 16:39:00 +00:00
Egor Pugin
f06b2c7c8d [capi] Restore some of wrongly removed apis.
Removed C++ APIs are not restored.
Additionally remove unused C++ typedefs which were in removed C++ functions.
If you still need them, use C++ API instead.
2021-03-14 17:20:52 +03:00
Egor Pugin
dabdaa1def Misc. 2021-03-14 17:14:41 +03:00
Stefan Weil
7178ebd799 Add missing TESS_API for new function tesseract::split
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-03-14 08:16:33 +01:00
Stefan Weil
36f9131e04 Move implementation of tesseract::split from header to cpp file
This fixes duplicate symbols for some builds.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-03-13 23:39:58 +01:00
Stefan Weil
3b0759940c Replace more STRING by std::string
Remove STRING::add_str_int and STRING::add_str_double which are now unused.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-03-13 23:16:35 +01:00
Stefan Weil
c9f0da49ca Replace more STRING by std::string
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-03-13 21:15:52 +01:00
Stefan Weil
91f7675848 Replace more STRING by std::string for src/ccmain
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-03-13 21:15:52 +01:00
Stefan Weil
d084c7cca8 Replace remaining STRING by std::string for src/api
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-03-13 21:15:52 +01:00
Stefan Weil
96d1644da1 Replace more STRING by std::string
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-03-13 21:15:52 +01:00
Stefan Weil
a42c6c7dcd Replace more STRING by std::string
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-03-13 21:15:52 +01:00
Stefan Weil
9cf5b9870d Replace more STRING by std::string
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-03-13 21:15:52 +01:00
Stefan Weil
51909d5a2e Replace more STRING by std::string
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-03-13 21:15:52 +01:00
Stefan Weil
d6495d9026 Replace STRING by std::string in src/lstm
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-03-13 21:15:51 +01:00
Stefan Weil
1f2ec4dfb1 Fix network specification for NT_SYMCLIP
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-03-13 13:10:37 +01:00
Stefan Weil
6bf5080d4c Remove unused include statements for strngs.h
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-03-12 23:11:08 +01:00
Egor Pugin
a393df5038 Add missing export header. 2021-03-13 00:07:19 +03:00
Egor Pugin
2d10be5209 [clang-format] Format generated protobuf source. 2021-03-13 00:07:03 +03:00
Egor Pugin
618b185d14 Include missing config_auto.h 2021-03-12 23:39:18 +03:00
Egor Pugin
8b0c5405e2 Add missing forward decl. 2021-03-12 22:35:30 +03:00
Egor Pugin
0eb7ba88bf [clang-format] Execute clang format on include and src dirs.
Script:
find include src -type f | sort > all.txt
find include src -type f | grep -v "\.cpp" | grep -v "\.h" | sort > skip.txt
comm -23 all.txt skip.txt | xargs clang-format -i
2021-03-12 22:35:02 +03:00
Stefan Weil
4c6cc5a04d Replace GenericVector by std::vector in class ImageData
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-03-12 13:10:25 +01:00
Ger Hobbelt
779aa79350
Fix build (#3322)
* fix errors after merge commit: missing changes that are needed too to make this codebase compile.
* Update src/wordrec/wordrec.h

Co-authored-by: Stefan Weil <sw@weilnetz.de>
2021-03-11 21:43:07 +01:00
Egor Pugin
3444618075 Fix linux build. 2021-03-10 15:35:13 +03:00
Egor Pugin
ce058604ba Pass empty strings into Tesseract::init_tesseract(). 2021-03-10 15:21:03 +03:00
Egor Pugin
911dd93f12 Pass init strings as std::string instead of const char * internally. This does not affect public APIs. 2021-03-10 15:17:00 +03:00
Egor Pugin
9792f3c4ff Remove STRING::size() method. 2021-03-10 14:58:37 +03:00
Egor Pugin
6de97309a1 Remove unused STRING::strdup(). 2021-03-10 14:42:50 +03:00
Egor Pugin
f0e30a2af2 Remove unused STRING::unsigned_size(). 2021-03-10 14:41:31 +03:00
Egor Pugin
d36adf3d40 Replace STRING::truncate_at() with resize(). 2021-03-10 14:40:28 +03:00
Egor Pugin
e9a2fc0083 More std::string replacements. 2021-03-10 14:36:59 +03:00
Stefan Weil
0f1296c6f6 Clean implementation for (de-)serialization of a vector
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-03-08 13:33:48 +01:00
Stefan Weil
6cfe604d58 Fix serialization for vector of RecodedCharID
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-03-07 23:01:25 +01:00
Stefan Weil
0cde3ede98 Add heuristic to fix swap (partially fixes issue #2586)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-03-05 14:27:28 +01:00
Stefan Weil
a2769aebb4 Replace GenericVector<TBOX> by std::vector<TBOX>
Fix also endianness handling for (de)serialisation of TBOX.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-03-05 14:27:28 +01:00
Stefan Weil
c31c1a7d60 Fix two compiler warnings for serialis.h
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-03-05 14:27:28 +01:00
Stefan Weil
fe614c6069 Enable less FP exceptions for clang compiler when running tesseract
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-03-03 22:56:07 +01:00
Egor Pugin
c39b1daa6b GenericVector -> std::vector. 2021-03-03 22:22:00 +03:00
Egor Pugin
0a693a9519 Allow to serialize std vectors with classes from TFile. Implementation from GenericVector. 2021-03-03 22:21:40 +03:00
Stefan Weil
ff830775f9 Fix memory leak in DocumentCache
It was introduced in commit 5cac52173e.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-03-01 11:31:48 +01:00
Stefan Weil
339c01894e Avoid fp division by 0 (fix issue #3314)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-02-28 19:42:01 +01:00
Stefan Weil
cd60728e8a Avoid float division by zero when calculating adaptive learning rate
The following line results in a division by zero when
momentum is -1 and num_samples is even:

     learning_rate /= 1.0f - pow(momentum, num_samples);

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-02-27 21:08:41 +01:00
Stefan Weil
c12dde2862 Use float instead of double for learning_rate, momentum and adam_beta
Only WeightMatrix::Update used double parameters, all other functions
already used float. So this change avoids unnecessary conversions.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-02-27 21:08:41 +01:00
Stefan Weil
422452b9f4 Check for float errors when running tesseract and lstmtraining
Some illegal floating point calculations like division by zero,
illegal value or overflow will now abort tesseract with an error
message.

For lstmtraining there is now a new parameter --debug_float to
enable the same kind of checks. It is currently disabled by default
because such errors occur and would abort the training process.
That should be fixed in the future.

If tesseract also shows floating point errors which cannot be
fixed easily, a similar parameter to enable the checks can be
added there, too.

The new code requires the function feenableexcept which is only
available with the GNU libc, so it is only used on Linux.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-02-26 21:49:27 +01:00
Stefan Weil
51a214a51b Remove unused include statements for imagedata.h and document used ones
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-02-26 21:42:28 +01:00
Stefan Weil
1d7a981203 Disable code for unused classes WordFeature and FloatWordFeature
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-02-26 21:42:17 +01:00
Stefan Weil
5cac52173e Replace PointerVector by std::vector in class DocumentCache
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-02-26 21:42:07 +01:00
Stefan Weil
387acd9881 Initialize weight matrix with 0.0 (fix issue #3229)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-02-26 18:49:39 +01:00
Egor Pugin
1ab6b0fbc6
Merge pull request #3311 from stweil/master
Replace calls of exit function
2021-02-26 17:43:53 +03:00
Stefan Weil
58304cbfdd Don't compile OpenCL code when OpenCL is disabled
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-02-26 15:40:23 +01:00
Stefan Weil
a6946c3bf9 Replace calls of exit function
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-02-26 14:22:36 +01:00
Stefan Weil
373a3527ec Format code
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-02-26 14:22:09 +01:00
Stefan Weil
ea446b1eae Remove blanks at line endings
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-02-26 14:05:36 +01:00
Stefan Weil
394c56ab15 Replace GenericVector by std::vector in class WERD_CHOICE
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-02-23 23:14:25 +01:00
Stefan Weil
fccecb2d23 Replace GenericVector by std::vector in class ResultIterator
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-02-23 21:07:57 +01:00
Stefan Weil
2257028052 Replace GenericVector by std::vector in reject.cpp
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-02-23 21:06:59 +01:00
Stefan Weil
d62f27dd8f Replace GenericVector by std::vector in stepblob.cpp
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-02-23 20:47:06 +01:00
Stefan Weil
3e5b2760ab Replace GenericVector by std::vector for struct BlamerBundle
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-02-23 20:34:41 +01:00
Stefan Weil
0b8e937655 Use countof to get number of array elements
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-02-23 20:20:48 +01:00
Stefan Weil
7097dfd41c Replace GenericVector by std::vector for parameters
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-02-23 20:20:48 +01:00
Stefan Weil
f2d2695ce9 Replace STRING and clean declarations of local variables in eval_word_spacing
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-02-23 20:20:48 +01:00
Stefan Weil
5277443833 Replace more STRING
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-02-23 20:20:48 +01:00
Stefan Weil
ae00f291f6 Remove unused include statements
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-02-22 22:28:47 +01:00
Stefan Weil
65053890d7 Handle file list without terminating LF (fix issue #3298)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-02-13 11:44:47 +01:00
Stefan Weil
bc69e28de3 Update include statements for external header file allheaders.h
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-02-13 10:17:20 +01:00
Stefan Weil
e6f15621c2 Remove Python training scripts which were moved to tesstrain
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-02-04 14:45:19 +01:00
Shree Devi Kumar
40f3c8d104 Change LATIN_FONTS to use replacement fonts from TeX Gyre collection 2021-02-04 13:51:03 +01:00
Stefan Weil
4902e68682 cmake: Use pkg_config to find required libraries
This is needed for cmake builds on MacOS (Intel and Amd64) with Homebrew.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-01-31 17:23:06 +01:00
Stefan Weil
e999f421bc Replace GenericVector<float> by std::vector<float> for class SimpleStats
This also fixes a runtime error:

    src/ccutil/genericvector.h:228:11: runtime error:
      null pointer passed as argument 1, which is declared to never be null

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-01-26 14:29:07 +01:00
Stefan Weil
4b84a56d8d Replace STRING by std::string for function read_unlv_file
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-01-23 17:46:12 +01:00
Stefan Weil
139d127ff7 Remove unneeded include statement for genericvector.h
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-01-23 17:29:57 +01:00
Stefan Weil
71fb535427 Remove unneeded include statement for strngs.h
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-01-23 17:29:57 +01:00
Stefan Weil
44fd1c4986 Wordrec: Modernize code
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-01-23 15:53:55 +01:00
Stefan Weil
5a3d6e5e0d Fix memory leak in mastertrainer_test (fixes issue #3215)
The issue was introduced in commit 6e9456415.

Partially reverting this commit fixes it.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-01-23 14:54:38 +01:00
Stefan Weil
e3fd938bca lstmtrainer: Modernize code
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-01-22 08:17:19 +01:00
Stefan Weil
0cdaab5ac9 lstmtrainer: Remove unused local variable
This fixes a compiler warning:
    src/training/unicharset/lstmtrainer.cpp:107:15: warning:
      unused variable 'shape' [-Wunused-variable]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-01-22 08:13:38 +01:00
Stefan Weil
3d47e0a91a Replace GenericVector by std::vector in LoadFileLinesToStrings
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-01-22 08:13:38 +01:00
Stefan Weil
5d44a8216f Show names of failing lstmf files in error messages
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-01-20 13:36:59 +01:00
Stefan Weil
c7baf8f17d Add more information shown by combine_tessdata -l
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-01-15 18:49:51 +01:00
Stefan Weil
3195c8f75f Add new option -l for combine_tessdata to list the network string
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-01-15 18:49:51 +01:00
Stefan Weil
970eba79e6 Replace STRING by std::string for LSTMRecognizer::network_str_
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-01-15 18:49:51 +01:00
Stefan Weil
97cfd95872 Replace STRING by char* in LSTMRecognizer
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-01-15 18:49:51 +01:00
Stefan Weil
73ffcabfe9 lstmtraining: Interpret negative value for --max_iterations as epochs
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-01-14 19:51:58 +01:00
Stefan Weil
40bdcd2941 Add TESS_API to instantiation of template functions
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-01-14 18:07:35 +01:00
Stefan Weil
80810218f7 Use explicit int32_t for serialized data type
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-01-14 18:06:39 +01:00
Stefan Weil
05da41dc60 Replace GenericVector<BlobData> by std::vector<BlobData>
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-01-14 17:23:13 +01:00
Robert Pösel
7dcd9b5095 Remove ANDROID_BUILD macro
Build fails when ANDROID_BUILD is defined, because it removes parts of the LSTM engine, but there are still some unguarded references. But removing LSTM engine is not needed as it works perfectly fine on Android.

This macro doesn't provide any benefit anymore and is not even used in current build config. If needed, ANDROID macro should be used instead (which is already used on few places).
2021-01-14 14:31:34 +01:00
Stefan Weil
08f2ba02f7 Fix memory allocation in TFile::DeSerialize(std::vector<T>& data)
lstmtraining crashed when creating traineddata files:

    Error: attempt to subscript container with out-of-bounds index 0, but
    container only holds 0 elements.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-01-14 12:11:02 +01:00
Stefan Weil
5e661b9339 Don't use local CP_RESULT_STRUCT variable to initialize elements of std::vector
std::vector passes that local variable by reference, so no individual
instances are used for the new vector elements.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-01-13 15:57:04 +01:00
Stefan Weil
b0e46085f4 Fix serialization of std::vector (fix issue #3220)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-01-12 21:23:14 +01:00
Stefan Weil
9b15e65900 Replace resize(0) by clear() for std::vector
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-01-12 19:24:54 +01:00
Shree Devi Kumar
5104af6a15 Remove --psm 6 for lstm.train in tesstrain.py 2021-01-12 13:26:33 +01:00
Shree Devi Kumar
106b3d1ed0 No --psm 6 for lstm.train 2021-01-12 12:42:53 +01:00
Robert Pösel
ca9c7ba303 Fix NEON also tesseractmain.cpp 2021-01-11 12:17:25 +01:00
Robert Pösel
1954ee3867 Fix use of NEON on ARMv8
Flag neon_available_ is automatically set to true when __aarch64__ is defined,
but the actual check for neon_available_ required having also HAVE_NEON defined.

Now we check the flag also when only __aarch64__ is defined.
2021-01-11 12:17:16 +01:00
Stefan Weil
021237ad2c Add assertion for IntCastRounded
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-01-10 15:08:31 +01:00
Stefan Weil
209c1df599 Fix some format strings
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-01-08 18:49:21 +01:00
Egor Pugin
8cb1c62259 More std::vector. 2021-01-07 15:13:59 +03:00
Egor Pugin
8d6cad1acc Misc. 2021-01-07 14:33:45 +03:00
Egor Pugin
4f5bd1c562 Move unicodes into files where they are used. 2021-01-07 14:33:02 +03:00
Egor Pugin
8aa5492262 Misc. 2021-01-07 14:14:40 +03:00
Egor Pugin
9cc7bdeaa6 Use std::bitset<16> instead of custom BITS16. 2021-01-07 14:14:27 +03:00
Egor Pugin
9710bc0465 More std::vector. 2021-01-07 13:57:57 +03:00
Stefan Weil
d000df7e00 Remove remaining parts of tessopt (fix autotools build)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-01-05 23:06:17 +01:00
Egor Pugin
8e947a98b5 Remove emalloc. Replace it with malloc. To be replaced with new later. 2021-01-06 00:30:52 +03:00
Egor Pugin
af4ebaa943 Alloc on stack. 2021-01-05 18:07:40 +03:00
Egor Pugin
d3729cb34e Rmove unused members. 2021-01-05 18:07:10 +03:00
Egor Pugin
40aca00559 Remove unused var. 2021-01-05 17:56:39 +03:00
Egor Pugin
14cf6adda2 More std::vector. 2021-01-05 17:53:05 +03:00
Egor Pugin
a44d107e94 Misc. 2021-01-05 17:45:34 +03:00
Egor Pugin
6e94564152 [training] More unique ptrs. 2021-01-05 17:03:26 +03:00
Egor Pugin
4415209fd6 Remove tessopt. This fixes mastertrainer test in shared build. 2021-01-05 17:00:27 +03:00
Egor Pugin
c946a5610c Remove unused header. 2021-01-05 16:45:24 +03:00
Egor Pugin
8950e49a5d Remove unused var. 2021-01-05 16:45:07 +03:00
Egor Pugin
5160426400 Misc. 2021-01-05 16:31:09 +03:00
Egor Pugin
fb98b9b2f5 Use unique_ptr. 2021-01-05 16:00:22 +03:00
Egor Pugin
aa80aa5de1 More std::vector. 2021-01-05 15:54:30 +03:00
Egor Pugin
4f8f8e3d58 More std::vector. Simplify. 2021-01-05 15:49:53 +03:00
Egor Pugin
ca514ad91e [test] Return early on error. 2021-01-05 15:37:43 +03:00
Egor Pugin
4ed601956e More std::vector. 2021-01-05 14:46:11 +03:00
Egor Pugin
0c7139ce09 A better fix to read unichars. Imbue C locale always since on different systems, default locale will give different results. 2021-01-04 20:36:21 +03:00
Egor Pugin
0364832ab8 Correctly read cutoff classes. 2021-01-04 20:20:17 +03:00
Egor Pugin
71f578a198 Do not swap endian elements with size == 1. 2021-01-04 20:00:46 +03:00
Egor Pugin
4e59d964dc Use templates for serialize/deserialize. 2021-01-04 20:00:25 +03:00
Egor Pugin
4162e37e8c Use std::vector. 2021-01-04 19:54:51 +03:00
Egor Pugin
3aae46d53d Remove noisy message. 2021-01-04 18:11:16 +03:00
Stefan Weil
40ba25acbb Remove functions which are only used locally from scanedg.h
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-01-04 15:49:15 +01:00
Stefan Weil
709acf74fe Remove functions which are only used locally from fpchop.h
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-01-03 21:41:56 +01:00
Stefan Weil
bb6dbd2cd8 Fix autotoools build with --disable-legacy
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2021-01-03 20:28:30 +01:00
Egor Pugin
fd8907471c Improve C API. Add tests.
1. Add simple C API test in C++ program.
2. Add simple C API test in C program.
3. Fix including capi.h in C++ files.
2021-01-02 03:57:25 +03:00
Egor Pugin
bee90f7835 [capi] Remove unused functions.
Those functions were undef ifdef for C++ mode. Since in C++ mode noone uses them, they can be safely removed.
2021-01-02 02:59:31 +03:00
Egor Pugin
52f5e5b8fb Restore building of C API. Simplify.
1. Delete useless ifdefs.
2. Move C++ includes into source file. C code does not care about any C++ headers.
3. Replace TESS_CAPI_INCLUDE_BASEAPI with simple __cplusplus macro.
4. In capi.cpp remove enclosing namespace tesseract, so symbols have their according decls back.

In capi.cpp we
- put capi.h after all C++ headers, so we can remove some typedefs later,
- put using namespace tesseract between them, so C++ symbols are visible to functions in the file without namespace.
2021-01-02 02:53:33 +03:00
Egor Pugin
664a718a63 Rename platform.h to export.h. 2021-01-01 00:18:36 +03:00
Egor Pugin
2c84c4beb2 [cmake] Make pango include dirs public. 2020-12-31 20:47:34 +03:00
Egor Pugin
9eb52625cd Merge branch 'master' of github.com-egorpugin:tesseract-ocr/tesseract 2020-12-31 20:33:48 +03:00
Egor Pugin
32cb90f114 [cmake] Make pango deps public. 2020-12-31 20:33:01 +03:00
Stefan Weil
061f088b77 Replace C headers by C++ headers and remove old unused C code
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-12-31 18:26:33 +01:00
Stefan Weil
c0db7b7e93 Remove unused code from matchdefs.h
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-12-31 18:23:38 +01:00
Egor Pugin
0cdb718835 Remove deleted util.h header. 2020-12-31 20:16:20 +03:00
Egor Pugin
9e1e6305b2 [cmake] Fix build. 2020-12-31 19:56:55 +03:00
Egor Pugin
7b8a78045d Merge branch 'master' of github.com-egorpugin:tesseract-ocr/tesseract 2020-12-31 19:32:09 +03:00
Egor Pugin
6306393c91 [cmake] Implement shared builds. 2020-12-31 19:32:03 +03:00
Stefan Weil
43791c6520 Replace GenericVector<SetOfModels> by std::vector<SetOfModels>
This fixes commit cad0eb4d26.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-12-31 17:01:12 +01:00
Egor Pugin
07a1533a01 Move training lib sources into their own dirs. 2020-12-31 18:27:03 +03:00
Egor Pugin
1a53ca099a [cmake] tessopt is a static library. 2020-12-31 18:26:33 +03:00
Egor Pugin
cad8cb31bb Add missing includes. 2020-12-31 17:58:36 +03:00
Egor Pugin
65e230f1a2 Fix linux build. 2020-12-31 17:46:49 +03:00
Egor Pugin
a4daf19dd3 Merge branch 'master' of github.com-egorpugin:tesseract-ocr/tesseract 2020-12-31 17:37:37 +03:00
Stefan Weil
96fbe776ea Partially revert cad0eb4d26 (fix layout_test)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-12-31 15:36:28 +01:00
Egor Pugin
a32c8b2d93 Remove GenericVector::compare_callback. This fixes several tests after previous commit. 2020-12-31 17:26:40 +03:00
Egor Pugin
c86325e2f7 Use TESS_API for every public symbol. Public symbol is exported from the library. This also applies to unit test and training symbols. Users will be limited to public api, but set of exported symbols will be wider still.
Remove TESS_LOCAL.
Fix several symbol issues that made visible with these changes.

All build systems must set -fvisibility-hidden for *nix systems.
2020-12-31 16:32:29 +03:00
Egor Pugin
4d817d09a5 Remove custom string hasher. 2020-12-31 14:26:23 +03:00
Egor Pugin
250fc0023e Misc. 2020-12-31 14:24:52 +03:00
Egor Pugin
3a66282e92 Remove GOOGLE_TESSERACT ifdefs. 2020-12-31 14:23:52 +03:00
Egor Pugin
d0a730e3d0 Misc. 2020-12-31 13:25:10 +03:00
Egor Pugin
c812d9d894 Use template instead of overloads. 2020-12-31 13:20:21 +03:00
Stefan Weil
cad0eb4d26 Replace more GenericVector by std::vector
This fixes two LGTM alerts and might improve the performance:

    This parameter of type GenericVector<STRING> is 80 bytes -
    consider passing a const pointer/reference instead.

    This parameter of type GenericVectorEqEq<const ParagraphMode*> is 80 bytes -
    consider passing a const pointer/reference instead.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-12-31 09:28:35 +01:00
Stefan Weil
fc4002dda8 Remove helpers.h from public API
Remove also outdated references to apitypes.h which no longer exists.

Signed-off-by: Stefan Weil <sw@weil.de>
2020-12-31 09:06:16 +01:00
Egor Pugin
dfbd394a72 Export all simd matrices. 2020-12-31 03:27:18 +03:00
Egor Pugin
2c054b531c Fix linux build. 2020-12-31 03:06:39 +03:00
Egor Pugin
4ddc919ed0 Correctly use DEBUG macro. C++ compilers do not define it. Instead they define NDEBUG in optimized compilations. 2020-12-31 02:50:07 +03:00
Egor Pugin
3af30419db Move MAX_PATH def out from public header. 2020-12-31 02:35:28 +03:00
Egor Pugin
a0509b2feb Use std::swap instead of manual function. 2020-12-31 02:17:54 +03:00
Egor Pugin
89273c915d Remove empty DLLSYM macro. 2020-12-31 02:10:46 +03:00
Stefan Weil
4366d811d4 Fix TFile::DeSerialize, TFile::Serialize for empty vectors
Signed-off-by: Stefan Weil <sw@weil.de>
2020-12-30 19:15:56 +01:00
Stefan Weil
30eeb7f01a Replace some old-style type casts
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-12-30 17:56:59 +01:00
Stefan Weil
faf0407dff Remove RecognizeForChopTest from public API
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-12-30 17:55:40 +01:00
Stefan Weil
588ac3fed2 Remove TessTruthCallback from public API
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-12-30 15:38:11 +01:00
Stefan Weil
ebafb19a43 Replace GenericVector<ParamsTrainingHypothesis> by std::vector<ParamsTrainingHypothesis>
This fixes an LGTM alert:

    This parameter of type ParamsTrainingHypothesis is 136 bytes -
    consider passing a const pointer/reference instead.

It might also improve the performance.

Signed-off-by: Stefan Weil <sw@weil.de>
2020-12-30 13:26:44 +01:00
Stefan Weil
688ef20f62 Replace GenericVector<RowInfo> by std::vector<RowInfo>
This fixes an LGTM alert:

    This parameter of type RowInfo is 144 bytes -
    consider passing a const pointer/reference instead.

It might also improve the performance.

Signed-off-by: Stefan Weil <sw@weil.de>
2020-12-30 12:14:43 +01:00
Stefan Weil
536a676250 Replace GenericVector<WordData> by std::vector<WordData>
This fixes an LGTM alert:

    This parameter of type WordData is 112 bytes -
    consider passing a const pointer/reference instead.

It might also improve the performance.

Signed-off-by: Stefan Weil <sw@weil.de>
2020-12-30 12:14:43 +01:00
Stefan Weil
fbc807ce99 Remove unused local function CharCoverageMapToBitmap
Signed-off-by: Stefan Weil <sw@weil.de>
2020-12-30 12:14:43 +01:00
Stefan Weil
83d97ffc80 Remove redundant comparison
This fixes an LGTM alert:

    Comparison is always true because i >= 2.

Signed-off-by: Stefan Weil <sw@weil.de>
2020-12-30 12:14:43 +01:00
Stefan Weil
f3acab507d Fix arguments for tprintf
This fixes two LGTM alerts:

    This argument should be of type 'int' but is of type '_Bit_reference'

Signed-off-by: Stefan Weil <sw@weil.de>
2020-12-30 12:14:43 +01:00
Stefan Weil
53503b34be Fix declaration for C_BLOB
Signed-off-by: Stefan Weil <sw@weil.de>
2020-12-30 11:33:29 +01:00
Stefan Weil
7866677a0c avx2: Remove unused local variables
Signed-off-by: Stefan Weil <sw@weil.de>
2020-12-30 11:33:29 +01:00
Stefan Weil
96e3b52936 Remove unused function CompareSTRING
Signed-off-by: Stefan Weil <sw@weil.de>
2020-12-30 11:33:29 +01:00
Stefan Weil
2cf70d6164 Replace more GenericVector by std::vector
Signed-off-by: Stefan Weil <sw@weil.de>
2020-12-30 10:51:12 +01:00
Stefan Weil
3a34f17037 Order and clean include statements
Signed-off-by: Stefan Weil <sw@weil.de>
2020-12-30 10:50:39 +01:00
Stefan Weil
3603c740e7 Fix ShapeTable::AddUnicharToResults (fix mastertrainer_test)
Signed-off-by: Stefan Weil <sw@weil.de>
2020-12-30 07:10:29 +01:00
Stefan Weil
4c94d09047 Replace more GenericVector by std::vector
Signed-off-by: Stefan Weil <sw@weil.de>
2020-12-30 07:10:29 +01:00
Stefan Weil
deec8ef46f Replace std::list by std::vector
Signed-off-by: Stefan Weil <sw@weil.de>
2020-12-30 07:10:29 +01:00
Stefan Weil
4043204c2b Use old genericvector.h
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-12-30 07:10:29 +01:00
Egor Pugin
482824c109 Fix trie's word sort comparator. 2020-12-30 02:37:53 +03:00
Egor Pugin
37e760d9c2 [test] Fix unicharset. 21->18 failed tests remaining. 2020-12-30 02:11:58 +03:00
Stefan Weil
f4e380f64a Remove serialis.h from public API
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-12-29 11:28:50 +01:00
Stefan Weil
e2683e17fc Remove unused DocumentData::SaveToBuffer
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-12-29 10:43:00 +01:00
Egor Pugin
f190c85682
Update src/api/tesseractmain.cpp
Co-authored-by: Stefan Weil <sw@weilnetz.de>
2020-12-29 00:22:28 +03:00
Stefan Weil
c8be22f313 Fix nullptr assignment in TessBaseAPI
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-12-28 21:03:29 +01:00
Stefan Weil
90af3e7b5c Remove strngs.h from public API
Signed-off-by: Stefan Weil <sw@weil.de>
2020-12-28 21:03:29 +01:00
Stefan Weil
03884c370c Replace STRING by std::string in ResultIterator
Signed-off-by: Stefan Weil <sw@weil.de>
2020-12-28 21:03:29 +01:00
Stefan Weil
2369aa5604 Use std::vector, std::string in baseapi.h
Signed-off-by: Stefan Weil <sw@weil.de>
2020-12-28 21:03:29 +01:00
Stefan Weil
72663a9a81 Use std::vector, std::string in baseapi.h
Signed-off-by: Stefan Weil <sw@weil.de>
2020-12-28 21:03:29 +01:00
Stefan Weil
fec9c11c8c Use std::vector, std::string in baseapi.h
Signed-off-by: Stefan Weil <sw@weil.de>
2020-12-28 21:03:29 +01:00
Stefan Weil
64e902ddf7 Remove genericvector.h from public API
Signed-off-by: Stefan Weil <sw@weil.de>
2020-12-28 21:03:29 +01:00
Stefan Weil
f462389673 renderer for TessPDFRenderer
Signed-off-by: Stefan Weil <sw@weil.de>
2020-12-28 21:03:29 +01:00
Stefan Weil
d55e5f4803 Replace more GenericVector by std::vector
Signed-off-by: Stefan Weil <sw@weil.de>
2020-12-28 21:03:29 +01:00
Stefan Weil
4a28d33c58 Replace GenericVector by std::vector in strngs.h and more places
Signed-off-by: Stefan Weil <sw@weil.de>
2020-12-28 21:03:29 +01:00
Stefan Weil
3ddc88cccb Use std::vector in TessPDFRenderer
Signed-off-by: Stefan Weil <sw@weil.de>
2020-12-28 21:03:29 +01:00
Stefan Weil
7c679e777d Use std::vector for allowed_scripts
Signed-off-by: Stefan Weil <sw@weil.de>
2020-12-28 21:03:29 +01:00
Stefan Weil
32d53479ae Use std::vector for vars_vec, vars_values
Signed-off-by: Stefan Weil <sw@weil.de>
2020-12-28 21:03:29 +01:00
Stefan Weil
085f6b2572 Use std::list for paragraph models
Signed-off-by: Stefan Weil <sw@weil.de>
2020-12-28 21:03:29 +01:00
Stefan Weil
4ebba72919 Use std::vector for paragraph models
Signed-off-by: Stefan Weil <sw@weil.de>
2020-12-28 21:03:29 +01:00
Stefan Weil
524fc67165 Fix tesseract --list-langs
Signed-off-by: Stefan Weil <sw@weil.de>
2020-12-28 21:03:29 +01:00
Egor Pugin
986b57dd4e Export symbol for unit test. 2020-12-28 04:58:26 +03:00
Egor Pugin
3187f2ef08 Move doubleptr.h to unittests as it is used only there. 2020-12-28 02:32:27 +03:00
Egor Pugin
4175679da6 Revert kdpair, genericheap changes. 2020-12-28 02:31:45 +03:00
Stefan Weil
289a34a40a Add const attribute for pdf_ttf
That moves its data into the text segment and reduces the total size
slightly:

   text	   data	    bss	    dec	    hex	filename
  39788	    693	      0	  40481	   9e21	old/libtesseract_la-pdfrenderer.o
  40360	     88	      0	  40448	   9e00	new/libtesseract_la-pdfrenderer.o

Signed-off-by: Stefan Weil <sw@weil.de>
2020-12-26 17:51:56 +01:00
Stefan Weil
7dca63caf1 More fixes for namespace tesseract
Signed-off-by: Stefan Weil <sw@weil.de>
2020-12-26 17:41:53 +01:00
Stefan Weil
7188b160ae Fix build with --disable-graphics
Signed-off-by: Stefan Weil <sw@weil.de>
2020-12-26 17:36:24 +01:00
Egor Pugin
aecbf79791 Add missing merge_unicharsets training tool to cmake and sw build. 2020-12-26 15:57:22 +03:00
Stefan Weil
317ef988a0 Add missing namespace prefix for GlobalParams() (fix build for some unit tests)
Signed-off-by: Stefan Weil <sw@weil.de>
2020-12-26 13:44:43 +01:00
Stefan Weil
418064f639 Add missing namespace prefix (fix build for merge_unicharsets)
Signed-off-by: Stefan Weil <sw@weil.de>
2020-12-26 13:09:39 +01:00
Egor Pugin
c8b8d266d6 Fix some of vector<bool> cases for msvc. 2020-12-26 04:17:13 +03:00
Egor Pugin
6b22972bc2 Fix linux build. 2020-12-26 04:15:42 +03:00
Egor Pugin
c3e04abe1e Inherit STRING from std::string. 2020-12-26 03:48:35 +03:00
Egor Pugin
4fc467a922 Inherit GenericVector from std::vector. Inherit kdpairs from std::pair. Rewrite some move ctors to modern C++ style. 2020-12-26 03:23:09 +03:00
Egor Pugin
04d3cfcf2f Merge branch 'master' of github.com-egorpugin:tesseract-ocr/tesseract 2020-12-26 00:55:37 +03:00
Egor Pugin
79a86f2582 Move all tesseract symbols into tesseract namespace. Fix include order in many places. 2020-12-26 00:55:30 +03:00
zdenop
ceadc4ddb8 remove inline declaration 2020-12-25 16:28:00 +01:00
Egor Pugin
14d52a79ba Remove .rc files. No need to add them into dll/exe. 2020-12-25 18:06:35 +03:00
zdenop
044921267f embed pdf.ttf to tesseract library #2551 2020-12-25 13:20:36 +01:00
Stefan Weil
cc133aa394 Fix text for fonts_dir parameter
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-12-22 21:32:05 +01:00
Stefan Weil
34abba8698 Add terminating linefeed to fonts.conf
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-12-22 21:32:05 +01:00
Stefan Weil
17a64eef1e Simplify code for PangoFontInfo::HardInitFontConfig
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-12-22 21:32:05 +01:00
Stefan Weil
707ee70966 Use deprecated pango_fc_font_get_glyph for old Pango versions
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-12-22 12:02:37 +01:00
Stefan Weil
f759142c95 Remove buggy Windows implementation for getting glyph from font
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-12-22 09:07:09 +01:00
Stefan Weil
7669d36a37 Use HarfBuzz instead of deprecated pango_fc_font_get_glyph
This fixes the crash on MacOS with M1.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-12-22 09:03:05 +01:00
Stefan Weil
8c859a7329 Fix type cast from PangoFont to PangoFcFont
The original code crashes in pango_fc_font_get_glyph on MacOS with M1.

Replacing the type cast with the macro made for that conversion
gives at least an error message before crashing:

    (process:12546): GLib-GObject-WARNING **: 08:38:02.472: invalid cast from 'PangoCairoCoreTextFont' to 'PangoFcFont'
    zsh: segmentation fault  ./pango_font_info_test

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-12-22 08:45:11 +01:00
Stefan Weil
3efedabda3 automake: Flat build for src/training
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-12-19 15:25:21 +01:00
Stefan Weil
6fcf8d23bc Use more compiler and linker flags from pkg-config
This fixes some build issues with Homebrew on MacOS.

Signed-off-by: Stefan Weil <stefan@Sabines-Mac-mini.fritz.box>
2020-12-13 13:24:46 +01:00
Stefan Weil
490bd3ec8f Fix build with enabled TensorFlow
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-12-04 10:56:23 +01:00
Stefan Weil
ac116d1b28 Fix regression in Network::Serialize (fix issue #3167)
The regression was caused by a wrong string serialization in
commit 4613738a5e.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-12-03 19:36:58 +01:00
zdenop
279b0b2e37
Merge pull request #3160 from stweil/string2
Replace more occurrences of STRING by std::string of char*
2020-11-27 18:24:17 +01:00
Stefan Weil
65b11a1e12 Pack class SVMenuNode
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-11-26 17:17:27 +01:00
Stefan Weil
a1849bc65c Pack struct CLASS_STRUCT
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-11-26 17:17:27 +01:00
Stefan Weil
0bb46ac2e0 Pack struct BlamerBundle
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-11-26 17:17:27 +01:00
Stefan Weil
bf3774cc91 Use more const char*
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-11-26 17:01:17 +01:00
Stefan Weil
4613738a5e Use const char* for filename and network_spec parameters
This replaces the proprietary STRING data type
(764 instead of 838 lines remaining).

It also removes STRING from osdetect.h and serialis.h.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-11-26 17:01:17 +01:00
Stefan Weil
fbc4c809d9 Replace STRING by std::string
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-10-31 14:08:39 +01:00
Stefan Weil
92b6c652f3 Use std::vector for scales_
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-10-29 08:00:11 +01:00
Stefan Weil
c15dd26b84 Don't pass scales_ to IntSimdMatrix::Init
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-10-28 20:35:53 +01:00
Stefan Weil
fe76142a3d Remove GenericVector::scale() again
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-10-28 16:24:59 +01:00
Stefan Weil
eaf72ace31 Prefer result from inverted image if the mean confidence is better
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-10-26 20:37:47 +01:00
Stefan Weil
cfb1fb2540 Try OCR on inverted line only if mean confidence is below 50 %
The old code looked for the minimum confidence which triggered
very often a 2nd OCR without improving the result.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-10-26 09:32:09 +01:00
Robin Watts
436008bd37 Tweak SIMDDetect for ANDROID Neon.
cpufeatures.h should be cpu-features.h, with the latest NDK
at least. The #if 0'd section is not required because armv8
always includes NEON.
2020-10-19 12:04:29 +01:00
Robin Watts
db10c7b577 intsimdmatrixneon.cpp: Do biasing in SIMD. 2020-10-12 04:30:46 -07:00
Robin Watts
d1e49d6dd2 intsimdmatrixavx2: Do biasing in SIMD.
We also move to relying on both scales and output having been
padded to accomodate us writing more results than are actually
needed here. This was allowed for a few commits back.
2020-10-12 04:30:46 -07:00
Robin Watts
872816897a Rejig intsimdmatrix to reduce FP ops.
Avoid 1) floating point division by 127, 2) conversion of
bias to double, 3) FP addition, in favour of 1) integer
multiplication by 127, and 2) integer addition.

(Also costs extra work in the serialisation/deserialisation of
the scale values, and conversion of weights to int formats, but
these are all one offs).
2020-10-12 04:30:46 -07:00
Robin Watts
aba1800f69 Round output buffers for intSimdMatrix.
In order to allow intSimdMatrix implementations to 'overwrite'
their outputs, ensure that the output buffers are always padded
to the next block size.

This doesn't make any difference yet, but it enables optimisations
further down the line, especially when the biasing is pulled into
the SIMD.
2020-10-12 11:47:16 +01:00
Robin Watts
9dfdac51c6 Tweak scales array for intSimdMatrix case.
Currently, the size of the scales array is not rounded up
in the same way as the weights are. This blocks us pushing
the scale calculations into the SIMD, as when we "overread"
the end of the scale array, we potentially get errors.

Here, we adjust the intSimdMatrix stuff to ensure that the
scales array reserves enough entries to allow such overreads
to work.

This doesn't make any difference for now, but opens the way
for future optimisations.
2020-10-12 11:47:16 +01:00
amitdo
958f23453e Improve disabled legacy engine build 2020-10-12 11:47:16 +01:00
Stefan Weil
ac14ab32c6 Remove dummy functions from globaloc.cpp and related code
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-10-04 12:24:26 +02:00
Stefan Weil
7c4ef88dab Remove unused functions FontUtils::GetAllRenderableCharacters
They used the function pango_coverage_max which does nothing and
which has been deprecated since pango version 1.44.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-10-03 12:04:40 +02:00
Le Duc Nam
eb8f1674bf Correct "NoImages" in debug pdf file
Issues:
  Debug information for "NoImages" just be binary image,
  it don't show up the result of photo_mask_pix to developer

Fix:
  Substract binary image to photo_mask_pix, the result
  are "NoImages" binary pix
2020-09-06 23:31:30 +07:00
Robert Sachunsky
640c14e080 AutoPageSeg/FindBlocks/GridRemoveUnderlinePartitions: avoid self-deletion
When checking horizontal line partitions for
possible interpretation as underline formatting,
avoid confusing the hline partition itself with
an overlapping neighbour (which would delete it).
2020-08-24 19:13:48 +02:00
Robert Sachunsky
65a077d3e9 FindAndRemoveLines/FindVerticalAlignment: decrease fixed vline min length
When detecting vertical separators, the blob aligner is used to glue
line segments (often segmented due to artificial cracks).
But (unlike LineFinder) it has many parameters that are not
relative to pixel density/resolution.
This change decreases the minimum absolute length in pixels
for vertical separators.
2020-08-24 19:13:36 +02:00
Robert Sachunsky
0228d93684 textord debugging: invert default top/bottom bounaries, improve description 2020-08-24 19:13:25 +02:00
Stefan Weil
d33edbc4b1
Merge pull request #3066 from robinwatts/pushback14
Remove unused char constant that causes a warning.
2020-07-17 15:55:51 +02:00
Robin Watts
578462109b Remove unused char constant that causes a warning.
The kDictWildcard is never actually used, so removing it makes
no difference. It causes warnings in MSVC builds as MSVC doesn't
know how to pack a unicode value into chars.
2020-07-17 14:22:37 +01:00
Robin Watts
150e2e54fe Squash some warnings in MSVC build.
In particular, "defined but not used" (caused by GRAPHICS_DISABLED),
double constants being truncated to floats, and implicit casts.
2020-07-16 10:08:40 +01:00
zdenop
7fa200bfb7
Merge pull request #3064 from robinwatts/pushback12
Fix Memory leak when using TESSERACT_IMAGEDATA_AS_PIX
2020-07-15 19:08:58 +02:00
Robin Watts
7f45b719d1 Fix Memory leak when using TESSERACT_IMAGEDATA_AS_PIX
If building with TESSERACT_IMAGEDATA_AS_PIX, then tesseract
doesn't compress/decompress images, but rather holds the
data as internal Pix structures. Unfortunately, I forgot to
make the ImageData destructor free these, so memory leaked
during use. Fixed here.
2020-07-15 12:35:35 +01:00
zdenop
135c8a49b5
Merge pull request #3061 from stweil/neon
Always use NEON by default for ARMv8
2020-07-11 09:11:54 +02:00
zdenop
875bd48bd5
Merge pull request #3058 from stweil/scrollview
Disable more code and data with GRAPHICS_DISABLED
2020-07-11 09:11:27 +02:00
Stefan Weil
548a832b98 Use strtok_s for MSVC in class SVNetwork
strtok_s can be used with MSVC as a replacement for strtok_r, so less
special handling is needed in the code and class SVNetwork can be
made smaller by removing member has_content.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-07-10 17:47:05 +02:00
Stefan Weil
2db2223b39 Always use NEON by default for ARMv8
Signed-off-by: Stefan Weil <stefan.weil@bib.uni-mannheim.de>
2020-07-10 15:27:09 +02:00
Stefan Weil
cb3880fb15 Disable more code and data with GRAPHICS_DISABLED
Some runtime parameters which are only relevant with graphics enabled
were now removed from builds when graphics was disabled.

TableFinder::DisplayColSegmentGrid is never used, so remove it completely.

Builds with --disable-graphics significantly reduce the code size and avoid
some function calls which might be important for certain applications:

   text	   data	    bss	    dec	    hex	filename
3219230	  41136	  13920	3274286	 31f62e	.libs/libtesseract.so (--disable-graphics, old)
3211347	  40976	  13600	3265923	 31d583	.libs/libtesseract.so (--disable-graphics, new)
3360942	  43656	  15392	3419990	 342f56	.libs/libtesseract.so (default)

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-07-09 11:23:33 +02:00
Stefan Weil
22e6c2e5a7 Fix division by 0.0 in BaselineRow::PerpDistanceFromBaseline
It was reported by oss-fuzz (issue 23962).

Add log output to find real images which trigger that issue.
Avoid also some conversions from float to double by always using float.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-07-08 17:59:02 +02:00
Stefan Weil
8137cf35a6 Use const char* for filename parameters
This replaces the proprietary STRING data type
(801 instead of 838 lines remaining).

It also removes STRING from osdetect.h and serialis.h.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-07-07 14:20:09 +02:00
Stefan Weil
51dff483e7 Fix runtime error caused by too large TBOX
Runtime error reported by sanitizer:

    src/ccstruct/rect.h:191:44: runtime error: 50961 is outside the range of representable values of type 'short'
    SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior src/ccstruct/rect.h:191:44 in

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-06-30 20:51:52 +02:00
Stefan Weil
2269a500ef Fix runtime error with null pointer argument
Runtime error reported by sanitizer:

    src/ccstruct/coutln.cpp:1018:19: runtime error: null pointer passed as argument 2, which is declared to never be null
    /usr/include/string.h:48:14: note: nonnull attribute specified here
    SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior src/ccstruct/coutln.cpp:1018:19 in

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-06-29 19:13:39 +02:00
Stefan Weil
411ffa90c6 Fix unsigned integer overflow
Runtime errors reported by sanitizer:

    src/textord/pithsync.cpp:75:31: runtime error: unsigned integer overflow: 2147483648 + 2147483648 cannot be represented in type 'unsigned int'
    SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior src/textord/pithsync.cpp:75:31 in
    src/textord/pithsync.cpp:75:43: runtime error: unsigned integer overflow: 0 - 1 cannot be represented in type 'unsigned int'
    SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior src/textord/pithsync.cpp:75:43 in
    src/textord/pithsync.cpp:125:29: runtime error: unsigned integer overflow: 2147483648 + 2147483648 cannot be represented in type 'unsigned int'
    SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior src/textord/pithsync.cpp:125:29 in
    src/textord/pithsync.cpp:125:41: runtime error: unsigned integer overflow: 0 - 1 cannot be represented in type 'unsigned int'
    SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior src/textord/pithsync.cpp:125:41 in

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-06-29 19:13:39 +02:00
Stefan Weil
7c046c121f Fix out of bounds array access
Runtime error with enabled sanitizer:

    src/textord/colpartition.cpp:2243:66: runtime error: index -1 out of bounds for type 'tesseract::ColPartition *[6]'
    SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior src/textord/colpartition.cpp:2243:66 in

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-06-29 16:10:37 +02:00
zdenop
4ef709554b
Update imagedata.cpp
stop PreScale if pixScale failed (fixes #3025)
2020-06-25 20:32:51 +02:00
amitdo
efae270dea Disabled legacy build: Disable more unused code 2020-06-24 22:02:52 +03:00
Stefan Weil
ca0a6c9d37
Merge pull request #3035 from stweil/overflow
Avoid buffer overflow (issue #444)
2020-06-24 18:46:47 +02:00
Stefan Weil
2cb5bc7690 Improve debug message in ColPartition::ComputeLimits
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-06-23 22:52:45 +02:00
Stefan Weil
cfabdfe0af Avoid buffer overflow (issue #444)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-06-22 22:19:58 +02:00
Stefan Weil
62b085cb8d ScrollView: Remove C API callcpp.{cpp,h}
Use C++ class ScrollView directly instead of using an intermediate C API.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-06-22 09:14:26 +02:00
Stefan Weil
b2cc00d97f Replace cprintf by tprintf and remove cprintf
cprintf was an indirect way to call tprintf.
This indirection is not needed, so remove it and use tprintf directly.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-06-21 19:07:09 +02:00
Stefan Weil
ea1f597fc1 Fix insecure call of tprintf
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-06-21 19:03:03 +02:00
Stefan Weil
4a10bb68c7 Fix conversion of images with 16 bpp or 24 bpp to grey
The old code used pixConvertRGBToLuminance which only converts 32 bpp images.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-06-21 09:09:49 +02:00
Stefan Weil
6f6100ff9f Classify: Run sort only for more than one element
This fixes calls of qsort with a nullptr argument (reported by sanitizers).

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-06-20 21:43:22 +02:00
Stefan Weil
d4cf77c92b Don't check for limits.h (now unused)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-06-20 10:39:13 +02:00
Matej Knopp
e900252c1a Fix CMake build with DISABLED_LEGACY_ENGINE 2020-06-17 19:42:49 +02:00
Stefan Weil
d6ca7a5298 ScrollView: Fix typo in comment
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-06-17 16:26:41 +02:00
Stefan Weil
380466e0d3 Allow inlining of function TruncateParam
It is only used locally in intproto.cpp, so defining it before the first
use and adding the static attribute allows the compiler to inline it.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-06-16 20:17:41 +02:00
Stefan Weil
93cfffeb87 Remove unused argument from function TruncateParam
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-06-16 20:17:41 +02:00
Stefan Weil
f08b16a5a0 Remove assertion which is triggered by tests
oss-fuzz issue 15149 triggers this assertion. See test case here:
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=15149

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-06-16 20:17:26 +02:00
Stefan Weil
18d9983f69 StrokeWidth: Remove unused local variable (fixes compiler warning)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-06-16 20:17:09 +02:00
Stefan Weil
bc61038dd4 SPLIT: Make function bounding_box inline for better performance
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-06-16 17:21:36 +02:00
Stefan Weil
0e7701bc3c SEAM: More inline functions for better performance
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-06-16 17:20:14 +02:00
Stefan Weil
e45100ebf7 TBOX: Use inline constructor for better performance
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-06-16 17:17:55 +02:00
Stefan Weil
c110958ffa Fix undefined shift with negative value (oss-fuzz issue 14658)
This fixes a bug reported by OSS Fuzz:
https://oss-fuzz.com/issue/5697280134348800

The old code passed a negative value (-1) as argument to step_dir
when destindex was 0.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-06-16 13:25:32 +02:00
Stefan Weil
6ee3698958 Remove old unused code from imagedata.h
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-06-14 16:02:27 +02:00
Stefan Weil
d8500adcf4 Fix crash caused by missing thread synchronization (issues #757, #1168 and #2191)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-06-14 15:53:17 +02:00
Robin Watts
6fec69de1a Fix intsimdmatrixneon.cpp stack corruption.
The intsimdmatrix mechanism ensures that inputs would be
resized so that we'd only ever get "whole blocks" of data.
I'd assumed that that meant the same thing for scales/outputs
too, but this appears not to to be the case, as we can get
called (sometimes) with num_out % 8 == 7.

Possibly we could benefit from resizing those matrices so
that special cases in this innermost loop are not actually
required, but unless and until that is done, let's fix the
inner loop.
2020-05-27 13:40:17 +01:00
Stefan Weil
a06d0d8449 Add missing include statements for config_auto.h
They are required to get the macro DISABLED_LEGACY_ENGINE.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-05-22 16:34:28 +02:00
Stefan Weil
6732eb9eb5 Clean code for NEON support
Include it only for NEON and remove unneeded code.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-05-21 07:03:37 +02:00
Robin Watts
f79e52a7cc NEON SIMD code.
In tests on my pi3b+, a release build of my ghostscript integration
takes 2 minutes 27 seconds to render a PDF and OCR it with the
vanilla sources. With this NEON coded added the time drops to 37
seconds.

I have not tested the configure/Makefile changes as I'm not using
them.
2020-05-20 18:54:42 +01:00
zdenop
b5d639dcc5
Merge pull request #2965 from robinwatts/pushback1
thanks.
2020-05-16 20:35:19 +02:00
zdenop
064b4403de
Merge pull request #2966 from robinwatts/pushback2 2020-05-16 20:06:31 +02:00
Robin Watts
3408c36eab Guard #include "config_auto.h" with HAVE_CONFIG_H.
Every other file already does this.
2020-05-15 19:29:03 +01:00
Robin Watts
43437a540b Fix OEM_DEFAULT in DISABLED_LEGACY_ENGINE builds.
If api->Init is called with OEM_DEFAULT in DISABLED_LEGACY_ENGINE
build modes, the engine mode is never set, resulting in no
words being found.
2020-05-15 14:56:41 +01:00
Julian Gilbey
e7e6999d3b Move comment about swap meaning for DeSerialize to correct function 2020-05-13 07:02:59 +01:00
Robin Watts
27d513462c Avoid using PACKAGE_VERSION in favour of TESSERACT_VERSION_STR.
This means the sources compile perfectly in the absence of
config_auto.h/HAVE_CONFIG_H as they were intended to do.

TESSERACT_VERSION_STR is set to be precisely PACKAGE_VERSION
by autoconf, so there are no actual changes in compiled code.
2020-05-12 21:45:12 +02:00
Stefan Weil
39f7fb4a1a Allow line images with larger width (depending on height)
Training with normalized line images higher than 36 px also results in larger widths.
The limit should therefore depend on the height used for the normalization.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-05-12 16:59:31 +02:00
Stefan Weil
34bdc8b74e Allow line images with larger width
Line images can be larger than the old limit, especially when training
is made with newspaper lines.

    Image too large to learn!! Size = 2641x36
    Image too large to learn!! Size = 2704x36
    Image too large to learn!! Size = 2751x36
    Image too large to learn!! Size = 3738x36

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-05-12 16:50:40 +02:00
Julian Gilbey
ca5735efcb Destroy box before potentially exiting function 2020-05-12 15:25:16 +01:00
Stefan Weil
d3a0768c32
Merge pull request #2975 from robinwatts/pushback5
Tweak architecture specific SIMD files for ease of compilation
2020-05-12 14:55:32 +02:00
Robin Watts
a9b44ee8c2 Tweak architecture specific SIMD files for ease of compilation.
This won't affect anything using the supplied build system. For
other projects that include tesseract within them, however, this
may make their life easier.

For example, I have an integration of Tesseract with Ghostscript,
in which tesseract is built as part of the Ghostscript build,
without using the tesseract build system.

The Ghostscript build system is makefile based, and has to work
on a range of make systems, including unix make, gnu make and
nmake. As such we have to avoid conditionals in the common
makefiles. It therefore becomes hard to build one set of files on
x86 systems, and another on (say) ARM systems.

Accordingly, this commit makes small tweaks to the architecture
specific files, so that they compile on EVERY platform; just they
only compile to anything useful on the appropriate platform.

Thus the makefiles can build all the files on all the systems, and
the preprocessor flags mean that the correct functions are actually
built.
2020-05-12 13:09:29 +01:00
Egor Pugin
0eaabc42c7
Update CMakeLists.txt 2020-05-12 11:49:15 +03:00
Egor Pugin
e720a26745
[cmake] Set inactivity timeout during icu download to 300 seconds.
Fixes #2972.
2020-05-09 18:55:45 +03:00
Robin Watts
80d4af6ecf Add a mechanism to avoid creating debug fonts.
If TESSERACT_DISABLE_DEBUG_FONTS is defined, tesseract doesn't
atetmpt to create any debug fonts. This not only saves memory,
but it (combined with the change to optionally use Pix as
internal storage for the ImageData) allows us to use an
embedded Leptonica library with no format handlers at all.
2020-05-05 00:22:23 +01:00
Robin Watts
6bcb941bcf Avoid tesseract writing Pix out/reading them back.
By default, when we ImageData::SetPix, we write the data out as a
PNG, just to read it back in to get a compressed buffer of data.
We then use this to generate a new Pix.

In builds of Tesseract on systems where we don't have temp files,
writing files out is problematic.

Not only that, but compressing/uncompressing is slow, and on minimal
builds of leptonica, where we've disabled the format writers to reduce
memory footprint, we get no compression anyway.

In such cases, it'd be far nicer just to keep the original Pix as
the internal data.

Also, when recovering the pixmap from the ImageData, if we know we're
only going to read from the data, we can avoid duplicating it and
just use the original. This is exactly the case when GRAPHICS_DISABLED
is set.

So, introduce a TESSERACT_IMAGEDATA_AS_PIX predefine that we can use
to cause the internal data to be a Pix rather than a compressed
buffer.



Given we don't do compression, and they were writing to memory,
this was all just more effort than we needed.

Also, if we're using GRAPHICS_DISABLED, we might as well just
pixCopy rather than pixClone as only the scaler uses this.
2020-05-04 21:01:22 +01:00
Amit D
acc4c8bff5
Merge pull request #2952 from jannick0/patch-1
[trie.h] pattern definition: fix documentation
2020-04-27 23:44:48 +03:00
Stefan Weil
1188e0a516 Remove old code which was used for Ocropus
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-04-27 16:33:34 +02:00
jannick0
e044163085
[trie.h] pattern definition: fix documentation
The fix makes the definition of `\n` consistent with the examples given below the definition.  Please note that I did not check this against how it is implemented in the code.
2020-04-19 13:47:42 +02:00
Stefan Weil
4a00b68c63 Fix lambda function for curl code errors
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-03-18 20:46:52 +01:00
Stefan Weil
9f5a3f6ac7 Fix uninitialized local variable in curl code
Compiler warning:

    src/api/baseapi.cpp:1151:27: warning:
      variable 'curlcode' is uninitialized when used here [-Wuninitialized]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-03-18 19:25:33 +01:00
zdenop
6e307074d8
Merge pull request #2894 from stweil/curl
Report errors from curl_easy functions
2020-03-18 14:14:07 +01:00
Stefan Weil
ef4f99a994 Run xgetbv instruction only on machines which support it
This fixes a regression for older Intel processors.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-03-08 17:32:10 +01:00
Stefan Weil
eff4dc0603 Use lambda expressions for reporting curl errors
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-02-23 22:44:42 +01:00
Stefan Weil
9972c91127 Report errors from curl_easy functions
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-02-23 22:26:51 +01:00
Stefan Weil
57ff90687d simd: Check whether the OS supports FMA, AVX, ...
The previous check was only for the MS compiler, but not for gcc and clang.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-02-23 16:34:35 +01:00
Stefan Brechtken
b2ed8038d1 TableFind: clearing the statically allocated memory on api end 2020-02-19 13:18:28 +01:00
Stefan Brechtken
b3649b9fb2 TableFind: Api access, reskew and y inversion of the resulting TBOXes 2020-02-19 12:36:22 +01:00