Commit Graph

1174 Commits

Author SHA1 Message Date
Stefan Weil
91e2b253c0 Format modified code with clang-format
Format the files which were changed in
commit 297d7d86ce.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-24 21:10:29 +01:00
Stefan Weil
06acbaf99c IntegerMatcher: Fix division by zero
Credit to OSS-Fuzz which reported this issue:

    intmatcher.cpp:1231:62: runtime error: division by zero
	    #0 0x6119d5 in IntegerMatcher::ApplyCNCorrection(float, int, int, int) tesseract/src/classify/intmatcher.cpp:1231:62
	    #1 0x5fe9c4 in tesseract::Classify::ComputeCorrectedRating(bool, int, double, double, int, int, int, int, int, unsigned char const*) tesseract/src/classify/adaptmatch.cpp:1213:29
	    #2 0x5fdc22 in tesseract::Classify::ExpandShapesAndApplyCorrections(ADAPT_CLASS_STRUCT**, bool, int, int, int, float, int, int, unsigned char const*, tesseract::UnicharRating*, ADAPT_RESULTS*) tesseract/src/classify/adaptmatch.cpp:1184:13
	    #3 0x5fe421 in tesseract::Classify::MasterMatcher(INT_TEMPLATES_STRUCT*, short, INT_FEATURE_STRUCT const*, unsigned char const*, ADAPT_CLASS_STRUCT**, int, int, TBOX const&, GenericVector<CP_RESULT_STRUCT> const&, ADAPT_RESULTS*) tesseract/src/classify/adaptmatch.cpp:1119:5
	    #4 0x6003eb in tesseract::Classify::CharNormTrainingSample(bool, int, tesseract::TrainingSample const&, GenericVector<tesseract::UnicharRating>*) tesseract/src/classify/adaptmatch.cpp:1374:5

See https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=13712.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-24 19:39:31 +01:00
Stefan Weil
58423d2f6c
Merge pull request #2328 from bertsky/lstm-with-user-patterns2
Add user words / patterns again
2019-03-24 19:38:40 +01:00
zdenop
0d36d9a9d7
Merge pull request #2341 from Shreeshrii/fix
Fix
2019-03-24 18:21:09 +01:00
Stefan Weil
da6305b632 Fix compiler warnings caused by ASSERT_HOST
The modified definition avoids warnings caused by redundant semicolons.
Now a semicolon is required when using the macro, so a few code locations
had to be updated.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-24 17:47:04 +01:00
Stefan Weil
44a6d9f4d4 intmatcher: Catch more out of bounds reads
Credit to OSS-Fuzz which reported this issue:

intmatcher.cpp:1121:17: runtime error: index 24 out of bounds for type 'uint8_t [24]'
	    #0 0x61034b in ScratchEvidence::UpdateSumOfProtoEvidences(INT_CLASS_STRUCT*, unsigned int*, short) tesseract/src/classify/intmatcher.cpp:1121:17
	    #1 0x60f560 in IntegerMatcher::Match(INT_CLASS_STRUCT*, unsigned int*, unsigned int*, short, INT_FEATURE_STRUCT const*, tesseract::UnicharRating*, int, int, bool) tesseract/src/classify/intmatcher.cpp:514:11
	    #2 0x5f3a25 in tesseract::Classify::AdaptToChar(TBLOB*, int, int, float, ADAPT_TEMPLATES_STRUCT*) tesseract/src/classify/adaptmatch.cpp:894:9
	    #3 0x5f2ccd in tesseract::Classify::LearnPieces(char const*, int, int, float, tesseract::CharSegmentationType, char const*, WERD_RES*) tesseract/src/classify/adaptmatch.cpp:430:5
	    #4 0x5f16ee in tesseract::Classify::LearnWord(char const*, WERD_RES*) tesseract/src/classify/adaptmatch.cpp:293:7

This catches the out of bounds data reads in release builds.
Add also assertions for debug builds.

See https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=13818.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-24 17:27:43 +01:00
Stefan Weil
5fd7228414 intmatcher: Catch out of bounds reads
Credit to OSS-Fuzz which reported this issue:

    intmatcher.cpp:1163:17: runtime error: index 24 out of bounds for type 'uint8_t [24]'
	    #0 0x610d3b in ScratchEvidence::UpdateSumOfProtoEvidences(INT_CLASS_STRUCT*, unsigned int*) tesseract/src/classify/intmatcher.cpp:1163:17
	    #1 0x60ff4e in IntegerMatcher::Match(INT_CLASS_STRUCT*, unsigned int*, unsigned int*, short, INT_FEATURE_STRUCT const*, tesseract::UnicharRating*, int, int, bool) tesseract/src/classify/intmatcher.cpp:563:11
	    #2 0x5f4355 in tesseract::Classify::AdaptToChar(TBLOB*, int, int, float, ADAPT_TEMPLATES_STRUCT*) tesseract/src/classify/adaptmatch.cpp:894:9
	    #3 0x5f35fd in tesseract::Classify::LearnPieces(char const*, int, int, float, tesseract::CharSegmentationType, char const*, WERD_RES*) tesseract/src/classify/adaptmatch.cpp:430:5
	    #4 0x5f201e in tesseract::Classify::LearnWord(char const*, WERD_RES*) tesseract/src/classify/adaptmatch.cpp:293:7

This catches the out of bounds data reads, but does not fix the primary
reason: ProtoLengths currently gets values which are larger than the
allowed index.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-24 15:44:33 +01:00
Stefan Weil
509ee95023 IntegerMatcher: Fix data type of loop counters
ClassTemplate->ProtoLengths[n] is of type uint8_t, so use that for
the related loop counters, too.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-24 15:35:06 +01:00
Stefan Weil
f4f34a87db WERD_RES: Fix uninitialized member variable
Credit to OSS-Fuzz which reported this issue:

    pageres.cpp:1143:7: runtime error: load of value 249, which is not a valid value for type 'bool'
	    #0 0x6ba560 in WERD_RES::Clear() tesseract/src/ccstruct/pageres.cpp:1143:7
	    #1 0x6b9fd1 in WERD_RES::operator=(WERD_RES const&) tesseract/src/ccstruct/pageres.cpp:193:3
	    #2 0x49a9ad in WERD_RES::WERD_RES(WERD_RES const&) tesseract/src/ccstruct/pageres.h:356:11

See https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=13707.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-24 14:59:08 +01:00
Stefan Weil
afc099b9f4 intmatcher: Split data_table
The old code was a hack to improve the performance.

The new code is clearer and results in the same binary when compiling
with gcc 8.3.0, so it looks like the old hack is no longer needed with
modern compilers.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-24 08:15:40 +01:00
Shreeshrii
8749f3553e
LINEDATA=false 2019-03-23 19:16:49 +05:30
Shree
bcb7cf9846 sort arguments, use true/false instead of 1/0 2019-03-23 12:28:53 +00:00
Shree
c2db272134 Modify distort_image for Boolean 2019-03-22 17:02:46 +00:00
Shree
259d5af6b1 Add PSM values to the definition 2019-03-22 15:29:02 +00:00
Shree
8eafec0d17 Fix comments with current values of PSM codes 2019-03-22 14:10:49 +00:00
Stefan Weil
e1e56d9d66 Remove local function declarations from intmatcher.h
This requires moving the local function HeapSort to the beginning.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-22 11:39:39 +01:00
Stefan Weil
2ba194ca8d Remove four unused parameters
This fixes some compiler warnings:

    src/classify/intmatcher.cpp:711:63: warning: unused parameter ‘ConfigMask’ [-Wunused-parameter]
    src/classify/intmatcher.cpp:1007:16: warning: unused parameter ‘ProtoMask’ [-Wunused-parameter]
    src/classify/intmatcher.cpp:1095:61: warning: unused parameter ‘NumFeatures’ [-Wunused-parameter]
    src/classify/intmatcher.cpp:1136:59: warning: unused parameter ‘used_features’ [-Wunused-parameter]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-22 11:30:24 +01:00
Stefan Weil
dd79d56e9f Remove unused parameter BlobLength
This fixes two compiler warnings:

    src/classify/intmatcher.cpp:553:14: warning: unused parameter ‘BlobLength’ [-Wunused-parameter]
    src/classify/intmatcher.cpp:622:14: warning: unused parameter ‘BlobLength’ [-Wunused-parameter]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-22 11:17:19 +01:00
Shree
9b915d5efb add --distort_image 2019-03-22 05:39:38 +00:00
Shree
f7ffde99d5 add --distort_image 2019-03-22 05:34:00 +00:00
zdenop
ac7ea4322a
Merge pull request #2335 from Shreeshrii/master
Changes to tesstrain.py - max_workers=8, distort_image=false
2019-03-17 15:27:34 +01:00
zdenop
26877ba703 check min. python version; os.uname is not available on windows 2019-03-17 15:25:48 +01:00
Shreeshrii
f8e8521606
Update tesstrain_utils.py 2019-03-17 15:32:35 +05:30
Shree
6fa8e1bb15 Set max_workers=8 2019-03-17 09:58:11 +00:00
Shree
e21499e81e Set default value for distort_image 2019-03-17 09:54:16 +00:00
Stefan Weil
ee2f9bf7bf Remove old comments in file headers
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-16 10:55:00 +01:00
Shree
d47b0d588a Use LATIN_FONTS for kmr 2019-03-15 15:47:56 +00:00
Shree
3eee1d217a Add kmr and kur_ara, remove kur from training scripts 2019-03-15 15:37:49 +00:00
Robert Schubert
297d7d86ce trying to add user words/patterns again:
- pass in ParamsVectors from Tesseract
  (carrying values from langdata/config/api)
  into LSTMRecognizer::Load and LoadDictionary
- after LSTMRecognizer's Dict is initialised
  (with default values), reset the variables
  user_{words,patterns}_{suffix,file} from the
  corresponding entries in the passed vector
2019-03-15 16:06:19 +01:00
Shree
b2ebf0195f Add kmr and kur_ara, remove kur from training scripts 2019-03-15 14:39:39 +00:00
Shree
37befdf6c4 Add option for --distort_image 2019-03-15 13:32:36 +00:00
zdenop
0a36b38169
Merge pull request #2317 from eighttails/master
Added missing linker flags for MinGW.
2019-03-15 08:01:21 +01:00
Robert Schubert
14346e56b0 tesstrain: catch+handle SIGINT (to stop waiting on subjobs) 2019-03-15 00:03:16 +01:00
Robert Schubert
6cbad17e30 tesstrain: check all subjobs' retval 2019-03-14 14:38:51 +01:00
Robert Schubert
5316bcbb94 tesstrain: check failure of subjobs 2019-03-14 11:42:01 +01:00
Stefan Weil
4c2bbebecc Fix compiler warning (-Wunused-value)
Warning from clang++:

    ..\src\ccmain\ltrresultiterator.cpp(454,8):  warning: expression result unused [-Wunused-value]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-13 20:56:03 +01:00
Stefan Weil
ed84ba0a44 Fix wrong comparison
symbol_steps is a vector, so testing for a nullptr was wrong.

clang++ reports:

    ..\src\ccmain\ltrresultiterator.cpp(440,19):  warning: comparison of address of 'this->word_res_->symbol_steps' equal to a null pointer is always false [-Wtautological-pointer-compare]
      if (&word_res_->symbol_steps == nullptr || !LSTM_mode_) return nullptr;
           ~~~~~~~~~~~^~~~~~~~~~~~    ~~~~~~~

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-13 20:38:38 +01:00
Tadahito Yao
bbbd262a8d Added missing linker flags for MinGW. 2019-03-13 22:10:36 +09:00
jm server2
1206362d30 accumulated_timesteps is not a pointer but a vector and in case we use ChoiceIterator without lstm_choice_mode tesseract crashes (or similar) because the check is true and we reference not existing item 2019-03-13 12:55:14 +01:00
Stefan Weil
3baf0d8076 Fix boolean assignments
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-12 15:34:24 +01:00
Stefan Weil
8ad0489f0f Remove svpaint.cpp from libtesseract
svpaint is a standalone application (it includes a main function)
and should not be part of the Tesseract library.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-12 12:22:53 +01:00
zdenop
7546a01020
Merge pull request #2310 from noahmetzger/LSTMChoiceRIL
Lstm choice ril
2019-03-12 10:46:11 +01:00
Stefan Weil
35a999f91a Fix assertion caused by wrong unicharset
Credit to OSS-Fuzz: it found another case which triggered this assertion:

    contains_unichar_id(unichar_id):Error:Assert failed:in file ../../src/ccutil/unicharset.h, line 502

This is the OSS-Fuzz testcase:
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=13662

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-12 09:31:21 +01:00
Stefan Weil
56a39bda77 Fix float division by zero
That runtime error is normally not visible because it does not abort
the program, but is detected when the code was compiled with sanitizers.

It can be triggered with this OSS-Fuzz testcase:
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=13662

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-12 09:28:16 +01:00
Noah Metzger
5b3e2fe812 Integrated accumulated Symbol Choice in the Choice Iterator and made the api lstm_choice_mode independent
Signed-off-by: Noah Metzger <noah.metzger@bib.uni-mannheim.de>
2019-03-12 09:15:10 +01:00
Stefan Weil
4c0b98bd12 Replace undefined shift operations by multiplications
Shift operations are undefined for negative numbers, but at least on
Intel they return the same value as a multiplication with 2 ^ shift value.

This fixes runtime errors reported by sanitizers and OSS-Fuzz:

    intmatcher.cpp:821:59: runtime error: left shift of negative value -14
    intmatcher.cpp:823:75: runtime error: left shift of negative value -512
    intmatcher.cpp:820:50: runtime error: left shift of negative value -80

See issue #2297 and
https://oss-fuzz.com/testcase-detail/4845195990925312 for details.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-12 06:56:54 +01:00
Stefan Weil
896698a4f5 Fix runtime error (left shift of negative value)
Runtime error:

    src/training/util.h:37:28: runtime error: left shift of negative value -17

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-12 06:56:54 +01:00
Stefan Weil
5202208a8c Remove globals.h
It only included other files which are already included where needed.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-11 19:01:23 +01:00
Noah Metzger
bc2b919805 Integrated Timesteps per symbol into ChoiceIterator
Signed-off-by: Noah Metzger <noah.metzger@bib.uni-mannheim.de>
2019-03-11 10:50:56 +01:00
Noah Metzger
754e38d2b4 Added the option to get the timesteps separated by the suggested segmentation
Signed-off-by: Noah Metzger <noah.metzger@bib.uni-mannheim.de>
2019-03-11 10:50:56 +01:00
zdenop
e817607280 archive_version_details is available from libArchive version 3.2.0 2019-03-10 22:57:48 +01:00
zdenop
5cfe4cc1f0
Merge pull request #2286 from Shreeshrii/lstmbox
Rename function to TessBaseAPIGetTsvText to be consistent to Create method
2019-03-10 21:41:52 +01:00
zdenop
02a1ffe87a Report libArchive support 2019-03-10 20:08:45 +01:00
Stefan Weil
b3aff7d633 Fix Index-out-of-bounds in IntegerMatcher::UpdateTablesForFeature
This fixes issue #2299, an issue which was already reported by
static code analyzers and now by OSS-Fuzz, see details at
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=13597.

The Tesseract code assigns an address which is out-of-bounds to a pointer
variable, but increments that variable later. So this is a false positive.

Change the code nevertheless to satisfy OSS-Fuzz.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-10 18:26:40 +01:00
Stefan Weil
91d0a71d51 Fix assertion caused by wrong unicharset (issue #2301)
Credit to OSS-Fuzz:
This fixes an issue which was reported by OSS-Fuzz, see details at
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=13592.

OSS-Fuzz triggered this assertion:

    contains_unichar_id(unichar_id):Error:Assert failed:in file ../../src/ccutil/unicharset.h, line 502

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-10 16:42:54 +01:00
Stefan Weil
71d4990c6d Fix Heap-buffer-overflow in GenericVector<int>::size (issue #2298)
Credit to OSS-Fuzz:
This fixes a security issue which was reported by OSS-Fuzz, see details at
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=13590.

Add also some assertions to catch similar bugs.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-10 16:12:30 +01:00
Robert Schubert
3912cb1c33 LSTM char_whitelist/blacklist (6ac2ff0): more robust
- unicharset can be null too
2019-03-09 10:40:40 +01:00
Robert Schubert
b45999088c LSTM char_whitelist/blacklist (6ac2ff0): multi-code chars
- move decision from ComputeTopN to ContinueContext, where
  it belongs: block context continuations which emit final
  codes translating to disabled unichar_ids.
  (The normal logic for fallback from top2 > top2 > rest
   will apply.)
- pass UNICHARSET refs appropriately
2019-03-08 12:30:16 +01:00
Robert Schubert
8012d5e653 LSTM char_whitelist/blacklist (6ac2ff0): also sublangs 2019-03-07 18:32:50 +01:00
Robert Schubert
6ac2ff083e trying to add tessedit_char_whitelist etc. again:
- ignore matrix outputs in ComputeTopN if they
  belong to a disabled unichar_id
- pass UNICHARSET refs to check that
- in SetBlackAndWhitelist, also update the unicharset
  of the lstm_recognizer_ instance, if any
2019-03-07 01:37:23 +01:00
zdenop
f80085c0bf
Merge pull request #2289 from Armyke/master
Added an additional optional --tmp_dir parameter to specify the tempo…
2019-03-06 15:03:14 +01:00
Stefan Weil
1c7e00611b Add initial support for traineddata files in standard archive formats
This requires libarchive-dev.

Tesseract can now load traineddata files in any of the archive formats
which are supported by libarchive. Example of a zipped BagIt archive:

    $ unzip -l /usr/local/share/tessdata/zip.traineddata
    Archive:  /usr/local/share/tessdata/zip.traineddata
      Length      Date    Time    Name
    ---------  ---------- -----   ----
           55  2019-03-05 15:27   bagit.txt
            0  2019-03-05 15:25   data/
         1557  2019-03-05 15:28   manifest-sha256.txt
      1082890  2019-03-05 15:25   data/eng.word-dawg
      1487588  2019-03-05 15:25   data/eng.lstm
         7477  2019-03-05 15:25   data/eng.unicharset
        63346  2019-03-05 15:25   data/eng.shapetable
       976552  2019-03-05 15:25   data/eng.inttemp
        13408  2019-03-05 15:25   data/eng.normproto
         4322  2019-03-05 15:25   data/eng.punc-dawg
         4738  2019-03-05 15:25   data/eng.lstm-number-dawg
         1410  2019-03-05 15:25   data/eng.freq-dawg
          844  2019-03-05 15:25   data/eng.pffmtable
         6360  2019-03-05 15:25   data/eng.lstm-unicharset
         1012  2019-03-05 15:25   data/eng.lstm-recoder
         1047  2019-03-05 15:25   data/eng.unicharambigs
         4322  2019-03-05 15:25   data/eng.lstm-punc-dawg
     16109842  2019-03-05 15:25   data/eng.bigram-dawg
           80  2019-03-05 15:25   data/eng.version
         6426  2019-03-05 15:25   data/eng.number-dawg
      3694794  2019-03-05 15:25   data/eng.lstm-word-dawg
    ---------                     -------
     23468070                     21 files

`combine_tessdata -d` and `combine_tessdata -u` also work.

The traineddata files in the new format can be generated with
standard tools like zip or tar.

More work is needed for other training tools and big endian support.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-05 17:18:48 +01:00
Armyke
56b04d4ea7 Added the same --tmp_dir flag to tesstrain_utils.sh 2019-03-04 14:05:25 +00:00
Armyke
25fa392887 Added an additional optional --tmp_dir parameter to specify the temporary directory in which tesstrain.py creates the training temporary files. The main reason is due to the slow R/W on HDD, if anyone wants to speed up this process can use as tmp_dir a directory on an SSDrive 2019-03-04 13:26:53 +00:00
Stefan Weil
7fbde96a04 Format new code with clang-format
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-02 20:26:07 +01:00
Stefan Weil
38fac625cd Format new code with clang-format
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-02 20:01:48 +01:00
Shree
a0202bac70 Rename function to TessBaseAPIGetTsvText to be consistent to the Create method 2019-03-02 16:29:53 +00:00
zdenop
5de2a21b3f
Merge pull request #2283 from Shreeshrii/lstmbox
Add missing renderers to C-API
2019-03-02 15:15:34 +01:00
Stefan Weil
9c90894ff0 PAGE_RES_IT: Optimize compare operators by using inline code
Avoiding a function call will make both == and != operator faster.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-02 14:57:16 +01:00
Stefan Weil
295996ed05 commandlineflags: Fix compiler warnings (signed/unsigned)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-02 14:21:04 +01:00
Stefan Weil
eb14726aac ICOORD: Fix old type casts
This fixes compiler warnings and avoids unnecessary conversions
between float and double.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-02 14:04:54 +01:00
Stefan Weil
fb0f1bcf66 BoxChar: Fix compiler warnings (signed/unsigned)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-02 14:04:54 +01:00
Stefan Weil
0e1a1fc3cf Validator: Fix compiler warnings (signed/unsigned)
This also fixes a regression in validate_grapheme_test introduced
by commit 32e9d7c8f5.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-02 13:05:03 +01:00
Shree
c7e8131efc Add TSV option to C-API 2019-03-02 09:50:54 +00:00
Shree
22c099348b rename LSTMBOX to LSTMBox 2019-03-02 09:11:47 +00:00
zdenop
2ba8e0061a
Merge branch 'master' into mya 2019-03-01 18:37:24 +01:00
Shree
c33f03e33e Add lstmboxand wordstrbox to capi.h 2019-03-01 17:16:59 +00:00
Shree
76ec21df3d Add lstmbox and wordstrbox to C-API 2019-03-01 16:40:41 +00:00
zdenop
646b043d2c
use space instead of tab 2019-03-01 14:36:09 +01:00
Shree
5ee1deaea2 correct handling of 0BF0-0BFA Tamil numbers and symbols 2019-03-01 13:21:49 +00:00
zdenop
d7ddc4c5b7
Merge pull request #2270 from Shreeshrii/U_ARABIC_NUMBER
Treat U_ARABIC_NUMBER as LTR
2019-02-28 09:27:54 +01:00
zdenop
12c1225a5f
Merge pull request #2271 from stweil/refactor
Refactor class Network
2019-02-27 07:43:13 +01:00
Michal Čihař
14c4494f42 Allow UTF-8 variant of C locale
It behaves same in scanf, but it allows proper handling of unicode
chars.
2019-02-26 21:37:33 +01:00
Stefan Weil
98dd3b6351 Refactor class Network
That class is an abstract class with several pure virtual functions.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-02-26 16:55:31 +01:00
Shree
25b02bf1f2 Treat U_ARABIC_NUMBER as LTR 2019-02-26 09:51:21 +00:00
Shreeshrii
2f71fe280c
Use alternative way to comment a block of code (using the c preprocessor).
https://github.com/tesseract-ocr/tesseract/pull/2268#pullrequestreview-207605382
Thanks @amitdo
2019-02-26 15:05:51 +05:30
Shree
449f1cd4ba Remove test for Word started with a combiner 2019-02-25 18:47:42 +00:00
zdenop
25c43b1e7c
Merge branch 'master' into distort 2019-02-23 18:23:14 +01:00
Stefan Weil
b3e355a682 Remove whitespace at line endings
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-02-23 17:49:56 +01:00
Shreeshrii
34e4d6b1d7
Revert to 0 (50% percents of images inverted). 2019-02-23 17:59:00 +05:30
Shreeshrii
287d5341bf
TODO 2019-02-23 17:56:02 +05:30
Shreeshrii
3e3e1ed55d
Remove commented Code 2019-02-23 17:54:00 +05:30
zdenop
c02f5e99fc
Merge pull request #2259 from Shreeshrii/distort
implement PrepareDistortedPix as part of DegradeImage
2019-02-22 21:06:29 +01:00
Shree
2aded47a3c Implement distort_image in text2image - default false 2019-02-22 12:27:27 +00:00
Shree
49ed3a72d4 implement PrepareDistortedPix as part of DegradeImage 2019-02-21 14:48:29 +00:00
zdenop
e250f3422d
Merge pull request #2258 from stweil/doc
Fix doxygen comments
2019-02-21 07:41:22 +01:00
Stefan Weil
2cbe723d03 Fix doxygen comments
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-02-20 21:11:38 +01:00
Stefan Weil
ef4d5b2e69 Optimize calculation of dot product for double vectors with AVX
This improves the performance with best models and should also
make training faster.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-02-20 17:45:38 +01:00
Stefan Weil
b3bd23edb7 Remove whitespace at line endings
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-02-19 13:53:31 +01:00
Stefan Weil
b95598a0b1
Merge pull request #2070 from pndaza/master
add missed letters ( ၌ ၍ ၎ ၏ )  and symbols ( ၊ ။ ) - 0x104a to 0x104f -
2019-02-19 12:22:53 +01:00
Stefan Weil
38861be639 Use __builtin_trap instead of null pointer dereference to abort
This fixes a warning from Apple's clang compiler:

    [ 34%] Building CXX object CMakeFiles/libtesseract.dir/src/ccutil/errcode.cpp.o
    /Users/travis/build/stweil/tesseract/src/ccutil/errcode.cpp:83:7: warning: indirection of non-volatile null pointer will be deleted, not trap [-Wnull-dereference]
          *reinterpret_cast<int*>(0) = 0;
          ^~~~~~~~~~~~~~~~~~~~~~~~~~
    /Users/travis/build/stweil/tesseract/src/ccutil/errcode.cpp:83:7: note: consider using __builtin_trap() or qualifying pointer with 'volatile'

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-02-18 10:49:51 +01:00
Stefan Weil
ddea230b1b Don't compute function tables at compile time with clang
The current code fails to compile with clang compilers on Linux and macOS.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-02-17 08:38:42 +01:00
zdenop
15f2a4b2c1
Merge pull request #2231 from Shreeshrii/wordstr
Add renderer to create WordStr box files from images
2019-02-16 13:48:06 +01:00
Stefan Weil
862322c18c Fix check for images which are too small to scale
Images with width == min_width are not too small.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-02-15 13:53:11 +01:00
Shree
a044f64375 fix Myanmar validation rules as per Unicode charts 2019-02-15 04:40:55 +00:00
Stefan Weil
c0523ee5a2 Fix compiler warning
g++ warning:

    src/lstm/functions.h:152:35: warning:
        unused parameter ‘x’ [-Wunused-parameter]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-02-14 10:29:39 +01:00
Stefan Weil
3556152412 Compute function tables at compile time
This requires C++ 14. Older compilers still use the old code.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-02-14 10:29:39 +01:00
Stefan Weil
f491eb6188 Simplify tanh and logistic functions and precompute function tables
Both functions are called very often, so computing the table values
at program start should be faster than computing them on demand.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-02-12 12:04:08 +01:00
Shree Devi Kumar
f3362a4b5b Add renderer to create WordStr box files from images 2019-02-10 19:59:17 +00:00
zdenop
2ae65b2493
Merge pull request #2216 from Shreeshrii/lstmbox
Lstmbox
2019-02-10 13:53:41 +01:00
Shree Devi Kumar
311053681c put common code in AddBoxToLSTM 2019-02-10 09:16:45 +00:00
zdenop
e51f1885e6
Merge pull request #2229 from stweil/warn
Fix some compiler warnings
2019-02-10 08:20:23 +01:00
Shree Devi Kumar
b51c1bf05a change to const char* as suggested by @stweil 2019-02-10 05:13:18 +00:00
Stefan Weil
0c9f7db536 Fix compiler warning (-Wimplicit-fallthrough)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-02-09 16:53:44 +01:00
Stefan Weil
d91c316ab1 FontInfo: Make sure that deleted member variables can no longer be used
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-02-09 16:32:20 +01:00
Stefan Weil
877e62db55 Fix compiler warning (-Wmaybe-uninitialized)
gcc warning:

    src/lstm/recodebeam.cpp:270:41: warning: ‘current_char’ may be used uninitialized in this function [-Wmaybe-uninitialized]

It's a false positive, but setting the variable to 0 satisfies the compiler.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-02-09 16:32:20 +01:00
Stefan Weil
33f6dc2a67 Fix compiler warnings (-Wformat-truncation=)
gcc warnings:

    src/viewer/scrollview.cpp:404:31: warning: ‘%s’ directive output may be
        truncated writing up to 4095 bytes into a region of size between 4084 and 4093 [-Wformat-truncation=]
    src/viewer/scrollview.cpp:572:31: warning: ‘%s’ directive output may be
        truncated writing up to 4095 bytes into a region of size between 4084 and 4093 [-Wformat-truncation=]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-02-09 16:32:20 +01:00
Stefan Weil
2a355ea103 Fix compiler warnings (-Wimplicit-fallthrough)
gcc warnings:

    src/ccmain/docqual.cpp:734:26: warning: this statement may fall through [-Wimplicit-fallthrough=]
    src/ccmain/docqual.cpp:764:26: warning: this statement may fall through [-Wimplicit-fallthrough=]
    src/ccmain/docqual.cpp:782:26: warning: this statement may fall through [-Wimplicit-fallthrough=]
    [...]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-02-09 16:32:20 +01:00
Stefan Weil
aa2dcca295 Fix compiler warnings (-Wstringop-truncation)
gcc warnings:

    src/api/tesseractmain.cpp:252:14: warning:
        ‘char* strncpy(char*, const char*, size_t)’ specified bound 255
        equals destination size [-Wstringop-truncation]
    src/ccutil/unicharset.h:66:12: warning:
        ‘char* strncpy(char*, const char*, size_t)’ output may be truncated copying 30 bytes from a string of length 30 [-Wstringop-truncation]
    src/ccutil/unicharset.cpp:806:12: warning:
        ‘char* strncpy(char*, const char*, size_t)’ specified bound 64 equals destination size [-Wstringop-truncation]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-02-09 16:32:09 +01:00
Stefan Weil
d42413dd17 OpenCL: Remove PERF_COUNT framework
It was rarely used, but added a lot of code and an unconditional
dependency on openclwrapper.h.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-02-09 10:58:15 +01:00
Shree Devi Kumar
0f42fd8c69 change to use bbox coordinates for TEXTLINE for all characters
(cherry picked from commit 049db108b2d6cd3a7f52e480212320613117d50b)
2019-02-05 14:03:29 +00:00
Shree Devi Kumar
9c89cd51cf Add a new renderer to create box files from images for LSTM training
(cherry picked from commit 921da6be2bdbda2ddd64514f9b6bec40a336246a)

fix typo

(cherry picked from commit 7bd1a0c80393fce2f34e2845cb26760bcf3791cd)

Add lstmboxrenderer to CMakeLists

(cherry picked from commit cfef3a889aef830725921b5c0218d5e9c633b03e)

fix formatting

(cherry picked from commit 7ba2b01ede7940ed609a073364948ef8c838cd10)
2019-02-05 14:03:29 +00:00
Shreeshrii
c28a68115e
Merge branch 'master' into boxtiff 2019-02-02 23:42:39 +05:30
Shree Devi Kumar
d9590f8adf allow user specified box/tiff pairs with tesstrain.sh 2019-02-02 11:35:45 +00:00
Shree Devi Kumar
323361b902 allow user specified box/tiff pairs with tesstrain.sh 2019-02-02 11:33:32 +00:00
Shree Devi Kumar
ad223296af use --xsize instead of --x_size
(cherry picked from commit 94b8988b8cca3812137933db00750bd6e2e84e32)
2019-02-02 11:08:34 +00:00
Mikhail Akopov
7be04342cf Fix typo 2019-02-01 09:58:44 +01:00
Stefan Weil
b49806766e Fix AVX2 support for Windows builds with MSC
It was never detected, so the existing code for AVX2
was compiled but never used.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-01-30 11:40:17 +01:00
Shree Devi Kumar
4d9bc11fd3 add --xsize as parameter for tesstrain 2019-01-27 07:00:25 +00:00
zdenop
12c1abcb6b
Merge pull request #2189 from stweil/fix
Fix memory leak for PNG images
2019-01-24 07:59:55 +01:00
zdenop
059c50be8c
Merge pull request #2184 from stweil/tests
Fix and enable stringrenderer_test
2019-01-24 07:59:07 +01:00
Stefan Weil
9e6e3a0232 Fix memory leak for PNG images
Commit 5fe1390748 used an implementation
which created a new Pix object. That object was never destroyed.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-01-23 20:05:10 +01:00
Diego de la Hera
1a398a5b5d removed reference to unbound variable 2019-01-23 15:04:16 -03:00
Stefan Weil
ecf73f5bc7 training: Don't terminate after processing 8 fonts or 8 images
tesstrain_utils.sh sets the shell flag -e, so it exits immediately
if a command exits with a non-zero status.

The following command returns a non-zero status as soon as counter is a
multiple of par_factor (par_factor=8, that means as soon as 8 fonts or
images are processed):

    let rem=counter%par_factor

The new code fixes this undesired exit.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-01-23 17:26:40 +01:00
Stefan Weil
32e9d7c8f5 training: Fix some compiler warnings (signed/unsigned)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-01-23 13:55:13 +01:00
Stefan Weil
e4b862d588 pango_font_info: Fix runtime error messages from Pango
pango_coverage_get and pango_coverage_unref should not be called
with coverage == nullptr.

pango_font_get_coverage should not be called with font == nullptr.

Otherwise Pango prints runtime error messages:

    (process:12657): Pango-CRITICAL **: pango_coverage_get: assertion 'coverage != NULL' failed
    (process:12657): Pango-CRITICAL **: pango_coverage_unref: assertion 'coverage != NULL' failed
    (process:12657): Pango-CRITICAL **: pango_font_get_coverage: assertion 'font != NULL' failed
    (process:12657): GLib-GObject-CRITICAL **: g_object_unref: assertion 'G_IS_OBJECT (object)' failed

Typically those errors occur if a required font is not installed,
so this can be a quite common error.

Fix also a potential resource leak in PangoFontInfo::CoversUTF8Text.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-01-23 13:55:13 +01:00
Shree Devi Kumar
77d0b6ce8e fix WORDLIST filename 2019-01-22 15:49:55 +01:00
Stefan Weil
564482db30 Fix selection of IntSimdMatrix method
Commit d36231e3e4 did not distinguish
between AVX and AVX2, so AVX2 code was enabled for IntSimdMatrix
even when only AVX was supported.

This resulted in an illegal instruction.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-01-20 22:13:04 +01:00
Stefan Weil
66e31bfd8c OpenCL: Fix alloc-dealloc mismatch
Bug message from AddressSanitizer:

    ==7153==ERROR: AddressSanitizer: alloc-dealloc-mismatch (operator new [] vs free) on 0x602000072cb0
        #0 0x7ffff70c6a10 in free (/usr/lib/x86_64-linux-gnu/libasan.so.3+0xc1a10)
        #1 0x555557188638 in writeProfileToFile ../../../../../src/opencl/openclwrapper.cpp:541

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-01-19 08:06:26 +01:00
Stefan Weil
ad19183b92 OpenCL: Fix heap buffer overflow
Bug message from AddressSanitizer:

    ==6158==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x7fffe774b7fc at pc 0x555557086b54 bp 0x7fffffffcee0 sp 0x7fffffffced8
    READ of size 1 at 0x7fffe774b7fc thread T0
        #0 0x555557086b53 in tesseract::HistogramRect(Pix*, int, int, int, int, int, int*) ../../../../../src/ccstruct/otsuthr.cpp:163

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-01-19 07:58:16 +01:00
Stefan Weil
502bb624c2 More optimisations for IntSimdMatrix
* Move IntDotProductSSE. That allows inlining of the code.
* Improve IntDotProductSSE by moving some instructions.
* Remove unused num_input_groups_ from IntSimdMatrix.
* Re-order elements in IntSimdMatrix to avoid padding.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-01-14 21:34:37 +01:00
Stefan Weil
95606398f5 Clean code for IntSimdMatrix
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-01-14 21:34:37 +01:00
Stefan Weil
7fc7d28dd0 Compile files for AVX, AVX2 or SSE only when needed
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-01-14 21:34:37 +01:00
Stefan Weil
a9a1035e55 Move IntSimdMatrixNative from IntSimdMatrix to unittest
It is only used for the unit test.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-01-14 21:34:37 +01:00
Stefan Weil
d36231e3e4 Set best or user selected IntSimdMatrix
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-01-14 21:34:37 +01:00
Stefan Weil
605b4d66c7 Replace dynamically allocated IntSimdMatrix instances by constants
Two header files are no longer needed and could be removed.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-01-14 21:34:37 +01:00
Stefan Weil
26be7c5d2e Use constructor with parameters for IntSimdMatrix
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-01-14 21:34:37 +01:00
Stefan Weil
e237a38405 Add const attributes to IntSimMatrix multiplier
IntSimMatrix no longer contains variable members.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-01-14 21:34:37 +01:00
Stefan Weil
7c70147701 Move shaped weights from IntSimMatrix to WeightMatrix
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-01-14 21:34:37 +01:00
Stefan Weil
ea4d0d354b Format comment
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-01-14 21:34:37 +01:00
Stefan Weil
c79d613b65 Replace ASSERT_HOST by assert
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-01-14 21:34:37 +01:00
zdenop
f75b2c1948
Merge pull request #310 from nickjwhite/hocrcharboxes
Character boxes in hOCR output
2019-01-14 19:19:04 +01:00
Stefan Weil
9adf6e442b Revert 59fb3370bb (-ffast-math)
It breaks intsimdmatrix_test.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-01-14 17:56:35 +01:00
Nick White
ebbf907c56 Fix typo in hocr character box output 2019-01-13 16:28:31 +00:00
Nick White
4ce797b6f6 Fix hocr character box info to use new hocr renderer correctly 2019-01-13 13:01:14 +00:00
Nick White
c43e4501e3 Merge remote-tracking branch 'origin/master' into hocrcharboxes 2019-01-13 12:41:42 +00:00
zdenop
238cb219d5
Merge pull request #2152 from stweil/clean
Remove opencl_device_selection.h
2019-01-09 15:02:59 +01:00
Stefan Weil
a0e6586e63 Fix documentation for page segmentation mode 2
It never worked, so add a comment that the implementation is missing.
Add also a to-do comment.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-01-09 13:51:44 +01:00
Stefan Weil
0fae848b58 OpenCL: Add comments to users of openclwrapper.h
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-01-09 12:11:00 +01:00
Stefan Weil
e0fc4f2945 Remove opencl_device_selection.h
Always use OpenCL device selection if OpenCL is enabled.

This fixes a regression which was introduced by commit
5c6a57b727 which removed
the definition for USE_DEVICE_SELECTION.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-01-09 12:09:56 +01:00
Stefan Weil
595bb7df16 OpenCL: Remove unused code
The OpenCL kernel pixSubtract is never used, so remove it.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-01-05 16:41:20 +01:00
Nick White
b8de06430d Ensure baseapi.h header is used by commontraining.h regardless of autotools usage 2019-01-04 20:20:00 +00:00
Nick White
cd34ee55ec Add necessary intproto.h header to protos.cpp 2019-01-04 20:19:54 +00:00
Stefan Weil
62b635a74e Remove unused functions from cluster.cpp
Add also missing static attributes.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-01-03 13:16:31 +01:00
Stefan Weil
f76d8a14cd Remove unused code from oldlist
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-01-03 12:27:10 +01:00
Stefan Weil
7719f80155 Add missing std namespace in tensorflow code
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-01-03 11:15:36 +01:00
Stefan Weil
8a6fa452dc Fix build for architectures without CPUID
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-01-03 09:32:36 +01:00
Stefan Weil
91af010200 Fix compiler warning
gcc warning:

    src/training/text2image.cpp:694:35: warning:
        ISO C++ forbids converting a string constant to ‘char*’
        [-Wwrite-strings]

putenv expects a string which can be modified.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-01-01 22:49:04 +01:00
Stefan Weil
5dd606c631 Replace NULL by nullptr
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-01-01 22:45:49 +01:00
Stefan Weil
d9600cd82e Fix and simplify SIMD tests
The tests for SSE and AVX must only be done if the correct compiler
flags were used.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-01-01 11:19:17 +01:00
zdenop
d3065520fa fix 2 clang warnings 2018-12-30 20:25:24 +01:00
Stefan Weil
cb049133cd Fix compiler warning
clang warning:

    tesseractmain.cpp(512,21): warning: '&&' within '||' [-Wlogical-op-parentheses]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-12-29 22:17:33 +01:00
zdenop
420fb0ced0 Merge branch 'master' of https://github.com/tesseract-ocr/tesseract 2018-12-29 10:31:33 +01:00
zdenop
8885fe2ccb provide info about compiled openmp version 2018-12-29 10:18:27 +01:00
Stefan Weil
993e56ffde Don't try to create text output if other renderers failed (fix regression)
Commit 49d7df6dc3 added error handling,
but since that commit Tesseract used the text fallback if the user
selected output failed.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-12-27 10:23:28 +01:00
zdenop
cc997b53c7 add missing the implementation for TessBaseAPIGetAltoText method in C-API 2018-12-26 21:35:47 +01:00
Stefan Weil
db9c7e0312 Use std::stringstream to generate hOCR output
Using std::stringstream simplifies the code and allows conversion of
double to string independant of the current locale setting.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-12-16 20:14:11 +01:00
zdenop
72d8df581b
Merge pull request #2121 from stweil/hocr
Move code for hOCR renderer to new file
2018-12-16 16:26:27 +01:00
Stefan Weil
c7e8d30280 Fix value for PHYSICAL_IMG_NR in ALTO output
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-12-16 15:07:02 +01:00
Stefan Weil
457c53026d Fix indentation of hOCR output
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-12-15 17:51:59 +01:00
Stefan Weil
5de3fc47bb Format code in new file hocrrenderer.cpp
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-12-15 15:35:21 +01:00
Stefan Weil
48713f7df2 Move code for hOCR renderer to new file
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-12-15 15:33:47 +01:00
zdenop
1f5fb15af3 remove setting constant resolution from ImageThresholder::SetImage.
Credible resolution with be set afterward. Fixes #2080.
2018-12-14 19:23:22 +01:00
zdenop
6d06d39bf4
Merge pull request #2118 from stweil/clean
protos: Remove several unused macros, functions and global variables
2018-12-14 09:20:53 +01:00
Stefan Weil
b8c4f1b9fc protos: Remove unused config variable
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-12-13 21:37:33 +01:00
Stefan Weil
f35eeb3b4a protos: Remove several unused macros, functions and global variables
The unused global variable TrainingData used a lot of runtime memory.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-12-13 21:32:56 +01:00
Stefan Weil
fbbbdb4565 Use std::stringstream to generate ALTO output and add <SP> element
Using std::stringstream simplifies the code.
The <SP> element is needed between two >String> elements.
Remove also some unneeded spaces in the ALTO output.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-12-12 22:29:35 +01:00
Stefan Weil
7ebd3153ae Fix several typos (most of them found by codespell)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-12-10 18:59:58 +01:00
Stefan Weil
81ab302d52 FPRow: Remove three unused methods
This fixes warnings from the Intel compiler:

    src/textord/cjkpitch.cpp(319): warning #177:
      function "<unnamed>::FPRow::good_gaps" was declared but never referenced
    src/textord/cjkpitch.cpp(383): warning #177:
      function "<unnamed>::FPRow::is_bad" was declared but never referenced
    src/textord/cjkpitch.cpp(387): warning #177:
      function "<unnamed>::FPRow::is_unknown" was declared but never referenced

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-12-08 16:43:52 +01:00
Stefan Weil
404f9cd147 SimpleStats: Remove unused method
This fixes a warning from the Intel compiler:

    src/textord/cjkpitch.cpp(79): warning #177:
      function "<unnamed>::SimpleStats::maximum" was declared
      but never referenced

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-12-08 16:39:46 +01:00
Stefan Weil
a9121d28f3
Merge pull request #2107 from stweil/march
Add check whether compiler supports -march=native flag
2018-12-08 10:53:09 +01:00
Stefan Weil
2c044df959 Fix wrong x_fsize in hOCR output (regression)
The regression was caused by the latest commit
c9e85ab78f.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-12-08 10:39:31 +01:00
Stefan Weil
2ccc5810f3 Add check whether compiler supports -march=native flag
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-12-05 20:13:28 +01:00
Stefan Weil
c9e85ab78f Fix wrong font attributes in hOCR output
Instrumented code throws this runtime error during OCR:

    ../../src/api/baseapi.cpp:1616:5: runtime error: load of value 128,
      which is not a valid value for type 'bool'
    ../../src/api/baseapi.cpp:1627:5: runtime error: load of value 128,
      which is not a valid value for type 'bool'

If there is no font information (typical for Tesseract with a LSTM model),
the font attributes got random values resulting in wrong hOCR output.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-12-04 10:52:46 +01:00
Stefan Weil
0bdae8f8bf GENERIC_2D_ARRAY: Fix runtime error in assignment operator
Instrumented code throws this runtime error during OCR:

    ../../src/ccstruct/matrix.h:84:11: runtime error:
      null pointer passed as argument 2, which is declared to never be null

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-12-04 10:48:46 +01:00
Stefan Weil
f0a4d04187 Add config variable for selection of dot product function
All also a C++ implementation with more aggressive compiler options
which is optimized for the CPU where the software was built.

It is now possible to select the function used for the dot product
with -c dotproduct=FUNCTION where FUNCTION can be one of those values:

* auto      selection based on detected hardware (default)
* generic   C++ code with default compiler options
* native    C++ code optimized for build host
* avx       optimized code for AVX
* sse       optimized code for SSE

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-12-01 00:19:28 +01:00
zdenop
b527b37825
Merge pull request #2097 from stweil/namespace
SIMDDetect: Use tesseract namespace and format code
2018-12-01 00:02:18 +01:00
Stefan Weil
1910b1a72b SIMDDetect: Use tesseract namespace and format code
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-11-30 23:36:39 +01:00
Stefan Weil
66d3275d0b IntSimdMatrixSSE: Remove unused include statement and simplify code
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-11-30 23:14:11 +01:00
Stefan Weil
048eb34934 Add missing static attribute to local inline functions
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-11-30 23:14:11 +01:00
Stefan Weil
b73370aac9 Remove unneeded test for nullptr
IntSimdMatrix::GetFastestMultiplier never returns a nullptr.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-11-30 23:14:11 +01:00
Stefan Weil
e2419b1968 Fix potential crash in tprintf
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-11-30 23:14:11 +01:00
Stefan Weil
6b6d9de497 Fix potential crash in STRING class
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-11-30 23:14:11 +01:00
Stefan Weil
59fb3370bb Use -ffast-math for calculation of dot product
This reduces the code size for intsimdmatrixavx2 from 2700 to 2668
and slightly improves the performance for fast models with AVX2.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-11-30 22:52:04 +01:00
Stefan Weil
fda3ba9009 IntSimdMatrixSSE: Fix comment
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-11-30 22:13:32 +01:00
zdenop
07b140364f
Merge pull request #2093 from stweil/python
Updates for Python scripts
2018-11-30 08:10:20 +01:00
zdenop
53600c677e
Merge pull request #2092 from stweil/format
Format new ALTO code with clang-format
2018-11-30 08:08:52 +01:00
zdenop
f6493dd5e8
Merge pull request #2090 from stweil/inline
Optimize performance by using inline functions
2018-11-30 08:07:45 +01:00
Stefan Weil
c59c45fb3e Fix Amharic font list
This was reported for the Python code by LGTM.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-11-30 08:00:22 +01:00
Stefan Weil
b148644c1b Make Python script executable
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-11-30 07:08:45 +01:00
Stefan Weil
ed48b2a8f5 Format new ALTO code with clang-format
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-11-30 06:37:25 +01:00
Jake Sebright
d7cee03a94 Add support for ALTO output 2018-11-30 06:09:36 +01:00
Stefan Weil
3c047f0ac8 Optimize performance by using inline function DotProduct
This improves performace for the "best" models because it
avoids function calls.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-11-29 21:43:41 +01:00
Stefan Weil
e161501df6 Optimize performance by using inline MatrixDotVectorInternal
This improves performace for the "best" models because it
avoids function calls.

The compiler also knows the passed values for the parameters
add_bias_fwd and skip_bias_back.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-11-29 21:37:32 +01:00
Egor Pugin
685b136d89
Fix incorrect condition. 2018-11-29 19:02:54 +03:00
Egor Pugin
267b79982d
Merge pull request #2076 from jbarlow83/pythonize-training
RFC: Pythonize tesstrain.sh and friends
2018-11-25 13:31:48 +03:00
James R. Barlow
8aa25239ae Fix some of Codacy's complaints 2018-11-24 16:59:01 -08:00
James R. Barlow
9122e6249e Autoreformat code
This increases the deviation from the bash scripts so is done separately.
2018-11-24 00:50:29 -08:00
James R. Barlow
d9ae7ecc49 Pythonize tesstrain.sh -> tesstrain.py
This is a lightweight, semi-Pythonic conversion of tesstrain.sh that currently
supports only LSTM and not the Tesseract 3 training mode.

I attempted to keep source changes minimal so it would be easy to compare
bash to Python in code review and confirm equivalence.

Python 3.6+ is required.  Ubuntu 18.04 ships Python 3.6 and it is a mandatory
package (the package manager is also written in Python), so it is available
in the baseline Tesseract 4.0 system.

There are minor output and behavioral changes, and advantages.  Python's loggingis used.  Temporary files are only deleted on success, so they can be inspected
if training files.  Console output is more terse and the log file is more
verbose.  And there are progress bars!  (The python3-tqdm package is required.)
Where tesstrain.sh would sometimes fail without explanation and return an error
code of 1, it is much easier to find the point of failure in this version.
That was also the main motivation for this work.

Argument checking is also more comprehensive.
2018-11-24 00:45:35 -08:00
pndaza
fc8a3d5bbc combine condition with next 2018-11-24 09:21:05 +06:30
pndaza
5c85d8e03d add missed letters and symbols - 0x104a to 0x104f - 2018-11-24 09:14:31 +06:30
Stefan Weil
9b783822a0 Remove unused include statements for tprintf.h
Format also a call of tprintf and add a missing explicit include statement.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-11-18 17:25:01 +01:00
Stefan Weil
a93426c9ff Fix wrong results from function streamtofloat
The local variable k should be 10 ^ (number of digits after comma),
but will overflow when there are more than 9 digits after the comma
because an int value cannot store 10000000000.

This results in wrong double values read from .tr files for example
(or in a runtime exception if Tesseract was compiled with -ftrapv).

Using uint64_t does not fix the general problem but allows more digits
which should be sufficient for the data read by Tesseract.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-11-17 20:02:21 +01:00
Stefan Weil
acca4fb999 Fix some unbound variables and other small issues in training shell scripts
Fix also the logging helper functions to work without log file.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-11-16 11:13:46 +01:00
Stefan Weil
a4b03fbb27 Fix warning from shellcheck
shellcheck warning:

    In /tesseract/src/training/tesstrain_utils.sh line 209:
        TIMESTAMP=`date +%Y-%m-%d`
                  ^-- SC2006: Use $(..) instead of legacy `..`.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-11-15 17:45:20 +01:00
John Lin
bfe58aa56f Fix unbound variable $FONTS 2018-11-15 17:43:15 +01:00
Stefan Weil
0915cbd535 Simplify shell script using mktemp
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-11-15 13:36:52 +01:00
John Lin
edb76e281a Simplify MKTEMP_DT logic 2018-11-15 10:38:40 +08:00
John Lin
dbfc89f9af Fix mktemp in tesstrain_utils.sh
The commit 10f2c45c00 unified the usage of mktemp, but with a
incorrect bash syntax and unnecessary definition of LANG_CODE
and TIMESTAMP. This patch fixes the above problems.
2018-11-14 09:04:34 +08:00
Ray Smith
ce88adbf32 fix issue #1192 2018-11-12 12:53:12 +01:00
zdenop
724957167e fix typo in non VS build 2018-11-08 23:10:14 +01:00
zdenop
eb104f9fe4 VS build: fix warning C4996: The POSIX name for this item is deprecated. Instead, use the ISO C and C++ conformant name. 2018-11-08 22:55:04 +01:00
zdenop
cbef2ebe12 implement patches vcpkg tesseract 2018-11-08 21:37:47 +01:00
zdenop
7a7f226228 ocrclass: Remove unused macros
Signed-off-by: Stefan Weil <sw@weilnetz.de>

# Conflicts:
#	src/ccutil/ocrclass.h
2018-11-08 20:23:36 +01:00
Zdenko Podobný
2dd753ee4c replace VS implementation of gettimeofday with std::chrono::steady_clock::now(); fixes #2038 2018-11-08 19:43:46 +01:00
chrismamo1
439dfaaf8b un-fix one of the warnings 2018-10-30 18:10:48 -06:00
chrismamo1
30be5aaaac fix a couple minor compiler warnings 2018-10-30 18:00:32 -06:00
Stefan Weil
6f8bd340d9 Remove chopper.h
It is no longer needed after some reordering of code in chopper.cpp.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-29 19:51:44 +01:00
Stefan Weil
286dfb031a Remove unused include statements
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-29 19:46:58 +01:00
Stefan Weil
2098bb6daf Remove unused function ComputeOrientation
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-29 19:43:56 +01:00
Stefan Weil
cad6ebb5ff LIST: Remove old comments
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-29 19:43:56 +01:00
zdenop
99054f10c7
Merge pull request #2027 from stweil/warn
Fix compiler warning
2018-10-24 07:31:15 +02:00
Stefan Weil
eefb8348f7 Fix compiler warning
Compiler warning on macOS:

    tesscallback.h:29:7: warning:
      'TessClosure' has no out-of-line virtual method definitions;
      its vtable will be emitted in every translation unit [-Wweak-vtables]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-23 17:01:53 +02:00
Noah Metzger
f7f5f41073 Fixed a mac compiler warning in recodebeam.cpp
Signed-off-by: Noah Metzger <noah.metzger@bib.uni-mannheim.de>
2018-10-23 16:57:39 +02:00
zdenop
e60318f9c0 set PANGOCAIRO_BACKEND=fc to avoid crash; fixes #736 2018-10-23 13:22:38 +02:00
Zdenko Podobný
3d508a65a7 set unlv_tilde_crunching to false; fixes #1449 #948 2018-10-23 09:26:32 +02:00
Stefan Weil
7ebbb7370a ColPartition: Fix CID 1164543 (Division or modulo by float zero)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-22 22:14:15 +02:00
Stefan Weil
eaabe4a3ce ErrorCounter: Fix CID 1164538 (Division or modulo by float zero)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-22 22:14:15 +02:00
Stefan Weil
8f615d44f1 osdetect: Fix CID 1164539 (Division or modulo by float zero)
Avoid also a conversion from int16_t to double to float.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-22 22:14:15 +02:00
Stefan Weil
be0cf03778 tesseractmain: Fix memory leak
Commit 49d7df6dc3 introduced a memory leak
when the output file could not be created.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-22 18:50:47 +02:00
Stefan Weil
9c0799314e Add parenthesis in boolean expression
This fixes a compiler warning:

    scanutils.cpp:444:32: warning:
        '&&' within '||' [-Wlogical-op-parentheses]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-22 17:48:17 +02:00
Stefan Weil
0f973e1d62 Add missing 'static' keyword
This fixes a compiler warning:

    globaloc.cpp:33:6: warning: no previous extern declaration for
      non-static variable 'global_crash_pixes'
      [-Wmissing-variable-declarations]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-22 17:48:17 +02:00
Stefan Weil
a71ad455be Remove unused macros
This fixes some compiler warnings:

    mainblk.cpp:28:9: warning: macro is not used [-Wunused-macros]
    mainblk.cpp:29:9: warning: macro is not used [-Wunused-macros]
    [...]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-22 17:48:17 +02:00
zdenop
dba7f456d5
Merge pull request #2018 from stweil/sort
Get sorted list of available languages
2018-10-22 16:06:42 +02:00
Matthias Geerdsen
eac2880c24 avoid unbound variable TESSDATA_PREFIX
set TESSDATA_PREFIX as empty, if not defined in environment to avoid an
unbound variable
2018-10-22 14:28:14 +02:00
Stefan Weil
d75ef80f12 Get sorted list of available languages
TessBaseAPI::GetAvailableLanguagesAsVector returned the list of languages
without sorting, so the result was random and not user friendly.

Now `tesseract --list-langs` shows the available languages and scripts
in alphabetic order.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-22 14:07:03 +02:00
Matthias Geerdsen
95d9c8c57a set default values for unset variables
setting default values for posibly unset variables avoids unbount
variabe errors
2018-10-21 21:30:52 +02:00
Matthias Geerdsen
7b32e64564 add shebang 2018-10-21 21:30:13 +02:00
zdenop
32c1e4f433 FLAGS_webtext_prefix: unbound variable; issue #2005 2018-10-21 14:00:06 +02:00
Stefan Weil
34a89e54db Fix function ScrollViewCommand
The format string which builds the command only takes one or two
string arguments, so the function allocated too much memory and
passed too many arguments to snprintf.

This also fixes a compiler warning (clang).

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-21 08:13:16 +02:00
zdenop
4d3b0bc798 use <cstdio> instead of <stdio.h> 2018-10-20 21:46:40 +02:00
zdenop
8103d17c72 use _strdup instead of strdup in MSVC 2018-10-20 21:43:38 +02:00
zdenop
a033261f63 add info about used backend in text2image 2018-10-20 21:41:09 +02:00
Stefan Weil
e232114089 Fix use of undefined macro USE_DEVICE_SELECTION
This fixes compiler warnings.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-20 13:58:12 +02:00
Zdenko Podobný
486940687c Exit training script if run command failed; fixes #2005 2018-10-20 13:00:39 +02:00
Egor Pugin
5a4288f2fc
Merge pull request #2011 from stweil/fix
Small fix and optimization
2018-10-20 13:48:51 +03:00
Zdenko Podobný
1a523006a6 install training script with autotools. 2018-10-20 12:33:07 +02:00
Stefan Weil
b0ace0e850 ScrollView: Optimize local table_colors
It is constant, and the values are in the range 0...255,
so its size can be reduced.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-20 12:05:38 +02:00
Stefan Weil
d364750cb3 Remove type cast and fix compiler warning (-Wcast-qual)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-20 12:04:46 +02:00
Zdenko Podobný
1b2bda65e0 Revert "prefer to use FreeType for pango_cairo_font_map"
This reverts commit 345e5ee1f3.
2018-10-20 11:30:07 +02:00
Zdenko Podobný
276c6845ae Revert "free PangoFontMap; fixes #1999"
This reverts commit d1d73b9888.
2018-10-20 11:28:20 +02:00
Zdenko Podobný
a03f23e05e Merge branch 'master' of https://github.com/tesseract-ocr/tesseract 2018-10-20 11:26:23 +02:00
Marco Atzeri
ebbd4e3efc fixes #426; define NOUNDEFINED for cygwin 2018-10-20 11:25:28 +02:00
Stefan Weil
b40151c200 training: Don't hide global variables
This fixes two warnings from LGTM:

    Parameter feature_defs hides a global variable with the same name.
    Parameter Config hides a global variable with the same name.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-19 22:37:37 +02:00
Stefan Weil
bb181ec8d3 Rename API function from GetBestLSTMChoices to GetBestLSTMSymbolChoices
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-19 10:50:38 +02:00
Stefan Weil
df7d1e1f97 Rename API function for getting LSTM choices
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-19 10:50:38 +02:00
Stefan Weil
830b9c715a BLOBNBOX: Declare signed bit field
This fixes a warning from LGTM:

    Bit field area of type int should have explicitly unsigned integral,
    explicitly signed integral, or enumeration type.

Maybe area should be unsigned, but that would require lots of other
changes, so for now signedness is not changed.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-19 10:30:05 +02:00
Stefan Weil
d9c472b988 cluster: Fix some potential overflows
This fixes several issues reported by LGTM:

    Multiplication result may overflow 'int'
    before it is converted to 'size_type'.

    Multiplication result may overflow 'float'
    before it is converted to 'double'.

    Multiplication result may overflow 'int'
    before it is converted to 'unsigned long'.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-19 10:23:17 +02:00
Zdenko Podobný
d1d73b9888 free PangoFontMap; fixes #1999 2018-10-19 00:48:20 +02:00
zdenop
bbe7a4cc10
Merge pull request #2002 from stweil/err
Show error message when output file could not be created
2018-10-18 19:27:01 +02:00
Stefan Weil
49d7df6dc3 tesseractmain: Show error message when output file could not be created
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-18 19:22:49 +02:00
Stefan Weil
b0b8dfbc81 TessResultRenderer: Extend API to access status of renderer
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-18 19:22:48 +02:00
Stefan Weil
f0c9b753c6 BlamerBundle: Add declaration for copy assignment operator
It does not need an implementation as it is currently not used.

This fixes a warning from LGTM:

    No matching copy assignment operator in class BlamerBundle.
    It is good practice to match a copy constructor
    with a copy assignment operator.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-18 15:36:32 +02:00
Stefan Weil
e3658bbc78 C_OUTLINE_FRAG: Add declaration for copy constructor
It does not need an implementation as it is currently not used.

This fixes a warning from LGTM:

    No matching copy constructor in class C_OUTLINE_FRAG.
    It is good practice to match a copy assignment operator
    with a copy constructor.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-18 15:31:45 +02:00
Stefan Weil
5585ed8d85 ROW: Add declaration for copy constructor
It does not need an implementation as it is currently not used.

This fixes a warning from LGTM:

    No matching copy constructor in class ROW.
    It is good practice to match a copy assignment operator
    with a copy constructor.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-18 15:31:10 +02:00
Stefan Weil
a1f0c66be1 BLOB_CHOICE: Add copy assignment operator
This fixes a warning from LGTM:

    No matching copy assignment operator in class BLOB_CHOICE.
    It is good practice to match a copy constructor
    with a copy assignment operator.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-18 15:29:07 +02:00
Stefan Weil
7100a14636 ParamsTrainingHypothesis: Add copy assignment operator
This fixes a warning from LGTM:

    No matching copy assignment operator in class ParamsTrainingHypothesis.
    It is good practice to match a copy constructor
    with a copy assignment operator.

Use also a simpler expression for the size of features.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-18 15:28:12 +02:00
Stefan Weil
0bbd5c5d1c LineHypothesis: Add copy assignment operator
This fixes a warning from LGTM:

    No matching copy assignment operator in class LineHypothesis.
    It is good practice to match a copy constructor
    with a copy assignment operator.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-18 15:23:28 +02:00
Noah Metzger
c13371d6e0 Renamed GetGlyphConfidences() to GetChoices() and glyph_confidences to lstm_choice_mode
Renamed the global attribute glyph_confidences to lstm_choice_mode and the method GetGlyphConfidences() to GetChoices(). All Variables and comments contained in related methods were renamed as well.

Signed-off-by: Noah Metzger <noah.metzger@bib.uni-mannheim.de>
2018-10-17 16:43:39 +02:00
zdenop
e93e8f063f
Merge pull request #1994 from stweil/lgtm
Fix several warnings from LGTM
2018-10-16 18:18:43 +02:00
Stefan Weil
4b800ccaa7 Fix sum computation in higher precision
This also fixes two warnings from LGTM:

    Multiplication result may overflow 'float'
    before it is converted to 'double'.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-16 18:01:27 +02:00
Stefan Weil
fd84f7b666 LLSQ: Replace sqrt by std::sqrt
This should fix warnings from LGTM:

    Multiplication result may overflow 'float'
    before it is converted to 'double'.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-16 17:57:26 +02:00
Stefan Weil
7c2af45713 Fix sum computation in higher precision
This also fixes two warnings from LGTM:

    Multiplication result may overflow 'float'
    before it is converted to 'double'.

Replace also FALSE / TRUE by false / true for bool return value.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-16 17:50:12 +02:00
Stefan Weil
1730b8ccbe classify/cluster: Replace Emalloc by std::vector
This should fix a warning from LGTM:

    Multiplication result may overflow 'int' before it is
    converted to 'unsigned long'.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-16 17:14:51 +02:00
Stefan Weil
5fb461a563 SVNetwork: Handle failed socket call (CID 1164597)
This fixes a warning from Coverity Scan.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-16 16:53:24 +02:00
Stefan Weil
2d2b269e02 OpenclDevice: Catch negative index (CID 1395110)
This fixes a warning from CoverityScan.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-16 16:53:24 +02:00
Stefan Weil
146d2caa9d Classify: Fix new resource leak (CID 1396163)
This fixes a warnings from Coverity Scan.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-16 16:53:23 +02:00
Stefan Weil
edbd07a5f9 lstmtraining: Handle failed remove syscall (CID 1396166)
This fixes a warning from Coverity Scan.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-16 16:53:23 +02:00
Stefan Weil
32e1e4b6b4 TessPDFRenderer: Remove unused member variable jpg_quality_ (CID 1396172)
This fixes a warning from Coverity Scan

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-16 16:53:23 +02:00
Stefan Weil
d89ec15571 Revert "Fix CID 1396172 (Uninitialized members)"
This reverts commit cbd09de7fe.
The variable can be removed as it is not used.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-16 16:53:23 +02:00
Zdenko Podobný
cbd09de7fe Fix CID 1396172 (Uninitialized members) 2018-10-16 12:24:10 +02:00
Stefan Weil
d0d73da65a commontraining: Fix two comments
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-15 11:15:49 +02:00
Zdenko Podobný
10f2c45c00 fix "mkdir -dt" for bds, mac and cygwin 2018-10-14 18:08:50 +02:00
zdenop
524c23de53
Merge pull request #1987 from tfmorris/1986_errno_include
Add missing cerrno includes - fixes #1986
2018-10-13 22:06:00 +02:00
Tom Morris
14af3f720b Add missing cerrno includes - fixes #1986 2018-10-13 16:02:48 -04:00
zdenop
83f80054f6
Merge pull request #1985 from stweil/win32
win32: Show TIFF errors on console
2018-10-13 20:51:26 +02:00
Stefan Weil
6ffb53f815 win32: Show TIFF errors on console
Showing them in a window (default) is not acceptable for a console
application like Tesseract which must be able to work in batch mode.

Such error messages can be triggered by TIFF files which include
vendor specific tags.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-13 20:42:14 +02:00
zdenop
4734317499 fixes #408 - text2image: comma in font name 2018-10-13 15:23:40 +02:00
zdenop
5f4f9372e9 revert debug message commited by mistake 2018-10-13 11:20:25 +02:00
Tom Morris
f6fd9b3a00 Handle null raw_choice - fixes #235, fixes #246 2018-10-13 11:14:26 +02:00
Stefan Weil
de6a759744 unittest: Add paragraphs_test
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-12 16:23:10 +02:00
Stefan Weil
d86d520fd0 Remove tab character in source files
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-12 11:31:10 +02:00
Stefan Weil
d59f14c70a Remove gradechop.h
It only defines the macro partial_split_priority which is only used in
findseam.cpp, so move it to that file.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-12 11:31:10 +02:00
Zdenko Podobný
5fac51173b Merge branch 'master' of https://github.com/tesseract-ocr/tesseract
* 'master' of https://github.com/tesseract-ocr/tesseract:
  remove insight.io badge
  Use env variable in AppVeyor configuration
  Fix integer overflow in overlap calculation
  hocr: add ocrp_wconf to unconditional ocr-capabilities; fixes #1470
  fix uninitialized variable, remove unused variable
  Remove virtual specifiers
2018-10-10 00:38:24 +02:00
Egor Pugin
d93094b397
Merge pull request #1971 from stweil/fix
Fix integer overflow in overlap calculation
2018-10-09 19:59:09 +03:00
Stefan Weil
7f911ac5e0 Fix integer overflow in overlap calculation
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-09 16:43:31 +02:00
zdenop
ca5d285a28 hocr: add ocrp_wconf to unconditional ocr-capabilities; fixes #1470 2018-10-09 16:34:50 +02:00
zdenop
956525f5a4 fix uninitialized variable, remove unused variable 2018-10-09 15:47:20 +02:00
Zdenko Podobný
67b6b02e2d Merge branch 'master' of https://github.com/tesseract-ocr/tesseract
* 'master' of https://github.com/tesseract-ocr/tesseract:
  Remove code for _MSC_VER < 1900
  keep API compatibility with #1265
  Update googletest submodule to release v1.8.1
  Update test submodule
  Always use isascii() with isspace()
  Avoid crash with --psm 0 and LSTM traineddata
  SVPaint: Remove empty block
  Classify: Don't hide debug parameter
  UNICHARMAP: Remove comparison which is always false
  svpaint: Change a variable from global to local
  pgedit: remove unused declaration of display_bln_lines
  Plumbing: Remove comparison which is always false
  Release candidate 2
  use pdf L_FLATE_ENCODE only for png input; fixes #1961
2018-10-09 15:37:40 +02:00
Stefan Weil
128422e75c Remove virtual specifiers
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-09 15:23:59 +02:00
Stefan Weil
f94b3fd9fc Remove code for _MSC_VER < 1900
Tesseract does not support Visual C++ older than Visual Studio 2015.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-09 14:05:21 +02:00
zdenop
c375f4fbf7 keep API compatibility with #1265 2018-10-09 11:22:15 +02:00
zdenop
272ebf995f
Merge pull request #1965 from stweil/isspace
Always use isascii() with isspace()
2018-10-08 18:47:39 +02:00
Stefan Weil
dcd0377bf0 Always use isascii() with isspace()
isspace() must only used with an unsigned char or EOF argument,
and even then its result can depend on the current locale settings.

While this is not a problem for C/C++ executables which use the default
"C" locale, it becomes a problem when the Tesseract API is called from
languages like Python or Java which don't use the "C" locale.

By calling isasci() before calling isspace() this uncertainty can be
avoided, because any locale will hopefully give identical results for
the basic ASCII character set.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-08 17:25:09 +02:00
Stefan Weil
32e92def49 Avoid crash with --psm 0 and LSTM traineddata
Orientation and script detect only worked with legacy models
and crashed with LSTM models.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-08 16:03:54 +02:00
Stefan Weil
1eeca175f7 SVPaint: Remove empty block
This fixes a warning from LGTM:

    Empty block without comment

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-08 14:25:05 +02:00
Stefan Weil
9c857ab962 Classify: Don't hide debug parameter
Fix a warning from LGTM:

    Local variable 'debug' hides a parameter of the same name.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-08 14:22:31 +02:00
Stefan Weil
30b75cfc05 UNICHARMAP: Remove comparison which is always false
Warning from LGTM:

    Comparison is always false because index <= 0 and 1 <= length.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-08 14:15:17 +02:00
Stefan Weil
3ae765ecca svpaint: Change a variable from global to local
This fixes a warning from LGTM:

    Poor global variable name 'rgb'. Prefer longer, descriptive
    names for globals (eg. kMyGlobalConstant, not foo).

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-08 13:53:09 +02:00
Stefan Weil
7b5955920d pgedit: remove unused declaration of display_bln_lines
This fixes a warning from LGTM:

    This parameter of type ScrollView is 144 bytes
    - consider passing a pointer/reference instead.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-08 13:49:59 +02:00
Stefan Weil
ae93b65b1f Plumbing: Remove comparison which is always false
Warning from LGTM:

    Comparison is always false because index >= 0.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-08 13:47:16 +02:00
zdenop
f794571195 use pdf L_FLATE_ENCODE only for png input; fixes #1961 2018-10-07 20:57:19 +02:00
Zdenko Podobný
8598731daf Merge branch 'master' of https://github.com/tesseract-ocr/tesseract
* 'master' of https://github.com/tesseract-ocr/tesseract: (27 commits)
  Rework check for readable input file
  fix "mktemp -d --tmpdir" on Mac OS; see #1453
  pgedit: Change some variables from global to local ones
  improve description of min_characters_to_try variable
  WERD_RES: Remove comparisons which are constant
  GENERIC_2D_ARRAY: Pass parameters by reference
  genericvector: Pass parameters by reference
  chop: Use more efficient float calculations for sqrt
  rect: Use more efficient float calculations for ceil, floor
  intproto: Use more efficient float calculations for floor
  genericvector: Rewrite code to satisfy static code analyzer
  Fix constructor for class Dict (uninitialized member variables)
  Fix use of wrong UNICHARSET
  lstmtraining: Remove dead code for purified model name
  combine_tessdata: Handle failures when extracting
  lstmtraining: Check write permission for output model
  implement parameter min_characters_to_try for minimum characters to try to skip page entirely. fixes #1729
  Merge and enhance documentation on language and script models
  Document some more config options for tesseract
  Add Makefile rule to build HTML manpages
  ...
2018-10-07 15:39:02 +02:00
Stefan Weil
67bf9062df Rework check for readable input file
This reverts commit 1a096441d0 and
implements an alternate check which allows input from stdin.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-06 22:33:02 +02:00
zdenop
140bfa43f0 Merge branch 'master' of https://github.com/tesseract-ocr/tesseract 2018-10-06 20:50:08 +02:00
zdenop
4044ba8260 fix "mktemp -d --tmpdir" on Mac OS; see #1453 2018-10-06 20:47:48 +02:00
zdenop
c4fb194ba2
Merge pull request #1958 from stweil/lgtm
Fix some warnings from static code analyzer LGTM
2018-10-06 20:27:21 +02:00
Stefan Weil
685abc91f3 pgedit: Change some variables from global to local ones
This fixes compiler warnings and a warning from LGTM:

Poor global variable name 'pe'. Prefer longer, descriptive names [...]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-06 20:14:20 +02:00
zdenop
424dbd5dc7 improve description of min_characters_to_try variable 2018-10-06 20:10:54 +02:00
Stefan Weil
18f7ab751e WERD_RES: Remove comparisons which are constant
This fixes warnings from LGTM:

Comparison is always false because id >= 0.
Comparison is always true because mirrored >= 1.
Comparison is always false because id >= 0.

INVALID_UNICHAR_ID is -1, so the warnings are correct.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-06 20:06:38 +02:00
Stefan Weil
238c872753 GENERIC_2D_ARRAY: Pass parameters by reference
This fixes warnings from LGTM:

This parameter of type FontClassInfo is 192 bytes
- consider passing a pointer/reference instead.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-06 19:48:13 +02:00
Stefan Weil
a7982185c9 genericvector: Pass parameters by reference
This fixes warnings like the following one from LGTM:

This parameter of type ParamsTrainingHypothesis is 112 bytes
- consider passing a pointer/reference instead.

Most parameters can also get the const attribute.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-06 19:47:49 +02:00
Stefan Weil
819c43d377 chop: Use more efficient float calculations for sqrt
This fixes warnings from LGTM:

Multiplication result may overflow 'float' before it is converted
to 'double'.

While the sqrt function always calculates with double, here the
overloaded std::sqrt can be used to handle the float arguments
more efficiently.

Replace also an old C++ type cast by a static_cast.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-06 18:59:23 +02:00
Stefan Weil
f264464ec6 rect: Use more efficient float calculations for ceil, floor
This fixes warnings from LGTM:

Multiplication result may overflow 'float' before it is converted
to 'double'.

While the floor function always calculates with double, here the
overloaded std::floor can be used to handle the float arguments
more efficiently.

Replace also old C++ type casts by static_cast.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-06 18:51:06 +02:00
zdenop
1e4768c1f5
Merge pull request #1957 from stweil/lgtm
Fix some warnings from static code analyzer LGTM
2018-10-06 18:42:12 +02:00
Stefan Weil
b26866bb3b intproto: Use more efficient float calculations for floor
This fixes warnings from LGTM:

Multiplication result may overflow 'float' before it is converted
to 'double'.

While the floor function always calculates with double, here the
overloaded std::floor can be used to handle the float arguments
more efficiently.

Replace also old C++ type casts by static_cast.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-06 18:29:38 +02:00
Stefan Weil
06a8de0b8b genericvector: Rewrite code to satisfy static code analyzer
Warning from LGTM:

Resource data_ is acquired by class GenericVector<FontSpacingInfo *>
but not released in the destructor.

LGTM complains about data_ not being deleted in the destructor.
The destructor calls the clear() method, but the delete there
was conditional which confuses the static code analyzer.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-06 18:24:13 +02:00
Stefan Weil
c2a8aa00b8 Fix constructor for class Dict (uninitialized member variables)
wildcard_unichar_id_, apostrophe_unichar_id_, question_unichar_id_ and
slash_unichar_id_ were not initialized in the constructor.

slash_unichar_id_ was used later in a conditional.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-06 17:52:52 +02:00
zdenop
9efedc15b2
Merge pull request #1954 from stweil/unicharset
Fix use of wrong UNICHARSET
2018-10-06 15:04:31 +02:00
Stefan Weil
8dc9e9fd14 Fix use of wrong UNICHARSET
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-06 13:21:09 +02:00
Stefan Weil
0e71e5a754 lstmtraining: Remove dead code for purified model name
The purified model name `model_output` was unused,
so remove the comment and the unused code.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-06 09:34:17 +02:00
Stefan Weil
f4e982e041 combine_tessdata: Handle failures when extracting
Report an error and terminate if that fails.

Use also EXIT_SUCCESS and EXIT_FAILURE for the return values of main()
and add missing return at end of main().

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-05 21:39:18 +02:00
Stefan Weil
7434590b9a lstmtraining: Check write permission for output model
This is done by creating a temporary file.
Report an error and terminate if that fails.

Use also EXIT_SUCCESS and EXIT_FAILURE for the return values of main().

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-05 20:38:02 +02:00
zdenop
660dbaa9d5 implement parameter min_characters_to_try for minimum characters to try to skip page entirely.
fixes #1729
2018-10-05 19:05:28 +02:00
Stefan Weil
26bfd2b9d3 Allow orientation detection with any traineddata
While orientation and script detection (OSD) normally requires
osd.traineddata to detect both, it must also be possible to do
only orientation detection with eng.traineddata or any other
traineddata.

Enforce osd.traineddata only if there was no `-l` command line option.

Commit 27ce472666 was too restrictive.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-04 17:07:14 +02:00
Zdenko Podobný
dcc50a867f Merge branch 'master' of https://github.com/tesseract-ocr/tesseract
* 'master' of https://github.com/tesseract-ocr/tesseract:
  Fix CID 1164579 (Explicit null dereferenced)
  print help for tesstrain.sh; fixes #1469
  Fix CID 1395882 (Uninitialized scalar variable)
  Fix comments
  Move content of ipoints.h to points.h and remove ipoints.h
  remove duplicate help from combine_lang_model
  Fix typo.
  use tprintf instead of printf to be able disable messages by quiet option (issue #1240)
  add "sudo ldconfig" to install instruction. fixes #1212
  unittest: Replace NULL by nullptr
  unittest: Format code
  tesseract app: check if input file exists; fixes #1023
  Format code (replace ( xxx ) by (xxx))
  Simplify boolean expressions
  Win32: use the ISO C and C++ conformant name "_putenv" instead of deprecated "putenv"
2018-10-03 19:21:42 +02:00
zdenop
423798722f
Merge pull request #1938 from stweil/coverity
Fix two reports from CoverityScan and clean related code
2018-10-02 12:34:08 +02:00
Stefan Weil
04703ca8df Fix CID 1164579 (Explicit null dereferenced)
The report from Coverity Scan is a false positive.

Nevertheless the code can be rewritten and optimized
a little bit to fix that report.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-02 11:48:28 +02:00
Zdenko Podobný
7dbf5a030f print help for tesstrain.sh; fixes #1469 2018-10-02 11:35:10 +02:00
Stefan Weil
9a1f14f2aa Fix CID 1395882 (Uninitialized scalar variable)
The implementation for ICOORD only allows division by scale != 0.

Do the same for FCOORD by asserting that scale != 0.0f,
so undefined program behaviour will be caught.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-02 11:34:14 +02:00
Stefan Weil
ce6ff20939 Fix comments
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-02 11:26:36 +02:00
Stefan Weil
8c56b8f58c Move content of ipoints.h to points.h and remove ipoints.h
Both include files depended on each other, so it did not make sense
to separate them.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-02 11:21:27 +02:00
zdenop
57a6f1d22e remove duplicate help from combine_lang_model 2018-10-01 21:22:51 +02:00
Egor Pugin
6ee7f4eac2
Fix typo. 2018-09-29 17:04:25 +03:00
zdenop
14b83d3090 use tprintf instead of printf to be able disable messages by quiet option
(issue #1240)
2018-09-29 13:49:08 +02:00
zdenop
d5b6222856
Merge pull request #1935 from stweil/style
Format code and fix some style issues
2018-09-29 09:32:56 +02:00
zdenop
1a096441d0 tesseract app: check if input file exists; fixes #1023 2018-09-29 08:51:00 +02:00
Stefan Weil
0f3206d5fe Format code (replace ( xxx ) by (xxx))
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-09-29 08:21:25 +02:00
Stefan Weil
63f87cac90 Simplify boolean expressions
Remove "? true : false" which is not needed for boolean expressions.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-09-29 08:21:14 +02:00
Zdenko Podobný
bf6d929e4c fix using c-api / compile with gcc 2018-09-28 23:14:32 +02:00
zdenop
abe40f17c9 Win32: use the ISO C and C++ conformant name "_putenv" instead of deprecated "putenv" 2018-09-28 20:53:57 +02:00
zdenop
a0564fd4ec Allow user to specify dpi for input image 2018-09-28 20:28:52 +02:00
zdenop
345e5ee1f3 prefer to use FreeType for pango_cairo_font_map 2018-09-28 11:07:26 +02:00
zdenop
5fe1390748 remove alpha channel from png: issue #1914 2018-09-27 19:40:15 +02:00
zdenop
971fe50031 fixed #714: use binary mode when generating pdf to stdout on Windows 2018-09-27 18:35:15 +02:00
Zdenko Podobný
5dfce7471c fix #1889: part 2 2018-09-26 09:28:22 +02:00
DevelopAlex
f69af96dbe
Only print "Merging rows..." in debug mode
Only print "Merging rows..." if textord_debug_blob==true (like all the other debug messages).
Otherwise, there are a lot of "Merging rows..." messages in console output.
2018-09-24 11:43:47 +02:00
Zdenko Podobný
01cf7402df add header guard 2018-09-22 18:44:26 +02:00
zdenop
02f9d8d95e
Merge pull request #1923 from stweil/errhandling
Don't trigger a deliberate SIGSEGV for fatal errors in release code
2018-09-20 21:58:45 +02:00
zdenop
63674d3285 Merge branch 'master' of https://github.com/tesseract-ocr/tesseract 2018-09-20 21:58:24 +02:00
Stefan Weil
5338a5a8d5 Don't trigger a deliberate SIGSEGV for fatal errors in release code
The error message "segmentation fault" confuses most users,
so enforce a segmentation fault only in debug code.

Release code simply calls the abort function.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-09-20 21:50:13 +02:00
zdenop
4ca179d3fa remove condition because fontsize is always > 0 2018-09-20 21:48:44 +02:00
zdenop
cefb62b644
Merge pull request #1920 from stweil/errhandling
Don't call exit when parameter in file is unknown
2018-09-20 10:37:38 +02:00
Stefan Weil
741ea00d70 Don't call exit when parameter in file is unknown
Wrong or old parameters in traineddata files should not terminate
the program, so make that a warning instead of a fatal error.

This fixes issue #1520.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-09-20 08:37:33 +02:00
Stefan Weil
d586b97854 Remove duplicate include statements
One of them was reported in issue #1843.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-09-19 22:33:29 +02:00
Zdenko Podobný
5d22fdfeed replace deprecated C++ headers (reported by clan-tidy) - partially supersedes PR #1605 2018-09-18 18:51:11 +02:00
zdenop
62a5e8cfc3
Merge pull request #1265 from picturae/jpg_quality_option
Added JPEG quality option parameter (-c jpg_quality=n)
2018-09-18 11:37:37 +02:00
Jeff Breidenbach
c98391d3d7 fix #1192 bbox as the entire page 2018-09-18 08:09:11 +02:00
David Thornley
92e291250a Fix missing default parameter value cause compile to fail. 2018-09-14 09:56:06 +02:00
David Thornley
31aeb534d9 Fix merge conflicts
Merge branch 'master' into jpg_quality_option

* master: (577 commits)
  fix issue #1889
  Add badges for download , licence and lgtm
  Replace macro MINGW by __MINGW32__
  EquationDetectBase: Define virtual destructor in .cpp file
  BlobGrid: Define virtual destructor in .cpp file
  GridBase: Define virtual destructor in .cpp file
  AlignedBlob: Define virtual destructor in .cpp file
  TransposedArray: Define virtual destructor in .cpp file
  IndexMapBiDi: Define virtual destructor in .cpp file
  Add missing include file (fixes linker error for Visual Studio)
  NthItemTest: Add definition for virtual destructor
  HeapTest: Add definition for virtual destructor
  IcuErrorCode: Define virtual destructor in .cpp file
  Validator: Define virtual destructor in .cpp file
  Dawg: Define virtual destructor in .cpp file
  CUtil: Define virtual destructor in .cpp file
  IndexMap: Define virtual destructor in .cpp file
  CCUtil: Define virtual destructor in .cpp file
  MATRIX: Define virtual destructor in .cpp file
  CCStruct: Define virtual destructor in .cpp file
  ...
2018-09-13 16:03:24 +02:00
Zdenko Podobný
59e42fcef6 fix issue #1889 2018-09-13 07:26:37 +02:00
Stefan Weil
be1393b1e8 Replace macro MINGW by __MINGW32__
MINGW is no longer used and now removed from configure.ac.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-09-04 16:05:27 +02:00
Stefan Weil
4fa2a34577 EquationDetectBase: Define virtual destructor in .cpp file
This fixes compiler warnings from clang:

src/textord/equationdetectbase.h:32:7: warning:
 'EquationDetectBase' has no out-of-line virtual method definitions;
 its vtable will be emitted in every translation unit [-Wweak-vtables]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-09-04 13:14:29 +02:00
Stefan Weil
f29a949649 BlobGrid: Define virtual destructor in .cpp file
This fixes compiler warnings from clang:

src/textord/blobgrid.h:33:7: warning:
 'BlobGrid' has no out-of-line virtual method definitions;
 its vtable will be emitted in every translation unit [-Wweak-vtables]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-09-04 13:12:29 +02:00
Stefan Weil
b3206d94b5 GridBase: Define virtual destructor in .cpp file
This fixes compiler warnings from clang:

src/textord/bbgrid.h:53:7: warning:
 'GridBase' has no out-of-line virtual method definitions;
 its vtable will be emitted in every translation unit [-Wweak-vtables]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-09-04 13:11:35 +02:00
Stefan Weil
677198e399 AlignedBlob: Define virtual destructor in .cpp file
This fixes compiler warnings from clang:

src/textord/alignedblob.h:81:7: warning:
 'AlignedBlob' has no out-of-line virtual method definitions;
 its vtable will be emitted in every translation unit [-Wweak-vtables]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-09-04 13:10:44 +02:00
Stefan Weil
c9d8e5e8bf TransposedArray: Define virtual destructor in .cpp file
This fixes compiler warnings from clang:

src/lstm/weightmatrix.h:33:7: warning:
 'TransposedArray' has no out-of-line virtual method definitions;
 its vtable will be emitted in every translation unit [-Wweak-vtables]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-09-04 13:09:49 +02:00
Stefan Weil
94d227bc77 IndexMapBiDi: Define virtual destructor in .cpp file
This fixes compiler warnings from clang:

src/ccutil/indexmapbidi.h:102:7: warning:
 'IndexMapBiDi' has no out-of-line virtual method definitions;
 its vtable will be emitted in every translation unit [-Wweak-vtables]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-09-04 13:08:29 +02:00
Stefan Weil
319de30814 Add missing include file (fixes linker error for Visual Studio)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-09-04 12:22:57 +02:00
Stefan Weil
46d2273e82 IcuErrorCode: Define virtual destructor in .cpp file
This fixes compiler warnings from clang:

src/training/icuerrorcode.h:44:7: warning:
 'IcuErrorCode' has no out-of-line virtual method definitions;
 its vtable will be emitted in every translation unit [-Wweak-vtables]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-09-04 12:11:23 +02:00
Stefan Weil
68bcd6ba90 Validator: Define virtual destructor in .cpp file
This fixes compiler warnings from clang:

src/training/validator.h:72:7: warning:
 'Validator' has no out-of-line virtual method definitions;
 its vtable will be emitted in every translation unit [-Wweak-vtables]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-09-04 07:48:43 +02:00
Stefan Weil
0d211f9ed5 Dawg: Define virtual destructor in .cpp file
This fixes compiler warnings from clang:

src/dict/dawg.h:119:7: warning:
 'Dawg' has no out-of-line virtual method definitions;
 its vtable will be emitted in every translation unit [-Wweak-vtables]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-09-04 07:47:29 +02:00
Stefan Weil
ac8afc57bb CUtil: Define virtual destructor in .cpp file
This fixes compiler warnings from clang:

src/cutil/cutil_class.h:27:7: warning:
 'CUtil' has no out-of-line virtual method definitions;
 its vtable will be emitted in every translation unit [-Wweak-vtables]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-09-04 07:46:31 +02:00
Stefan Weil
32098b7d4d IndexMap: Define virtual destructor in .cpp file
This fixes compiler warnings from clang:

src/ccutil/indexmapbidi.h:102:7: warning:
 'IndexMapBiDi' has no out-of-line virtual method definitions;
 its vtable will be emitted in every translation unit [-Wweak-vtables]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-09-04 07:45:28 +02:00
Stefan Weil
5b8162f0ef CCUtil: Define virtual destructor in .cpp file
This fixes compiler warnings from clang:

src/ccutil/ccutil.h:51:7: warning:
 'CCUtil' has no out-of-line virtual method definitions;
 its vtable will be emitted in every translation unit [-Wweak-vtables]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-09-04 07:44:27 +02:00
Stefan Weil
14c23c9f13 MATRIX: Define virtual destructor in .cpp file
This fixes compiler warnings from clang:

src/ccstruct/matrix.h:575:7: warning:
 'MATRIX' has no out-of-line virtual method definitions;
 its vtable will be emitted in every translation unit [-Wweak-vtables]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-09-04 07:43:17 +02:00
Stefan Weil
bde8f08003 CCStruct: Define virtual destructor in .cpp file
This fixes compiler warnings from clang:

src/ccstruct/ccstruct.h:25:7: warning:
 'CCStruct' has no out-of-line virtual method definitions;
 its vtable will be emitted in every translation unit [-Wweak-vtables]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-09-04 07:42:06 +02:00
Stefan Weil
8317371f24 LTRResultIterator 2018-09-04 07:39:34 +02:00
Stefan Weil
b612c2c53d SVEventHandler 2018-09-04 07:39:20 +02:00
Stefan Weil
1c9bd51de8 SVEventHandler: Define virtual destructor in .cpp file
This fixes compiler warnings from clang:

src/viewer/scrollview.h:86:7: warning:
 'SVEventHandler' has no out-of-line virtual method definitions;
 its vtable will be emitted in every translation unit [-Wweak-vtables]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-09-04 07:35:30 +02:00
Stefan Weil
8e55146938 MutableIterator: Define virtual destructor in .cpp file
This fixes compiler warnings from clang:

src/ccmain/mutableiterator.h:44:7: warning:
 'MutableIterator' has no out-of-line virtual method definitions;
 its vtable will be emitted in every translation unit [-Wweak-vtables]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-09-04 07:35:30 +02:00
Stefan Weil
d926655cfe LTRResultIterator: Define virtual destructor in .cpp file
This fixes compiler warnings from clang:

src/ccmain/ltrresultiterator.h:48:16: warning:
 'LTRResultIterator' has no out-of-line virtual method definitions;
 its vtable will be emitted in every translation unit [-Wweak-vtables]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-09-04 07:35:30 +02:00
Stefan Weil
c635cdf5d5 Do not define or use macro __UNIX__
Either it was not needed, or it could be replaced by checking
for not _WIN32.

This fixes a compiler warning from clang:

src/ccutil/platform.h:41:9: warning:
 macro name is a reserved identifier [-Wreserved-id-macro]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-09-04 07:34:11 +02:00
Stefan Weil
9f8ed31a26 api/pdfrenderer.cpp: Fix compiler warning
Compiler warning from clang:

src/api/pdfrenderer.cpp:848:28: warning:
 cast from 'const char *' to 'char *' drops const qualifier [-Wcast-qual]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-09-03 12:32:35 +02:00
Stefan Weil
08e25d41b7 textord/cjkpitch: Fix mismatch between format string and argument
size_t would require a different format string. Here an unsigned int
is sufficient in both cases, so use that.

This error was found by lgtm, see
https://lgtm.com/projects/g/tesseract-ocr/tesseract/.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-09-03 12:17:47 +02:00
Stefan Weil
2cc7839af7 textord/makerow.cpp: Fix compiler warnings
Compiler warnings from clang:

src/textord/makerow.cpp:2579:36: warning:
 cast from 'const void *' to 'BLOBNBOX **' drops const qualifier [-Wcast-qual]
src/textord/makerow.cpp:2581:36: warning:
 cast from 'const void *' to 'BLOBNBOX **' drops const qualifier [-Wcast-qual]
src/textord/makerow.cpp:2601:31: warning:
 cast from 'const void *' to 'TO_ROW **' drops const qualifier [-Wcast-qual]
src/textord/makerow.cpp:2603:31: warning:
 cast from 'const void *' to 'TO_ROW **' drops const qualifier [-Wcast-qual]
src/textord/makerow.cpp:2623:31: warning:
 cast from 'const void *' to 'TO_ROW **' drops const qualifier [-Wcast-qual]
src/textord/makerow.cpp:2625:31: warning:
 cast from 'const void *' to 'TO_ROW **' drops const qualifier [-Wcast-qual]

Warning from lgtm:

Local variable 'blob' hides a parameter of the same name.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-09-03 12:05:56 +02:00
Stefan Weil
e74c88a4d3 ccstruct/werd.cpp: Fix compiler warnings
Compiler warnings from clang:

src/ccstruct/werd.cpp:128:4: warning:
 use of old-style cast [-Wold-style-cast]
src/ccstruct/werd.cpp:394:18: warning:
 use of old-style cast [-Wold-style-cast]
src/ccstruct/werd.cpp:394:27: warning:
 cast from 'const void *' to 'WERD **' drops const qualifier [-Wcast-qual]
src/ccstruct/werd.cpp:395:18: warning:
 use of old-style cast [-Wold-style-cast]
src/ccstruct/werd.cpp:395:27: warning:
 cast from 'const void *' to 'WERD **' drops const qualifier [-Wcast-qual]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-09-03 11:42:59 +02:00
Stefan Weil
4934b2e8eb ccstruct/polyblk.cpp: Fix compiler warnings
Compiler warnings from clang:

src/ccstruct/polyblk.cpp:194:16: warning:
 use of old-style cast [-Wold-style-cast]
src/ccstruct/polyblk.cpp:195:16: warning:
 use of old-style cast [-Wold-style-cast]
src/ccstruct/polyblk.cpp:292:45: warning:
 use of old-style cast [-Wold-style-cast]
src/ccstruct/polyblk.cpp:30:9: warning:
 macro is not used [-Wunused-macros]
src/ccstruct/polyblk.cpp:348:8: warning:
 use of old-style cast [-Wold-style-cast]
src/ccstruct/polyblk.cpp:358:12: warning:
 use of old-style cast [-Wold-style-cast]
src/ccstruct/polyblk.cpp:362:26: warning:
 use of old-style cast [-Wold-style-cast]
src/ccstruct/polyblk.cpp:383:21: warning:
 use of old-style cast [-Wold-style-cast]
src/ccstruct/polyblk.cpp:383:36: warning:
 cast from 'const void *' to 'ICOORDELT **' drops const qualifier [-Wcast-qual]
src/ccstruct/polyblk.cpp:384:21: warning:
 use of old-style cast [-Wold-style-cast]
src/ccstruct/polyblk.cpp:384:36:
 warning: cast from 'const void *' to 'ICOORDELT **' drops const qualifier [-Wcast-qual]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-09-03 11:42:59 +02:00
Stefan Weil
4f32b8fd05 ccstruct/ocrblock.cpp: Fix compiler warnings
Compiler warnings from clang:

src/ccstruct/ocrblock.cpp:74:12: warning:
 use of old-style cast [-Wold-style-cast]
src/ccstruct/ocrblock.cpp:74:21: warning:
 cast from 'const void *' to 'ROW **' drops const qualifier [-Wcast-qual]
src/ccstruct/ocrblock.cpp:75:16: warning:
 cast from 'const void *' to 'ROW **' drops const qualifier [-Wcast-qual]
src/ccstruct/ocrblock.cpp:75:7: warning:
 use of old-style cast [-Wold-style-cast]

Make also the function decreasing_top_order a local function as it is
only used locally and remove its global declarations (2 locations).

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-09-03 11:42:59 +02:00
Stefan Weil
59b637efcf ccstruct/mod128.cpp: Fix compiler warnings
Compiler warnings from clang:

src/ccstruct/mod128.cpp:57:15: warning:
 no previous extern declaration for non-static variable 'dirtab' [-Wmissing-variable-declarations]
src/ccstruct/mod128.cpp:57:24: warning:
 use of old-style cast [-Wold-style-cast]
src/ccstruct/mod128.cpp:57:35: warning:
 cast from 'const short *' to 'ICOORD *' drops const qualifier [-Wcast-qual]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-09-03 11:42:59 +02:00
Stefan Weil
2a61f6dfcd Fix compiler warnings in c_blob_comparator and make it a local function
Compiler warnings from clang:

src/ccstruct/genblob.cpp:34:20: warning:
 use of old-style cast [-Wold-style-cast]
src/ccstruct/genblob.cpp:34:32: warning:
 cast from 'const void *' to 'C_BLOB **' drops const qualifier [-Wcast-qual]
src/ccstruct/genblob.cpp:35:20: warning:
 use of old-style cast [-Wold-style-cast]
src/ccstruct/genblob.cpp:35:32: warning:
 cast from 'const void *' to 'C_BLOB **' drops const qualifier [-Wcast-qual]

The function c_blob_comparator is only used in fixspace.cpp,
so move it to that file, make it a local function, and remove
genblob.cpp and genblob.h which are no longer needed.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-09-03 10:59:11 +02:00
Stefan Weil
69a111a739 Clean use of qsort function sort_floats
It is only used in textord/topitch.cpp, so move it into that file.

Remove also the inline attribute as it has not effect here and
update the type casts to fix some compiler warnings from clang.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-08-31 23:17:27 +02:00
Shree Devi Kumar
70daecf267 Javanese Validation works now - for the most part 2018-08-27 21:00:35 +00:00
Shree Devi Kumar
3e8e338c06 taking as kCOnsonant in validate_grapheme 2018-08-27 12:09:34 +00:00
Shree Devi Kumar
a6c6b34bac Workaround for Javanese Aksara's Taling, do not label it as a combiner 2018-08-27 12:09:34 +00:00
Noah Metzger
f7663c69f6 Added detailed value description for glyph_confidences parameter
Signed-off-by: Noah Metzger <noah.metzger@bib.uni-mannheim.de>
2018-08-27 10:52:15 +02:00
Stefan Weil
7a2f8d9010 Move class tesseract::File from training to ccutil
This allows using the class for unittests, too.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-08-25 18:16:46 +02:00
Stefan Weil
f24426cd1b Convert CRLF line endings to LF
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-08-23 18:18:15 +02:00
Stefan Weil
63965bd750 Fix new whitespace issues
- add linefeed after last line
- remove blanks at line endings

This fixes some warnings from clang:

src/training/validate_javanese.h:63:51: warning:
 no newline at end of file [-Wnewline-eof]
src/training/validate_javanese.cpp:269:26: warning:
 no newline at end of file [-Wnewline-eof]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-08-23 18:18:15 +02:00
Stefan Weil
b08966addf Fix assertion caused by access to default TBOX
Instead of adding an empty TBOX at the end of the box list,
that corner case is now handled by passing a nullptr (like
it was already done for the first box in the list).

This avoids the calls of BoxMissMetric with a TBOX
which raises an assertion there (b == 0).

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-08-22 21:40:26 +02:00
Stefan Weil
7910a766fa Fix CID 1164567 (Dereference after null check)
It looks like the check cblob_ptr != nullptr is not needed.
If cblob_ptr were NULL, we would have seen crashes in compute_bounding_box.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-08-22 13:55:37 +02:00
Stefan Weil
f3c7a17df3 Fix CID 1395108 (Dereference after null check)
Let's hope that word->best_choice is never NULL.
Overwise both the old and the new code would abort with SIGSEGV.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-08-22 13:55:37 +02:00
Stefan Weil
6092a8f865 Fix CID 1395109 (Logically dead code)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-08-22 13:55:37 +02:00
Stefan Weil
ac17663015 Fix CID 1395113 ('Constant' variable guards dead code)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-08-22 13:55:37 +02:00
Stefan Weil
7e9dfefc5c Fix CID 1395114 ('Constant' variable guards dead code)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-08-22 13:55:37 +02:00
Stefan Weil
99efc13de8 Fix CID 1395116 ('Constant' variable guards dead code)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-08-22 13:55:37 +02:00
Egor Pugin
621a8cd29d
Merge pull request #1851 from noahmetzger/winfix
Added the option for character accumulated glyph confidences.
2018-08-20 16:35:14 +03:00
Egor Pugin
1f3acca03a
Merge pull request #1850 from Shreeshrii/new-branch-name
add option --save_box_tiff to save box/tiff pairs with lstmf files
2018-08-20 12:39:52 +03:00
Noah Metzger
663be426f6 Added the option for character accumulated glyph confidences.
The parameter glyph_confidences is changed from bool to int.
An execution with value 1 outputs the hOCR file enriched with glyph confidences
for every timestep like before. An execution with value 2 outputs the timesteps
accumulated over the recognized characters.

Signed-off-by: Noah Metzger <noah.metzger@bib.uni-mannheim.de>
2018-08-20 10:43:58 +02:00
Shree Devi Kumar
43e3f24bb0 add variable --save_box_tiff to Save box/tiff pairs along with lstmf files. 2018-08-20 08:24:09 +00:00
Egor Pugin
115fe7662c
Merge pull request #1844 from Shreeshrii/new-branch-name
Updates to Javanese Script Validation and Training
2018-08-17 13:24:28 +03:00
zdenop
debe3da36d remove duplicate include 2018-08-16 20:50:28 +02:00
Shree Devi Kumar
b34cf9d424 Javanese script training 2018-08-16 12:15:10 +00:00
Stefan Weil
e1c387c9b3 Fix typo in comments and variable name
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-08-16 11:38:36 +00:00
Stefan Weil
bf33301114 Fix typo in function name
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-08-16 11:38:36 +00:00
Stefan Weil
641237495a Fix typo in comments and variable name
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-08-14 16:20:27 +02:00
Stefan Weil
95ed924d81 Fix typo in function name
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-08-14 16:20:27 +02:00
Stefan Weil
ce135de37c scrollview: Clean include statements
cstring was included twice (reported by Martin Strunz).
Use C++ header files and sort them alphabetically.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-08-14 13:12:51 +02:00
Zdenko Podobný
296309c1f1 remove duplicate include. Fixes #1837 2018-08-14 13:06:14 +02:00
Atsuyoshi Suzuki
4cda775d73 Revert Makefile.am to beta.2
thesserocr needs `osdetect.h'.
2018-08-06 23:21:20 +09:00
Shree Devi Kumar
7957288fd5 chamge validate javanese similar to indic 2018-08-04 09:43:53 +00:00
Shree Devi Kumar
f93f9e8a09 fix typo re Javanese 2018-08-03 14:33:24 +00:00
Shree Devi Kumar
0eb7be1cd1 Initial COmmit to add Aksara Jawa - Javanese script 2018-08-03 13:59:27 +00:00
Stefan Weil
6a0f8e8c07 ColPartition: Rename median_size_ -> median_height_
This implements a TODO. Rename also some related items.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-08-03 08:46:38 +02:00
Stefan Weil
8af80b7ba6 Fix ImageThresholder::OtsuThresholdRectToPix for OpenCL
The ThresholdRectToPix OpenCL kernel only supports 4 channels.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-08-01 22:49:28 +02:00
zdenop
c044b8c916
Merge pull request #1818 from stweil/psm
Fix potential crash with --psm 0 and use osd.traineddata automatically
2018-08-01 16:56:56 +02:00
zdenop
d22ca6bb06
Merge pull request #1817 from noahmetzger/winfix
Fix issue detected by Coverity Scan
2018-08-01 16:55:56 +02:00
Stefan Weil
27ce472666 Fix potential crash with --psm 0 and use osd.traineddata automatically
Page segmentation mode "OSD only" requires osd.traineddata,
so use it automatically.

Report a warning if the user specified a different language.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-08-01 16:52:37 +02:00
Noah Metzger
65997bed16 Fix issue detected by Coverity Scan
CID: 1340285 (Division or modulo by zero)

Signed-off-by: Noah Metzger <noah.metzger@bib.uni-mannheim.de>
2018-08-01 15:56:19 +02:00
zdenop
b23568f3d1
Merge pull request #1816 from noahmetzger/winfix
Fix issues detected by Coverity Scan
2018-08-01 14:45:00 +02:00
Noah Metzger
d28631a274 Fix issues detected by Coverity Scan
CID: 1164604 (Nesting level does not match indentation)

Signed-off-by: Noah Metzger <noah.metzger@bib.uni-mannheim.de>
2018-08-01 14:30:13 +02:00
Stefan Weil
6a28cce96b Fix whitespace issues
* Remove whitespace (blanks, tabs, cr) at line endings

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-08-01 13:19:52 +02:00
zdenop
3af2773d0e
Merge pull request #1814 from noahmetzger/winfix
Fix issue detected by Coverity Scan
2018-08-01 11:20:13 +02:00
Noah Metzger
2d96c66126 Fix issue detected by Coverity Scan
CID: 1164533 (Logically dead code)

Signed-off-by: Noah Metzger <noah.metzger@bib.uni-mannheim.de>
2018-08-01 10:30:52 +02:00
Stefan Weil
eb69dd0201 TessPDFRenderer: Improve robustness of API (issue #1804)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-08-01 09:11:04 +02:00
Egor Pugin
9ce4d05188
Merge pull request #1812 from noahmetzger/winfix
Fix issue reported by Coverity Scan
2018-07-31 13:52:05 +03:00
Noah Metzger
d4490af06d Fix issue reported by Coverity Scan
CID: 1375395 (Dereference after null check)

Signed-off-by: Noah Metzger <noah.metzger@bib.uni-mannheim.de>
2018-07-31 10:43:39 +02:00
zdenop
7d99cb4e28
Merge pull request #1811 from noahmetzger/winfix
Fix issue reported by Coverity Scan
2018-07-31 09:53:33 +02:00
Noah Metzger
83a4eb3b44 Fix issue reported by Coverity Scan
CID: 1391264 (Improper use of negative value)

Signed-off-by: Noah Metzger <noah.metzger@bib.uni-mannheim.de>
2018-07-31 09:43:30 +02:00
Stefan Weil
9cf170cb7a Revert "Change default width for images output by text2image"
This reverts commit fdc243b363 because
it caused a regression reported in issue #1798.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-07-27 07:29:30 +02:00
Stefan Weil
b19e69086c training: Add new flag --workspace_dir to tesstraining_utils.sh
By default, that script creates two new temporary directories with random
names in /tmp.

The new command line flag --workspace_dir PATH uses the given path as
a base directory for all temporary files.

That allows better reproducable training results (no random directory
names in log files).

Signed-off-by: Stefan Weil <stweil@ub-backup.bib.uni-mannheim.de>
2018-07-26 17:14:19 +02:00
Noah Metzger
91c7504a35 Added a feature to enrich the hOCR output with glyph confidences
By using the parameter -c glyph_confidences=true the user is able to enrich
the hOCR output with additional information. Tesseract then lists additionally
the timesteps with all glyphs that were considered with their confidence
for every timestep of the LSTM.

The format of the hOCR output is slightly changed: There is now a linebreak
after every word for better readability by humans.

Signed-off-by: Noah Metzger <noah.metzger@bib.uni-mannheim.de>
2018-07-25 18:18:58 +02:00
Stefan Weil
132c540c85 Increase limit for deserialization of large arrays
The last limit was still too small.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-07-21 11:10:09 +02:00
Stefan Weil
f577e292c2 Increase limit and add assertions for deserialization of large arrays
One of the checks was too restrictive, as lstmeval deserializes
char arrays with 14000000 elements, so raise the limit to 30000000.
That check was added in commit 992031e824.

Add also assertions which help finding such problems in debug mode.

Signed-off-by: Stefan Weil <stweil@ub-backup.bib.uni-mannheim.de>
2018-07-20 11:47:49 +02:00
Stefan Weil
ca25d88538 Add missing execute permission for script files
It is needed for running the training tutorial on Linux.

The correct mode was lost when moving the files in
commit 104fe7931c.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-07-19 20:25:41 +02:00
Stefan Weil
b7b8dba5db LSTMTrainer: Use new serialization API
Improve also portability by using int32_t instead of int
for a serialized member variable.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-07-18 19:28:05 +02:00
Stefan Weil
1dcda1aa8a LSTMRecognizer: Use new serialization API
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-07-18 19:28:05 +02:00
Stefan Weil
45a7ccf2d2 LSTM: Use new serialization API
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-07-18 19:28:05 +02:00
Stefan Weil
f4449ba41a Convolve: Use new serialization API
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-07-18 19:28:05 +02:00
Stefan Weil
dfc3e9691f SquishedDawg: Use new serialization API
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-07-18 19:28:05 +02:00
Stefan Weil
6cf508960a UnicharAndFonts, Shape: Use new serialization API
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-07-18 17:31:37 +02:00
Stefan Weil
07b363fec0 MasterTrainer: Use new serialization API
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-07-18 17:29:10 +02:00
Stefan Weil
88b3d940be TessdataManager: Use new serialization API
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-07-18 17:28:13 +02:00
Stefan Weil
da0217fa75 STRING: Use new serialization API
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-07-18 17:17:22 +02:00
Stefan Weil
5e05f2cb84 IndexMap: Use new serialization API and optimize code
By changing the type of sparse_size_ from int to int32_t,
a local copy can be removed.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-07-18 17:12:44 +02:00
Stefan Weil
edff1d1882 BitVector: Use new serialization API
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-07-18 17:07:03 +02:00
Stefan Weil
bb6c0123cc ICOORD: Use new serialization API
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-07-18 17:02:12 +02:00
Stefan Weil
66bc012d27 UNICHARSET: Use new serialization API
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-07-18 16:22:02 +02:00
Stefan Weil
eb90068b5f RecodedCharID: Use new serialization API
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-07-18 16:22:01 +02:00
Stefan Weil
0ca7cdd2c8 WordFeature, ImageData: Use new serialization API
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-07-18 16:22:01 +02:00
Stefan Weil
7133a6f43c GENERIC_2D_ARRAY: Use new serialization API
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-07-18 16:22:01 +02:00
Stefan Weil
ea660f83a3 fontinfo: Use new serialization API and optimize code
Combine several calls of Serialize in write_spacing_info and in write_set.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-07-18 16:22:01 +02:00
zdenop
daba37f4d4
Merge pull request #1784 from stweil/serialize
Simplify API for serialization and add first users
2018-07-18 15:54:05 +02:00
Stefan Weil
6ef267c432 Use TFile::Serialize, TFile::DeSerialize
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-07-18 11:19:37 +02:00
Stefan Weil
c383b1aaca TFile: Add helper functions for serialization of simple data types
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-07-18 11:19:37 +02:00
Stefan Weil
bdd2a7aedc Use tesseract::Serialize, tesseract::DeSerialize
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-07-18 11:19:37 +02:00
Stefan Weil
16832f9878 Add helper functions for serialization of simple data types
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-07-18 11:19:37 +02:00
Stefan Weil
216c2b31e7 Fix typo and add TODO comment
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-07-18 09:58:39 +02:00
Stefan Weil
2b6a356cba IntFeatureSpace: Remove unused DeSerialize method
The Serialize method is used indirectly by MasterTrainer::Serialize,
but there is no corresponding MasterTrainer::DeSerialize.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-07-18 09:56:43 +02:00
Stefan Weil
cfd72ff31e Fix --print-parameters (regression)
Commit 629ded223c had broken that
functionality.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-07-09 14:42:48 +02:00