Commit Graph

115 Commits

Author SHA1 Message Date
Stefan Weil
36f768853a Modernize C++ code using override
The modifications were done using this command:

    run-clang-tidy-8.py -header-filter='.*' -checks='-*,modernize-use-override' -fix

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-26 07:37:52 +01:00
Stefan Weil
631882a346 Fix compiler warnings (signed / unsigned mismatch)
clang warnings:

    src/ccutil/unicharcompress.cpp:172:27: warning: comparison of integers of different signs: 'int' and 'std::__cxx1998::vector::size_type' (aka 'unsigned long') [-Wsign-compare]
    src/lstm/recodebeam.cpp:129:29: warning: comparison of integers of different signs: 'std::__cxx1998::vector::size_type' (aka 'unsigned long') and 'int' [-Wsign-compare]
    src/lstm/recodebeam.cpp:276:48: warning: comparison of integers of different signs: 'std::__cxx1998::vector::size_type' (aka 'unsigned long') and 'int' [-Wsign-compare]
    unittest/imagedata_test.cc:101:21: warning: comparison of integers of different signs: 'int' and 'std::__cxx1998::vector::size_type' (aka 'unsigned long') [-Wsign-compare]
    unittest/linlsq_test.cc:33:23: warning: comparison of integers of different signs: 'int' and 'std::__cxx1998::vector::size_type' (aka 'unsigned long') [-Wsign-compare]
    unittest/linlsq_test.cc:44:23: warning: comparison of integers of different signs: 'int' and 'std::__cxx1998::vector::size_type' (aka 'unsigned long') [-Wsign-compare]
    unittest/nthitem_test.cc:27:23: warning: comparison of integers of different signs: 'int' and 'unsigned long' [-Wsign-compare]
    unittest/nthitem_test.cc:68:21: warning: comparison of integers of different signs: 'int' and 'unsigned long' [-Wsign-compare]
    unittest/stats_test.cc:26:23: warning: comparison of integers of different signs: 'int' and 'unsigned long' [-Wsign-compare]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-25 08:36:07 +01:00
Stefan Weil
f9860cda41 Optimize functions ResetFrom
The loop can terminate as soon as the parameter name was found.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-24 21:21:23 +01:00
Stefan Weil
41da5afe9d UNICHARSET: Fix compiler warning (signed/unsigned mismatch)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-24 21:18:21 +01:00
Stefan Weil
91e2b253c0 Format modified code with clang-format
Format the files which were changed in
commit 297d7d86ce.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-24 21:10:29 +01:00
Stefan Weil
58423d2f6c
Merge pull request #2328 from bertsky/lstm-with-user-patterns2
Add user words / patterns again
2019-03-24 19:38:40 +01:00
Stefan Weil
da6305b632 Fix compiler warnings caused by ASSERT_HOST
The modified definition avoids warnings caused by redundant semicolons.
Now a semicolon is required when using the macro, so a few code locations
had to be updated.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-24 17:47:04 +01:00
Stefan Weil
ee2f9bf7bf Remove old comments in file headers
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-16 10:55:00 +01:00
Robert Schubert
297d7d86ce trying to add user words/patterns again:
- pass in ParamsVectors from Tesseract
  (carrying values from langdata/config/api)
  into LSTMRecognizer::Load and LoadDictionary
- after LSTMRecognizer's Dict is initialised
  (with default values), reset the variables
  user_{words,patterns}_{suffix,file} from the
  corresponding entries in the passed vector
2019-03-15 16:06:19 +01:00
Stefan Weil
1c7e00611b Add initial support for traineddata files in standard archive formats
This requires libarchive-dev.

Tesseract can now load traineddata files in any of the archive formats
which are supported by libarchive. Example of a zipped BagIt archive:

    $ unzip -l /usr/local/share/tessdata/zip.traineddata
    Archive:  /usr/local/share/tessdata/zip.traineddata
      Length      Date    Time    Name
    ---------  ---------- -----   ----
           55  2019-03-05 15:27   bagit.txt
            0  2019-03-05 15:25   data/
         1557  2019-03-05 15:28   manifest-sha256.txt
      1082890  2019-03-05 15:25   data/eng.word-dawg
      1487588  2019-03-05 15:25   data/eng.lstm
         7477  2019-03-05 15:25   data/eng.unicharset
        63346  2019-03-05 15:25   data/eng.shapetable
       976552  2019-03-05 15:25   data/eng.inttemp
        13408  2019-03-05 15:25   data/eng.normproto
         4322  2019-03-05 15:25   data/eng.punc-dawg
         4738  2019-03-05 15:25   data/eng.lstm-number-dawg
         1410  2019-03-05 15:25   data/eng.freq-dawg
          844  2019-03-05 15:25   data/eng.pffmtable
         6360  2019-03-05 15:25   data/eng.lstm-unicharset
         1012  2019-03-05 15:25   data/eng.lstm-recoder
         1047  2019-03-05 15:25   data/eng.unicharambigs
         4322  2019-03-05 15:25   data/eng.lstm-punc-dawg
     16109842  2019-03-05 15:25   data/eng.bigram-dawg
           80  2019-03-05 15:25   data/eng.version
         6426  2019-03-05 15:25   data/eng.number-dawg
      3694794  2019-03-05 15:25   data/eng.lstm-word-dawg
    ---------                     -------
     23468070                     21 files

`combine_tessdata -d` and `combine_tessdata -u` also work.

The traineddata files in the new format can be generated with
standard tools like zip or tar.

More work is needed for other training tools and big endian support.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-05 17:18:48 +01:00
Stefan Weil
2cbe723d03 Fix doxygen comments
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-02-20 21:11:38 +01:00
Stefan Weil
38861be639 Use __builtin_trap instead of null pointer dereference to abort
This fixes a warning from Apple's clang compiler:

    [ 34%] Building CXX object CMakeFiles/libtesseract.dir/src/ccutil/errcode.cpp.o
    /Users/travis/build/stweil/tesseract/src/ccutil/errcode.cpp:83:7: warning: indirection of non-volatile null pointer will be deleted, not trap [-Wnull-dereference]
          *reinterpret_cast<int*>(0) = 0;
          ^~~~~~~~~~~~~~~~~~~~~~~~~~
    /Users/travis/build/stweil/tesseract/src/ccutil/errcode.cpp:83:7: note: consider using __builtin_trap() or qualifying pointer with 'volatile'

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-02-18 10:49:51 +01:00
Stefan Weil
2a355ea103 Fix compiler warnings (-Wimplicit-fallthrough)
gcc warnings:

    src/ccmain/docqual.cpp:734:26: warning: this statement may fall through [-Wimplicit-fallthrough=]
    src/ccmain/docqual.cpp:764:26: warning: this statement may fall through [-Wimplicit-fallthrough=]
    src/ccmain/docqual.cpp:782:26: warning: this statement may fall through [-Wimplicit-fallthrough=]
    [...]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-02-09 16:32:20 +01:00
Stefan Weil
aa2dcca295 Fix compiler warnings (-Wstringop-truncation)
gcc warnings:

    src/api/tesseractmain.cpp:252:14: warning:
        ‘char* strncpy(char*, const char*, size_t)’ specified bound 255
        equals destination size [-Wstringop-truncation]
    src/ccutil/unicharset.h:66:12: warning:
        ‘char* strncpy(char*, const char*, size_t)’ output may be truncated copying 30 bytes from a string of length 30 [-Wstringop-truncation]
    src/ccutil/unicharset.cpp:806:12: warning:
        ‘char* strncpy(char*, const char*, size_t)’ specified bound 64 equals destination size [-Wstringop-truncation]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-02-09 16:32:09 +01:00
Stefan Weil
e2419b1968 Fix potential crash in tprintf
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-11-30 23:14:11 +01:00
Stefan Weil
6b6d9de497 Fix potential crash in STRING class
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-11-30 23:14:11 +01:00
Stefan Weil
9b783822a0 Remove unused include statements for tprintf.h
Format also a call of tprintf and add a missing explicit include statement.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-11-18 17:25:01 +01:00
Stefan Weil
a93426c9ff Fix wrong results from function streamtofloat
The local variable k should be 10 ^ (number of digits after comma),
but will overflow when there are more than 9 digits after the comma
because an int value cannot store 10000000000.

This results in wrong double values read from .tr files for example
(or in a runtime exception if Tesseract was compiled with -ftrapv).

Using uint64_t does not fix the general problem but allows more digits
which should be sufficient for the data read by Tesseract.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-11-17 20:02:21 +01:00
zdenop
724957167e fix typo in non VS build 2018-11-08 23:10:14 +01:00
zdenop
eb104f9fe4 VS build: fix warning C4996: The POSIX name for this item is deprecated. Instead, use the ISO C and C++ conformant name. 2018-11-08 22:55:04 +01:00
zdenop
7a7f226228 ocrclass: Remove unused macros
Signed-off-by: Stefan Weil <sw@weilnetz.de>

# Conflicts:
#	src/ccutil/ocrclass.h
2018-11-08 20:23:36 +01:00
Zdenko Podobný
2dd753ee4c replace VS implementation of gettimeofday with std::chrono::steady_clock::now(); fixes #2038 2018-11-08 19:43:46 +01:00
chrismamo1
30be5aaaac fix a couple minor compiler warnings 2018-10-30 18:00:32 -06:00
Stefan Weil
eefb8348f7 Fix compiler warning
Compiler warning on macOS:

    tesscallback.h:29:7: warning:
      'TessClosure' has no out-of-line virtual method definitions;
      its vtable will be emitted in every translation unit [-Wweak-vtables]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-23 17:01:53 +02:00
Stefan Weil
9c0799314e Add parenthesis in boolean expression
This fixes a compiler warning:

    scanutils.cpp:444:32: warning:
        '&&' within '||' [-Wlogical-op-parentheses]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-22 17:48:17 +02:00
Stefan Weil
0f973e1d62 Add missing 'static' keyword
This fixes a compiler warning:

    globaloc.cpp:33:6: warning: no previous extern declaration for
      non-static variable 'global_crash_pixes'
      [-Wmissing-variable-declarations]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-22 17:48:17 +02:00
Stefan Weil
a71ad455be Remove unused macros
This fixes some compiler warnings:

    mainblk.cpp:28:9: warning: macro is not used [-Wunused-macros]
    mainblk.cpp:29:9: warning: macro is not used [-Wunused-macros]
    [...]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-22 17:48:17 +02:00
Zdenko Podobný
67b6b02e2d Merge branch 'master' of https://github.com/tesseract-ocr/tesseract
* 'master' of https://github.com/tesseract-ocr/tesseract:
  Remove code for _MSC_VER < 1900
  keep API compatibility with #1265
  Update googletest submodule to release v1.8.1
  Update test submodule
  Always use isascii() with isspace()
  Avoid crash with --psm 0 and LSTM traineddata
  SVPaint: Remove empty block
  Classify: Don't hide debug parameter
  UNICHARMAP: Remove comparison which is always false
  svpaint: Change a variable from global to local
  pgedit: remove unused declaration of display_bln_lines
  Plumbing: Remove comparison which is always false
  Release candidate 2
  use pdf L_FLATE_ENCODE only for png input; fixes #1961
2018-10-09 15:37:40 +02:00
Stefan Weil
f94b3fd9fc Remove code for _MSC_VER < 1900
Tesseract does not support Visual C++ older than Visual Studio 2015.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-09 14:05:21 +02:00
Stefan Weil
dcd0377bf0 Always use isascii() with isspace()
isspace() must only used with an unsigned char or EOF argument,
and even then its result can depend on the current locale settings.

While this is not a problem for C/C++ executables which use the default
"C" locale, it becomes a problem when the Tesseract API is called from
languages like Python or Java which don't use the "C" locale.

By calling isasci() before calling isspace() this uncertainty can be
avoided, because any locale will hopefully give identical results for
the basic ASCII character set.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-08 17:25:09 +02:00
Stefan Weil
30b75cfc05 UNICHARMAP: Remove comparison which is always false
Warning from LGTM:

    Comparison is always false because index <= 0 and 1 <= length.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-08 14:15:17 +02:00
Zdenko Podobný
8598731daf Merge branch 'master' of https://github.com/tesseract-ocr/tesseract
* 'master' of https://github.com/tesseract-ocr/tesseract: (27 commits)
  Rework check for readable input file
  fix "mktemp -d --tmpdir" on Mac OS; see #1453
  pgedit: Change some variables from global to local ones
  improve description of min_characters_to_try variable
  WERD_RES: Remove comparisons which are constant
  GENERIC_2D_ARRAY: Pass parameters by reference
  genericvector: Pass parameters by reference
  chop: Use more efficient float calculations for sqrt
  rect: Use more efficient float calculations for ceil, floor
  intproto: Use more efficient float calculations for floor
  genericvector: Rewrite code to satisfy static code analyzer
  Fix constructor for class Dict (uninitialized member variables)
  Fix use of wrong UNICHARSET
  lstmtraining: Remove dead code for purified model name
  combine_tessdata: Handle failures when extracting
  lstmtraining: Check write permission for output model
  implement parameter min_characters_to_try for minimum characters to try to skip page entirely. fixes #1729
  Merge and enhance documentation on language and script models
  Document some more config options for tesseract
  Add Makefile rule to build HTML manpages
  ...
2018-10-07 15:39:02 +02:00
Stefan Weil
a7982185c9 genericvector: Pass parameters by reference
This fixes warnings like the following one from LGTM:

This parameter of type ParamsTrainingHypothesis is 112 bytes
- consider passing a pointer/reference instead.

Most parameters can also get the const attribute.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-06 19:47:49 +02:00
Stefan Weil
06a8de0b8b genericvector: Rewrite code to satisfy static code analyzer
Warning from LGTM:

Resource data_ is acquired by class GenericVector<FontSpacingInfo *>
but not released in the destructor.

LGTM complains about data_ not being deleted in the destructor.
The destructor calls the clear() method, but the delete there
was conditional which confuses the static code analyzer.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-06 18:24:13 +02:00
Zdenko Podobný
dcc50a867f Merge branch 'master' of https://github.com/tesseract-ocr/tesseract
* 'master' of https://github.com/tesseract-ocr/tesseract:
  Fix CID 1164579 (Explicit null dereferenced)
  print help for tesstrain.sh; fixes #1469
  Fix CID 1395882 (Uninitialized scalar variable)
  Fix comments
  Move content of ipoints.h to points.h and remove ipoints.h
  remove duplicate help from combine_lang_model
  Fix typo.
  use tprintf instead of printf to be able disable messages by quiet option (issue #1240)
  add "sudo ldconfig" to install instruction. fixes #1212
  unittest: Replace NULL by nullptr
  unittest: Format code
  tesseract app: check if input file exists; fixes #1023
  Format code (replace ( xxx ) by (xxx))
  Simplify boolean expressions
  Win32: use the ISO C and C++ conformant name "_putenv" instead of deprecated "putenv"
2018-10-03 19:21:42 +02:00
Stefan Weil
04703ca8df Fix CID 1164579 (Explicit null dereferenced)
The report from Coverity Scan is a false positive.

Nevertheless the code can be rewritten and optimized
a little bit to fix that report.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-02 11:48:28 +02:00
Stefan Weil
0f3206d5fe Format code (replace ( xxx ) by (xxx))
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-09-29 08:21:25 +02:00
Stefan Weil
63f87cac90 Simplify boolean expressions
Remove "? true : false" which is not needed for boolean expressions.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-09-29 08:21:14 +02:00
Zdenko Podobný
bf6d929e4c fix using c-api / compile with gcc 2018-09-28 23:14:32 +02:00
Stefan Weil
5338a5a8d5 Don't trigger a deliberate SIGSEGV for fatal errors in release code
The error message "segmentation fault" confuses most users,
so enforce a segmentation fault only in debug code.

Release code simply calls the abort function.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-09-20 21:50:13 +02:00
Stefan Weil
741ea00d70 Don't call exit when parameter in file is unknown
Wrong or old parameters in traineddata files should not terminate
the program, so make that a warning instead of a fatal error.

This fixes issue #1520.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-09-20 08:37:33 +02:00
Zdenko Podobný
5d22fdfeed replace deprecated C++ headers (reported by clan-tidy) - partially supersedes PR #1605 2018-09-18 18:51:11 +02:00
Stefan Weil
94d227bc77 IndexMapBiDi: Define virtual destructor in .cpp file
This fixes compiler warnings from clang:

src/ccutil/indexmapbidi.h:102:7: warning:
 'IndexMapBiDi' has no out-of-line virtual method definitions;
 its vtable will be emitted in every translation unit [-Wweak-vtables]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-09-04 13:08:29 +02:00
Stefan Weil
32098b7d4d IndexMap: Define virtual destructor in .cpp file
This fixes compiler warnings from clang:

src/ccutil/indexmapbidi.h:102:7: warning:
 'IndexMapBiDi' has no out-of-line virtual method definitions;
 its vtable will be emitted in every translation unit [-Wweak-vtables]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-09-04 07:45:28 +02:00
Stefan Weil
5b8162f0ef CCUtil: Define virtual destructor in .cpp file
This fixes compiler warnings from clang:

src/ccutil/ccutil.h:51:7: warning:
 'CCUtil' has no out-of-line virtual method definitions;
 its vtable will be emitted in every translation unit [-Wweak-vtables]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-09-04 07:44:27 +02:00
Stefan Weil
c635cdf5d5 Do not define or use macro __UNIX__
Either it was not needed, or it could be replaced by checking
for not _WIN32.

This fixes a compiler warning from clang:

src/ccutil/platform.h:41:9: warning:
 macro name is a reserved identifier [-Wreserved-id-macro]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-09-04 07:34:11 +02:00
Stefan Weil
69a111a739 Clean use of qsort function sort_floats
It is only used in textord/topitch.cpp, so move it into that file.

Remove also the inline attribute as it has not effect here and
update the type casts to fix some compiler warnings from clang.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-08-31 23:17:27 +02:00
Stefan Weil
7a2f8d9010 Move class tesseract::File from training to ccutil
This allows using the class for unittests, too.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-08-25 18:16:46 +02:00
Stefan Weil
6a28cce96b Fix whitespace issues
* Remove whitespace (blanks, tabs, cr) at line endings

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-08-01 13:19:52 +02:00
Stefan Weil
132c540c85 Increase limit for deserialization of large arrays
The last limit was still too small.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-07-21 11:10:09 +02:00