Commit Graph

300 Commits

Author SHA1 Message Date
Noah Metzger
21e25d1829 Fixed a memory corruption, detected by Coverity
CID 1385632 Out-of-bounds write in DO-While loop

Signed-off-by: Noah Metzger <noah.metzger@bib.uni-mannheim.de>
2018-04-18 13:37:38 +02:00
Noah Metzger
d597570737 Fixed compiler warning
Warning C4996: 'access': The POSIX name for this item is deprecated. Instead, use the ISO C and C++ conformant name: _access.

Signed-off-by: Noah Metzger <noah.metzger@bib.uni-mannheim.de>
2018-04-13 11:31:50 +02:00
Noah Metzger
d88a6b5c19 Replace unsecure _splitpath by secure _splitpath_s
Use the predefined macros for the lengths of drive, dir and path.
This avoids potential buffer overruns.
Show also an error message in case of a too long path.

Signed-off-by: Noah Metzger <noah.metzger@bib.uni-mannheim.de>
2018-04-12 16:32:47 +02:00
Noah Metzger
b7b6b28ecf Fixed Tessdata directory for Windows
The old code ignored the drive letter for the tessdata directory path.

Signed-off-by: Noah Metzger <noah.metzger@bib.uni-mannheim.de>
2018-04-11 16:57:38 +02:00
Stefan Weil
2cc46fa6d4 BITS16: Use inline code for all constructors (#1434)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-03-28 10:02:37 +02:00
Stefan Weil
9e74ed3730 Add IntCastRounded for float argument (#1433)
The method is called with a float argument several times, and the
previous implementation which only supported a double argument
resulted in type conversions and compiler warnings.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-03-27 14:34:06 +02:00
Stefan Weil
afcd4cbec1 Remove unused local variable max_num_strokes (#1417)
This fixes a compiler warning. The variable is unused since commit
0e95e2ca87.

Remove also a related code comment.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-03-25 17:26:28 +02:00
Stefan Weil
b94bbd6e83 Update version handling (#1408)
ccutil/version.h is now no longer needed.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-03-22 21:49:47 +01:00
Stefan Weil
660b366401 Fix issues reported by Coverity Scan (#1409)
* Fix CID 1164532 'Constant' variable guards dead code

Signed-off-by: Stefan Weil <sw@weilnetz.de>

* Fix CID 1164594 Argument cannot be negative

Signed-off-by: Stefan Weil <sw@weilnetz.de>

* Fix CID 1164597 Argument cannot be negative

Signed-off-by: Stefan Weil <sw@weilnetz.de>

* Fix CID 1366447 Argument cannot be negative

Fix also the data type for current_pos, as ftell returns a long value.

Signed-off-by: Stefan Weil <sw@weilnetz.de>

* Fix CID 1270404 Arguments in wrong order

This does not change the code, but should help Coverity Scan to see
that the argument order is as intended.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-03-22 20:00:01 +01:00
Stefan Weil
1694be9223 tessdatamanager: Use PACKAGE_VERSION instead of TESSERACT_VERSION_STR (#1407)
This allows further simplifications for the version handling.

Move the implementation for the constructors from .h file to .cpp file
to reduce dependencies.

Remove unneeded include statements from the .h file to reduce more
dependencies.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-03-22 07:43:35 +01:00
Stefan Weil
023e1b340e Use POSIX data types and macros (#878)
* api: Replace Tesseract data types by POSIX data types

Signed-off-by: Stefan Weil <sw@weilnetz.de>

* ccmain: Replace Tesseract data types by POSIX data types

Signed-off-by: Stefan Weil <sw@weilnetz.de>

* ccstruct: Replace Tesseract data types by POSIX data types

Signed-off-by: Stefan Weil <sw@weilnetz.de>

* classify: Replace Tesseract data types by POSIX data types

Signed-off-by: Stefan Weil <sw@weilnetz.de>

* cutil: Replace Tesseract data types by POSIX data types

Signed-off-by: Stefan Weil <sw@weilnetz.de>

* dict: Replace Tesseract data types by POSIX data types

Signed-off-by: Stefan Weil <sw@weilnetz.de>

* textord: Replace Tesseract data types by POSIX data types

Signed-off-by: Stefan Weil <sw@weilnetz.de>

* training: Replace Tesseract data types by POSIX data types

Signed-off-by: Stefan Weil <sw@weilnetz.de>

* wordrec: Replace Tesseract data types by POSIX data types

Signed-off-by: Stefan Weil <sw@weilnetz.de>

* ccutil: Replace Tesseract data types by POSIX data types

Now all Tesseract data types which are no longer needed can be removed
from ccutil/host.h.

Signed-off-by: Stefan Weil <sw@weilnetz.de>

* ccmain: Replace Tesseract's MIN_*INT, MAX_*INT* by POSIX *INT*_MIN, *INT*_MAX

Signed-off-by: Stefan Weil <sw@weilnetz.de>

* ccstruct: Replace Tesseract's MIN_*INT, MAX_*INT* by POSIX *INT*_MIN, *INT*_MAX

Signed-off-by: Stefan Weil <sw@weilnetz.de>

* classify: Replace Tesseract's MIN_*INT, MAX_*INT* by POSIX *INT*_MIN, *INT*_MAX

Signed-off-by: Stefan Weil <sw@weilnetz.de>

* dict: Replace Tesseract's MIN_*INT, MAX_*INT* by POSIX *INT*_MIN, *INT*_MAX

Signed-off-by: Stefan Weil <sw@weilnetz.de>

* lstm: Replace Tesseract's MIN_*INT, MAX_*INT* by POSIX *INT*_MIN, *INT*_MAX

Signed-off-by: Stefan Weil <sw@weilnetz.de>

* textord: Replace Tesseract's MIN_*INT, MAX_*INT* by POSIX *INT*_MIN, *INT*_MAX

Signed-off-by: Stefan Weil <sw@weilnetz.de>

* wordrec: Replace Tesseract's MIN_*INT, MAX_*INT* by POSIX *INT*_MIN, *INT*_MAX

Signed-off-by: Stefan Weil <sw@weilnetz.de>

* ccutil: Replace Tesseract's MIN_*INT, MAX_*INT* by POSIX *INT*_MIN, *INT*_MAX

Remove the macros which are now unused from ccutil/host.h.
Remove also the obsolete history comments.

Signed-off-by: Stefan Weil <sw@weilnetz.de>

* Fix build error caused by ambiguous ClipToRange

Error message vom Appveyor CI:

    C:\projects\tesseract\ccstruct\coutln.cpp(818): error C2672: 'ClipToRange': no matching overloaded function found [C:\projects\tesseract\build\libtesseract.vcxproj]
    C:\projects\tesseract\ccstruct\coutln.cpp(818): error C2782: 'T ClipToRange(const T &,const T &,const T &)': template parameter 'T' is ambiguous [C:\projects\tesseract\build\libtesseract.vcxproj]
      c:\projects\tesseract\ccutil\helpers.h(122): note: see declaration of 'ClipToRange'
      C:\projects\tesseract\ccstruct\coutln.cpp(818): note: could be 'char'
      C:\projects\tesseract\ccstruct\coutln.cpp(818): note: or       'int'

Signed-off-by: Stefan Weil <sw@weilnetz.de>

* unittest: Replace Tesseract's MAX_INT8 by POSIX INT8_MAX

Signed-off-by: Stefan Weil <sw@weilnetz.de>

* arch: Replace Tesseract's MAX_INT8 by POSIX INT8_MAX

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-03-13 21:36:30 +01:00
Stefan Weil
47a326b02d Use POSIX data types for external interfaces (#1358)
Replace the Tesseract specific data types in header files which are
part of Debian package libtesseract-dev by POSIX data types.

Update also matching cpp files.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-03-13 19:01:40 +01:00
Stefan Weil
c6afad03b2 Fix compiler warning (-Wsign-compare) (#1385)
gcc reports this warning about 250 times:

ccutil/genericvector.h:378:48: warning:
 comparison between signed and unsigned integer expressions [-Wsign-compare]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-03-13 19:00:49 +01:00
Amit D
4b2bea79a5 Update TESSERACT_VERSION_STR (#1372) 2018-03-11 18:25:35 +01:00
Stefan Weil
960007e58e Fix compiler warning (possible loss of data) (#1370)
Fix 306 warnings from MS C:

tesseract\ccutil\unicharset.h(242): warning C4267:
 'argument': conversion from 'size_t' to 'int', possible loss of data

The change also avoids some type conversions.
2018-03-10 20:51:52 +01:00
Amit D
53f791ba8b Remove obsolete code (#1365)
MSVC 8.0 was released in 2005 and we don't support it.
2018-03-08 21:12:23 +01:00
Stefan Weil
7972b13e3a Remove macro USE_STD_NAMESPACE (#1360)
The related code in training/util.h now uses the GOOGLE_TESSERACT macro
to enable Google specific code to disable heap checking.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-03-04 14:43:28 +01:00
Stefan Weil
068d43d3d8 Remove old code for string class (no longer needed) (#1354)
* Remove old code for string class (no longer needed)

Signed-off-by: Stefan Weil <sw@weilnetz.de>

* Add std namespace to string class

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-03-03 14:36:28 +01:00
Stefan Weil
9035217acd Remove parameter m_data_sub_dir (#1356)
This further simplifies the finding of the tessdata directory.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-03-03 14:34:24 +01:00
Jeroen Ooms
c6e8916065 fixes for C++11 (#1164) 2018-02-28 15:42:33 +01:00
fifothekid
ad6f3b412a Fixed unqualified class "string" (#1082) 2018-02-28 15:16:23 +01:00
Stefan Weil
20b3ff8796 Fix some minor issues reported by Coverity Scan (#1321)
* Dereference pointer after NULL check (CID 1385638)

Move the statement which dereferences the pointer variable "current"
after the NULL check.

Signed-off-by: Stefan Weil <sw@weilnetz.de>

* Dereference pointer after NULL check (CID 1385635)

Move the statement which dereferences the pointer variable "current"
after the NULL check.

Signed-off-by: Stefan Weil <sw@weilnetz.de>

* Dereference pointer after NULL check (CID 1385634)

Move the statement which dereferences the pointer variable "current"
after the NULL check.

Signed-off-by: Stefan Weil <sw@weilnetz.de>

* Fix CID 1164527 'Constant' variable guards dead code

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-02-18 15:22:59 +01:00
Stefan Weil
ebbfc3ae8d Improve robustness of function LoadDataFromFile (#1207)
ftell returns a long value which can be negative when an error occurred.
It returns LONG_MAX for directories.

Both cases were not handled by the old code.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2017-11-10 15:46:38 +01:00
amitdo
a905548ed6 Autotools build: Remove the option 'USING_MULTIPLELIBS'
Libtool's convenience libraries should never be installed. Fixes #985.
2017-09-11 15:03:53 +03:00
Ray Smith
a912967cc3 Rewrote unicharset_extractor to use the new string normalizer and read plain text as well as box files. 2017-09-08 11:49:57 +01:00
Ray Smith
a2a72d7ca7 Clang tidy changes from sync 2017-09-08 10:13:33 +01:00
Egor Pugin
c67c2e9f41 Add combine_lang_model to cmake and cppan builds. 2017-08-06 14:46:32 +03:00
Hintz
c5a861b229 Define std::max under VS2017 x64 2017-07-26 17:19:40 -04:00
Ray Smith
0e95e2ca87 Rewrote the recoder to use an encoding based on wubi instead of radical-stroke index, changed from normalized to unnormalized unichar representation 2017-07-25 09:40:44 -07:00
Ray Smith
b0ead95d64 Changed the way unicharsets are handled to allow support for the ™ character. Can find the issue where it was requested. 2017-07-24 11:45:57 -07:00
Stefan Weil
9929587f36 Remove extra semicolons
This fixes these compiler warnings:

    ccmain/equationdetect.cpp:1519:2: warning: extra ‘;’ [-Wpedantic]
    ccstruct/blobs.cpp:65:17: warning: extra ‘;’ [-Wpedantic]
    ccstruct/blobs.h:178:18: warning: extra ‘;’ [-Wpedantic]
    ccstruct/ratngs.cpp:36:22: warning: extra ‘;’ [-Wpedantic]
    ccstruct/ratngs.cpp:37:22: warning: extra ‘;’ [-Wpedantic]
    ccutil/ambigs.cpp:46:20: warning: extra ‘;’ [-Wpedantic]
    ccutil/ambigs.h:137:21: warning: extra ‘;’ [-Wpedantic]
    cutil/structures.cpp:36:45: warning: extra ‘;’ [-Wpedantic]
    textord/equationdetectbase.cpp:65:2: warning: extra ‘;’ [-Wpedantic]
    textord/equationdetectbase.h:57:2: warning: extra ‘;’ [-Wpedantic]
    wordrec/lm_state.cpp:25:28: warning: extra ‘;’ [-Wpedantic]
    wordrec/lm_state.h:190:29: warning: extra ‘;’ [-Wpedantic]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2017-07-15 12:40:34 +02:00
Stefan Weil
fa9e43fdde Fix wrong data type in argument for sscanf
Compiler warning:

    ccutil/unicharcompress.cpp:76:76: warning: format ‘%x’ expects argument of type ‘unsigned int*’, but argument 3 has type ‘int*’ [-Wformat=]
    ccutil/unicharcompress.cpp:80:31: warning: format ‘%x’ expects argument of type ‘unsigned int*’, but argument 3 has type ‘int*’ [-Wformat=]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2017-07-15 09:30:31 +02:00
Ray Smith
dc8745e6fd Move LSTM unicharset and recoder to traineddata with version string part1. Backwards compatible - maybe. 2017-07-14 11:14:23 -07:00
Ray Smith
3ec11bd37a Deleted some dead LSTM code, making everything use the recoder 2017-07-14 10:58:21 -07:00
Ray Smith
aee910a7bf Fixed build broken by previous commits that added use of string in low-level code 2017-07-14 10:33:55 -07:00
Ray Smith
df41eab6aa Added script-specific validation and normalization for virama-using scripts and updated normalization for others 2017-07-14 10:05:05 -07:00
Ray Smith
da03e4e910 Fixes from pull of cleanups: clang tidied, reviewed, fixed new bugs, undeleted needed code. Probably breaks the build, due to some inclusion of changes in utf8/32 conversion 2017-07-14 09:30:14 -07:00
Justin Hotchkiss Palermo
f057938069 fix filenames in comments 2017-07-02 17:35:47 -04:00
Stefan Weil
5f8ecdb2b3 Remove local implementation of strtok_r
MS Visual Studio does not provide that function, but can use strtok_s
which does exactly the same.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2017-06-05 19:52:25 +02:00
Stefan Weil
fb863c97a9 UNICHARSET: Add missing initialization
The member variable default_sid_ was used without being initialized.

Valgrind report for `tesseract --oem 1 hello.png hello`:

    Conditional jump or move depends on uninitialised value(s)
       at 0x14352E: BITS16::set_bit(unsigned char, unsigned char) (bits16.h:50)
       by 0x143E27: WERD::set_flag(WERD_FLAGS, unsigned char) (werd.h:129)
       by 0x27D053: WERD_RES::SetupWordScript(UNICHARSET const&) (pageres.cpp:381)
       by 0x27CAFD: WERD_RES::SetupForRecognition(UNICHARSET const&, tesseract::Tesseract*, Pix*, int, TBOX const*, bool, bool, bool, ROW*, BLOCK const*) (pageres.cpp:316)
       by 0x145903: tesseract::Tesseract::SetupWordPassN(int, tesseract::WordData*) (control.cpp:182)
       by 0x145780: tesseract::Tesseract::SetupAllWordsPassN(int, TBOX const*, char const*, PAGE_RES*, GenericVector<tesseract::WordData>*) (control.cpp:168)
       by 0x146293: tesseract::Tesseract::recog_all_words(PAGE_RES*, ETEXT_DESC*, TBOX const*, char const*, int) (control.cpp:336)
       by 0x12F356: tesseract::TessBaseAPI::Recognize(ETEXT_DESC*) (baseapi.cpp:878)
       by 0x13036D: tesseract::TessBaseAPI::ProcessPage(Pix*, int, char const*, char const*, int, tesseract::TessResultRenderer*) (baseapi.cpp:1184)
       by 0x13014A: tesseract::TessBaseAPI::ProcessPagesInternal(char const*, char const*, int, tesseract::TessResultRenderer*) (baseapi.cpp:1140)
       by 0x12FBCE: tesseract::TessBaseAPI::ProcessPages(char const*, char const*, int, tesseract::TessResultRenderer*) (baseapi.cpp:1040)
       by 0x12C3DF: main (tesseractmain.cpp:515)
     Uninitialised value was created by a heap allocation
       at 0x4C2C21F: operator new(unsigned long) (vg_replace_malloc.c:334)
       by 0x12D88B: tesseract::TessBaseAPI::Init(char const*, int, char const*, tesseract::OcrEngineMode, char**, int, GenericVector<STRING> const*, GenericVector<STRING> const*, bool, bool (*)(STRING const&, GenericVector<char>*)) (baseapi.cpp:320)
       by 0x12D6DA: tesseract::TessBaseAPI::Init(char const*, char const*, tesseract::OcrEngineMode, char**, int, GenericVector<STRING> const*, GenericVector<STRING> const*, bool) (baseapi.cpp:284)
       by 0x12C088: main (tesseractmain.cpp:440)

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2017-05-19 20:57:39 +02:00
Stefan Weil
e05f4c677d Remove obsolete comments and unused code from ccutil/host.h
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2017-05-17 11:55:00 +02:00
Stefan Weil
3a6a8d70fc Replace Standard C library header files by C++ header files
Replacing inttypes.h by cinttypes fixes a problem with glibc < 2.18:
In older inttypes.h, the standard C format macros are only defined for
C++ when the macro __STDC_FORMAT_MACROS is set.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2017-05-17 11:49:43 +02:00
Stefan Weil
0ba202f6ed Remove unneeded null pointer check
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2017-05-16 22:58:10 +02:00
Stefan Weil
46ca83071e genericvector: Add overloaded LoadDataFromFile
Several code locations call that method with a normal C string,
so overload it to accept that without a conversion to a STRING
object. This saves unneeded new / memcpy / delete operations.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2017-05-16 22:57:46 +02:00
Stefan Weil
079d6b9161 Improve robustness of TessdataManager
Tesseract crashes with an unhandled exception (std::bad_alloc) if it gets
a bad tessdata file where the numEntries data field is very large (also
after swapping), for example 0x77777777.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2017-05-14 21:33:56 +02:00
Stefan Weil
db8750e94e Remove unused method TessdataManager::LoadFileLater
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2017-05-13 13:14:47 +02:00
Stefan Weil
65b839e1aa Remove unused method TessdataManager::OverwriteEntry
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2017-05-13 13:14:47 +02:00
zdenop
6bebe71749 Merge pull request #910 from stweil/opt
Fix GenericVector and optimize some code which used GenericVector::init_to_size
2017-05-13 12:53:40 +02:00
Stefan Weil
69296f8d18 Clean method UNICHARSET::add_script
It increased the script_table too early, so the last element was never
used.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2017-05-13 11:53:43 +02:00
Stefan Weil
3a67ff930e Optimize code by replacing init_to_size with resize_no_init
There is no need to initialize memory with a fixed value which is
overwritten in the next step.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2017-05-12 14:34:55 +02:00