Ray Smith
a912967cc3
Rewrote unicharset_extractor to use the new string normalizer and read plain text as well as box files.
2017-09-08 11:49:57 +01:00
Ray Smith
a2a72d7ca7
Clang tidy changes from sync
2017-09-08 10:13:33 +01:00
Egor Pugin
c67c2e9f41
Add combine_lang_model to cmake and cppan builds.
2017-08-06 14:46:32 +03:00
Hintz
c5a861b229
Define std::max under VS2017 x64
2017-07-26 17:19:40 -04:00
Ray Smith
0e95e2ca87
Rewrote the recoder to use an encoding based on wubi instead of radical-stroke index, changed from normalized to unnormalized unichar representation
2017-07-25 09:40:44 -07:00
Ray Smith
b0ead95d64
Changed the way unicharsets are handled to allow support for the ™ character. Can find the issue where it was requested.
2017-07-24 11:45:57 -07:00
Stefan Weil
9929587f36
Remove extra semicolons
...
This fixes these compiler warnings:
ccmain/equationdetect.cpp:1519:2: warning: extra ‘;’ [-Wpedantic]
ccstruct/blobs.cpp:65:17: warning: extra ‘;’ [-Wpedantic]
ccstruct/blobs.h:178:18: warning: extra ‘;’ [-Wpedantic]
ccstruct/ratngs.cpp:36:22: warning: extra ‘;’ [-Wpedantic]
ccstruct/ratngs.cpp:37:22: warning: extra ‘;’ [-Wpedantic]
ccutil/ambigs.cpp:46:20: warning: extra ‘;’ [-Wpedantic]
ccutil/ambigs.h:137:21: warning: extra ‘;’ [-Wpedantic]
cutil/structures.cpp:36:45: warning: extra ‘;’ [-Wpedantic]
textord/equationdetectbase.cpp:65:2: warning: extra ‘;’ [-Wpedantic]
textord/equationdetectbase.h:57:2: warning: extra ‘;’ [-Wpedantic]
wordrec/lm_state.cpp:25:28: warning: extra ‘;’ [-Wpedantic]
wordrec/lm_state.h:190:29: warning: extra ‘;’ [-Wpedantic]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2017-07-15 12:40:34 +02:00
Stefan Weil
fa9e43fdde
Fix wrong data type in argument for sscanf
...
Compiler warning:
ccutil/unicharcompress.cpp:76:76: warning: format ‘%x’ expects argument of type ‘unsigned int*’, but argument 3 has type ‘int*’ [-Wformat=]
ccutil/unicharcompress.cpp:80:31: warning: format ‘%x’ expects argument of type ‘unsigned int*’, but argument 3 has type ‘int*’ [-Wformat=]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2017-07-15 09:30:31 +02:00
Ray Smith
dc8745e6fd
Move LSTM unicharset and recoder to traineddata with version string part1. Backwards compatible - maybe.
2017-07-14 11:14:23 -07:00
Ray Smith
3ec11bd37a
Deleted some dead LSTM code, making everything use the recoder
2017-07-14 10:58:21 -07:00
Ray Smith
aee910a7bf
Fixed build broken by previous commits that added use of string in low-level code
2017-07-14 10:33:55 -07:00
Ray Smith
df41eab6aa
Added script-specific validation and normalization for virama-using scripts and updated normalization for others
2017-07-14 10:05:05 -07:00
Ray Smith
da03e4e910
Fixes from pull of cleanups: clang tidied, reviewed, fixed new bugs, undeleted needed code. Probably breaks the build, due to some inclusion of changes in utf8/32 conversion
2017-07-14 09:30:14 -07:00
Justin Hotchkiss Palermo
f057938069
fix filenames in comments
2017-07-02 17:35:47 -04:00
Stefan Weil
5f8ecdb2b3
Remove local implementation of strtok_r
...
MS Visual Studio does not provide that function, but can use strtok_s
which does exactly the same.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2017-06-05 19:52:25 +02:00
Stefan Weil
fb863c97a9
UNICHARSET: Add missing initialization
...
The member variable default_sid_ was used without being initialized.
Valgrind report for `tesseract --oem 1 hello.png hello`:
Conditional jump or move depends on uninitialised value(s)
at 0x14352E: BITS16::set_bit(unsigned char, unsigned char) (bits16.h:50)
by 0x143E27: WERD::set_flag(WERD_FLAGS, unsigned char) (werd.h:129)
by 0x27D053: WERD_RES::SetupWordScript(UNICHARSET const&) (pageres.cpp:381)
by 0x27CAFD: WERD_RES::SetupForRecognition(UNICHARSET const&, tesseract::Tesseract*, Pix*, int, TBOX const*, bool, bool, bool, ROW*, BLOCK const*) (pageres.cpp:316)
by 0x145903: tesseract::Tesseract::SetupWordPassN(int, tesseract::WordData*) (control.cpp:182)
by 0x145780: tesseract::Tesseract::SetupAllWordsPassN(int, TBOX const*, char const*, PAGE_RES*, GenericVector<tesseract::WordData>*) (control.cpp:168)
by 0x146293: tesseract::Tesseract::recog_all_words(PAGE_RES*, ETEXT_DESC*, TBOX const*, char const*, int) (control.cpp:336)
by 0x12F356: tesseract::TessBaseAPI::Recognize(ETEXT_DESC*) (baseapi.cpp:878)
by 0x13036D: tesseract::TessBaseAPI::ProcessPage(Pix*, int, char const*, char const*, int, tesseract::TessResultRenderer*) (baseapi.cpp:1184)
by 0x13014A: tesseract::TessBaseAPI::ProcessPagesInternal(char const*, char const*, int, tesseract::TessResultRenderer*) (baseapi.cpp:1140)
by 0x12FBCE: tesseract::TessBaseAPI::ProcessPages(char const*, char const*, int, tesseract::TessResultRenderer*) (baseapi.cpp:1040)
by 0x12C3DF: main (tesseractmain.cpp:515)
Uninitialised value was created by a heap allocation
at 0x4C2C21F: operator new(unsigned long) (vg_replace_malloc.c:334)
by 0x12D88B: tesseract::TessBaseAPI::Init(char const*, int, char const*, tesseract::OcrEngineMode, char**, int, GenericVector<STRING> const*, GenericVector<STRING> const*, bool, bool (*)(STRING const&, GenericVector<char>*)) (baseapi.cpp:320)
by 0x12D6DA: tesseract::TessBaseAPI::Init(char const*, char const*, tesseract::OcrEngineMode, char**, int, GenericVector<STRING> const*, GenericVector<STRING> const*, bool) (baseapi.cpp:284)
by 0x12C088: main (tesseractmain.cpp:440)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2017-05-19 20:57:39 +02:00
Stefan Weil
e05f4c677d
Remove obsolete comments and unused code from ccutil/host.h
...
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2017-05-17 11:55:00 +02:00
Stefan Weil
3a6a8d70fc
Replace Standard C library header files by C++ header files
...
Replacing inttypes.h by cinttypes fixes a problem with glibc < 2.18:
In older inttypes.h, the standard C format macros are only defined for
C++ when the macro __STDC_FORMAT_MACROS is set.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2017-05-17 11:49:43 +02:00
Stefan Weil
0ba202f6ed
Remove unneeded null pointer check
...
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2017-05-16 22:58:10 +02:00
Stefan Weil
46ca83071e
genericvector: Add overloaded LoadDataFromFile
...
Several code locations call that method with a normal C string,
so overload it to accept that without a conversion to a STRING
object. This saves unneeded new / memcpy / delete operations.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2017-05-16 22:57:46 +02:00
Stefan Weil
079d6b9161
Improve robustness of TessdataManager
...
Tesseract crashes with an unhandled exception (std::bad_alloc) if it gets
a bad tessdata file where the numEntries data field is very large (also
after swapping), for example 0x77777777.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2017-05-14 21:33:56 +02:00
Stefan Weil
db8750e94e
Remove unused method TessdataManager::LoadFileLater
...
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2017-05-13 13:14:47 +02:00
Stefan Weil
65b839e1aa
Remove unused method TessdataManager::OverwriteEntry
...
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2017-05-13 13:14:47 +02:00
zdenop
6bebe71749
Merge pull request #910 from stweil/opt
...
Fix GenericVector and optimize some code which used GenericVector::init_to_size
2017-05-13 12:53:40 +02:00
Stefan Weil
69296f8d18
Clean method UNICHARSET::add_script
...
It increased the script_table too early, so the last element was never
used.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2017-05-13 11:53:43 +02:00
Stefan Weil
3a67ff930e
Optimize code by replacing init_to_size with resize_no_init
...
There is no need to initialize memory with a fixed value which is
overwritten in the next step.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2017-05-12 14:34:55 +02:00
Stefan Weil
bb2348bbbe
genericvector: Fix and optimize function LoadDataFromFile
...
It's not necessary to initialize the vector with 0,
because the initial values are read from file.
Fix also an assertion when trying to read an empty file.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2017-05-12 14:15:54 +02:00
Stefan Weil
80f51c3758
ccutil: Remove unneeded include statement
...
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2017-05-12 14:11:21 +02:00
Raf Schietekat
c335508e84
Fewer g++ -Wsign-compare warnings
2017-05-11 23:14:52 +02:00
Stefan Weil
7831a35dbb
ccutil: Simplify code (removes type cast)
...
There is no need for an intermediate variable char_buffer.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2017-05-11 20:10:17 +02:00
Stefan Weil
9266f01857
Remove macros which are no longer needed
...
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2017-05-11 19:32:51 +02:00
Stefan Weil
f2252fdadc
Introduce standard macros for format specifiers
...
There exist standard macro definitions for the printf format specifiers.
MS Visual Studio does not support that standard (at least not in older
versions), so local definitions are needed there.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2017-05-11 19:30:49 +02:00
zdenop
64994a2707
Merge pull request #900 from rfschtkt/cast
...
Reviewed uses of reinterpret_cast
2017-05-11 16:08:12 +02:00
Stefan Weil
3cccae69e5
Fix wrong format string
...
The local variable intval is of type int.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2017-05-11 09:06:02 +02:00
Raf Schietekat
3983d2f76a
Reviewed uses of reinterpret_cast
2017-05-11 01:58:40 +02:00
Ray Smith
8e79297dce
Final part of endian improvement. Adds big-endian support to lstm and fixes issue 518
2017-05-03 16:09:44 -07:00
Stefan Weil
46c887b77e
genericvector: Fix minimum size
...
Commit 907de5995f
tried to improve
GenericVector, but missed a case where vectors with less than
kDefaultVectorSize were allocated. This resulted in additional
alloc / free operations.
Commit a28b2a033d
(before memory optimization)
oem 0: total heap usage: 739,238 allocs, 739,237 frees, 161,699,214 bytes allocated
oem 1: total heap usage: 690,182 allocs, 690,175 frees, 144,470,400 bytes allocated
oem 2: total heap usage: 728,213 allocs, 728,206 frees, 182,885,824 bytes allocated
Commit fd3f8f9b2d
without genericvector change
oem 0: total heap usage: 738,980 allocs, 738,979 frees, 161,697,150 bytes allocated
oem 1: total heap usage: 690,182 allocs, 690,175 frees, 144,470,400 bytes allocated
oem 2: total heap usage: 728,213 allocs, 728,206 frees, 182,885,824 bytes allocated
=> Improvements for oem 0, no change for oem 1 and oem 2.
Commit fd3f8f9b2d
oem 0: total heap usage: 772,648 allocs, 772,647 frees, 160,083,901 bytes allocated
oem 1: total heap usage: 748,591 allocs, 748,584 frees, 143,581,672 bytes allocated
oem 2: total heap usage: 764,796 allocs, 764,789 frees, 181,212,197 bytes allocated
=> Less bytes allocated, but more allocs / frees = bad for performance.
Commit fd3f8f9b2d
with this patch
oem 0: total heap usage: 677,537 allocs, 677,536 frees, 160,444,634 bytes allocated
oem 1: total heap usage: 653,812 allocs, 653,805 frees, 143,423,008 bytes allocated
oem 2: total heap usage: 670,029 allocs, 670,022 frees, 181,517,760 bytes allocated
=> Improvements for all three cases.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2017-05-03 09:49:23 +02:00
Stefan Weil
048cf9d06a
Remove unused local variables
...
This fixes some compiler warnings.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2017-05-02 09:43:29 +02:00
zdenop
fd3f8f9b2d
Merge pull request #352 from pnordhus/reduce_mallocs
...
Avoid unnecessary memory allocations
2017-04-30 17:39:31 +02:00
Stefan Weil
f8fba59804
Replace alloc_struct, free_struct
...
Both functions simply call malloc, free.
Remove also unneeded null pointer checks and use calloc where possible.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2017-04-30 09:25:04 +02:00
Ray Smith
7a116ce8bb
More formatting fixes from clang tidy
2017-04-28 13:38:32 -07:00
Ray Smith
1cc511188d
Added extra Init that takes a memory buffer or a filereader function pointer to enable read of traineddata from memory or foreign file systems. Updated existing readers to use TFile API instead of FILE. This does not yet add big-endian capability to LSTM, but it is very easy from here.
2017-04-27 15:48:23 -07:00
Stefan Weil
8f8651b6ce
Fix typo
...
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2017-04-15 17:27:56 +02:00
Stefan Weil
363f13157b
ccutil: Remove unused variable
...
This fixes a compiler warning:
ccutil/scanutils.cpp:284:7: warning:
variable 'sign' set but not used [-Wunused-but-set-variable]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2017-03-08 07:38:59 +01:00
Mikhail Solomennik
ba4b60374d
Correct reading config files with \r\n
2017-03-01 14:41:17 +03:00
Ray Smith
f566a45b30
clang-tidy changes from sync
2017-01-25 16:20:19 -08:00
Egor Pugin
9b604b1eb9
Fix possible warning when WIN32_LEAN_AND_MEAN is already defined.
2017-01-24 00:22:36 +03:00
amitdo
5d627aacae
Remove code that is no longer needed
...
The code in ccutil/hashfn.h was needed for some old compilers. Now that we support MSVC >= 2010 and compilers that has good support for C++11, we can drop this code.
As a result of this file removal, we now use:
std::unordered_map
std::unordered_set
std::unique_ptr
directly in the codebase with '#include' for the needed headers.
2017-01-16 01:49:17 +02:00
Egor Pugin
442b5b731a
Fix building of training tools in shared configuration.
2016-12-17 16:19:35 +03:00
zdenop
da4c064c2e
Merge pull request #531 from stweil/guards
...
Fix header file guards and replace reserved identifiers
2016-12-15 08:29:32 +01:00