Page segmentation mode "OSD only" requires osd.traineddata,
so use it automatically.
Report a warning if the user specified a different language.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
By default, that script creates two new temporary directories with random
names in /tmp.
The new command line flag --workspace_dir PATH uses the given path as
a base directory for all temporary files.
That allows better reproducable training results (no random directory
names in log files).
Signed-off-by: Stefan Weil <stweil@ub-backup.bib.uni-mannheim.de>
By using the parameter -c glyph_confidences=true the user is able to enrich
the hOCR output with additional information. Tesseract then lists additionally
the timesteps with all glyphs that were considered with their confidence
for every timestep of the LSTM.
The format of the hOCR output is slightly changed: There is now a linebreak
after every word for better readability by humans.
Signed-off-by: Noah Metzger <noah.metzger@bib.uni-mannheim.de>
One of the checks was too restrictive, as lstmeval deserializes
char arrays with 14000000 elements, so raise the limit to 30000000.
That check was added in commit 992031e824.
Add also assertions which help finding such problems in debug mode.
Signed-off-by: Stefan Weil <stweil@ub-backup.bib.uni-mannheim.de>
It is needed for running the training tutorial on Linux.
The correct mode was lost when moving the files in
commit 104fe7931c.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
The Serialize method is used indirectly by MasterTrainer::Serialize,
but there is no corresponding MasterTrainer::DeSerialize.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
OpenclDevice::getDeviceSelection crashed when outdated information
was read from file and device.score was not set.
Change also the struct definitions from C to C++ and
eliminate some type casts.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Commit 4d514d5a60 introduced tprintf_internal
with an additional argument "level" which was removed again in commit
7dc5296fe9.
So we can now restore the original state without tprintf_internal.
Remove also the declaration of debug_window_on (it does not exist since
commit 030aae9896) and make the
configuration parameter debug_file local as it is only used by tprintf.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
`int depth = strtol(*str + 1, str, 10);`
`**str` holds the words in the VGSL specification, and `*str` holds a single word, lets say, `Fr64`. Now, the `strtol` function modifies `str` to point to the first character which a non-digit number, and assumes that ` *str+1 ` points to a number (of valid integer format) as a string (automatically skipping all the white spaces, and no other characters), where in reality, it seems to point to `r` in `Fr164`.This is a bad argument, which results in strtol returning 0.
` strtol (*str + 2, str, 10)` should be passed instead.
Limit the matrix to UINT16_MAX x UINT16_MAX.
Larger dimensions could also result in an arithmetic overflow
when multiplying the two dimensions.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Wrong file data could give a large value for the number of vector elements
resulting in very large memory allocations.
Limit the allowed data range to UINT16_MAX (65535) elements
which hopefully should be sufficient for all use cases.
Changing the data type of the related member variables from int to
uint32_t allowed removing several type casts.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Add missing include statements, add missing "static" qualifiers or
remove functions which are not used at all.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
* Add break in default case to avoid potential problems with
future case statements following the default case.
* Remove empty statement.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
clang warnings:
src/ccstruct/coutln.cpp:231:15: warning:
variable 'destindex' may be uninitialized when used here [-Wconditional-uninitialized]
src/wordrec/language_model.cpp:1170:27: warning:
variable 'expected_gap' may be uninitialized when used here [-Wconditional-uninitialized]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
clang warnings:
src/api/baseapi.cpp:1642:18: warning:
possible misuse of comma operator here [-Wcomma]
src/api/baseapi.cpp:1642:31: warning:
possible misuse of comma operator here [-Wcomma]
src/api/baseapi.cpp:1642:45: warning:
possible misuse of comma operator here [-Wcomma]
src/api/baseapi.cpp:1652:16: warning:
possible misuse of comma operator here [-Wcomma]
src/api/baseapi.cpp:1652:30: warning:
possible misuse of comma operator here [-Wcomma]
src/api/baseapi.cpp:1662:17: warning:
possible misuse of comma operator here [-Wcomma]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
clang warning:
src/ccstruct/polyblk.cpp:48:36: warning:
constructor parameter 'box' shadows the field 'box' of 'POLY_BLOCK'
[-Wshadow-field-in-constructor]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
clang warning:
src/lstm/networkio.cpp:56:15: warning:
'this' pointer cannot be null in well-defined C++ code;
comparison may be assumed to always evaluate to true [-Wtautological-undefined-compare]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
clang warning:
src/lstm/lstmrecognizer.cpp:411:13: warning:
unused function 'NullIsBest' [-Wunused-function]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
clang warning:
src/lstm/network.cpp:249:7:
warning: 'break' will never be executed [-Wunreachable-code-break]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
The functions TessBaseAPIInitLangMod, TessBaseAPIClearAdaptiveClassifier
and TessBaseAPIDetectOrientationScript need conditional compilation.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Instead of defining the DISABLED_LEGACY_ENGINE macro in config_auto.h
(which is not included by all source files), define it as a preprocessor
option for those parts of the code which require it.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
On most systems float is the IEEE 754 single-precision binary
floating-point format (32 bits). Tesseract does not support other systems.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
On most systems double is the IEEE 754 double-precision binary
floating-point format (64 bits). Tesseract does not support other systems.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
It did not cause a problem as both arguments were 0.
Update also the function prototype of HistogramRectOCL to
accept a void pointer which allows removing a type cast.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
The division was made with integers, giving a wrong result.
* Avoid division and use pure integer operations.
* Add missing "static" attribute.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Remove unneeded assignments and a wrong comment in the destructor.
Fix wrong data type for local variable xstarts.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
The changes are based on an analysis done with include-what-you-use.
Replace also some standard header files by the corresponding
standard C++ header files.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Remove unneeded include statements, remove conditional statements and
replace the remaining assert.h by their standard C++ variant cassert.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
genericvector.h used a mix of assert and ASSERT_HOST.
By using assert only, it does no longer depend on errcode.h
which defines the ASSERT_HOST macro.
Other files which still use ASSERT_HOST now need an explicit
include statement for errcode.h.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Coverity Scan does not like incrementing of a null pointer,
so increment an index value instead of a pointer.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
The tesseract/ subdirectory is no longer automatically added to the
include path of the compiler. Therefore old code which used code like
#include "capi.h"
must now change that to
#include "tesseract/capi.h"
This avoids name conflicts with header files from other projects.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Check whether the top right point of the block is inside of the
thresholded image t_pix. Otherwise the following code would make
illegal memory accesses.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Commit 87d33b6c9e added code which uses bool.
Therefore stdbool.h must be included for compilations with a C compiler.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Normal C++ programs like those which are built for tesseract automatically
set the locale "C".
There can be different locale settings if the tesseract library is used
in other software.
A wrong locale can cause wrong results from sscanf which is used at
different places in the tesseract code, so make sure that we have the
right locale settings and fail if that is not the case.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
The assertions introduced by commit 8bea6bcc12
were too strict. The first one failed in osd_test, the second one failed
in `tesseract IMAGE BASE --psm 13 lstm.train`.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Raise an assertion for unexpected arguments and use size_t instead of int
for the size argument which is typically sizeof(some_datatype).
Signed-off-by: Stefan Weil <sw@weilnetz.de>
When Tesseract is called without any argument, the help message is still
printed, but the exit status no longer indicates success (EXIT_OK).
Signed-off-by: Stefan Weil <sw@weilnetz.de>
The progress reporting function returns a boolean. The returned
value is never used by the tesseract and its meaing is not
documented, which renders the value meaningless. Still, lack of
return should not be premitted.
The C API is missing the ability to monitor the progress of the
recognition. This patch adds C wrappers to the progress monitor
that allow monitoring the progress and canceling the recognition
process early.
The progress_callback field in the ETEXT_DESC monitor type does not
take any 'context' parameter, which may make implementing callback
functions difficult and may require use of global variables.
The new function receives the ETEXT_DESC pointer as an argument.
This makes it possible to share the cancel_this field as context
carrier if required.
The change is backwards-compatible: the old pointer remains as a
member of the class, and the default value for the new pointer is
a function calling the classic progress notifier. This way the code
unaware of the new member will continue to work as before.
Commit 0248c7ff9d replaced math.h by cmath.
Therefore isinf and isnan are no longer declared.
Replace them by their C++ 11 variant.
Signed-off-by: Stefan Weil <stweil@ub-blade-02.bib.uni-mannheim.de>
The following code caused a crash when Tesseract was compiled with -ftrapv:
1259 int width = right - left;
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1 0x00007ffff665c231 in __GI_abort () at abort.c:79
#2 0x00007ffff69e34d8 in __subvsi3 () from /lib/x86_64-linux-gnu/libgcc_s.so.1
#3 0x000055555560c1c5 in tesseract::ColPartitionGrid::FindVPartitionPartners (this=0x55555717e3c0, to_the_left=true, part=0x5555571fa380)
at ../../../src/textord/colpartitiongrid.cpp:1259
#4 0x000055555560bda0 in tesseract::ColPartitionGrid::FindPartitionPartners (this=0x55555717e3c0) at ../../../src/textord/colpartitiongrid.cpp:1196
#5 0x00005555555f52b6 in tesseract::ColumnFinder::FindBlocks (this=0x55555717e280, pageseg_mode=tesseract::PSM_AUTO, scaled_color=0x0, scaled_factor=-1,
input_block=0x555555f91390, photo_mask_pix=0x555555f73300, thresholds_pix=0x555555f76290, grey_pix=0x555555f762e0, pixa_debug=0x7ffff7fc88d8, blocks=0x7fffffffd250,
diacritic_blobs=0x7fffffffd330, to_blocks=0x7fffffffd328) at ../../../src/textord/colfind.cpp:431
#6 0x00005555555c240d in tesseract::Tesseract::AutoPageSeg (this=0x7ffff7fa5010, pageseg_mode=tesseract::PSM_AUTO, blocks=0x555555f761d0, to_blocks=0x7fffffffd328,
diacritic_blobs=0x7fffffffd330, osd_tess=0x0, osr=0x7fffffffd6d0) at ../../../src/ccmain/pagesegmain.cpp:229
#7 0x00005555555c1ffd in tesseract::Tesseract::SegmentPage (this=0x7ffff7fa5010, input_file=0x555555f7bd90, blocks=0x555555f761d0, osd_tess=0x0, osr=0x7fffffffd6d0)
at ../../../src/ccmain/pagesegmain.cpp:141
#8 0x0000555555582540 in tesseract::TessBaseAPI::FindLines (this=0x555555a9a580 <main::api>) at ../../../src/api/baseapi.cpp:2291
#9 0x000055555557ce42 in tesseract::TessBaseAPI::Recognize (this=0x555555a9a580 <main::api>, monitor=0x0) at ../../../src/api/baseapi.cpp:802
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes a compiler warning:
warning: ‘tesseract::TabFind::v_it_’ will be initialized after [-Wreorder]
warning: ‘ICOORD tesseract::TabFind::image_origin_’ [-Wreorder]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes a compiler warning:
warning: ‘BLOCK::filename’ will be initialized after [-Wreorder]
warning: ‘PDBLK BLOCK::pdblk’ [-Wreorder]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
The commit effa574 in 20.01.2017 added the bool textonly to the constructor of TessPDFRenderer. To maintain the compatibility to older APIs which are still using only two parameter, a default value for the textonly parameter is set.
Signed-off-by: Noah Metzger <noah.metzger@bib.uni-mannheim.de>
The else statement is never executed.
Remove also an unused element from the names array
and add the "static" attribute.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
It's still possible to set the warning level in the project settings,
but single source files should normally not disable compiler warnings.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Coverity ID: 1386084 the set_font method has accessed resolution_ before it was initialized by the set_resolution method.
Signed-off-by: Noah Metzger <noah.metzger@bib.uni-mannheim.de>
Tesseract code does not use strings.h (strngs.h was once called strings.h),
so that dependency can also be removed from cmake and cppan configuration.
Signed-off-by: Stefan Weil <sw@weilnetz.de>