They used the function pango_coverage_max which does nothing and
which has been deprecated since pango version 1.44.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Issues:
Debug information for "NoImages" just be binary image,
it don't show up the result of photo_mask_pix to developer
Fix:
Substract binary image to photo_mask_pix, the result
are "NoImages" binary pix
When checking horizontal line partitions for
possible interpretation as underline formatting,
avoid confusing the hline partition itself with
an overlapping neighbour (which would delete it).
When detecting vertical separators, the blob aligner is used to glue
line segments (often segmented due to artificial cracks).
But (unlike LineFinder) it has many parameters that are not
relative to pixel density/resolution.
This change decreases the minimum absolute length in pixels
for vertical separators.
The kDictWildcard is never actually used, so removing it makes
no difference. It causes warnings in MSVC builds as MSVC doesn't
know how to pack a unicode value into chars.
If building with TESSERACT_IMAGEDATA_AS_PIX, then tesseract
doesn't compress/decompress images, but rather holds the
data as internal Pix structures. Unfortunately, I forgot to
make the ImageData destructor free these, so memory leaked
during use. Fixed here.
strtok_s can be used with MSVC as a replacement for strtok_r, so less
special handling is needed in the code and class SVNetwork can be
made smaller by removing member has_content.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Some runtime parameters which are only relevant with graphics enabled
were now removed from builds when graphics was disabled.
TableFinder::DisplayColSegmentGrid is never used, so remove it completely.
Builds with --disable-graphics significantly reduce the code size and avoid
some function calls which might be important for certain applications:
text data bss dec hex filename
3219230 41136 13920 3274286 31f62e .libs/libtesseract.so (--disable-graphics, old)
3211347 40976 13600 3265923 31d583 .libs/libtesseract.so (--disable-graphics, new)
3360942 43656 15392 3419990 342f56 .libs/libtesseract.so (default)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
It was reported by oss-fuzz (issue 23962).
Add log output to find real images which trigger that issue.
Avoid also some conversions from float to double by always using float.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This replaces the proprietary STRING data type
(801 instead of 838 lines remaining).
It also removes STRING from osdetect.h and serialis.h.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Runtime error reported by sanitizer:
src/ccstruct/rect.h:191:44: runtime error: 50961 is outside the range of representable values of type 'short'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior src/ccstruct/rect.h:191:44 in
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Runtime error reported by sanitizer:
src/ccstruct/coutln.cpp:1018:19: runtime error: null pointer passed as argument 2, which is declared to never be null
/usr/include/string.h:48:14: note: nonnull attribute specified here
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior src/ccstruct/coutln.cpp:1018:19 in
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Runtime errors reported by sanitizer:
src/textord/pithsync.cpp:75:31: runtime error: unsigned integer overflow: 2147483648 + 2147483648 cannot be represented in type 'unsigned int'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior src/textord/pithsync.cpp:75:31 in
src/textord/pithsync.cpp:75:43: runtime error: unsigned integer overflow: 0 - 1 cannot be represented in type 'unsigned int'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior src/textord/pithsync.cpp:75:43 in
src/textord/pithsync.cpp:125:29: runtime error: unsigned integer overflow: 2147483648 + 2147483648 cannot be represented in type 'unsigned int'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior src/textord/pithsync.cpp:125:29 in
src/textord/pithsync.cpp:125:41: runtime error: unsigned integer overflow: 0 - 1 cannot be represented in type 'unsigned int'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior src/textord/pithsync.cpp:125:41 in
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Runtime error with enabled sanitizer:
src/textord/colpartition.cpp:2243:66: runtime error: index -1 out of bounds for type 'tesseract::ColPartition *[6]'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior src/textord/colpartition.cpp:2243:66 in
Signed-off-by: Stefan Weil <sw@weilnetz.de>
cprintf was an indirect way to call tprintf.
This indirection is not needed, so remove it and use tprintf directly.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
It is only used locally in intproto.cpp, so defining it before the first
use and adding the static attribute allows the compiler to inline it.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes a bug reported by OSS Fuzz:
https://oss-fuzz.com/issue/5697280134348800
The old code passed a negative value (-1) as argument to step_dir
when destindex was 0.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
The intsimdmatrix mechanism ensures that inputs would be
resized so that we'd only ever get "whole blocks" of data.
I'd assumed that that meant the same thing for scales/outputs
too, but this appears not to to be the case, as we can get
called (sometimes) with num_out % 8 == 7.
Possibly we could benefit from resizing those matrices so
that special cases in this innermost loop are not actually
required, but unless and until that is done, let's fix the
inner loop.
In tests on my pi3b+, a release build of my ghostscript integration
takes 2 minutes 27 seconds to render a PDF and OCR it with the
vanilla sources. With this NEON coded added the time drops to 37
seconds.
I have not tested the configure/Makefile changes as I'm not using
them.
This means the sources compile perfectly in the absence of
config_auto.h/HAVE_CONFIG_H as they were intended to do.
TESSERACT_VERSION_STR is set to be precisely PACKAGE_VERSION
by autoconf, so there are no actual changes in compiled code.
Training with normalized line images higher than 36 px also results in larger widths.
The limit should therefore depend on the height used for the normalization.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Line images can be larger than the old limit, especially when training
is made with newspaper lines.
Image too large to learn!! Size = 2641x36
Image too large to learn!! Size = 2704x36
Image too large to learn!! Size = 2751x36
Image too large to learn!! Size = 3738x36
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This won't affect anything using the supplied build system. For
other projects that include tesseract within them, however, this
may make their life easier.
For example, I have an integration of Tesseract with Ghostscript,
in which tesseract is built as part of the Ghostscript build,
without using the tesseract build system.
The Ghostscript build system is makefile based, and has to work
on a range of make systems, including unix make, gnu make and
nmake. As such we have to avoid conditionals in the common
makefiles. It therefore becomes hard to build one set of files on
x86 systems, and another on (say) ARM systems.
Accordingly, this commit makes small tweaks to the architecture
specific files, so that they compile on EVERY platform; just they
only compile to anything useful on the appropriate platform.
Thus the makefiles can build all the files on all the systems, and
the preprocessor flags mean that the correct functions are actually
built.
If TESSERACT_DISABLE_DEBUG_FONTS is defined, tesseract doesn't
atetmpt to create any debug fonts. This not only saves memory,
but it (combined with the change to optionally use Pix as
internal storage for the ImageData) allows us to use an
embedded Leptonica library with no format handlers at all.
By default, when we ImageData::SetPix, we write the data out as a
PNG, just to read it back in to get a compressed buffer of data.
We then use this to generate a new Pix.
In builds of Tesseract on systems where we don't have temp files,
writing files out is problematic.
Not only that, but compressing/uncompressing is slow, and on minimal
builds of leptonica, where we've disabled the format writers to reduce
memory footprint, we get no compression anyway.
In such cases, it'd be far nicer just to keep the original Pix as
the internal data.
Also, when recovering the pixmap from the ImageData, if we know we're
only going to read from the data, we can avoid duplicating it and
just use the original. This is exactly the case when GRAPHICS_DISABLED
is set.
So, introduce a TESSERACT_IMAGEDATA_AS_PIX predefine that we can use
to cause the internal data to be a Pix rather than a compressed
buffer.
Given we don't do compression, and they were writing to memory,
this was all just more effort than we needed.
Also, if we're using GRAPHICS_DISABLED, we might as well just
pixCopy rather than pixClone as only the scaler uses this.
The fix makes the definition of `\n` consistent with the examples given below the definition. Please note that I did not check this against how it is implemented in the code.
Compiler warning:
src/api/baseapi.cpp:1151:27: warning:
variable 'curlcode' is uninitialized when used here [-Wuninitialized]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
clang warnings:
src/ccstruct/pageres.cpp:903:20: warning:
implicit conversion from 'int' to 'float' changes value from
2147483647 to 2147483648 [-Wimplicit-int-float-conversion]
src/ccstruct/pageres.cpp:904:23:
warning: implicit conversion from 'int' to 'float' changes value from
-2147483647 to -2147483648 [-Wimplicit-int-float-conversion]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
powerpc64le-linux-gnu-g++ warning:
src/training/mftraining.cpp:209:5: warning:
‘%04d’ directive output may be truncated writing between 4 and 10 bytes
into a region of size 8 [-Wformat-truncation=]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Those files are C++, and the wrong modeline is not needed at all.
Remove also some empty descriptions and old history in the comments.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
- Remove unused type definitions for TessTextRenderer, ... in capi.h
(they were only used in capi.cpp which now no longer needs them)
- Fix typo in comment
Signed-off-by: Stefan Weil <sw@weilnetz.de>
- Replace AVX_OPT, AVX2_OPT, FMA_OPT, SSE41_OPT
- Replace AVX, AVX2, FMA, SSE4_1
- Write new HAVE_AVX, HAVE_AVX2, HAVE_FMA, HAVE_SSE4_1 into config_auto.h
- Put related conditionals in Makefile.am in one place
This makes the code clearer and fixes a log message in
IntSimdMatrixTest.AVX2.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
`tesseract --version` now also shows the version of libcurl and related
libraries if it was build with libcurl.
The preprocessor macro HAVE_LIBCURL is now defined in config_auto.h.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Commit 94d0f77f56 tried to fix issue #2741
but created a new problem.
This commit should fix both old and new issue.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
If Tesseract cannot find text in the input image, it should not write
an empty lstmf file. This problem was reported in issue #2741.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
option --ptsize which defaults to 12. This option is not exposed through
tesstrain.sh; thus, you cannot use tesstrain.sh to explore training with
different font sizes. I made a small modification to expose the --ptsize
option to tesstrain.sh. It defaults to 12 if not specified.
Fix two occurrences of this LGTM warning:
Multiplication result may overflow 'double'
before it is converted to 'long double'.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes wrong output of integers with locale de_DE.UTF-8:
- /Width 2.481
- /Height 3.508
+ /Width 2481
+ /Height 3508
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes wrong output of integers with locale de_DE.UTF-8:
- <Page WIDTH="2.481" HEIGHT="3.508" PHYSICAL_IMG_NR="0" ID="page_0">
+ <Page WIDTH="2481" HEIGHT="3508" PHYSICAL_IMG_NR="0" ID="page_0">
Signed-off-by: Stefan Weil <sw@weilnetz.de>
The title can be set for hOCR and PDF output.
Currently it is also used for ALTO, so setting the title can be used
as a workaround for issue #2700.
The constant unknown_title_ is no longer needed and therefore removed.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
The function derives the file name for the .box file from an image name.
For training from existing line images, it is useful to directly support
the image names which are commonly used.
While generated images for Tesseract training typically use the name
pattern NAME.tif, other ground truth sets use NAME.bin.png for binarized
or NAME.nrm.png for grayscale images.
BoxFileName is also now a local function as it is only used locally.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
The configuration file lstm.train causes Tesseract to generate
training data for training of an LSTM line recognizer.
In this mode, no other files with OCR results should be written.
Without this patch, Tesseract writes a small text file.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This allows OCR of images from the internet without downloading them first:
tesseract http://IMAGE_URL OUTPUT ...
It uses libcurl.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
- Use C++ type casts
- Remove unneeded type cast
- Simplify code for function pop
- Remove macro push_on (it was only used once)
This fixes lots of compiler warnings caused by old type casts.
- Use C++ enums
- Use strongly typed C++11 enum for DIRECTION and optimize struct MFEDGEPT
- Use float constant for MF_SCALE_FACTOR
- Replace macros by inline functions
- Fix documentation comment
This fixes several warnings from clang.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes a clang warning:
src/ccstruct/polyblk.cpp:412:12: warning: result of comparison of
unsigned enum expression >= 0 is always true
[-Wtautological-unsigned-enum-zero-compare]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Replace the macros which were declared in vecfuncs.h by member functions
and move a function which was only used in chop.cpp to that file.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Removing STRING from genericvector.h allows eliminating the proprietary
STRING data type from the public Tesseract API.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
- add another constructor for LSTMRecognizer
which takes the language_data_path_prefix configured/selected
at runtime and passes it to the internal CCUtil
- use this in Tesseract::init_tesseract_lang_data when LSTMs
are available
(this was missing from 297d7d86ce)
This fixes compiler warnings caused by
commit 091ce345f6:
src/wordrec/lm_state.h💯7: warning: field 'cost'
will be initialized after field 'curr_b' [-Wreorder]
src/wordrec/lm_state.h:104:7: warning: field 'top_choice_flags'
will be initialized after field 'dawg_info' [-Wreorder]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes compiler warnings caused by
commit 5b4565b80b:
src/textord/colpartition.cpp:91:24: warning: field 'last_column_'
will be initialized after field 'column_set_' [-Wreorder]
src/textord/colpartition.cpp:93:37: warning: field 'inside_table_column_'
will be initialized after field 'nearest_neighbor_above_' [-Wreorder]
src/textord/colpartition.cpp:95:58: warning: field 'space_to_right_'
will be initialized after field 'owns_blobs_' [-Wreorder]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes compiler warnings caused by
commit ecf0f2dee5:
src/dict/dawg.h:202:9: warning: field 'type_' will be initialized
after field 'lang_' [-Wreorder]
src/dict/dawg.h:355:9: warning: field 'dawg_index' will be initialized
after field 'dawg_ref' [-Wreorder]
src/dict/dawg.h:356:9: warning: field 'punc_index' will be initialized
after field 'punc_ref' [-Wreorder]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes compiler warnings caused by
commit 751fcd2b11:
src/classify/classify.cpp:176:7: warning:
field 'EnableLearning' will be initialized after
field 'il1_adaption_test' [-Wreorder]
src/classify/classify.cpp:187:7: warning:
field 'dict_' will be initialized after
field 'static_classifier_' [-Wreorder]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Only one of bIt, dIt, iIt and sIt is used, so put all four in a union.
This fixes CID 1164628, CID 1164629, CID 1164630 and CID 1164631.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Report from Coverity Scan:
CID 1405560 (#1 of 1): Uninitialized scalar field (UNINIT_CTOR)
2. uninit_member: Non-static class member end is not initialized in
this constructor nor in any functions that it calls.
CID 1405561 [...]
Modernize and optimize class WERD_RES. This not only fixes the issues
but also reduces the size and eliminates the functions InitNonPointers
and InitPointers.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Reduce size from 368 to 352 bytes for Trie, 72 to 64 bytes for Dawg
and 40 to 24 bytes for DawgPosition by avoiding holes.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
The class no longer uses bit fields. Re-ordering the member variables
avoids holes and reduces the size of BLOBNBOX from 168 to 152 bytes.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Fix this runtime error in recodebeam_test and unicharcompress_test:
src/ccutil/unicharcompress.h:84:27: runtime error:
left shift of 267 by 28 places cannot be represented in type 'int'
code has up to kMaxCodeLen (9) values, so the highest possible value for
i is 8, and the shift value can reach 7 * 8 = 56.
That requires an uint64_t data type.
size_t would fit for 64 bit hosts, but be too small for 32 bit hosts.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Fix this runtime error in osd_test and textlineprojection_test:
src/ccmain/osdetect.cpp:109:14: runtime error: division by zero
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Fix these runtime errors in mastertrainer_test:
src/ccutil/bitvector.cpp:119:18: runtime error:
null pointer passed as argument 2, which is declared to never be null
src/ccutil/bitvector.cpp:124:10: runtime error:
null pointer passed as argument 1, which is declared to never be null
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes three LGTM warnings:
Multiplication result may overflow 'float' before it is converted to 'double'.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
They are moved from src/classify and src/lstm to src/training.
This reduces the size of the Tesseract library.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
It is only used in unittest/layout_test.cc after moving a test from
baseapi_test.cc to that file, so it can be made local.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
The method was only used in unittest where it can be replaced by
UNICHARSET::load_from_file which also simplifies the code.
This allows removing the class InMemoryFilePointer and fixes a TODO.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
The MS compiler only accepts string constants up to 65535 characters,
so shorten the string for that compiler to fix the compilation.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This converts special character like '<' or '>' to the
correct HTML entities.
Optimize also the code a little bit.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
The vector was already limited to MAX_NUM_PROTOS (512) entries or 64 bytes
in the old code. Now it uses that size right from the start which avoids
reallocating it later when entries are added.
The old code which reallocated the vector to expand it was buggy because
the realloc function can return a different pointer, but the code still
used the original pointer to reset the new bits.
Function ExpandBitVector is now unused and therefore removed.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
tesseract::FileReader and tesseract::FileWriter are already declared
in serialis.h which is included by genericvector.h.
Signed-off-by: Stefan Weil <sw@weilnetz.de>