That moves its data into the text segment and reduces the total size
slightly:
text data bss dec hex filename
39788 693 0 40481 9e21 old/libtesseract_la-pdfrenderer.o
40360 88 0 40448 9e00 new/libtesseract_la-pdfrenderer.o
Signed-off-by: Stefan Weil <sw@weil.de>
The original code crashes in pango_fc_font_get_glyph on MacOS with M1.
Replacing the type cast with the macro made for that conversion
gives at least an error message before crashing:
(process:12546): GLib-GObject-WARNING **: 08:38:02.472: invalid cast from 'PangoCairoCoreTextFont' to 'PangoFcFont'
zsh: segmentation fault ./pango_font_info_test
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This replaces the proprietary STRING data type
(764 instead of 838 lines remaining).
It also removes STRING from osdetect.h and serialis.h.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
The old code looked for the minimum confidence which triggered
very often a 2nd OCR without improving the result.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
We also move to relying on both scales and output having been
padded to accomodate us writing more results than are actually
needed here. This was allowed for a few commits back.
Avoid 1) floating point division by 127, 2) conversion of
bias to double, 3) FP addition, in favour of 1) integer
multiplication by 127, and 2) integer addition.
(Also costs extra work in the serialisation/deserialisation of
the scale values, and conversion of weights to int formats, but
these are all one offs).
In order to allow intSimdMatrix implementations to 'overwrite'
their outputs, ensure that the output buffers are always padded
to the next block size.
This doesn't make any difference yet, but it enables optimisations
further down the line, especially when the biasing is pulled into
the SIMD.
Currently, the size of the scales array is not rounded up
in the same way as the weights are. This blocks us pushing
the scale calculations into the SIMD, as when we "overread"
the end of the scale array, we potentially get errors.
Here, we adjust the intSimdMatrix stuff to ensure that the
scales array reserves enough entries to allow such overreads
to work.
This doesn't make any difference for now, but opens the way
for future optimisations.
They used the function pango_coverage_max which does nothing and
which has been deprecated since pango version 1.44.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Issues:
Debug information for "NoImages" just be binary image,
it don't show up the result of photo_mask_pix to developer
Fix:
Substract binary image to photo_mask_pix, the result
are "NoImages" binary pix
When checking horizontal line partitions for
possible interpretation as underline formatting,
avoid confusing the hline partition itself with
an overlapping neighbour (which would delete it).
When detecting vertical separators, the blob aligner is used to glue
line segments (often segmented due to artificial cracks).
But (unlike LineFinder) it has many parameters that are not
relative to pixel density/resolution.
This change decreases the minimum absolute length in pixels
for vertical separators.
The kDictWildcard is never actually used, so removing it makes
no difference. It causes warnings in MSVC builds as MSVC doesn't
know how to pack a unicode value into chars.
If building with TESSERACT_IMAGEDATA_AS_PIX, then tesseract
doesn't compress/decompress images, but rather holds the
data as internal Pix structures. Unfortunately, I forgot to
make the ImageData destructor free these, so memory leaked
during use. Fixed here.
strtok_s can be used with MSVC as a replacement for strtok_r, so less
special handling is needed in the code and class SVNetwork can be
made smaller by removing member has_content.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Some runtime parameters which are only relevant with graphics enabled
were now removed from builds when graphics was disabled.
TableFinder::DisplayColSegmentGrid is never used, so remove it completely.
Builds with --disable-graphics significantly reduce the code size and avoid
some function calls which might be important for certain applications:
text data bss dec hex filename
3219230 41136 13920 3274286 31f62e .libs/libtesseract.so (--disable-graphics, old)
3211347 40976 13600 3265923 31d583 .libs/libtesseract.so (--disable-graphics, new)
3360942 43656 15392 3419990 342f56 .libs/libtesseract.so (default)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
It was reported by oss-fuzz (issue 23962).
Add log output to find real images which trigger that issue.
Avoid also some conversions from float to double by always using float.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This replaces the proprietary STRING data type
(801 instead of 838 lines remaining).
It also removes STRING from osdetect.h and serialis.h.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Runtime error reported by sanitizer:
src/ccstruct/rect.h:191:44: runtime error: 50961 is outside the range of representable values of type 'short'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior src/ccstruct/rect.h:191:44 in
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Runtime error reported by sanitizer:
src/ccstruct/coutln.cpp:1018:19: runtime error: null pointer passed as argument 2, which is declared to never be null
/usr/include/string.h:48:14: note: nonnull attribute specified here
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior src/ccstruct/coutln.cpp:1018:19 in
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Runtime errors reported by sanitizer:
src/textord/pithsync.cpp:75:31: runtime error: unsigned integer overflow: 2147483648 + 2147483648 cannot be represented in type 'unsigned int'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior src/textord/pithsync.cpp:75:31 in
src/textord/pithsync.cpp:75:43: runtime error: unsigned integer overflow: 0 - 1 cannot be represented in type 'unsigned int'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior src/textord/pithsync.cpp:75:43 in
src/textord/pithsync.cpp:125:29: runtime error: unsigned integer overflow: 2147483648 + 2147483648 cannot be represented in type 'unsigned int'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior src/textord/pithsync.cpp:125:29 in
src/textord/pithsync.cpp:125:41: runtime error: unsigned integer overflow: 0 - 1 cannot be represented in type 'unsigned int'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior src/textord/pithsync.cpp:125:41 in
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Runtime error with enabled sanitizer:
src/textord/colpartition.cpp:2243:66: runtime error: index -1 out of bounds for type 'tesseract::ColPartition *[6]'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior src/textord/colpartition.cpp:2243:66 in
Signed-off-by: Stefan Weil <sw@weilnetz.de>
cprintf was an indirect way to call tprintf.
This indirection is not needed, so remove it and use tprintf directly.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
It is only used locally in intproto.cpp, so defining it before the first
use and adding the static attribute allows the compiler to inline it.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This fixes a bug reported by OSS Fuzz:
https://oss-fuzz.com/issue/5697280134348800
The old code passed a negative value (-1) as argument to step_dir
when destindex was 0.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
The intsimdmatrix mechanism ensures that inputs would be
resized so that we'd only ever get "whole blocks" of data.
I'd assumed that that meant the same thing for scales/outputs
too, but this appears not to to be the case, as we can get
called (sometimes) with num_out % 8 == 7.
Possibly we could benefit from resizing those matrices so
that special cases in this innermost loop are not actually
required, but unless and until that is done, let's fix the
inner loop.
In tests on my pi3b+, a release build of my ghostscript integration
takes 2 minutes 27 seconds to render a PDF and OCR it with the
vanilla sources. With this NEON coded added the time drops to 37
seconds.
I have not tested the configure/Makefile changes as I'm not using
them.