This fixes a bug reported by OSS Fuzz:
https://oss-fuzz.com/issue/5697280134348800
The old code passed a negative value (-1) as argument to step_dir
when destindex was 0.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
The intsimdmatrix mechanism ensures that inputs would be
resized so that we'd only ever get "whole blocks" of data.
I'd assumed that that meant the same thing for scales/outputs
too, but this appears not to to be the case, as we can get
called (sometimes) with num_out % 8 == 7.
Possibly we could benefit from resizing those matrices so
that special cases in this innermost loop are not actually
required, but unless and until that is done, let's fix the
inner loop.
In tests on my pi3b+, a release build of my ghostscript integration
takes 2 minutes 27 seconds to render a PDF and OCR it with the
vanilla sources. With this NEON coded added the time drops to 37
seconds.
I have not tested the configure/Makefile changes as I'm not using
them.
This means the sources compile perfectly in the absence of
config_auto.h/HAVE_CONFIG_H as they were intended to do.
TESSERACT_VERSION_STR is set to be precisely PACKAGE_VERSION
by autoconf, so there are no actual changes in compiled code.
Training with normalized line images higher than 36 px also results in larger widths.
The limit should therefore depend on the height used for the normalization.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Line images can be larger than the old limit, especially when training
is made with newspaper lines.
Image too large to learn!! Size = 2641x36
Image too large to learn!! Size = 2704x36
Image too large to learn!! Size = 2751x36
Image too large to learn!! Size = 3738x36
Signed-off-by: Stefan Weil <sw@weilnetz.de>
This won't affect anything using the supplied build system. For
other projects that include tesseract within them, however, this
may make their life easier.
For example, I have an integration of Tesseract with Ghostscript,
in which tesseract is built as part of the Ghostscript build,
without using the tesseract build system.
The Ghostscript build system is makefile based, and has to work
on a range of make systems, including unix make, gnu make and
nmake. As such we have to avoid conditionals in the common
makefiles. It therefore becomes hard to build one set of files on
x86 systems, and another on (say) ARM systems.
Accordingly, this commit makes small tweaks to the architecture
specific files, so that they compile on EVERY platform; just they
only compile to anything useful on the appropriate platform.
Thus the makefiles can build all the files on all the systems, and
the preprocessor flags mean that the correct functions are actually
built.
This is a copy of projects/tesseract-ocr/build.sh including its history from
https://github.com/google/oss-fuzz.git.
It allows maintaining the build rules with the Tesseract source code.
The build rules for Leptonica were slightly modified to avoid
unneeded compilations.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
If TESSERACT_DISABLE_DEBUG_FONTS is defined, tesseract doesn't
atetmpt to create any debug fonts. This not only saves memory,
but it (combined with the change to optionally use Pix as
internal storage for the ImageData) allows us to use an
embedded Leptonica library with no format handlers at all.
By default, when we ImageData::SetPix, we write the data out as a
PNG, just to read it back in to get a compressed buffer of data.
We then use this to generate a new Pix.
In builds of Tesseract on systems where we don't have temp files,
writing files out is problematic.
Not only that, but compressing/uncompressing is slow, and on minimal
builds of leptonica, where we've disabled the format writers to reduce
memory footprint, we get no compression anyway.
In such cases, it'd be far nicer just to keep the original Pix as
the internal data.
Also, when recovering the pixmap from the ImageData, if we know we're
only going to read from the data, we can avoid duplicating it and
just use the original. This is exactly the case when GRAPHICS_DISABLED
is set.
So, introduce a TESSERACT_IMAGEDATA_AS_PIX predefine that we can use
to cause the internal data to be a Pix rather than a compressed
buffer.
Given we don't do compression, and they were writing to memory,
this was all just more effort than we needed.
Also, if we're using GRAPHICS_DISABLED, we might as well just
pixCopy rather than pixClone as only the scaler uses this.
The fix makes the definition of `\n` consistent with the examples given below the definition. Please note that I did not check this against how it is implemented in the code.
Compiler warning:
src/api/baseapi.cpp:1151:27: warning:
variable 'curlcode' is uninitialized when used here [-Wuninitialized]
Signed-off-by: Stefan Weil <sw@weilnetz.de>