Commit Graph

4645 Commits

Author SHA1 Message Date
Stefan Weil
0e7701bc3c SEAM: More inline functions for better performance
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-06-16 17:20:14 +02:00
Stefan Weil
e45100ebf7 TBOX: Use inline constructor for better performance
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-06-16 17:17:55 +02:00
zdenop
58c60e6c98
Merge pull request #3022 from stweil/fix
Fix undefined shift with negative value (oss-fuzz issue 14658)
2020-06-16 13:48:22 +02:00
Stefan Weil
c110958ffa Fix undefined shift with negative value (oss-fuzz issue 14658)
This fixes a bug reported by OSS Fuzz:
https://oss-fuzz.com/issue/5697280134348800

The old code passed a negative value (-1) as argument to step_dir
when destindex was 0.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-06-16 13:25:32 +02:00
Stefan Weil
6ee3698958 Remove old unused code from imagedata.h
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-06-14 16:02:27 +02:00
Stefan Weil
d8500adcf4 Fix crash caused by missing thread synchronization (issues #757, #1168 and #2191)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-06-14 15:53:17 +02:00
Stefan Weil
62eae84fea
Merge pull request #2991 from robinwatts/pushback9
Fix intsimdmatrixneon.cpp stack corruption.
2020-05-27 16:31:23 +02:00
Robin Watts
6fec69de1a Fix intsimdmatrixneon.cpp stack corruption.
The intsimdmatrix mechanism ensures that inputs would be
resized so that we'd only ever get "whole blocks" of data.
I'd assumed that that meant the same thing for scales/outputs
too, but this appears not to to be the case, as we can get
called (sometimes) with num_out % 8 == 7.

Possibly we could benefit from resizing those matrices so
that special cases in this innermost loop are not actually
required, but unless and until that is done, let's fix the
inner loop.
2020-05-27 13:40:17 +01:00
Stefan Weil
ff0a7a38f7 Check compiler options depending on host cpu
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-05-27 06:52:36 +02:00
Stefan Weil
a06d0d8449 Add missing include statements for config_auto.h
They are required to get the macro DISABLED_LEGACY_ENGINE.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-05-22 16:34:28 +02:00
Stefan Weil
6732eb9eb5 Clean code for NEON support
Include it only for NEON and remove unneeded code.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-05-21 07:03:37 +02:00
Stefan Weil
7b0e5b0722
Merge pull request #2978 from robinwatts/pushback6
Add NEON SIMD code
2020-05-21 06:59:01 +02:00
Robin Watts
f79e52a7cc NEON SIMD code.
In tests on my pi3b+, a release build of my ghostscript integration
takes 2 minutes 27 seconds to render a PDF and OCR it with the
vanilla sources. With this NEON coded added the time drops to 37
seconds.

I have not tested the configure/Makefile changes as I'm not using
them.
2020-05-20 18:54:42 +01:00
zdenop
3a3c41d1ab try to fix cmake gcc build - make simd configuration (HAVE_?) global (as autotools). 2020-05-19 18:02:16 +02:00
zdenop
32b3ab40f1 fix cmake msvc build 2020-05-19 16:16:38 +02:00
zdenop
90e81ac939 supress VS warnings in release target C4267 (conversion from 'size_t' to 'type', possible loss of data), C4305 ('context' : truncation from 'type1' to 'type2') and C4267 (var' : conversion from 'size_t' to 'type', possible loss of data) 2020-05-19 16:06:03 +02:00
zdenop
b5d639dcc5
Merge pull request #2965 from robinwatts/pushback1
thanks.
2020-05-16 20:35:19 +02:00
zdenop
064b4403de
Merge pull request #2966 from robinwatts/pushback2 2020-05-16 20:06:31 +02:00
Stefan Weil
5d9b181d67
Merge pull request #2982 from robinwatts/pushback8
Guard #include "config_auto.h" with HAVE_CONFIG_H.
2020-05-16 15:00:40 +02:00
zdenop
acaa90c971 cmake: dont use vector unit compile definition globaly 2020-05-16 12:30:20 +02:00
Robin Watts
3408c36eab Guard #include "config_auto.h" with HAVE_CONFIG_H.
Every other file already does this.
2020-05-15 19:29:03 +01:00
Amit D
b4d3bf616a
Merge pull request #2981 from robinwatts/pushback7
Fix OEM_DEFAULT in DISABLED_LEGACY_ENGINE builds.
2020-05-15 18:09:06 +03:00
Robin Watts
43437a540b Fix OEM_DEFAULT in DISABLED_LEGACY_ENGINE builds.
If api->Init is called with OEM_DEFAULT in DISABLED_LEGACY_ENGINE
build modes, the engine mode is never set, resulting in no
words being found.
2020-05-15 14:56:41 +01:00
Stefan Weil
84721e9049
Merge pull request #2979 from juliangilbey/correct_swap_comment
Trivial code documentation fix: move comment about swap meaning for DeSerialize to correct function
2020-05-13 09:32:55 +02:00
Julian Gilbey
e7e6999d3b Move comment about swap meaning for DeSerialize to correct function 2020-05-13 07:02:59 +01:00
Robin Watts
27d513462c Avoid using PACKAGE_VERSION in favour of TESSERACT_VERSION_STR.
This means the sources compile perfectly in the absence of
config_auto.h/HAVE_CONFIG_H as they were intended to do.

TESSERACT_VERSION_STR is set to be precisely PACKAGE_VERSION
by autoconf, so there are no actual changes in compiled code.
2020-05-12 21:45:12 +02:00
zdenop
f9f8da1b8c
Merge pull request #2977 from stweil/limit 2020-05-12 19:14:09 +02:00
Stefan Weil
39f7fb4a1a Allow line images with larger width (depending on height)
Training with normalized line images higher than 36 px also results in larger widths.
The limit should therefore depend on the height used for the normalization.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-05-12 16:59:31 +02:00
Stefan Weil
34bdc8b74e Allow line images with larger width
Line images can be larger than the old limit, especially when training
is made with newspaper lines.

    Image too large to learn!! Size = 2641x36
    Image too large to learn!! Size = 2704x36
    Image too large to learn!! Size = 2751x36
    Image too large to learn!! Size = 3738x36

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-05-12 16:50:40 +02:00
Egor Pugin
43bbcd4ce2
Merge pull request #2976 from juliangilbey/fix_memory_leak_in_linerec
Destroy box before potentially exiting function (preventing a memory leak)
2020-05-12 17:33:42 +03:00
Julian Gilbey
ca5735efcb Destroy box before potentially exiting function 2020-05-12 15:25:16 +01:00
Stefan Weil
d3a0768c32
Merge pull request #2975 from robinwatts/pushback5
Tweak architecture specific SIMD files for ease of compilation
2020-05-12 14:55:32 +02:00
Robin Watts
a9b44ee8c2 Tweak architecture specific SIMD files for ease of compilation.
This won't affect anything using the supplied build system. For
other projects that include tesseract within them, however, this
may make their life easier.

For example, I have an integration of Tesseract with Ghostscript,
in which tesseract is built as part of the Ghostscript build,
without using the tesseract build system.

The Ghostscript build system is makefile based, and has to work
on a range of make systems, including unix make, gnu make and
nmake. As such we have to avoid conditionals in the common
makefiles. It therefore becomes hard to build one set of files on
x86 systems, and another on (say) ARM systems.

Accordingly, this commit makes small tweaks to the architecture
specific files, so that they compile on EVERY platform; just they
only compile to anything useful on the appropriate platform.

Thus the makefiles can build all the files on all the systems, and
the preprocessor flags mean that the correct functions are actually
built.
2020-05-12 13:09:29 +01:00
Egor Pugin
0eaabc42c7
Update CMakeLists.txt 2020-05-12 11:49:15 +03:00
Egor Pugin
e720a26745
[cmake] Set inactivity timeout during icu download to 300 seconds.
Fixes #2972.
2020-05-09 18:55:45 +03:00
Stefan Weil
fe966cc0b1 Add build script for oss-fuzz fuzzers
This is a copy of projects/tesseract-ocr/build.sh including its history from
https://github.com/google/oss-fuzz.git.

It allows maintaining the build rules with the Tesseract source code.

The build rules for Leptonica were slightly modified to avoid
unneeded compilations.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-05-08 17:37:37 +02:00
Stefan Weil
016016df77 Build only required Leptonica components
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-05-08 17:37:37 +02:00
Guido Vranken
6e9a1e97db Fix build (#3177)
* [tesseract-ocr] Fix build

* [tesseract-ocr] Disable AFL, lower resolution
2020-05-08 17:37:37 +02:00
jonathanmetzman
db5655333e Migrate projects using -lFuzzingEngine to $LIB_FUZZING_ENGINE (#2325)
Migrate from -lFuzzingEngine to $LIB_FUZZING_ENGINE where possible and not causing breakage
2020-05-08 17:37:37 +02:00
Guido Vranken
56b94fb783 Add fuzzer that processes 512x512 images (#2279) 2020-05-08 17:37:37 +02:00
Guido Vranken
b2d1a11016 Use Leptonica master branch (#2224) 2020-05-08 17:37:37 +02:00
Guido Vranken
1a7f633ab0 Add Tesseract (#2210)
* Add Tesseract

* Use -lz instead of static library path

* Disable Tesseract shared build

* Minimal repository cloning (--depth 1)

* Improve tessdata directory resolution syntax

* Don't hardcode TESSDATA_PREFIX into binary

* Don't move, but copy $SRC/tessdata to $OUT

Move sometimes results in "inter-device move failed"
2020-05-08 17:37:37 +02:00
Robin Watts
80d4af6ecf Add a mechanism to avoid creating debug fonts.
If TESSERACT_DISABLE_DEBUG_FONTS is defined, tesseract doesn't
atetmpt to create any debug fonts. This not only saves memory,
but it (combined with the change to optionally use Pix as
internal storage for the ImageData) allows us to use an
embedded Leptonica library with no format handlers at all.
2020-05-05 00:22:23 +01:00
Robin Watts
6bcb941bcf Avoid tesseract writing Pix out/reading them back.
By default, when we ImageData::SetPix, we write the data out as a
PNG, just to read it back in to get a compressed buffer of data.
We then use this to generate a new Pix.

In builds of Tesseract on systems where we don't have temp files,
writing files out is problematic.

Not only that, but compressing/uncompressing is slow, and on minimal
builds of leptonica, where we've disabled the format writers to reduce
memory footprint, we get no compression anyway.

In such cases, it'd be far nicer just to keep the original Pix as
the internal data.

Also, when recovering the pixmap from the ImageData, if we know we're
only going to read from the data, we can avoid duplicating it and
just use the original. This is exactly the case when GRAPHICS_DISABLED
is set.

So, introduce a TESSERACT_IMAGEDATA_AS_PIX predefine that we can use
to cause the internal data to be a Pix rather than a compressed
buffer.



Given we don't do compression, and they were writing to memory,
this was all just more effort than we needed.

Also, if we're using GRAPHICS_DISABLED, we might as well just
pixCopy rather than pixClone as only the scaler uses this.
2020-05-04 21:01:22 +01:00
zdenop
79c3ebbbb9
Merge pull request #2962 from stweil/GetPageRes
Add TessBaseAPI::GetPageRes again
2020-05-04 15:15:29 +02:00
Stefan Weil
9173e6e3f7 Add TessBaseAPI::GetPageRes again
It is now added unconditionally, so it is always available for the unittest.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-05-04 14:03:39 +02:00
Amit D
acc4c8bff5
Merge pull request #2952 from jannick0/patch-1
[trie.h] pattern definition: fix documentation
2020-04-27 23:44:48 +03:00
zdenop
23be532f7d
Merge pull request #2957 from stweil/master 2020-04-27 19:56:32 +02:00
Stefan Weil
1188e0a516 Remove old code which was used for Ocropus
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-04-27 16:33:34 +02:00
jannick0
e044163085
[trie.h] pattern definition: fix documentation
The fix makes the definition of `\n` consistent with the examples given below the definition.  Please note that I did not check this against how it is implemented in the code.
2020-04-19 13:47:42 +02:00