Commit Graph

1252 Commits

Author SHA1 Message Date
Stefan Weil
93cfffeb87 Remove unused argument from function TruncateParam
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-06-16 20:17:41 +02:00
Stefan Weil
f08b16a5a0 Remove assertion which is triggered by tests
oss-fuzz issue 15149 triggers this assertion. See test case here:
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=15149

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-06-16 20:17:26 +02:00
Stefan Weil
18d9983f69 StrokeWidth: Remove unused local variable (fixes compiler warning)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-06-16 20:17:09 +02:00
Stefan Weil
bc61038dd4 SPLIT: Make function bounding_box inline for better performance
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-06-16 17:21:36 +02:00
Stefan Weil
0e7701bc3c SEAM: More inline functions for better performance
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-06-16 17:20:14 +02:00
Stefan Weil
e45100ebf7 TBOX: Use inline constructor for better performance
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-06-16 17:17:55 +02:00
Stefan Weil
c110958ffa Fix undefined shift with negative value (oss-fuzz issue 14658)
This fixes a bug reported by OSS Fuzz:
https://oss-fuzz.com/issue/5697280134348800

The old code passed a negative value (-1) as argument to step_dir
when destindex was 0.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-06-16 13:25:32 +02:00
Stefan Weil
6ee3698958 Remove old unused code from imagedata.h
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-06-14 16:02:27 +02:00
Stefan Weil
d8500adcf4 Fix crash caused by missing thread synchronization (issues #757, #1168 and #2191)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-06-14 15:53:17 +02:00
Robin Watts
6fec69de1a Fix intsimdmatrixneon.cpp stack corruption.
The intsimdmatrix mechanism ensures that inputs would be
resized so that we'd only ever get "whole blocks" of data.
I'd assumed that that meant the same thing for scales/outputs
too, but this appears not to to be the case, as we can get
called (sometimes) with num_out % 8 == 7.

Possibly we could benefit from resizing those matrices so
that special cases in this innermost loop are not actually
required, but unless and until that is done, let's fix the
inner loop.
2020-05-27 13:40:17 +01:00
Stefan Weil
a06d0d8449 Add missing include statements for config_auto.h
They are required to get the macro DISABLED_LEGACY_ENGINE.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-05-22 16:34:28 +02:00
Stefan Weil
6732eb9eb5 Clean code for NEON support
Include it only for NEON and remove unneeded code.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-05-21 07:03:37 +02:00
Robin Watts
f79e52a7cc NEON SIMD code.
In tests on my pi3b+, a release build of my ghostscript integration
takes 2 minutes 27 seconds to render a PDF and OCR it with the
vanilla sources. With this NEON coded added the time drops to 37
seconds.

I have not tested the configure/Makefile changes as I'm not using
them.
2020-05-20 18:54:42 +01:00
zdenop
b5d639dcc5
Merge pull request #2965 from robinwatts/pushback1
thanks.
2020-05-16 20:35:19 +02:00
zdenop
064b4403de
Merge pull request #2966 from robinwatts/pushback2 2020-05-16 20:06:31 +02:00
Robin Watts
3408c36eab Guard #include "config_auto.h" with HAVE_CONFIG_H.
Every other file already does this.
2020-05-15 19:29:03 +01:00
Robin Watts
43437a540b Fix OEM_DEFAULT in DISABLED_LEGACY_ENGINE builds.
If api->Init is called with OEM_DEFAULT in DISABLED_LEGACY_ENGINE
build modes, the engine mode is never set, resulting in no
words being found.
2020-05-15 14:56:41 +01:00
Julian Gilbey
e7e6999d3b Move comment about swap meaning for DeSerialize to correct function 2020-05-13 07:02:59 +01:00
Robin Watts
27d513462c Avoid using PACKAGE_VERSION in favour of TESSERACT_VERSION_STR.
This means the sources compile perfectly in the absence of
config_auto.h/HAVE_CONFIG_H as they were intended to do.

TESSERACT_VERSION_STR is set to be precisely PACKAGE_VERSION
by autoconf, so there are no actual changes in compiled code.
2020-05-12 21:45:12 +02:00
Stefan Weil
39f7fb4a1a Allow line images with larger width (depending on height)
Training with normalized line images higher than 36 px also results in larger widths.
The limit should therefore depend on the height used for the normalization.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-05-12 16:59:31 +02:00
Stefan Weil
34bdc8b74e Allow line images with larger width
Line images can be larger than the old limit, especially when training
is made with newspaper lines.

    Image too large to learn!! Size = 2641x36
    Image too large to learn!! Size = 2704x36
    Image too large to learn!! Size = 2751x36
    Image too large to learn!! Size = 3738x36

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-05-12 16:50:40 +02:00
Julian Gilbey
ca5735efcb Destroy box before potentially exiting function 2020-05-12 15:25:16 +01:00
Stefan Weil
d3a0768c32
Merge pull request #2975 from robinwatts/pushback5
Tweak architecture specific SIMD files for ease of compilation
2020-05-12 14:55:32 +02:00
Robin Watts
a9b44ee8c2 Tweak architecture specific SIMD files for ease of compilation.
This won't affect anything using the supplied build system. For
other projects that include tesseract within them, however, this
may make their life easier.

For example, I have an integration of Tesseract with Ghostscript,
in which tesseract is built as part of the Ghostscript build,
without using the tesseract build system.

The Ghostscript build system is makefile based, and has to work
on a range of make systems, including unix make, gnu make and
nmake. As such we have to avoid conditionals in the common
makefiles. It therefore becomes hard to build one set of files on
x86 systems, and another on (say) ARM systems.

Accordingly, this commit makes small tweaks to the architecture
specific files, so that they compile on EVERY platform; just they
only compile to anything useful on the appropriate platform.

Thus the makefiles can build all the files on all the systems, and
the preprocessor flags mean that the correct functions are actually
built.
2020-05-12 13:09:29 +01:00
Egor Pugin
0eaabc42c7
Update CMakeLists.txt 2020-05-12 11:49:15 +03:00
Egor Pugin
e720a26745
[cmake] Set inactivity timeout during icu download to 300 seconds.
Fixes #2972.
2020-05-09 18:55:45 +03:00
Robin Watts
80d4af6ecf Add a mechanism to avoid creating debug fonts.
If TESSERACT_DISABLE_DEBUG_FONTS is defined, tesseract doesn't
atetmpt to create any debug fonts. This not only saves memory,
but it (combined with the change to optionally use Pix as
internal storage for the ImageData) allows us to use an
embedded Leptonica library with no format handlers at all.
2020-05-05 00:22:23 +01:00
Robin Watts
6bcb941bcf Avoid tesseract writing Pix out/reading them back.
By default, when we ImageData::SetPix, we write the data out as a
PNG, just to read it back in to get a compressed buffer of data.
We then use this to generate a new Pix.

In builds of Tesseract on systems where we don't have temp files,
writing files out is problematic.

Not only that, but compressing/uncompressing is slow, and on minimal
builds of leptonica, where we've disabled the format writers to reduce
memory footprint, we get no compression anyway.

In such cases, it'd be far nicer just to keep the original Pix as
the internal data.

Also, when recovering the pixmap from the ImageData, if we know we're
only going to read from the data, we can avoid duplicating it and
just use the original. This is exactly the case when GRAPHICS_DISABLED
is set.

So, introduce a TESSERACT_IMAGEDATA_AS_PIX predefine that we can use
to cause the internal data to be a Pix rather than a compressed
buffer.



Given we don't do compression, and they were writing to memory,
this was all just more effort than we needed.

Also, if we're using GRAPHICS_DISABLED, we might as well just
pixCopy rather than pixClone as only the scaler uses this.
2020-05-04 21:01:22 +01:00
Amit D
acc4c8bff5
Merge pull request #2952 from jannick0/patch-1
[trie.h] pattern definition: fix documentation
2020-04-27 23:44:48 +03:00
Stefan Weil
1188e0a516 Remove old code which was used for Ocropus
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-04-27 16:33:34 +02:00
jannick0
e044163085
[trie.h] pattern definition: fix documentation
The fix makes the definition of `\n` consistent with the examples given below the definition.  Please note that I did not check this against how it is implemented in the code.
2020-04-19 13:47:42 +02:00
Stefan Weil
4a00b68c63 Fix lambda function for curl code errors
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-03-18 20:46:52 +01:00
Stefan Weil
9f5a3f6ac7 Fix uninitialized local variable in curl code
Compiler warning:

    src/api/baseapi.cpp:1151:27: warning:
      variable 'curlcode' is uninitialized when used here [-Wuninitialized]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-03-18 19:25:33 +01:00
zdenop
6e307074d8
Merge pull request #2894 from stweil/curl
Report errors from curl_easy functions
2020-03-18 14:14:07 +01:00
Stefan Weil
ef4f99a994 Run xgetbv instruction only on machines which support it
This fixes a regression for older Intel processors.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-03-08 17:32:10 +01:00
Stefan Weil
eff4dc0603 Use lambda expressions for reporting curl errors
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-02-23 22:44:42 +01:00
Stefan Weil
9972c91127 Report errors from curl_easy functions
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-02-23 22:26:51 +01:00
Stefan Weil
57ff90687d simd: Check whether the OS supports FMA, AVX, ...
The previous check was only for the MS compiler, but not for gcc and clang.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-02-23 16:34:35 +01:00
zdenop
7c3ac569f9
Replace references to the old wiki by new URLs (#2877)
Replace references to the old wiki by new URLs
2020-02-03 14:59:18 +01:00
Stefan Weil
16553014e0 Replace references to the old wiki by new URLs
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-02-03 11:37:41 +01:00
Stefan Weil
20bcbc4058 Catch std::runtime_error exception when setting the locale in debug code
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-02-03 07:58:43 +01:00
Robert Sachunsky
cdc8e44a20 ChoiceIterator: skip symbol without choices 2020-01-24 09:19:14 +01:00
jkang-eng
60248f59d4 Fix "tesseract.exe not flushing stdout/stderr" (Issue #2859) (#2865)
* Issue #2859 - Fix "tesseract.exe not flushing stdout/stderr"
2020-01-21 21:51:08 +01:00
Stefan Weil
6f2f310fdf Remove redundant method from class GenericVector
length() is not needed: it can be replaced by size().

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-01-18 11:30:14 +01:00
Stefan Weil
3d1f82d0e2 tesstrain.sh: Fix command line flag --help
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-01-05 10:10:55 +01:00
Stefan Weil
cfd39dc2c7 pageres: Fix compiler warnings
clang warnings:

    src/ccstruct/pageres.cpp:903:20: warning:
      implicit conversion from 'int' to 'float' changes value from
      2147483647 to 2147483648 [-Wimplicit-int-float-conversion]
    src/ccstruct/pageres.cpp:904:23:
      warning: implicit conversion from 'int' to 'float' changes value from
      -2147483647 to -2147483648 [-Wimplicit-int-float-conversion]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-01-04 09:46:10 +01:00
Stefan Weil
d2a2292f32 mftraining: Fix compiler warning
powerpc64le-linux-gnu-g++ warning:

    src/training/mftraining.cpp:209:5: warning:
        ‘%04d’ directive output may be truncated writing between 4 and 10 bytes
        into a region of size 8 [-Wformat-truncation=]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-01-03 10:13:58 +01:00
zdenop
79f191fe20
Merge pull request #2826 from bertsky/clip-blockpolygon
make BlockPolygon usable
2019-12-19 09:14:25 +01:00
Robert Sachunsky
4b0c9f3373 BlockPolygon: clip to image rectangle 2019-12-18 13:29:43 +01:00
Robert Sachunsky
5751a408c9 BlockPolygon: unrotate from internal to image coordinates 2019-12-18 13:29:43 +01:00
amitdo
502ebe8ca9 Autotools: Pango, Cairo and ICU only required by training tools 2019-12-16 17:23:06 +02:00
Stefan Weil
fc84f84b5b Remove Emacs C modeline in comment line 1
Those files are C++, and the wrong modeline is not needed at all.
Remove also some empty descriptions and old history in the comments.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-12-05 13:57:50 +01:00
Stefan Weil
420cbac876 Clean public API for renderers
- Remove unused type definitions for TessTextRenderer, ... in capi.h
  (they were only used in capi.cpp which now no longer needs them)

- Fix typo in comment

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-12-03 12:23:58 +01:00
Stefan Weil
56df8e6e19 Fix some typos in comments (most of them found by codespell)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-12-02 14:30:13 +01:00
Stefan Weil
a1a139cbd2 Replace AVX_OPT, ..., AVX macros by HAVE_AVX, ... and clean related code
- Replace AVX_OPT, AVX2_OPT, FMA_OPT, SSE41_OPT
- Replace AVX, AVX2, FMA, SSE4_1
- Write new HAVE_AVX, HAVE_AVX2, HAVE_FMA, HAVE_SSE4_1 into config_auto.h
- Put related conditionals in Makefile.am in one place

This makes the code clearer and fixes a log message in
IntSimdMatrixTest.AVX2.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-11-28 17:51:37 +01:00
Stefan Weil
074844ce46 Show libcurl version
`tesseract --version` now also shows the version of libcurl and related
libraries if it was build with libcurl.

The preprocessor macro HAVE_LIBCURL is now defined in config_auto.h.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-11-28 16:34:52 +01:00
Stefan Weil
cbd3a21cb2 automake: Flat build for src/viewer and src/wordrec
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-11-26 16:20:46 +01:00
Stefan Weil
0cd2bdbd2b automake: Flat build for src/textord
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-11-26 16:20:46 +01:00
Stefan Weil
558462358a automake: Flat build for src/opencl
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-11-26 16:20:46 +01:00
Stefan Weil
6eeb486b77 automake: Flat build for src/lstm
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-11-26 16:20:46 +01:00
Stefan Weil
7ebcc77e3b automake: Flat build for src/dict
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-11-26 16:20:46 +01:00
Stefan Weil
6181acf367 automake: Flat build for src/cutil
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-11-26 16:20:46 +01:00
Stefan Weil
159160518b automake: Flat build for src/classify
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-11-26 16:20:46 +01:00
Stefan Weil
9730c7e167 automake: Flat build for src/ccutil
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-11-26 16:20:46 +01:00
Stefan Weil
b1d449315e automake: Flat build for src/ccstruct
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-11-26 16:20:46 +01:00
Stefan Weil
9745a9d111 automake: Flat build for src/ccmain
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-11-26 16:20:46 +01:00
Stefan Weil
a166efaad6 automake: Flat build for src/arch
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-11-26 16:20:46 +01:00
Stefan Weil
cafb1bbfd7 automake: Flat build for src/api
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-11-26 16:20:46 +01:00
Martin Malmsten
9ed3887432 Added ComposedBlock level to Alto output 2019-11-17 21:06:12 +01:00
zdenop
2d6f38eebf fix using bilevel tiff in pdf output 2019-11-10 16:11:52 +01:00
Shreeshrii
99dfa8a680 Add separator and training_iteration to checkpoint name (#2752)
* Add separator and training_iteration to checkpoint name
* specify modelname_N.NN_NN_NN.checkpoint for intermediate checkpoint
2019-11-09 12:22:40 +01:00
Stefan Weil
ac46b286a4 Fix issue #2748
Commit 94d0f77f56 tried to fix issue #2741
but created a new problem.

This commit should fix both old and new issue.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-11-08 17:12:20 +01:00
Stefan Weil
0406f7706d Use BRT_UNKNOWN instead of BRT_NOISE to initialize ColPartition::blob_type_
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-11-08 07:40:06 +01:00
Stefan Weil
9b46a67efa Use "C" locale for printing parameters
This fixes a test for the Python wrapper `tesserocr` (python setup.py test).

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-11-04 19:21:20 +01:00
Egor Pugin
ab836dbb31
Merge pull request #2743 from DavidMaung/master
Exposed the text2image option --ptsize to tesstrain.sh.
2019-11-02 17:09:51 +03:00
Stefan Weil
a306cd7370 Fail if no valid lstmf file was written (fix issue #2741)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-11-01 21:52:45 +01:00
Stefan Weil
94d0f77f56 Don't create an empty lstmf file
If Tesseract cannot find text in the input image, it should not write
an empty lstmf file. This problem was reported in issue #2741.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-11-01 21:43:26 +01:00
maungd@battelle.org
3d7afb69ea Exposed the text2image option --ptsize to tesstrain.sh. Text2image has the
option --ptsize which defaults to 12.  This option is not exposed through
tesstrain.sh; thus, you cannot use tesstrain.sh to explore training with
different font sizes.  I made a small modification to expose the --ptsize
option to tesstrain.sh.  It defaults to 12 if not specified.
2019-11-01 15:10:58 -04:00
Stefan Weil
b5498c70fa Use pre-calculated lookup tables for all C++ compilers
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-10-31 20:26:01 +01:00
Egor Pugin
2bcc9d8093 Remove cppan build. 2019-10-30 21:37:38 +03:00
Stefan Weil
ca87b06d59 Fix build for Intel Compiler (issue #2736)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-10-30 10:09:44 +01:00
Stefan Weil
20a50e9bcb Fix typo in comment
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-10-30 10:06:31 +01:00
Egor Pugin
2a37f5dd62 Update includes to use <>. 2019-10-29 14:50:11 +03:00
Egor Pugin
9e324938ab Update includes to use <>. 2019-10-29 14:31:38 +03:00
Stefan Weil
629b05d978 Update README.md and other documentation for new include file structure
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-10-29 12:26:41 +01:00
amitdo
2f8884a64e Fix autotools build 2019-10-28 21:23:58 +02:00
amitdo
e1bae15547 Fix #include path of public headers 2019-10-28 19:10:30 +02:00
amitdo
dfede8ac01 Move all public headers to include/tesseract 2019-10-28 18:50:31 +02:00
zdenop
cede5b34e7
Add pageseg_apply_music_mask option to allow disabling the musi… (#2732)
Add pageseg_apply_music_mask option to allow disabling the music mask
2019-10-27 17:02:05 +01:00
zdenop
4a37cde0d9 fix inverting (Bilevel BW png) in pdf; fixes # 2059 2019-10-27 14:15:12 +01:00
Nat
52bc15acd9 Add pageseg_apply_music_mask option to allow disabling the music mask 2019-10-24 11:44:05 -05:00
Egor Pugin
c727b556f0 Remove unneeded TESS_API from source file. 2019-10-23 13:26:46 +03:00
Egor Pugin
e2688c39e9 Remove TESS_CALL. 2019-10-23 13:21:59 +03:00
wshwang
4ee95a615a src/ccutil/bits16.h remove warnings (#2726) 2019-10-23 11:46:24 +02:00
wshwang
71e291bae5 Remove warning C4312 2019-10-22 13:06:44 +02:00
zdenop
fc629eae3b Subject: training: show error description for open/delete file 2019-10-21 16:31:57 +02:00
Stefan Weil
90bcff3732 Delete copy constructor and assignment operator for TessBaseAPI (fix issue #874)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-10-21 13:12:36 +02:00
Stefan Weil
a209a6b4b5 Copy resolution of source image (fix issue #1702)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-10-20 20:45:35 +02:00
zdenop
36dc2ccf75 fix memory leak at PangoFontInfo::CanRenderString 2019-10-20 16:43:04 +02:00
zdenop
1ec34378d9 test for synthesized font faces. 2019-10-19 15:05:28 +02:00