Commit Graph

1395 Commits

Author SHA1 Message Date
Stefan Weil
f462389673 renderer for TessPDFRenderer
Signed-off-by: Stefan Weil <sw@weil.de>
2020-12-28 21:03:29 +01:00
Stefan Weil
d55e5f4803 Replace more GenericVector by std::vector
Signed-off-by: Stefan Weil <sw@weil.de>
2020-12-28 21:03:29 +01:00
Stefan Weil
4a28d33c58 Replace GenericVector by std::vector in strngs.h and more places
Signed-off-by: Stefan Weil <sw@weil.de>
2020-12-28 21:03:29 +01:00
Stefan Weil
3ddc88cccb Use std::vector in TessPDFRenderer
Signed-off-by: Stefan Weil <sw@weil.de>
2020-12-28 21:03:29 +01:00
Stefan Weil
7c679e777d Use std::vector for allowed_scripts
Signed-off-by: Stefan Weil <sw@weil.de>
2020-12-28 21:03:29 +01:00
Stefan Weil
32d53479ae Use std::vector for vars_vec, vars_values
Signed-off-by: Stefan Weil <sw@weil.de>
2020-12-28 21:03:29 +01:00
Stefan Weil
085f6b2572 Use std::list for paragraph models
Signed-off-by: Stefan Weil <sw@weil.de>
2020-12-28 21:03:29 +01:00
Stefan Weil
4ebba72919 Use std::vector for paragraph models
Signed-off-by: Stefan Weil <sw@weil.de>
2020-12-28 21:03:29 +01:00
Stefan Weil
524fc67165 Fix tesseract --list-langs
Signed-off-by: Stefan Weil <sw@weil.de>
2020-12-28 21:03:29 +01:00
Egor Pugin
986b57dd4e Export symbol for unit test. 2020-12-28 04:58:26 +03:00
Egor Pugin
3187f2ef08 Move doubleptr.h to unittests as it is used only there. 2020-12-28 02:32:27 +03:00
Egor Pugin
4175679da6 Revert kdpair, genericheap changes. 2020-12-28 02:31:45 +03:00
Stefan Weil
289a34a40a Add const attribute for pdf_ttf
That moves its data into the text segment and reduces the total size
slightly:

   text	   data	    bss	    dec	    hex	filename
  39788	    693	      0	  40481	   9e21	old/libtesseract_la-pdfrenderer.o
  40360	     88	      0	  40448	   9e00	new/libtesseract_la-pdfrenderer.o

Signed-off-by: Stefan Weil <sw@weil.de>
2020-12-26 17:51:56 +01:00
Stefan Weil
7dca63caf1 More fixes for namespace tesseract
Signed-off-by: Stefan Weil <sw@weil.de>
2020-12-26 17:41:53 +01:00
Stefan Weil
7188b160ae Fix build with --disable-graphics
Signed-off-by: Stefan Weil <sw@weil.de>
2020-12-26 17:36:24 +01:00
Egor Pugin
aecbf79791 Add missing merge_unicharsets training tool to cmake and sw build. 2020-12-26 15:57:22 +03:00
Stefan Weil
317ef988a0 Add missing namespace prefix for GlobalParams() (fix build for some unit tests)
Signed-off-by: Stefan Weil <sw@weil.de>
2020-12-26 13:44:43 +01:00
Stefan Weil
418064f639 Add missing namespace prefix (fix build for merge_unicharsets)
Signed-off-by: Stefan Weil <sw@weil.de>
2020-12-26 13:09:39 +01:00
Egor Pugin
c8b8d266d6 Fix some of vector<bool> cases for msvc. 2020-12-26 04:17:13 +03:00
Egor Pugin
6b22972bc2 Fix linux build. 2020-12-26 04:15:42 +03:00
Egor Pugin
c3e04abe1e Inherit STRING from std::string. 2020-12-26 03:48:35 +03:00
Egor Pugin
4fc467a922 Inherit GenericVector from std::vector. Inherit kdpairs from std::pair. Rewrite some move ctors to modern C++ style. 2020-12-26 03:23:09 +03:00
Egor Pugin
04d3cfcf2f Merge branch 'master' of github.com-egorpugin:tesseract-ocr/tesseract 2020-12-26 00:55:37 +03:00
Egor Pugin
79a86f2582 Move all tesseract symbols into tesseract namespace. Fix include order in many places. 2020-12-26 00:55:30 +03:00
zdenop
ceadc4ddb8 remove inline declaration 2020-12-25 16:28:00 +01:00
Egor Pugin
14d52a79ba Remove .rc files. No need to add them into dll/exe. 2020-12-25 18:06:35 +03:00
zdenop
044921267f embed pdf.ttf to tesseract library #2551 2020-12-25 13:20:36 +01:00
Stefan Weil
cc133aa394 Fix text for fonts_dir parameter
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-12-22 21:32:05 +01:00
Stefan Weil
34abba8698 Add terminating linefeed to fonts.conf
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-12-22 21:32:05 +01:00
Stefan Weil
17a64eef1e Simplify code for PangoFontInfo::HardInitFontConfig
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-12-22 21:32:05 +01:00
Stefan Weil
707ee70966 Use deprecated pango_fc_font_get_glyph for old Pango versions
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-12-22 12:02:37 +01:00
Stefan Weil
f759142c95 Remove buggy Windows implementation for getting glyph from font
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-12-22 09:07:09 +01:00
Stefan Weil
7669d36a37 Use HarfBuzz instead of deprecated pango_fc_font_get_glyph
This fixes the crash on MacOS with M1.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-12-22 09:03:05 +01:00
Stefan Weil
8c859a7329 Fix type cast from PangoFont to PangoFcFont
The original code crashes in pango_fc_font_get_glyph on MacOS with M1.

Replacing the type cast with the macro made for that conversion
gives at least an error message before crashing:

    (process:12546): GLib-GObject-WARNING **: 08:38:02.472: invalid cast from 'PangoCairoCoreTextFont' to 'PangoFcFont'
    zsh: segmentation fault  ./pango_font_info_test

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-12-22 08:45:11 +01:00
Stefan Weil
3efedabda3 automake: Flat build for src/training
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-12-19 15:25:21 +01:00
Stefan Weil
6fcf8d23bc Use more compiler and linker flags from pkg-config
This fixes some build issues with Homebrew on MacOS.

Signed-off-by: Stefan Weil <stefan@Sabines-Mac-mini.fritz.box>
2020-12-13 13:24:46 +01:00
Stefan Weil
490bd3ec8f Fix build with enabled TensorFlow
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-12-04 10:56:23 +01:00
Stefan Weil
ac116d1b28 Fix regression in Network::Serialize (fix issue #3167)
The regression was caused by a wrong string serialization in
commit 4613738a5e.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-12-03 19:36:58 +01:00
zdenop
279b0b2e37
Merge pull request #3160 from stweil/string2
Replace more occurrences of STRING by std::string of char*
2020-11-27 18:24:17 +01:00
Stefan Weil
65b11a1e12 Pack class SVMenuNode
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-11-26 17:17:27 +01:00
Stefan Weil
a1849bc65c Pack struct CLASS_STRUCT
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-11-26 17:17:27 +01:00
Stefan Weil
0bb46ac2e0 Pack struct BlamerBundle
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-11-26 17:17:27 +01:00
Stefan Weil
bf3774cc91 Use more const char*
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-11-26 17:01:17 +01:00
Stefan Weil
4613738a5e Use const char* for filename and network_spec parameters
This replaces the proprietary STRING data type
(764 instead of 838 lines remaining).

It also removes STRING from osdetect.h and serialis.h.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-11-26 17:01:17 +01:00
Stefan Weil
fbc4c809d9 Replace STRING by std::string
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-10-31 14:08:39 +01:00
Stefan Weil
92b6c652f3 Use std::vector for scales_
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-10-29 08:00:11 +01:00
Stefan Weil
c15dd26b84 Don't pass scales_ to IntSimdMatrix::Init
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-10-28 20:35:53 +01:00
Stefan Weil
fe76142a3d Remove GenericVector::scale() again
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-10-28 16:24:59 +01:00
Stefan Weil
eaf72ace31 Prefer result from inverted image if the mean confidence is better
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-10-26 20:37:47 +01:00
Stefan Weil
cfb1fb2540 Try OCR on inverted line only if mean confidence is below 50 %
The old code looked for the minimum confidence which triggered
very often a 2nd OCR without improving the result.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-10-26 09:32:09 +01:00
Robin Watts
436008bd37 Tweak SIMDDetect for ANDROID Neon.
cpufeatures.h should be cpu-features.h, with the latest NDK
at least. The #if 0'd section is not required because armv8
always includes NEON.
2020-10-19 12:04:29 +01:00
Robin Watts
db10c7b577 intsimdmatrixneon.cpp: Do biasing in SIMD. 2020-10-12 04:30:46 -07:00
Robin Watts
d1e49d6dd2 intsimdmatrixavx2: Do biasing in SIMD.
We also move to relying on both scales and output having been
padded to accomodate us writing more results than are actually
needed here. This was allowed for a few commits back.
2020-10-12 04:30:46 -07:00
Robin Watts
872816897a Rejig intsimdmatrix to reduce FP ops.
Avoid 1) floating point division by 127, 2) conversion of
bias to double, 3) FP addition, in favour of 1) integer
multiplication by 127, and 2) integer addition.

(Also costs extra work in the serialisation/deserialisation of
the scale values, and conversion of weights to int formats, but
these are all one offs).
2020-10-12 04:30:46 -07:00
Robin Watts
aba1800f69 Round output buffers for intSimdMatrix.
In order to allow intSimdMatrix implementations to 'overwrite'
their outputs, ensure that the output buffers are always padded
to the next block size.

This doesn't make any difference yet, but it enables optimisations
further down the line, especially when the biasing is pulled into
the SIMD.
2020-10-12 11:47:16 +01:00
Robin Watts
9dfdac51c6 Tweak scales array for intSimdMatrix case.
Currently, the size of the scales array is not rounded up
in the same way as the weights are. This blocks us pushing
the scale calculations into the SIMD, as when we "overread"
the end of the scale array, we potentially get errors.

Here, we adjust the intSimdMatrix stuff to ensure that the
scales array reserves enough entries to allow such overreads
to work.

This doesn't make any difference for now, but opens the way
for future optimisations.
2020-10-12 11:47:16 +01:00
amitdo
958f23453e Improve disabled legacy engine build 2020-10-12 11:47:16 +01:00
Stefan Weil
ac14ab32c6 Remove dummy functions from globaloc.cpp and related code
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-10-04 12:24:26 +02:00
Stefan Weil
7c4ef88dab Remove unused functions FontUtils::GetAllRenderableCharacters
They used the function pango_coverage_max which does nothing and
which has been deprecated since pango version 1.44.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-10-03 12:04:40 +02:00
Le Duc Nam
eb8f1674bf Correct "NoImages" in debug pdf file
Issues:
  Debug information for "NoImages" just be binary image,
  it don't show up the result of photo_mask_pix to developer

Fix:
  Substract binary image to photo_mask_pix, the result
  are "NoImages" binary pix
2020-09-06 23:31:30 +07:00
Robert Sachunsky
640c14e080 AutoPageSeg/FindBlocks/GridRemoveUnderlinePartitions: avoid self-deletion
When checking horizontal line partitions for
possible interpretation as underline formatting,
avoid confusing the hline partition itself with
an overlapping neighbour (which would delete it).
2020-08-24 19:13:48 +02:00
Robert Sachunsky
65a077d3e9 FindAndRemoveLines/FindVerticalAlignment: decrease fixed vline min length
When detecting vertical separators, the blob aligner is used to glue
line segments (often segmented due to artificial cracks).
But (unlike LineFinder) it has many parameters that are not
relative to pixel density/resolution.
This change decreases the minimum absolute length in pixels
for vertical separators.
2020-08-24 19:13:36 +02:00
Robert Sachunsky
0228d93684 textord debugging: invert default top/bottom bounaries, improve description 2020-08-24 19:13:25 +02:00
Stefan Weil
d33edbc4b1
Merge pull request #3066 from robinwatts/pushback14
Remove unused char constant that causes a warning.
2020-07-17 15:55:51 +02:00
Robin Watts
578462109b Remove unused char constant that causes a warning.
The kDictWildcard is never actually used, so removing it makes
no difference. It causes warnings in MSVC builds as MSVC doesn't
know how to pack a unicode value into chars.
2020-07-17 14:22:37 +01:00
Robin Watts
150e2e54fe Squash some warnings in MSVC build.
In particular, "defined but not used" (caused by GRAPHICS_DISABLED),
double constants being truncated to floats, and implicit casts.
2020-07-16 10:08:40 +01:00
zdenop
7fa200bfb7
Merge pull request #3064 from robinwatts/pushback12
Fix Memory leak when using TESSERACT_IMAGEDATA_AS_PIX
2020-07-15 19:08:58 +02:00
Robin Watts
7f45b719d1 Fix Memory leak when using TESSERACT_IMAGEDATA_AS_PIX
If building with TESSERACT_IMAGEDATA_AS_PIX, then tesseract
doesn't compress/decompress images, but rather holds the
data as internal Pix structures. Unfortunately, I forgot to
make the ImageData destructor free these, so memory leaked
during use. Fixed here.
2020-07-15 12:35:35 +01:00
zdenop
135c8a49b5
Merge pull request #3061 from stweil/neon
Always use NEON by default for ARMv8
2020-07-11 09:11:54 +02:00
zdenop
875bd48bd5
Merge pull request #3058 from stweil/scrollview
Disable more code and data with GRAPHICS_DISABLED
2020-07-11 09:11:27 +02:00
Stefan Weil
548a832b98 Use strtok_s for MSVC in class SVNetwork
strtok_s can be used with MSVC as a replacement for strtok_r, so less
special handling is needed in the code and class SVNetwork can be
made smaller by removing member has_content.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-07-10 17:47:05 +02:00
Stefan Weil
2db2223b39 Always use NEON by default for ARMv8
Signed-off-by: Stefan Weil <stefan.weil@bib.uni-mannheim.de>
2020-07-10 15:27:09 +02:00
Stefan Weil
cb3880fb15 Disable more code and data with GRAPHICS_DISABLED
Some runtime parameters which are only relevant with graphics enabled
were now removed from builds when graphics was disabled.

TableFinder::DisplayColSegmentGrid is never used, so remove it completely.

Builds with --disable-graphics significantly reduce the code size and avoid
some function calls which might be important for certain applications:

   text	   data	    bss	    dec	    hex	filename
3219230	  41136	  13920	3274286	 31f62e	.libs/libtesseract.so (--disable-graphics, old)
3211347	  40976	  13600	3265923	 31d583	.libs/libtesseract.so (--disable-graphics, new)
3360942	  43656	  15392	3419990	 342f56	.libs/libtesseract.so (default)

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-07-09 11:23:33 +02:00
Stefan Weil
22e6c2e5a7 Fix division by 0.0 in BaselineRow::PerpDistanceFromBaseline
It was reported by oss-fuzz (issue 23962).

Add log output to find real images which trigger that issue.
Avoid also some conversions from float to double by always using float.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-07-08 17:59:02 +02:00
Stefan Weil
8137cf35a6 Use const char* for filename parameters
This replaces the proprietary STRING data type
(801 instead of 838 lines remaining).

It also removes STRING from osdetect.h and serialis.h.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-07-07 14:20:09 +02:00
Stefan Weil
51dff483e7 Fix runtime error caused by too large TBOX
Runtime error reported by sanitizer:

    src/ccstruct/rect.h:191:44: runtime error: 50961 is outside the range of representable values of type 'short'
    SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior src/ccstruct/rect.h:191:44 in

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-06-30 20:51:52 +02:00
Stefan Weil
2269a500ef Fix runtime error with null pointer argument
Runtime error reported by sanitizer:

    src/ccstruct/coutln.cpp:1018:19: runtime error: null pointer passed as argument 2, which is declared to never be null
    /usr/include/string.h:48:14: note: nonnull attribute specified here
    SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior src/ccstruct/coutln.cpp:1018:19 in

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-06-29 19:13:39 +02:00
Stefan Weil
411ffa90c6 Fix unsigned integer overflow
Runtime errors reported by sanitizer:

    src/textord/pithsync.cpp:75:31: runtime error: unsigned integer overflow: 2147483648 + 2147483648 cannot be represented in type 'unsigned int'
    SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior src/textord/pithsync.cpp:75:31 in
    src/textord/pithsync.cpp:75:43: runtime error: unsigned integer overflow: 0 - 1 cannot be represented in type 'unsigned int'
    SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior src/textord/pithsync.cpp:75:43 in
    src/textord/pithsync.cpp:125:29: runtime error: unsigned integer overflow: 2147483648 + 2147483648 cannot be represented in type 'unsigned int'
    SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior src/textord/pithsync.cpp:125:29 in
    src/textord/pithsync.cpp:125:41: runtime error: unsigned integer overflow: 0 - 1 cannot be represented in type 'unsigned int'
    SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior src/textord/pithsync.cpp:125:41 in

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-06-29 19:13:39 +02:00
Stefan Weil
7c046c121f Fix out of bounds array access
Runtime error with enabled sanitizer:

    src/textord/colpartition.cpp:2243:66: runtime error: index -1 out of bounds for type 'tesseract::ColPartition *[6]'
    SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior src/textord/colpartition.cpp:2243:66 in

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-06-29 16:10:37 +02:00
zdenop
4ef709554b
Update imagedata.cpp
stop PreScale if pixScale failed (fixes #3025)
2020-06-25 20:32:51 +02:00
amitdo
efae270dea Disabled legacy build: Disable more unused code 2020-06-24 22:02:52 +03:00
Stefan Weil
ca0a6c9d37
Merge pull request #3035 from stweil/overflow
Avoid buffer overflow (issue #444)
2020-06-24 18:46:47 +02:00
Stefan Weil
2cb5bc7690 Improve debug message in ColPartition::ComputeLimits
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-06-23 22:52:45 +02:00
Stefan Weil
cfabdfe0af Avoid buffer overflow (issue #444)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-06-22 22:19:58 +02:00
Stefan Weil
62b085cb8d ScrollView: Remove C API callcpp.{cpp,h}
Use C++ class ScrollView directly instead of using an intermediate C API.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-06-22 09:14:26 +02:00
Stefan Weil
b2cc00d97f Replace cprintf by tprintf and remove cprintf
cprintf was an indirect way to call tprintf.
This indirection is not needed, so remove it and use tprintf directly.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-06-21 19:07:09 +02:00
Stefan Weil
ea1f597fc1 Fix insecure call of tprintf
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-06-21 19:03:03 +02:00
Stefan Weil
4a10bb68c7 Fix conversion of images with 16 bpp or 24 bpp to grey
The old code used pixConvertRGBToLuminance which only converts 32 bpp images.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-06-21 09:09:49 +02:00
Stefan Weil
6f6100ff9f Classify: Run sort only for more than one element
This fixes calls of qsort with a nullptr argument (reported by sanitizers).

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-06-20 21:43:22 +02:00
Stefan Weil
d4cf77c92b Don't check for limits.h (now unused)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-06-20 10:39:13 +02:00
Matej Knopp
e900252c1a Fix CMake build with DISABLED_LEGACY_ENGINE 2020-06-17 19:42:49 +02:00
Stefan Weil
d6ca7a5298 ScrollView: Fix typo in comment
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-06-17 16:26:41 +02:00
Stefan Weil
380466e0d3 Allow inlining of function TruncateParam
It is only used locally in intproto.cpp, so defining it before the first
use and adding the static attribute allows the compiler to inline it.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-06-16 20:17:41 +02:00
Stefan Weil
93cfffeb87 Remove unused argument from function TruncateParam
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-06-16 20:17:41 +02:00
Stefan Weil
f08b16a5a0 Remove assertion which is triggered by tests
oss-fuzz issue 15149 triggers this assertion. See test case here:
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=15149

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-06-16 20:17:26 +02:00
Stefan Weil
18d9983f69 StrokeWidth: Remove unused local variable (fixes compiler warning)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-06-16 20:17:09 +02:00
Stefan Weil
bc61038dd4 SPLIT: Make function bounding_box inline for better performance
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-06-16 17:21:36 +02:00
Stefan Weil
0e7701bc3c SEAM: More inline functions for better performance
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-06-16 17:20:14 +02:00
Stefan Weil
e45100ebf7 TBOX: Use inline constructor for better performance
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-06-16 17:17:55 +02:00
Stefan Weil
c110958ffa Fix undefined shift with negative value (oss-fuzz issue 14658)
This fixes a bug reported by OSS Fuzz:
https://oss-fuzz.com/issue/5697280134348800

The old code passed a negative value (-1) as argument to step_dir
when destindex was 0.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-06-16 13:25:32 +02:00
Stefan Weil
6ee3698958 Remove old unused code from imagedata.h
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-06-14 16:02:27 +02:00
Stefan Weil
d8500adcf4 Fix crash caused by missing thread synchronization (issues #757, #1168 and #2191)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-06-14 15:53:17 +02:00
Robin Watts
6fec69de1a Fix intsimdmatrixneon.cpp stack corruption.
The intsimdmatrix mechanism ensures that inputs would be
resized so that we'd only ever get "whole blocks" of data.
I'd assumed that that meant the same thing for scales/outputs
too, but this appears not to to be the case, as we can get
called (sometimes) with num_out % 8 == 7.

Possibly we could benefit from resizing those matrices so
that special cases in this innermost loop are not actually
required, but unless and until that is done, let's fix the
inner loop.
2020-05-27 13:40:17 +01:00
Stefan Weil
a06d0d8449 Add missing include statements for config_auto.h
They are required to get the macro DISABLED_LEGACY_ENGINE.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-05-22 16:34:28 +02:00
Stefan Weil
6732eb9eb5 Clean code for NEON support
Include it only for NEON and remove unneeded code.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-05-21 07:03:37 +02:00
Robin Watts
f79e52a7cc NEON SIMD code.
In tests on my pi3b+, a release build of my ghostscript integration
takes 2 minutes 27 seconds to render a PDF and OCR it with the
vanilla sources. With this NEON coded added the time drops to 37
seconds.

I have not tested the configure/Makefile changes as I'm not using
them.
2020-05-20 18:54:42 +01:00
zdenop
b5d639dcc5
Merge pull request #2965 from robinwatts/pushback1
thanks.
2020-05-16 20:35:19 +02:00
zdenop
064b4403de
Merge pull request #2966 from robinwatts/pushback2 2020-05-16 20:06:31 +02:00
Robin Watts
3408c36eab Guard #include "config_auto.h" with HAVE_CONFIG_H.
Every other file already does this.
2020-05-15 19:29:03 +01:00
Robin Watts
43437a540b Fix OEM_DEFAULT in DISABLED_LEGACY_ENGINE builds.
If api->Init is called with OEM_DEFAULT in DISABLED_LEGACY_ENGINE
build modes, the engine mode is never set, resulting in no
words being found.
2020-05-15 14:56:41 +01:00
Julian Gilbey
e7e6999d3b Move comment about swap meaning for DeSerialize to correct function 2020-05-13 07:02:59 +01:00
Robin Watts
27d513462c Avoid using PACKAGE_VERSION in favour of TESSERACT_VERSION_STR.
This means the sources compile perfectly in the absence of
config_auto.h/HAVE_CONFIG_H as they were intended to do.

TESSERACT_VERSION_STR is set to be precisely PACKAGE_VERSION
by autoconf, so there are no actual changes in compiled code.
2020-05-12 21:45:12 +02:00
Stefan Weil
39f7fb4a1a Allow line images with larger width (depending on height)
Training with normalized line images higher than 36 px also results in larger widths.
The limit should therefore depend on the height used for the normalization.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-05-12 16:59:31 +02:00
Stefan Weil
34bdc8b74e Allow line images with larger width
Line images can be larger than the old limit, especially when training
is made with newspaper lines.

    Image too large to learn!! Size = 2641x36
    Image too large to learn!! Size = 2704x36
    Image too large to learn!! Size = 2751x36
    Image too large to learn!! Size = 3738x36

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-05-12 16:50:40 +02:00
Julian Gilbey
ca5735efcb Destroy box before potentially exiting function 2020-05-12 15:25:16 +01:00
Stefan Weil
d3a0768c32
Merge pull request #2975 from robinwatts/pushback5
Tweak architecture specific SIMD files for ease of compilation
2020-05-12 14:55:32 +02:00
Robin Watts
a9b44ee8c2 Tweak architecture specific SIMD files for ease of compilation.
This won't affect anything using the supplied build system. For
other projects that include tesseract within them, however, this
may make their life easier.

For example, I have an integration of Tesseract with Ghostscript,
in which tesseract is built as part of the Ghostscript build,
without using the tesseract build system.

The Ghostscript build system is makefile based, and has to work
on a range of make systems, including unix make, gnu make and
nmake. As such we have to avoid conditionals in the common
makefiles. It therefore becomes hard to build one set of files on
x86 systems, and another on (say) ARM systems.

Accordingly, this commit makes small tweaks to the architecture
specific files, so that they compile on EVERY platform; just they
only compile to anything useful on the appropriate platform.

Thus the makefiles can build all the files on all the systems, and
the preprocessor flags mean that the correct functions are actually
built.
2020-05-12 13:09:29 +01:00
Egor Pugin
0eaabc42c7
Update CMakeLists.txt 2020-05-12 11:49:15 +03:00
Egor Pugin
e720a26745
[cmake] Set inactivity timeout during icu download to 300 seconds.
Fixes #2972.
2020-05-09 18:55:45 +03:00
Robin Watts
80d4af6ecf Add a mechanism to avoid creating debug fonts.
If TESSERACT_DISABLE_DEBUG_FONTS is defined, tesseract doesn't
atetmpt to create any debug fonts. This not only saves memory,
but it (combined with the change to optionally use Pix as
internal storage for the ImageData) allows us to use an
embedded Leptonica library with no format handlers at all.
2020-05-05 00:22:23 +01:00
Robin Watts
6bcb941bcf Avoid tesseract writing Pix out/reading them back.
By default, when we ImageData::SetPix, we write the data out as a
PNG, just to read it back in to get a compressed buffer of data.
We then use this to generate a new Pix.

In builds of Tesseract on systems where we don't have temp files,
writing files out is problematic.

Not only that, but compressing/uncompressing is slow, and on minimal
builds of leptonica, where we've disabled the format writers to reduce
memory footprint, we get no compression anyway.

In such cases, it'd be far nicer just to keep the original Pix as
the internal data.

Also, when recovering the pixmap from the ImageData, if we know we're
only going to read from the data, we can avoid duplicating it and
just use the original. This is exactly the case when GRAPHICS_DISABLED
is set.

So, introduce a TESSERACT_IMAGEDATA_AS_PIX predefine that we can use
to cause the internal data to be a Pix rather than a compressed
buffer.



Given we don't do compression, and they were writing to memory,
this was all just more effort than we needed.

Also, if we're using GRAPHICS_DISABLED, we might as well just
pixCopy rather than pixClone as only the scaler uses this.
2020-05-04 21:01:22 +01:00
Amit D
acc4c8bff5
Merge pull request #2952 from jannick0/patch-1
[trie.h] pattern definition: fix documentation
2020-04-27 23:44:48 +03:00
Stefan Weil
1188e0a516 Remove old code which was used for Ocropus
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-04-27 16:33:34 +02:00
jannick0
e044163085
[trie.h] pattern definition: fix documentation
The fix makes the definition of `\n` consistent with the examples given below the definition.  Please note that I did not check this against how it is implemented in the code.
2020-04-19 13:47:42 +02:00
Stefan Weil
4a00b68c63 Fix lambda function for curl code errors
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-03-18 20:46:52 +01:00
Stefan Weil
9f5a3f6ac7 Fix uninitialized local variable in curl code
Compiler warning:

    src/api/baseapi.cpp:1151:27: warning:
      variable 'curlcode' is uninitialized when used here [-Wuninitialized]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-03-18 19:25:33 +01:00
zdenop
6e307074d8
Merge pull request #2894 from stweil/curl
Report errors from curl_easy functions
2020-03-18 14:14:07 +01:00
Stefan Weil
ef4f99a994 Run xgetbv instruction only on machines which support it
This fixes a regression for older Intel processors.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-03-08 17:32:10 +01:00
Stefan Weil
eff4dc0603 Use lambda expressions for reporting curl errors
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-02-23 22:44:42 +01:00
Stefan Weil
9972c91127 Report errors from curl_easy functions
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-02-23 22:26:51 +01:00
Stefan Weil
57ff90687d simd: Check whether the OS supports FMA, AVX, ...
The previous check was only for the MS compiler, but not for gcc and clang.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-02-23 16:34:35 +01:00
zdenop
7c3ac569f9
Replace references to the old wiki by new URLs (#2877)
Replace references to the old wiki by new URLs
2020-02-03 14:59:18 +01:00
Stefan Weil
16553014e0 Replace references to the old wiki by new URLs
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-02-03 11:37:41 +01:00
Stefan Weil
20bcbc4058 Catch std::runtime_error exception when setting the locale in debug code
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-02-03 07:58:43 +01:00
Robert Sachunsky
cdc8e44a20 ChoiceIterator: skip symbol without choices 2020-01-24 09:19:14 +01:00
jkang-eng
60248f59d4 Fix "tesseract.exe not flushing stdout/stderr" (Issue #2859) (#2865)
* Issue #2859 - Fix "tesseract.exe not flushing stdout/stderr"
2020-01-21 21:51:08 +01:00
Stefan Weil
6f2f310fdf Remove redundant method from class GenericVector
length() is not needed: it can be replaced by size().

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-01-18 11:30:14 +01:00
Stefan Weil
3d1f82d0e2 tesstrain.sh: Fix command line flag --help
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-01-05 10:10:55 +01:00
Stefan Weil
cfd39dc2c7 pageres: Fix compiler warnings
clang warnings:

    src/ccstruct/pageres.cpp:903:20: warning:
      implicit conversion from 'int' to 'float' changes value from
      2147483647 to 2147483648 [-Wimplicit-int-float-conversion]
    src/ccstruct/pageres.cpp:904:23:
      warning: implicit conversion from 'int' to 'float' changes value from
      -2147483647 to -2147483648 [-Wimplicit-int-float-conversion]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-01-04 09:46:10 +01:00
Stefan Weil
d2a2292f32 mftraining: Fix compiler warning
powerpc64le-linux-gnu-g++ warning:

    src/training/mftraining.cpp:209:5: warning:
        ‘%04d’ directive output may be truncated writing between 4 and 10 bytes
        into a region of size 8 [-Wformat-truncation=]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-01-03 10:13:58 +01:00
zdenop
79f191fe20
Merge pull request #2826 from bertsky/clip-blockpolygon
make BlockPolygon usable
2019-12-19 09:14:25 +01:00
Robert Sachunsky
4b0c9f3373 BlockPolygon: clip to image rectangle 2019-12-18 13:29:43 +01:00
Robert Sachunsky
5751a408c9 BlockPolygon: unrotate from internal to image coordinates 2019-12-18 13:29:43 +01:00
amitdo
502ebe8ca9 Autotools: Pango, Cairo and ICU only required by training tools 2019-12-16 17:23:06 +02:00
Stefan Weil
fc84f84b5b Remove Emacs C modeline in comment line 1
Those files are C++, and the wrong modeline is not needed at all.
Remove also some empty descriptions and old history in the comments.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-12-05 13:57:50 +01:00
Stefan Weil
420cbac876 Clean public API for renderers
- Remove unused type definitions for TessTextRenderer, ... in capi.h
  (they were only used in capi.cpp which now no longer needs them)

- Fix typo in comment

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-12-03 12:23:58 +01:00
Stefan Weil
56df8e6e19 Fix some typos in comments (most of them found by codespell)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-12-02 14:30:13 +01:00
Stefan Weil
a1a139cbd2 Replace AVX_OPT, ..., AVX macros by HAVE_AVX, ... and clean related code
- Replace AVX_OPT, AVX2_OPT, FMA_OPT, SSE41_OPT
- Replace AVX, AVX2, FMA, SSE4_1
- Write new HAVE_AVX, HAVE_AVX2, HAVE_FMA, HAVE_SSE4_1 into config_auto.h
- Put related conditionals in Makefile.am in one place

This makes the code clearer and fixes a log message in
IntSimdMatrixTest.AVX2.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-11-28 17:51:37 +01:00
Stefan Weil
074844ce46 Show libcurl version
`tesseract --version` now also shows the version of libcurl and related
libraries if it was build with libcurl.

The preprocessor macro HAVE_LIBCURL is now defined in config_auto.h.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-11-28 16:34:52 +01:00
Stefan Weil
cbd3a21cb2 automake: Flat build for src/viewer and src/wordrec
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-11-26 16:20:46 +01:00