Commit Graph

1395 Commits

Author SHA1 Message Date
Stefan Weil
6ee3698958 Remove old unused code from imagedata.h
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-06-14 16:02:27 +02:00
Stefan Weil
d8500adcf4 Fix crash caused by missing thread synchronization (issues #757, #1168 and #2191)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-06-14 15:53:17 +02:00
Robin Watts
6fec69de1a Fix intsimdmatrixneon.cpp stack corruption.
The intsimdmatrix mechanism ensures that inputs would be
resized so that we'd only ever get "whole blocks" of data.
I'd assumed that that meant the same thing for scales/outputs
too, but this appears not to to be the case, as we can get
called (sometimes) with num_out % 8 == 7.

Possibly we could benefit from resizing those matrices so
that special cases in this innermost loop are not actually
required, but unless and until that is done, let's fix the
inner loop.
2020-05-27 13:40:17 +01:00
Stefan Weil
a06d0d8449 Add missing include statements for config_auto.h
They are required to get the macro DISABLED_LEGACY_ENGINE.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-05-22 16:34:28 +02:00
Stefan Weil
6732eb9eb5 Clean code for NEON support
Include it only for NEON and remove unneeded code.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-05-21 07:03:37 +02:00
Robin Watts
f79e52a7cc NEON SIMD code.
In tests on my pi3b+, a release build of my ghostscript integration
takes 2 minutes 27 seconds to render a PDF and OCR it with the
vanilla sources. With this NEON coded added the time drops to 37
seconds.

I have not tested the configure/Makefile changes as I'm not using
them.
2020-05-20 18:54:42 +01:00
zdenop
b5d639dcc5
Merge pull request #2965 from robinwatts/pushback1
thanks.
2020-05-16 20:35:19 +02:00
zdenop
064b4403de
Merge pull request #2966 from robinwatts/pushback2 2020-05-16 20:06:31 +02:00
Robin Watts
3408c36eab Guard #include "config_auto.h" with HAVE_CONFIG_H.
Every other file already does this.
2020-05-15 19:29:03 +01:00
Robin Watts
43437a540b Fix OEM_DEFAULT in DISABLED_LEGACY_ENGINE builds.
If api->Init is called with OEM_DEFAULT in DISABLED_LEGACY_ENGINE
build modes, the engine mode is never set, resulting in no
words being found.
2020-05-15 14:56:41 +01:00
Julian Gilbey
e7e6999d3b Move comment about swap meaning for DeSerialize to correct function 2020-05-13 07:02:59 +01:00
Robin Watts
27d513462c Avoid using PACKAGE_VERSION in favour of TESSERACT_VERSION_STR.
This means the sources compile perfectly in the absence of
config_auto.h/HAVE_CONFIG_H as they were intended to do.

TESSERACT_VERSION_STR is set to be precisely PACKAGE_VERSION
by autoconf, so there are no actual changes in compiled code.
2020-05-12 21:45:12 +02:00
Stefan Weil
39f7fb4a1a Allow line images with larger width (depending on height)
Training with normalized line images higher than 36 px also results in larger widths.
The limit should therefore depend on the height used for the normalization.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-05-12 16:59:31 +02:00
Stefan Weil
34bdc8b74e Allow line images with larger width
Line images can be larger than the old limit, especially when training
is made with newspaper lines.

    Image too large to learn!! Size = 2641x36
    Image too large to learn!! Size = 2704x36
    Image too large to learn!! Size = 2751x36
    Image too large to learn!! Size = 3738x36

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-05-12 16:50:40 +02:00
Julian Gilbey
ca5735efcb Destroy box before potentially exiting function 2020-05-12 15:25:16 +01:00
Stefan Weil
d3a0768c32
Merge pull request #2975 from robinwatts/pushback5
Tweak architecture specific SIMD files for ease of compilation
2020-05-12 14:55:32 +02:00
Robin Watts
a9b44ee8c2 Tweak architecture specific SIMD files for ease of compilation.
This won't affect anything using the supplied build system. For
other projects that include tesseract within them, however, this
may make their life easier.

For example, I have an integration of Tesseract with Ghostscript,
in which tesseract is built as part of the Ghostscript build,
without using the tesseract build system.

The Ghostscript build system is makefile based, and has to work
on a range of make systems, including unix make, gnu make and
nmake. As such we have to avoid conditionals in the common
makefiles. It therefore becomes hard to build one set of files on
x86 systems, and another on (say) ARM systems.

Accordingly, this commit makes small tweaks to the architecture
specific files, so that they compile on EVERY platform; just they
only compile to anything useful on the appropriate platform.

Thus the makefiles can build all the files on all the systems, and
the preprocessor flags mean that the correct functions are actually
built.
2020-05-12 13:09:29 +01:00
Egor Pugin
0eaabc42c7
Update CMakeLists.txt 2020-05-12 11:49:15 +03:00
Egor Pugin
e720a26745
[cmake] Set inactivity timeout during icu download to 300 seconds.
Fixes #2972.
2020-05-09 18:55:45 +03:00
Robin Watts
80d4af6ecf Add a mechanism to avoid creating debug fonts.
If TESSERACT_DISABLE_DEBUG_FONTS is defined, tesseract doesn't
atetmpt to create any debug fonts. This not only saves memory,
but it (combined with the change to optionally use Pix as
internal storage for the ImageData) allows us to use an
embedded Leptonica library with no format handlers at all.
2020-05-05 00:22:23 +01:00
Robin Watts
6bcb941bcf Avoid tesseract writing Pix out/reading them back.
By default, when we ImageData::SetPix, we write the data out as a
PNG, just to read it back in to get a compressed buffer of data.
We then use this to generate a new Pix.

In builds of Tesseract on systems where we don't have temp files,
writing files out is problematic.

Not only that, but compressing/uncompressing is slow, and on minimal
builds of leptonica, where we've disabled the format writers to reduce
memory footprint, we get no compression anyway.

In such cases, it'd be far nicer just to keep the original Pix as
the internal data.

Also, when recovering the pixmap from the ImageData, if we know we're
only going to read from the data, we can avoid duplicating it and
just use the original. This is exactly the case when GRAPHICS_DISABLED
is set.

So, introduce a TESSERACT_IMAGEDATA_AS_PIX predefine that we can use
to cause the internal data to be a Pix rather than a compressed
buffer.



Given we don't do compression, and they were writing to memory,
this was all just more effort than we needed.

Also, if we're using GRAPHICS_DISABLED, we might as well just
pixCopy rather than pixClone as only the scaler uses this.
2020-05-04 21:01:22 +01:00
Amit D
acc4c8bff5
Merge pull request #2952 from jannick0/patch-1
[trie.h] pattern definition: fix documentation
2020-04-27 23:44:48 +03:00
Stefan Weil
1188e0a516 Remove old code which was used for Ocropus
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-04-27 16:33:34 +02:00
jannick0
e044163085
[trie.h] pattern definition: fix documentation
The fix makes the definition of `\n` consistent with the examples given below the definition.  Please note that I did not check this against how it is implemented in the code.
2020-04-19 13:47:42 +02:00
Stefan Weil
4a00b68c63 Fix lambda function for curl code errors
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-03-18 20:46:52 +01:00
Stefan Weil
9f5a3f6ac7 Fix uninitialized local variable in curl code
Compiler warning:

    src/api/baseapi.cpp:1151:27: warning:
      variable 'curlcode' is uninitialized when used here [-Wuninitialized]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-03-18 19:25:33 +01:00
zdenop
6e307074d8
Merge pull request #2894 from stweil/curl
Report errors from curl_easy functions
2020-03-18 14:14:07 +01:00
Stefan Weil
ef4f99a994 Run xgetbv instruction only on machines which support it
This fixes a regression for older Intel processors.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-03-08 17:32:10 +01:00
Stefan Weil
eff4dc0603 Use lambda expressions for reporting curl errors
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-02-23 22:44:42 +01:00
Stefan Weil
9972c91127 Report errors from curl_easy functions
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-02-23 22:26:51 +01:00
Stefan Weil
57ff90687d simd: Check whether the OS supports FMA, AVX, ...
The previous check was only for the MS compiler, but not for gcc and clang.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-02-23 16:34:35 +01:00
zdenop
7c3ac569f9
Replace references to the old wiki by new URLs (#2877)
Replace references to the old wiki by new URLs
2020-02-03 14:59:18 +01:00
Stefan Weil
16553014e0 Replace references to the old wiki by new URLs
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-02-03 11:37:41 +01:00
Stefan Weil
20bcbc4058 Catch std::runtime_error exception when setting the locale in debug code
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-02-03 07:58:43 +01:00
Robert Sachunsky
cdc8e44a20 ChoiceIterator: skip symbol without choices 2020-01-24 09:19:14 +01:00
jkang-eng
60248f59d4 Fix "tesseract.exe not flushing stdout/stderr" (Issue #2859) (#2865)
* Issue #2859 - Fix "tesseract.exe not flushing stdout/stderr"
2020-01-21 21:51:08 +01:00
Stefan Weil
6f2f310fdf Remove redundant method from class GenericVector
length() is not needed: it can be replaced by size().

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-01-18 11:30:14 +01:00
Stefan Weil
3d1f82d0e2 tesstrain.sh: Fix command line flag --help
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-01-05 10:10:55 +01:00
Stefan Weil
cfd39dc2c7 pageres: Fix compiler warnings
clang warnings:

    src/ccstruct/pageres.cpp:903:20: warning:
      implicit conversion from 'int' to 'float' changes value from
      2147483647 to 2147483648 [-Wimplicit-int-float-conversion]
    src/ccstruct/pageres.cpp:904:23:
      warning: implicit conversion from 'int' to 'float' changes value from
      -2147483647 to -2147483648 [-Wimplicit-int-float-conversion]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-01-04 09:46:10 +01:00
Stefan Weil
d2a2292f32 mftraining: Fix compiler warning
powerpc64le-linux-gnu-g++ warning:

    src/training/mftraining.cpp:209:5: warning:
        ‘%04d’ directive output may be truncated writing between 4 and 10 bytes
        into a region of size 8 [-Wformat-truncation=]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-01-03 10:13:58 +01:00
zdenop
79f191fe20
Merge pull request #2826 from bertsky/clip-blockpolygon
make BlockPolygon usable
2019-12-19 09:14:25 +01:00
Robert Sachunsky
4b0c9f3373 BlockPolygon: clip to image rectangle 2019-12-18 13:29:43 +01:00
Robert Sachunsky
5751a408c9 BlockPolygon: unrotate from internal to image coordinates 2019-12-18 13:29:43 +01:00
amitdo
502ebe8ca9 Autotools: Pango, Cairo and ICU only required by training tools 2019-12-16 17:23:06 +02:00
Stefan Weil
fc84f84b5b Remove Emacs C modeline in comment line 1
Those files are C++, and the wrong modeline is not needed at all.
Remove also some empty descriptions and old history in the comments.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-12-05 13:57:50 +01:00
Stefan Weil
420cbac876 Clean public API for renderers
- Remove unused type definitions for TessTextRenderer, ... in capi.h
  (they were only used in capi.cpp which now no longer needs them)

- Fix typo in comment

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-12-03 12:23:58 +01:00
Stefan Weil
56df8e6e19 Fix some typos in comments (most of them found by codespell)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-12-02 14:30:13 +01:00
Stefan Weil
a1a139cbd2 Replace AVX_OPT, ..., AVX macros by HAVE_AVX, ... and clean related code
- Replace AVX_OPT, AVX2_OPT, FMA_OPT, SSE41_OPT
- Replace AVX, AVX2, FMA, SSE4_1
- Write new HAVE_AVX, HAVE_AVX2, HAVE_FMA, HAVE_SSE4_1 into config_auto.h
- Put related conditionals in Makefile.am in one place

This makes the code clearer and fixes a log message in
IntSimdMatrixTest.AVX2.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-11-28 17:51:37 +01:00
Stefan Weil
074844ce46 Show libcurl version
`tesseract --version` now also shows the version of libcurl and related
libraries if it was build with libcurl.

The preprocessor macro HAVE_LIBCURL is now defined in config_auto.h.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-11-28 16:34:52 +01:00
Stefan Weil
cbd3a21cb2 automake: Flat build for src/viewer and src/wordrec
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-11-26 16:20:46 +01:00
Stefan Weil
0cd2bdbd2b automake: Flat build for src/textord
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-11-26 16:20:46 +01:00
Stefan Weil
558462358a automake: Flat build for src/opencl
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-11-26 16:20:46 +01:00
Stefan Weil
6eeb486b77 automake: Flat build for src/lstm
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-11-26 16:20:46 +01:00
Stefan Weil
7ebcc77e3b automake: Flat build for src/dict
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-11-26 16:20:46 +01:00
Stefan Weil
6181acf367 automake: Flat build for src/cutil
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-11-26 16:20:46 +01:00
Stefan Weil
159160518b automake: Flat build for src/classify
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-11-26 16:20:46 +01:00
Stefan Weil
9730c7e167 automake: Flat build for src/ccutil
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-11-26 16:20:46 +01:00
Stefan Weil
b1d449315e automake: Flat build for src/ccstruct
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-11-26 16:20:46 +01:00
Stefan Weil
9745a9d111 automake: Flat build for src/ccmain
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-11-26 16:20:46 +01:00
Stefan Weil
a166efaad6 automake: Flat build for src/arch
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-11-26 16:20:46 +01:00
Stefan Weil
cafb1bbfd7 automake: Flat build for src/api
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-11-26 16:20:46 +01:00
Martin Malmsten
9ed3887432 Added ComposedBlock level to Alto output 2019-11-17 21:06:12 +01:00
zdenop
2d6f38eebf fix using bilevel tiff in pdf output 2019-11-10 16:11:52 +01:00
Shreeshrii
99dfa8a680 Add separator and training_iteration to checkpoint name (#2752)
* Add separator and training_iteration to checkpoint name
* specify modelname_N.NN_NN_NN.checkpoint for intermediate checkpoint
2019-11-09 12:22:40 +01:00
Stefan Weil
ac46b286a4 Fix issue #2748
Commit 94d0f77f56 tried to fix issue #2741
but created a new problem.

This commit should fix both old and new issue.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-11-08 17:12:20 +01:00
Stefan Weil
0406f7706d Use BRT_UNKNOWN instead of BRT_NOISE to initialize ColPartition::blob_type_
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-11-08 07:40:06 +01:00
Stefan Weil
9b46a67efa Use "C" locale for printing parameters
This fixes a test for the Python wrapper `tesserocr` (python setup.py test).

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-11-04 19:21:20 +01:00
Egor Pugin
ab836dbb31
Merge pull request #2743 from DavidMaung/master
Exposed the text2image option --ptsize to tesstrain.sh.
2019-11-02 17:09:51 +03:00
Stefan Weil
a306cd7370 Fail if no valid lstmf file was written (fix issue #2741)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-11-01 21:52:45 +01:00
Stefan Weil
94d0f77f56 Don't create an empty lstmf file
If Tesseract cannot find text in the input image, it should not write
an empty lstmf file. This problem was reported in issue #2741.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-11-01 21:43:26 +01:00
maungd@battelle.org
3d7afb69ea Exposed the text2image option --ptsize to tesstrain.sh. Text2image has the
option --ptsize which defaults to 12.  This option is not exposed through
tesstrain.sh; thus, you cannot use tesstrain.sh to explore training with
different font sizes.  I made a small modification to expose the --ptsize
option to tesstrain.sh.  It defaults to 12 if not specified.
2019-11-01 15:10:58 -04:00
Stefan Weil
b5498c70fa Use pre-calculated lookup tables for all C++ compilers
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-10-31 20:26:01 +01:00
Egor Pugin
2bcc9d8093 Remove cppan build. 2019-10-30 21:37:38 +03:00
Stefan Weil
ca87b06d59 Fix build for Intel Compiler (issue #2736)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-10-30 10:09:44 +01:00
Stefan Weil
20a50e9bcb Fix typo in comment
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-10-30 10:06:31 +01:00
Egor Pugin
2a37f5dd62 Update includes to use <>. 2019-10-29 14:50:11 +03:00
Egor Pugin
9e324938ab Update includes to use <>. 2019-10-29 14:31:38 +03:00
Stefan Weil
629b05d978 Update README.md and other documentation for new include file structure
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-10-29 12:26:41 +01:00
amitdo
2f8884a64e Fix autotools build 2019-10-28 21:23:58 +02:00
amitdo
e1bae15547 Fix #include path of public headers 2019-10-28 19:10:30 +02:00
amitdo
dfede8ac01 Move all public headers to include/tesseract 2019-10-28 18:50:31 +02:00
zdenop
cede5b34e7
Add pageseg_apply_music_mask option to allow disabling the musi… (#2732)
Add pageseg_apply_music_mask option to allow disabling the music mask
2019-10-27 17:02:05 +01:00
zdenop
4a37cde0d9 fix inverting (Bilevel BW png) in pdf; fixes # 2059 2019-10-27 14:15:12 +01:00
Nat
52bc15acd9 Add pageseg_apply_music_mask option to allow disabling the music mask 2019-10-24 11:44:05 -05:00
Egor Pugin
c727b556f0 Remove unneeded TESS_API from source file. 2019-10-23 13:26:46 +03:00
Egor Pugin
e2688c39e9 Remove TESS_CALL. 2019-10-23 13:21:59 +03:00
wshwang
4ee95a615a src/ccutil/bits16.h remove warnings (#2726) 2019-10-23 11:46:24 +02:00
wshwang
71e291bae5 Remove warning C4312 2019-10-22 13:06:44 +02:00
zdenop
fc629eae3b Subject: training: show error description for open/delete file 2019-10-21 16:31:57 +02:00
Stefan Weil
90bcff3732 Delete copy constructor and assignment operator for TessBaseAPI (fix issue #874)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-10-21 13:12:36 +02:00
Stefan Weil
a209a6b4b5 Copy resolution of source image (fix issue #1702)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-10-20 20:45:35 +02:00
zdenop
36dc2ccf75 fix memory leak at PangoFontInfo::CanRenderString 2019-10-20 16:43:04 +02:00
zdenop
1ec34378d9 test for synthesized font faces. 2019-10-19 15:05:28 +02:00
zdenop
cbbe45d94b cmake: add minimum required version for pango and icu based on autotools 2019-10-19 15:00:49 +02:00
zdenop
37c7a5dd82 text2image: show pango version 2019-10-19 14:52:06 +02:00
Stefan Weil
73a38b39d5 quadlsq: Fix warnings from LGTM
Fix two occurrences of this LGTM warning:

    Multiplication result may overflow 'double'
      before it is converted to 'long double'.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-10-18 12:07:54 +02:00
Stefan Weil
22cf0f854d Use "C" locale for PDF output
This fixes wrong output of integers with locale de_DE.UTF-8:

    -  /Width 2.481
    -  /Height 3.508
    +  /Width 2481
    +  /Height 3508

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-10-18 11:30:42 +02:00
Stefan Weil
914a8e40d6 Use "C" locale for ALTO output
This fixes wrong output of integers with locale de_DE.UTF-8:

    - <Page WIDTH="2.481" HEIGHT="3.508" PHYSICAL_IMG_NR="0" ID="page_0">
    + <Page WIDTH="2481" HEIGHT="3508" PHYSICAL_IMG_NR="0" ID="page_0">

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-10-18 11:18:27 +02:00
Stefan Weil
3e8cc203f4 Fix build error (undefined local variable)
The latest commit 96025c7923 was incomplete.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-10-18 11:05:31 +02:00
Stefan Weil
96025c7923 Remove unimplemented +/- for parameter files
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-10-17 17:14:43 +02:00
zdenop
a3cfd66f37 do not exit if not existing parameter is used. fixes #1334 2019-10-15 07:56:22 +02:00
zdenop
0150fc57cc Report when tesseract legacy engine not present. (fix issue #2053) 2019-10-14 22:55:47 +02:00
Stefan Weil
a1e3150bd7 Add new parameter "document_title" to set the title in OCR output files
The title can be set for hOCR and PDF output.

Currently it is also used for ALTO, so setting the title can be used
as a workaround for issue #2700.

The constant unknown_title_ is no longer needed and therefore removed.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-10-10 15:42:52 +02:00
Stefan Weil
7a7704bc94 Extend function BoxFileName to handle more common image names
The function derives the file name for the .box file from an image name.

For training from existing line images, it is useful to directly support
the image names which are commonly used.

While generated images for Tesseract training typically use the name
pattern NAME.tif, other ground truth sets use NAME.bin.png for binarized
or NAME.nrm.png for grayscale images.

BoxFileName is also now a local function as it is only used locally.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-10-05 15:59:56 +02:00
jm
fb150265ef speed optimisation - add the option to disable automatic inverting of line images 2019-10-04 10:09:52 +02:00
Stefan Weil
6b35d6ff6e Fix comment which referred to unused Tesseract parameter
This completes commit aa2ab68e29.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-10-03 09:23:25 +02:00
Johannes Künsebeck
aa2ab68e29 Removed unused parameters
The following parameters are not used anywhere anymore:

 * use_definite_ambigs_for_classifier
 * max_viterbi_list_size
 * word_to_debug_lengths
 * fragments_debug
 * tessedit_redo_xheight
 * debug_acceptable_wds
 * tessedit_matcher_log
 * tessedit_test_adaption_mode
 * docqual_excuse_outline_errs
 * crunch_pot_garbage
 * suspect_space_level
 * tessedit_consistent_reps
 * wordrec_display_all_words
 * wordrec_no_block
 * wordrec_worst_state
 * fragments_guide_chopper
 * segment_adjust_debug
 * classify_adapt_feature_thresh (classify_adapt_feature_threshold still exists)
 * classify_adapt_proto_thresh (classify_adapt_proto_threshold still exists)
 * classify_min_norm_scale_x
 * classify_max_norm_scale_x
 * classify_min_norm_scale_y
 * classify_max_norm_scale_y
 * il1_adaption_test
 * textord_blob_size_bigile
 * textord_blob_size_smallile
 * editor_debug_config_file
 * textord_tabfind_show_color_fit

The list was generated by a python script and each parameter occurence checked
manually.
2019-10-03 09:18:29 +02:00
Stefan Weil
1e84a6f225 Don't create OCR result files when training data is created
The configuration file lstm.train causes Tesseract to generate
training data for training of an LSTM line recognizer.

In this mode, no other files with OCR results should be written.
Without this patch, Tesseract writes a small text file.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-10-02 19:29:27 +02:00
Stefan Weil
286d8275c7 Add support for image or image list by URL
This allows OCR of images from the internet without downloading them first:

    tesseract http://IMAGE_URL OUTPUT ...

It uses libcurl.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-10-01 12:10:45 +02:00
Stefan Weil
47d70d7014 Modernize code for LIST (fix some -Wold-style-cast warnings)
- Use C++ type casts
- Remove unneeded type cast
- Simplify code for function pop
- Remove macro push_on (it was only used once)

This fixes lots of compiler warnings caused by old type casts.
2019-10-01 11:12:00 +02:00
Stefan Weil
672d67859f mfoutline: Modernize code
- Use C++ enums
- Use strongly typed C++11 enum for DIRECTION and optimize struct MFEDGEPT
- Use float constant for MF_SCALE_FACTOR
- Replace macros by inline functions
- Fix documentation comment

This fixes several warnings from clang.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-30 21:33:15 +02:00
Stefan Weil
7ec5f0ca02 intmatcher: Avoid conversion from double to float and vice versa
This fixes some clang warnings:

    src/classify/intmatcher.cpp:48:49: warning:
      implicit conversion loses floating-point precision:
      'double' to 'const float' [-Wimplicit-float-conversion]
    src/classify/intmatcher.cpp:405:34: warning:
      implicit conversion loses floating-point precision:
      'double' to 'float' [-Wimplicit-float-conversion]
    src/classify/intmatcher.cpp:405:64: warning:
      implicit conversion increases floating-point precision:
      'float' to 'double' [-Wdouble-promotion]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-30 18:05:26 +02:00
Stefan Weil
6d259ebe44 Remove unneeded compare statement (-Wtautological-unsigned-enum-zero-compare)
This fixes a clang warning:

    src/ccstruct/polyblk.cpp:412:12: warning: result of comparison of
      unsigned enum expression >= 0 is always true
      [-Wtautological-unsigned-enum-zero-compare]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-29 22:13:27 +02:00
Stefan Weil
49e351508c Re-add strngs.h to public API
It is still needed.
This partially reverts commit a730b5c4ff.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-28 10:34:48 +02:00
Stefan Weil
8ad86d6494 Add missing linker flags for TensorFlow
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-28 09:42:37 +02:00
zdenop
d6aa866430 ignore #pragma optimize for clang-cl 2019-09-27 21:19:37 +02:00
Stefan Weil
74d5ce82a6 Remove vecfuncs.cpp and vecfunc.h
Replace the macros which were declared in vecfuncs.h by member functions
and move a function which was only used in chop.cpp to that file.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-25 21:20:03 +02:00
Stefan Weil
7bddad59d1 Optimize class ChoiceIterator
Re-order a class variable to avoid memory holes and
remove unused class variables.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-25 09:43:57 +02:00
Noah Metzger
ff4c1d204d Fixed minor bug with the Choice iterator when lstm_choice_mode is not active.
Signed-off-by: Noah Metzger <noah.metzger@bib.uni-mannheim.de>
2019-09-24 15:38:28 +02:00
Stefan Weil
994ec697d8 Remove member functions STRING::string and StringParam::string
They were redundant because there exist member functions 'c_str' which do the same.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-23 08:33:08 +02:00
Egor Pugin
1fa7324cf7
Merge pull request #2668 from stweil/api
Remove STRING from the public Tesseract API
2019-09-23 01:02:26 +03:00
amitdo
0598879a00 Disable legacy build: Disable bitvec.h 2019-09-22 20:37:13 +02:00
Stefan Weil
a730b5c4ff Remove STRING from the public Tesseract API
Removing STRING from genericvector.h allows eliminating the proprietary
STRING data type from the public Tesseract API.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-22 20:32:28 +02:00
Stefan Weil
8cb677d6a2 Replace STRING arguments for LoadDataFromFile and SaveDataToFile
This is a step to eliminate the proprietary STRING data type
from the public Tesseract API.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-22 20:32:28 +02:00
amitdo
1e13d1d4d5 Disable legacy build: Disable more unneeded code 2019-09-22 20:55:24 +03:00
zdenop
39a63c2837
Merge pull request #2663 from bertsky/fix-lstm-user-patterns
fix langdata (user words/patterns) file suffixes for LSTMs:
2019-09-20 15:32:54 +02:00
Stefan Weil
0c7cc5a4dd Fix CID 1405673 part 2 (Uninitialized members)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-19 19:37:05 +02:00
Robert Schubert
5b976bfb55 fix langdata (user words/patterns) file suffixes for LSTMs:
- add another constructor for LSTMRecognizer
  which takes the language_data_path_prefix configured/selected
  at runtime and passes it to the internal CCUtil
- use this in Tesseract::init_tesseract_lang_data when LSTMs
  are available

(this was missing from 297d7d86ce)
2019-09-19 19:30:54 +02:00
amitdo
479a7b1ca0 Disabled legacy build: Disable more unneeded code 2019-09-19 19:00:13 +03:00
Stefan Weil
3b030b4aeb Fix CID 1405673 (Uninitialized members)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-17 22:04:08 +02:00
Stefan Weil
85e8529a2e Fix CID 1164624 (Uninitialized members)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-17 21:59:42 +02:00
Stefan Weil
b2999d8190 Fix comment for Textord::make_prop_words
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-16 15:03:45 +02:00
Stefan Weil
256701e2e0 Re-order initialisation in constructor of class ViterbiStateEntry
This fixes compiler warnings caused by
commit 091ce345f6:

    src/wordrec/lm_state.h💯7: warning: field 'cost'
      will be initialized after field 'curr_b' [-Wreorder]
    src/wordrec/lm_state.h:104:7: warning: field 'top_choice_flags'
      will be initialized after field 'dawg_info' [-Wreorder]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-16 14:33:32 +02:00
Stefan Weil
081521fb9f Move initial values for class ColPartition from constructor to header file
This fixes compiler warnings caused by
commit 5b4565b80b:

    src/textord/colpartition.cpp:91:24: warning: field 'last_column_'
      will be initialized after field 'column_set_' [-Wreorder]
    src/textord/colpartition.cpp:93:37: warning: field 'inside_table_column_'
      will be initialized after field 'nearest_neighbor_above_' [-Wreorder]
    src/textord/colpartition.cpp:95:58: warning: field 'space_to_right_'
      will be initialized after field 'owns_blobs_' [-Wreorder]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-16 14:33:32 +02:00
Stefan Weil
8f66020821 Re-order initialisation in constructors of classes Dawg and DawgPosition
This fixes compiler warnings caused by
commit ecf0f2dee5:

    src/dict/dawg.h:202:9: warning: field 'type_' will be initialized
      after field 'lang_' [-Wreorder]
    src/dict/dawg.h:355:9: warning: field 'dawg_index' will be initialized
      after field 'dawg_ref' [-Wreorder]
    src/dict/dawg.h:356:9: warning: field 'punc_index' will be initialized
      after field 'punc_ref' [-Wreorder]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-16 14:31:32 +02:00
Stefan Weil
b466cead8e Add more initial values for class Classify from constructor to header file
This fixes compiler warnings caused by
commit 751fcd2b11:

    src/classify/classify.cpp:176:7: warning:
      field 'EnableLearning' will be initialized after
      field 'il1_adaption_test' [-Wreorder]
    src/classify/classify.cpp:187:7: warning:
      field 'dict_' will be initialized after
      field 'static_classifier_' [-Wreorder]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-16 14:31:32 +02:00
Stefan Weil
91b3248af3 Fix CID 1164666 (Uninitialized scalar field)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-15 22:01:25 +02:00
Stefan Weil
fc6899d898 Fix CID 1164664 (Uninitialized scalar field)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-15 21:52:51 +02:00
Stefan Weil
930e11996c Fix CID 1375402 (Uninitialized pointer field)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-15 21:17:12 +02:00
Stefan Weil
408d6e8b72 simd: Check OSXSAVE bit before calling _xgetbv
Both checks are needed for AVX, AVX2 and FMA checks.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-15 19:35:37 +02:00
Stefan Weil
627faa6f9c Remove UnicharAmbigs for builds without legacy code
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-15 19:11:30 +02:00
amitdo
2134cd7867 Disabled legacy engine build: Disable code related to ambigs. 2019-09-15 19:11:30 +02:00
Stefan Weil
0c960c3cc5 Fix 1164647 (Uninitialized members)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-15 14:25:48 +02:00
amitdo
994596842e 'Disabled leagcy engine' build: don't include unused header 2019-09-15 12:35:36 +03:00
Egor Pugin
6a9584fbc2
Merge pull request #2650 from stweil/cid
Fix several issues reported by Coverity Scan
2019-09-14 21:18:37 +03:00
Stefan Weil
763f4781e8 Fix CID 1164662 (Uninitialized scalar field)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-14 19:22:56 +02:00
Stefan Weil
6fd58d2897 Fix CID 1164659 (Uninitialized scalar field)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-14 19:20:14 +02:00
Stefan Weil
c3500e8d95 Fix CID 1164657 (Uninitialized scalar field)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-14 19:11:02 +02:00
Stefan Weil
1d3ee3b2a7 Fix CID 1164649 (Uninitialized scalar field)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-14 17:37:00 +02:00
Stefan Weil
bd1083904d Fix CID 1164648 (Uninitialized scalar field)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-14 17:32:29 +02:00
Stefan Weil
80f367c6f4 Fix CID 1164644 (Uninitialized scalar field)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-14 17:26:49 +02:00
Stefan Weil
7caded8e6b Fix CID 1164643 (Uninitialized scalar field)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-14 17:24:26 +02:00
Stefan Weil
3127242bcd Fix CID 1164638 (Uninitialized scalar field)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-14 17:18:15 +02:00
Stefan Weil
06de3075e0 Fix CID 1164636 (Uninitialized pointer field)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-14 17:13:06 +02:00
Stefan Weil
052f9ca0bc Fix CID 1164634, CID 1164635 (Uninitialized pointer field)
Remove the unused dummy member variables.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-14 17:12:39 +02:00
Stefan Weil
97dda3d535 Fix CID 1386099 (Uninitialized pointer field)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-14 15:43:50 +02:00
Stefan Weil
46f21a4182 Fix CID 1164633 (Uninitialized pointer field)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-14 15:43:50 +02:00
Stefan Weil
9ea579bf1b Fix CID 1164628 ff (Uninitialized pointer field) and optimize class ParamContent
Only one of bIt, dIt, iIt and sIt is used, so put all four in a union.
This fixes CID 1164628, CID 1164629, CID 1164630 and CID 1164631.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-14 15:43:50 +02:00
Stefan Weil
74b552fc31 Remove unused FeatureEnabled from FEATURE_DEFS_STRUCT
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-14 15:43:50 +02:00
Stefan Weil
9f709404f9 Fix CID 1164622 (Uninitialized pointer field)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-14 15:43:50 +02:00
Stefan Weil
5b1f0dbd4b Fix CID 1164620 (Uninitialized pointer field)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-14 15:43:50 +02:00
Stefan Weil
951f442303 Fix CID 1386105 (Logically dead code)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-14 15:43:50 +02:00
Stefan Weil
64fc205e78 Fix CID 1402767 (Invalid type in argument to printf format specifier)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-14 15:43:50 +02:00
Stefan Weil
f62a895f74 Remove unused italic, bold in class BLOCK_RES and class WORD_RES
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-14 11:53:58 +02:00
Stefan Weil
ceb8af889e Fix CID 1340276 (Uninitialized scalar field) for class BLOB_CHOICE
xgap_before_ and xgap_after_ are never used, so remove them.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-13 22:15:47 +02:00
Stefan Weil
5fdd32bea8 Fix CID 1366450 (Uninitialized scalar field) for class RecodeBeamSearch
secondary_beam_size_ is set but never used, so remove it.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-13 22:09:03 +02:00
Stefan Weil
737173a84d Fix CID 1375401 (Uninitialized scalar field) for class Dawg
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-13 22:03:10 +02:00
Stefan Weil
edba74d64f Fix CID 1400760 (Uninitialized scalar field) for class BLOCK
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-13 21:58:05 +02:00
Stefan Weil
8ff321e41a Fix two issues reported by Coverity Scan and modernize class WERD_RES
Report from Coverity Scan:

    CID 1405560 (#1 of 1): Uninitialized scalar field (UNINIT_CTOR)
    2. uninit_member: Non-static class member end is not initialized in
    this constructor nor in any functions that it calls.

    CID 1405561 [...]

Modernize and optimize class WERD_RES. This not only fixes the issues
but also reduces the size and eliminates the functions InitNonPointers
and InitPointers.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-13 21:51:36 +02:00
Stefan Weil
ecf0f2dee5 Optimize classes Trie, Dawg and DawgPosition
Reduce size from 368 to 352 bytes for Trie, 72 to 64 bytes for Dawg
and 40 to 24 bytes for DawgPosition by avoiding holes.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-13 08:15:01 +02:00
Stefan Weil
efd8dea587 Optimize classes CLIST_ITERATOR, ELIST_ITERATOR, ELIST2_ITERATOR
Reduce size from 56 to 48 bytes by avoiding holes.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-12 22:03:03 +02:00
Stefan Weil
751fcd2b11 Optimize class Classify
Reduce size from 138016 to 13000 bytes by avoiding holes.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-12 21:46:55 +02:00
Stefan Weil
0ad08a99b0 Optimize class TFile
Reduce size from 24 to 16 bytes by avoiding holes.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-12 20:17:05 +02:00
Stefan Weil
5b4565b80b Optimize class ColPartition
Reduce size from 248 to 224 bytes by avoiding holes.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-12 20:04:27 +02:00
Stefan Weil
5a12273650 Optimize struct LMConsistencyInfo
Reduce size from 104 to 96 bytes by avoiding holes.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-12 20:04:27 +02:00
Stefan Weil
091ce345f6 Optimize class ViterbiStateEntry
Reduce size from 232 to 216 bytes by avoiding holes.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-12 20:04:27 +02:00
Stefan Weil
913cbe6eae Modernize and optimize BLOBNBOX and remove BLOBNBOX::ConstructionInit
The class no longer uses bit fields. Re-ordering the member variables
avoids holes and reduces the size of BLOBNBOX from 168 to 152 bytes.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-12 09:07:48 +02:00
Stefan Weil
a922745d9a tfnetwork: Fix info text
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-11 19:10:25 +02:00
Stefan Weil
5fa09f184f RecodedCharIDHash: Fix runtime errors detected by UndefinedBehaviorSanitizer
Fix this runtime error in recodebeam_test and unicharcompress_test:

    src/ccutil/unicharcompress.h:84:27: runtime error:
      left shift of 267 by 28 places cannot be represented in type 'int'

code has up to kMaxCodeLen (9) values, so the highest possible value for
i is 8, and the shift value can reach 7 * 8 = 56.

That requires an uint64_t data type.
size_t would fit for 64 bit hosts, but be too small for 32 bit hosts.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-10 15:56:32 +02:00
Stefan Weil
4a2d5a2e8d OSResults: Fix runtime errors detected by UndefinedBehaviorSanitizer
Fix this runtime error in osd_test and textlineprojection_test:

    src/ccmain/osdetect.cpp:109:14: runtime error: division by zero

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-10 15:56:32 +02:00
Stefan Weil
5c6fade555 BitVector: Fix runtime errors detected by UndefinedBehaviorSanitizer
Fix these runtime errors in mastertrainer_test:

    src/ccutil/bitvector.cpp:119:18: runtime error:
      null pointer passed as argument 2, which is declared to never be null
    src/ccutil/bitvector.cpp:124:10: runtime error:
      null pointer passed as argument 1, which is declared to never be null

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-10 15:56:32 +02:00
zdenop
98c7aaa343
Lstm choice ril (#2635)
Lstm choice ril
2019-09-06 19:12:00 +02:00
Stefan Weil
9f32032517 ccutil: Remove old comments
There is no CLIST2 in the current code.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-05 17:52:42 +02:00
Stefan Weil
b6933a1082 Use type bool for boolean values in class BLOBNBOX
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-03 19:56:59 +02:00
Noah Metzger
c350077b96 Made the lstm_choice mode compatible with the hocr_char_boxes mode
Signed-off-by: Noah Metzger <noah.metzger@bib.uni-mannheim.de>
2019-09-02 11:09:54 +02:00
Noah Metzger
e8b9c10d07 Clean up lstm_choice_mode and cut it down to 2 modes instead of 4
Signed-off-by: Noah Metzger <noah.metzger@bib.uni-mannheim.de>
2019-09-02 11:09:53 +02:00
Stefan Weil
fdf4067296 Fix warnings from LGTM
This fixes three LGTM warnings:

    Multiplication result may overflow 'float' before it is converted to 'double'.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-08-30 22:04:24 +02:00
Stefan Weil
dc90741f1b Fix crash when function lookup tables are accessed with NaN
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-08-30 13:42:09 +02:00
Stefan Weil
7968f50fe6 capi: Add missing PSM_RAW_LINE to TessPageSegMode
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-08-25 09:08:09 +02:00
zdenop
0ded672067 fix typo 2019-08-18 18:47:32 +02:00
Stefan Weil
00cff79f7f simd: Check whether the OS supports FMA, AVX, ...
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-08-16 22:51:17 +02:00
Stefan Weil
43b2e9513b lstmtrainer: Fix diagnostic message
Signed character values must be converted to unsigned integers for %x.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-08-15 14:31:32 +02:00
Stefan Weil
100d8cd29b lstmtester: Add missing space in log messages
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-08-14 14:12:47 +02:00
Stefan Weil
a86251c62b classify/Makefile: Fix inconsistent style
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-08-13 21:35:59 +02:00
Egor Pugin
423a188513 Export some classify vars. 2019-08-13 20:12:21 +03:00
Stefan Weil
46e2a0f106 Remove more code for builds with disabled legacy engine
Now the Tesseract library no longer includes unused code.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-08-13 17:49:10 +02:00
Egor Pugin
73f713519c
Merge pull request #2614 from stweil/training
Move source files which are used for training only to src/training
2019-08-12 19:35:50 +03:00
Stefan Weil
e84cb24def Move source files which are used for training only to src/training
They are moved from src/classify and src/lstm to src/training.

This reduces the size of the Tesseract library.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-08-12 17:08:08 +02:00
Stefan Weil
ba17bc8204 OpenCL: Add static attribute for kernel_src
It is only used in openclwrapper.cpp.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-08-12 15:13:45 +02:00
Stefan Weil
970622fbd1 Remove unused functions create_edges_window, draw_raw_edge
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-08-12 15:04:10 +02:00
Stefan Weil
23e605911f Remove unused function truncate_path and related files
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-08-12 14:48:56 +02:00
Stefan Weil
bce585286d Remove global array kPolyBlockNames from Tesseract library
It is only used in unittest/layout_test.cc after moving a test from
baseapi_test.cc to that file, so it can be made local.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-08-12 14:33:55 +02:00
Stefan Weil
beec85e023 Remove UNICHARSET::load_from_inmemory_file and related code
The method was only used in unittest where it can be replaced by
UNICHARSET::load_from_file which also simplifies the code.

This allows removing the class InMemoryFilePointer and fixes a TODO.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-08-12 13:07:15 +02:00
Stefan Weil
315dd9df3f cmake: Don't link pthread on Windows
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-08-07 15:24:00 +02:00
Stefan Weil
b8079d8ce1 universalambigs: Add hack to fix builds with Microsoft compiler
The MS compiler only accepts string constants up to 65535 characters,
so shorten the string for that compiler to fix the compilation.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-08-06 15:46:07 +02:00
Zdenko Podobný
c5a50b93ce move fileio.cpp and fileio.h to training (this fix android build) 2019-08-04 21:26:39 +02:00
Stefan Weil
6acab45837 universalambigs: Replace octal characters by UTF-8 string
This improves readability and reduces the file size.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-08-04 19:21:59 +02:00
Stefan Weil
8127b4dd27 Clean ambigs.h
* Remove unused kUnigramAmbigsBufferSize and kAmbigNgramSeparator
* Move some declarations to ambigs.cpp

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-08-04 19:21:59 +02:00
Stefan Weil
23ef93ac4d cmake: Add missing pthread library
It is needed for C++ threads since commit 85068be405.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-07-26 07:45:51 +02:00
Stefan Weil
e6ca7f3ec6 hocrrenderer: Add missing escaping of special characters in HTML output
This converts special character like '<' or '>' to the
correct HTML entities.

Optimize also the code a little bit.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-07-19 13:53:36 +02:00
Stefan Weil
2679cae5d8 Simplify code by using ClipToRange
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-07-19 13:37:39 +02:00
Stefan Weil
4b2927ae41 LSTMRecognizer: Add non const get functions
This allows removing several const casts.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-07-18 11:26:51 +02:00
Stefan Weil
4cb3f34c09 Improve formatting of hOCR output with character boxes
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-07-18 11:07:18 +02:00
Stefan Weil
9195a904a7 Use auto data type for results of std::ftell
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-07-18 10:56:17 +02:00
Stefan Weil
4132194c49 Remove unused filesize_ from class InputBuffer
This also simplifies the constructors.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-07-18 10:48:27 +02:00
Stefan Weil
a2b13b49ff Simplify shell code (fixes warning from Codacy)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-07-17 21:33:24 +02:00
Stefan Weil
d4e0ab3014 Use long instead of off_t for result from ftell
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-07-17 21:14:42 +02:00
Stefan Weil
467f8f4140 Fix training script for macOS (issue #2578)
Bash on macOS does not support "|&":

    tesstrain_utils.sh: line 80: syntax error near unexpected token `&'

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-07-17 17:18:44 +02:00
Stefan Weil
f92181561c Fix some compiler warnings (unused local variables)
gcc warnings:

    src/classify/protos.cpp:85:7: warning: unused variable ‘i’ [-Wunused-variable]
    src/classify/protos.cpp:86:7: warning: unused variable ‘Bit’ [-Wunused-variable]
    src/classify/protos.cpp:89:14: warning: unused variable ‘Config’ [-Wunused-variable]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-07-17 07:47:28 +02:00
Stefan Weil
a419f2d78b Modernize BIT_VECTOR a little bit
This removes one more user of Emalloc / Efree.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-07-16 22:09:08 +02:00
zdenop
c8374cc528
Merge pull request #2576 from noahmetzger/LSTMChoiceRIL
Implemented improved character bounding box algorithm
2019-07-16 12:25:17 +02:00
zdenop
f4925077e8
Merge pull request #2574 from stweil/fix
classify: Use fixed size bit vector
2019-07-16 12:22:48 +02:00
zdenop
cb5c78be7d
Merge pull request #2572 from adaptech-cz/wordBoundsOn2ndPass
Give word's bounds to callback also during second pass
2019-07-16 12:19:31 +02:00
Noah Metzger
3a5e508934 Implemented improved bounding box algorithm
Signed-off-by: Noah Metzger <noah.metzger@bib.uni-mannheim.de>
2019-07-16 11:38:50 +02:00
Stefan Weil
028fff6edd classify: Use fixed size bit vector
The vector was already limited to MAX_NUM_PROTOS (512) entries or 64 bytes
in the old code. Now it uses that size right from the start which avoids
reallocating it later when entries are added.

The old code which reallocated the vector to expand it was buggy because
the realloc function can return a different pointer, but the code still
used the original pointer to reset the new bits.

Function ExpandBitVector is now unused and therefore removed.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-07-16 10:18:11 +02:00
Robert Pösel
f99fcd7691 Give word's bounds to callback also during second pass 2019-07-16 09:11:06 +02:00
Stefan Weil
5bbb7f59a6 Remove structures.*
It only provided the functions new_cell, free_cell which could be replaced by new, delete.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-07-16 07:03:52 +02:00
Stefan Weil
3621272051 Remove cutil_class.*
It is no longer needed since commit 4523ce9f7d.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-07-16 07:03:52 +02:00
Stefan Weil
ea462b2c03 Remove unused functions reverse16, reverse32
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-07-15 21:50:46 +02:00
Stefan Weil
c8cb925813 Remove non portable sleep by std::this_thread::sleep_for
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-07-15 16:00:07 +02:00
Stefan Weil
fcfdb7e56f Remove unused include statements
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-07-15 14:48:31 +02:00
Stefan Weil
ba0c55adc5 svutil: Remove SVSync::StartThread and SVSync::ExitThread
Both are unused now.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-07-15 14:30:51 +02:00
Stefan Weil
85068be405 lstmtester: Replace SVSync::StartThread by std::thread
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-07-15 14:30:51 +02:00
Stefan Weil
43a281893f scrollview: Replace SVSync::StartThread by std::thread
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-07-15 14:30:51 +02:00
Stefan Weil
a6d723bf10 Replace SVSync::StartThread by std::thread and use std::this_thread::yield
Using yield instead of a sleep makes running imagedata_test much faster.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-07-15 14:30:51 +02:00
Stefan Weil
13bb4623b1 Use std::lock_guard to protect a code block
This is simpler than using lock() / unlock() explicitly.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-07-15 12:01:28 +02:00
Stefan Weil
93427391c1 Replace SVAutoLock by std::lock_guard
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-07-15 12:01:28 +02:00
Stefan Weil
c0b8ee3b82 Replace CCUtilMutex by std::mutex
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-07-15 12:01:28 +02:00
Stefan Weil
36026e3c35 Replace SVMutex by std::mutex
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-07-15 12:01:28 +02:00
zdenop
56d4fdce00
Merge pull request #2554 from noahmetzger/LSTMChoiceRIL
Improved lstm_choice_mode
2019-07-15 11:51:52 +02:00
Noah Metzger
2dd5d0d60a Fixed a bug when first decode iteration stays empty and added some comments.
Signed-off-by: Noah Metzger <noah.metzger@bib.uni-mannheim.de>
2019-07-15 10:05:22 +02:00
Stefan Weil
61eab60fe3 arch: Reduce number of include files for dot product functions
dotproductavx.h and dotproductsse.h declared only two functions.
Move those declarations to dotproduct.h.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-07-12 23:18:00 +02:00
Stefan Weil
2d5b166876 Add dot product implementation for Intel FMA (double = tessdata_best)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-07-12 23:18:00 +02:00
Stefan Weil
9259ed8f26 Optimize tprintf implementation
It no longer uses a local buffer, so it needs less memory
and no mutex.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-07-10 20:59:07 +02:00
Stefan Weil
2aebd10fb7 FPRow: Add missing initialisation for scalar (CID 1402754)
Modernize the code also a little bit.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-07-10 17:15:55 +02:00
Stefan Weil
bdc7abf518 Fix format strings for size_t arguments (CID 1402762, 1402767)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-07-10 16:57:19 +02:00
Noah Metzger
11a4cd298b Added parameters for the LSTM CTC Choice mode
Signed-off-by: Noah Metzger <noah.metzger@bib.uni-mannheim.de>
2019-07-10 16:34:41 +02:00
Noah Metzger
f2d685a90f Added CTC-based Symbolchoices.
Signed-off-by: Noah Metzger <noah.metzger@bib.uni-mannheim.de>
2019-07-10 16:34:41 +02:00
Stefan Weil
ee04347347 Fix format string for 64 bit integer (CID 1402986)
Commit c1264c189e was not the right fix.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-07-10 16:20:50 +02:00
Stefan Weil
890b810a9e tfnetwork: Add missing return statement (CID 1402992)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-07-10 08:20:52 +02:00