Commit Graph

1215 Commits

Author SHA1 Message Date
Stefan Weil
57ff90687d simd: Check whether the OS supports FMA, AVX, ...
The previous check was only for the MS compiler, but not for gcc and clang.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-02-23 16:34:35 +01:00
zdenop
7c3ac569f9
Replace references to the old wiki by new URLs (#2877)
Replace references to the old wiki by new URLs
2020-02-03 14:59:18 +01:00
Stefan Weil
16553014e0 Replace references to the old wiki by new URLs
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-02-03 11:37:41 +01:00
Stefan Weil
20bcbc4058 Catch std::runtime_error exception when setting the locale in debug code
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-02-03 07:58:43 +01:00
Robert Sachunsky
cdc8e44a20 ChoiceIterator: skip symbol without choices 2020-01-24 09:19:14 +01:00
jkang-eng
60248f59d4 Fix "tesseract.exe not flushing stdout/stderr" (Issue #2859) (#2865)
* Issue #2859 - Fix "tesseract.exe not flushing stdout/stderr"
2020-01-21 21:51:08 +01:00
Stefan Weil
6f2f310fdf Remove redundant method from class GenericVector
length() is not needed: it can be replaced by size().

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-01-18 11:30:14 +01:00
Stefan Weil
3d1f82d0e2 tesstrain.sh: Fix command line flag --help
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-01-05 10:10:55 +01:00
Stefan Weil
cfd39dc2c7 pageres: Fix compiler warnings
clang warnings:

    src/ccstruct/pageres.cpp:903:20: warning:
      implicit conversion from 'int' to 'float' changes value from
      2147483647 to 2147483648 [-Wimplicit-int-float-conversion]
    src/ccstruct/pageres.cpp:904:23:
      warning: implicit conversion from 'int' to 'float' changes value from
      -2147483647 to -2147483648 [-Wimplicit-int-float-conversion]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-01-04 09:46:10 +01:00
Stefan Weil
d2a2292f32 mftraining: Fix compiler warning
powerpc64le-linux-gnu-g++ warning:

    src/training/mftraining.cpp:209:5: warning:
        ‘%04d’ directive output may be truncated writing between 4 and 10 bytes
        into a region of size 8 [-Wformat-truncation=]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-01-03 10:13:58 +01:00
zdenop
79f191fe20
Merge pull request #2826 from bertsky/clip-blockpolygon
make BlockPolygon usable
2019-12-19 09:14:25 +01:00
Robert Sachunsky
4b0c9f3373 BlockPolygon: clip to image rectangle 2019-12-18 13:29:43 +01:00
Robert Sachunsky
5751a408c9 BlockPolygon: unrotate from internal to image coordinates 2019-12-18 13:29:43 +01:00
amitdo
502ebe8ca9 Autotools: Pango, Cairo and ICU only required by training tools 2019-12-16 17:23:06 +02:00
Stefan Weil
fc84f84b5b Remove Emacs C modeline in comment line 1
Those files are C++, and the wrong modeline is not needed at all.
Remove also some empty descriptions and old history in the comments.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-12-05 13:57:50 +01:00
Stefan Weil
420cbac876 Clean public API for renderers
- Remove unused type definitions for TessTextRenderer, ... in capi.h
  (they were only used in capi.cpp which now no longer needs them)

- Fix typo in comment

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-12-03 12:23:58 +01:00
Stefan Weil
56df8e6e19 Fix some typos in comments (most of them found by codespell)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-12-02 14:30:13 +01:00
Stefan Weil
a1a139cbd2 Replace AVX_OPT, ..., AVX macros by HAVE_AVX, ... and clean related code
- Replace AVX_OPT, AVX2_OPT, FMA_OPT, SSE41_OPT
- Replace AVX, AVX2, FMA, SSE4_1
- Write new HAVE_AVX, HAVE_AVX2, HAVE_FMA, HAVE_SSE4_1 into config_auto.h
- Put related conditionals in Makefile.am in one place

This makes the code clearer and fixes a log message in
IntSimdMatrixTest.AVX2.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-11-28 17:51:37 +01:00
Stefan Weil
074844ce46 Show libcurl version
`tesseract --version` now also shows the version of libcurl and related
libraries if it was build with libcurl.

The preprocessor macro HAVE_LIBCURL is now defined in config_auto.h.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-11-28 16:34:52 +01:00
Stefan Weil
cbd3a21cb2 automake: Flat build for src/viewer and src/wordrec
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-11-26 16:20:46 +01:00
Stefan Weil
0cd2bdbd2b automake: Flat build for src/textord
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-11-26 16:20:46 +01:00
Stefan Weil
558462358a automake: Flat build for src/opencl
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-11-26 16:20:46 +01:00
Stefan Weil
6eeb486b77 automake: Flat build for src/lstm
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-11-26 16:20:46 +01:00
Stefan Weil
7ebcc77e3b automake: Flat build for src/dict
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-11-26 16:20:46 +01:00
Stefan Weil
6181acf367 automake: Flat build for src/cutil
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-11-26 16:20:46 +01:00
Stefan Weil
159160518b automake: Flat build for src/classify
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-11-26 16:20:46 +01:00
Stefan Weil
9730c7e167 automake: Flat build for src/ccutil
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-11-26 16:20:46 +01:00
Stefan Weil
b1d449315e automake: Flat build for src/ccstruct
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-11-26 16:20:46 +01:00
Stefan Weil
9745a9d111 automake: Flat build for src/ccmain
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-11-26 16:20:46 +01:00
Stefan Weil
a166efaad6 automake: Flat build for src/arch
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-11-26 16:20:46 +01:00
Stefan Weil
cafb1bbfd7 automake: Flat build for src/api
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-11-26 16:20:46 +01:00
Martin Malmsten
9ed3887432 Added ComposedBlock level to Alto output 2019-11-17 21:06:12 +01:00
zdenop
2d6f38eebf fix using bilevel tiff in pdf output 2019-11-10 16:11:52 +01:00
Shreeshrii
99dfa8a680 Add separator and training_iteration to checkpoint name (#2752)
* Add separator and training_iteration to checkpoint name
* specify modelname_N.NN_NN_NN.checkpoint for intermediate checkpoint
2019-11-09 12:22:40 +01:00
Stefan Weil
ac46b286a4 Fix issue #2748
Commit 94d0f77f56 tried to fix issue #2741
but created a new problem.

This commit should fix both old and new issue.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-11-08 17:12:20 +01:00
Stefan Weil
0406f7706d Use BRT_UNKNOWN instead of BRT_NOISE to initialize ColPartition::blob_type_
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-11-08 07:40:06 +01:00
Stefan Weil
9b46a67efa Use "C" locale for printing parameters
This fixes a test for the Python wrapper `tesserocr` (python setup.py test).

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-11-04 19:21:20 +01:00
Egor Pugin
ab836dbb31
Merge pull request #2743 from DavidMaung/master
Exposed the text2image option --ptsize to tesstrain.sh.
2019-11-02 17:09:51 +03:00
Stefan Weil
a306cd7370 Fail if no valid lstmf file was written (fix issue #2741)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-11-01 21:52:45 +01:00
Stefan Weil
94d0f77f56 Don't create an empty lstmf file
If Tesseract cannot find text in the input image, it should not write
an empty lstmf file. This problem was reported in issue #2741.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-11-01 21:43:26 +01:00
maungd@battelle.org
3d7afb69ea Exposed the text2image option --ptsize to tesstrain.sh. Text2image has the
option --ptsize which defaults to 12.  This option is not exposed through
tesstrain.sh; thus, you cannot use tesstrain.sh to explore training with
different font sizes.  I made a small modification to expose the --ptsize
option to tesstrain.sh.  It defaults to 12 if not specified.
2019-11-01 15:10:58 -04:00
Stefan Weil
b5498c70fa Use pre-calculated lookup tables for all C++ compilers
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-10-31 20:26:01 +01:00
Egor Pugin
2bcc9d8093 Remove cppan build. 2019-10-30 21:37:38 +03:00
Stefan Weil
ca87b06d59 Fix build for Intel Compiler (issue #2736)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-10-30 10:09:44 +01:00
Stefan Weil
20a50e9bcb Fix typo in comment
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-10-30 10:06:31 +01:00
Egor Pugin
2a37f5dd62 Update includes to use <>. 2019-10-29 14:50:11 +03:00
Egor Pugin
9e324938ab Update includes to use <>. 2019-10-29 14:31:38 +03:00
Stefan Weil
629b05d978 Update README.md and other documentation for new include file structure
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-10-29 12:26:41 +01:00
amitdo
2f8884a64e Fix autotools build 2019-10-28 21:23:58 +02:00
amitdo
e1bae15547 Fix #include path of public headers 2019-10-28 19:10:30 +02:00
amitdo
dfede8ac01 Move all public headers to include/tesseract 2019-10-28 18:50:31 +02:00
zdenop
cede5b34e7
Add pageseg_apply_music_mask option to allow disabling the musi… (#2732)
Add pageseg_apply_music_mask option to allow disabling the music mask
2019-10-27 17:02:05 +01:00
zdenop
4a37cde0d9 fix inverting (Bilevel BW png) in pdf; fixes # 2059 2019-10-27 14:15:12 +01:00
Nat
52bc15acd9 Add pageseg_apply_music_mask option to allow disabling the music mask 2019-10-24 11:44:05 -05:00
Egor Pugin
c727b556f0 Remove unneeded TESS_API from source file. 2019-10-23 13:26:46 +03:00
Egor Pugin
e2688c39e9 Remove TESS_CALL. 2019-10-23 13:21:59 +03:00
wshwang
4ee95a615a src/ccutil/bits16.h remove warnings (#2726) 2019-10-23 11:46:24 +02:00
wshwang
71e291bae5 Remove warning C4312 2019-10-22 13:06:44 +02:00
zdenop
fc629eae3b Subject: training: show error description for open/delete file 2019-10-21 16:31:57 +02:00
Stefan Weil
90bcff3732 Delete copy constructor and assignment operator for TessBaseAPI (fix issue #874)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-10-21 13:12:36 +02:00
Stefan Weil
a209a6b4b5 Copy resolution of source image (fix issue #1702)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-10-20 20:45:35 +02:00
zdenop
36dc2ccf75 fix memory leak at PangoFontInfo::CanRenderString 2019-10-20 16:43:04 +02:00
zdenop
1ec34378d9 test for synthesized font faces. 2019-10-19 15:05:28 +02:00
zdenop
cbbe45d94b cmake: add minimum required version for pango and icu based on autotools 2019-10-19 15:00:49 +02:00
zdenop
37c7a5dd82 text2image: show pango version 2019-10-19 14:52:06 +02:00
Stefan Weil
73a38b39d5 quadlsq: Fix warnings from LGTM
Fix two occurrences of this LGTM warning:

    Multiplication result may overflow 'double'
      before it is converted to 'long double'.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-10-18 12:07:54 +02:00
Stefan Weil
22cf0f854d Use "C" locale for PDF output
This fixes wrong output of integers with locale de_DE.UTF-8:

    -  /Width 2.481
    -  /Height 3.508
    +  /Width 2481
    +  /Height 3508

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-10-18 11:30:42 +02:00
Stefan Weil
914a8e40d6 Use "C" locale for ALTO output
This fixes wrong output of integers with locale de_DE.UTF-8:

    - <Page WIDTH="2.481" HEIGHT="3.508" PHYSICAL_IMG_NR="0" ID="page_0">
    + <Page WIDTH="2481" HEIGHT="3508" PHYSICAL_IMG_NR="0" ID="page_0">

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-10-18 11:18:27 +02:00
Stefan Weil
3e8cc203f4 Fix build error (undefined local variable)
The latest commit 96025c7923 was incomplete.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-10-18 11:05:31 +02:00
Stefan Weil
96025c7923 Remove unimplemented +/- for parameter files
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-10-17 17:14:43 +02:00
zdenop
a3cfd66f37 do not exit if not existing parameter is used. fixes #1334 2019-10-15 07:56:22 +02:00
zdenop
0150fc57cc Report when tesseract legacy engine not present. (fix issue #2053) 2019-10-14 22:55:47 +02:00
Stefan Weil
a1e3150bd7 Add new parameter "document_title" to set the title in OCR output files
The title can be set for hOCR and PDF output.

Currently it is also used for ALTO, so setting the title can be used
as a workaround for issue #2700.

The constant unknown_title_ is no longer needed and therefore removed.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-10-10 15:42:52 +02:00
Stefan Weil
7a7704bc94 Extend function BoxFileName to handle more common image names
The function derives the file name for the .box file from an image name.

For training from existing line images, it is useful to directly support
the image names which are commonly used.

While generated images for Tesseract training typically use the name
pattern NAME.tif, other ground truth sets use NAME.bin.png for binarized
or NAME.nrm.png for grayscale images.

BoxFileName is also now a local function as it is only used locally.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-10-05 15:59:56 +02:00
jm
fb150265ef speed optimisation - add the option to disable automatic inverting of line images 2019-10-04 10:09:52 +02:00
Stefan Weil
6b35d6ff6e Fix comment which referred to unused Tesseract parameter
This completes commit aa2ab68e29.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-10-03 09:23:25 +02:00
Johannes Künsebeck
aa2ab68e29 Removed unused parameters
The following parameters are not used anywhere anymore:

 * use_definite_ambigs_for_classifier
 * max_viterbi_list_size
 * word_to_debug_lengths
 * fragments_debug
 * tessedit_redo_xheight
 * debug_acceptable_wds
 * tessedit_matcher_log
 * tessedit_test_adaption_mode
 * docqual_excuse_outline_errs
 * crunch_pot_garbage
 * suspect_space_level
 * tessedit_consistent_reps
 * wordrec_display_all_words
 * wordrec_no_block
 * wordrec_worst_state
 * fragments_guide_chopper
 * segment_adjust_debug
 * classify_adapt_feature_thresh (classify_adapt_feature_threshold still exists)
 * classify_adapt_proto_thresh (classify_adapt_proto_threshold still exists)
 * classify_min_norm_scale_x
 * classify_max_norm_scale_x
 * classify_min_norm_scale_y
 * classify_max_norm_scale_y
 * il1_adaption_test
 * textord_blob_size_bigile
 * textord_blob_size_smallile
 * editor_debug_config_file
 * textord_tabfind_show_color_fit

The list was generated by a python script and each parameter occurence checked
manually.
2019-10-03 09:18:29 +02:00
Stefan Weil
1e84a6f225 Don't create OCR result files when training data is created
The configuration file lstm.train causes Tesseract to generate
training data for training of an LSTM line recognizer.

In this mode, no other files with OCR results should be written.
Without this patch, Tesseract writes a small text file.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-10-02 19:29:27 +02:00
Stefan Weil
286d8275c7 Add support for image or image list by URL
This allows OCR of images from the internet without downloading them first:

    tesseract http://IMAGE_URL OUTPUT ...

It uses libcurl.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-10-01 12:10:45 +02:00
Stefan Weil
47d70d7014 Modernize code for LIST (fix some -Wold-style-cast warnings)
- Use C++ type casts
- Remove unneeded type cast
- Simplify code for function pop
- Remove macro push_on (it was only used once)

This fixes lots of compiler warnings caused by old type casts.
2019-10-01 11:12:00 +02:00
Stefan Weil
672d67859f mfoutline: Modernize code
- Use C++ enums
- Use strongly typed C++11 enum for DIRECTION and optimize struct MFEDGEPT
- Use float constant for MF_SCALE_FACTOR
- Replace macros by inline functions
- Fix documentation comment

This fixes several warnings from clang.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-30 21:33:15 +02:00
Stefan Weil
7ec5f0ca02 intmatcher: Avoid conversion from double to float and vice versa
This fixes some clang warnings:

    src/classify/intmatcher.cpp:48:49: warning:
      implicit conversion loses floating-point precision:
      'double' to 'const float' [-Wimplicit-float-conversion]
    src/classify/intmatcher.cpp:405:34: warning:
      implicit conversion loses floating-point precision:
      'double' to 'float' [-Wimplicit-float-conversion]
    src/classify/intmatcher.cpp:405:64: warning:
      implicit conversion increases floating-point precision:
      'float' to 'double' [-Wdouble-promotion]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-30 18:05:26 +02:00
Stefan Weil
6d259ebe44 Remove unneeded compare statement (-Wtautological-unsigned-enum-zero-compare)
This fixes a clang warning:

    src/ccstruct/polyblk.cpp:412:12: warning: result of comparison of
      unsigned enum expression >= 0 is always true
      [-Wtautological-unsigned-enum-zero-compare]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-29 22:13:27 +02:00
Stefan Weil
49e351508c Re-add strngs.h to public API
It is still needed.
This partially reverts commit a730b5c4ff.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-28 10:34:48 +02:00
Stefan Weil
8ad86d6494 Add missing linker flags for TensorFlow
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-28 09:42:37 +02:00
zdenop
d6aa866430 ignore #pragma optimize for clang-cl 2019-09-27 21:19:37 +02:00
Stefan Weil
74d5ce82a6 Remove vecfuncs.cpp and vecfunc.h
Replace the macros which were declared in vecfuncs.h by member functions
and move a function which was only used in chop.cpp to that file.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-25 21:20:03 +02:00
Stefan Weil
7bddad59d1 Optimize class ChoiceIterator
Re-order a class variable to avoid memory holes and
remove unused class variables.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-25 09:43:57 +02:00
Noah Metzger
ff4c1d204d Fixed minor bug with the Choice iterator when lstm_choice_mode is not active.
Signed-off-by: Noah Metzger <noah.metzger@bib.uni-mannheim.de>
2019-09-24 15:38:28 +02:00
Stefan Weil
994ec697d8 Remove member functions STRING::string and StringParam::string
They were redundant because there exist member functions 'c_str' which do the same.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-23 08:33:08 +02:00
Egor Pugin
1fa7324cf7
Merge pull request #2668 from stweil/api
Remove STRING from the public Tesseract API
2019-09-23 01:02:26 +03:00
amitdo
0598879a00 Disable legacy build: Disable bitvec.h 2019-09-22 20:37:13 +02:00
Stefan Weil
a730b5c4ff Remove STRING from the public Tesseract API
Removing STRING from genericvector.h allows eliminating the proprietary
STRING data type from the public Tesseract API.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-22 20:32:28 +02:00
Stefan Weil
8cb677d6a2 Replace STRING arguments for LoadDataFromFile and SaveDataToFile
This is a step to eliminate the proprietary STRING data type
from the public Tesseract API.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-22 20:32:28 +02:00
amitdo
1e13d1d4d5 Disable legacy build: Disable more unneeded code 2019-09-22 20:55:24 +03:00
zdenop
39a63c2837
Merge pull request #2663 from bertsky/fix-lstm-user-patterns
fix langdata (user words/patterns) file suffixes for LSTMs:
2019-09-20 15:32:54 +02:00
Stefan Weil
0c7cc5a4dd Fix CID 1405673 part 2 (Uninitialized members)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-19 19:37:05 +02:00
Robert Schubert
5b976bfb55 fix langdata (user words/patterns) file suffixes for LSTMs:
- add another constructor for LSTMRecognizer
  which takes the language_data_path_prefix configured/selected
  at runtime and passes it to the internal CCUtil
- use this in Tesseract::init_tesseract_lang_data when LSTMs
  are available

(this was missing from 297d7d86ce)
2019-09-19 19:30:54 +02:00
amitdo
479a7b1ca0 Disabled legacy build: Disable more unneeded code 2019-09-19 19:00:13 +03:00
Stefan Weil
3b030b4aeb Fix CID 1405673 (Uninitialized members)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-17 22:04:08 +02:00