Commit Graph

246 Commits

Author SHA1 Message Date
zdenop
fad96db497
Merge pull request #2494 from Shreeshrii/master
Allow saving of box/tiff pairs during legacy tesseract training
2019-06-14 20:44:49 +02:00
Shree
6fa4587949 Allow saving of box/tiff pairs during base tesseract training 2019-06-14 09:35:39 +00:00
Shree
45cdf741ae Allow saving of box/tiff pairs during base tesseract training 2019-06-14 09:32:41 +00:00
Shree
832c6edb97 Allow saving of box/tiff pairs during base tesseract training 2019-06-14 09:25:54 +00:00
James R. Barlow
a9890afd12 Fix text2image compilation on C++17 compilers
C++17 drops support for `std::random_shuffle`, breaking C++17 compilers
that run to compile text2image.cpp. std::shuffle is valid on C++11
through C++17, so use std::shuffle instead.

Due to the use `std::random_shuffle`, `text2image --render_ngrams`
would not give consistent results for different compilers or platforms.
With the current change, the same random number generator is used for
all platforms and initialized to the same seed, so training output
should be consistent.
2019-06-13 16:07:20 -07:00
Stefan Weil
9a4bd041c8 Fix build for unittests
Commit 29f2cff203 was the wrong fix
for the compiler warnings because it broke the unittest build.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-05-26 21:36:34 +02:00
Stefan Weil
ac999b2409 Remove unused macros
This fixes compiler warnings from clang++ like these ones:

    src/ccutil/params.cpp:34:9: warning: macro is not used [-Wunused-macros]
    src/cutil/oldlist.cpp:67:9: warning: macro is not used [-Wunused-macros]
    src/cutil/oldlist.cpp:68:9: warning: macro is not used [-Wunused-macros]
    src/cutil/oldlist.cpp:78:9: warning: macro is not used [-Wunused-macros]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-05-26 20:27:21 +02:00
Stefan Weil
29f2cff203 training: Add missing static attributes
That fixes several warnings from clang++ like the following one:

    src/training/combine_lang_model.cpp:36:1: warning: no previous extern declaration for non-static variable 'FLAGS_lang_is_rtl' [-Wmissing-variable-declarations]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-05-26 11:33:52 +02:00
Stefan Weil
a139d553a7 training: Move declarations from cpp files to h file
That fixes several warnings from clang++ like the following one:

    src/training/commontraining.cpp:95:1: warning: no previous extern declaration for non-static variable 'FLAGS_D' [-Wmissing-variable-declarations]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-05-26 08:53:09 +02:00
Stefan Weil
4bec4a69a0 Add missing static attributes
This fixes lots of compiler warnings like these ones:

    src/api/baseapi.cpp:113:13: warning: no previous extern declaration for non-static variable 'kInputFile' [-Wmissing-variable-declarations]
    src/api/baseapi.cpp:117:13: warning: no previous extern declaration for non-static variable 'kOldVarsFile' [-Wmissing-variable-declarations]
    src/api/baseapi.cpp:97:10: warning: no previous extern declaration for non-static variable 'stream_filelist' [-Wmissing-variable-declarations]
    src/ccmain/equationdetect.cpp:46:10: warning: no previous extern declaration for non-static variable 'equationdetect_save_bi_image' [-Wmissing-variable-declarations]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-05-26 08:53:09 +02:00
Stefan Weil
2441e4d8ac Implement check for Tensorflow header file
This looks for one of the header files which are included by Tesseract.
It currently uses a hard coded path which works for Debian / Ubuntu.

Simplify also the rules for linking Tensorflow.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-05-24 16:52:14 +02:00
Stefan Weil
9cdf041448 Remove "third_party/" in comments and update path names
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-05-24 14:12:52 +02:00
Stefan Weil
4382ab1a34 Support build with Tensorflow
It expects include files in /usr/include/tensorflow.

* Add configure option --with-tensorflow (disabled by default)
* Fix data type tensorflow::int64
* Remove "third_party/" in include statements
* Add dummy implementations for Backward and DebugWeights in TFNetwork
* Add files generated with protoc from tfnetwork.proto
  (so the Tensorflow sources are not needed for the build)
* Update Makefiles

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-05-24 14:11:31 +02:00
Bharat123rox
945ccac85a Fix syntax error 2019-05-22 10:10:12 +05:30
Bharat123rox
7f31a0634d Some LGTM fixes and potential bugfixes 2019-05-21 23:24:50 +05:30
Stefan Weil
d2ca81e794 Remove local definition of M_PI
It is defined for all platforms when math.h or cmath is included
after defining the macro _USE_MATH_DEFINES.

Define _USE_MATH_DEFINES before any include statement to make sure
that M_PI gets defined. It is not necessary to define it conditionally
only for Windows.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-05-20 21:18:52 +02:00
Stefan Weil
6b1e709b19 Fix Doxygen comments for void functions
Void functions should not use @return. It causes compiler warnings
like this one:

    src/classify/intproto.cpp:326:5: warning:
      '@return' command used in a comment that is attached to a function
      returning void [-Wdocumentation]

Some non-void functions also were documented with @return none.
Fix those comments, too.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-05-14 21:57:17 +02:00
Stefan Weil
4fbc0a257b commandlineflags: Replace strtod by std::stringstream
Using std::stringstream allows conversion of double to string
independent of the current locale setting.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-05-02 07:33:46 +02:00
Stefan Weil
78a957b989 Remove spaces a line endings
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-04-13 18:54:42 +02:00
Stefan Weil
72c874140e Modernize code by replacing C type casts
This was done using clang-tidy.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-04-07 09:04:51 +02:00
zdenop
ab09b09da6
Merge pull request #2294 from bertsky/lstm-with-char-whitelist
trying to add tessedit_char_whitelist etc. again:
2019-04-06 14:41:30 +02:00
Robert Schubert
25a42ea42f fixed failure report for tesstrain commands:
- with `set -e` in effect, looking at stdout
  to detect failure is too late
2019-04-06 08:13:03 +02:00
Robert Schubert
d5584e793e fixed failure report for tesstrain commands:
- with `set -e` in effect, it does not make sense
  to query `$?` indirectly
2019-04-06 08:13:03 +02:00
Stefan Weil
802f42e821 Remove BOOL8, TRUE, FALSE from host.h
Remove unneeded include statements for host.h, add required ones and
update the comments for the remaining include statements.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-31 18:27:20 +02:00
Stefan Weil
cbb5e729a1 classify: Use bool and replace TRUE, FALSE
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-31 17:53:50 +02:00
Stefan Weil
664811a869 Replace BOOL8, TRUE, FALSE by bool, true, false
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-31 17:28:28 +02:00
zdenop
5f06402755 python: optimize imports, reformat code 2019-03-31 16:53:39 +02:00
zdenop
2e9fd69c9e use 'import pathlib'; fix "TypeError: argument of type 'WindowsPath' is not iterable" 2019-03-31 16:53:33 +02:00
zdenop
a0527b41bd fix LGTM reports for python 2019-03-31 16:53:25 +02:00
Shreeshrii
ea36e94e58 fix Could not parse bool from flag (#2359) 2019-03-29 14:50:21 +01:00
Stefan Weil
f877640bc9
Merge pull request #2319 from bertsky/tesstrain-parallel-wait-retval
tesstrain: check failure of subjobs
2019-03-25 16:10:09 +01:00
Stefan Weil
d8d2f6f48a Fix broken shell scripts for training
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-25 15:32:43 +01:00
Shreeshrii
8749f3553e
LINEDATA=false 2019-03-23 19:16:49 +05:30
Shree
bcb7cf9846 sort arguments, use true/false instead of 1/0 2019-03-23 12:28:53 +00:00
Shree
c2db272134 Modify distort_image for Boolean 2019-03-22 17:02:46 +00:00
Shree
9b915d5efb add --distort_image 2019-03-22 05:39:38 +00:00
Shree
f7ffde99d5 add --distort_image 2019-03-22 05:34:00 +00:00
zdenop
ac7ea4322a
Merge pull request #2335 from Shreeshrii/master
Changes to tesstrain.py - max_workers=8, distort_image=false
2019-03-17 15:27:34 +01:00
zdenop
26877ba703 check min. python version; os.uname is not available on windows 2019-03-17 15:25:48 +01:00
Shreeshrii
f8e8521606
Update tesstrain_utils.py 2019-03-17 15:32:35 +05:30
Shree
6fa8e1bb15 Set max_workers=8 2019-03-17 09:58:11 +00:00
Shree
e21499e81e Set default value for distort_image 2019-03-17 09:54:16 +00:00
Stefan Weil
ee2f9bf7bf Remove old comments in file headers
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-16 10:55:00 +01:00
Shree
d47b0d588a Use LATIN_FONTS for kmr 2019-03-15 15:47:56 +00:00
Shree
3eee1d217a Add kmr and kur_ara, remove kur from training scripts 2019-03-15 15:37:49 +00:00
Shree
b2ebf0195f Add kmr and kur_ara, remove kur from training scripts 2019-03-15 14:39:39 +00:00
Shree
37befdf6c4 Add option for --distort_image 2019-03-15 13:32:36 +00:00
Robert Schubert
14346e56b0 tesstrain: catch+handle SIGINT (to stop waiting on subjobs) 2019-03-15 00:03:16 +01:00
Robert Schubert
6cbad17e30 tesstrain: check all subjobs' retval 2019-03-14 14:38:51 +01:00
Robert Schubert
5316bcbb94 tesstrain: check failure of subjobs 2019-03-14 11:42:01 +01:00
Stefan Weil
896698a4f5 Fix runtime error (left shift of negative value)
Runtime error:

    src/training/util.h:37:28: runtime error: left shift of negative value -17

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-12 06:56:54 +01:00
Stefan Weil
5202208a8c Remove globals.h
It only included other files which are already included where needed.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-11 19:01:23 +01:00
zdenop
f80085c0bf
Merge pull request #2289 from Armyke/master
Added an additional optional --tmp_dir parameter to specify the tempo…
2019-03-06 15:03:14 +01:00
Stefan Weil
1c7e00611b Add initial support for traineddata files in standard archive formats
This requires libarchive-dev.

Tesseract can now load traineddata files in any of the archive formats
which are supported by libarchive. Example of a zipped BagIt archive:

    $ unzip -l /usr/local/share/tessdata/zip.traineddata
    Archive:  /usr/local/share/tessdata/zip.traineddata
      Length      Date    Time    Name
    ---------  ---------- -----   ----
           55  2019-03-05 15:27   bagit.txt
            0  2019-03-05 15:25   data/
         1557  2019-03-05 15:28   manifest-sha256.txt
      1082890  2019-03-05 15:25   data/eng.word-dawg
      1487588  2019-03-05 15:25   data/eng.lstm
         7477  2019-03-05 15:25   data/eng.unicharset
        63346  2019-03-05 15:25   data/eng.shapetable
       976552  2019-03-05 15:25   data/eng.inttemp
        13408  2019-03-05 15:25   data/eng.normproto
         4322  2019-03-05 15:25   data/eng.punc-dawg
         4738  2019-03-05 15:25   data/eng.lstm-number-dawg
         1410  2019-03-05 15:25   data/eng.freq-dawg
          844  2019-03-05 15:25   data/eng.pffmtable
         6360  2019-03-05 15:25   data/eng.lstm-unicharset
         1012  2019-03-05 15:25   data/eng.lstm-recoder
         1047  2019-03-05 15:25   data/eng.unicharambigs
         4322  2019-03-05 15:25   data/eng.lstm-punc-dawg
     16109842  2019-03-05 15:25   data/eng.bigram-dawg
           80  2019-03-05 15:25   data/eng.version
         6426  2019-03-05 15:25   data/eng.number-dawg
      3694794  2019-03-05 15:25   data/eng.lstm-word-dawg
    ---------                     -------
     23468070                     21 files

`combine_tessdata -d` and `combine_tessdata -u` also work.

The traineddata files in the new format can be generated with
standard tools like zip or tar.

More work is needed for other training tools and big endian support.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-05 17:18:48 +01:00
Armyke
56b04d4ea7 Added the same --tmp_dir flag to tesstrain_utils.sh 2019-03-04 14:05:25 +00:00
Armyke
25fa392887 Added an additional optional --tmp_dir parameter to specify the temporary directory in which tesstrain.py creates the training temporary files. The main reason is due to the slow R/W on HDD, if anyone wants to speed up this process can use as tmp_dir a directory on an SSDrive 2019-03-04 13:26:53 +00:00
Stefan Weil
295996ed05 commandlineflags: Fix compiler warnings (signed/unsigned)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-02 14:21:04 +01:00
Stefan Weil
fb0f1bcf66 BoxChar: Fix compiler warnings (signed/unsigned)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-02 14:04:54 +01:00
Stefan Weil
0e1a1fc3cf Validator: Fix compiler warnings (signed/unsigned)
This also fixes a regression in validate_grapheme_test introduced
by commit 32e9d7c8f5.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-02 13:05:03 +01:00
zdenop
2ba8e0061a
Merge branch 'master' into mya 2019-03-01 18:37:24 +01:00
zdenop
646b043d2c
use space instead of tab 2019-03-01 14:36:09 +01:00
Shree
5ee1deaea2 correct handling of 0BF0-0BFA Tamil numbers and symbols 2019-03-01 13:21:49 +00:00
zdenop
d7ddc4c5b7
Merge pull request #2270 from Shreeshrii/U_ARABIC_NUMBER
Treat U_ARABIC_NUMBER as LTR
2019-02-28 09:27:54 +01:00
Shree
25b02bf1f2 Treat U_ARABIC_NUMBER as LTR 2019-02-26 09:51:21 +00:00
Shreeshrii
2f71fe280c
Use alternative way to comment a block of code (using the c preprocessor).
https://github.com/tesseract-ocr/tesseract/pull/2268#pullrequestreview-207605382
Thanks @amitdo
2019-02-26 15:05:51 +05:30
Shree
449f1cd4ba Remove test for Word started with a combiner 2019-02-25 18:47:42 +00:00
zdenop
25c43b1e7c
Merge branch 'master' into distort 2019-02-23 18:23:14 +01:00
Stefan Weil
b3e355a682 Remove whitespace at line endings
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-02-23 17:49:56 +01:00
Shreeshrii
34e4d6b1d7
Revert to 0 (50% percents of images inverted). 2019-02-23 17:59:00 +05:30
Shreeshrii
287d5341bf
TODO 2019-02-23 17:56:02 +05:30
Shreeshrii
3e3e1ed55d
Remove commented Code 2019-02-23 17:54:00 +05:30
Shree
2aded47a3c Implement distort_image in text2image - default false 2019-02-22 12:27:27 +00:00
Shree
49ed3a72d4 implement PrepareDistortedPix as part of DegradeImage 2019-02-21 14:48:29 +00:00
Stefan Weil
b3bd23edb7 Remove whitespace at line endings
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-02-19 13:53:31 +01:00
Stefan Weil
b95598a0b1
Merge pull request #2070 from pndaza/master
add missed letters ( ၌ ၍ ၎ ၏ )  and symbols ( ၊ ။ ) - 0x104a to 0x104f -
2019-02-19 12:22:53 +01:00
Shree
a044f64375 fix Myanmar validation rules as per Unicode charts 2019-02-15 04:40:55 +00:00
Shreeshrii
c28a68115e
Merge branch 'master' into boxtiff 2019-02-02 23:42:39 +05:30
Shree Devi Kumar
d9590f8adf allow user specified box/tiff pairs with tesstrain.sh 2019-02-02 11:35:45 +00:00
Shree Devi Kumar
323361b902 allow user specified box/tiff pairs with tesstrain.sh 2019-02-02 11:33:32 +00:00
Shree Devi Kumar
ad223296af use --xsize instead of --x_size
(cherry picked from commit 94b8988b8cca3812137933db00750bd6e2e84e32)
2019-02-02 11:08:34 +00:00
Shree Devi Kumar
4d9bc11fd3 add --xsize as parameter for tesstrain 2019-01-27 07:00:25 +00:00
zdenop
059c50be8c
Merge pull request #2184 from stweil/tests
Fix and enable stringrenderer_test
2019-01-24 07:59:07 +01:00
Diego de la Hera
1a398a5b5d removed reference to unbound variable 2019-01-23 15:04:16 -03:00
Stefan Weil
ecf73f5bc7 training: Don't terminate after processing 8 fonts or 8 images
tesstrain_utils.sh sets the shell flag -e, so it exits immediately
if a command exits with a non-zero status.

The following command returns a non-zero status as soon as counter is a
multiple of par_factor (par_factor=8, that means as soon as 8 fonts or
images are processed):

    let rem=counter%par_factor

The new code fixes this undesired exit.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-01-23 17:26:40 +01:00
Stefan Weil
32e9d7c8f5 training: Fix some compiler warnings (signed/unsigned)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-01-23 13:55:13 +01:00
Stefan Weil
e4b862d588 pango_font_info: Fix runtime error messages from Pango
pango_coverage_get and pango_coverage_unref should not be called
with coverage == nullptr.

pango_font_get_coverage should not be called with font == nullptr.

Otherwise Pango prints runtime error messages:

    (process:12657): Pango-CRITICAL **: pango_coverage_get: assertion 'coverage != NULL' failed
    (process:12657): Pango-CRITICAL **: pango_coverage_unref: assertion 'coverage != NULL' failed
    (process:12657): Pango-CRITICAL **: pango_font_get_coverage: assertion 'font != NULL' failed
    (process:12657): GLib-GObject-CRITICAL **: g_object_unref: assertion 'G_IS_OBJECT (object)' failed

Typically those errors occur if a required font is not installed,
so this can be a quite common error.

Fix also a potential resource leak in PangoFontInfo::CoversUTF8Text.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-01-23 13:55:13 +01:00
Shree Devi Kumar
77d0b6ce8e fix WORDLIST filename 2019-01-22 15:49:55 +01:00
Nick White
b8de06430d Ensure baseapi.h header is used by commontraining.h regardless of autotools usage 2019-01-04 20:20:00 +00:00
Stefan Weil
91af010200 Fix compiler warning
gcc warning:

    src/training/text2image.cpp:694:35: warning:
        ISO C++ forbids converting a string constant to ‘char*’
        [-Wwrite-strings]

putenv expects a string which can be modified.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-01-01 22:49:04 +01:00
Stefan Weil
7ebd3153ae Fix several typos (most of them found by codespell)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-12-10 18:59:58 +01:00
Stefan Weil
c59c45fb3e Fix Amharic font list
This was reported for the Python code by LGTM.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-11-30 08:00:22 +01:00
Stefan Weil
b148644c1b Make Python script executable
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-11-30 07:08:45 +01:00
Egor Pugin
267b79982d
Merge pull request #2076 from jbarlow83/pythonize-training
RFC: Pythonize tesstrain.sh and friends
2018-11-25 13:31:48 +03:00
James R. Barlow
8aa25239ae Fix some of Codacy's complaints 2018-11-24 16:59:01 -08:00
James R. Barlow
9122e6249e Autoreformat code
This increases the deviation from the bash scripts so is done separately.
2018-11-24 00:50:29 -08:00
James R. Barlow
d9ae7ecc49 Pythonize tesstrain.sh -> tesstrain.py
This is a lightweight, semi-Pythonic conversion of tesstrain.sh that currently
supports only LSTM and not the Tesseract 3 training mode.

I attempted to keep source changes minimal so it would be easy to compare
bash to Python in code review and confirm equivalence.

Python 3.6+ is required.  Ubuntu 18.04 ships Python 3.6 and it is a mandatory
package (the package manager is also written in Python), so it is available
in the baseline Tesseract 4.0 system.

There are minor output and behavioral changes, and advantages.  Python's loggingis used.  Temporary files are only deleted on success, so they can be inspected
if training files.  Console output is more terse and the log file is more
verbose.  And there are progress bars!  (The python3-tqdm package is required.)
Where tesstrain.sh would sometimes fail without explanation and return an error
code of 1, it is much easier to find the point of failure in this version.
That was also the main motivation for this work.

Argument checking is also more comprehensive.
2018-11-24 00:45:35 -08:00
pndaza
fc8a3d5bbc combine condition with next 2018-11-24 09:21:05 +06:30
pndaza
5c85d8e03d add missed letters and symbols - 0x104a to 0x104f - 2018-11-24 09:14:31 +06:30
Stefan Weil
9b783822a0 Remove unused include statements for tprintf.h
Format also a call of tprintf and add a missing explicit include statement.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-11-18 17:25:01 +01:00
Stefan Weil
acca4fb999 Fix some unbound variables and other small issues in training shell scripts
Fix also the logging helper functions to work without log file.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-11-16 11:13:46 +01:00