Commit Graph

178 Commits

Author SHA1 Message Date
Stefan Weil
78a957b989 Remove spaces a line endings
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-04-13 18:54:42 +02:00
Stefan Weil
72c874140e Modernize code by replacing C type casts
This was done using clang-tidy.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-04-07 09:04:51 +02:00
zdenop
ab09b09da6
Merge pull request #2294 from bertsky/lstm-with-char-whitelist
trying to add tessedit_char_whitelist etc. again:
2019-04-06 14:41:30 +02:00
Robert Schubert
25a42ea42f fixed failure report for tesstrain commands:
- with `set -e` in effect, looking at stdout
  to detect failure is too late
2019-04-06 08:13:03 +02:00
Robert Schubert
d5584e793e fixed failure report for tesstrain commands:
- with `set -e` in effect, it does not make sense
  to query `$?` indirectly
2019-04-06 08:13:03 +02:00
Stefan Weil
802f42e821 Remove BOOL8, TRUE, FALSE from host.h
Remove unneeded include statements for host.h, add required ones and
update the comments for the remaining include statements.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-31 18:27:20 +02:00
Stefan Weil
cbb5e729a1 classify: Use bool and replace TRUE, FALSE
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-31 17:53:50 +02:00
Stefan Weil
664811a869 Replace BOOL8, TRUE, FALSE by bool, true, false
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-31 17:28:28 +02:00
zdenop
5f06402755 python: optimize imports, reformat code 2019-03-31 16:53:39 +02:00
zdenop
2e9fd69c9e use 'import pathlib'; fix "TypeError: argument of type 'WindowsPath' is not iterable" 2019-03-31 16:53:33 +02:00
zdenop
a0527b41bd fix LGTM reports for python 2019-03-31 16:53:25 +02:00
Shreeshrii
ea36e94e58 fix Could not parse bool from flag (#2359) 2019-03-29 14:50:21 +01:00
Stefan Weil
f877640bc9
Merge pull request #2319 from bertsky/tesstrain-parallel-wait-retval
tesstrain: check failure of subjobs
2019-03-25 16:10:09 +01:00
Stefan Weil
d8d2f6f48a Fix broken shell scripts for training
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-25 15:32:43 +01:00
Shreeshrii
8749f3553e
LINEDATA=false 2019-03-23 19:16:49 +05:30
Shree
bcb7cf9846 sort arguments, use true/false instead of 1/0 2019-03-23 12:28:53 +00:00
Shree
c2db272134 Modify distort_image for Boolean 2019-03-22 17:02:46 +00:00
Shree
9b915d5efb add --distort_image 2019-03-22 05:39:38 +00:00
Shree
f7ffde99d5 add --distort_image 2019-03-22 05:34:00 +00:00
zdenop
ac7ea4322a
Merge pull request #2335 from Shreeshrii/master
Changes to tesstrain.py - max_workers=8, distort_image=false
2019-03-17 15:27:34 +01:00
zdenop
26877ba703 check min. python version; os.uname is not available on windows 2019-03-17 15:25:48 +01:00
Shreeshrii
f8e8521606
Update tesstrain_utils.py 2019-03-17 15:32:35 +05:30
Shree
6fa8e1bb15 Set max_workers=8 2019-03-17 09:58:11 +00:00
Shree
e21499e81e Set default value for distort_image 2019-03-17 09:54:16 +00:00
Stefan Weil
ee2f9bf7bf Remove old comments in file headers
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-16 10:55:00 +01:00
Shree
d47b0d588a Use LATIN_FONTS for kmr 2019-03-15 15:47:56 +00:00
Shree
3eee1d217a Add kmr and kur_ara, remove kur from training scripts 2019-03-15 15:37:49 +00:00
Shree
b2ebf0195f Add kmr and kur_ara, remove kur from training scripts 2019-03-15 14:39:39 +00:00
Shree
37befdf6c4 Add option for --distort_image 2019-03-15 13:32:36 +00:00
Robert Schubert
14346e56b0 tesstrain: catch+handle SIGINT (to stop waiting on subjobs) 2019-03-15 00:03:16 +01:00
Robert Schubert
6cbad17e30 tesstrain: check all subjobs' retval 2019-03-14 14:38:51 +01:00
Robert Schubert
5316bcbb94 tesstrain: check failure of subjobs 2019-03-14 11:42:01 +01:00
Stefan Weil
896698a4f5 Fix runtime error (left shift of negative value)
Runtime error:

    src/training/util.h:37:28: runtime error: left shift of negative value -17

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-12 06:56:54 +01:00
Stefan Weil
5202208a8c Remove globals.h
It only included other files which are already included where needed.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-11 19:01:23 +01:00
zdenop
f80085c0bf
Merge pull request #2289 from Armyke/master
Added an additional optional --tmp_dir parameter to specify the tempo…
2019-03-06 15:03:14 +01:00
Stefan Weil
1c7e00611b Add initial support for traineddata files in standard archive formats
This requires libarchive-dev.

Tesseract can now load traineddata files in any of the archive formats
which are supported by libarchive. Example of a zipped BagIt archive:

    $ unzip -l /usr/local/share/tessdata/zip.traineddata
    Archive:  /usr/local/share/tessdata/zip.traineddata
      Length      Date    Time    Name
    ---------  ---------- -----   ----
           55  2019-03-05 15:27   bagit.txt
            0  2019-03-05 15:25   data/
         1557  2019-03-05 15:28   manifest-sha256.txt
      1082890  2019-03-05 15:25   data/eng.word-dawg
      1487588  2019-03-05 15:25   data/eng.lstm
         7477  2019-03-05 15:25   data/eng.unicharset
        63346  2019-03-05 15:25   data/eng.shapetable
       976552  2019-03-05 15:25   data/eng.inttemp
        13408  2019-03-05 15:25   data/eng.normproto
         4322  2019-03-05 15:25   data/eng.punc-dawg
         4738  2019-03-05 15:25   data/eng.lstm-number-dawg
         1410  2019-03-05 15:25   data/eng.freq-dawg
          844  2019-03-05 15:25   data/eng.pffmtable
         6360  2019-03-05 15:25   data/eng.lstm-unicharset
         1012  2019-03-05 15:25   data/eng.lstm-recoder
         1047  2019-03-05 15:25   data/eng.unicharambigs
         4322  2019-03-05 15:25   data/eng.lstm-punc-dawg
     16109842  2019-03-05 15:25   data/eng.bigram-dawg
           80  2019-03-05 15:25   data/eng.version
         6426  2019-03-05 15:25   data/eng.number-dawg
      3694794  2019-03-05 15:25   data/eng.lstm-word-dawg
    ---------                     -------
     23468070                     21 files

`combine_tessdata -d` and `combine_tessdata -u` also work.

The traineddata files in the new format can be generated with
standard tools like zip or tar.

More work is needed for other training tools and big endian support.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-05 17:18:48 +01:00
Armyke
56b04d4ea7 Added the same --tmp_dir flag to tesstrain_utils.sh 2019-03-04 14:05:25 +00:00
Armyke
25fa392887 Added an additional optional --tmp_dir parameter to specify the temporary directory in which tesstrain.py creates the training temporary files. The main reason is due to the slow R/W on HDD, if anyone wants to speed up this process can use as tmp_dir a directory on an SSDrive 2019-03-04 13:26:53 +00:00
Stefan Weil
295996ed05 commandlineflags: Fix compiler warnings (signed/unsigned)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-02 14:21:04 +01:00
Stefan Weil
fb0f1bcf66 BoxChar: Fix compiler warnings (signed/unsigned)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-02 14:04:54 +01:00
Stefan Weil
0e1a1fc3cf Validator: Fix compiler warnings (signed/unsigned)
This also fixes a regression in validate_grapheme_test introduced
by commit 32e9d7c8f5.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-02 13:05:03 +01:00
zdenop
2ba8e0061a
Merge branch 'master' into mya 2019-03-01 18:37:24 +01:00
zdenop
646b043d2c
use space instead of tab 2019-03-01 14:36:09 +01:00
Shree
5ee1deaea2 correct handling of 0BF0-0BFA Tamil numbers and symbols 2019-03-01 13:21:49 +00:00
zdenop
d7ddc4c5b7
Merge pull request #2270 from Shreeshrii/U_ARABIC_NUMBER
Treat U_ARABIC_NUMBER as LTR
2019-02-28 09:27:54 +01:00
Shree
25b02bf1f2 Treat U_ARABIC_NUMBER as LTR 2019-02-26 09:51:21 +00:00
Shreeshrii
2f71fe280c
Use alternative way to comment a block of code (using the c preprocessor).
https://github.com/tesseract-ocr/tesseract/pull/2268#pullrequestreview-207605382
Thanks @amitdo
2019-02-26 15:05:51 +05:30
Shree
449f1cd4ba Remove test for Word started with a combiner 2019-02-25 18:47:42 +00:00
zdenop
25c43b1e7c
Merge branch 'master' into distort 2019-02-23 18:23:14 +01:00
Stefan Weil
b3e355a682 Remove whitespace at line endings
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-02-23 17:49:56 +01:00