Commit Graph

356 Commits

Author SHA1 Message Date
Stefan Weil
6fcf8d23bc Use more compiler and linker flags from pkg-config
This fixes some build issues with Homebrew on MacOS.

Signed-off-by: Stefan Weil <stefan@Sabines-Mac-mini.fritz.box>
2020-12-13 13:24:46 +01:00
Stefan Weil
bf3774cc91 Use more const char*
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-11-26 17:01:17 +01:00
Stefan Weil
4613738a5e Use const char* for filename and network_spec parameters
This replaces the proprietary STRING data type
(764 instead of 838 lines remaining).

It also removes STRING from osdetect.h and serialis.h.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-11-26 17:01:17 +01:00
Stefan Weil
7c4ef88dab Remove unused functions FontUtils::GetAllRenderableCharacters
They used the function pango_coverage_max which does nothing and
which has been deprecated since pango version 1.44.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-10-03 12:04:40 +02:00
Stefan Weil
cb3880fb15 Disable more code and data with GRAPHICS_DISABLED
Some runtime parameters which are only relevant with graphics enabled
were now removed from builds when graphics was disabled.

TableFinder::DisplayColSegmentGrid is never used, so remove it completely.

Builds with --disable-graphics significantly reduce the code size and avoid
some function calls which might be important for certain applications:

   text	   data	    bss	    dec	    hex	filename
3219230	  41136	  13920	3274286	 31f62e	.libs/libtesseract.so (--disable-graphics, old)
3211347	  40976	  13600	3265923	 31d583	.libs/libtesseract.so (--disable-graphics, new)
3360942	  43656	  15392	3419990	 342f56	.libs/libtesseract.so (default)

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-07-09 11:23:33 +02:00
Stefan Weil
8137cf35a6 Use const char* for filename parameters
This replaces the proprietary STRING data type
(801 instead of 838 lines remaining).

It also removes STRING from osdetect.h and serialis.h.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-07-07 14:20:09 +02:00
Stefan Weil
62b085cb8d ScrollView: Remove C API callcpp.{cpp,h}
Use C++ class ScrollView directly instead of using an intermediate C API.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-06-22 09:14:26 +02:00
Matej Knopp
e900252c1a Fix CMake build with DISABLED_LEGACY_ENGINE 2020-06-17 19:42:49 +02:00
Egor Pugin
0eaabc42c7
Update CMakeLists.txt 2020-05-12 11:49:15 +03:00
Egor Pugin
e720a26745
[cmake] Set inactivity timeout during icu download to 300 seconds.
Fixes #2972.
2020-05-09 18:55:45 +03:00
Stefan Weil
16553014e0 Replace references to the old wiki by new URLs
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-02-03 11:37:41 +01:00
Stefan Weil
3d1f82d0e2 tesstrain.sh: Fix command line flag --help
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-01-05 10:10:55 +01:00
Stefan Weil
d2a2292f32 mftraining: Fix compiler warning
powerpc64le-linux-gnu-g++ warning:

    src/training/mftraining.cpp:209:5: warning:
        ‘%04d’ directive output may be truncated writing between 4 and 10 bytes
        into a region of size 8 [-Wformat-truncation=]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2020-01-03 10:13:58 +01:00
amitdo
502ebe8ca9 Autotools: Pango, Cairo and ICU only required by training tools 2019-12-16 17:23:06 +02:00
Stefan Weil
6181acf367 automake: Flat build for src/cutil
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-11-26 16:20:46 +01:00
Stefan Weil
cafb1bbfd7 automake: Flat build for src/api
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-11-26 16:20:46 +01:00
Shreeshrii
99dfa8a680 Add separator and training_iteration to checkpoint name (#2752)
* Add separator and training_iteration to checkpoint name
* specify modelname_N.NN_NN_NN.checkpoint for intermediate checkpoint
2019-11-09 12:22:40 +01:00
maungd@battelle.org
3d7afb69ea Exposed the text2image option --ptsize to tesstrain.sh. Text2image has the
option --ptsize which defaults to 12.  This option is not exposed through
tesstrain.sh; thus, you cannot use tesstrain.sh to explore training with
different font sizes.  I made a small modification to expose the --ptsize
option to tesstrain.sh.  It defaults to 12 if not specified.
2019-11-01 15:10:58 -04:00
Egor Pugin
2bcc9d8093 Remove cppan build. 2019-10-30 21:37:38 +03:00
Egor Pugin
2a37f5dd62 Update includes to use <>. 2019-10-29 14:50:11 +03:00
amitdo
2f8884a64e Fix autotools build 2019-10-28 21:23:58 +02:00
amitdo
e1bae15547 Fix #include path of public headers 2019-10-28 19:10:30 +02:00
zdenop
fc629eae3b Subject: training: show error description for open/delete file 2019-10-21 16:31:57 +02:00
zdenop
36dc2ccf75 fix memory leak at PangoFontInfo::CanRenderString 2019-10-20 16:43:04 +02:00
zdenop
1ec34378d9 test for synthesized font faces. 2019-10-19 15:05:28 +02:00
zdenop
cbbe45d94b cmake: add minimum required version for pango and icu based on autotools 2019-10-19 15:00:49 +02:00
zdenop
37c7a5dd82 text2image: show pango version 2019-10-19 14:52:06 +02:00
Stefan Weil
994ec697d8 Remove member functions STRING::string and StringParam::string
They were redundant because there exist member functions 'c_str' which do the same.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-23 08:33:08 +02:00
Stefan Weil
a730b5c4ff Remove STRING from the public Tesseract API
Removing STRING from genericvector.h allows eliminating the proprietary
STRING data type from the public Tesseract API.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-22 20:32:28 +02:00
Stefan Weil
8cb677d6a2 Replace STRING arguments for LoadDataFromFile and SaveDataToFile
This is a step to eliminate the proprietary STRING data type
from the public Tesseract API.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-22 20:32:28 +02:00
Stefan Weil
97dda3d535 Fix CID 1386099 (Uninitialized pointer field)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-14 15:43:50 +02:00
Stefan Weil
951f442303 Fix CID 1386105 (Logically dead code)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-14 15:43:50 +02:00
Stefan Weil
64fc205e78 Fix CID 1402767 (Invalid type in argument to printf format specifier)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-09-14 15:43:50 +02:00
Stefan Weil
43b2e9513b lstmtrainer: Fix diagnostic message
Signed character values must be converted to unsigned integers for %x.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-08-15 14:31:32 +02:00
Stefan Weil
100d8cd29b lstmtester: Add missing space in log messages
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-08-14 14:12:47 +02:00
Stefan Weil
e84cb24def Move source files which are used for training only to src/training
They are moved from src/classify and src/lstm to src/training.

This reduces the size of the Tesseract library.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-08-12 17:08:08 +02:00
Stefan Weil
315dd9df3f cmake: Don't link pthread on Windows
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-08-07 15:24:00 +02:00
Zdenko Podobný
c5a50b93ce move fileio.cpp and fileio.h to training (this fix android build) 2019-08-04 21:26:39 +02:00
Egor Pugin
c58efee4ba Use pangocairo-1.43 for the moment. Remove private pango header. 2019-08-01 11:55:18 +03:00
Egor Pugin
f1a567e814
Try to fix #2599 2019-08-01 11:35:15 +03:00
Stefan Weil
23ef93ac4d cmake: Add missing pthread library
It is needed for C++ threads since commit 85068be405.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-07-26 07:45:51 +02:00
Stefan Weil
a2b13b49ff Simplify shell code (fixes warning from Codacy)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-07-17 21:33:24 +02:00
Stefan Weil
467f8f4140 Fix training script for macOS (issue #2578)
Bash on macOS does not support "|&":

    tesstrain_utils.sh: line 80: syntax error near unexpected token `&'

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-07-17 17:18:44 +02:00
Stefan Weil
fcfdb7e56f Remove unused include statements
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-07-15 14:48:31 +02:00
Stefan Weil
85068be405 lstmtester: Replace SVSync::StartThread by std::thread
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-07-15 14:30:51 +02:00
Stefan Weil
93427391c1 Replace SVAutoLock by std::lock_guard
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-07-15 12:01:28 +02:00
Stefan Weil
36026e3c35 Replace SVMutex by std::mutex
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-07-15 12:01:28 +02:00
Stefan Weil
bdc7abf518 Fix format strings for size_t arguments (CID 1402762, 1402767)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-07-10 16:57:19 +02:00
Egor Pugin
3b6f071ee8 Implement CMake+SW build.
Currently only Windows is supported.
You could try it as following:

    mkdir build_sw && cd build_sw && cmake .. -DSW_BUILD=1
2019-07-08 18:50:30 +03:00
zhuangzhuang1988
18c67f4989 fix tesstrain.py error 2019-07-08 14:35:17 +08:00
Stefan Weil
1c1eb76c36 Use C++-11 code instead of TessCallback for Dawg::iterate_words
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-07-04 16:03:30 +02:00
Stefan Weil
eeec9c66d4 training: Use C++-11 code for TestCallback
This allows removing more code from tesscallback.h.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-07-04 16:03:30 +02:00
zhuangzhuang1988
99cb088708 close log file handle before move it. 2019-07-01 10:53:12 +08:00
zhuangzhuang1988
a3a361f73d fix logger file encoding error. 2019-06-28 18:29:52 +08:00
Stefan Weil
ea20bf0373 Remove dummy code from LSTMTrainer::InitTensorFlowNetwork
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-06-22 21:01:40 +02:00
Stefan Weil
41f91c96c8 cmake: Build training tools also on Linux and macOS
This enables CI tests for the code in src/training on Linux and macOS.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-06-22 20:27:56 +02:00
Stefan Weil
df98bb7368 Move LSTMTrainer from libtesseract to libtesseract_training
LSTMTrainer is only used for training, so the shared library for
Tesseract can be made smaller.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-06-22 16:23:51 +02:00
Stefan Weil
bd13069fe8 Simplify class LSTMTrainer
The function pointers and callbacks file_reader_, file_writer_,
checkpointer_reader_ and checkpoint_writer_ are always set to
the same values. Replacing them by direct function calls
simplifies the code and allows removing more code from tesscallback.h.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-06-22 09:18:13 +02:00
zdenop
60b4c68d31 tesstrain_utils.sh: remove redundant code 2019-06-20 18:42:29 +02:00
zdenop
60aee9f821 create OUTPUT_DIR did not exist; fixes #2497 2019-06-16 15:07:16 +02:00
zdenop
fad96db497
Merge pull request #2494 from Shreeshrii/master
Allow saving of box/tiff pairs during legacy tesseract training
2019-06-14 20:44:49 +02:00
Shree
6fa4587949 Allow saving of box/tiff pairs during base tesseract training 2019-06-14 09:35:39 +00:00
Shree
45cdf741ae Allow saving of box/tiff pairs during base tesseract training 2019-06-14 09:32:41 +00:00
Shree
832c6edb97 Allow saving of box/tiff pairs during base tesseract training 2019-06-14 09:25:54 +00:00
James R. Barlow
a9890afd12 Fix text2image compilation on C++17 compilers
C++17 drops support for `std::random_shuffle`, breaking C++17 compilers
that run to compile text2image.cpp. std::shuffle is valid on C++11
through C++17, so use std::shuffle instead.

Due to the use `std::random_shuffle`, `text2image --render_ngrams`
would not give consistent results for different compilers or platforms.
With the current change, the same random number generator is used for
all platforms and initialized to the same seed, so training output
should be consistent.
2019-06-13 16:07:20 -07:00
Stefan Weil
9a4bd041c8 Fix build for unittests
Commit 29f2cff203 was the wrong fix
for the compiler warnings because it broke the unittest build.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-05-26 21:36:34 +02:00
Stefan Weil
ac999b2409 Remove unused macros
This fixes compiler warnings from clang++ like these ones:

    src/ccutil/params.cpp:34:9: warning: macro is not used [-Wunused-macros]
    src/cutil/oldlist.cpp:67:9: warning: macro is not used [-Wunused-macros]
    src/cutil/oldlist.cpp:68:9: warning: macro is not used [-Wunused-macros]
    src/cutil/oldlist.cpp:78:9: warning: macro is not used [-Wunused-macros]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-05-26 20:27:21 +02:00
Stefan Weil
29f2cff203 training: Add missing static attributes
That fixes several warnings from clang++ like the following one:

    src/training/combine_lang_model.cpp:36:1: warning: no previous extern declaration for non-static variable 'FLAGS_lang_is_rtl' [-Wmissing-variable-declarations]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-05-26 11:33:52 +02:00
Stefan Weil
a139d553a7 training: Move declarations from cpp files to h file
That fixes several warnings from clang++ like the following one:

    src/training/commontraining.cpp:95:1: warning: no previous extern declaration for non-static variable 'FLAGS_D' [-Wmissing-variable-declarations]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-05-26 08:53:09 +02:00
Stefan Weil
4bec4a69a0 Add missing static attributes
This fixes lots of compiler warnings like these ones:

    src/api/baseapi.cpp:113:13: warning: no previous extern declaration for non-static variable 'kInputFile' [-Wmissing-variable-declarations]
    src/api/baseapi.cpp:117:13: warning: no previous extern declaration for non-static variable 'kOldVarsFile' [-Wmissing-variable-declarations]
    src/api/baseapi.cpp:97:10: warning: no previous extern declaration for non-static variable 'stream_filelist' [-Wmissing-variable-declarations]
    src/ccmain/equationdetect.cpp:46:10: warning: no previous extern declaration for non-static variable 'equationdetect_save_bi_image' [-Wmissing-variable-declarations]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-05-26 08:53:09 +02:00
Stefan Weil
2441e4d8ac Implement check for Tensorflow header file
This looks for one of the header files which are included by Tesseract.
It currently uses a hard coded path which works for Debian / Ubuntu.

Simplify also the rules for linking Tensorflow.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-05-24 16:52:14 +02:00
Stefan Weil
9cdf041448 Remove "third_party/" in comments and update path names
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-05-24 14:12:52 +02:00
Stefan Weil
4382ab1a34 Support build with Tensorflow
It expects include files in /usr/include/tensorflow.

* Add configure option --with-tensorflow (disabled by default)
* Fix data type tensorflow::int64
* Remove "third_party/" in include statements
* Add dummy implementations for Backward and DebugWeights in TFNetwork
* Add files generated with protoc from tfnetwork.proto
  (so the Tensorflow sources are not needed for the build)
* Update Makefiles

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-05-24 14:11:31 +02:00
Bharat123rox
945ccac85a Fix syntax error 2019-05-22 10:10:12 +05:30
Bharat123rox
7f31a0634d Some LGTM fixes and potential bugfixes 2019-05-21 23:24:50 +05:30
Stefan Weil
d2ca81e794 Remove local definition of M_PI
It is defined for all platforms when math.h or cmath is included
after defining the macro _USE_MATH_DEFINES.

Define _USE_MATH_DEFINES before any include statement to make sure
that M_PI gets defined. It is not necessary to define it conditionally
only for Windows.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-05-20 21:18:52 +02:00
Stefan Weil
6b1e709b19 Fix Doxygen comments for void functions
Void functions should not use @return. It causes compiler warnings
like this one:

    src/classify/intproto.cpp:326:5: warning:
      '@return' command used in a comment that is attached to a function
      returning void [-Wdocumentation]

Some non-void functions also were documented with @return none.
Fix those comments, too.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-05-14 21:57:17 +02:00
Stefan Weil
4fbc0a257b commandlineflags: Replace strtod by std::stringstream
Using std::stringstream allows conversion of double to string
independent of the current locale setting.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-05-02 07:33:46 +02:00
Stefan Weil
78a957b989 Remove spaces a line endings
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-04-13 18:54:42 +02:00
Stefan Weil
72c874140e Modernize code by replacing C type casts
This was done using clang-tidy.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-04-07 09:04:51 +02:00
zdenop
ab09b09da6
Merge pull request #2294 from bertsky/lstm-with-char-whitelist
trying to add tessedit_char_whitelist etc. again:
2019-04-06 14:41:30 +02:00
Robert Schubert
25a42ea42f fixed failure report for tesstrain commands:
- with `set -e` in effect, looking at stdout
  to detect failure is too late
2019-04-06 08:13:03 +02:00
Robert Schubert
d5584e793e fixed failure report for tesstrain commands:
- with `set -e` in effect, it does not make sense
  to query `$?` indirectly
2019-04-06 08:13:03 +02:00
Stefan Weil
802f42e821 Remove BOOL8, TRUE, FALSE from host.h
Remove unneeded include statements for host.h, add required ones and
update the comments for the remaining include statements.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-31 18:27:20 +02:00
Stefan Weil
cbb5e729a1 classify: Use bool and replace TRUE, FALSE
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-31 17:53:50 +02:00
Stefan Weil
664811a869 Replace BOOL8, TRUE, FALSE by bool, true, false
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-31 17:28:28 +02:00
zdenop
5f06402755 python: optimize imports, reformat code 2019-03-31 16:53:39 +02:00
zdenop
2e9fd69c9e use 'import pathlib'; fix "TypeError: argument of type 'WindowsPath' is not iterable" 2019-03-31 16:53:33 +02:00
zdenop
a0527b41bd fix LGTM reports for python 2019-03-31 16:53:25 +02:00
Shreeshrii
ea36e94e58 fix Could not parse bool from flag (#2359) 2019-03-29 14:50:21 +01:00
Stefan Weil
f877640bc9
Merge pull request #2319 from bertsky/tesstrain-parallel-wait-retval
tesstrain: check failure of subjobs
2019-03-25 16:10:09 +01:00
Stefan Weil
d8d2f6f48a Fix broken shell scripts for training
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-25 15:32:43 +01:00
Shreeshrii
8749f3553e
LINEDATA=false 2019-03-23 19:16:49 +05:30
Shree
bcb7cf9846 sort arguments, use true/false instead of 1/0 2019-03-23 12:28:53 +00:00
Shree
c2db272134 Modify distort_image for Boolean 2019-03-22 17:02:46 +00:00
Shree
9b915d5efb add --distort_image 2019-03-22 05:39:38 +00:00
Shree
f7ffde99d5 add --distort_image 2019-03-22 05:34:00 +00:00
zdenop
ac7ea4322a
Merge pull request #2335 from Shreeshrii/master
Changes to tesstrain.py - max_workers=8, distort_image=false
2019-03-17 15:27:34 +01:00
zdenop
26877ba703 check min. python version; os.uname is not available on windows 2019-03-17 15:25:48 +01:00
Shreeshrii
f8e8521606
Update tesstrain_utils.py 2019-03-17 15:32:35 +05:30
Shree
6fa8e1bb15 Set max_workers=8 2019-03-17 09:58:11 +00:00
Shree
e21499e81e Set default value for distort_image 2019-03-17 09:54:16 +00:00
Stefan Weil
ee2f9bf7bf Remove old comments in file headers
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-16 10:55:00 +01:00
Shree
d47b0d588a Use LATIN_FONTS for kmr 2019-03-15 15:47:56 +00:00
Shree
3eee1d217a Add kmr and kur_ara, remove kur from training scripts 2019-03-15 15:37:49 +00:00
Shree
b2ebf0195f Add kmr and kur_ara, remove kur from training scripts 2019-03-15 14:39:39 +00:00
Shree
37befdf6c4 Add option for --distort_image 2019-03-15 13:32:36 +00:00
Robert Schubert
14346e56b0 tesstrain: catch+handle SIGINT (to stop waiting on subjobs) 2019-03-15 00:03:16 +01:00
Robert Schubert
6cbad17e30 tesstrain: check all subjobs' retval 2019-03-14 14:38:51 +01:00
Robert Schubert
5316bcbb94 tesstrain: check failure of subjobs 2019-03-14 11:42:01 +01:00
Stefan Weil
896698a4f5 Fix runtime error (left shift of negative value)
Runtime error:

    src/training/util.h:37:28: runtime error: left shift of negative value -17

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-12 06:56:54 +01:00
Stefan Weil
5202208a8c Remove globals.h
It only included other files which are already included where needed.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-11 19:01:23 +01:00
zdenop
f80085c0bf
Merge pull request #2289 from Armyke/master
Added an additional optional --tmp_dir parameter to specify the tempo…
2019-03-06 15:03:14 +01:00
Stefan Weil
1c7e00611b Add initial support for traineddata files in standard archive formats
This requires libarchive-dev.

Tesseract can now load traineddata files in any of the archive formats
which are supported by libarchive. Example of a zipped BagIt archive:

    $ unzip -l /usr/local/share/tessdata/zip.traineddata
    Archive:  /usr/local/share/tessdata/zip.traineddata
      Length      Date    Time    Name
    ---------  ---------- -----   ----
           55  2019-03-05 15:27   bagit.txt
            0  2019-03-05 15:25   data/
         1557  2019-03-05 15:28   manifest-sha256.txt
      1082890  2019-03-05 15:25   data/eng.word-dawg
      1487588  2019-03-05 15:25   data/eng.lstm
         7477  2019-03-05 15:25   data/eng.unicharset
        63346  2019-03-05 15:25   data/eng.shapetable
       976552  2019-03-05 15:25   data/eng.inttemp
        13408  2019-03-05 15:25   data/eng.normproto
         4322  2019-03-05 15:25   data/eng.punc-dawg
         4738  2019-03-05 15:25   data/eng.lstm-number-dawg
         1410  2019-03-05 15:25   data/eng.freq-dawg
          844  2019-03-05 15:25   data/eng.pffmtable
         6360  2019-03-05 15:25   data/eng.lstm-unicharset
         1012  2019-03-05 15:25   data/eng.lstm-recoder
         1047  2019-03-05 15:25   data/eng.unicharambigs
         4322  2019-03-05 15:25   data/eng.lstm-punc-dawg
     16109842  2019-03-05 15:25   data/eng.bigram-dawg
           80  2019-03-05 15:25   data/eng.version
         6426  2019-03-05 15:25   data/eng.number-dawg
      3694794  2019-03-05 15:25   data/eng.lstm-word-dawg
    ---------                     -------
     23468070                     21 files

`combine_tessdata -d` and `combine_tessdata -u` also work.

The traineddata files in the new format can be generated with
standard tools like zip or tar.

More work is needed for other training tools and big endian support.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-05 17:18:48 +01:00
Armyke
56b04d4ea7 Added the same --tmp_dir flag to tesstrain_utils.sh 2019-03-04 14:05:25 +00:00
Armyke
25fa392887 Added an additional optional --tmp_dir parameter to specify the temporary directory in which tesstrain.py creates the training temporary files. The main reason is due to the slow R/W on HDD, if anyone wants to speed up this process can use as tmp_dir a directory on an SSDrive 2019-03-04 13:26:53 +00:00
Stefan Weil
295996ed05 commandlineflags: Fix compiler warnings (signed/unsigned)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-02 14:21:04 +01:00
Stefan Weil
fb0f1bcf66 BoxChar: Fix compiler warnings (signed/unsigned)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-02 14:04:54 +01:00
Stefan Weil
0e1a1fc3cf Validator: Fix compiler warnings (signed/unsigned)
This also fixes a regression in validate_grapheme_test introduced
by commit 32e9d7c8f5.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-02 13:05:03 +01:00
zdenop
2ba8e0061a
Merge branch 'master' into mya 2019-03-01 18:37:24 +01:00
zdenop
646b043d2c
use space instead of tab 2019-03-01 14:36:09 +01:00
Shree
5ee1deaea2 correct handling of 0BF0-0BFA Tamil numbers and symbols 2019-03-01 13:21:49 +00:00
zdenop
d7ddc4c5b7
Merge pull request #2270 from Shreeshrii/U_ARABIC_NUMBER
Treat U_ARABIC_NUMBER as LTR
2019-02-28 09:27:54 +01:00
Shree
25b02bf1f2 Treat U_ARABIC_NUMBER as LTR 2019-02-26 09:51:21 +00:00
Shreeshrii
2f71fe280c
Use alternative way to comment a block of code (using the c preprocessor).
https://github.com/tesseract-ocr/tesseract/pull/2268#pullrequestreview-207605382
Thanks @amitdo
2019-02-26 15:05:51 +05:30
Shree
449f1cd4ba Remove test for Word started with a combiner 2019-02-25 18:47:42 +00:00
zdenop
25c43b1e7c
Merge branch 'master' into distort 2019-02-23 18:23:14 +01:00
Stefan Weil
b3e355a682 Remove whitespace at line endings
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-02-23 17:49:56 +01:00
Shreeshrii
34e4d6b1d7
Revert to 0 (50% percents of images inverted). 2019-02-23 17:59:00 +05:30
Shreeshrii
287d5341bf
TODO 2019-02-23 17:56:02 +05:30
Shreeshrii
3e3e1ed55d
Remove commented Code 2019-02-23 17:54:00 +05:30
Shree
2aded47a3c Implement distort_image in text2image - default false 2019-02-22 12:27:27 +00:00
Shree
49ed3a72d4 implement PrepareDistortedPix as part of DegradeImage 2019-02-21 14:48:29 +00:00
Stefan Weil
b3bd23edb7 Remove whitespace at line endings
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-02-19 13:53:31 +01:00
Stefan Weil
b95598a0b1
Merge pull request #2070 from pndaza/master
add missed letters ( ၌ ၍ ၎ ၏ )  and symbols ( ၊ ။ ) - 0x104a to 0x104f -
2019-02-19 12:22:53 +01:00
Shree
a044f64375 fix Myanmar validation rules as per Unicode charts 2019-02-15 04:40:55 +00:00
Shreeshrii
c28a68115e
Merge branch 'master' into boxtiff 2019-02-02 23:42:39 +05:30
Shree Devi Kumar
d9590f8adf allow user specified box/tiff pairs with tesstrain.sh 2019-02-02 11:35:45 +00:00
Shree Devi Kumar
323361b902 allow user specified box/tiff pairs with tesstrain.sh 2019-02-02 11:33:32 +00:00
Shree Devi Kumar
ad223296af use --xsize instead of --x_size
(cherry picked from commit 94b8988b8cca3812137933db00750bd6e2e84e32)
2019-02-02 11:08:34 +00:00
Shree Devi Kumar
4d9bc11fd3 add --xsize as parameter for tesstrain 2019-01-27 07:00:25 +00:00
zdenop
059c50be8c
Merge pull request #2184 from stweil/tests
Fix and enable stringrenderer_test
2019-01-24 07:59:07 +01:00
Diego de la Hera
1a398a5b5d removed reference to unbound variable 2019-01-23 15:04:16 -03:00
Stefan Weil
ecf73f5bc7 training: Don't terminate after processing 8 fonts or 8 images
tesstrain_utils.sh sets the shell flag -e, so it exits immediately
if a command exits with a non-zero status.

The following command returns a non-zero status as soon as counter is a
multiple of par_factor (par_factor=8, that means as soon as 8 fonts or
images are processed):

    let rem=counter%par_factor

The new code fixes this undesired exit.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-01-23 17:26:40 +01:00
Stefan Weil
32e9d7c8f5 training: Fix some compiler warnings (signed/unsigned)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-01-23 13:55:13 +01:00
Stefan Weil
e4b862d588 pango_font_info: Fix runtime error messages from Pango
pango_coverage_get and pango_coverage_unref should not be called
with coverage == nullptr.

pango_font_get_coverage should not be called with font == nullptr.

Otherwise Pango prints runtime error messages:

    (process:12657): Pango-CRITICAL **: pango_coverage_get: assertion 'coverage != NULL' failed
    (process:12657): Pango-CRITICAL **: pango_coverage_unref: assertion 'coverage != NULL' failed
    (process:12657): Pango-CRITICAL **: pango_font_get_coverage: assertion 'font != NULL' failed
    (process:12657): GLib-GObject-CRITICAL **: g_object_unref: assertion 'G_IS_OBJECT (object)' failed

Typically those errors occur if a required font is not installed,
so this can be a quite common error.

Fix also a potential resource leak in PangoFontInfo::CoversUTF8Text.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-01-23 13:55:13 +01:00
Shree Devi Kumar
77d0b6ce8e fix WORDLIST filename 2019-01-22 15:49:55 +01:00
Nick White
b8de06430d Ensure baseapi.h header is used by commontraining.h regardless of autotools usage 2019-01-04 20:20:00 +00:00
Stefan Weil
91af010200 Fix compiler warning
gcc warning:

    src/training/text2image.cpp:694:35: warning:
        ISO C++ forbids converting a string constant to ‘char*’
        [-Wwrite-strings]

putenv expects a string which can be modified.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-01-01 22:49:04 +01:00
Stefan Weil
7ebd3153ae Fix several typos (most of them found by codespell)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-12-10 18:59:58 +01:00