Commit Graph

587 Commits

Author SHA1 Message Date
Stefan Weil
f35eeb3b4a protos: Remove several unused macros, functions and global variables
The unused global variable TrainingData used a lot of runtime memory.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-12-13 21:32:56 +01:00
Stefan Weil
fbbbdb4565 Use std::stringstream to generate ALTO output and add <SP> element
Using std::stringstream simplifies the code.
The <SP> element is needed between two >String> elements.
Remove also some unneeded spaces in the ALTO output.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-12-12 22:29:35 +01:00
Stefan Weil
7ebd3153ae Fix several typos (most of them found by codespell)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-12-10 18:59:58 +01:00
Stefan Weil
81ab302d52 FPRow: Remove three unused methods
This fixes warnings from the Intel compiler:

    src/textord/cjkpitch.cpp(319): warning #177:
      function "<unnamed>::FPRow::good_gaps" was declared but never referenced
    src/textord/cjkpitch.cpp(383): warning #177:
      function "<unnamed>::FPRow::is_bad" was declared but never referenced
    src/textord/cjkpitch.cpp(387): warning #177:
      function "<unnamed>::FPRow::is_unknown" was declared but never referenced

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-12-08 16:43:52 +01:00
Stefan Weil
404f9cd147 SimpleStats: Remove unused method
This fixes a warning from the Intel compiler:

    src/textord/cjkpitch.cpp(79): warning #177:
      function "<unnamed>::SimpleStats::maximum" was declared
      but never referenced

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-12-08 16:39:46 +01:00
Stefan Weil
a9121d28f3
Merge pull request #2107 from stweil/march
Add check whether compiler supports -march=native flag
2018-12-08 10:53:09 +01:00
Stefan Weil
2c044df959 Fix wrong x_fsize in hOCR output (regression)
The regression was caused by the latest commit
c9e85ab78f.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-12-08 10:39:31 +01:00
Stefan Weil
2ccc5810f3 Add check whether compiler supports -march=native flag
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-12-05 20:13:28 +01:00
Stefan Weil
c9e85ab78f Fix wrong font attributes in hOCR output
Instrumented code throws this runtime error during OCR:

    ../../src/api/baseapi.cpp:1616:5: runtime error: load of value 128,
      which is not a valid value for type 'bool'
    ../../src/api/baseapi.cpp:1627:5: runtime error: load of value 128,
      which is not a valid value for type 'bool'

If there is no font information (typical for Tesseract with a LSTM model),
the font attributes got random values resulting in wrong hOCR output.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-12-04 10:52:46 +01:00
Stefan Weil
0bdae8f8bf GENERIC_2D_ARRAY: Fix runtime error in assignment operator
Instrumented code throws this runtime error during OCR:

    ../../src/ccstruct/matrix.h:84:11: runtime error:
      null pointer passed as argument 2, which is declared to never be null

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-12-04 10:48:46 +01:00
Stefan Weil
f0a4d04187 Add config variable for selection of dot product function
All also a C++ implementation with more aggressive compiler options
which is optimized for the CPU where the software was built.

It is now possible to select the function used for the dot product
with -c dotproduct=FUNCTION where FUNCTION can be one of those values:

* auto      selection based on detected hardware (default)
* generic   C++ code with default compiler options
* native    C++ code optimized for build host
* avx       optimized code for AVX
* sse       optimized code for SSE

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-12-01 00:19:28 +01:00
zdenop
b527b37825
Merge pull request #2097 from stweil/namespace
SIMDDetect: Use tesseract namespace and format code
2018-12-01 00:02:18 +01:00
Stefan Weil
1910b1a72b SIMDDetect: Use tesseract namespace and format code
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-11-30 23:36:39 +01:00
Stefan Weil
66d3275d0b IntSimdMatrixSSE: Remove unused include statement and simplify code
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-11-30 23:14:11 +01:00
Stefan Weil
048eb34934 Add missing static attribute to local inline functions
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-11-30 23:14:11 +01:00
Stefan Weil
b73370aac9 Remove unneeded test for nullptr
IntSimdMatrix::GetFastestMultiplier never returns a nullptr.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-11-30 23:14:11 +01:00
Stefan Weil
e2419b1968 Fix potential crash in tprintf
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-11-30 23:14:11 +01:00
Stefan Weil
6b6d9de497 Fix potential crash in STRING class
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-11-30 23:14:11 +01:00
Stefan Weil
59fb3370bb Use -ffast-math for calculation of dot product
This reduces the code size for intsimdmatrixavx2 from 2700 to 2668
and slightly improves the performance for fast models with AVX2.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-11-30 22:52:04 +01:00
Stefan Weil
fda3ba9009 IntSimdMatrixSSE: Fix comment
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-11-30 22:13:32 +01:00
zdenop
07b140364f
Merge pull request #2093 from stweil/python
Updates for Python scripts
2018-11-30 08:10:20 +01:00
zdenop
53600c677e
Merge pull request #2092 from stweil/format
Format new ALTO code with clang-format
2018-11-30 08:08:52 +01:00
zdenop
f6493dd5e8
Merge pull request #2090 from stweil/inline
Optimize performance by using inline functions
2018-11-30 08:07:45 +01:00
Stefan Weil
c59c45fb3e Fix Amharic font list
This was reported for the Python code by LGTM.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-11-30 08:00:22 +01:00
Stefan Weil
b148644c1b Make Python script executable
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-11-30 07:08:45 +01:00
Stefan Weil
ed48b2a8f5 Format new ALTO code with clang-format
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-11-30 06:37:25 +01:00
Jake Sebright
d7cee03a94 Add support for ALTO output 2018-11-30 06:09:36 +01:00
Stefan Weil
3c047f0ac8 Optimize performance by using inline function DotProduct
This improves performace for the "best" models because it
avoids function calls.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-11-29 21:43:41 +01:00
Stefan Weil
e161501df6 Optimize performance by using inline MatrixDotVectorInternal
This improves performace for the "best" models because it
avoids function calls.

The compiler also knows the passed values for the parameters
add_bias_fwd and skip_bias_back.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-11-29 21:37:32 +01:00
Egor Pugin
685b136d89
Fix incorrect condition. 2018-11-29 19:02:54 +03:00
Egor Pugin
267b79982d
Merge pull request #2076 from jbarlow83/pythonize-training
RFC: Pythonize tesstrain.sh and friends
2018-11-25 13:31:48 +03:00
James R. Barlow
8aa25239ae Fix some of Codacy's complaints 2018-11-24 16:59:01 -08:00
James R. Barlow
9122e6249e Autoreformat code
This increases the deviation from the bash scripts so is done separately.
2018-11-24 00:50:29 -08:00
James R. Barlow
d9ae7ecc49 Pythonize tesstrain.sh -> tesstrain.py
This is a lightweight, semi-Pythonic conversion of tesstrain.sh that currently
supports only LSTM and not the Tesseract 3 training mode.

I attempted to keep source changes minimal so it would be easy to compare
bash to Python in code review and confirm equivalence.

Python 3.6+ is required.  Ubuntu 18.04 ships Python 3.6 and it is a mandatory
package (the package manager is also written in Python), so it is available
in the baseline Tesseract 4.0 system.

There are minor output and behavioral changes, and advantages.  Python's loggingis used.  Temporary files are only deleted on success, so they can be inspected
if training files.  Console output is more terse and the log file is more
verbose.  And there are progress bars!  (The python3-tqdm package is required.)
Where tesstrain.sh would sometimes fail without explanation and return an error
code of 1, it is much easier to find the point of failure in this version.
That was also the main motivation for this work.

Argument checking is also more comprehensive.
2018-11-24 00:45:35 -08:00
Stefan Weil
9b783822a0 Remove unused include statements for tprintf.h
Format also a call of tprintf and add a missing explicit include statement.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-11-18 17:25:01 +01:00
Stefan Weil
a93426c9ff Fix wrong results from function streamtofloat
The local variable k should be 10 ^ (number of digits after comma),
but will overflow when there are more than 9 digits after the comma
because an int value cannot store 10000000000.

This results in wrong double values read from .tr files for example
(or in a runtime exception if Tesseract was compiled with -ftrapv).

Using uint64_t does not fix the general problem but allows more digits
which should be sufficient for the data read by Tesseract.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-11-17 20:02:21 +01:00
Stefan Weil
acca4fb999 Fix some unbound variables and other small issues in training shell scripts
Fix also the logging helper functions to work without log file.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-11-16 11:13:46 +01:00
Stefan Weil
a4b03fbb27 Fix warning from shellcheck
shellcheck warning:

    In /tesseract/src/training/tesstrain_utils.sh line 209:
        TIMESTAMP=`date +%Y-%m-%d`
                  ^-- SC2006: Use $(..) instead of legacy `..`.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-11-15 17:45:20 +01:00
John Lin
bfe58aa56f Fix unbound variable $FONTS 2018-11-15 17:43:15 +01:00
Stefan Weil
0915cbd535 Simplify shell script using mktemp
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-11-15 13:36:52 +01:00
John Lin
edb76e281a Simplify MKTEMP_DT logic 2018-11-15 10:38:40 +08:00
John Lin
dbfc89f9af Fix mktemp in tesstrain_utils.sh
The commit 10f2c45c00 unified the usage of mktemp, but with a
incorrect bash syntax and unnecessary definition of LANG_CODE
and TIMESTAMP. This patch fixes the above problems.
2018-11-14 09:04:34 +08:00
Ray Smith
ce88adbf32 fix issue #1192 2018-11-12 12:53:12 +01:00
zdenop
724957167e fix typo in non VS build 2018-11-08 23:10:14 +01:00
zdenop
eb104f9fe4 VS build: fix warning C4996: The POSIX name for this item is deprecated. Instead, use the ISO C and C++ conformant name. 2018-11-08 22:55:04 +01:00
zdenop
cbef2ebe12 implement patches vcpkg tesseract 2018-11-08 21:37:47 +01:00
zdenop
7a7f226228 ocrclass: Remove unused macros
Signed-off-by: Stefan Weil <sw@weilnetz.de>

# Conflicts:
#	src/ccutil/ocrclass.h
2018-11-08 20:23:36 +01:00
Zdenko Podobný
2dd753ee4c replace VS implementation of gettimeofday with std::chrono::steady_clock::now(); fixes #2038 2018-11-08 19:43:46 +01:00
chrismamo1
439dfaaf8b un-fix one of the warnings 2018-10-30 18:10:48 -06:00
chrismamo1
30be5aaaac fix a couple minor compiler warnings 2018-10-30 18:00:32 -06:00
Stefan Weil
6f8bd340d9 Remove chopper.h
It is no longer needed after some reordering of code in chopper.cpp.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-29 19:51:44 +01:00
Stefan Weil
286dfb031a Remove unused include statements
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-29 19:46:58 +01:00
Stefan Weil
2098bb6daf Remove unused function ComputeOrientation
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-29 19:43:56 +01:00
Stefan Weil
cad6ebb5ff LIST: Remove old comments
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-29 19:43:56 +01:00
zdenop
99054f10c7
Merge pull request #2027 from stweil/warn
Fix compiler warning
2018-10-24 07:31:15 +02:00
Stefan Weil
eefb8348f7 Fix compiler warning
Compiler warning on macOS:

    tesscallback.h:29:7: warning:
      'TessClosure' has no out-of-line virtual method definitions;
      its vtable will be emitted in every translation unit [-Wweak-vtables]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-23 17:01:53 +02:00
Noah Metzger
f7f5f41073 Fixed a mac compiler warning in recodebeam.cpp
Signed-off-by: Noah Metzger <noah.metzger@bib.uni-mannheim.de>
2018-10-23 16:57:39 +02:00
zdenop
e60318f9c0 set PANGOCAIRO_BACKEND=fc to avoid crash; fixes #736 2018-10-23 13:22:38 +02:00
Zdenko Podobný
3d508a65a7 set unlv_tilde_crunching to false; fixes #1449 #948 2018-10-23 09:26:32 +02:00
Stefan Weil
7ebbb7370a ColPartition: Fix CID 1164543 (Division or modulo by float zero)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-22 22:14:15 +02:00
Stefan Weil
eaabe4a3ce ErrorCounter: Fix CID 1164538 (Division or modulo by float zero)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-22 22:14:15 +02:00
Stefan Weil
8f615d44f1 osdetect: Fix CID 1164539 (Division or modulo by float zero)
Avoid also a conversion from int16_t to double to float.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-22 22:14:15 +02:00
Stefan Weil
be0cf03778 tesseractmain: Fix memory leak
Commit 49d7df6dc3 introduced a memory leak
when the output file could not be created.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-22 18:50:47 +02:00
Stefan Weil
9c0799314e Add parenthesis in boolean expression
This fixes a compiler warning:

    scanutils.cpp:444:32: warning:
        '&&' within '||' [-Wlogical-op-parentheses]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-22 17:48:17 +02:00
Stefan Weil
0f973e1d62 Add missing 'static' keyword
This fixes a compiler warning:

    globaloc.cpp:33:6: warning: no previous extern declaration for
      non-static variable 'global_crash_pixes'
      [-Wmissing-variable-declarations]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-22 17:48:17 +02:00
Stefan Weil
a71ad455be Remove unused macros
This fixes some compiler warnings:

    mainblk.cpp:28:9: warning: macro is not used [-Wunused-macros]
    mainblk.cpp:29:9: warning: macro is not used [-Wunused-macros]
    [...]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-22 17:48:17 +02:00
zdenop
dba7f456d5
Merge pull request #2018 from stweil/sort
Get sorted list of available languages
2018-10-22 16:06:42 +02:00
Matthias Geerdsen
eac2880c24 avoid unbound variable TESSDATA_PREFIX
set TESSDATA_PREFIX as empty, if not defined in environment to avoid an
unbound variable
2018-10-22 14:28:14 +02:00
Stefan Weil
d75ef80f12 Get sorted list of available languages
TessBaseAPI::GetAvailableLanguagesAsVector returned the list of languages
without sorting, so the result was random and not user friendly.

Now `tesseract --list-langs` shows the available languages and scripts
in alphabetic order.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-22 14:07:03 +02:00
Matthias Geerdsen
95d9c8c57a set default values for unset variables
setting default values for posibly unset variables avoids unbount
variabe errors
2018-10-21 21:30:52 +02:00
Matthias Geerdsen
7b32e64564 add shebang 2018-10-21 21:30:13 +02:00
zdenop
32c1e4f433 FLAGS_webtext_prefix: unbound variable; issue #2005 2018-10-21 14:00:06 +02:00
Stefan Weil
34a89e54db Fix function ScrollViewCommand
The format string which builds the command only takes one or two
string arguments, so the function allocated too much memory and
passed too many arguments to snprintf.

This also fixes a compiler warning (clang).

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-21 08:13:16 +02:00
zdenop
4d3b0bc798 use <cstdio> instead of <stdio.h> 2018-10-20 21:46:40 +02:00
zdenop
8103d17c72 use _strdup instead of strdup in MSVC 2018-10-20 21:43:38 +02:00
zdenop
a033261f63 add info about used backend in text2image 2018-10-20 21:41:09 +02:00
Stefan Weil
e232114089 Fix use of undefined macro USE_DEVICE_SELECTION
This fixes compiler warnings.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-20 13:58:12 +02:00
Zdenko Podobný
486940687c Exit training script if run command failed; fixes #2005 2018-10-20 13:00:39 +02:00
Egor Pugin
5a4288f2fc
Merge pull request #2011 from stweil/fix
Small fix and optimization
2018-10-20 13:48:51 +03:00
Zdenko Podobný
1a523006a6 install training script with autotools. 2018-10-20 12:33:07 +02:00
Stefan Weil
b0ace0e850 ScrollView: Optimize local table_colors
It is constant, and the values are in the range 0...255,
so its size can be reduced.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-20 12:05:38 +02:00
Stefan Weil
d364750cb3 Remove type cast and fix compiler warning (-Wcast-qual)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-20 12:04:46 +02:00
Zdenko Podobný
1b2bda65e0 Revert "prefer to use FreeType for pango_cairo_font_map"
This reverts commit 345e5ee1f3.
2018-10-20 11:30:07 +02:00
Zdenko Podobný
276c6845ae Revert "free PangoFontMap; fixes #1999"
This reverts commit d1d73b9888.
2018-10-20 11:28:20 +02:00
Zdenko Podobný
a03f23e05e Merge branch 'master' of https://github.com/tesseract-ocr/tesseract 2018-10-20 11:26:23 +02:00
Marco Atzeri
ebbd4e3efc fixes #426; define NOUNDEFINED for cygwin 2018-10-20 11:25:28 +02:00
Stefan Weil
b40151c200 training: Don't hide global variables
This fixes two warnings from LGTM:

    Parameter feature_defs hides a global variable with the same name.
    Parameter Config hides a global variable with the same name.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-19 22:37:37 +02:00
Stefan Weil
bb181ec8d3 Rename API function from GetBestLSTMChoices to GetBestLSTMSymbolChoices
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-19 10:50:38 +02:00
Stefan Weil
df7d1e1f97 Rename API function for getting LSTM choices
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-19 10:50:38 +02:00
Stefan Weil
830b9c715a BLOBNBOX: Declare signed bit field
This fixes a warning from LGTM:

    Bit field area of type int should have explicitly unsigned integral,
    explicitly signed integral, or enumeration type.

Maybe area should be unsigned, but that would require lots of other
changes, so for now signedness is not changed.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-19 10:30:05 +02:00
Stefan Weil
d9c472b988 cluster: Fix some potential overflows
This fixes several issues reported by LGTM:

    Multiplication result may overflow 'int'
    before it is converted to 'size_type'.

    Multiplication result may overflow 'float'
    before it is converted to 'double'.

    Multiplication result may overflow 'int'
    before it is converted to 'unsigned long'.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-19 10:23:17 +02:00
Zdenko Podobný
d1d73b9888 free PangoFontMap; fixes #1999 2018-10-19 00:48:20 +02:00
zdenop
bbe7a4cc10
Merge pull request #2002 from stweil/err
Show error message when output file could not be created
2018-10-18 19:27:01 +02:00
Stefan Weil
49d7df6dc3 tesseractmain: Show error message when output file could not be created
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-18 19:22:49 +02:00
Stefan Weil
b0b8dfbc81 TessResultRenderer: Extend API to access status of renderer
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-18 19:22:48 +02:00
Stefan Weil
f0c9b753c6 BlamerBundle: Add declaration for copy assignment operator
It does not need an implementation as it is currently not used.

This fixes a warning from LGTM:

    No matching copy assignment operator in class BlamerBundle.
    It is good practice to match a copy constructor
    with a copy assignment operator.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-18 15:36:32 +02:00
Stefan Weil
e3658bbc78 C_OUTLINE_FRAG: Add declaration for copy constructor
It does not need an implementation as it is currently not used.

This fixes a warning from LGTM:

    No matching copy constructor in class C_OUTLINE_FRAG.
    It is good practice to match a copy assignment operator
    with a copy constructor.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-18 15:31:45 +02:00
Stefan Weil
5585ed8d85 ROW: Add declaration for copy constructor
It does not need an implementation as it is currently not used.

This fixes a warning from LGTM:

    No matching copy constructor in class ROW.
    It is good practice to match a copy assignment operator
    with a copy constructor.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-18 15:31:10 +02:00
Stefan Weil
a1f0c66be1 BLOB_CHOICE: Add copy assignment operator
This fixes a warning from LGTM:

    No matching copy assignment operator in class BLOB_CHOICE.
    It is good practice to match a copy constructor
    with a copy assignment operator.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-18 15:29:07 +02:00
Stefan Weil
7100a14636 ParamsTrainingHypothesis: Add copy assignment operator
This fixes a warning from LGTM:

    No matching copy assignment operator in class ParamsTrainingHypothesis.
    It is good practice to match a copy constructor
    with a copy assignment operator.

Use also a simpler expression for the size of features.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-18 15:28:12 +02:00