Commit Graph

3577 Commits

Author SHA1 Message Date
Stefan Weil
b26866bb3b intproto: Use more efficient float calculations for floor
This fixes warnings from LGTM:

Multiplication result may overflow 'float' before it is converted
to 'double'.

While the floor function always calculates with double, here the
overloaded std::floor can be used to handle the float arguments
more efficiently.

Replace also old C++ type casts by static_cast.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-06 18:29:38 +02:00
Stefan Weil
06a8de0b8b genericvector: Rewrite code to satisfy static code analyzer
Warning from LGTM:

Resource data_ is acquired by class GenericVector<FontSpacingInfo *>
but not released in the destructor.

LGTM complains about data_ not being deleted in the destructor.
The destructor calls the clear() method, but the delete there
was conditional which confuses the static code analyzer.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-06 18:24:13 +02:00
Stefan Weil
c2a8aa00b8 Fix constructor for class Dict (uninitialized member variables)
wildcard_unichar_id_, apostrophe_unichar_id_, question_unichar_id_ and
slash_unichar_id_ were not initialized in the constructor.

slash_unichar_id_ was used later in a conditional.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-06 17:52:52 +02:00
zdenop
9efedc15b2
Merge pull request #1954 from stweil/unicharset
Fix use of wrong UNICHARSET
2018-10-06 15:04:31 +02:00
zdenop
76cd80e1d7
Merge pull request #1953 from stweil/fix
lstmtraining: Remove dead code for purified model name
2018-10-06 15:02:39 +02:00
Stefan Weil
8dc9e9fd14 Fix use of wrong UNICHARSET
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-06 13:21:09 +02:00
Stefan Weil
0e71e5a754 lstmtraining: Remove dead code for purified model name
The purified model name `model_output` was unused,
so remove the comment and the unused code.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-06 09:34:17 +02:00
Egor Pugin
0e43ae5cf4
Merge pull request #1951 from stweil/checkdir
combine_tessdata, lstmtraining: Check for write failures
2018-10-05 23:38:01 +03:00
Stefan Weil
f4e982e041 combine_tessdata: Handle failures when extracting
Report an error and terminate if that fails.

Use also EXIT_SUCCESS and EXIT_FAILURE for the return values of main()
and add missing return at end of main().

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-05 21:39:18 +02:00
Stefan Weil
7434590b9a lstmtraining: Check write permission for output model
This is done by creating a temporary file.
Report an error and terminate if that fails.

Use also EXIT_SUCCESS and EXIT_FAILURE for the return values of main().

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-05 20:38:02 +02:00
zdenop
660dbaa9d5 implement parameter min_characters_to_try for minimum characters to try to skip page entirely.
fixes #1729
2018-10-05 19:05:28 +02:00
zdenop
2cb609d202
Merge pull request #1950 from stweil/manpage
Merge and enhance documentation on language and script models
2018-10-05 18:09:31 +02:00
Stefan Weil
3315931859 Merge and enhance documentation on language and script models
Add also links to the user forum and to the Wiki and update the
history text.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-05 16:55:21 +02:00
zdenop
551abb2114
Merge pull request #1949 from stweil/manpage
Document some more config options for tesseract
2018-10-05 16:38:06 +02:00
Stefan Weil
383dcf70b5 Document some more config options for tesseract
Clarify also the name(s) of the generated OCR result file(s):
Tesseract does not create a file named outbase.txt by default.

Fix also a sentence in the language section.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-05 16:03:51 +02:00
Egor Pugin
e03ee932d2
Merge pull request #1947 from stweil/doc
Update tesseract man page and add Makefile rule to build HTML manpages
2018-10-05 00:25:07 +03:00
Stefan Weil
b70a456788 Add Makefile rule to build HTML manpages
They can be built optionally by `make html` (only for automake builds).

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-04 22:36:03 +02:00
Stefan Weil
3e9b0acc5c Update tesseract man page
- move Tesseract 4 release note to other release notes
- format command line options in text
- add link to release notes (wiki)
- add link to contributors (GitHub)

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-04 22:10:22 +02:00
zdenop
f2c44a0ba8
Merge pull request #1946 from stweil/psm
Don't set page segmentation mode for unlv config
2018-10-04 22:00:40 +02:00
Stefan Weil
c6f759148b Don't set page segmentation mode for unlv config
Setting the page segmentation mode to 6 ("Assume a single uniform block
of text") typically improves the layout detection for such texts, but
should not be done in the config file.

unlvtests/runtestset.sh adds `--psm 6` explicitly, so test results
won't change when using that script.

This is similar to commit ecfee53bac.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-04 21:01:18 +02:00
Egor Pugin
a86292b111
Merge pull request #1944 from stweil/psm
Allow orientation detection with any traineddata
2018-10-04 18:29:45 +03:00
Stefan Weil
26bfd2b9d3 Allow orientation detection with any traineddata
While orientation and script detection (OSD) normally requires
osd.traineddata to detect both, it must also be possible to do
only orientation detection with eng.traineddata or any other
traineddata.

Enforce osd.traineddata only if there was no `-l` command line option.

Commit 27ce472666 was too restrictive.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-04 17:07:14 +02:00
zdenop
6b9f1f100b
Merge pull request #1943 from stweil/psm
Don't set page segmentation mode for hocr, pdf and tsv configs
2018-10-04 16:24:52 +02:00
Stefan Weil
ecfee53bac Don't set page segmentation mode for hocr, pdf and tsv configs
Setting the page segmentation mode in those config files gives unexpected
results: the text recognized when no config or only txt is given changes
if both txt and any of hocr, pdf or tsv is chosen.

In a test set of nearly 200 pages from historical books, using
segmentation mode 1 is typically slightly better than the default,
but there are also cases where it is much worse. Therefore the user
should be able to decide which page segmentation mode is best.

Old results for hocr, pdf or tsv now need an explicit `--psm 1` for
reproduction.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-04 12:05:49 +02:00
zdenop
b15fbf1d0f
Merge pull request #1941 from Shreeshrii/master
Update man page and readme reg two OCR engines in Tesseract 4
2018-10-04 07:49:08 +02:00
Shree Devi Kumar
d160067308 Update README about both OCR engines in tesseract 4 2018-10-04 04:17:49 +00:00
Shree Devi Kumar
0c39d3446b Update tesseract man page about both OCR engines in tesseract 4 2018-10-04 04:01:26 +00:00
zdenop
1beeeee215 fix version info in VERSION 2018-10-03 23:51:41 +02:00
Zdenko Podobný
dcc50a867f Merge branch 'master' of https://github.com/tesseract-ocr/tesseract
* 'master' of https://github.com/tesseract-ocr/tesseract:
  Fix CID 1164579 (Explicit null dereferenced)
  print help for tesstrain.sh; fixes #1469
  Fix CID 1395882 (Uninitialized scalar variable)
  Fix comments
  Move content of ipoints.h to points.h and remove ipoints.h
  remove duplicate help from combine_lang_model
  Fix typo.
  use tprintf instead of printf to be able disable messages by quiet option (issue #1240)
  add "sudo ldconfig" to install instruction. fixes #1212
  unittest: Replace NULL by nullptr
  unittest: Format code
  tesseract app: check if input file exists; fixes #1023
  Format code (replace ( xxx ) by (xxx))
  Simplify boolean expressions
  Win32: use the ISO C and C++ conformant name "_putenv" instead of deprecated "putenv"
2018-10-03 19:21:42 +02:00
zdenop
423798722f
Merge pull request #1938 from stweil/coverity
Fix two reports from CoverityScan and clean related code
2018-10-02 12:34:08 +02:00
Stefan Weil
04703ca8df Fix CID 1164579 (Explicit null dereferenced)
The report from Coverity Scan is a false positive.

Nevertheless the code can be rewritten and optimized
a little bit to fix that report.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-02 11:48:28 +02:00
Zdenko Podobný
7dbf5a030f print help for tesstrain.sh; fixes #1469 2018-10-02 11:35:10 +02:00
Stefan Weil
9a1f14f2aa Fix CID 1395882 (Uninitialized scalar variable)
The implementation for ICOORD only allows division by scale != 0.

Do the same for FCOORD by asserting that scale != 0.0f,
so undefined program behaviour will be caught.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-02 11:34:14 +02:00
Stefan Weil
ce6ff20939 Fix comments
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-02 11:26:36 +02:00
Stefan Weil
8c56b8f58c Move content of ipoints.h to points.h and remove ipoints.h
Both include files depended on each other, so it did not make sense
to separate them.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-02 11:21:27 +02:00
zdenop
57a6f1d22e remove duplicate help from combine_lang_model 2018-10-01 21:22:51 +02:00
Egor Pugin
6ee7f4eac2
Fix typo. 2018-09-29 17:04:25 +03:00
zdenop
14b83d3090 use tprintf instead of printf to be able disable messages by quiet option
(issue #1240)
2018-09-29 13:49:08 +02:00
zdenop
d9372662ec add "sudo ldconfig" to install instruction. fixes #1212 2018-09-29 13:33:36 +02:00
zdenop
d5b6222856
Merge pull request #1935 from stweil/style
Format code and fix some style issues
2018-09-29 09:32:56 +02:00
Stefan Weil
4ec9c86226 unittest: Replace NULL by nullptr
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-09-29 09:27:12 +02:00
Stefan Weil
9e66fb918f unittest: Format code
It was formatted with clang-format-7 -i unittest/*.{c*,h}.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-09-29 09:19:13 +02:00
zdenop
1a096441d0 tesseract app: check if input file exists; fixes #1023 2018-09-29 08:51:00 +02:00
Stefan Weil
0f3206d5fe Format code (replace ( xxx ) by (xxx))
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-09-29 08:21:25 +02:00
Stefan Weil
63f87cac90 Simplify boolean expressions
Remove "? true : false" which is not needed for boolean expressions.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-09-29 08:21:14 +02:00
Zdenko Podobný
bf6d929e4c fix using c-api / compile with gcc 2018-09-28 23:14:32 +02:00
zdenop
abe40f17c9 Win32: use the ISO C and C++ conformant name "_putenv" instead of deprecated "putenv" 2018-09-28 20:53:57 +02:00
zdenop
a0564fd4ec Allow user to specify dpi for input image 2018-09-28 20:28:52 +02:00
zdenop
345e5ee1f3 prefer to use FreeType for pango_cairo_font_map 2018-09-28 11:07:26 +02:00
zdenop
5fe1390748 remove alpha channel from png: issue #1914 2018-09-27 19:40:15 +02:00