Shree Devi Kumar
f3362a4b5b
Add renderer to create WordStr box files from images
2019-02-10 19:59:17 +00:00
zdenop
2ae65b2493
Merge pull request #2216 from Shreeshrii/lstmbox
...
Lstmbox
2019-02-10 13:53:41 +01:00
Shree Devi Kumar
311053681c
put common code in AddBoxToLSTM
2019-02-10 09:16:45 +00:00
zdenop
e51f1885e6
Merge pull request #2229 from stweil/warn
...
Fix some compiler warnings
2019-02-10 08:20:23 +01:00
Shree Devi Kumar
b51c1bf05a
change to const char* as suggested by @stweil
2019-02-10 05:13:18 +00:00
Stefan Weil
aa2dcca295
Fix compiler warnings (-Wstringop-truncation)
...
gcc warnings:
src/api/tesseractmain.cpp:252:14: warning:
‘char* strncpy(char*, const char*, size_t)’ specified bound 255
equals destination size [-Wstringop-truncation]
src/ccutil/unicharset.h:66:12: warning:
‘char* strncpy(char*, const char*, size_t)’ output may be truncated copying 30 bytes from a string of length 30 [-Wstringop-truncation]
src/ccutil/unicharset.cpp:806:12: warning:
‘char* strncpy(char*, const char*, size_t)’ specified bound 64 equals destination size [-Wstringop-truncation]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-02-09 16:32:09 +01:00
Stefan Weil
d42413dd17
OpenCL: Remove PERF_COUNT framework
...
It was rarely used, but added a lot of code and an unconditional
dependency on openclwrapper.h.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-02-09 10:58:15 +01:00
Shree Devi Kumar
0f42fd8c69
change to use bbox coordinates for TEXTLINE for all characters
...
(cherry picked from commit 049db108b2d6cd3a7f52e480212320613117d50b)
2019-02-05 14:03:29 +00:00
Shree Devi Kumar
9c89cd51cf
Add a new renderer to create box files from images for LSTM training
...
(cherry picked from commit 921da6be2bdbda2ddd64514f9b6bec40a336246a)
fix typo
(cherry picked from commit 7bd1a0c80393fce2f34e2845cb26760bcf3791cd)
Add lstmboxrenderer to CMakeLists
(cherry picked from commit cfef3a889aef830725921b5c0218d5e9c633b03e)
fix formatting
(cherry picked from commit 7ba2b01ede7940ed609a073364948ef8c838cd10)
2019-02-05 14:03:29 +00:00
Mikhail Akopov
7be04342cf
Fix typo
2019-02-01 09:58:44 +01:00
Stefan Weil
9e6e3a0232
Fix memory leak for PNG images
...
Commit 5fe1390748
used an implementation
which created a new Pix object. That object was never destroyed.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-01-23 20:05:10 +01:00
Stefan Weil
7fc7d28dd0
Compile files for AVX, AVX2 or SSE only when needed
...
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-01-14 21:34:37 +01:00
zdenop
f75b2c1948
Merge pull request #310 from nickjwhite/hocrcharboxes
...
Character boxes in hOCR output
2019-01-14 19:19:04 +01:00
Nick White
ebbf907c56
Fix typo in hocr character box output
2019-01-13 16:28:31 +00:00
Nick White
4ce797b6f6
Fix hocr character box info to use new hocr renderer correctly
2019-01-13 13:01:14 +00:00
Nick White
c43e4501e3
Merge remote-tracking branch 'origin/master' into hocrcharboxes
2019-01-13 12:41:42 +00:00
zdenop
238cb219d5
Merge pull request #2152 from stweil/clean
...
Remove opencl_device_selection.h
2019-01-09 15:02:59 +01:00
Stefan Weil
a0e6586e63
Fix documentation for page segmentation mode 2
...
It never worked, so add a comment that the implementation is missing.
Add also a to-do comment.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-01-09 13:51:44 +01:00
Stefan Weil
0fae848b58
OpenCL: Add comments to users of openclwrapper.h
...
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-01-09 12:11:00 +01:00
Stefan Weil
e0fc4f2945
Remove opencl_device_selection.h
...
Always use OpenCL device selection if OpenCL is enabled.
This fixes a regression which was introduced by commit
5c6a57b727
which removed
the definition for USE_DEVICE_SELECTION.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-01-09 12:09:56 +01:00
zdenop
d3065520fa
fix 2 clang warnings
2018-12-30 20:25:24 +01:00
Stefan Weil
cb049133cd
Fix compiler warning
...
clang warning:
tesseractmain.cpp(512,21): warning: '&&' within '||' [-Wlogical-op-parentheses]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-12-29 22:17:33 +01:00
zdenop
420fb0ced0
Merge branch 'master' of https://github.com/tesseract-ocr/tesseract
2018-12-29 10:31:33 +01:00
zdenop
8885fe2ccb
provide info about compiled openmp version
2018-12-29 10:18:27 +01:00
Stefan Weil
993e56ffde
Don't try to create text output if other renderers failed (fix regression)
...
Commit 49d7df6dc3
added error handling,
but since that commit Tesseract used the text fallback if the user
selected output failed.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-12-27 10:23:28 +01:00
zdenop
cc997b53c7
add missing the implementation for TessBaseAPIGetAltoText method in C-API
2018-12-26 21:35:47 +01:00
Stefan Weil
db9c7e0312
Use std::stringstream to generate hOCR output
...
Using std::stringstream simplifies the code and allows conversion of
double to string independant of the current locale setting.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-12-16 20:14:11 +01:00
zdenop
72d8df581b
Merge pull request #2121 from stweil/hocr
...
Move code for hOCR renderer to new file
2018-12-16 16:26:27 +01:00
Stefan Weil
c7e8d30280
Fix value for PHYSICAL_IMG_NR in ALTO output
...
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-12-16 15:07:02 +01:00
Stefan Weil
457c53026d
Fix indentation of hOCR output
...
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-12-15 17:51:59 +01:00
Stefan Weil
5de3fc47bb
Format code in new file hocrrenderer.cpp
...
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-12-15 15:35:21 +01:00
Stefan Weil
48713f7df2
Move code for hOCR renderer to new file
...
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-12-15 15:33:47 +01:00
Stefan Weil
fbbbdb4565
Use std::stringstream to generate ALTO output and add <SP> element
...
Using std::stringstream simplifies the code.
The <SP> element is needed between two >String> elements.
Remove also some unneeded spaces in the ALTO output.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-12-12 22:29:35 +01:00
Stefan Weil
f0a4d04187
Add config variable for selection of dot product function
...
All also a C++ implementation with more aggressive compiler options
which is optimized for the CPU where the software was built.
It is now possible to select the function used for the dot product
with -c dotproduct=FUNCTION where FUNCTION can be one of those values:
* auto selection based on detected hardware (default)
* generic C++ code with default compiler options
* native C++ code optimized for build host
* avx optimized code for AVX
* sse optimized code for SSE
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-12-01 00:19:28 +01:00
Stefan Weil
1910b1a72b
SIMDDetect: Use tesseract namespace and format code
...
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-11-30 23:36:39 +01:00
Stefan Weil
ed48b2a8f5
Format new ALTO code with clang-format
...
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-11-30 06:37:25 +01:00
Jake Sebright
d7cee03a94
Add support for ALTO output
2018-11-30 06:09:36 +01:00
Egor Pugin
685b136d89
Fix incorrect condition.
2018-11-29 19:02:54 +03:00
Zdenko Podobný
3d508a65a7
set unlv_tilde_crunching to false; fixes #1449 #948
2018-10-23 09:26:32 +02:00
Stefan Weil
be0cf03778
tesseractmain: Fix memory leak
...
Commit 49d7df6dc3
introduced a memory leak
when the output file could not be created.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-22 18:50:47 +02:00
Stefan Weil
d75ef80f12
Get sorted list of available languages
...
TessBaseAPI::GetAvailableLanguagesAsVector returned the list of languages
without sorting, so the result was random and not user friendly.
Now `tesseract --list-langs` shows the available languages and scripts
in alphabetic order.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-22 14:07:03 +02:00
Stefan Weil
e232114089
Fix use of undefined macro USE_DEVICE_SELECTION
...
This fixes compiler warnings.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-20 13:58:12 +02:00
Stefan Weil
d364750cb3
Remove type cast and fix compiler warning (-Wcast-qual)
...
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-20 12:04:46 +02:00
Marco Atzeri
ebbd4e3efc
fixes #426 ; define NOUNDEFINED for cygwin
2018-10-20 11:25:28 +02:00
Stefan Weil
bb181ec8d3
Rename API function from GetBestLSTMChoices to GetBestLSTMSymbolChoices
...
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-19 10:50:38 +02:00
Stefan Weil
df7d1e1f97
Rename API function for getting LSTM choices
...
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-19 10:50:38 +02:00
Stefan Weil
49d7df6dc3
tesseractmain: Show error message when output file could not be created
...
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-18 19:22:49 +02:00
Stefan Weil
b0b8dfbc81
TessResultRenderer: Extend API to access status of renderer
...
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-18 19:22:48 +02:00
Noah Metzger
c13371d6e0
Renamed GetGlyphConfidences() to GetChoices() and glyph_confidences to lstm_choice_mode
...
Renamed the global attribute glyph_confidences to lstm_choice_mode and the method GetGlyphConfidences() to GetChoices(). All Variables and comments contained in related methods were renamed as well.
Signed-off-by: Noah Metzger <noah.metzger@bib.uni-mannheim.de>
2018-10-17 16:43:39 +02:00
Stefan Weil
32e1e4b6b4
TessPDFRenderer: Remove unused member variable jpg_quality_ (CID 1396172)
...
This fixes a warning from Coverity Scan
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-16 16:53:23 +02:00
Stefan Weil
d89ec15571
Revert "Fix CID 1396172 (Uninitialized members)"
...
This reverts commit cbd09de7fe
.
The variable can be removed as it is not used.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-16 16:53:23 +02:00
Zdenko Podobný
cbd09de7fe
Fix CID 1396172 (Uninitialized members)
2018-10-16 12:24:10 +02:00
Stefan Weil
6ffb53f815
win32: Show TIFF errors on console
...
Showing them in a window (default) is not acceptable for a console
application like Tesseract which must be able to work in batch mode.
Such error messages can be triggered by TIFF files which include
vendor specific tags.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-13 20:42:14 +02:00
Stefan Weil
d86d520fd0
Remove tab character in source files
...
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-12 11:31:10 +02:00
zdenop
ca5d285a28
hocr: add ocrp_wconf to unconditional ocr-capabilities; fixes #1470
2018-10-09 16:34:50 +02:00
zdenop
956525f5a4
fix uninitialized variable, remove unused variable
2018-10-09 15:47:20 +02:00
zdenop
c375f4fbf7
keep API compatibility with #1265
2018-10-09 11:22:15 +02:00
zdenop
f794571195
use pdf L_FLATE_ENCODE only for png input; fixes #1961
2018-10-07 20:57:19 +02:00
Stefan Weil
67bf9062df
Rework check for readable input file
...
This reverts commit 1a096441d0
and
implements an alternate check which allows input from stdin.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-06 22:33:02 +02:00
Stefan Weil
8dc9e9fd14
Fix use of wrong UNICHARSET
...
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-06 13:21:09 +02:00
Stefan Weil
26bfd2b9d3
Allow orientation detection with any traineddata
...
While orientation and script detection (OSD) normally requires
osd.traineddata to detect both, it must also be possible to do
only orientation detection with eng.traineddata or any other
traineddata.
Enforce osd.traineddata only if there was no `-l` command line option.
Commit 27ce472666
was too restrictive.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-04 17:07:14 +02:00
Egor Pugin
6ee7f4eac2
Fix typo.
2018-09-29 17:04:25 +03:00
zdenop
d5b6222856
Merge pull request #1935 from stweil/style
...
Format code and fix some style issues
2018-09-29 09:32:56 +02:00
zdenop
1a096441d0
tesseract app: check if input file exists; fixes #1023
2018-09-29 08:51:00 +02:00
Stefan Weil
0f3206d5fe
Format code (replace ( xxx ) by (xxx))
...
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-09-29 08:21:25 +02:00
zdenop
a0564fd4ec
Allow user to specify dpi for input image
2018-09-28 20:28:52 +02:00
zdenop
5fe1390748
remove alpha channel from png: issue #1914
2018-09-27 19:40:15 +02:00
zdenop
971fe50031
fixed #714 : use binary mode when generating pdf to stdout on Windows
2018-09-27 18:35:15 +02:00
Zdenko Podobný
5dfce7471c
fix #1889 : part 2
2018-09-26 09:28:22 +02:00
zdenop
4ca179d3fa
remove condition because fontsize is always > 0
2018-09-20 21:48:44 +02:00
Zdenko Podobný
5d22fdfeed
replace deprecated C++ headers (reported by clan-tidy) - partially supersedes PR #1605
2018-09-18 18:51:11 +02:00
David Thornley
92e291250a
Fix missing default parameter value cause compile to fail.
2018-09-14 09:56:06 +02:00
David Thornley
31aeb534d9
Fix merge conflicts
...
Merge branch 'master' into jpg_quality_option
* master: (577 commits)
fix issue #1889
Add badges for download , licence and lgtm
Replace macro MINGW by __MINGW32__
EquationDetectBase: Define virtual destructor in .cpp file
BlobGrid: Define virtual destructor in .cpp file
GridBase: Define virtual destructor in .cpp file
AlignedBlob: Define virtual destructor in .cpp file
TransposedArray: Define virtual destructor in .cpp file
IndexMapBiDi: Define virtual destructor in .cpp file
Add missing include file (fixes linker error for Visual Studio)
NthItemTest: Add definition for virtual destructor
HeapTest: Add definition for virtual destructor
IcuErrorCode: Define virtual destructor in .cpp file
Validator: Define virtual destructor in .cpp file
Dawg: Define virtual destructor in .cpp file
CUtil: Define virtual destructor in .cpp file
IndexMap: Define virtual destructor in .cpp file
CCUtil: Define virtual destructor in .cpp file
MATRIX: Define virtual destructor in .cpp file
CCStruct: Define virtual destructor in .cpp file
...
2018-09-13 16:03:24 +02:00
Zdenko Podobný
59e42fcef6
fix issue #1889
2018-09-13 07:26:37 +02:00
Stefan Weil
be1393b1e8
Replace macro MINGW by __MINGW32__
...
MINGW is no longer used and now removed from configure.ac.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-09-04 16:05:27 +02:00
Stefan Weil
9f8ed31a26
api/pdfrenderer.cpp: Fix compiler warning
...
Compiler warning from clang:
src/api/pdfrenderer.cpp:848:28: warning:
cast from 'const char *' to 'char *' drops const qualifier [-Wcast-qual]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-09-03 12:32:35 +02:00
Noah Metzger
663be426f6
Added the option for character accumulated glyph confidences.
...
The parameter glyph_confidences is changed from bool to int.
An execution with value 1 outputs the hOCR file enriched with glyph confidences
for every timestep like before. An execution with value 2 outputs the timesteps
accumulated over the recognized characters.
Signed-off-by: Noah Metzger <noah.metzger@bib.uni-mannheim.de>
2018-08-20 10:43:58 +02:00
Stefan Weil
27ce472666
Fix potential crash with --psm 0 and use osd.traineddata automatically
...
Page segmentation mode "OSD only" requires osd.traineddata,
so use it automatically.
Report a warning if the user specified a different language.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-08-01 16:52:37 +02:00
Stefan Weil
6a28cce96b
Fix whitespace issues
...
* Remove whitespace (blanks, tabs, cr) at line endings
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-08-01 13:19:52 +02:00
Stefan Weil
eb69dd0201
TessPDFRenderer: Improve robustness of API (issue #1804 )
...
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-08-01 09:11:04 +02:00
Noah Metzger
91c7504a35
Added a feature to enrich the hOCR output with glyph confidences
...
By using the parameter -c glyph_confidences=true the user is able to enrich
the hOCR output with additional information. Tesseract then lists additionally
the timesteps with all glyphs that were considered with their confidence
for every timestep of the LSTM.
The format of the hOCR output is slightly changed: There is now a linebreak
after every word for better readability by humans.
Signed-off-by: Noah Metzger <noah.metzger@bib.uni-mannheim.de>
2018-07-25 18:18:58 +02:00
Stefan Weil
bdd2a7aedc
Use tesseract::Serialize, tesseract::DeSerialize
...
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-07-18 11:19:37 +02:00
Stefan Weil
cfd72ff31e
Fix --print-parameters (regression)
...
Commit 629ded223c
had broken that
functionality.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-07-09 14:42:48 +02:00
Stefan Weil
55f0ca5842
Add missing include statements and clean some include statements
...
The changes are based on an analysis done with include-what-you-use.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-07-07 16:24:53 +02:00
Egor Pugin
3ea9cff149
Merge pull request #1752 from stweil/api
...
API fixes
2018-07-05 17:28:48 +03:00
Stefan Weil
d2febafdcd
Fix compiler warnings [-Wmissing-prototypes]
...
Add missing include statements, add missing "static" qualifiers or
remove functions which are not used at all.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-07-05 16:03:02 +02:00
Stefan Weil
ffb501936c
Fix prototype for API function TessBaseGetBlockTextOrientations
...
The declaration did not match the implementation (BOOL / bool).
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-07-05 14:49:48 +02:00
Stefan Weil
790b410fd6
Remove unused API function TessBaseAPIDetectOS
...
It was not declared in capi.h, so external users could not use it anyway.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-07-05 14:49:48 +02:00
Stefan Weil
a74d467e90
Fix compiler warnings [-Wcomma]
...
clang warnings:
src/api/baseapi.cpp:1642:18: warning:
possible misuse of comma operator here [-Wcomma]
src/api/baseapi.cpp:1642:31: warning:
possible misuse of comma operator here [-Wcomma]
src/api/baseapi.cpp:1642:45: warning:
possible misuse of comma operator here [-Wcomma]
src/api/baseapi.cpp:1652:16: warning:
possible misuse of comma operator here [-Wcomma]
src/api/baseapi.cpp:1652:30: warning:
possible misuse of comma operator here [-Wcomma]
src/api/baseapi.cpp:1662:17: warning:
possible misuse of comma operator here [-Wcomma]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-07-05 12:07:04 +02:00
Stefan Weil
60fcff5ed9
Fix build with legacy engine disabled (part 2)
...
The functions TessBaseAPIInitLangMod, TessBaseAPIClearAdaptiveClassifier
and TessBaseAPIDetectOrientationScript need conditional compilation.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-07-04 17:56:42 +02:00
Stefan Weil
081793ff48
Fix build with legacy engine disabled
...
Instead of defining the DISABLED_LEGACY_ENGINE macro in config_auto.h
(which is not included by all source files), define it as a preprocessor
option for those parts of the code which require it.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-07-04 17:56:42 +02:00
zdenop
20e53b119a
Merge pull request #1742 from stweil/casts
...
Remove unneeded type casts
2018-07-04 15:35:49 +02:00
Stefan Weil
c8b5a29ce9
Remove unneeded type casts
...
This removes unneded type casts to (char*) and (const char*).
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-07-04 14:23:55 +02:00
Amit D
62c7b796da
Merge branch 'master' into disable-legacy
2018-07-04 11:14:33 +03:00
amitdo
15fb491be4
Add missing #ifdef in tesseractmain.cpp
2018-07-04 09:57:12 +03:00
amitdo
aa9f4b4861
Add an option to compile tesseract without the code of the legacy OCR engine
2018-07-03 18:49:42 +03:00
Stefan Weil
f7b61891bc
Replace macro PI by macro M_PI
...
One definition for pi is sufficient.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-07-02 21:26:53 +02:00
Stefan Weil
6801085376
pdfrenderer: Fix ClipBaseline and optimize code
...
The division was made with integers, giving a wrong result.
* Avoid division and use pure integer operations.
* Add missing "static" attribute.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-07-01 08:33:56 +02:00
Stefan Weil
e8e94d372c
Fix CID 1340287 (Unchecked return value)
...
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-07-01 07:54:11 +02:00
Stefan Weil
a49b8f1d21
Fix CID 1297960 (Dereference after null check)
...
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-07-01 07:54:11 +02:00