Tom Morris
6700edd8bc
Cleanup TSV renderer
...
Remove all references to hocr, hocr.tsv, etc. Remove dead code for font
info, input filename, HTML escapes. Improved comments. Fixed
indentation.
2016-03-01 13:41:19 -05:00
Sundar M. Vaidya
858f4b75ce
Avoids HTML escaping.
2016-03-01 12:30:39 -05:00
Sundar M. Vaidya
b1e4a82b0b
Render output in TSV format.
2016-03-01 12:30:39 -05:00
Sundar M. Vaidya
59d593d796
Calls TessHOcrTsvRenderer if tessedit_create_hocrtsv is true.
2016-03-01 12:23:12 -05:00
Sundar M. Vaidya
4d13892f5b
Adds TessHOcrTsvRenderer class for rendering HOCR info in tsv format.
2016-03-01 12:13:42 -05:00
Sundar M. Vaidya
d04e3259af
Adds char* GetHOCRTSVText(int) as placeholder. Copy of char* GetHOCRText(int).
2016-03-01 12:13:42 -05:00
Tom Morris
6c44775d8a
Emit fewer "lang" attributes
...
Add "lang" attribute to paragraph markup and only include
word lang attribute if it's different from the paragraph's value.
2016-02-17 10:23:41 -05:00
Tom Morris
ea401c9046
Only generate dir for HOCR when needed - fixes #208
...
Takes advantage of inheritance and dir="ltr" default to:
- only generate paragraph dirs which are not ltr
- only generate word dirs which don't match enclosing paragraph
Tested against LTR, RTL, and mixed direction files. Files for the
latter two cases are in a separate commit on the ltr-test-files branch.
2016-02-17 10:23:41 -05:00
Tom Morris
809bbd9bfa
Fix varsize array for Microsoft compiler
2016-02-17 10:20:18 -05:00
Tom Morris
431786276c
INCOMPATIBLE fix to hOCR line height information - fixes #225 .
...
This fixes the duplicate line IDs caused by inserting height information
into the middle of the ID and it moves the line height info into
the title attribute like everything else, rather than using non-standard
HTML attributes (which won't validate).
This change may break consumers of the HTML output, but 3.04 has only
been in the wild for 6 months and the current HTML is invalid, so I
believe the benefit outweighs the cost for the fix.
2016-02-15 18:02:46 -05:00
amitdo
6be9d7a5f8
Fix #64 . Make box training work
...
This commit is better than 06fc0533c
. Hopefully, this is the last fix to box training issue.
2016-01-29 03:37:34 +02:00
amitdo
06fc0533c8
Fix #184 . Training should work now
2016-01-17 14:27:35 +02:00
zdenop
c53add706e
Merge pull request #27 from tesseract-ocr/monitor
...
Monitor
2016-01-05 16:28:42 +01:00
amitdo
a20156fc67
Add missing ')'_to make the code compile
2015-12-11 19:42:16 +02:00
amitdo
c2f5e9b849
If there is no explicit renderer(s), default to TessTextRenderer
...
Revert fd429c32
, 43834da7
, 05de195e
.
See #49 , #59 .
The code in this commit solves the issue in a more elegant way, IMHO.
Now you can use:
* `tesseract eurotext.tif eurotext txt pdf`
* `tesseract eurotext.tif eurotext txt hocr`
* `tesseract eurotext.tif eurotext txt hocr pdf`
NOTE:
With `tesseract eurotext.tif eurotext`
or `tesseract eurotext.tif eurotext txt`
the psm will be set to '3', but...
With `tesseract eurotext.tif eurotext txt pdf`
or `tesseract eurotext.tif eurotext txt hocr`
the psm will be set to '1'.
2015-12-11 19:06:49 +02:00
Stefan Weil
71c9e028f7
tesseractmain: Prettify help message
...
Commit 99110df757
improved the help text
in several aspects, but also introduced new inconsistencies which this
patch tries to fix.
* Align columns (this needed replacing tabs by spaces).
* Start explaining text with uppercase.
* Replace "the stdout" by "stdout.
* Small changes in help text for page segmentation modes.
* Split options in OCR options and single options
(partially revert commit 99110df757
).
In addition, whitespace characters at end of lines were removed.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2015-11-29 10:26:40 +01:00
zdenop
7cc7c6f9c2
Merge pull request #156 from stweil/master
...
pdfrenderer: Fix uninitialized local variables
2015-11-27 21:53:55 +01:00
amitdo
99110df757
tesseractmain.cpp: Split huge main() to sub functions
...
Add these functions to api/tesseractmain.cpp:
PrintVersionInfo()
PrintUsage()
PrintHelpForPSM()
PrintHelpMessage()
SetVariablesFromCLArgs()
PrintLangsList()
FixPageSegMode()
ParseArgs()
PreloadRenderers()
2015-11-26 11:36:16 +02:00
Stefan Weil
5ce88d7f49
pdfrenderer: Fix uninitialized local variables
...
Coverity bug reports:
CID 1270405: Uninitialized scalar variable
CID 1270408: Uninitialized scalar variable
CID 1270409: Uninitialized scalar variable
CID 1270410: Uninitialized scalar variable
Those variables are set conditionally in the while loop
and must keep their values in following iterations, so
they must be declared outside of the loop.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2015-11-25 22:24:06 +01:00
Stefan Weil
03f37c0cdc
tesseractmain: Fix unterminated string
...
Coverity bug report: CID 1270421 "Buffer not null terminated".
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2015-11-24 17:17:17 +01:00
Stefan Weil
997c4a6078
api: Fix printing of a size_t value
...
size_t is not always the same as long, especially not for 64 bit Windows:
api/pdfrenderer.cpp:549:31: warning:
format '%ld' expects argument of type 'long int',
but argument 4 has type 'size_t {aka long long unsigned int}' [-Wformat=]
size_t normally requires a format string "%zu", but this is unsupported
by Visual Studio, so use a type cast.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2015-11-05 06:39:35 +01:00
Stefan Weil
3272b62201
Don't use NULL for integer arguments
...
This fixes compiler warnings:
api/baseapi.cpp:1422:49: warning:
passing NULL to non-pointer argument 6 of
'int MultiByteToWideChar(UINT, DWORD, LPCCH, int, LPWSTR, int)'
[-Wconversion-null]
api/baseapi.cpp:1427:54:
warning: passing NULL to non-pointer argument 6 of
'int WideCharToMultiByte(UINT, DWORD, LPCWCH, int, LPSTR, int, LPCCH, LPBOOL)'
[-Wconversion-null]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2015-11-05 06:38:01 +01:00
Stefan Weil
edf765b952
Remove unneeded const qualifiers
...
This fixes compiler warnings like this one:
api/baseapi.h:739:32: warning:
type qualifiers ignored on function return type [-Wignored-qualifiers]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2015-11-05 06:36:42 +01:00
amitdo
6bbcb50dd9
Added osd renderer for psm 0.
...
Works for single page and multi-page.
2015-10-30 20:09:00 +02:00
amitdo
dcfdd5c035
OSD: Print script name instead of meaningless script id
2015-10-28 09:50:28 +02:00
Stefan Weil
11b2a4d9af
api: Fix typos in comments (all found by codespell)
...
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2015-09-14 21:54:27 +02:00
James R. Barlow
18ac7ae7ef
Get OpenCL to compile on OS X
...
However, the output of the OpenCL build is garbage....
2015-08-26 02:03:07 -07:00
Zdenko Podobný
bb19f2c16b
Fixes #76 - enable OpenMP support
2015-08-14 21:39:40 +02:00
Robert Theis
aa6a0b12f9
Remove extraneous line feed
2015-08-12 18:02:35 -07:00
Zdenko Podobný
0337d898d4
fix bug in UTF-16BE conversion
2015-08-10 21:22:20 +02:00
Zdenko Podobný
545a0634da
improve NO_CUBE_BUILD
2015-08-09 18:09:52 +02:00
Zdenko Podobný
67ede37b50
Fixes #74 NO_CUBE_BUILD with reverting to ANDROID_BUILD in baseapi
2015-08-09 18:09:30 +02:00
Zdenko Podobný
628de5ba3f
enable pdfrender with NO_CUBE_BUILD
2015-08-07 23:20:22 +02:00
Jeff Breidenbach
9dcf2c6aa8
replace CubeUtils::UTF8ToUTF32 in pdfrenderer
2015-08-07 22:18:33 +02:00
Zdenko Podobný
66a76a9477
Revert "temporary add config/*, configure and Makefile.in for release"
...
This reverts commits ec9581d8f2
, 1afe382c4e
, 4b2cfabcc1
2015-07-31 21:44:43 +02:00
Zdenko Podobný
41478fd5a1
implement build without cube (-DNO_CUBE_BUILD)
2015-07-24 11:51:44 +02:00
Zdenko Podobný
71e226c44f
increase version number
2015-07-21 22:46:52 +02:00
zdenop
e4f4893fb8
Merge pull request #52 from unbe/null-pointer-access-in-hocr
...
Fix null pointer dereference when writing font name into HOCR.
2015-07-20 07:40:59 +02:00
artem
2b6801eddb
Fix null pointer dereference when writing font name into HOCR.
2015-07-19 22:05:02 +02:00
unbe
67ffea8877
Update capi.cpp
...
Make TessDeleteResultRenderer use delete, not delete[]
2015-07-19 15:15:42 +02:00
Zdenko Podobný
ec9581d8f2
temporary add configure and Makefile.in for release
2015-07-11 09:42:43 +02:00
Ray Smith
a303ab9d00
Misc fixes, mostly clang formatting, but some bug fixes in matrix, werd, and tesstrain_utils. Also updates unicharset to match traineddata files.
2015-07-09 14:28:20 -07:00
Ray Smith
b1d99dfe23
Added a backup adaptive classifier to take over from primary when it fills on a large document
2015-06-12 11:10:53 -07:00
Ray Smith
ab0f4e2c38
Clang fixes to earlier changes and build compatability with Google environment
2015-06-12 10:53:21 -07:00
orbitcowboy
9328f0e5d4
Fix potential null pointer dereference in ccmain/paragraphs.cpp.
2015-05-19 10:17:44 +02:00
Jim O'Regan
4a6195202c
fix typo
2015-05-18 12:32:36 +01:00
Zdenko Podobný
438edd6c7b
added row attributes to hocr output
2015-05-17 22:13:59 +02:00
Zdenko Podobný
ed6ae9b974
Add monitor to GetHOCRText
2015-05-17 21:55:50 +02:00
Zdenko Podobný
59bcbc79b3
fix GIT_VER info in VS2010
2015-05-15 15:14:49 +02:00
Zdenko Podobný
e98849b482
rint error message when pdf.ttf is not found.
2015-05-15 15:14:00 +02:00
Zdenko Podobný
035b324f0f
reflect the latest commits in VS2010 build
2015-05-14 10:52:54 +02:00
Jim O'Regan
b13691fda0
Merge conflict: going with Ray's version
2015-05-13 08:54:28 +01:00
Ray Smith
03f3c9dc88
Misc fixes missed from previous commits
2015-05-12 18:13:15 -07:00
Ray Smith
6b634170c1
Significant change to invisible font system
...
to improve correctness and compatibility with
external programs, particularly ghostscript.
We will start mapping everything to a single glyph,
rather than allowing characters to run off the end
of the font.
A more detailed design discussion is embedded into
pdfrenderer.cpp comments. The font, source code
that produces the font, and the design comments
were contributed by Ken Sharp from Artifex Software.
2015-05-12 17:33:18 -07:00
Ray Smith
4a3caefd92
Add ability to build under android (without cube or scrollview).
2015-05-12 15:41:15 -07:00
Ray Smith
53fc4456cc
Fixed issue 1252: Refactored LearnBlob and its call hierarchy to make it a member of Classify.
...
Eliminated the flexfx scheme for calling global feature extractor functions
through an array of function pointers.
Deleted dead code I found as a by-product.
This CL does not change BlobToTrainingSample or ExtractFeatures to be full
members of Classify (the eventual goal) as that would make it even bigger,
since there are a lot of callers to these functions.
When ExtractFeatures and BlobToTrainingSample are members of Classify they
will be able to access control parameters in Classify, which will greatly
simplify developing variations to the feature extraction process.
2015-05-12 15:22:34 -07:00
Zdenko Podobný
d508751e58
Fixed issue 1317 - git revision info used as version info for autotools & DEBUG
2015-05-02 12:15:13 +02:00
Zdenko Podobný
4c7c960bfd
fix issue 1417
2015-02-07 22:22:20 +01:00
Zdenko Podobný
09b0c91fc9
fix Issue 1398
2015-02-06 23:44:58 +01:00
Zdenko Podobný
e0441d0c6b
fix typo/ issue 1397
2014-12-31 22:31:50 +01:00
Zdenko Podobný
473141c1de
fix bool in c-api
2014-12-28 17:55:56 +01:00
Zdenko Podobný
4da712d04d
Add paragraph info to C-API(fix issue 1388)
2014-12-07 14:07:14 +01:00
Zdenko Podobný
239f350a72
remove const from C API TessResultIteratorGetChoiceIterator (issue 1342)
2014-10-14 22:46:11 +02:00
Ray Smith
242b14ae7f
Reduced size of multi-renderer implementation from code review
2014-10-09 13:29:46 -07:00
Ray Smith
d9699c4099
Fixed bidi handling in PDF output
2014-10-09 13:29:01 -07:00
Zdenko Podobný
d0cb1071b2
remove parameters tessedit_pdf_jpg_quality, tessedit_pdf_compression (reasons are in i1300 and i1285)
2014-10-07 23:37:34 +02:00
Zdenko Podobný
4904afe65b
fix issue 1300 - patch from #35
2014-10-06 22:43:56 +02:00
Zdenko Podobný
4c01561b0f
fix issue 1300 - patch from #26
2014-10-02 21:19:17 +02:00
Zdenko Podobný
c0640a4bef
fix cygwin build (issue 1289)
2014-09-28 23:19:52 +02:00
Zdenko Podobný
f8613fab22
fix issue 1300 /patches from breidenbach
2014-09-21 16:38:24 +02:00
Zdenko Podobný
9e8629d9ef
allow multiple output in tesseract executable ( https://groups.google.com/d/msg/tesseract-ocr/Z_WUKmJDVxc/1vc3W0xJZ2oJ )
2014-09-19 23:33:47 +02:00
Ray Smith
648e7ca311
Merge branch 'master' of https://code.google.com/p/tesseract-ocr
...
Usual git need to merge if local is out of date.
2014-09-17 18:10:17 -07:00
Ray Smith
0256529c1f
Fixed issue 1243
2014-09-17 18:09:45 -07:00
Jim O'Regan
c0c719306a
update docs for TessBaseAPI::SetProbabilityInContextFunc based on Ray's email today
2014-09-09 20:37:27 +01:00
Zdenko Podobný
d1aa61c110
fix issue 1285: reimplement option to select pdf compression
2014-09-06 09:32:22 +02:00
Ray Smith
cd2653c167
Cleanup from previous changes
2014-08-12 16:12:46 -07:00
theraysmith@gmail.com
dbf6197471
Major refactor of control.cpp to enable line recognition
...
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@1147 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2014-08-11 23:23:06 +00:00
theraysmith@gmail.com
b64ad05096
Improved efficiency of image processing for PDF
...
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@1141 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2014-08-11 23:15:25 +00:00
zdenop
bce2cd5f33
enable to select pdf compression type and jpeg quality (fix issue 1263)
...
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@1134 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2014-08-08 21:18:44 +00:00
zdenop
1156098567
Add font info to hocr output - fix issue 1219
...
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@1132 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2014-08-03 16:22:12 +00:00
zdenop
5b779456f9
fix compatibility with leptonica 1.71 and 1.70
...
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@1126 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2014-07-24 19:11:39 +00:00
zdenop
95b7783a95
fix issue 1228: bilevel pdf output - horizontal/vertical lines removed
...
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@1118 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2014-06-23 21:04:37 +00:00
zdenop
905e6162b9
put info about (API) version; fix typo
...
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@1117 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2014-06-22 18:31:42 +00:00
zdenop
fad9de4e1b
fix issue 1217: GetThresholdedImage accesses possibly NULL thresholder_
...
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@1113 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2014-05-31 21:21:37 +00:00
zdenop
e64f555567
fix Issue 1223: TessPolyBlockType enum is outdated in C-API
...
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@1112 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2014-05-31 20:31:48 +00:00
zdenop
36f3f76d64
fix tiff issue on windows
...
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@1111 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2014-05-31 07:27:54 +00:00
zdenop@gmail.com
84cdcb32cc
fixed windows build
...
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@1110 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2014-05-26 06:48:58 +00:00
zdenop
19c4c2f0e7
fix C-API to resent C++ API changes - thanks to Nick White
...
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@1109 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2014-05-25 21:03:11 +00:00
zdenop
ffe52737d5
check if input file exists
...
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@1108 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2014-05-25 19:58:00 +00:00
theraysmith@gmail.com
25a8c7b720
Enabled streaming input and output of multi-page documents
...
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@1105 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2014-05-21 15:46:21 +00:00
zdenop
979f9cafe5
Add word recognition language to C-API - fix issue 1200
...
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@1102 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2014-05-16 18:35:54 +00:00
zdenop
44b0d0e28e
addition to r1100
...
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@1101 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2014-05-11 21:24:54 +00:00
zdenop
6051e40212
fix issue 1197
...
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@1100 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2014-05-11 21:20:38 +00:00
zdenop
2e520f2fac
fix hocr/pdf output when image is provided from stdin - issue 1196
...
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@1099 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2014-05-11 15:59:47 +00:00
zdenop
bdb912c186
escape input_file name in hOCR output - fix issue 1154
...
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@1098 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2014-05-09 22:19:30 +00:00
zdenop
30f6ae6742
amendment to r1091
...
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@1095 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2014-05-07 20:53:03 +00:00
zdenop
ee73e3b107
fix issue 123: user-words (and user-patterns) file specified by command line
...
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@1093 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2014-05-04 21:11:00 +00:00
zdenop
bc09cd9040
fix formating in C-API and add TessChoiceIteratorDelete
...
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@1092 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2014-05-03 20:21:37 +00:00
zdenop
f86e9d83d4
add ChoiceIterator to C-API - fix issue 1149
...
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@1091 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2014-05-03 09:29:20 +00:00
theraysmith@gmail.com
45e106820f
Fixed issue 1116
...
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@1074 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2014-04-24 00:50:27 +00:00