Nick White
06b7a7b188
Add option to include character bounding boxes in hocr output
...
Add the 'hocr_char_boxes' configuration option (off by default),
which enables printing the bounding boxes of each character in the
x_bboxes property of an ocrx_word element in hocr output.
2016-04-29 15:37:46 +01:00
Tom Morris
6700edd8bc
Cleanup TSV renderer
...
Remove all references to hocr, hocr.tsv, etc. Remove dead code for font
info, input filename, HTML escapes. Improved comments. Fixed
indentation.
2016-03-01 13:41:19 -05:00
Sundar M. Vaidya
738fe4f757
Adds BoolParam tessedit_create_hocrtsv in class Tesseract.
2016-03-01 12:30:39 -05:00
amitdo
c2f5e9b849
If there is no explicit renderer(s), default to TessTextRenderer
...
Revert fd429c32
, 43834da7
, 05de195e
.
See #49 , #59 .
The code in this commit solves the issue in a more elegant way, IMHO.
Now you can use:
* `tesseract eurotext.tif eurotext txt pdf`
* `tesseract eurotext.tif eurotext txt hocr`
* `tesseract eurotext.tif eurotext txt hocr pdf`
NOTE:
With `tesseract eurotext.tif eurotext`
or `tesseract eurotext.tif eurotext txt`
the psm will be set to '3', but...
With `tesseract eurotext.tif eurotext txt pdf`
or `tesseract eurotext.tif eurotext txt hocr`
the psm will be set to '1'.
2015-12-11 19:06:49 +02:00
Stefan Weil
318b88daa6
ccmain: Fix typos in comments and strings
...
Most of them were found by codespell.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2015-09-14 21:59:16 +02:00
Zdenko Podobný
41478fd5a1
implement build without cube (-DNO_CUBE_BUILD)
2015-07-24 11:51:44 +02:00
Ray Smith
78b5e1a77d
Fixed occurrence of small rotated blocks in loosely spaced text
2015-06-12 11:05:00 -07:00
Ray Smith
0e868ef377
Major change to improve layout analysis for heavily diacritic languages:
...
Tha, Vie, Kan, Tel etc.
There is a new overlap detector that detects when diacritics
cause a big increase in textline overlap. In such cases, diacritics from
overlap regions are kept separate from layout analysis completely, allowing
textline formation to happen without them. The diacritics are then assigned
to 0, 1 or 2 close words at the end of layout analysis, using and modifying
an old noise detection data path.
The stored diacritics are used or not during recognition according to the
character classifier's liking for them.
2015-05-12 16:47:02 -07:00
Ray Smith
b6d0184806
Fixed problems with shifted baselines so recognition can recover from layout analysis errors.
2015-05-12 15:53:45 -07:00
Ray Smith
4a3caefd92
Add ability to build under android (without cube or scrollview).
2015-05-12 15:41:15 -07:00
Zdenko Podobný
4c7c960bfd
fix issue 1417
2015-02-07 22:22:20 +01:00
Zdenko Podobný
36883b4faf
preserve interword spaces patch - Issue 1409
2015-01-27 22:58:04 +01:00
Ray Smith
f927728169
Fixed issue 1207
2014-10-09 13:28:03 -07:00
Zdenko Podobný
d0cb1071b2
remove parameters tessedit_pdf_jpg_quality, tessedit_pdf_compression (reasons are in i1300 and i1285)
2014-10-07 23:37:34 +02:00
Ray Smith
55d11ad3c2
Moved params from global in page layout to tesseractclass, improved single column layout analysis
2014-10-07 09:31:00 -07:00
Zdenko Podobný
9e8629d9ef
allow multiple output in tesseract executable ( https://groups.google.com/d/msg/tesseract-ocr/Z_WUKmJDVxc/1vc3W0xJZ2oJ )
2014-09-19 23:33:47 +02:00
Ray Smith
2f197cd653
Fixed issues 899/1220/1246 (mixed eng+ara)
2014-09-17 18:27:49 -07:00
Zdenko Podobný
ff87944171
fix typo
2014-09-07 18:23:47 +02:00
Zdenko Podobný
d1aa61c110
fix issue 1285: reimplement option to select pdf compression
2014-09-06 09:32:22 +02:00
Ray Smith
09b439b05a
Fixed issue 1241, but disabled due to making accuracy worse
2014-08-13 13:33:10 -07:00
theraysmith@gmail.com
dbf6197471
Major refactor of control.cpp to enable line recognition
...
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@1147 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2014-08-11 23:23:06 +00:00
zdenop
6941bffbd2
fix typo
...
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@1135 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2014-08-09 17:53:57 +00:00
zdenop
bce2cd5f33
enable to select pdf compression type and jpeg quality (fix issue 1263)
...
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@1134 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2014-08-08 21:18:44 +00:00
zdenop
1156098567
Add font info to hocr output - fix issue 1219
...
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@1132 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2014-08-03 16:22:12 +00:00
theraysmith@gmail.com
8364f24f4b
Added ability for box files to store spaces and newlines
...
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@1060 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2014-04-23 22:52:05 +00:00
zdenop
790a3da22f
remove 'class IMAGE;'
...
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@1045 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2014-02-03 23:32:23 +00:00
theraysmith@gmail.com
d2ad450502
Added PDF renderer
...
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@957 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2014-01-09 17:47:34 +00:00
theraysmith@gmail.com
7ec4fd7a56
Refactorerd control functions to enable parallel blob classification
...
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@904 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2013-11-08 20:30:56 +00:00
theraysmith@gmail.com
2aafc9df24
Improved sub/superscript treatment
...
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@872 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2013-09-20 19:49:47 +00:00
theraysmith@gmail.com
64c739c8af
Added sparse text mode, also fixed issue 653.
...
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@820 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2013-01-03 19:06:41 +00:00
theraysmith@gmail.com
f23460bec4
Removed config_auto.h from .h files
...
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@748 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2012-09-21 15:26:10 +00:00
zdenop@gmail.com
49c4ce3183
fix for GRAPHICS_DISABLED build
...
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@686 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2012-03-01 22:43:51 +00:00
theraysmith@gmail.com
3a998fe7ac
Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic, Added paragraph detection in layout analysis/post OCR, Fixed inconsistent xheight during training and over-chopping, Added simultaneous multi-language capability, Refactored top-level word recognition module, Fixed problems with internally scaled images
...
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@651 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2012-02-02 02:59:49 +00:00
theraysmith
3e8c0bc228
Various fixes, including memory leak in fixspace, font labels on output, removed some annoying debug output, fixes to initialization of parameters, general cleanup, and added Hindi
...
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@567 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2011-03-21 21:44:05 +00:00
theraysmith
c8465252e4
Rewrite of DENORM
...
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@538 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2010-11-30 01:05:48 +00:00
zdenop@gmail.com
4523ce9f7d
3.01 code from http://github.com/jimregan/tesseract-ocr with addaptions related to Linux and Windows (VC2008) compile process
...
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@526 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2010-11-23 18:34:14 +00:00
theraysmith
96e8b51feb
More changes to ccmain for 3.00
...
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@287 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2009-07-11 02:07:25 +00:00