Commit Graph

48 Commits

Author SHA1 Message Date
Thijs Leegwater
f061503a14 Added JPEG quality option parameter (-c jpg_quality=n) 2018-01-11 09:11:30 +01:00
Stefan Weil
aa6eb6bd46 Remove Tesseract parameter "include_page_breaks" and use FF by default
Now Tesseract adds a page break (normally form feed) by default.

It is still possible to suppress page breaks by setting an empty
page_separator.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2017-09-19 07:34:32 +02:00
jm
2a77d5ad69 returns the correct dictionary if lstm only used 2017-09-14 13:03:22 +02:00
Ray Smith
0382222d85 More clang-tidy fixes from sync 2017-09-08 10:22:32 +01:00
Stefan Weil
b016c48d06 Add missing spaces in help text
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2017-08-23 19:12:41 +02:00
Ray Smith
1cc511188d Added extra Init that takes a memory buffer or a filereader function pointer to enable read of traineddata from memory or foreign file systems. Updated existing readers to use TFile API instead of FILE. This does not yet add big-endian capability to LSTM, but it is very easy from here. 2017-04-27 15:48:23 -07:00
Ray Smith
f566a45b30 clang-tidy changes from sync 2017-01-25 16:20:19 -08:00
Ray Smith
b453f74e01 Fixed issue #633 (multi-language mode 2017-01-25 15:58:39 -08:00
zdenop
c768b5867d Merge pull request #668 from Wikinaut/chg-textonly-pdf-parameter-description
Improve textonly_pdf parameter description
2017-01-21 16:29:06 +01:00
Wikinaut
c03299e2b4 Improve textonly_pdf parameter description 2017-01-21 16:18:53 +01:00
Wikinaut
98df78ca8a fix typo in parameter description 2017-01-21 10:48:25 +01:00
Zdenko Podobný
effa5741e6 Implement invisible text only for PDF 2017-01-20 21:26:34 +01:00
Wikinaut
39274d8000 typo correction "specific" 2017-01-13 04:17:32 +01:00
Simon Strandgaard
d38cffc332 Fixed typo 2016-12-15 14:58:53 +00:00
Ray Smith
9f5ba9105f Removed dependency on cube from the code 2016-12-14 10:55:15 -08:00
Ray Smith
13e46ae1c4 Made LSTM the default engine, pushed cube out 2016-12-13 14:37:40 -08:00
Ray Smith
5deebe6c27 Fixed multilang for LSTM, pushed cube to one side without actually deleting it 2016-12-05 14:41:43 -08:00
Ray Smith
c1c1e426b3 Added new LSTM-based neural network line recognizer 2016-11-07 15:38:07 -08:00
Ray Smith
2c837dffc3 Result of clang tidy on recent merge 2016-11-07 10:46:33 -08:00
Tom Morris
6700edd8bc Cleanup TSV renderer
Remove all references to hocr, hocr.tsv, etc. Remove dead code for font
info, input filename, HTML escapes. Improved comments. Fixed
indentation.
2016-03-01 13:41:19 -05:00
Sundar M. Vaidya
738fe4f757 Adds BoolParam tessedit_create_hocrtsv in class Tesseract. 2016-03-01 12:30:39 -05:00
amitdo
c2f5e9b849 If there is no explicit renderer(s), default to TessTextRenderer
Revert fd429c32, 43834da7, 05de195e.

See #49, #59.

The code in this commit solves the issue in a more elegant way, IMHO.

Now you can use:
  * `tesseract eurotext.tif eurotext txt pdf`
  * `tesseract eurotext.tif eurotext txt hocr`
  * `tesseract eurotext.tif eurotext txt hocr pdf`

NOTE:
  With `tesseract eurotext.tif eurotext`
  or `tesseract eurotext.tif eurotext txt`
  the psm will be set to '3', but...
  With `tesseract eurotext.tif eurotext txt pdf`
  or `tesseract eurotext.tif eurotext txt hocr`
  the psm will be set to '1'.
2015-12-11 19:06:49 +02:00
Stefan Weil
318b88daa6 ccmain: Fix typos in comments and strings
Most of them were found by codespell.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2015-09-14 21:59:16 +02:00
Zdenko Podobný
41478fd5a1 implement build without cube (-DNO_CUBE_BUILD) 2015-07-24 11:51:44 +02:00
Ray Smith
0e868ef377 Major change to improve layout analysis for heavily diacritic languages:
Tha, Vie, Kan, Tel etc.
There is a new overlap detector that detects when diacritics
cause a big increase in textline overlap. In such cases, diacritics from
overlap regions are kept separate from layout analysis completely, allowing
textline formation to happen without them. The diacritics are then assigned
to 0, 1 or 2 close words at the end of layout analysis, using and modifying
an old noise detection data path.
The stored diacritics are used or not during recognition according to the
character classifier's liking for them.
2015-05-12 16:47:02 -07:00
Ray Smith
4a3caefd92 Add ability to build under android (without cube or scrollview). 2015-05-12 15:41:15 -07:00
Zdenko Podobný
4c7c960bfd fix issue 1417 2015-02-07 22:22:20 +01:00
Zdenko Podobný
36883b4faf preserve interword spaces patch - Issue 1409 2015-01-27 22:58:04 +01:00
Ray Smith
f927728169 Fixed issue 1207 2014-10-09 13:28:03 -07:00
Zdenko Podobný
d0cb1071b2 remove parameters tessedit_pdf_jpg_quality, tessedit_pdf_compression (reasons are in i1300 and i1285) 2014-10-07 23:37:34 +02:00
Ray Smith
55d11ad3c2 Moved params from global in page layout to tesseractclass, improved single column layout analysis 2014-10-07 09:31:00 -07:00
Zdenko Podobný
9e8629d9ef allow multiple output in tesseract executable (https://groups.google.com/d/msg/tesseract-ocr/Z_WUKmJDVxc/1vc3W0xJZ2oJ) 2014-09-19 23:33:47 +02:00
Zdenko Podobný
ff87944171 fix typo 2014-09-07 18:23:47 +02:00
Zdenko Podobný
d1aa61c110 fix issue 1285: reimplement option to select pdf compression 2014-09-06 09:32:22 +02:00
Ray Smith
09b439b05a Fixed issue 1241, but disabled due to making accuracy worse 2014-08-13 13:33:10 -07:00
theraysmith@gmail.com
dbf6197471 Major refactor of control.cpp to enable line recognition
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@1147 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2014-08-11 23:23:06 +00:00
zdenop
6941bffbd2 fix typo
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@1135 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2014-08-09 17:53:57 +00:00
zdenop
bce2cd5f33 enable to select pdf compression type and jpeg quality (fix issue 1263)
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@1134 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2014-08-08 21:18:44 +00:00
zdenop
1156098567 Add font info to hocr output - fix issue 1219
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@1132 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2014-08-03 16:22:12 +00:00
theraysmith@gmail.com
d2ad450502 Added PDF renderer
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@957 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2014-01-09 17:47:34 +00:00
theraysmith@gmail.com
7ec4fd7a56 Refactorerd control functions to enable parallel blob classification
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@904 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2013-11-08 20:30:56 +00:00
theraysmith@gmail.com
2aafc9df24 Improved sub/superscript treatment
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@872 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2013-09-20 19:49:47 +00:00
theraysmith@gmail.com
3a998fe7ac Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic, Added paragraph detection in layout analysis/post OCR, Fixed inconsistent xheight during training and over-chopping, Added simultaneous multi-language capability, Refactored top-level word recognition module, Fixed problems with internally scaled images
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@651 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2012-02-02 02:59:49 +00:00
zdenop@gmail.com
da41b96f7f removed check for libtiff - leptonica is required; cleanup #ifdef/#ifndef HAVE_LIBLEPT
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@624 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2011-08-30 06:34:41 +00:00
theraysmith
3e8c0bc228 Various fixes, including memory leak in fixspace, font labels on output, removed some annoying debug output, fixes to initialization of parameters, general cleanup, and added Hindi
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@567 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2011-03-21 21:44:05 +00:00
theraysmith
c8465252e4 Rewrite of DENORM
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@538 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2010-11-30 01:05:48 +00:00
zdenop@gmail.com
4523ce9f7d 3.01 code from http://github.com/jimregan/tesseract-ocr with addaptions related to Linux and Windows (VC2008) compile process
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@526 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2010-11-23 18:34:14 +00:00
theraysmith
96e8b51feb More changes to ccmain for 3.00
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@287 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2009-07-11 02:07:25 +00:00