Commit Graph

1555 Commits

Author SHA1 Message Date
Zdenko Podobný
438edd6c7b added row attributes to hocr output 2015-05-17 22:13:59 +02:00
Zdenko Podobný
917e994caa extend ETEXT_DESC by progress_callback 2015-05-17 21:56:40 +02:00
Zdenko Podobný
ed6ae9b974 Add monitor to GetHOCRText 2015-05-17 21:55:50 +02:00
Henrik Feldt
a0ea634e15 [infra] README -> README.md, links 2015-05-16 19:19:54 +02:00
Henrik Feldt
03c29f96d8 [infra] updating readme 2015-05-16 19:10:10 +02:00
Zdenko Podobný
59bcbc79b3 fix GIT_VER info in VS2010 2015-05-15 15:14:49 +02:00
Zdenko Podobný
e98849b482 rint error message when pdf.ttf is not found. 2015-05-15 15:14:00 +02:00
Jim O'Regan
e7b087ffe6 update Doxyfile 2015-05-14 13:43:07 +01:00
Zdenko Podobný
aec22a47ec fix autotools c++11 issue with disabled training 2015-05-14 14:25:49 +02:00
Zdenko Podobný
1d6de86150 fix VS2010 linking error 2015-05-14 14:24:55 +02:00
Zdenko Podobný
035b324f0f reflect the latest commits in VS2010 build 2015-05-14 10:52:54 +02:00
Ray Smith
941d87057e Fixed training build 2015-05-13 17:46:58 -07:00
Ray Smith
81b67f7ed9 Removed debug logging that doesn't belong 2015-05-13 17:12:23 -07:00
Ray Smith
d91df9856b Fixed crash on debugging classifier with a shapetable present 2015-05-13 17:10:23 -07:00
Ray Smith
4598061324 Fixed infinite loop in training due to poor clipping of the table filler 2015-05-13 17:09:35 -07:00
Ray Smith
5bb0d89291 Improved debug of class pruner 2015-05-13 17:07:11 -07:00
zhivko.tabakov@gmail.com
07be522e43 Issue 1351: OpenCL build - kernel_ThresholdRectToPix() not accounting for padding bits in the output pix?!
https://code.google.com/p/tesseract-ocr/issues/detail?id=1351

What steps will reproduce the problem?
1.Use tesseract build with OpenCL.
2.Pass full color image with width which is not multiple of 32.
3.Recognition is way too slow and does not recognize anything.
I read the article on http://www.sk-spell.sk.cx/tesseract-meets-the-opencl-first-test and decided to give OCL a try. The initial result was as per point 3 above. After some debugging I figured the problem is that the OCL version of threshold rect generation does not account for padding bits in the output pix lines. To prove my discovery I made a quick fix in oclkernels.h replacing the definition of kernel_ThresholdRectToPix

Just a reminder: it is necessary to force OCL kernel recompilation after changing this source (e.g. delete “kernel - <device>.bin” from the exec folder).
The fix is working but I am not sure about it since the original source apparently works for other people (as per the article). If I am right the OS/GPU are irrelevant since the bug is algorithmic, but mine are Windows/AMD. Also similar fix is applicable to kernel_ThresholdRectToPix_OneChan(), but there the input array might have some padding bytes as well, so its indexing will need further adjustments. I can come with some prove/fix for it either - I have not played with it yet.
Disclaimer: I have no prior experience with image processing and tesseract source or with GPU computing and OpenCL (but please do explain if I am wrong).
2015-05-13 21:23:23 +01:00
Ray Smith
1e3b671298 Fixes to make yesterday's changes compile 2015-05-13 09:58:59 -07:00
Ray Smith
7bc6d3e059 Merge remote-tracking branch 'refs/remotes/origin/master'
Updating from master.
2015-05-13 09:06:44 -07:00
Ray Smith
c34dea6543 Missing from 25d0968 2015-05-13 09:05:08 -07:00
Jim O'Regan
a94943cc1f remove unneeded comment from commit 2015-05-13 14:59:02 +01:00
oriahulrich@microvu.com
d3252f926e Issue 1316: The traineddata file must be closed after it was opened 2015-05-13 14:53:37 +01:00
Jim O'Regan
b13691fda0 Merge conflict: going with Ray's version 2015-05-13 08:54:28 +01:00
Ray Smith
03f3c9dc88 Misc fixes missed from previous commits 2015-05-12 18:13:15 -07:00
Ray Smith
b2a3924585 Major updates to training system as a result of extensive testing on 100 languages - makefile.am 2015-05-12 18:08:39 -07:00
Ray Smith
6be25156f7 Major updates to training system as a result of extensive testing on 100 languages 2015-05-12 18:04:31 -07:00
Ray Smith
21805e63a4 Improved performance with PIC compilation option 2015-05-12 17:56:04 -07:00
Ray Smith
164897210a Improved newlines and spaces in a box file so it works better with RTL languages. 2015-05-12 17:51:03 -07:00
Ray Smith
6b634170c1 Significant change to invisible font system
to improve correctness and compatibility with
external programs, particularly ghostscript.
We will start mapping everything to a single glyph,
rather than allowing characters to run off the end
of the font.

A more detailed design discussion is embedded into
pdfrenderer.cpp comments. The font, source code
that produces the font, and the design comments
were contributed by Ken Sharp from Artifex Software.
2015-05-12 17:33:18 -07:00
Ray Smith
2924d3ae15 Changes missed from diacritic fix edit 2015-05-12 17:28:56 -07:00
Ray Smith
84920b92b3 Font and classifier output structure cleanup.
Font recognition was poor, due to forcing a 1st and 2nd choice at
a character level, when the total score for the correct font is often
correct at the word level, so allowed the propagation of a full set
of fonts and scores to the word recognizer, which can now decide word
level fonts using the scores instead of simple votes.

Change precipitated a cleanup of output data structures for classifier
results, eliminating ScoredClass and INT_RESULT_STRUCT, with a few
extra elements going in UnicharRating, and using that wherever possible.
That added the extra complexity of 1-rating due to a flip between 0 is
good and 0 is bad for the internal classifier scores before they are
converted to rating and certainty.
2015-05-12 17:24:34 -07:00
Ray Smith
0e868ef377 Major change to improve layout analysis for heavily diacritic languages:
Tha, Vie, Kan, Tel etc.
There is a new overlap detector that detects when diacritics
cause a big increase in textline overlap. In such cases, diacritics from
overlap regions are kept separate from layout analysis completely, allowing
textline formation to happen without them. The diacritics are then assigned
to 0, 1 or 2 close words at the end of layout analysis, using and modifying
an old noise detection data path.
The stored diacritics are used or not during recognition according to the
character classifier's liking for them.
2015-05-12 16:47:02 -07:00
Ray Smith
b6d0184806 Fixed problems with shifted baselines so recognition can recover from layout analysis errors. 2015-05-12 15:53:45 -07:00
Ray Smith
4a3caefd92 Add ability to build under android (without cube or scrollview). 2015-05-12 15:41:15 -07:00
Ray Smith
2eec979577 Makefile.am for fix to issue 1252 2015-05-12 15:25:00 -07:00
Ray Smith
53fc4456cc Fixed issue 1252: Refactored LearnBlob and its call hierarchy to make it a member of Classify.
Eliminated the flexfx scheme for calling global feature extractor functions
through an array of function pointers.
Deleted dead code I found as a by-product.
This CL does not change BlobToTrainingSample or ExtractFeatures to be full
members of Classify (the eventual goal) as that would make it even bigger,
since there are a lot of callers to these functions.
When ExtractFeatures and BlobToTrainingSample are members of Classify they
will be able to access control parameters in Classify, which will greatly
simplify developing variations to the feature extraction process.
2015-05-12 15:22:34 -07:00
Ray Smith
e735a9017b Makefile.am change for Split/seam refactor 2015-05-12 15:05:56 -07:00
Ray Smith
25d0968d09 Major refactor to improve speed on difficut images, especially when running
a heap checker.
SEAM and SPLIT have been begging for a refactor for a *LONG* time.
This change does most of the work of turning them into proper classes:
  Moved relevant code into SEAM/SPLIT/TBLOB/EDGEPT etc from global helper functions.
  Made the splits full data members of SEAM in an array instead of 3 separate pointers.
    This greatly reduces the amount of new/delete happening in the chopper, which is the main goal.
  Deleted redundant files: olutil.*,  makechop.*
  Brought other code into SEAM in order to keep its data members private with only priority having accessors.
2015-05-12 14:59:14 -07:00
Zdenko Podobný
d508751e58 Fixed issue 1317 - git revision info used as version info for autotools & DEBUG 2015-05-02 12:15:13 +02:00
Zdenko Podobný
d1c749f6ad Fixed issue 1133 - part3 (Nick's replacement of InputBuffer-ReadLine with InputBuffer-Read) 2015-05-01 19:33:56 +02:00
Zdenko Podobný
5e754af9cb Fixed issue 1133 - part2 2015-05-01 19:12:03 +02:00
Zdenko Podobný
53eab2ee92 fix issue 1354 2015-04-15 22:37:58 +02:00
Zdenko Podobný
370f1c65ad fix issue 1436 2015-04-12 16:38:03 +02:00
Zdenko Podobný
4c7c960bfd fix issue 1417 2015-02-07 22:22:20 +01:00
Zdenko Podobný
09b0c91fc9 fix Issue 1398 2015-02-06 23:44:58 +01:00
Zdenko Podobný
15d48361b4 fix VS2010 build; 2015-02-05 17:27:18 +01:00
Zdenko Podobný
9bca55c73b fix space issue in revision 36883b4faf 2015-01-30 22:24:26 +01:00
Zdenko Podobný
36883b4faf preserve interword spaces patch - Issue 1409 2015-01-27 22:58:04 +01:00
Zdenko Podobný
e0441d0c6b fix typo/ issue 1397 2014-12-31 22:31:50 +01:00
Zdenko Podobný
473141c1de fix bool in c-api 2014-12-28 17:55:56 +01:00