Commit Graph

1022 Commits

Author SHA1 Message Date
Ray Smith
d174c4fd33 Fixed occurrence of small rotated blocks in loosely spaced text part 2 2015-06-12 11:12:06 -07:00
Ray Smith
b1d99dfe23 Added a backup adaptive classifier to take over from primary when it fills on a large document 2015-06-12 11:10:53 -07:00
Ray Smith
78b5e1a77d Fixed occurrence of small rotated blocks in loosely spaced text 2015-06-12 11:05:00 -07:00
Ray Smith
d74c625e52 Fixed blob division params to fix CJK training speed. 2015-06-12 10:59:26 -07:00
Ray Smith
4c7ab0caea Fixed font lists, improved wordlist management 2015-06-12 10:56:40 -07:00
Ray Smith
ab0f4e2c38 Clang fixes to earlier changes and build compatability with Google environment 2015-06-12 10:53:21 -07:00
zdenop
3ba1f83eb1 Merge pull request #36 from jan-ruzicka/patch-2
ChangeLog reformatting for consistent ordering
2015-06-11 09:50:38 +02:00
Jan Ruzicka
953c563efb change order of entries V1.0 ... V2.04
This is to have the newest on top ordering of revisions.
2015-06-11 01:34:45 -04:00
Jan Ruzicka
36740897e0 convert date formats 2015-06-11 01:27:11 -04:00
Jan Ruzicka
42481f2cf4 uniform bullet formatting 2015-06-10 22:52:37 -04:00
zdenop
10ea4f0636 Merge pull request #35 from jan-ruzicka/patch-1
more link updates
2015-06-02 21:29:47 +02:00
Jan Ruzicka
f89c7808cf more link updates
modifying link to training from google code and adding link to documentation by Doxygen.
2015-06-02 14:12:42 -04:00
zdenop
8faea4bf06 Update README.md
fix links to wiki
2015-06-02 09:56:55 +02:00
Zdenko Podobný
fc793355a8 Move pdf documents to docs repository 2015-05-22 22:10:31 +02:00
Zdenko Podobný
b1b02572ab Merge branch 'Issue1474'
* Issue1474:
  Fix potential null pointer dereference in ccmain/paragraphs.cpp.
2015-05-22 21:19:14 +02:00
Zdenko Podobný
d8a55d739d Fix potential null pointer dereference in ccmain/paragraphs.cpp. 2015-05-22 21:17:33 +02:00
zdenop
e4136f28a5 Merge pull request #33 from rmtheis/tweak-readme
Minor edits to Readme
2015-05-22 08:25:44 +02:00
Robert Theis
a36a5f96d0 Minor edits to Readme 2015-05-21 19:36:50 -07:00
zdenop
f8ebff262e Merge pull request #32 from orbitcowboy/master
Fix potential null pointer dereference in ccmain/paragraphs.cpp.
2015-05-20 19:01:13 +02:00
orbitcowboy
9328f0e5d4 Fix potential null pointer dereference in ccmain/paragraphs.cpp. 2015-05-19 10:17:44 +02:00
Jim Regan
05acff6253 Merge pull request #23 from tesseract-ocr/training-sh
/usr/share/fonts is the wrong path on Mac
2015-05-18 14:05:44 +01:00
Jim O'Regan
16ac3b0a20 /usr/share/fonts is the wrong path on Mac 2015-05-18 09:53:14 +01:00
zdenop
e9f59351de Merge pull request #19 from haf/feature/readme-improvement
[infra] updating readme
2015-05-18 08:46:46 +02:00
Henrik Feldt
a0ea634e15 [infra] README -> README.md, links 2015-05-16 19:19:54 +02:00
Henrik Feldt
03c29f96d8 [infra] updating readme 2015-05-16 19:10:10 +02:00
Zdenko Podobný
59bcbc79b3 fix GIT_VER info in VS2010 2015-05-15 15:14:49 +02:00
Zdenko Podobný
e98849b482 rint error message when pdf.ttf is not found. 2015-05-15 15:14:00 +02:00
Jim O'Regan
e7b087ffe6 update Doxyfile 2015-05-14 13:43:07 +01:00
Zdenko Podobný
aec22a47ec fix autotools c++11 issue with disabled training 2015-05-14 14:25:49 +02:00
Zdenko Podobný
1d6de86150 fix VS2010 linking error 2015-05-14 14:24:55 +02:00
Zdenko Podobný
035b324f0f reflect the latest commits in VS2010 build 2015-05-14 10:52:54 +02:00
Ray Smith
941d87057e Fixed training build 2015-05-13 17:46:58 -07:00
Ray Smith
81b67f7ed9 Removed debug logging that doesn't belong 2015-05-13 17:12:23 -07:00
Ray Smith
d91df9856b Fixed crash on debugging classifier with a shapetable present 2015-05-13 17:10:23 -07:00
Ray Smith
4598061324 Fixed infinite loop in training due to poor clipping of the table filler 2015-05-13 17:09:35 -07:00
Ray Smith
5bb0d89291 Improved debug of class pruner 2015-05-13 17:07:11 -07:00
Ray Smith
1e3b671298 Fixes to make yesterday's changes compile 2015-05-13 09:58:59 -07:00
Ray Smith
7bc6d3e059 Merge remote-tracking branch 'refs/remotes/origin/master'
Updating from master.
2015-05-13 09:06:44 -07:00
Ray Smith
c34dea6543 Missing from 25d0968 2015-05-13 09:05:08 -07:00
Jim O'Regan
b13691fda0 Merge conflict: going with Ray's version 2015-05-13 08:54:28 +01:00
Ray Smith
03f3c9dc88 Misc fixes missed from previous commits 2015-05-12 18:13:15 -07:00
Ray Smith
b2a3924585 Major updates to training system as a result of extensive testing on 100 languages - makefile.am 2015-05-12 18:08:39 -07:00
Ray Smith
6be25156f7 Major updates to training system as a result of extensive testing on 100 languages 2015-05-12 18:04:31 -07:00
Ray Smith
21805e63a4 Improved performance with PIC compilation option 2015-05-12 17:56:04 -07:00
Ray Smith
164897210a Improved newlines and spaces in a box file so it works better with RTL languages. 2015-05-12 17:51:03 -07:00
Ray Smith
6b634170c1 Significant change to invisible font system
to improve correctness and compatibility with
external programs, particularly ghostscript.
We will start mapping everything to a single glyph,
rather than allowing characters to run off the end
of the font.

A more detailed design discussion is embedded into
pdfrenderer.cpp comments. The font, source code
that produces the font, and the design comments
were contributed by Ken Sharp from Artifex Software.
2015-05-12 17:33:18 -07:00
Ray Smith
2924d3ae15 Changes missed from diacritic fix edit 2015-05-12 17:28:56 -07:00
Ray Smith
84920b92b3 Font and classifier output structure cleanup.
Font recognition was poor, due to forcing a 1st and 2nd choice at
a character level, when the total score for the correct font is often
correct at the word level, so allowed the propagation of a full set
of fonts and scores to the word recognizer, which can now decide word
level fonts using the scores instead of simple votes.

Change precipitated a cleanup of output data structures for classifier
results, eliminating ScoredClass and INT_RESULT_STRUCT, with a few
extra elements going in UnicharRating, and using that wherever possible.
That added the extra complexity of 1-rating due to a flip between 0 is
good and 0 is bad for the internal classifier scores before they are
converted to rating and certainty.
2015-05-12 17:24:34 -07:00
Ray Smith
0e868ef377 Major change to improve layout analysis for heavily diacritic languages:
Tha, Vie, Kan, Tel etc.
There is a new overlap detector that detects when diacritics
cause a big increase in textline overlap. In such cases, diacritics from
overlap regions are kept separate from layout analysis completely, allowing
textline formation to happen without them. The diacritics are then assigned
to 0, 1 or 2 close words at the end of layout analysis, using and modifying
an old noise detection data path.
The stored diacritics are used or not during recognition according to the
character classifier's liking for them.
2015-05-12 16:47:02 -07:00
Ray Smith
b6d0184806 Fixed problems with shifted baselines so recognition can recover from layout analysis errors. 2015-05-12 15:53:45 -07:00