tesseract

mirror of https://github.com/tesseract-ocr/tesseract.git synced 2024-12-03 09:24:24 +08:00

Author	SHA1	Message	Date
Ray Smith	44122698d7	Removed debug messages, forward compatability of traineddata files, further bug fix.	2015-07-09 14:50:25 -07:00
Ray Smith	2924d3ae15	Changes missed from diacritic fix edit	2015-05-12 17:28:56 -07:00
Ray Smith	84920b92b3	Font and classifier output structure cleanup. Font recognition was poor, due to forcing a 1st and 2nd choice at a character level, when the total score for the correct font is often correct at the word level, so allowed the propagation of a full set of fonts and scores to the word recognizer, which can now decide word level fonts using the scores instead of simple votes. Change precipitated a cleanup of output data structures for classifier results, eliminating ScoredClass and INT_RESULT_STRUCT, with a few extra elements going in UnicharRating, and using that wherever possible. That added the extra complexity of 1-rating due to a flip between 0 is good and 0 is bad for the internal classifier scores before they are converted to rating and certainty.	2015-05-12 17:24:34 -07:00
Ray Smith	0e868ef377	Major change to improve layout analysis for heavily diacritic languages: Tha, Vie, Kan, Tel etc. There is a new overlap detector that detects when diacritics cause a big increase in textline overlap. In such cases, diacritics from overlap regions are kept separate from layout analysis completely, allowing textline formation to happen without them. The diacritics are then assigned to 0, 1 or 2 close words at the end of layout analysis, using and modifying an old noise detection data path. The stored diacritics are used or not during recognition according to the character classifier's liking for them.	2015-05-12 16:47:02 -07:00
Ray Smith	b6d0184806	Fixed problems with shifted baselines so recognition can recover from layout analysis errors.	2015-05-12 15:53:45 -07:00
Ray Smith	25d0968d09	Major refactor to improve speed on difficut images, especially when running a heap checker. SEAM and SPLIT have been begging for a refactor for a LONG time. This change does most of the work of turning them into proper classes: Moved relevant code into SEAM/SPLIT/TBLOB/EDGEPT etc from global helper functions. Made the splits full data members of SEAM in an array instead of 3 separate pointers. This greatly reduces the amount of new/delete happening in the chopper, which is the main goal. Deleted redundant files: olutil., makechop. Brought other code into SEAM in order to keep its data members private with only priority having accessors.	2015-05-12 14:59:14 -07:00
Ray Smith	2f197cd653	Fixed issues 899/1220/1246 (mixed eng+ara)	2014-09-17 18:27:49 -07:00
theraysmith@gmail.com	dbf6197471	Major refactor of control.cpp to enable line recognition git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@1147 d0cd1f9f-072b-0410-8dd7-cf729c803f20	2014-08-11 23:23:06 +00:00
theraysmith@gmail.com	7ec4fd7a56	Refactorerd control functions to enable parallel blob classification git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@904 d0cd1f9f-072b-0410-8dd7-cf729c803f20	2013-11-08 20:30:56 +00:00
theraysmith@gmail.com	4d514d5a60	Major refactor of beam search, elimination of dead code, misc bug fixes, updates to Makefile.am, Changelog etc. git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@878 d0cd1f9f-072b-0410-8dd7-cf729c803f20	2013-09-23 15:26:50 +00:00
zdenop@gmail.com	10c1169d98	remove unused code (Windows related) git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@860 d0cd1f9f-072b-0410-8dd7-cf729c803f20	2013-07-08 18:21:10 +00:00
david.eger@gmail.com	018f192fc2	Abolish populate_unichars(), fixing seg fault reported in Debian: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=658634 git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@675 d0cd1f9f-072b-0410-8dd7-cf729c803f20	2012-02-15 01:37:00 +00:00
theraysmith@gmail.com	9206e92b0d	Added simultaneous multi-language capability, Refactored top-level word recognition module, Blamer module added for error analysis, Tidied up constraints on control parameters, Added UNICHARSET to WERD_CHOICE to make mult-language handling easier, Added word bigram correction git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@655 d0cd1f9f-072b-0410-8dd7-cf729c803f20	2012-02-02 03:06:39 +00:00
theraysmith	82b1b201fc	Various fixes, including memory leak in fixspace, font labels on output, removed some annoying debug output, fixes to initialization of parameters, general cleanup, and added Hindi git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@568 d0cd1f9f-072b-0410-8dd7-cf729c803f20	2011-03-21 21:44:45 +00:00
theraysmith	137f4806b6	Added sub/superscript, small/dropcap detection git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@547 d0cd1f9f-072b-0410-8dd7-cf729c803f20	2010-12-09 01:32:20 +00:00
theraysmith	47dc322437	Removed serialise and NEWDELETE macro git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@531 d0cd1f9f-072b-0410-8dd7-cf729c803f20	2010-11-30 00:56:39 +00:00
zdenop@gmail.com	4523ce9f7d	3.01 code from http://github.com/jimregan/tesseract-ocr with addaptions related to Linux and Windows (VC2008) compile process git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@526 d0cd1f9f-072b-0410-8dd7-cf729c803f20	2010-11-23 18:34:14 +00:00
theraysmith	5c964ea6da	More harmless improvements from 3.00 in 2.04 git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@217 d0cd1f9f-072b-0410-8dd7-cf729c803f20	2008-12-30 21:31:01 +00:00
theraysmith	c4f4840fbe	Fixed name collision with jpeg library git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@163 d0cd1f9f-072b-0410-8dd7-cf729c803f20	2008-04-22 00:41:37 +00:00
theraysmith	6ae6c0a042	Made some preliminary changes for improving xheights git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@107 d0cd1f9f-072b-0410-8dd7-cf729c803f20	2007-08-30 18:20:10 +00:00
tmbdev	425d593ebe	top-skimming import from sf.net git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk/trunk@2 d0cd1f9f-072b-0410-8dd7-cf729c803f20	2007-03-07 20:03:40 +00:00

21 Commits