Commit Graph

1197 Commits

Author SHA1 Message Date
david.eger@gmail.com
d9d70919bb Conform to the hocr spec: hocr doesn't have ocr_word, but instead has ocrx_word.
Tested with ExactImage's hocr2pdf. 
$ tesseract phototest.tif phototest hocr
$ hocr2pdf -i phototest.tif -o ./phototest.pdf < ./phototest.hocr 
$ evince phototest.pdf 

See: https://docs.google.com/document/preview?id=1QQnIQtvdAC_8n92-LhwPcjtAUFwBlzE8EWnKAxlgVf0 



git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@726 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2012-05-25 17:36:25 +00:00
david.eger@gmail.com
eeeb4f513c Provide better paragraph segmentation without having to run fully
automatic layout analysis.



git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@725 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2012-05-10 00:03:34 +00:00
zdenop@gmail.com
e606c311f5 fix issue Issue 684 : show correct line in failure message "Couldn't find a matching blob"
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@723 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2012-04-22 20:51:00 +00:00
zdenop@gmail.com
d39cb38ab8 Fix Issue 678
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@722 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2012-04-17 17:32:42 +00:00
david.eger@gmail.com
56403c6dc3 Fix an issue where we sometimes leave a dangling outline->loop pointer
during chopping.



git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@721 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2012-04-17 00:02:52 +00:00
david.eger@gmail.com
71b3200625 Fix a shapetable serialization issue -- sizeof(bool) is not portable.
See http://code.google.com/p/tesseract-ocr/issues/detail?id=669



git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@720 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2012-04-17 00:00:26 +00:00
david.eger@gmail.com
a253ea224a Add some documentation on how to use config files and user dictionaries.
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@719 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2012-04-09 19:41:06 +00:00
zdenop@gmail.com
aa14e8b212 fix Mingw shared build
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@718 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2012-04-02 12:14:37 +00:00
zdenop@gmail.com
c2d5616a7e add Doxyfile (doxygen config) to distribution
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@717 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2012-04-02 10:52:13 +00:00
zdenop@gmail.com
cd8de9157c change comments to doxygen block comments (api)
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@716 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2012-03-30 21:24:12 +00:00
zdenop@gmail.com
5958f01f5f fix doxygen warnings
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@715 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2012-03-30 15:42:06 +00:00
david.eger@gmail.com
4f0ff358a7 Missing close bracket.
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@714 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2012-03-29 06:15:33 +00:00
david.eger@gmail.com
4ddb3e5941 Good moming, Good aftemoon.
During our initial chopping for each word, pay attention to whether a
dangerous ambiguity (like rn <-> m) would lead us to a dictionary word.
If so, make sure that blob gets chopped so that we can evaluate said
dictionary word during the segmentation search.

Large accuracy improvement, especially on English printed books (~9%).



git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@713 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2012-03-28 21:02:54 +00:00
zdenop@gmail.com
ee44165d3d improve doxygen config; fix doxygen warnings for baseapi.h
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@712 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2012-03-28 20:38:14 +00:00
david.eger@gmail.com
0d5e8b5cb6 Recording segmentation state for a choice at LogNewChoice() time was a
bad idea -- a VIABLE_CHOICE's Blob->NumChunks can be modified as we go
by a call from Dict::LogNewSplit().  Relying on the auxilury
segmentation_state makes alt choices sometimes reference the wrong
blobs.



git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@711 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2012-03-28 20:11:57 +00:00
zdenop@gmail.com
3f9032ef0c fix 'make dist' for MinGW+MSYS
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@710 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2012-03-24 16:33:11 +00:00
zdenop@gmail.com
3115fbfdcb another fix MinGW+MSYS
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@709 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2012-03-24 10:14:47 +00:00
zdenop@gmail.com
d4d4b8aad8 improve autools system (mingw+msys fix); implementation of --disable-tessdata-prefix
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@708 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2012-03-22 20:01:33 +00:00
david.eger@gmail.com
c0cd2cd605 Restore VC++ compatibility for paragraphs.cpp.
Missed a __func__ addition in the last merge.



git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@707 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2012-03-21 16:41:27 +00:00
david.eger@gmail.com
a91778397b Fix Issue 645, a char signed/unsigned issue in paragraphs.cpp.
When constructing our debug strings, our simple UTF-8 processing should skip all non-ASCII chars.



git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@706 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2012-03-20 20:19:00 +00:00
zdenop@gmail.com
1563c01565 fixed build in java directory; create documentation package with 'make doc-pack'
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@705 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2012-03-15 21:05:12 +00:00
zdenop@gmail.com
1009a6e2f0 fopen() should use binary mode (issue 70)
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@704 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2012-03-11 12:41:17 +00:00
tomp2010@gmail.com
87e03edb3a Fix dawg2wordlist crash on Windows caused by fopening dawg file in "r" instead of "rb" mode.
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@703 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2012-03-10 08:09:11 +00:00
zdenop@gmail.com
2972cc426b + fix VS2008 warning about "non dll-interface class tesseract::LTRResultIterator used as base for dll-interface class tesseract::ResultIterator" by making LTRResultIterator also visible.
+ Changed Project preprocessor definition of WINDLLNAME, because stringizing operator doesn't seem to work when initializing tessedit_module_name in ccutil/ccutil.cpp (which was omitted in previous fixes).
+ Update vs2008/tesshelper.py for new public header files.
patch from Tom Powers (https://groups.google.com/group/tesseract-dev/msg/6da2799cd2cb9844)

git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@702 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2012-03-08 21:15:13 +00:00
zdenop@gmail.com
2f1c112640 +Remove visibility from protected members of tesseract::TessBaseAPI class by applying TESS_LOCAL macro;
+Make PageIterator & ResultIterator classes visible by applying TESS_API macro;
+Fix api/Makefile.am & training/Makefile.am to allow Parallel Build Trees;
patch from Tom Powers (https://groups.google.com/group/tesseract-dev/msg/9d00579540e44055)

git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@701 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2012-03-07 22:04:46 +00:00
zdenop@gmail.com
1455bf5610 set tessedit_module_name for windows;
implement 'make install LANG="eng ara deu"';
more headers need to be installed: https://groups.google.com/group/tesseract-dev/msg/a4f7424377993b2e


git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@700 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2012-03-06 22:41:43 +00:00
david.eger@gmail.com
c2e84c4606 Fix two issues with GetHOCRText():
+ make it not seg-fault if called without calling SetInputName().
+ make it not leak memory (thank you valgrind)

http://code.google.com/p/tesseract-ocr/issues/detail?id=463



git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@699 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2012-03-06 21:18:16 +00:00
david.eger@gmail.com
75a9a8fae7 Address "RIL_PARA doesn't work" comment in issue 622.
http://code.google.com/p/tesseract-ocr/issues/detail?id=622

The core of the problem is that in PSM_SINGLE_BLOCK mode, Tesseract
doesn't run paragraph detection, so no paragraphs get generated.  Here,
we make sure that even if run in a mode where no paragraphs get
generated, we treat each block as its own paragraph.



git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@696 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2012-03-06 20:02:57 +00:00
zdenop@gmail.com
8cc34e85f1 'make install' do not require language data; language data are installed by 'make install-langs'
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@695 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2012-03-05 00:11:38 +00:00
zdenop@gmail.com
765832d449 fixes issue 573 where boolean was being compared to float;
tesseract prints full version info when -v arg;
removes extra includes from tesseractmain.h;
removes extra DLLEXPORT & DLLIMPORT from hosts.h;
remove CCUTIL_IMPORTS & CCUTIL_EXPORTS from vs2008 *.vcproj;


git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@694 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2012-03-04 22:27:16 +00:00
zdenop@gmail.com
5761bc5736 fix visibility build; + tprintf visible
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@693 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2012-03-03 21:13:42 +00:00
zdenop@gmail.com
97e19443a3 install only necessary headers, fix uninstall
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@692 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2012-03-03 13:22:51 +00:00
zdenop@gmail.com
3b326532cc fix --enable-multiple-libraries; implement quite mode (issue 580)
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@691 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2012-03-03 11:48:59 +00:00
zdenop@gmail.com
30a70142a0 visibility - autotools part (./configure --enable-visibility)
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@690 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2012-03-02 23:51:33 +00:00
zdenop@gmail.com
a776e0be85 TP: visibility trial - code & windows build changes (without autotools changes)
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@689 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2012-03-02 17:48:45 +00:00
zdenop@gmail.com
e216adab43 fix configure.ac; unify identifiers (WIN32 vs _WIN32)
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@688 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2012-03-02 17:31:24 +00:00
zdenop@gmail.com
657722aeca added missing changes for r686
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@687 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2012-03-01 23:19:35 +00:00
zdenop@gmail.com
49c4ce3183 fix for GRAPHICS_DISABLED build
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@686 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2012-03-01 22:43:51 +00:00
zdenop
06b2156a99 fixed makemoredists; add --enable-embedded to configure
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@685 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2012-03-01 12:40:04 +00:00
zdenop@gmail.com
df1cbdd7d3 fix for issue 463 (GetHOCRText segfaults unless SetInputName has been called first); removed declaration of GetLastInitLanguage
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@684 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2012-02-27 17:19:20 +00:00
zdenop@gmail.com
bf7ca288ac fixed 635 (strngs.h has unnecessary include of genericvector.h)
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@682 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2012-02-26 16:39:01 +00:00
zdenop@gmail.com
da121f013c vs2008 and vs2010 replaced with Tom Powers solution
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@681 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2012-02-26 15:30:05 +00:00
zdenop@gmail.com
492f9119c2 check return code of API init (issue 593)
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@680 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2012-02-26 14:48:35 +00:00
zdenop@gmail.com
132909a607 fix for issue 631: gettimeofday() on windows based on leptonica l_getCurrentTime()
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@679 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2012-02-21 21:38:45 +00:00
zdenop@gmail.com
95168ef064 fix missing ";" in VS2008 project files + fix VS2010
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@678 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2012-02-20 13:12:45 +00:00
zdenop@gmail.com
6ccab83bd6 fixing issue 628 (replacing __MSW32__ with _WIN32) and issue 614 (reverting "class DLLSYM STRING" to "class CCUTIL_API STRING")
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@677 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2012-02-19 21:48:45 +00:00
zdenop@gmail.com
61611c1990 removed unnecessary conditional
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@676 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2012-02-18 09:18:06 +00:00
david.eger@gmail.com
018f192fc2 Abolish populate_unichars(), fixing seg fault reported in Debian:
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=658634



git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@675 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2012-02-15 01:37:00 +00:00
zdenop@gmail.com
53d133d83a fixed cntraning thanks to Wil Hadden; fixed installation of new manpages
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@674 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2012-02-12 16:03:05 +00:00
zdenop@gmail.com
3c4fd30bb5 Fix is isinf for VC++
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@673 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2012-02-12 14:51:28 +00:00