Commit Graph

254 Commits

Author SHA1 Message Date
Philip Rinn
f00ff67c17 Fix ABI break introduced in 3.04.00, fixes #254 2016-03-08 17:37:00 +01:00
amitdo
6f4dca803f Don't display tesseract's banner when quiet mode is active 2016-03-08 17:36:48 +01:00
Zdenko Podobný
285c3fba6a solve segfault for box.train; fixes #57 2016-03-08 17:36:22 +01:00
Zdenko Podobný
e4711bfcd5 increase version number in 3.04 branch 2016-02-18 09:07:34 +01:00
Tom Morris
4ef68a036c Emit fewer "lang" attributes
Add "lang" attribute to paragraph markup and only include
word lang attribute if it's different from the paragraph's value.
2016-02-18 09:05:54 +01:00
Tom Morris
381b3a56c6 Only generate dir for HOCR when needed - fixes #208
Takes advantage of inheritance and dir="ltr" default to:
 - only generate paragraph dirs which are not ltr
 - only generate word dirs which don't match enclosing paragraph

Tested against LTR, RTL, and mixed direction files. Files for the
latter two cases are in a separate commit on the ltr-test-files branch.
2016-02-18 09:05:46 +01:00
Tom Morris
c3ad0de69b Fix varsize array for Microsoft compiler 2016-02-18 09:05:37 +01:00
Tom Morris
134ebc3df3 INCOMPATIBLE fix to hOCR line height information - fixes #225.
This fixes the duplicate line IDs caused by inserting height information
into the middle of the ID and it moves the line height info into
the title attribute like everything else, rather than using non-standard
HTML attributes (which won't validate).

This change may break consumers of the HTML output, but 3.04 has only
been in the wild for 6 months and the current HTML is invalid, so I
believe the benefit outweighs the cost for the fix.
2016-02-16 22:26:12 +01:00
Zdenko Podobný
8473e5a262 update autotools files 2016-02-13 00:06:11 +01:00
Zdenko Podobný
ebadb00e4d fix version number => 3.04.01 2016-02-12 23:28:40 +01:00
amitdo
337a9b52c4 Fix #64. Make box training work
This commit is better than 06fc0533c. Hopefully, this is the last fix to box training issue.
2016-02-05 11:21:42 +01:00
amitdo
b60bb806bf Fix #184. Training should work now 2016-02-05 11:20:57 +01:00
Zdenko Podobný
8daef71a83 added row attributes to hocr output 2016-02-05 11:20:01 +01:00
amitdo
fe5ee13229 Add missing ')'_to make the code compile 2016-02-05 11:18:40 +01:00
amitdo
270214e667 If there is no explicit renderer(s), default to TessTextRenderer
Revert fd429c32, 43834da7, 05de195e.

See #49, #59.

The code in this commit solves the issue in a more elegant way, IMHO.

Now you can use:
  * `tesseract eurotext.tif eurotext txt pdf`
  * `tesseract eurotext.tif eurotext txt hocr`
  * `tesseract eurotext.tif eurotext txt hocr pdf`

NOTE:
  With `tesseract eurotext.tif eurotext`
  or `tesseract eurotext.tif eurotext txt`
  the psm will be set to '3', but...
  With `tesseract eurotext.tif eurotext txt pdf`
  or `tesseract eurotext.tif eurotext txt hocr`
  the psm will be set to '1'.
2016-02-05 11:18:34 +01:00
Stefan Weil
4a7cf319fc tesseractmain: Prettify help message
Commit 99110df757 improved the help text
in several aspects, but also introduced new inconsistencies which this
patch tries to fix.

* Align columns (this needed replacing tabs by spaces).
* Start explaining text with uppercase.
* Replace "the stdout" by "stdout.
* Small changes in help text for page segmentation modes.
* Split options in OCR options and single options
  (partially revert commit 99110df757).

In addition, whitespace characters at end of lines were removed.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2016-02-05 11:17:08 +01:00
Stefan Weil
613140a1ac pdfrenderer: Fix uninitialized local variables
Coverity bug reports:

CID 1270405: Uninitialized scalar variable
CID 1270408: Uninitialized scalar variable
CID 1270409: Uninitialized scalar variable
CID 1270410: Uninitialized scalar variable

Those variables are set conditionally in the while loop
and must keep their values in following iterations, so
they must be declared outside of the loop.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2016-02-05 11:15:54 +01:00
amitdo
d36ee9c4d0 tesseractmain.cpp: Split huge main() to sub functions
Add these functions to api/tesseractmain.cpp:
PrintVersionInfo()
PrintUsage()
PrintHelpForPSM()
PrintHelpMessage()
SetVariablesFromCLArgs()
PrintLangsList()
FixPageSegMode()
ParseArgs()
PreloadRenderers()
2016-02-05 11:15:38 +01:00
Stefan Weil
8c4b027292 tesseractmain: Fix unterminated string
Coverity bug report: CID 1270421 "Buffer not null terminated".

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2016-02-05 11:15:06 +01:00
Stefan Weil
cd946dc30d api: Fix printing of a size_t value
size_t is not always the same as long, especially not for 64 bit Windows:

api/pdfrenderer.cpp:549:31: warning:
 format '%ld' expects argument of type 'long int',
 but argument 4 has type 'size_t {aka long long unsigned int}' [-Wformat=]

size_t normally requires a format string "%zu", but this is unsupported
by Visual Studio, so use a type cast.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2016-02-05 11:13:34 +01:00
Stefan Weil
f7368ecb14 Don't use NULL for integer arguments
This fixes compiler warnings:

api/baseapi.cpp:1422:49: warning:
 passing NULL to non-pointer argument 6 of
 'int MultiByteToWideChar(UINT, DWORD, LPCCH, int, LPWSTR, int)'
 [-Wconversion-null]
api/baseapi.cpp:1427:54:
 warning: passing NULL to non-pointer argument 6 of
 'int WideCharToMultiByte(UINT, DWORD, LPCWCH, int, LPSTR, int, LPCCH, LPBOOL)'
 [-Wconversion-null]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2016-02-05 11:13:15 +01:00
Stefan Weil
1f4c8d0567 Remove unneeded const qualifiers
This fixes compiler warnings like this one:

api/baseapi.h:739:32: warning:
 type qualifiers ignored on function return type [-Wignored-qualifiers]

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2016-02-05 11:13:00 +01:00
amitdo
0bb5a7d6f0 Added osd renderer for psm 0.
Works for single page and multi-page.
2016-02-05 10:58:29 +01:00
amitdo
79ed9a30c7 OSD: Print script name instead of meaningless script id 2016-02-05 10:57:45 +01:00
Stefan Weil
f72e65b36e api: Fix typos in comments (all found by codespell)
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2016-02-05 10:52:06 +01:00
James R. Barlow
7b85eeafe2 Get OpenCL to compile on OS X
However, the output of the OpenCL build is garbage....
2016-02-05 10:47:15 +01:00
Zdenko Podobný
f7fd63efea Fixes #76 - enable OpenMP support 2016-02-05 10:43:21 +01:00
Robert Theis
76a28640c6 Remove extraneous line feed 2016-02-05 10:42:41 +01:00
Zdenko Podobný
85f8a98c93 fix bug in UTF-16BE conversion 2016-02-05 10:42:29 +01:00
Zdenko Podobný
41918a452a improve NO_CUBE_BUILD 2016-02-05 10:42:19 +01:00
Zdenko Podobný
9e4ceb1522 Fixes #74 NO_CUBE_BUILD with reverting to ANDROID_BUILD in baseapi 2016-02-05 10:42:10 +01:00
Zdenko Podobný
ff6c088084 enable pdfrender with NO_CUBE_BUILD 2016-02-05 10:41:49 +01:00
Jeff Breidenbach
300b5246f3 replace CubeUtils::UTF8ToUTF32 in pdfrenderer 2016-02-05 10:41:39 +01:00
Zdenko Podobný
982789ac35 implement build without cube (-DNO_CUBE_BUILD) 2016-02-05 10:40:26 +01:00
Zdenko Podobný
b677761ba9 increase version number 2015-07-21 22:48:35 +02:00
zdenop
e4f4893fb8 Merge pull request #52 from unbe/null-pointer-access-in-hocr
Fix null pointer dereference when writing font name into HOCR.
2015-07-20 07:40:59 +02:00
artem
2b6801eddb Fix null pointer dereference when writing font name into HOCR. 2015-07-19 22:05:02 +02:00
unbe
67ffea8877 Update capi.cpp
Make TessDeleteResultRenderer use delete, not delete[]
2015-07-19 15:15:42 +02:00
Zdenko Podobný
ec9581d8f2 temporary add configure and Makefile.in for release 2015-07-11 09:42:43 +02:00
Ray Smith
a303ab9d00 Misc fixes, mostly clang formatting, but some bug fixes in matrix, werd, and tesstrain_utils. Also updates unicharset to match traineddata files. 2015-07-09 14:28:20 -07:00
Ray Smith
b1d99dfe23 Added a backup adaptive classifier to take over from primary when it fills on a large document 2015-06-12 11:10:53 -07:00
Ray Smith
ab0f4e2c38 Clang fixes to earlier changes and build compatability with Google environment 2015-06-12 10:53:21 -07:00
orbitcowboy
9328f0e5d4 Fix potential null pointer dereference in ccmain/paragraphs.cpp. 2015-05-19 10:17:44 +02:00
Zdenko Podobný
59bcbc79b3 fix GIT_VER info in VS2010 2015-05-15 15:14:49 +02:00
Zdenko Podobný
e98849b482 rint error message when pdf.ttf is not found. 2015-05-15 15:14:00 +02:00
Zdenko Podobný
035b324f0f reflect the latest commits in VS2010 build 2015-05-14 10:52:54 +02:00
Jim O'Regan
b13691fda0 Merge conflict: going with Ray's version 2015-05-13 08:54:28 +01:00
Ray Smith
03f3c9dc88 Misc fixes missed from previous commits 2015-05-12 18:13:15 -07:00
Ray Smith
6b634170c1 Significant change to invisible font system
to improve correctness and compatibility with
external programs, particularly ghostscript.
We will start mapping everything to a single glyph,
rather than allowing characters to run off the end
of the font.

A more detailed design discussion is embedded into
pdfrenderer.cpp comments. The font, source code
that produces the font, and the design comments
were contributed by Ken Sharp from Artifex Software.
2015-05-12 17:33:18 -07:00
Ray Smith
4a3caefd92 Add ability to build under android (without cube or scrollview). 2015-05-12 15:41:15 -07:00