Most command line programs print the version to stdout.
This seams to be reasonable for Tesseract, too.
Now a shell statement like "VERSION=$(tesseract --version)" works
without I/O redirection.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
As with 0c492cb, in VC14 snprintf function is provided in standard library there triggering error. "snprintf Do not define snprintf as a macro. Macro definition of snprintf conflicts with Standard Library function declaration"
Takes advantage of inheritance and dir="ltr" default to:
- only generate paragraph dirs which are not ltr
- only generate word dirs which don't match enclosing paragraph
Tested against LTR, RTL, and mixed direction files. Files for the
latter two cases are in a separate commit on the ltr-test-files branch.
This fixes the duplicate line IDs caused by inserting height information
into the middle of the ID and it moves the line height info into
the title attribute like everything else, rather than using non-standard
HTML attributes (which won't validate).
This change may break consumers of the HTML output, but 3.04 has only
been in the wild for 6 months and the current HTML is invalid, so I
believe the benefit outweighs the cost for the fix.
As discussed at length in issue #182, the existing pdf.ttf causes difficulties
for certain PDF viewers, in part because the old file had zero advance width.
With testing, sharp2.ttf seems to be the best available compromise, although
it's not perfect and causes some visual difficulties in Evince. It does
seem to fix Kindle and OS X Preview.
Revert fd429c32, 43834da7, 05de195e.
See #49, #59.
The code in this commit solves the issue in a more elegant way, IMHO.
Now you can use:
* `tesseract eurotext.tif eurotext txt pdf`
* `tesseract eurotext.tif eurotext txt hocr`
* `tesseract eurotext.tif eurotext txt hocr pdf`
NOTE:
With `tesseract eurotext.tif eurotext`
or `tesseract eurotext.tif eurotext txt`
the psm will be set to '3', but...
With `tesseract eurotext.tif eurotext txt pdf`
or `tesseract eurotext.tif eurotext txt hocr`
the psm will be set to '1'.
Commit 99110df757 improved the help text
in several aspects, but also introduced new inconsistencies which this
patch tries to fix.
* Align columns (this needed replacing tabs by spaces).
* Start explaining text with uppercase.
* Replace "the stdout" by "stdout.
* Small changes in help text for page segmentation modes.
* Split options in OCR options and single options
(partially revert commit 99110df757).
In addition, whitespace characters at end of lines were removed.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
The way tesstrain.sh handled font names was really weird, using '+'
signs as a delimiter. However quoting arguments is a much more
straightforward, standard and sensible way to do things.
So whereas previously one would have used this:
--fontlist Times New Roman + Arial Black
Now they should be specified like this:
--fontlist "Times New Roman" "Arial Black"
This font list contains a selection fonts produced by the Greek Font
Society <http://greekfontsociety.gr>, and is the result of testing
with a large corpus of a variety of scanned works.
Coverity bug reports:
CID 1270405: Uninitialized scalar variable
CID 1270408: Uninitialized scalar variable
CID 1270409: Uninitialized scalar variable
CID 1270410: Uninitialized scalar variable
Those variables are set conditionally in the while loop
and must keep their values in following iterations, so
they must be declared outside of the loop.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Coverity bug report: CID 1270401 (#1 of 1): Use after free
As the comment (which was also fixed) says, ReadNextBox() already
calls fclose(box_file), so don't call it a 2nd time.
Signed-off-by: Stefan Weil <sw@weilnetz.de>