Add PAGE XML export and documentation.
To generate PAGE XML output just add 'page' to the tesseract command.
The output is outputname + '.page.xml' to avoid conflicts with ALTO export.
The output can be customized with the flags:
tessedit_create_page_polygon and tessedit_create_page_wordlevel.
Co-authored-by: Stefan Weil <sw@weilnetz.de>
Setting the page segmentation mode to 6 ("Assume a single uniform block
of text") typically improves the layout detection for such texts, but
should not be done in the config file.
unlvtests/runtestset.sh adds `--psm 6` explicitly, so test results
won't change when using that script.
This is similar to commit ecfee53bac.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Setting the page segmentation mode in those config files gives unexpected
results: the text recognized when no config or only txt is given changes
if both txt and any of hocr, pdf or tsv is chosen.
In a test set of nearly 200 pages from historical books, using
segmentation mode 1 is typically slightly better than the default,
but there are also cases where it is much worse. Therefore the user
should be able to decide which page segmentation mode is best.
Old results for hocr, pdf or tsv now need an explicit `--psm 1` for
reproduction.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
As discussed at length in issue #182, the existing pdf.ttf causes difficulties
for certain PDF viewers, in part because the old file had zero advance width.
With testing, sharp2.ttf seems to be the best available compromise, although
it's not perfect and causes some visual difficulties in Evince. It does
seem to fix Kindle and OS X Preview.
Revert fd429c32, 43834da7, 05de195e.
See #49, #59.
The code in this commit solves the issue in a more elegant way, IMHO.
Now you can use:
* `tesseract eurotext.tif eurotext txt pdf`
* `tesseract eurotext.tif eurotext txt hocr`
* `tesseract eurotext.tif eurotext txt hocr pdf`
NOTE:
With `tesseract eurotext.tif eurotext`
or `tesseract eurotext.tif eurotext txt`
the psm will be set to '3', but...
With `tesseract eurotext.tif eurotext txt pdf`
or `tesseract eurotext.tif eurotext txt hocr`
the psm will be set to '1'.
Okay, everything is more of less under control except for this:
tesseract phototest.tif - pdf > phototest.pdf
This is sending activating both the text renderer, and the pdf renderer.
They both get sent to stdout where they mix together and cause chaos.
Same thing happens with this command.
tesseract phototest.tif stdout pdf > phototest.pdf
What's happening is tesseractmain.cpp is setting tessedit_create_pdf without
disabling tessedit_create_txt.
https://groups.google.com/d/msgid/tesseract-dev/32c065ee-aefa-441a-b37b-b6bdc234c8ab%40googlegroups.com
to improve correctness and compatibility with
external programs, particularly ghostscript.
We will start mapping everything to a single glyph,
rather than allowing characters to run off the end
of the font.
A more detailed design discussion is embedded into
pdfrenderer.cpp comments. The font, source code
that produces the font, and the design comments
were contributed by Ken Sharp from Artifex Software.