Setting the page segmentation mode in those config files gives unexpected
results: the text recognized when no config or only txt is given changes
if both txt and any of hocr, pdf or tsv is chosen.
In a test set of nearly 200 pages from historical books, using
segmentation mode 1 is typically slightly better than the default,
but there are also cases where it is much worse. Therefore the user
should be able to decide which page segmentation mode is best.
Old results for hocr, pdf or tsv now need an explicit `--psm 1` for
reproduction.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Revert fd429c32, 43834da7, 05de195e.
See #49, #59.
The code in this commit solves the issue in a more elegant way, IMHO.
Now you can use:
* `tesseract eurotext.tif eurotext txt pdf`
* `tesseract eurotext.tif eurotext txt hocr`
* `tesseract eurotext.tif eurotext txt hocr pdf`
NOTE:
With `tesseract eurotext.tif eurotext`
or `tesseract eurotext.tif eurotext txt`
the psm will be set to '3', but...
With `tesseract eurotext.tif eurotext txt pdf`
or `tesseract eurotext.tif eurotext txt hocr`
the psm will be set to '1'.