Commit Graph

107 Commits

Author SHA1 Message Date
Shree Devi Kumar
9c89cd51cf Add a new renderer to create box files from images for LSTM training
(cherry picked from commit 921da6be2bdbda2ddd64514f9b6bec40a336246a)

fix typo

(cherry picked from commit 7bd1a0c80393fce2f34e2845cb26760bcf3791cd)

Add lstmboxrenderer to CMakeLists

(cherry picked from commit cfef3a889aef830725921b5c0218d5e9c633b03e)

fix formatting

(cherry picked from commit 7ba2b01ede7940ed609a073364948ef8c838cd10)
2019-02-05 14:03:29 +00:00
Stefan Weil
e817d93e62 Add configuration file for ALTO to installation
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-11-30 06:17:04 +01:00
Jake Sebright
d7cee03a94 Add support for ALTO output 2018-11-30 06:09:36 +01:00
Zdenko Podobný
ba64aaf257 add lstmdebug config to distribution and installation process 2018-10-29 09:38:11 +01:00
Stefan Weil
125fdc3f1b Add debug configuration for LSTM
It was provided by Jeff Breidenbach <jbreiden@google.com>.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-27 08:04:45 +02:00
Zdenko Podobný
3d508a65a7 set unlv_tilde_crunching to false; fixes #1449 #948 2018-10-23 09:26:32 +02:00
Stefan Weil
c6f759148b Don't set page segmentation mode for unlv config
Setting the page segmentation mode to 6 ("Assume a single uniform block
of text") typically improves the layout detection for such texts, but
should not be done in the config file.

unlvtests/runtestset.sh adds `--psm 6` explicitly, so test results
won't change when using that script.

This is similar to commit ecfee53bac.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-04 21:01:18 +02:00
Stefan Weil
ecfee53bac Don't set page segmentation mode for hocr, pdf and tsv configs
Setting the page segmentation mode in those config files gives unexpected
results: the text recognized when no config or only txt is given changes
if both txt and any of hocr, pdf or tsv is chosen.

In a test set of nearly 200 pages from historical books, using
segmentation mode 1 is typically slightly better than the default,
but there are also cases where it is much worse. Therefore the user
should be able to decide which page segmentation mode is best.

Old results for hocr, pdf or tsv now need an explicit `--psm 1` for
reproduction.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-04 12:05:49 +02:00
Stefan Weil
dabf3c299f Fix file endings
Text files should end with a LF, but not additional empty lines.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-04-25 19:35:33 +02:00
Stefan Weil
10a8a67ca2 Remove execute permission from config file (#1263)
This fixes the only configuration file which had such permissions.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-01-10 16:43:02 +01:00
Atsuyoshi Suzuki
82d62f89a2 Update Makefile.am (add 'lstm.train') 2017-04-02 17:06:12 +09:00
Zdenko Podobný
a011b15b0d fix #712: Ghostscript mangling Tesseract-produced PDFs 2017-02-15 17:09:37 +01:00
Ray Smith
81ebba0394 More makefile changes to remove cube 2016-12-14 11:17:06 -08:00
Ray Smith
65517794f9 Added missing lstm.train 2016-12-06 08:48:23 -08:00
Ray Smith
3d00d3bd94 Missing pdf font file from previous sync 2016-11-28 08:55:03 -08:00
Ray Smith
2c837dffc3 Result of clang tidy on recent merge 2016-11-07 10:46:33 -08:00
Zdenko Podobný
a6871a8c91 remove install-langs - fix #376 2016-09-01 19:21:30 +02:00
Tom Morris
fc80ceafb9 Fix hocrtsv references in Makefile 2016-03-02 10:46:52 -05:00
Tom Morris
6700edd8bc Cleanup TSV renderer
Remove all references to hocr, hocr.tsv, etc. Remove dead code for font
info, input filename, HTML escapes. Improved comments. Fixed
indentation.
2016-03-01 13:41:19 -05:00
Sundar M. Vaidya
937ceb2d1b Adds hocrtsv to tessdata/configs/Makefile.am 2016-03-01 12:25:15 -05:00
Sundar M. Vaidya
3163b38151 Adds hocrtsv file to configs folder. 2016-03-01 12:23:12 -05:00
Sundar M. Vaidya
59d593d796 Calls TessHOcrTsvRenderer if tessedit_create_hocrtsv is true. 2016-03-01 12:23:12 -05:00
Tom Morris
e3e1fe0e20 Document hocr_font_info in config 2016-02-14 16:49:00 -05:00
James R. Barlow
b30930b95a Replace pdf.ttf with sharp2.ttf, keep name the same
As discussed at length in issue #182, the existing pdf.ttf causes difficulties
for certain PDF viewers, in part because the old file had zero advance width.

With testing, sharp2.ttf seems to be the best available compromise, although
it's not perfect and causes some visual difficulties in Evince.  It does
seem to fix Kindle and OS X Preview.
2016-02-11 15:44:11 -08:00
Amit Dovev
6b08184a2c Update Makefile.am 2015-12-18 16:12:32 +02:00
amitdo
c2f5e9b849 If there is no explicit renderer(s), default to TessTextRenderer
Revert fd429c32, 43834da7, 05de195e.

See #49, #59.

The code in this commit solves the issue in a more elegant way, IMHO.

Now you can use:
  * `tesseract eurotext.tif eurotext txt pdf`
  * `tesseract eurotext.tif eurotext txt hocr`
  * `tesseract eurotext.tif eurotext txt hocr pdf`

NOTE:
  With `tesseract eurotext.tif eurotext`
  or `tesseract eurotext.tif eurotext txt`
  the psm will be set to '3', but...
  With `tesseract eurotext.tif eurotext txt pdf`
  or `tesseract eurotext.tif eurotext txt hocr`
  the psm will be set to '1'.
2015-12-11 19:06:49 +02:00
Zdenko Podobný
66a76a9477 Revert "temporary add config/*, configure and Makefile.in for release"
This reverts commits ec9581d8f2, 1afe382c4e, 4b2cfabcc1
2015-07-31 21:44:43 +02:00
Zdenko Podobný
5dfb0cb898 Fixes #64 - tessedit_create_txt 0 blocks box training 2015-07-25 22:49:55 +02:00
Jim O'Regan
05de195efc disable text creation for unlv, makebox, box.train, and box.train.stderr (see #49) 2015-07-20 10:07:55 +01:00
Jim O'Regan
43834da7a2 disable text creation when creating hOCR (issue #49) 2015-07-18 08:56:21 +01:00
Jeff Breidenbach
fd429c32a0 PDF creation: not disabling tessedit_create_txt
Okay, everything is more of less under control except for this:

  tesseract phototest.tif - pdf > phototest.pdf

This is sending activating both the text renderer, and the pdf renderer.
They both get sent to stdout where they mix together and cause chaos.
Same thing happens with this command.

   tesseract phototest.tif stdout pdf > phototest.pdf

What's happening is tesseractmain.cpp is setting tessedit_create_pdf without
disabling tessedit_create_txt.

https://groups.google.com/d/msgid/tesseract-dev/32c065ee-aefa-441a-b37b-b6bdc234c8ab%40googlegroups.com
2015-07-18 08:39:57 +01:00
Zdenko Podobný
ec9581d8f2 temporary add configure and Makefile.in for release 2015-07-11 09:42:43 +02:00
Ray Smith
1e3b671298 Fixes to make yesterday's changes compile 2015-05-13 09:58:59 -07:00
Ray Smith
6b634170c1 Significant change to invisible font system
to improve correctness and compatibility with
external programs, particularly ghostscript.
We will start mapping everything to a single glyph,
rather than allowing characters to run off the end
of the font.

A more detailed design discussion is embedded into
pdfrenderer.cpp comments. The font, source code
that produces the font, and the design comments
were contributed by Ken Sharp from Artifex Software.
2015-05-12 17:33:18 -07:00
Ray Smith
d9699c4099 Fixed bidi handling in PDF output 2014-10-09 13:29:01 -07:00
Zdenko Podobný
369fabb7fc fix filemode;
update autotools and distribution script to repository changes;
ignore doxygen generated files and langauge data files;
2014-08-14 23:37:17 +02:00
zdenop
1ea387232b fix compatibility of uninstall: MacOSX rm needs -f instead of --force
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@1127 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2014-07-24 20:39:30 +00:00
zdenop
a66f5b84c8 install pdf.ttf and pdf.ttx as part of tesseract library
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@1031 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2014-01-29 22:12:32 +00:00
theraysmith@gmail.com
91d2265429 More minor fixes from issues and cleanup
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@974 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2014-01-10 01:38:00 +00:00
theraysmith@gmail.com
4c72deea6c Added pdf config file
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@972 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2014-01-09 19:18:07 +00:00
theraysmith@gmail.com
bfa401a6f8 Added PDF data files
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@971 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2014-01-09 19:14:11 +00:00
zdenop@gmail.com
53a3e0f88a fix issue 755; add example config files from tesseract manpage
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@894 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2013-10-20 20:20:10 +00:00
zdenop@gmail.com
d5b3c6c47c fix Parallel Build Trees (a.k.a. VPATH Builds) ('make install-langs' and 'make install-jars')
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@888 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2013-10-03 21:26:35 +00:00
zdenop@gmail.com
32d212d0c6 add new config file - get.image
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@826 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2013-02-23 11:56:49 +00:00
zdenop@gmail.com
e83503022c update script for 3.02.02 release
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@793 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2012-10-26 18:49:14 +00:00
zdenop
1131e5dd2f addition to Issue 724
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@731 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2012-07-04 15:35:26 +00:00
zdenop@gmail.com
d72a318c5c fix Issue 724: DESTDIR not supported with make install-langs
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@730 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2012-07-03 20:33:28 +00:00
zdenop@gmail.com
1455bf5610 set tessedit_module_name for windows;
implement 'make install LANG="eng ara deu"';
more headers need to be installed: https://groups.google.com/group/tesseract-dev/msg/a4f7424377993b2e


git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@700 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2012-03-06 22:41:43 +00:00
zdenop@gmail.com
8cc34e85f1 'make install' do not require language data; language data are installed by 'make install-langs'
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@695 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2012-03-05 00:11:38 +00:00
zdenop@gmail.com
3b326532cc fix --enable-multiple-libraries; implement quite mode (issue 580)
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@691 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2012-03-03 11:48:59 +00:00