Jan Kamlah
577e8a8b93
Add PAGE XML renderer / export ( #4214 )
...
Add PAGE XML export and documentation.
To generate PAGE XML output just add 'page' to the tesseract command.
The output is outputname + '.page.xml' to avoid conflicts with ALTO export.
The output can be customized with the flags:
tessedit_create_page_polygon and tessedit_create_page_wordlevel.
Co-authored-by: Stefan Weil <sw@weilnetz.de>
2024-04-19 21:12:39 +02:00
Gitoffthelawn
d086c075b3
Fixed 2 errors
2022-10-06 03:53:11 -07:00
Shree
df6b1ce452
remove legacy parameter disable_character_fragments from lstm.train
2019-10-23 13:15:16 +02:00
Johannes Künsebeck
aa2ab68e29
Removed unused parameters
...
The following parameters are not used anywhere anymore:
* use_definite_ambigs_for_classifier
* max_viterbi_list_size
* word_to_debug_lengths
* fragments_debug
* tessedit_redo_xheight
* debug_acceptable_wds
* tessedit_matcher_log
* tessedit_test_adaption_mode
* docqual_excuse_outline_errs
* crunch_pot_garbage
* suspect_space_level
* tessedit_consistent_reps
* wordrec_display_all_words
* wordrec_no_block
* wordrec_worst_state
* fragments_guide_chopper
* segment_adjust_debug
* classify_adapt_feature_thresh (classify_adapt_feature_threshold still exists)
* classify_adapt_proto_thresh (classify_adapt_proto_threshold still exists)
* classify_min_norm_scale_x
* classify_max_norm_scale_x
* classify_min_norm_scale_y
* classify_max_norm_scale_y
* il1_adaption_test
* textord_blob_size_bigile
* textord_blob_size_smallile
* editor_debug_config_file
* textord_tabfind_show_color_fit
The list was generated by a python script and each parameter occurence checked
manually.
2019-10-03 09:18:29 +02:00
Julian Gilbey
5a1978a4fc
fix #2616 : allow building of training data
...
This fixes Issue #2616 by preventing an attempt to build the recognition engine when running tesstrain.sh.
2019-08-13 19:05:49 +01:00
Stefan Weil
7db25e15c0
Remove unused config variable tessedit_single_match
...
Replace also TRUE, FALSE by true, false.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2019-03-31 17:38:35 +02:00
Shree
08e96516c6
install lstmbox and wordstrbox config files
2019-03-01 15:26:59 +00:00
Shree Devi Kumar
f3362a4b5b
Add renderer to create WordStr box files from images
2019-02-10 19:59:17 +00:00
zdenop
2ae65b2493
Merge pull request #2216 from Shreeshrii/lstmbox
...
Lstmbox
2019-02-10 13:53:41 +01:00
Chris Mayo
6dc48adfee
Rename get.image config to get.images and install
2019-02-05 19:57:53 +00:00
Shree Devi Kumar
9c89cd51cf
Add a new renderer to create box files from images for LSTM training
...
(cherry picked from commit 921da6be2bdbda2ddd64514f9b6bec40a336246a)
fix typo
(cherry picked from commit 7bd1a0c80393fce2f34e2845cb26760bcf3791cd)
Add lstmboxrenderer to CMakeLists
(cherry picked from commit cfef3a889aef830725921b5c0218d5e9c633b03e)
fix formatting
(cherry picked from commit 7ba2b01ede7940ed609a073364948ef8c838cd10)
2019-02-05 14:03:29 +00:00
Stefan Weil
e817d93e62
Add configuration file for ALTO to installation
...
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-11-30 06:17:04 +01:00
Jake Sebright
d7cee03a94
Add support for ALTO output
2018-11-30 06:09:36 +01:00
Zdenko Podobný
ba64aaf257
add lstmdebug config to distribution and installation process
2018-10-29 09:38:11 +01:00
Stefan Weil
125fdc3f1b
Add debug configuration for LSTM
...
It was provided by Jeff Breidenbach <jbreiden@google.com>.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-27 08:04:45 +02:00
Zdenko Podobný
3d508a65a7
set unlv_tilde_crunching to false; fixes #1449 #948
2018-10-23 09:26:32 +02:00
Stefan Weil
c6f759148b
Don't set page segmentation mode for unlv config
...
Setting the page segmentation mode to 6 ("Assume a single uniform block
of text") typically improves the layout detection for such texts, but
should not be done in the config file.
unlvtests/runtestset.sh adds `--psm 6` explicitly, so test results
won't change when using that script.
This is similar to commit ecfee53bac
.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-04 21:01:18 +02:00
Stefan Weil
ecfee53bac
Don't set page segmentation mode for hocr, pdf and tsv configs
...
Setting the page segmentation mode in those config files gives unexpected
results: the text recognized when no config or only txt is given changes
if both txt and any of hocr, pdf or tsv is chosen.
In a test set of nearly 200 pages from historical books, using
segmentation mode 1 is typically slightly better than the default,
but there are also cases where it is much worse. Therefore the user
should be able to decide which page segmentation mode is best.
Old results for hocr, pdf or tsv now need an explicit `--psm 1` for
reproduction.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-10-04 12:05:49 +02:00
Stefan Weil
dabf3c299f
Fix file endings
...
Text files should end with a LF, but not additional empty lines.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-04-25 19:35:33 +02:00
Stefan Weil
10a8a67ca2
Remove execute permission from config file ( #1263 )
...
This fixes the only configuration file which had such permissions.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
2018-01-10 16:43:02 +01:00
Atsuyoshi Suzuki
82d62f89a2
Update Makefile.am (add 'lstm.train')
2017-04-02 17:06:12 +09:00
Ray Smith
65517794f9
Added missing lstm.train
2016-12-06 08:48:23 -08:00
Ray Smith
2c837dffc3
Result of clang tidy on recent merge
2016-11-07 10:46:33 -08:00
Tom Morris
fc80ceafb9
Fix hocrtsv references in Makefile
2016-03-02 10:46:52 -05:00
Tom Morris
6700edd8bc
Cleanup TSV renderer
...
Remove all references to hocr, hocr.tsv, etc. Remove dead code for font
info, input filename, HTML escapes. Improved comments. Fixed
indentation.
2016-03-01 13:41:19 -05:00
Sundar M. Vaidya
937ceb2d1b
Adds hocrtsv to tessdata/configs/Makefile.am
2016-03-01 12:25:15 -05:00
Sundar M. Vaidya
3163b38151
Adds hocrtsv file to configs folder.
2016-03-01 12:23:12 -05:00
Sundar M. Vaidya
59d593d796
Calls TessHOcrTsvRenderer if tessedit_create_hocrtsv is true.
2016-03-01 12:23:12 -05:00
Tom Morris
e3e1fe0e20
Document hocr_font_info in config
2016-02-14 16:49:00 -05:00
Amit Dovev
6b08184a2c
Update Makefile.am
2015-12-18 16:12:32 +02:00
amitdo
c2f5e9b849
If there is no explicit renderer(s), default to TessTextRenderer
...
Revert fd429c32
, 43834da7
, 05de195e
.
See #49 , #59 .
The code in this commit solves the issue in a more elegant way, IMHO.
Now you can use:
* `tesseract eurotext.tif eurotext txt pdf`
* `tesseract eurotext.tif eurotext txt hocr`
* `tesseract eurotext.tif eurotext txt hocr pdf`
NOTE:
With `tesseract eurotext.tif eurotext`
or `tesseract eurotext.tif eurotext txt`
the psm will be set to '3', but...
With `tesseract eurotext.tif eurotext txt pdf`
or `tesseract eurotext.tif eurotext txt hocr`
the psm will be set to '1'.
2015-12-11 19:06:49 +02:00
Zdenko Podobný
66a76a9477
Revert "temporary add config/*, configure and Makefile.in for release"
...
This reverts commits ec9581d8f2
, 1afe382c4e
, 4b2cfabcc1
2015-07-31 21:44:43 +02:00
Zdenko Podobný
5dfb0cb898
Fixes #64 - tessedit_create_txt 0 blocks box training
2015-07-25 22:49:55 +02:00
Jim O'Regan
05de195efc
disable text creation for unlv, makebox, box.train, and box.train.stderr (see #49 )
2015-07-20 10:07:55 +01:00
Jim O'Regan
43834da7a2
disable text creation when creating hOCR (issue #49 )
2015-07-18 08:56:21 +01:00
Jeff Breidenbach
fd429c32a0
PDF creation: not disabling tessedit_create_txt
...
Okay, everything is more of less under control except for this:
tesseract phototest.tif - pdf > phototest.pdf
This is sending activating both the text renderer, and the pdf renderer.
They both get sent to stdout where they mix together and cause chaos.
Same thing happens with this command.
tesseract phototest.tif stdout pdf > phototest.pdf
What's happening is tesseractmain.cpp is setting tessedit_create_pdf without
disabling tessedit_create_txt.
https://groups.google.com/d/msgid/tesseract-dev/32c065ee-aefa-441a-b37b-b6bdc234c8ab%40googlegroups.com
2015-07-18 08:39:57 +01:00
Zdenko Podobný
ec9581d8f2
temporary add configure and Makefile.in for release
2015-07-11 09:42:43 +02:00
Zdenko Podobný
369fabb7fc
fix filemode;
...
update autotools and distribution script to repository changes;
ignore doxygen generated files and langauge data files;
2014-08-14 23:37:17 +02:00
theraysmith@gmail.com
91d2265429
More minor fixes from issues and cleanup
...
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@974 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2014-01-10 01:38:00 +00:00
theraysmith@gmail.com
4c72deea6c
Added pdf config file
...
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@972 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2014-01-09 19:18:07 +00:00
zdenop@gmail.com
53a3e0f88a
fix issue 755; add example config files from tesseract manpage
...
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@894 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2013-10-20 20:20:10 +00:00
zdenop@gmail.com
32d212d0c6
add new config file - get.image
...
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@826 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2013-02-23 11:56:49 +00:00
zdenop@gmail.com
e83503022c
update script for 3.02.02 release
...
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@793 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2012-10-26 18:49:14 +00:00
zdenop@gmail.com
3b326532cc
fix --enable-multiple-libraries; implement quite mode (issue 580)
...
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@691 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2012-03-03 11:48:59 +00:00
theraysmith@gmail.com
d581ab7e12
New config for testing bigram correction.
...
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@661 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2012-02-02 18:46:19 +00:00
theraysmith@gmail.com
6e273b71bd
Cube trained data for fra, ita, rus, spa
...
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@656 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2012-02-02 03:08:26 +00:00
joregan@gmail.com
323ee5af7a
more Makefile.in
...
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@618 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2011-08-18 18:40:33 +00:00
theraysmith@gmail.com
d5d15f32d7
Deleted Makefile.in from svn
...
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@606 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2011-08-18 16:32:44 +00:00
zdenop@gmail.com
3463abfd34
commented parameters that caused error (read_params_file: parameter not found:)
...
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@589 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2011-06-15 20:20:45 +00:00
theraysmith
311d1f9253
Added Hindi traineddata
...
git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@576 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2011-03-21 21:57:08 +00:00