tesseract

mirror of https://github.com/tesseract-ocr/tesseract.git synced 2024-11-24 02:59:07 +08:00

Author	SHA1	Message	Date
Jan Kamlah	577e8a8b93	Add PAGE XML renderer / export (#4214 ) Add PAGE XML export and documentation. To generate PAGE XML output just add 'page' to the tesseract command. The output is outputname + '.page.xml' to avoid conflicts with ALTO export. The output can be customized with the flags: tessedit_create_page_polygon and tessedit_create_page_wordlevel. Co-authored-by: Stefan Weil <sw@weilnetz.de>	2024-04-19 21:12:39 +02:00
Gitoffthelawn	d086c075b3	Fixed 2 errors	2022-10-06 03:53:11 -07:00
Shree	df6b1ce452	remove legacy parameter disable_character_fragments from lstm.train	2019-10-23 13:15:16 +02:00
Johannes Künsebeck	aa2ab68e29	Removed unused parameters The following parameters are not used anywhere anymore: * use_definite_ambigs_for_classifier * max_viterbi_list_size * word_to_debug_lengths * fragments_debug * tessedit_redo_xheight * debug_acceptable_wds * tessedit_matcher_log * tessedit_test_adaption_mode * docqual_excuse_outline_errs * crunch_pot_garbage * suspect_space_level * tessedit_consistent_reps * wordrec_display_all_words * wordrec_no_block * wordrec_worst_state * fragments_guide_chopper * segment_adjust_debug * classify_adapt_feature_thresh (classify_adapt_feature_threshold still exists) * classify_adapt_proto_thresh (classify_adapt_proto_threshold still exists) * classify_min_norm_scale_x * classify_max_norm_scale_x * classify_min_norm_scale_y * classify_max_norm_scale_y * il1_adaption_test * textord_blob_size_bigile * textord_blob_size_smallile * editor_debug_config_file * textord_tabfind_show_color_fit The list was generated by a python script and each parameter occurence checked manually.	2019-10-03 09:18:29 +02:00
Julian Gilbey	5a1978a4fc	fix #2616 : allow building of training data This fixes Issue #2616 by preventing an attempt to build the recognition engine when running tesstrain.sh.	2019-08-13 19:05:49 +01:00
Zdenko Podobný	68ca3518be	autotools: remove list of traineddata files	2019-05-08 15:36:58 +02:00
Stefan Weil	7db25e15c0	Remove unused config variable tessedit_single_match Replace also TRUE, FALSE by true, false. Signed-off-by: Stefan Weil <sw@weilnetz.de>	2019-03-31 17:38:35 +02:00
Shree	08e96516c6	install lstmbox and wordstrbox config files	2019-03-01 15:26:59 +00:00
Shree Devi Kumar	f3362a4b5b	Add renderer to create WordStr box files from images	2019-02-10 19:59:17 +00:00
zdenop	2ae65b2493	Merge pull request #2216 from Shreeshrii/lstmbox Lstmbox	2019-02-10 13:53:41 +01:00
Chris Mayo	6dc48adfee	Rename get.image config to get.images and install	2019-02-05 19:57:53 +00:00
Shree Devi Kumar	9c89cd51cf	Add a new renderer to create box files from images for LSTM training (cherry picked from commit 921da6be2bdbda2ddd64514f9b6bec40a336246a) fix typo (cherry picked from commit 7bd1a0c80393fce2f34e2845cb26760bcf3791cd) Add lstmboxrenderer to CMakeLists (cherry picked from commit cfef3a889aef830725921b5c0218d5e9c633b03e) fix formatting (cherry picked from commit 7ba2b01ede7940ed609a073364948ef8c838cd10)	2019-02-05 14:03:29 +00:00
Stefan Weil	e817d93e62	Add configuration file for ALTO to installation Signed-off-by: Stefan Weil <sw@weilnetz.de>	2018-11-30 06:17:04 +01:00
Jake Sebright	d7cee03a94	Add support for ALTO output	2018-11-30 06:09:36 +01:00
Zdenko Podobný	ba64aaf257	add lstmdebug config to distribution and installation process	2018-10-29 09:38:11 +01:00
Stefan Weil	125fdc3f1b	Add debug configuration for LSTM It was provided by Jeff Breidenbach <jbreiden@google.com>. Signed-off-by: Stefan Weil <sw@weilnetz.de>	2018-10-27 08:04:45 +02:00
Zdenko Podobný	3d508a65a7	set unlv_tilde_crunching to false; fixes #1449 #948	2018-10-23 09:26:32 +02:00
Stefan Weil	c6f759148b	Don't set page segmentation mode for unlv config Setting the page segmentation mode to 6 ("Assume a single uniform block of text") typically improves the layout detection for such texts, but should not be done in the config file. unlvtests/runtestset.sh adds `--psm 6` explicitly, so test results won't change when using that script. This is similar to commit `ecfee53bac`. Signed-off-by: Stefan Weil <sw@weilnetz.de>	2018-10-04 21:01:18 +02:00
Stefan Weil	ecfee53bac	Don't set page segmentation mode for hocr, pdf and tsv configs Setting the page segmentation mode in those config files gives unexpected results: the text recognized when no config or only txt is given changes if both txt and any of hocr, pdf or tsv is chosen. In a test set of nearly 200 pages from historical books, using segmentation mode 1 is typically slightly better than the default, but there are also cases where it is much worse. Therefore the user should be able to decide which page segmentation mode is best. Old results for hocr, pdf or tsv now need an explicit `--psm 1` for reproduction. Signed-off-by: Stefan Weil <sw@weilnetz.de>	2018-10-04 12:05:49 +02:00
Stefan Weil	dabf3c299f	Fix file endings Text files should end with a LF, but not additional empty lines. Signed-off-by: Stefan Weil <sw@weilnetz.de>	2018-04-25 19:35:33 +02:00
Stefan Weil	10a8a67ca2	Remove execute permission from config file (#1263 ) This fixes the only configuration file which had such permissions. Signed-off-by: Stefan Weil <sw@weilnetz.de>	2018-01-10 16:43:02 +01:00
Atsuyoshi Suzuki	82d62f89a2	Update Makefile.am (add 'lstm.train')	2017-04-02 17:06:12 +09:00
Zdenko Podobný	a011b15b0d	fix #712 : Ghostscript mangling Tesseract-produced PDFs	2017-02-15 17:09:37 +01:00
Ray Smith	81ebba0394	More makefile changes to remove cube	2016-12-14 11:17:06 -08:00
Ray Smith	65517794f9	Added missing lstm.train	2016-12-06 08:48:23 -08:00
Ray Smith	3d00d3bd94	Missing pdf font file from previous sync	2016-11-28 08:55:03 -08:00
Ray Smith	2c837dffc3	Result of clang tidy on recent merge	2016-11-07 10:46:33 -08:00
Zdenko Podobný	a6871a8c91	remove install-langs - fix #376	2016-09-01 19:21:30 +02:00
Tom Morris	fc80ceafb9	Fix hocrtsv references in Makefile	2016-03-02 10:46:52 -05:00
Tom Morris	6700edd8bc	Cleanup TSV renderer Remove all references to hocr, hocr.tsv, etc. Remove dead code for font info, input filename, HTML escapes. Improved comments. Fixed indentation.	2016-03-01 13:41:19 -05:00
Sundar M. Vaidya	937ceb2d1b	Adds hocrtsv to tessdata/configs/Makefile.am	2016-03-01 12:25:15 -05:00
Sundar M. Vaidya	3163b38151	Adds hocrtsv file to configs folder.	2016-03-01 12:23:12 -05:00
Sundar M. Vaidya	59d593d796	Calls TessHOcrTsvRenderer if tessedit_create_hocrtsv is true.	2016-03-01 12:23:12 -05:00
Tom Morris	e3e1fe0e20	Document hocr_font_info in config	2016-02-14 16:49:00 -05:00
James R. Barlow	b30930b95a	Replace pdf.ttf with sharp2.ttf, keep name the same As discussed at length in issue #182, the existing pdf.ttf causes difficulties for certain PDF viewers, in part because the old file had zero advance width. With testing, sharp2.ttf seems to be the best available compromise, although it's not perfect and causes some visual difficulties in Evince. It does seem to fix Kindle and OS X Preview.	2016-02-11 15:44:11 -08:00
Amit Dovev	6b08184a2c	Update Makefile.am	2015-12-18 16:12:32 +02:00
amitdo	c2f5e9b849	If there is no explicit renderer(s), default to TessTextRenderer Revert `fd429c32`, `43834da7`, `05de195e`. See #49, #59. The code in this commit solves the issue in a more elegant way, IMHO. Now you can use: * `tesseract eurotext.tif eurotext txt pdf` * `tesseract eurotext.tif eurotext txt hocr` * `tesseract eurotext.tif eurotext txt hocr pdf` NOTE: With `tesseract eurotext.tif eurotext` or `tesseract eurotext.tif eurotext txt` the psm will be set to '3', but... With `tesseract eurotext.tif eurotext txt pdf` or `tesseract eurotext.tif eurotext txt hocr` the psm will be set to '1'.	2015-12-11 19:06:49 +02:00
Zdenko Podobný	66a76a9477	Revert "temporary add config/*, configure and Makefile.in for release" This reverts commits `ec9581d8f2`, `1afe382c4e`, `4b2cfabcc1`	2015-07-31 21:44:43 +02:00
Zdenko Podobný	5dfb0cb898	Fixes #64 - tessedit_create_txt 0 blocks box training	2015-07-25 22:49:55 +02:00
Jim O'Regan	05de195efc	disable text creation for unlv, makebox, box.train, and box.train.stderr (see #49 )	2015-07-20 10:07:55 +01:00
Jim O'Regan	43834da7a2	disable text creation when creating hOCR (issue #49 )	2015-07-18 08:56:21 +01:00
Jeff Breidenbach	fd429c32a0	PDF creation: not disabling tessedit_create_txt Okay, everything is more of less under control except for this: tesseract phototest.tif - pdf > phototest.pdf This is sending activating both the text renderer, and the pdf renderer. They both get sent to stdout where they mix together and cause chaos. Same thing happens with this command. tesseract phototest.tif stdout pdf > phototest.pdf What's happening is tesseractmain.cpp is setting tessedit_create_pdf without disabling tessedit_create_txt. https://groups.google.com/d/msgid/tesseract-dev/32c065ee-aefa-441a-b37b-b6bdc234c8ab%40googlegroups.com	2015-07-18 08:39:57 +01:00
Zdenko Podobný	ec9581d8f2	temporary add configure and Makefile.in for release	2015-07-11 09:42:43 +02:00
Ray Smith	1e3b671298	Fixes to make yesterday's changes compile	2015-05-13 09:58:59 -07:00
Ray Smith	6b634170c1	Significant change to invisible font system to improve correctness and compatibility with external programs, particularly ghostscript. We will start mapping everything to a single glyph, rather than allowing characters to run off the end of the font. A more detailed design discussion is embedded into pdfrenderer.cpp comments. The font, source code that produces the font, and the design comments were contributed by Ken Sharp from Artifex Software.	2015-05-12 17:33:18 -07:00
Ray Smith	d9699c4099	Fixed bidi handling in PDF output	2014-10-09 13:29:01 -07:00
Zdenko Podobný	369fabb7fc	fix filemode; update autotools and distribution script to repository changes; ignore doxygen generated files and langauge data files;	2014-08-14 23:37:17 +02:00
zdenop	1ea387232b	fix compatibility of uninstall: MacOSX rm needs -f instead of --force git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@1127 d0cd1f9f-072b-0410-8dd7-cf729c803f20	2014-07-24 20:39:30 +00:00
zdenop	a66f5b84c8	install pdf.ttf and pdf.ttx as part of tesseract library git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@1031 d0cd1f9f-072b-0410-8dd7-cf729c803f20	2014-01-29 22:12:32 +00:00
theraysmith@gmail.com	91d2265429	More minor fixes from issues and cleanup git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@974 d0cd1f9f-072b-0410-8dd7-cf729c803f20	2014-01-10 01:38:00 +00:00

1 2 3

118 Commits