From 6a28cce96b4f9b2a115650fbd2ad0b46f9d34ab7 Mon Sep 17 00:00:00 2001 From: Stefan Weil Date: Sat, 16 Sep 2017 18:47:04 +0200 Subject: [PATCH] Fix whitespace issues * Remove whitespace (blanks, tabs, cr) at line endings Signed-off-by: Stefan Weil --- .github/ISSUE_TEMPLATE.md | 2 +- CONTRIBUTING.md | 16 ++-- ChangeLog | 12 +-- INSTALL.GIT.md | 6 +- appveyor.yml | 8 +- autogen.sh | 12 +-- contrib/genlangdata.pl | 4 +- contrib/tesseract.completion | 8 +- doc/Makefile.am | 2 +- doc/classifier_tester.1.asc | 14 ++-- doc/combine_lang_model.1.asc | 40 +++++----- doc/lstmeval.1.asc | 18 ++--- doc/lstmtraining.1.asc | 16 ++-- doc/merge_unicharsets.1.asc | 8 +- doc/set_unicharset_properties.1.asc | 8 +- doc/tesseract.1.asc | 78 +++++++++---------- doc/text2image.1.asc | 30 +++---- .../com/google/scrollview/events/SVEvent.java | 10 +-- .../google/scrollview/events/SVEventType.java | 4 +- src/api/Makefile.am | 4 +- src/ccstruct/Makefile.am | 2 +- src/ccutil/genericvector.h | 4 +- src/classify/Makefile.am | 2 +- src/classify/classify.cpp | 2 +- src/classify/featdefs.cpp | 8 +- src/classify/float2int.cpp | 8 +- src/classify/protos.cpp | 10 +-- src/cutil/bitvec.cpp | 2 +- src/cutil/emalloc.cpp | 4 +- src/dict/Makefile.am | 2 +- src/lstm/lstmrecognizer.cpp | 2 +- src/lstm/lstmrecognizer.h | 2 +- src/lstm/recodebeam.cpp | 6 +- src/lstm/recodebeam.h | 2 +- src/textord/Makefile.am | 2 +- src/textord/colpartition.cpp | 2 +- src/textord/tablerecog.cpp | 2 +- src/training/Makefile.am | 10 +-- src/vs2010/tesseract/libtesseract.rc.in | 6 +- src/wordrec/Makefile.am | 2 +- unittest/Makefile.am | 4 +- unittest/apiexample_test.cc | 22 +++--- unittest/loadlang_test.cc | 24 +++--- unittest/osd_test.cc | 40 +++++----- unlvtests/README.md | 8 +- 45 files changed, 239 insertions(+), 239 deletions(-) diff --git a/.github/ISSUE_TEMPLATE.md b/.github/ISSUE_TEMPLATE.md index 121cdeaa..ecc0b180 100644 --- a/.github/ISSUE_TEMPLATE.md +++ b/.github/ISSUE_TEMPLATE.md @@ -6,7 +6,7 @@ Note that it will be much easier for us to fix the issue if a test case that reproduces the problem is provided. Ideally this test case should not have any external dependencies. Provide a copy of the image or link to files for the test case. -Please delete this text and fill in the template below. +Please delete this text and fill in the template below. ------------------------ diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 4a1ba02a..d5e0c51d 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -9,9 +9,9 @@ If you think you found a bug in Tesseract, please create an issue. Use the [users mailing-list](https://groups.google.com/d/forum/tesseract-ocr) instead of creating an Issue if ... * You have problems using Tesseract and need some help. * You have problems installing the software. -* You are not satisfied with the accuracy of the OCR, and want to ask how you can improve it. Note: You should first read the [ImproveQuality](https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality) wiki page. +* You are not satisfied with the accuracy of the OCR, and want to ask how you can improve it. Note: You should first read the [ImproveQuality](https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality) wiki page. * You are trying to train Tesseract and you have a problem and/or want to ask a question about the training process. Note: You should first read the **official** guides [[1]](https://github.com/tesseract-ocr/tesseract/wiki) or [[2]](https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract) found in the project wiki. -* You have a general question. +* You have a general question. An issue should only be reported if the platform you are using is one of these: * Linux (but not a version that is more than 4 years old) @@ -22,7 +22,7 @@ For older versions or other operating systems, use the Tesseract forum. When creating an issue, please report your operating system, including its specific version: "Ubuntu 16.04", "Windows 10", "Mac OS X 10.11" etc. -Search through open and closed issues to see if similar issue has been reported already (and sometimes also has been solved). +Search through open and closed issues to see if similar issue has been reported already (and sometimes also has been solved). Similarly, before you post your question in the forum, search through past threads to see if similar question has been asked already. @@ -32,10 +32,10 @@ Only report an issue in the latest official release. Optionally, try to check if Make sure you are able to replicate the problem with Tesseract command line program. For external programs that use Tesseract (including wrappers and your own program, if you are developer), report the issue to the developers of that software if it's possible. You can also try to find help in the Tesseract forum. -Each version of Tesseract has its own language data you need to obtain. You **must** obtain and install trained data for English (eng) and osd. Verify that Tesseract knows about these two files (and other trained data you installed) with this command: +Each version of Tesseract has its own language data you need to obtain. You **must** obtain and install trained data for English (eng) and osd. Verify that Tesseract knows about these two files (and other trained data you installed) with this command: `tesseract --list-langs`. -Post example files to demonstrate the problem. +Post example files to demonstrate the problem. BUT don't post files with private info (about yourself or others). When attaching a file to the issue report / forum ... @@ -46,7 +46,7 @@ Do not attach programs or libraries to your issues/posts. For large files or for programs, add a link to a location where they can be downloaded (your site, Git repo, Google Drive, Dropbox etc.) -Attaching a multi-page TIFF image is useful only if you have problem with multi-page functionality, otherwise attach only one or a few single page images. +Attaching a multi-page TIFF image is useful only if you have problem with multi-page functionality, otherwise attach only one or a few single page images. Copy the error message from the console instead of sending a screenshot of it. @@ -54,7 +54,7 @@ Use the toolbar above the comment edit area to format your comment. Add three backticks before and after a code sample or output of a command to format it (The `Insert code` button can help you doing it). -If your comment includes a code sample or output of a command that exceeds ~25 lines, post it as attached text file (`filename.txt`). +If your comment includes a code sample or output of a command that exceeds ~25 lines, post it as attached text file (`filename.txt`). Use `Preview` before you send your issue. Read it again before sending. @@ -62,7 +62,7 @@ Note that most of the people that respond to issues and answer questions are eit The [tesseract developers](http://groups.google.com/group/tesseract-dev/) forum should be used to discuss Tesseract development: bug fixes, enhancements, add-ons for Tesseract. -Sometimes you will not get a respond to your issue or question. We apologize in advance! Please don't take it personally. There can be many reasons for this, including: time limits, no one knows the answer (at least not the ones that are available at that time) or just that +Sometimes you will not get a respond to your issue or question. We apologize in advance! Please don't take it personally. There can be many reasons for this, including: time limits, no one knows the answer (at least not the ones that are available at that time) or just that your question has been asked (and has been answered) many times before... ## For Developers: Creating a Pull Request diff --git a/ChangeLog b/ChangeLog index 1e634e3b..af897045 100644 --- a/ChangeLog +++ b/ChangeLog @@ -1,7 +1,7 @@ 2017-03-24 - V4.00.00-alpha * Added new neural network system based on LSTMs, with major accuracy gains. * Improvements to PDF rendering. - * Fixes to trainingdata rendering. + * Fixes to trainingdata rendering. * Added LSTM models+lang models to 101 languages. (tessdata repository) * Improved multi-page TIFF handling. * Fixed damage to binary images when processing PDFs. @@ -40,7 +40,7 @@ * Fixed some openCL issues. * Added option to build Tesseract with CMake build system. * Implemented CPPAN support for easy Windows building. - + 2016-02-17 - V3.04.01 * Added OSD renderer for psm 0. Works for single page and multi-page images. * Improve tesstrain.sh script. @@ -84,7 +84,7 @@ text and truetype fonts. * Added support for PDF output with searchable text. * Removed entire IMAGE class and all code in image directory. - * Tesseract executable: support for output to stdout; limited support for one + * Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) * Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. @@ -169,12 +169,12 @@ * Added TessdataManager to combine data files into a single file. * Some dead code deleted. * VC++6 no longer supported. It can't cope with the use of templates. - * Many more languages added. + * Many more languages added. * Doxygenation of most of the function header comments. * Added man pages. * Added bash completion script (issue 247: thanks to neskiem) * Fix integer overview in thresholding (issue 366: thanks to Cyanide.Drake) - * Add Danish Fraktur support (issues 300, 360: thanks to + * Add Danish Fraktur support (issues 300, 360: thanks to dsl602230@vip.cybercity.dk) * Fix file pointer leak (issue 359, thanks to yukihiro.nakadaira) * Fix an error using user-words (Issue 345: thanks to max.markin) @@ -183,7 +183,7 @@ * Fix an automake error (Issue 318, thanks to ichanjz) * Fix a Win32 crash on fileFormatIsTiff() (Issues 304, 316, 317, 330, 347, 349, 352: thanks to nguyenq87, max.markin, zdenop) - * Fixed a number of errors in newer (stricter) versions of VC++ (Issues + * Fixed a number of errors in newer (stricter) versions of VC++ (Issues 301, among others) 2009-06-30 - V2.04 diff --git a/INSTALL.GIT.md b/INSTALL.GIT.md index c53b1381..ef1d1b9a 100644 --- a/INSTALL.GIT.md +++ b/INSTALL.GIT.md @@ -26,14 +26,14 @@ So, the steps for making Tesseract are: $ make training $ sudo make training-install -You need to install at least English language and OSD traineddata files to -`TESSDATA_PREFIX` directory. +You need to install at least English language and OSD traineddata files to +`TESSDATA_PREFIX` directory. You can retrieve single file with tools like [wget](https://www.gnu.org/software/wget/), [curl](https://curl.haxx.se/), [GithubDownloader](https://github.com/intezer/GithubDownloader) or browser. All language data files can be retrieved from git repository (useful only for packagers!). (Repository is huge - more that 1.2 GB. You do NOT need to download traineddata files for -all languages). +all languages). $ git clone https://github.com/tesseract-ocr/tessdata.git tesseract-ocr.tessdata diff --git a/appveyor.yml b/appveyor.yml index 521cbf13..234b1132 100644 --- a/appveyor.yml +++ b/appveyor.yml @@ -5,13 +5,13 @@ environment: - APPVEYOR_BUILD_WORKER_IMAGE: Visual Studio 2017 vs_ver: 15 2017 vs_platform: " Win64" - + configuration: - Release - + cache: - c:/Users/appveyor/.cppan/storage - + # for curl install: - set PATH=C:\Program Files\Git\mingw64\bin;%PATH% @@ -25,7 +25,7 @@ before_build: - ps: 'Add-Content $env:USERPROFILE\.cppan\cppan.yml "`n`nbuild_warning_level: 0`n"' - ps: 'Add-Content $env:USERPROFILE\.cppan\cppan.yml "`n`nbuild_system_verbose: false`n"' - ps: 'Add-Content $env:USERPROFILE\.cppan\cppan.yml "`n`nvar_check_jobs: 1`n"' - + build_script: - mkdir build - mkdir build\bin diff --git a/autogen.sh b/autogen.sh index e90cf4d4..2faf72ed 100755 --- a/autogen.sh +++ b/autogen.sh @@ -46,10 +46,10 @@ if [ "$1" = "clean" ]; then find . -iname "Makefile.in" -type f -exec rm '{}' + fi -# Prevent any errors that might result from failing to properly invoke -# `libtoolize` or `glibtoolize,` whichever is present on your system, -# from occurring by testing for its existence and capturing the absolute path to -# its location for caching purposes prior to using it later on in 'Step 2:' +# Prevent any errors that might result from failing to properly invoke +# `libtoolize` or `glibtoolize,` whichever is present on your system, +# from occurring by testing for its existence and capturing the absolute path to +# its location for caching purposes prior to using it later on in 'Step 2:' if command -v libtoolize >/dev/null 2>&1; then LIBTOOLIZE="$(command -v libtoolize)" elif command -v glibtoolize >/dev/null 2>&1; then @@ -67,13 +67,13 @@ fi bail_out() { echo - echo " Something went wrong, bailing out!" + echo " Something went wrong, bailing out!" echo exit 1 } # --- Step 1: Generate aclocal.m4 from: -# . acinclude.m4 +# . acinclude.m4 # . config/*.m4 (these files are referenced in acinclude.m4) mkdir -p config diff --git a/contrib/genlangdata.pl b/contrib/genlangdata.pl index 53e3431e..ae9a5fe0 100644 --- a/contrib/genlangdata.pl +++ b/contrib/genlangdata.pl @@ -8,7 +8,7 @@ use Getopt::Std; =pod -=head1 NAME +=head1 NAME genwordlists.pl - generate word lists for Tesseract @@ -33,7 +33,7 @@ use: pfx=$(echo $i|tr '/' '_'); cat $i | \ perl genwordlists.pl -d OUTDIR -p $pfx; done -This will create a set of output files to match each of the files +This will create a set of output files to match each of the files WikiExtractor created. To combine these files: diff --git a/contrib/tesseract.completion b/contrib/tesseract.completion index 06bcdc15..3e0b1e96 100644 --- a/contrib/tesseract.completion +++ b/contrib/tesseract.completion @@ -1,6 +1,6 @@ #-*- mode: shell-script;-*- # -# bash completion support for tesseract +# bash completion support for tesseract # # Copyright (C) 2009 Neskie A. Manuel # Distributed under the Apache License, Version 2.0. @@ -20,19 +20,19 @@ _tesseract() COMPREPLY=() cur="$2" prev="$3" - + case "$prev" in tesseract) COMPREPLY=($(compgen -f -X "!*.+(tif)" -- "$cur") ) ;; *.tif) - COMPREPLY=($(compgen -W "$(basename $prev .tif)" ) ) + COMPREPLY=($(compgen -W "$(basename $prev .tif)" ) ) ;; -l) _tesseract_languages ;; *) - COMPREPLY=($(compgen -W "-l" ) ) + COMPREPLY=($(compgen -W "-l" ) ) ;; esac } diff --git a/doc/Makefile.am b/doc/Makefile.am index 6844f022..52c6898e 100644 --- a/doc/Makefile.am +++ b/doc/Makefile.am @@ -17,7 +17,7 @@ man_MANS = \ text2image.1 \ unicharambigs.5 \ unicharset_extractor.1 \ - wordlist2dawg.1 + wordlist2dawg.1 if !DISABLED_LEGACY_ENGINE man_MANS += \ diff --git a/doc/classifier_tester.1.asc b/doc/classifier_tester.1.asc index ff81447a..758b367b 100644 --- a/doc/classifier_tester.1.asc +++ b/doc/classifier_tester.1.asc @@ -11,9 +11,9 @@ SYNOPSIS DESCRIPTION ----------- -classifier_tester(1) runs Tesseract in a special mode. -It takes a list of .tr files and tests a character classifier -on data as formatted for training, +classifier_tester(1) runs Tesseract in a special mode. +It takes a list of .tr files and tests a character classifier +on data as formatted for training, but it doesn't have to be the same as the training data. IN/OUT ARGUMENTS @@ -25,11 +25,11 @@ OPTIONS ------- -l 'lang':: (Input) three character language code; default value 'eng'. - + -classifier 'x':: (Input) One of "pruner", "full". - - + + -U 'unicharset':: (Input) The unicharset for the language. @@ -42,7 +42,7 @@ OPTIONS (Input) x heights file, each line is of the following form, where xheight is calculated as the pixel x height of a character drawn at 32pt on 300 dpi. [ That is, if base x height + ascenders + descenders = 133, how much is x height? ] *font_name* *xheight* - + -output_trainer 'trainer':: (Output, Optional) Filename for output trainer. diff --git a/doc/combine_lang_model.1.asc b/doc/combine_lang_model.1.asc index c233cd58..0c2b16d7 100644 --- a/doc/combine_lang_model.1.asc +++ b/doc/combine_lang_model.1.asc @@ -8,54 +8,54 @@ combine_lang_model - generate starter traineddata SYNOPSIS -------- -*combine_lang_model* --input_unicharset 'filename' --script_dir 'dirname' --output_dir 'rootdir' --lang 'lang' [--lang_is_rtl] [pass_through_recoder] [--words file --puncs file --numbers file] +*combine_lang_model* --input_unicharset 'filename' --script_dir 'dirname' --output_dir 'rootdir' --lang 'lang' [--lang_is_rtl] [pass_through_recoder] [--words file --puncs file --numbers file] DESCRIPTION ----------- combine_lang_model(1) generates a starter traineddata file that can be used to train an LSTM-based neural network model. It takes as input a unicharset and an optional set of wordlists. It eliminates the need to run set_unicharset_properties(1), wordlist2dawg(1), some non-existent binary to generate the recoder (unicode compressor), and finally combine_tessdata(1). - + OPTIONS ------- '-l lang':: - The language to use. + The language to use. Tesseract uses 3-character ISO 639-2 language codes. (See LANGUAGES) -'--script_dir PATH':: +'--script_dir PATH':: Directory name for input script unicharsets. It should point to the location of langdata (github repo) directory. (type:string default:) - -'--input_unicharset FILE':: + +'--input_unicharset FILE':: Unicharset to complete and use in encoding. It can be a hand-created file with incomplete fields. Its basic and script properties will be set before it is used. (type:string default:) - + '--lang_is_rtl BOOL':: True if language being processed is written right-to-left (eg Arabic/Hebrew). (type:bool default:false) - + '--pass_through_recoder BOOL':: If true, the recoder is a simple pass-through of the unicharset. Otherwise, potentially a compression of it by encoding Hangul in Jamos, decomposing multi-unicode symbols into sequences of unicodes, and encoding Han using the data in the radical_table_data, which must be the content of the file: langdata/radical-stroke.txt. (type:bool default:false) -'--version_str STRING':: +'--version_str STRING':: An arbitrary version label to add to traineddata file (type:string default:) - -'--words FILE':: + +'--words FILE':: (Optional) File listing words to use for the system dictionary (type:string default:) - -'--numbers FILE':: + +'--numbers FILE':: (Optional) File listing number patterns (type:string default:) - -'--puncs FILE':: + +'--puncs FILE':: (Optional) File listing punctuation patterns. The words/puncs/numbers lists may be all empty. If any are non-empty then puncs must be non-empty. (type:string default:) - -'--output_dir PATH':: + +'--output_dir PATH':: Root directory for output files. Output files will be written to //.* (type:string default:) - + HISTORY ------- -combine_lang_model(1) was first made available for tesseract4.00.00alpha. +combine_lang_model(1) was first made available for tesseract4.00.00alpha. RESOURCES --------- Main web site: + Information on training tesseract LSTM: - + SEE ALSO -------- tesseract(1) diff --git a/doc/lstmeval.1.asc b/doc/lstmeval.1.asc index ada1ecf4..5fae045e 100644 --- a/doc/lstmeval.1.asc +++ b/doc/lstmeval.1.asc @@ -4,7 +4,7 @@ LSTMEVAL(1) NAME ---- -lstmeval - Evaluation program for LSTM-based networks. +lstmeval - Evaluation program for LSTM-based networks. SYNOPSIS -------- @@ -12,34 +12,34 @@ SYNOPSIS DESCRIPTION ----------- -lstmeval(1) evaluates LSTM-based networks. Either a recognition model or a training checkpoint can be given as input for evaluation along with a list of lstmf files. If evaluating a training checkpoint, '--traineddata' should also be specified. - +lstmeval(1) evaluates LSTM-based networks. Either a recognition model or a training checkpoint can be given as input for evaluation along with a list of lstmf files. If evaluating a training checkpoint, '--traineddata' should also be specified. + OPTIONS ------- '--model FILE':: Name of model file (training or recognition) (type:string default:) - + '--traineddata FILE':: If model is a training checkpoint, then traineddata must be the traineddata file that was given to the trainer (type:string default:) - + '--eval_listfile FILE':: File listing sample files in lstmf training format. (type:string default:) - + '--max_image_MB INT':: Max memory to use for images. (type:int default:2000) - + '--verbosity INT':: Amount of diagnosting information to output (0-2). (type:int default:1) HISTORY ------- -lstmeval(1) was first made available for tesseract4.00.00alpha. +lstmeval(1) was first made available for tesseract4.00.00alpha. RESOURCES --------- Main web site: + Information on training tesseract LSTM: - + SEE ALSO -------- tesseract(1) diff --git a/doc/lstmtraining.1.asc b/doc/lstmtraining.1.asc index 81122568..ea47e31e 100644 --- a/doc/lstmtraining.1.asc +++ b/doc/lstmtraining.1.asc @@ -8,19 +8,19 @@ lstmtraining - Training program for LSTM-based networks. SYNOPSIS -------- -*lstmtraining* +*lstmtraining* --continue_from 'train_output_dir/continue_from_lang.lstm' - --old_traineddata 'bestdata_dir/continue_from_lang.traineddata' - --traineddata 'train_output_dir/lang/lang.traineddata' - --max_iterations 'NNN' - --debug_interval '0|-1' + --old_traineddata 'bestdata_dir/continue_from_lang.traineddata' + --traineddata 'train_output_dir/lang/lang.traineddata' + --max_iterations 'NNN' + --debug_interval '0|-1' --train_listfile 'train_output_dir/lang.training_files.txt' --model_output 'train_output_dir/newlstmmodel' DESCRIPTION ----------- lstmtraining(1) trains LSTM-based networks using a list of lstmf files and starter traineddata file as the main input. Training from scratch is not recommended to be done by users. Finetuning (example command shown in synopsis above) or replacing a layer options can be used instead. Different options apply to different types of training. Read [Training Wiki page](https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00) for details. - + OPTIONS ------- @@ -95,13 +95,13 @@ OPTIONS HISTORY ------- -lstmtraining(1) was first made available for tesseract4.00.00alpha. +lstmtraining(1) was first made available for tesseract4.00.00alpha. RESOURCES --------- Main web site: + Information on training tesseract LSTM: - + SEE ALSO -------- tesseract(1) diff --git a/doc/merge_unicharsets.1.asc b/doc/merge_unicharsets.1.asc index 44439ed8..5e4d1112 100644 --- a/doc/merge_unicharsets.1.asc +++ b/doc/merge_unicharsets.1.asc @@ -13,23 +13,23 @@ SYNOPSIS DESCRIPTION ----------- merge_unicharsets(1) is a simple tool to merge two or more unicharsets. -It could be used to create a combined unicharset for a script-level engine, +It could be used to create a combined unicharset for a script-level engine, like the new Latin or Devanagari. IN/OUT ARGUMENTS ---------------- 'unicharset-in-1':: (Input) The name of the first unicharset file to be merged. - + 'unicharset-in-n':: (Input) The name of the nth unicharset file to be merged. 'unicharset-out':: (Output) The name of the merged unicharset file. - + HISTORY ------- -merge_unicharsets(1) was first made available for tesseract4.00.00alpha. +merge_unicharsets(1) was first made available for tesseract4.00.00alpha. RESOURCES --------- diff --git a/doc/set_unicharset_properties.1.asc b/doc/set_unicharset_properties.1.asc index 16770a95..e86911a5 100644 --- a/doc/set_unicharset_properties.1.asc +++ b/doc/set_unicharset_properties.1.asc @@ -19,22 +19,22 @@ OPTIONS '--script_dir /path/to/langdata':: (Input) Specify the location of directory for universal script unicharsets and font xheights (type:string default:) - + '--U unicharsetfile':: (Input) Specify the location of the unicharset to load as input. - + '--O unicharsetfile':: (Output) Specify the location of the unicharset to be written with updated properties. HISTORY ------- -set_unicharset_properties(1) was first made available for tesseract version 3.03. +set_unicharset_properties(1) was first made available for tesseract version 3.03. RESOURCES --------- Main web site: + Information on training: - + SEE ALSO -------- tesseract(1) diff --git a/doc/tesseract.1.asc b/doc/tesseract.1.asc index ecc938c7..8604053b 100644 --- a/doc/tesseract.1.asc +++ b/doc/tesseract.1.asc @@ -246,48 +246,48 @@ SCRIPTS ------- The traineddata files for the following scripts for tesseract 4.0 -are also in https://github.com/tesseract-ocr/tessdata_fast. +are also in https://github.com/tesseract-ocr/tessdata_fast. -In most cases, each of these contains all the languages that use that script PLUS English. -So it is possible to recognize a language that has not been specifically trained for +In most cases, each of these contains all the languages that use that script PLUS English. +So it is possible to recognize a language that has not been specifically trained for by using traineddata for the script it is written in. -Arabic, -Armenian, -Bengali, -Canadian Aboriginal, -Cherokee, -Cyrillic, -Devanagari, -Ethiopic, -Fraktur, -Georgian, -Greek, -Gujarati, -Gurmukhi, -Han - Simplified, -Han - Simplified (vertical), -Han - Traditional, -Han - Traditional (vertical), -Hangul, -Hangul (vertical), -Hebrew, -Japanese, -Japanese (vertical), -Kannada, -Khmer, -Lao, -Latin, -Malayalam, -Myanmar, -Oriya (Odia), -Sinhala, -Syriac, -Tamil, -Telugu, -Thaana, -Thai, -Tibetan, +Arabic, +Armenian, +Bengali, +Canadian Aboriginal, +Cherokee, +Cyrillic, +Devanagari, +Ethiopic, +Fraktur, +Georgian, +Greek, +Gujarati, +Gurmukhi, +Han - Simplified, +Han - Simplified (vertical), +Han - Traditional, +Han - Traditional (vertical), +Hangul, +Hangul (vertical), +Hebrew, +Japanese, +Japanese (vertical), +Kannada, +Khmer, +Lao, +Latin, +Malayalam, +Myanmar, +Oriya (Odia), +Sinhala, +Syriac, +Tamil, +Telugu, +Thaana, +Thai, +Tibetan, Vietnamese. diff --git a/doc/text2image.1.asc b/doc/text2image.1.asc index d7681d2a..2a689b5f 100644 --- a/doc/text2image.1.asc +++ b/doc/text2image.1.asc @@ -4,16 +4,16 @@ TEXT2IMAGE(1) NAME ---- -text2image - generate OCR training pages. +text2image - generate OCR training pages. SYNOPSIS -------- -*text2image* --text 'FILE' --outputbase 'PATH' --fonts_dir 'PATH' [OPTION] +*text2image* --text 'FILE' --outputbase 'PATH' --fonts_dir 'PATH' [OPTION] DESCRIPTION ----------- text2image(1) generates OCR training pages. Given a text file it outputs an image with a given font and degradation. - + OPTIONS ------- '--text FILE':: @@ -27,22 +27,22 @@ OPTIONS '--fonts_dir PATH':: If empty it use system default. Otherwise it overrides system default font location (type:string default:) - + '--font FONTNAME':: Font description name to use (type:string default:Arial) - + '--writing_mode MODE':: Specify one of the following writing modes. 'horizontal' : Render regular horizontal text. (default) 'vertical' : Render vertical text. Glyph orientation is selected by Pango. 'vertical-upright' : Render vertical text. Glyph orientation is set to be upright. (type:string default:horizontal) - + '--tlog_level INT':: - Minimum logging level for tlog() output (type:int default:0) + Minimum logging level for tlog() output (type:int default:0) '--max_pages INT':: Maximum number of pages to output (0=unlimited) (type:int default:0) - + '--degrade_image BOOL':: Degrade rendered image with speckle noise, dilation/erosion and rotation (type:bool default:true) @@ -54,7 +54,7 @@ OPTIONS '--ligatures BOOL':: Rebuild and render ligatures (type:bool default:false) - + '--exposure INT':: Exposure level in photocopier (type:int default:0) @@ -93,7 +93,7 @@ OPTIONS '--output_word_boxes BOOL':: Output word bounding boxes instead of character boxes. This is used for Cube training, and implied by --render_ngrams. (type:bool default:false) - + '--unicharset_file FILE':: File with characters in the unicharset. If --render_ngrams is true and --unicharset_file is specified, ngrams with characters that are not in unicharset will be omitted (type:string default:) @@ -114,7 +114,7 @@ Use these flags to output zero-padded, square individual character images '--glyph_num_border_pixels_to_pad INT':: Final_size=glyph_resized_size+2*glyph_num_border_pixels_to_pad (type:int default:0) - + Use these flags to find fonts that can render a given text ---------------------------------------------------------- @@ -126,7 +126,7 @@ Use these flags to find fonts that can render a given text '--min_coverage DOUBLE':: If find_fonts==true, the minimum coverage the font has of the characters in the text file to include it, between 0 and 1. (type:double default:1) - + Example Usage: ``` text2image --find_fonts \ @@ -136,7 +136,7 @@ text2image --find_fonts \ --render_per_font \ --outputbase ../langdata/hin/hin \ |& grep raw | sed -e 's/ :.*/" \\/g' | sed -e 's/^/ "/' >../langdata/hin/fontslist.txt -``` +``` SINGLE OPTIONS -------------- @@ -146,13 +146,13 @@ SINGLE OPTIONS HISTORY ------- -text2image(1) was first made available for tesseract 3.03. +text2image(1) was first made available for tesseract 3.03. RESOURCES --------- Main web site: + Information on training tesseract LSTM: - + SEE ALSO -------- tesseract(1) diff --git a/java/com/google/scrollview/events/SVEvent.java b/java/com/google/scrollview/events/SVEvent.java index df62ef62..18309c2f 100644 --- a/java/com/google/scrollview/events/SVEvent.java +++ b/java/com/google/scrollview/events/SVEvent.java @@ -1,5 +1,5 @@ // Copyright 2007 Google Inc. All Rights Reserved. -// +// // Licensed under the Apache License, Version 2.0 (the "License"); You may not // use this file except in compliance with the License. You may obtain a copy of // the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by @@ -15,7 +15,7 @@ import com.google.scrollview.ui.SVWindow; /** * The SVEvent is a structure which holds the actual values of a message to be * transmitted. It corresponds to the client structure defined in scrollview.h - * + * * @author wanke@google.com */ public class SVEvent { @@ -30,7 +30,7 @@ public class SVEvent { /** * A "normal" SVEvent. - * + * * @param t The type of the event as specified in SVEventType (e.g. * SVET_CLICK) * @param w The window the event corresponds to @@ -49,12 +49,12 @@ public class SVEvent { xSize = x2; ySize = y2; commandId = 0; - parameter = p; + parameter = p; } /** * An event which issues a command (like clicking on a item in the menubar). - * + * * @param eventtype The type of the event as specified in SVEventType * (usually SVET_MENU or SVET_POPUP) * @param svWindow The window the event corresponds to diff --git a/java/com/google/scrollview/events/SVEventType.java b/java/com/google/scrollview/events/SVEventType.java index 6b16f7f3..b15f37e2 100644 --- a/java/com/google/scrollview/events/SVEventType.java +++ b/java/com/google/scrollview/events/SVEventType.java @@ -1,5 +1,5 @@ // Copyright 2007 Google Inc. All Rights Reserved. -// +// // Licensed under the Apache License, Version 2.0 (the "License"); You may not // use this file except in compliance with the License. You may obtain a copy of // the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by @@ -14,7 +14,7 @@ package com.google.scrollview.events; * These are the defined events which can happen in ScrollView and be * transferred to the client. They are same events as on the client side part of * ScrollView (defined in ScrollView.h). - * + * * @author wanke@google.com */ public enum SVEventType { diff --git a/src/api/Makefile.am b/src/api/Makefile.am index 8218fae5..7f90f4bf 100644 --- a/src/api/Makefile.am +++ b/src/api/Makefile.am @@ -24,7 +24,7 @@ AM_CPPFLAGS += -fvisibility=hidden -fvisibility-inlines-hidden endif pkginclude_HEADERS = apitypes.h baseapi.h capi.h renderer.h tess_version.h -lib_LTLIBRARIES = +lib_LTLIBRARIES = noinst_LTLIBRARIES = libtesseract_api.la @@ -56,7 +56,7 @@ libtesseract_la_LIBADD = \ ../cutil/libtesseract_cutil.la \ ../viewer/libtesseract_viewer.la \ ../ccutil/libtesseract_ccutil.la \ - ../opencl/libtesseract_opencl.la + ../opencl/libtesseract_opencl.la libtesseract_la_LDFLAGS += -version-info $(GENERIC_LIBRARY_VERSION) -no-undefined diff --git a/src/ccstruct/Makefile.am b/src/ccstruct/Makefile.am index 70bc7c80..63605df9 100644 --- a/src/ccstruct/Makefile.am +++ b/src/ccstruct/Makefile.am @@ -4,7 +4,7 @@ AM_CPPFLAGS += \ -I$(top_srcdir)/src/viewer \ -I$(top_srcdir)/src/opencl AM_CPPFLAGS += $(OPENCL_CPPFLAGS) - + if VISIBILITY AM_CPPFLAGS += -DTESS_EXPORTS \ -fvisibility=hidden -fvisibility-inlines-hidden diff --git a/src/ccutil/genericvector.h b/src/ccutil/genericvector.h index 5466d491..4ec32981 100644 --- a/src/ccutil/genericvector.h +++ b/src/ccutil/genericvector.h @@ -373,8 +373,8 @@ inline bool LoadDataFromFile(const char* filename, GenericVector* data) { fseek(fp, 0, SEEK_SET); // Trying to open a directory on Linux sets size to LONG_MAX. Catch it here. if (size > 0 && size < LONG_MAX) { - // reserve an extra byte in case caller wants to append a '\0' character - data->reserve(size + 1); + // reserve an extra byte in case caller wants to append a '\0' character + data->reserve(size + 1); data->resize_no_init(size); result = static_cast(fread(&(*data)[0], 1, size, fp)) == size; } diff --git a/src/classify/Makefile.am b/src/classify/Makefile.am index 0c1150a7..933ff1d6 100644 --- a/src/classify/Makefile.am +++ b/src/classify/Makefile.am @@ -4,7 +4,7 @@ AM_CPPFLAGS += \ -I$(top_srcdir)/src/ccstruct \ -I$(top_srcdir)/src/dict \ -I$(top_srcdir)/src/viewer - + if DISABLED_LEGACY_ENGINE AM_CPPFLAGS += -DDISABLED_LEGACY_ENGINE endif diff --git a/src/classify/classify.cpp b/src/classify/classify.cpp index 5a9a65b0..546013b0 100644 --- a/src/classify/classify.cpp +++ b/src/classify/classify.cpp @@ -25,7 +25,7 @@ namespace tesseract { Classify::Classify() - : + : INT_MEMBER(classify_debug_level, 0, "Classify debug level", this->params()), diff --git a/src/classify/featdefs.cpp b/src/classify/featdefs.cpp index e060d7d4..dc4e3735 100644 --- a/src/classify/featdefs.cpp +++ b/src/classify/featdefs.cpp @@ -123,7 +123,7 @@ void InitFeatureDefs(FEATURE_DEFS_STRUCT *featuredefs) { * * @param CharDesc character description to be deallocated * - * Globals: + * Globals: * - none */ void FreeCharDescription(CHAR_DESC CharDesc) { @@ -140,7 +140,7 @@ void FreeCharDescription(CHAR_DESC CharDesc) { * Allocate a new character description, initialize its * feature sets to be empty, and return it. * - * Globals: + * Globals: * - none * * @return New character description structure. @@ -226,9 +226,9 @@ bool ValidCharDescription(const FEATURE_DEFS_STRUCT &FeatureDefs, ... @endverbatim * - * Globals: + * Globals: * - none - * + * * @param FeatureDefs definitions of feature types/extractors * @param File open text file to read character description from * @return Character description read from File. diff --git a/src/classify/float2int.cpp b/src/classify/float2int.cpp index 21833d27..6d4f6ac9 100644 --- a/src/classify/float2int.cpp +++ b/src/classify/float2int.cpp @@ -36,7 +36,7 @@ namespace tesseract { * For each class in the unicharset, clears the corresponding * entry in char_norm_array. char_norm_array is indexed by unichar_id. * - * Globals: + * Globals: * - none * * @param char_norm_array array to be cleared @@ -47,13 +47,13 @@ void Classify::ClearCharNormArray(uint8_t* char_norm_array) { /*---------------------------------------------------------------------------*/ -/** +/** * For each class in unicharset, computes the match between * norm_feature and the normalization protos for that class. * Converts this number to the range from 0 - 255 and stores it * into char_norm_array. CharNormArray is indexed by unichar_id. * - * Globals: + * Globals: * - PreTrainedTemplates current set of built-in templates * * @param norm_feature character normalization feature @@ -81,7 +81,7 @@ void Classify::ComputeIntCharNormArray(const FEATURE_STRUCT& norm_feature, * in Features into integer format and saves it into * IntFeatures. * - * Globals: + * Globals: * - none * * @param Features floating point pico-features to be converted diff --git a/src/classify/protos.cpp b/src/classify/protos.cpp index cd083dd3..f316b02d 100644 --- a/src/classify/protos.cpp +++ b/src/classify/protos.cpp @@ -54,7 +54,7 @@ STRING_VAR(classify_training_file, "MicroFeatures", "Training file"); * * Add a new config to this class. Malloc new space and copy the * old configs if necessary. Return the config id for the new config. - * + * * @param Class The class to add to */ int AddConfigToClass(CLASS_TYPE Class) { @@ -90,7 +90,7 @@ int AddConfigToClass(CLASS_TYPE Class) { * * Add a new proto to this class. Malloc new space and copy the * old protos if necessary. Return the proto id for the new proto. - * + * * @param Class The class to add to */ int AddProtoToClass(CLASS_TYPE Class) { @@ -132,7 +132,7 @@ int AddProtoToClass(CLASS_TYPE Class) { * @name ClassConfigLength * * Return the length of all the protos in this class. - * + * * @param Class The class to add to * @param Config FIXME */ @@ -154,7 +154,7 @@ float ClassConfigLength(CLASS_TYPE Class, BIT_VECTOR Config) { * @name ClassProtoLength * * Return the length of all the protos in this class. - * + * * @param Class The class to use */ float ClassProtoLength(CLASS_TYPE Class) { @@ -172,7 +172,7 @@ float ClassProtoLength(CLASS_TYPE Class) { * @name CopyProto * * Copy the first proto into the second. - * + * * @param Src Source * @param Dest Destination */ diff --git a/src/cutil/bitvec.cpp b/src/cutil/bitvec.cpp index a656bf95..0ecef3f7 100644 --- a/src/cutil/bitvec.cpp +++ b/src/cutil/bitvec.cpp @@ -34,7 +34,7 @@ * This routine uses realloc to increase the size of * the specified bit vector. * - * Globals: + * Globals: * - none * * @param Vector bit vector to be expanded diff --git a/src/cutil/emalloc.cpp b/src/cutil/emalloc.cpp index a0414ccb..440649d3 100644 --- a/src/cutil/emalloc.cpp +++ b/src/cutil/emalloc.cpp @@ -42,7 +42,7 @@ void *Erealloc(void *ptr, int size) { return Buffer; } -void Efree(void *ptr) { +void Efree(void *ptr) { ASSERT_HOST(ptr != nullptr); - free(ptr); + free(ptr); } diff --git a/src/dict/Makefile.am b/src/dict/Makefile.am index 9986e070..f3a95446 100644 --- a/src/dict/Makefile.am +++ b/src/dict/Makefile.am @@ -3,7 +3,7 @@ AM_CPPFLAGS += \ -I$(top_srcdir)/src/ccutil \ -I$(top_srcdir)/src/ccstruct \ -I$(top_srcdir)/src/viewer - + if VISIBILITY AM_CPPFLAGS += -DTESS_EXPORTS \ -fvisibility=hidden -fvisibility-inlines-hidden diff --git a/src/lstm/lstmrecognizer.cpp b/src/lstm/lstmrecognizer.cpp index 7766476a..7ef79d24 100644 --- a/src/lstm/lstmrecognizer.cpp +++ b/src/lstm/lstmrecognizer.cpp @@ -186,7 +186,7 @@ void LSTMRecognizer::RecognizeLine(const ImageData& image_data, bool invert, search_->Decode(outputs, kDictRatio, kCertOffset, worst_dict_cert, &GetUnicharset(), glyph_confidences); search_->ExtractBestPathAsWords(line_box, scale_factor, debug, - &GetUnicharset(), words, + &GetUnicharset(), words, glyph_confidences); } diff --git a/src/lstm/lstmrecognizer.h b/src/lstm/lstmrecognizer.h index 0d1afbb4..0755db9a 100644 --- a/src/lstm/lstmrecognizer.h +++ b/src/lstm/lstmrecognizer.h @@ -184,7 +184,7 @@ class LSTMRecognizer { // will be used in a dictionary word. void RecognizeLine(const ImageData& image_data, bool invert, bool debug, double worst_dict_cert, const TBOX& line_box, - PointerVector* words, + PointerVector* words, bool glyph_confidences = false); // Helper computes min and mean best results in the output. diff --git a/src/lstm/recodebeam.cpp b/src/lstm/recodebeam.cpp index c0a9ba8b..682484f1 100644 --- a/src/lstm/recodebeam.cpp +++ b/src/lstm/recodebeam.cpp @@ -82,7 +82,7 @@ void RecodeBeamSearch::Decode(const NetworkIO& output, double dict_ratio, const UNICHARSET* charset, bool glyph_confidence) { beam_size_ = 0; int width = output.Width(); - if (glyph_confidence) + if (glyph_confidence) timesteps.clear(); for (int t = 0; t < width; ++t) { ComputeTopN(output.f(t), output.NumFeatures(), kBeamWidths[0]); @@ -128,7 +128,7 @@ void RecodeBeamSearch::SaveMostCertainGlyphs(const float* outputs, pos++; } glyphs.insert(glyphs.begin() + pos, - std::pair(charakter, outputs[i])); + std::pair(charakter, outputs[i])); } } timesteps.push_back(glyphs); @@ -515,7 +515,7 @@ void RecodeBeamSearch::ContinueContext(const RecodeNode* prev, int index, if (previous != nullptr) { prefix.Set(p, previous->code); full_code.Set(p, previous->code); - } + } } if (prev != nullptr && !is_simple_text_) { if (top_n_flags_[prev->code] == top_n_flag) { diff --git a/src/lstm/recodebeam.h b/src/lstm/recodebeam.h index 85636581..c9970daa 100644 --- a/src/lstm/recodebeam.h +++ b/src/lstm/recodebeam.h @@ -208,7 +208,7 @@ class RecodeBeamSearch { // Generates debug output of the content of the beams after a Decode. void DebugBeams(const UNICHARSET& unicharset) const; - + std::vector< std::vector>> timesteps; // Clipping value for certainty inside Tesseract. Reflects the minimum value // of certainty that will be returned by ExtractBestPathAsUnicharIds. diff --git a/src/textord/Makefile.am b/src/textord/Makefile.am index ab656d2b..56f7bd30 100644 --- a/src/textord/Makefile.am +++ b/src/textord/Makefile.am @@ -11,7 +11,7 @@ AM_CPPFLAGS += \ -I$(top_srcdir)/src/opencl AM_CPPFLAGS += $(OPENCL_CPPFLAGS) - + if VISIBILITY AM_CPPFLAGS += -DTESS_EXPORTS \ -fvisibility=hidden -fvisibility-inlines-hidden diff --git a/src/textord/colpartition.cpp b/src/textord/colpartition.cpp index b86f4fba..ad2c6d88 100644 --- a/src/textord/colpartition.cpp +++ b/src/textord/colpartition.cpp @@ -1343,7 +1343,7 @@ bool ColPartition::HasGoodBaseline() { width = last_pt.x() - first_pt.x(); } // Maximum median error allowed to be a good text line. - if (height_count == 0) + if (height_count == 0) return false; double max_error = kMaxBaselineError * total_height / height_count; ICOORD start_pt, end_pt; diff --git a/src/textord/tablerecog.cpp b/src/textord/tablerecog.cpp index 00ecd87a..5de16c4c 100644 --- a/src/textord/tablerecog.cpp +++ b/src/textord/tablerecog.cpp @@ -54,7 +54,7 @@ const double kMaxRowSize = 2.5; // Number of filled columns required to form a strong table row. // For small tables, this is an absolute number. const double kGoodRowNumberOfColumnsSmall[] = { 2, 2, 2, 2, 2, 3, 3 }; -const int kGoodRowNumberOfColumnsSmallSize = +const int kGoodRowNumberOfColumnsSmallSize = sizeof(kGoodRowNumberOfColumnsSmall) / sizeof(double) - 1; // For large tables, it is a relative number const double kGoodRowNumberOfColumnsLarge = 0.7; diff --git a/src/training/Makefile.am b/src/training/Makefile.am index ecbc2541..fd38ffbe 100644 --- a/src/training/Makefile.am +++ b/src/training/Makefile.am @@ -20,8 +20,8 @@ if DISABLED_LEGACY_ENGINE AM_CPPFLAGS += -DDISABLED_LEGACY_ENGINE endif -# TODO: training programs can not be linked to shared library created -# with -fvisibility +# TODO: training programs can not be linked to shared library created +# with -fvisibility if VISIBILITY AM_LDFLAGS += -all-static endif @@ -57,9 +57,9 @@ endif noinst_LTLIBRARIES = libtesseract_training.la libtesseract_tessopt.la libtesseract_training_la_LIBADD = \ - ../cutil/libtesseract_cutil.la + ../cutil/libtesseract_cutil.la # ../api/libtesseract.la - + libtesseract_training_la_SOURCES = \ boxchar.cpp \ commandlineflags.cpp \ @@ -275,5 +275,5 @@ lstmeval_LDADD += $(LEPTONICA_LIBS) lstmtraining_LDADD += $(LEPTONICA_LIBS) set_unicharset_properties_LDADD += $(LEPTONICA_LIBS) text2image_LDADD += $(LEPTONICA_LIBS) -unicharset_extractor_LDADD += $(LEPTONICA_LIBS) +unicharset_extractor_LDADD += $(LEPTONICA_LIBS) wordlist2dawg_LDADD += $(LEPTONICA_LIBS) diff --git a/src/vs2010/tesseract/libtesseract.rc.in b/src/vs2010/tesseract/libtesseract.rc.in index 98414acc..85809cd9 100644 --- a/src/vs2010/tesseract/libtesseract.rc.in +++ b/src/vs2010/tesseract/libtesseract.rc.in @@ -27,18 +27,18 @@ LANGUAGE LANG_ENGLISH, SUBLANG_ENGLISH_US // TEXTINCLUDE // -1 TEXTINCLUDE +1 TEXTINCLUDE BEGIN "resource.h\0" END -2 TEXTINCLUDE +2 TEXTINCLUDE BEGIN "#include ""afxres.h""\r\n" "\0" END -3 TEXTINCLUDE +3 TEXTINCLUDE BEGIN "\r\n" "\0" diff --git a/src/wordrec/Makefile.am b/src/wordrec/Makefile.am index ef1beaf0..8f8ccbae 100644 --- a/src/wordrec/Makefile.am +++ b/src/wordrec/Makefile.am @@ -61,5 +61,5 @@ libtesseract_wordrec_la_SOURCES += \ plotedges.cpp \ render.cpp \ segsearch.cpp \ - wordclass.cpp + wordclass.cpp endif diff --git a/unittest/Makefile.am b/unittest/Makefile.am index 8ece3990..21844c91 100644 --- a/unittest/Makefile.am +++ b/unittest/Makefile.am @@ -27,7 +27,7 @@ AM_CPPFLAGS += -I$(top_srcdir)/src/wordrec # Build googletest: check_LTLIBRARIES = libgtest.la libgtest_main.la libgmock.la libgmock_main.la libgtest_la_SOURCES = ../googletest/googletest/src/gtest-all.cc -libgtest_la_CPPFLAGS = -I$(top_srcdir)/googletest/googletest/include -I$(top_srcdir)/googletest/googletest -pthread +libgtest_la_CPPFLAGS = -I$(top_srcdir)/googletest/googletest/include -I$(top_srcdir)/googletest/googletest -pthread libgtest_main_la_SOURCES = ../googletest/googletest/src/gtest_main.cc ## libgtest_main_la_LIBADD = libgtest.la @@ -57,7 +57,7 @@ check_PROGRAMS = \ matrix_test \ osd_test \ loadlang_test \ - tesseracttests + tesseracttests TESTS = $(check_PROGRAMS) diff --git a/unittest/apiexample_test.cc b/unittest/apiexample_test.cc index 55077112..d3cf5389 100644 --- a/unittest/apiexample_test.cc +++ b/unittest/apiexample_test.cc @@ -63,7 +63,7 @@ class QuickTest : public testing::Test { class MatchGroundTruth : public QuickTest , public ::testing::WithParamInterface { }; - + TEST_P(MatchGroundTruth, FastPhototestOCR) { OCRTester(TESTING_DIR "/phototest.tif", TESTING_DIR "/phototest.txt", @@ -75,33 +75,33 @@ class QuickTest : public testing::Test { TESTING_DIR "/phototest.txt", TESSDATA_DIR "_best", GetParam()); } - + TEST_P(MatchGroundTruth, TessPhototestOCR) { OCRTester(TESTING_DIR "/phototest.tif", TESTING_DIR "/phototest.txt", TESSDATA_DIR , GetParam()); } - - INSTANTIATE_TEST_CASE_P( Eng, MatchGroundTruth, + + INSTANTIATE_TEST_CASE_P( Eng, MatchGroundTruth, ::testing::Values("eng") ); - INSTANTIATE_TEST_CASE_P( Latin, MatchGroundTruth, + INSTANTIATE_TEST_CASE_P( Latin, MatchGroundTruth, ::testing::Values("script/Latin") ); - INSTANTIATE_TEST_CASE_P( Deva, MatchGroundTruth, + INSTANTIATE_TEST_CASE_P( Deva, MatchGroundTruth, ::testing::Values("script/Devanagari") ); - INSTANTIATE_TEST_CASE_P( Arab, MatchGroundTruth, + INSTANTIATE_TEST_CASE_P( Arab, MatchGroundTruth, ::testing::Values("script/Arabic") ); - + class EuroText : public QuickTest { }; - + TEST_F(EuroText, FastLatinOCR) { OCRTester(TESTING_DIR "/eurotext.tif", TESTING_DIR "/eurotext.txt", TESSDATA_DIR "_fast", "script/Latin"); } - // script/Latin for eurotext.tif does not match groundtruth + // script/Latin for eurotext.tif does not match groundtruth // for tessdata & tessdata_best // so do not test these here. - + } // namespace diff --git a/unittest/loadlang_test.cc b/unittest/loadlang_test.cc index 942d1f23..9ccc2228 100644 --- a/unittest/loadlang_test.cc +++ b/unittest/loadlang_test.cc @@ -37,13 +37,13 @@ class QuickTest : public testing::Test { ASSERT_FALSE(api->Init(tessdatadir, lang)) << "Could not initialize tesseract for $lang."; api->End(); } - + // For all languages - + class LoadLanguage : public QuickTest , public ::testing::WithParamInterface { }; - + TEST_P(LoadLanguage, afr) {LangLoader("afr" , GetParam());} TEST_P(LoadLanguage, amh) {LangLoader("amh" , GetParam());} TEST_P(LoadLanguage, ara) {LangLoader("ara" , GetParam());} @@ -169,18 +169,18 @@ class QuickTest : public testing::Test { TEST_P(LoadLanguage, yid) {LangLoader("yid" , GetParam());} TEST_P(LoadLanguage, yor) {LangLoader("yor" , GetParam());} - INSTANTIATE_TEST_CASE_P( Tessdata_fast, LoadLanguage, + INSTANTIATE_TEST_CASE_P( Tessdata_fast, LoadLanguage, ::testing::Values(TESSDATA_DIR "_fast") ); - INSTANTIATE_TEST_CASE_P( Tessdata_best, LoadLanguage, + INSTANTIATE_TEST_CASE_P( Tessdata_best, LoadLanguage, ::testing::Values(TESSDATA_DIR "_best") ); - INSTANTIATE_TEST_CASE_P( Tessdata, LoadLanguage, + INSTANTIATE_TEST_CASE_P( Tessdata, LoadLanguage, ::testing::Values(TESSDATA_DIR) ); // For all scripts class LoadScript : public QuickTest , public ::testing::WithParamInterface { - }; + }; TEST_P(LoadScript, Arabic) {LangLoader("script/Arabic" , GetParam());} TEST_P(LoadScript, Armenian) {LangLoader("script/Armenian" , GetParam());} @@ -219,19 +219,19 @@ class QuickTest : public testing::Test { TEST_P(LoadScript, Thai) {LangLoader("script/Thai" , GetParam());} TEST_P(LoadScript, Tibetan) {LangLoader("script/Tibetan" , GetParam());} TEST_P(LoadScript, Vietnamese) {LangLoader("script/Vietnamese" , GetParam());} - - INSTANTIATE_TEST_CASE_P( Tessdata_fast, LoadScript, + + INSTANTIATE_TEST_CASE_P( Tessdata_fast, LoadScript, ::testing::Values(TESSDATA_DIR "_fast") ); - INSTANTIATE_TEST_CASE_P( Tessdata_best, LoadScript, + INSTANTIATE_TEST_CASE_P( Tessdata_best, LoadScript, ::testing::Values(TESSDATA_DIR "_best") ); - INSTANTIATE_TEST_CASE_P( Tessdata, LoadScript, + INSTANTIATE_TEST_CASE_P( Tessdata, LoadScript, ::testing::Values(TESSDATA_DIR) ); // Use class LoadLang for languages which are NOT there in all three repos class LoadLang : public QuickTest { }; - + TEST_F(LoadLang, kmrFast) {LangLoader("kmr" , TESSDATA_DIR "_fast");} TEST_F(LoadLang, kmrBest) {LangLoader("kmr" , TESSDATA_DIR "_best");} // TEST_F(LoadLang, kmrBestInt) {LangLoader("kmr" , TESSDATA_DIR);} diff --git a/unittest/osd_test.cc b/unittest/osd_test.cc index 2cdb0100..2bd408e3 100644 --- a/unittest/osd_test.cc +++ b/unittest/osd_test.cc @@ -14,7 +14,7 @@ // limitations under the License. /////////////////////////////////////////////////////////////////////// -//based on https://gist.github.com/amitdo/7c7a522004dd79b398340c9595b377e1 +//based on https://gist.github.com/amitdo/7c7a522004dd79b398340c9595b377e1 // expects clones of tessdata, tessdata_fast and tessdata_best repos @@ -30,7 +30,7 @@ namespace { class TestClass : public testing::Test { protected: }; - + void OSDTester( int expected_deg, const char* imgname, const char* tessdatadir) { //log.info() << tessdatadir << " for image: " << imgname << std::endl; tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI(); @@ -55,63 +55,63 @@ class TestClass : public testing::Test { class OSDTest : public TestClass , public ::testing::WithParamInterface> {}; - + TEST_P(OSDTest, MatchOrientationDegrees) { OSDTester(std::get<0>(GetParam()), std::get<1>(GetParam()), std::get<2>(GetParam())); } - - INSTANTIATE_TEST_CASE_P( TessdataEngEuroHebrew, OSDTest, + + INSTANTIATE_TEST_CASE_P( TessdataEngEuroHebrew, OSDTest, ::testing::Combine( ::testing::Values(0), ::testing::Values(TESTING_DIR "/phototest.tif", TESTING_DIR "/eurotext.tif", TESTING_DIR "/hebrew.png"), ::testing::Values(TESSDATA_DIR))); - - INSTANTIATE_TEST_CASE_P( TessdataBestEngEuroHebrew, OSDTest, + + INSTANTIATE_TEST_CASE_P( TessdataBestEngEuroHebrew, OSDTest, ::testing::Combine( ::testing::Values(0), ::testing::Values(TESTING_DIR "/phototest.tif", TESTING_DIR "/eurotext.tif", TESTING_DIR "/hebrew.png"), ::testing::Values(TESSDATA_DIR "_best"))); - - INSTANTIATE_TEST_CASE_P( TessdataFastEngEuroHebrew, OSDTest, + + INSTANTIATE_TEST_CASE_P( TessdataFastEngEuroHebrew, OSDTest, ::testing::Combine( ::testing::Values(0), ::testing::Values(TESTING_DIR "/phototest.tif", TESTING_DIR "/eurotext.tif", TESTING_DIR "/hebrew.png"), ::testing::Values(TESSDATA_DIR "_fast"))); - - INSTANTIATE_TEST_CASE_P( TessdataFastRotated90, OSDTest, + + INSTANTIATE_TEST_CASE_P( TessdataFastRotated90, OSDTest, ::testing::Combine( ::testing::Values(90), ::testing::Values(TESTING_DIR "/phototest-rotated-R.png"), ::testing::Values(TESSDATA_DIR "_fast"))); - - INSTANTIATE_TEST_CASE_P( TessdataFastRotated180, OSDTest, + + INSTANTIATE_TEST_CASE_P( TessdataFastRotated180, OSDTest, ::testing::Combine( ::testing::Values(180), ::testing::Values(TESTING_DIR "/phototest-rotated-180.png"), ::testing::Values(TESSDATA_DIR "_fast"))); - - INSTANTIATE_TEST_CASE_P( TessdataFastRotated270, OSDTest, + + INSTANTIATE_TEST_CASE_P( TessdataFastRotated270, OSDTest, ::testing::Combine( ::testing::Values(270), ::testing::Values(TESTING_DIR "/phototest-rotated-L.png"), ::testing::Values(TESSDATA_DIR "_fast"))); - - INSTANTIATE_TEST_CASE_P( TessdataFastDevaRotated270, OSDTest, + + INSTANTIATE_TEST_CASE_P( TessdataFastDevaRotated270, OSDTest, ::testing::Combine( ::testing::Values(270), ::testing::Values(TESTING_DIR "/devatest-rotated-270.png"), ::testing::Values(TESSDATA_DIR "_fast"))); - - INSTANTIATE_TEST_CASE_P( TessdataFastDeva, OSDTest, + + INSTANTIATE_TEST_CASE_P( TessdataFastDeva, OSDTest, ::testing::Combine( ::testing::Values(0), ::testing::Values(TESTING_DIR "/devatest.png"), ::testing::Values(TESSDATA_DIR "_fast"))); - + } // namespace diff --git a/unlvtests/README.md b/unlvtests/README.md index 0ea53def..32687f1a 100644 --- a/unlvtests/README.md +++ b/unlvtests/README.md @@ -6,7 +6,7 @@ See http://www.expervision.com/wp-content/uploads/2012/12/1995.The_Fourth_Annual but first you have to get the tools and data used by UNLV: ### Step 1: to download the images go to -https://sourceforge.net/projects/isri-ocr-evaluation-tools-alt/files/ +https://sourceforge.net/projects/isri-ocr-evaluation-tools-alt/files/ and get doe3.3B.tar.gz, bus.3B.tar.gz, mag.3B.tar.gz and news.3B.tar.gz spn.3B.tar.gz is incorrect in this repo, so get it from code.google @@ -20,7 +20,7 @@ curl -L https://sourceforge.net/projects/isri-ocr-evaluation-tools-alt/files/ne curl -L https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/isri-ocr-evaluation-tools/spn.3B.tar.gz > spn.3B.tar.gz ``` -### Step 2: extract the files. +### Step 2: extract the files. It doesn't really matter where in your filesystem you put them, but they must go under a common root so you have directories doe3.3B, bus.3B, mag.3B and news.3B. in, for example, @@ -80,7 +80,7 @@ unlvtests/runalltests_spa.sh ~/ISRI-OCRtk 4_fast_spa ../tessdata_fast If you just want to remove all lines which have 100% recognition, you can add a 'awk' command like this: -ocrevalutf8 wordacc ground.txt ocr.txt | awk '$3 != 100 {print $0}' +ocrevalutf8 wordacc ground.txt ocr.txt | awk '$3 != 100 {print $0}' results.txt or if you've already got a results file you want to change, you can do this: @@ -90,5 +90,5 @@ awk '$3 != 100 {print $0}' results.txt newresults.txt If you only want the last sections where things are broken down by word, you can add a sed command, like this: -ocrevalutf8 wordacc ground.txt ocr.txt | sed '/^ Count Missed %Right $/,$ +ocrevalutf8 wordacc ground.txt ocr.txt | sed '/^ Count Missed %Right $/,$ !d' | awk '$3 != 100 {print $0}' results.txt