mirror of
https://github.com/tesseract-ocr/tesseract.git
synced 2024-11-23 18:49:08 +08:00
Fix whitespace issues
* Remove whitespace (blanks, tabs, cr) at line endings Signed-off-by: Stefan Weil <sw@weilnetz.de>
This commit is contained in:
parent
3af2773d0e
commit
6a28cce96b
2
.github/ISSUE_TEMPLATE.md
vendored
2
.github/ISSUE_TEMPLATE.md
vendored
@ -6,7 +6,7 @@ Note that it will be much easier for us to fix the issue if a test case that
|
||||
reproduces the problem is provided. Ideally this test case should not have any
|
||||
external dependencies. Provide a copy of the image or link to files for the test case.
|
||||
|
||||
Please delete this text and fill in the template below.
|
||||
Please delete this text and fill in the template below.
|
||||
|
||||
------------------------
|
||||
|
||||
|
@ -9,9 +9,9 @@ If you think you found a bug in Tesseract, please create an issue.
|
||||
Use the [users mailing-list](https://groups.google.com/d/forum/tesseract-ocr) instead of creating an Issue if ...
|
||||
* You have problems using Tesseract and need some help.
|
||||
* You have problems installing the software.
|
||||
* You are not satisfied with the accuracy of the OCR, and want to ask how you can improve it. Note: You should first read the [ImproveQuality](https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality) wiki page.
|
||||
* You are not satisfied with the accuracy of the OCR, and want to ask how you can improve it. Note: You should first read the [ImproveQuality](https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality) wiki page.
|
||||
* You are trying to train Tesseract and you have a problem and/or want to ask a question about the training process. Note: You should first read the **official** guides [[1]](https://github.com/tesseract-ocr/tesseract/wiki) or [[2]](https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract) found in the project wiki.
|
||||
* You have a general question.
|
||||
* You have a general question.
|
||||
|
||||
An issue should only be reported if the platform you are using is one of these:
|
||||
* Linux (but not a version that is more than 4 years old)
|
||||
@ -22,7 +22,7 @@ For older versions or other operating systems, use the Tesseract forum.
|
||||
|
||||
When creating an issue, please report your operating system, including its specific version: "Ubuntu 16.04", "Windows 10", "Mac OS X 10.11" etc.
|
||||
|
||||
Search through open and closed issues to see if similar issue has been reported already (and sometimes also has been solved).
|
||||
Search through open and closed issues to see if similar issue has been reported already (and sometimes also has been solved).
|
||||
|
||||
Similarly, before you post your question in the forum, search through past threads to see if similar question has been asked already.
|
||||
|
||||
@ -32,10 +32,10 @@ Only report an issue in the latest official release. Optionally, try to check if
|
||||
|
||||
Make sure you are able to replicate the problem with Tesseract command line program. For external programs that use Tesseract (including wrappers and your own program, if you are developer), report the issue to the developers of that software if it's possible. You can also try to find help in the Tesseract forum.
|
||||
|
||||
Each version of Tesseract has its own language data you need to obtain. You **must** obtain and install trained data for English (eng) and osd. Verify that Tesseract knows about these two files (and other trained data you installed) with this command:
|
||||
Each version of Tesseract has its own language data you need to obtain. You **must** obtain and install trained data for English (eng) and osd. Verify that Tesseract knows about these two files (and other trained data you installed) with this command:
|
||||
`tesseract --list-langs`.
|
||||
|
||||
Post example files to demonstrate the problem.
|
||||
Post example files to demonstrate the problem.
|
||||
BUT don't post files with private info (about yourself or others).
|
||||
|
||||
When attaching a file to the issue report / forum ...
|
||||
@ -46,7 +46,7 @@ Do not attach programs or libraries to your issues/posts.
|
||||
|
||||
For large files or for programs, add a link to a location where they can be downloaded (your site, Git repo, Google Drive, Dropbox etc.)
|
||||
|
||||
Attaching a multi-page TIFF image is useful only if you have problem with multi-page functionality, otherwise attach only one or a few single page images.
|
||||
Attaching a multi-page TIFF image is useful only if you have problem with multi-page functionality, otherwise attach only one or a few single page images.
|
||||
|
||||
Copy the error message from the console instead of sending a screenshot of it.
|
||||
|
||||
@ -54,7 +54,7 @@ Use the toolbar above the comment edit area to format your comment.
|
||||
|
||||
Add three backticks before and after a code sample or output of a command to format it (The `Insert code` button can help you doing it).
|
||||
|
||||
If your comment includes a code sample or output of a command that exceeds ~25 lines, post it as attached text file (`filename.txt`).
|
||||
If your comment includes a code sample or output of a command that exceeds ~25 lines, post it as attached text file (`filename.txt`).
|
||||
|
||||
Use `Preview` before you send your issue. Read it again before sending.
|
||||
|
||||
@ -62,7 +62,7 @@ Note that most of the people that respond to issues and answer questions are eit
|
||||
|
||||
The [tesseract developers](http://groups.google.com/group/tesseract-dev/) forum should be used to discuss Tesseract development: bug fixes, enhancements, add-ons for Tesseract.
|
||||
|
||||
Sometimes you will not get a respond to your issue or question. We apologize in advance! Please don't take it personally. There can be many reasons for this, including: time limits, no one knows the answer (at least not the ones that are available at that time) or just that
|
||||
Sometimes you will not get a respond to your issue or question. We apologize in advance! Please don't take it personally. There can be many reasons for this, including: time limits, no one knows the answer (at least not the ones that are available at that time) or just that
|
||||
your question has been asked (and has been answered) many times before...
|
||||
|
||||
## For Developers: Creating a Pull Request
|
||||
|
12
ChangeLog
12
ChangeLog
@ -1,7 +1,7 @@
|
||||
2017-03-24 - V4.00.00-alpha
|
||||
* Added new neural network system based on LSTMs, with major accuracy gains.
|
||||
* Improvements to PDF rendering.
|
||||
* Fixes to trainingdata rendering.
|
||||
* Fixes to trainingdata rendering.
|
||||
* Added LSTM models+lang models to 101 languages. (tessdata repository)
|
||||
* Improved multi-page TIFF handling.
|
||||
* Fixed damage to binary images when processing PDFs.
|
||||
@ -40,7 +40,7 @@
|
||||
* Fixed some openCL issues.
|
||||
* Added option to build Tesseract with CMake build system.
|
||||
* Implemented CPPAN support for easy Windows building.
|
||||
|
||||
|
||||
2016-02-17 - V3.04.01
|
||||
* Added OSD renderer for psm 0. Works for single page and multi-page images.
|
||||
* Improve tesstrain.sh script.
|
||||
@ -84,7 +84,7 @@
|
||||
text and truetype fonts.
|
||||
* Added support for PDF output with searchable text.
|
||||
* Removed entire IMAGE class and all code in image directory.
|
||||
* Tesseract executable: support for output to stdout; limited support for one
|
||||
* Tesseract executable: support for output to stdout; limited support for one
|
||||
page images from stdin (especially on Windows)
|
||||
* Added Renderer to API to allow document-level processing and output
|
||||
of document formats, like hOCR, PDF.
|
||||
@ -169,12 +169,12 @@
|
||||
* Added TessdataManager to combine data files into a single file.
|
||||
* Some dead code deleted.
|
||||
* VC++6 no longer supported. It can't cope with the use of templates.
|
||||
* Many more languages added.
|
||||
* Many more languages added.
|
||||
* Doxygenation of most of the function header comments.
|
||||
* Added man pages.
|
||||
* Added bash completion script (issue 247: thanks to neskiem)
|
||||
* Fix integer overview in thresholding (issue 366: thanks to Cyanide.Drake)
|
||||
* Add Danish Fraktur support (issues 300, 360: thanks to
|
||||
* Add Danish Fraktur support (issues 300, 360: thanks to
|
||||
dsl602230@vip.cybercity.dk)
|
||||
* Fix file pointer leak (issue 359, thanks to yukihiro.nakadaira)
|
||||
* Fix an error using user-words (Issue 345: thanks to max.markin)
|
||||
@ -183,7 +183,7 @@
|
||||
* Fix an automake error (Issue 318, thanks to ichanjz)
|
||||
* Fix a Win32 crash on fileFormatIsTiff() (Issues 304, 316, 317, 330, 347,
|
||||
349, 352: thanks to nguyenq87, max.markin, zdenop)
|
||||
* Fixed a number of errors in newer (stricter) versions of VC++ (Issues
|
||||
* Fixed a number of errors in newer (stricter) versions of VC++ (Issues
|
||||
301, among others)
|
||||
|
||||
2009-06-30 - V2.04
|
||||
|
@ -26,14 +26,14 @@ So, the steps for making Tesseract are:
|
||||
$ make training
|
||||
$ sudo make training-install
|
||||
|
||||
You need to install at least English language and OSD traineddata files to
|
||||
`TESSDATA_PREFIX` directory.
|
||||
You need to install at least English language and OSD traineddata files to
|
||||
`TESSDATA_PREFIX` directory.
|
||||
|
||||
You can retrieve single file with tools like [wget](https://www.gnu.org/software/wget/), [curl](https://curl.haxx.se/), [GithubDownloader](https://github.com/intezer/GithubDownloader) or browser.
|
||||
|
||||
All language data files can be retrieved from git repository (useful only for packagers!).
|
||||
(Repository is huge - more that 1.2 GB. You do NOT need to download traineddata files for
|
||||
all languages).
|
||||
all languages).
|
||||
|
||||
$ git clone https://github.com/tesseract-ocr/tessdata.git tesseract-ocr.tessdata
|
||||
|
||||
|
@ -5,13 +5,13 @@ environment:
|
||||
- APPVEYOR_BUILD_WORKER_IMAGE: Visual Studio 2017
|
||||
vs_ver: 15 2017
|
||||
vs_platform: " Win64"
|
||||
|
||||
|
||||
configuration:
|
||||
- Release
|
||||
|
||||
|
||||
cache:
|
||||
- c:/Users/appveyor/.cppan/storage
|
||||
|
||||
|
||||
# for curl
|
||||
install:
|
||||
- set PATH=C:\Program Files\Git\mingw64\bin;%PATH%
|
||||
@ -25,7 +25,7 @@ before_build:
|
||||
- ps: 'Add-Content $env:USERPROFILE\.cppan\cppan.yml "`n`nbuild_warning_level: 0`n"'
|
||||
- ps: 'Add-Content $env:USERPROFILE\.cppan\cppan.yml "`n`nbuild_system_verbose: false`n"'
|
||||
- ps: 'Add-Content $env:USERPROFILE\.cppan\cppan.yml "`n`nvar_check_jobs: 1`n"'
|
||||
|
||||
|
||||
build_script:
|
||||
- mkdir build
|
||||
- mkdir build\bin
|
||||
|
12
autogen.sh
12
autogen.sh
@ -46,10 +46,10 @@ if [ "$1" = "clean" ]; then
|
||||
find . -iname "Makefile.in" -type f -exec rm '{}' +
|
||||
fi
|
||||
|
||||
# Prevent any errors that might result from failing to properly invoke
|
||||
# `libtoolize` or `glibtoolize,` whichever is present on your system,
|
||||
# from occurring by testing for its existence and capturing the absolute path to
|
||||
# its location for caching purposes prior to using it later on in 'Step 2:'
|
||||
# Prevent any errors that might result from failing to properly invoke
|
||||
# `libtoolize` or `glibtoolize,` whichever is present on your system,
|
||||
# from occurring by testing for its existence and capturing the absolute path to
|
||||
# its location for caching purposes prior to using it later on in 'Step 2:'
|
||||
if command -v libtoolize >/dev/null 2>&1; then
|
||||
LIBTOOLIZE="$(command -v libtoolize)"
|
||||
elif command -v glibtoolize >/dev/null 2>&1; then
|
||||
@ -67,13 +67,13 @@ fi
|
||||
bail_out()
|
||||
{
|
||||
echo
|
||||
echo " Something went wrong, bailing out!"
|
||||
echo " Something went wrong, bailing out!"
|
||||
echo
|
||||
exit 1
|
||||
}
|
||||
|
||||
# --- Step 1: Generate aclocal.m4 from:
|
||||
# . acinclude.m4
|
||||
# . acinclude.m4
|
||||
# . config/*.m4 (these files are referenced in acinclude.m4)
|
||||
|
||||
mkdir -p config
|
||||
|
@ -8,7 +8,7 @@ use Getopt::Std;
|
||||
|
||||
=pod
|
||||
|
||||
=head1 NAME
|
||||
=head1 NAME
|
||||
|
||||
genwordlists.pl - generate word lists for Tesseract
|
||||
|
||||
@ -33,7 +33,7 @@ use:
|
||||
pfx=$(echo $i|tr '/' '_'); cat $i | \
|
||||
perl genwordlists.pl -d OUTDIR -p $pfx; done
|
||||
|
||||
This will create a set of output files to match each of the files
|
||||
This will create a set of output files to match each of the files
|
||||
WikiExtractor created.
|
||||
|
||||
To combine these files:
|
||||
|
@ -1,6 +1,6 @@
|
||||
#-*- mode: shell-script;-*-
|
||||
#
|
||||
# bash completion support for tesseract
|
||||
# bash completion support for tesseract
|
||||
#
|
||||
# Copyright (C) 2009 Neskie A. Manuel <neskiem@gmail.com>
|
||||
# Distributed under the Apache License, Version 2.0.
|
||||
@ -20,19 +20,19 @@ _tesseract()
|
||||
COMPREPLY=()
|
||||
cur="$2"
|
||||
prev="$3"
|
||||
|
||||
|
||||
case "$prev" in
|
||||
tesseract)
|
||||
COMPREPLY=($(compgen -f -X "!*.+(tif)" -- "$cur") )
|
||||
;;
|
||||
*.tif)
|
||||
COMPREPLY=($(compgen -W "$(basename $prev .tif)" ) )
|
||||
COMPREPLY=($(compgen -W "$(basename $prev .tif)" ) )
|
||||
;;
|
||||
-l)
|
||||
_tesseract_languages
|
||||
;;
|
||||
*)
|
||||
COMPREPLY=($(compgen -W "-l" ) )
|
||||
COMPREPLY=($(compgen -W "-l" ) )
|
||||
;;
|
||||
esac
|
||||
}
|
||||
|
@ -17,7 +17,7 @@ man_MANS = \
|
||||
text2image.1 \
|
||||
unicharambigs.5 \
|
||||
unicharset_extractor.1 \
|
||||
wordlist2dawg.1
|
||||
wordlist2dawg.1
|
||||
|
||||
if !DISABLED_LEGACY_ENGINE
|
||||
man_MANS += \
|
||||
|
@ -11,9 +11,9 @@ SYNOPSIS
|
||||
|
||||
DESCRIPTION
|
||||
-----------
|
||||
classifier_tester(1) runs Tesseract in a special mode.
|
||||
It takes a list of .tr files and tests a character classifier
|
||||
on data as formatted for training,
|
||||
classifier_tester(1) runs Tesseract in a special mode.
|
||||
It takes a list of .tr files and tests a character classifier
|
||||
on data as formatted for training,
|
||||
but it doesn't have to be the same as the training data.
|
||||
|
||||
IN/OUT ARGUMENTS
|
||||
@ -25,11 +25,11 @@ OPTIONS
|
||||
-------
|
||||
-l 'lang'::
|
||||
(Input) three character language code; default value 'eng'.
|
||||
|
||||
|
||||
-classifier 'x'::
|
||||
(Input) One of "pruner", "full".
|
||||
|
||||
|
||||
|
||||
|
||||
-U 'unicharset'::
|
||||
(Input) The unicharset for the language.
|
||||
|
||||
@ -42,7 +42,7 @@ OPTIONS
|
||||
(Input) x heights file, each line is of the following form, where xheight is calculated as the pixel x height of a character drawn at 32pt on 300 dpi. [ That is, if base x height + ascenders + descenders = 133, how much is x height? ]
|
||||
|
||||
*font_name* *xheight*
|
||||
|
||||
|
||||
-output_trainer 'trainer'::
|
||||
(Output, Optional) Filename for output trainer.
|
||||
|
||||
|
@ -8,54 +8,54 @@ combine_lang_model - generate starter traineddata
|
||||
|
||||
SYNOPSIS
|
||||
--------
|
||||
*combine_lang_model* --input_unicharset 'filename' --script_dir 'dirname' --output_dir 'rootdir' --lang 'lang' [--lang_is_rtl] [pass_through_recoder] [--words file --puncs file --numbers file]
|
||||
*combine_lang_model* --input_unicharset 'filename' --script_dir 'dirname' --output_dir 'rootdir' --lang 'lang' [--lang_is_rtl] [pass_through_recoder] [--words file --puncs file --numbers file]
|
||||
|
||||
DESCRIPTION
|
||||
-----------
|
||||
combine_lang_model(1) generates a starter traineddata file that can be used to train an LSTM-based neural network model. It takes as input a unicharset and an optional set of wordlists. It eliminates the need to run set_unicharset_properties(1), wordlist2dawg(1), some non-existent binary to generate the recoder (unicode compressor), and finally combine_tessdata(1).
|
||||
|
||||
|
||||
OPTIONS
|
||||
-------
|
||||
'-l lang'::
|
||||
The language to use.
|
||||
The language to use.
|
||||
Tesseract uses 3-character ISO 639-2 language codes. (See LANGUAGES)
|
||||
|
||||
'--script_dir PATH'::
|
||||
'--script_dir PATH'::
|
||||
Directory name for input script unicharsets. It should point to the location of langdata (github repo) directory. (type:string default:)
|
||||
|
||||
'--input_unicharset FILE'::
|
||||
|
||||
'--input_unicharset FILE'::
|
||||
Unicharset to complete and use in encoding. It can be a hand-created file with incomplete fields. Its basic and script properties will be set before it is used. (type:string default:)
|
||||
|
||||
|
||||
'--lang_is_rtl BOOL'::
|
||||
True if language being processed is written right-to-left (eg Arabic/Hebrew). (type:bool default:false)
|
||||
|
||||
|
||||
'--pass_through_recoder BOOL'::
|
||||
If true, the recoder is a simple pass-through of the unicharset. Otherwise, potentially a compression of it by encoding Hangul in Jamos, decomposing multi-unicode symbols into sequences of unicodes, and encoding Han using the data in the radical_table_data, which must be the content of the file: langdata/radical-stroke.txt. (type:bool default:false)
|
||||
|
||||
'--version_str STRING'::
|
||||
'--version_str STRING'::
|
||||
An arbitrary version label to add to traineddata file (type:string default:)
|
||||
|
||||
'--words FILE'::
|
||||
|
||||
'--words FILE'::
|
||||
(Optional) File listing words to use for the system dictionary (type:string default:)
|
||||
|
||||
'--numbers FILE'::
|
||||
|
||||
'--numbers FILE'::
|
||||
(Optional) File listing number patterns (type:string default:)
|
||||
|
||||
'--puncs FILE'::
|
||||
|
||||
'--puncs FILE'::
|
||||
(Optional) File listing punctuation patterns. The words/puncs/numbers lists may be all empty. If any are non-empty then puncs must be non-empty. (type:string default:)
|
||||
|
||||
'--output_dir PATH'::
|
||||
|
||||
'--output_dir PATH'::
|
||||
Root directory for output files. Output files will be written to <output_dir>/<lang>/<lang>.* (type:string default:)
|
||||
|
||||
|
||||
HISTORY
|
||||
-------
|
||||
combine_lang_model(1) was first made available for tesseract4.00.00alpha.
|
||||
combine_lang_model(1) was first made available for tesseract4.00.00alpha.
|
||||
|
||||
RESOURCES
|
||||
---------
|
||||
Main web site: <https://github.com/tesseract-ocr> +
|
||||
Information on training tesseract LSTM: <https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00>
|
||||
|
||||
|
||||
SEE ALSO
|
||||
--------
|
||||
tesseract(1)
|
||||
|
@ -4,7 +4,7 @@ LSTMEVAL(1)
|
||||
|
||||
NAME
|
||||
----
|
||||
lstmeval - Evaluation program for LSTM-based networks.
|
||||
lstmeval - Evaluation program for LSTM-based networks.
|
||||
|
||||
SYNOPSIS
|
||||
--------
|
||||
@ -12,34 +12,34 @@ SYNOPSIS
|
||||
|
||||
DESCRIPTION
|
||||
-----------
|
||||
lstmeval(1) evaluates LSTM-based networks. Either a recognition model or a training checkpoint can be given as input for evaluation along with a list of lstmf files. If evaluating a training checkpoint, '--traineddata' should also be specified.
|
||||
|
||||
lstmeval(1) evaluates LSTM-based networks. Either a recognition model or a training checkpoint can be given as input for evaluation along with a list of lstmf files. If evaluating a training checkpoint, '--traineddata' should also be specified.
|
||||
|
||||
OPTIONS
|
||||
-------
|
||||
'--model FILE'::
|
||||
Name of model file (training or recognition) (type:string default:)
|
||||
|
||||
|
||||
'--traineddata FILE'::
|
||||
If model is a training checkpoint, then traineddata must be the traineddata file that was given to the trainer (type:string default:)
|
||||
|
||||
|
||||
'--eval_listfile FILE'::
|
||||
File listing sample files in lstmf training format. (type:string default:)
|
||||
|
||||
|
||||
'--max_image_MB INT'::
|
||||
Max memory to use for images. (type:int default:2000)
|
||||
|
||||
|
||||
'--verbosity INT'::
|
||||
Amount of diagnosting information to output (0-2). (type:int default:1)
|
||||
|
||||
HISTORY
|
||||
-------
|
||||
lstmeval(1) was first made available for tesseract4.00.00alpha.
|
||||
lstmeval(1) was first made available for tesseract4.00.00alpha.
|
||||
|
||||
RESOURCES
|
||||
---------
|
||||
Main web site: <https://github.com/tesseract-ocr> +
|
||||
Information on training tesseract LSTM: <https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00>
|
||||
|
||||
|
||||
SEE ALSO
|
||||
--------
|
||||
tesseract(1)
|
||||
|
@ -8,19 +8,19 @@ lstmtraining - Training program for LSTM-based networks.
|
||||
|
||||
SYNOPSIS
|
||||
--------
|
||||
*lstmtraining*
|
||||
*lstmtraining*
|
||||
--continue_from 'train_output_dir/continue_from_lang.lstm'
|
||||
--old_traineddata 'bestdata_dir/continue_from_lang.traineddata'
|
||||
--traineddata 'train_output_dir/lang/lang.traineddata'
|
||||
--max_iterations 'NNN'
|
||||
--debug_interval '0|-1'
|
||||
--old_traineddata 'bestdata_dir/continue_from_lang.traineddata'
|
||||
--traineddata 'train_output_dir/lang/lang.traineddata'
|
||||
--max_iterations 'NNN'
|
||||
--debug_interval '0|-1'
|
||||
--train_listfile 'train_output_dir/lang.training_files.txt'
|
||||
--model_output 'train_output_dir/newlstmmodel'
|
||||
|
||||
DESCRIPTION
|
||||
-----------
|
||||
lstmtraining(1) trains LSTM-based networks using a list of lstmf files and starter traineddata file as the main input. Training from scratch is not recommended to be done by users. Finetuning (example command shown in synopsis above) or replacing a layer options can be used instead. Different options apply to different types of training. Read [Training Wiki page](https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00) for details.
|
||||
|
||||
|
||||
OPTIONS
|
||||
-------
|
||||
|
||||
@ -95,13 +95,13 @@ OPTIONS
|
||||
|
||||
HISTORY
|
||||
-------
|
||||
lstmtraining(1) was first made available for tesseract4.00.00alpha.
|
||||
lstmtraining(1) was first made available for tesseract4.00.00alpha.
|
||||
|
||||
RESOURCES
|
||||
---------
|
||||
Main web site: <https://github.com/tesseract-ocr> +
|
||||
Information on training tesseract LSTM: <https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00>
|
||||
|
||||
|
||||
SEE ALSO
|
||||
--------
|
||||
tesseract(1)
|
||||
|
@ -13,23 +13,23 @@ SYNOPSIS
|
||||
DESCRIPTION
|
||||
-----------
|
||||
merge_unicharsets(1) is a simple tool to merge two or more unicharsets.
|
||||
It could be used to create a combined unicharset for a script-level engine,
|
||||
It could be used to create a combined unicharset for a script-level engine,
|
||||
like the new Latin or Devanagari.
|
||||
|
||||
IN/OUT ARGUMENTS
|
||||
----------------
|
||||
'unicharset-in-1'::
|
||||
(Input) The name of the first unicharset file to be merged.
|
||||
|
||||
|
||||
'unicharset-in-n'::
|
||||
(Input) The name of the nth unicharset file to be merged.
|
||||
|
||||
'unicharset-out'::
|
||||
(Output) The name of the merged unicharset file.
|
||||
|
||||
|
||||
HISTORY
|
||||
-------
|
||||
merge_unicharsets(1) was first made available for tesseract4.00.00alpha.
|
||||
merge_unicharsets(1) was first made available for tesseract4.00.00alpha.
|
||||
|
||||
RESOURCES
|
||||
---------
|
||||
|
@ -19,22 +19,22 @@ OPTIONS
|
||||
|
||||
'--script_dir /path/to/langdata'::
|
||||
(Input) Specify the location of directory for universal script unicharsets and font xheights (type:string default:)
|
||||
|
||||
|
||||
'--U unicharsetfile'::
|
||||
(Input) Specify the location of the unicharset to load as input.
|
||||
|
||||
|
||||
'--O unicharsetfile'::
|
||||
(Output) Specify the location of the unicharset to be written with updated properties.
|
||||
|
||||
HISTORY
|
||||
-------
|
||||
set_unicharset_properties(1) was first made available for tesseract version 3.03.
|
||||
set_unicharset_properties(1) was first made available for tesseract version 3.03.
|
||||
|
||||
RESOURCES
|
||||
---------
|
||||
Main web site: <https://github.com/tesseract-ocr> +
|
||||
Information on training: <https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract>
|
||||
|
||||
|
||||
SEE ALSO
|
||||
--------
|
||||
tesseract(1)
|
||||
|
@ -246,48 +246,48 @@ SCRIPTS
|
||||
-------
|
||||
|
||||
The traineddata files for the following scripts for tesseract 4.0
|
||||
are also in https://github.com/tesseract-ocr/tessdata_fast.
|
||||
are also in https://github.com/tesseract-ocr/tessdata_fast.
|
||||
|
||||
In most cases, each of these contains all the languages that use that script PLUS English.
|
||||
So it is possible to recognize a language that has not been specifically trained for
|
||||
In most cases, each of these contains all the languages that use that script PLUS English.
|
||||
So it is possible to recognize a language that has not been specifically trained for
|
||||
by using traineddata for the script it is written in.
|
||||
|
||||
Arabic,
|
||||
Armenian,
|
||||
Bengali,
|
||||
Canadian Aboriginal,
|
||||
Cherokee,
|
||||
Cyrillic,
|
||||
Devanagari,
|
||||
Ethiopic,
|
||||
Fraktur,
|
||||
Georgian,
|
||||
Greek,
|
||||
Gujarati,
|
||||
Gurmukhi,
|
||||
Han - Simplified,
|
||||
Han - Simplified (vertical),
|
||||
Han - Traditional,
|
||||
Han - Traditional (vertical),
|
||||
Hangul,
|
||||
Hangul (vertical),
|
||||
Hebrew,
|
||||
Japanese,
|
||||
Japanese (vertical),
|
||||
Kannada,
|
||||
Khmer,
|
||||
Lao,
|
||||
Latin,
|
||||
Malayalam,
|
||||
Myanmar,
|
||||
Oriya (Odia),
|
||||
Sinhala,
|
||||
Syriac,
|
||||
Tamil,
|
||||
Telugu,
|
||||
Thaana,
|
||||
Thai,
|
||||
Tibetan,
|
||||
Arabic,
|
||||
Armenian,
|
||||
Bengali,
|
||||
Canadian Aboriginal,
|
||||
Cherokee,
|
||||
Cyrillic,
|
||||
Devanagari,
|
||||
Ethiopic,
|
||||
Fraktur,
|
||||
Georgian,
|
||||
Greek,
|
||||
Gujarati,
|
||||
Gurmukhi,
|
||||
Han - Simplified,
|
||||
Han - Simplified (vertical),
|
||||
Han - Traditional,
|
||||
Han - Traditional (vertical),
|
||||
Hangul,
|
||||
Hangul (vertical),
|
||||
Hebrew,
|
||||
Japanese,
|
||||
Japanese (vertical),
|
||||
Kannada,
|
||||
Khmer,
|
||||
Lao,
|
||||
Latin,
|
||||
Malayalam,
|
||||
Myanmar,
|
||||
Oriya (Odia),
|
||||
Sinhala,
|
||||
Syriac,
|
||||
Tamil,
|
||||
Telugu,
|
||||
Thaana,
|
||||
Thai,
|
||||
Tibetan,
|
||||
Vietnamese.
|
||||
|
||||
|
||||
|
@ -4,16 +4,16 @@ TEXT2IMAGE(1)
|
||||
|
||||
NAME
|
||||
----
|
||||
text2image - generate OCR training pages.
|
||||
text2image - generate OCR training pages.
|
||||
|
||||
SYNOPSIS
|
||||
--------
|
||||
*text2image* --text 'FILE' --outputbase 'PATH' --fonts_dir 'PATH' [OPTION]
|
||||
*text2image* --text 'FILE' --outputbase 'PATH' --fonts_dir 'PATH' [OPTION]
|
||||
|
||||
DESCRIPTION
|
||||
-----------
|
||||
text2image(1) generates OCR training pages. Given a text file it outputs an image with a given font and degradation.
|
||||
|
||||
|
||||
OPTIONS
|
||||
-------
|
||||
'--text FILE'::
|
||||
@ -27,22 +27,22 @@ OPTIONS
|
||||
|
||||
'--fonts_dir PATH'::
|
||||
If empty it use system default. Otherwise it overrides system default font location (type:string default:)
|
||||
|
||||
|
||||
'--font FONTNAME'::
|
||||
Font description name to use (type:string default:Arial)
|
||||
|
||||
|
||||
'--writing_mode MODE'::
|
||||
Specify one of the following writing modes.
|
||||
'horizontal' : Render regular horizontal text. (default)
|
||||
'vertical' : Render vertical text. Glyph orientation is selected by Pango.
|
||||
'vertical-upright' : Render vertical text. Glyph orientation is set to be upright. (type:string default:horizontal)
|
||||
|
||||
|
||||
'--tlog_level INT'::
|
||||
Minimum logging level for tlog() output (type:int default:0)
|
||||
Minimum logging level for tlog() output (type:int default:0)
|
||||
|
||||
'--max_pages INT'::
|
||||
Maximum number of pages to output (0=unlimited) (type:int default:0)
|
||||
|
||||
|
||||
'--degrade_image BOOL'::
|
||||
Degrade rendered image with speckle noise, dilation/erosion and rotation (type:bool default:true)
|
||||
|
||||
@ -54,7 +54,7 @@ OPTIONS
|
||||
|
||||
'--ligatures BOOL'::
|
||||
Rebuild and render ligatures (type:bool default:false)
|
||||
|
||||
|
||||
'--exposure INT'::
|
||||
Exposure level in photocopier (type:int default:0)
|
||||
|
||||
@ -93,7 +93,7 @@ OPTIONS
|
||||
|
||||
'--output_word_boxes BOOL'::
|
||||
Output word bounding boxes instead of character boxes. This is used for Cube training, and implied by --render_ngrams. (type:bool default:false)
|
||||
|
||||
|
||||
'--unicharset_file FILE'::
|
||||
File with characters in the unicharset. If --render_ngrams is true and --unicharset_file is specified, ngrams with characters that are not in unicharset will be omitted (type:string default:)
|
||||
|
||||
@ -114,7 +114,7 @@ Use these flags to output zero-padded, square individual character images
|
||||
|
||||
'--glyph_num_border_pixels_to_pad INT'::
|
||||
Final_size=glyph_resized_size+2*glyph_num_border_pixels_to_pad (type:int default:0)
|
||||
|
||||
|
||||
Use these flags to find fonts that can render a given text
|
||||
----------------------------------------------------------
|
||||
|
||||
@ -126,7 +126,7 @@ Use these flags to find fonts that can render a given text
|
||||
|
||||
'--min_coverage DOUBLE'::
|
||||
If find_fonts==true, the minimum coverage the font has of the characters in the text file to include it, between 0 and 1. (type:double default:1)
|
||||
|
||||
|
||||
Example Usage:
|
||||
```
|
||||
text2image --find_fonts \
|
||||
@ -136,7 +136,7 @@ text2image --find_fonts \
|
||||
--render_per_font \
|
||||
--outputbase ../langdata/hin/hin \
|
||||
|& grep raw | sed -e 's/ :.*/" \\/g' | sed -e 's/^/ "/' >../langdata/hin/fontslist.txt
|
||||
```
|
||||
```
|
||||
|
||||
SINGLE OPTIONS
|
||||
--------------
|
||||
@ -146,13 +146,13 @@ SINGLE OPTIONS
|
||||
|
||||
HISTORY
|
||||
-------
|
||||
text2image(1) was first made available for tesseract 3.03.
|
||||
text2image(1) was first made available for tesseract 3.03.
|
||||
|
||||
RESOURCES
|
||||
---------
|
||||
Main web site: <https://github.com/tesseract-ocr> +
|
||||
Information on training tesseract LSTM: <https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00>
|
||||
|
||||
|
||||
SEE ALSO
|
||||
--------
|
||||
tesseract(1)
|
||||
|
@ -1,5 +1,5 @@
|
||||
// Copyright 2007 Google Inc. All Rights Reserved.
|
||||
//
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License"); You may not
|
||||
// use this file except in compliance with the License. You may obtain a copy of
|
||||
// the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by
|
||||
@ -15,7 +15,7 @@ import com.google.scrollview.ui.SVWindow;
|
||||
/**
|
||||
* The SVEvent is a structure which holds the actual values of a message to be
|
||||
* transmitted. It corresponds to the client structure defined in scrollview.h
|
||||
*
|
||||
*
|
||||
* @author wanke@google.com
|
||||
*/
|
||||
public class SVEvent {
|
||||
@ -30,7 +30,7 @@ public class SVEvent {
|
||||
|
||||
/**
|
||||
* A "normal" SVEvent.
|
||||
*
|
||||
*
|
||||
* @param t The type of the event as specified in SVEventType (e.g.
|
||||
* SVET_CLICK)
|
||||
* @param w The window the event corresponds to
|
||||
@ -49,12 +49,12 @@ public class SVEvent {
|
||||
xSize = x2;
|
||||
ySize = y2;
|
||||
commandId = 0;
|
||||
parameter = p;
|
||||
parameter = p;
|
||||
}
|
||||
|
||||
/**
|
||||
* An event which issues a command (like clicking on a item in the menubar).
|
||||
*
|
||||
*
|
||||
* @param eventtype The type of the event as specified in SVEventType
|
||||
* (usually SVET_MENU or SVET_POPUP)
|
||||
* @param svWindow The window the event corresponds to
|
||||
|
@ -1,5 +1,5 @@
|
||||
// Copyright 2007 Google Inc. All Rights Reserved.
|
||||
//
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License"); You may not
|
||||
// use this file except in compliance with the License. You may obtain a copy of
|
||||
// the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by
|
||||
@ -14,7 +14,7 @@ package com.google.scrollview.events;
|
||||
* These are the defined events which can happen in ScrollView and be
|
||||
* transferred to the client. They are same events as on the client side part of
|
||||
* ScrollView (defined in ScrollView.h).
|
||||
*
|
||||
*
|
||||
* @author wanke@google.com
|
||||
*/
|
||||
public enum SVEventType {
|
||||
|
@ -24,7 +24,7 @@ AM_CPPFLAGS += -fvisibility=hidden -fvisibility-inlines-hidden
|
||||
endif
|
||||
|
||||
pkginclude_HEADERS = apitypes.h baseapi.h capi.h renderer.h tess_version.h
|
||||
lib_LTLIBRARIES =
|
||||
lib_LTLIBRARIES =
|
||||
|
||||
noinst_LTLIBRARIES = libtesseract_api.la
|
||||
|
||||
@ -56,7 +56,7 @@ libtesseract_la_LIBADD = \
|
||||
../cutil/libtesseract_cutil.la \
|
||||
../viewer/libtesseract_viewer.la \
|
||||
../ccutil/libtesseract_ccutil.la \
|
||||
../opencl/libtesseract_opencl.la
|
||||
../opencl/libtesseract_opencl.la
|
||||
|
||||
libtesseract_la_LDFLAGS += -version-info $(GENERIC_LIBRARY_VERSION) -no-undefined
|
||||
|
||||
|
@ -4,7 +4,7 @@ AM_CPPFLAGS += \
|
||||
-I$(top_srcdir)/src/viewer \
|
||||
-I$(top_srcdir)/src/opencl
|
||||
AM_CPPFLAGS += $(OPENCL_CPPFLAGS)
|
||||
|
||||
|
||||
if VISIBILITY
|
||||
AM_CPPFLAGS += -DTESS_EXPORTS \
|
||||
-fvisibility=hidden -fvisibility-inlines-hidden
|
||||
|
@ -373,8 +373,8 @@ inline bool LoadDataFromFile(const char* filename, GenericVector<char>* data) {
|
||||
fseek(fp, 0, SEEK_SET);
|
||||
// Trying to open a directory on Linux sets size to LONG_MAX. Catch it here.
|
||||
if (size > 0 && size < LONG_MAX) {
|
||||
// reserve an extra byte in case caller wants to append a '\0' character
|
||||
data->reserve(size + 1);
|
||||
// reserve an extra byte in case caller wants to append a '\0' character
|
||||
data->reserve(size + 1);
|
||||
data->resize_no_init(size);
|
||||
result = static_cast<long>(fread(&(*data)[0], 1, size, fp)) == size;
|
||||
}
|
||||
|
@ -4,7 +4,7 @@ AM_CPPFLAGS += \
|
||||
-I$(top_srcdir)/src/ccstruct \
|
||||
-I$(top_srcdir)/src/dict \
|
||||
-I$(top_srcdir)/src/viewer
|
||||
|
||||
|
||||
if DISABLED_LEGACY_ENGINE
|
||||
AM_CPPFLAGS += -DDISABLED_LEGACY_ENGINE
|
||||
endif
|
||||
|
@ -25,7 +25,7 @@
|
||||
namespace tesseract {
|
||||
|
||||
Classify::Classify()
|
||||
:
|
||||
:
|
||||
INT_MEMBER(classify_debug_level, 0, "Classify debug level",
|
||||
this->params()),
|
||||
|
||||
|
@ -123,7 +123,7 @@ void InitFeatureDefs(FEATURE_DEFS_STRUCT *featuredefs) {
|
||||
*
|
||||
* @param CharDesc character description to be deallocated
|
||||
*
|
||||
* Globals:
|
||||
* Globals:
|
||||
* - none
|
||||
*/
|
||||
void FreeCharDescription(CHAR_DESC CharDesc) {
|
||||
@ -140,7 +140,7 @@ void FreeCharDescription(CHAR_DESC CharDesc) {
|
||||
* Allocate a new character description, initialize its
|
||||
* feature sets to be empty, and return it.
|
||||
*
|
||||
* Globals:
|
||||
* Globals:
|
||||
* - none
|
||||
*
|
||||
* @return New character description structure.
|
||||
@ -226,9 +226,9 @@ bool ValidCharDescription(const FEATURE_DEFS_STRUCT &FeatureDefs,
|
||||
...
|
||||
@endverbatim
|
||||
*
|
||||
* Globals:
|
||||
* Globals:
|
||||
* - none
|
||||
*
|
||||
*
|
||||
* @param FeatureDefs definitions of feature types/extractors
|
||||
* @param File open text file to read character description from
|
||||
* @return Character description read from File.
|
||||
|
@ -36,7 +36,7 @@ namespace tesseract {
|
||||
* For each class in the unicharset, clears the corresponding
|
||||
* entry in char_norm_array. char_norm_array is indexed by unichar_id.
|
||||
*
|
||||
* Globals:
|
||||
* Globals:
|
||||
* - none
|
||||
*
|
||||
* @param char_norm_array array to be cleared
|
||||
@ -47,13 +47,13 @@ void Classify::ClearCharNormArray(uint8_t* char_norm_array) {
|
||||
|
||||
|
||||
/*---------------------------------------------------------------------------*/
|
||||
/**
|
||||
/**
|
||||
* For each class in unicharset, computes the match between
|
||||
* norm_feature and the normalization protos for that class.
|
||||
* Converts this number to the range from 0 - 255 and stores it
|
||||
* into char_norm_array. CharNormArray is indexed by unichar_id.
|
||||
*
|
||||
* Globals:
|
||||
* Globals:
|
||||
* - PreTrainedTemplates current set of built-in templates
|
||||
*
|
||||
* @param norm_feature character normalization feature
|
||||
@ -81,7 +81,7 @@ void Classify::ComputeIntCharNormArray(const FEATURE_STRUCT& norm_feature,
|
||||
* in Features into integer format and saves it into
|
||||
* IntFeatures.
|
||||
*
|
||||
* Globals:
|
||||
* Globals:
|
||||
* - none
|
||||
*
|
||||
* @param Features floating point pico-features to be converted
|
||||
|
@ -54,7 +54,7 @@ STRING_VAR(classify_training_file, "MicroFeatures", "Training file");
|
||||
*
|
||||
* Add a new config to this class. Malloc new space and copy the
|
||||
* old configs if necessary. Return the config id for the new config.
|
||||
*
|
||||
*
|
||||
* @param Class The class to add to
|
||||
*/
|
||||
int AddConfigToClass(CLASS_TYPE Class) {
|
||||
@ -90,7 +90,7 @@ int AddConfigToClass(CLASS_TYPE Class) {
|
||||
*
|
||||
* Add a new proto to this class. Malloc new space and copy the
|
||||
* old protos if necessary. Return the proto id for the new proto.
|
||||
*
|
||||
*
|
||||
* @param Class The class to add to
|
||||
*/
|
||||
int AddProtoToClass(CLASS_TYPE Class) {
|
||||
@ -132,7 +132,7 @@ int AddProtoToClass(CLASS_TYPE Class) {
|
||||
* @name ClassConfigLength
|
||||
*
|
||||
* Return the length of all the protos in this class.
|
||||
*
|
||||
*
|
||||
* @param Class The class to add to
|
||||
* @param Config FIXME
|
||||
*/
|
||||
@ -154,7 +154,7 @@ float ClassConfigLength(CLASS_TYPE Class, BIT_VECTOR Config) {
|
||||
* @name ClassProtoLength
|
||||
*
|
||||
* Return the length of all the protos in this class.
|
||||
*
|
||||
*
|
||||
* @param Class The class to use
|
||||
*/
|
||||
float ClassProtoLength(CLASS_TYPE Class) {
|
||||
@ -172,7 +172,7 @@ float ClassProtoLength(CLASS_TYPE Class) {
|
||||
* @name CopyProto
|
||||
*
|
||||
* Copy the first proto into the second.
|
||||
*
|
||||
*
|
||||
* @param Src Source
|
||||
* @param Dest Destination
|
||||
*/
|
||||
|
@ -34,7 +34,7 @@
|
||||
* This routine uses realloc to increase the size of
|
||||
* the specified bit vector.
|
||||
*
|
||||
* Globals:
|
||||
* Globals:
|
||||
* - none
|
||||
*
|
||||
* @param Vector bit vector to be expanded
|
||||
|
@ -42,7 +42,7 @@ void *Erealloc(void *ptr, int size) {
|
||||
return Buffer;
|
||||
}
|
||||
|
||||
void Efree(void *ptr) {
|
||||
void Efree(void *ptr) {
|
||||
ASSERT_HOST(ptr != nullptr);
|
||||
free(ptr);
|
||||
free(ptr);
|
||||
}
|
||||
|
@ -3,7 +3,7 @@ AM_CPPFLAGS += \
|
||||
-I$(top_srcdir)/src/ccutil \
|
||||
-I$(top_srcdir)/src/ccstruct \
|
||||
-I$(top_srcdir)/src/viewer
|
||||
|
||||
|
||||
if VISIBILITY
|
||||
AM_CPPFLAGS += -DTESS_EXPORTS \
|
||||
-fvisibility=hidden -fvisibility-inlines-hidden
|
||||
|
@ -186,7 +186,7 @@ void LSTMRecognizer::RecognizeLine(const ImageData& image_data, bool invert,
|
||||
search_->Decode(outputs, kDictRatio, kCertOffset, worst_dict_cert,
|
||||
&GetUnicharset(), glyph_confidences);
|
||||
search_->ExtractBestPathAsWords(line_box, scale_factor, debug,
|
||||
&GetUnicharset(), words,
|
||||
&GetUnicharset(), words,
|
||||
glyph_confidences);
|
||||
}
|
||||
|
||||
|
@ -184,7 +184,7 @@ class LSTMRecognizer {
|
||||
// will be used in a dictionary word.
|
||||
void RecognizeLine(const ImageData& image_data, bool invert, bool debug,
|
||||
double worst_dict_cert, const TBOX& line_box,
|
||||
PointerVector<WERD_RES>* words,
|
||||
PointerVector<WERD_RES>* words,
|
||||
bool glyph_confidences = false);
|
||||
|
||||
// Helper computes min and mean best results in the output.
|
||||
|
@ -82,7 +82,7 @@ void RecodeBeamSearch::Decode(const NetworkIO& output, double dict_ratio,
|
||||
const UNICHARSET* charset, bool glyph_confidence) {
|
||||
beam_size_ = 0;
|
||||
int width = output.Width();
|
||||
if (glyph_confidence)
|
||||
if (glyph_confidence)
|
||||
timesteps.clear();
|
||||
for (int t = 0; t < width; ++t) {
|
||||
ComputeTopN(output.f(t), output.NumFeatures(), kBeamWidths[0]);
|
||||
@ -128,7 +128,7 @@ void RecodeBeamSearch::SaveMostCertainGlyphs(const float* outputs,
|
||||
pos++;
|
||||
}
|
||||
glyphs.insert(glyphs.begin() + pos,
|
||||
std::pair<const char*, float>(charakter, outputs[i]));
|
||||
std::pair<const char*, float>(charakter, outputs[i]));
|
||||
}
|
||||
}
|
||||
timesteps.push_back(glyphs);
|
||||
@ -515,7 +515,7 @@ void RecodeBeamSearch::ContinueContext(const RecodeNode* prev, int index,
|
||||
if (previous != nullptr) {
|
||||
prefix.Set(p, previous->code);
|
||||
full_code.Set(p, previous->code);
|
||||
}
|
||||
}
|
||||
}
|
||||
if (prev != nullptr && !is_simple_text_) {
|
||||
if (top_n_flags_[prev->code] == top_n_flag) {
|
||||
|
@ -208,7 +208,7 @@ class RecodeBeamSearch {
|
||||
|
||||
// Generates debug output of the content of the beams after a Decode.
|
||||
void DebugBeams(const UNICHARSET& unicharset) const;
|
||||
|
||||
|
||||
std::vector< std::vector<std::pair<const char*, float>>> timesteps;
|
||||
// Clipping value for certainty inside Tesseract. Reflects the minimum value
|
||||
// of certainty that will be returned by ExtractBestPathAsUnicharIds.
|
||||
|
@ -11,7 +11,7 @@ AM_CPPFLAGS += \
|
||||
-I$(top_srcdir)/src/opencl
|
||||
|
||||
AM_CPPFLAGS += $(OPENCL_CPPFLAGS)
|
||||
|
||||
|
||||
if VISIBILITY
|
||||
AM_CPPFLAGS += -DTESS_EXPORTS \
|
||||
-fvisibility=hidden -fvisibility-inlines-hidden
|
||||
|
@ -1343,7 +1343,7 @@ bool ColPartition::HasGoodBaseline() {
|
||||
width = last_pt.x() - first_pt.x();
|
||||
}
|
||||
// Maximum median error allowed to be a good text line.
|
||||
if (height_count == 0)
|
||||
if (height_count == 0)
|
||||
return false;
|
||||
double max_error = kMaxBaselineError * total_height / height_count;
|
||||
ICOORD start_pt, end_pt;
|
||||
|
@ -54,7 +54,7 @@ const double kMaxRowSize = 2.5;
|
||||
// Number of filled columns required to form a strong table row.
|
||||
// For small tables, this is an absolute number.
|
||||
const double kGoodRowNumberOfColumnsSmall[] = { 2, 2, 2, 2, 2, 3, 3 };
|
||||
const int kGoodRowNumberOfColumnsSmallSize =
|
||||
const int kGoodRowNumberOfColumnsSmallSize =
|
||||
sizeof(kGoodRowNumberOfColumnsSmall) / sizeof(double) - 1;
|
||||
// For large tables, it is a relative number
|
||||
const double kGoodRowNumberOfColumnsLarge = 0.7;
|
||||
|
@ -20,8 +20,8 @@ if DISABLED_LEGACY_ENGINE
|
||||
AM_CPPFLAGS += -DDISABLED_LEGACY_ENGINE
|
||||
endif
|
||||
|
||||
# TODO: training programs can not be linked to shared library created
|
||||
# with -fvisibility
|
||||
# TODO: training programs can not be linked to shared library created
|
||||
# with -fvisibility
|
||||
if VISIBILITY
|
||||
AM_LDFLAGS += -all-static
|
||||
endif
|
||||
@ -57,9 +57,9 @@ endif
|
||||
noinst_LTLIBRARIES = libtesseract_training.la libtesseract_tessopt.la
|
||||
|
||||
libtesseract_training_la_LIBADD = \
|
||||
../cutil/libtesseract_cutil.la
|
||||
../cutil/libtesseract_cutil.la
|
||||
# ../api/libtesseract.la
|
||||
|
||||
|
||||
libtesseract_training_la_SOURCES = \
|
||||
boxchar.cpp \
|
||||
commandlineflags.cpp \
|
||||
@ -275,5 +275,5 @@ lstmeval_LDADD += $(LEPTONICA_LIBS)
|
||||
lstmtraining_LDADD += $(LEPTONICA_LIBS)
|
||||
set_unicharset_properties_LDADD += $(LEPTONICA_LIBS)
|
||||
text2image_LDADD += $(LEPTONICA_LIBS)
|
||||
unicharset_extractor_LDADD += $(LEPTONICA_LIBS)
|
||||
unicharset_extractor_LDADD += $(LEPTONICA_LIBS)
|
||||
wordlist2dawg_LDADD += $(LEPTONICA_LIBS)
|
||||
|
@ -27,18 +27,18 @@ LANGUAGE LANG_ENGLISH, SUBLANG_ENGLISH_US
|
||||
// TEXTINCLUDE
|
||||
//
|
||||
|
||||
1 TEXTINCLUDE
|
||||
1 TEXTINCLUDE
|
||||
BEGIN
|
||||
"resource.h\0"
|
||||
END
|
||||
|
||||
2 TEXTINCLUDE
|
||||
2 TEXTINCLUDE
|
||||
BEGIN
|
||||
"#include ""afxres.h""\r\n"
|
||||
"\0"
|
||||
END
|
||||
|
||||
3 TEXTINCLUDE
|
||||
3 TEXTINCLUDE
|
||||
BEGIN
|
||||
"\r\n"
|
||||
"\0"
|
||||
|
@ -61,5 +61,5 @@ libtesseract_wordrec_la_SOURCES += \
|
||||
plotedges.cpp \
|
||||
render.cpp \
|
||||
segsearch.cpp \
|
||||
wordclass.cpp
|
||||
wordclass.cpp
|
||||
endif
|
||||
|
@ -27,7 +27,7 @@ AM_CPPFLAGS += -I$(top_srcdir)/src/wordrec
|
||||
# Build googletest:
|
||||
check_LTLIBRARIES = libgtest.la libgtest_main.la libgmock.la libgmock_main.la
|
||||
libgtest_la_SOURCES = ../googletest/googletest/src/gtest-all.cc
|
||||
libgtest_la_CPPFLAGS = -I$(top_srcdir)/googletest/googletest/include -I$(top_srcdir)/googletest/googletest -pthread
|
||||
libgtest_la_CPPFLAGS = -I$(top_srcdir)/googletest/googletest/include -I$(top_srcdir)/googletest/googletest -pthread
|
||||
libgtest_main_la_SOURCES = ../googletest/googletest/src/gtest_main.cc
|
||||
## libgtest_main_la_LIBADD = libgtest.la
|
||||
|
||||
@ -57,7 +57,7 @@ check_PROGRAMS = \
|
||||
matrix_test \
|
||||
osd_test \
|
||||
loadlang_test \
|
||||
tesseracttests
|
||||
tesseracttests
|
||||
|
||||
|
||||
TESTS = $(check_PROGRAMS)
|
||||
|
@ -63,7 +63,7 @@ class QuickTest : public testing::Test {
|
||||
class MatchGroundTruth : public QuickTest ,
|
||||
public ::testing::WithParamInterface<const char*> {
|
||||
};
|
||||
|
||||
|
||||
TEST_P(MatchGroundTruth, FastPhototestOCR) {
|
||||
OCRTester(TESTING_DIR "/phototest.tif",
|
||||
TESTING_DIR "/phototest.txt",
|
||||
@ -75,33 +75,33 @@ class QuickTest : public testing::Test {
|
||||
TESTING_DIR "/phototest.txt",
|
||||
TESSDATA_DIR "_best", GetParam());
|
||||
}
|
||||
|
||||
|
||||
TEST_P(MatchGroundTruth, TessPhototestOCR) {
|
||||
OCRTester(TESTING_DIR "/phototest.tif",
|
||||
TESTING_DIR "/phototest.txt",
|
||||
TESSDATA_DIR , GetParam());
|
||||
}
|
||||
|
||||
INSTANTIATE_TEST_CASE_P( Eng, MatchGroundTruth,
|
||||
|
||||
INSTANTIATE_TEST_CASE_P( Eng, MatchGroundTruth,
|
||||
::testing::Values("eng") );
|
||||
INSTANTIATE_TEST_CASE_P( Latin, MatchGroundTruth,
|
||||
INSTANTIATE_TEST_CASE_P( Latin, MatchGroundTruth,
|
||||
::testing::Values("script/Latin") );
|
||||
INSTANTIATE_TEST_CASE_P( Deva, MatchGroundTruth,
|
||||
INSTANTIATE_TEST_CASE_P( Deva, MatchGroundTruth,
|
||||
::testing::Values("script/Devanagari") );
|
||||
INSTANTIATE_TEST_CASE_P( Arab, MatchGroundTruth,
|
||||
INSTANTIATE_TEST_CASE_P( Arab, MatchGroundTruth,
|
||||
::testing::Values("script/Arabic") );
|
||||
|
||||
|
||||
class EuroText : public QuickTest {
|
||||
};
|
||||
|
||||
|
||||
TEST_F(EuroText, FastLatinOCR) {
|
||||
OCRTester(TESTING_DIR "/eurotext.tif",
|
||||
TESTING_DIR "/eurotext.txt",
|
||||
TESSDATA_DIR "_fast", "script/Latin");
|
||||
}
|
||||
|
||||
// script/Latin for eurotext.tif does not match groundtruth
|
||||
// script/Latin for eurotext.tif does not match groundtruth
|
||||
// for tessdata & tessdata_best
|
||||
// so do not test these here.
|
||||
|
||||
|
||||
} // namespace
|
||||
|
@ -37,13 +37,13 @@ class QuickTest : public testing::Test {
|
||||
ASSERT_FALSE(api->Init(tessdatadir, lang)) << "Could not initialize tesseract for $lang.";
|
||||
api->End();
|
||||
}
|
||||
|
||||
|
||||
// For all languages
|
||||
|
||||
|
||||
class LoadLanguage : public QuickTest ,
|
||||
public ::testing::WithParamInterface<const char*> {
|
||||
};
|
||||
|
||||
|
||||
TEST_P(LoadLanguage, afr) {LangLoader("afr" , GetParam());}
|
||||
TEST_P(LoadLanguage, amh) {LangLoader("amh" , GetParam());}
|
||||
TEST_P(LoadLanguage, ara) {LangLoader("ara" , GetParam());}
|
||||
@ -169,18 +169,18 @@ class QuickTest : public testing::Test {
|
||||
TEST_P(LoadLanguage, yid) {LangLoader("yid" , GetParam());}
|
||||
TEST_P(LoadLanguage, yor) {LangLoader("yor" , GetParam());}
|
||||
|
||||
INSTANTIATE_TEST_CASE_P( Tessdata_fast, LoadLanguage,
|
||||
INSTANTIATE_TEST_CASE_P( Tessdata_fast, LoadLanguage,
|
||||
::testing::Values(TESSDATA_DIR "_fast") );
|
||||
INSTANTIATE_TEST_CASE_P( Tessdata_best, LoadLanguage,
|
||||
INSTANTIATE_TEST_CASE_P( Tessdata_best, LoadLanguage,
|
||||
::testing::Values(TESSDATA_DIR "_best") );
|
||||
INSTANTIATE_TEST_CASE_P( Tessdata, LoadLanguage,
|
||||
INSTANTIATE_TEST_CASE_P( Tessdata, LoadLanguage,
|
||||
::testing::Values(TESSDATA_DIR) );
|
||||
|
||||
// For all scripts
|
||||
|
||||
class LoadScript : public QuickTest ,
|
||||
public ::testing::WithParamInterface<const char*> {
|
||||
};
|
||||
};
|
||||
|
||||
TEST_P(LoadScript, Arabic) {LangLoader("script/Arabic" , GetParam());}
|
||||
TEST_P(LoadScript, Armenian) {LangLoader("script/Armenian" , GetParam());}
|
||||
@ -219,19 +219,19 @@ class QuickTest : public testing::Test {
|
||||
TEST_P(LoadScript, Thai) {LangLoader("script/Thai" , GetParam());}
|
||||
TEST_P(LoadScript, Tibetan) {LangLoader("script/Tibetan" , GetParam());}
|
||||
TEST_P(LoadScript, Vietnamese) {LangLoader("script/Vietnamese" , GetParam());}
|
||||
|
||||
INSTANTIATE_TEST_CASE_P( Tessdata_fast, LoadScript,
|
||||
|
||||
INSTANTIATE_TEST_CASE_P( Tessdata_fast, LoadScript,
|
||||
::testing::Values(TESSDATA_DIR "_fast") );
|
||||
INSTANTIATE_TEST_CASE_P( Tessdata_best, LoadScript,
|
||||
INSTANTIATE_TEST_CASE_P( Tessdata_best, LoadScript,
|
||||
::testing::Values(TESSDATA_DIR "_best") );
|
||||
INSTANTIATE_TEST_CASE_P( Tessdata, LoadScript,
|
||||
INSTANTIATE_TEST_CASE_P( Tessdata, LoadScript,
|
||||
::testing::Values(TESSDATA_DIR) );
|
||||
|
||||
// Use class LoadLang for languages which are NOT there in all three repos
|
||||
|
||||
class LoadLang : public QuickTest {
|
||||
};
|
||||
|
||||
|
||||
TEST_F(LoadLang, kmrFast) {LangLoader("kmr" , TESSDATA_DIR "_fast");}
|
||||
TEST_F(LoadLang, kmrBest) {LangLoader("kmr" , TESSDATA_DIR "_best");}
|
||||
// TEST_F(LoadLang, kmrBestInt) {LangLoader("kmr" , TESSDATA_DIR);}
|
||||
|
@ -14,7 +14,7 @@
|
||||
// limitations under the License.
|
||||
///////////////////////////////////////////////////////////////////////
|
||||
|
||||
//based on https://gist.github.com/amitdo/7c7a522004dd79b398340c9595b377e1
|
||||
//based on https://gist.github.com/amitdo/7c7a522004dd79b398340c9595b377e1
|
||||
|
||||
// expects clones of tessdata, tessdata_fast and tessdata_best repos
|
||||
|
||||
@ -30,7 +30,7 @@ namespace {
|
||||
class TestClass : public testing::Test {
|
||||
protected:
|
||||
};
|
||||
|
||||
|
||||
void OSDTester( int expected_deg, const char* imgname, const char* tessdatadir) {
|
||||
//log.info() << tessdatadir << " for image: " << imgname << std::endl;
|
||||
tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
|
||||
@ -55,63 +55,63 @@ class TestClass : public testing::Test {
|
||||
|
||||
class OSDTest : public TestClass ,
|
||||
public ::testing::WithParamInterface<std::tuple<int, const char*, const char*>> {};
|
||||
|
||||
|
||||
TEST_P(OSDTest, MatchOrientationDegrees) {
|
||||
OSDTester(std::get<0>(GetParam()), std::get<1>(GetParam()), std::get<2>(GetParam()));
|
||||
}
|
||||
|
||||
INSTANTIATE_TEST_CASE_P( TessdataEngEuroHebrew, OSDTest,
|
||||
|
||||
INSTANTIATE_TEST_CASE_P( TessdataEngEuroHebrew, OSDTest,
|
||||
::testing::Combine(
|
||||
::testing::Values(0),
|
||||
::testing::Values(TESTING_DIR "/phototest.tif",
|
||||
TESTING_DIR "/eurotext.tif",
|
||||
TESTING_DIR "/hebrew.png"),
|
||||
::testing::Values(TESSDATA_DIR)));
|
||||
|
||||
INSTANTIATE_TEST_CASE_P( TessdataBestEngEuroHebrew, OSDTest,
|
||||
|
||||
INSTANTIATE_TEST_CASE_P( TessdataBestEngEuroHebrew, OSDTest,
|
||||
::testing::Combine(
|
||||
::testing::Values(0),
|
||||
::testing::Values(TESTING_DIR "/phototest.tif",
|
||||
TESTING_DIR "/eurotext.tif",
|
||||
TESTING_DIR "/hebrew.png"),
|
||||
::testing::Values(TESSDATA_DIR "_best")));
|
||||
|
||||
INSTANTIATE_TEST_CASE_P( TessdataFastEngEuroHebrew, OSDTest,
|
||||
|
||||
INSTANTIATE_TEST_CASE_P( TessdataFastEngEuroHebrew, OSDTest,
|
||||
::testing::Combine(
|
||||
::testing::Values(0),
|
||||
::testing::Values(TESTING_DIR "/phototest.tif",
|
||||
TESTING_DIR "/eurotext.tif",
|
||||
TESTING_DIR "/hebrew.png"),
|
||||
::testing::Values(TESSDATA_DIR "_fast")));
|
||||
|
||||
INSTANTIATE_TEST_CASE_P( TessdataFastRotated90, OSDTest,
|
||||
|
||||
INSTANTIATE_TEST_CASE_P( TessdataFastRotated90, OSDTest,
|
||||
::testing::Combine(
|
||||
::testing::Values(90),
|
||||
::testing::Values(TESTING_DIR "/phototest-rotated-R.png"),
|
||||
::testing::Values(TESSDATA_DIR "_fast")));
|
||||
|
||||
INSTANTIATE_TEST_CASE_P( TessdataFastRotated180, OSDTest,
|
||||
|
||||
INSTANTIATE_TEST_CASE_P( TessdataFastRotated180, OSDTest,
|
||||
::testing::Combine(
|
||||
::testing::Values(180),
|
||||
::testing::Values(TESTING_DIR "/phototest-rotated-180.png"),
|
||||
::testing::Values(TESSDATA_DIR "_fast")));
|
||||
|
||||
INSTANTIATE_TEST_CASE_P( TessdataFastRotated270, OSDTest,
|
||||
|
||||
INSTANTIATE_TEST_CASE_P( TessdataFastRotated270, OSDTest,
|
||||
::testing::Combine(
|
||||
::testing::Values(270),
|
||||
::testing::Values(TESTING_DIR "/phototest-rotated-L.png"),
|
||||
::testing::Values(TESSDATA_DIR "_fast")));
|
||||
|
||||
INSTANTIATE_TEST_CASE_P( TessdataFastDevaRotated270, OSDTest,
|
||||
|
||||
INSTANTIATE_TEST_CASE_P( TessdataFastDevaRotated270, OSDTest,
|
||||
::testing::Combine(
|
||||
::testing::Values(270),
|
||||
::testing::Values(TESTING_DIR "/devatest-rotated-270.png"),
|
||||
::testing::Values(TESSDATA_DIR "_fast")));
|
||||
|
||||
INSTANTIATE_TEST_CASE_P( TessdataFastDeva, OSDTest,
|
||||
|
||||
INSTANTIATE_TEST_CASE_P( TessdataFastDeva, OSDTest,
|
||||
::testing::Combine(
|
||||
::testing::Values(0),
|
||||
::testing::Values(TESTING_DIR "/devatest.png"),
|
||||
::testing::Values(TESSDATA_DIR "_fast")));
|
||||
|
||||
|
||||
} // namespace
|
||||
|
@ -6,7 +6,7 @@ See http://www.expervision.com/wp-content/uploads/2012/12/1995.The_Fourth_Annual
|
||||
but first you have to get the tools and data used by UNLV:
|
||||
|
||||
### Step 1: to download the images go to
|
||||
https://sourceforge.net/projects/isri-ocr-evaluation-tools-alt/files/
|
||||
https://sourceforge.net/projects/isri-ocr-evaluation-tools-alt/files/
|
||||
and get doe3.3B.tar.gz, bus.3B.tar.gz, mag.3B.tar.gz and news.3B.tar.gz
|
||||
spn.3B.tar.gz is incorrect in this repo, so get it from code.google
|
||||
|
||||
@ -20,7 +20,7 @@ curl -L https://sourceforge.net/projects/isri-ocr-evaluation-tools-alt/files/ne
|
||||
curl -L https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/isri-ocr-evaluation-tools/spn.3B.tar.gz > spn.3B.tar.gz
|
||||
```
|
||||
|
||||
### Step 2: extract the files.
|
||||
### Step 2: extract the files.
|
||||
It doesn't really matter where
|
||||
in your filesystem you put them, but they must go under a common
|
||||
root so you have directories doe3.3B, bus.3B, mag.3B and news.3B. in, for example,
|
||||
@ -80,7 +80,7 @@ unlvtests/runalltests_spa.sh ~/ISRI-OCRtk 4_fast_spa ../tessdata_fast
|
||||
If you just want to remove all lines which have 100% recognition,
|
||||
you can add a 'awk' command like this:
|
||||
|
||||
ocrevalutf8 wordacc ground.txt ocr.txt | awk '$3 != 100 {print $0}'
|
||||
ocrevalutf8 wordacc ground.txt ocr.txt | awk '$3 != 100 {print $0}'
|
||||
results.txt
|
||||
|
||||
or if you've already got a results file you want to change, you can do this:
|
||||
@ -90,5 +90,5 @@ awk '$3 != 100 {print $0}' results.txt newresults.txt
|
||||
If you only want the last sections where things are broken down by
|
||||
word, you can add a sed command, like this:
|
||||
|
||||
ocrevalutf8 wordacc ground.txt ocr.txt | sed '/^ Count Missed %Right $/,$
|
||||
ocrevalutf8 wordacc ground.txt ocr.txt | sed '/^ Count Missed %Right $/,$
|
||||
!d' | awk '$3 != 100 {print $0}' results.txt
|
||||
|
Loading…
Reference in New Issue
Block a user