* 'master' of https://github.com/tesseract-ocr/tesseract: (27 commits)
  Rework check for readable input file
  fix "mktemp -d --tmpdir" on Mac OS; see #1453
  pgedit: Change some variables from global to local ones
  improve description of min_characters_to_try variable
  WERD_RES: Remove comparisons which are constant
  GENERIC_2D_ARRAY: Pass parameters by reference
  genericvector: Pass parameters by reference
  chop: Use more efficient float calculations for sqrt
  rect: Use more efficient float calculations for ceil, floor
  intproto: Use more efficient float calculations for floor
  genericvector: Rewrite code to satisfy static code analyzer
  Fix constructor for class Dict (uninitialized member variables)
  Fix use of wrong UNICHARSET
  lstmtraining: Remove dead code for purified model name
  combine_tessdata: Handle failures when extracting
  lstmtraining: Check write permission for output model
  implement parameter min_characters_to_try for minimum characters to try to skip page entirely. fixes #1729
  Merge and enhance documentation on language and script models
  Document some more config options for tesseract
  Add Makefile rule to build HTML manpages
  ...
This commit is contained in:
Zdenko Podobný 2018-10-07 15:39:02 +02:00
commit 8598731daf
29 changed files with 253 additions and 146 deletions

View File

@ -12,6 +12,12 @@
## About
This package contains an **OCR engine** - `libtesseract` and a **command line program** - `tesseract`.
Tesseract 4 adds a new neural net (LSTM) based OCR engine which is focused
on line recognition, but also still supports the legacy Tesseract OCR engine of
Tesseract 3 which works by recognizing character patterns. Compatibility with
Tesseract 3 is enabled by using the Legacy OCR Engine mode (--oem 0).
It also needs traineddata files which support the legacy engine, for example
those from the tessdata repository.
The lead developer is Ray Smith. The maintainer is Zdenko Podobny.
For a list of contributors see [AUTHORS](https://github.com/tesseract-ocr/tesseract/blob/master/AUTHORS)

View File

@ -1 +1 @@
4.0.0-beta.4
4.0.0-rc1

View File

@ -6,7 +6,7 @@ asciidoc=asciidoc -d manpage
man_MANS = \
combine_lang_model.1 \
combine_lang_model.1 \
combine_tessdata.1 \
dawg2wordlist.1 \
lstmeval.1 \
@ -31,9 +31,16 @@ endif
EXTRA_DIST = $(man_MANS) Doxyfile
.PHONY: html
html: $(patsubst %,%.html,$(man_MANS))
%: %.asc
$(asciidoc) -o $@ $<
%.html: %.asc
asciidoc -b html5 -o $@ $<
MAINTAINERCLEANFILES = $(man_MANS) Doxyfile
endif

View File

@ -34,7 +34,9 @@ IN/OUT ARGUMENTS
'outputbase'::
The basename of the output file (to which the appropriate extension
will be appended). By default the output will be named 'outbase.txt'.
will be appended). By default the output will be a text file
with `.txt` added to the basename unless there are one or more
'configfile' options which explicitly specify the desired output.
'stdout'::
Instruction to sent output data to standard output
@ -88,10 +90,21 @@ OPTIONS
contains a list of variables and their values, one per line, with a
space separating variable from value. Interesting config files
include: +
* hocr - Output in hOCR format instead of as a text file.
* pdf - Output in pdf instead of a text file.
* `hocr` - Output in hOCR format (file extension `.hocr`).
* `pdf` - Output PDF (file extension `.pdf`).
* `tsv` - Output TSV (file extension `.tsv`).
* `txt` - Output plain text (file extension `.txt`).
* `get.images` - Write images.
* `logfile` - Write debug file `tesseract.log`.
* `lstm.train` - Used for LSTM training.
* `makebox` - Output box file.
* `quiet` - Write debug file to /dev/null.
*Nota Bene:* The options '-l lang' and '--psm N' must occur
It is possible to select several config files, for example
`tesseract image.png demo hocr pdf txt` will create three output files
`demo.hocr`, `demo.pdf` and `demo.txt` with the OCR results.
*Nota Bene:* The options `-l lang` and `--psm N` must occur
before any 'configfile'.
@ -110,19 +123,35 @@ SINGLE OPTIONS
Returns the current version of the tesseract(1) executable.
'--list-langs'::
List available languages for tesseract engine. Can be used with --tessdata-dir.
List available languages for tesseract engine. Can be used with `--tessdata-dir`.
'--print-parameters'::
Print tesseract parameters.
LANGUAGES
---------
LANGUAGES AND SCRIPTS
---------------------
The currently available traineddata files for tesseract 4.0
for the following languages are in
(in https://github.com/tesseract-ocr/tessdata_fast):
To recognize some text with Tesseract, it is normally necessary to specify
the language(s) or script of the text (unless it is English text which is
supported by default) using `-l lang`.
Selecting a language automatically also selects the language specific
character set and dictionary (word list).
Selecting a script typically selects all characters of that script
which can be from different languages. The dictionary which is included
also contains a mix from different languages.
In most cases, a script also supports English.
So it is possible to recognize a language that has not been specifically
trained for by using traineddata for the script it is written in.
https://github.com/tesseract-ocr/tessdata_fast provides fast language and
script models which are also part of Linux distributions.
For Tesseract 4, `tessdata_fast` includes traineddata files for the
following languages:
*afr* (Afrikaans),
*amh* (Amharic),
@ -245,17 +274,10 @@ for the following languages are in
To use a non-standard language pack named *foo.traineddata*, set the
*TESSDATA_PREFIX* environment variable so the file can be found at
*TESSDATA_PREFIX*/tessdata/*foo*.traineddata and give Tesseract the
argument '-l foo'.
argument `-l foo`.
SCRIPTS
-------
The traineddata files for the following scripts for tesseract 4.0
are also in https://github.com/tesseract-ocr/tessdata_fast.
In most cases, each of these contains all the languages that use that script PLUS English.
So it is possible to recognize a language that has not been specifically trained for
by using traineddata for the script it is written in.
For Tesseract 4, `tessdata_fast` includes traineddata files for the
following scripts:
Arabic,
Armenian,
@ -295,6 +317,18 @@ Thai,
Tibetan,
Vietnamese.
The same languages and scripts are available from
https://github.com/tesseract-ocr/tessdata_best.
`tessdata_best` provides slow language and script models.
These models are needed for training. They also can give better OCR results,
but the recognition takes much more time.
Both `tessdata_fast` and `tessdata_best` only support the LSTM OCR engine.
There is a third repository, https://github.com/tesseract-ocr/tessdata,
with models which support both the Tesseract 3 legacy OCR engine and the
Tesseract 4 LSTM OCR engine.
CONFIG FILES AND AUGMENTING WITH USER DATA
------------------------------------------
@ -364,18 +398,29 @@ scripts are now included to allow anyone to reproduce some of these tests.
See <https://github.com/tesseract-ocr/tesseract/wiki/TestingTesseract> for more
details.
Tesseract 3.00 adds a number of new languages, including Chinese, Japanese,
and Korean. It also introduces a new, single-file based system of managing
Tesseract 3.00 added a number of new languages, including Chinese, Japanese,
and Korean. It also introduced a new, single-file based system of managing
language data.
Tesseract 3.02 adds BiDirectional text support, the ability to recognize
Tesseract 3.02 added BiDirectional text support, the ability to recognize
multiple languages in a single image, and improved layout analysis.
For further details, see the file ReleaseNotes included with the distribution.
Tesseract 4 adds a new neural net (LSTM) based OCR engine which is focused
on line recognition, but also still supports the legacy Tesseract OCR engine of
Tesseract 3 which works by recognizing character patterns. Compatibility with
Tesseract 3 is enabled by `--oem 0`. This also needs traineddata files which
support the legacy engine, for example those from the tessdata repository
(https://github.com/tesseract-ocr/tessdata).
For further details, see the release notes in the Tesseract wiki
(<https://github.com/tesseract-ocr/tesseract/wiki/ReleaseNotes>).
RESOURCES
---------
Main web site: <https://github.com/tesseract-ocr> +
User forum: <http://groups.google.com/group/tesseract-ocr> +
Wiki: <https://github.com/tesseract-ocr/tesseract/wiki> +
Information on training: <https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract>
SEE ALSO
@ -396,6 +441,9 @@ Pingping Xiu, Pong Eksombatchai (Chantat), Ranjith Unnikrishnan, Raquel
Romano, Ray Smith, Rika Antonova, Robert Moss, Samuel Charron, Sheelagh
Lloyd, Shobhit Saxena, and Thomas Kielbus.
For a list of contributors see
<https://github.com/tesseract-ocr/tesseract/blob/master/AUTHORS>.
COPYING
-------
Licensed under the Apache License, Version 2.0

View File

@ -1130,6 +1130,15 @@ bool TessBaseAPI::ProcessPagesInternal(const char* filename,
buf.assign((std::istreambuf_iterator<char>(std::cin)),
(std::istreambuf_iterator<char>()));
data = reinterpret_cast<const l_uint8 *>(buf.data());
} else {
// Check whether the input file can be read.
if (FILE* file = fopen(filename, "rb")) {
fclose(file);
} else {
fprintf(stderr, "Error, cannot read input file %s: %s\n",
filename, strerror(errno));
return false;
}
}
// Here is our autodetection

View File

@ -75,6 +75,7 @@ class Trie;
class Wordrec;
typedef int (Dict::*DictFunc)(void* void_dawg_args,
const UNICHARSET& unicharset,
UNICHAR_ID unichar_id, bool word_end) const;
typedef double (Dict::*ProbabilityInContextFunc)(const char* lang,
const char* context,

View File

@ -367,11 +367,15 @@ static void ParseArgs(const int argc, char** argv, const char** lang,
*arg_i = i;
if (*pagesegmode == tesseract::PSM_OSD_ONLY) {
// That mode requires osd.traineddata, no other language or script files.
// OSD = orientation and script detection.
if (*lang != nullptr && strcmp(*lang, "osd")) {
fprintf(stderr, "Warning, ignoring -l %s for --psm 0\n", *lang);
// If the user explicitly specifies a language (other than osd)
// or a script, only orientation can be detected.
fprintf(stderr, "Warning, detects only orientation with -l %s\n", *lang);
} else {
// That mode requires osd.traineddata to detect orientation and script.
*lang = "osd";
}
*lang = "osd";
}
if (*outputbase == nullptr && noocr == false) {
@ -530,13 +534,6 @@ int main(int argc, char** argv) {
return EXIT_SUCCESS;
}
if (FILE* file = fopen(image, "r")) {
fclose(file);
} else {
fprintf(stderr, "Cannot open input file: %s\n", image);
return EXIT_FAILURE;
}
FixPageSegMode(&api, pagesegmode);
if (dpi) {

View File

@ -36,9 +36,6 @@
#include <algorithm>
#include <memory>
const int kMinCharactersToTry = 50;
const int kMaxCharactersToTry = 5 * kMinCharactersToTry;
const float kSizeRatioToReject = 2.0;
const int kMinAcceptableBlobHeight = 10;
@ -278,6 +275,8 @@ int os_detect_blobs(const GenericVector<int>* allowed_scripts,
BLOBNBOX_CLIST* blob_list, OSResults* osr,
tesseract::Tesseract* tess) {
OSResults osr_;
int minCharactersToTry = tess->min_characters_to_try;
int maxCharactersToTry = 5 * minCharactersToTry;
if (osr == nullptr)
osr = &osr_;
@ -286,13 +285,13 @@ int os_detect_blobs(const GenericVector<int>* allowed_scripts,
ScriptDetector s(allowed_scripts, osr, tess);
BLOBNBOX_C_IT filtered_it(blob_list);
int real_max = std::min(filtered_it.length(), kMaxCharactersToTry);
int real_max = std::min(filtered_it.length(), maxCharactersToTry);
// tprintf("Total blobs found = %d\n", blobs_total);
// tprintf("Number of blobs post-filtering = %d\n", filtered_it.length());
// tprintf("Number of blobs to try = %d\n", real_max);
// If there are too few characters, skip this page entirely.
if (real_max < kMinCharactersToTry / 2) {
if (real_max < minCharactersToTry / 2) {
tprintf("Too few characters. Skipping this page\n");
return 0;
}
@ -307,7 +306,7 @@ int os_detect_blobs(const GenericVector<int>* allowed_scripts,
int num_blobs_evaluated = 0;
for (int i = 0; i < real_max; ++i) {
if (os_detect_blob(blobs[sequence.GetVal()], &o, &s, osr, tess)
&& i > kMinCharactersToTry) {
&& i > minCharactersToTry) {
break;
}
++num_blobs_evaluated;

View File

@ -100,15 +100,15 @@ enum ColorationMode {
*
*/
ScrollView* image_win;
ParamsEditor* pe;
bool stillRunning = false;
static ScrollView* image_win;
static ParamsEditor* pe;
static bool stillRunning = false;
ScrollView* bln_word_window = nullptr; // baseline norm words
static ScrollView* bln_word_window = nullptr; // baseline norm words
CMD_EVENTS mode = CHANGE_DISP_CMD_EVENT; // selected words op
static CMD_EVENTS mode = CHANGE_DISP_CMD_EVENT; // selected words op
bool recog_done = false; // recog_all_words was called
static bool recog_done = false; // recog_all_words was called
// These variables should remain global, since they are only used for the
// debug mode (in which only a single Tesseract thread/instance will exist).

View File

@ -397,6 +397,9 @@ Tesseract::Tesseract()
INT_MEMBER(jpg_quality, 85, "Set JPEG quality level", this->params()),
INT_MEMBER(user_defined_dpi, 0, "Specify DPI for input image",
this->params()),
INT_MEMBER(min_characters_to_try, 50,
"Specify minimum characters to try during OSD",
this->params()),
STRING_MEMBER(unrecognised_char, "|",
"Output char for unidentified blobs", this->params()),
INT_MEMBER(suspect_level, 99, "Suspect marker level", this->params()),

View File

@ -1043,6 +1043,8 @@ class Tesseract : public Wordrec {
"Create PDF with only one invisible text layer");
INT_VAR_H(jpg_quality, 85, "Set JPEG quality level");
INT_VAR_H(user_defined_dpi, 0, "Specify DPI for input image");
INT_VAR_H(min_characters_to_try, 50,
"Specify minimum characters to try during OSD");
STRING_VAR_H(unrecognised_char, "|",
"Output char for unidentified blobs");
INT_VAR_H(suspect_level, 99, "Suspect marker level");

View File

@ -365,7 +365,7 @@ class GENERIC_2D_ARRAY {
}
// Accumulates the element-wise sums of squares of src into *this.
void SumSquares(const GENERIC_2D_ARRAY<T>& src, T decay_factor) {
void SumSquares(const GENERIC_2D_ARRAY<T>& src, const T& decay_factor) {
T update_factor = 1.0 - decay_factor;
int size = num_elements();
for (int i = 0; i < size; ++i) {
@ -377,7 +377,7 @@ class GENERIC_2D_ARRAY {
// Scales each element using the adam algorithm, ie array_[i] by
// sqrt(sqsum[i] + epsilon)).
void AdamUpdate(const GENERIC_2D_ARRAY<T>& sum,
const GENERIC_2D_ARRAY<T>& sqsum, T epsilon) {
const GENERIC_2D_ARRAY<T>& sqsum, const T& epsilon) {
int size = num_elements();
for (int i = 0; i < size; ++i) {
array_[i] += sum.array_[i] / (sqrt(sqsum.array_[i]) + epsilon);

View File

@ -363,10 +363,10 @@ class WERD_RES : public ELIST_LINK {
blob_index >= best_choice->length())
return nullptr;
UNICHAR_ID id = best_choice->unichar_id(blob_index);
if (id < 0 || id >= uch_set->size() || id == INVALID_UNICHAR_ID)
if (id < 0 || id >= uch_set->size())
return nullptr;
UNICHAR_ID mirrored = uch_set->get_mirror(id);
if (in_rtl_context && mirrored > 0 && mirrored != INVALID_UNICHAR_ID)
if (in_rtl_context && mirrored > 0)
id = mirrored;
return uch_set->id_to_unichar_ext(id);
}
@ -375,7 +375,7 @@ class WERD_RES : public ELIST_LINK {
if (blob_index < 0 || blob_index >= raw_choice->length())
return nullptr;
UNICHAR_ID id = raw_choice->unichar_id(blob_index);
if (id < 0 || id >= uch_set->size() || id == INVALID_UNICHAR_ID)
if (id < 0 || id >= uch_set->size())
return nullptr;
return uch_set->id_to_unichar(id);
}

View File

@ -21,7 +21,7 @@
#define RECT_H
#include <algorithm> // for std::max, std::min
#include <cmath> // for ceil, floor
#include <cmath> // for std::ceil, std::floor
#include <cstdint> // for INT16_MAX
#include <cstdio> // for FILE
#include "platform.h" // for DLLSYM
@ -162,29 +162,33 @@ class DLLSYM TBOX { // bounding box
void move( // move box
const FCOORD vec) { // by float vector
bot_left.set_x ((int16_t) floor (bot_left.x () + vec.x ()));
bot_left.set_x(static_cast<int16_t>(std::floor(bot_left.x() + vec.x())));
// round left
bot_left.set_y ((int16_t) floor (bot_left.y () + vec.y ()));
bot_left.set_y(static_cast<int16_t>(std::floor(bot_left.y() + vec.y())));
// round down
top_right.set_x ((int16_t) ceil (top_right.x () + vec.x ()));
top_right.set_x(static_cast<int16_t>(std::ceil(top_right.x() + vec.x())));
// round right
top_right.set_y ((int16_t) ceil (top_right.y () + vec.y ()));
top_right.set_y(static_cast<int16_t>(std::ceil(top_right.y() + vec.y())));
// round up
}
void scale( // scale box
const float f) { // by multiplier
bot_left.set_x ((int16_t) floor (bot_left.x () * f)); // round left
bot_left.set_y ((int16_t) floor (bot_left.y () * f)); // round down
top_right.set_x ((int16_t) ceil (top_right.x () * f)); // round right
top_right.set_y ((int16_t) ceil (top_right.y () * f)); // round up
// round left
bot_left.set_x(static_cast<int16_t>(std::floor(bot_left.x() * f)));
// round down
bot_left.set_y(static_cast<int16_t>(std::floor(bot_left.y() * f)));
// round right
top_right.set_x(static_cast<int16_t>(std::ceil(top_right.x() * f)));
// round up
top_right.set_y(static_cast<int16_t>(std::ceil(top_right.y() * f)));
}
void scale( // scale box
const FCOORD vec) { // by float vector
bot_left.set_x ((int16_t) floor (bot_left.x () * vec.x ()));
bot_left.set_y ((int16_t) floor (bot_left.y () * vec.y ()));
top_right.set_x ((int16_t) ceil (top_right.x () * vec.x ()));
top_right.set_y ((int16_t) ceil (top_right.y () * vec.y ()));
bot_left.set_x(static_cast<int16_t>(std::floor(bot_left.x() * vec.x())));
bot_left.set_y(static_cast<int16_t>(std::floor(bot_left.y() * vec.y())));
top_right.set_x(static_cast<int16_t>(std::ceil(top_right.x() * vec.x())));
top_right.set_y(static_cast<int16_t>(std::ceil(top_right.y() * vec.y())));
}
// rotate doesn't enlarge the box - it just rotates the bottom-left
@ -314,8 +318,10 @@ class DLLSYM TBOX { // bounding box
inline TBOX::TBOX( // constructor
const FCOORD pt // floating centre
) {
bot_left = ICOORD ((int16_t) floor (pt.x ()), (int16_t) floor (pt.y ()));
top_right = ICOORD ((int16_t) ceil (pt.x ()), (int16_t) ceil (pt.y ()));
bot_left = ICOORD(static_cast<int16_t>(std::floor(pt.x())),
static_cast<int16_t>(std::floor(pt.y())));
top_right = ICOORD(static_cast<int16_t>(std::ceil(pt.x())),
static_cast<int16_t>(std::ceil(pt.y())));
}

View File

@ -39,7 +39,7 @@ class GenericVector {
GenericVector() {
init(kDefaultVectorSize);
}
GenericVector(int size, T init_val) {
GenericVector(int size, const T& init_val) {
init(size);
init_to_size(size, init_val);
}
@ -60,7 +60,7 @@ class GenericVector {
void double_the_size();
// Resizes to size and sets all values to t.
void init_to_size(int size, T t);
void init_to_size(int size, const T& t);
// Resizes to size without any initialization.
void resize_no_init(int size) {
reserve(size);
@ -101,31 +101,31 @@ class GenericVector {
// Return the index of the T object.
// This method NEEDS a compare_callback to be passed to
// set_compare_callback.
int get_index(T object) const;
int get_index(const T& object) const;
// Return true if T is in the array
bool contains(T object) const;
bool contains(const T& object) const;
// Return true if the index is valid
T contains_index(int index) const;
// Push an element in the end of the array
int push_back(T object);
void operator+=(T t);
void operator+=(const T& t);
// Push an element in the end of the array if the same
// element is not already contained in the array.
int push_back_new(T object);
int push_back_new(const T& object);
// Push an element in the front of the array
// Note: This function is O(n)
int push_front(T object);
int push_front(const T& object);
// Set the value at the given index
void set(T t, int index);
void set(const T& t, int index);
// Insert t at the given index, push other elements to the right.
void insert(T t, int index);
void insert(const T& t, int index);
// Removes an element at the given index and
// shifts the remaining elements to the left.
@ -705,7 +705,7 @@ void GenericVector<T>::double_the_size() {
// Resizes to size and sets all values to t.
template <typename T>
void GenericVector<T>::init_to_size(int size, T t) {
void GenericVector<T>::init_to_size(int size, const T& t) {
reserve(size);
size_used_ = size;
for (int i = 0; i < size; ++i)
@ -740,7 +740,7 @@ T GenericVector<T>::pop_back() {
// Return the object from an index.
template <typename T>
void GenericVector<T>::set(T t, int index) {
void GenericVector<T>::set(const T& t, int index) {
assert(index >= 0 && index < size_used_);
data_[index] = t;
}
@ -749,7 +749,7 @@ void GenericVector<T>::set(T t, int index) {
// space for the new elements and inserts the given element
// at the specified index.
template <typename T>
void GenericVector<T>::insert(T t, int index) {
void GenericVector<T>::insert(const T& t, int index) {
assert(index >= 0 && index <= size_used_);
if (size_reserved_ == size_used_)
double_the_size();
@ -779,7 +779,7 @@ T GenericVector<T>::contains_index(int index) const {
// Return the index of the T object.
template <typename T>
int GenericVector<T>::get_index(T object) const {
int GenericVector<T>::get_index(const T& object) const {
for (int i = 0; i < size_used_; ++i) {
assert(compare_cb_ != nullptr);
if (compare_cb_->Run(object, data_[i]))
@ -790,7 +790,7 @@ int GenericVector<T>::get_index(T object) const {
// Return true if T is in the array
template <typename T>
bool GenericVector<T>::contains(T object) const {
bool GenericVector<T>::contains(const T& object) const {
return get_index(object) != -1;
}
@ -806,7 +806,7 @@ int GenericVector<T>::push_back(T object) {
}
template <typename T>
int GenericVector<T>::push_back_new(T object) {
int GenericVector<T>::push_back_new(const T& object) {
int index = get_index(object);
if (index >= 0)
return index;
@ -815,7 +815,7 @@ int GenericVector<T>::push_back_new(T object) {
// Add an element in the array (front)
template <typename T>
int GenericVector<T>::push_front(T object) {
int GenericVector<T>::push_front(const T& object) {
if (size_used_ == size_reserved_)
double_the_size();
for (int i = size_used_; i > 0; --i)
@ -826,7 +826,7 @@ int GenericVector<T>::push_front(T object) {
}
template <typename T>
void GenericVector<T>::operator+=(T t) {
void GenericVector<T>::operator+=(const T& t) {
push_back(t);
}
@ -866,15 +866,14 @@ void GenericVector<T>::set_compare_callback(
// Clear the array, calling the callback function if any.
template <typename T>
void GenericVector<T>::clear() {
if (size_reserved_ > 0) {
if (clear_cb_ != nullptr)
for (int i = 0; i < size_used_; ++i)
clear_cb_->Run(data_[i]);
delete[] data_;
data_ = nullptr;
size_used_ = 0;
size_reserved_ = 0;
if (size_reserved_ > 0 && clear_cb_ != nullptr) {
for (int i = 0; i < size_used_; ++i)
clear_cb_->Run(data_[i]);
}
delete[] data_;
data_ = nullptr;
size_used_ = 0;
size_reserved_ = 0;
delete clear_cb_;
clear_cb_ = nullptr;
delete compare_cb_;

View File

@ -20,7 +20,7 @@
-----------------------------------------------------------------------------*/
#include <algorithm>
#include <cmath>
#include <cmath> // for std::floor
#include <cstdio>
#include <cassert>
@ -117,7 +117,7 @@ FILL_SPEC;
#define CircularIncrement(i,r) (((i) < (r) - 1)?((i)++):((i) = 0))
/** macro for mapping floats to ints without bounds checking */
#define MapParam(P,O,N) (floor (((P) + (O)) * (N)))
#define MapParam(P,O,N) (std::floor(((P) + (O)) * (N)))
/*---------------------------------------------------------------------------
Private Function Prototypes
@ -1205,11 +1205,11 @@ void FillPPCircularBits(uint32_t ParamTable[NUM_PP_BUCKETS][WERDS_PER_PP_VECTOR]
if (Spread > 0.5)
Spread = 0.5;
FirstBucket = (int) floor ((Center - Spread) * NUM_PP_BUCKETS);
FirstBucket = static_cast<int>(std::floor((Center - Spread) * NUM_PP_BUCKETS));
if (FirstBucket < 0)
FirstBucket += NUM_PP_BUCKETS;
LastBucket = (int) floor ((Center + Spread) * NUM_PP_BUCKETS);
LastBucket = static_cast<int>(std::floor((Center + Spread) * NUM_PP_BUCKETS));
if (LastBucket >= NUM_PP_BUCKETS)
LastBucket -= NUM_PP_BUCKETS;
if (debug) tprintf("Circular fill from %d to %d", FirstBucket, LastBucket);
@ -1243,11 +1243,11 @@ void FillPPLinearBits(uint32_t ParamTable[NUM_PP_BUCKETS][WERDS_PER_PP_VECTOR],
int Bit, float Center, float Spread, bool debug) {
int i, FirstBucket, LastBucket;
FirstBucket = (int) floor ((Center - Spread) * NUM_PP_BUCKETS);
FirstBucket = static_cast<int>(std::floor((Center - Spread) * NUM_PP_BUCKETS));
if (FirstBucket < 0)
FirstBucket = 0;
LastBucket = (int) floor ((Center + Spread) * NUM_PP_BUCKETS);
LastBucket = static_cast<int>(std::floor((Center + Spread) * NUM_PP_BUCKETS));
if (LastBucket >= NUM_PP_BUCKETS)
LastBucket = NUM_PP_BUCKETS - 1;
@ -1736,7 +1736,7 @@ int TruncateParam(float Param, int Min, int Max, char *Id) {
Id, Param, Max);
Param = Max;
}
return static_cast<int>(floor(Param));
return static_cast<int>(std::floor(Param));
} /* TruncateParam */

View File

@ -32,6 +32,11 @@ Dict::Dict(CCUtil *ccutil)
probability_in_context_(&tesseract::Dict::def_probability_in_context),
params_model_classify_(nullptr),
ccutil_(ccutil),
wildcard_unichar_id_(INVALID_UNICHAR_ID),
apostrophe_unichar_id_(INVALID_UNICHAR_ID),
question_unichar_id_(INVALID_UNICHAR_ID),
slash_unichar_id_(INVALID_UNICHAR_ID),
hyphen_unichar_id_(INVALID_UNICHAR_ID),
STRING_MEMBER(user_words_file, "", "A filename of user-provided words.",
getCCUtil()->params()),
STRING_INIT_MEMBER(user_words_suffix, "",
@ -167,7 +172,6 @@ Dict::Dict(CCUtil *ccutil)
go_deeper_fxn_ = nullptr;
hyphen_word_ = nullptr;
last_word_on_line_ = false;
hyphen_unichar_id_ = INVALID_UNICHAR_ID;
document_words_ = nullptr;
dawg_cache_ = nullptr;
dawg_cache_is_ours_ = false;
@ -361,10 +365,13 @@ void Dict::End() {
// according to at least one of the dawgs in the dawgs_ vector.
// See more extensive comments in dict.h where this function is declared.
int Dict::def_letter_is_okay(void* void_dawg_args,
const UNICHARSET& unicharset,
UNICHAR_ID unichar_id,
bool word_end) const {
DawgArgs *dawg_args = static_cast<DawgArgs *>(void_dawg_args);
ASSERT_HOST(unicharset.contains_unichar_id(unichar_id));
if (dawg_debug_level >= 3) {
tprintf("def_letter_is_okay: current unichar=%s word_end=%d"
" num active dawgs=%d\n",
@ -410,7 +417,7 @@ int Dict::def_letter_is_okay(void* void_dawg_args,
for (int s = 0; s < slist.length(); ++s) {
int sdawg_index = slist[s];
const Dawg *sdawg = dawgs_[sdawg_index];
UNICHAR_ID ch = char_for_dawg(unichar_id, sdawg);
UNICHAR_ID ch = char_for_dawg(unicharset, unichar_id, sdawg);
EDGE_REF dawg_edge = sdawg->edge_char_of(0, ch, word_end);
if (dawg_edge != NO_EDGE) {
if (dawg_debug_level >=3) {
@ -477,7 +484,8 @@ int Dict::def_letter_is_okay(void* void_dawg_args,
// Find the edge out of the node for the unichar_id.
NODE_REF node = GetStartingNode(dawg, pos.dawg_ref);
EDGE_REF edge = (node == NO_EDGE) ? NO_EDGE
: dawg->edge_char_of(node, char_for_dawg(unichar_id, dawg), word_end);
: dawg->edge_char_of(node, char_for_dawg(unicharset, unichar_id, dawg),
word_end);
if (dawg_debug_level >= 3) {
tprintf("Active dawg: [%d, " REFFORMAT "] edge=" REFFORMAT "\n",
@ -759,7 +767,8 @@ int Dict::valid_word(const WERD_CHOICE &word, bool numbers_ok) const {
int last_index = word_ptr->length() - 1;
// Call letter_is_okay for each letter in the word.
for (int i = hyphen_base_size(); i <= last_index; ++i) {
if (!((this->*letter_is_okay_)(&dawg_args, word_ptr->unichar_id(i),
if (!((this->*letter_is_okay_)(&dawg_args, *word_ptr->unicharset(),
word_ptr->unichar_id(i),
i == last_index))) break;
// Swap active_dawgs, constraints with the corresponding updated vector.
if (dawg_args.updated_dawgs == &(active_dawgs[1])) {

View File

@ -351,15 +351,17 @@ class Dict {
*/
//
int def_letter_is_okay(void* void_dawg_args,
int def_letter_is_okay(void* void_dawg_args, const UNICHARSET& unicharset,
UNICHAR_ID unichar_id, bool word_end) const;
int (Dict::*letter_is_okay_)(void* void_dawg_args,
const UNICHARSET& unicharset,
UNICHAR_ID unichar_id, bool word_end) const;
/// Calls letter_is_okay_ member function.
int LetterIsOkay(void* void_dawg_args,
int LetterIsOkay(void* void_dawg_args, const UNICHARSET& unicharset,
UNICHAR_ID unichar_id, bool word_end) const {
return (this->*letter_is_okay_)(void_dawg_args, unichar_id, word_end);
return (this->*letter_is_okay_)(void_dawg_args,
unicharset, unichar_id, word_end);
}
@ -428,11 +430,12 @@ class Dict {
// Given a unichar from a string and a given dawg, return the unichar
// we should use to match in that dawg type. (for example, in the number
// dawg, all numbers are transformed to kPatternUnicharId).
inline UNICHAR_ID char_for_dawg(UNICHAR_ID ch, const Dawg *dawg) const {
UNICHAR_ID char_for_dawg(const UNICHARSET& unicharset, UNICHAR_ID ch,
const Dawg *dawg) const {
if (!dawg) return ch;
switch (dawg->type()) {
case DAWG_TYPE_NUMBER:
return getUnicharset().get_isdigit(ch) ? Dawg::kPatternUnicharID : ch;
return unicharset.get_isdigit(ch) ? Dawg::kPatternUnicharID : ch;
default:
return ch;
}

View File

@ -88,7 +88,7 @@ void Dict::go_deeper_dawg_fxn(
++num_unigrams;
word->append_unichar_id(uch_id, 1, 0.0, 0.0);
unigrams_ok = (this->*letter_is_okay_)(
&unigram_dawg_args,
&unigram_dawg_args, *word->unicharset(),
word->unichar_id(word_index+num_unigrams-1),
word_ending && i == encoding.size() - 1);
(*unigram_dawg_args.active_dawgs) = *(unigram_dawg_args.updated_dawgs);
@ -111,7 +111,8 @@ void Dict::go_deeper_dawg_fxn(
// Check which dawgs from the dawgs_ vector contain the word
// up to and including the current unichar.
if (checked_unigrams || (this->*letter_is_okay_)(
more_args, word->unichar_id(word_index), word_ending)) {
more_args, *word->unicharset(), word->unichar_id(word_index),
word_ending)) {
// Add a new word choice
if (word_ending) {
if (dawg_debug_level) {

View File

@ -771,7 +771,8 @@ void RecodeBeamSearch::ContinueDawg(int code, int unichar_id, float cert,
return; // Can't continue if not a dict word.
}
PermuterType permuter = static_cast<PermuterType>(
dict_->def_letter_is_okay(&dawg_args, unichar_id, false));
dict_->def_letter_is_okay(&dawg_args,
dict_->getUnicharset(), unichar_id, false));
if (permuter != NO_PERM) {
PushHeapIfBetter(kBeamWidths[0], code, unichar_id, permuter, false,
word_start, dawg_args.valid_end, false, cert, prev,

View File

@ -72,7 +72,7 @@ int main(int argc, char **argv) {
tesseract::TessdataManager tm;
if (argc > 1 && (!strcmp(argv[1], "-v") || !strcmp(argv[1], "--version"))) {
printf("%s\n", tesseract::TessBaseAPI::Version());
return 0;
return EXIT_SUCCESS;
} else if (argc == 2) {
printf("Combining tessdata files\n");
STRING lang = argv[1];
@ -92,16 +92,22 @@ int main(int argc, char **argv) {
// Initialize TessdataManager with the data in the given traineddata file.
if (!tm.Init(argv[2])) {
tprintf("Failed to read %s\n", argv[2]);
exit(1);
return EXIT_FAILURE;
}
printf("Extracting tessdata components from %s\n", argv[2]);
if (strcmp(argv[1], "-e") == 0) {
for (i = 3; i < argc; ++i) {
errno = 0;
if (tm.ExtractToFile(argv[i])) {
printf("Wrote %s\n", argv[i]);
} else {
} else if (errno == 0) {
printf("Not extracting %s, since this component"
" is not present\n", argv[i]);
return EXIT_FAILURE;
} else {
printf("Error, could not extract %s: %s\n",
argv[i], strerror(errno));
return EXIT_FAILURE;
}
}
} else { // extract all the components
@ -111,8 +117,13 @@ int main(int argc, char **argv) {
if (*last != '.')
filename += '.';
filename += tesseract::kTessdataFileSuffixes[i];
errno = 0;
if (tm.ExtractToFile(filename.string())) {
printf("Wrote %s\n", filename.string());
} else if (errno != 0) {
printf("Error, could not extract %s: %s\n",
filename.string(), strerror(errno));
return EXIT_FAILURE;
}
}
}
@ -124,7 +135,7 @@ int main(int argc, char **argv) {
if (rename(new_traineddata_filename, traineddata_filename.string()) != 0) {
tprintf("Failed to create a temporary file %s\n",
traineddata_filename.string());
exit(1);
return EXIT_FAILURE;
}
// Initialize TessdataManager with the data in the given traineddata file.
@ -135,17 +146,17 @@ int main(int argc, char **argv) {
} else if (argc == 3 && strcmp(argv[1], "-c") == 0) {
if (!tm.Init(argv[2])) {
tprintf("Failed to read %s\n", argv[2]);
exit(1);
return EXIT_FAILURE;
}
tesseract::TFile fp;
if (!tm.GetComponent(tesseract::TESSDATA_LSTM, &fp)) {
tprintf("No LSTM Component found in %s!\n", argv[2]);
exit(1);
return EXIT_FAILURE;
}
tesseract::LSTMRecognizer recognizer;
if (!recognizer.DeSerialize(&tm, &fp)) {
tprintf("Failed to deserialize LSTM in %s!\n", argv[2]);
exit(1);
return EXIT_FAILURE;
}
recognizer.ConvertToInt();
GenericVector<char> lstm_data;
@ -155,7 +166,7 @@ int main(int argc, char **argv) {
lstm_data.size());
if (!tm.SaveFile(argv[2], nullptr)) {
tprintf("Failed to write modified traineddata:%s!\n", argv[2]);
exit(1);
return EXIT_FAILURE;
}
} else if (argc == 3 && strcmp(argv[1], "-d") == 0) {
// Initialize TessdataManager with the data in the given traineddata file.
@ -186,4 +197,5 @@ int main(int argc, char **argv) {
return 1;
}
tm.Directory();
return EXIT_SUCCESS;
}

View File

@ -73,22 +73,27 @@ const int kNumPagesPerBatch = 100;
int main(int argc, char **argv) {
tesseract::CheckSharedLibraryVersion();
ParseArguments(&argc, &argv);
// Purify the model name in case it is based on the network string.
if (FLAGS_model_output.empty()) {
tprintf("Must provide a --model_output!\n");
return 1;
return EXIT_FAILURE;
}
if (FLAGS_traineddata.empty()) {
tprintf("Must provide a --traineddata see training wiki\n");
return 1;
return EXIT_FAILURE;
}
STRING model_output = FLAGS_model_output.c_str();
for (int i = 0; i < model_output.length(); ++i) {
if (model_output[i] == '[' || model_output[i] == ']')
model_output[i] = '-';
if (model_output[i] == '(' || model_output[i] == ')')
model_output[i] = '_';
// Check write permissions.
STRING test_file = FLAGS_model_output.c_str();
test_file += "_wtest";
FILE* f = fopen(test_file.c_str(), "wb");
if (f != nullptr) {
fclose(f);
remove(test_file.c_str());
} else {
tprintf("Error, model output cannot be written: %s\n", strerror(errno));
return EXIT_FAILURE;
}
// Setup the trainer.
STRING checkpoint_file = FLAGS_model_output.c_str();
checkpoint_file += "_checkpoint";
@ -105,7 +110,7 @@ int main(int argc, char **argv) {
if (!trainer.TryLoadingCheckpoint(FLAGS_continue_from.c_str(), nullptr)) {
tprintf("Failed to read continue from: %s\n",
FLAGS_continue_from.c_str());
return 1;
return EXIT_FAILURE;
}
if (FLAGS_debug_network) {
trainer.DebugNetwork();
@ -116,20 +121,20 @@ int main(int argc, char **argv) {
FLAGS_model_output.c_str());
}
}
return 0;
return EXIT_SUCCESS;
}
// Get the list of files to process.
if (FLAGS_train_listfile.empty()) {
tprintf("Must supply a list of training filenames! --train_listfile\n");
return 1;
return EXIT_FAILURE;
}
GenericVector<STRING> filenames;
if (!tesseract::LoadFileLinesToStrings(FLAGS_train_listfile.c_str(),
&filenames)) {
tprintf("Failed to load list of training filenames from %s\n",
FLAGS_train_listfile.c_str());
return 1;
return EXIT_FAILURE;
}
// Checkpoints always take priority if they are available.
@ -145,7 +150,7 @@ int main(int argc, char **argv) {
? FLAGS_continue_from.c_str()
: FLAGS_old_traineddata.c_str())) {
tprintf("Failed to continue from: %s\n", FLAGS_continue_from.c_str());
return 1;
return EXIT_FAILURE;
}
tprintf("Continuing from %s\n", FLAGS_continue_from.c_str());
trainer.InitIterations();
@ -155,7 +160,7 @@ int main(int argc, char **argv) {
tprintf("Appending a new network to an old one!!");
if (FLAGS_continue_from.empty()) {
tprintf("Must set --continue_from for appending!\n");
return 1;
return EXIT_FAILURE;
}
}
// We are initializing from scratch.
@ -165,7 +170,7 @@ int main(int argc, char **argv) {
FLAGS_adam_beta)) {
tprintf("Failed to create network from spec: %s\n",
FLAGS_net_spec.c_str());
return 1;
return EXIT_FAILURE;
}
trainer.set_perfect_delay(FLAGS_perfect_sample_delay);
}
@ -176,7 +181,7 @@ int main(int argc, char **argv) {
: tesseract::CS_ROUND_ROBIN,
FLAGS_randomly_rotate)) {
tprintf("Load of images failed!!\n");
return 1;
return EXIT_FAILURE;
}
tesseract::LSTMTester tester(static_cast<int64_t>(FLAGS_max_image_MB) *
@ -186,7 +191,7 @@ int main(int argc, char **argv) {
if (!tester.LoadAllEvalData(FLAGS_eval_listfile.c_str())) {
tprintf("Failed to load eval data from: %s\n",
FLAGS_eval_listfile.c_str());
return 1;
return EXIT_FAILURE;
}
tester_callback =
NewPermanentTessCallback(&tester, &tesseract::LSTMTester::RunEvalAsync);
@ -208,5 +213,5 @@ int main(int argc, char **argv) {
FLAGS_max_iterations == 0));
delete tester_callback;
tprintf("Finished! Error rate = %g\n", trainer.best_error_rate());
return 0;
return EXIT_SUCCESS;
} /* main */

View File

@ -186,7 +186,11 @@ parse_flags() {
# Location where intermediate files will be created.
TIMESTAMP=`date +%Y-%m-%d`
if [ "$(uname)" == "Darwin" ];then
TMP_DIR=$(mktemp -d -t ${LANG_CODE}-${TIMESTAMP}.XXX )
else
TMP_DIR=$(mktemp -d --tmpdir ${LANG_CODE}-${TIMESTAMP}.XXX )
fi
TRAINING_DIR=${TMP_DIR}
# Location of log file for the whole run.
LOG_FILE=${TRAINING_DIR}/tesstrain.log

View File

@ -89,7 +89,6 @@ int Wordrec::angle_change(EDGEPT *point1, EDGEPT *point2, EDGEPT *point3) {
VECTOR vector2;
int angle;
float length;
/* Compute angle */
vector1.x = point2->pos.x - point1->pos.x;
@ -97,7 +96,7 @@ int Wordrec::angle_change(EDGEPT *point1, EDGEPT *point2, EDGEPT *point3) {
vector2.x = point3->pos.x - point2->pos.x;
vector2.y = point3->pos.y - point2->pos.y;
/* Use cross product */
length = (float)sqrt((float)LENGTH(vector1) * LENGTH(vector2));
float length = std::sqrt(static_cast<float>(LENGTH(vector1)) * LENGTH(vector2));
if ((int) length == 0)
return (0);
angle = static_cast<int>(floor(asin(CROSS (vector1, vector2) /

View File

@ -853,7 +853,7 @@ LanguageModelDawgInfo *LanguageModel::GenerateDawgInfo(
if (language_model_debug_level > 2)
tprintf("Test Letter OK for unichar %d, normed %d\n",
b.unichar_id(), normed_ids[i]);
dict_->LetterIsOkay(&dawg_args_, normed_ids[i],
dict_->LetterIsOkay(&dawg_args_, dict_->getUnicharset(), normed_ids[i],
word_end && i == normed_ids.size() - 1);
if (dawg_args_.permuter == NO_PERM) {
break;

View File

@ -1,3 +1,2 @@
tessedit_create_hocr 1
tessedit_pageseg_mode 1
hocr_font_info 0

View File

@ -1,2 +1 @@
tessedit_create_pdf 1
tessedit_pageseg_mode 1

View File

@ -1,2 +1 @@
tessedit_create_tsv 1
tessedit_pageseg_mode 1

View File

@ -1,2 +1 @@
tessedit_write_unlv 1
tessedit_pageseg_mode 6