Merge branch 'master' of https://github.com/tesseract-ocr/tesseract

* 'master' of https://github.com/tesseract-ocr/tesseract: (27 commits) Rework check for readable input file fix "mktemp -d --tmpdir" on Mac OS; see #1453 pgedit: Change some variables from global to local ones improve description of min_characters_to_try variable WERD_RES: Remove comparisons which are constant GENERIC_2D_ARRAY: Pass parameters by reference genericvector: Pass parameters by reference chop: Use more efficient float calculations for sqrt rect: Use more efficient float calculations for ceil, floor intproto: Use more efficient float calculations for floor genericvector: Rewrite code to satisfy static code analyzer Fix constructor for class Dict (uninitialized member variables) Fix use of wrong UNICHARSET lstmtraining: Remove dead code for purified model name combine_tessdata: Handle failures when extracting lstmtraining: Check write permission for output model implement parameter min_characters_to_try for minimum characters to try to skip page entirely. fixes #1729 Merge and enhance documentation on language and script models Document some more config options for tesseract Add Makefile rule to build HTML manpages ...
2025-01-18 06:30:14 +08:00 · 2018-10-07 15:39:02 +02:00 · 2018-10-07 15:39:02 +02:00 · 8598731daf
commit 8598731daf
parent dcc50a867f 5cf5c80ba1
29 changed files with 253 additions and 146 deletions
--- a/README.md
+++ b/README.md
@ -12,6 +12,12 @@
 ## About

 This package contains an **OCR engine** - `libtesseract` and a **command line program** - `tesseract`.
+Tesseract 4 adds a new neural net (LSTM) based OCR engine which is focused
+on line recognition, but also still supports the legacy Tesseract OCR engine of 
+Tesseract 3 which works by recognizing character patterns. Compatibility with 
+Tesseract 3 is enabled by using the Legacy OCR Engine mode (--oem 0). 
+It also needs traineddata files which support the legacy engine, for example 
+those from the tessdata repository.

 The lead developer is Ray Smith. The maintainer is Zdenko Podobny.
 For a list of contributors see [AUTHORS](https://github.com/tesseract-ocr/tesseract/blob/master/AUTHORS)
--- a/2
+++ b/2
@ -1 +1 @@
-4.0.0-beta.4
+4.0.0-rc1
--- a/doc/Makefile.am
+++ b/doc/Makefile.am
@ -6,7 +6,7 @@ asciidoc=asciidoc -d manpage


 man_MANS = \
-   combine_lang_model.1 \
+  combine_lang_model.1 \
  combine_tessdata.1  \
  dawg2wordlist.1 \
  lstmeval.1 \
@ -31,9 +31,16 @@ endif

 EXTRA_DIST = $(man_MANS) Doxyfile

+.PHONY: html
+
+html: $(patsubst %,%.html,$(man_MANS))
+
 %: %.asc
 	$(asciidoc) -o $@ $<

+%.html: %.asc
+	asciidoc -b html5 -o $@ $<
+
 MAINTAINERCLEANFILES = $(man_MANS) Doxyfile

 endif
--- a/doc/tesseract.1.asc
+++ b/doc/tesseract.1.asc
@ -34,7 +34,9 @@ IN/OUT ARGUMENTS

 'outputbase'::
 	The basename of the output file (to which the appropriate extension
-	will be appended).  By default the output will be named 'outbase.txt'.
+	will be appended).  By default the output will be a text file
+	with `.txt` added to the basename unless there are one or more
+	'configfile' options which explicitly specify the desired output.

 'stdout'::
 	Instruction to sent output data to standard output
@ -88,10 +90,21 @@ OPTIONS
 	contains a list of variables and their values, one per line, with a
 	space separating variable from value.  Interesting config files
 	include: +
-  * hocr - Output in hOCR format instead of as a text file.
-  * pdf  - Output in pdf instead of a text file.
+  * `hocr` - Output in hOCR format (file extension `.hocr`).
+  * `pdf` - Output PDF (file extension `.pdf`).
+  * `tsv` - Output TSV (file extension `.tsv`).
+  * `txt` - Output plain text (file extension `.txt`).
+  * `get.images` - Write images.
+  * `logfile` - Write debug file `tesseract.log`.
+  * `lstm.train` - Used for LSTM training.
+  * `makebox` - Output box file.
+  * `quiet` - Write debug file to /dev/null.

-*Nota Bene:*   The options '-l lang' and '--psm N' must occur
+It is possible to select several config files, for example
+`tesseract image.png demo hocr pdf txt` will create three output files
+`demo.hocr`, `demo.pdf` and `demo.txt` with the OCR results.
+
+*Nota Bene:*   The options `-l lang` and `--psm N` must occur
 before any 'configfile'.


@ -110,19 +123,35 @@ SINGLE OPTIONS
 	Returns the current version of the tesseract(1) executable.

 '--list-langs'::
-	List available languages for tesseract engine. Can be used with --tessdata-dir.
+	List available languages for tesseract engine. Can be used with `--tessdata-dir`.

 '--print-parameters'::
 	Print tesseract parameters.



-LANGUAGES
---------
+LANGUAGES AND SCRIPTS
+---------------------

-The currently available traineddata files for tesseract 4.0
-for the following languages are in
-(in https://github.com/tesseract-ocr/tessdata_fast):
+To recognize some text with Tesseract, it is normally necessary to specify
+the language(s) or script of the text (unless it is English text which is
+supported by default) using `-l lang`.
+
+Selecting a language automatically also selects the language specific
+character set and dictionary (word list).
+
+Selecting a script typically selects all characters of that script
+which can be from different languages. The dictionary which is included
+also contains a mix from different languages.
+In most cases, a script also supports English.
+So it is possible to recognize a language that has not been specifically
+trained for by using traineddata for the script it is written in.
+
+https://github.com/tesseract-ocr/tessdata_fast provides fast language and
+script models which are also part of Linux distributions.
+
+For Tesseract 4, `tessdata_fast` includes traineddata files for the
+following languages:

 *afr* (Afrikaans),
 *amh* (Amharic),
@ -245,17 +274,10 @@ for the following languages are in
 To use a non-standard language pack named *foo.traineddata*, set the
 *TESSDATA_PREFIX* environment variable so the file can be found at
 *TESSDATA_PREFIX*/tessdata/*foo*.traineddata and give Tesseract the
-argument '-l foo'.
+argument `-l foo`.

-SCRIPTS
-------
-
-The traineddata files for the following scripts for tesseract 4.0
-are also in https://github.com/tesseract-ocr/tessdata_fast.
-
-In most cases, each of these contains all the languages that use that script PLUS English.
-So it is possible to recognize a language that has not been specifically trained for
-by using traineddata for the script it is written in.
+For Tesseract 4, `tessdata_fast` includes traineddata files for the
+following scripts:

 Arabic,
 Armenian,
@ -295,6 +317,18 @@ Thai,
 Tibetan,
 Vietnamese.

+The same languages and scripts are available from
+https://github.com/tesseract-ocr/tessdata_best.
+`tessdata_best` provides slow language and script models.
+These models are needed for training. They also can give better OCR results,
+but the recognition takes much more time.
+
+Both `tessdata_fast` and `tessdata_best` only support the LSTM OCR engine.
+
+There is a third repository, https://github.com/tesseract-ocr/tessdata,
+with models which support both the Tesseract 3 legacy OCR engine and the
+Tesseract 4 LSTM OCR engine.
+

 CONFIG FILES AND AUGMENTING WITH USER DATA
 ------------------------------------------
@ -364,18 +398,29 @@ scripts are now included to allow anyone to reproduce some of these tests.
 See <https://github.com/tesseract-ocr/tesseract/wiki/TestingTesseract> for more
 details.

-Tesseract 3.00 adds a number of new languages, including Chinese, Japanese,
-and Korean. It also introduces a new, single-file based system of managing
+Tesseract 3.00 added a number of new languages, including Chinese, Japanese,
+and Korean. It also introduced a new, single-file based system of managing
 language data.

-Tesseract 3.02 adds BiDirectional text support, the ability to recognize
+Tesseract 3.02 added BiDirectional text support, the ability to recognize
 multiple languages in a single image, and improved layout analysis.

-For further details, see the file ReleaseNotes included with the distribution.
+Tesseract 4 adds a new neural net (LSTM) based OCR engine which is focused
+on line recognition, but also still supports the legacy Tesseract OCR engine of
+Tesseract 3 which works by recognizing character patterns. Compatibility with
+Tesseract 3 is enabled by `--oem 0`. This also needs traineddata files which
+support the legacy engine, for example those from the tessdata repository
+(https://github.com/tesseract-ocr/tessdata).
+
+For further details, see the release notes in the Tesseract wiki
+(<https://github.com/tesseract-ocr/tesseract/wiki/ReleaseNotes>).
+

 RESOURCES
 ---------
 Main web site: <https://github.com/tesseract-ocr> +
+User forum: <http://groups.google.com/group/tesseract-ocr> +
+Wiki: <https://github.com/tesseract-ocr/tesseract/wiki> +
 Information on training: <https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract>

 SEE ALSO
@ -396,6 +441,9 @@ Pingping Xiu, Pong Eksombatchai (Chantat), Ranjith Unnikrishnan, Raquel
 Romano, Ray Smith, Rika Antonova, Robert Moss, Samuel Charron, Sheelagh
 Lloyd, Shobhit Saxena, and Thomas Kielbus.

+For a list of contributors see
+<https://github.com/tesseract-ocr/tesseract/blob/master/AUTHORS>.
+
 COPYING
 -------
 Licensed under the Apache License, Version 2.0
--- a/src/api/baseapi.cpp
+++ b/src/api/baseapi.cpp
@ -1130,6 +1130,15 @@ bool TessBaseAPI::ProcessPagesInternal(const char* filename,
    buf.assign((std::istreambuf_iterator<char>(std::cin)),
               (std::istreambuf_iterator<char>()));
    data = reinterpret_cast<const l_uint8 *>(buf.data());
+  } else {
+    // Check whether the input file can be read.
+    if (FILE* file = fopen(filename, "rb")) {
+      fclose(file);
+    } else {
+      fprintf(stderr, "Error, cannot read input file %s: %s\n",
+              filename, strerror(errno));
+      return false;
+    }
  }

  // Here is our autodetection
--- a/src/api/baseapi.h
+++ b/src/api/baseapi.h
@ -75,6 +75,7 @@ class Trie;
 class Wordrec;

 typedef int (Dict::*DictFunc)(void* void_dawg_args,
+                              const UNICHARSET& unicharset,
                              UNICHAR_ID unichar_id, bool word_end) const;
 typedef double (Dict::*ProbabilityInContextFunc)(const char* lang,
                                                 const char* context,
--- a/src/api/tesseractmain.cpp
+++ b/src/api/tesseractmain.cpp
@ -367,11 +367,15 @@ static void ParseArgs(const int argc, char** argv, const char** lang,
  *arg_i = i;

  if (*pagesegmode == tesseract::PSM_OSD_ONLY) {
-    // That mode requires osd.traineddata, no other language or script files.
+    // OSD = orientation and script detection.
    if (*lang != nullptr && strcmp(*lang, "osd")) {
-      fprintf(stderr, "Warning, ignoring -l %s for --psm 0\n", *lang);
+      // If the user explicitly specifies a language (other than osd)
+      // or a script, only orientation can be detected.
+      fprintf(stderr, "Warning, detects only orientation with -l %s\n", *lang);
+    } else {
+      // That mode requires osd.traineddata to detect orientation and script.
+      *lang = "osd";
    }
-    *lang = "osd";
  }

  if (*outputbase == nullptr && noocr == false) {
@ -530,13 +534,6 @@ int main(int argc, char** argv) {
    return EXIT_SUCCESS;
  }

-  if (FILE* file = fopen(image, "r")) {
-    fclose(file);
-  } else {
-    fprintf(stderr, "Cannot open input file: %s\n", image);
-    return EXIT_FAILURE;
-  }
-
  FixPageSegMode(&api, pagesegmode);

  if (dpi) {
--- a/src/ccmain/osdetect.cpp
+++ b/src/ccmain/osdetect.cpp
@ -36,9 +36,6 @@
 #include <algorithm>
 #include <memory>

-const int kMinCharactersToTry = 50;
-const int kMaxCharactersToTry = 5 * kMinCharactersToTry;
-
 const float kSizeRatioToReject = 2.0;
 const int kMinAcceptableBlobHeight = 10;

@ -278,6 +275,8 @@ int os_detect_blobs(const GenericVector<int>* allowed_scripts,
                    BLOBNBOX_CLIST* blob_list, OSResults* osr,
                    tesseract::Tesseract* tess) {
  OSResults osr_;
+  int minCharactersToTry = tess->min_characters_to_try;
+  int maxCharactersToTry = 5 * minCharactersToTry;
  if (osr == nullptr)
    osr = &osr_;

@ -286,13 +285,13 @@ int os_detect_blobs(const GenericVector<int>* allowed_scripts,
  ScriptDetector s(allowed_scripts, osr, tess);

  BLOBNBOX_C_IT filtered_it(blob_list);
-  int real_max = std::min(filtered_it.length(), kMaxCharactersToTry);
+  int real_max = std::min(filtered_it.length(), maxCharactersToTry);
  // tprintf("Total blobs found = %d\n", blobs_total);
  // tprintf("Number of blobs post-filtering = %d\n", filtered_it.length());
  // tprintf("Number of blobs to try = %d\n", real_max);

  // If there are too few characters, skip this page entirely.
-  if (real_max < kMinCharactersToTry / 2) {
+  if (real_max < minCharactersToTry / 2) {
    tprintf("Too few characters. Skipping this page\n");
    return 0;
  }
@ -307,7 +306,7 @@ int os_detect_blobs(const GenericVector<int>* allowed_scripts,
  int num_blobs_evaluated = 0;
  for (int i = 0; i < real_max; ++i) {
    if (os_detect_blob(blobs[sequence.GetVal()], &o, &s, osr, tess)
-        && i > kMinCharactersToTry) {
+        && i > minCharactersToTry) {
      break;
    }
    ++num_blobs_evaluated;
--- a/src/ccmain/pgedit.cpp
+++ b/src/ccmain/pgedit.cpp
@ -100,15 +100,15 @@ enum ColorationMode {
 *
 */

-ScrollView* image_win;
-ParamsEditor* pe;
-bool stillRunning = false;
+static ScrollView* image_win;
+static ParamsEditor* pe;
+static bool stillRunning = false;

-ScrollView* bln_word_window = nullptr;       // baseline norm words
+static ScrollView* bln_word_window = nullptr; // baseline norm words

-CMD_EVENTS mode = CHANGE_DISP_CMD_EVENT;  // selected words op
+static CMD_EVENTS mode = CHANGE_DISP_CMD_EVENT; // selected words op

-bool recog_done = false;                  // recog_all_words was called
+static bool recog_done = false; // recog_all_words was called

 // These variables should remain global, since they are only used for the
 // debug mode (in which only a single Tesseract thread/instance will exist).
--- a/src/ccmain/tesseractclass.cpp
+++ b/src/ccmain/tesseractclass.cpp
@ -397,6 +397,9 @@ Tesseract::Tesseract()
      INT_MEMBER(jpg_quality, 85, "Set JPEG quality level", this->params()),
      INT_MEMBER(user_defined_dpi, 0, "Specify DPI for input image",
                 this->params()),
+      INT_MEMBER(min_characters_to_try, 50,
+                 "Specify minimum characters to try during OSD",
+                 this->params()),
      STRING_MEMBER(unrecognised_char, "|",
                    "Output char for unidentified blobs", this->params()),
      INT_MEMBER(suspect_level, 99, "Suspect marker level", this->params()),
--- a/src/ccmain/tesseractclass.h
+++ b/src/ccmain/tesseractclass.h
@ -1043,6 +1043,8 @@ class Tesseract : public Wordrec {
             "Create PDF with only one invisible text layer");
  INT_VAR_H(jpg_quality, 85, "Set JPEG quality level");
  INT_VAR_H(user_defined_dpi, 0, "Specify DPI for input image");
+  INT_VAR_H(min_characters_to_try, 50,
+            "Specify minimum characters to try during OSD");
  STRING_VAR_H(unrecognised_char, "|",
               "Output char for unidentified blobs");
  INT_VAR_H(suspect_level, 99, "Suspect marker level");
--- a/src/ccstruct/matrix.h
+++ b/src/ccstruct/matrix.h
@ -365,7 +365,7 @@ class GENERIC_2D_ARRAY {
  }

  // Accumulates the element-wise sums of squares of src into *this.
-  void SumSquares(const GENERIC_2D_ARRAY<T>& src, T decay_factor) {
+  void SumSquares(const GENERIC_2D_ARRAY<T>& src, const T& decay_factor) {
    T update_factor = 1.0 - decay_factor;
    int size = num_elements();
    for (int i = 0; i < size; ++i) {
@ -377,7 +377,7 @@ class GENERIC_2D_ARRAY {
  // Scales each element using the adam algorithm, ie array_[i] by
  // sqrt(sqsum[i] + epsilon)).
  void AdamUpdate(const GENERIC_2D_ARRAY<T>& sum,
-                  const GENERIC_2D_ARRAY<T>& sqsum, T epsilon) {
+                  const GENERIC_2D_ARRAY<T>& sqsum, const T& epsilon) {
    int size = num_elements();
    for (int i = 0; i < size; ++i) {
      array_[i] += sum.array_[i] / (sqrt(sqsum.array_[i]) + epsilon);
--- a/src/ccstruct/pageres.h
+++ b/src/ccstruct/pageres.h
@ -363,10 +363,10 @@ class WERD_RES : public ELIST_LINK {
        blob_index >= best_choice->length())
      return nullptr;
    UNICHAR_ID id = best_choice->unichar_id(blob_index);
-    if (id < 0 || id >= uch_set->size() || id == INVALID_UNICHAR_ID)
+    if (id < 0 || id >= uch_set->size())
      return nullptr;
    UNICHAR_ID mirrored = uch_set->get_mirror(id);
-    if (in_rtl_context && mirrored > 0 && mirrored != INVALID_UNICHAR_ID)
+    if (in_rtl_context && mirrored > 0)
      id = mirrored;
    return uch_set->id_to_unichar_ext(id);
  }
@ -375,7 +375,7 @@ class WERD_RES : public ELIST_LINK {
    if (blob_index < 0 || blob_index >= raw_choice->length())
      return nullptr;
    UNICHAR_ID id = raw_choice->unichar_id(blob_index);
-    if (id < 0 || id >= uch_set->size() || id == INVALID_UNICHAR_ID)
+    if (id < 0 || id >= uch_set->size())
      return nullptr;
    return uch_set->id_to_unichar(id);
  }
--- a/src/ccstruct/rect.h
+++ b/src/ccstruct/rect.h
@ -21,7 +21,7 @@
 #define RECT_H

 #include <algorithm>           // for std::max, std::min
-#include <cmath>               // for ceil, floor
+#include <cmath>               // for std::ceil, std::floor
 #include <cstdint>             // for INT16_MAX
 #include <cstdio>              // for FILE
 #include "platform.h"          // for DLLSYM
@ -162,29 +162,33 @@ class DLLSYM TBOX  {  // bounding box

    void move(                     // move box
              const FCOORD vec) {  // by float vector
-      bot_left.set_x ((int16_t) floor (bot_left.x () + vec.x ()));
+      bot_left.set_x(static_cast<int16_t>(std::floor(bot_left.x() + vec.x())));
      // round left
-      bot_left.set_y ((int16_t) floor (bot_left.y () + vec.y ()));
+      bot_left.set_y(static_cast<int16_t>(std::floor(bot_left.y() + vec.y())));
      // round down
-      top_right.set_x ((int16_t) ceil (top_right.x () + vec.x ()));
+      top_right.set_x(static_cast<int16_t>(std::ceil(top_right.x() + vec.x())));
      // round right
-      top_right.set_y ((int16_t) ceil (top_right.y () + vec.y ()));
+      top_right.set_y(static_cast<int16_t>(std::ceil(top_right.y() + vec.y())));
      // round up
    }

    void scale(                  // scale box
               const float f) {  // by multiplier
-      bot_left.set_x ((int16_t) floor (bot_left.x () * f));  // round left
-      bot_left.set_y ((int16_t) floor (bot_left.y () * f));  // round down
-      top_right.set_x ((int16_t) ceil (top_right.x () * f));  // round right
-      top_right.set_y ((int16_t) ceil (top_right.y () * f));  // round up
+      // round left
+      bot_left.set_x(static_cast<int16_t>(std::floor(bot_left.x() * f)));
+      // round down
+      bot_left.set_y(static_cast<int16_t>(std::floor(bot_left.y() * f)));
+      // round right
+      top_right.set_x(static_cast<int16_t>(std::ceil(top_right.x() * f)));
+      // round up
+      top_right.set_y(static_cast<int16_t>(std::ceil(top_right.y() * f)));
    }
    void scale(                     // scale box
               const FCOORD vec) {  // by float vector
-      bot_left.set_x ((int16_t) floor (bot_left.x () * vec.x ()));
-      bot_left.set_y ((int16_t) floor (bot_left.y () * vec.y ()));
-      top_right.set_x ((int16_t) ceil (top_right.x () * vec.x ()));
-      top_right.set_y ((int16_t) ceil (top_right.y () * vec.y ()));
+      bot_left.set_x(static_cast<int16_t>(std::floor(bot_left.x() * vec.x())));
+      bot_left.set_y(static_cast<int16_t>(std::floor(bot_left.y() * vec.y())));
+      top_right.set_x(static_cast<int16_t>(std::ceil(top_right.x() * vec.x())));
+      top_right.set_y(static_cast<int16_t>(std::ceil(top_right.y() * vec.y())));
    }

    // rotate doesn't enlarge the box - it just rotates the bottom-left
@ -314,8 +318,10 @@ class DLLSYM TBOX  {  // bounding box
 inline TBOX::TBOX(   // constructor
    const FCOORD pt  // floating centre
    ) {
-  bot_left = ICOORD ((int16_t) floor (pt.x ()), (int16_t) floor (pt.y ()));
-  top_right = ICOORD ((int16_t) ceil (pt.x ()), (int16_t) ceil (pt.y ()));
+  bot_left = ICOORD(static_cast<int16_t>(std::floor(pt.x())),
+                    static_cast<int16_t>(std::floor(pt.y())));
+  top_right = ICOORD(static_cast<int16_t>(std::ceil(pt.x())),
+                     static_cast<int16_t>(std::ceil(pt.y())));
 }


--- a/src/ccutil/genericvector.h
+++ b/src/ccutil/genericvector.h
@ -39,7 +39,7 @@ class GenericVector {
  GenericVector() {
    init(kDefaultVectorSize);
  }
-  GenericVector(int size, T init_val) {
+  GenericVector(int size, const T& init_val) {
    init(size);
    init_to_size(size, init_val);
  }
@ -60,7 +60,7 @@ class GenericVector {
  void double_the_size();

  // Resizes to size and sets all values to t.
-  void init_to_size(int size, T t);
+  void init_to_size(int size, const T& t);
  // Resizes to size without any initialization.
  void resize_no_init(int size) {
    reserve(size);
@ -101,31 +101,31 @@ class GenericVector {
  // Return the index of the T object.
  // This method NEEDS a compare_callback to be passed to
  // set_compare_callback.
-  int get_index(T object) const;
+  int get_index(const T& object) const;

  // Return true if T is in the array
-  bool contains(T object) const;
+  bool contains(const T& object) const;

  // Return true if the index is valid
  T contains_index(int index) const;

  // Push an element in the end of the array
  int push_back(T object);
-  void operator+=(T t);
+  void operator+=(const T& t);

  // Push an element in the end of the array if the same
  // element is not already contained in the array.
-  int push_back_new(T object);
+  int push_back_new(const T& object);

  // Push an element in the front of the array
  // Note: This function is O(n)
-  int push_front(T object);
+  int push_front(const T& object);

  // Set the value at the given index
-  void set(T t, int index);
+  void set(const T& t, int index);

  // Insert t at the given index, push other elements to the right.
-  void insert(T t, int index);
+  void insert(const T& t, int index);

  // Removes an element at the given index and
  // shifts the remaining elements to the left.
@ -705,7 +705,7 @@ void GenericVector<T>::double_the_size() {

 // Resizes to size and sets all values to t.
 template <typename T>
-void GenericVector<T>::init_to_size(int size, T t) {
+void GenericVector<T>::init_to_size(int size, const T& t) {
  reserve(size);
  size_used_ = size;
  for (int i = 0; i < size; ++i)
@ -740,7 +740,7 @@ T GenericVector<T>::pop_back() {

 // Return the object from an index.
 template <typename T>
-void GenericVector<T>::set(T t, int index) {
+void GenericVector<T>::set(const T& t, int index) {
  assert(index >= 0 && index < size_used_);
  data_[index] = t;
 }
@ -749,7 +749,7 @@ void GenericVector<T>::set(T t, int index) {
 // space for the new elements and inserts the given element
 // at the specified index.
 template <typename T>
-void GenericVector<T>::insert(T t, int index) {
+void GenericVector<T>::insert(const T& t, int index) {
  assert(index >= 0 && index <= size_used_);
  if (size_reserved_ == size_used_)
    double_the_size();
@ -779,7 +779,7 @@ T GenericVector<T>::contains_index(int index) const {

 // Return the index of the T object.
 template <typename T>
-int GenericVector<T>::get_index(T object) const {
+int GenericVector<T>::get_index(const T& object) const {
  for (int i = 0; i < size_used_; ++i) {
    assert(compare_cb_ != nullptr);
    if (compare_cb_->Run(object, data_[i]))
@ -790,7 +790,7 @@ int GenericVector<T>::get_index(T object) const {

 // Return true if T is in the array
 template <typename T>
-bool GenericVector<T>::contains(T object) const {
+bool GenericVector<T>::contains(const T& object) const {
  return get_index(object) != -1;
 }

@ -806,7 +806,7 @@ int GenericVector<T>::push_back(T object) {
 }

 template <typename T>
-int GenericVector<T>::push_back_new(T object) {
+int GenericVector<T>::push_back_new(const T& object) {
  int index = get_index(object);
  if (index >= 0)
    return index;
@ -815,7 +815,7 @@ int GenericVector<T>::push_back_new(T object) {

 // Add an element in the array (front)
 template <typename T>
-int GenericVector<T>::push_front(T object) {
+int GenericVector<T>::push_front(const T& object) {
  if (size_used_ == size_reserved_)
    double_the_size();
  for (int i = size_used_; i > 0; --i)
@ -826,7 +826,7 @@ int GenericVector<T>::push_front(T object) {
 }

 template <typename T>
-void GenericVector<T>::operator+=(T t) {
+void GenericVector<T>::operator+=(const T& t) {
  push_back(t);
 }

@ -866,15 +866,14 @@ void GenericVector<T>::set_compare_callback(
 // Clear the array, calling the callback function if any.
 template <typename T>
 void GenericVector<T>::clear() {
-  if (size_reserved_ > 0) {
-    if (clear_cb_ != nullptr)
-      for (int i = 0; i < size_used_; ++i)
-        clear_cb_->Run(data_[i]);
-    delete[] data_;
-    data_ = nullptr;
-    size_used_ = 0;
-    size_reserved_ = 0;
+  if (size_reserved_ > 0 && clear_cb_ != nullptr) {
+    for (int i = 0; i < size_used_; ++i)
+      clear_cb_->Run(data_[i]);
  }
+  delete[] data_;
+  data_ = nullptr;
+  size_used_ = 0;
+  size_reserved_ = 0;
  delete clear_cb_;
  clear_cb_ = nullptr;
  delete compare_cb_;
--- a/src/classify/intproto.cpp
+++ b/src/classify/intproto.cpp
@ -20,7 +20,7 @@
 -----------------------------------------------------------------------------*/

 #include <algorithm>
-#include <cmath>
+#include <cmath>           // for std::floor
 #include <cstdio>
 #include <cassert>

@ -117,7 +117,7 @@ FILL_SPEC;
 #define CircularIncrement(i,r)  (((i) < (r) - 1)?((i)++):((i) = 0))

 /** macro for mapping floats to ints without bounds checking */
-#define MapParam(P,O,N)   (floor (((P) + (O)) * (N)))
+#define MapParam(P,O,N)   (std::floor(((P) + (O)) * (N)))

 /*---------------------------------------------------------------------------
            Private Function Prototypes
@ -1205,11 +1205,11 @@ void FillPPCircularBits(uint32_t ParamTable[NUM_PP_BUCKETS][WERDS_PER_PP_VECTOR]
  if (Spread > 0.5)
    Spread = 0.5;

-  FirstBucket = (int) floor ((Center - Spread) * NUM_PP_BUCKETS);
+  FirstBucket = static_cast<int>(std::floor((Center - Spread) * NUM_PP_BUCKETS));
  if (FirstBucket < 0)
    FirstBucket += NUM_PP_BUCKETS;

-  LastBucket = (int) floor ((Center + Spread) * NUM_PP_BUCKETS);
+  LastBucket = static_cast<int>(std::floor((Center + Spread) * NUM_PP_BUCKETS));
  if (LastBucket >= NUM_PP_BUCKETS)
    LastBucket -= NUM_PP_BUCKETS;
  if (debug) tprintf("Circular fill from %d to %d", FirstBucket, LastBucket);
@ -1243,11 +1243,11 @@ void FillPPLinearBits(uint32_t ParamTable[NUM_PP_BUCKETS][WERDS_PER_PP_VECTOR],
                      int Bit, float Center, float Spread, bool debug) {
  int i, FirstBucket, LastBucket;

-  FirstBucket = (int) floor ((Center - Spread) * NUM_PP_BUCKETS);
+  FirstBucket = static_cast<int>(std::floor((Center - Spread) * NUM_PP_BUCKETS));
  if (FirstBucket < 0)
    FirstBucket = 0;

-  LastBucket = (int) floor ((Center + Spread) * NUM_PP_BUCKETS);
+  LastBucket = static_cast<int>(std::floor((Center + Spread) * NUM_PP_BUCKETS));
  if (LastBucket >= NUM_PP_BUCKETS)
    LastBucket = NUM_PP_BUCKETS - 1;

@ -1736,7 +1736,7 @@ int TruncateParam(float Param, int Min, int Max, char *Id) {
              Id, Param, Max);
    Param = Max;
  }
-  return static_cast<int>(floor(Param));
+  return static_cast<int>(std::floor(Param));
 }                                /* TruncateParam */


--- a/src/dict/dict.cpp
+++ b/src/dict/dict.cpp
@ -32,6 +32,11 @@ Dict::Dict(CCUtil *ccutil)
      probability_in_context_(&tesseract::Dict::def_probability_in_context),
      params_model_classify_(nullptr),
      ccutil_(ccutil),
+      wildcard_unichar_id_(INVALID_UNICHAR_ID),
+      apostrophe_unichar_id_(INVALID_UNICHAR_ID),
+      question_unichar_id_(INVALID_UNICHAR_ID),
+      slash_unichar_id_(INVALID_UNICHAR_ID),
+      hyphen_unichar_id_(INVALID_UNICHAR_ID),
      STRING_MEMBER(user_words_file, "", "A filename of user-provided words.",
                    getCCUtil()->params()),
      STRING_INIT_MEMBER(user_words_suffix, "",
@ -167,7 +172,6 @@ Dict::Dict(CCUtil *ccutil)
  go_deeper_fxn_ = nullptr;
  hyphen_word_ = nullptr;
  last_word_on_line_ = false;
-  hyphen_unichar_id_ = INVALID_UNICHAR_ID;
  document_words_ = nullptr;
  dawg_cache_ = nullptr;
  dawg_cache_is_ours_ = false;
@ -361,10 +365,13 @@ void Dict::End() {
 // according to at least one of the dawgs in the dawgs_ vector.
 // See more extensive comments in dict.h where this function is declared.
 int Dict::def_letter_is_okay(void* void_dawg_args,
+                             const UNICHARSET& unicharset,
                             UNICHAR_ID unichar_id,
                             bool word_end) const {
  DawgArgs *dawg_args = static_cast<DawgArgs *>(void_dawg_args);

+  ASSERT_HOST(unicharset.contains_unichar_id(unichar_id));
+
  if (dawg_debug_level >= 3) {
    tprintf("def_letter_is_okay: current unichar=%s word_end=%d"
            " num active dawgs=%d\n",
@ -410,7 +417,7 @@ int Dict::def_letter_is_okay(void* void_dawg_args,
        for (int s = 0; s < slist.length(); ++s) {
          int sdawg_index = slist[s];
          const Dawg *sdawg = dawgs_[sdawg_index];
-          UNICHAR_ID ch = char_for_dawg(unichar_id, sdawg);
+          UNICHAR_ID ch = char_for_dawg(unicharset, unichar_id, sdawg);
          EDGE_REF dawg_edge = sdawg->edge_char_of(0, ch, word_end);
          if (dawg_edge != NO_EDGE) {
            if (dawg_debug_level >=3) {
@ -477,7 +484,8 @@ int Dict::def_letter_is_okay(void* void_dawg_args,
    // Find the edge out of the node for the unichar_id.
    NODE_REF node = GetStartingNode(dawg, pos.dawg_ref);
    EDGE_REF edge = (node == NO_EDGE) ? NO_EDGE
-        : dawg->edge_char_of(node, char_for_dawg(unichar_id, dawg), word_end);
+        : dawg->edge_char_of(node, char_for_dawg(unicharset, unichar_id, dawg),
+                             word_end);

    if (dawg_debug_level >= 3) {
      tprintf("Active dawg: [%d, " REFFORMAT "] edge=" REFFORMAT "\n",
@ -759,7 +767,8 @@ int Dict::valid_word(const WERD_CHOICE &word, bool numbers_ok) const {
  int last_index = word_ptr->length() - 1;
  // Call letter_is_okay for each letter in the word.
  for (int i = hyphen_base_size(); i <= last_index; ++i) {
-    if (!((this->*letter_is_okay_)(&dawg_args, word_ptr->unichar_id(i),
+    if (!((this->*letter_is_okay_)(&dawg_args, *word_ptr->unicharset(),
+                                   word_ptr->unichar_id(i),
                                   i == last_index))) break;
    // Swap active_dawgs, constraints with the corresponding updated vector.
    if (dawg_args.updated_dawgs == &(active_dawgs[1])) {
--- a/src/dict/dict.h
+++ b/src/dict/dict.h
@ -351,15 +351,17 @@ class Dict {
   */

  //
-  int def_letter_is_okay(void* void_dawg_args,
+  int def_letter_is_okay(void* void_dawg_args, const UNICHARSET& unicharset,
                         UNICHAR_ID unichar_id, bool word_end) const;

  int (Dict::*letter_is_okay_)(void* void_dawg_args,
+                               const UNICHARSET& unicharset,
                               UNICHAR_ID unichar_id, bool word_end) const;
  /// Calls letter_is_okay_ member function.
-  int LetterIsOkay(void* void_dawg_args,
+  int LetterIsOkay(void* void_dawg_args, const UNICHARSET& unicharset,
                   UNICHAR_ID unichar_id, bool word_end) const {
-    return (this->*letter_is_okay_)(void_dawg_args, unichar_id, word_end);
+    return (this->*letter_is_okay_)(void_dawg_args,
+                                    unicharset, unichar_id, word_end);
  }


@ -428,11 +430,12 @@ class Dict {
  // Given a unichar from a string and a given dawg, return the unichar
  // we should use to match in that dawg type.  (for example, in the number
  // dawg, all numbers are transformed to kPatternUnicharId).
-  inline UNICHAR_ID char_for_dawg(UNICHAR_ID ch, const Dawg *dawg) const {
+  UNICHAR_ID char_for_dawg(const UNICHARSET& unicharset, UNICHAR_ID ch,
+                           const Dawg *dawg) const {
    if (!dawg) return ch;
    switch (dawg->type()) {
      case DAWG_TYPE_NUMBER:
-        return getUnicharset().get_isdigit(ch) ? Dawg::kPatternUnicharID : ch;
+        return unicharset.get_isdigit(ch) ? Dawg::kPatternUnicharID : ch;
      default:
        return ch;
    }
--- a/src/dict/permdawg.cpp
+++ b/src/dict/permdawg.cpp
@ -88,7 +88,7 @@ void Dict::go_deeper_dawg_fxn(
      ++num_unigrams;
      word->append_unichar_id(uch_id, 1, 0.0, 0.0);
      unigrams_ok = (this->*letter_is_okay_)(
-          &unigram_dawg_args,
+          &unigram_dawg_args, *word->unicharset(),
          word->unichar_id(word_index+num_unigrams-1),
          word_ending && i == encoding.size() - 1);
      (*unigram_dawg_args.active_dawgs) = *(unigram_dawg_args.updated_dawgs);
@ -111,7 +111,8 @@ void Dict::go_deeper_dawg_fxn(
  // Check which dawgs from the dawgs_ vector contain the word
  // up to and including the current unichar.
  if (checked_unigrams || (this->*letter_is_okay_)(
-      more_args, word->unichar_id(word_index), word_ending)) {
+      more_args, *word->unicharset(), word->unichar_id(word_index),
+      word_ending)) {
    // Add a new word choice
    if (word_ending) {
      if (dawg_debug_level) {
--- a/src/lstm/recodebeam.cpp
+++ b/src/lstm/recodebeam.cpp
@ -771,7 +771,8 @@ void RecodeBeamSearch::ContinueDawg(int code, int unichar_id, float cert,
    return;  // Can't continue if not a dict word.
  }
  PermuterType permuter = static_cast<PermuterType>(
-      dict_->def_letter_is_okay(&dawg_args, unichar_id, false));
+      dict_->def_letter_is_okay(&dawg_args,
+                                dict_->getUnicharset(), unichar_id, false));
  if (permuter != NO_PERM) {
    PushHeapIfBetter(kBeamWidths[0], code, unichar_id, permuter, false,
                     word_start, dawg_args.valid_end, false, cert, prev,
--- a/src/training/combine_tessdata.cpp
+++ b/src/training/combine_tessdata.cpp
@ -72,7 +72,7 @@ int main(int argc, char **argv) {
  tesseract::TessdataManager tm;
  if (argc > 1 && (!strcmp(argv[1], "-v") || !strcmp(argv[1], "--version"))) {
    printf("%s\n", tesseract::TessBaseAPI::Version());
-    return 0;
+    return EXIT_SUCCESS;
  } else if (argc == 2) {
    printf("Combining tessdata files\n");
    STRING lang = argv[1];
@ -92,16 +92,22 @@ int main(int argc, char **argv) {
    // Initialize TessdataManager with the data in the given traineddata file.
    if (!tm.Init(argv[2])) {
      tprintf("Failed to read %s\n", argv[2]);
-      exit(1);
+      return EXIT_FAILURE;
    }
    printf("Extracting tessdata components from %s\n", argv[2]);
    if (strcmp(argv[1], "-e") == 0) {
      for (i = 3; i < argc; ++i) {
+        errno = 0;
        if (tm.ExtractToFile(argv[i])) {
          printf("Wrote %s\n", argv[i]);
-        } else {
+        } else if (errno == 0) {
          printf("Not extracting %s, since this component"
                 " is not present\n", argv[i]);
+          return EXIT_FAILURE;
+        } else {
+          printf("Error, could not extract %s: %s\n",
+                 argv[i], strerror(errno));
+          return EXIT_FAILURE;
        }
      }
    } else {  // extract all the components
@ -111,8 +117,13 @@ int main(int argc, char **argv) {
        if (*last != '.')
          filename += '.';
        filename += tesseract::kTessdataFileSuffixes[i];
+        errno = 0;
        if (tm.ExtractToFile(filename.string())) {
          printf("Wrote %s\n", filename.string());
+        } else if (errno != 0) {
+          printf("Error, could not extract %s: %s\n",
+                 filename.string(), strerror(errno));
+          return EXIT_FAILURE;
        }
      }
    }
@ -124,7 +135,7 @@ int main(int argc, char **argv) {
    if (rename(new_traineddata_filename, traineddata_filename.string()) != 0) {
      tprintf("Failed to create a temporary file %s\n",
              traineddata_filename.string());
-      exit(1);
+      return EXIT_FAILURE;
    }

    // Initialize TessdataManager with the data in the given traineddata file.
@ -135,17 +146,17 @@ int main(int argc, char **argv) {
  } else if (argc == 3 && strcmp(argv[1], "-c") == 0) {
    if (!tm.Init(argv[2])) {
      tprintf("Failed to read %s\n", argv[2]);
-      exit(1);
+      return EXIT_FAILURE;
    }
    tesseract::TFile fp;
    if (!tm.GetComponent(tesseract::TESSDATA_LSTM, &fp)) {
      tprintf("No LSTM Component found in %s!\n", argv[2]);
-      exit(1);
+      return EXIT_FAILURE;
    }
    tesseract::LSTMRecognizer recognizer;
    if (!recognizer.DeSerialize(&tm, &fp)) {
      tprintf("Failed to deserialize LSTM in %s!\n", argv[2]);
-      exit(1);
+      return EXIT_FAILURE;
    }
    recognizer.ConvertToInt();
    GenericVector<char> lstm_data;
@ -155,7 +166,7 @@ int main(int argc, char **argv) {
                      lstm_data.size());
    if (!tm.SaveFile(argv[2], nullptr)) {
      tprintf("Failed to write modified traineddata:%s!\n", argv[2]);
-      exit(1);
+      return EXIT_FAILURE;
    }
  } else if (argc == 3 && strcmp(argv[1], "-d") == 0) {
    // Initialize TessdataManager with the data in the given traineddata file.
@ -186,4 +197,5 @@ int main(int argc, char **argv) {
    return 1;
  }
  tm.Directory();
+  return EXIT_SUCCESS;
 }
--- a/src/training/lstmtraining.cpp
+++ b/src/training/lstmtraining.cpp
@ -73,22 +73,27 @@ const int kNumPagesPerBatch = 100;
 int main(int argc, char **argv) {
  tesseract::CheckSharedLibraryVersion();
  ParseArguments(&argc, &argv);
-  // Purify the model name in case it is based on the network string.
  if (FLAGS_model_output.empty()) {
    tprintf("Must provide a --model_output!\n");
-    return 1;
+    return EXIT_FAILURE;
  }
  if (FLAGS_traineddata.empty()) {
    tprintf("Must provide a --traineddata see training wiki\n");
-    return 1;
+    return EXIT_FAILURE;
  }
-  STRING model_output = FLAGS_model_output.c_str();
-  for (int i = 0; i < model_output.length(); ++i) {
-    if (model_output[i] == '[' || model_output[i] == ']')
-      model_output[i] = '-';
-    if (model_output[i] == '(' || model_output[i] == ')')
-      model_output[i] = '_';
+
+  // Check write permissions.
+  STRING test_file = FLAGS_model_output.c_str();
+  test_file += "_wtest";
+  FILE* f = fopen(test_file.c_str(), "wb");
+  if (f != nullptr) {
+    fclose(f);
+    remove(test_file.c_str());
+  } else {
+    tprintf("Error, model output cannot be written: %s\n", strerror(errno));
+    return EXIT_FAILURE;
  }
+
  // Setup the trainer.
  STRING checkpoint_file = FLAGS_model_output.c_str();
  checkpoint_file += "_checkpoint";
@ -105,7 +110,7 @@ int main(int argc, char **argv) {
    if (!trainer.TryLoadingCheckpoint(FLAGS_continue_from.c_str(), nullptr)) {
      tprintf("Failed to read continue from: %s\n",
              FLAGS_continue_from.c_str());
-      return 1;
+      return EXIT_FAILURE;
    }
    if (FLAGS_debug_network) {
      trainer.DebugNetwork();
@ -116,20 +121,20 @@ int main(int argc, char **argv) {
                FLAGS_model_output.c_str());
      }
    }
-    return 0;
+    return EXIT_SUCCESS;
  }

  // Get the list of files to process.
  if (FLAGS_train_listfile.empty()) {
    tprintf("Must supply a list of training filenames! --train_listfile\n");
-    return 1;
+    return EXIT_FAILURE;
  }
  GenericVector<STRING> filenames;
  if (!tesseract::LoadFileLinesToStrings(FLAGS_train_listfile.c_str(),
                                         &filenames)) {
    tprintf("Failed to load list of training filenames from %s\n",
            FLAGS_train_listfile.c_str());
-    return 1;
+    return EXIT_FAILURE;
  }

  // Checkpoints always take priority if they are available.
@ -145,7 +150,7 @@ int main(int argc, char **argv) {
                                            ? FLAGS_continue_from.c_str()
                                            : FLAGS_old_traineddata.c_str())) {
        tprintf("Failed to continue from: %s\n", FLAGS_continue_from.c_str());
-        return 1;
+        return EXIT_FAILURE;
      }
      tprintf("Continuing from %s\n", FLAGS_continue_from.c_str());
      trainer.InitIterations();
@ -155,7 +160,7 @@ int main(int argc, char **argv) {
        tprintf("Appending a new network to an old one!!");
        if (FLAGS_continue_from.empty()) {
          tprintf("Must set --continue_from for appending!\n");
-          return 1;
+          return EXIT_FAILURE;
        }
      }
      // We are initializing from scratch.
@ -165,7 +170,7 @@ int main(int argc, char **argv) {
                               FLAGS_adam_beta)) {
        tprintf("Failed to create network from spec: %s\n",
                FLAGS_net_spec.c_str());
-        return 1;
+        return EXIT_FAILURE;
      }
      trainer.set_perfect_delay(FLAGS_perfect_sample_delay);
    }
@ -176,7 +181,7 @@ int main(int argc, char **argv) {
                                       : tesseract::CS_ROUND_ROBIN,
                                   FLAGS_randomly_rotate)) {
    tprintf("Load of images failed!!\n");
-    return 1;
+    return EXIT_FAILURE;
  }

  tesseract::LSTMTester tester(static_cast<int64_t>(FLAGS_max_image_MB) *
@ -186,7 +191,7 @@ int main(int argc, char **argv) {
    if (!tester.LoadAllEvalData(FLAGS_eval_listfile.c_str())) {
      tprintf("Failed to load eval data from: %s\n",
              FLAGS_eval_listfile.c_str());
-      return 1;
+      return EXIT_FAILURE;
    }
    tester_callback =
        NewPermanentTessCallback(&tester, &tesseract::LSTMTester::RunEvalAsync);
@ -208,5 +213,5 @@ int main(int argc, char **argv) {
            FLAGS_max_iterations == 0));
  delete tester_callback;
  tprintf("Finished! Error rate = %g\n", trainer.best_error_rate());
-  return 0;
+  return EXIT_SUCCESS;
 } /* main */
--- a/src/training/tesstrain_utils.sh
+++ b/src/training/tesstrain_utils.sh
@ -186,7 +186,11 @@ parse_flags() {

    # Location where intermediate files will be created.
    TIMESTAMP=`date +%Y-%m-%d`
+if [ "$(uname)" == "Darwin" ];then
+    TMP_DIR=$(mktemp -d -t ${LANG_CODE}-${TIMESTAMP}.XXX )
+else
    TMP_DIR=$(mktemp -d --tmpdir ${LANG_CODE}-${TIMESTAMP}.XXX )
+fi
    TRAINING_DIR=${TMP_DIR}
    # Location of log file for the whole run.
    LOG_FILE=${TRAINING_DIR}/tesstrain.log
--- a/src/wordrec/chop.cpp
+++ b/src/wordrec/chop.cpp
@ -89,7 +89,6 @@ int Wordrec::angle_change(EDGEPT *point1, EDGEPT *point2, EDGEPT *point3) {
  VECTOR vector2;

  int angle;
-  float length;

  /* Compute angle */
  vector1.x = point2->pos.x - point1->pos.x;
@ -97,7 +96,7 @@ int Wordrec::angle_change(EDGEPT *point1, EDGEPT *point2, EDGEPT *point3) {
  vector2.x = point3->pos.x - point2->pos.x;
  vector2.y = point3->pos.y - point2->pos.y;
  /* Use cross product */
-  length = (float)sqrt((float)LENGTH(vector1) * LENGTH(vector2));
+  float length = std::sqrt(static_cast<float>(LENGTH(vector1)) * LENGTH(vector2));
  if ((int) length == 0)
    return (0);
  angle = static_cast<int>(floor(asin(CROSS (vector1, vector2) /
--- a/src/wordrec/language_model.cpp
+++ b/src/wordrec/language_model.cpp
@ -853,7 +853,7 @@ LanguageModelDawgInfo *LanguageModel::GenerateDawgInfo(
    if (language_model_debug_level > 2)
      tprintf("Test Letter OK for unichar %d, normed %d\n",
              b.unichar_id(), normed_ids[i]);
-    dict_->LetterIsOkay(&dawg_args_, normed_ids[i],
+    dict_->LetterIsOkay(&dawg_args_, dict_->getUnicharset(), normed_ids[i],
                        word_end && i == normed_ids.size() - 1);
    if (dawg_args_.permuter == NO_PERM) {
      break;
--- a/tessdata/configs/hocr
+++ b/tessdata/configs/hocr
@ -1,3 +1,2 @@
 tessedit_create_hocr 1
-tessedit_pageseg_mode 1
 hocr_font_info 0
--- a/tessdata/configs/pdf
+++ b/tessdata/configs/pdf
@ -1,2 +1 @@
 tessedit_create_pdf 1
-tessedit_pageseg_mode 1
--- a/tessdata/configs/tsv
+++ b/tessdata/configs/tsv
@ -1,2 +1 @@
 tessedit_create_tsv 1
-tessedit_pageseg_mode 1
--- a/tessdata/configs/unlv
+++ b/tessdata/configs/unlv
@ -1,2 +1 @@
 tessedit_write_unlv 1
-tessedit_pageseg_mode 6