Remaining misc changes for 3.02

git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@658 d0cd1f9f-072b-0410-8dd7-cf729c803f20
2025-06-06 17:32:41 +08:00 · 2012-02-02 03:14:43 +00:00 · 2012-02-02 03:14:43 +00:00 · e0d735b122
commit e0d735b122
parent 23dfabcab1
41 changed files with 1893 additions and 775 deletions
--- a/63
+++ b/63
@ -1,29 +1,43 @@
+2012-02-01 - v3.02
+  * Moved ResultIterator/PageIterator to ccmain.
+  * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic.
+  * Added paragraph detection in layout analysis/post OCR.
+  * Fixed inconsistent xheight during training and over-chopping.
+  * Added simultaneous multi-language capability.
+  * Refactored top-level word recognition module.
+  * Added experimental equation detector.
+  * Improved handling of resolution from input images.
+  * Blamer module added for error analysis.
+  * Cleaned up externally used namespace by removing includes from baseapi.h.
+  * Removed dead memory mangagement code.
+  * Tidied up constraints on control parameters.
+  * Added support for ShapeTable in classifier and training.
+  * Refactored class pruner.
+  * Fixed training leaks and randomness.
+  * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding.
+  * Improved line detection and removal.
+  * Added fixed pitch chopper for CJK.
+  * Added UNICHARSET to WERD_CHOICE to make mult-language handling easier.
+  * Fixed problems with internally scaled images.
+  * Added page and bbox to string in tr files to identify source of training data better.
+  * Fixes to Hindi Shiroreka splitter.
+  * Added word bigram correction.
+  * Reduced stack memory consumption and eliminated some ugly typedefs.
+  * Added new uniform classifier API.
+  * Added new training error counter.
+  * Fixed endian bug in dawg reader.
+  * Many other fixes, including the way in which the chopper finds chops and messes with the outline while it does so.
+
 2010-11-29 - V3.01
  * Removed old/dead serialise/deserialze methods on *LISTIZED classes.
  * Total rewrite of DENORM to better encapsulate operation and make
    for potential to extract features from images.
-  * Thread-safety! Moved all critical globals and statics to
-    members of the appropriate class. Tesseract is now
-    thread-safe (multiple instances can be used in parallel
-    in multiple threads.) with the minor exception that some
-    control parameters are still global and affect all threads.
-  * Added Cube, a new recognizer for Arabic. Cube can also be
-    used in combination with normal Tesseract for other languages
-    with an improvement in accuracy at the cost of (much) lower speed.
-    There is no training module for Cube yet.
-  * OcrEngineMode in Init replaces AccuracyVSpeed to control cube.
-  * Greatly improved segmentation search with consequent accuracy and
-    speed improvements, especially for Chinese.
-  * Added PageIterator and ResultIterator as cleaner ways to get the
-    full results out of Tesseract, that are not currently provided
-    by any of the TessBaseAPI::Get* methods.
-    All other methods, such as the ETEXT_STRUCT in particular are
-    deprecated and will be deleted in the future.
-  * ApplyBoxes totally rewritten to make training easier.
-    It can now cope with touching/overlapping training characters,
-    and a new boxfile format allows word boxes instead of character
-    boxes, BUT to use that you have to have already boostrapped the
-    language with character boxes. "Cyclic dependency" on traineddata.
+  * Thread-safety! Moved all critical globals and statics to members of the appropriate class. Tesseract is now thread-safe (multiple instances can be used in parallel in multiple threads.) with the minor exception that some control parameters are still global and affect all threads.
+  * Added Cube, a new recognizer for Arabic. Cube can also be used in combination with normal Tesseract for other languages with an improvement in accuracy at the cost of (much) lower speed. *There is no training module for Cube yet.*
+  * `OcrEngineMode` in `Init` replaces `AccuracyVSpeed` to control cube.
+  * Greatly improved segmentation search with consequent accuracy and speed improvements, especially for Chinese.
+  * Added `PageIterator` and `ResultIterator` as cleaner ways to get the full results out of Tesseract, that are not currently provided by any of the `TessBaseAPI::Get*` methods. All other methods, such as the `ETEXT_STRUCT` in particular are deprecated and will be deleted in the future.
+  * ApplyBoxes totally rewritten to make training easier. It can now cope with touching/overlapping training characters, and a new boxfile format allows word boxes instead of character boxes, BUT to use that you have to have already boostrapped the language with character boxes. "Cyclic dependency" on traineddata.
  * Auto orientation and script detection added to page layout analysis.
  * Deleted *lots* of dead code.
  * Fixxht module replaced with scalable data-driven module.
@ -35,6 +49,11 @@
  * Handling of vertical text improved.
  * Handling of leader dots improved.
  * Table detection greatly improved.
+  * Fixed a couple of memory leaks.
+  * Fixed font labels on output text. (Not perfect, but a lot better than before.)
+  * Cleanup and more bug fixes
+  * Special treatments for Hindi.
+  * Support for build in VS2010 with Microsoft Windows SDK for Windows 7 (thanks to Michael Lutz)

 2010-09-21 - V3.00
  * Preparations for thread safety:
--- a/45
+++ b/45
@ -1,26 +1,24 @@
+= Tesseract release notes Feb 01 2012 - V3.02 =
+  * Added Right-to-left/Bidi capability in the output iterators for Hebrew/Arabic.
+  * Added paragraph detection in layout analysis/post OCR.
+  * Added simultaneous multi-language capability.
+  * Added experimental equation detector.
+  * Major improvements to layout analysis for better image detection, diacritic detection, better textline finding, better tabstop finding.
+  * Improved line detection and removal.
+  * Added fixed pitch chopper for CJK.
+  * Added word bigram correction.
+  * Added new uniform classifier API.
+  * Added new training error counter.
+  * More detailed changes recorded in ChangeLog.
+
+
 = Tesseract release notes Oct 21 2011 - V3.01 =
-  * Thread-safety! Moved all critical globals and statics to
-    members of the appropriate class. Tesseract is now
-    thread-safe (multiple instances can be used in parallel
-    in multiple threads.) with the minor exception that some
-    control parameters are still global and affect all threads.
-  * Added Cube, a new recognizer for Arabic. Cube can also be
-    used in combination with normal Tesseract for other languages
-    with an improvement in accuracy at the cost of (much) lower speed.
-    There is no training module for Cube yet.
-  * OcrEngineMode in Init replaces AccuracyVSpeed to control cube.
-  * Greatly improved segmentation search with consequent accuracy and
-    speed improvements, especially for Chinese.
-  * Added PageIterator and ResultIterator as cleaner ways to get the
-    full results out of Tesseract, that are not currently provided
-    by any of the TessBaseAPI::Get* methods.
-    All other methods, such as the ETEXT_STRUCT in particular are
-    deprecated and will be deleted in the future.
-  * ApplyBoxes totally rewritten to make training easier.
-    It can now cope with touching/overlapping training characters,
-    and a new boxfile format allows word boxes instead of character
-    boxes, BUT to use that you have to have already boostrapped the
-    language with character boxes. "Cyclic dependency" on traineddata.
+  * Thread-safety! Moved all critical globals and statics to members of the appropriate class. Tesseract is now thread-safe (multiple instances can be used in parallel in multiple threads.) with the minor exception that some control parameters are still global and affect all threads.
+  * Added Cube, a new recognizer for Arabic. Cube can also be used in combination with normal Tesseract for other languages with an improvement in accuracy at the cost of (much) lower speed. *There is no training module for Cube yet.*
+  * `OcrEngineMode` in `Init` replaces `AccuracyVSpeed` to control cube.
+  * Greatly improved segmentation search with consequent accuracy and speed improvements, especially for Chinese.
+  * Added `PageIterator` and `ResultIterator` as cleaner ways to get the full results out of Tesseract, that are not currently provided by any of the `TessBaseAPI::Get*` methods. All other methods, such as the `ETEXT_STRUCT` in particular are deprecated and will be deleted in the future.
+  * ApplyBoxes totally rewritten to make training easier. It can now cope with touching/overlapping training characters, and a new boxfile format allows word boxes instead of character boxes, BUT to use that you have to have already boostrapped the language with character boxes. "Cyclic dependency" on traineddata.
  * Auto orientation and script detection added to page layout analysis.
  * Deleted *lots* of dead code.
  * Fixxht module replaced with scalable data-driven module.
@ -33,8 +31,7 @@
  * Handling of leader dots improved.
  * Table detection greatly improved.
  * Fixed a couple of memory leaks.
-  * Fixed font labels on output text. (Not perfect, but a lot better than
-    before.)
+  * Fixed font labels on output text. (Not perfect, but a lot better than before.)
  * Cleanup and more bug fixes
  * Special treatments for Hindi.
  * Support for build in VS2010 with Microsoft Windows SDK for Windows 7 (thanks to Michael Lutz)
--- a/ccutil/Makefile.am
+++ b/ccutil/Makefile.am
@ -4,17 +4,17 @@ AM_CXXFLAGS = -DTESSDATA_PREFIX=@datadir@/
 EXTRA_DIST = mfcpch.cpp scanutils.cpp scanutils.h

 include_HEADERS = \
-    ambigs.h basedir.h bits16.h boxread.h \
+    ambigs.h basedir.h bits16.h bitvector.h \
    tesscallback.h ccutil.h clst.h \
    elst2.h elst.h errcode.h \
    fileerr.h genericvector.h globaloc.h \
-    hashfn.h helpers.h host.h hosthplb.h lsterr.h \
-    memblk.h memry.h memryerr.h mfcpch.h \
+    hashfn.h helpers.h host.h hosthplb.h indexmapbidi.h lsterr.h \
+    memry.h mfcpch.h \
    ndminx.h notdll.h nwmain.h \
    ocrclass.h platform.h qrsequence.h \
    secname.h serialis.h sorthelper.h stderr.h strngs.h \
    tessdatamanager.h tprintf.h \
-    unichar.h unicharmap.h unicharset.h unicity_table.h \
+    unichar.h unicharmap.h unicharset.h unicity_table.h unicodes.h \
    params.h

 if !USING_MULTIPLELIBS
@ -25,14 +25,14 @@ libtesseract_ccutil_la_LDFLAGS = -version-info $(GENERIC_LIBRARY_VERSION)
 endif

 libtesseract_ccutil_la_SOURCES = \
-    ambigs.cpp basedir.cpp bits16.cpp boxread.cpp \
+    ambigs.cpp basedir.cpp bits16.cpp bitvector.cpp \
    ccutil.cpp clst.cpp \
    elst2.cpp elst.cpp errcode.cpp \
-    globaloc.cpp hashfn.cpp \
-    mainblk.cpp memblk.cpp memry.cpp \
+    globaloc.cpp hashfn.cpp indexmapbidi.cpp \
+    mainblk.cpp memry.cpp \
    serialis.cpp strngs.cpp \
    tessdatamanager.cpp tprintf.cpp \
-    unichar.cpp unicharmap.cpp unicharset.cpp \
+    unichar.cpp unicharmap.cpp unicharset.cpp unicodes.cpp \
    params.cpp


--- a/ccutil/ambigs.cpp
+++ b/ccutil/ambigs.cpp
@ -18,9 +18,6 @@
 //
 ///////////////////////////////////////////////////////////////////////

-#include <string>
-#include <algorithm>
-
 #include "ambigs.h"
 #include "helpers.h"

@ -314,7 +311,7 @@ void UnicharAmbigs::InsertIntoTable(
      unichar_id = ambig_spec->correct_ngram_id;
    } else {
      STRING frag_str = CHAR_FRAGMENT::to_string(
-          ReplacementString, i, TestAmbigPartSize);
+          ReplacementString, i, TestAmbigPartSize, false);
      unicharset->unichar_insert(frag_str.string());
      unichar_id = unicharset->unichar_to_id(frag_str.string());
    }
--- a/ccutil/bitvector.cpp
+++ b/ccutil/bitvector.cpp
@ -0,0 +1,104 @@
+// Copyright 2011 Google Inc. All Rights Reserved.
+// Author: rays@google.com (Ray Smith)
+///////////////////////////////////////////////////////////////////////
+// File:        bitvector.cpp
+// Description: Class replacement for BITVECTOR.
+// Author:      Ray Smith
+// Created:     Mon Jan 10 17:45:01 PST 2011
+//
+// (C) Copyright 2011, Google Inc.
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+// http://www.apache.org/licenses/LICENSE-2.0
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+//
+///////////////////////////////////////////////////////////////////////
+
+#include "bitvector.h"
+#include <string.h>
+#include "helpers.h"
+
+namespace tesseract {
+
+BitVector::BitVector() : bit_size_(0), array_(NULL) {}
+
+BitVector::BitVector(int length) : bit_size_(length) {
+  array_ = new uinT32[WordLength()];
+  SetAllFalse();
+}
+
+BitVector::BitVector(const BitVector& src) : bit_size_(src.bit_size_) {
+  array_ = new uinT32[WordLength()];
+  memcpy(array_, src.array_, ByteLength());
+}
+
+BitVector& BitVector::operator=(const BitVector& src) {
+  Alloc(src.bit_size_);
+  memcpy(array_, src.array_, ByteLength());
+  return *this;
+}
+
+BitVector::~BitVector() {
+  delete [] array_;
+}
+
+// Initializes the array to length * false.
+void BitVector::Init(int length) {
+  Alloc(length);
+  SetAllFalse();
+}
+
+// Writes to the given file. Returns false in case of error.
+bool BitVector::Serialize(FILE* fp) const {
+  if (fwrite(&bit_size_, sizeof(bit_size_), 1, fp) != 1) return false;
+  int wordlen = WordLength();
+  if (fwrite(array_, sizeof(*array_), wordlen, fp) != wordlen) return false;
+  return true;
+}
+
+// Reads from the given file. Returns false in case of error.
+// If swap is true, assumes a big/little-endian swap is needed.
+bool BitVector::DeSerialize(bool swap, FILE* fp) {
+  uinT32 new_bit_size;
+  if (fread(&new_bit_size, sizeof(new_bit_size), 1, fp) != 1) return false;
+  if (swap) {
+    ReverseN(&new_bit_size, sizeof(new_bit_size));
+  }
+  Alloc(new_bit_size);
+  int wordlen = WordLength();
+  if (fread(array_, sizeof(*array_), wordlen, fp) != wordlen) return false;
+  if (swap) {
+    for (int i = 0; i < wordlen; ++i)
+      ReverseN(&array_[i], sizeof(array_[i]));
+  }
+  return true;
+}
+
+void BitVector::SetAllFalse() {
+  memset(array_, 0, ByteLength());
+}
+void BitVector::SetAllTrue() {
+  memset(array_, ~0, ByteLength());
+}
+
+// Allocates memory for a vector of the given length.
+// Reallocates if the array is a different size, larger or smaller.
+void BitVector::Alloc(int length) {
+  int initial_wordlength = WordLength();
+  bit_size_ = length;
+  int new_wordlength = WordLength();
+  if (new_wordlength != initial_wordlength) {
+    delete [] array_;
+    array_ = new uinT32[new_wordlength];
+  }
+}
+
+
+}  // namespace tesseract.
+
+
--- a/ccutil/bitvector.h
+++ b/ccutil/bitvector.h
@ -0,0 +1,116 @@
+// Copyright 2011 Google Inc. All Rights Reserved.
+// Author: rays@google.com (Ray Smith)
+///////////////////////////////////////////////////////////////////////
+// File:        bitvector.h
+// Description: Class replacement for BITVECTOR.
+// Author:      Ray Smith
+// Created:     Mon Jan 10 17:44:01 PST 2011
+//
+// (C) Copyright 2011, Google Inc.
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+// http://www.apache.org/licenses/LICENSE-2.0
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+//
+///////////////////////////////////////////////////////////////////////
+
+
+#ifndef TESSERACT_CCUTIL_BITVECTOR_H__
+#define TESSERACT_CCUTIL_BITVECTOR_H__
+
+#include <assert.h>
+#include <stdio.h>
+#include "host.h"
+
+namespace tesseract {
+
+// Trivial class to encapsulate a fixed-length array of bits, with
+// Serialize/DeSerialize. Replaces the old macros.
+class BitVector {
+ public:
+  BitVector();
+  // Initializes the array to length * false.
+  explicit BitVector(int length);
+  BitVector(const BitVector& src);
+  BitVector& operator=(const BitVector& src);
+  ~BitVector();
+
+  // Initializes the array to length * false.
+  void Init(int length);
+
+  // Returns the number of bits that are accessible in the vector.
+  int size() const {
+    return bit_size_;
+  }
+
+  // Writes to the given file. Returns false in case of error.
+  bool Serialize(FILE* fp) const;
+  // Reads from the given file. Returns false in case of error.
+  // If swap is true, assumes a big/little-endian swap is needed.
+  bool DeSerialize(bool swap, FILE* fp);
+
+  void SetAllFalse();
+  void SetAllTrue();
+
+  // Accessors to set/reset/get bits.
+  // The range of index is [0, size()-1].
+  // There is debug-only bounds checking.
+  void SetBit(int index) {
+    array_[WordIndex(index)] |= BitMask(index);
+  }
+  void ResetBit(int index) {
+    array_[WordIndex(index)] &= ~BitMask(index);
+  }
+  void SetValue(int index, bool value) {
+    if (value)
+      SetBit(index);
+    else
+      ResetBit(index);
+  }
+  bool At(int index) const {
+    return (array_[WordIndex(index)] & BitMask(index)) != 0;
+  }
+  bool operator[](int index) const {
+    return (array_[WordIndex(index)] & BitMask(index)) != 0;
+  }
+
+ private:
+  // Allocates memory for a vector of the given length.
+  void Alloc(int length);
+
+  // Computes the index to array_ for the given index, with debug range
+  // checking.
+  int WordIndex(int index) const {
+    assert(0 <= index && index < bit_size_);
+    return index / kBitFactor;
+  }
+  // Returns a mask to select the appropriate bit for the given index.
+  uinT32 BitMask(int index) const {
+    return 1 << (index & (kBitFactor - 1));
+  }
+  // Returns the number of array elements needed to represent the current
+  // bit_size_.
+  int WordLength() const {
+    return (bit_size_ + kBitFactor - 1) / kBitFactor;
+  }
+  // Returns the number of bytes consumed by the array_.
+  int ByteLength() const {
+    return WordLength() * sizeof(*array_);
+  }
+
+  // Number of bits in this BitVector.
+  uinT32 bit_size_;
+  // Array of words used to pack the bits.
+  uinT32* array_;
+  // Number of bits in an array_ element.
+  static const int kBitFactor = sizeof(uinT32) * 8;
+};
+
+}  // namespace tesseract.
+
+#endif  // TESSERACT_CCUTIL_BITVECTOR_H__
--- a/ccutil/boxread.cpp
+++ b/ccutil/boxread.cpp
@ -1,138 +0,0 @@
-/**********************************************************************
- * File:        boxread.cpp
- * Description: Read data from a box file.
- * Author:      Ray Smith
- * Created:     Fri Aug 24 17:47:23 PDT 2007
- *
- * (C) Copyright 2007, Google Inc.
- ** Licensed under the Apache License, Version 2.0 (the "License");
- ** you may not use this file except in compliance with the License.
- ** You may obtain a copy of the License at
- ** http://www.apache.org/licenses/LICENSE-2.0
- ** Unless required by applicable law or agreed to in writing, software
- ** distributed under the License is distributed on an "AS IS" BASIS,
- ** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- ** See the License for the specific language governing permissions and
- ** limitations under the License.
- *
- **********************************************************************/
-
-#include "mfcpch.h"
-#include <string.h>
-#include "boxread.h"
-#include "unichar.h"
-#include "tprintf.h"
-
-// Special char code used to identify multi-blob labels.
-static const char* kMultiBlobLabelCode = "WordStr";
-
-// Open the boxfile based on the given image filename.
-FILE* OpenBoxFile(const STRING& fname) {
-  STRING filename = fname;
-  const char *lastdot = strrchr(filename.string(), '.');
-  if (lastdot != NULL)
-    filename[lastdot - filename.string()] = '\0';
-
-  filename += ".box";
-  FILE* box_file = NULL;
-  if (!(box_file = fopen(filename.string(), "rb"))) {
-    CANTOPENFILE.error("read_next_box", TESSEXIT,
-                       "Cant open box file %s",
-                       filename.string());
-  }
-  return box_file;
-}
-
-// Box files are used ONLY DURING TRAINING, but by both processes of
-// creating tr files with tesseract, and unicharset_extractor.
-// read_next_box factors out the code to interpret a line of a box
-// file so that applybox and unicharset_extractor interpret the same way.
-// This function returns the next valid box file utf8 string and coords
-// and returns true, or false on eof (and closes the file).
-// It ignores the uft8 file signature, checks for valid utf-8 and allows
-// space or tab between fields.
-// utf8_str must be at least kBoxReadBufSize in length.
-// If there are page numbers in the file, it reads them all.
-bool read_next_box(int *line_number, FILE* box_file, char* utf8_str,
-                   int* x_min, int* y_min, int* x_max, int* y_max) {
-  return read_next_box(-1, line_number, box_file, utf8_str,
-                       x_min, y_min, x_max, y_max);
-}
-
-// As read_next_box above, but get a specific page number. (0-based)
-// Use -1 to read any page number. Files without page number all
-// read as if they are page 0.
-bool read_next_box(int target_page, int *line_number,
-                   FILE* box_file, char* utf8_str,
-                   int* x_min, int* y_min, int* x_max, int* y_max) {
-  int count = 0;
-  int page = 0;
-  char buff[kBoxReadBufSize];   // boxfile read buffer
-  char uch[kBoxReadBufSize];
-  char *buffptr = buff;
-
-  while (fgets(buff, sizeof(buff) - 1, box_file)) {
-    (*line_number)++;
-
-    buffptr = buff;
-    const unsigned char *ubuf = reinterpret_cast<const unsigned char*>(buffptr);
-    if (ubuf[0] == 0xef && ubuf[1] == 0xbb && ubuf[2] == 0xbf)
-      buffptr += 3;  // Skip unicode file designation.
-    // Check for blank lines in box file
-    while (*buffptr == ' ' || *buffptr == '\t')
-      buffptr++;
-    if (*buffptr != '\0') {
-      // Read the unichar without messing up on Tibetan.
-      // According to issue 253 the utf-8 surrogates 85 and A0 are treated
-      // as whitespace by sscanf, so it is more reliable to just find
-      // ascii space and tab.
-      int uch_len = 0;
-      while (*buffptr != '\0' && *buffptr != ' ' && *buffptr != '\t')
-        uch[uch_len++] = *buffptr++;
-      uch[uch_len] = '\0';
-      if (*buffptr != '\0') ++buffptr;
-      count = sscanf(buffptr, "%d %d %d %d %d",
-                     x_min, y_min, x_max, y_max, &page);
-      if (count != 5) {
-        if (target_page <= 0) {
-          // If target_page is negative or zero, allow lines with no page number
-          page = 0;
-          count = sscanf(buffptr, "%d %d %d %d", x_min, y_min, x_max, y_max);
-        } else {
-          tprintf("Box file format error on line %i; ignored\n", *line_number);
-          continue;
-        }
-      }
-      if (target_page >= 0 && target_page != page)
-        continue;  // Not on the appropriate page.
-      // Test for long space-delimited string label.
-      if (strcmp(uch, kMultiBlobLabelCode) == 0 &&
-          (buffptr = strchr(buffptr, '#')) != NULL) {
-        strcpy(uch, buffptr + 1);
-        chomp_string(uch);
-        uch_len = strlen(uch);
-      }
-      // Validate UTF8 by making unichars with it.
-      int used = 0;
-      while (used < uch_len) {
-        UNICHAR ch(uch + used, uch_len - used);
-        int new_used = ch.utf8_len();
-        if (new_used == 0) {
-          tprintf("Bad UTF-8 str %s starts with 0x%02x at line %d, col %d\n",
-                  uch + used, uch[used], *line_number, used + 1);
-          count = 0;
-          break;
-        }
-        used += new_used;
-      }
-      if (count < 4 || used == 0) {
-        tprintf("Box file format error on line %i; ignored\n", *line_number);
-      } else {
-        strncpy(utf8_str, uch, kBoxReadBufSize);
-        return true;  // Successfully read a box.
-      }
-    }
-  }
-  fclose(box_file);
-  return false;  // EOF
-}
--- a/ccutil/boxread.h
+++ b/ccutil/boxread.h
@ -1,48 +0,0 @@
-/**********************************************************************
- * File:        boxread.cpp
- * Description: Read data from a box file.
- * Author:		Ray Smith
- * Created:		Fri Aug 24 17:47:23 PDT 2007
- *
- * (C) Copyright 2007, Google Inc.
- ** Licensed under the Apache License, Version 2.0 (the "License");
- ** you may not use this file except in compliance with the License.
- ** You may obtain a copy of the License at
- ** http://www.apache.org/licenses/LICENSE-2.0
- ** Unless required by applicable law or agreed to in writing, software
- ** distributed under the License is distributed on an "AS IS" BASIS,
- ** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- ** See the License for the specific language governing permissions and
- ** limitations under the License.
- *
- **********************************************************************/
-
-#ifndef TESSERACT_CCUTIL_BOXREAD_H__
-#define TESSERACT_CCUTIL_BOXREAD_H__
-
-#include <stdio.h>
-#include "strngs.h"
-
-// Size of buffer used to read a line from a box file.
-const int kBoxReadBufSize = 1024;
-
-// Open the boxfile based on the given image filename.
-FILE* OpenBoxFile(const STRING& fname);
-
-// read_next_box factors out the code to interpret a line of a box
-// file so that applybox and unicharset_extractor interpret the same way.
-// This function returns the next valid box file utf8 string and coords
-// and returns true, or false on eof (and closes the file).
-// If ignores the uft8 file signature, checks for valid utf-8 and allows
-// space or tab between fields.
-// utf8_str must be at least kBoxReadBufSize in length.
-// If there are page numbers in the file, it reads them all.
-bool read_next_box(int *line_number, FILE* box_file, char* utf8_str,
-                   int* x_min, int* y_min, int* x_max, int* y_max);
-// As read_next_box above, but get a specific page number. (0-based)
-// Use -1 to read any page number. Files without page number all
-// read as if they are page 0.
-bool read_next_box(int page, int *line_number, FILE* box_file, char* utf8_str,
-                   int* x_min, int* y_min, int* x_max, int* y_max);
-
-#endif  // TESSERACT_CCUTIL_BOXREAD_H__
--- a/ccutil/ccutil.cpp
+++ b/ccutil/ccutil.cpp
@ -14,10 +14,10 @@ CCUtil::CCUtil() :
 #endif
  INT_INIT_MEMBER(ambigs_debug_level, 0, "Debug level for unichar ambiguities",
                  &params_),
-  BOOL_INIT_MEMBER(use_definite_ambigs_for_classifier, 0, "Use definite"
-                   " ambiguities when running character classifier", &params_),
-  BOOL_INIT_MEMBER(use_ambigs_for_adaption, 0, "Use ambigs for deciding"
-                   " whether to adapt to a character", &params_) {
+  BOOL_MEMBER(use_definite_ambigs_for_classifier, 0, "Use definite"
+              " ambiguities when running character classifier", &params_),
+  BOOL_MEMBER(use_ambigs_for_adaption, 0, "Use ambigs for deciding"
+              " whether to adapt to a character", &params_) {
 }

 CCUtil::~CCUtil() {
--- a/ccutil/clst.cpp
+++ b/ccutil/clst.cpp
@ -93,36 +93,6 @@ void CLIST::shallow_clear() {  //destroy all links
  }
 }

-
-/***********************************************************************
- *							CLIST::internal_deep_copy
- *
- *  Used during explict deep copy of a list.  The "copier" function passed
- *  allows each element to be correctly deep copied (assuming that each class
- *  in the inheritance hierarchy does properly deep copies its members).  The
- *  function passing technique is as for "internal_clear".
- **********************************************************************/
-
-void
-                                 //ptr to copier functn
-CLIST::internal_deep_copy (void *(*copier) (void *),
-const CLIST * list) {            //list being copied
-  CLIST_ITERATOR from_it ((CLIST *) list);
-  CLIST_ITERATOR to_it(this);
-
-  #ifndef NDEBUG
-  if (!this)
-    NULL_OBJECT.error ("CLIST::internal_deep_copy", ABORT, NULL);
-  if (!list)
-    BAD_PARAMETER.error ("CLIST::internal_deep_copy", ABORT,
-      "source list is NULL");
-  #endif
-
-  for (from_it.mark_cycle_pt (); !from_it.cycled_list (); from_it.forward ())
-    to_it.add_after_then_move (copier (from_it.data ()));
-}
-
-
 /***********************************************************************
 *							CLIST::assign_to_sublist
 *
--- a/ccutil/clst.h
+++ b/ccutil/clst.h
@ -105,10 +105,6 @@ class DLLSYM CLIST
      last = from_list->last;
    }

-                                 //ptr to copier functn
-    void internal_deep_copy (void *(*copier) (void *),
-      const CLIST * list);       //list being copied
-
    void assign_to_sublist(                           //to this list
                           CLIST_ITERATOR *start_it,  //from list start
                           CLIST_ITERATOR *end_it);   //from list end
@ -897,10 +893,6 @@ public:																			\
 void						deep_clear()				/* delete elements */	\
 	{ CLIST::internal_deep_clear( &CLASSNAME##_c1_zapper ); }					\
 																				\
-void						deep_copy(					/* become a deep */		\
-	const CLASSNAME##_CLIST*list)						/* copy of src list*/	\
-	{ CLIST::internal_deep_copy( &CLASSNAME##_c1_copier, list ); }				\
-																				\
 void						operator=(					/* prevent assign */	\
 	const CLASSNAME##_CLIST&)													\
 	{ DONT_ASSIGN_LISTS.error( QUOTE_IT( CLASSNAME##_CLIST ),					\
@ -979,27 +971,5 @@ void*						link)						/*link to delete*/		\
 {																				\
 delete (CLASSNAME *) link;														\
 }																				\
-																				\
-																				\
-																				\
-/***********************************************************************		\
-*							CLASSNAME##_c1_copier								\
-*																				\
-*  A function which can generate a new, deep copy of a CLASSNAME element.		\
-*  This is passed to the generic deep copy list member function so that when	\
-*  a list is copied the elements on the list are properly copied from the		\
-*  base class, even though we dont use a virtual function.						\
-*																				\
-**********************************************************************/			\
-																				\
-DLLSYM void*				CLASSNAME##_c1_copier(		/*deep copy a link*/	\
-void*						old_element)				/*source link*/			\
-{																				\
-	CLASSNAME*			new_element;										\
-																				\
-new_element = new CLASSNAME;													\
-*new_element = *((CLASSNAME*) old_element);									\
-return (void*) new_element;														\
-}

 #endif
--- a/ccutil/errcode.cpp
+++ b/ccutil/errcode.cpp
@ -26,7 +26,6 @@
 #include          <signal.h>
 #endif
 #include          "tprintf.h"
-//#include                                      "ipeerr.h"
 #include          "errcode.h"

 const ERRCODE BADERRACTION = "Illegal error action";
@ -39,14 +38,12 @@ const ERRCODE BADERRACTION = "Illegal error action";
 * Makes use of error messages and numbers in a common place.
 *
 **********************************************************************/
-void
-ERRCODE::error (                 //handle error
-const char *caller,              //name of caller
-inT8 action,                     //action to take
-const char *format, ...          //special message
-) const
-{
-  va_list args;                  //variable args
+void ERRCODE::error(             // handle error
+const char *caller,              // name of caller
+TessErrorLogCode action,         // action to take
+const char *format, ...          // special message
+) const {
+  va_list args;                  // variable args
  char msg[MAX_MSG];
  char *msgptr = msg;

@ -76,16 +73,6 @@ const char *format, ...          //special message
    msgptr += sprintf (msgptr, "\n");

  fprintf(stderr, msg);
-  /*if ((strstr (message, "File") != NULL) ||
-    (strstr (message, "file") != NULL))
-  else if ((strstr (message, "List") != NULL) ||
-    (strstr (message, "list") != NULL))
-  else if ((strstr (message, "Memory") != NULL) ||
-    (strstr (message, "memory") != NULL))
-    global_abort_code = MEMORY_ABORT;
-  else
-    global_abort_code = NO_ABORT_CODE;
-    */

  int* p = NULL;
  switch (action) {
--- a/ccutil/errcode.h
+++ b/ccutil/errcode.h
@ -23,10 +23,12 @@
 #include          "host.h"

 /*Control parameters for error()*/
-#define DBG         -1           /*log without alert */
-#define TESSLOG     0            /*alert user */
-#define TESSEXIT    1            /*exit after erro */
-#define ABORT       2            /*abort after error */
+enum TessErrorLogCode {
+  DBG =     -1,           /*log without alert */
+  TESSLOG =  0,            /*alert user */
+  TESSEXIT = 1,            /*exit after erro */
+  ABORT =    2            /*abort after error */
+};

 /* Explicit Error Abort codes */
 #define NO_ABORT_CODE      0
@ -64,18 +66,17 @@
 #define SUBSUBLOC_TESS    1
 #define SUBSUBLOC_NN    2

-class DLLSYM ERRCODE             //error handler class
-{
-  const char *message;           //error message
-  public:
-    void error (                 //error print function
-      const char *caller,        //function location
-      inT8 action,               //action to take
-      const char *format, ...    //fprintf format
-      ) const;
-    ERRCODE(const char *string) {
-      message = string;
-    }                            //initialize with string
+class DLLSYM ERRCODE {           // error handler class
+  const char *message;           // error message
+ public:
+  void error(                  // error print function
+    const char *caller,        // function location
+    TessErrorLogCode action,   // action to take
+    const char *format, ...    // fprintf format
+  ) const;
+  ERRCODE(const char *string) {
+    message = string;
+  }                            // initialize with string
 };

 const ERRCODE ASSERT_FAILED = "Assert failed";
--- a/ccutil/genericvector.h
+++ b/ccutil/genericvector.h
@ -71,6 +71,7 @@ class GenericVector {

  // Return the object from an index.
  T &get(int index) const;
+  T &back() const;
  T &operator[](int index) const;

  // Return the index of the T object.
@ -88,6 +89,10 @@ class GenericVector {
  int push_back(T object);
  void operator+=(T t);

+  // Push an element in the end of the array if the same
+  // element is not already contained in the array.
+  int push_back_new(T object);
+
  // Push an element in the front of the array
  // Note: This function is O(n)
  int push_front(T object);
@ -127,7 +132,7 @@ class GenericVector {
  void delete_data_pointers();

  // This method clears the current object, then, does a shallow copy of
-  // its argument, and finally invalidate its argument.
+  // its argument, and finally invalidates its argument.
  // Callbacks are moved to the current object;
  void move(GenericVector<T>* from);

@ -197,7 +202,10 @@ class GenericVector {
  }
  // Searches the array (assuming sorted in ascending order, using sort()) for
  // an element equal to target and returns the index of the best candidate.
-  // The return value is the largest index i such that data_[i] <= target or 0.
+  // The return value is conceptually the largest index i such that
+  // data_[i] <= target or 0 if target < the whole vector.
+  // NOTE that this function uses operator> so really the return value is
+  // the largest index i such that data_[i] > target is false.
  int binary_search(const T& target) const {
    int bottom = 0;
    int top = size_used_;
@ -328,13 +336,13 @@ class PointerVector : public GenericVector<T*> {
  // Copy must be deep, as the pointers will be automatically deleted on
  // destruction.
  PointerVector(const PointerVector& other) {
-    init(other.size());
+    this->init(other.size());
    this->operator+=(other);
  }
  PointerVector<T>& operator+=(const PointerVector& other) {
-    reserve(this->size_used_ + other.size_used_);
+    this->reserve(this->size_used_ + other.size_used_);
    for (int i = 0; i < other.size(); ++i) {
-      push_back(new T(*other.data_[i]));
+      this->push_back(new T(*other.data_[i]));
    }
    return *this;
  }
@ -360,6 +368,28 @@ class PointerVector : public GenericVector<T*> {
    GenericVector<T*>::truncate(size);
  }

+  // Compact the vector by deleting elements for which delete_cb returns
+  // true. delete_cb is a permanent callback and will be deleted.
+  void compact(TessResultCallback1<bool, const T*>* delete_cb) {
+    int new_size = 0;
+    int old_index = 0;
+    // Until the callback returns true, the elements stay the same.
+    while (old_index < GenericVector<T*>::size_used_ &&
+           !delete_cb->Run(GenericVector<T*>::data_[old_index++]))
+      ++new_size;
+    // Now just copy anything else that gets false from delete_cb.
+    for (; old_index < GenericVector<T*>::size_used_; ++old_index) {
+      if (!delete_cb->Run(GenericVector<T*>::data_[old_index])) {
+        GenericVector<T*>::data_[new_size++] =
+            GenericVector<T*>::data_[old_index];
+      } else {
+        delete GenericVector<T*>::data_[old_index];
+      }
+    }
+    GenericVector<T*>::size_used_ = new_size;
+    delete delete_cb;
+  }
+
  // Clear the array, calling the clear callback function if any.
  // All the owned callbacks are also deleted.
  // If you don't want the callbacks to be deleted, before calling clear, set
@ -399,7 +429,7 @@ class PointerVector : public GenericVector<T*> {
        item = new T;
        if (!item->DeSerialize(swap, fp)) return false;
      }
-      push_back(item);
+      this->push_back(item);
    }
    return true;
  }
@ -488,6 +518,12 @@ T &GenericVector<T>::operator[](int index) const {
 return data_[index];
 }

+template <typename T>
+T &GenericVector<T>::back() const {
+  ASSERT_HOST(size_used_ > 0);
+  return data_[size_used_ - 1];
+}
+
 // Return the object from an index.
 template <typename T>
 void GenericVector<T>::set(T t, int index) {
@ -555,6 +591,14 @@ int GenericVector<T>::push_back(T object) {
  return index;
 }

+template <typename T>
+int GenericVector<T>::push_back_new(T object) {
+  int index = get_index(object);
+  if (index >= 0)
+    return index;
+  return push_back(object);
+}
+
 // Add an element in the array (front)
 template <typename T>
 int GenericVector<T>::push_front(T object) {
@ -739,7 +783,7 @@ bool GenericVector<T>::DeSerializeClasses(bool swap, FILE* fp) {
 }

 // This method clear the current object, then, does a shallow copy of
-// its argument, and finally invalindate its argument.
+// its argument, and finally invalidates its argument.
 template <typename T>
 void GenericVector<T>::move(GenericVector<T>* from) {
  this->clear();
--- a/ccutil/helpers.h
+++ b/ccutil/helpers.h
@ -31,8 +31,9 @@
 // Remove newline (if any) at the end of the string.
 inline void chomp_string(char *str) {
  int last_index = strlen(str) - 1;
-  if (last_index >= 0 && str[last_index] == '\n') {
-    str[last_index] = '\0';
+  while (last_index >= 0 &&
+         (str[last_index] == '\n' || str[last_index] == '\r')) {
+    str[last_index--] = '\0';
  }
 }

@ -87,6 +88,18 @@ inline void UpdateRange(const T1& x_lo, const T1& x_hi,
    *upper_bound = x_hi;
 }

+// Intersect the range [*lower2, *upper2] with the range [lower1, upper1],
+// putting the result back in [*lower2, *upper2].
+// If non-intersecting ranges are given, we end up with *lower2 > *upper2.
+template<typename T>
+inline void IntersectRange(const T& lower1, const T& upper1,
+                           T* lower2, T* upper2) {
+  if (lower1 > *lower2)
+    *lower2 = lower1;
+  if (upper1 < *upper2)
+    *upper2 = upper1;
+}
+
 // Proper modulo arithmetic operator. Returns a mod b that works for -ve a.
 // For any integer a and positive b, returns r : 0<=r<b and a=n*b + r for
 // some integer n.
@ -99,8 +112,8 @@ inline int Modulo(int a, int b) {
 // counting at 0. With simple rounding 1/3 = 0, 0/3 = 0 -1/3 = 0, -2/3 = 0,
 // -3/3 = 0 and -4/3 = -1.
 // I want 1/3 = 0, 0/3 = 0, -1/3 = 0, -2/3 = -1, -3/3 = -1 and -4/3 = -1.
-// Results with b negative are not defined.
 inline int DivRounded(int a, int b) {
+  if (b < 0) return -DivRounded(a, -b);
  return a >= 0 ? (a + b / 2) / b : (a - b / 2) / b;
 }

--- a/ccutil/host.h
+++ b/ccutil/host.h
@ -150,15 +150,6 @@ typedef unsigned char BOOL8;
 #define MIN_FLOAT32 ((float)1.17549435e-38)

 // Defines
-
-#ifndef OKAY
-#define OKAY            0
-#endif
-
-#ifndef HPERR
-#define HPERR           -1
-#endif
-
 #ifndef TRUE
 #define TRUE            1
 #endif
--- a/ccutil/indexmapbidi.cpp
+++ b/ccutil/indexmapbidi.cpp
@ -0,0 +1,250 @@
+///////////////////////////////////////////////////////////////////////
+// File:        indexmapbidi.cpp
+// Description: Bi-directional mapping between a sparse and compact space.
+// Author:      rays@google.com (Ray Smith)
+// Created:     Tue Apr 06 11:33:59 PDT 2010
+//
+// (C) Copyright 2010, Google Inc.
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+// http://www.apache.org/licenses/LICENSE-2.0
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+//
+///////////////////////////////////////////////////////////////////////
+
+#include "indexmapbidi.h"
+
+namespace tesseract {
+
+// SparseToCompact takes a sparse index to an index in the compact space.
+// Uses a binary search to find the result. For faster speed use
+// IndexMapBiDi, but that takes more memory.
+int IndexMap::SparseToCompact(int sparse_index) const {
+  int result = compact_map_.binary_search(sparse_index);
+  return compact_map_[result] == sparse_index ? result : -1;
+}
+
+// Copy from the input.
+void IndexMap::CopyFrom(const IndexMap& src) {
+  sparse_size_ = src.sparse_size_;
+  compact_map_ = src.compact_map_;
+}
+void IndexMap::CopyFrom(const IndexMapBiDi& src) {
+  sparse_size_ = src.SparseSize();
+  compact_map_ = src.compact_map_;
+}
+
+// Writes to the given file. Returns false in case of error.
+bool IndexMap::Serialize(FILE* fp) const {
+  inT32 sparse_size = sparse_size_;
+  if (fwrite(&sparse_size, sizeof(sparse_size), 1, fp) != 1) return false;
+  if (!compact_map_.Serialize(fp)) return false;
+  return true;
+}
+
+// Reads from the given file. Returns false in case of error.
+// If swap is true, assumes a big/little-endian swap is needed.
+bool IndexMap::DeSerialize(bool swap, FILE* fp) {
+  inT32 sparse_size;
+  if (fread(&sparse_size, sizeof(sparse_size), 1, fp) != 1) return false;
+  if (swap)
+    ReverseN(&sparse_size, sizeof(sparse_size));
+  sparse_size_ = sparse_size;
+  if (!compact_map_.DeSerialize(swap, fp)) return false;
+  return true;
+}
+
+
+// Top-level init function in a single call to initialize a map to select
+// a single contiguous subrange [start, end) of the sparse space to be mapped
+// 1 to 1 to the compact space, with all other elements of the sparse space
+// left unmapped.
+// No need to call Setup after this.
+void IndexMapBiDi::InitAndSetupRange(int sparse_size, int start, int end) {
+  Init(sparse_size, false);
+  for (int i = start; i < end; ++i)
+    SetMap(i, true);
+  Setup();
+}
+
+// Initializes just the sparse_map_ to the given size with either all
+// forward indices mapped (all_mapped = true) or none (all_mapped = false).
+// Call Setup immediately after, or make calls to SetMap first to adjust the
+// mapping and then call Setup before using the map.
+void IndexMapBiDi::Init(int size, bool all_mapped) {
+  sparse_map_.init_to_size(size, -1);
+  if (all_mapped) {
+    for (int i = 0; i < size; ++i)
+      sparse_map_[i] = i;
+  }
+}
+
+// Sets a given index in the sparse_map_ to be mapped or not.
+void IndexMapBiDi::SetMap(int sparse_index, bool mapped) {
+  sparse_map_[sparse_index] = mapped ? 0 : -1;
+}
+
+// Sets up the sparse_map_ and compact_map_ properly after Init and
+// some calls to SetMap. Assumes an ordered 1-1 map from set indices
+// in the forward map to the compact space.
+void IndexMapBiDi::Setup() {
+  int compact_size = 0;
+  for (int i = 0; i < sparse_map_.size(); ++i) {
+    if (sparse_map_[i] >= 0) {
+      sparse_map_[i] = compact_size++;
+    }
+  }
+  compact_map_.init_to_size(compact_size, -1);
+  for (int i = 0; i < sparse_map_.size(); ++i) {
+    if (sparse_map_[i] >= 0) {
+      compact_map_[sparse_map_[i]] = i;
+    }
+  }
+  sparse_size_ = sparse_map_.size();
+}
+
+// Copy from the input.
+void IndexMapBiDi::CopyFrom(const IndexMapBiDi& src) {
+  sparse_map_ = src.sparse_map_;
+  compact_map_ = src.compact_map_;
+  sparse_size_ = sparse_map_.size();
+}
+
+// Merges the two compact space indices. May be called many times, but
+// the merges must be concluded by a call to CompleteMerges.
+// Returns true if a merge was actually performed.
+bool IndexMapBiDi::Merge(int compact_index1, int compact_index2) {
+  // Find the current master index for index1 and index2.
+  compact_index1 = MasterCompactIndex(compact_index1);
+  compact_index2 = MasterCompactIndex(compact_index2);
+  // Be sure that index1 < index2.
+  if (compact_index1 > compact_index2) {
+    int tmp = compact_index1;
+    compact_index1 = compact_index2;
+    compact_index2 = tmp;
+  } else if (compact_index1 == compact_index2) {
+    return false;
+  }
+  // To save iterating over all sparse_map_ entries, simply make the master
+  // entry for index2 point to index1.
+  // This leaves behind a potential chain of parents that needs to be chased,
+  // as above.
+  sparse_map_[compact_map_[compact_index2]] = compact_index1;
+  if (compact_index1 >= 0)
+    compact_map_[compact_index2] = compact_map_[compact_index1];
+  return true;
+}
+
+// Completes one or more Merge operations by further compacting the
+// compact space. Unused compact space indices are removed, and the used
+// ones above shuffled down to fill the gaps.
+// Example:
+// Input sparse_map_: (x indicates -1)
+// x x 0 x 2 x x 4 x 0 x 2 x
+// Output sparse_map_:
+// x x 0 x 1 x x 2 x 0 x 1 x
+// Output compact_map_:
+// 2 4 7.
+void IndexMapBiDi::CompleteMerges() {
+  // Ensure each sparse_map_entry contains a master compact_map_ index.
+  int compact_size = 0;
+  for (int i = 0; i < sparse_map_.size(); ++i) {
+    int compact_index = MasterCompactIndex(sparse_map_[i]);
+    sparse_map_[i] = compact_index;
+    if (compact_index >= compact_size)
+      compact_size = compact_index + 1;
+  }
+  // Re-generate the compact_map leaving holes for unused indices.
+  compact_map_.init_to_size(compact_size, -1);
+  for (int i = 0; i < sparse_map_.size(); ++i) {
+    if (sparse_map_[i] >= 0) {
+      if (compact_map_[sparse_map_[i]] == -1)
+        compact_map_[sparse_map_[i]] = i;
+    }
+  }
+  // Compact the compact_map, leaving tmp_compact_map saying where each
+  // index went to in the compacted map.
+  GenericVector<inT32> tmp_compact_map;
+  tmp_compact_map.init_to_size(compact_size, -1);
+  compact_size = 0;
+  for (int i = 0; i < compact_map_.size(); ++i) {
+    if (compact_map_[i] >= 0) {
+      tmp_compact_map[i] = compact_size;
+      compact_map_[compact_size++] = compact_map_[i];
+    }
+  }
+  compact_map_.truncate(compact_size);
+  // Now modify the entries in the sparse map to point to the new locations.
+  for (int i = 0; i < sparse_map_.size(); ++i) {
+    if (sparse_map_[i] >= 0) {
+      sparse_map_[i] = tmp_compact_map[sparse_map_[i]];
+    }
+  }
+}
+
+// Writes to the given file. Returns false in case of error.
+bool IndexMapBiDi::Serialize(FILE* fp) const {
+  if (!IndexMap::Serialize(fp)) return false;
+  // Make a vector containing the rest of the map. If the map is many-to-one
+  // then each additional sparse entry needs to be stored.
+  // Normally we store only the compact map to save space.
+  GenericVector<inT32> remaining_pairs;
+  for (int i = 0; i < sparse_map_.size(); ++i) {
+    if (sparse_map_[i] >= 0 && compact_map_[sparse_map_[i]] != i) {
+      remaining_pairs.push_back(i);
+      remaining_pairs.push_back(sparse_map_[i]);
+    }
+  }
+  if (!remaining_pairs.Serialize(fp)) return false;
+  return true;
+}
+
+// Reads from the given file. Returns false in case of error.
+// If swap is true, assumes a big/little-endian swap is needed.
+bool IndexMapBiDi::DeSerialize(bool swap, FILE* fp) {
+  if (!IndexMap::DeSerialize(swap, fp)) return false;
+  GenericVector<inT32> remaining_pairs;
+  if (!remaining_pairs.DeSerialize(swap, fp)) return false;
+  sparse_map_.init_to_size(sparse_size_, -1);
+  for (int i = 0; i < compact_map_.size(); ++i) {
+    sparse_map_[compact_map_[i]] = i;
+  }
+  for (int i = 0; i < remaining_pairs.size(); ++i) {
+    int sparse_index = remaining_pairs[i++];
+    sparse_map_[sparse_index] = remaining_pairs[i];
+  }
+  return true;
+}
+
+// Bulk calls to SparseToCompact.
+// Maps the given array of sparse indices to an array of compact indices.
+// Assumes the input is sorted. The output indices are sorted and uniqued.
+// Return value is the number of "missed" features, being features that
+// don't map to the compact feature space.
+int IndexMapBiDi::MapFeatures(const GenericVector<int>& sparse,
+                              GenericVector<int>* compact) const {
+  compact->truncate(0);
+  int num_features = sparse.size();
+  int missed_features = 0;
+  int prev_good_feature = -1;
+  for (int f = 0; f < num_features; ++f) {
+    int feature = sparse_map_[sparse[f]];
+    if (feature >= 0) {
+      if (feature != prev_good_feature) {
+        compact->push_back(feature);
+        prev_good_feature = feature;
+      }
+    } else {
+      ++missed_features;
+    }
+  }
+  return missed_features;
+}
+
+}  // namespace tesseract.
+
--- a/ccutil/indexmapbidi.h
+++ b/ccutil/indexmapbidi.h
@ -0,0 +1,180 @@
+///////////////////////////////////////////////////////////////////////
+// File:        indexmapbidi.h
+// Description: Bi-directional mapping between a sparse and compact space.
+// Author:      rays@google.com (Ray Smith)
+// Created:     Tue Apr 06 11:33:59 PDT 2010
+//
+// (C) Copyright 2010, Google Inc.
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+// http://www.apache.org/licenses/LICENSE-2.0
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+//
+///////////////////////////////////////////////////////////////////////
+
+#ifndef TESSERACT_CCUTIL_INDEXMAPBIDI_H_
+#define TESSERACT_CCUTIL_INDEXMAPBIDI_H_
+
+#include <stdio.h>
+#include "genericvector.h"
+
+namespace tesseract {
+
+class IndexMapBiDi;
+
+// Bidirectional one-to-one mapping between a sparse and a compact discrete
+// space. Many entries in the sparse space are unmapped, but those that are
+// mapped have a 1-1 mapping to (and from) the compact space, where all
+// values are used. This is useful for forming subsets of larger collections,
+// such as subsets of character sets, or subsets of binary feature spaces.
+//
+// This base class provides basic functionality with binary search for the
+// SparseToCompact mapping to save memory.
+// For a faster inverse mapping, or to allow a many-to-one mapping, use
+// IndexMapBiDi below.
+// NOTE: there are currently no methods to setup an IndexMap on its own!
+// It must be initialized by copying from an IndexMapBiDi or by DeSerialize.
+class IndexMap {
+ public:
+  virtual ~IndexMap() {}
+
+  // SparseToCompact takes a sparse index to an index in the compact space.
+  // Uses a binary search to find the result. For faster speed use
+  // IndexMapBiDi, but that takes more memory.
+  virtual int SparseToCompact(int sparse_index) const;
+
+  // CompactToSparse takes a compact index to the corresponding index in the
+  // sparse space.
+  int CompactToSparse(int compact_index) const {
+    return compact_map_[compact_index];
+  }
+  // The size of the sparse space.
+  virtual int SparseSize() const {
+    return sparse_size_;
+  }
+  // The size of the compact space.
+  int CompactSize() const {
+    return compact_map_.size();
+  }
+
+  // Copy from the input.
+  void CopyFrom(const IndexMap& src);
+  void CopyFrom(const IndexMapBiDi& src);
+
+  // Writes to the given file. Returns false in case of error.
+  bool Serialize(FILE* fp) const;
+  // Reads from the given file. Returns false in case of error.
+  // If swap is true, assumes a big/little-endian swap is needed.
+  bool DeSerialize(bool swap, FILE* fp);
+
+ protected:
+  // The sparse space covers integers in the range [0, sparse_size_-1].
+  int sparse_size_;
+  // The compact space covers integers in the range [0, compact_map_.size()-1].
+  // Each element contains the corresponding sparse index.
+  GenericVector<inT32> compact_map_;
+};
+
+// Bidirectional many-to-one mapping between a sparse and a compact discrete
+// space. As with IndexMap, many entries may be unmapped, but unlike IndexMap,
+// of those that are, many may be mapped to the same compact index.
+// If the map is many-to-one, it is not possible to directly obtain all the
+// sparse indices that map to a single compact index.
+// This map is time- rather than space-efficient. It stores the entire sparse
+// space.
+// IndexMapBiDi may be initialized in one of 3 ways:
+// 1. Init(size, true);
+//    Setup();
+//    Sets a complete 1:1 mapping with no unmapped elements.
+// 2. Init(size, false);
+//    for ... SetMap(index, true);
+//    Setup();
+//    Specifies precisely which sparse indices are mapped. The mapping is 1:1.
+// 3. Either of the above, followed by:
+//    for ... Merge(index1, index2);
+//    CompleteMerges();
+//    Allows a many-to-one mapping by merging compact space indices.
+class IndexMapBiDi : public IndexMap {
+ public:
+  virtual ~IndexMapBiDi() {}
+
+  // Top-level init function in a single call to initialize a map to select
+  // a single contiguous subrange [start, end) of the sparse space to be mapped
+  // 1 to 1 to the compact space, with all other elements of the sparse space
+  // left unmapped.
+  // No need to call Setup after this.
+  void InitAndSetupRange(int sparse_size, int start, int end);
+
+  // Initializes just the sparse_map_ to the given size with either all
+  // forward indices mapped (all_mapped = true) or none (all_mapped = false).
+  // Call Setup immediately after, or make calls to SetMap first to adjust the
+  // mapping and then call Setup before using the map.
+  void Init(int size, bool all_mapped);
+  // Sets a given index in the sparse_map_ to be mapped or not.
+  void SetMap(int sparse_index, bool mapped);
+  // Sets up the sparse_map_ and compact_map_ properly after Init and
+  // some calls to SetMap. Assumes an ordered 1-1 map from set indices
+  // in the sparse space to the compact space.
+  void Setup();
+
+  // Merges the two compact space indices. May be called many times, but
+  // the merges must be concluded by a call to CompleteMerges.
+  // Returns true if a merge was actually performed.
+  bool Merge(int compact_index1, int compact_index2);
+  // Returns true if the given compact index has been deleted.
+  bool IsCompactDeleted(int index) const {
+    return MasterCompactIndex(index) < 0;
+  }
+  // Completes one or more Merge operations by further compacting the
+  // compact space.
+  void CompleteMerges();
+
+  // SparseToCompact takes a sparse index to an index in the compact space.
+  virtual int SparseToCompact(int sparse_index) const {
+    return sparse_map_[sparse_index];
+  }
+  // The size of the sparse space.
+  virtual int SparseSize() const {
+    return sparse_map_.size();
+  }
+
+  // Copy from the input.
+  void CopyFrom(const IndexMapBiDi& src);
+
+  // Writes to the given file. Returns false in case of error.
+  bool Serialize(FILE* fp) const;
+  // Reads from the given file. Returns false in case of error.
+  // If swap is true, assumes a big/little-endian swap is needed.
+  bool DeSerialize(bool swap, FILE* fp);
+
+  // Bulk calls to SparseToCompact.
+  // Maps the given array of sparse indices to an array of compact indices.
+  // Assumes the input is sorted. The output indices are sorted and uniqued.
+  // Return value is the number of "missed" features, being features that
+  // don't map to the compact feature space.
+  int MapFeatures(const GenericVector<int>& sparse,
+                  GenericVector<int>* compact) const;
+
+ private:
+  // Returns the master compact index for a given compact index.
+  // During a multiple merge operation, several compact indices may be
+  // combined, so we need to be able to find the master of all.
+  int MasterCompactIndex(int compact_index) const {
+    while (compact_index >= 0 &&
+           sparse_map_[compact_map_[compact_index]] != compact_index)
+      compact_index = sparse_map_[compact_map_[compact_index]];
+    return compact_index;
+  }
+
+  // Direct look-up of the compact index for each element in sparse space.
+  GenericVector<inT32> sparse_map_;
+};
+
+}  // namespace tesseract.
+
+#endif  // TESSERACT_CCUTIL_INDEXMAPBIDI_H_
--- a/ccutil/ocrclass.h
+++ b/ccutil/ocrclass.h
@ -43,193 +43,11 @@
 #define MAX_OCR_NAME    32       /*name of engine */
 #define MAX_OCR_VERSION   17     /*version code of engine */

-/*Image parameters*/
-#define MIN_IMAGE_SIZE    64     /*smallest image that will be passed */
-#define IMAGE_ROUNDING    32     /*all sizes are multiple of this */
-
-#if defined(__SLOW_TIMES__)
-/*Maximum timeouts of various functions (in secs)*/
-#define STARTUP_TIMEOUT   100    /*start of OCR engine */
-#define SHUTDOWN_TIMEOUT  50     /*end of OCR engine */
-#define SENDIM_TIMEOUT    50     /*send of image */
-#define RELEASE_TIMEOUT   50     /*release of semaphore */
-#define READIM_TIMEOUT    100    /*read of image */
-#define READTEXT_TIMEOUT  50     /*read of text */
-#define PROGRESS_TIMEOUT  30     /*progress every 3 seconds */
-#define BADTIMES_TIMEOUT  7      /*max lack of progress */
-#else
-/*Maximum timeouts of various functions (in secs)*/
-#define STARTUP_TIMEOUT   10     /*start of OCR engine */
-#define SHUTDOWN_TIMEOUT  6      /*end of OCR engine */
-#define SENDIM_TIMEOUT    5      /*send of image */
-#define RELEASE_TIMEOUT   5      /*release of semaphore */
-#define READIM_TIMEOUT    10     /*read of image */
-#define READTEXT_TIMEOUT  5      /*read of text */
-#define PROGRESS_TIMEOUT  3      /*progress every 3 seconds */
-#define BADTIMES_TIMEOUT  7      /*max lack of progress */
-#endif
-
-/*language definitions are identical to RTF*/
-#define LANGE_NONE      0x0400   /*no language */
-#define LANGE_ALBANIAN    0x041c /*Albanian */
-#define LANGE_BRITISH   0x0809   /*International English */
-#define LANGE_BULGARIAN   0x0402 /*Bulgarian */
-#define LANGE_CROATIAN    0x041a /*Croatian(latin alphabet) */
-#define LANGE_CZECH     0x0405   /*Czech */
-#define LANGE_DANISH    0x0406   /*Danish */
-#define LANGE_DUTCH     0x0413   /*Dutch */
-#define LANGE_FINNISH   0x040b   /*Finnish */
-#define LANGE_FRENCH    0x040c   /*French */
-#define LANGE_GERMAN    0x0407   /*German */
-#define LANGE_GREEK     0x0408   /*Greek */
-#define LANGE_HUNGARIAN   0x040e /*Hungarian */
-#define LANGE_ITALIAN   0x0410   /*Italian */
-#define LANGE_JAPANESE    0x0411 /*Japanese */
-#define LANGE_KOREAN    0x0412   /*Korean */
-#define LANGE_NORWEGIAN   0x0414 /*Bokmal */
-#define LANGE_POLISH    0x0415   /*Polish */
-#define LANGE_PORTUGESE   0x0416 /*Brazilian Portugese */
-#define LANGE_ROMANIAN    0x0418 /*Romanian */
-#define LANGE_RUSSIAN   0x0419   /*Russian */
-#define LANGE_SCHINESE    0x0804 /*Simplified Chinese */
-#define LANGE_SLOVAK    0x041b   /*Slovak */
-#define LANGE_SPANISH   0x040a   /*Castilian */
-#define LANGE_SWEDISH   0x041d   /*Swedish */
-#define LANGE_TCHINESE    0x0404 /*Traditional Chinese */
-#define LANGE_TURKISH   0x041f   /*Turkish */
-#define LANGE_USENGLISH   0x0409 /*American */
-
-/*font family definitions are identical to RTF*/
-#define FFAM_NONE     0          /*unknown */
-#define FFAM_ROMAN      1        /*serifed prop */
-#define FFAM_SWISS      2        /*sans-serif prop */
-#define FFAM_MODERN     3        /*fixed pitch */
-
-/*character set definitions are identical to RTF*/
-#define CHSET_ANSI      0        /*Ansi efigs */
-#define CHSET_SHIFT_JIS   128    /*JIS X 0208-1990 */
-#define CHSET_KOREAN    129      /*KS C 5601-1992 */
-#define CHSET_SCHINESE    134    /*GB 2312-80 */
-#define CHSET_BIG5      136      /*Big Five */
-#define CHSET_CYRILLIC    204    /*Cyrillic */
-#define CHSET_EEUROPE   238      /*Eastern Europe */
-
 /*pitch set definitions are identical to RTF*/
 #define PITCH_DEF     0          /*default */
 #define PITCH_FIXED     1        /*fixed pitch */
 #define PITCH_VAR     2          /*variable pitch */

-/*Bitmasks for character enhancements.
-OR these together for enhancement in ocr_append_char*/
-#define EUC_BOLD      1          /*bold character */
-#define EUC_ITALIC      2        /*italic char */
-#define EUC_UNDERLINE   4        /*underlined char */
-#define EUC_SUBSCRIPT   8        /*subscript char */
-#define EUC_SUPERSCRIPT   16     /*superscript char */
-
-/*enum for character rendering direction*/
-enum OCR_CHAR_DIRECTION {
-  OCR_CDIR_RIGHT_LEFT,           /*right to left horizontal */
-  OCR_CDIR_LEFT_RIGHT,           /*left to right horizontal */
-  OCR_CDIR_TOP_BOTTOM,           /*top to bottom vertical */
-  OCR_CDIR_BOTTOM_TOP            /*bottom to top vertical */
-};
-
-/*enum for line rendering direction*/
-enum OCR_LINE_DIRECTION {
-  OCR_LDIR_DOWN_RIGHT,           /*horizontal lines go down */
-  /*vertical lines go right */
-  OCR_LDIR_UP_LEFT               /*horizontal lines go up */
-};
-
-/*enum for newline type*/
-enum OCR_NEWLINE_TYPE {
-  OCR_NL_NONE,                   /*not a newline */
-  OCR_NL_NEWLINE,                /*this is a newline but not new para */
-  OCR_NL_NEWPARA                 /*this is a newline and a new para */
-};
-
-/*error codes that can be returned from the API functions other than OKAY
-and HPERR*/
-#define OCR_API_NO_MEM    (-2)   /*filled output buffer */
-#define OCR_API_BAD_CHAR  (-3)   /*whitespace sent to ocr_append_char */
-#define OCR_API_BAD_STATE (-4)   /*invalid call sequence */
-
-/*error codes used for passing errors back to the HP side*/
-enum OCR_ERR_CODE {
-  OCR_ERR_NONE,                  /*no error */
-  OCR_ERR_CLEAN_EXIT,            /*no error */
-  OCR_ERR_NO_MEM,                /*out of memory */
-  OCR_ERR_FILE_READ,             /*failed to read data file */
-  OCR_ERR_TMP_WRITE,             /*failed to write temp file */
-  OCR_ERR_TMP_READ,              /*failed to read temp file */
-  OCR_ERR_BAD_DLL,               /*missing or invalid dll subcomponent */
-  OCR_ERR_BAD_EXE,               /*missing or invalid exe subcomponent */
-  OCR_ERR_BAD_LOAD,              /*failed to load subcomponent */
-  OCR_ERR_BAD_LANG,              /*unable to recognize requested language */
-  OCR_ERR_BAD_STATE,             /*engine did call out of sequence */
-  OCR_ERR_INTERNAL1,             /*internal error type 1 */
-  OCR_ERR_INTERNAL2,             /*internal error type 1 */
-  OCR_ERR_INTERNAL3,             /*internal error type 1 */
-  OCR_ERR_INTERNAL4,             /*internal error type 1 */
-  OCR_ERR_INTERNAL5,             /*internal error type 1 */
-  OCR_ERR_INTERNAL6,             /*internal error type 1 */
-  OCR_ERR_INTERNAL7,             /*internal error type 1 */
-  OCR_ERR_INTERNAL8,             /*internal error type 1 */
-  OCR_ERR_TIMEOUT                /*timed out in comms */
-};                               /*for calls to ocr_error */
-
-/**********************************************************************
- * EFONT_DESC
- * Description of one font.
- * The information required is basically that used by RTF.
- * The name may be either a valid font on the system or the empty string.
- **********************************************************************/
-
-typedef struct {                  /*font description */
-  uinT16 language;               /*default language */
-  uinT8 font_family;             /*serif/not, fixed/not */
-  uinT8 char_set;                /*character set standard */
-  uinT8 pitch;                   /*fixed or prop */
-  inT8 name[MAX_FONT_NAME + 1];  /*plain ascii name */
-} EFONT_DESC;                    /*font description */
-
-/**********************************************************************
- * EOCR_DESC
- * Description of the OCR engine provided at startup.
- * The name and version may be reported to the user at some point.
- * The fonts array should indicate the fonts that the OCR system
- * can recognize.
- **********************************************************************/
-
-typedef struct {                  /*startup info */
-  inT32 protocol;                /*interface version */
-  uinT32 font_count;             /*number of fonts */
-  uinT16 language;               /*default language */
-  uinT16 name[MAX_OCR_NAME + 1]; /*name of engine */
-                                 /*version of engine */
-  uinT16 version[MAX_OCR_VERSION + 1];
-  EFONT_DESC fonts[1];           /*array of fonts */
-} EOCR_DESC;                     /*startup info */
-
-/**********************************************************************
- * ESTRIP_DESC
- * Description of the image strip as it is passed to the engine.
- * The image is always 1 bit, with 1=black.
- * The width is always a multiple of 32, so padding is always OK.
- * The height of the full image is always a multiple of 32.
- * The top y coordinate is 0, and increases down.
- * The top leftmost pixel is in the most significant bit of the first byte.
- **********************************************************************/
-
-typedef struct {                  /*bitmap strip */
-  inT16 x_size;                  /*width in pixels */
-  inT16 y_size;                  /*of full image */
-  inT16 strip_size;              /*of this strip */
-  inT16 resolution;              /*pixels per inch */
-  uinT8 data[8];                 /*image data */
-} ESTRIP_DESC;                   /*bitmap strip */
-
 /**********************************************************************
 * EANYCODE_CHAR
 * Description of a single character. The character code is defined by
--- a/ccutil/params.cpp
+++ b/ccutil/params.cpp
@ -40,7 +40,8 @@ tesseract::ParamsVectors *GlobalParams() {

 namespace tesseract {

-bool ParamUtils::ReadParamsFile(const char *file, bool init_only,
+bool ParamUtils::ReadParamsFile(const char *file,
+                                SetParamConstraint constraint,
                                ParamsVectors *member_params) {
  char flag;                     // file flag
  inT16 nameoffset;              // offset for real name
@ -63,11 +64,12 @@ bool ParamUtils::ReadParamsFile(const char *file, bool init_only,
    tprintf("read_params_file: Can't open %s\n", file + nameoffset);
    return true;
  }
-  return ReadParamsFromFp(fp, -1, init_only, member_params);
+  return ReadParamsFromFp(fp, -1, constraint, member_params);
  fclose(fp);
 }

-bool ParamUtils::ReadParamsFromFp(FILE *fp, inT64 end_offset, bool init_only,
+bool ParamUtils::ReadParamsFromFp(FILE *fp, inT64 end_offset,
+                                  SetParamConstraint constraint,
                                  ParamsVectors *member_params) {
  char line[MAX_PATH];           // input line
  bool anyerr = false;           // true if any error
@ -89,7 +91,7 @@ bool ParamUtils::ReadParamsFromFp(FILE *fp, inT64 end_offset, bool init_only,
          valptr++;              // find end of blanks
        while (*valptr == ' ' || *valptr == '\t');
      }
-      foundit = SetParam(line, valptr, init_only, member_params);
+      foundit = SetParam(line, valptr, constraint, member_params);

      if (!foundit) {
        anyerr = true;         // had an error
@ -102,24 +104,25 @@ bool ParamUtils::ReadParamsFromFp(FILE *fp, inT64 end_offset, bool init_only,
 }

 bool ParamUtils::SetParam(const char *name, const char* value,
-                          bool init_only, ParamsVectors *member_params) {
+                          SetParamConstraint constraint,
+                          ParamsVectors *member_params) {
  // Look for the parameter among string parameters.
  StringParam *sp = FindParam<StringParam>(name, GlobalParams()->string_params,
                                           member_params->string_params);
-  if (sp != NULL && (!init_only || sp->is_init())) sp->set_value(value);
+  if (sp != NULL && sp->constraint_ok(constraint)) sp->set_value(value);
  if (*value == '\0') return (sp != NULL);

  // Look for the parameter among int parameters.
  int intval;
  IntParam *ip = FindParam<IntParam>(name, GlobalParams()->int_params,
                                     member_params->int_params);
-  if (ip && (!init_only || ip->is_init()) &&
+  if (ip && ip->constraint_ok(constraint) &&
      sscanf(value, INT32FORMAT, &intval) == 1) ip->set_value(intval);

  // Look for the parameter among bool parameters.
  BoolParam *bp = FindParam<BoolParam>(name, GlobalParams()->bool_params,
                                       member_params->bool_params);
-  if (bp != NULL && (!init_only || bp->is_init())) {
+  if (bp != NULL && bp->constraint_ok(constraint)) {
    if (*value == 'T' || *value == 't' ||
        *value == 'Y' || *value == 'y' || *value == '1') {
      bp->set_value(true);
@ -133,7 +136,7 @@ bool ParamUtils::SetParam(const char *name, const char* value,
  double doubleval;
  DoubleParam *dp = FindParam<DoubleParam>(name, GlobalParams()->double_params,
                                           member_params->double_params);
-  if (dp != NULL && (!init_only || dp->is_init())) {
+  if (dp != NULL && dp->constraint_ok(constraint)) {
 #ifdef EMBEDDED
      doubleval = strtofloat(value);
 #else
--- a/ccutil/params.h
+++ b/ccutil/params.h
@ -32,6 +32,14 @@ class BoolParam;
 class StringParam;
 class DoubleParam;

+// Enum for constraints on what kind of params should be set by SetParam().
+enum SetParamConstraint {
+  SET_PARAM_CONSTRAINT_NONE,
+  SET_PARAM_CONSTRAINT_DEBUG_ONLY,
+  SET_PARAM_CONSTRAINT_NON_DEBUG_ONLY,
+  SET_PARAM_CONSTRAINT_NON_INIT_ONLY,
+};
+
 struct ParamsVectors {
  GenericVector<IntParam *> int_params;
  GenericVector<BoolParam *> bool_params;
@ -49,17 +57,18 @@ class ParamUtils {
  // Values may have any whitespace after the name and are the rest of line.
  static bool ReadParamsFile(
      const char *file,   // filename to read
-      bool init_only,     // only set parameters that need to be
-                        // initialized when Init() is called
+      SetParamConstraint constraint,
      ParamsVectors *member_params);

  // Read parameters from the given file pointer (stop at end_offset).
-  static bool ReadParamsFromFp(FILE *fp, inT64 end_offset, bool init_only,
-                                 ParamsVectors *member_params);
+  static bool ReadParamsFromFp(FILE *fp, inT64 end_offset,
+                               SetParamConstraint constraint,
+                               ParamsVectors *member_params);

  // Set a parameters to have the given value.
  static bool SetParam(const char *name, const char* value,
-                         bool init_only, ParamsVectors *member_params);
+                       SetParamConstraint constraint,
+                       ParamsVectors *member_params);

  // Returns the pointer to the parameter with the given name (of the
  // appropriate type) if it was found in the vector obtained from
@ -105,14 +114,27 @@ class Param {
  const char *name_str() const { return name_; }
  const char *info_str() const { return info_; }
  bool is_init() const { return init_; }
+  bool is_debug() const { return debug_; }
+  bool constraint_ok(SetParamConstraint constraint) const {
+    return (constraint == SET_PARAM_CONSTRAINT_NONE ||
+            (constraint == SET_PARAM_CONSTRAINT_DEBUG_ONLY &&
+             this->is_debug()) ||
+            (constraint == SET_PARAM_CONSTRAINT_NON_DEBUG_ONLY &&
+             !this->is_debug()) ||
+            (constraint == SET_PARAM_CONSTRAINT_NON_INIT_ONLY &&
+             !this->is_init()));
+  }

 protected:
  Param(const char *name, const char *comment, bool init) :
-    name_(name), info_(comment), init_(init) {}
+    name_(name), info_(comment), init_(init) {
+    debug_ = (strstr(name, "debug") != NULL) || (strstr(name, "display"));
+  }

-    const char *name_;      // name of this parameter
-    const char *info_;      // for menus
-    bool init_;             // needs to be set before init
+  const char *name_;      // name of this parameter
+  const char *info_;      // for menus
+  bool init_;             // needs to be set before init
+  bool debug_;
 };

 class IntParam : public Param {
@ -124,7 +146,7 @@ class IntParam : public Param {
    vec->int_params.push_back(this);
  }
  ~IntParam() { ParamUtils::RemoveParam<IntParam>(this, params_vec_); }
-  operator inT32() { return value_; }
+  operator inT32() const { return value_; }
  void set_value(inT32 value) { value_ = value; }

 private:
@ -142,12 +164,12 @@ class BoolParam : public Param {
    vec->bool_params.push_back(this);
  }
  ~BoolParam() { ParamUtils::RemoveParam<BoolParam>(this, params_vec_); }
-  operator BOOL8() { return value_; }
+  operator BOOL8() const { return value_; }
  void set_value(BOOL8 value) { value_ = value; }

 private:
  BOOL8 value_;
-  // Pointer to the vector that contains this param (not owened by this class).
+  // Pointer to the vector that contains this param (not owned by this class).
  GenericVector<BoolParam *> *params_vec_;
 };

@ -163,6 +185,7 @@ class StringParam : public Param {
  ~StringParam() { ParamUtils::RemoveParam<StringParam>(this, params_vec_); }
  operator STRING &() { return value_; }
  const char *string() const { return value_.string(); }
+  bool empty() { return value_.length() <= 0; }
  void set_value(const STRING &value) { value_ = value; }

 private:
@ -180,12 +203,12 @@ class DoubleParam : public Param {
    vec->double_params.push_back(this);
  }
  ~DoubleParam() { ParamUtils::RemoveParam<DoubleParam>(this, params_vec_); }
-  operator double() { return value_; }
+  operator double() const { return value_; }
  void set_value(double value) { value_ = value; }

 private:
  double value_;
-  // Pointer to the vector that contains this param (not owened by this class).
+  // Pointer to the vector that contains this param (not owned by this class).
  GenericVector<DoubleParam *> *params_vec_;
 };

--- a/ccutil/strngs.cpp
+++ b/ccutil/strngs.cpp
@ -17,9 +17,11 @@
 *
 **********************************************************************/

-#include          "mfcpch.h"     //precompiled headers
+#include          "mfcpch.h"     // Precompiled headers
+#include          "helpers.h"
 #include          "tprintf.h"
 #include          "strngs.h"
+#include          "genericvector.h"

 #include <assert.h>
 // Size of buffer needed to host the decimal representation of the maximum
@ -122,6 +124,25 @@ STRING::~STRING() {
  DiscardData();
 }

+// Writes to the given file. Returns false in case of error.
+bool STRING::Serialize(FILE* fp) const {
+  inT32 len = length();
+  if (fwrite(&len, sizeof(len), 1, fp) != 1) return false;
+  if (fwrite(GetCStr(), 1, len, fp) != len) return false;
+  return true;
+}
+// Reads from the given file. Returns false in case of error.
+// If swap is true, assumes a big/little-endian swap is needed.
+bool STRING::DeSerialize(bool swap, FILE* fp) {
+  inT32 len;
+  if (fread(&len, sizeof(len), 1, fp) != 1) return false;
+  if (swap)
+    ReverseN(&len, sizeof(len));
+  truncate_at(len);
+  if (fread(GetCStr(), 1, len, fp) != len) return false;
+  return true;
+}
+
 BOOL8 STRING::contains(const char c) const {
  return (c != '\0') && (strchr (GetCStr(), c) != NULL);
 }
@ -197,14 +218,14 @@ void STRING::erase_range(inT32 index, int len) {
  assert(InvariantOk());
 }

+#else
 void STRING::truncate_at(inT32 index) {
-  char* this_cstr = ensure_cstr(index);
+  char* this_cstr = ensure_cstr(index + 1);
  this_cstr[index] = '\0';
-  GetHeader()->used_ = index;
+  GetHeader()->used_ = index + 1;
  assert(InvariantOk());
 }

-#else
 char& STRING::operator[](inT32 index) const {
  // Code is casting away this const and mutating the string,
  // so mark used_ as -1 to flag it unreliable.
@ -213,6 +234,26 @@ char& STRING::operator[](inT32 index) const {
 }
 #endif

+void STRING::split(const char c, GenericVector<STRING> *splited) {
+  int start_index = 0;
+  for (int i = 0; i < length(); i++) {
+    if ((*this)[i] == c) {
+      if (i != start_index) {
+        (*this)[i] = '\0';
+        STRING tmp = GetCStr() + start_index;
+        splited->push_back(tmp);
+        (*this)[i] = c;
+      }
+      start_index = i + 1;
+    }
+  }
+
+  if (length() != start_index) {
+    STRING tmp = GetCStr() + start_index;
+    splited->push_back(tmp);
+  }
+}
+
 BOOL8 STRING::operator==(const STRING& str) const {
  FixHeader();
  str.FixHeader();
--- a/ccutil/strngs.h
+++ b/ccutil/strngs.h
@ -23,6 +23,7 @@
 #include          <string.h>
 #include          "memry.h"
 #include          "serialis.h"
+#include          "genericvector.h"

 // STRING_IS_PROTECTED means that  string[index] = X is invalid
 // because you have to go through strings interface to modify it.
@ -42,7 +43,8 @@
 #define CCUTIL_API
 #endif

-class CCUTIL_API STRING
+
+class DLLSYM STRING
 {
  public:
    STRING();
@ -50,8 +52,15 @@ class CCUTIL_API STRING
    STRING(const char *string);
    ~STRING ();

+    // Writes to the given file. Returns false in case of error.
+    bool Serialize(FILE* fp) const;
+    // Reads from the given file. Returns false in case of error.
+    // If swap is true, assumes a big/little-endian swap is needed.
+    bool DeSerialize(bool swap, FILE* fp);
+
    BOOL8 contains(const char c) const;
    inT32 length() const;
+    inT32 size() const { return length(); }
    const char *string() const;

 #if STRING_IS_PROTECTED
@ -59,10 +68,11 @@ class CCUTIL_API STRING
    // len is number of chars in s to insert starting at index in this string
    void insert_range(inT32 index, const char*s, int len);
    void erase_range(inT32 index, int len);
-    void truncate_at(inT32 index);
 #else
    char &operator[] (inT32 index) const;
 #endif
+    void split(const char c, GenericVector<STRING> *splited);
+    void truncate_at(inT32 index);

    BOOL8 operator== (const STRING & string) const;
    BOOL8 operator!= (const STRING & string) const;
--- a/ccutil/tesscallback.h
+++ b/ccutil/tesscallback.h
@ -1017,12 +1017,12 @@ struct Identity {
  typedef T type;
 };

-template <bool del, class R, class T, class P1, class A1, class A2>
-class _ConstTessMemberResultCallback_1_2
-  : public TessResultCallback2<R,A1,A2> {
+template <bool del, class R, class T, class P1, class A1, class A2, class A3>
+class _ConstTessMemberResultCallback_1_3
+  : public TessResultCallback3<R,A1,A2,A3> {
 public:
-  typedef TessResultCallback2<R,A1,A2> base;
-  typedef R (T::*MemberSignature)(P1,A1,A2) const;
+  typedef TessResultCallback3<R,A1,A2,A3> base;
+  typedef R (T::*MemberSignature)(P1,A1,A2,A3) const;

 private:
   T* object_;
@ -1030,16 +1030,16 @@ class _ConstTessMemberResultCallback_1_2
  typename remove_reference<P1>::type p1_;

 public:
-  inline _ConstTessMemberResultCallback_1_2(T* object,
+  inline _ConstTessMemberResultCallback_1_3(T* object,
                                            MemberSignature member, P1 p1)
    : object_(object), member_(member), p1_(p1) { }

-  virtual R Run(A1 a1, A2 a2) {
+  virtual R Run(A1 a1, A2 a2, A3 a3) {
    if (!del) {
-      R result = (object_->*member_)(p1_,a1,a2);
+      R result = (object_->*member_)(p1_,a1,a2,a3);
      return result;
    } else {
-      R result = (object_->*member_)(p1_,a1,a2);
+      R result = (object_->*member_)(p1_,a1,a2,a3);
      //  zero out the pointer to ensure segfault if used again
      member_ = NULL;
      delete this;
@ -1048,12 +1048,12 @@ class _ConstTessMemberResultCallback_1_2
  }
 };

-template <bool del, class T, class P1, class A1, class A2>
-class _ConstTessMemberResultCallback_1_2<del, void, T, P1, A1, A2>
-  : public TessCallback2<A1,A2> {
+template <bool del, class T, class P1, class A1, class A2, class A3>
+class _ConstTessMemberResultCallback_1_3<del, void, T, P1, A1, A2, A3>
+  : public TessCallback3<A1,A2,A3> {
 public:
-  typedef TessCallback2<A1,A2> base;
-  typedef void (T::*MemberSignature)(P1,A1,A2) const;
+  typedef TessCallback3<A1,A2,A3> base;
+  typedef void (T::*MemberSignature)(P1,A1,A2,A3) const;

 private:
   T* object_;
@ -1061,15 +1061,15 @@ class _ConstTessMemberResultCallback_1_2<del, void, T, P1, A1, A2>
  typename remove_reference<P1>::type p1_;

 public:
-  inline _ConstTessMemberResultCallback_1_2(T* object,
+  inline _ConstTessMemberResultCallback_1_3(T* object,
                                            MemberSignature member, P1 p1)
    : object_(object), member_(member), p1_(p1) { }

-  virtual void Run(A1 a1, A2 a2) {
+  virtual void Run(A1 a1, A2 a2, A3 a3) {
    if (!del) {
-      (object_->*member_)(p1_,a1,a2);
+      (object_->*member_)(p1_,a1,a2,a3);
    } else {
-      (object_->*member_)(p1_,a1,a2);
+      (object_->*member_)(p1_,a1,a2,a3);
      //  zero out the pointer to ensure segfault if used again
      member_ = NULL;
      delete this;
@ -1078,26 +1078,26 @@ class _ConstTessMemberResultCallback_1_2<del, void, T, P1, A1, A2>
 };

 #ifndef SWIG
-template <class T1, class T2, class R, class P1, class A1, class A2>
-inline typename _ConstTessMemberResultCallback_1_2<true,R,T1,P1,A1,A2>::base*
-NewTessCallback( T1* obj, R (T2::*member)(P1,A1,A2) , typename Identity<P1>::type p1) {
-  return new _ConstTessMemberResultCallback_1_2<true,R,T1,P1,A1,A2>(obj, member, p1);
+template <class T1, class T2, class R, class P1, class A1, class A2, class A3>
+inline typename _ConstTessMemberResultCallback_1_3<true,R,T1,P1,A1,A2,A3>::base*
+NewTessCallback( T1* obj, R (T2::*member)(P1,A1,A2,A3) , typename Identity<P1>::type p1) {
+  return new _ConstTessMemberResultCallback_1_3<true,R,T1,P1,A1,A2,A3>(obj, member, p1);
 }
 #endif

 #ifndef SWIG
-template <class T1, class T2, class R, class P1, class A1, class A2>
-inline typename _ConstTessMemberResultCallback_1_2<false,R,T1,P1,A1,A2>::base*
-NewPermanentTessCallback( T1* obj, R (T2::*member)(P1,A1,A2) , typename Identity<P1>::type p1) {
-  return new _ConstTessMemberResultCallback_1_2<false,R,T1,P1,A1,A2>(obj, member, p1);
+template <class T1, class T2, class R, class P1, class A1, class A2, class A3>
+inline typename _ConstTessMemberResultCallback_1_3<false,R,T1,P1,A1,A2,A3>::base*
+NewPermanentTessCallback( T1* obj, R (T2::*member)(P1,A1,A2,A3) , typename Identity<P1>::type p1) {
+  return new _ConstTessMemberResultCallback_1_3<false,R,T1,P1,A1,A2,A3>(obj, member, p1);
 }
 #endif

-template <bool del, class R, class T, class P1, class A1, class A2>
-class _TessMemberResultCallback_1_2 : public TessResultCallback2<R,A1,A2> {
+template <bool del, class R, class T, class P1, class A1, class A2, class A3>
+class _TessMemberResultCallback_1_3 : public TessResultCallback3<R,A1,A2,A3> {
 public:
-  typedef TessResultCallback2<R,A1,A2> base;
-  typedef R (T::*MemberSignature)(P1,A1,A2) ;
+  typedef TessResultCallback3<R,A1,A2,A3> base;
+  typedef R (T::*MemberSignature)(P1,A1,A2,A3) ;

 private:
   T* object_;
@ -1105,16 +1105,16 @@ class _TessMemberResultCallback_1_2 : public TessResultCallback2<R,A1,A2> {
  typename remove_reference<P1>::type p1_;

 public:
-  inline _TessMemberResultCallback_1_2(T* object,
+  inline _TessMemberResultCallback_1_3(T* object,
                                        MemberSignature member, P1 p1)
    : object_(object), member_(member), p1_(p1) { }

-  virtual R Run(A1 a1, A2 a2) {
+  virtual R Run(A1 a1, A2 a2, A3 a3) {
    if (!del) {
-      R result = (object_->*member_)(p1_,a1,a2);
+      R result = (object_->*member_)(p1_,a1,a2,a3);
      return result;
    } else {
-      R result = (object_->*member_)(p1_,a1,a2);
+      R result = (object_->*member_)(p1_,a1,a2,a3);
      //  zero out the pointer to ensure segfault if used again
      member_ = NULL;
      delete this;
@ -1123,12 +1123,12 @@ class _TessMemberResultCallback_1_2 : public TessResultCallback2<R,A1,A2> {
  }
 };

-template <bool del, class T, class P1, class A1, class A2>
-class _TessMemberResultCallback_1_2<del, void, T, P1, A1, A2>
-  : public TessCallback2<A1,A2> {
+template <bool del, class T, class P1, class A1, class A2, class A3>
+class _TessMemberResultCallback_1_3<del, void, T, P1, A1, A2, A3>
+  : public TessCallback3<A1,A2,A3> {
 public:
-  typedef TessCallback2<A1,A2> base;
-  typedef void (T::*MemberSignature)(P1,A1,A2) ;
+  typedef TessCallback3<A1,A2,A3> base;
+  typedef void (T::*MemberSignature)(P1,A1,A2,A3) ;

 private:
   T* object_;
@ -1136,15 +1136,15 @@ class _TessMemberResultCallback_1_2<del, void, T, P1, A1, A2>
  typename remove_reference<P1>::type p1_;

 public:
-  inline _TessMemberResultCallback_1_2(T* object,
+  inline _TessMemberResultCallback_1_3(T* object,
                                        MemberSignature member, P1 p1)
    : object_(object), member_(member), p1_(p1) { }

-  virtual void Run(A1 a1, A2 a2) {
+  virtual void Run(A1 a1, A2 a2, A3 a3) {
    if (!del) {
-      (object_->*member_)(p1_,a1,a2);
+      (object_->*member_)(p1_,a1,a2,a3);
    } else {
-      (object_->*member_)(p1_,a1,a2);
+      (object_->*member_)(p1_,a1,a2,a3);
      //  zero out the pointer to ensure segfault if used again
      member_ = NULL;
      delete this;
@ -1153,41 +1153,41 @@ class _TessMemberResultCallback_1_2<del, void, T, P1, A1, A2>
 };

 #ifndef SWIG
-template <class T1, class T2, class R, class P1, class A1, class A2>
-inline typename _TessMemberResultCallback_1_2<true,R,T1,P1,A1,A2>::base*
-NewTessCallback( T1* obj, R (T2::*member)(P1,A1,A2) , typename Identity<P1>::type p1) {
-  return new _TessMemberResultCallback_1_2<true,R,T1,P1,A1,A2>(obj, member, p1);
+template <class T1, class T2, class R, class P1, class A1, class A2, class A3>
+inline typename _TessMemberResultCallback_1_3<true,R,T1,P1,A1,A2,A3>::base*
+NewTessCallback( T1* obj, R (T2::*member)(P1,A1,A2,A3) , typename Identity<P1>::type p1) {
+  return new _TessMemberResultCallback_1_3<true,R,T1,P1,A1,A2,A3>(obj, member, p1);
 }
 #endif

 #ifndef SWIG
-template <class T1, class T2, class R, class P1, class A1, class A2>
-inline typename _TessMemberResultCallback_1_2<false,R,T1,P1,A1,A2>::base*
-NewPermanentTessCallback( T1* obj, R (T2::*member)(P1,A1,A2) , typename Identity<P1>::type p1) {
-  return new _TessMemberResultCallback_1_2<false,R,T1,P1,A1,A2>(obj, member, p1);
+template <class T1, class T2, class R, class P1, class A1, class A2, class A3>
+inline typename _TessMemberResultCallback_1_3<false,R,T1,P1,A1,A2,A3>::base*
+NewPermanentTessCallback( T1* obj, R (T2::*member)(P1,A1,A2,A3) , typename Identity<P1>::type p1) {
+  return new _TessMemberResultCallback_1_3<false,R,T1,P1,A1,A2,A3>(obj, member, p1);
 }
 #endif

-template <bool del, class R, class P1, class A1, class A2>
-class _TessFunctionResultCallback_1_2 : public TessCallback2<A1,A2> {
+template <bool del, class R, class P1, class A1, class A2, class A3>
+class _TessFunctionResultCallback_1_3 : public TessCallback3<A1,A2,A3> {
 public:
-  typedef TessCallback2<A1,A2> base;
-  typedef R (*FunctionSignature)(P1,A1,A2);
+  typedef TessCallback3<A1,A2,A3> base;
+  typedef R (*FunctionSignature)(P1,A1,A2,A3);

 private:
  FunctionSignature function_;
  typename remove_reference<P1>::type p1_;

 public:
-  inline _TessFunctionResultCallback_1_2(FunctionSignature function, P1 p1)
+  inline _TessFunctionResultCallback_1_3(FunctionSignature function, P1 p1)
    : function_(function), p1_(p1) { }

-  virtual R Run(A1 a1, A2 a2) {
+  virtual R Run(A1 a1, A2 a2, A3 a3) {
    if (!del) {
-      R result = (*function_)(p1_,a1,a2);
+      R result = (*function_)(p1_,a1,a2,a3);
      return result;
    } else {
-      R result = (*function_)(p1_,a1,a2);
+      R result = (*function_)(p1_,a1,a2,a3);
      //  zero out the pointer to ensure segfault if used again
      function_ = NULL;
      delete this;
@ -1196,26 +1196,26 @@ class _TessFunctionResultCallback_1_2 : public TessCallback2<A1,A2> {
  }
 };

-template <bool del, class P1, class A1, class A2>
-class _TessFunctionResultCallback_1_2<del, void, P1, A1, A2>
-  : public TessCallback2<A1,A2> {
+template <bool del, class P1, class A1, class A2, class A3>
+class _TessFunctionResultCallback_1_3<del, void, P1, A1, A2, A3>
+  : public TessCallback3<A1,A2,A3> {
 public:
-  typedef TessCallback2<A1,A2> base;
-  typedef void (*FunctionSignature)(P1,A1,A2);
+  typedef TessCallback3<A1,A2,A3> base;
+  typedef void (*FunctionSignature)(P1,A1,A2,A3);

 private:
  FunctionSignature function_;
  typename remove_reference<P1>::type p1_;

 public:
-  inline _TessFunctionResultCallback_1_2(FunctionSignature function, P1 p1)
+  inline _TessFunctionResultCallback_1_3(FunctionSignature function, P1 p1)
    : function_(function), p1_(p1) { }

-  virtual void Run(A1 a1, A2 a2) {
+  virtual void Run(A1 a1, A2 a2, A3 a3) {
    if (!del) {
-      (*function_)(p1_,a1,a2);
+      (*function_)(p1_,a1,a2,a3);
    } else {
-      (*function_)(p1_,a1,a2);
+      (*function_)(p1_,a1,a2,a3);
      //  zero out the pointer to ensure segfault if used again
      function_ = NULL;
      delete this;
@ -1223,16 +1223,16 @@ class _TessFunctionResultCallback_1_2<del, void, P1, A1, A2>
  }
 };

-template <class R, class P1, class A1, class A2>
-inline typename _TessFunctionResultCallback_1_2<true,R,P1,A1,A2>::base*
-NewTessCallback(R (*function)(P1,A1,A2), typename Identity<P1>::type p1) {
-  return new _TessFunctionResultCallback_1_2<true,R,P1,A1,A2>(function, p1);
+template <class R, class P1, class A1, class A2, class A3>
+inline typename _TessFunctionResultCallback_1_3<true,R,P1,A1,A2,A3>::base*
+NewTessCallback(R (*function)(P1,A1,A2,A3), typename Identity<P1>::type p1) {
+  return new _TessFunctionResultCallback_1_3<true,R,P1,A1,A2,A3>(function, p1);
 }

-template <class R, class P1, class A1, class A2>
-inline typename _TessFunctionResultCallback_1_2<false,R,P1,A1,A2>::base*
-NewPermanentTessCallback(R (*function)(P1,A1,A2), typename Identity<P1>::type p1) {
-  return new _TessFunctionResultCallback_1_2<false,R,P1,A1,A2>(function, p1);
+template <class R, class P1, class A1, class A2, class A3>
+inline typename _TessFunctionResultCallback_1_3<false,R,P1,A1,A2,A3>::base*
+NewPermanentTessCallback(R (*function)(P1,A1,A2,A3), typename Identity<P1>::type p1) {
+  return new _TessFunctionResultCallback_1_3<false,R,P1,A1,A2,A3>(function, p1);
 }

 #endif /* _TESS_CALLBACK_SPECIALIZATIONS_H */
--- a/ccutil/tessdatamanager.cpp
+++ b/ccutil/tessdatamanager.cpp
@ -38,17 +38,19 @@ bool TessdataManager::Init(const char *data_file_name, int debug_level) {
  data_file_ = fopen(data_file_name, "rb");
  if (data_file_ == NULL) {
    tprintf("Error opening data file %s\n", data_file_name);
+    tprintf("Please make sure the TESSDATA_PREFIX environment variable is set "
+            "to the parent directory of your \"tessdata\" directory.\n");
    return false;
  }
  fread(&actual_tessdata_num_entries_, sizeof(inT32), 1, data_file_);
-  bool swap = (actual_tessdata_num_entries_ > kMaxNumTessdataEntries);
-  if (swap) {
+  swap_ = (actual_tessdata_num_entries_ > kMaxNumTessdataEntries);
+  if (swap_) {
    actual_tessdata_num_entries_ = reverse32(actual_tessdata_num_entries_);
  }
  ASSERT_HOST(actual_tessdata_num_entries_ <= TESSDATA_NUM_ENTRIES);
  fread(offset_table_, sizeof(inT64),
        actual_tessdata_num_entries_, data_file_);
-  if (swap) {
+  if (swap_) {
    for (i = 0 ; i < actual_tessdata_num_entries_; ++i) {
      offset_table_[i] = reverse64(offset_table_[i]);
    }
--- a/ccutil/tessdatamanager.h
+++ b/ccutil/tessdatamanager.h
@ -41,6 +41,10 @@ static const char kFreqDawgFileSuffix[] = "freq-dawg";
 static const char kFixedLengthDawgsFileSuffix[] = "fixed-length-dawgs";
 static const char kCubeUnicharsetFileSuffix[] = "cube-unicharset";
 static const char kCubeSystemDawgFileSuffix[] = "cube-word-dawg";
+static const char kShapeTableFileSuffix[] = "shapetable";
+static const char kBigramDawgFileSuffix[] = "bigram-dawg";
+static const char kUnambigDawgFileSuffix[] = "unambig-dawg";
+static const char kParamsTrainingModelFileSuffix[] = "params-training-model";

 namespace tesseract {

@ -58,6 +62,10 @@ enum TessdataType {
  TESSDATA_FIXED_LENGTH_DAWGS,  // 10
  TESSDATA_CUBE_UNICHARSET,     // 11
  TESSDATA_CUBE_SYSTEM_DAWG,    // 12
+  TESSDATA_SHAPE_TABLE,         // 13
+  TESSDATA_BIGRAM_DAWG,         // 14
+  TESSDATA_UNAMBIG_DAWG,        // 15
+  TESSDATA_PARAMS_TRAINING_MODEL,  // 16

  TESSDATA_NUM_ENTRIES
 };
@ -80,6 +88,10 @@ static const char * const kTessdataFileSuffixes[] = {
  kFixedLengthDawgsFileSuffix,  // 10
  kCubeUnicharsetFileSuffix,    // 11
  kCubeSystemDawgFileSuffix,    // 12
+  kShapeTableFileSuffix,        // 13
+  kBigramDawgFileSuffix,        // 14
+  kUnambigDawgFileSuffix,       // 15
+  kParamsTrainingModelFileSuffix,  // 16
 };

 /**
@ -100,6 +112,10 @@ static const bool kTessdataFileIsText[] = {
  false,                        // 10
  true,                         // 11
  false,                        // 12
+  false,                        // 13
+  false,                        // 14
+  false,                        // 15
+  false,                        // 16
 };

 /**
@ -174,6 +190,9 @@ class TessdataManager {
      data_file_ = NULL;
    }
  }
+  bool swap() const {
+    return swap_;
+  }

  /** Writes the number of entries and the given offset table to output_file. */
  static void WriteMetadata(inT64 *offset_table, FILE *output_file);
@ -260,6 +279,8 @@ class TessdataManager {
  inT32 actual_tessdata_num_entries_;
  FILE *data_file_;  ///< pointer to the data file.
  int debug_level_;
+  // True if the bytes need swapping.
+  bool swap_;
 };


--- a/ccutil/tprintf.cpp
+++ b/ccutil/tprintf.cpp
@ -31,7 +31,7 @@
 #include          "tprintf.h"
 #include          "ccutil.h"

-#define MAX_MSG_LEN     1024
+#define MAX_MSG_LEN     65536

 #define EXTERN
 // Since tprintf is protected by a mutex, these parameters can rmain global.
--- a/ccutil/unichar.h
+++ b/ccutil/unichar.h
@ -25,7 +25,7 @@

 // Maximum number of characters that can be stored in a UNICHAR. Must be
 // at least 4. Must not exceed 31 without changing the coding of length.
-#define UNICHAR_LEN 24
+#define UNICHAR_LEN 30

 // A UNICHAR_ID is the unique id of a unichar.
 typedef int UNICHAR_ID;
--- a/ccutil/unicharset.cpp
+++ b/ccutil/unicharset.cpp
@ -1,4 +1,3 @@
-
 ///////////////////////////////////////////////////////////////////////
 // File:        unicharset.cpp
 // Description: Unicode character/ligature set class.
@ -22,30 +21,51 @@
 #include <stdio.h>
 #include <string.h>

+#include "tesscallback.h"
 #include "tprintf.h"
 #include "unichar.h"
 #include "unicharset.h"
 #include "params.h"

+// Special character used in representing character fragments.
+static const char kSeparator = '|';
+// Special character used in representing 'natural' character fragments.
+static const char kNaturalFlag = 'n';
+
 static const int ISALPHA_MASK = 0x1;
 static const int ISLOWER_MASK = 0x2;
 static const int ISUPPER_MASK = 0x4;
 static const int ISDIGIT_MASK = 0x8;
 static const int ISPUNCTUATION_MASK = 0x10;
+
 // Y coordinate threshold for determining cap-height vs x-height.
 // TODO(rays) Bring the global definition down to the ccutil library level,
 // so this constant is relative to some other constants.
 static const int kMeanlineThreshold = 220;
 // Let C be the number of alpha chars for which all tops exceed
-// kMeanlineThreshold, and X the number of alpha chars for which all tops
-// are below kMeanlineThreshold, then if X > C * kMinXHeightFraction or
-// more than half the alpha characters have upper or lower case, then
-// the unicharset "has x-height".
+// kMeanlineThreshold, and X the number of alpha chars for which all
+// tops are below kMeanlineThreshold, then if X > C *
+// kMinXHeightFraction and C > X * kMinCapHeightFraction or more than
+// half the alpha characters have upper or lower case, then the
+// unicharset "has x-height".
 const double kMinXHeightFraction = 0.25;
+const double kMinCapHeightFraction = 0.05;
+
+/*static */
+const char* UNICHARSET::kCustomLigatures[][2] = {
+  {"ct", "\uE003"},  // c + t -> U+E003
+  {"ſh", "\uE006"},  // long-s + h -> U+E006
+  {"ſi", "\uE007"},  // long-s + i -> U+E007
+  {"ſl", "\uE008"},  // long-s + l -> U+E008
+  {"ſſ", "\uE009"},  // long-s + long-s -> U+E009
+  {NULL, NULL}
+};

 UNICHARSET::UNICHAR_PROPERTIES::UNICHAR_PROPERTIES() {
  Init();
 }
+
+// Initialize all properties to sensible default values.
 void UNICHARSET::UNICHAR_PROPERTIES::Init() {
  isalpha = false;
  islower = false;
@ -54,13 +74,73 @@ void UNICHARSET::UNICHAR_PROPERTIES::Init() {
  ispunctuation = false;
  isngram = false;
  enabled = false;
+  SetRangesOpen();
+  script_id = 0;
+  other_case = 0;
+  mirror = 0;
+  normed = "";
+  direction = UNICHARSET::U_LEFT_TO_RIGHT;
+  fragment = NULL;
+}
+
+// Sets all ranges wide open. Initialization default in case there are
+// no useful values available.
+void UNICHARSET::UNICHAR_PROPERTIES::SetRangesOpen() {
  min_bottom = 0;
  max_bottom = MAX_UINT8;
  min_top = 0;
  max_top = MAX_UINT8;
-  script_id = 0;
-  other_case = 0;
-  fragment = NULL;
+  min_width = 0;
+  max_width = MAX_INT16;
+  min_bearing = 0;
+  max_bearing = MAX_INT16;
+  min_advance = 0;
+  max_advance = MAX_INT16;
+}
+
+// Sets all ranges to empty. Used before expanding with font-based data.
+void UNICHARSET::UNICHAR_PROPERTIES::SetRangesEmpty() {
+  min_bottom = MAX_UINT8;
+  max_bottom = 0;
+  min_top = MAX_UINT8;
+  max_top = 0;
+  min_width = MAX_INT16;
+  max_width = 0;
+  min_bearing = MAX_INT16;
+  max_bearing = 0;
+  min_advance = MAX_INT16;
+  max_advance = 0;
+}
+
+// Returns true if any of the top/bottom/width/bearing/advance ranges is
+// emtpy.
+bool UNICHARSET::UNICHAR_PROPERTIES::AnyRangeEmpty() const {
+  return min_bottom > max_bottom || min_top > max_top ||
+      min_width > max_width || min_bearing > max_bearing ||
+      min_advance > max_advance;
+}
+
+// Expands the ranges with the ranges from the src properties.
+void UNICHARSET::UNICHAR_PROPERTIES::ExpandRangesFrom(
+    const UNICHAR_PROPERTIES& src) {
+  UpdateRange(src.min_bottom, &min_bottom, &max_bottom);
+  UpdateRange(src.max_bottom, &min_bottom, &max_bottom);
+  UpdateRange(src.min_top, &min_top, &max_top);
+  UpdateRange(src.max_top, &min_top, &max_top);
+  UpdateRange(src.min_width, &min_width, &max_width);
+  UpdateRange(src.max_width, &min_width, &max_width);
+  UpdateRange(src.min_bearing, &min_bearing, &max_bearing);
+  UpdateRange(src.max_bearing, &min_bearing, &max_bearing);
+  UpdateRange(src.min_advance, &min_advance, &max_advance);
+  UpdateRange(src.max_advance, &min_advance, &max_advance);
+}
+
+// Copies the properties from src into this.
+void UNICHARSET::UNICHAR_PROPERTIES::CopyFrom(const UNICHAR_PROPERTIES& src) {
+  // Apart from the fragment, everything else can be done with a default copy.
+  CHAR_FRAGMENT* saved_fragment = fragment;
+  *this = src;  // Bitwise copy.
+  fragment = saved_fragment;
 }

 UNICHARSET::UNICHARSET() :
@ -82,7 +162,7 @@ void UNICHARSET::reserve(int unichars_number) {
  if (unichars_number > size_reserved) {
    UNICHAR_SLOT* unichars_new = new UNICHAR_SLOT[unichars_number];
    for (int i = 0; i < size_used; ++i)
-      memcpy(&unichars_new[i], &unichars[i], sizeof(UNICHAR_SLOT));
+      unichars_new[i] = unichars[i];
    for (int j = size_used; j < unichars_number; ++j) {
      unichars_new[j].properties.script_id = add_script(null_script);
    }
@ -119,21 +199,60 @@ int UNICHARSET::step(const char* str) const {
  while (goodlength <= UNICHAR_LEN) {
    if (str[goodlength] == '\0' || ids.minmatch(str + goodlength) > 0)
      return goodlength;  // This length works!
+
    // The next char is illegal so find the next usable length.
    do {
      ++goodlength;
    } while (str[goodlength] != '\0' && goodlength <= UNICHAR_LEN &&
             !ids.contains(str, goodlength));
+    if (goodlength > UNICHAR_LEN || !ids.contains(str, goodlength)) {
+      // This does not constitute a good length!
+      return minlength;
+    }
  }
  // Search to find a subsequent legal char failed so return the minlength.
  return minlength;
 }

+// Return whether the given UTF-8 string is encodable with this UNICHARSET.
+// If not encodable, write the first byte offset which cannot be converted
+// into the second (return) argument.
+bool UNICHARSET::encodable_string(const char *str,
+                                  int *first_bad_position) const {
+  for (int i = 0, len = strlen(str); i < len; ) {
+    int increment = step(str + i);
+    if (increment == 0) {
+      if (first_bad_position) *first_bad_position = i;
+      return false;
+    }
+    i += increment;
+  }
+  return true;
+}
+
 const char* const UNICHARSET::id_to_unichar(UNICHAR_ID id) const {
  if (id == INVALID_UNICHAR_ID) {
    return INVALID_UNICHAR;
  }
-  assert(id < this->size());
+  ASSERT_HOST(id < this->size());
+  return unichars[id].representation;
+}
+
+const char* const UNICHARSET::id_to_unichar_ext(UNICHAR_ID id) const {
+  if (id == INVALID_UNICHAR_ID) {
+    return INVALID_UNICHAR;
+  }
+  ASSERT_HOST(id < this->size());
+  // Resolve from the kCustomLigatures table if this is a private encoding.
+  if (get_isprivate(id)) {
+    const char* ch = id_to_unichar(id);
+    for (int i = 0; kCustomLigatures[i][0] != NULL; ++i) {
+      if (!strcmp(ch, kCustomLigatures[i][1])) {
+        return kCustomLigatures[i][0];
+      }
+    }
+  }
+  // Otherwise return the stored representation.
  return unichars[id].representation;
 }

@ -167,9 +286,7 @@ STRING UNICHARSET::debug_str(UNICHAR_ID id) const {
  if (id == INVALID_UNICHAR_ID) return STRING(id_to_unichar(id));
  const CHAR_FRAGMENT *fragment = this->get_fragment(id);
  if (fragment) {
-    STRING base = debug_str(fragment->get_unichar());
-    return CHAR_FRAGMENT::to_string(base.string(), fragment->get_pos(),
-                                    fragment->get_total());
+    return fragment->to_string();
  }
  const char* str = id_to_unichar(id);
  STRING result = debug_utf8_str(str);
@ -193,6 +310,180 @@ STRING UNICHARSET::debug_str(UNICHAR_ID id) const {
  return result;
 }

+// Returns whether the unichar id represents a unicode value in the private use
+// area. We use this range only internally to represent uncommon ligatures
+// (eg. 'ct') that do not have regular unicode values.
+bool UNICHARSET::get_isprivate(UNICHAR_ID unichar_id) const {
+  UNICHAR uc(id_to_unichar(unichar_id), -1);
+  int uni = uc.first_uni();
+  return (uni >= 0xE000 && uni <= 0xF8FF);
+}
+
+
+// Sets all ranges to empty, so they can be expanded to set the values.
+void UNICHARSET::set_ranges_empty() {
+  for (int id = 0; id < size_used; ++id) {
+    unichars[id].properties.SetRangesEmpty();
+  }
+}
+
+// Sets all the properties for this unicharset given a src unicharset with
+// everything set. The unicharsets don't have to be the same, and graphemes
+// are correctly accounted for.
+void UNICHARSET::SetPropertiesFromOther(const UNICHARSET& src) {
+  for (int ch = 0; ch < size_used; ++ch) {
+    const char* utf8 = id_to_unichar(ch);
+    UNICHAR_PROPERTIES properties;
+    if (src.GetStrProperties(utf8, &properties)) {
+      // Setup the script_id, other_case, and mirror properly.
+      const char* script = src.get_script_from_script_id(properties.script_id);
+      properties.script_id = add_script(script);
+      const char* other_case = src.id_to_unichar(properties.other_case);
+      if (contains_unichar(other_case)) {
+        properties.other_case = unichar_to_id(other_case);
+      } else {
+        properties.other_case = ch;
+      }
+      const char* mirror_str = src.id_to_unichar(properties.mirror);
+      if (contains_unichar(mirror_str)) {
+        properties.mirror = unichar_to_id(mirror_str);
+      } else {
+        properties.mirror = ch;
+      }
+      unichars[ch].properties.CopyFrom(properties);
+    }
+  }
+}
+
+// Expands the tops and bottoms and widths for this unicharset given a
+// src unicharset with ranges in it. The unicharsets don't have to be the
+// same, and graphemes are correctly accounted for.
+void UNICHARSET::ExpandRangesFromOther(const UNICHARSET& src) {
+  for (int ch = 0; ch < size_used; ++ch) {
+    const char* utf8 = id_to_unichar(ch);
+    UNICHAR_PROPERTIES properties;
+    if (src.GetStrProperties(utf8, &properties)) {
+      // Expand just the ranges from properties.
+      unichars[ch].properties.ExpandRangesFrom(properties);
+    }
+  }
+}
+
+// For each id in src, if it does not occur in this, add it, as in
+// SetPropertiesFromOther, otherwise expand the ranges, as in
+// ExpandRangesFromOther.
+void UNICHARSET::AppendOtherUnicharset(const UNICHARSET& src) {
+  for (int ch = 0; ch < src.size_used; ++ch) {
+    const UNICHAR_PROPERTIES& src_props = src.unichars[ch].properties;
+    const char* utf8 = src.id_to_unichar(ch);
+    if (strcmp(utf8, " ") != 0 && src_props.AnyRangeEmpty()) {
+      // Only use fully valid entries.
+      tprintf("Bad properties for char %s: %d,%d %d,%d %d,%d %d,%d %d,%d\n",
+              utf8, src_props.min_bottom, src_props.max_bottom,
+              src_props.min_top, src_props.max_top,
+              src_props.min_width, src_props.max_width,
+              src_props.min_bearing, src_props.max_bearing,
+              src_props.min_advance, src_props.max_advance);
+      continue;
+    }
+    int id = size_used;
+    if (contains_unichar(utf8)) {
+      id = unichar_to_id(utf8);
+    } else {
+      unichar_insert(utf8);
+      unichars[id].properties.SetRangesEmpty();
+    }
+    if (!unichars[id].properties.AnyRangeEmpty()) {
+      // Just expand current ranges.
+      unichars[id].properties.ExpandRangesFrom(src_props);
+    } else {
+      // Copy properties from src_props.
+      unichars[id].properties.CopyFrom(src_props);
+      // Setup the script_id, other_case and mirror properly.
+      const char* script = src.get_script_from_script_id(src_props.script_id);
+      unichars[id].properties.script_id = add_script(script);
+      const char* other_case = src.id_to_unichar(src_props.other_case);
+      if (!contains_unichar(other_case)) {
+        unichar_insert(other_case);
+        unichars[size_used - 1].properties.SetRangesEmpty();
+        // Other_case will have its ranges set later as it is contained in src.
+      }
+      unichars[id].properties.other_case = unichar_to_id(other_case);
+      const char* mirror_str = src.id_to_unichar(src_props.mirror);
+      if (!contains_unichar(mirror_str)) {
+        unichar_insert(mirror_str);
+        unichars[size_used - 1].properties.SetRangesEmpty();
+        // Mirror will have its ranges set later as it is contained in src.
+      }
+      unichars[id].properties.mirror = unichar_to_id(mirror_str);
+    }
+  }
+}
+
+// Gets the properties for a grapheme string, combining properties for
+// multiple characters in a meaningful way where possible.
+// Returns false if no valid match was found in the unicharset.
+// NOTE that script_id, mirror, and other_case refer to this unicharset on
+// return and will need translation if the target unicharset is different.
+bool UNICHARSET::GetStrProperties(const char* utf8_str,
+                                  UNICHAR_PROPERTIES* props) const {
+  props->Init();
+  props->SetRangesEmpty();
+  props->min_advance = 0;
+  props->max_advance = 0;
+  int utf8_step = 0;
+  int total_unicodes = 0;
+  for (int offset = 0; utf8_str[offset] != '\0'; offset += utf8_step) {
+    utf8_step = step(utf8_str + offset);
+    if (utf8_step == 0) return false;
+    int id = unichar_to_id(utf8_str + offset, utf8_step);
+    if (id < 0) return false;
+    const UNICHAR_PROPERTIES& src_props = unichars[id].properties;
+    // Logical OR all the bools.
+    if (src_props.isalpha) props->isalpha = true;
+    if (src_props.islower) props->islower = true;
+    if (src_props.isupper) props->isupper = true;
+    if (src_props.isdigit) props->isdigit = true;
+    if (src_props.ispunctuation) props->ispunctuation = true;
+    if (src_props.isngram) props->isngram = true;
+    if (src_props.enabled) props->enabled = true;
+    // Min/max the tops/bottoms.
+    UpdateRange(src_props.min_bottom, &props->min_bottom, &props->max_bottom);
+    UpdateRange(src_props.max_bottom, &props->min_bottom, &props->max_bottom);
+    UpdateRange(src_props.min_top, &props->min_top, &props->max_top);
+    UpdateRange(src_props.max_top, &props->min_top, &props->max_top);
+    int bearing = props->min_advance + src_props.min_bearing;
+    if (total_unicodes == 0 || bearing < props->min_bearing)
+      props->min_bearing = bearing;
+    bearing = props->max_advance + src_props.max_bearing;
+    if (total_unicodes == 0 || bearing < props->max_bearing)
+      props->max_bearing = bearing;
+    props->min_advance += src_props.min_advance;
+    props->max_advance += src_props.max_advance;
+    // With a single width, just use the widths stored in the unicharset.
+    props->min_width = src_props.min_width;
+    props->max_width = src_props.max_width;
+    // Use the first script id, other_case, mirror, direction.
+    // Note that these will need translation, except direction.
+    if (total_unicodes == 0) {
+      props->script_id = src_props.script_id;
+      props->other_case = src_props.other_case;
+      props->mirror = src_props.mirror;
+      props->direction = src_props.direction;
+    }
+    // The normed string for the compound character is the concatenation of
+    // the normed versions of the individual characters.
+    props->normed += src_props.normed;
+    ++total_unicodes;
+  }
+  if (total_unicodes > 1) {
+    // Estimate the total widths from the advance - bearing.
+    props->min_width = props->min_advance - props->max_bearing;
+    props->max_width = props->max_advance - props->min_bearing;
+  }
+  return total_unicodes > 0;
+}
+
 unsigned int UNICHARSET::get_properties(UNICHAR_ID id) const {
  unsigned int properties = 0;
  if (this->get_isalpha(id))
@ -271,27 +562,96 @@ bool UNICHARSET::save_to_file(FILE *file) const {
  for (UNICHAR_ID id = 0; id < this->size(); ++id) {
    int min_bottom, max_bottom, min_top, max_top;
    get_top_bottom(id, &min_bottom, &max_bottom, &min_top, &max_top);
+    int min_width, max_width;
+    get_width_range(id, &min_width, &max_width);
+    int min_bearing, max_bearing;
+    get_bearing_range(id, &min_bearing, &max_bearing);
+    int min_advance, max_advance;
+    get_advance_range(id, &min_advance, &max_advance);
    unsigned int properties = this->get_properties(id);
-    if (strcmp(this->id_to_unichar(id), " ") == 0)
+    if (strcmp(this->id_to_unichar(id), " ") == 0) {
      fprintf(file, "%s %x %s %d\n", "NULL", properties,
              this->get_script_from_script_id(this->get_script(id)),
              this->get_other_case(id));
-    else
-      fprintf(file, "%s %x %d,%d,%d,%d %s %d\t# %s\n",
+    } else {
+      fprintf(file,
+              "%s %x %d,%d,%d,%d,%d,%d,%d,%d,%d,%d %s %d %d %d %s\t# %s\n",
              this->id_to_unichar(id), properties,
-              min_bottom, max_bottom, min_top, max_top,
+              min_bottom, max_bottom, min_top, max_top, min_width, max_width,
+              min_bearing, max_bearing, min_advance, max_advance,
              this->get_script_from_script_id(this->get_script(id)),
-              this->get_other_case(id), this->debug_str(id).string());
+              this->get_other_case(id), this->get_direction(id),
+              this->get_mirror(id), this->get_normed_unichar(id),
+              this->debug_str(id).string());
+    }
  }
  return true;
 }

+class InMemoryFilePointer {
+ public:
+  InMemoryFilePointer(const char *memory, int mem_size)
+      : memory_(memory), fgets_ptr_(memory), mem_size_(mem_size) { }
+
+  char *fgets(char *orig_dst, int size) {
+    const char *src_end = memory_ + mem_size_;
+    char *dst_end = orig_dst + size - 1;
+    if (size < 1) {
+      return fgets_ptr_ < src_end ? orig_dst : NULL;
+    }
+
+    char *dst = orig_dst;
+    char ch = '^';
+    while (fgets_ptr_ < src_end && dst < dst_end && ch != '\n') {
+      ch = *dst++ = *fgets_ptr_++;
+    }
+    *dst = 0;
+    return (dst == orig_dst) ? NULL : orig_dst;
+  }
+
+ private:
+  const char *memory_;
+  const char *fgets_ptr_;
+  const int mem_size_;
+};
+
+bool UNICHARSET::load_from_inmemory_file(
+    const char *memory, int mem_size, bool skip_fragments) {
+  InMemoryFilePointer mem_fp(memory, mem_size);
+  TessResultCallback2<char *, char *, int> *fgets_cb =
+      NewPermanentTessCallback(&mem_fp, &InMemoryFilePointer::fgets);
+  bool success = load_via_fgets(fgets_cb, skip_fragments);
+  delete fgets_cb;
+  return success;
+}
+
+class LocalFilePointer {
+ public:
+  LocalFilePointer(FILE *stream) : fp_(stream) {}
+  char *fgets(char *dst, int size) {
+    return ::fgets(dst, size, fp_);
+  }
+ private:
+  FILE *fp_;
+};
+
 bool UNICHARSET::load_from_file(FILE *file, bool skip_fragments) {
+  LocalFilePointer lfp(file);
+  TessResultCallback2<char *, char *, int> *fgets_cb =
+      NewPermanentTessCallback(&lfp, &LocalFilePointer::fgets);
+  bool success = load_via_fgets(fgets_cb, skip_fragments);
+  delete fgets_cb;
+  return success;
+}
+
+bool UNICHARSET::load_via_fgets(
+    TessResultCallback2<char *, char *, int> *fgets_cb,
+    bool skip_fragments) {
  int unicharset_size;
  char buffer[256];

  this->clear();
-  if (fgets(buffer, sizeof (buffer), file) == NULL ||
+  if (fgets_cb->Run(buffer, sizeof(buffer)) == NULL ||
      sscanf(buffer, "%d", &unicharset_size) != 1) {
    return false;
  }
@ -302,21 +662,53 @@ bool UNICHARSET::load_from_file(FILE *file, bool skip_fragments) {
    char script[64];

    strcpy(script, null_script);
-    this->unichars[id].properties.other_case = id;
    int min_bottom = 0;
    int max_bottom = MAX_UINT8;
    int min_top = 0;
    int max_top = MAX_UINT8;
-    if (fgets(buffer, sizeof (buffer), file) == NULL ||
-        (sscanf(buffer, "%s %x %d,%d,%d,%d %63s %d", unichar, &properties,
-                &min_bottom, &max_bottom, &min_top, &max_top,
-                script, &(this->unichars[id].properties.other_case)) != 8 &&
-         sscanf(buffer, "%s %x %63s %d", unichar, &properties,
-                script, &(this->unichars[id].properties.other_case)) != 4 &&
-         sscanf(buffer, "%s %x %63s", unichar, &properties, script) != 3 &&
-         sscanf(buffer, "%s %x", unichar, &properties) != 2)) {
+    int min_width = 0;
+    int max_width = MAX_INT16;
+    int min_bearing = 0;
+    int max_bearing = MAX_INT16;
+    int min_advance = 0;
+    int max_advance = MAX_INT16;
+    // TODO(eger): check that this default it ok
+    // after enabling BiDi iterator for Arabic+Cube.
+    int direction = UNICHARSET::U_LEFT_TO_RIGHT;
+    UNICHAR_ID other_case = id;
+    UNICHAR_ID mirror = id;
+    char normed[64];
+    int v = -1;
+    if (fgets_cb->Run(buffer, sizeof (buffer)) == NULL ||
+        ((v = sscanf(buffer,
+                     "%s %x %d,%d,%d,%d,%d,%d,%d,%d,%d,%d %63s %d %d %d %63s",
+                     unichar, &properties,
+                     &min_bottom, &max_bottom, &min_top, &max_top,
+                     &min_width, &max_width, &min_bearing, &max_bearing,
+                     &min_advance, &max_advance, script, &other_case,
+                     &direction, &mirror, normed)) != 17 &&
+         (v = sscanf(buffer,
+                     "%s %x %d,%d,%d,%d,%d,%d,%d,%d,%d,%d %63s %d %d %d",
+                     unichar, &properties,
+                     &min_bottom, &max_bottom, &min_top, &max_top,
+                     &min_width, &max_width, &min_bearing, &max_bearing,
+                     &min_advance, &max_advance,
+                     script, &other_case, &direction, &mirror)) != 16 &&
+          (v = sscanf(buffer, "%s %x %d,%d,%d,%d %63s %d %d %d",
+                      unichar, &properties,
+                      &min_bottom, &max_bottom, &min_top, &max_top,
+                      script, &other_case, &direction, &mirror)) != 10 &&
+          (v = sscanf(buffer, "%s %x %d,%d,%d,%d %63s %d", unichar, &properties,
+                      &min_bottom, &max_bottom, &min_top, &max_top,
+                      script, &other_case)) != 8 &&
+          (v = sscanf(buffer, "%s %x %63s %d", unichar, &properties,
+                      script, &other_case)) != 4 &&
+          (v = sscanf(buffer, "%s %x %63s",
+                      unichar, &properties, script)) != 3 &&
+          (v = sscanf(buffer, "%s %x", unichar, &properties) != 2))) {
      return false;
    }
+
    // Skip fragments if needed.
    CHAR_FRAGMENT *frag = NULL;
    if (skip_fragments && (frag = CHAR_FRAGMENT::parse_from_string(unichar))) {
@ -338,6 +730,15 @@ bool UNICHARSET::load_from_file(FILE *file, bool skip_fragments) {
    this->set_script(id, script);
    this->unichars[id].properties.enabled = true;
    this->set_top_bottom(id, min_bottom, max_bottom, min_top, max_top);
+    this->set_width_range(id, min_width, max_width);
+    this->set_bearing_range(id, min_bearing, max_bearing);
+    this->set_advance_range(id, min_advance, max_advance);
+    this->set_direction(id, static_cast<UNICHARSET::Direction>(direction));
+    ASSERT_HOST(other_case < unicharset_size);
+    this->set_other_case(id, (v>3) ? other_case : id);
+    ASSERT_HOST(mirror < unicharset_size);
+    this->set_mirror(id, (v>8) ? mirror : id);
+    this->set_normed(id, (v>16) ? normed : unichar);
  }
  post_load_setup();
  return true;
@ -372,9 +773,11 @@ void UNICHARSET::post_load_setup() {
        ++cap_height_alphas;
    }
  }
+
  script_has_upper_lower_ = net_case_alphas > 0;
  script_has_xheight_ = script_has_upper_lower_ ||
-      x_height_alphas > cap_height_alphas * kMinXHeightFraction;
+      (x_height_alphas > cap_height_alphas * kMinXHeightFraction &&
+       cap_height_alphas > x_height_alphas * kMinCapHeightFraction);

  null_sid_ = get_script_id_from_name(null_script);
  ASSERT_HOST(null_sid_ == 0);
@ -386,11 +789,15 @@ void UNICHARSET::post_load_setup() {
  hiragana_sid_ = get_script_id_from_name("Hiragana");
  katakana_sid_ = get_script_id_from_name("Katakana");

-  // Compute default script.
+  // Compute default script. Use the highest-counting alpha script, that is
+  // not the common script, as that still contains some "alphas".
  int* script_counts = new int[script_table_size_used];
  memset(script_counts, 0, sizeof(*script_counts) * script_table_size_used);
-  for (int id = 0; id < size_used; ++id)
-    ++script_counts[get_script(id)];
+  for (int id = 0; id < size_used; ++id) {
+    if (get_isalpha(id)) {
+      ++script_counts[get_script(id)];
+    }
+  }
  default_sid_ = 0;
  for (int s = 1; s < script_table_size_used; ++s) {
    if (script_counts[s] > script_counts[default_sid_] && s != common_sid_)
@ -399,15 +806,21 @@ void UNICHARSET::post_load_setup() {
  delete [] script_counts;
 }

-// Returns true if any script entry in the unicharset is for a
-// right_to_left language.
-bool UNICHARSET::any_right_to_left() const {
-  for (int id = 0; id < script_table_size_used; ++id) {
-    if (strcmp(script_table[id], "Arabic") == 0 ||
-        strcmp(script_table[id], "Hebrew") == 0)
-      return true;
+// Returns true if right_to_left scripts are significant in the unicharset,
+// but without being so sensitive that "universal" unicharsets containing
+// characters from many scripts, like orientation and script detection,
+// look like they are right_to_left.
+bool UNICHARSET::major_right_to_left() const {
+  int ltr_count = 0;
+  int rtl_count = 0;
+  for (int id = 0; id < size_used; ++id) {
+    int dir = get_direction(id);
+    if (dir == UNICHARSET::U_LEFT_TO_RIGHT) ltr_count++;
+    if (dir == UNICHARSET::U_RIGHT_TO_LEFT ||
+        dir == UNICHARSET::U_RIGHT_TO_LEFT_ARABIC ||
+        dir == UNICHARSET::U_ARABIC_NUMBER) rtl_count++;
  }
-  return false;
+  return rtl_count > ltr_count;
 }

 // Set a whitelist and/or blacklist of characters to recognize.
@ -471,6 +884,21 @@ int UNICHARSET::add_script(const char* script) {
  return script_table_size_used++;
 }

+// Returns the string that represents a fragment
+// with the given unichar, pos and total.
+STRING CHAR_FRAGMENT::to_string(const char *unichar, int pos, int total,
+                                bool natural) {
+  if (total == 1) return STRING(unichar);
+  STRING result = "";
+  result += kSeparator;
+  result += unichar;
+  char buffer[kMaxLen];
+  snprintf(buffer, kMaxLen, "%c%d%c%d", kSeparator, pos,
+           natural ? kNaturalFlag : kSeparator, total);
+  result += buffer;
+  return result;
+}
+
 CHAR_FRAGMENT *CHAR_FRAGMENT::parse_from_string(const char *string) {
  const char *ptr = string;
  int len = strlen(string);
@ -491,10 +919,14 @@ CHAR_FRAGMENT *CHAR_FRAGMENT::parse_from_string(const char *string) {
  ptr += step;  // move to the next fragment separator
  int pos = 0;
  int total = 0;
+  bool natural = false;
  char *end_ptr = NULL;
  for (int i = 0; i < 2; i++) {
    if (ptr > string + len || *ptr != kSeparator) {
-      return NULL;  // failed to parse fragment representation
+      if (i == 1 && *ptr == kNaturalFlag)
+        natural = true;
+      else
+        return NULL;  // Failed to parse fragment representation.
    }
    ptr++;  // move to the next character
    i == 0 ? pos = static_cast<int>(strtol(ptr, &end_ptr, 10))
@ -505,7 +937,7 @@ CHAR_FRAGMENT *CHAR_FRAGMENT::parse_from_string(const char *string) {
    return NULL;  // malformed fragment representation
  }
  CHAR_FRAGMENT *fragment = new CHAR_FRAGMENT();
-  fragment->set_all(unichar, pos, total);
+  fragment->set_all(unichar, pos, total, natural);
  return fragment;
 }

--- a/ccutil/unicharset.h
+++ b/ccutil/unicharset.h
@ -26,22 +26,29 @@
 #include "unicharmap.h"
 #include "params.h"

+enum StrongScriptDirection {
+  DIR_NEUTRAL = 0,        // Text contains only neutral characters.
+  DIR_LEFT_TO_RIGHT = 1,  // Text contains no Right-to-Left characters.
+  DIR_RIGHT_TO_LEFT = 2,  // Text contains no Left-to-Right characters.
+  DIR_MIX = 3,            // Text contains a mixture of left-to-right
+                          // and right-to-left characters.
+};
+
 class CHAR_FRAGMENT {
 public:
  // Minimum number of characters used for fragment representation.
  static const int kMinLen = 6;
  // Maximum number of characters used for fragment representation.
  static const int kMaxLen = 3 + UNICHAR_LEN + 2;
-  // Special character used in representing character fragments.
-  static const char kSeparator = '|';
  // Maximum number of fragments per character.
-  static const int kMaxChunks = 3;
+  static const int kMaxChunks = 5;

  // Setters and Getters.
-  inline void set_all(const char *unichar, int pos, int total) {
-    this->set_unichar(unichar);
-    this->set_pos(pos);
-    this->set_total(total);
+  inline void set_all(const char *unichar, int pos, int total, bool natural) {
+    set_unichar(unichar);
+    set_pos(pos);
+    set_total(total);
+    set_natural(natural);
  }
  inline void set_unichar(const char *uch) {
    strncpy(this->unichar, uch, UNICHAR_LEN);
@ -55,19 +62,11 @@ class CHAR_FRAGMENT {

  // Returns the string that represents a fragment
  // with the given unichar, pos and total.
-  static STRING to_string(const char *unichar, int pos, int total) {
-    if (total == 1) return STRING(unichar);
-    STRING result = "";
-    result += kSeparator;
-    result += unichar;
-    char buffer[kMaxLen];
-    snprintf(buffer, kMaxLen, "%c%d%c%d", kSeparator, pos, kSeparator, total);
-    result += buffer;
-    return result;
-  }
+  static STRING to_string(const char *unichar, int pos, int total,
+                          bool natural);
  // Returns the string that represents this fragment.
  STRING to_string() const {
-    return to_string(this->unichar, this->pos, this->total);
+    return to_string(unichar, pos, total, natural);
  }

  // Checks whether a fragment has the same unichar,
@ -97,11 +96,19 @@ class CHAR_FRAGMENT {
  // Returns true if this fragment is an ending fragment.
  inline bool is_ending() const { return this->pos == this->total-1; }

+  // Returns true if the fragment was a separate component to begin with,
+  // ie did not need chopping to be isolated, but may have been separated
+  // out from a multi-outline blob.
+  inline bool is_natural() const { return natural; }
+  void set_natural(bool value) { natural = value; }
+
  // Parses the string to see whether it represents a character fragment
  // (rather than a regular character). If so, allocates memory for a new
  // CHAR_FRAGMENT instance and fills it in with the corresponding fragment
  // information. Fragments are of the form:
-  // |m|1|2, meaning chunk 1 of 2 of character m.
+  // |m|1|2, meaning chunk 1 of 2 of character m, or
+  // |:|1n2, meaning chunk 1 of 2 of character :, and no chopping was needed
+  // to divide the parts, as they were already separate connected components.
  //
  // If parsing succeeded returns the pointer to the allocated CHAR_FRAGMENT
  // instance, otherwise (if the string does not represent a fragment or it
@ -113,6 +120,10 @@ class CHAR_FRAGMENT {

 private:
  char unichar[UNICHAR_LEN + 1];
+  // True if the fragment was a separate component to begin with,
+  // ie did not need chopping to be isolated, but may have been separated
+  // out from a multi-outline blob.
+  bool natural;
  inT16 pos;    // fragment position in the character
  inT16 total;  // total number of fragments in the character
 };
@ -122,6 +133,35 @@ class CHAR_FRAGMENT {
 // by a unique number, from 0 to (size - 1).
 class UNICHARSET {
 public:
+  // Custom list of characters and their ligature forms (UTF8)
+  // These map to unicode values in the private use area (PUC) and are supported
+  // by only few font families (eg. Wyld, Adobe Caslon Pro).
+  static const char* kCustomLigatures[][2];
+
+  // ICU 2.0 UCharDirection enum (from third_party/icu/include/unicode/uchar.h)
+  enum Direction {
+      U_LEFT_TO_RIGHT               = 0,
+      U_RIGHT_TO_LEFT               = 1,
+      U_EUROPEAN_NUMBER             = 2,
+      U_EUROPEAN_NUMBER_SEPARATOR   = 3,
+      U_EUROPEAN_NUMBER_TERMINATOR  = 4,
+      U_ARABIC_NUMBER               = 5,
+      U_COMMON_NUMBER_SEPARATOR     = 6,
+      U_BLOCK_SEPARATOR             = 7,
+      U_SEGMENT_SEPARATOR           = 8,
+      U_WHITE_SPACE_NEUTRAL         = 9,
+      U_OTHER_NEUTRAL               = 10,
+      U_LEFT_TO_RIGHT_EMBEDDING     = 11,
+      U_LEFT_TO_RIGHT_OVERRIDE      = 12,
+      U_RIGHT_TO_LEFT_ARABIC        = 13,
+      U_RIGHT_TO_LEFT_EMBEDDING     = 14,
+      U_RIGHT_TO_LEFT_OVERRIDE      = 15,
+      U_POP_DIRECTIONAL_FORMAT      = 16,
+      U_DIR_NON_SPACING_MARK        = 17,
+      U_BOUNDARY_NEUTRAL            = 18,
+      U_CHAR_DIRECTION_COUNT
+  };
+
  // Create an empty UNICHARSET
  UNICHARSET();

@ -142,10 +182,21 @@ class UNICHARSET {
  // ensures there is a legal match after it.
  int step(const char* str) const;

+  // Return whether the given UTF-8 string is encodable with this UNICHARSET.
+  // If not encodable, write the first byte offset which cannot be converted
+  // into the second (return) argument.
+  bool encodable_string(const char *str, int *first_bad_position) const;
+
  // Return the unichar representation corresponding to the given UNICHAR_ID
  // within the UNICHARSET.
  const char* const id_to_unichar(UNICHAR_ID id) const;

+  // Return the UTF8 representation corresponding to the given UNICHAR_ID after
+  // resolving any private encodings internal to Tesseract. This method is
+  // preferrable to id_to_unichar for outputting text that will be visible to
+  // external applications.
+  const char* const id_to_unichar_ext(UNICHAR_ID id) const;
+
  // Return a STRING that reformats the utf8 str into the str followed
  // by its hex unicodes.
  static STRING debug_utf8_str(const char* str);
@ -163,7 +214,8 @@ class UNICHARSET {
  // Return true if the given unichar id exists within the set.
  // Relies on the fact that unichar ids are contiguous in the unicharset.
  bool contains_unichar_id(UNICHAR_ID unichar_id) const {
-    return unichar_id != INVALID_UNICHAR_ID && unichar_id < size_used;
+    return unichar_id != INVALID_UNICHAR_ID && unichar_id < size_used &&
+        unichar_id >= 0;
  }

  // Return true if the given unichar representation exists within the set.
@ -237,6 +289,16 @@ class UNICHARSET {
  // Returns true if the operation is successful.
  bool save_to_file(FILE *file) const;

+  // Load a unicharset from a unicharset file that has been loaded into
+  // the given memory buffer.
+  // Returns true if the operation is successful.
+  bool load_from_inmemory_file(const char* const memory, int mem_size,
+                               bool skip_fragments);
+  // Returns true if the operation is successful.
+  bool load_from_inmemory_file(const char* const memory, int mem_size) {
+    return load_from_inmemory_file(memory, mem_size, false);
+  }
+
  // Opens the file indicated by filename and loads the UNICHARSET
  // from the given file. The previous data is lost.
  // Returns true if the operation is successful.
@ -247,6 +309,7 @@ class UNICHARSET {
    fclose(file);
    return result;
  }
+  // returns true if the operation is successful.
  bool load_from_file(const char* const filename) {
    return load_from_file(filename, false);
  }
@ -261,9 +324,11 @@ class UNICHARSET {
  // during set_unicharset_properties.
  void post_load_setup();

-  // Returns true if any script entry in the unicharset is for a
-  // right_to_left language.
-  bool any_right_to_left() const;
+  // Returns true if right_to_left scripts are significant in the unicharset,
+  // but without being so sensitive that "universal" unicharsets containing
+  // characters from many scripts, like orientation and script detection,
+  // look like they are right_to_left.
+  bool major_right_to_left() const;

  // Set a whitelist and/or blacklist of characters to recognize.
  // An empty or NULL whitelist enables everything (minus any blacklist).
@ -315,40 +380,85 @@ class UNICHARSET {
    unichars[unichar_id].properties.other_case = other_case;
  }

+  // Set the direction property of the given unichar to the given value.
+  void set_direction(UNICHAR_ID unichar_id, UNICHARSET::Direction value) {
+    unichars[unichar_id].properties.direction = value;
+  }
+
+  // Set mirror unichar id in the properties for the given unichar id.
+  void set_mirror(UNICHAR_ID unichar_id, UNICHAR_ID mirror) {
+    unichars[unichar_id].properties.mirror = mirror;
+  }
+
+  // Record normalized version of unichar with the given unichar_id.
+  void set_normed(UNICHAR_ID unichar_id, const char* normed) {
+    unichars[unichar_id].properties.normed = normed;
+  }
+
  // Return the isalpha property of the given unichar.
  bool get_isalpha(UNICHAR_ID unichar_id) const {
+    if (INVALID_UNICHAR_ID == unichar_id) return false;
+    ASSERT_HOST(contains_unichar_id(unichar_id));
    return unichars[unichar_id].properties.isalpha;
  }

  // Return the islower property of the given unichar.
  bool get_islower(UNICHAR_ID unichar_id) const {
+    if (INVALID_UNICHAR_ID == unichar_id) return false;
+    ASSERT_HOST(contains_unichar_id(unichar_id));
    return unichars[unichar_id].properties.islower;
  }

  // Return the isupper property of the given unichar.
  bool get_isupper(UNICHAR_ID unichar_id) const {
+    if (INVALID_UNICHAR_ID == unichar_id) return false;
+    ASSERT_HOST(contains_unichar_id(unichar_id));
    return unichars[unichar_id].properties.isupper;
  }

  // Return the isdigit property of the given unichar.
  bool get_isdigit(UNICHAR_ID unichar_id) const {
+    if (INVALID_UNICHAR_ID == unichar_id) return false;
+    ASSERT_HOST(contains_unichar_id(unichar_id));
    return unichars[unichar_id].properties.isdigit;
  }

  // Return the ispunctuation property of the given unichar.
  bool get_ispunctuation(UNICHAR_ID unichar_id) const {
+    if (INVALID_UNICHAR_ID == unichar_id) return false;
+    ASSERT_HOST(contains_unichar_id(unichar_id));
    return unichars[unichar_id].properties.ispunctuation;
  }

  // Return the isngram property of the given unichar.
  bool get_isngram(UNICHAR_ID unichar_id) const {
+    if (INVALID_UNICHAR_ID == unichar_id) return false;
+    ASSERT_HOST(contains_unichar_id(unichar_id));
    return unichars[unichar_id].properties.isngram;
  }

+  // Returns whether the unichar id represents a unicode value in the private
+  // use area.
+  bool get_isprivate(UNICHAR_ID unichar_id) const;
+
  // Returns true if the ids have useful min/max top/bottom values.
  bool top_bottom_useful() const {
    return top_bottom_set_;
  }
+  // Sets all ranges to empty, so they can be expanded to set the values.
+  void set_ranges_empty();
+  // Sets all the properties for this unicharset given a src_unicharset with
+  // everything set. The unicharsets don't have to be the same, and graphemes
+  // are correctly accounted for.
+  void SetPropertiesFromOther(const UNICHARSET& src);
+  // Expands the tops and bottoms and widths for this unicharset given a
+  // src_unicharset with ranges in it. The unicharsets don't have to be the
+  // same, and graphemes are correctly accounted for.
+  void ExpandRangesFromOther(const UNICHARSET& src);
+  // For each id in src, if it does not occur in this, add it, as in
+  // SetPropertiesFromOther, otherwise expand the ranges, as in
+  // ExpandRangesFromOther.
+  void AppendOtherUnicharset(const UNICHARSET& src);
  // Returns the min and max bottom and top of the given unichar in
  // baseline-normalized coordinates, ie, where the baseline is
  // kBlnBaselineOffset and the meanline is kBlnBaselineOffset + kBlnXHeight
@ -356,6 +466,12 @@ class UNICHARSET {
  void get_top_bottom(UNICHAR_ID unichar_id,
                      int* min_bottom, int* max_bottom,
                      int* min_top, int* max_top) const {
+    if (INVALID_UNICHAR_ID == unichar_id) {
+      *min_bottom = *min_top = 0;
+      *max_bottom = *max_top = 256;  // kBlnCellHeight
+      return;
+    }
+    ASSERT_HOST(contains_unichar_id(unichar_id));
    *min_bottom = unichars[unichar_id].properties.min_bottom;
    *max_bottom = unichars[unichar_id].properties.max_bottom;
    *min_top = unichars[unichar_id].properties.min_top;
@ -373,11 +489,76 @@ class UNICHARSET {
    unichars[unichar_id].properties.max_top =
        static_cast<uinT8>(ClipToRange(max_top, 0, MAX_UINT8));
  }
+  // Returns the width range of the given unichar in baseline-normalized
+  // coordinates, ie, where the baseline is kBlnBaselineOffset and the
+  // meanline is kBlnBaselineOffset + kBlnXHeight.
+  // (See normalis.h for the definitions).
+  void get_width_range(UNICHAR_ID unichar_id,
+                       int* min_width, int* max_width) const {
+    if (INVALID_UNICHAR_ID == unichar_id) {
+      *min_width = 0;
+      *max_width = 256;  // kBlnCellHeight;
+      return;
+    }
+    ASSERT_HOST(contains_unichar_id(unichar_id));
+    *min_width = unichars[unichar_id].properties.min_width;
+    *max_width = unichars[unichar_id].properties.max_width;
+  }
+  void set_width_range(UNICHAR_ID unichar_id, int min_width, int max_width) {
+    unichars[unichar_id].properties.min_width =
+        static_cast<inT16>(ClipToRange(min_width, 0, MAX_INT16));
+    unichars[unichar_id].properties.max_width =
+        static_cast<inT16>(ClipToRange(max_width, 0, MAX_INT16));
+  }
+  // Returns the range of the x-bearing of the given unichar in
+  // baseline-normalized coordinates, ie, where the baseline is
+  // kBlnBaselineOffset and the meanline is kBlnBaselineOffset + kBlnXHeight.
+  // (See normalis.h for the definitions).
+  void get_bearing_range(UNICHAR_ID unichar_id,
+                         int* min_bearing, int* max_bearing) const {
+    if (INVALID_UNICHAR_ID == unichar_id) {
+      *min_bearing = *max_bearing = 0;
+      return;
+    }
+    ASSERT_HOST(contains_unichar_id(unichar_id));
+    *min_bearing = unichars[unichar_id].properties.min_bearing;
+    *max_bearing = unichars[unichar_id].properties.max_bearing;
+  }
+  void set_bearing_range(UNICHAR_ID unichar_id,
+                         int min_bearing, int max_bearing) {
+    unichars[unichar_id].properties.min_bearing =
+        static_cast<inT16>(ClipToRange(min_bearing, 0, MAX_INT16));
+    unichars[unichar_id].properties.max_bearing =
+        static_cast<inT16>(ClipToRange(max_bearing, 0, MAX_INT16));
+  }
+  // Returns the range of the x-advance of the given unichar in
+  // baseline-normalized coordinates, ie, where the baseline is
+  // kBlnBaselineOffset and the meanline is kBlnBaselineOffset + kBlnXHeight.
+  // (See normalis.h for the definitions).
+  void get_advance_range(UNICHAR_ID unichar_id,
+                         int* min_advance, int* max_advance) const {
+    if (INVALID_UNICHAR_ID == unichar_id) {
+      *min_advance = *max_advance = 0;
+      return;
+    }
+    ASSERT_HOST(contains_unichar_id(unichar_id));
+    *min_advance = unichars[unichar_id].properties.min_advance;
+    *max_advance = unichars[unichar_id].properties.max_advance;
+  }
+  void set_advance_range(UNICHAR_ID unichar_id,
+                         int min_advance, int max_advance) {
+    unichars[unichar_id].properties.min_advance =
+        static_cast<inT16>(ClipToRange(min_advance, 0, MAX_INT16));
+    unichars[unichar_id].properties.max_advance =
+        static_cast<inT16>(ClipToRange(max_advance, 0, MAX_INT16));
+  }

  // Return the script name of the given unichar.
  // The returned pointer will always be the same for the same script, it's
  // managed by unicharset and thus MUST NOT be deleted
  int get_script(UNICHAR_ID unichar_id) const {
+    if (INVALID_UNICHAR_ID == unichar_id) return null_sid_;
+    ASSERT_HOST(contains_unichar_id(unichar_id));
    return unichars[unichar_id].properties.script_id;
  }

@ -396,17 +577,37 @@ class UNICHARSET {

  // Get other_case unichar id in the properties for the given unichar id.
  UNICHAR_ID get_other_case(UNICHAR_ID unichar_id) const {
+    if (INVALID_UNICHAR_ID == unichar_id) return INVALID_UNICHAR_ID;
+    ASSERT_HOST(contains_unichar_id(unichar_id));
    return unichars[unichar_id].properties.other_case;
  }

+  // Returns the direction property of the given unichar.
+  Direction get_direction(UNICHAR_ID unichar_id) const {
+     if (INVALID_UNICHAR_ID == unichar_id) return UNICHARSET::U_OTHER_NEUTRAL;
+     ASSERT_HOST(contains_unichar_id(unichar_id));
+     return unichars[unichar_id].properties.direction;
+   }
+
+  // Get mirror unichar id in the properties for the given unichar id.
+  UNICHAR_ID get_mirror(UNICHAR_ID unichar_id) const {
+    if (INVALID_UNICHAR_ID == unichar_id) return INVALID_UNICHAR_ID;
+    ASSERT_HOST(contains_unichar_id(unichar_id));
+    return unichars[unichar_id].properties.mirror;
+  }
+
  // Returns UNICHAR_ID of the corresponding lower-case unichar.
  UNICHAR_ID to_lower(UNICHAR_ID unichar_id) const {
+    if (INVALID_UNICHAR_ID == unichar_id) return INVALID_UNICHAR_ID;
+    ASSERT_HOST(contains_unichar_id(unichar_id));
    if (unichars[unichar_id].properties.islower) return unichar_id;
    return unichars[unichar_id].properties.other_case;
  }

  // Returns UNICHAR_ID of the corresponding upper-case unichar.
  UNICHAR_ID to_upper(UNICHAR_ID unichar_id) const {
+    if (INVALID_UNICHAR_ID == unichar_id) return INVALID_UNICHAR_ID;
+    ASSERT_HOST(contains_unichar_id(unichar_id));
    if (unichars[unichar_id].properties.isupper) return unichar_id;
    return unichars[unichar_id].properties.other_case;
  }
@ -414,6 +615,8 @@ class UNICHARSET {
  // Return a pointer to the CHAR_FRAGMENT class if the given
  // unichar id represents a character fragment.
  const CHAR_FRAGMENT *get_fragment(UNICHAR_ID unichar_id) const {
+    if (INVALID_UNICHAR_ID == unichar_id) return NULL;
+    ASSERT_HOST(contains_unichar_id(unichar_id));
    return unichars[unichar_id].properties.fragment;
  }

@ -504,6 +707,11 @@ class UNICHARSET {
    return get_ispunctuation(unichar_to_id(unichar_repr, length));
  }

+  // Returns normalized version of unichar with the given unichar_id.
+  const char *get_normed_unichar(UNICHAR_ID unichar_id) const {
+    return unichars[unichar_id].properties.normed.string();
+  }
+
  // Return the script name of the given unichar representation.
  // Only the first length characters from unichar_repr are used.
  // The returned pointer will always be the same for the same script, it's
@ -574,7 +782,20 @@ class UNICHARSET {

  struct UNICHAR_PROPERTIES {
    UNICHAR_PROPERTIES();
+    // Initializes all properties to sensible default values.
    void Init();
+    // Sets all ranges wide open. Initialization default in case there are
+    // no useful values available.
+    void SetRangesOpen();
+    // Sets all ranges to empty. Used before expanding with font-based data.
+    void SetRangesEmpty();
+    // Returns true if any of the top/bottom/width/bearing/advance ranges is
+    // emtpy.
+    bool AnyRangeEmpty() const;
+    // Expands the ranges with the ranges from the src properties.
+    void ExpandRangesFrom(const UNICHAR_PROPERTIES& src);
+    // Copies the properties from src into this.
+    void CopyFrom(const UNICHAR_PROPERTIES& src);

    bool  isalpha;
    bool  islower;
@ -591,9 +812,25 @@ class UNICHARSET {
    uinT8 max_bottom;
    uinT8 min_top;
    uinT8 max_top;
+    // Limits on the widths of bounding box, also in baseline-normalized coords.
+    inT16 min_width;
+    inT16 max_width;
+    // Limits on the x-bearing and advance, also in baseline-normalized coords.
+    inT16 min_bearing;
+    inT16 max_bearing;
+    inT16 min_advance;
+    inT16 max_advance;
    int   script_id;
    UNICHAR_ID other_case;  // id of the corresponding upper/lower case unichar
-
+    Direction direction;  // direction of this unichar
+    // Mirror property is useful for reverse DAWG lookup for words in
+    // right-to-left languages (e.g. "(word)" would be in
+    // '[open paren]' 'w' 'o' 'r' 'd' '[close paren]' in a UTF8 string.
+    // However, what we want in our DAWG is
+    // '[open paren]', 'd', 'r', 'o', 'w', '[close paren]' not
+    // '[close paren]', 'd', 'r', 'o', 'w', '[open paren]'.
+    UNICHAR_ID mirror;
+    STRING normed;  // normalized version of this unichar
    // Contains meta information about the fragment if a unichar represents
    // a fragment of a character, otherwise should be set to NULL.
    // It is assumed that character fragments are added to the unicharset
@ -606,6 +843,20 @@ class UNICHARSET {
    UNICHAR_PROPERTIES properties;
  };

+  // Gets the properties for a grapheme string, combining properties for
+  // multiple characters in a meaningful way where possible.
+  // Returns false if no valid match was found in the unicharset.
+  // NOTE that script_id, mirror, and other_case refer to this unicharset on
+  // return and will need redirecting if the target unicharset is different.
+  bool GetStrProperties(const char* utf8_str,
+                        UNICHAR_PROPERTIES* props) const;
+
+  // Load ourselves from a "file" where our only interface to the file is
+  // an implementation of fgets().  This is the parsing primitive accessed by
+  // the public routines load_from_file() and load_from_inmemory_file().
+  bool load_via_fgets(TessResultCallback2<char *, char *, int> *fgets_cb,
+                      bool skip_fragments);
+
  UNICHAR_SLOT* unichars;
  UNICHARMAP ids;
  int size_used;
--- a/ccutil/unicity_table.h
+++ b/ccutil/unicity_table.h
@ -85,7 +85,7 @@ class UnicityTable {
  /// The Callback given must be permanent since they will be called more than
  /// once. The given callback will be deleted at the end.
  /// Returns false on read/write error.
-  bool write(FILE* f, TessResultCallback2<bool, FILE*, T const &>* cb);
+  bool write(FILE* f, TessResultCallback2<bool, FILE*, T const &>* cb) const;
  /// swap is used to switch the endianness.
  bool read(FILE* f, TessResultCallback3<bool, FILE*, T*, bool>* cb, bool swap);

@ -187,8 +187,8 @@ void UnicityTable<T>::clear() {
 }

 template <typename T>
-bool UnicityTable<T>::write(FILE* f,
-                            TessResultCallback2<bool, FILE*, T const &>* cb) {
+bool UnicityTable<T>::write(
+    FILE* f, TessResultCallback2<bool, FILE*, T const &>* cb) const {
  return table_.write(f, cb);
 }

--- a/ccutil/unicodes.cpp
+++ b/ccutil/unicodes.cpp
@ -0,0 +1,57 @@
+/**********************************************************************
+ * File:        unicodes.h
+ * Description: Unicode related machinery
+ * Author:      David Eger
+ * Created:     Wed Jun 15 16:37:50 PST 2011
+ *
+ * (C) Copyright 2011, Google, Inc.
+ ** Licensed under the Apache License, Version 2.0 (the "License");
+ ** you may not use this file except in compliance with the License.
+ ** You may obtain a copy of the License at
+ ** http://www.apache.org/licenses/LICENSE-2.0
+ ** Unless required by applicable law or agreed to in writing, software
+ ** distributed under the License is distributed on an "AS IS" BASIS,
+ ** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ ** See the License for the specific language governing permissions and
+ ** limitations under the License.
+ *
+ **********************************************************************/
+
+#include "unicodes.h"
+#include "host.h"  // for NULL
+
+namespace tesseract {
+
+const char *kUTF8LineSeparator = "\u2028";  // "\xe2\x80\xa8";
+const char *kUTF8ParagraphSeparator = "\u2029";  // "\xe2\x80\xa9";
+const char *kLRM = "\u200E";  // Left-to-Right Mark
+const char *kRLM = "\u200F";  // Right-to-Left Mark
+const char *kRLE = "\u202A";  // Right-to-Left Embedding
+const char *kPDF = "\u202C";  // Pop Directional Formatting
+
+const char *kHyphenLikeUTF8[] = {
+  "-",       // ASCII hyphen-minus
+  "\u05BE",  // word hyphen in hybrew
+  "\u2010",  // hyphen
+  "\u2011",  // non-breaking hyphen
+  "\u2012",  // a hyphen the same width as digits
+  "\u2013",  // en dash
+  "\u2014",  // em dash
+  "\u2015",  // horizontal bar
+  "\u2212",  // arithmetic minus sign
+  "\uFE58",  // small em dash
+  "\uFE63",  // small hyphen-minus
+  "\uFF0D",  // fullwidth hyphen-minus
+  NULL,      // end of our list
+};
+
+const char *kApostropheLikeUTF8[] = {
+  "'",       // ASCII apostrophe
+  "`",       // ASCII backtick
+  "\u2018",  // opening single quote
+  "\u2019",  // closing single quote
+  "\u2032",  // mathematical prime mark
+  NULL,      // end of our list.
+};
+
+}  // namespace
--- a/ccutil/unicodes.h
+++ b/ccutil/unicodes.h
@ -0,0 +1,39 @@
+/**********************************************************************
+ * File:        unicodes.h
+ * Description: Unicode related machinery
+ * Author:      David Eger
+ * Created:     Wed Jun 15 16:37:50 PST 2011
+ *
+ * (C) Copyright 2011, Google, Inc.
+ ** Licensed under the Apache License, Version 2.0 (the "License");
+ ** you may not use this file except in compliance with the License.
+ ** You may obtain a copy of the License at
+ ** http://www.apache.org/licenses/LICENSE-2.0
+ ** Unless required by applicable law or agreed to in writing, software
+ ** distributed under the License is distributed on an "AS IS" BASIS,
+ ** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ ** See the License for the specific language governing permissions and
+ ** limitations under the License.
+ *
+ **********************************************************************/
+
+#ifndef TESSERACT_CCUTIL_UNICODES_H__
+#define TESSERACT_CCUTIL_UNICODES_H__
+
+namespace tesseract {
+
+extern const char *kUTF8LineSeparator;
+extern const char *kUTF8ParagraphSeparator;
+extern const char *kLRM;  // Left-to-Right Mark
+extern const char *kRLM;  // Right-to-Left Mark
+extern const char *kRLE;  // Right-to-Left Embedding
+extern const char *kPDF;  // Pop Directional Formatting
+
+// The following are confusable internal word punctuation symbols
+// which we normalize to the first variant when matching in dawgs.
+extern const char *kHyphenLikeUTF8[];
+extern const char *kApostropheLikeUTF8[];
+
+}  // namespace
+
+#endif  // TESSERACT_CCUTIL_UNICODES_H__
--- a/configure.ac
+++ b/configure.ac
@ -7,7 +7,7 @@
 # ----------------------------------------

 AC_PREREQ(2.50)
-AC_INIT([tesseract], [3.01], [theraysmith@gmail.com])
+AC_INIT([tesseract], [3.02], [theraysmith@gmail.com])
 AC_CONFIG_MACRO_DIR([m4])
 AC_REVISION($Id: configure.ac,v 1.4 2007/02/02 22:38:17 theraysmith Exp $)
 AC_CONFIG_AUX_DIR(config)
@ -18,8 +18,8 @@ AC_CANONICAL_HOST
 # Define date of package, etc. Could be useful in auto-generated
 # documentation.
 # TODO(luc) Generate good documentation using doxygen or equivalent
-PACKAGE_YEAR=2011
-PACKAGE_DATE="10/16"
+PACKAGE_YEAR=2012
+PACKAGE_DATE="02/01"

 AC_DEFINE_UNQUOTED(PACKAGE_NAME,["${PACKAGE_NAME}"],[Name of package])
 AC_DEFINE_UNQUOTED(PACKAGE_VERSION,["${PACKAGE_VERSION}"],[Version number])
@ -35,7 +35,7 @@ GENERIC_LIBRARY_NAME=tesseract

 # Release versioning
 GENERIC_MAJOR_VERSION=3
-GENERIC_MINOR_VERSION=1
+GENERIC_MINOR_VERSION=2
 GENERIC_MICRO_VERSION=0

 # API version (often = GENERIC_MAJOR_VERSION.GENERIC_MINOR_VERSION)
--- a/cutil/freelist.cpp
+++ b/cutil/freelist.cpp
@ -11,57 +11,22 @@
 **************************************************************************/
 #include "freelist.h"

-#include <memory.h>
-
-#include "danerror.h"
-#include "memry.h"
-#include "tprintf.h"
+#include <stdlib.h>
+
+
+// With improvements in OS memory allocators, internal memory management is
+// no longer required, so these functions all map to their malloc-family
+// equivalents.


-/**********************************************************************
- * memalloc
- *
- * Memory allocator with protection.
- **********************************************************************/
 int *memalloc(int size) {
-  return ((int *) alloc_mem (size));
+  return static_cast<int*>(malloc(static_cast<size_t>(size)));
 }

-
-/**********************************************************************
- * memrealloc
- *
- * Memory allocator with protection.
- **********************************************************************/
 int *memrealloc(void *ptr, int size, int oldsize) {
-  int shiftsize;
-  int *newbuf;
-
-  shiftsize = size > oldsize ? oldsize : size;
-  newbuf = (int *) alloc_mem (size);
-  memcpy(newbuf, ptr, shiftsize);
-  free_mem(ptr);
-  return newbuf;
+  return static_cast<int*>(realloc(ptr, static_cast<size_t>(size)));
 }

-
-/**********************************************************************
- * memfree
- *
- * Memory allocator with protection.
- **********************************************************************/
 void memfree(void *element) {
-  if (element) {
-    free_mem(element);
-  }
-}
-
-
-/**********************************************************************
- * mem_tidy
- *
- * Do nothing.
- **********************************************************************/
-void mem_tidy(int level) {
-  check_mem ("Old tidy", level);
+  free(element);
 }
--- a/cutil/freelist.h
+++ b/cutil/freelist.h
@ -40,6 +40,4 @@ int *memrealloc(void *ptr, int size, int oldsize);

 void memfree(void *element);

-void mem_tidy(int level);
-
 #endif
--- a/image/image.h
+++ b/image/image.h
@ -26,6 +26,9 @@ class CCUtil;
 class Image {
 public:
  Image(CCUtil* ccutil_ptr);
+  const CCUtil* getCCUtil() const {
+    return ccutil_ptr_;
+  }
  CCUtil* getCCUtil() {
    return ccutil_ptr_;
  }
--- a/image/svshowim.cpp
+++ b/image/svshowim.cpp
@ -23,7 +23,6 @@ void sv_show_sub_image(IMAGE*    source,         // Image to show.
                       ScrollView*    window,         // Window to draw in.
                       inT32     xpos,           // Place to show bottom-left.
                       inT32     ypos) {         // Y position.
-
  Pix* pix;
  if (xstart != 0 || ystart != 0 ||
      xext != source->get_xsize() || yext != source->get_ysize()) {
--- a/viewer/scrollview.cpp
+++ b/viewer/scrollview.cpp
@ -52,7 +52,6 @@ const int kMaxIntPairSize = 45;  // Holds %d,%d, for upto 64 bit.

 #include "allheaders.h"

-
 struct SVPolyLineBuffer {
  bool empty;  // Independent indicator to allow SendMsg to call SendPolygon.
  std::vector<int> xcoords;
--- a/viewer/scrollview.h
+++ b/viewer/scrollview.h
@ -31,6 +31,8 @@

 #ifndef TESSERACT_VIEWER_SCROLLVIEW_H__
 #define TESSERACT_VIEWER_SCROLLVIEW_H__
+// TODO(rays) Move ScrollView into the tesseract namespace.
+#ifndef OCR_SCROLLVIEW_H__

 #include <stdio.h>

@ -412,4 +414,5 @@ class ScrollView {
 #endif  // GRAPHICS_DISABLED
 };

+#endif  // OCR_SCROLLVIEW_H__
 #endif  // TESSERACT_VIEWER_SCROLLVIEW_H__
--- a/viewer/svutil.cpp
+++ b/viewer/svutil.cpp
@ -42,6 +42,7 @@ struct addrinfo {
 #ifdef __linux__
 #include <sys/prctl.h>
 #endif
+#include <unistd.h>
 #endif

 #include <cstdlib>