tesseract/training/tesstrain.sh

#!/bin/bash
# (C) Copyright 2014, Google Inc.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# This script provides an easy way to execute various phases of training
# Tesseract.  For a detailed description of the phases, see
# https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract
#
# USAGE:
#
# tesstrain.sh
#    --fontlist FONTS           # A list of fontnames to train on.
#    --fonts_dir FONTS_PATH     # Path to font files.
#    --lang LANG_CODE           # ISO 639 code.
#    --langdata_dir DATADIR     # Path to tesseract/training/langdata directory.
#    --output_dir OUTPUTDIR     # Location of output traineddata file.
#    --overwrite                # Safe to overwrite files in output_dir.
#    --run_shape_clustering     # Run shape clustering (use for Indic langs).
#    --exposures EXPOSURES      # A list of exposure levels to use (e.g. "-1 0 1").
#
# OPTIONAL flags for input data. If unspecified we will look for them in
# the langdata_dir directory.
#    --training_text TEXTFILE   # Text to render and use for training.
#    --wordlist WORDFILE        # Word list for the language ordered by
#                               # decreasing frequency.
#
# OPTIONAL flag to specify location of existing traineddata files, required
# during feature extraction. If unspecified will use TESSDATA_PREFIX defined in
# the current environment.
#    --tessdata_dir TESSDATADIR     # Path to tesseract/tessdata directory.
#
# NOTE:
# The font names specified in --fontlist need to be recognizable by Pango using
# fontconfig. An easy way to list the canonical names of all fonts available on
# your system is to run text2image with --list_available_fonts and the
# appropriate --fonts_dir path.


source `dirname $0`/tesstrain_utils.sh

ARGV=("$@")
parse_flags

mkdir -p ${TRAINING_DIR}
tlog "\n=== Starting training for language '${LANG_CODE}'"

source `dirname $0`/language-specific.sh
set_lang_specific_parameters ${LANG_CODE}

initialize_fontconfig

phase_I_generate_image 8
phase_UP_generate_unicharset
phase_D_generate_dawg
phase_E_extract_features "box.train" 8
phase_C_cluster_prototypes "${TRAINING_DIR}/${LANG_CODE}.normproto"
if [[ "${ENABLE_SHAPE_CLUSTERING}" == "y" ]]; then
    phase_S_cluster_shapes
fi
phase_M_cluster_microfeatures
phase_B_generate_ambiguities
make__traineddata

tlog "\nCompleted training for language '${LANG_CODE}'\n"
Added tesstrain.sh - a master training script git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@1146 d0cd1f9f-072b-0410-8dd7-cf729c803f20 2014-08-12 07:20:56 +08:00			`#!/bin/bash`
			`# (C) Copyright 2014, Google Inc.`
			`# Licensed under the Apache License, Version 2.0 (the "License");`
			`# you may not use this file except in compliance with the License.`
			`# You may obtain a copy of the License at`
			`# http://www.apache.org/licenses/LICENSE-2.0`
			`# Unless required by applicable law or agreed to in writing, software`
			`# distributed under the License is distributed on an "AS IS" BASIS,`
			`# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.`
			`# See the License for the specific language governing permissions and`
			`# limitations under the License.`
			`#`
			`# This script provides an easy way to execute various phases of training`
			`# Tesseract. For a detailed description of the phases, see`
change links from code.google.com to github.com 2015-07-11 15:43:31 +08:00			`# https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract`
Added tesstrain.sh - a master training script git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@1146 d0cd1f9f-072b-0410-8dd7-cf729c803f20 2014-08-12 07:20:56 +08:00			`#`
			`# USAGE:`
			`#`
			`# tesstrain.sh`
Use shell quoting rather than pluses to separate font arguments in tesstrain.sh The way tesstrain.sh handled font names was really weird, using '+' signs as a delimiter. However quoting arguments is a much more straightforward, standard and sensible way to do things. So whereas previously one would have used this: --fontlist Times New Roman + Arial Black Now they should be specified like this: --fontlist "Times New Roman" "Arial Black" 2015-10-30 21:26:45 +08:00			`# --fontlist FONTS # A list of fontnames to train on.`
Added tesstrain.sh - a master training script git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@1146 d0cd1f9f-072b-0410-8dd7-cf729c803f20 2014-08-12 07:20:56 +08:00			`# --fonts_dir FONTS_PATH # Path to font files.`
			`# --lang LANG_CODE # ISO 639 code.`
			`# --langdata_dir DATADIR # Path to tesseract/training/langdata directory.`
			`# --output_dir OUTPUTDIR # Location of output traineddata file.`
			`# --overwrite # Safe to overwrite files in output_dir.`
			`# --run_shape_clustering # Run shape clustering (use for Indic langs).`
Add --exposures option to tesstrain.sh This flag can be used to specify multiple different exposure levels for a training. There was some code already in tesstrain_utils.sh to deal with multiple exposure levels, so it looks like this functionality was always intended. The default usage does not change, with exposure level 0 being the only one used if --exposures is not used. 2015-09-10 21:57:17 +08:00			`# --exposures EXPOSURES # A list of exposure levels to use (e.g. "-1 0 1").`
Added tesstrain.sh - a master training script git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@1146 d0cd1f9f-072b-0410-8dd7-cf729c803f20 2014-08-12 07:20:56 +08:00			`#`
			`# OPTIONAL flags for input data. If unspecified we will look for them in`
			`# the langdata_dir directory.`
			`# --training_text TEXTFILE # Text to render and use for training.`
			`# --wordlist WORDFILE # Word list for the language ordered by`
			`# # decreasing frequency.`
			`#`
			`# OPTIONAL flag to specify location of existing traineddata files, required`
			`# during feature extraction. If unspecified will use TESSDATA_PREFIX defined in`
			`# the current environment.`
			`# --tessdata_dir TESSDATADIR # Path to tesseract/tessdata directory.`
			`#`
			`# NOTE:`
			`# The font names specified in --fontlist need to be recognizable by Pango using`
			`# fontconfig. An easy way to list the canonical names of all fonts available on`
			`# your system is to run text2image with --list_available_fonts and the`
			`# appropriate --fonts_dir path.`


Major updates to training system as a result of extensive testing on 100 languages 2015-05-13 09:04:31 +08:00			source `dirname $0`/tesstrain_utils.sh
Added tesstrain.sh - a master training script git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@1146 d0cd1f9f-072b-0410-8dd7-cf729c803f20 2014-08-12 07:20:56 +08:00
			`ARGV=("$@")`
			`parse_flags`

			`mkdir -p ${TRAINING_DIR}`
Use mktemp to create workspace directory mktemp is a better idea for security, as well as enabling users to specify a different directory using the TMPDIR environment variable, which is useful if /tmp is a small tmpfs. Also fix a bug where the first few log messages were failing as the workspace directory wasn't been created early enough. 2015-09-10 22:05:07 +08:00			`tlog "\n=== Starting training for language '${LANG_CODE}'"`
Added tesstrain.sh - a master training script git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@1146 d0cd1f9f-072b-0410-8dd7-cf729c803f20 2014-08-12 07:20:56 +08:00
Major updates to training system as a result of extensive testing on 100 languages 2015-05-13 09:04:31 +08:00			source `dirname $0`/language-specific.sh
			`set_lang_specific_parameters ${LANG_CODE}`

			`initialize_fontconfig`

			`phase_I_generate_image 8`
			`phase_UP_generate_unicharset`
			`phase_D_generate_dawg`
			`phase_E_extract_features "box.train" 8`
			`phase_C_cluster_prototypes "${TRAINING_DIR}/${LANG_CODE}.normproto"`
			`if [[ "${ENABLE_SHAPE_CLUSTERING}" == "y" ]]; then`
			`phase_S_cluster_shapes`
			`fi`
			`phase_M_cluster_microfeatures`
			`phase_B_generate_ambiguities`
			`make__traineddata`
Added tesstrain.sh - a master training script git-svn-id: https://tesseract-ocr.googlecode.com/svn/trunk@1146 d0cd1f9f-072b-0410-8dd7-cf729c803f20 2014-08-12 07:20:56 +08:00
			`tlog "\nCompleted training for language '${LANG_CODE}'\n"`