From 06d278ea48ac165870bedd1a0719c250f682bec1 Mon Sep 17 00:00:00 2001 From: Shreeshrii Date: Tue, 27 Dec 2016 17:23:15 +0530 Subject: [PATCH] Added section on Fine-tuning and LSTM model - example using Arabic --- 4.0-with-LSTM.md | 82 +++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 81 insertions(+), 1 deletion(-) diff --git a/4.0-with-LSTM.md b/4.0-with-LSTM.md index ec4d249..366d7ec 100644 --- a/4.0-with-LSTM.md +++ b/4.0-with-LSTM.md @@ -13,7 +13,87 @@ Slides [#7](https://github.com/tesseract-ocr/docs/blob/master/das_tutorial2016/7Building%20a%20Multi-Lingual%20OCR%20Engine.pdf) have information about LSTM integration in Tesseract 4.0. -## 3.05-dev +## Fine-tuning and LSTM model - example using Arabic + +Have copies of the 4.0.0 alpha langdata, tessdata and tesseract-ocr repositories in the following directory structure. + +``` +./langdata +./tessdata +./tesseract-ocr +./tesseract-ocr/tessdata +./tesseract-ocr/tessdata/configs/ +``` + +Make a copy of English and Arabic 4.0.0alpha traineddata files in ./tesseract-ocr/tessdata +Check that lstm.train is available under configs. +Setup appropriate TESSDATA_PREFIX directory. +``` +cp ./tessdata/eng.traineddata ./tesseract-ocr/tessdata +cp ./tessdata/ara.traineddata ./tesseract-ocr/tessdata +``` + +Change to the tesseract-ocr directory and then follow the given commands. + +``` +cd ./tesseract-ocr + +training/tesstrain.sh --fonts_dir /usr/share/fonts --lang ara --linedata_only \ + --training_text ../langdata/ara/arabic1.txt \ + --langdata_dir ../langdata --tessdata_dir ./tessdata \ + --fontlist "Times New Roman," \ + --output_dir ~/tesstutorial/aratest +``` +This creates the .lstmf files in the output directory. The box/tiff pairs are created in a /tmp/../ara/ directory and are not copied to the output directory. +``` +mkdir -p ~/tesstutorial/aratuned_from_ara + +combine_tessdata -e ../tessdata/ara.traineddata \ + ~/tesstutorial/aratuned_from_ara/ara.lstm + +lstmtraining --model_output ~/tesstutorial/aratuned_from_ara/aratuned \ + --continue_from ~/tesstutorial/aratuned_from_ara/ara.lstm \ + --train_listfile ~/tesstutorial/aratest/ara.training_files.txt \ + --target_error_rate 0.01 +``` + +The above commands extract the existing LSTM model for Arabic from ./tessdata and finetune it using the .lstmf files created earlier, given in the train_listfile. +``` +lstmtraining --model_output ~/tesstutorial/aratuned_from_ara/aratuned.lstm \ + --continue_from ~/tesstutorial/aratuned_from_ara/aratuned_checkpoint \ + --stop_training +``` +The above command creates the new LSTM model from the finetuning output. +``` +combine_tessdata -o ./tessdata/ara.traineddata \ + ~/tesstutorial/aratuned_from_ara/aratuned.lstm \ + ~/tesstutorial/aratest/ara.lstm-number-dawg \ + ~/tesstutorial/aratest/ara.lstm-punc-dawg \ + ~/tesstutorial/aratest/ara.lstm-word-dawg +``` +Finally the new LSTM model and new dawg files can be combined with the existing Arabic traineddata in ./tesseract-ocr/tessdata. The old ara.traineddata file in ./tesseract-ocr/tessdata is renamed. + +``` +training/lstmeval --model ~/tesstutorial/aratuned_from_ara/ara.lstm \ + --eval_listfile ~/tesstutorial/aratest/ara.training_files.txt + +training/lstmeval --model ~/tesstutorial/aratuned_from_ara/aratuned_checkpoint \ + --eval_listfile ~/tesstutorial/aratest/ara.training_files.txt + +training/lstmeval --model ~/tesstutorial/aratuned_from_ara/aratuned.lstm \ + --eval_listfile ~/tesstutorial/aratest/ara.training_files.txt +``` + +The above three commands evaluate the LSTM models, first with the original Arabic LSTM model, second with the checkpoint model during finetuning and third with the finetuned Arabic model. + +``` +time tesseract --tessdata-dir ../tessdata /tmp//ara/ara.Times_New_Roman.exp0.tif out-tuned.txt -l ara +``` + +The above runs OCR on the tif file created during training with the original traineddata and finetuned traineddata + +## 3.05-dev and 4.0.0-alpha for Windows An unofficial installer for Tesseract 3.05-dev for Windows is available from [Tesseract at UB Mannheim] (https://github.com/UB-Mannheim/tesseract/wiki). This includes the training tools.