Added section on Fine-tuning and LSTM model - example using Arabic

Shreeshrii 2016-12-27 17:23:15 +05:30
parent 79588dca58
commit 06d278ea48

@ -13,7 +13,87 @@ Slides
[#7](https://github.com/tesseract-ocr/docs/blob/master/das_tutorial2016/7Building%20a%20Multi-Lingual%20OCR%20Engine.pdf)
have information about LSTM integration in Tesseract 4.0.
## 3.05-dev
## Fine-tuning and LSTM model - example using Arabic
Have copies of the 4.0.0 alpha langdata, tessdata and tesseract-ocr repositories in the following directory structure.
```
./langdata
./tessdata
./tesseract-ocr
./tesseract-ocr/tessdata
./tesseract-ocr/tessdata/configs/
```
Make a copy of English and Arabic 4.0.0alpha traineddata files in ./tesseract-ocr/tessdata
Check that lstm.train is available under configs.
Setup appropriate TESSDATA_PREFIX directory.
```
cp ./tessdata/eng.traineddata ./tesseract-ocr/tessdata
cp ./tessdata/ara.traineddata ./tesseract-ocr/tessdata
```
Change to the tesseract-ocr directory and then follow the given commands.
```
cd ./tesseract-ocr
training/tesstrain.sh --fonts_dir /usr/share/fonts --lang ara --linedata_only \
--training_text ../langdata/ara/arabic1.txt \
--langdata_dir ../langdata --tessdata_dir ./tessdata \
--fontlist "Times New Roman," \
--output_dir ~/tesstutorial/aratest
```
This creates the .lstmf files in the output directory. The box/tiff pairs are created in a /tmp/../ara/ directory and are not copied to the output directory.
```
mkdir -p ~/tesstutorial/aratuned_from_ara
combine_tessdata -e ../tessdata/ara.traineddata \
~/tesstutorial/aratuned_from_ara/ara.lstm
lstmtraining --model_output ~/tesstutorial/aratuned_from_ara/aratuned \
--continue_from ~/tesstutorial/aratuned_from_ara/ara.lstm \
--train_listfile ~/tesstutorial/aratest/ara.training_files.txt \
--target_error_rate 0.01
```
The above commands extract the existing LSTM model for Arabic from ./tessdata and finetune it using the .lstmf files created earlier, given in the train_listfile.
```
lstmtraining --model_output ~/tesstutorial/aratuned_from_ara/aratuned.lstm \
--continue_from ~/tesstutorial/aratuned_from_ara/aratuned_checkpoint \
--stop_training
```
The above command creates the new LSTM model from the finetuning output.
```
combine_tessdata -o ./tessdata/ara.traineddata \
~/tesstutorial/aratuned_from_ara/aratuned.lstm \
~/tesstutorial/aratest/ara.lstm-number-dawg \
~/tesstutorial/aratest/ara.lstm-punc-dawg \
~/tesstutorial/aratest/ara.lstm-word-dawg
```
Finally the new LSTM model and new dawg files can be combined with the existing Arabic traineddata in ./tesseract-ocr/tessdata. The old ara.traineddata file in ./tesseract-ocr/tessdata is renamed.
```
training/lstmeval --model ~/tesstutorial/aratuned_from_ara/ara.lstm \
--eval_listfile ~/tesstutorial/aratest/ara.training_files.txt
training/lstmeval --model ~/tesstutorial/aratuned_from_ara/aratuned_checkpoint \
--eval_listfile ~/tesstutorial/aratest/ara.training_files.txt
training/lstmeval --model ~/tesstutorial/aratuned_from_ara/aratuned.lstm \
--eval_listfile ~/tesstutorial/aratest/ara.training_files.txt
```
The above three commands evaluate the LSTM models, first with the original Arabic LSTM model, second with the checkpoint model during finetuning and third with the finetuned Arabic model.
```
time tesseract --tessdata-dir ../tessdata /tmp/<tmpdir.ara/ara.Times_New_Roman.exp0.tif out-4alpha.txt -l ara
time tesseract --tessdata-dir ./tessdata /tmp/<tmpdir>/ara/ara.Times_New_Roman.exp0.tif out-tuned.txt -l ara
```
The above runs OCR on the tif file created during training with the original traineddata and finetuned traineddata
## 3.05-dev and 4.0.0-alpha for Windows
An unofficial installer for Tesseract 3.05-dev for Windows is available from [Tesseract at UB Mannheim] (https://github.com/UB-Mannheim/tesseract/wiki). This includes the training tools.