mirror of
https://github.com/tesseract-ocr/tesseract.git
synced 2024-11-30 23:49:05 +08:00
Added section on Fine-tuning and LSTM model - example using Arabic
parent
79588dca58
commit
06d278ea48
@ -13,7 +13,87 @@ Slides
|
||||
[#7](https://github.com/tesseract-ocr/docs/blob/master/das_tutorial2016/7Building%20a%20Multi-Lingual%20OCR%20Engine.pdf)
|
||||
have information about LSTM integration in Tesseract 4.0.
|
||||
|
||||
## 3.05-dev
|
||||
## Fine-tuning and LSTM model - example using Arabic
|
||||
|
||||
Have copies of the 4.0.0 alpha langdata, tessdata and tesseract-ocr repositories in the following directory structure.
|
||||
|
||||
```
|
||||
./langdata
|
||||
./tessdata
|
||||
./tesseract-ocr
|
||||
./tesseract-ocr/tessdata
|
||||
./tesseract-ocr/tessdata/configs/
|
||||
```
|
||||
|
||||
Make a copy of English and Arabic 4.0.0alpha traineddata files in ./tesseract-ocr/tessdata
|
||||
Check that lstm.train is available under configs.
|
||||
Setup appropriate TESSDATA_PREFIX directory.
|
||||
```
|
||||
cp ./tessdata/eng.traineddata ./tesseract-ocr/tessdata
|
||||
cp ./tessdata/ara.traineddata ./tesseract-ocr/tessdata
|
||||
```
|
||||
|
||||
Change to the tesseract-ocr directory and then follow the given commands.
|
||||
|
||||
```
|
||||
cd ./tesseract-ocr
|
||||
|
||||
training/tesstrain.sh --fonts_dir /usr/share/fonts --lang ara --linedata_only \
|
||||
--training_text ../langdata/ara/arabic1.txt \
|
||||
--langdata_dir ../langdata --tessdata_dir ./tessdata \
|
||||
--fontlist "Times New Roman," \
|
||||
--output_dir ~/tesstutorial/aratest
|
||||
```
|
||||
This creates the .lstmf files in the output directory. The box/tiff pairs are created in a /tmp/../ara/ directory and are not copied to the output directory.
|
||||
```
|
||||
mkdir -p ~/tesstutorial/aratuned_from_ara
|
||||
|
||||
combine_tessdata -e ../tessdata/ara.traineddata \
|
||||
~/tesstutorial/aratuned_from_ara/ara.lstm
|
||||
|
||||
lstmtraining --model_output ~/tesstutorial/aratuned_from_ara/aratuned \
|
||||
--continue_from ~/tesstutorial/aratuned_from_ara/ara.lstm \
|
||||
--train_listfile ~/tesstutorial/aratest/ara.training_files.txt \
|
||||
--target_error_rate 0.01
|
||||
```
|
||||
|
||||
The above commands extract the existing LSTM model for Arabic from ./tessdata and finetune it using the .lstmf files created earlier, given in the train_listfile.
|
||||
```
|
||||
lstmtraining --model_output ~/tesstutorial/aratuned_from_ara/aratuned.lstm \
|
||||
--continue_from ~/tesstutorial/aratuned_from_ara/aratuned_checkpoint \
|
||||
--stop_training
|
||||
```
|
||||
The above command creates the new LSTM model from the finetuning output.
|
||||
```
|
||||
combine_tessdata -o ./tessdata/ara.traineddata \
|
||||
~/tesstutorial/aratuned_from_ara/aratuned.lstm \
|
||||
~/tesstutorial/aratest/ara.lstm-number-dawg \
|
||||
~/tesstutorial/aratest/ara.lstm-punc-dawg \
|
||||
~/tesstutorial/aratest/ara.lstm-word-dawg
|
||||
```
|
||||
Finally the new LSTM model and new dawg files can be combined with the existing Arabic traineddata in ./tesseract-ocr/tessdata. The old ara.traineddata file in ./tesseract-ocr/tessdata is renamed.
|
||||
|
||||
```
|
||||
training/lstmeval --model ~/tesstutorial/aratuned_from_ara/ara.lstm \
|
||||
--eval_listfile ~/tesstutorial/aratest/ara.training_files.txt
|
||||
|
||||
training/lstmeval --model ~/tesstutorial/aratuned_from_ara/aratuned_checkpoint \
|
||||
--eval_listfile ~/tesstutorial/aratest/ara.training_files.txt
|
||||
|
||||
training/lstmeval --model ~/tesstutorial/aratuned_from_ara/aratuned.lstm \
|
||||
--eval_listfile ~/tesstutorial/aratest/ara.training_files.txt
|
||||
```
|
||||
|
||||
The above three commands evaluate the LSTM models, first with the original Arabic LSTM model, second with the checkpoint model during finetuning and third with the finetuned Arabic model.
|
||||
|
||||
```
|
||||
time tesseract --tessdata-dir ../tessdata /tmp/<tmpdir.ara/ara.Times_New_Roman.exp0.tif out-4alpha.txt -l ara
|
||||
time tesseract --tessdata-dir ./tessdata /tmp/<tmpdir>/ara/ara.Times_New_Roman.exp0.tif out-tuned.txt -l ara
|
||||
```
|
||||
|
||||
The above runs OCR on the tif file created during training with the original traineddata and finetuned traineddata
|
||||
|
||||
## 3.05-dev and 4.0.0-alpha for Windows
|
||||
|
||||
An unofficial installer for Tesseract 3.05-dev for Windows is available from [Tesseract at UB Mannheim] (https://github.com/UB-Mannheim/tesseract/wiki). This includes the training tools.
|
||||
|
||||
|
Loading…
Reference in New Issue
Block a user