Updated 4.0 with LSTM (markdown)

Shreeshrii 2016-12-27 17:26:46 +05:30
parent 8457c79cec
commit d0e5fe92c6

@ -15,6 +15,8 @@ have information about LSTM integration in Tesseract 4.0.
## Fine-tuning a LSTM model - example using Arabic
Please read [TrainingTesseract 4.00](https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00) before trying the following.
Have copies of the 4.0.0 alpha langdata, tessdata and tesseract-ocr repositories in the following directory structure.
```
@ -26,7 +28,9 @@ Have copies of the 4.0.0 alpha langdata, tessdata and tesseract-ocr repositories
```
Make a copy of English and Arabic 4.0.0alpha traineddata files in ./tesseract-ocr/tessdata
Check that lstm.train is available under configs.
Setup appropriate TESSDATA_PREFIX directory.
```
cp ./tessdata/eng.traineddata ./tesseract-ocr/tessdata
@ -44,7 +48,7 @@ training/tesstrain.sh --fonts_dir /usr/share/fonts --lang ara --linedata_only \
--fontlist "Times New Roman," \
--output_dir ~/tesstutorial/aratest
```
This creates the .lstmf files in the output directory. The box/tiff pairs are created in a /tmp/../ara/ directory and are not copied to the output directory.
This creates the .lstmf files in the output directory using the given training_text. The box/tiff pairs are created in a /tmp/<tmpdir>/ara/ directory and are not copied to the output directory.
```
mkdir -p ~/tesstutorial/aratuned_from_ara
@ -71,7 +75,7 @@ combine_tessdata -o ./tessdata/ara.traineddata \
~/tesstutorial/aratest/ara.lstm-punc-dawg \
~/tesstutorial/aratest/ara.lstm-word-dawg
```
Finally the new LSTM model and new dawg files can be combined with the existing Arabic traineddata in ./tesseract-ocr/tessdata. The old ara.traineddata file in ./tesseract-ocr/tessdata is renamed.
Finally, the new LSTM model and new dawg files can be combined with the existing Arabic traineddata in ./tesseract-ocr/tessdata. The old ara.traineddata file in ./tesseract-ocr/tessdata is renamed.
```
training/lstmeval --model ~/tesstutorial/aratuned_from_ara/ara.lstm \