mirror of
https://github.com/tesseract-ocr/tesseract.git
synced 2024-11-30 23:49:05 +08:00
Updated 4.0 with LSTM (markdown)
parent
8457c79cec
commit
d0e5fe92c6
@ -15,6 +15,8 @@ have information about LSTM integration in Tesseract 4.0.
|
||||
|
||||
## Fine-tuning a LSTM model - example using Arabic
|
||||
|
||||
Please read [TrainingTesseract 4.00](https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00) before trying the following.
|
||||
|
||||
Have copies of the 4.0.0 alpha langdata, tessdata and tesseract-ocr repositories in the following directory structure.
|
||||
|
||||
```
|
||||
@ -26,7 +28,9 @@ Have copies of the 4.0.0 alpha langdata, tessdata and tesseract-ocr repositories
|
||||
```
|
||||
|
||||
Make a copy of English and Arabic 4.0.0alpha traineddata files in ./tesseract-ocr/tessdata
|
||||
|
||||
Check that lstm.train is available under configs.
|
||||
|
||||
Setup appropriate TESSDATA_PREFIX directory.
|
||||
```
|
||||
cp ./tessdata/eng.traineddata ./tesseract-ocr/tessdata
|
||||
@ -44,7 +48,7 @@ training/tesstrain.sh --fonts_dir /usr/share/fonts --lang ara --linedata_only \
|
||||
--fontlist "Times New Roman," \
|
||||
--output_dir ~/tesstutorial/aratest
|
||||
```
|
||||
This creates the .lstmf files in the output directory. The box/tiff pairs are created in a /tmp/../ara/ directory and are not copied to the output directory.
|
||||
This creates the .lstmf files in the output directory using the given training_text. The box/tiff pairs are created in a /tmp/<tmpdir>/ara/ directory and are not copied to the output directory.
|
||||
```
|
||||
mkdir -p ~/tesstutorial/aratuned_from_ara
|
||||
|
||||
@ -71,7 +75,7 @@ combine_tessdata -o ./tessdata/ara.traineddata \
|
||||
~/tesstutorial/aratest/ara.lstm-punc-dawg \
|
||||
~/tesstutorial/aratest/ara.lstm-word-dawg
|
||||
```
|
||||
Finally the new LSTM model and new dawg files can be combined with the existing Arabic traineddata in ./tesseract-ocr/tessdata. The old ara.traineddata file in ./tesseract-ocr/tessdata is renamed.
|
||||
Finally, the new LSTM model and new dawg files can be combined with the existing Arabic traineddata in ./tesseract-ocr/tessdata. The old ara.traineddata file in ./tesseract-ocr/tessdata is renamed.
|
||||
|
||||
```
|
||||
training/lstmeval --model ~/tesstutorial/aratuned_from_ara/ara.lstm \
|
||||
|
Loading…
Reference in New Issue
Block a user