Updated 4.0 with LSTM (markdown)

2024-11-30 23:49:05 +08:00 · 2016-12-27 17:26:46 +05:30 · 2016-12-27 17:26:46 +05:30 · d0e5fe92c6
commit d0e5fe92c6
parent 8457c79cec
1 changed files with 6 additions and 2 deletions
--- a/4.0-with-LSTM.md
+++ b/4.0-with-LSTM.md
@ -15,6 +15,8 @@ have information about LSTM integration in Tesseract 4.0.

 ## Fine-tuning a LSTM model - example using Arabic

+Please read [TrainingTesseract 4.00](https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00) before trying the following.
+
 Have copies of the 4.0.0 alpha langdata, tessdata and tesseract-ocr repositories in the following directory structure.

 ```
@ -26,7 +28,9 @@ Have copies of the 4.0.0 alpha langdata, tessdata and tesseract-ocr repositories
 ```

 Make a copy of English and Arabic 4.0.0alpha traineddata files in ./tesseract-ocr/tessdata
+
 Check that lstm.train is available under configs.
+
 Setup appropriate TESSDATA_PREFIX directory.
 ```
 cp ./tessdata/eng.traineddata ./tesseract-ocr/tessdata
@ -44,7 +48,7 @@ training/tesstrain.sh --fonts_dir /usr/share/fonts --lang ara  --linedata_only \
  --fontlist "Times New Roman," \
  --output_dir ~/tesstutorial/aratest
 ```
-This creates the .lstmf files in the output directory. The box/tiff pairs are created in a /tmp/../ara/ directory and are not copied to the output directory.
+This creates the .lstmf files in the output directory using the given training_text. The box/tiff pairs are created in a /tmp/<tmpdir>/ara/ directory and are not copied to the output directory.
 ```
 mkdir -p ~/tesstutorial/aratuned_from_ara 

@ -71,7 +75,7 @@ combine_tessdata -o ./tessdata/ara.traineddata \
  ~/tesstutorial/aratest/ara.lstm-punc-dawg \
  ~/tesstutorial/aratest/ara.lstm-word-dawg 
 ```  
-Finally the new LSTM model and new dawg files can be combined with the existing Arabic traineddata in ./tesseract-ocr/tessdata. The old ara.traineddata file in ./tesseract-ocr/tessdata is renamed.
+Finally, the new LSTM model and new dawg files can be combined with the existing Arabic traineddata in ./tesseract-ocr/tessdata. The old ara.traineddata file in ./tesseract-ocr/tessdata is renamed.

 ```
 training/lstmeval --model ~/tesstutorial/aratuned_from_ara/ara.lstm \