Updated TrainingTesseract 4.00 (markdown)

2024-12-03 00:49:01 +08:00 · 2019-08-26 13:28:23 +05:30 · 2019-08-26 13:28:23 +05:30 · 585ae54e57
commit 585ae54e57
parent c394eacbc8
1 changed files with 15 additions and 1 deletions
--- a/TrainingTesseract-4.00.md
+++ b/TrainingTesseract-4.00.md
@ -564,6 +564,9 @@ In either case, the required format is still the tiff/box file
 pair, except that the boxes only need to cover a textline instead of individual
 characters.

+If you use tesstrain.sh then required `synthetic` training data (box/tiff pairs and
+lstmf files) is created from the training text and given list of fonts.
+
 ### Making Box Files

 Multiple formats of box files are accepted by Tesseract 4 for LSTM training, 
@ -641,6 +644,9 @@ language being used) and optional word list files. It creates the `lstm-recoder`
 from the `input_unicharset` and creates all the dawgs, if wordlists are provided, 
 putting everything together into a `traineddata` file.

+If you use tesstrain.sh then the starter traineddata is also created along with
+`synthetic` training data (box/tiff pairs and lstmf files)
+from the training text and given list of fonts.


 ### Training From Scratch
@ -754,7 +760,7 @@ tuning...

 ### Fine Tuning for Impact

-Please note that Fine Tuning can be done ONLY by using `float` models as the base to
+Please note that Fine Tuning for Impact can be done ONLY by using `float` models as the base to
 continue from i.e. use the traineddata files from [tessdata_best](https://github.com/tesseract-ocr/tessdata_best) repo
 not from [tessdata](https://github.com/tesseract-ocr/tessdata) or [tessdata_fast](https://github.com/tesseract-ocr/tessdata_fast).

@ -847,6 +853,10 @@ important however, to avoid over-fitting.

 ### Fine Tuning for ± a few characters

+Please note that Fine Tuning for ± a few characters can be done ONLY by using `float` models as the base to
+continue from i.e. use the traineddata files from [tessdata_best](https://github.com/tesseract-ocr/tessdata_best) repo
+not from [tessdata](https://github.com/tesseract-ocr/tessdata) or [tessdata_fast](https://github.com/tesseract-ocr/tessdata_fast).
+
 **New feature** It is possible to add a few new characters to the character set
 and train for them by fine tuning, without a large amount of training data.

@ -957,6 +967,10 @@ optimizers.

 ### Training Just a Few Layers

+Please note that Training Just a Few Layers can be done ONLY by using `float` models as the base to
+continue from i.e. use the traineddata files from [tessdata_best](https://github.com/tesseract-ocr/tessdata_best) repo
+not from [tessdata](https://github.com/tesseract-ocr/tessdata) or [tessdata_fast](https://github.com/tesseract-ocr/tessdata_fast).
+
 Fine tuning is OK if you only want to add a new font style or need a couple of
 new characters, but what if you want to train for Klingon? You are unlikely to
 have much training data and it is unlike anything else, so what do you do? You