Updated 4.0 with LSTM (markdown)

Shreeshrii 2019-09-11 18:41:05 +05:30
parent 585ae54e57
commit 10f69d90ba

@ -1,10 +1,11 @@
## 4.0 ## 4.0 +
Tesseract 4.0 **rc** source code is available in the 'master' branch of the [repository](https://github.com/tesseract-ocr/tesseract). It adds a new OCR engine based on LSTM neural networks. It initially works (well) on x86/Linux. Model data for 101 languages is available in the [tessdata repository](https://github.com/tesseract-ocr/tessdata). Tesseract 4.0 **+** source code is available in the 'master' branch of the [repository](https://github.com/tesseract-ocr/tesseract). It adds a new OCR engine based on LSTM neural networks. It initially works (well) on x86/Linux. Model data for 101 languages is available in [tessdata](https://github.com/tesseract-ocr/tessdata), [tessdata_best](https://github.com/tesseract-ocr/tessdata_best), [tessdata_fast](https://github.com/tesseract-ocr/tessdata_fast) repositories.
## Documentation ## Documentation
* [NeuralNetsInTesseract4.00](NeuralNetsInTesseract4.00) * [NeuralNetsInTesseract4.00](NeuralNetsInTesseract4.00)
* [VGSLSpecs](https://github.com/tesseract-ocr/tesseract/wiki/VGSLSpecs) * [VGSLSpecs](https://github.com/tesseract-ocr/tesseract/wiki/VGSLSpecs)
* [VGSLSpecs info from Tensorflow](https://github.com/mldbai/tensorflow-models/blob/master/street/g3doc/vgslspecs.md)
* [DAS 2016 tutorial slides](https://github.com/tesseract-ocr/docs/tree/master/das_tutorial2016) * [DAS 2016 tutorial slides](https://github.com/tesseract-ocr/docs/tree/master/das_tutorial2016)
Slides Slides
[#2](https://github.com/tesseract-ocr/docs/blob/master/das_tutorial2016/2ArchitectureAndDataStructures.pdf), [#2](https://github.com/tesseract-ocr/docs/blob/master/das_tutorial2016/2ArchitectureAndDataStructures.pdf),
@ -18,11 +19,13 @@ have information about LSTM integration in Tesseract 4.0.
* [TrainingTesseract 4.00](https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00) * [TrainingTesseract 4.00](https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00)
3.0 version of box files can be converted for use with LSTM training by adding a tab character at end of each line and boxes with space after each word. `Mark EOL` and `Mark EOL Bulk` functions under `Edit` in `Box Editor` tab of latest version of [jTessBoxEditor - jTessBoxEditor-2.0-Beta](https://sourceforge.net/projects/vietocr/files/jTessBoxEditor/) can be used to add the EOL tabs automatically. Insert mode can be used on last letter of each word to add a box with space. There is no automated way to do this. * [tess4training - LSTM Training Tutorial for Tesseract 4](https://github.com/Shreeshrii/tess4training)
## 4.0.0-alpha ppa * [tessttrain - formerly ocrd-train](https://github.com/tesseract-ocr/tesstrain)
Unofficial Ubuntu PPAs for Tesseract 4.00 & Leptonica 1.74: ## 4.x ppa
Ubuntu PPAs for Tesseract 4.x & Leptonica 1.7x:
* https://launchpad.net/~alex-p/+archive/ubuntu/tesseract-ocr * https://launchpad.net/~alex-p/+archive/ubuntu/tesseract-ocr
Leptonica 1.74.1 package for Debian: Leptonica 1.74.1 package for Debian: