Updated UNLV Testing of Tesseract (markdown)

Shreeshrii 2018-02-26 15:16:04 +05:30
parent fa1165bece
commit f96215461a

@ -173,148 +173,3 @@ Tools to test OCR accuracy.
Of particular relevance here is the 'tessaccsummary' script, which when given a directory of images and corresponding ground truth text and a .traineddata file will OCR each page and print the accuracy, and an average summary at the end.
## Hindi
A dataset comprising of 100 scanned images of hindi document images and their ground truth files is available from http://ocr.iiit.ac.in/Hindi100.html as part of DAS 2016 work - 'Multilingual OCR for Indic Scripts' by Mathew, Minesh and Singh, Ajeet and Jawahar, C.~V.
Modified version of the ocr-evaluation-tools (https://github.com/Shreeshrii/ocr-evaluation-tools.git) was used for processing these, with average character accuracy of 89%.
```
ocr-evaluation-tools$ bash ./tessaccsummary ../imagesjpg/hin ../ hin jpg
100: 88.25%
101: 89.75%
102: 91.21%
103: 89.76%
104: 86.75%
105: 91.16%
106: 79.34%
107: 87.65%
108: 88.76%
109: 91.86%
110: 90.03%
111: 89.59%
112: 89.69%
113: 89.57%
114: 90.64%
115: 90.72%
116: 91.21%
117: 88.18%
118: 88.54%
119: 81.04%
120: 79.68%
121: 82.42%
122: 73.18%
123: 85.80%
124: 82.14%
125: 78.73%
126: 94.29%
127: 90.51%
128: 91.72%
129: 90.66%
130: 93.42%
131: 92.03%
132: 80.75%
133: 86.38%
134: 89.38%
135: 73.36%
136: 86.19%
137: 86.69%
138: 90.56%
139: 88.41%
140: 93.81%
141: 89.50%
142: 93.79%
143: 91.04%
144: 92.99%
145: 94.13%
146: 93.69%
147: 93.17%
148: 91.04%
149: 93.34%
150: 89.79%
151: 86.45%
152: 90.85%
153: 94.79%
154: 89.32%
155: 94.25%
156: 95.87%
157: 94.82%
158: 94.71%
159: 93.28%
160: 91.84%
161: 94.02%
162: 95.26%
163: 92.18%
164: 89.46%
165: 90.92%
166: 92.65%
167: 93.05%
168: 91.90%
169: 95.63%
170: 95.03%
171: 83.53%
172: 83.31%
173: 85.96%
174: 89.36%
175: 87.15%
176: 82.92%
177: 91.13%
178: 88.28%
179: 93.64%
180: 92.60%
181: 86.59%
182: 86.99%
183: 86.17%
184: 85.01%
185: 94.27%
186: 94.74%
187: 95.10%
188: 91.65%
189: 93.81%
190: 88.26%
191: 89.37%
192: 88.91%
193: 91.49%
194: 94.64%
195: 93.06%
196: 92.49%
197: 91.27%
198: 83.87%
199: 85.31%
hin001: 82.12%
average: 89.46%
```
______________________________________________
A different dataset with a few 300 dpi greyscale images of one page of hindi text in multiple fonts (https://github.com/Shreeshrii/imageshin)
### Character Accuracy
```
$ time bash ./tessaccsummary ../imageshin ../ hin png
hin001: 81.83%
hin-kokila: 94.92%
hin-mangal: 83.50%
hin-nirmala: 90.57%
hin-sanskrit: 86.40%
hin-utsaah: 90.21%
average: 87.90%
real 3m21.020s
user 2m51.977s
sys 0m27.498s
```
### Word Accuracy
```
$ time bash ./tessaccsummary -w ../imageshin ../ hin png
hin001: 67.33%
hin-kokila: 87.38%
hin-mangal: 68.93%
hin-nirmala: 75.73%
hin-sanskrit: 73.79%
hin-utsaah: 76.21%
average: 74.89%
real 3m22.813s
user 2m53.653s
sys 0m27.620s
```