Updated UNLV Testing of Tesseract (markdown)

2024-11-27 20:59:36 +08:00 · 2018-02-26 15:16:04 +05:30 · 2018-02-26 15:16:04 +05:30 · f96215461a
commit f96215461a
parent fa1165bece
1 changed files with 0 additions and 145 deletions
--- a/UNLV-Testing-of-Tesseract.md
+++ b/UNLV-Testing-of-Tesseract.md
@ -173,148 +173,3 @@ Tools to test OCR accuracy.

 Of particular relevance here is the 'tessaccsummary' script, which when given a directory of images and corresponding ground truth text and a .traineddata file will OCR each page and print the accuracy, and an average summary at the end. 

-## Hindi
-
-A dataset comprising of 100 scanned images of hindi document images and their ground truth files is available from http://ocr.iiit.ac.in/Hindi100.html as part of DAS 2016 work - 'Multilingual OCR for Indic Scripts' by Mathew, Minesh and Singh, Ajeet  and Jawahar, C.~V.
-
-Modified version of the ocr-evaluation-tools (https://github.com/Shreeshrii/ocr-evaluation-tools.git) was used for processing these, with average character accuracy of 89%.
-
-```
-ocr-evaluation-tools$ bash ./tessaccsummary ../imagesjpg/hin ../ hin  jpg
-100: 88.25%
-101: 89.75%
-102: 91.21%
-103: 89.76%
-104: 86.75%
-105: 91.16%
-106: 79.34%
-107: 87.65%
-108: 88.76%
-109: 91.86%
-110: 90.03%
-111: 89.59%
-112: 89.69%
-113: 89.57%
-114: 90.64%
-115: 90.72%
-116: 91.21%
-117: 88.18%
-118: 88.54%
-119: 81.04%
-120: 79.68%
-121: 82.42%
-122: 73.18%
-123: 85.80%
-124: 82.14%
-125: 78.73%
-126: 94.29%
-127: 90.51%
-128: 91.72%
-129: 90.66%
-130: 93.42%
-131: 92.03%
-132: 80.75%
-133: 86.38%
-134: 89.38%
-135: 73.36%
-136: 86.19%
-137: 86.69%
-138: 90.56%
-139: 88.41%
-140: 93.81%
-141: 89.50%
-142: 93.79%
-143: 91.04%
-144: 92.99%
-145: 94.13%
-146: 93.69%
-147: 93.17%
-148: 91.04%
-149: 93.34%
-150: 89.79%
-151: 86.45%
-152: 90.85%
-153: 94.79%
-154: 89.32%
-155: 94.25%
-156: 95.87%
-157: 94.82%
-158: 94.71%
-159: 93.28%
-160: 91.84%
-161: 94.02%
-162: 95.26%
-163: 92.18%
-164: 89.46%
-165: 90.92%
-166: 92.65%
-167: 93.05%
-168: 91.90%
-169: 95.63%
-170: 95.03%
-171: 83.53%
-172: 83.31%
-173: 85.96%
-174: 89.36%
-175: 87.15%
-176: 82.92%
-177: 91.13%
-178: 88.28%
-179: 93.64%
-180: 92.60%
-181: 86.59%
-182: 86.99%
-183: 86.17%
-184: 85.01%
-185: 94.27%
-186: 94.74%
-187: 95.10%
-188: 91.65%
-189: 93.81%
-190: 88.26%
-191: 89.37%
-192: 88.91%
-193: 91.49%
-194: 94.64%
-195: 93.06%
-196: 92.49%
-197: 91.27%
-198: 83.87%
-199: 85.31%
-hin001: 82.12%
-average: 89.46%
-```
-______________________________________________
-A different dataset with a few 300 dpi greyscale images of one page of hindi text in multiple fonts (https://github.com/Shreeshrii/imageshin) 
-
-### Character Accuracy
-
-```
-$ time bash ./tessaccsummary ../imageshin ../ hin png
-hin001: 81.83%
-hin-kokila: 94.92%
-hin-mangal: 83.50%
-hin-nirmala: 90.57%
-hin-sanskrit: 86.40%
-hin-utsaah: 90.21%
-average: 87.90%
-
-real    3m21.020s
-user    2m51.977s
-sys     0m27.498s
-```
-### Word Accuracy
-```
-$ time bash ./tessaccsummary -w ../imageshin ../ hin png
-hin001: 67.33%
-hin-kokila: 87.38%
-hin-mangal: 68.93%
-hin-nirmala: 75.73%
-hin-sanskrit: 73.79%
-hin-utsaah: 76.21%
-average: 74.89%
-
-real    3m22.813s
-user    2m53.653s
-sys     0m27.620s
-```