tesseract/README.md at d28631a274b97239a83ff1aaa257246a65c322d7

mirror of https://github.com/tesseract-ocr/tesseract.git synced 2024-12-05 02:47:00 +08:00

Stefan Weil cdc7db7082 unlvtests: Fix typo in documentation (found by codespell)

Signed-off-by: Stefan Weil <sw@weilnetz.de>

2018-07-06 22:02:08 +02:00

3.5 KiB

Raw Blame History

How to run UNLV tests.

The scripts in this directory make it possible to duplicate the tests published in the Fourth Annual Test of OCR Accuracy. See http://www.expervision.com/wp-content/uploads/2012/12/1995.The_Fourth_Annual_Test_of_OCR_Accuracy.pdf but first you have to get the tools and data used by UNLV:

Step 1: to download the images go to

https://sourceforge.net/projects/isri-ocr-evaluation-tools-alt/files/ and get doe3.3B.tar.gz, bus.3B.tar.gz, mag.3B.tar.gz and news.3B.tar.gz spn.3B.tar.gz is incorrect in this repo, so get it from code.google

mkdir -p ~/isri-downloads
cd ~/isri-downloads
curl  -L https://sourceforge.net/projects/isri-ocr-evaluation-tools-alt/files/bus.3B.tar.gz > bus.3B.tar.gz
curl  -L https://sourceforge.net/projects/isri-ocr-evaluation-tools-alt/files/doe3.3B.tar.gz > doe3.3B.tar.gz
curl  -L https://sourceforge.net/projects/isri-ocr-evaluation-tools-alt/files/mag.3B.tar.gz > mag.3B.tar.gz
curl  -L https://sourceforge.net/projects/isri-ocr-evaluation-tools-alt/files/news.3B.tar.gz > news.3B.tar.gz
curl  -L https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/isri-ocr-evaluation-tools/spn.3B.tar.gz > spn.3B.tar.gz

Step 2: extract the files.

It doesn't really matter where in your filesystem you put them, but they must go under a common root so you have directories doe3.3B, bus.3B, mag.3B and news.3B. in, for example, ~/ISRI-OCRtk.

mkdir -p ~/ISRI-OCRtk
cd ~/ISRI-OCRtk
tar xzvf ~/isri-downloads/bus.3B.tar.gz
tar xzvf ~/isri-downloads/doe3.3B.tar.gz
tar xzvf ~/isri-downloads/mag.3B.tar.gz
tar xzvf ~/isri-downloads/news.3B.tar.gz
tar xzvf ~/isri-downloads/spn.3B.tar.gz
mkdir -p stopwords
cd stopwords
wget -O spa.stopwords.txt https://raw.githubusercontent.com/stopwords-iso/stopwords-es/master/stopwords-es.txt

Edit ~/ISRI-OCRtk/stopwords/spa.stopwords.txt wordacc uses a space delimited stopwords file, not line delimited. s/\n/ /g

Edit ~/ISRI-OCRtk/spn.3B/pages Delete the line containing the following imagename as it crashes tesseract.

7733_005.3B 3

Step 3: Download the modified ISRI toolkit, make and install the tools :

These will be installed in /usr/local/bin.

git clone https://github.com/Shreeshrii/ocr-evaluation-tools.git
cd ~/ocr-evaluation-tools
sudo make install

Step 4: cd back to your main tesseract-ocr dir and Build tesseract.

Step 5: run unlvtests/runalltests.sh with the root ISRI data dir, testname, tessdata-dir:

unlvtests/runalltests.sh ~/ISRI-OCRtk 4_fast_eng ../tessdata_fast

and go to the gym, have lunch etc. It takes a while to run.

Step 6: There should be a RELEASE.summary file

unlvtests/reports/4-beta_fast.summary that contains the final summarized accuracy report and comparison with the 1995 results.

Step 7: run the test for Spanish.

unlvtests/runalltests_spa.sh ~/ISRI-OCRtk 4_fast_spa ../tessdata_fast

Notes from Nick White regarding wordacc

If you just want to remove all lines which have 100% recognition, you can add a 'awk' command like this:

ocrevalutf8 wordacc ground.txt ocr.txt | awk '$3 != 100 {print $0}'
results.txt

or if you've already got a results file you want to change, you can do this:

awk '$3 != 100 {print $0}' results.txt newresults.txt

If you only want the last sections where things are broken down by word, you can add a sed command, like this:

ocrevalutf8 wordacc ground.txt ocr.txt | sed '/^ Count Missed %Right /, !d' | awk '$3 != 100 {print $0}' results.txt

3.5 KiB Raw Blame History