Merge pull request #17675 from zihaomu:GSoC_digit_text_detect_and_recog

This commit is contained in:
Alexander Alekhin 2020-08-22 20:21:49 +03:00 committed by GitHub
commit 3547ac4b49
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
7 changed files with 256 additions and 5 deletions

View File

@ -0,0 +1,46 @@
# How to run custom OCR model {#tutorial_dnn_OCR}
@prev_tutorial{tutorial_dnn_custom_layers}
## Introduction
In this tutorial, we first introduce how to obtain the custom OCR model, then how to transform your own OCR models so that they can be run correctly by the opencv_dnn module. and finally we will provide some pre-trained models.
## Train your own OCR model
[This repository](https://github.com/zihaomu/deep-text-recognition-benchmark) is a good start point for training your own OCR model. In repository, the MJSynth+SynthText was set as training set by default. In addition, you can configure the model structure and data set you want.
## Transform OCR model to ONNX format and Use it in OpenCV DNN
After completing the model training, please use [transform_to_onnx.py](https://github.com/zihaomu/deep-text-recognition-benchmark/blob/master/transform_to_onnx.py) to convert the model into onnx format.
#### Execute in webcam
The Python version example code can be found at [here](https://github.com/opencv/opencv/blob/master/samples/dnn/text_detection.py).
Example:
@code{.bash}
$ text_detection -m=[path_to_text_detect_model] -ocr=[path_to_text_recognition_model]
@endcode
## Pre-trained ONNX models are provided
Some pre-trained models can be found at https://drive.google.com/drive/folders/1cTbQ3nuZG-EKWak6emD_s8_hHXWz7lAr?usp=sharing.
Their performance at different text recognition datasets is shown in the table below:
| Model name | IIIT5k(%) | SVT(%) | ICDAR03(%) | ICDAR13(%) | ICDAR15(%) | SVTP(%) | CUTE80(%) | average acc (%) | parameter( x10^6 ) |
| -------------------- | --------- | ------ | ---------- | ---------- | ---------- | ------- | --------- | --------------- | ------------------ |
| DenseNet-CTC | 72.267 | 67.39 | 82.81 | 80 | 48.38 | 49.45 | 42.50 | 63.26 | 0.24 |
| DenseNet-BiLSTM-CTC | 73.76 | 72.33 | 86.15 | 83.15 | 50.67 | 57.984 | 49.826 | 67.69 | 3.63 |
| VGG-CTC | 75.96 | 75.42 | 85.92 | 83.54 | 54.89 | 57.52 | 50.17 | 69.06 | 5.57 |
| CRNN_VGG-BiLSTM-CTC | 82.63 | 82.07 | 92.96 | 88.867 | 66.28 | 71.01 | 62.37 | 78.03 | 8.45 |
| ResNet-CTC | 84.00 | 84.08 | 92.39 | 88.96 | 67.74 | 74.73 | 67.60 | 79.93 | 44.28 |
The performance of the text recognition model were tesred on OpenCV DNN, and does not include the text detection model.
#### Model selection suggestion:
The input of text recognition model is the output of the text detection model, which causes the performance of text detection to greatly affect the performance of text recognition.
DenseNet_CTC has the smallest parameters and best FPS, and it is suitable for edge devices, which are very sensitive to the cost of calculation. If you have limited computing resources and want to achieve better accuracy, VGG_CTC is a good choice.
CRNN_VGG_BiLSTM_CTC is suitable for scenarios that require high recognition accuracy.

View File

@ -1,6 +1,7 @@
# Custom deep learning layers support {#tutorial_dnn_custom_layers}
@prev_tutorial{tutorial_dnn_javascript}
@next_tutorial{tutorial_dnn_OCR}
## Introduction
Deep learning is a fast growing area. The new approaches to build neural networks

View File

@ -70,3 +70,13 @@ Deep Neural Networks (dnn module) {#tutorial_table_of_content_dnn}
*Author:* Dmitry Kurtaev
How to define custom layers to import networks.
- @subpage tutorial_dnn_OCR
*Languages:* C++
*Compatibility:* \> OpenCV 4.3
*Author:* Zihao Mu
In this tutorial you will learn how to use opencv_dnn module using custom OCR models.

View File

@ -0,0 +1,182 @@
// This example provides a digital recognition based on LeNet-5 and connected component analysis.
// It makes it possible for OpenCV beginner to run dnn models in real time using only CPU.
// It can read pictures from the camera in real time to make predictions, and display the recognized digits as overlays on top of the original digits.
//
// In order to achieve a better display effect, please write the number on white paper and occupy the entire camera.
//
// You can follow the following guide to train LeNet-5 by yourself using the MNIST dataset.
// https://github.com/intel/caffe/blob/a3d5b022fe026e9092fc7abc7654b1162ab9940d/examples/mnist/readme.md
//
// You can also download already trained model directly.
// https://github.com/zihaomu/opencv_digit_text_recognition_demo/tree/master/src
#include <opencv2/imgproc.hpp>
#include <opencv2/highgui.hpp>
#include <opencv2/dnn.hpp>
#include <iostream>
#include <vector>
using namespace cv;
using namespace cv::dnn;
const char *keys =
"{ help h | | Print help message. }"
"{ input i | | Path to input image or video file. Skip this argument to capture frames from a camera.}"
"{ device | 0 | camera device number. }"
"{ modelBin | | Path to a binary .caffemodel file contains trained network.}"
"{ modelTxt | | Path to a .prototxt file contains the model definition of trained network.}"
"{ width | 640 | Set the width of the camera }"
"{ height | 480 | Set the height of the camera }"
"{ thr | 0.7 | Confidence threshold. }";
// Find best class for the blob (i.e. class with maximal probability)
static void getMaxClass(const Mat &probBlob, int &classId, double &classProb);
void predictor(Net net, const Mat &roi, int &class_id, double &probability);
int main(int argc, char **argv)
{
// Parse command line arguments.
CommandLineParser parser(argc, argv, keys);
if (argc == 1 || parser.has("help"))
{
parser.printMessage();
return 0;
}
int vWidth = parser.get<int>("width");
int vHeight = parser.get<int>("height");
float confThreshold = parser.get<float>("thr");
std::string modelTxt = parser.get<String>("modelTxt");
std::string modelBin = parser.get<String>("modelBin");
Net net;
try
{
net = readNet(modelTxt, modelBin);
}
catch (cv::Exception &ee)
{
std::cerr << "Exception: " << ee.what() << std::endl;
std::cout << "Can't load the network by using the flowing files:" << std::endl;
std::cout << "modelTxt: " << modelTxt << std::endl;
std::cout << "modelBin: " << modelBin << std::endl;
return 1;
}
const std::string resultWinName = "Please write the number on white paper and occupy the entire camera.";
const std::string preWinName = "Preprocessing";
namedWindow(preWinName, WINDOW_AUTOSIZE);
namedWindow(resultWinName, WINDOW_AUTOSIZE);
Mat labels, stats, centroids;
Point position;
Rect getRectangle;
bool ifDrawingBox = false;
int classId = 0;
double probability = 0;
Rect basicRect = Rect(0, 0, vWidth, vHeight);
Mat rawImage;
double fps = 0;
// Open a video file or an image file or a camera stream.
VideoCapture cap;
if (parser.has("input"))
cap.open(parser.get<String>("input"));
else
cap.open(parser.get<int>("device"));
TickMeter tm;
while (waitKey(1) < 0)
{
cap >> rawImage;
if (rawImage.empty())
{
waitKey();
break;
}
tm.reset();
tm.start();
Mat image = rawImage.clone();
// Image preprocessing
cvtColor(image, image, COLOR_BGR2GRAY);
GaussianBlur(image, image, Size(3, 3), 2, 2);
adaptiveThreshold(image, image, 255, ADAPTIVE_THRESH_MEAN_C, THRESH_BINARY, 25, 10);
bitwise_not(image, image);
Mat element = getStructuringElement(MORPH_RECT, Size(3, 3), Point(-1,-1));
dilate(image, image, element, Point(-1,-1), 1);
// Find connected component
int nccomps = cv::connectedComponentsWithStats(image, labels, stats, centroids);
for (int i = 1; i < nccomps; i++)
{
ifDrawingBox = false;
// Extend the bounding box of connected component for easier recognition
if (stats.at<int>(i - 1, CC_STAT_AREA) > 80 && stats.at<int>(i - 1, CC_STAT_AREA) < 3000)
{
ifDrawingBox = true;
int left = stats.at<int>(i - 1, CC_STAT_HEIGHT) / 4;
getRectangle = Rect(stats.at<int>(i - 1, CC_STAT_LEFT) - left, stats.at<int>(i - 1, CC_STAT_TOP) - left, stats.at<int>(i - 1, CC_STAT_WIDTH) + 2 * left, stats.at<int>(i - 1, CC_STAT_HEIGHT) + 2 * left);
getRectangle &= basicRect;
}
if (ifDrawingBox && !getRectangle.empty())
{
Mat roi = image(getRectangle);
predictor(net, roi, classId, probability);
if (probability < confThreshold)
continue;
rectangle(rawImage, getRectangle, Scalar(128, 255, 128), 2);
position = Point(getRectangle.br().x - 7, getRectangle.br().y + 25);
putText(rawImage, std::to_string(classId), position, 3, 1.0, Scalar(128, 128, 255), 2);
}
}
tm.stop();
fps = 1 / tm.getTimeSec();
std::string fpsString = format("Inference FPS: %.2f.", fps);
putText(rawImage, fpsString, Point(5, 20), FONT_HERSHEY_SIMPLEX, 0.6, Scalar(128, 255, 128));
imshow(resultWinName, rawImage);
imshow(preWinName, image);
}
return 0;
}
static void getMaxClass(const Mat &probBlob, int &classId, double &classProb)
{
Mat probMat = probBlob.reshape(1, 1);
Point classNumber;
minMaxLoc(probMat, NULL, &classProb, NULL, &classNumber);
classId = classNumber.x;
}
void predictor(Net net, const Mat &roi, int &classId, double &probability)
{
Mat pred;
// Convert Mat to batch of images
Mat inputBlob = dnn::blobFromImage(roi, 1.0, Size(28, 28));
// Set the network input
net.setInput(inputBlob);
// Compute output
pred = net.forward();
getMaxClass(pred, classId, probability);
}

View File

@ -2,12 +2,16 @@
Text detection model: https://github.com/argman/EAST
Download link: https://www.dropbox.com/s/r2ingd0l3zt8hxs/frozen_east_text_detection.tar.gz?dl=1
Text recognition model taken from here: https://github.com/meijieru/crnn.pytorch
CRNN Text recognition model taken from here: https://github.com/meijieru/crnn.pytorch
How to convert from pb to onnx:
Using classes from here: https://github.com/meijieru/crnn.pytorch/blob/master/models/crnn.py
More converted onnx text recognition models can be downloaded directly here:
Download link: https://drive.google.com/drive/folders/1cTbQ3nuZG-EKWak6emD_s8_hHXWz7lAr?usp=sharing
And these models taken from here:https://github.com/clovaai/deep-text-recognition-benchmark
import torch
import models.crnn as crnn
from models.crnn import CRNN
model = CRNN(32, 1, 37, 256)
model.load_state_dict(torch.load('crnn.pth'))

View File

@ -1,11 +1,18 @@
'''
Text detection model: https://github.com/argman/EAST
Download link: https://www.dropbox.com/s/r2ingd0l3zt8hxs/frozen_east_text_detection.tar.gz?dl=1
Text recognition model taken from here: https://github.com/meijieru/crnn.pytorch
CRNN Text recognition model taken from here: https://github.com/meijieru/crnn.pytorch
How to convert from pb to onnx:
Using classes from here: https://github.com/meijieru/crnn.pytorch/blob/master/models/crnn.py
More converted onnx text recognition models can be downloaded directly here:
Download link: https://drive.google.com/drive/folders/1cTbQ3nuZG-EKWak6emD_s8_hHXWz7lAr?usp=sharing
And these models taken from here:https://github.com/clovaai/deep-text-recognition-benchmark
import torch
import models.crnn as CRNN
from models.crnn import CRNN
model = CRNN(32, 1, 37, 256)
model.load_state_dict(torch.load('crnn.pth'))
dummy_input = torch.randn(1, 1, 32, 100)
@ -23,7 +30,8 @@ import argparse
parser = argparse.ArgumentParser(
description="Use this script to run TensorFlow implementation (https://github.com/argman/EAST) of "
"EAST: An Efficient and Accurate Scene Text Detector (https://arxiv.org/abs/1704.03155v2)"
"The OCR model can be obtained from converting the pretrained CRNN model to .onnx format from the github repository https://github.com/meijieru/crnn.pytorch")
"The OCR model can be obtained from converting the pretrained CRNN model to .onnx format from the github repository https://github.com/meijieru/crnn.pytorch"
"Or you can download trained OCR model directly from https://drive.google.com/drive/folders/1cTbQ3nuZG-EKWak6emD_s8_hHXWz7lAr?usp=sharing")
parser.add_argument('--input',
help='Path to input image or video file. Skip this argument to capture frames from a camera.')
parser.add_argument('--model', '-m', required=True,