Merge pull request #17675 from zihaomu:GSoC_digit_text_detect_and_recog

2024-11-28 13:10:12 +08:00 · 2020-08-22 20:21:49 +03:00 · 2020-08-22 20:21:49 +03:00 · 3547ac4b49
commit 3547ac4b49
parent f6c2bf21c8 397ba2d9aa
7 changed files with 256 additions and 5 deletions
--- a/doc/tutorials/dnn/dnn_OCR/dnn_OCR.markdown
+++ b/doc/tutorials/dnn/dnn_OCR/dnn_OCR.markdown
@ -0,0 +1,46 @@
+# How to run custom OCR model {#tutorial_dnn_OCR}
+
+@prev_tutorial{tutorial_dnn_custom_layers}
+
+## Introduction
+
+In this tutorial, we first introduce how to obtain the custom OCR model, then how to transform your own OCR models so that they can be run correctly by the opencv_dnn module. and finally we will provide some pre-trained models.
+
+## Train your own OCR model
+
+[This repository](https://github.com/zihaomu/deep-text-recognition-benchmark) is a good start point for training your own OCR model. In repository, the MJSynth+SynthText was set as training set by default. In addition, you can configure the model structure and data set you want.
+
+## Transform OCR model to ONNX format and Use it in OpenCV DNN
+
+After completing the model training, please use [transform_to_onnx.py](https://github.com/zihaomu/deep-text-recognition-benchmark/blob/master/transform_to_onnx.py) to convert the model into onnx format.
+
+#### Execute in webcam
+The Python version example code can be found at [here](https://github.com/opencv/opencv/blob/master/samples/dnn/text_detection.py).
+
+Example:
+@code{.bash}
+$ text_detection -m=[path_to_text_detect_model] -ocr=[path_to_text_recognition_model]
+@endcode
+
+## Pre-trained ONNX models are provided
+
+Some pre-trained models can be found at https://drive.google.com/drive/folders/1cTbQ3nuZG-EKWak6emD_s8_hHXWz7lAr?usp=sharing.
+
+Their performance at different text recognition datasets is shown in the table below:
+
+| Model name           | IIIT5k(%) | SVT(%) | ICDAR03(%) | ICDAR13(%) | ICDAR15(%) | SVTP(%) | CUTE80(%) | average acc (%) | parameter( x10^6 ) |
+| -------------------- | --------- | ------ | ---------- | ---------- | ---------- | ------- | --------- | --------------- | ------------------ |
+| DenseNet-CTC         | 72.267    | 67.39  | 82.81     | 80         | 48.38     | 49.45  | 42.50    | 63.26       | 0.24              |
+| DenseNet-BiLSTM-CTC  | 73.76    | 72.33 | 86.15     | 83.15     | 50.67     | 57.984  | 49.826    | 67.69       | 3.63              |
+| VGG-CTC              | 75.96    | 75.42 | 85.92     | 83.54     | 54.89     | 57.52  | 50.17    | 69.06       | 5.57              |
+| CRNN_VGG-BiLSTM-CTC | 82.63    | 82.07 | 92.96     | 88.867     | 66.28     | 71.01  | 62.37    | 78.03       | 8.45              |
+| ResNet-CTC           | 84.00        | 84.08  | 92.39     | 88.96     | 67.74     | 74.73  | 67.60    | 79.93    | 44.28             |
+
+The performance of the text recognition model were tesred on OpenCV DNN, and does not include the text detection model.
+
+#### Model selection suggestion:
+The input of text recognition model is the output of the text detection model, which causes the performance of text detection to greatly affect the performance of text recognition.
+
+DenseNet_CTC has the smallest parameters and best FPS, and it is suitable for edge devices, which are very sensitive to the cost of calculation. If you have limited computing resources and want to achieve better accuracy, VGG_CTC is a good choice.
+
+CRNN_VGG_BiLSTM_CTC is suitable for scenarios that require high recognition accuracy.
--- a/doc/tutorials/dnn/dnn_custom_layers/dnn_custom_layers.md
+++ b/doc/tutorials/dnn/dnn_custom_layers/dnn_custom_layers.md
@ -1,6 +1,7 @@
 # Custom deep learning layers support {#tutorial_dnn_custom_layers}

@prev_tutorial{tutorial_dnn_javascript}
+@next_tutorial{tutorial_dnn_OCR}

 ## Introduction
 Deep learning is a fast growing area. The new approaches to build neural networks
--- a/doc/tutorials/dnn/table_of_content_dnn.markdown
+++ b/doc/tutorials/dnn/table_of_content_dnn.markdown
@ -70,3 +70,13 @@ Deep Neural Networks (dnn module) {#tutorial_table_of_content_dnn}
    *Author:* Dmitry Kurtaev

    How to define custom layers to import networks.
+
+-   @subpage tutorial_dnn_OCR
+
+    *Languages:* C++
+
+    *Compatibility:* \> OpenCV 4.3
+
+    *Author:* Zihao Mu
+
+    In this tutorial you will learn how to use opencv_dnn module using custom OCR models.
--- a/samples/cpp/digits_lenet.cpp
+++ b/samples/cpp/digits_lenet.cpp
@ -0,0 +1,182 @@
+//  This example provides a digital recognition based on LeNet-5 and connected component analysis.
+//  It makes it possible for OpenCV beginner to run dnn models in real time using only CPU.
+//  It can read pictures from the camera in real time to make predictions, and display the recognized digits as overlays on top of the original digits.
+//
+//  In order to achieve a better display effect, please write the number on white paper and occupy the entire camera.
+//
+//  You can follow the following guide to train LeNet-5 by yourself using the MNIST dataset.
+//  https://github.com/intel/caffe/blob/a3d5b022fe026e9092fc7abc7654b1162ab9940d/examples/mnist/readme.md
+//
+//  You can also download already trained model directly.
+//  https://github.com/zihaomu/opencv_digit_text_recognition_demo/tree/master/src
+
+
+#include <opencv2/imgproc.hpp>
+#include <opencv2/highgui.hpp>
+#include <opencv2/dnn.hpp>
+
+#include <iostream>
+#include <vector>
+
+using namespace cv;
+using namespace cv::dnn;
+
+const char *keys =
+    "{ help     h  | | Print help message. }"
+    "{ input    i  | | Path to input image or video file. Skip this argument to capture frames from a camera.}"
+    "{ device      |  0  | camera device number. }"
+    "{ modelBin    |     | Path to a binary .caffemodel file contains trained network.}"
+    "{ modelTxt    |     | Path to a .prototxt file contains the model definition of trained network.}"
+    "{ width       | 640 | Set the width of the camera }"
+    "{ height      | 480 | Set the height of the camera }"
+    "{ thr         | 0.7 | Confidence threshold. }";
+
+// Find best class for the blob (i.e. class with maximal probability)
+static void getMaxClass(const Mat &probBlob, int &classId, double &classProb);
+
+void predictor(Net net, const Mat &roi, int &class_id, double &probability);
+
+int main(int argc, char **argv)
+{
+    // Parse command line arguments.
+    CommandLineParser parser(argc, argv, keys);
+
+    if (argc == 1 || parser.has("help"))
+    {
+        parser.printMessage();
+        return 0;
+    }
+
+    int vWidth = parser.get<int>("width");
+    int vHeight = parser.get<int>("height");
+    float confThreshold = parser.get<float>("thr");
+    std::string modelTxt = parser.get<String>("modelTxt");
+    std::string modelBin = parser.get<String>("modelBin");
+
+    Net net;
+    try
+    {
+        net = readNet(modelTxt, modelBin);
+    }
+    catch (cv::Exception &ee)
+    {
+        std::cerr << "Exception: " << ee.what() << std::endl;
+        std::cout << "Can't load the network by using the flowing files:" << std::endl;
+        std::cout << "modelTxt: " << modelTxt << std::endl;
+        std::cout << "modelBin: " << modelBin << std::endl;
+        return 1;
+    }
+
+    const std::string resultWinName = "Please write the number on white paper and occupy the entire camera.";
+    const std::string preWinName = "Preprocessing";
+
+    namedWindow(preWinName, WINDOW_AUTOSIZE);
+    namedWindow(resultWinName, WINDOW_AUTOSIZE);
+
+    Mat labels, stats, centroids;
+    Point position;
+
+    Rect getRectangle;
+    bool ifDrawingBox = false;
+
+    int classId = 0;
+    double probability = 0;
+
+    Rect basicRect = Rect(0, 0, vWidth, vHeight);
+    Mat rawImage;
+
+    double fps = 0;
+
+    // Open a video file or an image file or a camera stream.
+    VideoCapture cap;
+    if (parser.has("input"))
+        cap.open(parser.get<String>("input"));
+    else
+        cap.open(parser.get<int>("device"));
+
+    TickMeter tm;
+
+    while (waitKey(1) < 0)
+    {
+        cap >> rawImage;
+        if (rawImage.empty())
+        {
+            waitKey();
+            break;
+        }
+
+        tm.reset();
+        tm.start();
+
+        Mat image = rawImage.clone();
+        // Image preprocessing
+        cvtColor(image, image, COLOR_BGR2GRAY);
+        GaussianBlur(image, image, Size(3, 3), 2, 2);
+        adaptiveThreshold(image, image, 255, ADAPTIVE_THRESH_MEAN_C, THRESH_BINARY, 25, 10);
+        bitwise_not(image, image);
+
+        Mat element = getStructuringElement(MORPH_RECT, Size(3, 3), Point(-1,-1));
+        dilate(image, image, element, Point(-1,-1), 1);
+        // Find connected component
+        int nccomps = cv::connectedComponentsWithStats(image, labels, stats, centroids);
+
+        for (int i = 1; i < nccomps; i++)
+        {
+            ifDrawingBox = false;
+
+            // Extend the bounding box of connected component for easier recognition
+            if (stats.at<int>(i - 1, CC_STAT_AREA) > 80 && stats.at<int>(i - 1, CC_STAT_AREA) < 3000)
+            {
+                ifDrawingBox = true;
+                int left = stats.at<int>(i - 1, CC_STAT_HEIGHT) / 4;
+                getRectangle = Rect(stats.at<int>(i - 1, CC_STAT_LEFT) - left, stats.at<int>(i - 1, CC_STAT_TOP) - left, stats.at<int>(i - 1, CC_STAT_WIDTH) + 2 * left, stats.at<int>(i - 1, CC_STAT_HEIGHT) + 2 * left);
+                getRectangle &= basicRect;
+            }
+
+            if (ifDrawingBox && !getRectangle.empty())
+            {
+                Mat roi = image(getRectangle);
+                predictor(net, roi, classId, probability);
+
+                if (probability < confThreshold)
+                    continue;
+
+                rectangle(rawImage, getRectangle, Scalar(128, 255, 128), 2);
+
+                position = Point(getRectangle.br().x - 7, getRectangle.br().y + 25);
+                putText(rawImage, std::to_string(classId), position, 3, 1.0, Scalar(128, 128, 255), 2);
+            }
+        }
+
+        tm.stop();
+        fps = 1 / tm.getTimeSec();
+        std::string fpsString = format("Inference FPS: %.2f.", fps);
+        putText(rawImage, fpsString, Point(5, 20), FONT_HERSHEY_SIMPLEX, 0.6, Scalar(128, 255, 128));
+
+        imshow(resultWinName, rawImage);
+        imshow(preWinName, image);
+
+    }
+
+    return 0;
+}
+
+static void getMaxClass(const Mat &probBlob, int &classId, double &classProb)
+{
+    Mat probMat = probBlob.reshape(1, 1);
+    Point classNumber;
+    minMaxLoc(probMat, NULL, &classProb, NULL, &classNumber);
+    classId = classNumber.x;
+}
+
+void predictor(Net net, const Mat &roi, int &classId, double &probability)
+{
+    Mat pred;
+    // Convert Mat to batch of images
+    Mat inputBlob = dnn::blobFromImage(roi, 1.0, Size(28, 28));
+    // Set the network input
+    net.setInput(inputBlob);
+    // Compute output
+    pred = net.forward();
+    getMaxClass(pred, classId, probability);
+}
--- a/samples/cpp/digits_svm.cpp
+++ b/samples/cpp/digits_svm.cpp
--- a/samples/dnn/text_detection.cpp
+++ b/samples/dnn/text_detection.cpp
@ -2,12 +2,16 @@
    Text detection model: https://github.com/argman/EAST
    Download link: https://www.dropbox.com/s/r2ingd0l3zt8hxs/frozen_east_text_detection.tar.gz?dl=1

-    Text recognition model taken from here: https://github.com/meijieru/crnn.pytorch
+    CRNN Text recognition model taken from here: https://github.com/meijieru/crnn.pytorch
    How to convert from pb to onnx:
    Using classes from here: https://github.com/meijieru/crnn.pytorch/blob/master/models/crnn.py

+    More converted onnx text recognition models can be downloaded directly here:
+    Download link: https://drive.google.com/drive/folders/1cTbQ3nuZG-EKWak6emD_s8_hHXWz7lAr?usp=sharing
+    And these models taken from here:https://github.com/clovaai/deep-text-recognition-benchmark
+
    import torch
-    import models.crnn as crnn
+    from models.crnn import CRNN

    model = CRNN(32, 1, 37, 256)
    model.load_state_dict(torch.load('crnn.pth'))
--- a/samples/dnn/text_detection.py
+++ b/samples/dnn/text_detection.py
@ -1,11 +1,18 @@
 '''
    Text detection model: https://github.com/argman/EAST
    Download link: https://www.dropbox.com/s/r2ingd0l3zt8hxs/frozen_east_text_detection.tar.gz?dl=1
-    Text recognition model taken from here: https://github.com/meijieru/crnn.pytorch
+
+    CRNN Text recognition model taken from here: https://github.com/meijieru/crnn.pytorch
    How to convert from pb to onnx:
    Using classes from here: https://github.com/meijieru/crnn.pytorch/blob/master/models/crnn.py
+
+    More converted onnx text recognition models can be downloaded directly here:
+    Download link: https://drive.google.com/drive/folders/1cTbQ3nuZG-EKWak6emD_s8_hHXWz7lAr?usp=sharing
+    And these models taken from here:https://github.com/clovaai/deep-text-recognition-benchmark
+
    import torch
-    import models.crnn as CRNN
+    from models.crnn import CRNN
+
    model = CRNN(32, 1, 37, 256)
    model.load_state_dict(torch.load('crnn.pth'))
    dummy_input = torch.randn(1, 1, 32, 100)
@ -23,7 +30,8 @@ import argparse
 parser = argparse.ArgumentParser(
    description="Use this script to run TensorFlow implementation (https://github.com/argman/EAST) of "
                "EAST: An Efficient and Accurate Scene Text Detector (https://arxiv.org/abs/1704.03155v2)"
-                "The OCR model can be obtained from converting the pretrained CRNN model to .onnx format from the github repository https://github.com/meijieru/crnn.pytorch")
+                "The OCR model can be obtained from converting the pretrained CRNN model to .onnx format from the github repository https://github.com/meijieru/crnn.pytorch"
+                "Or you can download trained OCR model directly from https://drive.google.com/drive/folders/1cTbQ3nuZG-EKWak6emD_s8_hHXWz7lAr?usp=sharing")
 parser.add_argument('--input',
                    help='Path to input image or video file. Skip this argument to capture frames from a camera.')
 parser.add_argument('--model', '-m', required=True,