Make the implementation of optimization in DNN adjustable to different vector sizes with RVV intrinsics.
* Update fastGEMM for multi VLEN.
* Update fastGEMM1T for multi VLEN.
* Update fastDepthwiseConv for multi VLEN.
* Update fastConv for multi VLEN.
* Replace malloc with cv::AutoBuffer.
Optimization of DNN using native RISC-V vector intrinsics.
* Use RVV to optimize fastGEMM (FP32) in DNN.
* Use RVV to optimize fastGEMM1T in DNN.
* Use RVV to optimize fastConv in DNN.
* Use RVV to optimize fastDepthwiseConv in DNN.
* Vectorize tails using vl.
* Use "vl" instead of scalar to handle small block in fastConv.
* Fix memory access out of bound in "fastGEMM1T".
* Remove setvl.
* Remove useless initialization.
* Use loop unrolling to handle tail part instead of switch.
* added depth-wise convolution; gives ~20-30% performance improvement in MobileSSD networks
* hopefully, eliminated compile warnings, errors, as well as failure in one test
* * fixed a few typos
* decreased buffer size in some cases
* added more optimal im2row branch in the case of 1x1 convolutions
* tuned fastConv to reduce the number of passes over arrays
dnn: Fix output mismatch when forward dnn model contain [depthwise conv(group=1) + bn + prelu] (#11649)
* this can make sure [depthwise conv(group=1) + bn + prelu] output not shift
* add TEST to show the output mismatch in [DWconv+Prelu]
* fix typo
* change loading image to init cvMat directly
* build runtime model, without loading external model
* remove whitespace
* change way to create a cvmat
* add bias_term, add target output
* fix [dwconv + prelu] value mismatch when no optimizations
* fix Test error when change output channels
* add parametric test
* change num_output to group value
* change conv code and change test back
* Add a 512 bit codepath to the AVX512 fastConv function
this patch adds a 512 wide codepath to the fastConv() function for
AVX512 use.
The basic idea is to process the first N * 16 elements of the vector
with avx512, and then run the rest of the vector using the traditional
AVX2 codepath.
* dnn: use unaligned AVX512 load (OpenCV aligns data on 32-byte boundary)
* dnn: change "vecsize" condition for AVX512
* dnn: fix indentation
This patch adds AVX512 optimized fastConv as well as the hookups
needed to get these called in the convolution_layer.
AVX512 fastConv is code-identical on a C level to the AVX2 one,
but is measurably faster due to AVX512 having more registers available
to cache results in.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>