This ocl kernel is 46%~171% faster than current laplacian 3x3
ocl kernel in the perf test, with image format "CV_8UC1".
Signed-off-by: Li Peng <peng.li@intel.com>
This ocl kernel is for 3x3 kernel size and CV_8UC1 format
It is 115% ~ 300% faster than current ocl path in perf test
python ./modules/ts/misc/run.py -t imgproc --gtest_filter=OCL_GaussianBlurFixture*
Signed-off-by: Li Peng <peng.li@intel.com>
This kernel is for CV_8UC1 format and 3x3 kernel size,
It is about 33% ~ 55% faster than current ocl kernel with below perf test
python ./modules/ts/misc/run.py -t imgproc --gtest_filter=OCL_ErodeFixture*
python ./modules/ts/misc/run.py -t imgproc --gtest_filter=OCL_DilateFixture*
Also add accuracy test cases for this kernel, the test command is
./bin/opencv_test_imgproc --gtest_filter=OCL_Filter/MorphFilter3x3*
Signed-off-by: Li Peng <peng.li@intel.com>
The optimization is for CV_8UC1 format and 3x3 box filter,
it is 15%~87% faster than current ocl kernel with below perf test
./modules/ts/misc/run.py -t imgproc --gtest_filter=OCL_BlurFixture*
Also add test cases for this ocl kernel.
Signed-off-by: Li Peng <peng.li@intel.com>
There is an issue with processing of abs(short) function for
negative argument.
Affected OpenCL devices:
- iGPU: Intel(R) HD Graphics 520 (OpenCL 2.0 )
- CPU: Intel(R) Core(TM) i5-6300U CPU @ 2.40GHz (OpenCL 2.0 (Build 10094))
Add OpenCL support to linearPolar & logPolar.
The OpenCL code use float instead of double, so that it does not require
cl_khr_fp64 extension, with slight precision lost.
Add explicit conversion
Add explicit conversion from double to float to eliminate warning during
compilation.
See the below code snippet:
while(l_counter != 0)
{
int mod = l_counter % LOCAL_TOTAL;
int pix_per_thr = l_counter / LOCAL_TOTAL + ((lid < mod) ? 1 : 0);
for (int i = 0; i < pix_per_thr; ++i)
{
int index = atomic_dec(&l_counter) - 1;
....
}
....
barrier(CLK_LOCAL_MEM_FENCE);
}
If we don't put a barrier before the for loop, then there is a possiblity
that some work item enter this loop but the others are not, the the l_counter
will be reduced in the for loop and may be changed to zero, and the other
work items may can't enter the while loop. If this happens, it breaks the
barrier's rule which requires all the work items reach the same barrier.
And it may hang the GPU depends on the implementation of opencl platform.
This issue is raised at:
https://github.com/Itseez/opencv/issues/5175
Signed-off-by: Zhigang Gong <zhigang.gong@linux.intel.com>
int pix_per_thr = l_counter / LOCAL_TOTAL + ((lid < mod) ? 1 : 0);
The pix_per_thr * LOCAL_TOTAL may be larger than l_counter.
Thus the index of l_stack may be negative which may cause serious
problems. Let's skip the loop when we get negative index and we need
to add back the lcounter to keep its balance and avoid potential
negative counter.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
According to opencl 1.2 spec 6.1.5:
For arguments to a __kernel function declared to be a pointer to a
data type, the OpenCL compiler can assume that the pointee is always
appropriately aligned as required by the data type. The behavior of
an unaligned load or store is undefined, except for the
vloadn, vload_halfn, vstoren, and vstore_halfn functions defined in
section 6.12.7.
Original code read data of type T from address not aligned by multiple
of sizeof(T), so the result is incorrect. With this patch, the cases
./opencv_perf_imgproc
--gtest_filter=OCL_ImgSize_TmplSize_Method_MatType_MatchTemplate.MatchTemplate/*
could work well with beignet 0.9.3.
Signed-off-by: Chuanbo Weng <chuanbo.weng@intel.com>