Arm: fix the test failure of OCL_Imgproc/CLAHETest.Accuracy on ODROID-XU4 (#11409)
* fix the test failure of OCL_Imgproc/CLAHETest.Accuracy on ODROID-XU4
* avoid the race condition in the reduce
* imgproc(ocl): simplify CLAHE code
* remove unused class
color.cpp split (#10869)
* initial split is done
* files renamed (these names are excluded during compilation)
* IPP code moved to corresponding files
* splineBuild, splineInterpolate -> color_lab.cpp
* Lab, Luv: little refactored
* it compiles (didn't check work); Lab OCL code moved to color_lab.cpp
* cvtcolor.cl: Lab/Luv part moved to color_lab.cl
* cvtcolor.cl: color_rgb.cl extracted
* cvtcolor.cl: color_yuv.cl separated
* cvtcolor.cl: color_hsv.cl extracted
* cvtcolor.cl: extracted to color_lab.cl and color_rgb.cl
* helper functions moved to hpp file
* Lab, Luv: moved to color_lab.cpp
* CPU XYZ: to color_lab.cpp
* OCL XYZ: to color_lab.cpp
* warning fixed
* CvtHelper added
* CPU YUV: to color_yuv.cpp, helpers to color.hpp
* CPU HLS/HSV: to color_hsv.cpp
* CPU BGR2BGR: to color_rgb.cpp
* CPU RGB: to color_rgb.cpp
* extra arg removed
* CPU YUV: to color_yuv.cpp
* color code decoded
* OclHelper added, some funcs rewritten
* color_lab.cpp: refactored to use OclHelper
* OCL RGB: to color_rgb.cpp
* OCL HLS/HSV: to color_hsv.cpp
* OCL YUV: to color_yuv.cpp
* OCL YUV planes: to color_yuv.cpp
* OCL: color code reduced
* licence to demosaicing.cpp
* IPP func tables to color_rgb.cpp
* code cleanup
* HAVE_OPENCL ifdefs added
* helpers made more common
* fixed two plane YUV with separate mats
* fixed warning in gcc7.2.0
* precomp header fixed
* color space classification functions fixed
* helpers fixed
* rename: isSRGB -> is_sRGB
* lab_tetra squashed
* initial version is almost written
* unfinished work
* compilation fixed, to be debugged
* Lab test removed
* more fixes
* Luv2RGBinteger: channels order fixed
* Lab structs removed
* good trilinear interpolation added
* several fixes
* removed Luv2RGB interpolations, XYZ tables; 8-cell LUT added
* no_interpolate made 8-cell
* interpolations rewritten to 8-cell, minor fixes
* packed interpolation added for RGB2Luv
* tetra implemented
* removing unnecessary code
* LUT building merged
* changes ported to color.cpp
* minor fixes; try to suppress warnings
* fixed v range of Luv
* fixed incorrect src channel number
* minor fixes
* preliminary version of Luv2RGBinteger is done
* Luv2RGB_b is in progress
* XYZ color constants converted to softfloat
* Luv test: precision fixed
* Luv bit-exactness test added
* warnings fixed
* compilation fixed, error message fixed
* Luv check is limited to [0-2,0-2,0-2] by XYZ
* L->Y generation moved to LUT
* LUTs added for up and vp of Luv2RGB_b
* still works
* fixed-point is done, works at maxerr 2
* vectorized code is done, 2x slower than original
* perf improved by 10%
* extra comments removed
* code moved to color.cpp
* test_lab.cpp updated
* minor refactoring
* test added for Luv2RGB
* OCL Luv2RGB_b: XYZ are limited to [0, 2]; docs updated
* Luv2RGB_b rewritten to universal intrinsics
* test_lab.cpp moved to luv_tetra branch
RGB2Lab_f added, bugs fixed, moved to float
several bugs fixed
LUT fixed, no switch in tetraInterpolate()
temporary code; to be removed and rewritten
before refactoring
extra interpolations removed, some things to do left
added Lab2RGB_b +XYZ version, etc.
basic version is done, to be sped up
tetra refactored
interpolations: LUT for weights, refactor., etc.
address arithm optimized
initial version of vectorized code added (not compiling now)
compilation fixed, now segfaults
a lot of fixes, vectorization temp. disabled
fixed trilinear shift size, max error dropped from 19 to 10
fixed several bugs (255 vs 256, signed vs unsigned, bIdx)
minor changes
packed: address arithmetics fixed
shorter code
experiments with pure integer calculations
Lab2RGB max error decreased to 2; need to clean the code
ready for vectorization; need cleaning
vectorized, to be debugged
precision fixed, max error is 2
Lab->XYZ shortened
minor fixes
Lab2RGB_f version fixed, to be completely rewritten using _b code
RGB2Lab_f vectorized
minors
moved to separate file
refactored Lab2RGB to float and int versions
minor fix
Lab2RGB_f vectorized
minor refactoring
Lab2RGBint refactored: process methods, vectorize by 4 pix
Lab2RGB_f int version is done
cleanup extra code
code copied to color.cpp
fixed blue idx bug
optimizations enabled when testing; mulFracConst introduced
divConst -> mulFracConst
calc min time in perf instead of avg
minors
process() slightly sped up
Lab2RGB_f: disabled int version
reinterpret added, minor fixes in names
some warnings fixed
changes transferred to color.cpp
RGB2Lab_f code (and trilinear interpolation code) moved to rgb2lab_faster
whitespace
shift negative fixed
more warnings fixed
"constant condition" warnings fixed, little speed up
minor changes
test_photo decolor fixed
changes copied to test_lab.cpp
idx bounds checking in LUT init
several fixes
WIP: softfloat almost integrated
test_lab partially rewritten to SoftFloat
color.cpp rewritten to SoftFloat
test_lab.cpp: accuracy code added
several fixes
RGB2Lab_b testing fixed
splineBuild() rewritten to SoftFloat
accuracy control improved
rounding fixed
Luv <=> RGB: rewritten to SoftFloat
OCL cvtColor Lab and Lut rewritten to SoftFloat
minor fixes
refactored to new SoftFloat interface
round() -> cvRound, etc.
fixed OCL tests
softfloat.cpp: internal functions made static, unused ones removed
meaningful constants
extra lines removed
unused function removed
unfinished work
it works, need to fix TODOs
refactoring; more calls rewritten
mulFracConst removed
constants made bit exact; minors
changes moved to color.cpp
fixed 1 bug and 4 warnings
OCL: fixed constants
pow(x, _1_3f) replaced by cubeRoot(x)
fixed compilation on MSVC32
magic constants explained
file with internal accuracy&speed tests moved to lab_tetra branch
Add new 5x5 gaussian blur kernel for CV_8UC1 format,
it is 50% ~ 70% faster than current ocl kernel in the perf test.
Signed-off-by: Li Peng <peng.li@intel.com>
Add new OpenCL kernels for bicubic interploation, it is 20% faster
than current warp image kernel with bicubic interploation.
Signed-off-by: Li Peng <peng.li@intel.com>
Add new ocl kernels for warpAffine and warpPerspective,
The average performance improvemnt is about 30%. The new
ocl kernels require CV_8UC1 format and support nearest
neighbor and bilinear interpolation.
Signed-off-by: Li Peng <peng.li@intel.com>
This ocl kernel is 46%~171% faster than current laplacian 3x3
ocl kernel in the perf test, with image format "CV_8UC1".
Signed-off-by: Li Peng <peng.li@intel.com>
This ocl kernel is for 3x3 kernel size and CV_8UC1 format
It is 115% ~ 300% faster than current ocl path in perf test
python ./modules/ts/misc/run.py -t imgproc --gtest_filter=OCL_GaussianBlurFixture*
Signed-off-by: Li Peng <peng.li@intel.com>
This kernel is for CV_8UC1 format and 3x3 kernel size,
It is about 33% ~ 55% faster than current ocl kernel with below perf test
python ./modules/ts/misc/run.py -t imgproc --gtest_filter=OCL_ErodeFixture*
python ./modules/ts/misc/run.py -t imgproc --gtest_filter=OCL_DilateFixture*
Also add accuracy test cases for this kernel, the test command is
./bin/opencv_test_imgproc --gtest_filter=OCL_Filter/MorphFilter3x3*
Signed-off-by: Li Peng <peng.li@intel.com>
The optimization is for CV_8UC1 format and 3x3 box filter,
it is 15%~87% faster than current ocl kernel with below perf test
./modules/ts/misc/run.py -t imgproc --gtest_filter=OCL_BlurFixture*
Also add test cases for this ocl kernel.
Signed-off-by: Li Peng <peng.li@intel.com>
There is an issue with processing of abs(short) function for
negative argument.
Affected OpenCL devices:
- iGPU: Intel(R) HD Graphics 520 (OpenCL 2.0 )
- CPU: Intel(R) Core(TM) i5-6300U CPU @ 2.40GHz (OpenCL 2.0 (Build 10094))
Add OpenCL support to linearPolar & logPolar.
The OpenCL code use float instead of double, so that it does not require
cl_khr_fp64 extension, with slight precision lost.
Add explicit conversion
Add explicit conversion from double to float to eliminate warning during
compilation.
See the below code snippet:
while(l_counter != 0)
{
int mod = l_counter % LOCAL_TOTAL;
int pix_per_thr = l_counter / LOCAL_TOTAL + ((lid < mod) ? 1 : 0);
for (int i = 0; i < pix_per_thr; ++i)
{
int index = atomic_dec(&l_counter) - 1;
....
}
....
barrier(CLK_LOCAL_MEM_FENCE);
}
If we don't put a barrier before the for loop, then there is a possiblity
that some work item enter this loop but the others are not, the the l_counter
will be reduced in the for loop and may be changed to zero, and the other
work items may can't enter the while loop. If this happens, it breaks the
barrier's rule which requires all the work items reach the same barrier.
And it may hang the GPU depends on the implementation of opencl platform.
This issue is raised at:
https://github.com/Itseez/opencv/issues/5175
Signed-off-by: Zhigang Gong <zhigang.gong@linux.intel.com>
int pix_per_thr = l_counter / LOCAL_TOTAL + ((lid < mod) ? 1 : 0);
The pix_per_thr * LOCAL_TOTAL may be larger than l_counter.
Thus the index of l_stack may be negative which may cause serious
problems. Let's skip the loop when we get negative index and we need
to add back the lcounter to keep its balance and avoid potential
negative counter.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>