* added performance test for compareHist
* compareHist reworked to use wide universal intrinsics
* Disabled vectorization for CV_COMP_CORREL and CV_COMP_BHATTACHARYYA if f64 is unsupported
* Added performance tests for hal::norm functions
* Added sum of absolute differences intrinsic
* norm implementation updated to use wide universal intrinsics
* improve and fix v_reduce_sad on VSX
- add infrastructure support for Power9/VSX3
- fix missing VSX flags on GCC4.9 and CLANG4(#13210, #13222)
- fix disable VSX optimzation on GCC by using flag ENABLE_VSX
- flag ENABLE_VSX is deprecated now, use CPU_BASELINE, CPU_DISPATCH instead
- add VSX3 to arithmetic dispatchable flags
* Support for Matx read/write by FileStorage
* Only empty filestorage read now produces default Matx. Split Matx IO test into smaller units. Test checks for exception thrown if reading a Mat into a Matx of different size.
* Updated boxFilter implementations to use wide universal intrinsics
* boxFilter implementation moved to separate file
* Replaced ROUNDUP macro with roundUp() function
* integrated the new C++ persistence; removed old persistence; most of OpenCV compiles fine! the tests have not been run yet
* fixed multiple bugs in the new C++ persistence
* fixed raw size of the parsed empty sequences
* [temporarily] excluded obsolete applications traincascade and createsamples from build
* fixed several compiler warnings and multiple test failures
* undo changes in cocoa window rendering (that was fixed in another PR)
* fixed more compile warnings and the remaining test failures (hopefully)
* trying to fix the last little warning
- initialize arithmetic dispatcher
- add new universal intrinsic v_absdiffs
- add new universal intrinsic v_pack_b
- add accumulate version of universal intrinsic v_round
- fix sse/avx2:uint8 multiplication overflow
- reimplement arithmetic, logic and comparison operations into wide universal intrinsics
with full support for all types
- reimplement IPP arithmetic, logic and comparison operations in a sperate file arithm_ipp.hpp
- avoid scalar multiplication if scaling factor eq 1 and use integer multiplication
- move C arithmetic operations to precomp.hpp and delete [arithm_simd|arithm_core].hpp
- add compatibility with new opencv4 divide policy
* js: update build script
- support emscipten 1.38.12 (wasm is ON by default)
- verbose build messages
* js: use builtin Math functions
* js: disable tracing code completelly
* hide use of std::mutex from /clr compilation under Visual Studio
C++11 <mutex> is not available when compiling with /clr under Visual Studio, thus opencv cannot be easily included.
It is fixed by making a CEEMutex wrapper class, around an opaque implementation using std::mutex internally
* fixed compilation outside of Visual Studio
fixed compilation outside of Visual Studio by avoiding some macros
* fixed indentation, prepare getting rid of CEEMutex/CEELockGuard
fixed indentation
After discussion, CEEMutex and CEELockGuard can be totally removed, letting the developer in a /clr context to provide his own implementation
* remove CEEMutex/CEELockGuard
Fixes for instrumentation of IPP and OCL (#12637)
* fixed warning about re-declaring variable when both IPP and instrumentation are enabled
* fixed segfault when no funName provided
* compilation fixed when both OCL and instrumentation are enabled
- avoid updating of input image during equalizeHist() call
- avoid for() with double variable (use 'int' instead)
- more CV_Check*() macros
- use Mat_<T>, Matx
- static for local variables
* Remove isIntel check from deep learning layers
* Remove fp16->fp32 fallbacks where it's not necessary
* Fix Kernel::run to prevent localsize > globalsize
* added basic support for CV_16F (the new datatype etc.). CV_USRTYPE1 is now equal to CV_16F, which may break some [rarely used] functionality. We'll see
* fixed just introduced bug in norm; reverted errorneous changes in Torch importer (need to find a better solution)
* addressed some issues found during the PR review
* restored the patch to fix some perf test failures
* may be an typo fix
* remove identical branch,may be paste error
* add parentheses around macro parameter
* simplify if condition
* check malloc fail
* change the condition of branch removed by commit 3041502861
* rewrote Mat::convertTo() and convertScaleAbs() to wide universal intrinsics; added always-available and SIMD-optimized FP16<=>FP32 conversion
* fixed compile warnings
* fix some more compile errors
* slightly relaxed accuracy threshold for int->float conversion (since we now do it using single-precision arithmetics, not double-precision)
* fixed compile errors on iOS, Android and in the baseline C++ version (intrin_cpp.hpp)
* trying to fix ARM-neon builds
* trying to fix ARM-neon builds
* trying to fix ARM-neon builds
* trying to fix ARM-neon builds
* trying to fix the custom AVX2 builder test failures (false alarms)
* fixed compile error with CPU_BASELINE=AVX2 on x86; raised tolerance thresholds in a couple of tests
* fixed compile error with CPU_BASELINE=AVX2 on x86; raised tolerance thresholds in a couple of tests
* fixed compile error with CPU_BASELINE=AVX2 on x86; raised tolerance thresholds in a couple of tests
* seemingly disabled false alarm warning in surf.cpp; increased tolerance thresholds in the tests for SolvePnP and in DNN/ENet
Intrinsics must be effective, so don't declare FP16 type/operations if there is no native support.
- CV_FP16: supports load/store into/from float32
- CV_SIMD_FP16: declares FP16 types and native FP16 operations
for some big negative values less than -INT_MAX+32767 the sign of the numbers is lost due to overflow that leads to
incorrect saturation to MAX value, instead of zero.
The issue is not reproduced with CV_ENABLED_INTRINSICS=OFF
* 1. changed static const __m128/256 to const __m128/256 to avoid wierd instructions and calls inserted by compiler.
2. added universal intrinsics that wrap MOVNTPS and other such (non-temporary or "no cache" store) instructions. v_store_interleave() and v_store() got respective flags/overloaded variants
3. rewrote split & merge to use the "no cache" store instructions. It resulted in dramatic performance improvement when processing big arrays
* hopefully, fixed some test failures where 4-channel v_store_interleave() is used
* added missing implementation of the new universal intrinsics (v_store_aligned_nocache() etc.)
* fixed silly typo in the new intrinsics in intrin_vsx.hpp
* still trying to fix VSX compiler errors
* still trying to fix VSX compiler errors
* still trying to fix VSX compiler errors
* still trying to fix VSX compiler errors
* fixed/updated v_load_deinterleave and v_store_interleave intrinsics; modified split() and merge() functions to use those intrinsics
* fixed a few compile errors and bug in v_load_deinterleave(ptr, v_uint32x4& a, v_uint32x4& b)
* fixed few more compile errors
* core:OE-27 prepare universal intrinsics to expand (#11022)
* core:OE-27 prepare universal intrinsics to expand (#11022)
* core: Add universal intrinsics for AVX2
* updated implementation of wide univ. intrinsics; converted several OpenCV HAL functions: sqrt, invsqrt, magnitude, phase, exp to the wide universal intrinsics.
* converted log to universal intrinsics; cleaned up the code a bit; added v_lut_deinterleave intrinsics.
* core: Add universal intrinsics for AVX2
* fixed multiple compile errors
* fixed many more compile errors and hopefully some test failures
* fixed some more compile errors
* temporarily disabled IPP to debug exp & log; hopefully fixed Doxygen complains
* fixed some more compile errors
* fixed v_store(short*, v_float16&) signatures
* trying to fix the test failures on Linux
* fixed some issues found by alalek
* restored IPP optimization after the patch with AVX wide intrinsics has been properly tested
* restored IPP optimization after the patch with AVX wide intrinsics has been properly tested
- 'if' logic is moved into templates.
- removed unnecessary cv::Mat objects creation.
- fixed inv() test (invA * A == eye)
- added more Matx tests to cover all defined template specializations
* a part of https://github.com/opencv/opencv/pull/11364 by Tetragramm. Rewritten and extended findNonZero & PSNR to support more types, not just 8u.
* fixed compile & doxygen warnings
* fixed small bug in findNonZero test
fixes handling of empty matrices in some functions (#11634)
* a part of PR #11416 by Yuki Takehara
* moved the empty mat check in Mat::copyTo()
* fixed some test failures
* make sure that the matrix with more than INT_MAX elements is marked as non-continuous, and thus all the pixel-wise functions process it correctly (i.e. row-by-row, not as a single row, where integer overflow may occur when computing the total number of elements)
* Issue 11242 intrinsics v_extract, v_rotate improvement, branch 3.4, without C++11 (remove type restrictions for SSE2, use PALIGNR on SSSE3, compile to no-op when imm is 0 or nlanes).
* fix whitespace
* Fix#11242 (NEON intrinsics v_rotate...) branch 3.4
Separate macro expansion OPENCV_HAL_IMPL_NEON_SHIFT_OP for bitwise shifts for integers, from macro expansion OPENCV_HAL_IMPL_NEON_ROTATE for lane rotations. Bitwise shifts do not apply to floats, but lane-rotations can apply to both.
* fix whitespace
* Fix#11242 compile error (VSX intrinsics v_rotate(a)) branch 3.4 no-c++11
* Fix CV_Asserts with negation of strings
{!"string"} causes some compilers to throw a warning.
The value of the string is not that important -- it's only for printing
the assertion message.
Replace these calls with:
CV_Error(Error::StsError, "string")
to suppress the warning.
* remove unnecessary 'break' after CV_Error()
* use universal intrinsic instead of raw intrinsic
* add 2 channels de-interleave on x86 platform
* add v_int32x4 version of v_muladd
* add accumulate version of v_dotprod based on the commit from seiko2plus on bf1852d
* remove some verify check in performance test
* avoid the out of boundary access and keep the performance
* remove unnecessary defines from vsx_utils
* fix v_load_expand, load lower 64bit
* use vec_ld, vec_st with alignment load/store on all types except 64bit
* map v_extract to v_rotate_right
* update license header
* enable VSX by default on clang since #11167
To avoid compilation of this code:
- buf = 0;
This code can be received after refactoring of 1D cv::Mat to cv::AutoBuffer.
- "cv_mat = 0" calls setTo().
- cv::AutoBuffer calls "allocate(0)" - this is wrong.
* Make <array> #ifdef true for MSVC
I think MSVC had `std::array` for quite a while (possibly going back as far as VS 2012, but it's definitely there in 2015 and 2017. So I think `_MSC_VER` `1900` is a safe bet. Probably `1800` and maybe even `1700` could work as well but I can't test that locally.
* fix test
* Update BufferReader documentation with some example code
* Add warning to BufferPool doc regarding deallocation of StackAllocator
* Added a sample code that satisfies LIFO rule for StackAllocator
OpenCV pthreads-based implementation changes:
- rework worker threads pool, allow to execute job by the main thread too
- rework synchronization scheme (wait for job completion, threads 'pong' answer is not required)
- allow "active wait" (spin) by worker threads and by the main thread
- use _mm_pause() during active wait (support for Hyper-Threading technology)
- use sched_yield() to avoid preemption of still working other workers
- don't use getTickCount()
- optional builtin thread pool profiler (disabled by compilation flag)
UMatData locks are not mapped on real locks (they are mapped to some "pre-initialized" pool).
Concurrent execution of these statements may lead to deadlock:
- a.copyTo(b) from thread 1
- c.copyTo(d) from thread 2
where:
- 'a' and 'd' are mapped to single lock "A".
- 'b' and 'c' are mapped to single lock "B".
Workaround is to process locks with strict order.
The opencv infrastructure mostly has the basics for supporting avx512 math functions,
but it wasn't hooked up (likely due to lack of users)
In order to compile the DNN functions for AVX512, a few things need to be hooked up
and this patch does that
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
- don't store ProgramSource in compiled Programs (resolved problem with "source" buffers lifetime)
- completelly remove Program::read/write methods implementation:
- replaced with method to query RAW OpenCL binary without any "custom" data
- deprecate Program::getPrefix() methods
If there are no OpenCL/UMat methods calls from application.
OpenCL subsystem is initialized:
- haveOpenCL() is called from application
- useOpenCL() is called from application
- access to OpenCL allocator: UMat is created (empty UMat is ignored) or UMat <-> Mat conversions are called
Don't call OpenCL functions if OPENCV_OPENCL_RUNTIME=disabled
(independent from OpenCL linkage type)
* add accuracy test and performance check for matmul
* add performance tests for transform and dotProduct
* add test Core_TransformLargeTest for 8u version of transform
* remove raw SSE2/NEON implementation from matmul.cpp
* use universal intrinsic instead of raw intrinsic
* remove unused templated function
* add v_matmuladd which multiply 3x3 matrix and add 3x1 vector
* add v_rotate_left/right in universal intrinsic
* suppress intrinsic on some function and platform
* add pure SW implementation of new universal intrinsics
* add test for new universal intrinsics
* core: prevent memory access after the end of buffer
* fix perf tests
When elements are 64 bits, the vec_st_interleave()/vec_ld_deinterleave()
doesn't interleave 4 elements correctly.
For vec_st_interleave(), following is saved into mem:
a0 b0 a1 b1 c0 d0 c1 d1
-> we expected:
a0 b0 c0 d0 a1 b1 c1 d1
for vec_ld_deinterleave(), following is loaded into a b c d for memory
string { 1 2 3 4 5 6 7 8 }:
a: 1 3
b: 2 4
c: 5 7
d: 6 8
-> we expected:
a: 1 5
b: 2 6
c: 3 7
d: 4 8
This patch corrects this behavior.
Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
- changed behavior of vec_ctf, vec_ctu, vec_cts
in gcc and clang to make them compatible with XLC
- implemented most of missing conversion intrinsics in gcc and clang
- implemented conversions intrinsics of odd-numbered elements
- ignored gcc bug warning that caused by -Wunused-but-set-variable in rare cases
- replaced right shift with algebraic right shift for signed vectors
to shift in the sign bit.
- added new universal intrinsics v_matmuladd, v_rotate_left/right
- avoid using floating multiply-add in RNG
* Update OpenCVCompilerOptimizations.cmake
Neon not supported on MSVC ARM breaking build fix
* Update OpenCVCompilerOptimizations.cmake
Whitespace
* Update intrin.hpp
Many problems in MSVC ARM builds (at least on VS2017) being fixed in this PR now.
C:\Users\Gregory\DOCUME~1\MYLIBR~1\OPENCV~3\opencv\sources\modules\core\include\opencv2/core/hal/intrin.hpp(444): error C3861: '_tzcnt_u32': identifier not found
* Update hal_replacement.hpp
Passing variadic expansion in a macro to another macro does not work properly in MSVC and a famous known workaround is hereby applied. Discussion of it: https://stackoverflow.com/questions/5134523/msvc-doesnt-expand-va-args-correctly
Only needed the fix for ARM builds: TEGRA_ macros are used for cv_hal_ functions in the carotene library.
C:\Users\Gregory\Documents\My Libraries\opencv330\opencv\sources\modules\core\src\arithm.cpp(2378): warning C4003: not enough actual parameters for macro 'TEGRA_ADD'
C:\Users\Gregory\Documents\My Libraries\opencv330\opencv\sources\modules\core\src\arithm.cpp(2378): error C2143: syntax error: missing ')' before ','
C:\Users\Gregory\Documents\My Libraries\opencv330\opencv\sources\modules\core\src\arithm.cpp(2378): error C2059: syntax error: ')'
* Update hal_replacement.hpp
All hal_replacement's using carotene\hal\tegra_hal.hpp TEGRA_ functions as macros preprocessed by variadic macros should be changed, identical as was done in core.
C:\Users\Gregory\Documents\My Libraries\opencv330\opencv\sources\modules\imgproc\src\color.cpp(9604): warning C4003: not enough actual parameters for macro 'TEGRA_CVTBGRTOBGR'
C:\Users\Gregory\Documents\My Libraries\opencv330\opencv\sources\modules\imgproc\src\color.cpp(9604): error C2059: syntax error: '=='
* Update OpenCVCompilerOptimizations.cmake
* Update hal_replacement.hpp
* Update hal_replacement.hpp
The original template based mat ptr for indexing is not implemented,
add the similar implementation as uchar type, but cast to
user-defined type from the uchar pointer.
Adds fitEllipseDirect to imgproc: The Direct least square (Direct) method by Fitzgibbon1999.
New Tests are included for the methods.
fitEllipseAMS Tests
fitEllipseDirect Tests
Comparative examples are added to fitEllipse.cpp in Samples.
add libdnn acceleration to dnn module (#9114)
* import libdnn code
Signed-off-by: Li Peng <peng.li@intel.com>
* add convolution layer ocl acceleration
Signed-off-by: Li Peng <peng.li@intel.com>
* add pooling layer ocl acceleration
Signed-off-by: Li Peng <peng.li@intel.com>
* add softmax layer ocl acceleration
Signed-off-by: Li Peng <peng.li@intel.com>
* add lrn layer ocl acceleration
Signed-off-by: Li Peng <peng.li@intel.com>
* add innerproduct layer ocl acceleration
Signed-off-by: Li Peng <peng.li@intel.com>
* add HAVE_OPENCL macro
Signed-off-by: Li Peng <peng.li@intel.com>
* fix for convolution ocl
Signed-off-by: Li Peng <peng.li@intel.com>
* enable getUMat() for multi-dimension Mat
Signed-off-by: Li Peng <peng.li@intel.com>
* use getUMat for ocl acceleration
Signed-off-by: Li Peng <peng.li@intel.com>
* use CV_OCL_RUN macro
Signed-off-by: Li Peng <peng.li@intel.com>
* set OPENCL target when it is available
and disable fuseLayer for OCL target for the time being
Signed-off-by: Li Peng <peng.li@intel.com>
* fix innerproduct accuracy test
Signed-off-by: Li Peng <peng.li@intel.com>
* remove trailing space
Signed-off-by: Li Peng <peng.li@intel.com>
* Fixed tensorflow demo bug.
Root cause is that tensorflow has different algorithm with libdnn
to calculate convolution output dimension.
libdnn don't calculate output dimension anymore and just use one
passed in by config.
* split gemm ocl file
split it into gemm_buffer.cl and gemm_image.cl
Signed-off-by: Li Peng <peng.li@intel.com>
* Fix compile failure
Signed-off-by: Li Peng <peng.li@intel.com>
* check env flag for auto tuning
Signed-off-by: Li Peng <peng.li@intel.com>
* switch to new ocl kernels for softmax layer
Signed-off-by: Li Peng <peng.li@intel.com>
* update softmax layer
on some platform subgroup extension may not work well,
fallback to non subgroup ocl acceleration.
Signed-off-by: Li Peng <peng.li@intel.com>
* fallback to cpu path for fc layer with multi output
Signed-off-by: Li Peng <peng.li@intel.com>
* update output message
Signed-off-by: Li Peng <peng.li@intel.com>
* update fully connected layer
fallback to gemm API if libdnn return false
Signed-off-by: Li Peng <peng.li@intel.com>
* Add ReLU OCL implementation
* disable layer fusion for now
Signed-off-by: Li Peng <peng.li@intel.com>
* Add OCL implementation for concat layer
Signed-off-by: Wu Zhiwen <zhiwen.wu@intel.com>
* libdnn: update license and copyrights
Also refine libdnn coding style
Signed-off-by: Wu Zhiwen <zhiwen.wu@intel.com>
Signed-off-by: Li Peng <peng.li@intel.com>
* DNN: Don't link OpenCL library explicitly
* DNN: Make default preferableTarget to DNN_TARGET_CPU
User should set it to DNN_TARGET_OPENCL explicitly if want to
use OpenCL acceleration.
Also don't fusion when using DNN_TARGET_OPENCL
* DNN: refine coding style
* Add getOpenCLErrorString
* DNN: Use int32_t/uint32_t instread of alias
* Use namespace ocl4dnn to include libdnn things
* remove extra copyTo in softmax ocl path
Signed-off-by: Li Peng <peng.li@intel.com>
* update ReLU layer ocl path
Signed-off-by: Li Peng <peng.li@intel.com>
* Add prefer target property for layer class
It is used to indicate the target for layer forwarding,
either the default CPU target or OCL target.
Signed-off-by: Li Peng <peng.li@intel.com>
* Add cl_event based timer for cv::ocl
* Rename libdnn to ocl4dnn
Signed-off-by: Li Peng <peng.li@intel.com>
Signed-off-by: wzw <zhiwen.wu@intel.com>
* use UMat for ocl4dnn internal buffer
Remove allocateMemory which use clCreateBuffer directly
Signed-off-by: Li Peng <peng.li@intel.com>
Signed-off-by: wzw <zhiwen.wu@intel.com>
* enable buffer gemm in ocl4dnn innerproduct
Signed-off-by: Li Peng <peng.li@intel.com>
* replace int_tp globally for ocl4dnn kernels.
Signed-off-by: wzw <zhiwen.wu@intel.com>
Signed-off-by: Li Peng <peng.li@intel.com>
* create UMat for layer params
Signed-off-by: Li Peng <peng.li@intel.com>
* update sign ocl kernel
Signed-off-by: Li Peng <peng.li@intel.com>
* update image based gemm of inner product layer
Signed-off-by: Li Peng <peng.li@intel.com>
* remove buffer gemm of inner product layer
call cv::gemm API instead
Signed-off-by: Li Peng <peng.li@intel.com>
* change ocl4dnn forward parameter to UMat
Signed-off-by: Li Peng <peng.li@intel.com>
* Refine auto-tuning mechanism.
- Use OPENCV_OCL4DNN_KERNEL_CONFIG_PATH to set cache directory
for fine-tuned kernel configuration.
e.g. export OPENCV_OCL4DNN_KERNEL_CONFIG_PATH=/home/tmp,
the cache directory will be /home/tmp/spatialkernels/ on Linux.
- Define environment OPENCV_OCL4DNN_ENABLE_AUTO_TUNING to enable
auto-tuning.
- OPENCV_OPENCL_ENABLE_PROFILING is only used to enable profiling
for OpenCL command queue. This fix basic kernel get wrong running
time, i.e. 0ms.
- If creating cache directory failed, disable auto-tuning.
* Detect and create cache dir on windows
Signed-off-by: Li Peng <peng.li@intel.com>
* Refine gemm like convolution kernel.
Signed-off-by: Li Peng <peng.li@intel.com>
* Fix redundant swizzleWeights calling when use cached kernel config.
* Fix "out of resource" bug when auto-tuning too many kernels.
* replace cl_mem with UMat in ocl4dnnConvSpatial class
* OCL4DNN: reduce the tuning kernel candidate.
This patch could reduce 75% of the tuning candidates with less
than 2% performance impact for the final result.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
* replace cl_mem with umat in ocl4dnn convolution
Signed-off-by: Li Peng <peng.li@intel.com>
* remove weight_image_ of ocl4dnn inner product
Actually it is unused in the computation
Signed-off-by: Li Peng <peng.li@intel.com>
* Various fixes for ocl4dnn
1. OCL_PERFORMANCE_CHECK(ocl::Device::getDefault().isIntel())
2. Ptr<OCL4DNNInnerProduct<float> > innerProductOp
3. Code comments cleanup
4. ignore check on OCL cpu device
Signed-off-by: Li Peng <peng.li@intel.com>
* add build option for log softmax
Signed-off-by: Li Peng <peng.li@intel.com>
* remove unused ocl kernels in ocl4dnn
Signed-off-by: Li Peng <peng.li@intel.com>
* replace ocl4dnnSet with opencv setTo
Signed-off-by: Li Peng <peng.li@intel.com>
* replace ALIGN with cv::alignSize
Signed-off-by: Li Peng <peng.li@intel.com>
* check kernel build options
Signed-off-by: Li Peng <peng.li@intel.com>
* Handle program compilation fail properly.
* Use std::numeric_limits<float>::infinity() for large float number
* check ocl4dnn kernel compilation result
Signed-off-by: Li Peng <peng.li@intel.com>
* remove unused ctx_id
Signed-off-by: Li Peng <peng.li@intel.com>
* change clEnqueueNDRangeKernel to kernel.run()
Signed-off-by: Li Peng <peng.li@intel.com>
* change cl_mem to UMat in image based gemm
Signed-off-by: Li Peng <peng.li@intel.com>
* check intel subgroup support for lrn and pooling layer
Signed-off-by: Li Peng <peng.li@intel.com>
* Fix convolution bug if group is greater than 1
Signed-off-by: Li Peng <peng.li@intel.com>
* Set default layer preferableTarget to be DNN_TARGET_CPU
Signed-off-by: Li Peng <peng.li@intel.com>
* Add ocl perf test for convolution
Signed-off-by: Li Peng <peng.li@intel.com>
* Add more ocl accuracy test
Signed-off-by: Li Peng <peng.li@intel.com>
* replace cl_image with ocl::Image2D
Signed-off-by: Li Peng <peng.li@intel.com>
* Fix build failure in elementwise layer
Signed-off-by: Li Peng <peng.li@intel.com>
* use getUMat() to get blob data
Signed-off-by: Li Peng <peng.li@intel.com>
* replace cl_mem handle with ocl::KernelArg
Signed-off-by: Li Peng <peng.li@intel.com>
* dnn(build): don't use C++11, OPENCL_LIBRARIES fix
* dnn(ocl4dnn): remove unused OpenCL kernels
* dnn(ocl4dnn): extract OpenCL code into .cl files
* dnn(ocl4dnn): refine auto-tuning
Defaultly disable auto-tuning, set OPENCV_OCL4DNN_ENABLE_AUTO_TUNING
environment variable to enable it.
Use a set of pre-tuned configs as default config if auto-tuning is disabled.
These configs are tuned for Intel GPU with 48/72 EUs, and for googlenet,
AlexNet, ResNet-50
If default config is not suitable, use the first available kernel config
from the candidates. Candidate priority from high to low is gemm like kernel,
IDLF kernel, basick kernel.
* dnn(ocl4dnn): pooling doesn't use OpenCL subgroups
* dnn(ocl4dnn): fix perf test
OpenCV has default 3sec time limit for each performance test.
Warmup OpenCL backend outside of perf measurement loop.
* use ocl::KernelArg as much as possible
Signed-off-by: Li Peng <peng.li@intel.com>
* dnn(ocl4dnn): fix bias bug for gemm like kernel
* dnn(ocl4dnn): wrap cl_mem into UMat
Signed-off-by: Li Peng <peng.li@intel.com>
* dnn(ocl4dnn): Refine signature of kernel config
- Use more readable string as signture of kernel config
- Don't count device name and vendor in signature string
- Default kernel configurations are tuned for Intel GPU with
24/48/72 EUs, and for googlenet, AlexNet, ResNet-50 net model.
* dnn(ocl4dnn): swap width/height in configuration
* dnn(ocl4dnn): enable configs for Intel OpenCL runtime only
* core: make configuration helper functions accessible from non-core modules
* dnn(ocl4dnn): update kernel auto-tuning behavior
Avoid unwanted creation of directories
* dnn(ocl4dnn): simplify kernel to workaround OpenCL compiler crash
* dnn(ocl4dnn): remove redundant code
* dnn(ocl4dnn): Add more clear message for simd size dismatch.
* dnn(ocl4dnn): add const to const argument
Signed-off-by: Li Peng <peng.li@intel.com>
* dnn(ocl4dnn): force compiler use a specific SIMD size for IDLF kernel
* dnn(ocl4dnn): drop unused tuneLocalSize()
* dnn(ocl4dnn): specify OpenCL queue for Timer and convolve() method
* dnn(ocl4dnn): sanitize file names used for cache
* dnn(perf): enable Network tests with OpenCL
* dnn(ocl4dnn/conv): drop computeGlobalSize()
* dnn(ocl4dnn/conv): drop unused fields
* dnn(ocl4dnn/conv): simplify ctor
* dnn(ocl4dnn/conv): refactor kernelConfig localSize=NULL
* dnn(ocl4dnn/conv): drop unsupported double / untested half types
* dnn(ocl4dnn/conv): drop unused variable
* dnn(ocl4dnn/conv): alignSize/divUp
* dnn(ocl4dnn/conv): use enum values
* dnn(ocl4dnn): drop unused innerproduct variable
Signed-off-by: Li Peng <peng.li@intel.com>
* dnn(ocl4dnn): add an generic function to check cl option support
* dnn(ocl4dnn): run softmax subgroup version kernel first
Signed-off-by: Li Peng <peng.li@intel.com>
Added forkfour Latex command to math js support.
Split cv::norm documentation between the cv::norm and its overload, to make things clearer
Corrected some typos and cleaned up grammar.
Result is clearer documentation for the norms.
Work pending...
This adds the possibility to use multi-channel masks for the functions
cv::mean, cv::meanStdDev and the method Mat::setTo. The tests have now a
probability to use multi-channel masks for operations that support them.
This also includes Mat::copyTo, which supported multi-channel masks
before, but there was no test confirming this.
This function is the counterpart of "Context::getProg".
With this function, users have chance to unload a program
from global run-time cached programs, and save resource.
- Optimizations set change. Now IPP integrations will provide code for SSE42, AVX2 and AVX512 (SKX) CPUs only. For HW below SSE42 IPP code is disabled.
- Performance regressions fixes for IPP code paths;
- cv::boxFilter integration improvement;
- cv::filter2D integration improvement;
Add constructors taking initializer_list for some of OpenCV data types (#9034)
* Add a constructor taking initializer_list for Matx
* Add a constructor taking initializer list for Mat and Mat_
* Add one more method to initialize Mat to the corresponding tutorial
* Add a note how to initialize Matx
* CV_CXX_11->CV_CXX11
BufferPoolController has a non virtual protected destructor (which is legitimate)
However, Visual Studio sees this as a bug, if you enable more warnings, like below
```
add_compile_options(/W3) # level 3 warnings
add_compile_options(/we4265) # warning about missing virtual destructors
```
This is a proposition in order to silence this warning.
See https://github.com/ivsgroup/boost_warnings_minimal_demo for a demo of the same problem
with boost/exception.hpp
Remove unnecessary Non-ASCII characters from source code (#9075)
* Remove unnecessary Non-ASCII characters from source code
Remove unnecessary Non-ASCII characters and replace them with ASCII
characters
* Remove dashes in the @param statement
Remove dashes and place single space in the @param statement to keep
coding style
* misc: more fixes for non-ASCII symbols
* misc: fix non-ASCII symbol in CMake file
* another round of dnn optimization:
* increased malloc alignment across OpenCV from 16 to 64 bytes to make it AVX2 and even AVX-512 friendly
* improved SIMD optimization of pooling layer, optimized average pooling
* cleaned up convolution layer implementation
* made activation layer "attacheable" to all other layers, including fully connected and addition layer.
* fixed bug in the fusion algorithm: "LayerData::consumers" should not be cleared, because it desctibes the topology.
* greatly optimized permutation layer, which improved SSD performance
* parallelized element-wise binary/ternary/... ops (sum, prod, max)
* also, added missing copyrights to many of the layer implementation files
* temporarily disabled (again) the check for intermediate blobs consistency; fixed warnings from various builders
Parallelize Canny with custom gradient (#8694)
* New Canny implementation. Restructuring code in parallelCanny class. Align mag buffer and map.
* Fix warnings.
* Missing SIMD check added.
* Replaced local trailingZeros in contours.cpp. Use alignSize in canny.cpp
* Fix warnings in alignSize and allocate just minimum extra columns.
* Fix another warning in map.create.
* Exchange for loop by do loop to avoid double check at the beginning.
Define extra SIMD CANNY_CHECK to avoid unnecessary continue.
* Correct the existing documented T-API functions to match the doxygen format.
* docs: fix comments style
* T-API documentation: minor formatting changes
Updated integrations for:
cv::split
cv::merge
cv::insertChannel
cv::extractChannel
cv::Mat::convertTo - now with scaled conversions support
cv::LUT - disabled due to performance issues
Mat::copyTo
Mat::setTo
cv::flip
cv::copyMakeBorder - currently disabled
cv::polarToCart
cv::pow - ipp pow function was removed due to performance issues
cv::hal::magnitude32f/64f - disabled for <= SSE42, poor performance
cv::countNonZero
cv::minMaxIdx
cv::norm
cv::canny - new integration. Disabled for threaded;
cv::cornerHarris
cv::boxFilter
cv::bilateralFilter
cv::integral
Add support for std::array<T, N> (#8535)
* Add support for std::array<T, N>
* Add std::array<Mat, N> support
* Remove UMat constructor with std::array parameter
Gemm kernels for Intel GPU (#8104)
* Fix an issue with Kernel object reset release when consecutive Kernel::run calls
Kernel::run launch OCL gpu kernels and set a event callback function
to decreate the ref count of UMat or remove UMat when the lauched workloads
are completed. However, for some OCL kernels requires multiple call of
Kernel::run function with some kernel parameter changes (e.g., input
and output buffer offset) to get the final computation result.
In the case, the current implementation requires unnecessary
synchronization and cleanupMat.
This fix requires the user to specify whether there will be more work or not.
If there is no remaining computation, the Kernel::run will reset the
kernel object
Signed-off-by: Woo, Insoo <insoo.woo@intel.com>
* GEMM kernel optimization for Intel GEN
The optimized kernels uses cl_intel_subgroups extension for better
performance.
Note: This optimized kernels will be part of ISAAC in a code generation
way under MIT license.
Signed-off-by: Woo, Insoo <insoo.woo@intel.com>
* Fix API compatibility error
This patch fixes a OCV API compatibility error. The error was reported
due to the interface changes of Kernel::run. To resolve the issue,
An overloaded function of Kernel::run is added. It take a flag indicating
whether there are more work to be done with the kernel object without
releasing resources related to it.
Signed-off-by: Woo, Insoo <insoo.woo@intel.com>
* Renaming intel_gpu_gemm.cpp to intel_gpu_gemm.inl.hpp
Signed-off-by: Woo, Insoo <insoo.woo@intel.com>
* Revert "Fix API compatibility error"
This reverts commit 2ef427db91.
Conflicts:
modules/core/src/intel_gpu_gemm.inl.hpp
* Revert "Fix an issue with Kernel object reset release when consecutive Kernel::run calls"
This reverts commit cc7f9f5469.
* Fix the case of uninitialization D
When C is null and beta is non-zero, D is used without initialization.
This resloves the issue
Signed-off-by: Woo, Insoo <insoo.woo@intel.com>
* fix potential output error due to 0 * nan
Signed-off-by: Woo, Insoo <insoo.woo@intel.com>
* whitespace fix, eliminate non-ASCII symbols
* fix build warning
- use suffixes like '.avx.cpp'
- added CMake-generated files for '.simd.hpp' optimization approach
- wrap HAL intrinsic headers into separate namespaces for different build flags
- automatic vzeroupper insertion (via CV_INSTRUMENT_REGION macro)
`template<typename _Tp> inline const _Tp* Mat_<_Tp>::operator [](int y) const` does not support 3d matrix since it checks rows.
This operator[] shall check size.p[0] instead.
* Fix the documentation for Mat::diag(int).
Fix issue #8181
* Fix the documentation for Mat::diag(int).
Fix issue #8181.
* Add support for printing out cv::Complex.
* Remove extra spaces.
* cv::Complex is submitted as a new pull request.
Add support for printing out cv::Complex. (#8208)
* Add support for printing out cv::Complex.
* Conform to the format of std::complex.
* Remove extra spaces.
* Remove extra spaces.
Fix typos in the documentation for AutoBuffer. (#8197)
* Allocate 1000 floats to match the documentation
Fix the documentation of `AutoBuffer`. By default, the following code
```.cpp
cv::AutoBuffer<float> m;
````
allocates only 264 floats. But the comment in the demonstration code says it allocates 1000 floats, which is
not correct.
* fix typo in the comment.
Currently, to select a submatrix of a N-dimensional matrix, it requires
two lines of code while only one line of code is required if using a 2D
array.
I added functionality to be able to select an N-dim submatrix using a
vector list instead of a Range pointer. This allows initializer lists to
be used for a one-line selection.
This allows for an N-dimensional array to be setup in one line instead of two when using C++11 initializer lists. cv::Mat(3, {zDim, yDim, xDim}, ...) can be used instead of having to create an int pointer to hold the size array.
Maximum depth limit var was added to the instrumentation structure;
Trace names output console output fix: improper tree formatting could happen;
Output in case of error was added;
Custom regions improvements;
Improved timing and weight calculation for parallel regions; New TC (threads counter) value to indicate how many different threads accessed particular node;
parallel_for, warnings fixes and ReturnAddress code from Alexander Alekhin;
* use hasSIMD128 rather than calling checkHardwareSupport
* add SIMD check in spartialgradient.cpp
* add SIMD check in stereosgbm.cpp
* add SIMD check in canny.cpp
A bug in ICC improperly identified the first parameter as "void*"
rather than the proper "volatile long*". This is scheduled to be
fixed in ICC in a future release.
This patch casts only to a "long*" to preserve backwards compatibility
with the ICC 16 and ICC 17 releases.
[GSOC] New camera model for stitching pipeline
* implement estimateAffine2D
estimates affine transformation using robust RANSAC method.
* uses RANSAC framework in calib3d
* includes accuracy test
* uses SVD decomposition for solving 3 point equation
* implement estimateAffinePartial2D
estimates limited affine transformation
* includes accuracy test
* stitching: add affine matcher
initial version of matcher that estimates affine transformation
* stitching: added affine transform estimator
initial version of estimator that simply chain transformations in homogeneous coordinates
* calib3d: rename estimateAffine3D test
test Calib3d_EstimateAffineTransform rename to Calib3d_EstimateAffine3D. This is more descriptive and prevents confusion with estimateAffine2D tests.
* added perf test for estimateAffine functions
tests both estimateAffine2D and estimateAffinePartial2D
* calib3d: compare error in square in estimateAffine2D
* incorporates fix from #6768
* rerun affine estimation on inliers
* stitching: new API for parallel feature finding
due to ABI breakage new functionality is added to `FeaturesFinder2`, `SurfFeaturesFinder2` and `OrbFeaturesFinder2`
* stitching: add tests for parallel feature find API
* perf test (about linear speed up)
* accuracy test compares results with serial version
* stitching: use dynamic_cast to overcome ABI issues
adding parallel API to FeaturesFinder breaks ABI. This commit uses dynamic_cast and hardcodes thread-safe finders to avoid breaking ABI.
This should be replaced by proper method similar to FeaturesMatcher on next ABI break.
* use estimateAffinePartial2D in AffineBestOf2NearestMatcher
* add constructor to AffineBestOf2NearestMatcher
* allows to choose between full affine transform and partial affine transform. Other params are the as for BestOf2NearestMatcher
* added protected field
* samples: stitching_detailed support affine estimator and matcher
* added new flags to choose matcher and estimator
* stitching: rework affine matcher
represent transformation in homogeneous coordinates
affine matcher: remove duplicite code
rework flow to get rid of duplicite code
affine matcher: do not center points to (0, 0)
it is not needed for affine model. it should not affect estimation in any way.
affine matcher: remove unneeded cv namespacing
* stitching: add stub bundle adjuster
* adds stub bundle adjuster that does nothing
* can be used in place of standard bundle adjusters to omit bundle adjusting step
* samples: stitching detailed, support no budle adjust
* uses new NoBundleAdjuster
* added affine warper
* uses R to get whole affine transformation and propagates rotation and translation to plane warper
* add affine warper factory class
* affine warper: compensate transformation
* samples: stitching_detailed add support for affine warper
* add Stitcher::create method
this method follows similar constructor methods and returns smart pointer. This allows constructing Stitcher according to OpenCV guidelines.
* supports multiple stitcher configurations (PANORAMA and SCANS) for convenient setup
* returns cv::Ptr
* stitcher: dynamicaly determine correct estimator
we need to use affine estimator for affine matcher
* preserves ABI (but add hints for ABI 4)
* uses dynamic_cast hack to inject correct estimator
* sample stitching: add support for multiple modes
shows how to use different configurations of stitcher easily (panorama stitching and scans affine model)
* stitcher: find features in parallel
use new FeatureFinder API to find features in parallel. Parallelized using TBB.
* stitching: disable parallel feature finding for OCL
it does not bring much speedup to run features finder in parallel when OpenCL is enabled, because finder needs to wait for OCL device.
Also, currently ORB is not thread-safe when OCL is enabled.
* stitching: move matcher tests
move matchers tests perf_stich.cpp -> perf_matchers.cpp
* stitching: add affine stiching integration test
test basic affine stitching (SCANS mode of stitcher) with images that have only translation between them
* enable surf for stitching tests
stitching.b12 test was failing with surf
investigated the issue, surf is producing good result. Transformation is only slightly different from ORB, so that resulting pano does not exactly match ORB's result. That caused sanity check to fail.
* added size checks similar to other tests
* sanity check will be applied only for ORB
* stitching: fix wrong estimator choice
if case was exactly wrong, estimators were chosen wrong
added logging for estimated transformation
* enable surf for matchers stitching tests
* enable SURF
* rework sanity checking. Check estimated transform instead of matches. Est. transform should be more stable and comparable between SURF and ORB.
* remove regression checking for VectorFeatures tests. It has a lot if data andtest is the same as previous except it test different vector size for performance, so sanity checking does not add any value here. Added basic sanity asserts instead.
* stitching tests: allow relative error for transform
* allows .01 relative error for estimated homography sanity check in stitching matchers tests
* fix VS warning
stitching tests: increase relative error
increase relative error to make it pass on all platforms (results are still good).
stitching test: allow bigger relative error
transformation can differ in small values (with small absolute difference, but large relative difference). transformation output still looks usable for all platforms. This difference affects only mac and windows, linux passes fine with small difference.
* stitching: add tests for affine matcher
uses s1, s2 images. added also new sanity data.
* stitching tests: use different data for matchers tests
this data should yeild more stable transformation (it has much more matches, especially for surf). Sanity data regenerated.
* stitching test: rework tests for matchers
* separated rotation and translations as they are different by scale.
* use appropriate absolute error for them separately. (relative error does not work for values near zero.)
* stitching: fix affine warper compensation
calculation of rotation and translation extracted for plane warper was wrong
* stitching test: enable surf for opencl integration tests
* enable SURF with correct guard (HAVE_OPENCV_XFEATURES2D)
* add OPENCL guard and correct namespace as usual for opencl tests
* stitching: add ocl accuracy test for affine warper
test consistent results with ocl on and off
* stitching: add affine warper ocl perf test
add affine warper to existing warper perf tests. Added new sanity data.
* stitching: do not overwrite inliers in affine matcher
* estimation is run second time on inliers only, inliers produces in second run will not be therefore correct for all matches
* calib3d: add Levenberg–Marquardt refining to estimateAffine2D* functions
this adds affine Levenberg–Marquardt refining to estimateAffine2D functions similar to what is done in findHomography.
implements Levenberg–Marquardt refinig for both full affine and partial affine transformations.
* stitching: remove reestimation step in affine matcher
reestimation step is not needed. estimateAffine2D* functions are running their own reestimation on inliers using the Levenberg-Marquardt algorithm, which is better than simply rerunning RANSAC on inliers.
* implement partial affine bundle adjuster
bundle adjuster that expect affine transform with 4DOF. Refines parameters for all cameras together.
stitching: fix bug in BundleAdjusterAffinePartial
* use the invers properly
* use static buffer for invers to speed it up
* samples: add affine bundle adjuster option to stitching_detailed
* add support for using affine bundle adjuster with 4DOF
* improve logging of initial intristics
* sttiching: add affine bundle adjuster test
* fix build warnings
* stitching: increase limit on sanity check
prevents spurious test failures on mac. values are still pretty fine.
* stitching: set affine bundle adjuster for SCANS mode
* fix bug with AffineBestOf2NearestMatcher (we want to select affine partial mode)
* select right bundle adjuster
* stitching: increase error bound for matcher tests
* this prevents failure on mac. tranformation is still ok.
* stitching: implement affine bundle adjuster
* implements affine bundle adjuster that is using full affine transform
* existing test case modified to test both affinePartial an full affine bundle adjuster
* add stitching tutorial
* show basic usage of stitching api (Stitcher class)
* stitching: add more integration test for affine stitching
* added new datasets to existing testcase
* removed unused include
* calib3d: move `haveCollinearPoints` to common header
* added comment to make that this also checks too close points
* calib3d: redone checkSubset for estimateAffine* callback
* use common function to check collinearity
* this also ensures that point will not be too close to each other
* calib3d: change estimateAffine* functions API
* more similar to `findHomography`, `findFundamentalMat`, `findEssentialMat` and similar
* follows standard recommended semantic INPUTS, OUTPUTS, FLAGS
* allows to disable refining
* supported LMEDS robust method (tests yet to come) along with RANSAC
* extended docs with some tips
* calib3d: rewrite estimateAffine2D test
* rewrite in googletest style
* parametrize to test both robust methods (RANSAC and LMEDS)
* get rid of boilerplate
* calib3d: rework estimateAffinePartial2D test
* rework in googletest style
* add testing for LMEDS
* calib3d: rework estimateAffine*2D perf test
* test for LMEDS speed
* test with/without Levenberg-Marquart
* remove sanity checking (this is covered by accuracy tests)
* calib3d: improve estimateAffine*2D tests
* test transformations in loop
* improves test by testing more potential transformations
* calib3d: rewrite kernels for estimateAffine*2D functions
* use analytical solution instead of SVD
* this version is faster especially for smaller amount of points
* calib3d: tune up perf of estimateAffine*2D functions
* avoid copying inliers
* avoid converting input points if not necessary
* check only `from` point for collinearity, as `to` does not affect stability of transform
* tutorials: add commands examples to stitching tutorials
* add some examples how to run stitcher sample code
* mention stitching_detailed.cpp
* calib3d: change computeError for estimateAffine*2D
* do error computing in floats instead of doubles
this have required precision + we were storing the result in float anyway. This make code faster and allows auto-vectorization by smart compilers.
* documentation: mention estimateAffine*2D function
* refer to new functions on appropriate places
* prefer estimateAffine*2D over estimateRigidTransform
* stitching: add camera models documentations
* mention camera models in module documentation to give user a better overview and reduce confusion
* use __GNUC_MINOR__ in correct place to check the version of GCC
* check processor support of FP16 at run time
* check compiler support of FP16 and pass correct compiler option
* rely on ENABLE_AVX on gcc since AVX is generated when mf16c is passed
* guard correctly using ifdef in case of various configuration
* use v_float16x4 correctly by including the right header file
- calculate ticksTotal instead of ticksMean
- local / global width is based on ticksTotal value
- added instrumentation for OpenCL program compilation
- added instrumentation for OpenCL kernel execution
* use v_float16x4 (universal intrinsic) instead of raw SSE/NEON implementation
* define v_load_f16/v_store_f16 since v_load can't be distinguished when short pointer passed
* brush up implementation on old compiler (guard correctly)
* add test for v_load_f16 and round trip conversion of v_float16x4
* fix conversion error
* use universal intrinsic for accumulate series using float/double
* accumulate, accumulateSquare, accumulateProduct and accumulateWeighted
* add v_cvt_f64_high in both SSE/NEON
* add test for conversion v_cvt_f64_high in test_intrin.cpp
* improve some existing universal intrinsic by using new instructions in Aarch64
* add workaround for Android build in intrin_neon.hpp
* Added 2-channel ops to match existing 3-channel and 4-channel ops
* v_load_deinterleave() and v_store_interleave()
* Implements float32x4 only on SSE (but all types on NEON and CPP)
* Includes tests
* Will be used to vectorize 2D functions, such as estimateAffine2D()
* raise an error when wrong bit depth passed
* raise an build error when wrong depth is specified for cvtScaleHalf_
* remove unnecessary safe check in cvtScaleHalf_
* use intrinsic instead of direct pointer access
* update the explanation
Major changes:
- modify the Base64 functions to compatible with `cvWriteRawData` and so
on.
- add a Base64 flag for FileStorage and outputs raw data in Base64
automatically.
- complete all testing and documentation.
The three new functions:
```cpp
void cvStartWriteRawData_Base64(::CvFileStorage * fs, const char* name,
int len, const char* dt);
void cvWriteRawData_Base64(::CvFileStorage *
fs, const void* _data, int len);
void
cvEndWriteRawData_Base64(::CvFileStorage * fs);
```
Test is also updated. (And it's remarkable that there is a bug in
`cvWriteReadData`.)
* check compiler support
* check HW support before executing
* add test doing round trip conversion from / to FP32
* treat array correctly if size is not multiple of 4
* add declaration to prevent warning
* make it possible to enable fp16 on 32bit ARM
* let the conversion possible on non-supported HW, too.
* add test using both HW and SW implementation
Changed statements of type "#if __CUDA_ARCH__ >= 200" to
"#if defined __CUDA_ARCH__ && __CUDA_ARCH__ >= 200" in order to
avoid warnings about __CUDA_ARCH__ being undefined.
computes the complement of the Jaccard Index as described in
https://en.wikipedia.org/wiki/Jaccard_index. For rectangles this reduces
to computing the intersection over the union.
- added new functions from core module: split, merge, add, sub, mul, div, ...
- added function replacement mechanism
- added example of HAL replacement library