* resize: HResizeLinear reduce duplicate work
There appears to be a 2x unroll of the HResizeLinear against k,
however the k value is only incremented by 1 during the unroll. This
results in k - 1 duplicate passes when k > 1.
Likewise, the final pass may not respect the work done by the vector
loop. Start it with the offset returned by the vector op if
implemented. Note, no vector ops are implemented today.
The performance is most noticable on a linear downscale. A set of
performance tests are added to characterize this. The performance
improvement is 10-50% depending on the scaling.
* imgproc: vectorize HResizeLinear
Performance is mostly gated by the gather operations
for x inputs.
Likewise, provide a 2x unroll against k, this reduces the
number of alpha gathers by 1/2 for larger k.
While not a 4x improvement, it still performs substantially
better under P9 for a 1.4x improvement. P8 baseline is
1.05-1.10x due to reduced VSX instruction set.
For float types, this results in a more modest
1.2x improvement.
* Update U8 processing for non-bitexact linear resize
* core: hal: vsx: improve v_load_expand_q
With a little help, we can do this quickly without gprs on
all VSX enabled targets.
* resize: Fix cn == 3 step per feedback
Per feedback, ensure we don't overrun. This was caught via the
failure observed in Test_TensorFlow.inception_accuracy.
* calib3d: use normalized input in solvePnPGeneric()
* calib3d: java regression test for solvePnPGeneric
* calib3d: python regression test for solvePnPGeneric
* core: disable invalid constructors in C API by default
- C API objects will lose their default initializers through constructors
* samples: stop using of C API
Tests for argument conversion of Python bindings generator
* Tests for parsing elemental types from Python bindings
- Add positive and negative tests for int, float, double, size_t,
const char*, bool.
- Tests with wrong conversion behavior are skipped.
* Move implicit conversion of bool to integer/floating types to wrong
conversion behavior.
Improving VSX performance of integral function
* Adding support for vector get function on VSX datatypes so the
integral function gains a bit of performance.
* Removing get as a datatype member function and implementing a new HAL
instruction v_extract_n to get the n-th element of a vector register.
* Adding SSE/NEON/AVX intrinsics.
* Implement new HAL instruction v_broadcast_element on VSX/AVX/NEON/SSE.
* core(simd): add tests for v_extract_n/v_broadcast_element
- updated docs
- commented out code to repair compilation
- added WASM and MSA default implementations
* core(simd): fix compilation
- x86: avoid _mm256_extract_epi64/32/16/8 with MSVS 2015
- x86: _mm_extract_epi64 is 64-bit only
* cleanup
* Use FlsAlloc/FlsFree/FlsGetValue/FlsSetValue instead of TlsAlloc/TlsFree/TlsGetValue/TlsSetValue to implment TLS value cleanup when thread has been terminated on Windows Vista and above
* Fix 32-bit build
* Fixed calling convention of cleanup callback
* WINAPI changed to NTAPI
* Use proper guard macro
* Vectorize flipHoriz and flipVert functions.
* Change v_load_mirror_1 to use vec_revb for VSX
* Only use vec_revb in ISA3.0
* Removing vec_revb code since some of the older compilers don't fully support it.
* Use new v_reverse intrinsic and cleanup code.
* Ensure there are no alignment issues with copies
- move TLS & instrumentation code out of core/utility.hpp
- (*) TLSData lost .gather() method (to dispose thread data on thread termination)
- use TLSDataAccumulator for reliable collecting of thread data
- prefer using of .detachData() + .cleanupDetachedData() instead of .gather() method
(*) API is broken: replace TLSData => TLSDataAccumulator if gather required
(objects disposal on threads termination is not available in accumulator mode)
Fixing bug with comparison of v_int64x2 or v_uint64x2
* Casting v_uint64x2 to v_float64x2 and comparing does NOT work in all cases. Rewrite using epi64 instructions - faster too.
* Fix bad merge.
* Fix equal comparsion for non-SSE4.1. Add test cases for v_int64x2 comparisons.
* Try to fix merge conflict.
* Only test v_int64x2 comparisons if CV_SIMD_64F
* Fix compiler warning.
* New v_reverse HAL intrinsic for reversing the ordering of a vector
* Fix conflict.
* Try to resolve conflict again.
* Try one more time.
* Add _MM_SHUFFLE. Remove non-vectorize code in SSE2. Fix copy and paste issue with NEON.
* Change v_uint16x8 SSE2 version to use shuffles
* Cuda + OpenGL on ARM
There might be multiple ways of getting OpenCV compile on Tegra (NVIDIA Jetson) platform, but mainly they modify CUDA(8,9,10...) source code, this one fixes it for all installations.
( https://devtalk.nvidia.com/default/topic/1007290/jetson-tx2/building-opencv-with-opengl-support-/post/5141945/#5141945 et al.).
This way is exactly the same as the one proposed but the code change happens in OpenCV.
* Updated,
The link provided mentions: cuda8 + 9, I have cuda 10 + 10.1 (and can confirm it is still defined this way).
NVIDIA is probably using some other "secret" backend with Jetson.
* core: rework and optimize SIMD implementation of dotProd
- add new universal intrinsics v_dotprod[int32], v_dotprod_expand[u&int8, u&int16, int32], v_cvt_f64(int64)
- add a boolean param for all v_dotprod&_expand intrinsics that change the behavior of addition order between
pairs in some platforms in order to reach the maximum optimization when the sum among all lanes is what only matters
- fix clang build on ppc64le
- support wide universal intrinsics for dotProd_32s
- remove raw SIMD and activate universal intrinsics for dotProd_8
- implement SIMD optimization for dotProd_s16&u16
- extend performance test data types of dotprod
- fix GCC VSX workaround of vec_mule and vec_mulo (in little-endian it must be swapped)
- optimize v_mul_expand(int32) on VSX
* core: remove boolean param from v_dotprod&_expand and implement v_dotprod_fast&v_dotprod_expand_fast
this changes made depend on "terfendail" review
- renamed Cascade Lake AVX512_CEL => AVX512_CLX (align with Intel SDE tool)
- fixed CLX instruction sets (no IFMA/VBMI)
- added flag to bypass CPU baseline check: OPENCV_SKIP_CPU_BASELINE_CHECK
[GSoC 2019] Improve the performance of JavaScript version of OpenCV (OpenCV.js)
* [GSoC 2019]
Improve the performance of JavaScript version of OpenCV (OpenCV.js):
1. Create the base of OpenCV.js performance test:
This perf test is based on benchmark.js(https://benchmarkjs.com). And first add `cvtColor`, `Resize`, `Threshold` into it.
2. Optimize the OpenCV.js performance by WASM threads:
This optimization is based on Web Worker API and SharedArrayBuffer, so it can be only used in browser.
3. Optimize the OpenCV.js performance by WASM SIMD:
Add WASM SIMD backend for OpenCV Universal Intrinsics. It's experimental as WASM SIMD is still in development.
* [GSoC2019]
1. use short license header
2. fix documentation node issue
3. remove the unused `hasSIMD128()` api
* [GSoC2019]
1. fix emscripten define
2. use fallback function for f16
* [GSoC2019]
Fix rebase issue
* Added MSA implementations for mips platforms. Intrinsics for MSA and build scripts for MIPS platforms are added.
Signed-off-by: Fei Wu <fwu@wavecomp.com>
* Removed some unused code in mips.toolchain.cmake.
Signed-off-by: Fei Wu <fwu@wavecomp.com>
* Added comments for mips toolchain configuration and disabled compiling warnings for libpng.
Signed-off-by: Fei Wu <fwu@wavecomp.com>
* Fixed the build error of unsupported opcode 'pause' when mips isa_rev is less than 2.
Signed-off-by: Fei Wu <fwu@wavecomp.com>
* 1. Removed FP16 related item in MSA option defines in OpenCVCompilerOptimizations.cmake.
2. Use CV_CPU_COMPILE_MSA instead of __mips_msa for MSA feature check in cv_cpu_dispatch.h.
3. Removed hasSIMD128() in intrin_msa.hpp.
4. Define CPU_MSA as 150.
Signed-off-by: Fei Wu <fwu@wavecomp.com>
* 1. Removed unnecessary CV_SIMD128_64F guarding in intrin_msa.hpp.
2. Removed unnecessary CV_MSA related code block in dotProd_8u().
Signed-off-by: Fei Wu <fwu@wavecomp.com>
* 1. Defined CPU_MSA_FLAGS_ON as "-mmsa".
2. Removed CV_SIMD128_64F guardings in intrin_msa.hpp.
Signed-off-by: Fei Wu <fwu@wavecomp.com>
* Removed unused msa_mlal_u16() and msa_mlal_s16 from msa_macros.h.
Signed-off-by: Fei Wu <fwu@wavecomp.com>
ISA 2.07 (aka POWER8) effectively extended the expanding multiply
operation to word types. The altivec intrinsics prior to gcc 8 did
not get the update.
Workaround this deficiency similar to other fixes.
This was exposed by commit 33fb253a66
which leverages the int -> dword expanding multiply.
This fixes Issue #15506
Detected by clang trunk:
```
opencv/modules/core/src/ocl.cpp:4337:37: warning: object backing the pointer will be destroyed at the end of the full-expression [-Wdangling]
CV_OCL_CHECK_RESULT(retval, cv::format("clCreateBuffer(capacity=%lld) => %p", (long long int)entry.capacity_, (void*)entry.clBuffer_).c_str());
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
opencv/modules/core/src/ocl.cpp:193:42: note: expanded from macro 'CV_OCL_CHECK_RESULT'
if (0) { const char* msg_ = (msg); CV_UNUSED(msg_); /* ensure const char* type (cv::String without c_str()) */ } \
```
because `cv::format` yields a temporary std::string, and thus `msg_` points to a destroyed buffer.
Use 4x FMA chains to sum on SIMD 128 FP64 targets. On
x86 this showed about 1.4x improvement.
For PPC, do a full multiply (32x32->64b), convert to DP
then accumulate. This may be slightly less precise for
some inputs. But is 1.5x faster than the above which
is about 1.5x than the FMA above for ~2.5x speedup.
Implement cvRound using inline asm. No compiler support
exists today to properly optimize this. This results in
about a 4x speedup over the default rounding. Likewise,
simplify the growing number of rounding function overloads.
For P9 enabled targets, utilize the classification
testing instruction to test for Inf/Nan values. Operation
speedup is about 1.2x for FP32, and 1.5x for FP64 operands.
For P8 targets, fallback to the GCC nan inline. It provides
a 1.1/1.4x improvement for FP32/FP64 arguments.
Add a new macro definition OPENCV_USE_FASTMATH_GCC_BUILTINS to enable
usage of GCC inline math functions, if available and requested by the
user.
Likewise, enable it for POWER. This is nearly always a substantial
improvement over using integer manipulation as most operations can
be done in several instructions with no branching. The result is a
1.5-1.8x speedup in the ceil/floor operations.
1. As tested with AT 12.0-1 (GCC 8.3.1) compiler on P9 LE.
Add a basic sanity test to verify the rounding functions
work as expected.
Likewise, extend the rounding performance test to cover the
additional float -> int fast math functions.
Due to the explicitly declared copy constructor Vec<T, n>::Vec(Vec <T,n>&)
GCC 9 warns if there is no assignment operator, as having one typically
requires the other (rule-of-three, constructor/desctructor/assginment).
As the values are just a plain array the default assignment operator does
the right thing. Tell the compiler explicitly to default it.
Signed-off-by: Stefan Brüns <stefan.bruens@rwth-aachen.de>
* core: improve AVX512 infrastructure by adding more CPU features groups
* cmake: use groups for AVX512 optimization flags
* core: remove gap in CPU flags enumeration
* cmake: restore default CPU_DISPATCH
OE-11 Logging revamp (#13909)
* Initial commit for log tag support.
Part of #11003, incomplete. Should pass build.
Moved LogLevel enum to logger.defines.hpp
LogTag struct used to convey both name and log level threshold as
one argument to the new logging macro. See logtag.hpp file, and
CV_LOG_WITH_TAG macro.
Global log level is now associated with a global log tag, when a
logging statement doesn't specify any log tag. See getLogLevel and
getGlobalLogTag functions.
A macro CV_LOGTAG_FALLBACK is allowed to be re-defined by other modules
or compilation units, internally, so that logging statements inside
that unit that specify NULL as tag will fall back to the re-defined tag.
Line-of-code information (file name, line number, function name),
together with tag name, are passed into the new log message sink.
See writeLogMessageEx function.
Fixed old incorrect CV_LOG_VERBOSE usage in ocl4dnn_conv_spatial.cpp.
* Implemented tag-based log filtering
Added LogTagManager. This is an initial version, using standard C++
approach as much as possible, to allow easier code review. Will
optimize later.
A workaround for all static dynamic initialization issues is
implemented. Refer to code comments.
* Added LogTagConfigParser.
Note: new code does not fully handle old log config parsing behavior.
* Fix log tag config vs registering ordering issue.
* Started testing LogTagConfigParser, incomplete.
The intention of this commit is to illustrate the capabilities of
the current design of LogTagConfigParser.
The test contained in this commit is not complete. Also, design changes
may require throwing away this commit and rewriting test code from
scratch.
Does not test whitespace segmentation (multiple tags on the config);
will do in next commit.
* Added CV_LOGTAG_EXPAND_NAME macro
This macro allows to be re-defined locally in other compilation units
to apply a prefix to whatever argument is passed as the "tag" argument
into CV_LOG_WITH_TAG. The default definition in logger.hpp does not
modify the argument. It is recommended to include the address-of
operator (ampersand) when re-defined locally.
* Added a few tests for LogTagManager, some fail.
See test_logtagmanager.cpp
Failed tests are: non-global ("something"), setting level by name-part
(first part or any part) has no effect at all.
* LogTagManagerTests substring non-confusion tests
* Fix major bugs in LogTagManager
The code change is intended to approximate the spec documented in
https://gist.github.com/kinchungwong/ec25bc1eba99142e0be4509b0f67d0c6
Refer to test suite in test_logtagmanager.cpp
Filter test result in "opencv_test_core" ...
with gtest_filter "LogTagManager*"
To see the test code that finds the bugs, refer to original commits
(before rebase; might be gone)
.. f3451208 (2019-03-03T19:45:17Z)
.... LogTagManagerTests substring non-confusion tests
.. 1b848f5f (2019-03-03T01:55:18Z)
.... Added a few tests for LogTagManager, some fail.
* Added LogTagManagerNamePartNonConfusionTest.
See test_logtagmanager.cpp in modules/core/test.
* Added LogTagAuto for auto registration in ctor
* Rewritten LogTagManager to resolve issues.
* Resolves code review issues around 2019-04-10
LogTagConfigParser::parseLogLevel - as part of resolving code review
issues, this function is rewritten to simplify control flow and to
improve conformance with legacy usage (for string values "OFF",
"DISABLED", and "WARNINGS").
- added functionality to collect memory usage of OpenCL sybsystem
- memory usage of fastMalloc() (disabled by default):
* It is not accurate sometimes - external memory profiler is required.
- specify common `CV_TEST_TAG_` macros
- added applyTestTag() function
- write memory usage / enabled tags into Google Tests output file (.xml)
* Add CUDA support for D3D11 interop. #13888
color_detail.hpp: fixed build error : dynamic initialization is not supported for a __constant__ variable.
directx.cpp: Add CUDA support(cl_nv_d3d11_sharing) for D3D11 interop. #13888
Update directx.cpp
Format adjustment.
Update directx.cpp
fix error.
Update directx.cpp
Format adjustment
Update directx.cpp
fix trailing whitespace.
fix format errors
convert indentation to spaces .
Trim trailing whitespace.
Add information about source of cl_d3d11_ext.h
Avoid unrelated changes.
Increase compile-time conditional judgment.
Increase the judgment of whether the OCL device has the required extensions at compile time.
Add compilation option `HAVE_CLNVEXT`.Check CL support in runtime.
Check result of `clGetExtensionFunctionAddressForPlatform` for KHR is invalid.It always can get the address(from OpenCL.dll),So I check NV support(from nvopencl64.dll) before KHR when `HAVE_CLNVEXT` is enabled.
Delete cl_d3d11_ext.h
Modified parameter list
fix "cannot open include file: 'CL/cl_d3d11_ext.h'"
remove not referenced var
fix C2143: syntax error
Improve compile-time judgment.
dlrectx.cpp Modify the detection order.
initializeContextFromD3D11Device:
```
// try with NV(Need to check it first)
// try with KHR
```
fix warnig C4100
Revert "fix warnig C4100"
This reverts commit 76e5becb67780071d0cbde61cc4f5f807ad7c5ac.
fix warning C4100
fix warning C4505
Format alignment
Format adjustment and automatically detect header files.
Automatically detect header files when users are not configured or configuration errors occur.
avoid unrelated changes.
Update .cmake
Update .cmake
* fix build errors
* fix warning:defined but not used
* Revert "fix warning:defined but not used"
This reverts commit 7ab3537cd0.
* fix warning:defined but not used
* fix build error for mac
* fix build error for win
* optimizing branch judgment
* Revert "optimizing branch judgment"
This reverts commit 88b72b870e.
* fix warning C4702: unreachable code
* remove unused code
* Fix problems that may lead to undefined behavior
* Add status check
* fix error C2664,C2665 : cannot convert argument
* Format adjustment
VSCODE will automatically format the indentation to 4 spaces in some situation.
* fix error C2440
* fix error C2440
* add cl_d3d11_ext.h
* Format adjustment
* remove unnecessary checks
- allow cmake to check sanity of vsx aligned ld/st
- force universal intrinsics v_load_aligned/v_store_aligned
to failback to unaligned ld/st if cmake runtime vsx aligned test fail
* Expose more C++ functionality in the Java wrapper of the Mat class
In particular expose methods for handling Mat with more than 2 dimensions
* add constructors taking an array of dimension sizes
* add constructor taking an existing Mat and an array of Ranges
* add override of the create method taking an array of dimension sizes
* add overrides of the ones and zeros methods taking an array of dimension sizes
* add override of the submat method taking an array of ranges
* add overrides of put and get taking arrays of indices
* add wrapper for copySize method
* fix crash in the JNI wrapper of the reshape(int cn, int[] newshape) method
* add test for each method added to Mat.java
* Fix broken test
Lab/XYZ modes have been postponed (color_lab.cpp):
- need to split code for tables initialization and for pixels processing first
- no significant performance improvements for switching between SSE42 / AVX2 code generation
* core, stitching: revise syntax to support Visual C++ 2013
* stitching: revise syntax again to support Visual C++ 2013 and other compilers
* stitching: minor update to clarify changes
Resize reworked using wide universal intrinsics (#13781)
* Added wide universal intrinsics optimized implementation for 3 channel bit-exact linear resize
* Reworked linear resize using new wide LUT intrinsics
* Fix for VSX intrinsics
Due to size limit of shared memory, histogram is built on
the global memory for CV_16UC1 case.
The amount of memory needed for building histogram is:
65536 * 4byte = 256KB
and shared memory limit is 48KB typically.
Added test cases for CV_16UC1 and various clip limits.
Added perf tests for CV_16UC1 on both CPU and CUDA code.
There was also a bug in CV_8UC1 case when redistributing
"residual" clipped pixels. Adding the test case where clip
limit is 5.0 exposes this bug.
* Add Operator override for multi-channel Mat with literal constant.
* simple test
* Operator overloading channel constraint for primitive types
* fix some test for #13586
Fix a bug in cv :: merge when array of 3-channel mat is input (#13544)
* Mat merge function bug fix - Bug fix of merge function of 3-channel vector <Mat> of 3 or 4 matrices
* Add Core_merge test for opencv#13544
* fixups
* Python wrapper for detail
* hide pyrotationwrapper
* copy code in pyopencv_rotationwarper.hpp
* move ImageFeatures MatchInfo and CameraParams in core/misc/
* add python test for detail
* move test_detail in test_stitching
* rename
* added performance test for compareHist
* compareHist reworked to use wide universal intrinsics
* Disabled vectorization for CV_COMP_CORREL and CV_COMP_BHATTACHARYYA if f64 is unsupported
* Added performance tests for hal::norm functions
* Added sum of absolute differences intrinsic
* norm implementation updated to use wide universal intrinsics
* improve and fix v_reduce_sad on VSX
- add infrastructure support for Power9/VSX3
- fix missing VSX flags on GCC4.9 and CLANG4(#13210, #13222)
- fix disable VSX optimzation on GCC by using flag ENABLE_VSX
- flag ENABLE_VSX is deprecated now, use CPU_BASELINE, CPU_DISPATCH instead
- add VSX3 to arithmetic dispatchable flags
* Support for Matx read/write by FileStorage
* Only empty filestorage read now produces default Matx. Split Matx IO test into smaller units. Test checks for exception thrown if reading a Mat into a Matx of different size.
* significantly reduced OpenCV binary size by disabling IPP calls in some OpenCV functions: Sobel, Scharr, medianBlur, GaussianBlur, filter2D, mean, meanStdDev, norm, sum, minMaxIdx, sort.
* re-enable IPP in norm, since it's much faster (without adding too much space overhead)
* Updated boxFilter implementations to use wide universal intrinsics
* boxFilter implementation moved to separate file
* Replaced ROUNDUP macro with roundUp() function
* integrated the new C++ persistence; removed old persistence; most of OpenCV compiles fine! the tests have not been run yet
* fixed multiple bugs in the new C++ persistence
* fixed raw size of the parsed empty sequences
* [temporarily] excluded obsolete applications traincascade and createsamples from build
* fixed several compiler warnings and multiple test failures
* undo changes in cocoa window rendering (that was fixed in another PR)
* fixed more compile warnings and the remaining test failures (hopefully)
* trying to fix the last little warning
- initialize arithmetic dispatcher
- add new universal intrinsic v_absdiffs
- add new universal intrinsic v_pack_b
- add accumulate version of universal intrinsic v_round
- fix sse/avx2:uint8 multiplication overflow
- reimplement arithmetic, logic and comparison operations into wide universal intrinsics
with full support for all types
- reimplement IPP arithmetic, logic and comparison operations in a sperate file arithm_ipp.hpp
- avoid scalar multiplication if scaling factor eq 1 and use integer multiplication
- move C arithmetic operations to precomp.hpp and delete [arithm_simd|arithm_core].hpp
- add compatibility with new opencv4 divide policy
"as opposed to" is a phrase of opposed meaning distinguished from or in contrast with. e.g., "an approach that is theoretical as opposed to practical"
synonyms: in contrast with, as against, as contrasted with, rather than, instead of, as an alternative to
example: "we use only steam, as opposed to chemical products, to clean our house"
Exceptions caught by value incur needless cost in C++, most of them can
be caught by const-reference, especially as nearly none are actually
used. This could allow compiler generate a slightly more efficient code.
* js: update build script
- support emscipten 1.38.12 (wasm is ON by default)
- verbose build messages
* js: use builtin Math functions
* js: disable tracing code completelly
* hide use of std::mutex from /clr compilation under Visual Studio
C++11 <mutex> is not available when compiling with /clr under Visual Studio, thus opencv cannot be easily included.
It is fixed by making a CEEMutex wrapper class, around an opaque implementation using std::mutex internally
* fixed compilation outside of Visual Studio
fixed compilation outside of Visual Studio by avoiding some macros
* fixed indentation, prepare getting rid of CEEMutex/CEELockGuard
fixed indentation
After discussion, CEEMutex and CEELockGuard can be totally removed, letting the developer in a /clr context to provide his own implementation
* remove CEEMutex/CEELockGuard
Fixes for instrumentation of IPP and OCL (#12637)
* fixed warning about re-declaring variable when both IPP and instrumentation are enabled
* fixed segfault when no funName provided
* compilation fixed when both OCL and instrumentation are enabled
- avoid updating of input image during equalizeHist() call
- avoid for() with double variable (use 'int' instead)
- more CV_Check*() macros
- use Mat_<T>, Matx
- static for local variables
* Remove isIntel check from deep learning layers
* Remove fp16->fp32 fallbacks where it's not necessary
* Fix Kernel::run to prevent localsize > globalsize
* added basic support for CV_16F (the new datatype etc.). CV_USRTYPE1 is now equal to CV_16F, which may break some [rarely used] functionality. We'll see
* fixed just introduced bug in norm; reverted errorneous changes in Torch importer (need to find a better solution)
* addressed some issues found during the PR review
* restored the patch to fix some perf test failures
* may be an typo fix
* remove identical branch,may be paste error
* add parentheses around macro parameter
* simplify if condition
* check malloc fail
* change the condition of branch removed by commit 3041502861
* rewrote Mat::convertTo() and convertScaleAbs() to wide universal intrinsics; added always-available and SIMD-optimized FP16<=>FP32 conversion
* fixed compile warnings
* fix some more compile errors
* slightly relaxed accuracy threshold for int->float conversion (since we now do it using single-precision arithmetics, not double-precision)
* fixed compile errors on iOS, Android and in the baseline C++ version (intrin_cpp.hpp)
* trying to fix ARM-neon builds
* trying to fix ARM-neon builds
* trying to fix ARM-neon builds
* trying to fix ARM-neon builds
* trying to fix the custom AVX2 builder test failures (false alarms)
* fixed compile error with CPU_BASELINE=AVX2 on x86; raised tolerance thresholds in a couple of tests
* fixed compile error with CPU_BASELINE=AVX2 on x86; raised tolerance thresholds in a couple of tests
* fixed compile error with CPU_BASELINE=AVX2 on x86; raised tolerance thresholds in a couple of tests
* seemingly disabled false alarm warning in surf.cpp; increased tolerance thresholds in the tests for SolvePnP and in DNN/ENet
fix some errors found by static analyzer. (#12391)
* fix possible divided by zero and by negative values
* only 4 elements are used in these arrays
* fix uninitialized member
* use boolean type for semantic boolean variables
* avoid invalid array index
* to avoid exception and because base64_beg is only used in this block
* use std::atomic<bool> to avoid thread control race condition
* Add HPX backend for OpenCV implementation
Adds hpx backend for cv::parallel_for_() calls respecting the nstripes chunking parameter. C++ code for the backend is added to modules/core/parallel.cpp. Also, the necessary changes to cmake files are introduced.
Backend can operate in 2 versions (selectable by cmake build option WITH_HPX_STARTSTOP): hpx (runtime always on) and hpx_startstop (start and stop the backend for each cv::parallel_for_() call)
* WIP: Conditionally include hpx_main.hpp to tests in core module
Header hpx_main.hpp is included to both core/perf/perf_main.cpp and core/test/test_main.cpp.
The changes to cmake files for linking hpx library to above mentioned test executalbles are proposed but have issues.
* Add coditional iclusion of hpx_main.hpp to cpp cpu modules
* Remove start/stop version of hpx backend
Intrinsics must be effective, so don't declare FP16 type/operations if there is no native support.
- CV_FP16: supports load/store into/from float32
- CV_SIMD_FP16: declares FP16 types and native FP16 operations
- use '3.4.x' cache name for current maintenance series (there are no serious changes between releases)
- message is shown only once during creation of new cache directory
- use OPENCV_CACHE_SHOW_CLEANUP_MESSAGE=0 to hide this warning
for some big negative values less than -INT_MAX+32767 the sign of the numbers is lost due to overflow that leads to
incorrect saturation to MAX value, instead of zero.
The issue is not reproduced with CV_ENABLED_INTRINSICS=OFF
Changed the type of variable *r* from int to size_t.
This change makes sure that a valid result of std::max(r + delta,
(r*3+1)/2) can be passed into the reserve function.
* 1. changed static const __m128/256 to const __m128/256 to avoid wierd instructions and calls inserted by compiler.
2. added universal intrinsics that wrap MOVNTPS and other such (non-temporary or "no cache" store) instructions. v_store_interleave() and v_store() got respective flags/overloaded variants
3. rewrote split & merge to use the "no cache" store instructions. It resulted in dramatic performance improvement when processing big arrays
* hopefully, fixed some test failures where 4-channel v_store_interleave() is used
* added missing implementation of the new universal intrinsics (v_store_aligned_nocache() etc.)
* fixed silly typo in the new intrinsics in intrin_vsx.hpp
* still trying to fix VSX compiler errors
* still trying to fix VSX compiler errors
* still trying to fix VSX compiler errors
* still trying to fix VSX compiler errors
* fixed/updated v_load_deinterleave and v_store_interleave intrinsics; modified split() and merge() functions to use those intrinsics
* fixed a few compile errors and bug in v_load_deinterleave(ptr, v_uint32x4& a, v_uint32x4& b)
* fixed few more compile errors
* core:OE-27 prepare universal intrinsics to expand (#11022)
* core:OE-27 prepare universal intrinsics to expand (#11022)
* core: Add universal intrinsics for AVX2
* updated implementation of wide univ. intrinsics; converted several OpenCV HAL functions: sqrt, invsqrt, magnitude, phase, exp to the wide universal intrinsics.
* converted log to universal intrinsics; cleaned up the code a bit; added v_lut_deinterleave intrinsics.
* core: Add universal intrinsics for AVX2
* fixed multiple compile errors
* fixed many more compile errors and hopefully some test failures
* fixed some more compile errors
* temporarily disabled IPP to debug exp & log; hopefully fixed Doxygen complains
* fixed some more compile errors
* fixed v_store(short*, v_float16&) signatures
* trying to fix the test failures on Linux
* fixed some issues found by alalek
* restored IPP optimization after the patch with AVX wide intrinsics has been properly tested
* restored IPP optimization after the patch with AVX wide intrinsics has been properly tested
- 'if' logic is moved into templates.
- removed unnecessary cv::Mat objects creation.
- fixed inv() test (invA * A == eye)
- added more Matx tests to cover all defined template specializations
* a part of https://github.com/opencv/opencv/pull/11364 by Tetragramm. Rewritten and extended findNonZero & PSNR to support more types, not just 8u.
* fixed compile & doxygen warnings
* fixed small bug in findNonZero test
fixes handling of empty matrices in some functions (#11634)
* a part of PR #11416 by Yuki Takehara
* moved the empty mat check in Mat::copyTo()
* fixed some test failures
* make sure that the matrix with more than INT_MAX elements is marked as non-continuous, and thus all the pixel-wise functions process it correctly (i.e. row-by-row, not as a single row, where integer overflow may occur when computing the total number of elements)
* Issue 11242 intrinsics v_extract, v_rotate improvement, branch 3.4, without C++11 (remove type restrictions for SSE2, use PALIGNR on SSSE3, compile to no-op when imm is 0 or nlanes).
* fix whitespace
* Fix#11242 (NEON intrinsics v_rotate...) branch 3.4
Separate macro expansion OPENCV_HAL_IMPL_NEON_SHIFT_OP for bitwise shifts for integers, from macro expansion OPENCV_HAL_IMPL_NEON_ROTATE for lane rotations. Bitwise shifts do not apply to floats, but lane-rotations can apply to both.
* fix whitespace
* Fix#11242 compile error (VSX intrinsics v_rotate(a)) branch 3.4 no-c++11
* Fix CV_Asserts with negation of strings
{!"string"} causes some compilers to throw a warning.
The value of the string is not that important -- it's only for printing
the assertion message.
Replace these calls with:
CV_Error(Error::StsError, "string")
to suppress the warning.
* remove unnecessary 'break' after CV_Error()
* use universal intrinsic instead of raw intrinsic
* add 2 channels de-interleave on x86 platform
* add v_int32x4 version of v_muladd
* add accumulate version of v_dotprod based on the commit from seiko2plus on bf1852d
* remove some verify check in performance test
* avoid the out of boundary access and keep the performance
* remove unnecessary defines from vsx_utils
* fix v_load_expand, load lower 64bit
* use vec_ld, vec_st with alignment load/store on all types except 64bit
* map v_extract to v_rotate_right
* update license header
* enable VSX by default on clang since #11167
To avoid compilation of this code:
- buf = 0;
This code can be received after refactoring of 1D cv::Mat to cv::AutoBuffer.
- "cv_mat = 0" calls setTo().
- cv::AutoBuffer calls "allocate(0)" - this is wrong.
* loosen some test threshold mainly for integer types
* use relative error for floating points result
* avoid division by zero by following the comment
* fix the indentation
* Make <array> #ifdef true for MSVC
I think MSVC had `std::array` for quite a while (possibly going back as far as VS 2012, but it's definitely there in 2015 and 2017. So I think `_MSC_VER` `1900` is a safe bet. Probably `1800` and maybe even `1700` could work as well but I can't test that locally.
* fix test
* Update BufferReader documentation with some example code
* Add warning to BufferPool doc regarding deallocation of StackAllocator
* Added a sample code that satisfies LIFO rule for StackAllocator
- removed tr1 usage (dropped in C++17)
- moved includes of vector/map/iostream/limits into ts.hpp
- require opencv_test + anonymous namespace (added compile check)
- fixed norm() usage (must be from cvtest::norm for checks) and other conflict functions
- added missing license headers
OpenCV pthreads-based implementation changes:
- rework worker threads pool, allow to execute job by the main thread too
- rework synchronization scheme (wait for job completion, threads 'pong' answer is not required)
- allow "active wait" (spin) by worker threads and by the main thread
- use _mm_pause() during active wait (support for Hyper-Threading technology)
- use sched_yield() to avoid preemption of still working other workers
- don't use getTickCount()
- optional builtin thread pool profiler (disabled by compilation flag)
UMatData locks are not mapped on real locks (they are mapped to some "pre-initialized" pool).
Concurrent execution of these statements may lead to deadlock:
- a.copyTo(b) from thread 1
- c.copyTo(d) from thread 2
where:
- 'a' and 'd' are mapped to single lock "A".
- 'b' and 'c' are mapped to single lock "B".
Workaround is to process locks with strict order.
Adding capability to parse subsections of a byte array in Java bindings (#10489)
* Adding capability to parse subsections of a byte array in Java bindings. (Because Java lacks pointers. Therefore, reading images within a subsection of a byte array is impossible by Java's nature and limitations. Because of this, many IO functions in Java require additional parameters offset and length to define, which section of an array to be read.)
* Corrected according to the review. Previous interfaces were restored, instead internal interfaces were modified to provide subsampling of java byte arrays.
* Adding tests and test related files.
* Adding missing files for the test.
* Simplified the test
* Check was corrected according to discussion. An OutOfRangeException will be thrown instead of returning.
* java: update MatOfByte implementation checks / tests
fix the "initializing global variables with values that are not
compile-time constants" issue in Intel SDK for OpenCL. The root cause
is when initializing global variables with value, the variable need is
compile-time constants.
Thanks Zheng, Yang <yang.zheng@intel.com>,
Chodor, Jaroslaw <jaroslaw.chodor@intel.com> give a help.
Signed-off-by: Liu,Kaixuan <kaixuan.liu@intel.com>
Signed-off-by: Jun Zhao <jun.zhao@intel.com>
The opencv infrastructure mostly has the basics for supporting avx512 math functions,
but it wasn't hooked up (likely due to lack of users)
In order to compile the DNN functions for AVX512, a few things need to be hooked up
and this patch does that
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
* remove raw SSE2/NEON implementation from convert.cpp
* remove raw implementation from Cvt_SIMD
* remove raw implementation from cvtScale_SIMD
* remove raw implementation from cvtScaleAbs_SIMD
* remove duplicated implementation cvt_<float, short>
* remove duplicated implementation cvtScale_<short, short, float>
* add "from double" version of Cvt_SIMD
* modify the condition of test ConvertScaleAbs
* Update convert.cpp
fixed crash in cvtScaleAbs(8s=>8u)
* fixed compile error on Win32
* fixed several test failures because of accuracy loss in cvtScale(int=>int)
* fixed NEON implementation of v_cvt_f64(int=>double) intrinsic
* another attempt to fix test failures
* keep trying to fix the test failures and just introduced compile warnings
* fixed one remaining test (subtractScalar)
- don't store ProgramSource in compiled Programs (resolved problem with "source" buffers lifetime)
- completelly remove Program::read/write methods implementation:
- replaced with method to query RAW OpenCL binary without any "custom" data
- deprecate Program::getPrefix() methods
* fixed OpenCL functions on Mac, so that the tests pass
* fixed compile warnings; temporarily disabled OCL branch of TV L1 optical flow on mac
* fixed other few warnings on macos
If there are no OpenCL/UMat methods calls from application.
OpenCL subsystem is initialized:
- haveOpenCL() is called from application
- useOpenCL() is called from application
- access to OpenCL allocator: UMat is created (empty UMat is ignored) or UMat <-> Mat conversions are called
Don't call OpenCL functions if OPENCV_OPENCL_RUNTIME=disabled
(independent from OpenCL linkage type)
* add accuracy test and performance check for matmul
* add performance tests for transform and dotProduct
* add test Core_TransformLargeTest for 8u version of transform
* remove raw SSE2/NEON implementation from matmul.cpp
* use universal intrinsic instead of raw intrinsic
* remove unused templated function
* add v_matmuladd which multiply 3x3 matrix and add 3x1 vector
* add v_rotate_left/right in universal intrinsic
* suppress intrinsic on some function and platform
* add pure SW implementation of new universal intrinsics
* add test for new universal intrinsics
* core: prevent memory access after the end of buffer
* fix perf tests
When elements are 64 bits, the vec_st_interleave()/vec_ld_deinterleave()
doesn't interleave 4 elements correctly.
For vec_st_interleave(), following is saved into mem:
a0 b0 a1 b1 c0 d0 c1 d1
-> we expected:
a0 b0 c0 d0 a1 b1 c1 d1
for vec_ld_deinterleave(), following is loaded into a b c d for memory
string { 1 2 3 4 5 6 7 8 }:
a: 1 3
b: 2 4
c: 5 7
d: 6 8
-> we expected:
a: 1 5
b: 2 6
c: 3 7
d: 4 8
This patch corrects this behavior.
Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
- changed behavior of vec_ctf, vec_ctu, vec_cts
in gcc and clang to make them compatible with XLC
- implemented most of missing conversion intrinsics in gcc and clang
- implemented conversions intrinsics of odd-numbered elements
- ignored gcc bug warning that caused by -Wunused-but-set-variable in rare cases
- replaced right shift with algebraic right shift for signed vectors
to shift in the sign bit.
- added new universal intrinsics v_matmuladd, v_rotate_left/right
- avoid using floating multiply-add in RNG
Exampls of these are gnu/kfreebsd and gnu/hurd, both available as
unofficial Debian ports.
They don't define __linux__ (as they are non-linux…) but still define
__GLIBC__, so check on that.
Signed-off-by: Mattia Rizzolo <mattia@mapreri.org>
* Update OpenCVCompilerOptimizations.cmake
Neon not supported on MSVC ARM breaking build fix
* Update OpenCVCompilerOptimizations.cmake
Whitespace
* Update intrin.hpp
Many problems in MSVC ARM builds (at least on VS2017) being fixed in this PR now.
C:\Users\Gregory\DOCUME~1\MYLIBR~1\OPENCV~3\opencv\sources\modules\core\include\opencv2/core/hal/intrin.hpp(444): error C3861: '_tzcnt_u32': identifier not found
* Update hal_replacement.hpp
Passing variadic expansion in a macro to another macro does not work properly in MSVC and a famous known workaround is hereby applied. Discussion of it: https://stackoverflow.com/questions/5134523/msvc-doesnt-expand-va-args-correctly
Only needed the fix for ARM builds: TEGRA_ macros are used for cv_hal_ functions in the carotene library.
C:\Users\Gregory\Documents\My Libraries\opencv330\opencv\sources\modules\core\src\arithm.cpp(2378): warning C4003: not enough actual parameters for macro 'TEGRA_ADD'
C:\Users\Gregory\Documents\My Libraries\opencv330\opencv\sources\modules\core\src\arithm.cpp(2378): error C2143: syntax error: missing ')' before ','
C:\Users\Gregory\Documents\My Libraries\opencv330\opencv\sources\modules\core\src\arithm.cpp(2378): error C2059: syntax error: ')'
* Update hal_replacement.hpp
All hal_replacement's using carotene\hal\tegra_hal.hpp TEGRA_ functions as macros preprocessed by variadic macros should be changed, identical as was done in core.
C:\Users\Gregory\Documents\My Libraries\opencv330\opencv\sources\modules\imgproc\src\color.cpp(9604): warning C4003: not enough actual parameters for macro 'TEGRA_CVTBGRTOBGR'
C:\Users\Gregory\Documents\My Libraries\opencv330\opencv\sources\modules\imgproc\src\color.cpp(9604): error C2059: syntax error: '=='
* Update OpenCVCompilerOptimizations.cmake
* Update hal_replacement.hpp
* Update hal_replacement.hpp
The original template based mat ptr for indexing is not implemented,
add the similar implementation as uchar type, but cast to
user-defined type from the uchar pointer.
The same code was repeated several time for different data types, so
it was extracted as a templated function to improve maintability and
make a code more clear.
Exception may be rasied inside the body of a copying constructor after
refcount has been increased, and beacause in the case of the exception
destrcutor is never called what causes memory leak. This commit adds a
workaround that calls the release() function before the exception is
thrown outside the contructor.
Adds fitEllipseDirect to imgproc: The Direct least square (Direct) method by Fitzgibbon1999.
New Tests are included for the methods.
fitEllipseAMS Tests
fitEllipseDirect Tests
Comparative examples are added to fitEllipse.cpp in Samples.
add libdnn acceleration to dnn module (#9114)
* import libdnn code
Signed-off-by: Li Peng <peng.li@intel.com>
* add convolution layer ocl acceleration
Signed-off-by: Li Peng <peng.li@intel.com>
* add pooling layer ocl acceleration
Signed-off-by: Li Peng <peng.li@intel.com>
* add softmax layer ocl acceleration
Signed-off-by: Li Peng <peng.li@intel.com>
* add lrn layer ocl acceleration
Signed-off-by: Li Peng <peng.li@intel.com>
* add innerproduct layer ocl acceleration
Signed-off-by: Li Peng <peng.li@intel.com>
* add HAVE_OPENCL macro
Signed-off-by: Li Peng <peng.li@intel.com>
* fix for convolution ocl
Signed-off-by: Li Peng <peng.li@intel.com>
* enable getUMat() for multi-dimension Mat
Signed-off-by: Li Peng <peng.li@intel.com>
* use getUMat for ocl acceleration
Signed-off-by: Li Peng <peng.li@intel.com>
* use CV_OCL_RUN macro
Signed-off-by: Li Peng <peng.li@intel.com>
* set OPENCL target when it is available
and disable fuseLayer for OCL target for the time being
Signed-off-by: Li Peng <peng.li@intel.com>
* fix innerproduct accuracy test
Signed-off-by: Li Peng <peng.li@intel.com>
* remove trailing space
Signed-off-by: Li Peng <peng.li@intel.com>
* Fixed tensorflow demo bug.
Root cause is that tensorflow has different algorithm with libdnn
to calculate convolution output dimension.
libdnn don't calculate output dimension anymore and just use one
passed in by config.
* split gemm ocl file
split it into gemm_buffer.cl and gemm_image.cl
Signed-off-by: Li Peng <peng.li@intel.com>
* Fix compile failure
Signed-off-by: Li Peng <peng.li@intel.com>
* check env flag for auto tuning
Signed-off-by: Li Peng <peng.li@intel.com>
* switch to new ocl kernels for softmax layer
Signed-off-by: Li Peng <peng.li@intel.com>
* update softmax layer
on some platform subgroup extension may not work well,
fallback to non subgroup ocl acceleration.
Signed-off-by: Li Peng <peng.li@intel.com>
* fallback to cpu path for fc layer with multi output
Signed-off-by: Li Peng <peng.li@intel.com>
* update output message
Signed-off-by: Li Peng <peng.li@intel.com>
* update fully connected layer
fallback to gemm API if libdnn return false
Signed-off-by: Li Peng <peng.li@intel.com>
* Add ReLU OCL implementation
* disable layer fusion for now
Signed-off-by: Li Peng <peng.li@intel.com>
* Add OCL implementation for concat layer
Signed-off-by: Wu Zhiwen <zhiwen.wu@intel.com>
* libdnn: update license and copyrights
Also refine libdnn coding style
Signed-off-by: Wu Zhiwen <zhiwen.wu@intel.com>
Signed-off-by: Li Peng <peng.li@intel.com>
* DNN: Don't link OpenCL library explicitly
* DNN: Make default preferableTarget to DNN_TARGET_CPU
User should set it to DNN_TARGET_OPENCL explicitly if want to
use OpenCL acceleration.
Also don't fusion when using DNN_TARGET_OPENCL
* DNN: refine coding style
* Add getOpenCLErrorString
* DNN: Use int32_t/uint32_t instread of alias
* Use namespace ocl4dnn to include libdnn things
* remove extra copyTo in softmax ocl path
Signed-off-by: Li Peng <peng.li@intel.com>
* update ReLU layer ocl path
Signed-off-by: Li Peng <peng.li@intel.com>
* Add prefer target property for layer class
It is used to indicate the target for layer forwarding,
either the default CPU target or OCL target.
Signed-off-by: Li Peng <peng.li@intel.com>
* Add cl_event based timer for cv::ocl
* Rename libdnn to ocl4dnn
Signed-off-by: Li Peng <peng.li@intel.com>
Signed-off-by: wzw <zhiwen.wu@intel.com>
* use UMat for ocl4dnn internal buffer
Remove allocateMemory which use clCreateBuffer directly
Signed-off-by: Li Peng <peng.li@intel.com>
Signed-off-by: wzw <zhiwen.wu@intel.com>
* enable buffer gemm in ocl4dnn innerproduct
Signed-off-by: Li Peng <peng.li@intel.com>
* replace int_tp globally for ocl4dnn kernels.
Signed-off-by: wzw <zhiwen.wu@intel.com>
Signed-off-by: Li Peng <peng.li@intel.com>
* create UMat for layer params
Signed-off-by: Li Peng <peng.li@intel.com>
* update sign ocl kernel
Signed-off-by: Li Peng <peng.li@intel.com>
* update image based gemm of inner product layer
Signed-off-by: Li Peng <peng.li@intel.com>
* remove buffer gemm of inner product layer
call cv::gemm API instead
Signed-off-by: Li Peng <peng.li@intel.com>
* change ocl4dnn forward parameter to UMat
Signed-off-by: Li Peng <peng.li@intel.com>
* Refine auto-tuning mechanism.
- Use OPENCV_OCL4DNN_KERNEL_CONFIG_PATH to set cache directory
for fine-tuned kernel configuration.
e.g. export OPENCV_OCL4DNN_KERNEL_CONFIG_PATH=/home/tmp,
the cache directory will be /home/tmp/spatialkernels/ on Linux.
- Define environment OPENCV_OCL4DNN_ENABLE_AUTO_TUNING to enable
auto-tuning.
- OPENCV_OPENCL_ENABLE_PROFILING is only used to enable profiling
for OpenCL command queue. This fix basic kernel get wrong running
time, i.e. 0ms.
- If creating cache directory failed, disable auto-tuning.
* Detect and create cache dir on windows
Signed-off-by: Li Peng <peng.li@intel.com>
* Refine gemm like convolution kernel.
Signed-off-by: Li Peng <peng.li@intel.com>
* Fix redundant swizzleWeights calling when use cached kernel config.
* Fix "out of resource" bug when auto-tuning too many kernels.
* replace cl_mem with UMat in ocl4dnnConvSpatial class
* OCL4DNN: reduce the tuning kernel candidate.
This patch could reduce 75% of the tuning candidates with less
than 2% performance impact for the final result.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
* replace cl_mem with umat in ocl4dnn convolution
Signed-off-by: Li Peng <peng.li@intel.com>
* remove weight_image_ of ocl4dnn inner product
Actually it is unused in the computation
Signed-off-by: Li Peng <peng.li@intel.com>
* Various fixes for ocl4dnn
1. OCL_PERFORMANCE_CHECK(ocl::Device::getDefault().isIntel())
2. Ptr<OCL4DNNInnerProduct<float> > innerProductOp
3. Code comments cleanup
4. ignore check on OCL cpu device
Signed-off-by: Li Peng <peng.li@intel.com>
* add build option for log softmax
Signed-off-by: Li Peng <peng.li@intel.com>
* remove unused ocl kernels in ocl4dnn
Signed-off-by: Li Peng <peng.li@intel.com>
* replace ocl4dnnSet with opencv setTo
Signed-off-by: Li Peng <peng.li@intel.com>
* replace ALIGN with cv::alignSize
Signed-off-by: Li Peng <peng.li@intel.com>
* check kernel build options
Signed-off-by: Li Peng <peng.li@intel.com>
* Handle program compilation fail properly.
* Use std::numeric_limits<float>::infinity() for large float number
* check ocl4dnn kernel compilation result
Signed-off-by: Li Peng <peng.li@intel.com>
* remove unused ctx_id
Signed-off-by: Li Peng <peng.li@intel.com>
* change clEnqueueNDRangeKernel to kernel.run()
Signed-off-by: Li Peng <peng.li@intel.com>
* change cl_mem to UMat in image based gemm
Signed-off-by: Li Peng <peng.li@intel.com>
* check intel subgroup support for lrn and pooling layer
Signed-off-by: Li Peng <peng.li@intel.com>
* Fix convolution bug if group is greater than 1
Signed-off-by: Li Peng <peng.li@intel.com>
* Set default layer preferableTarget to be DNN_TARGET_CPU
Signed-off-by: Li Peng <peng.li@intel.com>
* Add ocl perf test for convolution
Signed-off-by: Li Peng <peng.li@intel.com>
* Add more ocl accuracy test
Signed-off-by: Li Peng <peng.li@intel.com>
* replace cl_image with ocl::Image2D
Signed-off-by: Li Peng <peng.li@intel.com>
* Fix build failure in elementwise layer
Signed-off-by: Li Peng <peng.li@intel.com>
* use getUMat() to get blob data
Signed-off-by: Li Peng <peng.li@intel.com>
* replace cl_mem handle with ocl::KernelArg
Signed-off-by: Li Peng <peng.li@intel.com>
* dnn(build): don't use C++11, OPENCL_LIBRARIES fix
* dnn(ocl4dnn): remove unused OpenCL kernels
* dnn(ocl4dnn): extract OpenCL code into .cl files
* dnn(ocl4dnn): refine auto-tuning
Defaultly disable auto-tuning, set OPENCV_OCL4DNN_ENABLE_AUTO_TUNING
environment variable to enable it.
Use a set of pre-tuned configs as default config if auto-tuning is disabled.
These configs are tuned for Intel GPU with 48/72 EUs, and for googlenet,
AlexNet, ResNet-50
If default config is not suitable, use the first available kernel config
from the candidates. Candidate priority from high to low is gemm like kernel,
IDLF kernel, basick kernel.
* dnn(ocl4dnn): pooling doesn't use OpenCL subgroups
* dnn(ocl4dnn): fix perf test
OpenCV has default 3sec time limit for each performance test.
Warmup OpenCL backend outside of perf measurement loop.
* use ocl::KernelArg as much as possible
Signed-off-by: Li Peng <peng.li@intel.com>
* dnn(ocl4dnn): fix bias bug for gemm like kernel
* dnn(ocl4dnn): wrap cl_mem into UMat
Signed-off-by: Li Peng <peng.li@intel.com>
* dnn(ocl4dnn): Refine signature of kernel config
- Use more readable string as signture of kernel config
- Don't count device name and vendor in signature string
- Default kernel configurations are tuned for Intel GPU with
24/48/72 EUs, and for googlenet, AlexNet, ResNet-50 net model.
* dnn(ocl4dnn): swap width/height in configuration
* dnn(ocl4dnn): enable configs for Intel OpenCL runtime only
* core: make configuration helper functions accessible from non-core modules
* dnn(ocl4dnn): update kernel auto-tuning behavior
Avoid unwanted creation of directories
* dnn(ocl4dnn): simplify kernel to workaround OpenCL compiler crash
* dnn(ocl4dnn): remove redundant code
* dnn(ocl4dnn): Add more clear message for simd size dismatch.
* dnn(ocl4dnn): add const to const argument
Signed-off-by: Li Peng <peng.li@intel.com>
* dnn(ocl4dnn): force compiler use a specific SIMD size for IDLF kernel
* dnn(ocl4dnn): drop unused tuneLocalSize()
* dnn(ocl4dnn): specify OpenCL queue for Timer and convolve() method
* dnn(ocl4dnn): sanitize file names used for cache
* dnn(perf): enable Network tests with OpenCL
* dnn(ocl4dnn/conv): drop computeGlobalSize()
* dnn(ocl4dnn/conv): drop unused fields
* dnn(ocl4dnn/conv): simplify ctor
* dnn(ocl4dnn/conv): refactor kernelConfig localSize=NULL
* dnn(ocl4dnn/conv): drop unsupported double / untested half types
* dnn(ocl4dnn/conv): drop unused variable
* dnn(ocl4dnn/conv): alignSize/divUp
* dnn(ocl4dnn/conv): use enum values
* dnn(ocl4dnn): drop unused innerproduct variable
Signed-off-by: Li Peng <peng.li@intel.com>
* dnn(ocl4dnn): add an generic function to check cl option support
* dnn(ocl4dnn): run softmax subgroup version kernel first
Signed-off-by: Li Peng <peng.li@intel.com>