* Convert moments in tile algorithms to HAL (1.3x faster for VSX).
* Adding NEON code back in for non 64-bit platforms.
* Remove floats from post processing.
Clarify stereoRectify() doc
The function stereoRectify() takes as input a coordinate transform between two cameras. It is ambiguous how it goes. I clarified that it goes from the second camera to the first.
* Use FlsAlloc/FlsFree/FlsGetValue/FlsSetValue instead of TlsAlloc/TlsFree/TlsGetValue/TlsSetValue to implment TLS value cleanup when thread has been terminated on Windows Vista and above
* Fix 32-bit build
* Fixed calling convention of cleanup callback
* WINAPI changed to NTAPI
* Use proper guard macro
* Vectorize flipHoriz and flipVert functions.
* Change v_load_mirror_1 to use vec_revb for VSX
* Only use vec_revb in ISA3.0
* Removing vec_revb code since some of the older compilers don't fully support it.
* Use new v_reverse intrinsic and cleanup code.
* Ensure there are no alignment issues with copies
* Doc bugfix
The documentation page StereoBinaryBM and StereoBinarySGBM says that it returns a disparity that is scaled multiplied by 16. This scaling must be undone before calling reprojectImageTo3D, otherwise the results are wrong. The function reprojectImageTo3D() could do this scaling internally, maybe, but at least the documentation must explain that this has to be done.
* calib3d: update reprojectImageTo3D documentation
* calib3d: add StereoBM/StereoSGBM into notes list
- move TLS & instrumentation code out of core/utility.hpp
- (*) TLSData lost .gather() method (to dispose thread data on thread termination)
- use TLSDataAccumulator for reliable collecting of thread data
- prefer using of .detachData() + .cleanupDetachedData() instead of .gather() method
(*) API is broken: replace TLSData => TLSDataAccumulator if gather required
(objects disposal on threads termination is not available in accumulator mode)
Fixing bug with comparison of v_int64x2 or v_uint64x2
* Casting v_uint64x2 to v_float64x2 and comparing does NOT work in all cases. Rewrite using epi64 instructions - faster too.
* Fix bad merge.
* Fix equal comparsion for non-SSE4.1. Add test cases for v_int64x2 comparisons.
* Try to fix merge conflict.
* Only test v_int64x2 comparisons if CV_SIMD_64F
* Fix compiler warning.
* New v_reverse HAL intrinsic for reversing the ordering of a vector
* Fix conflict.
* Try to resolve conflict again.
* Try one more time.
* Add _MM_SHUFFLE. Remove non-vectorize code in SSE2. Fix copy and paste issue with NEON.
* Change v_uint16x8 SSE2 version to use shuffles
* Adding support for vectorized masking for uchar/ushort.
* Fixing bug where mask was zeroing the dst. Improved the way to calculate
the mask and tweaked for further performance improvements.
* Fixing mask comparison test.
* Restricting to one channel.
* Adding support for 3 channels, switch old approach to start using HAL's
v_select.
* Cuda + OpenGL on ARM
There might be multiple ways of getting OpenCV compile on Tegra (NVIDIA Jetson) platform, but mainly they modify CUDA(8,9,10...) source code, this one fixes it for all installations.
( https://devtalk.nvidia.com/default/topic/1007290/jetson-tx2/building-opencv-with-opengl-support-/post/5141945/#5141945 et al.).
This way is exactly the same as the one proposed but the code change happens in OpenCV.
* Updated,
The link provided mentions: cuda8 + 9, I have cuda 10 + 10.1 (and can confirm it is still defined this way).
NVIDIA is probably using some other "secret" backend with Jetson.
* core: rework and optimize SIMD implementation of dotProd
- add new universal intrinsics v_dotprod[int32], v_dotprod_expand[u&int8, u&int16, int32], v_cvt_f64(int64)
- add a boolean param for all v_dotprod&_expand intrinsics that change the behavior of addition order between
pairs in some platforms in order to reach the maximum optimization when the sum among all lanes is what only matters
- fix clang build on ppc64le
- support wide universal intrinsics for dotProd_32s
- remove raw SIMD and activate universal intrinsics for dotProd_8
- implement SIMD optimization for dotProd_s16&u16
- extend performance test data types of dotprod
- fix GCC VSX workaround of vec_mule and vec_mulo (in little-endian it must be swapped)
- optimize v_mul_expand(int32) on VSX
* core: remove boolean param from v_dotprod&_expand and implement v_dotprod_fast&v_dotprod_expand_fast
this changes made depend on "terfendail" review
- renamed Cascade Lake AVX512_CEL => AVX512_CLX (align with Intel SDE tool)
- fixed CLX instruction sets (no IFMA/VBMI)
- added flag to bypass CPU baseline check: OPENCV_SKIP_CPU_BASELINE_CHECK
[GSoC 2019] Improve the performance of JavaScript version of OpenCV (OpenCV.js)
* [GSoC 2019]
Improve the performance of JavaScript version of OpenCV (OpenCV.js):
1. Create the base of OpenCV.js performance test:
This perf test is based on benchmark.js(https://benchmarkjs.com). And first add `cvtColor`, `Resize`, `Threshold` into it.
2. Optimize the OpenCV.js performance by WASM threads:
This optimization is based on Web Worker API and SharedArrayBuffer, so it can be only used in browser.
3. Optimize the OpenCV.js performance by WASM SIMD:
Add WASM SIMD backend for OpenCV Universal Intrinsics. It's experimental as WASM SIMD is still in development.
* [GSoC2019]
1. use short license header
2. fix documentation node issue
3. remove the unused `hasSIMD128()` api
* [GSoC2019]
1. fix emscripten define
2. use fallback function for f16
* [GSoC2019]
Fix rebase issue
* Added MSA implementations for mips platforms. Intrinsics for MSA and build scripts for MIPS platforms are added.
Signed-off-by: Fei Wu <fwu@wavecomp.com>
* Removed some unused code in mips.toolchain.cmake.
Signed-off-by: Fei Wu <fwu@wavecomp.com>
* Added comments for mips toolchain configuration and disabled compiling warnings for libpng.
Signed-off-by: Fei Wu <fwu@wavecomp.com>
* Fixed the build error of unsupported opcode 'pause' when mips isa_rev is less than 2.
Signed-off-by: Fei Wu <fwu@wavecomp.com>
* 1. Removed FP16 related item in MSA option defines in OpenCVCompilerOptimizations.cmake.
2. Use CV_CPU_COMPILE_MSA instead of __mips_msa for MSA feature check in cv_cpu_dispatch.h.
3. Removed hasSIMD128() in intrin_msa.hpp.
4. Define CPU_MSA as 150.
Signed-off-by: Fei Wu <fwu@wavecomp.com>
* 1. Removed unnecessary CV_SIMD128_64F guarding in intrin_msa.hpp.
2. Removed unnecessary CV_MSA related code block in dotProd_8u().
Signed-off-by: Fei Wu <fwu@wavecomp.com>
* 1. Defined CPU_MSA_FLAGS_ON as "-mmsa".
2. Removed CV_SIMD128_64F guardings in intrin_msa.hpp.
Signed-off-by: Fei Wu <fwu@wavecomp.com>
* Removed unused msa_mlal_u16() and msa_mlal_s16 from msa_macros.h.
Signed-off-by: Fei Wu <fwu@wavecomp.com>
* issue 5769 fixed: cv::stereoRectify fails if given inliers mask of type vector<uchar>
* issue5769 fix using reshape and add regression test
* regression test with outlier detection, testing vector and mat data
* Size comparision of wrong vector within CV_Assert in regression test corrected
* cleanup test code
ISA 2.07 (aka POWER8) effectively extended the expanding multiply
operation to word types. The altivec intrinsics prior to gcc 8 did
not get the update.
Workaround this deficiency similar to other fixes.
This was exposed by commit 33fb253a66
which leverages the int -> dword expanding multiply.
This fixes Issue #15506
* Adding all possible data type interactions to the perf tests since some
use SIMD acceleration and others do not.
* Disabling full tests by default.
* Giving proper names, removing magic numbers and sanity checks of new
performance tests for the integral function.
* Giving proper names, making array static.
* Convert ImgWarp from SSE SIMD to HAL - 2.8x faster on Power (VSX) and 15% speedup on x86
* Change compile flag from CV_SIMD128 to CV_SIMD128_64F for use of v_float64x2 type
* Changing WarpPerspectiveLine from class functions and dispatching to static functions.
* Re-add dynamic runtime and dispatch execution.
* RRestore SSE4_1 optimizations inside opt_SSE4_1 namespace
* Convert lkpyramid from SSE SIMD to HAL - 90% faster on Power (VSX).
* Replace stores with reduce_sum. Rework to handle endianess correctly.
* Fix compiler warnings by casting values explicitly to shorts
* Switch to CV_SIMD128 compiler definition. Unroll loop to 8 elements since we've already loaded the data.
Use 4x FMA chains to sum on SIMD 128 FP64 targets. On
x86 this showed about 1.4x improvement.
For PPC, do a full multiply (32x32->64b), convert to DP
then accumulate. This may be slightly less precise for
some inputs. But is 1.5x faster than the above which
is about 1.5x than the FMA above for ~2.5x speedup.
Convert HOG from SSE SIMD to HAL - 35-45% faster on Power (VSX) (#15199)
* Convert SSE SIMD to HAL. 35-45% improvement for Power (VSX)
* Remove CV_NEON code. Use v_floor instead of 3 lines of code.
* Invert comparison logic to simplify code.
* Change initialization from v_load to constructor type.