Improve performance on Arm64
* Improve performance on Apple silicon
This patch will
- Enable dot product intrinsics for macOS arm64 builds
- Enable for macOS arm64 builds
- Improve HAL primitives
- reduction (sum, min, max, sad)
- signmask
- mul_expand
- check_any / check_all
Results on a M1 Macbook Pro
* Updates to #20011 based on feedback
- Removes Apple Silicon specific workarounds
- Makes #ifdef sections smaller for v_mul_expand cases
- Moves dot product optimization to compiler optimization check
- Adds 4x4 matrix transpose optimization
* Remove dotprod and fix v_transpose
Based on the latest, we've removed dotprod entirely and will revisit in a future PR.
Added explicit cats with v_transpose4x4()
This should resolve all opens with this PR
* Remove commented out lines
Remove two extraneous comments
* Adding functions rbegin() and rend() functions to matrix class.
This is important to be more standard compliant with C++ and an ever increasing number of people using standard algorithms for better code readability- and maintainability.
The functions are copy pated from their counterparts (even though they should probably call the counterparts but this gave me some troube).
They return iterators using std::reverse_iterators
Follow up of an open feature request:
https://github.com/opencv/opencv/issues/4641
* Fix rbegin() and rend() and provide tests for them
* Removing unnecessary whitespaces
* Adding rbegin and rend to Mat_ class with the right parameters so we don't need to repeat the template argument.
An instantiating cv::Mat_<int> for example can call it's rbegin() function and doesn't need rbegin<int>() with this convience addition.
Follows what is done for forward iterators
* static cast the vector size (return size_t) to an int (that is required for opencv mat constructor)
Co-authored-by: Stefan <stefan.gerl@tum.de>
* implements https://github.com/opencv/opencv/issues/19147
* CAUTION: this PR will only functions safely in the
4+ branches that already include PR 19029
* CAUTION: this PR requires thread-safe startup of the alloc.cpp
translation unit as implemented in PR 19029
Ordinary quaternion
* version 1.0
* add assumeUnit;
add UnitTest;
check boundary value;
fix the func using method: func(obj);
fix 4x4;
add rodrigues vector transformation;
fix mat to quat;
* fix blank and tab
* fix blank and tab
modify test;cpp to hpp
* mainly improve comment;
add rvec2Quat;fix toRodrigues;
fix throw to CV_Error
* fix bug of quatd * int;
combine hpp and cpp;
fix << overload error in win system;
modify include in test file;
* move implementation to quaternion.ini.hpp;
change some constructor to createFrom* function;
change Rodrigues vector to rotation vector;
change the matexpr to mat of 3x3 return type;
improve comments;
* try fix log function error in win
* add enums for assumeUnit;
improve docs;
add using std::cos funcs
* remove using std::* from header;
add std::* in affine.hpp,warpers_inl.hpp;
* quat: coding style
* quat: AssumeType => QuatAssumeType
- OpenCL kernel cleanup processing is asynchronous and can be called even after forced clFinish()
- buffers are released later in asynchronous mode
- silence these false positive cases for asynchronous cleanup
* add eigen tensor conversion functions
* add eigen tensor conversion tests
* add support for column major order
* update eigen tensor tests
* fix coding style and add conditional compilation
* fix conditional compilation checks
* remove whitespace
* rearrange functions for easier reading
* reformat function documentation and add tensormap unit test
* cleanup documentation of unit test
* remove condition duplication
* check Eigen major version, not minor version
* restrict to Eigen v3.3.0+
* add documentation note and add type checking to cv2eigen_tensormap()
trying to fix handling file storages with extremely long lines
* trying to fix handling of file storages with extremely long lines: https://github.com/opencv/opencv/issues/11061
* * fixed errorneous pointer access in JSON parser.
* it's now crash-test time! temporarily set the initial parser buffer size to just 40 bytes. let's run all the test and check if the buffer is always correctly resized and handled
* fixed pointer use in JSON parser; added the proper test to catch this case
* fixed the test to make it more challenging. generate test json with
*
**
***
etc. shape
Vectorize minMaxIdx functions
* Updated documentation and intrinsic tests for v_reduce
* Add other files back in from the forced push
* Prevent an constant overflow with v_reduce for int8 type
* Another alternative to fix constant overflow warning.
* Fix another compiler warning.
* Update comments and change comparison form to be consistent with other vectorized loops.
* Change return type of v_reduce_min & max for v_uint8 and v_uint16 to be same as lane type.
* Cast v_reduce functions to int to avoid overflow. Reduce number of parameters in MINMAXIDX_REDUCE macro.
* Restore cast type for v_reduce_min & max to LaneType
* add cv::compare test when Mat type == CV_16F
* add assertion in cv::compare when src.depth() == CV_16F
* cv::compare assertion minor fix
* core: add more checks
Add checks for empty operands in Matrix expressions that don't check properly
* Starting to add checks for empty operands in Matrix expressions that
don't check properly.
* Adding checks and delcarations for checker functions
* Fix signatures and add checks for each class of Matrix Expr operation
* Make it catch the right exception
* Don't expose helper functions to public API
Improving VSX performance of integral function
* Adding support for vector get function on VSX datatypes so the
integral function gains a bit of performance.
* Removing get as a datatype member function and implementing a new HAL
instruction v_extract_n to get the n-th element of a vector register.
* Adding SSE/NEON/AVX intrinsics.
* Implement new HAL instruction v_broadcast_element on VSX/AVX/NEON/SSE.
* core(simd): add tests for v_extract_n/v_broadcast_element
- updated docs
- commented out code to repair compilation
- added WASM and MSA default implementations
* core(simd): fix compilation
- x86: avoid _mm256_extract_epi64/32/16/8 with MSVS 2015
- x86: _mm_extract_epi64 is 64-bit only
* cleanup
- move TLS & instrumentation code out of core/utility.hpp
- (*) TLSData lost .gather() method (to dispose thread data on thread termination)
- use TLSDataAccumulator for reliable collecting of thread data
- prefer using of .detachData() + .cleanupDetachedData() instead of .gather() method
(*) API is broken: replace TLSData => TLSDataAccumulator if gather required
(objects disposal on threads termination is not available in accumulator mode)
Fixing bug with comparison of v_int64x2 or v_uint64x2
* Casting v_uint64x2 to v_float64x2 and comparing does NOT work in all cases. Rewrite using epi64 instructions - faster too.
* Fix bad merge.
* Fix equal comparsion for non-SSE4.1. Add test cases for v_int64x2 comparisons.
* Try to fix merge conflict.
* Only test v_int64x2 comparisons if CV_SIMD_64F
* Fix compiler warning.