Speed up line merging in INTER_AREA #24412
This provides a 10 to 20% speed-up.
Related perf test fix: https://github.com/opencv/opencv/pull/24417
This is a split of https://github.com/opencv/opencv/pull/23525 that will be updated to only deal with column merging.
### Pull Request Readiness Checklist
See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request
- [x] I agree to contribute to the project under Apache 2 License.
- [x] To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
- [x] The PR is proposed to the proper branch
- [x] There is a reference to the original bug report and related work
- [x] There is accuracy test, performance test and test data in opencv_extra repository, if applicable
Patch to opencv_extra has the same branch name.
- [x] The feature is well documented and sample code can be built with the project CMake
Rewrite Universal Intrinsic code by using new API: ImgProc module. #24058
The goal of this series of PRs is to modify the SIMD code blocks guarded by CV_SIMD macro in the `opencv/modules/imgproc` folder: rewrite them by using the new Universal Intrinsic API.
For easier review, this PR includes a part of the rewritten code, and another part will be brought in the next PR (coming soon). I tested this patch on RVV (QEMU) and AVX devices, `opencv_test_imgproc` is passed.
The patch is partially auto-generated by using the [rewriter](https://github.com/hanliutong/rewriter), related PR https://github.com/opencv/opencv/pull/23885 and https://github.com/opencv/opencv/pull/23980.
### Pull Request Readiness Checklist
See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request
- [ ] I agree to contribute to the project under Apache 2 License.
- [ ] To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
- [ ] The PR is proposed to the proper branch
- [ ] There is a reference to the original bug report and related work
- [ ] There is accuracy test, performance test and test data in opencv_extra repository, if applicable
Patch to opencv_extra has the same branch name.
- [ ] The feature is well documented and sample code can be built with the project CMake
Fix even input dimensions for INTER_NEAREST_EXACT #23634
### Pull Request Readiness Checklist
resolves https://github.com/opencv/opencv/issues/22204
related: https://github.com/opencv/opencv/issues/9096#issuecomment-1551306017
/cc @Yosshi999
See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request
- [x] I agree to contribute to the project under Apache 2 License.
- [x] To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
- [x] The PR is proposed to the proper branch
- [x] There is a reference to the original bug report and related work
- [x] There is accuracy test, performance test and test data in opencv_extra repository, if applicable
Patch to opencv_extra has the same branch name.
- [x] The feature is well documented and sample code can be built with the project CMake
* Replaced most remaining sprintf with snprintf
* Deprecated encodeFormat and introduced new method that takes the buffer length
* Also increased buffer size at call sites to be a little bigger, in case int is 64 bit
Bit-exact Nearest Neighbor Resizing
* bit exact resizeNN
* change the value of method enum
* add bitexact-nn to ResizeExactTest
* test to compare with non-exact version
* add perf for bit-exact resizenn
* use cvFloor-equivalent
* 1/3 scaling is not stable for floating calculation
* stricter test
* bugfix: broken data in case of 6 or 12bytes elements
* bugfix: broken data in default pix_size
* stricter threshold
* use raw() for floor
* use double instead of int
* follow code reviews
* fewer cases in perf test
* center pixel convention
* Fix NN resize with dimentions > 4
* add test check for nn resize with channels > 4
* Change types from float to double
* Del unnecessary test file. Move nn test to test_imgwarp. Add 5 channels test only.
Actually, we can do this in constant time. xofs always
contains same or increasing offset values. We can instead
find the most extreme value used and never attempt to load it.
Similarly, we can note for all dx >= 0 and dx < (dwidth - cn)
where xofs[dx] + cn < xofs[dwidth-cn] implies dx < (dwidth - cn).
Thus, we can use this to control our loop termination optimally.
This fixes#16137 with little or no performance impact. I have
also added a debug check as a sanity check.
* imgproc: Prevent 1B overrun of 8C3 SIMD optimization
The fourth value read via v_load_q is essentially ignored,
but can cause trouble if it happens to cross page boundaries.
The final few iterations may attempt to read the most extreme
elements of S, which will read 1B beyond the array in most
aligment cases. Dynamically compute the stop. This could be
hoised from the loop, but will require a more extensive change.
Likewise, cleanup the iteration increment statements to make
it more obvious they do channel count (3) elements per pass.
This should resolve#16137
* imgproc(resize): extra check
* resize: HResizeLinear reduce duplicate work
There appears to be a 2x unroll of the HResizeLinear against k,
however the k value is only incremented by 1 during the unroll. This
results in k - 1 duplicate passes when k > 1.
Likewise, the final pass may not respect the work done by the vector
loop. Start it with the offset returned by the vector op if
implemented. Note, no vector ops are implemented today.
The performance is most noticable on a linear downscale. A set of
performance tests are added to characterize this. The performance
improvement is 10-50% depending on the scaling.
* imgproc: vectorize HResizeLinear
Performance is mostly gated by the gather operations
for x inputs.
Likewise, provide a 2x unroll against k, this reduces the
number of alpha gathers by 1/2 for larger k.
While not a 4x improvement, it still performs substantially
better under P9 for a 1.4x improvement. P8 baseline is
1.05-1.10x due to reduced VSX instruction set.
For float types, this results in a more modest
1.2x improvement.
* Update U8 processing for non-bitexact linear resize
* core: hal: vsx: improve v_load_expand_q
With a little help, we can do this quickly without gprs on
all VSX enabled targets.
* resize: Fix cn == 3 step per feedback
Per feedback, ensure we don't overrun. This was caught via the
failure observed in Test_TensorFlow.inception_accuracy.
Resize reworked using wide universal intrinsics (#13781)
* Added wide universal intrinsics optimized implementation for 3 channel bit-exact linear resize
* Reworked linear resize using new wide LUT intrinsics
* Fix for VSX intrinsics
Exceptions caught by value incur needless cost in C++, most of them can
be caught by const-reference, especially as nearly none are actually
used. This could allow compiler generate a slightly more efficient code.
* Bit-exact resize reworked to use wide intrinsics
* Reworked bit-exact resize row data loading
* Added bit-exact resize row data loaders for SIMD256 and SIMD512
* Fixed type punned pointer dereferencing warning
* Reworked loading of source data for SIMD256 and SIMD512 bit-exact resize