* imgproc: Prevent 1B overrun of 8C3 SIMD optimization
The fourth value read via v_load_q is essentially ignored,
but can cause trouble if it happens to cross page boundaries.
The final few iterations may attempt to read the most extreme
elements of S, which will read 1B beyond the array in most
aligment cases. Dynamically compute the stop. This could be
hoised from the loop, but will require a more extensive change.
Likewise, cleanup the iteration increment statements to make
it more obvious they do channel count (3) elements per pass.
This should resolve#16137
* imgproc(resize): extra check
dnn(eltwise): fix handling of different number of channels
* dnn(test): reproducer for Eltwise layer issue from PR16063
* dnn(eltwise): rework support for inputs with different channels
* dnn(eltwise): get rid of finalize(), variableChannels
* dnn(eltwise): update input sorting by number of channels
- do not swap inputs if number of channels are same after truncation
* dnn(test): skip "shortcut" with batch size 2 on MYRIAD targets
G-API: Fix various issues for 4.2 release
* G-API: Fix issues reported by Coverity
- Fixed: passing values by value instead of passing by reference
* G-API: Fix redundant std::move()'s in return statements
Fixes#15903
* G-API: Added a smarter handling of Stop messages in the pipeline
- This should fix the "expected 100, got 99 frames" problem
- Fixes#15882
* G-API: Pass enum instead of GKernelPackage in Streaming test parameters
- Likely fixes#15836
* G-API: Address review issues in new bugfix comments
* G-API-NG/Docs: Added a tutorial page on interactive face detection sample
- Introduced a "--ser" option to run the pipeline serially for
benchmarking purposes
- Reorganized sample code to better fit the documentation;
- Fixed a couple of issues (mainly typos) in the public headers
* G-API-NG/Docs: Reflected meta-less compilation in new G-API tutorial
* G-API-NG/Docs: Addressed review comments on Face Analytics Pipeline example
cuda4dnn(resize): process multiple channels each iteration
* resize bilinear: process multiple chans. per iter.
* remove unused headers
* correct dispatch logic
* resize_nn: process multiple chans. per iter.
* resize: HResizeLinear reduce duplicate work
There appears to be a 2x unroll of the HResizeLinear against k,
however the k value is only incremented by 1 during the unroll. This
results in k - 1 duplicate passes when k > 1.
Likewise, the final pass may not respect the work done by the vector
loop. Start it with the offset returned by the vector op if
implemented. Note, no vector ops are implemented today.
The performance is most noticable on a linear downscale. A set of
performance tests are added to characterize this. The performance
improvement is 10-50% depending on the scaling.
* imgproc: vectorize HResizeLinear
Performance is mostly gated by the gather operations
for x inputs.
Likewise, provide a 2x unroll against k, this reduces the
number of alpha gathers by 1/2 for larger k.
While not a 4x improvement, it still performs substantially
better under P9 for a 1.4x improvement. P8 baseline is
1.05-1.10x due to reduced VSX instruction set.
For float types, this results in a more modest
1.2x improvement.
* Update U8 processing for non-bitexact linear resize
* core: hal: vsx: improve v_load_expand_q
With a little help, we can do this quickly without gprs on
all VSX enabled targets.
* resize: Fix cn == 3 step per feedback
Per feedback, ensure we don't overrun. This was caught via the
failure observed in Test_TensorFlow.inception_accuracy.