mirror of
https://github.com/opencv/opencv.git
synced 2025-06-17 07:10:51 +08:00

Impl hal_rvv LUT | Add more LUT test #26941 Implement through the existing `cv_hal_lut` interfaces. Add more LUT accuracy and performance tests: - **Accuracy test**: Multi-channel table tests are added, and the boundary of `randu` used for generating test data is broadened to make the test more robust. - **Performance test**: Multi-channel input and multi-channel table tests are added. Perf test done on - MUSE-PI (vlen=256) - Compiler: gcc 14.2 (riscv-collab/riscv-gnu-toolchain Nightly: December 16, 2024) ```sh $ opencv_test_core --gtest_filter="Core_LUT*" $ opencv_perf_core --gtest_filter="SizePrm_LUT*" --perf_min_samples=300 --perf_force_samples=300 ``` ```sh Geometric mean (ms) Name of Test scalar ui rvv ui rvv vs vs scalar scalar (x-factor) (x-factor) LUT::SizePrm::320x240 0.248 0.249 0.052 1.00 4.74 LUT::SizePrm::640x480 0.277 0.275 0.085 1.01 3.28 LUT::SizePrm::1920x1080 0.950 0.947 0.634 1.00 1.50 LUT_multi2::SizePrm::320x240 2.051 2.045 2.049 1.00 1.00 LUT_multi2::SizePrm::640x480 2.128 2.134 2.125 1.00 1.00 LUT_multi2::SizePrm::1920x1080 7.397 7.380 7.390 1.00 1.00 LUT_multi::SizePrm::320x240 0.715 0.747 0.154 0.96 4.64 LUT_multi::SizePrm::640x480 0.741 0.766 0.257 0.97 2.88 LUT_multi::SizePrm::1920x1080 2.766 2.765 1.925 1.00 1.44 ``` This optimization is achieved by loading the entire lookup table into vector registers. Due to register size limitations, the optimization is only effective under the following conditions: - For the U8C1 table type, the optimization works when `vlen >= 256` - For U16C1, it works when `vlen >= 512` - For U32C1, it works when `vlen >= 1024` Since I don’t have real hardware with `vlen > 256`, the corresponding accuracy tests were conducted on QEMU built from the `riscv-collab/riscv-gnu-toolchain`. This patch does not implement optimizations for multi-channel tables. Previous attempts: 1. For the U8C1 table type, when `vlen = 128`, it is possible to use four `u8m4` vectors to load the entire table, perform gathering, and merge the results. However, the performance is almost the same as the scalar version. 2. Loading part of the table and repeatedly loading the source data is faster for small sizes. But as the table size grows, the performance quickly degrades compared to the scalar version. 3. Using `vluxei8` as a general solution does not show any performance improvement. ### Pull Request Readiness Checklist See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request - [x] I agree to contribute to the project under Apache 2 License. - [x] To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV - [ ] The PR is proposed to the proper branch - [ ] There is a reference to the original bug report and related work - [ ] There is accuracy test, performance test and test data in opencv_extra repository, if applicable Patch to opencv_extra has the same branch name. - [ ] The feature is well documented and sample code can be built with the project CMake
47 lines
872 B
C++
47 lines
872 B
C++
#include "perf_precomp.hpp"
|
|
|
|
namespace opencv_test { namespace {
|
|
using namespace perf;
|
|
|
|
typedef perf::TestBaseWithParam<Size> SizePrm;
|
|
|
|
PERF_TEST_P( SizePrm, LUT,
|
|
testing::Values(szQVGA, szVGA, sz1080p)
|
|
)
|
|
{
|
|
Size sz = GetParam();
|
|
|
|
int maxValue = 255;
|
|
|
|
Mat src(sz, CV_8UC1);
|
|
randu(src, 0, maxValue);
|
|
Mat lut(1, 256, CV_8UC1);
|
|
randu(lut, 0, maxValue);
|
|
Mat dst(sz, CV_8UC1);
|
|
|
|
TEST_CYCLE() LUT(src, lut, dst);
|
|
|
|
SANITY_CHECK(dst, 0.1);
|
|
}
|
|
|
|
PERF_TEST_P( SizePrm, LUT_multi,
|
|
testing::Values(szQVGA, szVGA, sz1080p)
|
|
)
|
|
{
|
|
Size sz = GetParam();
|
|
|
|
int maxValue = 255;
|
|
|
|
Mat src(sz, CV_8UC3);
|
|
randu(src, 0, maxValue);
|
|
Mat lut(1, 256, CV_8UC1);
|
|
randu(lut, 0, maxValue);
|
|
Mat dst(sz, CV_8UC3);
|
|
|
|
TEST_CYCLE() LUT(src, lut, dst);
|
|
|
|
SANITY_CHECK_NOTHING();
|
|
}
|
|
|
|
}} // namespace
|