Commit Graph

27 Commits

Author SHA1 Message Date
casualwinds
7b399c4248
Merge pull request #24280 from casualwind:parallel_opt
Optimization for parallelization when large core number #24280

**Problem description:**
When the number of cores is large, OpenCV’s thread library may reduce performance when processing parallel jobs.

**The reason for this problem:**
When the number of cores (the thread pool initialized the threads, whose number is as same as the number of cores) is large, the main thread will spend too much time on waking up unnecessary threads.
When a parallel job needs to be executed, the main thread will wake up all threads in sequence, and then wait for the signal for the  job completion after waking up all threads. When the number of threads is larger than the parallel number of a job slices, there will be a situation where the main thread wakes up the threads in sequence and the awakened threads have completed the job, but the main thread is still waking up the other threads. The threads woken up by the main thread after this have nothing to do, and the broadcasts made by the waking threads take a lot of time, which reduce the performance.

**Solution:**
Reduce the time for the process of main thread waking up the worker threads through the following two methods:

•	The number of threads awakened by the main thread should be adjusted according to the parallel number of a job slices. If the number of threads is greater than the number of the parallel number of job slices, the total number of threads awakened should be reduced.
•	In the process of waking up threads in sequence, if the main thread finds that all parallel job slices have been allocated, it will jump out of the loop in time and wait for the signal for the job completion.

**Performance Test:**
The tests were run in the manner described by https://github.com/opencv/opencv/wiki/HowToUsePerfTests.
At core number =  160, There are big performance gain in some cases.

Take the following cases in the video module as examples:

OpticalFlowPyrLK_self::Path_Idx_Cn_NPoints_WSize_Deriv::("cv/optflow/frames/VGA_%02d.png", 2, 1, (9, 9), 11, true)
Performance improves 191%:0.185405ms ->0.0636496ms
perf::DenseOpticalFlow_VariationalRefinement::(320x240, 10, 10)
Performance improves 112%:23.88938ms -> 11.2562ms  
Among all the modules, the performance improvement is greatest on module video, and there are also certain improvements on other modules.

At core number = 160, the times labeled below are the geometric mean of the average time of all cases for one module. The optimization is available on each module.

overall | time(ms) |   |   |   |   |   |   |  
-- | -- | -- | -- | -- | -- | -- | -- | --
module   name | gapi | dnn | features2d | objdetect | core | imgproc | stitching | video
original | 0.185 | 1.586 | 9.998 | 11.846 | 0.205 | 0.215 | 164.409 | 0.803
optimized | 0.174 | 1.353 | 9.535 | 11.105 | 0.199 | 0.185 | 153.972 | 0.489
Performance   improves | 6% | 17% | 5% | 7% | 3% | 16% | 7% | 64%

Meanwhile, It is found that adjusting the order of test cases will have an impact on some test cases. For example, we used option --gtest-shuffle to run opencv_perf_gapi, the performance of TestPerformance::CmpWithScalarPerfTestFluid/CmpWithScalarPerfTest::(compare_f, CMP_GE, 1920x1080, 32FC1, { gapi.kernel_package })  case had 30% changes compared to the case without shuffle. I would like to ask if you have also encountered such a situation and could you share your experience?

### Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

- [x] I agree to contribute to the project under Apache 2 License.
- [x] To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
- [x] The PR is proposed to the proper branch
- [ ] There is a reference to the original bug report and related work
- [ ] There is accuracy test, performance test and test data in opencv_extra repository, if applicable
      Patch to opencv_extra has the same branch name.
- [ ] The feature is well documented and sample code can be built with the project CMake
2023-09-27 16:21:20 +03:00
Alexander Alekhin
bc8c912c7a Merge remote-tracking branch 'upstream/3.4' into merge-3.4 2022-12-24 13:54:58 +00:00
Vincent Rabaud
7463e9b8bb Even faster CV_PAUSE on SkyLake and above.
No need to loop as RDTSC is 3/4 times faster than _mm_pause.
2022-12-19 14:15:34 +01:00
Alexander Alekhin
420db56ffd Merge remote-tracking branch 'upstream/3.4' into merge-3.4 2022-12-18 02:17:17 +00:00
Vincent Rabaud
b7b08fa0c3 Fix slower CV_PAUSE on SkyLake and above.
This is fixing https://github.com/opencv/opencv/issues/22852
2022-12-15 14:18:57 +01:00
wxsheng
4154bd0667
Add Loongson Advanced SIMD Extension support: -DCPU_BASELINE=LASX
* Add Loongson Advanced SIMD Extension support: -DCPU_BASELINE=LASX
* Add resize.lasx.cpp for Loongson SIMD acceleration
* Add imgwarp.lasx.cpp for Loongson SIMD acceleration
* Add LASX acceleration support for dnn/conv
* Add CV_PAUSE(v) for Loongarch
* Set LASX by default on Loongarch64
* LoongArch: tune test threshold for Core/HAL.mat_decomp/15

Co-authored-by: shengwenxue <shengwenxue@loongson.cn>
2022-09-10 09:39:43 +03:00
Alexander Alekhin
87d4970e8b Merge remote-tracking branch 'upstream/3.4' into merge-3.4 2021-10-04 19:50:01 +00:00
Alexander Alekhin
62414e3073 core(parallel): suppress TSAN warning 2021-10-04 10:46:32 +00:00
Alexander Alekhin
e5d78960c6 Merge remote-tracking branch 'upstream/3.4' into merge-3.4 2021-02-12 21:34:49 +00:00
Vincent Rabaud
847b16fb76 Disable thread sanitization when CV_USE_GLOBAL_WORKERS_COND_VAR is not set.
This fixes #19463
2021-02-09 14:12:39 +01:00
Alexander Smorkalov
7228d2a824 Added initial version of cmake toolchain for RISC-V architecture. 2020-04-27 12:42:38 +03:00
Alexander Alekhin
560f85f8e5 Merge remote-tracking branch 'upstream/3.4' into merge-3.4 2020-01-28 14:26:57 +03:00
Alexander Alekhin
e83438c23d core(build): fix i386 compilation 2020-01-26 00:00:25 +00:00
Alexander Alekhin
92b9888837 Merge remote-tracking branch 'upstream/3.4' into merge-3.4 2019-12-12 13:02:19 +03:00
Alexander Alekhin
816f82682b core(trace/itt): avoid calling __itt_thread_set_name() by default
- don't override current application thread names
- set name for own threads only
2019-12-07 21:41:15 +00:00
Alexander Alekhin
a74fe2ec01 Merge remote-tracking branch 'upstream/3.4' into merge-3.4 2019-09-20 21:11:49 +00:00
mipsopen-fwu
b1ea91d8bd Merge pull request #15422 from mipsopen-fwu:msa-dev
* Added MSA implementations for mips platforms. Intrinsics for MSA and build scripts for MIPS platforms are added.

Signed-off-by: Fei Wu <fwu@wavecomp.com>

* Removed some unused code in mips.toolchain.cmake.

Signed-off-by: Fei Wu <fwu@wavecomp.com>

* Added comments for mips toolchain configuration and disabled compiling warnings for libpng.

Signed-off-by: Fei Wu <fwu@wavecomp.com>

* Fixed the build error of unsupported opcode 'pause' when mips isa_rev is less than 2.

Signed-off-by: Fei Wu <fwu@wavecomp.com>

* 1. Removed FP16 related item in MSA option defines in OpenCVCompilerOptimizations.cmake.
2. Use CV_CPU_COMPILE_MSA instead of __mips_msa for MSA feature check in cv_cpu_dispatch.h.
3. Removed hasSIMD128() in intrin_msa.hpp.
4. Define CPU_MSA as 150.
Signed-off-by: Fei Wu <fwu@wavecomp.com>

* 1. Removed unnecessary CV_SIMD128_64F guarding in intrin_msa.hpp.
2. Removed unnecessary CV_MSA related code block in dotProd_8u().

Signed-off-by: Fei Wu <fwu@wavecomp.com>

* 1. Defined CPU_MSA_FLAGS_ON as "-mmsa".
2. Removed CV_SIMD128_64F guardings in intrin_msa.hpp.

Signed-off-by: Fei Wu <fwu@wavecomp.com>

* Removed unused msa_mlal_u16() and msa_mlal_s16 from msa_macros.h.

Signed-off-by: Fei Wu <fwu@wavecomp.com>
2019-09-20 19:52:48 +03:00
cyy
10fb88d027 Merge pull request #12391 from DEEPIR:master
fix some errors found by static analyzer. (#12391)

* fix possible divided by zero and by negative values

* only 4 elements are used in these arrays

* fix uninitialized member

* use boolean type for semantic boolean variables

* avoid invalid array index

* to avoid exception and because base64_beg is only used in this block

* use std::atomic<bool> to avoid thread control race condition
2018-09-04 16:39:19 +03:00
cyy
09837928d9 Merge pull request #12357 from DEEPIR:master
* fix some static analyzer warnings

* fix some static analyzer warnings

* fix race condition of workthread control
2018-09-02 16:34:43 +03:00
Alexander Alekhin
98c8584b88 next: drop CV_CXX11 conditions
define itself is still here for compatibility
2018-04-10 18:09:54 +03:00
Alexander Alekhin
7dc162cb42 core: fix mm_pause() for non-SSE i386 builds
replaced to safe binary compatible 'rep; nop' asm instruction
2018-04-09 18:37:35 +03:00
Alexander Alekhin
491502a349 core: fix parallel_for data race 2018-02-16 21:13:48 +03:00
luz.paz
5718d09e39 Misc. modules/ typos
Found via `codespell`
2018-02-12 07:09:43 -05:00
Alexander Alekhin
914f57f28d core(parallel_for): fix data race 2018-02-06 18:19:50 +03:00
Alexander Alekhin
b10fedde56 core(parallel_for): cleanup
remove 'dont_wait' (can be replaced with has_wake_signal)
2018-02-06 16:10:41 +03:00
Sayed Adel
4e1d396ce1 core:ppc Add yield support 2018-01-31 04:03:35 +00:00
Alexander Alekhin
c49d5d5252 core: fix pthreads performance
OpenCV pthreads-based implementation changes:
- rework worker threads pool, allow to execute job by the main thread too
- rework synchronization scheme (wait for job completion, threads 'pong' answer is not required)
- allow "active wait" (spin) by worker threads and by the main thread
- use _mm_pause() during active wait (support for Hyper-Threading technology)
- use sched_yield() to avoid preemption of still working other workers
- don't use getTickCount()
- optional builtin thread pool profiler (disabled by compilation flag)
2018-01-26 04:09:11 +00:00