- move TLS & instrumentation code out of core/utility.hpp
- (*) TLSData lost .gather() method (to dispose thread data on thread termination)
- use TLSDataAccumulator for reliable collecting of thread data
- prefer using of .detachData() + .cleanupDetachedData() instead of .gather() method
(*) API is broken: replace TLSData => TLSDataAccumulator if gather required
(objects disposal on threads termination is not available in accumulator mode)
Fixing bug with comparison of v_int64x2 or v_uint64x2
* Casting v_uint64x2 to v_float64x2 and comparing does NOT work in all cases. Rewrite using epi64 instructions - faster too.
* Fix bad merge.
* Fix equal comparsion for non-SSE4.1. Add test cases for v_int64x2 comparisons.
* Try to fix merge conflict.
* Only test v_int64x2 comparisons if CV_SIMD_64F
* Fix compiler warning.
* New v_reverse HAL intrinsic for reversing the ordering of a vector
* Fix conflict.
* Try to resolve conflict again.
* Try one more time.
* Add _MM_SHUFFLE. Remove non-vectorize code in SSE2. Fix copy and paste issue with NEON.
* Change v_uint16x8 SSE2 version to use shuffles
* core: rework and optimize SIMD implementation of dotProd
- add new universal intrinsics v_dotprod[int32], v_dotprod_expand[u&int8, u&int16, int32], v_cvt_f64(int64)
- add a boolean param for all v_dotprod&_expand intrinsics that change the behavior of addition order between
pairs in some platforms in order to reach the maximum optimization when the sum among all lanes is what only matters
- fix clang build on ppc64le
- support wide universal intrinsics for dotProd_32s
- remove raw SIMD and activate universal intrinsics for dotProd_8
- implement SIMD optimization for dotProd_s16&u16
- extend performance test data types of dotprod
- fix GCC VSX workaround of vec_mule and vec_mulo (in little-endian it must be swapped)
- optimize v_mul_expand(int32) on VSX
* core: remove boolean param from v_dotprod&_expand and implement v_dotprod_fast&v_dotprod_expand_fast
this changes made depend on "terfendail" review
- renamed Cascade Lake AVX512_CEL => AVX512_CLX (align with Intel SDE tool)
- fixed CLX instruction sets (no IFMA/VBMI)
- added flag to bypass CPU baseline check: OPENCV_SKIP_CPU_BASELINE_CHECK
[GSoC 2019] Improve the performance of JavaScript version of OpenCV (OpenCV.js)
* [GSoC 2019]
Improve the performance of JavaScript version of OpenCV (OpenCV.js):
1. Create the base of OpenCV.js performance test:
This perf test is based on benchmark.js(https://benchmarkjs.com). And first add `cvtColor`, `Resize`, `Threshold` into it.
2. Optimize the OpenCV.js performance by WASM threads:
This optimization is based on Web Worker API and SharedArrayBuffer, so it can be only used in browser.
3. Optimize the OpenCV.js performance by WASM SIMD:
Add WASM SIMD backend for OpenCV Universal Intrinsics. It's experimental as WASM SIMD is still in development.
* [GSoC2019]
1. use short license header
2. fix documentation node issue
3. remove the unused `hasSIMD128()` api
* [GSoC2019]
1. fix emscripten define
2. use fallback function for f16
* [GSoC2019]
Fix rebase issue
* Added MSA implementations for mips platforms. Intrinsics for MSA and build scripts for MIPS platforms are added.
Signed-off-by: Fei Wu <fwu@wavecomp.com>
* Removed some unused code in mips.toolchain.cmake.
Signed-off-by: Fei Wu <fwu@wavecomp.com>
* Added comments for mips toolchain configuration and disabled compiling warnings for libpng.
Signed-off-by: Fei Wu <fwu@wavecomp.com>
* Fixed the build error of unsupported opcode 'pause' when mips isa_rev is less than 2.
Signed-off-by: Fei Wu <fwu@wavecomp.com>
* 1. Removed FP16 related item in MSA option defines in OpenCVCompilerOptimizations.cmake.
2. Use CV_CPU_COMPILE_MSA instead of __mips_msa for MSA feature check in cv_cpu_dispatch.h.
3. Removed hasSIMD128() in intrin_msa.hpp.
4. Define CPU_MSA as 150.
Signed-off-by: Fei Wu <fwu@wavecomp.com>
* 1. Removed unnecessary CV_SIMD128_64F guarding in intrin_msa.hpp.
2. Removed unnecessary CV_MSA related code block in dotProd_8u().
Signed-off-by: Fei Wu <fwu@wavecomp.com>
* 1. Defined CPU_MSA_FLAGS_ON as "-mmsa".
2. Removed CV_SIMD128_64F guardings in intrin_msa.hpp.
Signed-off-by: Fei Wu <fwu@wavecomp.com>
* Removed unused msa_mlal_u16() and msa_mlal_s16 from msa_macros.h.
Signed-off-by: Fei Wu <fwu@wavecomp.com>
ISA 2.07 (aka POWER8) effectively extended the expanding multiply
operation to word types. The altivec intrinsics prior to gcc 8 did
not get the update.
Workaround this deficiency similar to other fixes.
This was exposed by commit 33fb253a66
which leverages the int -> dword expanding multiply.
This fixes Issue #15506
Use 4x FMA chains to sum on SIMD 128 FP64 targets. On
x86 this showed about 1.4x improvement.
For PPC, do a full multiply (32x32->64b), convert to DP
then accumulate. This may be slightly less precise for
some inputs. But is 1.5x faster than the above which
is about 1.5x than the FMA above for ~2.5x speedup.
Implement cvRound using inline asm. No compiler support
exists today to properly optimize this. This results in
about a 4x speedup over the default rounding. Likewise,
simplify the growing number of rounding function overloads.
For P9 enabled targets, utilize the classification
testing instruction to test for Inf/Nan values. Operation
speedup is about 1.2x for FP32, and 1.5x for FP64 operands.
For P8 targets, fallback to the GCC nan inline. It provides
a 1.1/1.4x improvement for FP32/FP64 arguments.
Add a new macro definition OPENCV_USE_FASTMATH_GCC_BUILTINS to enable
usage of GCC inline math functions, if available and requested by the
user.
Likewise, enable it for POWER. This is nearly always a substantial
improvement over using integer manipulation as most operations can
be done in several instructions with no branching. The result is a
1.5-1.8x speedup in the ceil/floor operations.
1. As tested with AT 12.0-1 (GCC 8.3.1) compiler on P9 LE.
Due to the explicitly declared copy constructor Vec<T, n>::Vec(Vec <T,n>&)
GCC 9 warns if there is no assignment operator, as having one typically
requires the other (rule-of-three, constructor/desctructor/assginment).
As the values are just a plain array the default assignment operator does
the right thing. Tell the compiler explicitly to default it.
Signed-off-by: Stefan Brüns <stefan.bruens@rwth-aachen.de>
* core: improve AVX512 infrastructure by adding more CPU features groups
* cmake: use groups for AVX512 optimization flags
* core: remove gap in CPU flags enumeration
* cmake: restore default CPU_DISPATCH
OE-11 Logging revamp (#13909)
* Initial commit for log tag support.
Part of #11003, incomplete. Should pass build.
Moved LogLevel enum to logger.defines.hpp
LogTag struct used to convey both name and log level threshold as
one argument to the new logging macro. See logtag.hpp file, and
CV_LOG_WITH_TAG macro.
Global log level is now associated with a global log tag, when a
logging statement doesn't specify any log tag. See getLogLevel and
getGlobalLogTag functions.
A macro CV_LOGTAG_FALLBACK is allowed to be re-defined by other modules
or compilation units, internally, so that logging statements inside
that unit that specify NULL as tag will fall back to the re-defined tag.
Line-of-code information (file name, line number, function name),
together with tag name, are passed into the new log message sink.
See writeLogMessageEx function.
Fixed old incorrect CV_LOG_VERBOSE usage in ocl4dnn_conv_spatial.cpp.
* Implemented tag-based log filtering
Added LogTagManager. This is an initial version, using standard C++
approach as much as possible, to allow easier code review. Will
optimize later.
A workaround for all static dynamic initialization issues is
implemented. Refer to code comments.
* Added LogTagConfigParser.
Note: new code does not fully handle old log config parsing behavior.
* Fix log tag config vs registering ordering issue.
* Started testing LogTagConfigParser, incomplete.
The intention of this commit is to illustrate the capabilities of
the current design of LogTagConfigParser.
The test contained in this commit is not complete. Also, design changes
may require throwing away this commit and rewriting test code from
scratch.
Does not test whitespace segmentation (multiple tags on the config);
will do in next commit.
* Added CV_LOGTAG_EXPAND_NAME macro
This macro allows to be re-defined locally in other compilation units
to apply a prefix to whatever argument is passed as the "tag" argument
into CV_LOG_WITH_TAG. The default definition in logger.hpp does not
modify the argument. It is recommended to include the address-of
operator (ampersand) when re-defined locally.
* Added a few tests for LogTagManager, some fail.
See test_logtagmanager.cpp
Failed tests are: non-global ("something"), setting level by name-part
(first part or any part) has no effect at all.
* LogTagManagerTests substring non-confusion tests
* Fix major bugs in LogTagManager
The code change is intended to approximate the spec documented in
https://gist.github.com/kinchungwong/ec25bc1eba99142e0be4509b0f67d0c6
Refer to test suite in test_logtagmanager.cpp
Filter test result in "opencv_test_core" ...
with gtest_filter "LogTagManager*"
To see the test code that finds the bugs, refer to original commits
(before rebase; might be gone)
.. f3451208 (2019-03-03T19:45:17Z)
.... LogTagManagerTests substring non-confusion tests
.. 1b848f5f (2019-03-03T01:55:18Z)
.... Added a few tests for LogTagManager, some fail.
* Added LogTagManagerNamePartNonConfusionTest.
See test_logtagmanager.cpp in modules/core/test.
* Added LogTagAuto for auto registration in ctor
* Rewritten LogTagManager to resolve issues.
* Resolves code review issues around 2019-04-10
LogTagConfigParser::parseLogLevel - as part of resolving code review
issues, this function is rewritten to simplify control flow and to
improve conformance with legacy usage (for string values "OFF",
"DISABLED", and "WARNINGS").
- added functionality to collect memory usage of OpenCL sybsystem
- memory usage of fastMalloc() (disabled by default):
* It is not accurate sometimes - external memory profiler is required.
- specify common `CV_TEST_TAG_` macros
- added applyTestTag() function
- write memory usage / enabled tags into Google Tests output file (.xml)
- allow cmake to check sanity of vsx aligned ld/st
- force universal intrinsics v_load_aligned/v_store_aligned
to failback to unaligned ld/st if cmake runtime vsx aligned test fail
Lab/XYZ modes have been postponed (color_lab.cpp):
- need to split code for tables initialization and for pixels processing first
- no significant performance improvements for switching between SSE42 / AVX2 code generation
Resize reworked using wide universal intrinsics (#13781)
* Added wide universal intrinsics optimized implementation for 3 channel bit-exact linear resize
* Reworked linear resize using new wide LUT intrinsics
* Fix for VSX intrinsics
Due to size limit of shared memory, histogram is built on
the global memory for CV_16UC1 case.
The amount of memory needed for building histogram is:
65536 * 4byte = 256KB
and shared memory limit is 48KB typically.
Added test cases for CV_16UC1 and various clip limits.
Added perf tests for CV_16UC1 on both CPU and CUDA code.
There was also a bug in CV_8UC1 case when redistributing
"residual" clipped pixels. Adding the test case where clip
limit is 5.0 exposes this bug.
* Add Operator override for multi-channel Mat with literal constant.
* simple test
* Operator overloading channel constraint for primitive types
* fix some test for #13586
* added performance test for compareHist
* compareHist reworked to use wide universal intrinsics
* Disabled vectorization for CV_COMP_CORREL and CV_COMP_BHATTACHARYYA if f64 is unsupported
* Added performance tests for hal::norm functions
* Added sum of absolute differences intrinsic
* norm implementation updated to use wide universal intrinsics
* improve and fix v_reduce_sad on VSX
- add infrastructure support for Power9/VSX3
- fix missing VSX flags on GCC4.9 and CLANG4(#13210, #13222)
- fix disable VSX optimzation on GCC by using flag ENABLE_VSX
- flag ENABLE_VSX is deprecated now, use CPU_BASELINE, CPU_DISPATCH instead
- add VSX3 to arithmetic dispatchable flags
* Support for Matx read/write by FileStorage
* Only empty filestorage read now produces default Matx. Split Matx IO test into smaller units. Test checks for exception thrown if reading a Mat into a Matx of different size.
* Updated boxFilter implementations to use wide universal intrinsics
* boxFilter implementation moved to separate file
* Replaced ROUNDUP macro with roundUp() function
* integrated the new C++ persistence; removed old persistence; most of OpenCV compiles fine! the tests have not been run yet
* fixed multiple bugs in the new C++ persistence
* fixed raw size of the parsed empty sequences
* [temporarily] excluded obsolete applications traincascade and createsamples from build
* fixed several compiler warnings and multiple test failures
* undo changes in cocoa window rendering (that was fixed in another PR)
* fixed more compile warnings and the remaining test failures (hopefully)
* trying to fix the last little warning
- initialize arithmetic dispatcher
- add new universal intrinsic v_absdiffs
- add new universal intrinsic v_pack_b
- add accumulate version of universal intrinsic v_round
- fix sse/avx2:uint8 multiplication overflow
- reimplement arithmetic, logic and comparison operations into wide universal intrinsics
with full support for all types
- reimplement IPP arithmetic, logic and comparison operations in a sperate file arithm_ipp.hpp
- avoid scalar multiplication if scaling factor eq 1 and use integer multiplication
- move C arithmetic operations to precomp.hpp and delete [arithm_simd|arithm_core].hpp
- add compatibility with new opencv4 divide policy