2x more accurate float => bfloat conversion #26321
There is a magic trick to make float => bfloat conversion more accurate (_original reference needed, is it done this way in PyTorch?_). In simplified form it looks like:
```
uint16_t f2bf(float x) {
union {
unsigned u;
float f;
} u;
u.f = x;
// return (uint16_t)(u.u >> 16); <== the old method before this patch
return (uint16_t)((u.u + 0x8000) >> 16);
}
```
it works correctly for almost all valid floating-point values, positive, zero or negative, and even for some extreme cases, like `+/-inf`, `nan` etc. The addition of `0x8000` to integer representation of 32-bit float before retrieving the highest 16 bits reduces the rounding error by ~2x.
The slight problem with this improved method is that the numbers very close to or equal to `+/-FLT_MAX` are mistakenly converted to `+/-inf`, respectively.
This patch implements improved algorithm for `float => bfloat` conversion in scalar and vector form; it fixes the above-mentioned problem using some extra bit magic, i.e. 0x8000 is not added to very big (by absolute value) numbers:
```
// the actual implementation is more efficient,
// without conditions or floating-point operations, see the source code
return (uint16_t)(u.u + (fabsf(x) <= big_threshold ? 0x8000 : 0)) >> 16);
```
The corresponding test has been added as well and this is output from the test:
```
[----------] 1 test from Core_BFloat
[ RUN ] Core_BFloat.convert
maxerr0 = 0.00774842, mean0 = 0.00190643, stddev0 = 0.00186063
maxerr1 = 0.00389057, mean1 = 0.000952614, stddev1 = 0.000931268
[ OK ] Core_BFloat.convert (7 ms)
```
Here `maxerr0, mean0, stddev0` are for the original method and `maxerr1, mean1, stddev1` are for the new method. As you can see, there is a significant improvement in accuracy.
**Note:**
_Actually, on ~32,000,000 random FP32 numbers with uniformly distributed sign, exponent and mantissa the new method is always at least as accurate as the old one._
The test also checks all the corner cases, where we see no degradation either vs the original method.
- [x] I agree to contribute to the project under Apache 2 License.
- [x] To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
- [x] The PR is proposed to the proper branch
- [ ] There is a reference to the original bug report and related work
- [x] There is accuracy test, performance test and test data in opencv_extra repository, if applicable
Patch to opencv_extra has the same branch name.
- [x] The feature is well documented and sample code can be built with the project CMake
C-API cleanup: apps, imgproc_c and some constants #25075
Merge with https://github.com/opencv/opencv_contrib/pull/3642
* Removed obsolete apps - traincascade and createsamples (please use older OpenCV versions if you need them). These apps relied heavily on C-API
* removed all mentions of imgproc C-API headers (imgproc_c.h, types_c.h) - they were empty, included core C-API headers
* replaced usage of several C constants with C++ ones (error codes, norm modes, RNG modes, PCA modes, ...) - most part of this PR (split into two parts - all modules and calib+3d - for easier backporting)
* removed imgproc C-API headers (as separate commit, so that other changes could be backported to 4.x)
Most of these changes can be backported to 4.x.
Added clapack
* bring a small subset of Lapack, automatically converted to C, into OpenCV
* added missing lsame_ prototype
* * small fix in make_clapack script
* trying to fix remaining CI problems
* fixed character arrays' initializers
* get rid of F2C_STR_MAX
* * added back single-precision versions for QR, LU and Cholesky decompositions. It adds very little extra overhead.
* added stub version of sdesdd.
* uncommented calls to all the single-precision Lapack functions from opencv/core/src/hal_internal.cpp.
* fixed warning from Visual Studio + cleaned f2c runtime a bit
* * regenerated Lapack w/o forward declarations of intrinsic functions (such as sqrt(), r_cnjg() etc.)
* at once, trailing whitespaces are removed from the generated sources, just in case
* since there is no declarations of intrinsic functions anymore, we could turn some of them into inline functions
* trying to eliminate the crash on ARM
* fixed API and semantics of s_copy
* * CLapack has been tested successfully. It's now time to restore the standard LAPACK detection procedure
* removed some more trailing whitespaces
* * retained only the essential stuff in CLapack
* added checks to lapack calls to gracefully return "not implemented" instead of returning invalid results with "ok" status
* disabled warning when building lapack
* cmake: update LAPACK detection
Co-authored-by: Alexander Alekhin <alexander.a.alekhin@gmail.com>
Add a basic sanity test to verify the rounding functions
work as expected.
Likewise, extend the rounding performance test to cover the
additional float -> int fast math functions.
* rewrote Mat::convertTo() and convertScaleAbs() to wide universal intrinsics; added always-available and SIMD-optimized FP16<=>FP32 conversion
* fixed compile warnings
* fix some more compile errors
* slightly relaxed accuracy threshold for int->float conversion (since we now do it using single-precision arithmetics, not double-precision)
* fixed compile errors on iOS, Android and in the baseline C++ version (intrin_cpp.hpp)
* trying to fix ARM-neon builds
* trying to fix ARM-neon builds
* trying to fix ARM-neon builds
* trying to fix ARM-neon builds
* core:OE-27 prepare universal intrinsics to expand (#11022)
* core:OE-27 prepare universal intrinsics to expand (#11022)
* core: Add universal intrinsics for AVX2
* updated implementation of wide univ. intrinsics; converted several OpenCV HAL functions: sqrt, invsqrt, magnitude, phase, exp to the wide universal intrinsics.
* converted log to universal intrinsics; cleaned up the code a bit; added v_lut_deinterleave intrinsics.
* core: Add universal intrinsics for AVX2
* fixed multiple compile errors
* fixed many more compile errors and hopefully some test failures
* fixed some more compile errors
* temporarily disabled IPP to debug exp & log; hopefully fixed Doxygen complains
* fixed some more compile errors
* fixed v_store(short*, v_float16&) signatures
* trying to fix the test failures on Linux
* fixed some issues found by alalek
* restored IPP optimization after the patch with AVX wide intrinsics has been properly tested
* restored IPP optimization after the patch with AVX wide intrinsics has been properly tested
- 'if' logic is moved into templates.
- removed unnecessary cv::Mat objects creation.
- fixed inv() test (invA * A == eye)
- added more Matx tests to cover all defined template specializations
- removed tr1 usage (dropped in C++17)
- moved includes of vector/map/iostream/limits into ts.hpp
- require opencv_test + anonymous namespace (added compile check)
- fixed norm() usage (must be from cvtest::norm for checks) and other conflict functions
- added missing license headers
* add accuracy test and performance check for matmul
* add performance tests for transform and dotProduct
* add test Core_TransformLargeTest for 8u version of transform
* remove raw SSE2/NEON implementation from matmul.cpp
* use universal intrinsic instead of raw intrinsic
* remove unused templated function
* add v_matmuladd which multiply 3x3 matrix and add 3x1 vector
* add v_rotate_left/right in universal intrinsic
* suppress intrinsic on some function and platform
* add pure SW implementation of new universal intrinsics
* add test for new universal intrinsics
* core: prevent memory access after the end of buffer
* fix perf tests