opencv

mirror of https://github.com/opencv/opencv.git synced 2025-07-25 14:47:07 +08:00

History

Vadim Pisarevsky 2f35847960 Merge pull request #26321 from vpisarev:better_bfloat 2x more accurate float => bfloat conversion #26321 There is a magic trick to make float => bfloat conversion more accurate (_original reference needed, is it done this way in PyTorch?_). In simplified form it looks like: ``` uint16_t f2bf(float x) { union { unsigned u; float f; } u; u.f = x; // return (uint16_t)(u.u >> 16); <== the old method before this patch return (uint16_t)((u.u + 0x8000) >> 16); } ``` it works correctly for almost all valid floating-point values, positive, zero or negative, and even for some extreme cases, like `+/-inf`, `nan` etc. The addition of `0x8000` to integer representation of 32-bit float before retrieving the highest 16 bits reduces the rounding error by ~2x. The slight problem with this improved method is that the numbers very close to or equal to `+/-FLT_MAX` are mistakenly converted to `+/-inf`, respectively. This patch implements improved algorithm for `float => bfloat` conversion in scalar and vector form; it fixes the above-mentioned problem using some extra bit magic, i.e. 0x8000 is not added to very big (by absolute value) numbers: ``` // the actual implementation is more efficient, // without conditions or floating-point operations, see the source code return (uint16_t)(u.u + (fabsf(x) <= big_threshold ? 0x8000 : 0)) >> 16); ``` The corresponding test has been added as well and this is output from the test: ``` [----------] 1 test from Core_BFloat [ RUN ] Core_BFloat.convert maxerr0 = 0.00774842, mean0 = 0.00190643, stddev0 = 0.00186063 maxerr1 = 0.00389057, mean1 = 0.000952614, stddev1 = 0.000931268 [ OK ] Core_BFloat.convert (7 ms) ``` Here `maxerr0, mean0, stddev0` are for the original method and `maxerr1, mean1, stddev1` are for the new method. As you can see, there is a significant improvement in accuracy. Note: _Actually, on ~32,000,000 random FP32 numbers with uniformly distributed sign, exponent and mantissa the new method is always at least as accurate as the old one._ The test also checks all the corner cases, where we see no degradation either vs the original method. - [x] I agree to contribute to the project under Apache 2 License. - [x] To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV - [x] The PR is proposed to the proper branch - [ ] There is a reference to the original bug report and related work - [x] There is accuracy test, performance test and test data in opencv_extra repository, if applicable Patch to opencv_extra has the same branch name. - [x] The feature is well documented and sample code can be built with the project CMake	2024-10-18 14:46:40 +03:00
..
core	Merge pull request #26321 from vpisarev:better_bfloat	2024-10-18 14:46:40 +03:00
core.hpp	Merge pull request #26056 from vpisarev:new_dnn_engine	2024-10-16 15:28:19 +03:00

Vadim Pisarevsky 2f35847960

Merge pull request #26321 from vpisarev:better_bfloat

2x more accurate float => bfloat conversion #26321

There is a magic trick to make float => bfloat conversion more accurate (_original reference needed, is it done this way in PyTorch?_). In simplified form it looks like:

```
uint16_t f2bf(float x) {
    union {
        unsigned u;
        float f;
    } u;
    u.f = x;
    // return (uint16_t)(u.u >> 16); <== the old method before this patch
    return (uint16_t)((u.u + 0x8000) >> 16);
} 
```

it works correctly for almost all valid floating-point values, positive, zero or negative, and even for some extreme cases, like `+/-inf`, `nan` etc. The addition of `0x8000` to integer representation of 32-bit float before retrieving the highest 16 bits reduces the rounding error by ~2x.

The slight problem with this improved method is that the numbers very close to or equal to `+/-FLT_MAX` are mistakenly converted to `+/-inf`, respectively.

This patch implements improved algorithm for `float => bfloat` conversion in scalar and vector form; it fixes the above-mentioned problem using some extra bit magic, i.e. 0x8000 is not added to very big (by absolute value) numbers:

```
// the actual implementation is more efficient,
// without conditions or floating-point operations, see the source code
return (uint16_t)(u.u + (fabsf(x) <= big_threshold ? 0x8000 : 0)) >> 16);
```

The corresponding test has been added as well and this is output from the test:

```
[----------] 1 test from Core_BFloat
[ RUN      ] Core_BFloat.convert
maxerr0 = 0.00774842, mean0 = 0.00190643, stddev0 = 0.00186063
maxerr1 = 0.00389057, mean1 = 0.000952614, stddev1 = 0.000931268
[       OK ] Core_BFloat.convert (7 ms)
```

Here `maxerr0, mean0, stddev0` are for the original method and `maxerr1, mean1, stddev1` are for the new method. As you can see, there is a significant improvement in accuracy.

**Note:**

_Actually, on ~32,000,000 random FP32 numbers with uniformly distributed sign, exponent and mantissa the new method is always at least as accurate as the old one._

The test also checks all the corner cases, where we see no degradation either vs the original method.

- [x] I agree to contribute to the project under Apache 2 License.
- [x] To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
- [x] The PR is proposed to the proper branch
- [ ] There is a reference to the original bug report and related work
- [x] There is accuracy test, performance test and test data in opencv_extra repository, if applicable
      Patch to opencv_extra has the same branch name.
- [x] The feature is well documented and sample code can be built with the project CMake

2024-10-18 14:46:40 +03:00

core

Merge pull request #26321 from vpisarev:better_bfloat

2024-10-18 14:46:40 +03:00

core.hpp

Merge pull request #26056 from vpisarev:new_dnn_engine

2024-10-16 15:28:19 +03:00