Skip to content

Improve sqrt with SoftFloat-style lookup table and integer arithmetic#1331

Open
justinzhuguangwen wants to merge 32 commits intoboostorg:developfrom
justinzhuguangwen:develop
Open

Improve sqrt with SoftFloat-style lookup table and integer arithmetic#1331
justinzhuguangwen wants to merge 32 commits intoboostorg:developfrom
justinzhuguangwen:develop

Conversation

@justinzhuguangwen
Copy link

@justinzhuguangwen justinzhuguangwen commented Feb 4, 2026

Description

Summary

Replace Padé + Newton-Raphson sqrt with a SoftFloat-inspired lookup table and integer-only Newton refinement. All arithmetic stays in integers until final conversion, reducing rounding errors and improving throughput.

Changes

New files:

  • doc/decimal_float_and_sqrt.md: Documentation on decimal vs binary float representation, bit layout, and sqrt algorithm
  • include/boost/decimal/detail/cmath/impl/sqrt_lookup.hpp: 90-entry k0/k1 lookup table
  • include/boost/decimal/detail/cmath/impl/approx_recip_sqrt_impl.hpp: approx_recip_sqrt32 / approx_recip_sqrt64, integer-only 1/√x approximation via table lookup + Newton refinement
  • include/boost/decimal/detail/cmath/impl/sqrt32_impl.hpp: Integer sqrt for decimal32 using approx_recip_sqrt32 and Newton correction
  • include/boost/decimal/detail/cmath/impl/sqrt64_impl.hpp: Integer sqrt for decimal64 using approx_recip_sqrt64 and Newton correction; fallback path for targets without __int128 (MSVC, 32-bit)
  • include/boost/decimal/detail/cmath/impl/sqrt128_impl.hpp: u256-based integer sqrt for decimal128, frexp10 for exact significand, round-to-nearest

Modified files:

  • sqrt.hpp: Switched to new impl headers, replacing Padé + Newton-Raphson implementation
  • test/test_cmath.cpp: Updated sqrt-related tests
  • test/test_sqrt.cpp: Updated sqrt-related tests

Performance (sqrt_bench.py, baseline vs current)

Type Baseline Current Speedup
decimal32_t 1.59M ops/s 8.90M ops/s ✓ 5.59x (+458.9%)
decimal64_t 0.78M ops/s 4.17M ops/s ✓ 5.33x (+432.9%)
decimal128_t 0.24M ops/s 0.77M ops/s ✓ 3.15x (+214.8%)
decimal_fast32_t 1.80M ops/s 8.51M ops/s ✓ 4.72x (+372.3%)
decimal_fast64_t 0.81M ops/s 3.74M ops/s ✓ 4.64x (+363.9%)
decimal_fast128_t 0.22M ops/s 0.72M ops/s ✓ 3.21x (+221.2%)

Note: sqrt_bench.py and ./run_srqt_test.sh are not part of this PR. To run them, use commit d54af195e45f1207c7d55fcdb26f5890d9aafbbd.

Tests

  • CI covers sqrt-related tests. For local runs, ./run_srqt_test.sh passes at commit d54af195e45f1207c7d55fcdb26f5890d9aafbbd (decimal32, decimal64, decimal128, github_issue_1107, github_issue_1110).

Documentation

  • doc/decimal_float_and_sqrt.md: New document describing current algorithm and implementation details

Refs #1311

justinzhu@home added 23 commits January 25, 2026 03:05
Convert decimal to double, use std::sqrt, convert back. Added baseline,
benchmarks, and build scripts. Updated .gitignore.
- Add decimal lookup table for 1/sqrt(x) approximation
- Implement SoftFloat-style remainder elimination
- Add safety checks to prevent inf/NaN results
- Maintain C++14 compatibility with out-of-line definitions

Performance: ~4-5x speedup over baseline implementation
Replace Padé approximation with 32-entry lookup table approach,
reducing Newton-Raphson iterations from 3-5 to 1-3 based on precision.

- Add sqrt lookup tables for [0.1, 1.0) range with interpolation
- Reduce iterations: decimal32 (1), decimal64 (2), decimal128 (3)
- Maintain existing normalization and exponent handling

Performance: Fewer divisions per sqrt call with faster convergence.

Known issues: Some precision gaps remain, particularly for decimal32.
Further optimization needed for interpolation and edge cases.
Replace Padé approximation with 1/sqrt(x) lookup table and linear
interpolation, reducing Newton iterations from 3-5 to 1-2.
Replace Padé + Newton-Raphson with decimal-native lookup table
algorithm. Uses 128-entry table with linear interpolation for
initial approximation, then remainder elimination refinement using
multiplication only (no division). Reduces iterations from 3-5 to 1-3.

Experimental: precision tuning may be needed.
- Replace 128-entry table with 90-entry table (step 0.1)
- Use integer remainder for decimal32 (uint64) and decimal64 (uint128)
- Add 1/2/4 Newton iterations for decimal32/64/128
- Add generate_sqrt_tables.py for table generation
- Update decimal_float_and_sqrt.md with algorithm details
- Rewrite approx_recip_sqrt32/64 to match SoftFloat algorithm
- Use working scale and Newton iterations to avoid precision loss
- Switch sqrt32_impl and sqrt64_impl to integer approx_recip_sqrt
- Fix sqrt32 and sqrt64 test failures
- approx_recip_sqrt32/64: rewrite with integer-only Newton iterations
- sqrt32_impl: use approx_recip_sqrt32 for integer-based sqrt
- sqrt64_impl: use approx_recip_sqrt64 with __int128 when available
- sqrt64_impl: fallback path for platforms without __int128 (MSVC, 32-bit)
- sqrt128_impl: u256-based integer sqrt with frexp10 and round-to-nearest

Performance (ops/s):
- decimal32_t:      1.63M -> 9.55M  (5.86x)
- decimal64_t:      0.74M -> 4.39M  (5.97x)
- decimal128_t:     0.31M -> 0.75M  (2.41x)
- decimal_fast32_t: 1.81M -> 8.63M  (4.76x)
- decimal_fast64_t: 0.80M -> 3.74M  (4.69x)
- decimal_fast128_t: 0.22M -> 0.69M (3.13x)

All sqrt tests pass for decimal32/64/128 and sqrt64 fallback path.
- sqrt_tables.hpp -> sqrt_lookup.hpp
- approx_recip_sqrt.hpp -> approx_recip_sqrt_impl.hpp
- remove approx_recip_sqrt_1
- approxRecipSqrt_1k0s/k1s -> recip_sqrt_k0s/k1s
- Add lookup table algorithm for sqrt optimization
- decimal32/64/128: 4.72x, 5.66x, 2.85x speedup respectively
- Remove temporary test files
- Add lookup table algorithm for sqrt optimization
- decimal32/64/128: 4.72x, 5.66x, 2.85x speedup respectively
- Remove temporary test files
@cppalliance-bot
Copy link

cppalliance-bot commented Feb 4, 2026

An automated preview of the documentation is available at https://1331.decimal.prtest3.cppalliance.org/libs/decimal/doc/html/index.html

If more commits are pushed to the pull request, the docs will rebuild at the same URL.

2026-02-05 06:36:02 UTC

@ckormanyos
Copy link
Member

ckormanyos commented Feb 4, 2026

Hi @justinzhuguangwen I just approved your workflow run. In this repo, first-time contributors require workflow run approval.

Looks like you're getting good perf results now. CI will be interesting.

You're getting hit by some compiler warnings. Decimal runs high warning levels with -Werror. I should have warned you a bit more on this up front.

So don't be initially shocked if you get a lot of failing runs in the first try. Usually, these matters can be cleared up in a few trials and everything gets OK.

Our CI is setup to run faster runs on GHA and a whole bunch of slow tests on Drone. Some of the drone tests already failed on the pesky -Werror. But again, these matters can usually be cleared up with a few edits.

In fact, I just got hit by -Werror on my recent frexp PR too.

Cc: @mborland

@justinzhuguangwen
Copy link
Author

Hi @ckormanyos, thanks for approving the workflow and for the heads-up about the strict warning levels.

I've pushed a fix for the -Wsign-conversion error in approx_recip_sqrt64 (in approx_recip_sqrt_impl.hpp). The change is in the latest commit on this PR.

I've started the workflow on my fork to verify the fix. Once it passes, I'll trigger a re-run here if needed.

Thanks again for the guidance.

Copy link
Member

@ckormanyos ckormanyos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might prefer:

std::uint64_t base_sig = static_cast<std::uint64_t>((static_cast<unsigned int>(index) + 10U) * UINT64_C(100000000000000));

I hope this is syntactically correct and lets the compiler do unsigned * u64 -> u64 multiply

@ckormanyos
Copy link
Member

ckormanyos commented Feb 4, 2026

I've started the workflow on my fork to verify the fix. Once it passes, I'll trigger a re-run here if needed.

Nice.

Oh also, when we commit to a branch having a running PR with a running CI/CD, Matt (@mborland) has set it up to halt the running CI/CD and restart it. This is done both on GHA as well as on Drone.

So it is not uncommon for us to commit, re-commmit, and so on until it gets right. You don't need to be, let's say, ultra-conservative about that.

@ckormanyos
Copy link
Member

ckormanyos commented Feb 4, 2026

@justinzhuguangwen thank you for your amazing algorithmic and coding work. It looks like this thing is going to go green soon.

Hi Matt (@mborland) would you like to look over this work regarding matters of style and Boost/Decimal consistency?

Upon going green, I'd like to get this into 1.91.

@justinzhuguangwen
Copy link
Author

Hi @ckormanyos, thanks for the kind words and for the review.

The CI checks triggered by the latest commit have all passed. Happy to address any follow-up if needed.

Note: This PR has many commits from incremental updates. When merging, I'd suggest using squash merge and using the PR description as the squash commit message to keep the history clean.

Thanks again for the guidance.

Copy link
Member

@mborland mborland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good! My comments are really around other potential points of optimization, and I think they could apply a few different places in the code. Let me know if you have questions.

@ckormanyos
Copy link
Member

CI has been started for these next improvements.

@ckormanyos
Copy link
Member

Hi Justin @justinzhuguangwen, I think a trivial warning has crept in.

One of the CI logs shows:

2026-02-04T17:16:24.2135712Z ./boost/decimal/detail/cmath/impl/sqrt128_impl.hpp:72:29: error: unused variable 'scale15' [-Werror,-Wunused-variable]
2026-02-04T17:16:24.2136834Z     constexpr std::uint64_t scale15 = 1000000000000000ULL;    // 10^15
2026-02-04T17:16:24.2137441Z                             ^
2026-02-04T17:16:24.2137816Z 1 error generated.

It's annayong, ... we know, but probably worth a lot of quality in the end.

@justinzhuguangwen
Copy link
Author

Hi Justin @justinzhuguangwen, I think a trivial warning has crept in.

One of the CI logs shows:

2026-02-04T17:16:24.2135712Z ./boost/decimal/detail/cmath/impl/sqrt128_impl.hpp:72:29: error: unused variable 'scale15' [-Werror,-Wunused-variable]
2026-02-04T17:16:24.2136834Z     constexpr std::uint64_t scale15 = 1000000000000000ULL;    // 10^15
2026-02-04T17:16:24.2137441Z                             ^
2026-02-04T17:16:24.2137816Z 1 error generated.

It's annayong, ... we know, but probably worth a lot of quality in the end.

Fixed. Running CI in my fork to confirm.

@codecov
Copy link

codecov bot commented Feb 5, 2026

Codecov Report

❌ Patch coverage is 92.70833% with 14 lines in your changes missing coverage. Please review.
✅ Project coverage is 98.8%. Comparing base (e72fd34) to head (b1f37c5).

Files with missing lines Patch % Lines
...e/boost/decimal/detail/cmath/impl/sqrt128_impl.hpp 91.4% 6 Missing ⚠️
...de/boost/decimal/detail/cmath/impl/sqrt32_impl.hpp 86.7% 4 Missing ⚠️
...de/boost/decimal/detail/cmath/impl/sqrt64_impl.hpp 82.7% 4 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff            @@
##           develop   #1331     +/-   ##
=========================================
- Coverage     98.8%   98.8%   -0.0%     
=========================================
  Files          278     282      +4     
  Lines        18034   18185    +151     
  Branches      1918    1916      -2     
=========================================
+ Hits         17812   17950    +138     
- Misses         222     235     +13     
Files with missing lines Coverage Δ
...cimal/detail/cmath/impl/approx_recip_sqrt_impl.hpp 100.0% <100.0%> (ø)
include/boost/decimal/detail/cmath/sqrt.hpp 100.0% <100.0%> (ø)
test/test_cmath.cpp 100.0% <ø> (ø)
test/test_sqrt.cpp 100.0% <100.0%> (ø)
...de/boost/decimal/detail/cmath/impl/sqrt32_impl.hpp 86.7% <86.7%> (ø)
...de/boost/decimal/detail/cmath/impl/sqrt64_impl.hpp 82.7% <82.7%> (ø)
...e/boost/decimal/detail/cmath/impl/sqrt128_impl.hpp 91.4% <91.4%> (ø)

... and 4 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e72fd34...b1f37c5. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

- Add decimal_fast32_t, decimal128_t test_sqrt_edge, decimal_fast128_t
- Add perfect squares (sqrt(4), sqrt(9)) to cover rem==0 branch
- Add non-perfect squares (sqrt(2), sqrt(5)) to cover Newton correction
- Add dense sampling [1.01, 9.99] to exercise rem<0 and other edge paths
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants