
Type: Improvement

Status: Open

Priority: Minor

Resolution: Unresolved

Affects Version/s: 1.0

Fix Version/s: None

Component/s: arrays

Labels:None
The algorithm in LinearCombination is an implementation of dot2 from Ogita el al (algorithm 5.3). There is a subtle difference in that the original dot2 algorithm sums the roundoff from the products and the roundoff from the product summation together. The method in LinearCombination sums them separately (using an extra variable) and adds them at the end. This actually improves the accuracy under conditions where the roundoff is of greater sum than the products, as described below.
The dot2 algorithm suffers when the total sum is close to 0 but the intermediate sum is far enough from zero that there is a large difference between the exponents of summed terms and the final result. In this case the sum of the roundoff is more important than the sum of the products which due to massive cancellation is zero. The case is important for Complex numbers which require a computation of log1p(x^2+y^21) when x^2+y^2 is close to 1 such that log(1) would be ~zero yet the true logarithm is representable to very high precision.
This can be protected against by using the dotK algorithm of Ogita et al with K>2. This saves all the roundoff parts from each product and the running sum. These are subject to an error free transform that repeatedly adds adjacent pairs to generate a new split pair with a closer upper and lower part. Over time this will order the parts from low to high and these can be summed low first for an error free dot product.
Using this algorithm with a single pass (K=3 for dot3) removes the cancellation error observed in the mentioned use case. Adding a single pass over the parts changes the algorithm from 25n floating point operations (flops) to 31n flops for the sum of n products.
A second change for the algorithm is to switch to using Dekker's algorithm (Dekker, 1971) to split the number. This extracts two 26bit mantissas from a 53bit mantis (in IEEE 754 the leading bit in front of the of the 52bit mantissa is assumed to be 1). This is done by multiplication by 2^s+1 with s = ceil(53/2) = 27:
big = (2^s+1) * a
a_hi = (big  (big  a))
The extra bit of precision is carried into the sign of the low part of the split number.
This is in contrast to the method in LinearCombination that uses a simple mask on the long representation to obtain the a_hi part in 26bits and the lower part will be 27 bits.
The advantage of Dekker's method is it produces 2 parts with 26 bits in the mantissa that can be multiplied exactly. The disadvantage is the potential for overflow requiring a branch condition to check for extreme values.
It also appropriately handles very small subnormal numbers that would be masked to create a 0 high part with all the nonzero bits left in the low part using the current method. This will have knock on effects on split multiplication which requires the high part to be larger.
A simple change to the current implementation to use Dekker's split improves the precision on a wide range of test data (not shown).
 links to