(gmp.info.gz) Binary GCD

(gmp.info.gz) Greatest Common Divisor Algorithms
 
 Binary GCD
 ----------
 
 At small sizes GMP uses an O(N^2) binary style GCD.  This is described
 in many textbooks, for example Knuth section 4.5.2 algorithm B.  It
 simply consists of successively reducing odd operands a and b using
 
      a,b = abs(a-b),min(a,b)
      strip factors of 2 from a
 
    The Euclidean GCD algorithm, as per Knuth algorithms E and A,
 reduces using a mod b but this has so far been found to be slower
 everywhere.  One reason the binary method does well is that the implied
 quotient at each step is usually small, so often only one or two
 subtractions are needed to get the same effect as a division.
 Quotients 1, 2 and 3 for example occur 67.7% of the time, see Knuth
 section 4.5.3 Theorem E.
 
    When the implied quotient is large, meaning b is much smaller than
 a, then a division is worthwhile.  This is the basis for the initial a
 mod b reductions in `mpn_gcd' and `mpn_gcd_1' (the latter for both Nx1
 and 1x1 cases).  But after that initial reduction, big quotients occur
 too rarely to make it worth checking for them.
 
 
    The final 1x1 GCD in `mpn_gcd_1' is done in the generic C code as
 described above.  For two N-bit operands, the algorithm takes about
 0.68 iterations per bit.  For optimum performance some attention needs
 to be paid to the way the factors of 2 are stripped from a.
 
    Firstly it may be noted that in twos complement the number of low
 zero bits on a-b is the same as b-a, so counting or testing can begin on
 a-b without waiting for abs(a-b) to be determined.
 
    A loop stripping low zero bits tends not to branch predict well,
 since the condition is data dependent.  But on average there's only a
 few low zeros, so an option is to strip one or two bits arithmetically
 then loop for more (as done for AMD K6).  Or use a lookup table to get
 a count for several bits then loop for more (as done for AMD K7).  An
 alternative approach is to keep just one of a or b odd and iterate
 
      a,b = abs(a-b), min(a,b)
      a = a/2 if even
      b = b/2 if even
 
    This requires about 1.25 iterations per bit, but stripping of a
 single bit at each step avoids any branching.  Repeating the bit strip
 reduces to about 0.9 iterations per bit, which may be a worthwhile
 tradeoff.
 
    Generally with the above approaches a speed of perhaps 6 cycles per
 bit can be achieved, which is still not terribly fast with for instance
 a 64-bit GCD taking nearly 400 cycles.  It's this sort of time which
 means it's not usually advantageous to combine a set of divisibility
 tests into a GCD.
Info Catalog
(gmp.info.gz) Greatest Common Divisor Algorithms
(gmp.info.gz) Accelerated GCD
automatically generated byinfo2html