Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
230 views
in Technique[技术] by (71.8m points)

c - Conditional tests in primality by trial division

My question is about the conditional test in trial division. There seems to be some debate on what conditional test to employ. Let's look at the code for this from RosettaCode.

int is_prime(unsigned int n)
{
    unsigned int p;
    if (!(n & 1) || n < 2 ) return n == 2;

    /* comparing p*p <= n can overflow */
    for (p = 3; p <= n/p; p += 2)
        if (!(n % p)) return 0;
    return 1;
}

Wheel factorization or using a predetermined list of primes will not change the essence of my question.

There are three cases I can think of to do the conditional test:

  1. p<=n/p
  2. p*p<=n
  3. int cut = sqrt(n); for (p = 3; p <= cut; p += 2)

Case 1: works for all n but it has to do an extra division every iteration (edit: actually it does not need an extra division but it's still slower. I'm not sure why. See the assembly output below). I have found it to be twice as slow as case 2 for large values of n which are prime (on my Sandy Bridge system).

Case 2: is signficantly faster than case 1 but it has a problem that it overflows for large n and goes into an infinitive loop. The max value it can handle is

(sqrt(n) + c)^2 = INT_MAX  //solve
n = INT_MAX -2*c*sqrt(INT_MAX) + c^2
//INT_MAX = 2^32 -> n = 2^32 - c*s^17 + c^2; in our case c = 2

For example for uint64_t case 2 will go into an infinite loop for x =-1L-58 (x^64-59) which is a prime.

Case 3: no division or multiplication has to be done every iteration and it does not overflow like case 2. It's slightly faster than case 2 as well. The only question is if sqrt(n) is accurate enough.

Can someone explain to me why case 2 is so much faster than case 1? Case 1 does NOT use an extra division as I though but despite that it's still a lot slower.

Here are the times for the prime 2^56-5;

case 1 9.0s
case 2 4.6s
case 3 4.5s

Here is the code I used to test this http://coliru.stacked-crooked.com/a/69497863a97d8953. I also added the functions to the end of this question.

Here is the assembly output form GCC 4.8 with -O3 for case 1 and case 2. They both only have one division. Case 2 has a multiplication as well so my first guess is that case 2 would be slower but it's about twice as fast on both GCC and MSVC. I don't know why.

Case 1:

.L5:
  testl %edx, %edx
  je  .L8
.L4:
  addl  $2, %ecx
  xorl  %edx, %edx
  movl  %edi, %eax
  divl  %ecx
  cmpl  %ecx, %eax
  jae .L5

Case 2:

.L20:
  xorl  %edx, %edx
  movl  %edi, %eax
  divl  %ecx
  testl %edx, %edx
  je  .L23
.L19:
  addl  $2, %ecx
  movl  %ecx, %eax
  imull %ecx, %eax
  cmpl  %eax, %edi
  jae .L20

Here are the functions I'm using to test the time:

int is_prime(uint64_t n)
{
    uint64_t p;
    if (!(n & 1) || n < 2 ) return n == 2;

    /* comparing p*p <= n can overflow */
    for (p = 3; p <= n/p; p += 2)
        if (!(n % p)) return 0;
    return 1;
}

int is_prime2(uint64_t n)
{
    uint64_t p;
    if (!(n & 1) || n < 2 ) return n == 2;

    /* comparing p*p <= n can overflow */
    for (p = 3; p*p <= n; p += 2)
        if (!(n % p)) return 0;
    return 1;
}

int is_prime3(uint64_t n)
{
    uint64_t p;
    if (!(n & 1) || n < 2 ) return n == 2;

    /* comparing p*p <= n can overflow */
    uint32_t cut = sqrt(n);
    for (p = 3; p <= cut; p += 2)
        if (!(n % p)) return 0;
    return 1;
}

Added content after the bounty.

Aean discovered that in case 1 saving the quotient as well as the remainder is just as fast (or slightly faster) than case 2. Let's call this case 4. The following following code is twice as fast as case 1.

int is_prime4(uint64_t n)
{
    uint64_t p, q, r;
    if (!(n & 1) || n < 2 ) return n == 2;

    for (p = 3, q=n/p, r=n%p; p <= q; p += 2, q = n/p, r=n%p)
        if (!r) return 0;
    return 1;
}

I'm not sure why this helps. In any case there is no need to use case 2 anymore. For case 3, most versions of the sqrt function in hardware or software get the perfect squares right so it's safe to use in general. Case 3 is the only case that will work with OpenMP.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

UPD: This is a compiler optimization issue, obviously. While MinGW used only one div instruction in loop body, both GCC on Linux and MSVC failed to reuse the quotient from previous iteration.

I think the best we could do is explicitly define quo and rem and calculate them in the same basic instruction block, to show the compiler we want both quotient and remainder.

int is_prime(uint64_t n)
{
    uint64_t p = 3, quo, rem;
    if (!(n & 1) || n < 2) return n == 2;

    quo = n / p;
    for (; p <= quo; p += 2){
        quo = n / p; rem = n % p;
        if (!(rem)) return 0;
    }
    return 1;
}

I tried your code from http://coliru.stacked-crooked.com/a/69497863a97d8953 on a MinGW-w64 compiler, case 1 is faster than case 2.

enter image description here

So I guess you are compiling targeted to a 32-bit architecture and used uint64_t type. Your assembly shows it doesn't use any 64-bit register.

If I got it right, there is the reason.

On 32-bit architecture, 64-bit numbers is represented in two 32-bit registers, your compiler will do all concatenation works. It's simple to do 64-bit addition, subtraction and multiplication. But modulo and division is done by a small function call which named as ___umoddi3 and ___udivdi3 in GCC, aullrem and aulldiv in MSVC.

So actually you need one ___umoddi3 and one ___udivdi3 for each iteration in case 1, one ___udivdi3 and one concatenation of 64-bit multiplication in case 2. That's why case 1 seems twice slower than case 2 in your test.

What you really get in case 1:

L5:
    addl    $2, %esi
    adcl    $0, %edi
    movl    %esi, 8(%esp)
    movl    %edi, 12(%esp)
    movl    %ebx, (%esp)
    movl    %ebp, 4(%esp)
    call    ___udivdi3         // A call for div
    cmpl    %edi, %edx
    ja  L6
    jae L21
L6:
    movl    %esi, 8(%esp)
    movl    %edi, 12(%esp)
    movl    %ebx, (%esp)
    movl    %ebp, 4(%esp)
    call    ___umoddi3        // A call for modulo.
    orl %eax, %edx
    jne L5

What you really get in case 2:

L26:
    addl    $2, %esi
    adcl    $0, %edi
    movl    %esi, %eax
    movl    %edi, %ecx
    imull   %esi, %ecx
    mull    %esi
    addl    %ecx, %ecx
    addl    %ecx, %edx
    cmpl    %edx, %ebx
    ja  L27
    jae L41
L27:
    movl    %esi, 8(%esp)
    movl    %edi, 12(%esp)
    movl    %ebp, (%esp)
    movl    %ebx, 4(%esp)
    call    ___umoddi3         // Just one call for modulo
    orl %eax, %edx
    jne L26

MSVC failed to reuse the result of div. The optimization is broken by return. Try these code:

__declspec(noinline) int is_prime_A(unsigned int n)
{
    unsigned int p;
    int ret = -1;
    if (!(n & 1) || n < 2) return n == 2;

    /* comparing p*p <= n can overflow */
    p = 1;
    do {
        p += 2;
        if (p >= n / p) ret = 1; /* Let's return latter outside the loop. */
        if (!(n % p)) ret = 0;
    } while (ret < 0);
    return ret;
}

__declspec(noinline) int is_prime_B(unsigned int n)
{
    unsigned int p;
    if (!(n & 1) || n < 2) return n == 2;

    /* comparing p*p <= n can overflow */
    p = 1;
    do {
        p += 2;
        if (p > n / p) return 1; /* The common routine. */
        if (!(n % p)) return 0;
    } while (1);
}

The is_prime_B will be twice slower than is_prime_A on MSVC / ICC for windows.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...