Compare such an approach to a (pre-generated) sieve. Modulo is expensive, so both approaches essentially do two things: generate potential factors, and perform modulo operations. Either program should reasonably generate a new candidate factor in less cycles than modulo takes, so either program is modulo bound.
The given approach filters out a constant proportion of all integers, namely the multiples of 2 and 3, or 75%. One in four (as given) numbers is used as an argument to the modulo operator. I'll call it a skip filter.
On the other hand, a sieve uses only primes as arguments to the modulo operator, and the average difference between successive primes is governed by the prime number theorem to be 1/ln(N). For example, e^20 is just under 500 million, so numbers over 500 million have under a 5% chance of being prime. If all numbers up to 2^32 are considered, 5% is a good rule of thumb.
Therefore, a sieve will spend 5 times less time on div
operations as your skip filter. The next factor to consider is the speed at which the sieve produces primes, i.e. reads them from memory or disk. If fetching one prime is faster than 4 div
s, then the sieve is faster. According to my tables div
throughput on my Core2 is at most one per 12 cycles. These will be hard division problems, so let's conservatively budget 50 cycles per prime. For a 2.5 GHz processor, that's 20 nanoseconds.
In 20 ns, a 50 MB/sec hard drive can read about one byte. The simple solution is to use 4 bytes per prime, so the drive will be slower. But, we can be more clever. If we want to encode all the primes in order, we can just encode their differences. Again, the expected difference is 1/ln(N). Also, they're all even, which saves an extra bit. And they are never zero, which makes extension to a multibyte encoding free. So using one byte per prime, differences up to 512 can be stored in one byte, which gets us up to 303371455241 according to that Wikipedia article.
Therefore, depending on the hard drive, a stored list of primes should be about equal in speed at verifying primality. If it can be stored in RAM (it's 203 MB, so subsequent runs will probably hit the disk cache), then the problem goes away entirely, as the FSB speed typically differs from the processor speed by a factor less than the FSB width in bytes — i.e., the FSB can transfer more than one prime per cycle. Then factor of improvement is the reduction in division operations, i.e. five times. This is borne out by the experimental results below.
Of course, then there is multithreading. Ranges of either primes or skip-filtered candidates can be assigned to different threads, making either approach embarrassingly parallel. There are no optimizations that don't involve increasing the number of parallel divider circuits, unless you somehow eliminate the modulo.
Here is such a program. It's templated so you could add bignums.
/*
* multibyte_sieve.cpp
* Generate a table of primes, and use it to factorize numbers.
*
* Created by David Krauss on 10/12/10.
*
*/
#include <cmath>
#include <bitset>
#include <limits>
#include <memory>
#include <fstream>
#include <sstream>
#include <iostream>
#include <iterator>
#include <stdint.h>
using namespace std;
char const primes_filename[] = "primes";
enum { encoding_base = (1<< numeric_limits< unsigned char >::digits) - 2 };
template< typename It >
unsigned decode_gap( It &stream ) {
unsigned gap = static_cast< unsigned char >( * stream ++ );
if ( gap ) return 2 * gap; // only this path is tested
gap = ( decode_gap( stream )/2-1 ) * encoding_base; // deep recursion
return gap + decode_gap( stream ); // shallow recursion
}
template< typename It >
void encode_gap( It &stream, uint32_t gap ) {
unsigned len = 0, bytes[4];
gap /= 2;
do {
bytes[ len ++ ] = gap % encoding_base;
gap /= encoding_base;
} while ( gap );
while ( -- len ) { // loop not tested
* stream ++ = 0;
* stream ++ = bytes[ len + 1 ];
}
* stream ++ = bytes[ 0 ];
}
template< size_t lim >
void generate_primes() {
auto_ptr< bitset< lim / 2 > > sieve_p( new bitset< lim / 2 > );
bitset< lim / 2 > &sieve = * sieve_p;
ofstream out_f( primes_filename, ios::out | ios::binary );
ostreambuf_iterator< char > out( out_f );
size_t count = 0;
size_t last = sqrtl( lim ) / 2 + 1, prev = 0, x = 1;
for ( ; x != last; ++ x ) {
if ( sieve[ x ] ) continue;
size_t n = x * 2 + 1; // translate index to number
for ( size_t m = x + n; m < lim/2; m += n ) sieve[ m ] = true;
encode_gap( out, ( x - prev ) * 2 );
prev = x;
}
for ( ; x != lim / 2; ++ x ) {
if ( sieve[ x ] ) continue;
encode_gap( out, ( x - prev ) * 2 );
prev = x;
}
cout << prev * 2 + 1 << endl;
}
template< typename I >
void factorize( I n ) {
ifstream in_f( primes_filename, ios::in | ios::binary );
if ( ! in_f ) {
cerr << "Could not open primes file.
"
"Please generate it with 'g' command.
";
return;
}
while ( n % 2 == 0 ) {
n /= 2;
cout << "2 ";
}
unsigned long factor = 1;
for ( istreambuf_iterator< char > in( in_f ), in_end; in != in_end; ) {
factor += decode_gap( in );
while ( n % factor == 0 ) {
n /= factor;
cout << factor << " ";
}
if ( n == 1 ) goto finish;
}
cout << n;
finish:
cout << endl;
}
int main( int argc, char *argv[] ) {
if ( argc != 2 ) goto print_help;
unsigned long n;
if ( argv[1][0] == 'g' ) {
generate_primes< (1ul<< 32) >();
} else if ( ( istringstream( argv[1] ) >> n ).rdstate() == ios::eofbit )
factorize( n );
} else goto print_help;
return 0;
print_help:
cerr << "Usage:
" << argv[0] << " <number> -- factorize number.
"
"" << argv[0] << " g -- generate primes file in current directory.
";
}
Performance on a 2.2 GHz MacBook Pro:
dkrauss$ time ./multibyte_sieve g
4294967291
real 2m8.845s
user 1m15.177s
sys 0m2.446s
dkrauss$ time ./multibyte_sieve 18446743721522234449
4294967231 4294967279
real 0m5.405s
user 0m4.773s
sys 0m0.458s
dkrauss$ time ./mike 18446743721522234449
4294967231 4294967279
real 0m25.147s
user 0m24.170s
sys 0m0.096s