Edited:
My answer to the question is: Yes, you can definitely use the Task Parallel Library (TPL) to find the primes to one billion faster. The given code(s) in the question is slow because it isn't efficiently using memory or multiprocessing, and final output also isn't efficient.
So other than just multiprocessing, there are a huge number of things you can do to speed up the Sieve of Eratosthenese, as follows:
- You sieve all numbers, even and odd, which both uses more memory
(one billion bytes for your range of one billion) and is slower due
to the unnecessary processing. Just using the fact that two is the
only even prime so making the array represent only odd primes would
half the memory requirements and reduce the number of composite
number cull operations by over a factor of two so that the operation
might take something like 20 seconds on your machine for primes to a
billion.
- Part of the reason that composite number culling over such a huge
memory array is so slow is that it greatly exceeds the CPU cache
sizes so that many memory accesses are to main memory in a somewhat
random fashion meaning that culling a given composite number
representation can take over a hundred CPU clock cycles, whereas if
they were all in the L1 cache it would only take one cycle and in
the L2 cache only about four cycles; not all accesses take the worst
case times, but this definitely slows the processing. Using a bit
packed array to represent the prime candidates will reduce the use
of memory by a factor of eight and make the worst case accesses less
common. While there will be a computational overhead to accessing
individual bits, you will find there is a net gain as the time
saving in reducing average memory access time will be greater than
this cost. The simple way to implement this is to use a BitArray
rather than an array of bool. Writing your own bit accesses using
shift and "and" operations will be more efficient than use of the
BitArray class. You will find a slight saving using BitArray and
another factor of two doing your own bit operations for a single
threaded performance of perhaps about ten or twelve seconds with
this change.
- Your output of the count of primes found is not very efficient as it
requires an array access and an if condition per candidate prime.
Once you have the sieve buffer as an array packed word array of
bits, you can do this much more efficiently with a counting Look Up
Table (LUT) which eliminates the if condition and only needs two
array accesses per bit packed word. Doing this, the time to
count becomes a negligible part of the work as compared to the time
to cull composite numbers, for a further saving to get down to
perhaps eight seconds for the count of the primes to one billion.
- Further reductions in the number of processed prime candidates can
be the result of applying wheel factorization, which removes say the
factors of the primes 2, 3, and 5 from the processing and by
adjusting the method of bit packing can also increase the effective
range of a given size bit buffer by a factor of another about two.
This can reduce the number of composite number culling operations by
another huge factor of up to over three times, although at the cost
of further computational complexity.
- In order to further reduce memory use, making memory accesses even
more efficient, and preparing the way for multiprocessing per page
segment, one can divide the work into pages that are no larger
than the L1 or L2 cache sizes. This requires that one keep a base
primes table of all the primes up to the square root of the maximum
prime candidate and recomputes the starting address parameters of
each of the base primes used in culling across a given page segment,
but this is still more efficient than using huge culling arrays. An
added benefit to implementing this page segmenting is that one then
does not have to specify the upper sieving limit in advance but
rather can just extend the base primes as necessary as further upper
pages are processed. With all of the optimizations to this point,
you can likely produce the count of primes up to one billion in
about 2.5 seconds.
- Finally, one can put the final touches on multiprocessing the page
segments using TPL or Threads, which using a buffer size of about
the L2 cache size (per core) will produce an addition gain of a
factor of two on your dual core non Hyper Threaded (HT) older
processor as the Intel E7500 Core2Duo for an execute time to find
the number of primes to one billion of about 1.25 seconds or so.
I have implemented a multi-threaded Sieve of Eratosthenes as an answer to another thread to show there isn't any advantage to the Sieve of Atkin over the Sieve of Eratosthenes. It uses the Task Parallel Library (TPL) as in Tasks and TaskFactory so requires at least DotNet Framework 4. I have further tweaked that code using all of the optimizations discussed above as an alternate answer to the same quesion. I re-post that tweaked code here with added comments and easier-to-read formatting, as follows:
using System;
using System.Collections;
using System.Collections.Generic;
using System.Linq;
using System.Threading;
using System.Threading.Tasks;
class UltimatePrimesSoE : IEnumerable<ulong> {
#region const and static readonly field's, private struct's and classes
//one can get single threaded performance by setting NUMPRCSPCS = 1
static readonly uint NUMPRCSPCS = (uint)Environment.ProcessorCount + 1;
//the L1CACHEPOW can not be less than 14 and is usually the two raised to the power of the L1 or L2 cache
const int L1CACHEPOW = 14, L1CACHESZ = (1 << L1CACHEPOW), MXPGSZ = L1CACHESZ / 2; //for buffer ushort[]
const uint CHNKSZ = 17; //this times BWHLWRDS (below) times two should not be bigger than the L2 cache in bytes
//the 2,3,57 factorial wheel increment pattern, (sum) 48 elements long, starting at prime 19 position
static readonly byte[] WHLPTRN = { 2,3,1,3,2,1,2,3,3,1,3,2,1,3,2,3,4,2,1,2,1,2,4,3,
2,3,1,2,3,1,3,3,2,1,2,3,1,3,2,1,2,1,5,1,5,1,2,1 }; const uint FSTCP = 11;
static readonly byte[] WHLPOS; static readonly byte[] WHLNDX; //look up wheel position from index and vice versa
static readonly byte[] WHLRNDUP; //to look up wheel rounded up index positon values, allow for overflow in size
static readonly uint WCRC = WHLPTRN.Aggregate(0u, (acc, n) => acc + n); //small wheel circumference for odd numbers
static readonly uint WHTS = (uint)WHLPTRN.Length; static readonly uint WPC = WHTS >> 4; //number of wheel candidates
static readonly byte[] BWHLPRMS = { 2,3,5,7,11,13,17 }; const uint FSTBP = 19; //big wheel primes, following prime
//the big wheel circumference expressed in number of 16 bit words as in a minimum bit buffer size
static readonly uint BWHLWRDS = BWHLPRMS.Aggregate(1u, (acc, p) => acc * p) / 2 / WCRC * WHTS / 16;
//page size and range as developed from the above
static readonly uint PGSZ = MXPGSZ / BWHLWRDS * BWHLWRDS; static readonly uint PGRNG = PGSZ * 16 / WHTS * WCRC;
//buffer size (multiple chunks) as produced from the above
static readonly uint BFSZ = CHNKSZ * PGSZ, BFRNG = CHNKSZ * PGRNG; //number of uints even number of caches in chunk
static readonly ushort[] MCPY; //a Master Copy page used to hold the lower base primes preculled version of the page
struct Wst { public ushort msk; public byte mlt; public byte xtr; public ushort nxt; }
static readonly byte[] PRLUT; /*Wheel Index Look Up Table */ static readonly Wst[] WSLUT; //Wheel State Look Up Table
static readonly byte[] CLUT; // a Counting Look Up Table for very fast counting of primes
class Bpa { //very efficient auto-resizing thread-safe read-only indexer class to hold the base primes array
byte[] sa = new byte[0]; uint lwi = 0, lpd = 0; object lck = new object();
public uint this[uint i] {
get {
if (i >= this.sa.Length) lock (this.lck) {
var lngth = this.sa.Length; while (i >= lngth) {
var bf = (ushort[])MCPY.Clone(); if (lngth == 0) {
for (uint bi = 0, wi = 0, w = 0, msk = 0x8000, v = 0; w < bf.Length;
bi += WHLPTRN[wi++], wi = (wi >= WHTS) ? 0 : wi) {
if (msk >= 0x8000) { msk = 1; v = bf[w++]; } else msk <<= 1;
if ((v & msk) == 0) {
var p = FSTBP + (bi + bi); var k = (p * p - FSTBP) >> 1;
if (k >= PGRNG) break; var pd = p / WCRC; var kd = k / WCRC; var kn = WHLNDX[k - kd * WCRC];
for (uint wrd = kd * WPC + (uint)(kn >> 4), ndx = wi * WHTS + kn; wrd < bf.Length; ) {
var st = WSLUT[ndx]; bf[wrd] |= st.msk; wrd += st.mlt * pd + st.xtr; ndx = st.nxt;
}
}
}
}
else { this.lwi += PGRNG; cullbf(this.lwi, bf); }
var c = count(PGRNG, bf); var na = new byte[lngth + c]; sa.CopyTo(na, 0);
for (uint p = FSTBP + (this.lwi << 1), wi = 0, w = 0, msk = 0x8000, v = 0;
lngth < na.Length; p += (uint)(WHLPTRN[wi++] << 1), wi = (wi >= WHTS) ? 0 : wi) {
if (msk >= 0x8000) { msk = 1; v = bf[w++]; } else msk <<= 1; if ((v & msk) == 0) {
var pd = p / WCRC; na[lngth++] = (byte)(((pd - this.lpd) << 6) + wi); this.lpd = pd;
}
} this.sa = na;
}
} return this.sa[i];
}
}
}
static readonly Bpa baseprms = new Bpa(); //the base primes array using the above class
struct PrcsSpc { public Task tsk; public ushort[] buf; } //used for multi-threading buffer array processing
#endregion
#region private static methods
static int count(uint bitlim, ushort[] buf) { //very fast counting method using the CLUT look up table
if (bitlim < BFRNG) {
var addr = (bitlim - 1) / WCRC; var bit = WHLNDX[bitlim - addr * WCRC] - 1; addr *= WPC;
for (var i = 0; i < 3; ++i) buf[addr++] |= (ushort)((unchecked((ulong)-2) << bit) >> (i << 4));
}
var acc = 0; for (uint i = 0, w = 0; i < bitlim; i += WCRC)
acc += CLUT[buf[w++]] + CLUT[buf[w++]] + CLUT[buf[w++]]; return acc;
}
static void cullbf(ulong lwi, ushort[] b) { //fast buffer segment culling method using a Wheel State Look Up Table
ulong nlwi = lwi;
for (var i = 0u; i < b.Length; nlwi += PGRNG, i += PGSZ) MCPY.CopyTo(b, i); //copy preculled lower base primes.
for (uint i = 0, pd = 0; ; ++i) {
pd += (uint)baseprms[i] >> 6;
var wi = baseprms[i] & 0x3Fu; var wp = (uint)WHLPOS[wi]; var p = pd * WCRC + PRLUT[wi];
var k = ((ulong)p * (ulong)p - FSTBP) >> 1;
if (k >= nlwi) break; if (k < lwi) {
k = (lwi - k) % (WCRC * p);
if (k != 0) {
var nwp = wp + (uint)((k + p - 1) / p); k = (WHLRNDUP[nwp] - wp) * p - k;
}
}
else k -= lwi; var kd = k / WCRC; var kn = WHLNDX[k - kd * WCRC];
for (uint wrd = (uint)kd * WPC + (uint)(kn >> 4), ndx = wi * WHTS + kn; wrd < b.Length; ) {
var st = WSLUT[ndx]; b[wrd] |= st.msk; wrd += st.mlt * pd + st.xtr; ndx = st.nx