Most likely guess. In the single-threaded case, you are CPU bound. In the multi-threaded case, you are memory bound. In other words, the two cores are reading the data from DRAM at the maximum bus bandwidth. As a result, the cores end up idling waiting for more data to process.
You can test my theory by doing a true luminance calculation:
int value = floor( 0.299 * red + 0.587 * green + 0.114 * blue );
That calculation will yield gray scale values in the range from 0 to 255, given 8-bit rgb values. It also gives the processors more work to do per pixel. If you change that line of code, the time for the single threaded case should increase somewhat. And, if I'm correct, then the multi-threaded case should show a better performance improvement, as a percentage of the single-threaded time.
I decided to run some benchmarks of my own, both on the simulator and on an iPad2. The structure of my code was as follows.
Single Threaded
start = TimeStamp();
for ( y = 0; y < 2048; y++ )
for ( x = 0; x < 1536; x++ )
computePixel();
end = TimeStamp();
NSLog( @"single = %8.3lf msec", (end - start) * 1e3 );
Two Threads using GCD
dispatch_group_t tasks = dispatch_group_create();
dispatch_queue_t queue = dispatch_get_global_queue( DISPATCH_QUEUE_PRIORITY_HIGH, 0 );
start = TimeStamp();
dispatch_group_async( tasks, queue,
^{
topStart = TimeStamp();
for ( y = 0; y < 1024; y++ )
for ( x = 0; x < 1536; x++ )
computePixel();
topEnd = TimeStamp();
});
dispatch_group_async( tasks, queue,
^{
bottomStart = TimeStamp();
for ( y = 1024; y < 2048; y++ )
for ( x = 0; x < 1536; x++ )
computePixel();
bottomEnd = TimeStamp();
});
wait = TimeStamp();
dispatch_group_wait( tasks, DISPATCH_TIME_FOREVER );
end = TimeStamp();
NSLog( @"wait = %8.3lf msec", (wait - start) * 1e3 );
NSLog( @"topStart = %8.3lf msec", (topStart - start) * 1e3 );
NSLog( @"bottomStart = %8.3lf msec", (bottomStart - start) * 1e3 );
NSLog( @" " );
NSLog( @"topTime = %8.3lf msec", (topEnd - topStart) * 1e3 );
NSLog( @"bottomeTime = %8.3lf msec", (bottomEnd - bottomStart) * 1e3 );
NSLog( @"overallTime = %8.3lf msec", (end - start) * 1e3 );
Here are my results.
Running (r+g+b)/3 on the simulator
2014-04-03 23:16:22.239 GcdTest[1406:c07] single = 21.546 msec
2014-04-03 23:16:22.239 GcdTest[1406:c07]
2014-04-03 23:16:25.388 GcdTest[1406:c07] wait = 0.009 msec
2014-04-03 23:16:25.388 GcdTest[1406:c07] topStart = 0.031 msec
2014-04-03 23:16:25.388 GcdTest[1406:c07] bottomStart = 0.057 msec
2014-04-03 23:16:25.389 GcdTest[1406:c07]
2014-04-03 23:16:25.389 GcdTest[1406:c07] topTime = 10.865 msec
2014-04-03 23:16:25.389 GcdTest[1406:c07] bottomeTime = 10.879 msec
2014-04-03 23:16:25.390 GcdTest[1406:c07] overallTime = 10.961 msec
Running (.299r + .587g + .114b) on the simulator
2014-04-03 23:17:27.984 GcdTest[1422:c07] single = 55.738 msec
2014-04-03 23:17:27.985 GcdTest[1422:c07]
2014-04-03 23:17:29.306 GcdTest[1422:c07] wait = 0.008 msec
2014-04-03 23:17:29.307 GcdTest[1422:c07] topStart = 0.054 msec
2014-04-03 23:17:29.307 GcdTest[1422:c07] bottomStart = 0.060 msec
2014-04-03 23:17:29.307 GcdTest[1422:c07]
2014-04-03 23:17:29.308 GcdTest[1422:c07] topTime = 28.881 msec
2014-04-03 23:17:29.308 GcdTest[1422:c07] bottomeTime = 29.330 msec
2014-04-03 23:17:29.308 GcdTest[1422:c07] overallTime = 29.446 msec
Running (r+g+b)/3 on the iPad2
2014-04-03 23:27:19.601 GcdTest[13032:907] single = 298.799 msec
2014-04-03 23:27:19.602 GcdTest[13032:907]
2014-04-03 23:27:20.536 GcdTest[13032:907] wait = 0.060 msec
2014-04-03 23:27:20.537 GcdTest[13032:907] topStart = 0.246 msec
2014-04-03 23:27:20.539 GcdTest[13032:907] bottomStart = 2.906 msec
2014-04-03 23:27:20.541 GcdTest[13032:907]
2014-04-03 23:27:20.542 GcdTest[13032:907] topTime = 149.596 msec
2014-04-03 23:27:20.544 GcdTest[13032:907] bottomeTime = 149.209 msec
2014-04-03 23:27:20.545 GcdTest[13032:907] overallTime = 152.164 msec
Running (.299r + .587g + .114b) on the iPad2
2014-04-03 23:30:29.618 GcdTest[13045:907] single = 282.767 msec
2014-04-03 23:30:29.620 GcdTest[13045:907]
2014-04-03 23:30:34.008 GcdTest[13045:907] wait = 0.046 msec
2014-04-03 23:30:34.010 GcdTest[13045:907] topStart = 0.270 msec
2014-04-03 23:30:34.011 GcdTest[13045:907] bottomStart = 3.043 msec
2014-04-03 23:30:34.013 GcdTest[13045:907]
2014-04-03 23:30:34.014 GcdTest[13045:907] topTime = 143.078 msec
2014-04-03 23:30:34.015 GcdTest[13045:907] bottomeTime = 143.249 msec
2014-04-03 23:30:34.017 GcdTest[13045:907] overallTime = 146.350 msec
Running ((.299r + .587g + .114b) ^ 2.2) on the iPad2
2014-04-03 23:41:28.959 GcdTest[13078:907] single = 1258.818 msec
2014-04-03 23:41:28.961 GcdTest[13078:907]
2014-04-03 23:41:30.768 GcdTest[13078:907] wait = 0.048 msec
2014-04-03 23:41:30.769 GcdTest[13078:907] topStart = 0.264 msec
2014-04-03 23:41:30.771 GcdTest[13078:907] bottomStart = 3.037 msec
2014-04-03 23:41:30.772 GcdTest[13078:907]
2014-04-03 23:41:30.773 GcdTest[13078:907] topTime = 635.952 msec
2014-04-03 23:41:30.775 GcdTest[13078:907] bottomeTime = 634.749 msec
2014-04-03 23:41:30.776 GcdTest[13078:907] overallTime = 637.829 msec