Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
129 views
in Technique[技术] by (71.8m points)

c++ - Why OpenMP under ubuntu 12.04 is slower than serial version

I've read some other questions on this topic. However, they didn't solve my problem anyway.

I wrote the code as following and I got pthread version and omp version both slower than the serial version. I'm very confused.

Compiled under environment:

Ubuntu 12.04 64bit 3.2.0-60-generic
g++ (Ubuntu 4.8.1-2ubuntu1~12.04) 4.8.1

CPU(s):                2
On-line CPU(s) list:   0,1
Thread(s) per core:    1
Vendor ID:             AuthenticAMD
CPU family:            18
Model:                 1
Stepping:              0
CPU MHz:               800.000
BogoMIPS:              3593.36
L1d cache:             64K
L1i cache:             64K
L2 cache:              512K
NUMA node0 CPU(s):     0,1

Compile command:

g++ -std=c++11 ./eg001.cpp -fopenmp

#include <cmath>
#include <cstdio>
#include <ctime>
#include <omp.h>
#include <pthread.h>

#define NUM_THREADS 5
const int sizen = 256000000;

struct Data {
    double * pSinTable;
    long tid;
};

void * compute(void * p) {
    Data * pDt = (Data *)p;
    const int start = sizen * pDt->tid/NUM_THREADS;
    const int end = sizen * (pDt->tid + 1)/NUM_THREADS;
    for(int n = start; n < end; ++n) {
        pDt->pSinTable[n] = std::sin(2 * M_PI * n / sizen);
    }
    pthread_exit(nullptr);
}

int main()
{
    double * sinTable = new double[sizen];
    pthread_t threads[NUM_THREADS];
    pthread_attr_t attr;
    pthread_attr_init(&attr);
    pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);

    clock_t start, finish;

    start = clock();
    int rc;
    Data dt[NUM_THREADS];
    for(int i = 0; i < NUM_THREADS; ++i) {
        dt[i].pSinTable = sinTable;
        dt[i].tid = i;
        rc = pthread_create(&threads[i], &attr, compute, &dt[i]);
    }//for
    pthread_attr_destroy(&attr);
    for(int i = 0; i < NUM_THREADS; ++i) {
        rc = pthread_join(threads[i], nullptr);
    }//for
    finish = clock();
    printf("from pthread: %lf
", (double)(finish - start)/CLOCKS_PER_SEC);

    delete sinTable;
    sinTable = new double[sizen];

    start = clock();
#   pragma omp parallel for
    for(int n = 0; n < sizen; ++n)
        sinTable[n] = std::sin(2 * M_PI * n / sizen);
    finish = clock();
    printf("from omp: %lf
", (double)(finish - start)/CLOCKS_PER_SEC);

    delete sinTable;
    sinTable = new double[sizen];

    start = clock();
    for(int n = 0; n < sizen; ++n)
        sinTable[n] = std::sin(2 * M_PI * n / sizen);
    finish = clock();
    printf("from serial: %lf
", (double)(finish - start)/CLOCKS_PER_SEC);

    delete sinTable;

    pthread_exit(nullptr);
    return 0;
}

Output:

from pthread: 21.150000
from omp: 20.940000
from serial: 20.800000

I wonder whether it was my code's problem so I used pthread to do the same thing.

However, I'm totally wrong, and I wonder whether it might be Ubuntu's problem on OpenMP/pthread.

I have a friend who has AMD CPU and Ubuntu 12.04 as well, and got the same problem there, so I might have some reason to believe that the problem is not limited to only me.

If anyone has the same problem as me, or has some clue on the problem, thanks in advance.


If the code is not good enough, I ran a benchmark and I pasted the result here:

http://pastebin.com/RquLPREc

The benchmark url: http://www.cs.kent.edu/~farrell/mc08/lectures/progs/openmp/microBenchmarks/src/download.html


New infomation:

I ran the code on windows (without pthread version) with VS2012.

I used 1/10 of sizen because windows does not allow me to allocate that great trunk of memory where the results are:

from omp: 1.004
from serial: 1.420
from FreeNickName: 735 (this one is the suggestion improvement by @FreeNickName)

Does this indicate that it could be a problem of Ubuntu OS ??



Problem is solved by using omp_get_wtime function that is portable among Operating Systems. See the answer by Hristo Iliev.


Some tests about the controversial topic by FreeNickName.

(Sorry I need to test it on Ubuntu cause the windows was one of my friends'.)

--1-- Change from delete to delete [] : (but without memset)(-std=c++11 -fopenmp)

from pthread: 13.491405
from omp: 13.023099
from serial: 20.665132
from FreeNickName: 12.022501

--2-- With memset immediately after new: (-std=c++11 -fopenmp)

from pthread: 13.996505
from omp: 13.192444
from serial: 19.882127
from FreeNickName: 12.541723

--3-- With memset immediately after new: (-std=c++11 -fopenmp -march=native -O2)

from pthread: 11.886978
from omp: 11.351801
from serial: 17.002865
from FreeNickName: 11.198779

--4-- With memset immediately after new, and put FreeNickName's version before OMP for version: (-std=c++11 -fopenmp -march=native -O2)

from pthread: 11.831127
from FreeNickName: 11.571595
from omp: 11.932814
from serial: 16.976979

--5-- With memset immediately after new, and put FreeNickName's version before OMP for version, and set NUM_THREADS to 5 instead of 2 (I'm dual core).

from pthread: 9.451775
from FreeNickName: 9.385366
from omp: 11.854656
from serial: 16.960101
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

There is nothing wrong with OpenMP in your case. What is wrong is the way you measure the elapsed time.

Using clock() to measure the performance of multithreaded applications on Linux (and most other Unix-like OSes) is a mistake since it does not return the wall-clock (real) time but instead the accumulated CPU time for all process threads (and on some Unix flavours even the accumulated CPU time for all child processes). Your parallel code shows better performance on Windows since there clock() returns the real time and not the accumulated CPU time.

The best way to prevent such discrepancies is to use the portable OpenMP timer routine omp_get_wtime():

double start = omp_get_wtime();
#pragma omp parallel for
for(int n = 0; n < sizen; ++n)
    sinTable[n] = std::sin(2 * M_PI * n / sizen);
double finish = omp_get_wtime();
printf("from omp: %lf
", finish - start);

For non-OpenMP applications, you should use clock_gettime() with the CLOCK_REALTIME clock:

struct timespec start, finish;
clock_gettime(CLOCK_REALTIME, &start);
#pragma omp parallel for
for(int n = 0; n < sizen; ++n)
    sinTable[n] = std::sin(2 * M_PI * n / sizen);
clock_gettime(CLOCK_REALTIME, &finish);
printf("from omp: %lf
", (finish.tv_sec + 1.e-9 * finish.tv_nsec) -
                          (start.tv_sec + 1.e-9 * start.tv_nsec));

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...