This type of error message frequently refers to the launch configuration parameters (grid/threadblock dimensions in this case, could also be shared memory, etc. in other cases). When you see a message like this it's a good idea just to print out your actual config parameters before launching the kernel, to see if you've made any mistakes.
You said BLOCKDIM
= 512. You have threadNum = BLOCKDIM/8
so threadNum
= 64. Your threadblock configuration is:
dim3 dimBlock(threadNum,threadNum);
So you are asking to launch blocks of 64 x 64 threads, that is 4096 threads per block. That won't work on any generation of CUDA devices. All current CUDA devices are limited to a maximum of 1024 threads per block, which is the product of the 3 block dimensions.
Maximum dimensions are listed in table 14 of the CUDA programming guide, and also available via the deviceQuery
CUDA sample code.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…