Only one warp (the last one) gets padded. Threads are grouped into warps in the order x, y, z. In this way, if you have an odd 2D array size, like 17x17, that is stored contiguously in memory, you can still create 32-thread warps out of a 17x17 thread block that will generate coalesced accesses. In this way, all of the warps will generate fully coalesced accesses except the last one. If individual warps were padded with dead threads along the way, it would be more wasteful in terms of memory accesses in this example.
For this example, at least, it works better from a machine utilization standpoint.
The documentational support for this rests on understanding that thread ID and thread index are not the same.
Thread index for a given thread is identified by the built-in variables threadIdx.x
, threadIdx.y
, and threadIdx.z
. Thread ID is a unique (within the threadblock), scalar number assigned to each thread.
The relationship between thread ID and thread index is given by this statement:
"The index of a thread and its thread ID relate to each other in a straightforward way: For a one-dimensional block, they are the same; for a two-dimensional block of size (Dx, Dy),the thread ID of a thread of index (x, y) is (x + y Dx); for a three-dimensional block of size (Dx, Dy, Dz), the thread ID of a thread of index (x, y, z) is (x + y Dx + z Dx Dy). "
But the grouping of threads into warps is done explicitly by thread ID:
"The way a block is partitioned into warps is always the same; each warp contains threads of consecutive, increasing thread IDs with the first warp containing thread 0."
So based on the first statement, we see that even for an odd block shape like 17x17, there are no threads defined other than those which are within the dimensionality of the threadblock. Then based on the second statement, the consecutive assembly of warps by thread ID creates warps all of which have defined threads in them (except perhaps the last one.)