I’m getting confused about how to use shared and global memory in CUDA, especially with respect to the following:
- When we use
cudaMalloc()
, do we get a pointer to shared or global
memory?
- Does global memory reside on the host or device?
- Is there a
size limit to either one?
- Which is faster to access?
Is storing a
variable in shared memory the same as passing its address via the
kernel? I.e. instead of having
__global__ void kernel() {
__shared__ int i;
foo(i);
}
why not equivalently do
__global__ void kernel(int *i_ptr) {
foo(*i_ptr);
}
int main() {
int *i_ptr;
cudaMalloc(&i_ptr, sizeof(int));
kernel<<<blocks,threads>>>(i_ptr);
}
There've been many questions about specific speed issues in global vs shared memory, but none encompassing an overview of when to use either one in practice.
Many thanks
question from:
https://stackoverflow.com/questions/14093692/whats-the-difference-between-cuda-shared-and-global-memory 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…