Unified Virtual Addressing (UVA) enables one address space for all CPU and GPU memories since it allows determining physical memory location from pointer value.
Peer-to-peer memcpy with UVA*
When UVA is possible, then cudaMemcpy
can be used for peer-to-peer memcpy
since CUDA can infer which device "owns" which memory. The instructions you typically need to perform a peer-to-peer memcpy
with UVA are the following:
//Check for peer access between participating GPUs:
cudaDeviceCanAccessPeer(&can_access_peer_0_1, gpuid_0, gpuid_1);
cudaDeviceCanAccessPeer(&can_access_peer_1_0, gpuid_1, gpuid_0);
//Enable peer access between participating GPUs:
cudaSetDevice(gpuid_0);
cudaDeviceEnablePeerAccess(gpuid_1, 0);
cudaSetDevice(gpuid_1);
cudaDeviceEnablePeerAccess(gpuid_0, 0);
//UVA memory copy:
cudaMemcpy(gpu0_buf, gpu1_buf, buf_size, cudaMemcpyDefault);
Peer-to-peer memcpy without UVA
When UVA is not possible, then peer-to-peer memcpy is done via cudaMemcpyPeer
. Here is an example
// Set device 0 as current
cudaSetDevice(0);
float* p0;
size_t size = 1024 * sizeof(float);
// Allocate memory on device 0
cudaMalloc(&p0, size);
// Set device 1 as current
cudaSetDevice(1);
float* p1;
// Allocate memory on device 1
cudaMalloc(&p1, size);
// Set device 0 as current
cudaSetDevice(0);
// Launch kernel on device 0
MyKernel<<<1000, 128>>>(p0);
// Set device 1 as current
cudaSetDevice(1);
// Copy p0 to p1
cudaMemcpyPeer(p1, 1, p0, 0, size);
// Launch kernel on device 1
MyKernel<<<1000, 128>>>(p1);
As you can see, while in the former case (UVA possible) you don't need to specify which device the different pointers refer to, in the latter case (UVA not possible) you have to explicitly mention which device the pointers refer to.
The instruction
cudaSetDeviceFlags(cudaDeviceMapHost);
is used to enable host mapping to device memory, which is a different thing and regards host<->device memory movements and not peer-to-peer memory movements, which is the topic of your post.
In conclusion, the answer to your questions are:
- NO;
- NO;
- When possible, enable UVA and use
cudaMemcpy
(you don't need to specify the devices); otherwise, use cudaMemcpyPeer
(and you need to specify the devices).