You could do this as part of a host-> device copy. Each copy would take one of the contiguous input arrays on the host and copy it in strided fashion to the device. The storage layout of complex data types in CUDA is compatible with the layout defined for complex types in Fortran and C++, i.e. as a structure with the real part followed by imaginary part.
float * real_vec; // host vector, real part
float * imag_vec; // host vector, imaginary part
float2 * complex_vec_d; // device vector, single-precision complex
float * tmp_d = (float *) complex_vec_d;
cudaStat = cudaMemcpy2D (tmp_d, 2 * sizeof(tmp_d[0]),
real_vec, 1 * sizeof(real_vec[0]),
sizeof(real_vec[0]), n, cudaMemcpyHostToDevice);
cudaStat = cudaMemcpy2D (tmp_d + 1, 2 * sizeof(tmp_d[0]),
imag_vec, 1 * sizeof(imag_vec[0]),
sizeof(imag_vec[0]), n, cudaMemcpyHostToDevice);
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…