GPU memory -> device memory
CPU memory -> host memory
In CUDA C: the transfer of data is carried out using cudaMemcpyHostToDevice and cudaMemcpyDeviceToHost. Memory allocation using cudaMalloc and deallocation using cudaFree
PyCUDA covers all overhead of memory allocation, deallocation and transfer using gpuarray class.
It performs automatic cleanup based on lifetime.
How to transfer data from the host to GPU?
Contain data in host memory using Numpy e.g. host_data
Transfer to GPU using gpuarray.to_gpu(host_data)
After computation, retrieve data from GPU using gpuarray.get()
In PyCUDA, GPU code is often compiled at runtime with the NVIDIA nvcc compiler and then subsequently called from
PyCUDA. This can lead to an unexpected slowdown, usually the first time a program or GPU operation is run in a given
Python session.
Implementing pointwise operation using inline code in CUDA C. Example below
gpu_2x_ker = ElementwiseKernel( "float *in, float *out", "out[i] = 2*in[i];", "gpu_2x_ker")
This is compiled externally by nvcc compiler and then launched at runtime via PyCUDA.