-
Notifications
You must be signed in to change notification settings - Fork 21
Description
A note: I don't know whether this issue should be categorized as a bug. My setup steps might be wrong as well. If it is the later situation, please guide me accordingly.
Describe the bug
Concurrent execution of many instances of the same CUDA executable causes few of them to fail randomly.
In the case of :
onnx_dump,it fails withterminate called without an active exception- Sometimes shows
ERROR sending to socket: Bad file descriptorbefore printingterminate called without an active exception
- Sometimes shows
cudart,it fails with a simpleSegmentation fault (core dumped)
Suppose I write a CUDA executable named say toy.cu as follows:
#include <cuda.h>
#include <stdlib.h>
#include <stdio.h>
#include <assert.h>
#define BLOCK_SIZE 128
__global__
void do_something(float* d_array)
{
int idx = blockIdx.x*blockDim.x + threadIdx.x;
d_array[idx]*=100;
}
int main()
{
long N= 1<<7;
float *arr = (float*) malloc(N*sizeof(float));
long i;
for (i=1;i<=N;i++)
arr[i-1]=i;
float *d_array;
cudaError_t ret;
ret = cudaMalloc(&d_array, N*sizeof(float));
printf("Return value of cudaMalloc = %d\n", ret);
if(ret != cudaSuccess)
{
fprintf(stderr,"GPUassert: %s\n", cudaGetErrorString(ret));
exit(1);
}
ret = cudaMemcpy(d_array, arr, N*sizeof(float), cudaMemcpyHostToDevice);
printf("Return value of cudaMemcpy = %d\n", ret);
if(ret != cudaSuccess)
{
fprintf(stderr,"GPUassert: %s \n", cudaGetErrorString(ret));
exit(1);
}
int num_blocks= (N+BLOCK_SIZE-1)/BLOCK_SIZE;
do_something<<<num_blocks, BLOCK_SIZE>>>(d_array);
ret = cudaMemcpy(arr, d_array, N*sizeof(float), cudaMemcpyDeviceToHost);
printf("Return value of cudaMemcpy = %d\n", ret);
int j;
for(i=0;i<N;)
{
for(j=0;j<8;j++)
printf("%.0f\t", arr[i++]);
printf("\n");
}
cudaFree(d_array);
return 0;
}And compile it as :
nvcc -o toy toy.cu --cudart sharedThen, in the docker container setup to use the appropriate libguestlib.so I call the following script.sh
#!/bin/bash
if [ $# -ne 2 ]; then
echo "Usage: $0 <executable> <num_instances>"
exit 1
fi
executable=$1
num_instances=$2
for ((i=1; i<=$num_instances; i++)); do
$executable &
doneAnd run the following command:
$ ./script.sh toy 20Many (not all) fails, whether I use cudart or onnx_dump
To Reproduce
I'll go ahead and describe how I set up AvA.
First, I installed NVIDIA driver 418.226.00 using the NVIDIA-Linux-x86_64-418.226.00.run from the NVIDIA website.
Second, I installed CUDA Toolkit 10.1 using the cuda_10.1.168_418.67_linux.run from the NVIDIA website.
Third, I install cudnn 7.6.3.30 using the following files:
libcudnn7_7.6.3.30-1+cuda10.1_amd64.deb
libcudnn7-doc_7.6.3.30-1+cuda10.1_amd64.deb
libcudnn7-dev_7.6.3.30-1+cuda10.1_amd64.deb
Next, I forked the AvA repository.
I modified the ava/guestlib/cmd_channel_socket_tcp.cpp to connect to my host using it's IP address.
And then did the following:
$ ava
$ ./generate -s onnx_dump
$ cd ..
$ mkdir build
$ cd build
$ cmake ../ava
$ ccmake . # and then selected the options for onnx_dump and demo manager
$ make -j72
$ make install
Then I used a CUDA-10.1 docker image (the one given this repository under tools/docker, with a bit of modification to remove the issue of cuda keys for apt update)
Bind mounted my build directory in the docker container and then copied the libguestlib.so from the build directory to /usr/lib/x86_64-linux-gnu and /usr/local/cuda-10.1/targets/x86_64-linux/lib/ in the docker container. And modified the library symlinks accordingly:
/usr/lib/x86_64-linux-gnu$ ls -lh libcu*
lrwxrwxrwx 1 root root 17 Feb 25 2019 libcublasLt.so -> libcublasLt.so.10
lrwxrwxrwx 1 root root 14 Sep 10 04:41 libcublasLt.so.10 -> libguestlib.so
-rw-r--r-- 1 root root 12M Sep 10 04:40 libcublasLt.so.10.1.0.105
-rw-r--r-- 1 root root 23M Feb 25 2019 libcublasLt_static.a
lrwxrwxrwx 1 root root 15 Feb 25 2019 libcublas.so -> libcublas.so.10
lrwxrwxrwx 1 root root 14 Sep 10 04:41 libcublas.so.10 -> libguestlib.so
-rw-r--r-- 1 root root 12M Sep 10 04:40 libcublas.so.10.1.0.105
-rw-r--r-- 1 root root 87M Feb 25 2019 libcublas_static.a
lrwxrwxrwx 1 root root 29 Sep 9 16:09 libcudadebugger.so.1 -> libcudadebugger.so.535.104.05
-rwxr-xr-x 1 root root 9.8M Sep 9 15:43 libcudadebugger.so.535.104.05
lrwxrwxrwx 1 root root 12 Sep 9 16:09 libcuda.so -> libcuda.so.1
lrwxrwxrwx 1 root root 14 Sep 18 15:49 libcuda.so.1 -> libguestlib.so
-rw-r--r-- 1 root root 16M Feb 25 2019 libcuda.so.418.39
-rwxr-xr-x 1 root root 28M Sep 9 15:43 libcuda.so.535.104.05
lrwxrwxrwx 1 root root 29 Mar 7 2019 libcudnn.so -> /etc/alternatives/libcudnn_so
lrwxrwxrwx 1 root root 14 Sep 10 04:42 libcudnn.so.7 -> libguestlib.so
-rw-r--r-- 1 root root 7.0M Sep 9 16:14 libcudnn.so.7.5.0
lrwxrwxrwx 1 root root 32 Mar 7 2019 libcudnn_static.a -> /etc/alternatives/libcudnn_stlib
-rw-r--r-- 1 root root 351M Feb 15 2019 libcudnn_static_v7.a
lrwxrwxrwx 1 root root 23 Apr 6 2018 libcupsfilters.so.1 -> libcupsfilters.so.1.0.0
-rw-r--r-- 1 root root 211K Apr 6 2018 libcupsfilters.so.1.0.0
-rw-r--r-- 1 root root 34K Dec 12 2018 libcupsimage.so.2
-rw-r--r-- 1 root root 558K Dec 12 2018 libcups.so.2
-rw-r--r-- 1 root root 12M Sep 10 04:40 libcurand.so.10
lrwxrwxrwx 1 root root 19 Jan 29 2019 libcurl-gnutls.so.3 -> libcurl-gnutls.so.4
lrwxrwxrwx 1 root root 23 Jan 29 2019 libcurl-gnutls.so.4 -> libcurl-gnutls.so.4.5.0
-rw-r--r-- 1 root root 499K Jan 29 2019 libcurl-gnutls.so.4.5.0
lrwxrwxrwx 1 root root 16 Jan 29 2019 libcurl.so.4 -> libcurl.so.4.5.0
-rw-r--r-- 1 root root 507K Jan 29 2019 libcurl.so.4.5.0
lrwxrwxrwx 1 root root 12 May 23 2018 libcurses.a -> libncurses.a
lrwxrwxrwx 1 root root 13 May 23 2018 libcurses.so -> libncurses.so/usr/local/cuda-10.1/targets/x86_64-linux/lib$ ls -lh libcu*
-rw-r--r-- 1 root root 701K Feb 25 2019 libcudadevrt.a
lrwxrwxrwx 1 root root 17 Feb 25 2019 libcudart.so -> libcudart.so.10.1
lrwxrwxrwx 1 root root 14 Sep 18 15:45 libcudart.so.10.1 -> libguestlib.so
-rw-r--r-- 1 root root 493K Feb 25 2019 libcudart.so.10.1.105
-rw-r--r-- 1 root root 868K Feb 25 2019 libcudart_static.a
lrwxrwxrwx 1 root root 14 Feb 25 2019 libcufft.so -> libcufft.so.10
lrwxrwxrwx 1 root root 14 Oct 29 21:39 libcufft.so.10 -> libguestlib.so
-rw-r--r-- 1 root root 112M Feb 25 2019 libcufft.so.10.1.105
-rw-r--r-- 1 root root 132M Feb 25 2019 libcufft_static.a
-rw-r--r-- 1 root root 119M Feb 25 2019 libcufft_static_nocallback.a
lrwxrwxrwx 1 root root 15 Feb 25 2019 libcufftw.so -> libcufftw.so.10
lrwxrwxrwx 1 root root 21 Feb 25 2019 libcufftw.so.10 -> libcufftw.so.10.1.105
-rw-r--r-- 1 root root 489K Feb 25 2019 libcufftw.so.10.1.105
-rw-r--r-- 1 root root 33K Feb 25 2019 libcufftw_static.a
lrwxrwxrwx 1 root root 18 Feb 25 2019 libcuinj64.so -> libcuinj64.so.10.1
lrwxrwxrwx 1 root root 22 Feb 25 2019 libcuinj64.so.10.1 -> libcuinj64.so.10.1.105
-rw-r--r-- 1 root root 7.5M Feb 25 2019 libcuinj64.so.10.1.105
-rw-r--r-- 1 root root 32K Feb 25 2019 libculibos.a
lrwxrwxrwx 1 root root 15 Feb 25 2019 libcurand.so -> libcurand.so.10
lrwxrwxrwx 1 root root 14 Oct 29 21:39 libcurand.so.10 -> libguestlib.so
-rw-r--r-- 1 root root 58M Feb 25 2019 libcurand.so.10.1.105
-rw-r--r-- 1 root root 58M Feb 25 2019 libcurand_static.a
lrwxrwxrwx 1 root root 17 Feb 25 2019 libcusolver.so -> libcusolver.so.10
lrwxrwxrwx 1 root root 14 Oct 29 21:40 libcusolver.so.10 -> libguestlib.so
-rw-r--r-- 1 root root 175M Feb 25 2019 libcusolver.so.10.1.105
-rw-r--r-- 1 root root 88M Feb 25 2019 libcusolver_static.a
lrwxrwxrwx 1 root root 17 Feb 25 2019 libcusparse.so -> libcusparse.so.10
lrwxrwxrwx 1 root root 14 Oct 29 21:40 libcusparse.so.10 -> libguestlib.so
-rw-r--r-- 1 root root 87M Feb 25 2019 libcusparse.so.10.1.105
-rw-r--r-- 1 root root 97M Feb 25 2019 libcusparse_static.aAdded the guest config in the docker container as:
$ cat /etc/ava/guest.conf
channel = "TCP";
manager_address = "10.192.34.20:3333";
gpu_memory = [1024L];
Then I tried to launch the manger on the host as follows:
build$ ./install/bin/demo_manager --worker_path install/onnx_dump/bin/worker
Manager Service listening on ::3333
And on the guest, I try to run the toy cuda program. But it fails as described earlier.
I described the setup for onnx_dump but the setup for cudart is similar. But still it gives the error as described earlier.
Expected behavior
I expect all the instances of the toy executable launched concurrently to run successfully.
Environment:
- OS: Ubuntu 18.04.6 LTS x86_64
- Python version: 3.6.9
- GCC version: 7.5.0
- Kernel: 5.4.0-150-generic
- Host: SYS-7049GP-TRT 0123456789
- CPU: Intel Xeon Gold 6140 (72) @ 3.700GHz
- GPU: NVIDIA Tesla P40
- NVIDIA Driver Version: 418.226.00
- CUDA Version: 10.1