Skip to content

Conversation

@kabicm
Copy link
Collaborator

@kabicm kabicm commented Feb 7, 2024

As @simonpintarelli reported, some of the unit tests arising from the RPA simulation were failing with the GPU backend:

 OMP_NUM_THREADS=1 CRAY_CUDA_MPS=1  srun -u -N 1 -n 8 ./miniapp/pxgemm_miniapp -m 43417 -k 2170 -n 217  --test --transpose NN -r 1

Running PDGEMM on the following problem:
=============================
      GLOBAL MAT. SIZES
=============================
A = 43417 x 2170
B = 2170 x 217
C = 43417 x 217
=============================
        SUBMATRICES
=============================
(ia, ja) = (1, 1)
(ib, jb) = (1, 1)
(ic, jc) = (1, 1)
=============================
      SUBMATRIX SIZES
=============================
m = 43417
n = 217
k = 2170
=============================
      ADDITIONAL OPTIONS
=============================
alpha = 1
beta = 0
trans_a = N
trans_b = N
=============================
         PROC GRID
=============================
grid = 1 x 8
grid order = R
=============================
         PROC SRCS
=============================
P_SRC(A) = (0, 0)
P_SRC(B) = (0, 0)
P_SRC(C) = (0, 0)
=============================
          BLOCK SIZES
=============================
Blocks(A) = (128, 128)
Blocks(B) = (128, 128)
Blocks(C) = (128, 128)
=============================
          LEADING DIMS
=============================
lld_a = 43417
lld_b = 2170
lld_c = 43417
=============================

epsilon = 1e-06, v1 = 42.5759, which is != 528.075
epsilon = 1e-06, v1 = 43.1292, which is != 528.41
COSMA TIMES [ms] = 484
SCALAPACK TIMES [ms] = 571
Result is NOT CORRECT!

The bug was only occurring when the GPU backend is used. After a careful analysis, @simonpintarelli and I realized this problem boils down to the following local multiplications, executed multiple times:

m = 5428, n = 217, k = 2170 alpha = 1, beta = 0, copy_c_back = T, tile sizes  = 5000
m = 5427, n = 217, k = 2170 alpha = 1, beta = 0, copy_c_back = T, tile sizes = 5000

This bug was occurring in the GPU backend only when the matrix dimensions were slightly larger than the GPU tile sizes, as described here.

We fixed this bug in the GPU backend in the latest PR.

After updating the Tiled-MM submodule to the latest version, we verified the problem is resolved:

OMP_NUM_THREADS=1 CRAY_CUDA_MPS=1  srun -u -N 1 -n 8 ./miniapp/pxgemm_miniapp -m 43417 -k 2170 -n 217  --test --transpose NN -r 1

Running PDGEMM on the following problem:
=============================
      GLOBAL MAT. SIZES
=============================
A = 43417 x 2170
B = 2170 x 217
C = 43417 x 217
=============================
        SUBMATRICES
=============================
(ia, ja) = (1, 1)
(ib, jb) = (1, 1)
(ic, jc) = (1, 1)
=============================
      SUBMATRIX SIZES
=============================
m = 43417
n = 217
k = 2170
=============================
      ADDITIONAL OPTIONS
=============================
alpha = 1
beta = 0
trans_a = N
trans_b = N
=============================
         PROC GRID
=============================
grid = 1 x 8
grid order = R
=============================
         PROC SRCS
=============================
P_SRC(A) = (0, 0)
P_SRC(B) = (0, 0)
P_SRC(C) = (0, 0)
=============================
          BLOCK SIZES
=============================
Blocks(A) = (128, 128)
Blocks(B) = (128, 128)
Blocks(C) = (128, 128)
=============================
          LEADING DIMS
=============================
lld_a = 43417
lld_b = 2170
lld_c = 43417
=============================

COSMA TIMES [ms] = 304
SCALAPACK TIMES [ms] = 444
Result is CORRECT!

This has been tested on the RTX3090 GPUs.

@simonpintarelli
Copy link
Member

simonpintarelli commented Feb 23, 2024

cscs-ci run P100

1 similar comment
@simonpintarelli
Copy link
Member

cscs-ci run P100

@simonpintarelli
Copy link
Member

cscs-ci run GH200

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants