-
Notifications
You must be signed in to change notification settings - Fork 50
Refactor matrix operation interfaces (eigen-decompisition, matrix-matrix multiplication) #442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
agoetz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some things to check. And improvements to be made.
src/modules/quick_scf_module.f90
Outdated
| quick_qm_struct%E, quick_qm_struct%idegen, & | ||
| quick_qm_struct%vec, quick_qm_struct%co, & | ||
| V2, nbasis) | ||
| call fock_diag(quick_qm_struct%o, quick_qm_struct%x, & |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should remove code duplication between HIP and CUDA here. We should either
- have a function GPU_FOCK_DIAG which is defined as CUDA_FOCK_DIAG or HIP_FOCK_DIAG (similar to GPU_DGEMM)
or
- use GPU_DGEMM (twice) and have a new function GPU_DIAG which is defined as cudaDIAG, rocDIAG, or magmaDIAG
We should have GPU_DIAG in any case. Option 1) from above would also simplify the code here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MatDiag.f90 in subs directory is now called for diagonalization using both CPU and GPU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This simplifies the code significantly.
src/modules/quick_overlap_module.f90
Outdated
| @@ -243,8 +243,7 @@ subroutine fullx | |||
| RECORD_TIME(timer_begin%T1eSD) | |||
|
|
|||
| #if defined(CUDA) || defined(CUDA_MPIV) | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should have a function GPU_DIAG similar to GPU_DGEMM
| { | ||
| int ka, kb; | ||
| cudaError_t err1, err2, err3; | ||
| cublasStatus_t stat1, stat2, stat3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we using stat1, stat2, stat3 any more?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we are not using the stats. Just checking the errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fock_diag function does not exist anymore.
| stat1=cublasSetMatrix(dim,dim,sizeof(devPtr_o[0]),o,dim,devPtr_o,dim); | ||
| stat2=cublasSetMatrix(dim,dim,sizeof(devPtr_x[0]),x,dim,devPtr_x,dim); | ||
| stat3=cublasSetMatrix(dim,dim,sizeof(devPtr_hold[0]),hold,dim,devPtr_hold,dim); | ||
| err1 = cudaMemcpy(devPtr_o, o, sizeof(double)*dim*dim, cudaMemcpyHostToDevice); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When replacing cublasSetMatrix with cudaMemcpy does this guarantee the same row/column order mapping (Fortran vs C)? I am not sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure how will I check this with our setup. All our matrices which are calling this routine are symmetric.
PS - All tests are passing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some tests to check if switching to cudamemcpy will affect asymmetric matrices.
Since Fortran is using column major storage there will not be any issue. There will be an issue with row major arrays. So, if we will create an array in C/C++ then, we can not use this routine as it is.
|
@vtripath65, @agoetz -- Thanks for working on this. Overall, this is moving in the right direction. I'd like to make a few more changes (e.g., further simplify CPU and GPU codepaths for diagonalization-related routines, backport CUDA changes to HIP codes, etc.). Give me a few days to make the changes and test them. |
|
|
@ohearnk -- Please let me know if you need any help from me. |
…ense matrix operations (eigen-decompsition, matrix-matrix multiplication). Further code clean-up.
|
Is this ready? Any idea why tests are cancelled? |
CUDA_DIAG function was being called to perform symmetric orthogonalization and diagonalization. The function handled both fock matrix and overlap matrix. In case of overlap matrix diagonalization, an identity matrix is used as transformation matrix as overlap matrix should not be orthogonalized.
This PR fixes the issue by splitting CUDA_DIAG into CUDA_DIAG and Fock_DIAG, where CUDA_DIAG is called only for diagonalization of matrices and Fock_DIAG is called if matrix needs to be orthogonalized and then diagonalized. Fock_DIAG calls CUDA_DIAG for diagonalization.
This PR also prepares us for a shift to canonical orthogonalization for handling near-linear dependency in AOs.