Skip to content

Conversation

@vtripath65
Copy link
Collaborator

@vtripath65 vtripath65 commented Jan 12, 2026

CUDA_DIAG function was being called to perform symmetric orthogonalization and diagonalization. The function handled both fock matrix and overlap matrix. In case of overlap matrix diagonalization, an identity matrix is used as transformation matrix as overlap matrix should not be orthogonalized.

This PR fixes the issue by splitting CUDA_DIAG into CUDA_DIAG and Fock_DIAG, where CUDA_DIAG is called only for diagonalization of matrices and Fock_DIAG is called if matrix needs to be orthogonalized and then diagonalized. Fock_DIAG calls CUDA_DIAG for diagonalization.

This PR also prepares us for a shift to canonical orthogonalization for handling near-linear dependency in AOs.

@vtripath65 vtripath65 requested a review from ohearnk January 12, 2026 22:03
Copy link
Collaborator

@agoetz agoetz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some things to check. And improvements to be made.

quick_qm_struct%E, quick_qm_struct%idegen, &
quick_qm_struct%vec, quick_qm_struct%co, &
V2, nbasis)
call fock_diag(quick_qm_struct%o, quick_qm_struct%x, &
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should remove code duplication between HIP and CUDA here. We should either

  1. have a function GPU_FOCK_DIAG which is defined as CUDA_FOCK_DIAG or HIP_FOCK_DIAG (similar to GPU_DGEMM)

or

  1. use GPU_DGEMM (twice) and have a new function GPU_DIAG which is defined as cudaDIAG, rocDIAG, or magmaDIAG

We should have GPU_DIAG in any case. Option 1) from above would also simplify the code here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MatDiag.f90 in subs directory is now called for diagonalization using both CPU and GPU.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This simplifies the code significantly.

@@ -243,8 +243,7 @@ subroutine fullx
RECORD_TIME(timer_begin%T1eSD)

#if defined(CUDA) || defined(CUDA_MPIV)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have a function GPU_DIAG similar to GPU_DGEMM

{
int ka, kb;
cudaError_t err1, err2, err3;
cublasStatus_t stat1, stat2, stat3;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we using stat1, stat2, stat3 any more?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we are not using the stats. Just checking the errors.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fock_diag function does not exist anymore.

stat1=cublasSetMatrix(dim,dim,sizeof(devPtr_o[0]),o,dim,devPtr_o,dim);
stat2=cublasSetMatrix(dim,dim,sizeof(devPtr_x[0]),x,dim,devPtr_x,dim);
stat3=cublasSetMatrix(dim,dim,sizeof(devPtr_hold[0]),hold,dim,devPtr_hold,dim);
err1 = cudaMemcpy(devPtr_o, o, sizeof(double)*dim*dim, cudaMemcpyHostToDevice);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When replacing cublasSetMatrix with cudaMemcpy does this guarantee the same row/column order mapping (Fortran vs C)? I am not sure.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure how will I check this with our setup. All our matrices which are calling this routine are symmetric.

PS - All tests are passing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some tests to check if switching to cudamemcpy will affect asymmetric matrices.

Since Fortran is using column major storage there will not be any issue. There will be an issue with row major arrays. So, if we will create an array in C/C++ then, we can not use this routine as it is.

@ohearnk
Copy link
Collaborator

ohearnk commented Jan 20, 2026

@vtripath65, @agoetz -- Thanks for working on this. Overall, this is moving in the right direction. I'd like to make a few more changes (e.g., further simplify CPU and GPU codepaths for diagonalization-related routines, backport CUDA changes to HIP codes, etc.).

Give me a few days to make the changes and test them.

@vtripath65
Copy link
Collaborator Author

@vtripath65, @agoetz -- Thanks for working on this. Overall, this is moving in the right direction. I'd like to make a few more changes (e.g., further simplify CPU and GPU codepaths for diagonalization-related routines, backport CUDA changes to HIP codes, etc.).

Give me a few days to make the changes and test them.

@vtripath65 vtripath65 closed this Jan 21, 2026
@vtripath65 vtripath65 reopened this Jan 21, 2026
@vtripath65
Copy link
Collaborator Author

@ohearnk -- Please let me know if you need any help from me.

@ohearnk ohearnk changed the title CUDA_DIAG function call simplified Refactor matrix operation interfaces (eigen-decompisition, matrix-matrix multiplication) Jan 26, 2026
@agoetz
Copy link
Collaborator

agoetz commented Feb 3, 2026

Is this ready? Any idea why tests are cancelled?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants