-
Notifications
You must be signed in to change notification settings - Fork 50
Add checkpoint file functionality (native, HDF5 formats) #396
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…rom checkpoint file.
|
This branch is not ready to merge yet. |
created a couple of modules one each for hdf5 specific subs and one for restart. Modified CMake to handle hdf5.
Right now only first element can be read. some read statements are commented out. Need to comment them out soon.
|
git workflow needs to install HDF5 for this branch to pass tests (I think) |
…stall dependency for Github CI tests.
|
@vtripath65: Second times a charm. New commits should hopefully fix the CMake build issues. Also, we should probably make HDF5 an optional install dependency for users. This may require a bit of code tweaking with conditional compilation -- i.e., if HDF5 is found at configure time, then a preprocessor directive gets set which enables HDF5-specific code to be compiled. Else, code with errors messages indicating HDF5 is not installed gets compiled. |
…gacy configure/Makefile builds.
…pe tests (NVHPC SDK).
…(read routine). Other code clean-up.
…with certain Fortran compiler types, e.g., MPICH). Other code clean-up.
… (*.h5). Correct log output accordingly. Correct test names. Other clean-up.
…ration). Other clean-up.
…ration). Other clean-up.
…jobs). Fix HDF5 close routine order. Fix denstity write order for unrestrictred SCF. Other clean-up.
…llation. Fix non-HDF5 checkpoint reading and avoid re-reading portions of file. Other clean-up.
|
At this point, I'm convinced that there's something wrong with the HDF5 libraries built for Ubuntu v24.04 (specifically for the Fortran bindings), as all failing MPI tests error out on boilerplate HDF5 code to initialize the Fortran interface. @agoetz, @vtripath65 -- if either of you have any ideas here, I'd appreciate it. If not, I'm inclined to change the Github workflows to not test HDF5-based checkpoint functionality on Ubuntu v24.04 -- this is easy to do given how the workflow YAML files are written (and is currently done for tests with Intel oneAPI and NVHPC SDK due to Fortran module version mismatches with HDF5 module files). Otherwise, feel free to test this further on additional platforms and report any issues. In particular, GPU-based tests still need to be done. |
|
There is no dependency of HDF5 on GPU code. It's also strange why MPI
should have any impact on HDF5 part of the code. We are only using restart
capability through the master branch!
…On Mon, Nov 10, 2025 at 10:57 AM ohearnk ***@***.***> wrote:
*ohearnk* left a comment (merzlab/QUICK#396)
<https://urldefense.com/v3/__https://github.com/merzlab/QUICK/pull/396*issuecomment-3513433351__;Iw!!Mih3wA!Eb2iq45mYLwWxtzgYuA0HSFqUS-6YRZu7T-A0ygenyt_q1e_PBgMf-W7FBW6Jv2OfkuSlcvEZr6V91oA5bFeytusuA$>
At this point, I'm convinced that there's something wrong with the HDF5
libraries built for Ubuntu v24.04 (specifically for the Fortran bindings),
as all failing MPI tests error out on boilerplate HDF5 code to initialize
the Fortran interface.
@agoetz
<https://urldefense.com/v3/__https://github.com/agoetz__;!!Mih3wA!Eb2iq45mYLwWxtzgYuA0HSFqUS-6YRZu7T-A0ygenyt_q1e_PBgMf-W7FBW6Jv2OfkuSlcvEZr6V91oA5bEx8vaKWw$>,
@vtripath65
<https://urldefense.com/v3/__https://github.com/vtripath65__;!!Mih3wA!Eb2iq45mYLwWxtzgYuA0HSFqUS-6YRZu7T-A0ygenyt_q1e_PBgMf-W7FBW6Jv2OfkuSlcvEZr6V91oA5bGmHakFrQ$>
-- if either of you have any ideas here, I'd appreciate it. If not, I'm
inclined to change the Github workflows to not test HDF5-based checkpoint
functionality on Ubuntu v24.04 -- this is easy to do given how the workflow
YAML files are written (and is currently done for tests with Intel oneAPI
and NVHPC SDK due to Fortran module version mismatches with HDF5 module
files).
Otherwise, feel free to test this further on additional platforms and
report any issues. In particular, GPU-based tests still need to be done.
—
Reply to this email directly, view it on GitHub
<https://urldefense.com/v3/__https://github.com/merzlab/QUICK/pull/396*issuecomment-3513433351__;Iw!!Mih3wA!Eb2iq45mYLwWxtzgYuA0HSFqUS-6YRZu7T-A0ygenyt_q1e_PBgMf-W7FBW6Jv2OfkuSlcvEZr6V91oA5bFeytusuA$>,
or unsubscribe
<https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AM6PFQQOAY3T5DDR4EVD2TT34DNYLAVCNFSM6AAAAABXPWVILWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTKMJTGQZTGMZVGE__;!!Mih3wA!Eb2iq45mYLwWxtzgYuA0HSFqUS-6YRZu7T-A0ygenyt_q1e_PBgMf-W7FBW6Jv2OfkuSlcvEZr6V91oA5bHAZsUoiQ$>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
Vikrant Tripathy
Postdoctoral Research Scientist
San Diego Supercomputer Center
University of California, San Diego
|
|
I do see that I am using the module in all the processes including master
however, calling it only from master. That probably is the issue. We really
do not need to use the module from all the processes, just master should
suffice.
On Mon, Nov 10, 2025 at 12:06 PM Vikrant Tripathy ***@***.***>
wrote:
… There is no dependency of HDF5 on GPU code. It's also strange why MPI
should have any impact on HDF5 part of the code. We are only using restart
capability through the master branch!
On Mon, Nov 10, 2025 at 10:57 AM ohearnk ***@***.***> wrote:
> *ohearnk* left a comment (merzlab/QUICK#396)
> <https://urldefense.com/v3/__https://github.com/merzlab/QUICK/pull/396*issuecomment-3513433351__;Iw!!Mih3wA!Eb2iq45mYLwWxtzgYuA0HSFqUS-6YRZu7T-A0ygenyt_q1e_PBgMf-W7FBW6Jv2OfkuSlcvEZr6V91oA5bFeytusuA$>
>
> At this point, I'm convinced that there's something wrong with the HDF5
> libraries built for Ubuntu v24.04 (specifically for the Fortran bindings),
> as all failing MPI tests error out on boilerplate HDF5 code to initialize
> the Fortran interface.
>
> @agoetz
> <https://urldefense.com/v3/__https://github.com/agoetz__;!!Mih3wA!Eb2iq45mYLwWxtzgYuA0HSFqUS-6YRZu7T-A0ygenyt_q1e_PBgMf-W7FBW6Jv2OfkuSlcvEZr6V91oA5bEx8vaKWw$>,
> @vtripath65
> <https://urldefense.com/v3/__https://github.com/vtripath65__;!!Mih3wA!Eb2iq45mYLwWxtzgYuA0HSFqUS-6YRZu7T-A0ygenyt_q1e_PBgMf-W7FBW6Jv2OfkuSlcvEZr6V91oA5bGmHakFrQ$>
> -- if either of you have any ideas here, I'd appreciate it. If not, I'm
> inclined to change the Github workflows to not test HDF5-based checkpoint
> functionality on Ubuntu v24.04 -- this is easy to do given how the workflow
> YAML files are written (and is currently done for tests with Intel oneAPI
> and NVHPC SDK due to Fortran module version mismatches with HDF5 module
> files).
>
> Otherwise, feel free to test this further on additional platforms and
> report any issues. In particular, GPU-based tests still need to be done.
>
> —
> Reply to this email directly, view it on GitHub
> <https://urldefense.com/v3/__https://github.com/merzlab/QUICK/pull/396*issuecomment-3513433351__;Iw!!Mih3wA!Eb2iq45mYLwWxtzgYuA0HSFqUS-6YRZu7T-A0ygenyt_q1e_PBgMf-W7FBW6Jv2OfkuSlcvEZr6V91oA5bFeytusuA$>,
> or unsubscribe
> <https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AM6PFQQOAY3T5DDR4EVD2TT34DNYLAVCNFSM6AAAAABXPWVILWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTKMJTGQZTGMZVGE__;!!Mih3wA!Eb2iq45mYLwWxtzgYuA0HSFqUS-6YRZu7T-A0ygenyt_q1e_PBgMf-W7FBW6Jv2OfkuSlcvEZr6V91oA5bHAZsUoiQ$>
> .
> You are receiving this because you were mentioned.Message ID:
> ***@***.***>
>
--
Vikrant Tripathy
Postdoctoral Research Scientist
San Diego Supercomputer Center
University of California, San Diego
--
Vikrant Tripathy
Postdoctoral Research Scientist
San Diego Supercomputer Center
University of California, San Diego
|
|
On second thought that may not be possible.
On Mon, Nov 10, 2025 at 12:09 PM Vikrant Tripathy ***@***.***>
wrote:
… I do see that I am using the module in all the processes including master
however, calling it only from master. That probably is the issue. We really
do not need to use the module from all the processes, just master should
suffice.
On Mon, Nov 10, 2025 at 12:06 PM Vikrant Tripathy ***@***.***>
wrote:
> There is no dependency of HDF5 on GPU code. It's also strange why MPI
> should have any impact on HDF5 part of the code. We are only using restart
> capability through the master branch!
>
> On Mon, Nov 10, 2025 at 10:57 AM ohearnk ***@***.***>
> wrote:
>
>> *ohearnk* left a comment (merzlab/QUICK#396)
>> <https://urldefense.com/v3/__https://github.com/merzlab/QUICK/pull/396*issuecomment-3513433351__;Iw!!Mih3wA!Eb2iq45mYLwWxtzgYuA0HSFqUS-6YRZu7T-A0ygenyt_q1e_PBgMf-W7FBW6Jv2OfkuSlcvEZr6V91oA5bFeytusuA$>
>>
>> At this point, I'm convinced that there's something wrong with the HDF5
>> libraries built for Ubuntu v24.04 (specifically for the Fortran bindings),
>> as all failing MPI tests error out on boilerplate HDF5 code to initialize
>> the Fortran interface.
>>
>> @agoetz
>> <https://urldefense.com/v3/__https://github.com/agoetz__;!!Mih3wA!Eb2iq45mYLwWxtzgYuA0HSFqUS-6YRZu7T-A0ygenyt_q1e_PBgMf-W7FBW6Jv2OfkuSlcvEZr6V91oA5bEx8vaKWw$>,
>> @vtripath65
>> <https://urldefense.com/v3/__https://github.com/vtripath65__;!!Mih3wA!Eb2iq45mYLwWxtzgYuA0HSFqUS-6YRZu7T-A0ygenyt_q1e_PBgMf-W7FBW6Jv2OfkuSlcvEZr6V91oA5bGmHakFrQ$>
>> -- if either of you have any ideas here, I'd appreciate it. If not, I'm
>> inclined to change the Github workflows to not test HDF5-based checkpoint
>> functionality on Ubuntu v24.04 -- this is easy to do given how the workflow
>> YAML files are written (and is currently done for tests with Intel oneAPI
>> and NVHPC SDK due to Fortran module version mismatches with HDF5 module
>> files).
>>
>> Otherwise, feel free to test this further on additional platforms and
>> report any issues. In particular, GPU-based tests still need to be done.
>>
>> —
>> Reply to this email directly, view it on GitHub
>> <https://urldefense.com/v3/__https://github.com/merzlab/QUICK/pull/396*issuecomment-3513433351__;Iw!!Mih3wA!Eb2iq45mYLwWxtzgYuA0HSFqUS-6YRZu7T-A0ygenyt_q1e_PBgMf-W7FBW6Jv2OfkuSlcvEZr6V91oA5bFeytusuA$>,
>> or unsubscribe
>> <https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AM6PFQQOAY3T5DDR4EVD2TT34DNYLAVCNFSM6AAAAABXPWVILWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTKMJTGQZTGMZVGE__;!!Mih3wA!Eb2iq45mYLwWxtzgYuA0HSFqUS-6YRZu7T-A0ygenyt_q1e_PBgMf-W7FBW6Jv2OfkuSlcvEZr6V91oA5bHAZsUoiQ$>
>> .
>> You are receiving this because you were mentioned.Message ID:
>> ***@***.***>
>>
>
>
> --
> Vikrant Tripathy
> Postdoctoral Research Scientist
> San Diego Supercomputer Center
> University of California, San Diego
>
--
Vikrant Tripathy
Postdoctoral Research Scientist
San Diego Supercomputer Center
University of California, San Diego
--
Vikrant Tripathy
Postdoctoral Research Scientist
San Diego Supercomputer Center
University of California, San Diego
|
A few notes:
|
|
I believe the Ubuntu v24.04 failures are related to issues reported with gfortran due to recent supported added for half-precision (bfloat16) datatypes. See, e.g., this recent issue. The way to get a definitive answer would be to build HDF5 ourselves on an Ubuntu v24.04 platform to see if that resolves the failures. Thoughts? |
|
Agreed. The Intel server petra has ubuntu 24.04.2 LTS installed. I don't have the bandwidth to test this at the moment but both of you do have an account on petra. |
…ctly check and log checkpoint tests.
…ywords for checkpoint file functionality. Update checkpoint tests accordingly. Other clean-up.
Calculations can be restarted from checkpoint files containing coordinates and/or density.
Keywords can be read from multiple lines.
Closes #387.