Skip to content

Conversation

@vtripath65
Copy link
Collaborator

@vtripath65 vtripath65 commented Feb 20, 2025

Calculations can be restarted from checkpoint files containing coordinates and/or density.

Keywords can be read from multiple lines.

Closes #387.

@agoetz agoetz marked this pull request as draft February 20, 2025 03:53
@ohearnk ohearnk linked an issue Feb 21, 2025 that may be closed by this pull request
@ohearnk ohearnk added the enhancement New feature or request label Mar 7, 2025
@ohearnk ohearnk requested a review from agoetz March 7, 2025 16:02
@vtripath65
Copy link
Collaborator Author

This branch is not ready to merge yet.

@vtripath65 vtripath65 marked this pull request as ready for review March 18, 2025 03:17
@vtripath65
Copy link
Collaborator Author

git workflow needs to install HDF5 for this branch to pass tests (I think)

@ohearnk
Copy link
Collaborator

ohearnk commented Mar 18, 2025

@vtripath65: I believe that my previous commits should address the missing HDF5 install dependencies for Github CI tests. The legacy build system failures still need to be addressed. Please review the test artifacts for CMake builds once the updated code in my previous commits is run.

Second times a charm. New commits should hopefully fix the CMake build issues.

Also, we should probably make HDF5 an optional install dependency for users. This may require a bit of code tweaking with conditional compilation -- i.e., if HDF5 is found at configure time, then a preprocessor directive gets set which enables HDF5-specific code to be compiled. Else, code with errors messages indicating HDF5 is not installed gets compiled.

…with certain Fortran compiler types, e.g., MPICH). Other code clean-up.
… (*.h5). Correct log output accordingly. Correct test names. Other clean-up.
…jobs). Fix HDF5 close routine order. Fix denstity write order for unrestrictred SCF. Other clean-up.
…llation. Fix non-HDF5 checkpoint reading and avoid re-reading portions of file. Other clean-up.
@ohearnk
Copy link
Collaborator

ohearnk commented Nov 10, 2025

At this point, I'm convinced that there's something wrong with the HDF5 libraries built for Ubuntu v24.04 (specifically for the Fortran bindings), as all failing MPI tests error out on boilerplate HDF5 code to initialize the Fortran interface.

@agoetz, @vtripath65 -- if either of you have any ideas here, I'd appreciate it. If not, I'm inclined to change the Github workflows to not test HDF5-based checkpoint functionality on Ubuntu v24.04 -- this is easy to do given how the workflow YAML files are written (and is currently done for tests with Intel oneAPI and NVHPC SDK due to Fortran module version mismatches with HDF5 module files).

Otherwise, feel free to test this further on additional platforms and report any issues. In particular, GPU-based tests still need to be done.

@vtripath65
Copy link
Collaborator Author

vtripath65 commented Nov 10, 2025 via email

@vtripath65
Copy link
Collaborator Author

vtripath65 commented Nov 10, 2025 via email

@vtripath65
Copy link
Collaborator Author

vtripath65 commented Nov 10, 2025 via email

@ohearnk
Copy link
Collaborator

ohearnk commented Nov 12, 2025

@vtripath65, @agoetz

There is no dependency of HDF5 on GPU code. It's also strange why MPI should have any impact on HDF5 part of the code. We are only using restart capability through the master branch!

A few notes:

  1. Github CI tests have been modified such that all tests except those with Intel oneAPI and NVIDIA HPC SDK are currently testing HDF5-based checkpoint functionality. For oneAPI and NVHPC SDK, they are falling back to the custom built-in checkpoint format (currently the default in the HEAD of the master branch). This applies for both serial and MPI versions.
  2. Just because the HDF5 code does not depend on GPU functionality does not mean that it does not need to be tested. Changes in the CPU code can still inadvertantly introduce bugs in GPU versions. Examples would include forgetting to copy data in CPU memory to GPU memory.
  3. I opened up a few of the serial test artifacts, and tests are actually failing for HDF5 but being incorrectly marked as passing. As such, the runtest script needs to be fixed for checkpoint / restart tests to actually correctly mark failures (I'll look into this).
  4. The serial test failures follow the same pattern as the MPI failures, namely all failures are localized to Ubuntu v24.04. Annoyingly, (3) partially obfuscated this pattern of failures. Again, I stand by my statement that something appears wrong with the HDF5 libraries being built & installed for Ubuntu v24.04. As a further datapoint, tests on my local system pass on Fedora 41 with GCC v14.3.1, OpenMPI v5.0.5 OR MPICH v4.2.2, HDF5 v1.12.1. Please test on additional systems and report back any failures with version info.

@ohearnk
Copy link
Collaborator

ohearnk commented Nov 12, 2025

I believe the Ubuntu v24.04 failures are related to issues reported with gfortran due to recent supported added for half-precision (bfloat16) datatypes. See, e.g., this recent issue.

The way to get a definitive answer would be to build HDF5 ourselves on an Ubuntu v24.04 platform to see if that resolves the failures. Thoughts?

@agoetz
Copy link
Collaborator

agoetz commented Nov 12, 2025

Agreed. The Intel server petra has ubuntu 24.04.2 LTS installed. I don't have the bandwidth to test this at the moment but both of you do have an account on petra.

@ohearnk ohearnk changed the title Restart and key words Add checkpoint file functionality (native, HDF5 formats) Nov 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

How to enable restart/resume option. Complete documentation?

4 participants