diff --git a/lang/en/docs/cli/actions/add-software.md b/lang/en/docs/cli/actions/add-software.md index 0860d23a..5594d67c 100644 --- a/lang/en/docs/cli/actions/add-software.md +++ b/lang/en/docs/cli/actions/add-software.md @@ -2,22 +2,23 @@ ## Overview Users can compile their own software via the -[Command Line Interface](../overview.md) (CLI). This is helpful, for example, -after introducing some changes or patches to the source code, or if users need +[Command Line Interface](../overview.md) (CLI). This is helpful if users need to run a specific version of an application that is not installed "globally". -Most of the globally installed applications are currently distributed as +The globally installed applications are currently distributed as Apptainer[^1] (Singularity[^2]) containers, bundled with all required dependencies. This ensures that each application is isolated and avoids -dependency conflicts. If you plan to run an application that is not installed in -our cluster, we encourage you to package your code and its dependencies as an -Apptainer/Singularity container. If you already have a Docker image, it -can be converted into an Apptainer/Singularity image. +dependency conflicts. -## Experiment in Sandbox mode +When planning to run an application that is not installed in +our cluster, we encourage packaging code and its dependencies as an +Apptainer/Singularity container. Existing Docker images +can be converted into an Apptainer/Singularity images. +## Using Sandbox mode Apptainer's sandbox mode is helpful for testing and fine-tuning the build steps interactively. To start it, first initialize a sandbox with `--sandbox` or `-s` flag: + ```bash apptainer build --sandbox qe_sandbox/ docker://almalinux:9 ``` @@ -28,7 +29,7 @@ from the AlmaLinux 9 Docker image to a subdirectory named `qe_sandbox`. Now, to install packages and save them to the sandbox folder, we can enter into the container in shell (interactive) mode with write permission (use `--writable` or `-w` flag). We will also need `--fakeroot` or `-f` flag to -Install software as root inside the container: +install software as root inside the container: ```bash apptainer shell --writable --fakeroot qe_sandbox/ @@ -45,9 +46,9 @@ Once you are happy with the sandbox, have tested the build steps, and installed everything you need, `exit` from the Apptainer shell mode. -## Build container +## Building containers -### Build from a sandbox folder +### Build from a Sandbox folder We may either package the sandbox directory into a final image: ```bash @@ -134,6 +135,11 @@ along with its dependencies. 4. Set runtime environment variables 5. Build routine, under the `post` section +Now we are ready to build the container with: +```bash +apptainer build espresso.sif espresso.def +``` + ### Build Considerations #### Running resource-intensive builds in batch mode @@ -163,10 +169,71 @@ apptainer build espresso.sif espresso.def #### Porting large libraries from the host Large libraries such as the Intel OneAPI suite and NVIDIA HPC SDK, which are -several gigabytes in size, can be mapped from our cluster host instead of +several gigabytes in size, can be mapped from the cluster host instead of bundling together with the application. However, this is not applicable if one needs a different version of these libraries than the one provided. +This can be done by using the `--bind` directives and passing the appropriate +library location from the host, e.g., from +`/cluster-001-share/compute/software/libraries` or +`/export/compute/software/libraries/`. + +See the GPU example below for more details. + +#### Building containers with GPU support + +To run applications with GPU acceleration, first, we need to compile the +GPU code with appropriate GPU libraries used, which is done during the container +build phase. Here, we will describe how we can compile our application code +using NVIDIA HPC SDK (which includes CUDA libraries) and package the compiled +code as a containerized application. + +The process works even on systems without GPU devices or drivers, +thanks to the availability of dummy shared objects (e.g., +`libcuda.so`) in recent versions of the NVHPC SDK and CUDA Toolkit. These dummy +libraries allow the linker to complete compilation without requiring an actual +GPU. + +NVIDIA HPC SDK (or CUDA Toolkit) is a large package, +typically several gigabytes in size. Unless a specific version of CUDA is +required, it’s more efficient to map the NVHPC installation available on +the host cluster. Currently, NVHPC 25.3 with CUDA 12.8 is installed in the +Mat3ra clusters. This version matches the NVIDIA driver version on the cluster's +compute nodes. + +We build our GPU containers in two stages: + +1. **Base Image and Compilation Stage**: Install NVHPC and all other +dependencies, and compile the application code. +2. **Slim Production Image**: Create a final production container by copying +only the compiled application and smaller dependencies (if any) into a new base +image, omitting the NVHPC SDK. + +To run such a container, we must `--bind` the NVHPC paths from the host and set +appropriate `PATH` and `LD_LIBRARY_PATH` for apptainer. Specialized software +libraries are installed under `/export/compute/software` in Mat3ra clusters. +Also, to map the NVIDIA GPU drivers from the compute node, we must use the +`--nv` flag. Now, to set `PATH` inside apptainer, we can set +`APPTAINERENV_PREPEND_PATH` (or `APPTAINERENV_APPEND_PATH`) on the host. +However, for other ENV variables, such special Apptainer variables are not +present, so we can use the `APPTAINERENV_` prefix for them. So a typical job +script would look like: + +```bash +export APPTAINERENV_PREPEND_PATH="/export/compute/software/libraries/nvhpc-25.3-cuda-12.8/Linux_x86_64/25.3/comm_libs/12.8/hpcx/hpcx-2.22.1/hcoll/bin:/export/compute/software/libraries/nvhpc-25.3-cuda-12.8/Linux_x86_64/25.3/comm_libs/12.8/hpcx/hpcx-2.22.1/ompi/bin:/export/compute/software/libraries/nvhpc-25.3-cuda-12.8/Linux_x86_64/25.3/comm_libs/12.8/hpcx/hpcx-2.22.1/ucx/mt/bin:/export/compute/software/compilers/gcc/11.2.0/bin" + +export APPTAINERENV_LD_LIBRARY_PATH="/export/compute/software/libraries/nvhpc-25.3-cuda-12.8/Linux_x86_64/25.3/comm_libs/12.8/hpcx/hpcx-2.22.1/hcoll/lib:/export/compute/software/libraries/nvhpc-25.3-cuda-12.8/Linux_x86_64/25.3/comm_libs/12.8/hpcx/hpcx-2.22.1/ompi/lib:/export/compute/software/libraries/nvhpc-25.3-cuda-12.8/Linux_x86_64/25.3/comm_libs/12.8/hpcx/hpcx-2.22.1/nccl_rdma_sharp_plugin/lib:/export/compute/software/libraries/nvhpc-25.3-cuda-12.8/Linux_x86_64/25.3/comm_libs/12.8/hpcx/hpcx-2.22.1/sharp/lib:/export/compute/software/libraries/nvhpc-25.3-cuda-12.8/Linux_x86_64/25.3/comm_libs/12.8/hpcx/hpcx-2.22.1/ucx/mt/lib:/export/compute/software/libraries/nvhpc-25.3-cuda-12.8/Linux_x86_64/25.3/comm_libs/12.8/hpcx/hpcx-2.22.1/ucx/mt/lib/ucx:/export/compute/software/libraries/nvhpc-25.3-cuda-12.8/Linux_x86_64/25.3/comm_libs/12.8/nccl/lib:/export/compute/software/libraries/nvhpc-25.3-cuda-12.8/Linux_x86_64/25.3/compilers/lib:/export/compute/software/libraries/nvhpc-25.3-cuda-12.8/Linux_x86_64/25.3/cuda/12.8/lib64:/export/compute/software/libraries/nvhpc-25.3-cuda-12.8/Linux_x86_64/25.3/math_libs/12.8/lib64:/export/compute/software/compilers/gcc/11.2.0/lib64:\${LD_LIBRARY_PATH}" + +apptainer exec --nv --bind /export,/cluster-001-share pw.x -in pw.in > pw.out +``` + +To understand the details about library paths, one may inspect modulefiles (e.g., +`/cluster-001-share/compute/modulefiles/applications/espresso/7.4.1-cuda-12.8 `) +available in our clusters and [job scripts]( +https://github.com/Exabyte-io/cli-job-examples/blob/main/espresso/gpu/job.gpu.pbs) +to see how it is implemented. Do not forget to use a GPU-enabled queue, +such as [GOF](../../infrastructure/clusters/google.md) to submit your GPU jobs. + ## Run jobs using Apptainer @@ -214,14 +281,8 @@ You can build containers on your local machine or use pull pre-built ones from sources such as [NVIDIA GPU Cloud]( https://catalog.ngc.nvidia.com/orgs/hpc/containers/quantum_espresso). -If Apptainer is installed locally, build the container using: - -```bash -apptainer build espresso.sif espresso.def -``` - -Once built, you can push the image to a container registry such as the -[GitHub Container Registry]( +If the container is build locally, you can push the image to a container +registry such as the [GitHub Container Registry]( https://docs.github.com/en/packages/working-with-a-github-packages-registry/working-with-the-container-registry). ```bash @@ -236,8 +297,8 @@ apptainer pull oras://ghcr.io///: