4. Software Environment¶

Info

The command line examples on this page use the following notation
[login]$ : login node
[rNnN]$ : compute node
[login/rNnN]$ : login node or compute node
[yourPC]$ : environment from which the login node is connected

4.1. Switch User Environment¶

In this system, you can switch the compiler and application use environment by using the Environment Modules system.
For Environment Modules system, please refer to the man command or Environment Modules.

4.1.1. List available modules¶

You can check available modules with “module avail" or “module ava".

[login/rNnN]$ module avail

For each version available, please refer to following pages:
System software
Supported applictions

4.1.2. Module information¶

If you want to know information about a module, module watis MODULE command.

[login/rNnN]$ module whatis intel/2024.0.2
-------------------------------- /apps/t4/rhel9/modules/modulefiles/compiler --------------------------------
      intel/2024.0.2: Intel oneAPI compiler 2024.0 and MKL

4.1.3. Load modules¶

Modules are loaded with module load MODULE command.

[login/rNnN]$ module load intel/2024.0.2

You need to use same modules in a job script as complied/build.

4.1.4. List loaded modules¶

module list displays currently loaded modules.

[login/rNnN]$ module list
Currently Loaded Modulefiles:
 1) intel/2024.0.2   2) cuda/12.3.2

4.1.5. Unload modules¶

If you want to unload modules in use, type module unload MODULE.

[login/rNnN]$ module list
Currently Loaded Modulefiles:
 1) intel/2024.0.2   2) cuda/12.3.2
[login/rNnN]$ module unload cuda
[login/rNnN]$ module list
Currently Loaded Modulefiles:
 1) intel/2024.0.2

4.1.6. Refresh module environment¶

module purge can refresh module environment. All modules currently loaded will be unloaded.

[login/rNnN]$ module list
Currently Loaded Modulefiles:
 1) intel/2024.0.2   2) cuda/12.3.2
[login/rNnN]$ module purge
[login/rNnN]$ module list
No Modulefiles Currently Loaded.

4.2. Intel Compiler¶

In this system, you can use Intel compiler (oneAPI), AMD compiler (AOCC), NVIDIA compiler (NVIDIA HPC SDK) and GNU compiler.

Intel compiler's commands are listed below:

Command	Language	Syntax	Former command
ifx	Fortran 77/90/95	`$ ifx [OPTION] source_file`	ifort
icx	C	`$ icx [OPTION] source_file`	icc
icpx	C++	`$ icpx [OPTION] source_file`	icpc

When you use Intel compiler, load intel using module command.--help option shows various compiler options.

Info

icc and icpc commands are no loger supported in Intel oneAPI 2024. You should use icx and icpx instead.
ifort command is no loger supported in Intel oneAPI 2025. You should use ifx instead.

Info

Since Intel oneAPI 2024, the default standard for C++ has changed from C++14 to C++17. Therefore, syntax errors may occur. For more information, see here.

4.2.1. Compiler options¶

The compiler options are listed below.

Option	Description
`-O`	Same as `O2`
`-O0`	Disables all optimizations. Using for debugging,etc.
`-O1`	Affects code size and locality. Disables specific optimizations.
`-O2`	Default optimizations. Same as -O. Enables optimizations for speed, including global code scheduling, software pipelining, predication.
`-O3`	Aggressive optimizations for maximum speed (, but does not guarantee higher performance). Optimization including data prefetching, scalar replacement, loop transformations.
`-xCORE-AVX512`	Generates Intel Advanced Vector Extensions 512 (Intel AVX512), Intel AVX2, AVX, SSE4.2, SSE4.1, SSSE3, SSE3, SSE2, SSE instructions for Intel processors. Optimizes for Intel processors with Intel AVX512 instruction set support.
`-xCORE-AVX2`	Generates Intel Advanced Vector Extensions 2 (Intel AVX2), Intel AVX, SSE4.2, SSE4.1, SSSE3, SSE3, SSE2, and SSE instructions for Intel processors. Optimized for Intel processors supporting the Intel AVX2 instruction set.
`-xSSE4.2`	Generates Intel SSE4 high-efficiency and high-speed string processing instructions, Intel SSE4 vectorized compiler and media accelerator instructions, and Intel SSSE3, SSE3, SSE2, and SSE instructions for Intel processors. Optimizes for Intel processors supporting the Intel SSE4.2 instruction set.
`-xSSSE3`	Generates Intel SSSE3, SSE3, SSE2, and SSE instructions for Intel processors, optimized for Intel processors supporting the Intel SSSE3 instruction set.
`-qopt-report=n`	Generates an optimization report. By default, the report is output to a file with an .optrpt extension, where n is the level of detail from 0 (no report) to 5 (most detailed). The default is 2.
`-fp-model precise`	Controls the semantics of floating-point operations. Disables optimizations that affect the precision of floating-point data and rounds intermediate results to the precision defined by the source.
`-g`	The -g option instructs the compiler to generate symbolic debugging information in the object file that increases the size of the object file.
`-traceback`	This option instructs the compiler to generate supplemental information in the object file so that when a fatal error occurs at run-time, the source file traceback information can be displayed. When a fatal error occurs, correlation information for the source file, routine name, and line number is displayed along with the hex address of the call stack (program counter trace). The map file and the hex address of the stack displayed when an error occurs can be used to determine the cause of the error. This option increases the size of the executable program.

4.2.2. Recommended optimization options¶

Recommended compile optimization options are listed below. The AMD EPYC 9654 in this system supports the Intel AVX512 instruction set, so the -xCORE-AVX512 option can be specified. When -xCORE-AVX512 is specified, the compiler will analyze the source code and generate the optimal AVX512, AVX2, AVX, and SSE instructions. The recommended optimization option is an aggressive and safe option. Optimization may change the order of calculations and may introduce errors in the results.

Info

AVX512, adopted in Intel Xeon (Skylake and later), is now supported by AMD in the 4th generation EPYC of the Zen4 architecture, which is included in this system.

Option	Description
`-O3`	O2 optimization to enable more powerful loop transformations such as fusion, unroll and jam blocks, and IF statement folding.
`-xCORE-AVX512`	Generates Intel Advanced Vector Extensions 512 (Intel AVX512), Intel AVX2, AVX, SSE4.2, SSE4.1, SSSE3, SSE3, SSE2, SSE instructions for Intel processors. Optimizes for Intel processors with Intel AVX512 instruction set support.

If the above options worsen the performance of your program, reduce the level of optimization to -O2 or change the vectorization options. Also try the floating point option if the results do not match.

4.2.3. Intel 64 architecture memory model¶

Create an executable binary using one of the following memory models

Memory model	Description
small (`-mcmodel=small`)	Code and data are limited to the first 2GB of address space so that all access to code and data is instruction pointer (IP) relative addressing. If -mcmodel option is not specified, this is the default.
medium (`-mcmodel=medium`)	Code is limited to the first 2GB of address space, but data is not. Code can be accessed via IP relative addressing, but data must be accessed via absolute addressing.
large (`-mcmodel=large`)	Neither code nor data is restricted. Both code and data access uses absolute addressing.

IP relative addressing requires only 32 bits, while absolute addressing requires 64 bits. This affects code size and performance. (IP relative addressing is slightly faster to access.)

When the total amount of common blocks, global data, and static data in a program exceeds 2 GB, the following error message is output at link time

<some lib.a library>(some .o): In Function <function>:
  : relocation truncated to fit: R_X86_64_PC32 <some symbol>
…………………
  : relocation truncated to fit: R_X86_64_PC32 <some symbol>

In this case, compile/link with -mcmodel=medium and -shared-intel.

If you specify the medium or large memory model, you must also specify the -shared-intel compiler option to ensure that the appropriate dynamic version of Intel's runtime libraries is used.

4.3. NVIDIA HPC SDK¶

Each command of NVIDIA HPC SDK (formerly PGI compiler) is as follows.

Command	Language	Syntax	Former command
nvfortran	Fortran 77/90/95/2003/2008/2018	`$ nvfortran [OPTION] source_file`	pgfortran
nvc	C	`$ nvcc [OPTION] source_file`	pgcc
nvc++	C++	`$ nvc++ [OPTION] source_file`	pgc++

For more details about each command, see $ man nvcc.

To use it, load nvhpc with the module command. If you specify the --help option, a list of compiler options will be displayed.

4.4. AOCC¶

Each command of AMD Optimizing C/C++ and Fortran Compilers (AOCC) is as follows.

Command	Language	Syntax
flang (clang)	Fortran 95/2003/2008	`$ flang [OPTION] source_file`
clang	C	`$ clang [OPTION] source_file`
clang-cpp (clang)	C++	`$ clang-cpp [OPTION] source_file`

To use it, load aocc with the module command. If you specify the --help option, a list of compiler options will be displayed.

4.5. GPU Environment¶

This system provides you GPU (NVIDIA H100 SXM5) environment in conjection with CPU.

4.5.1. Execution and debug for interactive job¶

The login nodes (login, login1, login2) do not have GPUs and can only compile and link. Also, execution of high-load programs on login nodes is limited.

You can run GPU codes with interactive and debug on compute nodes by batch system. Please refer Interactive job for more details.

4.5.2. Supported applications for GPU¶

GPU supported applications are listed below (as of 2024.4.1).

ABAQUS 2024 --- Refer to ABAQUS Usage guide.
ANSYS 2024R1 --- Refer to ANSYS Usage guide.
AMBER 22up05 --- Refer to AMBER Usage guide.
Mathematica 14 --- Refer to Mathematica Usage guide.
MATLAB 2024 --- Refer to MATLAB Usage guide.
Linaro forge(ex:Arm Forge) --- Refer to [Parallel Programming] in Lectures
NVIDIA HPC SDK --- Refer to NVIDIA HPC Usage guide.

For other applications, we will provide it sequentially.

4.5.3. MPI Environment with CUDA¶

MPI Environment with CUDA is available.

Info

OpenMPI modules are managed by a version that combines OpenMPI version and build environment. Be sure to specify the version when using module load openmpi command.
The OpenMPI versions listed in this document may not be the latest. Please check the available version with module avail openmpi command. (Unless there is a specific reason not to, we recommend using the latest version.)
If the required associated module is available, it will be loaded automatically.

OpenMPI + gcc environment

# load OpenMPI, GCC
[rNnN]$ module purge
[rNnN]$ module load openmpi/5.0.7-gcc
Loading openmpi/5.0.7-gcc
  Loading requirement: cuda/12.8.0

OpenMPI + NVIDIA HPC SDK environment

# load OpenMPI、NVIDIA HCP SDK
[rNnN]$ module purge
[rNnN]$ module load openmpi/5.0.7-nvhpc
Loading openmpi/5.0.7-nvhpc
  Loading requirement: nvhpc/25.1_cuda12.6

OpenMPI + Intel environment

# load OpenMPI, Intel
[rNnN]$ module purge
[rNnN]$ module load openmpi/5.0.7-intel
Loading openmpi/5.0.7-intel
  Loading requirement: intel/2025.0.0 cuda/12.8.0

4.5.4. Multi-Instance GPU (MIG)¶

You can use half a GPU on node_o and gpu_h.
Multi-Instance GPU(MIG) paritions NVIDIA H100 SXM5 94GB HBM2e on TSUBAME4 into two separate GPU instances.
See details about MIG here
MIG is not used on the other resource types.

4.5.5. Multi-Process Service (MPS)¶

Multi-Process Service (MPS) allows a single GPU to use multiple CUDA processes.
The processes run in parallel on the GPU, eliminating saturation of the GPU compute resources.
MPS also enables concurrent execution, or overlapping, of kernel operations and memory copying
from different processes to enhance utilization.
See details about MPS here

Info

TSUBAME4 provides T4 original nvidia-cuda-mps-control to avoid sytem trouble.
Be sure to use this command. This command can be used by executing the following module load.
If you use MPS,
module load cuda
or
module load nvhpc (Excluding nvhpc/21.3)

Do not change CUDA_MPS_PIPE_DIRECTORY variables or it may damage the other jobs.

Please be aware that if you do not follow these rules and cause damage to other users, we may take measures such as deleting your job without notice or suspending your TSUBAME account.

4.5.6. GPU COMPUTE MODE¶

Only when using resource type node_f batch job, you can change GPU compute mode.
To change GPU compute mode, specify node_f in the job script and specify #$ - v GPU_COMPUTE_MODE=<MODE> for additional.
The following three modes are available.

Mode	Description
0	DEFAULT mode Multiple contexts are allowed per device.
1	EXCLUSIVE_PROCESS mode Only one context is allowed per device, usable from multiple threads at a time.
2	PROHIBITED mode No contexts are allowed per device (no compute apps).

Info

If GPU_COMPUTE_MODE is not specified when node_f is specified, DEFAULT mode (0) is set.
Also, GPU_COMPUTE_MODE is fixed at DEFAULT mode (0) when anything other than node_f is specified.

Here is a sample job script.

#!/bin/sh
#$ -cwd
#$ -l node_f=1
#$ -l h_rt=1:00:00
#$ -N gpumode
#$ -v GPU_COMPUTE_MODE=1
/usr/bin/nvidia-smi

When using interactive job, it can be used as follows.

[login]$ qrsh -g [TSUBAME group] -l node_f=1 -l h_rt=0:10:00 -pty yes -v TERM -v GPU_COMPUTE_MODE=1 /bin/bash

4.6. Use containers¶

In TSUBAME4.0, Apptainer (Singularity) is available as a container environment for HPC.

An example of how to use Apptainer is shown below.

4.6.1. Build image¶

An example of the Apptainer image creation process is shown below. This example uses the latest Ubuntu Docker image.

[Options]

-nv : Use GPU
-B : Mounting a file system
-s : Build images in a sandbox format

[login]$ mkdir $HOME/apptainer
[login]$ cd $HOME/apptainer
[login]$ apptainer build -s ubuntu/ docker://ubuntu:latest
INFO:    Starting build...
Getting image source signatures
Copying blob 49b384cc7b4a done
Copying config bf3dc08bfe done
Writing manifest to image destination
Storing signatures
2024/05/29 13:07:49  info unpack layer: sha256:49b384cc7b4aa0dfd16ff7817ad0ea04f1d0a8072e62114efcd99119f8ceb9ed
2024/05/29 13:07:50  warn xattr{etc/gshadow} ignoring ENOTSUP on setxattr "user.rootlesscontainers"
2024/05/29 13:07:50  warn xattr{$HOME/apptainer/build-temp-2088960457/rootfs/etc/gshadow} destination filesystem does not support xattrs, further warnings will be suppressed
INFO:    Creating sandbox directory...
INFO:    Build complete: ubuntu/

[login apptainer]$ cd ubuntu/
[login apptainer]$ mkdir gs apps      # Create directories to mount /gs and /apps
[login apptainer]$ cd ../

4.6.2. Shell start¶

The following is an example of how to start a shell with Apptainer. Here, the image created by Build image is used.

[Options]

-nv : Use GPU
-B : Mounting a file system
-f (--fakeroot) : Exercise root privileges on containers
-w (--writable) : Allows writing into the container

Info

The -w (--writable) option must be specified.
If not specified, we have confirmed that the shell will not start properly.

[login]$ cd $HOME/apptainer
[login]$ apptainer shell -B /gs -B /apps -B /home --nv -f -w ubuntu/
INFO:    User not listed in /etc/subuid, trying root-mapped namespace
INFO:    Using fakeroot command combined with root-mapped namespace
WARNING: nv files may not be bound with --writable
WARNING: Skipping mount /etc/localtime [binds]: /etc/localtime doesn't exist in container
WARNING: Skipping mount /bin/nvidia-smi [files]: /usr/bin/nvidia-smi doesn't exist in container
WARNING: Skipping mount /bin/nvidia-debugdump [files]: /usr/bin/nvidia-debugdump doesn't exist in container
WARNING: Skipping mount /bin/nvidia-persistenced [files]: /usr/bin/nvidia-persistenced doesn't exist in container
WARNING: Skipping mount /bin/nvidia-cuda-mps-control [files]: /usr/bin/nvidia-cuda-mps-control doesn't exist in container
WARNING: Skipping mount /bin/nvidia-cuda-mps-server [files]: /usr/bin/nvidia-cuda-mps-server doesn't exist in container
Apptainer> id
uid=0(root) gid=0(root) groups=0(root)

4.6.2.1. precautions when using fakeroot¶

When using the fakeroot function (--fakeroot) in Apptainer, the libc version must match between the host and container because Apptainer uses the host's fakeroot as a bindmap.
If the host's libc library is newer than the corresponding library in the container, the fakeroot command may output an error that the GLIBC version is missing. ( Reference:Fakeroot feature )
Example :

/.singularity.d/libs/faked: /lib/x86_64-linux-gnu/libc.so.6: version`GLIBC_2.33' not found (required by /.singularity.d/libs/faked)
/.singularity.d/libs/faked: /lib/x86_64-linux-gnu/libc.so.6: version`GLIBC_2.34' not found (required by /.singularity.d/libs/faked)
fakeroot: error while starting the `faked' daemon.
/.singularity.d/libs/fakeroot: 1: kill: Usage: kill [-s sigspec | -signum |-sigspec] [pid | job]... or
kill -l [exitstatus]

If the above error occurs, recreate the container so that the libc library versions of the host and container match.
A compatibility library is available to workaround this issue, though it is provided as an experimental service and thus unsupported.