4. Software Environment

4.1. Change User Environment

In this system, you can switch the compiler and application use environment by using the module command.

4.1.1. List the Available Modules

You can check available modules with “module avail" or “module ava".

$ module avail

Available modules are described in Application software.

4.1.2. Display the named module information

One can display the short information by issuing the command “module whatis MODULE".

$ module whatis intel/17.0.4.196
intel/17.0.4.196     : Intel Compiler version 17.0.4.196 (parallel_studio_xe_2017) and MKL

4.1.3. Load the named module

One can load the named module by issuing the command “module load MODULE"

$ module load intel/17.0.4.196

Please use the same module that you used at compile time for the module to be loaded in the job script.

4.1.4. List all the currently loaded modules

One can list the modules currently loaded by issuing the command “module list".

$ module list
Currently Loaded Modulefiles:
  1) intel/17.0.4.196   2) cuda/8.0.61

4.1.5. Unoad the named module

One can unload the named module by issuing the command “module unload MODULE".

$ module list
Currently Loaded Modulefiles:
  1) intel/17.0.4.196   2) cuda/8.0.61
$ module unload cuda
$ module list
Currently Loaded Modulefiles:
  1) intel/17.0.4.196

4.1.6. Remove all modules

One can remove all modules by issuing the command "module purge".

$ module list
Currently Loaded Modulefiles:
  1) intel/17.0.4.196   2) cuda/8.0.61
$ module purge
$ module list
No Modulefiles Currently Loaded.

4.2. Usage in job script

When executing the module command in the job script, it is necessary to initialize the module command in the job script as follows.

[sh, bash]

. /etc/profile.d/modules.sh
module load intel/17.0.4.196

[csh, tcsh]

source /etc/profile.d/modules.csh
module load intel/17.0.4.196

4.3. Intel Compiler

In this system, you can use Intel compiler, PGI compiler and GNU compiler as compiler. The Intel compiler commands are as follows.

Command Language Syntax
ifort Fortran 77/90/95 $ ifort [option] source_file
icc C C
icpc C++ C++

To use it, please load "intel" with the module command.
If you specify the --help option, a list of compiler options is displayed.

4.3.1. Compiler options

The compiler options are shown below.

Option Description
-O0 Disables all optimizations. Using for debugging,etc.
-O1 Affects code size and locality. Disables specific optimizations.
-O2 Default optimizations. Same as -O.
Enables optimizations for speed, including global code scheduling, software pipelining, predication,
-O3 Aggressive optimizations for maximum speed (, but does not guarantee higher performance).
Optimization including data prefetching, scalar replacement, loop transformations.
-xCORE-AVX2 The generated executable will not run on non-Intel processors and it will not run on Intel processors that do not support Intel AVX2 instructions.
-xSSE4.2 The generated executable will not run on non-Intel processors and it will not run on Intel processors that do not support Intel SSE4.2 instructions.
-xSSSE3 The generated executable will not run on non-Intel processors and it will not run on Intel processors that do not support Intel SSE3 instructions.
-qopt-report=n Generates optimizations report and directs to stderr.
n=0 : disable optimization report output
n=1 : minimum report output
n=2 : medium output (DEFAULT)
n=3 : maximum report output
-fp-model precise Tells the compiler to strictly adhere to value-safe optimizations when implementing floating-point calculations. It disables optimizations that can change the result of floating-point calculations. These semantics ensure the accuracy of floating-point computations, but they may slow performance.
-g Produces symbolic debug information in object file (implies -O0 when another optimization option is not explicitly set)
-traceback Tells the compiler to generate extra information in the object file to provide source file traceback information when a severe error occurs at runtime. Specifying -traceback will increase the size of the executable program, but has no impact on runtime execution speeds.

The recommended optimization options for compilation of this system are shown below.

Option Description
-O3 Aggressive optimizations for maximum speed (, but does not guarantee higher performance). Optimization including data prefetching, scalar replacement, loop transformations.
-xCORE-AVX2 The generated executable will not run on non-Intel processors and it will not run on Intel processors that do not support Intel AVX2 instructions.

If the performance of the program deteriorates by using the above option, lower the optimization level to -O2 or change the vectorization option. If the results do not match, try the floating point option as well.

4.3.3. Intel 64 architecture memory model

Tells the compiler to use a specific memory model to generate code and store data.

Memory model Description
small (-mcmodel=small) Tells the compiler to restrict code and data to the first 2GB of address space. All accesses of code and data can be done with Instruction Pointer (IP)-relative addressing.
medium (-mcmodel=medium) Tells the compiler to restrict code to the first 2GB; it places no memory restriction on data. Accesses of code can be done with IP-relative addressing, but accesses of data must be done with absolute addressing.
large (-mcmodel=large) Places no memory restriction on code or data. All accesses of code and data must be done with absolute addressing.

When you specify option -mcmodel=medium or -mcmodel=large, it sets option -shared-intel. This ensures that the correct dynamic versions of the Intel run-time libraries are used.
If you specify option -static-intel while -mcmodel=medium or -mcmodel=large is set, an error will be displayed.

<some lib.a library>(some .o): In Function <function>:
  : relocation truncated to fit: R_X86_64_PC32 <some symbol>
...
  : relocation truncated to fit: R_X86_64_PC32 <some symbol>

When you specify option -mcmodel=medium or -mcmodel=large, it sets option -shared-intel. This ensures that the correct dynamic versions of the Intel run-time libraries are used.
If you specify option -static-intel while -mcmodel=medium or -mcmodel=large is set, an error will be displayed.

4.4. PGI compiler

PGI compiler commands are shown below.

Command Language Syntax
pgfortran Fortran 77/90/95 $ pgfortran [option] source_file
pgcc C $ pgcc [option] source_file
pgc++ C++ $ pgc++ [option] source_file

There are two versions of PGI compiler, one is LLVM version and another one is no LLVM version.
To use LLVM version, invoke below.

module load pgi

To use no LLVM version, invoke below.

module load pgi-nollvm pgi

For details of each command, please refer to $ man pgcc etc.

4.5. Parallelization

4.5.1. Thread parallel (OpenMP, Automatic parallelization)

The command format when using OpenMP, automatic parallelization is shown below.

Language Command
OpenMP
Fortran 77/90/95 $ ifort -qopenmp [option] source_file
C $ icc -qopenmp [option] source_file
C++ $ icpc -qopenmp [option] source_file
Automatic Parallelization
Fortran 77/90/95 $ ifort -parallel [option] source_file
C $ icc -parallel [option] source_file
C++ $ icpc -parallel [option] source_file

-qopt-report-phase=openmp: Reports loops, regions, sections, and tasks successfully parallelized.

-qopt-report-phase=par: Reports which loops were parallelized.

4.5.2. Process parallel (MPI)

The command format when MPI is used is shown below. When using, please read each MPI with the module command.

MPI Library Launguage Command
Intel MPI Fortran 77/90/95 $ mpiifort [option] source_file
C $ mpiicc [option] source_file
C++ $ mpiicpc [option] source_file
Open MPI Fortran 77/90/95 $ mpifort [option] source_file
C $ mpicc [option] source_file
C++ $ mpicxx [option] source_file
SGI MPT Fortran 77/90/95 $ mpif90 [option] source_file
C $ mpicc [option] source_file
C++ $ mpicxx [option] source_file

4.6. GPU Environment

TSUBAEM3.0 provid environmtn of Intel CPUs in conjunction with GPU ( NVIDIA TESLA P100 ).

4.6.1. Interactive execution and debug

As login nodes (login, login0, login1) do not have GPU, you can not run GPU codes, only comple and link work. In addition to that, heavy work in login node is restricted.

You can run GPU codes with interactive and debug on compute nodes by batch system. Please refer Interactive job for more details.

4.6.2. Supprted application for GPU

Current GPU compatible applications are as follows. (As of 2017.12.18)

  • ABAQUS 2017 --- Please refer to ABAQUS usage guide (separate volume).
  • NASTRAN 2017.1 --- Please refer to NASTRAN usage guide (separate volume).
  • ANSYS 18 --- Please refer to ANSYS usage guide (separate volume).
  • AMBER 16 --- Please refer to AMBER usage guide (separate volume).
  • Maple 2016 --- Please refer to Maple usage guide (separate volume).
  • Mathematica 11.2 --- Please refer to Mathematica usage guide (separate volume).
  • MATLAB --- Please refer to MATLAB usage guide (separate volume).
  • Forge --- Please refer to Forge usage guide (separate volume).
  • PGI Compiler --- Please refer to PGI usage guide (separate volume).

Even for other applications, we will provide it sequentially.

4.6.3. MPI Environment with CUDA

MPI environment compatible with CUDA is available.

OpenMPI + gcc Environment

# load CUDA and Open MPI Environment (gcc is default setting)
module load cuda openmpi

OpenMPI + pgi environment

# Load CUDA and PGI Environment(First load the compiler environment)
module load cuda pgi
# Load Open MPI Environment(The OpenMPI environment according to the compiler is set up)
module load openmpi

Info

specific verion openmpi/2.1.2-pgi2019 described before is no more necesarry at present.

In the PGI bundle version , there is no linkage from the batch system, so you need to specify the host list at run.
An example job script is shown below.

OpenMPI + Intel Environment

# Load CUDA and Intel Environment(First load the compiler environment)
module load cuda intel
# Load Open MPIEnvinronment (The OpenMPI environment according to the compiler is set up)
module load openmpi

4.6.4. NVIDIA GPUDirect

Currently, there are four functions (GPUDIRECT SHARED GPU SYSMEM, GPUDIRECT P2P, GPUDIRECT RDMA, GPUDIRECT ASYNC) as NVIDIA GPUDirect (GPUDIRECT FAMILY). (As of 2017.12.18)
Of these, TSUBAME 3.0 supports GPUDIRECT SHARED GPU SYSMEM, GPUDIRECT P2P, GPUDIRECT RDMA.

  • GPUDIRECT SHARED GPU SYSMEM (Version1)
    It is a function that can directly specify the address of CUDA pinned memory and device memory in the send / receive buffer of MPI. When the device memory address is specified, the data is actually transferred via the buffer on the host memory.

  • GPUDIRECT P2P (Version2)
    It is a function of direct data transfer (P2P) between GPU via PCI - Express and NVLink. In TSUBAME 3.0, four GPUs is installed per node, but one CPU is connected to two GPUs via PLX switch. Between four GPUs, high speed NVLink is connected.

  • GPUDIRECT RDMA (Version3)
    It is a function to realize high-speed data transfer between GPUs of different nodes by directly transferring data (RDMA) between the GPU and the interconnect (Intel Omni-Path in TSUBAME 3.0) without going through the host memory.

  • GPUDIRECT ASYNC
    It is asynchronous communication between the GPU and the interconnect without going through the host memory.Currently, Intel Omni-Path of TSUBAME 3.0 does not support it.

Reference: http://on-demand.gputechconf.com/gtc/2017/presentation/s7128-davide-rossetti-how-to-enable.pdf

For GPUDirect, please also refer to the following URL.

4.6.5. GPUDirect RDMA

Calling cudaSetDevice() before calling MPI_Init() is mandatory to use GPUDirect RDMA on OPA1.9. https://www.intel.com/content/dam/support/us/en/documents/network-and-i-o/fabric-products/Intel_PSM2_PG_H76473_v12_0 .pdf p.15

CUDA support is limited to using a single GPU per process.
You set up the CUDA runtime and pre-select a GPU card (through the use of cudaSetDevice() or a similar CUDA API) pr ior to calling psm2_init() or MPI_Init(), if using MPI.
While systems with a single GPU may not have this requirement, systems with multiple GPU may see non-deterministic results without proper initialization.
Therefore, it is strongly recommended that you initialize the CUDA runtime before the psm2_init() or MPI_Init() cal l.

So modify your code with the above, or use openmpi/3.1.4-opa10.10-t3 module file, that does the modification in the openmpi.
It is available by module load cuda openmpi/3.1.4-opa10.10-t3.

The following shows how to execute GPUDirect RDMA with OpenMPI. Below, it is an execution example with two nodes, MPI x 2.

$ module load cuda openmpi/3.1.4-opa10.10-t3
$ mpirun -np 2 -npernode 1 -x PSM2_CUDA=1 -x PSM2_GPUDIRECT=1 -x LD_LIBRARY_PATH -x PATH [program]
  • PSM2_CUDA --- Enables CUDA support in PSM2 of Omni-Path
  • PSM2_GPUDIRECT --- Enable NVIDIA GPUDirect RDMA in PSM2

4.6.6. GPU COMPUTE MODE

Only when using resource type f_node batch job, you can change GPU compute mode.
To change GPU compute mode, specify f_node in the job script and specify #$ - v GPU_COMPUTE_MODE=<MODE> for additional.
The following three modes are available.

Mode Description
0 DEFAULT mode
Multiple contexts are allowed per device.
1 EXCLUSIVE_PROCESS mode
Only one context is allowed per device, usable from multiple threads at a time.
2 PROHIBITED mode
No contexts are allowed per device (no compute apps).

Here is a sample job script.

1
2
3
4
5
6
7
#!/bin/sh
#$ -cwd
#$ -l f_node=1
#$ -l h_rt=1:00:00
#$ -N gpumode
#$ -v GPU_COMPUTE_MODE=1
/usr/bin/nvidia-smi

When using interactive job, it can be used as follows.

$ qrsh -g [TSUBAME3 group] -l f_node=1 -l h_rt=0:10:00 -pty yes -v TERM -v GPU_COMPUTE_MODE=1 /bin/bash