4. Software Environment¶

4.1. Change User Environment¶

In this system, you can switch the compiler and application use environment by using the module command.

4.1.1. List the Available Modules¶

You can check available modules with “module avail" or “module ava".

$ module avail

Available modules are described in Application software.

4.1.2. Display the named module information¶

One can display the short information by issuing the command “module whatis MODULE".

$ module whatis intel/17.0.4.196
intel/17.0.4.196     : Intel Compiler version 17.0.4.196 (parallel_studio_xe_2017) and MKL

4.1.3. Load the named module¶

One can load the named module by issuing the command “module load MODULE"

$ module load intel/17.0.4.196

Please use the same module that you used at compile time for the module to be loaded in the job script.

4.1.4. List all the currently loaded modules¶

One can list the modules currently loaded by issuing the command “module list".

$ module list
Currently Loaded Modulefiles:
  1) intel/17.0.4.196   2) cuda/8.0.61

4.1.5. Unoad the named module¶

One can unload the named module by issuing the command “module unload MODULE".

$ module list
Currently Loaded Modulefiles:
  1) intel/17.0.4.196   2) cuda/8.0.61
$ module unload cuda
$ module list
Currently Loaded Modulefiles:
  1) intel/17.0.4.196

4.1.6. Remove all modules¶

One can remove all modules by issuing the command "module purge".

$ module list
Currently Loaded Modulefiles:
  1) intel/17.0.4.196   2) cuda/8.0.61
$ module purge
$ module list
No Modulefiles Currently Loaded.

4.2. Usage in job script¶

When executing the module command in the job script, it is necessary to initialize the module command in the job script as follows.

[sh, bash]

. /etc/profile.d/modules.sh
module load intel/17.0.4.196

[csh, tcsh]

source /etc/profile.d/modules.csh
module load intel/17.0.4.196

4.3. Intel Compiler¶

In this system, you can use Intel compiler, PGI compiler and GNU compiler as compiler. The Intel compiler commands are as follows.

Command	Language	Syntax
ifort	Fortran 77/90/95	`$ ifort [option] source_file`
icc	C	C
icpc	C++	C++

To use it, please load "intel" with the module command.
If you specify the --help option, a list of compiler options is displayed.

4.3.1. Compiler options¶

The compiler options are shown below.

Option	Description
`-O0`	Disables all optimizations. Using for debugging,etc.
`-O1`	Affects code size and locality. Disables specific optimizations.
`-O2`	Default optimizations. Same as -O. Enables optimizations for speed, including global code scheduling, software pipelining, predication,
`-O3`	Aggressive optimizations for maximum speed (, but does not guarantee higher performance). Optimization including data prefetching, scalar replacement, loop transformations.
`-xCORE-AVX2`	The generated executable will not run on non-Intel processors and it will not run on Intel processors that do not support Intel AVX2 instructions.
`-xSSE4.2`	The generated executable will not run on non-Intel processors and it will not run on Intel processors that do not support Intel SSE4.2 instructions.
`-xSSSE3`	The generated executable will not run on non-Intel processors and it will not run on Intel processors that do not support Intel SSE3 instructions.
`-qopt-report=n`	Generates optimizations report and directs to stderr. n=0 : disable optimization report output n=1 : minimum report output n=2 : medium output (DEFAULT) n=3 : maximum report output
`-fp-model precise`	Tells the compiler to strictly adhere to value-safe optimizations when implementing floating-point calculations. It disables optimizations that can change the result of floating-point calculations. These semantics ensure the accuracy of floating-point computations, but they may slow performance.
`-g`	Produces symbolic debug information in object file (implies -O0 when another optimization option is not explicitly set)
`-traceback`	Tells the compiler to generate extra information in the object file to provide source file traceback information when a severe error occurs at runtime. Specifying -traceback will increase the size of the executable program, but has no impact on runtime execution speeds.

4.3.2. Recommended optimization options¶

The recommended optimization options for compilation of this system are shown below.

Option	Description
`-O3`	Aggressive optimizations for maximum speed (, but does not guarantee higher performance). Optimization including data prefetching, scalar replacement, loop transformations.
`-xCORE-AVX2`	The generated executable will not run on non-Intel processors and it will not run on Intel processors that do not support Intel AVX2 instructions.

If the performance of the program deteriorates by using the above option, lower the optimization level to -O2 or change the vectorization option. If the results do not match, try the floating point option as well.

4.3.3. Intel 64 architecture memory model¶

Tells the compiler to use a specific memory model to generate code and store data.

Memory model	Description
small (`-mcmodel=small`)	Tells the compiler to restrict code and data to the first 2GB of address space. All accesses of code and data can be done with Instruction Pointer (IP)-relative addressing.
medium (`-mcmodel=medium`)	Tells the compiler to restrict code to the first 2GB; it places no memory restriction on data. Accesses of code can be done with IP-relative addressing, but accesses of data must be done with absolute addressing.
large (`-mcmodel=large`)	Places no memory restriction on code or data. All accesses of code and data must be done with absolute addressing.

When you specify option -mcmodel=medium or -mcmodel=large, it sets option -shared-intel. This ensures that the correct dynamic versions of the Intel run-time libraries are used.
If you specify option -static-intel while -mcmodel=medium or -mcmodel=large is set, an error will be displayed.

<some lib.a library>(some .o): In Function <function>:
  : relocation truncated to fit: R_X86_64_PC32 <some symbol>
...
  : relocation truncated to fit: R_X86_64_PC32 <some symbol>

When you specify option -mcmodel=medium or -mcmodel=large, it sets option -shared-intel. This ensures that the correct dynamic versions of the Intel run-time libraries are used.
If you specify option -static-intel while -mcmodel=medium or -mcmodel=large is set, an error will be displayed.

4.4. PGI compiler¶

PGI compiler commands are shown below.

Command	Language	Syntax
pgfortran	Fortran 77/90/95	`$ pgfortran [option] source_file`
pgcc	C	`$ pgcc [option] source_file`
pgc++	C++	`$ pgc++ [option] source_file`

There are two versions of PGI compiler, one is LLVM version and another one is no LLVM version.
To use LLVM version, invoke below.

module load pgi

To use no LLVM version, invoke below.

module load pgi-nollvm pgi

For details of each command, please refer to $ man pgcc etc.

4.5. Parallelization¶

4.5.1. Thread parallel (OpenMP, Automatic parallelization)¶

The command format when using OpenMP, automatic parallelization is shown below.

Language	Command
OpenMP
Fortran 77/90/95	`$ ifort -qopenmp [option] source_file`
C	`$ icc -qopenmp [option] source_file`
C++	`$ icpc -qopenmp [option] source_file`
Automatic Parallelization
Fortran 77/90/95	`$ ifort -parallel [option] source_file`
C	`$ icc -parallel [option] source_file`
C++	`$ icpc -parallel [option] source_file`

-qopt-report-phase=openmp: Reports loops, regions, sections, and tasks successfully parallelized.

-qopt-report-phase=par: Reports which loops were parallelized.

4.5.2. Process parallel (MPI)¶

The command format when MPI is used is shown below. When using, please read each MPI with the module command.

MPI Library	Launguage	Command
Intel MPI	Fortran 77/90/95	`$ mpiifort [option] source_file`
	C	`$ mpiicc [option] source_file`
	C++	`$ mpiicpc [option] source_file`
Open MPI	Fortran 77/90/95	`$ mpifort [option] source_file`
	C	`$ mpicc [option] source_file`
	C++	`$ mpicxx [option] source_file`
SGI MPT	Fortran 77/90/95	`$ mpif90 [option] source_file`
	C	`$ mpicc [option] source_file`
	C++	`$ mpicxx [option] source_file`

4.6. GPU Environment¶

TSUBAEM3.0 provid environmtn of Intel CPUs in conjunction with GPU ( NVIDIA TESLA P100 ).

4.6.1. Interactive execution and debug¶

As login nodes (login, login0, login1) do not have GPU, you can not run GPU codes, only comple and link work. In addition to that, heavy work in login node is restricted.

You can run GPU codes with interactive and debug on compute nodes by batch system. Please refer Interactive job for more details.

4.6.2. Supprted application for GPU¶

Current GPU compatible applications are as follows. (As of 2017.12.18)

ABAQUS 2017 --- Please refer to ABAQUS usage guide (separate volume).
NASTRAN 2017.1 --- Please refer to NASTRAN usage guide (separate volume).
ANSYS 18 --- Please refer to ANSYS usage guide (separate volume).
AMBER 16 --- Please refer to AMBER usage guide (separate volume).
Maple 2016 --- Please refer to Maple usage guide (separate volume).
Mathematica 11.2 --- Please refer to Mathematica usage guide (separate volume).
MATLAB --- Please refer to MATLAB usage guide (separate volume).
Forge --- Please refer to Forge usage guide (separate volume).
PGI Compiler --- Please refer to PGI usage guide (separate volume).

Even for other applications, we will provide it sequentially.

4.6.3. MPI Environment with CUDA¶

MPI environment compatible with CUDA is available.

OpenMPI + gcc Environment

# load CUDA and Open MPI Environment (gcc is default setting）
module load cuda openmpi

OpenMPI + pgi environment

# Load CUDA and PGI Environment(First load the compiler environment)
module load cuda pgi
# Load Open MPI Environment(The OpenMPI environment according to the compiler is set up）
module load openmpi

Info

specific verion openmpi/2.1.2-pgi2019 described before is no more necesarry at present.

In the PGI bundle version , there is no linkage from the batch system, so you need to specify the host list at run.
An example job script is shown below.

OpenMPI + Intel Environment

# Load CUDA and Intel Environment(First load the compiler environment)
module load cuda intel
# Load Open MPIEnvinronment (The OpenMPI environment according to the compiler is set up)
module load openmpi

4.6.4. NVIDIA GPUDirect¶

Currently, there are four functions (GPUDIRECT SHARED GPU SYSMEM, GPUDIRECT P2P, GPUDIRECT RDMA, GPUDIRECT ASYNC) as NVIDIA GPUDirect (GPUDIRECT FAMILY). (As of 2017.12.18)
Of these, TSUBAME 3.0 supports GPUDIRECT SHARED GPU SYSMEM, GPUDIRECT P2P, GPUDIRECT RDMA.

GPUDIRECT SHARED GPU SYSMEM (Version1)
It is a function that can directly specify the address of CUDA pinned memory and device memory in the send / receive buffer of MPI. When the device memory address is specified, the data is actually transferred via the buffer on the host memory.
GPUDIRECT P2P (Version2)
It is a function of direct data transfer (P2P) between GPU via PCI - Express and NVLink. In TSUBAME 3.0, four GPUs is installed per node, but one CPU is connected to two GPUs via PLX switch. Between four GPUs, high speed NVLink is connected.
GPUDIRECT RDMA (Version3)
It is a function to realize high-speed data transfer between GPUs of different nodes by directly transferring data (RDMA) between the GPU and the interconnect (Intel Omni-Path in TSUBAME 3.0) without going through the host memory.
GPUDIRECT ASYNC
It is asynchronous communication between the GPU and the interconnect without going through the host memory.Currently, Intel Omni-Path of TSUBAME 3.0 does not support it.

Reference: http://on-demand.gputechconf.com/gtc/2017/presentation/s7128-davide-rossetti-how-to-enable.pdf

For GPUDirect, please also refer to the following URL.

4.6.5. GPUDirect RDMA¶

Calling cudaSetDevice() before calling MPI_Init() is mandatory to use GPUDirect RDMA on OPA1.9. https://www.intel.com/content/dam/support/us/en/documents/network-and-i-o/fabric-products/Intel_PSM2_PG_H76473_v12_0 .pdf p.15

CUDA support is limited to using a single GPU per process.
You set up the CUDA runtime and pre-select a GPU card (through the use of cudaSetDevice() or a similar CUDA API) pr ior to calling psm2_init() or MPI_Init(), if using MPI.
While systems with a single GPU may not have this requirement, systems with multiple GPU may see non-deterministic results without proper initialization.
Therefore, it is strongly recommended that you initialize the CUDA runtime before the psm2_init() or MPI_Init() cal l.

So modify your code with the above, or use openmpi/3.1.4-opa10.10-t3 module file, that does the modification in the openmpi.
It is available by module load cuda openmpi/3.1.4-opa10.10-t3.

The following shows how to execute GPUDirect RDMA with OpenMPI. Below, it is an execution example with two nodes, MPI x 2.

$ module load cuda openmpi/3.1.4-opa10.10-t3
$ mpirun -np 2 -npernode 1 -x PSM2_CUDA=1 -x PSM2_GPUDIRECT=1 -x LD_LIBRARY_PATH -x PATH [program]

PSM2_CUDA --- Enables CUDA support in PSM2 of Omni-Path
PSM2_GPUDIRECT --- Enable NVIDIA GPUDirect RDMA in PSM2

4.6.6. GPU COMPUTE MODE¶

Only when using resource type f_node batch job, you can change GPU compute mode.
To change GPU compute mode, specify f_node in the job script and specify #$ - v GPU_COMPUTE_MODE=<MODE> for additional.
The following three modes are available.

Mode	Description
0	DEFAULT mode Multiple contexts are allowed per device.
1	EXCLUSIVE_PROCESS mode Only one context is allowed per device, usable from multiple threads at a time.
2	PROHIBITED mode No contexts are allowed per device (no compute apps).

Here is a sample job script.

#!/bin/sh
#$ -cwd
#$ -l f_node=1
#$ -l h_rt=1:00:00
#$ -N gpumode
#$ -v GPU_COMPUTE_MODE=1
/usr/bin/nvidia-smi

When using interactive job, it can be used as follows.

$ qrsh -g [TSUBAME group] -l f_node=1 -l h_rt=0:10:00 -pty yes -v TERM -v GPU_COMPUTE_MODE=1 /bin/bash