5. Job Scheduler

On this system, UNIVA Grid Engine manages the running and scheduling of jobs.

5.1. kind of compute nodes

5.1.1. baremetal environments

5.1.1.1. Available resource type

In this system, a job is executed using a logically divided computing node called "resource type."
When submitting a job, specify how many resource types to use (ex: -l f_node = 2). A list of available resource types is shown below.

Type Resource Type Name Physical CPU cores Memory(GB) GPUs
F f_node 28 235 4
H h_node 14 120 2
Q q_node 7 60 1
C1 s_core 1 7.5 0
C4 q_core 4 30 0
G1 s_gpu 2 15 1

"Physical CPU Cores", "Memory (GB)", "GPUs" are the available resources per resource type.

  • Resource type combinations are not available.
  • Maximum run time is 24 hours.
  • TSUBAME 3 has various limit values as follows.
  • Number of concurrently executable jobs per person
  • The total number of slots that can be simultaneously executed per person etc.
    A list of current limit values can be confirmed with the following URL.
    https://www.t3.gsic.titech.ac.jp/en/resource-limit Please note that it may change at any time according to resource usage.

5.1.2. Container environments

In this system, in order to absorb the system dependency of the application that is difficult to operate on host OS due to the software dependency, we provide the system container using Docker and the application container using Singularity.
This chapter describes how to use system container jobs using Docker. Please refer to the freeware chapter for Singularity.

5.1.2.1. Available resource types

The following resource types can be used for container usage jobs.
When used in a batch script, it has an .mpi suffix at the end.

Type Using nodes
Resource type Name
Using containers
Batch Job(Multi containers)
Using containers
Interactive Job(Single container)
F f_node t3_d_f_node.mpi t3_d_f_node
H h_node t3_d_h_node.mpi t3_d_h_node
Q q_node t3_d_q_node.mpi t3_d_q_node
C1 s_core ( t3_d_s_core ) t3_d_s_core
C4 q_core t3_d_q_core.mpi t3_d_q_core
G1 s_gpu t3_d_s_gpu.mpi t3_d_s_gpu

The resource type t3_d_s_core allows communication to the Internet but does not support inter-container communication.
Therefore, please specify another container resource when performing MPI or multi-container communication.

The following shows the qsub command options when using nodes and containers.

Using nodes Using containers
Set image -ac d=[ Container image ]
Set resource type -l [ Resource type Name ] =[Number] -jc [Container resource type]
-t 1-[Number]
Set walltime -l h_rt=[Maximum run time] -adds l_hard h_rt [Maximum run time]

For container jobs, the -t option is the number of containers.
For example, with 4 containers, specify as -t 1-4. MPI node files are stored as files in $SGE_JOB_SPOOL_DIR. Please use the host file of each MPI at the time of execution.

MPI Hostfile Name
Intel MPI impi_hostfile
OpenMPI ompi_hostfile
MPICH mpich_hostfile

You can use only the images provided by the system, please refer System Software page for list of available images.

5.2. Job submission

To execute the job in this system, log in to the login node and execute the qsub command.

5.2.1. Job submission flow

In order to submit a job, create and submit a job script. The submission command is qsub.

  1. Create a job script
  2. Submit a job using qsub
  3. Status check using qstat
  4. Cancel a job using qdel
  5. Check job result

The qsub command confirms billing information (TSUBAME 3 points) and accepts jobs.

5.2.2. Creating job script

Here is a job script format:

1
2
3
4
5
6
7
8
#!/bin/sh
#$ -cwd
#$ -l [Resource type Name] =[Number]
#$ -l h_rt=[Maximum run time]
#$ -p [Priority]
[Initialize module environment]                                         
[Load the relevant modules needed for the job]                                 
[Your program]

Warning

shebang(#!/bin/sh line) must be located at the first of the job script.

  • [Initialize module environment]
    By executing the following, initialize the module environment.
. /etc/profile.d/modules.sh
  • [Load the relevant modules needed for the job]
    [Load the relevant modules needed for the job with the module command.
    For example, load the intel compiler:
module load intel
  • [Your program] Execute your program. For example, if your binary is named "a.out":
./a.out

In a shell script, you can set the qsub options in lines that begin with #$.
There is an alternative way to pass them with the qsub command.
You should always specify "Resource type" and "Maximum run time."
The option used by qsub is following.

Option Description
-l [Resource type Name] =[Number]
(Required)
Specify the resource type.
-l h_rt=[Maximum run time]
(Required)
specify the maximum run time (hours, minutes and seconds)
You can specify it like HH: MM: SS or MM: SS or SS.
-N name of the job (Script file name if not specified)
-o name of the standard output file
-e name of the standard error output file
-m Will send email when job ends or aborts. The conditions for the -m argument include:
a: mail is sent when the job is aborted.
b: mail is sent when the job begins.
e: mail is sent when the job ends.
It is also possible to combine like abe.
When a large number of jobs with mail option are submitted, a large amount of mail is also sent, heavy load is applied to the mail server, and it may be detected as an attack and the mail from Tokyo Tech may be blocked. If you need to execute such jobs, please remove the mail option or review the script so that it can be executed with one job.
-M Email address to send email to
-p
(Premium Options)
Specify the job execution priority. If -4 or -3 is specified, a charge factor higher than -5 is applied. The setting values -5, -4, -3 correspond to the priority 0, 1, 2 of the charging rule.
-5: Standard execution priority. (Default)
-4: The execution priority is higher than -5 and lower than -3.
-3: Highest execution priority.
-t Submits a Array Job
specified with start-end[:step]
-hold_jid Defines the job dependency list of the submitted job.
The job is executed after the specified dependent job is finished.
-ar Specify the reserved AR ID when using the reserved node.

5.2.3. Job script examples

5.2.3.1. serial job/GPU job

The following is an example of a job script created when executing a single job (job not parallelized) or GPU job.
For GPU job, please replace -l s_core=1 with -l s_gpu=1 and load necessary modules such as CUDA environment.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
#!/bin/sh
## Run in current working directory
#$ -cwd
## Resource type F: qty 1
#$ -l s_core=1
## maximum run time
#$ -l h_rt=1:00:00
#$ -N serial
## Initialize module command
. /etc/profile.d/modules.sh
# Load CUDA environment
module load cuda
## Load Intel compiler environment
module load intel
./a.out

5.2.3.2. SMP job

An example of a job script created when executing an SMP parallel job is shown below.
Hyper-threading is enabled for compute nodes. Please explicitly specify the number of threads to use.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
#!/bin/sh
#$-cwd
## Resource type F: qty 1
#$ -l f_node=1
#$ -l h_rt=1:00:00
#$ -N openmp
. /etc/profile.d/modules.sh
module load cuda
module load intel
## 28 threads per node
export OMP_NUM_THREADS=28
./a.out

5.2.3.3. MPI job

An example of a job script created when executing an MPI parallel job is shown below.
Please specify an MPI environment according to the MPI library used by you for MPI jobs as follows.
For OpenMPI, to pass library environment variables to all nodes, you need to use -x LD_LIBRARY_PATH.

Intel MPI

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
#!/bin/sh
#$-cwd
## Resource type F: qty 4
#$ -l f_node=4
#$ -l h_rt=1:00:00
#$ -N flatmpi
. /etc/profile.d/modules.sh
module load cuda
module load intel
## Load Intel MPI environment
module load intel-mpi
## 8 process per node, all MPI process is 32
mpiexec.hydra -ppn 8 -n 32 ./a.out

OpenMPI

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
#!/bin/sh
#$-cwd
## Resource type F: qty 4
#$ -l f_node=4
#$ -l h_rt=1:00:00
#$ -N flatmpi
. /etc/profile.d/modules.sh
module load cuda
module load intel
## Load Open MPI environment
module load openmpi
## 8 process per node, all MPI process is 32
mpirun -npernode 8 -n 32 -x LD_LIBRARY_PATH ./a.out

SGI MPT

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
#!/bin/sh
#$-cwd
## Resource type F: qty 4
#$ -l f_node=4
#$ -l h_rt=1:00:00
#$ -N flatmpi
. /etc/profile.d/modules.sh
module load cuda
module load intel
## Load SGI MPT environment
module load mpt
## 8 process per node, all MPI process is 32
mpiexec_mpt -ppn 8 -n 32 ./a.out

The file of the node list assigned to the submitted job can be referred from $PE_HOSTFILE.

$ echo $PE_HOSTFILE
/var/spool/uge/r6i0n4/active_jobs/4564.1/pe_hostfile
$ cat /var/spool/uge/r6i0n4/active_jobs/4564.1/pe_hostfile
r6i0n4 28 all.q@r6i0n4 <NULL>
r6i3n5 28 all.q@r6i3n5 <NULL>

5.2.3.4. Hybrid parallel

An example of a job script created when executing a process/thread parallel (hybrid) job is shown below.
Please specify an MPI environment according to the MPI library used by you for MPI jobs as follows.
For OpenMPI, to pass library environment variables to all nodes, you need to use =-x LD_LIBRARY_PATH`.

Intel MPI

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
#!/bin/sh
#$-cwd
## Resource type F: qty 4
#$ -l f_node=4
#$ -l h_rt=1:00:00
#$ -N hybrid
. /etc/profile.d/modules.sh
module load cuda
module load intel
module load intel-mpi
## 28 threads per node
export OMP_NUM_THREADS=28
## 1 MPI process per node, all MPI process is 4
mpiexec.hydra -ppn 1 -n 4./a.out

OpenMPI

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
#!/bin/sh
#$-cwd
## Resource type F: qty 4
#$ -l f_node=4
#$ -l h_rt=1:00:00
#$ -N hybrid
. /etc/profile.d/modules.sh
module load cuda
module load intel
module load openmpi
## 28 threads per node
export OMP_NUM_THREADS=28
## 1 MPI process per node, all MPI process is 4
mpirun -npernode 1 -n 4 -x LD_LIBRARY_PATH ./a.out

5.2.3.5. Container job

Here is a container job script format:

1
2
3
4
5
6
7
8
9
#!/bin/sh
#$ -cwd
#$ -ac [Container image]
#$ -jc [Container resource type]
#$ -t 1-[Number]
#$ -adds l_hard h_rt=[Maximum run time]
[Initialize module environment]                                         
[Load the relevant modules needed for the job]                                 
[Your program]

Please note that the method of specifying the resource type and walltime is different from the case of normal use.

The following is an example of a job script created when executing a container job.
The usage of the GPU and the usage of MPI parallel jobs are the same as the normal usage.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
#!/bin/sh
#$ -cwd
## set container image SLES12SP2
#$ -ac d=sles12sp2-latest
## Resource type Q
#$ -jc t3_d_q_node.mpi
## container numver: qty 4
#$ -t 1-4
## maximum run time
#$ -adds l_hard h_rt 0:10:00

. /etc/profile.d/modules.sh
module load cuda
module load intel
module load openmpi
mpirun -npernode 6 -n 24 -hostfile $SGE_JOB_SPOOL_DIR/ompi_hostfile -x LD_LIBRARY_PATH ./a.out

5.2.4. Job submission

Job is queued and executed by specifying the job submission script in the qsub command.
You can submit a job using qsub as follows.

qsub -g [TSUBAME3 group] SCRIPTFILE
Option Description
-g Specify the TSUBAME3 group name.
Please add as qsub command option, not in script.

5.2.4.1. Trial run

TSUBAME provides the "trial run" feature, in which users can execute jobs without consuming points, for those who are anxious whether TSUBAME applies to their research or not.
To use this feature, submit jobs without specifying a group via -g option. In this case, the job is limited to 2 nodes, 10 minutes of running time, and priority -5 (worst).

Warning

The trial run feature is only for testing whether your program works or not. Do not use it for actual execution for your research and measurement.
It does not mean that you can execute jobs freely without charge if the job size meets the limitation written in above.

TSUBAME3 has the function of Trial run, that is for checking program operation without consuming points.
In the case of a trial run, the following restrictions apply to the amount of resources.

Maximum number of the resource type specified(*1) 2
Maximum usage time 10 min.
number of concurrent runs 1
resource type no limitation

(*1): When using Container job, 1

For Trial run, it is necessary to run a job without specifying a TSUBAME group.
Note that the points are consumed when you submit a job with the TSUBAME group.

5.2.5. Job status

The qstat command is a job status display command

$qstat [option]

The options used by qstat are following.

Option Description
-r Displays job resource information.
-j [job-ID] Display additional information about the job.

Here is the result of qstat command.

$ qstat
job-IDprior  nameuser   statesubmit/start at  queuejclass  slotsja-task-ID
Item Description
Job-ID Job-ID number
prior Priority of job
name Name of the job
user ID of the user who submitted job
state 'state' of the job
r running
qw waiting in the queue
h on hold
d deleting
t a transition like during job-start
s suspended
S suspended by the queue
T has reached the limit of the tail
E error
submit/start at Submit or start time and date of the job
queue Queue name
jclass job class name
slots The number of slot the job is taking.
ja-task-ID Array job task-id

5.2.6. Job delete

To delete your job, use the qdel command.

$ qdel [job-ID]

Here is the result of qdel command.

$ qstat
job-IDprior  nameuser   statesubmit/start at  queuejclass  slotsja-task-ID
----------------------------------------------------------------------------------
307 0.55500 sample.sh testuser r 02/12/2015 17:48:10  all.q@r8i6n1A.default32

$ qdel 307
testuser has registered the job 307 for deletion

$ qstat
job-IDprior  nameuser   statesubmit/start at  queuejclass  slotsja-task-ID
----------------------------------------------------------------------------------

5.2.7. Job results

The standard output is stored in the file "SCRIPTFILE.o[job-ID]" in the job execution directory.
The standard error output is "SCRIPTFILE.e[job-ID]".

5.2.8. Array Job

There is an array job as a function to parameterize and execute the operation contained in the job script repeatedly.

Info

Because each task in an array job is scheduled as a separate job, there is a schedule latency proportional to the number of tasks.
If each task is short or has a large number of tasks, it is strongly recommended that you reduce the number of tasks by combining multiple tasks.
Example: Combine 10000 tasks into 100 tasks, each processing 100 tasks.

Each job executed in the array job is called a task and managed by the task ID.

#  In job script
#$ -t 2-10:2

In the above example (2-10:2), start number 2, end number 10, and step size 2 (one skip index) are specified, and it has five tasks 2, 4, 6, 8, 10.
Each task number is set to the environment variable $ SGE_TASK_ID.
By using this environment variable in the job script, you will be able to do parameter studies.
The standard output is stored in the file "SCRIPTFILE.o[job-ID].[task-ID]" in the job execution directory.
The standard error output is "SCRIPTFILE.e[job-ID].[task-ID]".
If you want to delete a specific task, use the qdel -t option as follows.

$ qdel [job-ID] -t [task-id]

5.3. Reserve compute nodes

It is possible to execute jobs exceeding 24 hours and/or 72 nodes by reserving computation nodes.

  • Make a reservation from TSUBAME portal
  • Check reservation status, cancel a reservation from TSUBAME portal
  • Submit a job using qsub for reserved node
  • Cancel a job using qdel
  • Check job result
  • Check the reservation status and AR ID from the command line

Please refer to TSUBAME Portal User's Guide "Reserving compute nodes" on reservation from the portal, confirmation of reservation status and cancellation of the reservation.

When reservation time is reached, you will be able to execute jobs with the reservation group account.
The following example shows job submission with an AR ID that is a reservation ID.
(note) Resource types available are f_node, h_node and q_node. q_core, s_core, s_gpu cannot be used.

  • with qsub
$ qsub -g [TSUBAME3 group] -ar [AR ID]  SCRIPTFILE
  • with qrsh
qrsh -g [TSUBAME3 group] -l [resource type]=[number of resources] -l h_rt=[time limit] -ar [AR ID]

After submitting the job, you can check the status of the job with the qstat command, and delete the job with the qdel command.
The format of the job script is the same as that of the non-reserved job.

t3-user-info compute ar can be used to check the reservation status and AR ID from the command line.

xxxxx@login0:~> t3-user-info compute ar
ar_id   uid user_name         gid group_name                state     start_date           end_date        time_hour n
ode_count      point return_point
----------------------------------------------------------------------------------------------------------------------
---------------------------------
 1320  2005 A2901247         2015 tga-red000                  r   2018-01-29 12:00:00 2018-01-29 13:00:00          1
        1      18000            0
 1321  2005 A2901247         2015 tga-red000                  r   2018-01-29 13:00:00 2018-01-29 14:00:00          1
        1      18000            0
 1322  2005 A2901247         2015 tga-red000                  w   2018-01-29 14:00:00 2018-02-02 14:00:00         96
        1    1728000      1728000
 1323  2005 A2901247         2015 tga-red000                  r   2018-01-29 14:00:00 2018-02-02 14:00:00         96          1    1728000      1728000
 1324  2005 A2901247         2015 tga-red000                  r   2018-01-29 15:00:00 2018-01-29 16:00:00          1         17     306000            0
 1341  2005 A2901247         2015 tga-red000                  w   2018-02-25 12:00:00 2018-02-25 13:00:00          1         18     162000       162000
 3112  2004 A2901239         2349 tgz-training                r   2018-04-24 12:00:00 2018-04-24 18:00:00          6         20     540000            0
 3113  2004 A2901239         2349 tgz-training                r   2018-04-25 12:00:00 2018-04-25 18:00:00          6         20     540000            0
 3116  2005 A2901247         2015 tga-red000                  r   2018-04-18 17:00:00 2018-04-25 16:00:00        167          1    3006000            0
 3122  2005 A2901247         2014 tga-blue000                 r   2018-04-25 08:00:00 2018-05-02 08:00:00        168          5   15120000            0
 3123  2005 A2901247         2014 tga-blue000                 r   2018-05-02 08:00:00 2018-05-09 08:00:00        168          5    3780000            0
 3301  2005 A2901247         2015 tga-red000                  r   2018-08-30 14:00:00 2018-08-31 18:00:00         28          1     504000            0
 3302  2005 A2901247         2009 tga-green000                r   2018-08-30 14:00:00 2018-08-31 18:00:00         28          1     504000            0
 3304  2005 A2901247         2014 tga-blue000                 r   2018-09-03 10:00:00 2018-09-04 10:00:00         24          1     432000            0
 3470  2005 A2901247         2014 tga-blue000                 w   2018-11-11 22:00:00 2018-11-11 23:00:00          1          1       4500         4500
 4148  2004 A2901239         2007 tga-hpe_group00             w   2019-04-12 17:00:00 2019-04-12 18:00:00          1          1       4500         4500
 4149  2005 A2901247         2015 tga-red000                  w   2019-04-12 17:00:00 2019-04-13 17:00:00         24          1     108000       108000
 4150  2004 A2901239         2007 tga-hpe_group00             w   2019-04-12 17:00:00 2019-04-12 18:00:00          1          1       4500         4500
-------------------------------------------------------------------------------------------------------------------------------------------------------
total :                                                                                                          818         97   28507500      3739500

To check the availability of the current month's reservations from the command line, use t3-user-info compute ars.

5.4. Interactive job

To execute an interactive job, use the qrsh command, and specify the resource type and running time.
After job submission with qrsh, when the job is dispatched, the command prompt will be returned.
The usage of the interactive job is as follows.

1
2
3
4
5
$ qrsh -g [TSUBAME3 group] -l [resource type name]=[numbers] -l h_rt=[max running time]
Directory: /home/N/username
(Job start time)
username@rXiXnX:~> [Commands to run]
username@rXiXnX:~> exit

If group designation is not specified, the job will be treated as a trial run.
In the trial run, the number of the resource is limited to 2, and execution time is limited to 10 minutes, and priority is fixed to -5.
The following example is for resource type F, 1node, and maximum run time is 10minutes.

1
2
3
4
5
$ qrsh -g [TSUBAME3 group] -l f_node=1 -l h_rt=0:10:00
Directory: /home/N/username
(Job start time)
username@rXiXnX:~> [Commands to run]
username@rXiXnX:~> exit

To exit the interactive job, type exit at the prompt.

The following shows how to use containers in interactive jobs.
Specifying multiple containers is not permitted in interactive jobs.

$ qrsh -g [TSUBAME3 group] -jc [container resource type] -add l_hard h_rt [max running time] ?ac [image name]

The following is a sample that has been set resource type Q 1 container, and maximum run time is 10minutes.

$  qrsh -g tga-hpe_group00 -jc t3_d_q_node -adds l_hard h_rt 0:10:00 -ac d=sles12sp2-latest
Directory: /home/9/hpe_user009
Mon Jun  4 13:35:47 JST 2018

5.4.1. X forwarding

  1. Enable X forwarding and connect to login node with ssh.
  2. Execute qrsh command with X11 forwarding like the following example.

Info

With the scheduler update implemented in April 2020, you no longer need to specify -pty yes -display "$DISPLAY" -v TERM /bin/bash when executing qrsh.

In the following example, one node of resource type s_core and 2 hours of execution time are specified.

# Execution of qrsh command
$ qrsh -g [TSUBAME3 group] -l s_core=1 -l h_rt=2:00:00
username@rXiXnX:~> module load [Application module to load]
username@rXiXnX:~> [Command to run X11 application]
username@rXiXnX:~> exit

The following is an example of the interactive job with t3_d_s_core container resource type and X forwarding.

$ qrsh -g [TSUBAME3 group] -jc t3_d_s_core -adds l_hard h_rt 0:10:00 -ac d=sles12sp2-latest

5.4.2. Connection to the network applications

If your application requires We browser manipulation on the interactive job by the container, SSH port forwarding makes it possible with the web browser on your PC.

(1) Obtain the hostname connected by the interactive node with qrsh

$ qrsh -g tga-hpe_group00 -jc t3_d_q_node -adds l_hard h_rt 0:10:00 -ac d=sles12sp2-latest
$ hostname
r7i7n7-cnode00
$ [Execute the program it requires Web browser]

After launching the interactive job by qrsh, obtain the hostname of the machine.
r7i7n7-cnode00 is the hostname in the above example.
The console operation is finished but please keep the job session until the end of your application work.

(2) Connect to login node with enabling SSH port forwarding from the console which is the source of the console.(it is not the login node nor the interactive job)

ssh -l username -L 8888:r7i7n7-cnode00:<network port of the appliction to connect your PC> login.t3.gsic.titech.ac.jp

The connection application network port is different based on the application. For more details, please refer to each manual of the application, or, check the startup message of the application.

Tips

Depending on the console software to SSH to TSUBAME3, SSH port fowrward setup procedure could be different. Please refer to the manual of each SSH console, or to the FAQ.

5.4.3. Interactive queue

The interactive queue is prepared to make immediate execution of visualization and interactive jobs easier, even if TSUBAME is too crowded to allocate nodes for normal jobs, by sharing the same resources with multiple users.

Info

Campus users(tgz-edu) only can submit jobs to the queue.

The following is how to submit the jobs to the interactive queue.

qrsh -q interactive -l h_rt=<time>

Please note that CPU/GPU overcommit is allowed on interactive queue.
About the resource limits of the interactive queue, please refer to here.

5.5. SSH login to the compute node

You can log in using ssh directly to the computing nodes allocated to your job with the resource type f_node.
You can check the available nodes with the following command.

t3-test00@login0:~> qstat -j 1463
==============================================================
job_number:                 1463
jclass:                     NONE
exec_file:                  job_scripts/1463
submission_time:            07/29/2017 14:15:26.580
owner:                      t3-test00
uid:                        1804
group:                      tsubame-users0
gid:                        1800
supplementary group:        tsubame-users0, t3-test-group00
sge_o_home:                 /home/4/t3-test00
sge_o_log_name:             t3-test00
sge_o_path:            /apps/t3/sles12sp2/uge/latest/bin/lx-amd64:/apps/t3/sles12sp2/uge/latest/bin/lx-amd64:/home/4/t3-test00/bin:/usr/local/bin:/usr/bin:/bin:/usr/bin/X11:/usr/games
sge_o_shell:                /bin/bash
sge_o_workdir:              /home/4/t3-test00/koshino
sge_o_host:                 login0
account:                    2 0 0 0 0 0 600 0 0 1804 1800
cwd:                        /home/4/t3-test00
hard resource_list:         h_rt=600,f_node=1,gpu=4
mail_list:                  t3-test00@login0
notify:                     FALSE
job_name:                   flatmpi
priority:                   0
jobshare:                   0
env_list:                   RGST_PARAM_01=0,RGST_PARAM_02=1804,RGST_PARAM_03=1800,RGST_PARAM_04=2,RGST_PARAM_05=0,RGST_PARAM_06=0,RGST_PARAM_07=0,RGST_PARAM_08=0,RGST_PARAM_09=0,RGST_PARAM_10=600,RGST_PARAM_11=0
script_file:                flatmpi.sh
parallel environment:  mpi_f_node range: 56
department:                 defaultdepartment
binding:                    NONE
mbind:                      NONE
submit_cmd:                 qsub flatmpi.sh
start_time            1:    07/29/2017 14:15:26.684
job_state             1:    r
exec_host_list        1:    r8i6n3:28, r8i6n4:28        <--  Available nodes : r8i6n3、r8i6n4
granted_req.          1:    f_node=1, gpu=4
usage                 1:    wallclock=00:00:00, cpu=00:00:00, mem=0.00000 GBs, io=0.00000 GB, iow=0.000 s, ioops=0, vmem=N/A, maxvmem=N/A
binding               1:    r8i6n3=0,0:0,1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0,9:0,10:0,11:0,12:0,13:1,0:1,1:1,2:1,3:1,4:1,5:1,6:1,7:1,8:1,9:1,10:1,11:1,12:1,13, r8i6n4=0,0:0,1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0,9:0,10:0,11:0,12:0,13:1,0:1,1:1,2:1,3:1,4:1,5:1,6:1,7:1,8:1,9:1,10:1,11:1,12:1,13
resource map          1:    f_node=r8i6n3=(0), f_node=r8i6n4=(0), gpu=r8i6n3=(0 1 2 3), gpu=r8i6n4=(0 1 2 3)
scheduling info:            (Collecting of scheduler job information is turned off)

You can log in using ssh directly to the containers allocated to your job.
You can check the available containers with the following command.

hpe_user009@nfs1:~> qstat -j 476
==============================================================
job_number:                 476
jclass:                     t3_d_s_gpu.mpi
exec_file:                  job_scripts/476
submission_time:            06/04/2018 13:41:36.715
owner:                      hpe_user009
uid:                        2779
group:                      tga-hpe_group00
gid:                        2007
supplementary group:        tsubame-users, tgz-edu, tga-hpe_group00, tga-hpe-2017081600
sge_o_home:                 /home/9/hpe_user009
sge_o_log_name:             hpe_user009
sge_o_path:                 /apps/t3/sles12sp2/uge/latest/bin/lx-amd64:/home/9/hpe_user009/bin:/usr/local/bin:/usr/bin:/bin:/usr/bin/X11:/usr/games
sge_o_shell:                /bin/bash
sge_o_workdir:              /home/9/hpe_user009/koshino
sge_o_host:                 nfs1
account:                    0 0 0 1 0 0 600 0 0 2779 2007 1,2,3,4 0
cwd:                        /home/9/hpe_user009/koshino
merge:                      y
hard resource_list:         hostipv4=1,docker=true,s_gpu=1,h_rt=600
soft resource_list:         docker_images=suse/sles12sp2:latest
mail_list:                  hpe_user009@nfs1
notify:                     FALSE
job_name:                   flatmpi
priority:                   -5
hard_queue_list:            docker.q
env_list:                   SGE_ARRAY_MPI=true,RGST_PARAM_01=0,RGST_PARAM_02=2779,RGST_PARAM_03=2007,RGST_PARAM_04=0,RGST_PARAM_05=0,RGST_PARAM_06=0,RGST_PARAM_07=1,RGST_PARAM_0
script_file:                mpi.sh
parallel environment:       mpi_f_node range: 2
pe allocation rule:         2
department:                 defaultdepartment
job-array tasks:            1-4:1
task_concurrency:           all
docker_run_options:         --hostname=${hostipv4(0)},-v /home:/home,-v /scr:/scr,-v /dev/shm:/dev/shm,-v /etc/hosts:/etc/hosts,-v /var/lib/sss/pipes:/var/lib/sss/pipes,-v /apps
binding:                    NONE
mbind:                      NONE
submit_cmd:                 qsub -A tga-hpe_group00 mpi.sh
start_time            1:    06/04/2018 13:41:36.766
start_time            2:    06/04/2018 13:41:36.772
start_time            3:    06/04/2018 13:41:36.775
start_time            4:    06/04/2018 13:41:36.780
job_state             1:    r
job_state             2:    r
job_state             3:    r
job_state             4:    r
exec_host_list        1:    r7i7n7:2
exec_host_list        2:    r7i7n7:2
exec_host_list        3:    r7i7n7:2
exec_host_list        4:    r7i7n7:2
granted_req.          1:    hostipv4=1, s_gpu=1
granted_req.          2:    hostipv4=1, s_gpu=1
granted_req.          3:    hostipv4=1, s_gpu=1
granted_req.          4:    hostipv4=1, s_gpu=1
usage                 1:    wallclock=00:00:00, cpu=00:00:00, mem=0.00000 GBs, io=0.00000 GB, iow=0.000 s, ioops=0, vmem=N/A, maxvmem=N/A
usage                 2:    wallclock=00:00:00, cpu=00:00:00, mem=0.00000 GBs, io=0.00000 GB, iow=0.000 s, ioops=0, vmem=N/A, maxvmem=N/A
usage                 3:    wallclock=00:00:00, cpu=00:00:00, mem=0.00000 GBs, io=0.00000 GB, iow=0.000 s, ioops=0, vmem=N/A, maxvmem=N/A
usage                 4:    wallclock=00:00:00, cpu=00:00:00, mem=0.00000 GBs, io=0.00000 GB, iow=0.000 s, ioops=0, vmem=N/A, maxvmem=N/A
binding               1:    r7i7n7=0,0:0,1
binding               2:    r7i7n7=0,7:0,8
binding               3:    r7i7n7=1,0:1,1
binding               4:    r7i7n7=1,7:1,8
resource map          1:    hostipv4=r7i7n7=(r7i7n7-cnode00), s_gpu=r7i7n7=(0)
resource map          2:    hostipv4=r7i7n7=(r7i7n7-cnode01), s_gpu=r7i7n7=(1)
resource map          3:    hostipv4=r7i7n7=(r7i7n7-cnode02), s_gpu=r7i7n7=(2)
resource map          4:    hostipv4=r7i7n7=(r7i7n7-cnode03), s_gpu=r7i7n7=(3)  
^ Available containers:  r7i7n7-cnode00, r7i7n7-cnode01, r7i7n7-cnode02, r7i7n7-cnode03
scheduling info:            (Collecting of scheduler job information is turned off)

Info

When connecting to a compute node via ssh, the default GID of the processes after ssh is tsubame-users (2000), so the processes of your running job are not visible nor attachable by debuggers such as gdb except for trial execution cases.
To make it visible, do the following with the group name of the executed job after ssh.

newgrp <group name>
or
sg <group name>

5.6. Storage use on Compute Nodes

5.6.1. Local scratch area

Each node has SSD as local scratch disk space available to your job as $TMPDIR and $T3TMPDIR.
The local scratch area is an individual area of each compute node and is not shared.
To use it, you need to stage in and out from the job script to the local host.
The following example is a script to copy on one node. It does not correspond to multiple nodes.
Since $TMPDIR is deleted after each MPI termination, use $T3TMPDIR when using multiple MPI in one job.

1
2
3
4
5
6
7
#!/bin/sh
# copy input files
cp -rp $HOME/datasets $TMPDIR/
# execution
./a.out $TMPDIR/datasets $TMPDIR/results
# copy output files
cp -rp $TMPDIR/results $HOME/results

5.6.2. Shared scratch area

Only when using resource type f_node batch job, you can use BeeGFS On Demand (BeeOND), which creates SSD of reserved multiple computing nodes on demand as a shared file system. To enable BeeOND, specify f_node in the job script and specify “#$ - v USE_BEEOND=1 for additional.
You can use it by referring to /beeond on the compute node. Here is a sample job script.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
#!/bin/sh
#$ -cwd
#$ -l f_node=4
#$ -l h_rt=1:00:00
#$ -N flatmpi
#$ -v USE_BEEOND=1
. /etc/profile.d/modules.sh
module load cuda
module load intel
module load intel-mpi
mpiexec.hydra -ppn 8 -n 32 ./a.out

When using an interactive job, it can be used as follows. It takes a little time to mount the disk as compared with not using it.

$ qrsh -g [TSUBAME3 group] -l f_node=2 -l h_rt=0:10:00 -pty yes -v TERM -v USE_BEEOND=1 /bin/bash

The BeeOND shared scratch area is created at the timing secured by the job. You need to stage in and out from within the job script to /beeond.

1
2
3
4
5
6
7
#!/bin/sh
# copy input files
cp -rp $HOME/datasets /beeond/
# execution
./a.out $TMPDIR/datasets /beeond/results
# copy output files
cp -rp /beeond/results $HOME/results