Please read all of the following before using a GridEngine cluster.
You can get an overview of the cluster by typing qhost
, and you can also see the status of the GPUs with qhost -F gpu
.
You can view current jobs in the queue by typing qstat -F gpu
.
If you’re going to use GPUs alot, you might as well alias the long version of the above commands (which include -F gpu
) to the short version, in your .bashrc
:
echo "alias qstat='qstat -F gpu'" >> ~/.bashrc
echo "alias qhost='qhost -F gpu'" >> ~/.bashrc
source ~/.bashrc
The normal way to submit jobs to the cluster is using the qsub
command. For example, qsub myscript.sh
. The many options to the qsub
command are described in the manpage, man qsub
.
Any command-line argument for qsub can alternatively appear inside the shell script:
For example, either:
qsub -cwd -e /dev/null myscript.sh
Or:
qsub myscript.sh
with myscript.sh including the lines:
#$ -cwd
#$ -e /dev/null
The basic system only knows about GPU utilization based on what people request within GridEngine. So if people run GPU jobs on their own (outside of GridEngine), then the system is not aware of them.
So, to request GPUs with qsub
, add the following flag to qsub at the command-line
-l gpu=1
Or within your qsub
script:
#$ -l gpu=1
This requests one GPU to be used. GridEngine does not enable your program to use a GPU. Rather, it just keeps track of how many GPUs are being used in the cluster.
Don't forget to export all relevant environment variables for CUDA, like PATH and LD_LIBRARY_PATH.
df -h | grep :
.
ssh-keygen -b 8192 -t rsa
Use an empty passphrase, and accept other defaults.
ssh-copy-id yourusername@123.456.789.012
Now you don’t need to type a password to login to these servers.
/hd4
on server 123.456.789.012. You can use df -h
to find out disk usage on a given computer. Then create a directory there:mkdir -p /hd4/myusername/sge/
cd
ln -s /hd4/myusername/sge/ sge
Remember that files on these servers are not backed-up at all.
mount-sge.sh
, modifying it as necessary. It’s better to use IP addresses than hostnames:user=myusername
hosts='123.456.789.011 123.456.789.013 123.456.789.014'
mount_src="${user}@123.456.789.012:/hd4/${user}/sge/"
mount_tgt="~/sge/"
for host in $hosts; do
ssh ${user}@$host sshfs -o allow_root $mount_src $mount_tgt && \
echo "Mounted $mount_tgt on $host" || \
echo "Mountpoint $mount_tgt on $host is probably already mounted."
done
Don’t forget to set executable permissions: chmod u+x mount-sge.sh
. Running this script will mount all the necessary mount points, so that the output of all jobs can get sent to a common directory.
If a server gets rebooted, just rerun this script.
qsub
script add the following line, to set your working directory to ~/sge
:#$ -wd $HOME/sge/
You can create subdirectories in this path, and modify the above line accordingly.
You can alternatively just set the output & error path to ~/sge
:
#$ -o $HOME/sge/
#$ -e $HOME/sge/
If you don’t want any error output, set it to /dev/null
:
#$ -e /dev/null/
Below is an example script. For arguments that you always want (like email notifications), you can put those in ~/.sge_request
, omitting the #$
.
#!/bin/bash
## Inherit all environment variables
#$ -V
## Start in current working directory
#$ -cwd
## Stdout to the following dir
#$ -o $HOME/sge/
## Stderr to the following dir
#$ -e $HOME/sge/
## Specify job name
#$ -N test-3hr
## When will an email be sent.
## 'e'=end of job
## 'a'=if job is aborted
#$ -m ea
## Where to email info
#$ -M foo@example.com
## Which resources to use
#$ -l gpu=1
## RAM and swap limits in kilobytes. -v is a bashism
ulimit -m 8000000
ulimit -v 10000000
echo "Hello"
echo "The date is " `date`
echo "hostname is " `hostname`