Using SGE / GridEngine

You can get an overview of the cluster by typing qhost, and you can also see the status of the GPUs with qhost -F gpu.
You can view current jobs in the queue by typing qstat -F gpu.
If you’re going to use GPUs alot, you might as well alias the long version of the above commands (which include -F gpu) to the short version, in your .bashrc :

echo "alias qstat='qstat -F gpu'" >> ~/.bashrc
echo "alias qhost='qhost -F gpu'" >> ~/.bashrc
source ~/.bashrc

Submitting Jobs

The normal way to submit jobs to the cluster is using the qsub command. For example, qsub myscript.sh. The many options to the qsub command are described in the manpage, man qsub.
Any command-line argument for qsub can alternatively appear inside the shell script:

For example, either:

qsub -cwd -e /dev/null myscript.sh

Or:

qsub myscript.sh

with myscript.sh including the lines:

#$ -cwd
#$ -e /dev/null

Using GPUs

The basic system only knows about GPU utilization based on what people request within GridEngine. So if people run GPU jobs on their own (outside of GridEngine), then the system is not aware of them.
So, to request GPUs with qsub, add the following flag to qsub at the command-line

-l gpu=1

Or within your qsub script:

#$ -l gpu=1

This requests one GPU to be used. GridEngine does not enable your program to use a GPU. Rather, it just keeps track of how many GPUs are being used in the cluster.

Don't forget to export all relevant environment variables for CUDA, like PATH and LD_LIBRARY_PATH.

Network-wide Disk Space

NFS

SSHFS

First setup SSH keys. On a computer where you plan on submitting jobs:

ssh-keygen -b 8192 -t rsa

Use an empty passphrase, and accept other defaults.

For every remote computer, type:

ssh-copy-id  yourusername@123.456.789.012

Now you don’t need to type a password to login to these servers.

Decide on a hard drive of a remote server that has lots of space, like /hd4 on server 123.456.789.012. You can use df -h to find out disk usage on a given computer. Then create a directory there:

mkdir -p /hd4/myusername/sge/
cd
ln -s /hd4/myusername/sge/ sge

Remember that files on these servers are not backed-up at all.

Create a shell script like the following as mount-sge.sh, modifying it as necessary. It’s better to use IP addresses than hostnames:

user=myusername
hosts='123.456.789.011 123.456.789.013 123.456.789.014'
mount_src="${user}@123.456.789.012:/hd4/${user}/sge/"
mount_tgt="~/sge/"

for host in $hosts; do
  ssh ${user}@$host sshfs -o allow_root $mount_src $mount_tgt  &&  \
    echo "Mounted $mount_tgt on $host"  ||  \
    echo "Mountpoint $mount_tgt on $host is probably already mounted."
done

Don’t forget to set executable permissions: chmod u+x mount-sge.sh . Running this script will mount all the necessary mount points, so that the output of all jobs can get sent to a common directory.
If a server gets rebooted, just rerun this script.

In your qsub script add the following line, to set your working directory to ~/sge:

#$ -wd $HOME/sge/

You can create subdirectories in this path, and modify the above line accordingly.
You can alternatively just set the output & error path to ~/sge:

#$ -o $HOME/sge/
#$ -e $HOME/sge/

If you don’t want any error output, set it to /dev/null:

#$ -e /dev/null/

Example Script

Below is an example script. For arguments that you always want (like email notifications), you can put those in ~/.sge_request, omitting the #$ .

Jon Dehdari