DIAS
Overview
The DIAS cluster offers access to a small number of powerful CPU and GPU systems. It is available to researchers in Physics and Astronomy who need more powerful resources than their own desktop or laptop computers, but don't need low-latency networking for multi-node parallel computation (for which the Hypatia cluster may be more suitable) or the larger clusters provided by ARC or national facilities. It is available to masters (MSc and MSci) students undertaking research projects, where requested by the student's supervisor.
Account creation and support
To request an account on DIAS, please e-mail phy.dias.support@ucl.ac.uk with the following information:
- your name;
- your UCL computing ID (e.g. ucapxxx);
- your research group affiliation (Astro, AMOPP, BioP, CMMP, HEP).
UCL email addresses (i.e. of style
For masters students, the request must be made by the project supervisor or course leader, and should also include:
- supervisor's name;
- supervisor's UCL computing ID (e.g. ucapxxx);
- supervisor's UCL e-mail address.
Masters students should also contact their supervisor in the first instance with any queries or problems relating to the DIAS cluster. The supervisor can decide if a problem requires system administrator support.
Access
Access to DIAS is by SSH to dias.hpc.phys.ucl.ac.uk, e.g. ssh username@dias.hpc.phys.ucl.ac.uk
. DIAS access requires that you be within the UCL network.
If you are not on site at UCL you can still access the cluster by using
If you encounter any issue or need assistance with these services you will need to contact the ISD Service Desk.
For running graphical applications remotely from a Windows machine you will need an X server application. We recommend you install OpenSSH on your Windows machine. This is the easiest solution to connecting to SSH via windows as it is also a native application on Unix systems.
Alternatively you can find programs such as Exceed, which can be downloaded from the UCL software database: http://swdb.ucl.ac.uk/package/view/id/150 and an SSH client that supports X11 forwarding (e.g. PuTTY: https://www.chiark.greenend.org.uk/~sgtatham/putty/latest.html).
This video tutorial may be useful for Desktop@UCL users: Connecting to a Linux system from Desktop@UCL.
Using the cluster
The DIAS cluster runs a variant of the CentOS 7 Linux operating system. Using Linux remotely you will mainly use a terminal interface running ‘bash’.
If you are new to Linux or Research computing you can find some helpful online courses and resources provided by UCL ARC in the ARC Course Catalogue.
If you are looking for a more casual approach to learning how to use a Linux system you can find many websites with useful commands, examples and tutorials.
Please note that data on DIAS is not backed up and you should keep code in external source control and otherwise keep copies of data elsewhere.
DIAS operates in the style of an HPC cluster. HPC clusters differ somewhat in how they are used from conventional computing, in that the sorts of large or processor intensive jobs used on them must be submitted to a job manager so that appropriate resources are allocated and do not conflict with other jobs.
The node you log in to is the login node. In the case of Dias this is also the head node which manages the cluster. It's important for other users and the cluster as a whole that high memory or processor intensive jobs are never run on this node. If such jobs are found to be running they will be terminated without warning. You are encouraged to run normal editing and development work on the login node however, so you can edit files, download data, check your work into repositories, move files around and so on without a problem.
When you come to run intensive jobs, you must request a job from the job manager, which in the case of Dias is software called Slurm (Simple Linux Utility for Resource Management). This software will allocate your job to one of the two CPU nodes, or the GPU node. If Dias is busy there may be a wait for resources, but Slurm tries to allocate resources fairly so that users get an appropriate share of the time.
Resources are divided into partitions which contain different types of hardware - for Dias these are normal CPU nodes in the COMPUTE partition, one GPU node in the GPU partition, and one GPU node in the LIGHTGPU partition.
Submitting a job to Slurm
To submit a job you must write a bash script that has the commands needed to run your code, and also includes directives to Slurm for what resources are needed. This script is then submitted using the sbatch command. The directives to Slurm are special comment lines beginning with #SBATCH.
In the example below these are interleaved with normal explanatory comments. This can be difficult to understand, so don't hesitate to ask for help if needed.
An example job script might look like
#!/bin/bash
#submit to the normal COMPUTE partition for normal CPUS
#SBATCH -p COMPUTE
#request a different amount of time to the default 12h
#SBATCH --time 24:00:00
#requesting one node
#SBATCH -N1
#requesting 12 cpus
#SBATCH -n12
#SBATCH --mail-user=youremail@ucl.ac.uk
#SBATCH --mail-type=ALL
cd /home/username/programlocation
srun ./program
The job is submitted with 'sbatch scriptname' and you will be given a job number by Slurm. The srun command is a special part of Slurm that unlocks multiple processors for MPI code. There are different technologies for different kinds of multiple processor use, so if your code uses multiple processors you may want to contact technical support to make sure you are doing so correctly. Your supervisor may be able to advise on whether the code is for example OpenMP, MPI, or uses some other method, and providing this information to us will help us get your code running well.
GPU requests
A request for a GPU node must also request one or more of its NVIDIA A100 cards. This is done with something like
#!/bin/bash
#SBATCH -p GPU
#SBATCH --gres=gpu:a100:1
LIGHTGPU users may need to switch the first line of their script to
#!/bin/bash -l
source /etc/slurm/gpu_variables.sh
Interactive jobs
A simple interactive job can be run by using srun in a different mode to call a shell. srun will make the necessary resource request, and you do not need to use the sbatch command in this case.
srun --spankx11 --pty bash
Other resources
As well as how many CPUs or GPUs to request you may need to request a certain amount of memory, otherwise you will be limited to 2GB per CPU requested. It is important to be accurate with how much memory you need, but it is more important to overestimate the amount. If you ask for too much the worst that will happen is that other users may have to wait a bit longer for code to run, but if you ask for too little your program will fail, and you will have to resubmit your job and other users will may have to wait for your code to rerun anyway! These resources are requested with a line like this for a request per CPU requested:
#SBATCH --mem-per-cpu 4G
#SBATCH --mem 4G
#SBATCH --time 12:00:00
Python
You may wish to use Miniforge to create Python environments.
Jupyter
You can use Jupyter notebooks using a version of the script found at /share/apps/anaconda/jupyter-slurm.sh. You should make a copy, take a look through it, and alter parameters at the start as needed for a Slurm script, e.g. changing the partition to use a GPU. Also note the activation of Anaconda towards the end - you may want to activate your own environment here. The script will start Jupyter on one of the nodes, and give you connection instructions to connect your local machine to it (open the jupyter-notebook-xxx.log file created). These assume the use of Linux or Mac style command line ssh. If you are using PuTTY for example, you will need to use the port numbers and the node name in the port forwarding configuration of PuTTY.
For example, at the top of your log file you will see a line like To connect:
ssh -N -L 8350:compute-0-1:8350 eme@dias.hpc.phys.ucl.ac.uk
Important: You should use scancel to end the Jupyter job when you are done with the session, or it will continue to run and block resources.
The job number can be found in the output of squeue, and it is also the number in the appropriate jupyter-notebook-xxx.log filename.