MALTA Computing Centre

The IBM Blade Center based MALTA cluster is a 72-node cluster of which 71 are compute nodes. The las node is both, the master node that controls the cluster and the front end where the users log on to.

Each node holds 2 Intel Xeon four or six-core processors, for a total of 720 cores across the whole machine, having, each core, a clock rate of at least 2.0GHz and 2Gbytes of RAM memory.

User home directories as well as the /opt directory are located in a 6TB storage unit provided by two ethernet attached RAID arrays using Network File System (NFS) over a Gbit network.

Each user can store files in either the ?private? home directory or the /opt/groupname directory that shares with all the members of the group the user belongs to.

All the nodes in MALTA have a local storage capacity (not accesible from any other node) that ranges from 150GB t0 600GB. Files may only reside in local space for the life of a job. When a job exits, any files remaining in it will be purged. Permanent files should be moved to the user home directory. An environment variable, "$SCRATCH", is defined at the beginning of each job, pointing to the areas in scratch space on each node that are allocated to the job.

Login to MALTA

The login process is extremely easy if you?re running Linux/UNIX or Mac OS X on your personal computer, and Windows doesn?t make it too much harder.

To connect to the MALTA Computing Centre (MCC) you must use the SSH (secure shell) protocol that offers both high speed and escellent security. If you want to use windowing remotely via SSH, you must enable some kind of X11 tunneling and have an X server running locally.

Unix/Linux users

If you?re running any of the cited operating systems, you probably already know how to do this:
      ssh ?X malta.quimica.uniovi.es
                or
      ssh ?X 156.35.55.78
The ?X option enables X11 forwarding (see man ssh for further details.)

Windows users

There are a number of free SSH clients and X11 implementations that a Windows user can use:
? Putty: lightweight SSH that can be used for the SSH connection and X11 tunneling.
? Cygwin: basically a bash shell for Windows. It allows the implementation of UNIX/Linux tools through translating system calls to the Win32 API. Allows use of many UNIX/Linux programs; SSh and X11 are no exceptions.
? Winscp: open source free SFTP and SCP client to transfer files via SSH.
? Xming: leading free unlimited X Window server for Windows. It is needed by Putty for X11 forwarding.

Desktop Virtualization

In Linux, everything can be done from a shell. However, if you don?t feel comfy with the shell or just feel lazy, the cluster can be used as an VNC server.

VNC stands for Virtual Network Computing and allows you to see the desktop of a remote machine and control it with your local mouse and keyboard, just like you would do it sitting in the front of that computer.

To make the most of the VNC server, a VNC client must be running locally on your machine such as TightVNC, vinagre, etc.

Once you?re logged in to MCC you have to start the server (i.e export your desktop) by using:
     [user@malta ~]$ vncserver
.
     New ?malta:1 (user)? desktop is malta:1
     Starting applications specified in /home/user/.vnc/xstartup
     Log file is /home/mateo/.vnc/malta:1.log

Note that the first time you invoke the vncserver command, you will be asked for a password that can be changed at any time by using vncpasswd.

Once the server is up and running you can start a new local session on your local machine following the instructions of your vnc client.

To stop the virtualization, just type:
     [user@malta ~]$ vncserver ?kill
     Killing Xvnc process ID xxxx

More info: man vncserver

Operating System

Like any other operating system (OS), a cluster operating system must provide a user-friendly interface between the user, the applications and the cluster software.

The operating system that is run on MALTA is Red Hat Enterprise 5.2 (Rhel). Rhel is a commercial Linux distribution, therefore it is very similar to any Unix or Unix-like operating system. If you are not familiar with this kind of OS we encourage you to look on the net for some basic information and/or visit the links below:
1. Red Hat manuals
2. The Linux Documentation Project

Software

Compilers
• gcc 4.1.2 (gfortran, g77, gcc, g++)	default location	link
• Intel (ifort, icc) v11.081	/opt/intel/Compiler	link
? Python-1.4.3-24	default location	link
Math Libraries
• BLACS	/opt/blacs	link
• CBLAS (BLAS Library for C)	/opt/cblas	link
• FBLAS (BLAS Library for FORTRAN)	/opt/fblas	link
• FFTW	/opt/fftw/3.2.1	link
• Intel mkl	/opt/intel/mkl/10.1.1.019	link
• LAPACK	/opt/lapack/3.1.1	link
? ScaLAPACK	/opt/scalapack/1.8.0	link
Parallel Libraries
? MPICH	/opt/mpich/1.2.7p1	link
? MPICH2	/opt/mpich2/1.0.8	link
	/opt/mpich2/loadleveler	link
? OpenMPI	/opt/openmpi/1.3.1	link
Programs
• abinit	/opt/abinit/	link
• critic2	/opt/critic2	link
• elk	/opt/elk	link
• quatum espresso	/opt/espresso/	link
• gamess	/opt/gamess	link
• gibbs2	/opt/gibbs2	link
• gnuplot	default location	link
• gromacs	/opt/gromacs	link
• gulp	/opt/gulp	link
• octave	default location	link
• siesta	/opt/siesta/	link
• VASP	/opt/vasp/	link
? Wien2K	/opt/wien2k/	link

How to submit a job: Loadleveler

LoadLeveler (LL) is a job management system that allows users to run more jobs in less time by matching the jobs? processing needs with the available resources.

When a job is submitted to LL, a bunch of environmental variables are created. We list below some of the most relevant:
     ? $SCRATCH = local directory where the job is run. This directory will be automatically removed once the job is finished
     ? $LL_WORKDIR = working directory where both the .log a nd the output file(s) will be place after job completion.
     ? $LOG = .log file created during execution.

Serial Jobs

There are four serial classes or queues:
     1. sexpress: up to 2h cpu time
     2. ssmall: up to 24h cpu time
     3. smedium: up to 1 week cpu time
     4. slarge: up to 3 weeks cpu time

If none of these queues meet your needs, please contact the system administrator and He kindly will do his best to deal with your request.

Example of serial jobs:

#@ job_name = test1    ## job name
#@ class = sexpress | ssmall | smedium | slarge    ## queue
#@ initialdir = /home/user/test                    ## working directory = $LL_WORKDIR
#@ input = myprog.input                            ## stdin = a.out < myprog.input
#@ output = myprog.output                          ## stdout= a.out > myprog.out
#@ error = myprog.error                            ## stdout= a.out > myprog.out
#@ executable = myprogram                          ## executable file
#@ arguments = arg1 arg2 arg3                    ## executable arguments
#@ queue                                           ## submit

This type of job must be used only when the executable file doesn?t generate large temporary or output files.
   #@ = keyword
   # = comment

If the executable keyword is not used, LL assumes that the script is the executable:

#!/bin/sh
#@ step_name = step_1
#@ initialdir = /home/user/test
#@ job_type = serial
#@ class = ssmall
#@ output = $(job_name).$(Process).out              ## if no name is specified, an
#@ error = $(job_name).$(Process).err               ## automatic one will be generated
#@ environment = COPY_ALL               ## copy the environmental variables
#@ job_cpu_limit = 12:00                            ## 12h cpu time
#@ wall_clock_limit = 20:00                         ## 20h total time
#@ queue
# Copy all the necessary files from the initial
# directory in to the scratch one
cp $LL_WORKDIR/data.1 $SCRATCH/
# Everything is written in $SCRATCH
cd $SCRATCH
/home/usuario/myprogram.exe < data.1 > output.1
# Copy the output back in to the initial directory
cp output.1 $LL_WORKDIR/
#@ dependency = (step_1 == 0)                   ## only if the previous step results
                                                  ## in a normal termination
#@ input = output.1
#@ output = $(job_name).$(job_step).$(Process).out
#@ error = $(job_name).$(job_step).$(Process).err
#@ queue
# Copy all the necessary files from the initial
# directory in to the scratch one
cp $LL_WORKDIR/output.1 $SCRATCH/
# Everything is written in $SCRATCH
cd $SCRATCH
/home/usuario/myprogram.exe < output.1 > output.2
# Copy the output back in to the initial directory
cp output.2 $LL_WORKDIR/

This job has two dependent process where the second one starts only if the first job has finished in a proper way. If the keyword ?dependency? is not used both jobs are processed at the same time.

Parallel jobs

There are six parallel queues:
1. p(12)small: up to 24h cpu time ? 1 node ? 8(12) processors
2. p(12)medium: up to 1 week cpu time ? 1 node ? 8(12) processors
3. p(12)large: up to 3 weeks cpu time ? 1 node ? 8(12) processors

These classes are not permanent and can be changed at any time depending on the users needs.

Examples of parallel jobs:

#!/bin/sh
#
#
#@ job_name = sample_mpich
#@ step_name = step1
#@ job_type = mpich
#@ output = test.$(job_name).$(job_step).$(Process).out
#@ error = test.$(job_name).$(job_step).$(Process).err
#@ class = pmedium
#@ environment = COPY_ALL
#@ node = 1 ## number of nodes
#@ tasks_per_node = 1,8 ## from 1 to 8 processors
#@ queue
# Copy all the necessary files from the initial
# directory in to the scratch one
cp ?rp $LL_WORKDIR/data $SCRATCH/
# Everything is written in $SCRATCH
cd $SCRATCH/data
/opt/mpich2/1.0.8/bin/mpirun ?np $LOADL_TOTAL_TASKS /home/user/prog.exe
# Only useful files are copied back in to the initial directory
rm ?f $SCRATCH/data/file1
rm ?f $SCRATCH/data/file2
rm ?f $SCRATCH/data/file3
cp ?rp $SCRATCH/data $LL_WORKDIR/

For the time being, the parallel queues are configured to use up to eight processors
within the same node. We are working hard to overcome this problem and hopefully will be sorted out soon.

In this example we are asking for one node and a number of processors that ranges from 1 to 8.
This is a good way to minimize the time a job spends in the queue waiting for a free node. The variable
$ LOADL_TOTAL_TASKS specifies the number of nodes that have been allocated to the job. Please
note that this variable is not available when the job type is set to parallel (i.e job_type = parallel).

#@ class = plarge
#@ job_type = parallel
#@ node = 1
#@ tasks_per_node = 6
#@ initialdir = /home/user/myprogs
#@ executable = myopenmpcode
#@ input = inputfile1
#@ output = $(job_name).$(job_step).output
#@ error = $(job_name).$(job_step).error
#@ environment = COPY_ALL; OMP_NUM_THREADS=6
#@ queue

Job_type=parallel, so the number of processors is fixed.

Some Loadleveler Commands

? llsubmit job.cmd	submits the job to the queue
? llq	queue status
? llq ?s job.xyz	provides information on why a job or list of jobs remain in the NotQueued, Idle or Deferred state
? llcancel job.xyz	cancels one or more jobs from the queue
? llclass	displays the number of defined classes and usage information
? llstatus	provides information on the status of all the nodes in the cluster.

Contact us

If you have any doubts or queries please do not hesitate to contact the system administrator at:

menendezmanuel@uniovi.es