Operational Strategy Cluster Aix-la-Chapelle

The high-performance computer is operated by the IT Centre at RWTH Aachen University and is available to members of RWTH Aachen University and scientists from all over Germany. According to the "1-cluster concept", the operating concept makes all resources of the cluster available to the users by means of an interface, so that different expansion stages, innovative architectures and data can be used by means of the same processes.

The 1-Cluster Concept

As a result of its development, the IT Centre was and is faced with the challenge of operating a heterogeneous system landscape, integrating innovative architectures, and providing access to different user groups in different ways. The 1-cluster concept has developed from these requirements. This aims to operate all components in one large cluster and offers the following advantages:

The same interfaces regarding identity management, dialogue systems, access options, workload management system, operating system, software stack and file systems are available to the users in the entire cluster. The knowledge of the central components remains limited to what is necessary for the users, the concentration on one interface facilitates communication and the documentation is easier to maintain.
By using a separate solution for cluster management, operational processes scale optimally and can be adapted to different scenarios. Linking the various cluster management tools allows, for example, changes to be made in the entire cluster based on monitoring data, as these are built on the same technical basis.
This has made it possible for years to operate the HPC system without fixed maintenance windows, which leads to a very high availability of the system with very few interruptions to operation for the users. These are only necessary in exceptional cases such as maintenance work on the file systems or major conversions such as changing the operating system. The workload management system is used for minor maintenance work, such as installing new kernel versions, which means no restrictions for the users.
Clustering in this way leads to highly scalable operational processes and makes it possible for new and innovative functions to be immediately available on all suitable architectures and expansion levels. In addition, it is possible to set up many new systems in the shortest possible time and integrate them into the cluster, for example when expanding with a new expansion stage.
Differentiations, for example, between processor architectures or server types, are made on the user side after consideration in the technical-scientific assessment in the application process, and on the operational side to the smallest possible extent.

The structure of the cluster reflects the 1-cluster concept. The dialogue systems and the Jupyterhub form the interface to the high-performance computer for the users. These can be used to prepare, commission, control and evaluate computing jobs and to use development and analysis applications. Large amounts of data can be transferred into and out of the high-performance computer via special copy nodes with broadband connections to the university and science network. The large groups of backend systems (CLAIX-2023, Tier-3, innovative architectures (GPU, integrative hosting)) are made available via the workload management system and are not directly accessible. The file systems can be accessed from the entire cluster and can be addressed by the users as $HOME, $WORK, $HPCWORK and $BEEOND. The majority of the individual backend groups are connected to each other via high-performance and redundant Infiniband networks.

File Systems

For the storage of data, the users of the high-performance computer are provided with various file systems that differ with regard to the intended usage scenarios. These differences become apparent in the form of performance in various metrics, the available space and the data backup concepts. The following file systems can be used:

$HOME
$HOME is an NFS-based file system that provides users with 150 GB of storage space by default to store the most important data such as source codes and configuration files. The use of snap-shot mechanisms and data backup in the RWTH Aachen University backup system guarantees a very high level of data security. This is also reflected in the 100% availability of the file system from 2016 to 2023.
$WORK
$WORK is also an NFS file system. However, it is technically designed in such a way that it is intended for storing larger files. These include, for example, the results of computational tasks. With 250 GB, users have more storage space available in this file system, but this data is not saved in the backup system so that it should be reproducible. Accidentally deleted files can, however, be restored on the basis of snapshots.
$HPCWORK
$HPCWORK is two file systems based on the parallel high-performance file system Lustre. This is characterized by high write and read rates due to the way it works. With a standard storage space of 1 TB, the space available here is significantly higher than with the other file systems. Due to the amount of data, however, no central data backup is possible here either.
$BEEOND
$BEEOND is another available file system designed for specialised high-performance use cases and is based on parallel file system technology. It offers flexible and scalable storage options for the special requirements of high-performance applications.

Software

The cluster's software stack is controlled by the IT Centre and is also partly developed in-house. This approach has been followed for a long time and offers a number of advantages:

Independence from manufacturers ensures flexibility and allows quick adaptations to the frequently changing requirements in the field of research and teaching (for example integration of innovative architectures)
Savings in licensing and maintenance fees for software (for example operating system)
Access to all layers of the software stack allows effective and efficient error and performance analysis as well as comprehensive changes in response to the analysis results.
Consistent pursuit and implementation of an open-source strategy.

The operating system used is Rocky Linux, an open Red Hat-based Linux variant.

The SLURM workload management system used to manage the computing jobs on the backend systems successfully replaced its predecessor, IBM Platform LSF, in 2019. Experience has shown that it makes sense to use a professional solution for the one-cluster concept, the various integrated system architectures and the different requirements placed on the batch system (for example fairshare, backfilling, use of different MPIs).

Approximately 400 different ISV and open-source software packages are made available to users in various subject-specific categories. The IT Centre centrally takes over the provision and maintenance in case of a correspondingly large demand and operates the necessary licence servers if necessary. Tools for using the cluster (e.g., graphical interfaces), parallelisation (various MPI implementations), programming (compilers, libraries) and application analysis (debugger, performance analysis and visualisation) are made available to users centrally.

Integrative hosting

Individual institutes or projects often need to own their own systems. The Integrative Hosting service builds on the 1-cluster concept and uses the possibilities for scalable expansion of the cluster. The IT Centre procures, installs, and operates additional HPC resources for university use to support research and teaching. The offer pursues several objectives: By offering resources centrally for the university, synergy effects can be used regarding the energy and operational infrastructure. The offer allows users to concentrate on their area of application without having to contribute administrative services. In the area of cooperation with external users, integrative hosting offers a platform for collaboration.

The resources described in service level agreements and service certificates according to IT service management are made available within the framework of projects. These projects can be managed by those responsible themselves by appointing other users as project members.

Tools

Services

Institutions

RWTH High Performance Computing