Scientific Area
Quantum mechanics, density functional theory
Short description
The JuKKR package is developed at Research Center Juelich and contains a collection of codes for electronic structure calculations based on the Korringa-Kohn-Rostocker-Green's function method. One part of JuKKR is the KKRimp code, which allows the quantum mechanical simulation of impurities and defects embedded in a variety of materials. This allows unparalleled insights into their electronic and magnetic properties. Moreover, it allows to describe scattering of electrons off defect atoms. Recently, the code has been extended to describe defects embedded in superconductors which is an important ingredient in the search for materials that are useful for future stable quantum computing applications.
We conducted a performance assessment of the KKRimp code using the tools Score-P, Scalasca, Cube and Vampir. As part of the assessment, we calculated metrics for hybrid parallel applications as proposed by the POP2 project [3]. Using the POP-metrics we identified parts of the code that show potential to optimize the existing OpenMP parallelization.
One candidate for optimization is the routine rhooutnew, which calculates an integration of density matrices over a discrete set of radial points. The whole set of radial points is already distributed over each MPI process. However, the hybrid OpenMP parallelization is limited to calls of linear algebra routines, e.g., zgemm, to the parallel version of the Intel Math Kernel Library (MKL). Thus, a significant part of this routine is still executed in serial by the main thread of each MPI process, leading to an inefficient utilization of the allocated computing resources.
To optimize the performance of the code, we changed the OpenMP parallelization of this routine such that the integration loop over all radial points is parallelized with OpenMP and not only parts of it. At the same time all calls of linear algebra routines like zgemm to the parallel Intel MKL are replaced by the serial version to avoid nested parallelism of the OpenMP threads.
Results
As base for our performance evaluation, we use the original code, in which the OpenMP parallelization of the integration in the routine rhooutnew is only achieved by using calls to the parallel Intel MKL. The code was compiled using the Intel Fortran Compiler 2021.5.0 and Intel MPI Suite 2021.3.1 and executed on the CLAIX-2018 cluster. For execution we chose 16 MPI processes and a varying number of OpenMP threads per process. The following figure shows a comparison of the runtime of the rhooutnew routine between the base version (ref) and our optimized version (opt).