Supercomputers deploy massive number of cores working together to run highly parallel High Performance Computing (HPC) applications for long durations in the range of minutes, hours, or days. As the tasks of these applications communicate with each other intensively, the communication overhead comprises a significant portion of the applications’ execution time. This article focuses on topology mapping (i.e., the placement of the application tasks into the processing cores) to minimize the communication overhead and to improve supercomputer efficiency.
Supercomputers serve highly parallel applications, which cannot run on a single server due to their large size. Common examples of such applications include weather forecasting, quantum mechanics, and financial modeling. Programmers synchronize thousands of tasks in these applications using protocols such as Message Passing Interface (MPI) and OpenMP.
The sizes of these applications and the supercomputers have been growing exponentially. The number of processing cores in the largest supercomputers has increased from 65,000 to over 3 million in the last decade. In parallel with this trend, the power consumption of these supercomputers experienced a 25-fold increase to 18 MW. To keep this trend sustainable, US Department of Energy has set a goal of 20 MW for an exascale (1018 floating-point operations per second) supercomputer. In order to meet this goal, researchers put great effort on increasing the efficiency of supercomputers.
The main research areas on HPC efficiency can be broadly classified as workload management, fault tolerance, and power management. In “Cooling Awareness in HPC Job Allocation” (Circuit Cellar 296, March 2015), Ayse K. Coskun talked about HPC job allocation with cooling awareness. Here, I will present a deeper look in HPC workload management with an emphasis on topology mapping.
The massive size of supercomputers requires intricate workload management. Figure 1 shows a generic overview of how HPC jobs (i.e., specific instances of applications) are processed upon submission. A user can submit a job to a supercomputer with only basic additional information: required number of processing cores, required memory, and the worst-case execution time. Then, as part of the system software, the scheduler decides on when to run the job based on the resource availability and the job queue. The scheduler can prioritize certain jobs or try to achieve fairness among the users by reordering the job queue. Once the job is scheduled, the allocator assigns the job to a set of machine nodes, where each node contains multiple processing cores and a router. The main purpose of the allocator is to find available nodes that are close to each other to decrease the communication distances and thereby to reduce the communication overhead. The final stage, task mapping, refers to the placement of application tasks onto the processing cores within the nodes given by the allocator.
Simple Linux Utility for Resource Management (SLURM) is one of the most common workload management tools used in supercomputers, including on Tianhe-2, the world’s fastest supercomputer since June 2013 according to the top500 list (www.top500.org). SLURM is an open-source tool, so you can download it and check out how the fastest computer in the world is managed.
HPC applications spend up to 30% of their execution time during inter-task communication. Hence, reducing the communication overhead would have significant impact both on the performance and on the efficiency of the supercomputer. First, we should understand how the tasks of an application communicate with each other.
HPC applications are typically well-balanced, which means that the tasks reach their synchronization points (where they intensively communicate) at the same time. However, even if the application is well-balanced, some tasks can require more time to send and receive data due to the mismatch between the network topology and the application topology. Furthermore, this mismatch will cause a bottleneck for the entire application as all the tasks need to synchronize to be able to proceed with the execution. Topology mapping, which is the combined process of node allocation and task mapping, tries to minimize this mismatch by intelligently placing tasks onto processing cores.
APPLICATION & NETWORK TOPOLOGIES
An application’s topology is characterized by its communication pattern, which is inherent to whatever the application is computing. For example, in a simulation that models deformations in a multi-material body due to shock waves, each task can calculate the deformation within a small cubical volume and iteratively converge to the global solution using boundary conditions between the cubes. Such applications where the tasks form an n-dimensional array are called stencil applications and are common in HPC. The second type of HPC applications is those with irregular topologies where tasks can differ from each other both in terms of communication volume and number of neighbors. Examples for irregular applications include social network analysis and certain biomolecular simulations.
Figure 2 shows the common network topologies used in today’s supercomputers. One of the most prominent topologies is torus due to its high bisection bandwidth, easy management, and low cost. In addition, torus topologies are proven to be very efficient for stencil applications due to their close structural resemblance. However, the messages in the network have to “hop” through a high number of routers to travel between distant nodes in the system, increasing the communication overhead. The recently proposed dragonfly topology addresses this problem by minimizing the network radius, which is the maximum “hop distance” needed by a message to travel between two nodes in the system. Fat-tree is another common topology with high scalability but also with high maintenance cost and low resilience. If the switches that are closer to the tree root fail, the entire network is crippled.
To avoid any overheads associated with task migration and system management, supercomputers typically use static mapping (i.e., tasks stay on the same core during the entire lifetime of a job). Hence, the system applies topology mapping based on the average communication intensity between the tasks. For static mapping, the application topology can be represented by a graph, where the vertices represent tasks and edges represent the average communication between the tasks. The topology mapping problem then becomes finding a subgraph in the network that fits the best to the application, where “fitting the best” is measured with a network metric such as average message hop-distance or maximum congestion. This problem is shown to be NP-hard; hence, researchers developed a bulk of heuristics to minimize the mismatch between network and application topologies.
The most common allocation heuristic is ordering the machine nodes along a curve and selecting the smallest interval along the curve that is large enough to accommodate the next job. By using space-filling curves such as Hilbert curves, one can expect the nodes in the selected interval to be close to each other in the network. Another popular allocation technique, which is classified as clustered allocation, is searching for a set of available nodes that form a cube in a torus network. Although more intuitional, this technique requires more computation and increases the system overhead.
There are several popular heuristics for task mapping as well. Based on an application’s topology, the tasks can be reordered before placing them in order onto a curve to reduce the communication overhead. There are fast algorithms that are used to reduce the maximum hop distance such as the reverse Cuthill-Mckee algorithm. These algorithms can worsen the communication overhead of certain applications by significantly increasing the average hop distance. Another task mapping technique is recursively bisecting both the communication graph and the network graph. This technique increases the system overhead but it can be applied to any application and network topology.
Among all these allocation and task mapping techniques, there is no clear winner for all networks and applications. This led the researchers to develop techniques targeting specific conditions. For example, if there are clearly defined node coordinates (e.g., node positions in a torus topology) and task coordinates (e.g., 3-D regions in a physical simulation), recursive bisection of the network and application geometries is a fast and efficient solution for task mapping. In Boston University’s Power and Energy Aware Computing Laboratory, we investigated the performance of the allocation and mapping techniques I described above and developed a combined allocator and task mapper for irregular applications.
Our topology mapping policy is called Partitioning and Center Mapping (PaCMap). PaCMap first abstracts the core-versus-node difference by partitioning the application graph such that each vertex in the partitioned graph, which is a group of tasks, can fit into a single node. After this step, the problem reduces to mapping the task groups onto the available machine nodes.
PaCMap aims to find highly communicating tasks and map them next to each other in accordance with the network topology. For this purpose, it finds the center task in the application that has the least cumulative communication distance to all other tasks in the application’s communication graph. This center task is mapped onto a center node, which has sufficient availability around it for the entire job. Then, PaCMap expands the allocation from the centers by greedily picking a nearby available node and mapping a task to it based on the communication graph. The algorithm effectively uses the application topology by simultaneously applying allocation and task mapping.
We compared PaCMap with an allocator called best-fit that uses space-filling curves, a clustered allocator called mc1x1, and task mapping via recursive graph bisection (RGrB). For our evaluation, we used the Structural Simulation Toolkit, which is an open-source architectural simulation framework designed by Sandia National Laboratories to assist in the design, evaluation, and optimization of HPC architectures and applications.
Figure 3 shows the cumulative running time of all jobs using different allocators and task mappers with two different workload traces, each consisting of more than 1000 jobs with irregular communication patterns. Note that PaCMap can be used for simultaneous allocation and mapping or as a mere allocator or task mapper by limiting its decision space. The results show that performance of topology mapping policies highly depends on the workload trace. Although intuitively the clustered mc1x1 allocator should be more useful for RGrB than the curve-following best-fit allocator, we observe that for trace A, RGrB and best-fit pair leads to 7% less cumulative execution time than mc1x1. However, for trace B, RGrB and mc1x1 pair leads to 3% shorter execution than with best-fit. The topology mapping performance not only depends on the allocator and the task mapper, but also on in which order the jobs arrive. Complete topology mapping with PaCMap decreases the cumulative execution time by 2% and 3% for trace A and B, respectively, compared to the best case of RGrB for each trace.
The impact topology mapping on supercomputer performance will increase as HPC application sizes continue to grow. Even for the current applications, there is still no clear solution on which allocator and task mapper to use to minimize communication time with reasonable system management overhead. Our research shows that we need solutions targeting specific applications and network topologies to overcome the existing techniques. Furthermore, using the system resources more effectively will reduce the power consumption, helping HPC community to meet goal of 20 MW for an exascale supercomputer.
Another interesting aspect in topology mapping is the dynamicity of the system. The first source of dynamicity in supercomputers is the changes in the network traffic. As the network is used by multiple jobs, some network links can be congested, decreasing the performance of multiple jobs. Another form of dynamicity is due to software anomalies such as abnormally high usage of memory. In current systems, there is a lack of monitoring infrastructure that prevents us to pinpoint the jobs that cause congestion or that display abnormal behavior. The HPC research community is currently putting great effort to solve the problem of real-time monitoring thousands of computer nodes. When the detailed system state becomes visible in real-time, researchers will be able to significantly improve topology mapping as well as fault tolerance in supercomputers.
 T. Hoefler and M. Snir, “Generic Topology Mapping Strategies for Large-Scale Parallel Architectures,” Proceedings of the International Conference on Supercomputing (ICS ‘11), ACM, New York, NY, 2011.
 O. Tuncer, V. J. Leung, and A. K. Coskun, “PaCMap: Topology Mapping of Unstructured Communication Patterns onto Non-contiguous Allocations,” Proceedings of the 29th ACM on International Conference on Supercomputing (ICS ‘15). ACM, New York, NY, 2015.
 G. Hendry and A. Rodrigues, “SST: A Simulator for Exascale Co-Design,” ASCR/ASC Exascale Research Conference, 2012.
PUBLISHED IN CIRCUIT CELLAR MAGAZINE • JANUARY 2016 #306 – Get a PDF of the issue