Domain Decomposition:
Using Massive Parallel Architectures

A PDF version of this page is available here.

1. Domain Decomposition: Using Massively Parallel Architectures

In order to make efficient use of the present-day massively parallel architectures, we have implemented a domain decomposition algorithm into the CSU AGCM. While the spherical geodesic grid may appear to be irregular, data structures on this grid can be represented in a logical fashion. A two-dimensional field that covers the surface of the sphere can be described on the spherical geodesic grid using a data structure that is a set of two-dimensional arrays. Fig. 1 graphically depicts this data structure. Each "block" in the data structure is logically square. In Fig. 1a we chose to decompose the grid into 40 blocks, while in Fig. 1b we chose to use 160 blocks. Two-dimensional fields can then be represented as an array of dimension (ni, nj, nsdm), where ni and nj are the lengths of the sides of a block and nsdm is the total number of blocks. At present, we use square blocks of data, (ni = nj), which limits nsdm to the values 10, 40, 160, 640, and so on. If the need arises, we can certainly generalize our algorithm to accommodate general rectangular-shaped blocks of data where ni does not equal nj.

Figure 1: A two-dimensional field decomposed into 40 logically square blocks of data (left) and into 160 logically square blocks of data (right).

We can distribute the data structure, (ni, ni, nsdm), over nsdm processes by assigning one block of data to one process. Alternatively, we can use less than nsdm processes and assign multiple blocks of data to each process. Each process "sees" a data structures of size, (ni, ni, psdm), where the value of psdm will, in general, vary from process to process, and the sum of psdm across all processes will equal nsdm. The generalization from two-dimensional to three-dimensional fields is trivial; each process sees a data structure of size (ni, ni, nk, psdm) where nk is the number of vertical levels.

This domain-decomposition strategy has many positive attributes. First, the algorithm is equally applicable to atmospheric modeling and ocean modeling. This same decomposition strategy was developed and implemented independently by the Ocean Modeling Group at LANL. Second, this approach limits the number of messages and the total size of the messages required to update data at the boundaries of each block. This makes the algorithm somewhat indifferent to the specific type of interconnect fabric used by the various architectures. Third, by allowing each process to own an arbitrary number of blocks of data, we have built a load balancing mechanism into the algorithm. Processes that are over-loaded can give one or more blocks of data to processes that are under utilized. At present, we load balance at compile time. In the future we intend to purse the idea of dynamical load balancing. Furthermore, when modeling the ocean some blocks of data will not contain any ocean points; these blocks can be culled to increase the efficiency of the algorithm. Fourth, we can adjust the block size to insure that all data associated with one block can fit entirely into cache. This allows for the efficient access of data and, thus, optimizes the processor throughput. And finally, this approach allows the seamless use of multitasking threads, such as OpenMP, "under" the Message Passing Interface (MPI) algorithm. Since every process owns psdmblocks of data, and each of these blocks can be manipulated independently of all other blocks of data, we can multitask over the psdmindex of the data. Threads are spawned at the beginning of each time step and are not closed until near the end of the time step. This implies that each thread is open for a relatively long time and mitigates the cost of spawning the thread. The multitasking capability allows us to make better use of architectures that employ both shared-memory and distributed-memory paradigms.

Fig. 2 shows the scaling efficiency of this algorithm as implemented into our atmospheric dynamical core. Results are shown from three model resolutions: 2562, 10242, and 40962. When the sphere is discretized using 2562 grid cells, the nominal distance between grid cell centers is approximately 450 km. The 10242 and 40962 grids have grid spacing of approximately 225 km and 112 km, respectively. As Fig. 2 shows, the model using the finest resolution, 40962, scales linearly out to approximately 80 processes and shows a speed up of 110 when we use 160 processes. The "super-linear" speed up of the model using the 40962 grid at around 20 processes is related to data caching.

Figure 2: The scaling efficiency of our atmospheric dynamical core. Both parts shown the same data. The figure on the left shows the scaling out to 40 processes, while the figure on the right extends the x-axis to 160 processes.

Since our proposed atmosphere, ocean, and sea-ice sub-models will all use the spherical geodesic grid, we expect all of the sub-models to use the exact same message passing modules. This will have at least two positive impacts. First, we will be leveraging our efforts to make efficient use of the computational platforms across the different model components. Second, it will move the message passing to a layer beneath the numerical algorithm and will allow many of us to focus on the development of sound numerical methods as opposed to messaging passing techniques. This algorithm was designed to accommodate the present-day massively parallel architectures and we feel we have been successful in our design. At the same time, we have attempted to maintain some flexibility in the algorithm to allow for the efficient porting of the model components to the next generation of platforms.

Domain Decomposition:Using Massive Parallel Architectures

1. Domain Decomposition: Using Massively Parallel Architectures

Domain Decomposition:
Using Massive Parallel Architectures