1. Domain Decomposition: Using Massively Parallel Architectures
In order to make efficient use of the present-day massively parallel
architectures, we have implemented a domain decomposition algorithm into the CSU
AGCM. While the spherical geodesic grid may appear to be irregular, data
structures on this grid can be represented in a logical fashion. A
two-dimensional field that covers the surface of the sphere can be described on
the spherical geodesic grid using a data structure that is a set of
two-dimensional arrays. Fig. 1 graphically depicts this data structure. Each
"block" in the data structure is logically square. In Fig. 1a we chose to
decompose the grid into 40 blocks, while in Fig. 1b we chose to use 160 blocks.
Two-dimensional fields can then be represented as an array of dimension
(ni, nj, nsdm), where ni and nj are the lengths of the
sides of a block and nsdm is the total number of blocks. At present, we
use square blocks of data, (ni = nj), which limits nsdm to the
values 10, 40, 160, 640, and so on. If the need arises, we can certainly
generalize our algorithm to accommodate general rectangular-shaped blocks of
data where ni does not equal nj.
This domain-decomposition strategy has many positive attributes. First, the algorithm is equally applicable to atmospheric modeling and ocean modeling. This same decomposition strategy was developed and implemented independently by the Ocean Modeling Group at LANL. Second, this approach limits the number of messages and the total size of the messages required to update data at the boundaries of each block. This makes the algorithm somewhat indifferent to the specific type of interconnect fabric used by the various architectures. Third, by allowing each process to own an arbitrary number of blocks of data, we have built a load balancing mechanism into the algorithm. Processes that are over-loaded can give one or more blocks of data to processes that are under utilized. At present, we load balance at compile time. In the future we intend to purse the idea of dynamical load balancing. Furthermore, when modeling the ocean some blocks of data will not contain any ocean points; these blocks can be culled to increase the efficiency of the algorithm. Fourth, we can adjust the block size to insure that all data associated with one block can fit entirely into cache. This allows for the efficient access of data and, thus, optimizes the processor throughput. And finally, this approach allows the seamless use of multitasking threads, such as OpenMP, "under" the Message Passing Interface (MPI) algorithm. Since every process owns psdmblocks of data, and each of these blocks can be manipulated independently of all other blocks of data, we can multitask over the psdmindex of the data. Threads are spawned at the beginning of each time step and are not closed until near the end of the time step. This implies that each thread is open for a relatively long time and mitigates the cost of spawning the thread. The multitasking capability allows us to make better use of architectures that employ both shared-memory and distributed-memory paradigms.
Fig. 2 shows the scaling efficiency of this algorithm as implemented into our atmospheric dynamical core. Results are shown from three model resolutions: 2562, 10242, and 40962. When the sphere is discretized using 2562 grid cells, the nominal distance between grid cell centers is approximately 450 km. The 10242 and 40962 grids have grid spacing of approximately 225 km and 112 km, respectively. As Fig. 2 shows, the model using the finest resolution, 40962, scales linearly out to approximately 80 processes and shows a speed up of 110 when we use 160 processes. The "super-linear" speed up of the model using the 40962 grid at around 20 processes is related to data caching.