Applying Optimization Strategies

The compiler may or may not apply the following optimizations to your loop: Interchange, Unrolling, Cache Blocking, and LoadPair. These transformations are discussed in the following sections, including how to transform loops manually and how to control them with directives or internal options.

Loop Interchange

Loop Interchange is a nested loop transformation applied by High-level Optimization (HLO) that swaps the order of execution of two nested loops. Typically, the transformation is performed to provide sequential Unit Stride access to array elements used inside the loop to improve cache locality. The compiler -O3 (Linux*) or /O3 (Windows*) optimization looks for opportunities to apply loop interchange for you.

The following is an example of a loop interchange:

Example
subroutine loop_interchange(a,b,c, NUM) implicit none integer :: i,j,k,NUM real :: a(NUM,NUM), b(NUM,NUM), c(NUM,NUM) ! loop before loop interchange do i=1,NUM do j=1,NUM do k=1,NUM c(j,i) = c(j,i) + a(j,k) * b(k,i) end do end do end do ! loop after interchange do i=1,NUM do k=1,NUM do j=1,NUM c(j,i) = c(j,i) + a(j,k) * b(k,i) end do end do end do end subroutine loop_interchange

Example

subroutine loop_interchange(a,b,c, NUM)

implicit none

integer :: i,j,k,NUM

real :: a(NUM,NUM), b(NUM,NUM), c(NUM,NUM)

! loop before loop interchange

do i=1,NUM

do j=1,NUM

do k=1,NUM

c(j,i) = c(j,i) + a(j,k) * b(k,i)

end do

! loop after interchange

do i=1,NUM

do k=1,NUM

do j=1,NUM

c(j,i) = c(j,i) + a(j,k) * b(k,i)

end do

end subroutine loop_interchange

Unrolling

Loop unrolling is a loop transformation generally used by HLO that can take better advantage of Instruction-Level Parallelism (ILP), keeping as many functional units busy doing useful work as possible during single loop iteration. In loop unrolling, you add more work to the inside of the loop while doing fewer loop iterations in exchange.

Example
subroutine loop_unroll_before(a,b,c,N,M) implicit none integer :: i,j,N,M real :: a(N,M), b(N,M), c(N,M) N=1025 M=5 do i=1,N do j=1,M a(j,i) = b(j,i) + c(j,i) end do end do end subroutine loop_unroll_before

Example

subroutine loop_unroll_before(a,b,c,N,M)

implicit none

integer :: i,j,N,M

real :: a(N,M), b(N,M), c(N,M)

N=1025

M=5

do i=1,N

do j=1,M

a(j,i) = b(j,i) + c(j,i)

end do

end subroutine loop_unroll_before

Example
subroutine loop_unroll_after(a,b,c,N,M) implicit none integer :: i,j,K,N,M real :: a(N,M), b(N,M), c(N,M) N=1025 M=5 K=MOD(N,4) !K= N MOD 4 ! main part of loop do i=1,N-K,4 do j=1,M a(j,i) = b(j,i) + c(j,i) a(j,i+1) = b(j,i+1) + c(j,i+1) a(j,i+2) = b(j,i+2) + c(j,i+2) a(j,i+3) = b(j,i+3) + c(j,i+3) end do end do ! post conditioning part of loop do i= N-K+2, N, 4 do j=1,M a(j,i) = b(j,i) + c(j,i) end do end do end subroutine loop_unroll_after

Example

subroutine loop_unroll_after(a,b,c,N,M)

implicit none

integer :: i,j,K,N,M

real :: a(N,M), b(N,M), c(N,M)

N=1025

M=5

K=MOD(N,4) !K= N MOD 4

! main part of loop

do i=1,N-K,4

do j=1,M

a(j,i) = b(j,i) + c(j,i)

a(j,i+1) = b(j,i+1) + c(j,i+1)

a(j,i+2) = b(j,i+2) + c(j,i+2)

a(j,i+3) = b(j,i+3) + c(j,i+3)

end do

! post conditioning part of loop

do i= N-K+2, N, 4

do j=1,M

a(j,i) = b(j,i) + c(j,i)

end do

end subroutine loop_unroll_after

Post conditioning is preferred over pre-conditioning because post conditioning will preserve the data alignment and avoid the cost of memory alignment access penalties.

Cache Blocking

Cache blocking involves structuring data blocks so that they conveniently fit into a portion of the L1 or L2 cache. By controlling data cache locality, an application can minimize performance delays due to memory bus access. The application controls the behavior by dividing a large array into smaller blocks of memory so a thread can make repeated accesses to the data while the data is still in cache.

For example, image processing and video applications are well suited to cache blocking techniques because an image can be processed on smaller portions of the total image or video frame. Compilers often use the same technique, by grouping related blocks of instructions close together so they execute from the L2 cache.

The effectiveness of the cache blocking technique depends on data block size, processor cache size, and the number of times the data is reused. Cache sizes vary based on processor. An application can detect the data cache size using the CPUID instruction and dynamically adjust cache blocking tile sizes to maximize performance. As a general rule, cache block sizes should target approximately one-half to three-quarters the size of the physical cache. For systems that are Hyper-Threading Technology (HT Technology) enabled target one-quarter to one-half the physical cache size.

Cache blocking is applied in HLO and is used on large arrays where the arrays cannot all fit into cache simultaneously. This method is one way of pulling a subset of data into cache (in a small region), and using this cached data as effectively as possible before the data is replaced by new data from memory.

Example
subroutine cache_blocking_before(a,b,N) implicit none integer :: i,j,k,N real :: a(N,N,N), b(N,N,N), c(N,N,N) N=1000 do i = 1, N do j = 1, N do k = 1, N a(i,j,k) = a(i,j,k) + b(i,j,k) end do end do end do end subroutine cache_blocking_before subroutine cache_blocking_after(a,b,N) implicit none integer :: i,j,k,u,v,N real :: a(N,N,N), b(N,N,N), c(N,N,N) N=1000 do v = 1, N, 20 do u = 1, N, 20 do k = v, v+19 do j = u, u+19 do i = 1, N a(i,j,k) = a(i,j,k) + b(i,j,k) end do end do end do end do end do end subroutine cache_blocking_after

Example

subroutine cache_blocking_before(a,b,N)

implicit none

integer :: i,j,k,N

real :: a(N,N,N), b(N,N,N), c(N,N,N)

N=1000

do i = 1, N

do j = 1, N

do k = 1, N

a(i,j,k) = a(i,j,k) + b(i,j,k)

end do

end subroutine cache_blocking_before

subroutine cache_blocking_after(a,b,N)

implicit none

integer :: i,j,k,u,v,N

real :: a(N,N,N), b(N,N,N), c(N,N,N)

N=1000

do v = 1, N, 20

do u = 1, N, 20

do k = v, v+19

do j = u, u+19

do i = 1, N

a(i,j,k) = a(i,j,k) + b(i,j,k)

end do

end subroutine cache_blocking_after

The cache block size is set to 20. The goal is to read in a block of cache, do every bit of computing we can with the data in this cache, then load a new block of data into cache. There are 20 elements of A and 20 elements of B in cache at the same time and you should do as much work with this data as you can before you increment to the next cache block.

Blocking factors will be different for different architectures. Determine the blocking factors experimentally. For example, different blocking factors would be required for single precision versus double precision. Typically, the overall impact to performance can be significant.

Loop Distribution

Loop distribution is a high-level loop transformation that splits a large loop into two smaller loops. It can be useful in cases where optimizations like software-pipelining (SWP) or vectorization cannot take place due to excessive register usage. By splitting a loop into smaller segments, it may be possible to get each smaller loop or at least one of the smaller loops to SWP or vectorize. An example is as follows:

Example
subroutine loop_distribution_before(a,b,c,x,y,z,N) implicit none integer :: i,N real :: a(N), b(N), c(N), x(N), y(N), z(N) N=1024 do i = 1, N a(i) = a(i) + i b(i) = b(i) + i c(i) = c(i) + i x(i) = x(i) + i y(i) = y(i) + i z(i) = z(i) + i end do end subroutine loop_distribution_before subroutine loop_distribution_after(a,b,c,x,y,z,N) implicit none integer :: i,N real :: a(N), b(N), c(N), x(N), y(N), z(N) N=1024 do i = 1, N a(i) = a(i) + i b(i) = b(i) + i c(i) = c(i) + i end do do i = 1, N x(i) = x(i) + i y(i) = y(i) + i z(i) = z(i) + i end do end subroutine loop_distribution_after

Example

subroutine loop_distribution_before(a,b,c,x,y,z,N)

implicit none

integer :: i,N

real :: a(N), b(N), c(N), x(N), y(N), z(N)

N=1024

do i = 1, N

a(i) = a(i) + i

b(i) = b(i) + i

c(i) = c(i) + i

x(i) = x(i) + i

y(i) = y(i) + i

z(i) = z(i) + i

end do

end subroutine loop_distribution_before

subroutine loop_distribution_after(a,b,c,x,y,z,N)

implicit none

integer :: i,N

real :: a(N), b(N), c(N), x(N), y(N), z(N)

N=1024

do i = 1, N

a(i) = a(i) + i

b(i) = b(i) + i

c(i) = c(i) + i

end do

do i = 1, N

x(i) = x(i) + i

y(i) = y(i) + i

z(i) = z(i) + i

end do

end subroutine loop_distribution_after

There are directives to suggest distributing loops to the compiler as follows:

Example
!DEC$ distribute point

Placed outside a loop, the compiler will attempt to distribute the loop based on an internal heuristic. The following is an example of using the pragma outside the loop:

Example
subroutine loop_distribution_pragma1(a,b,c,x,y,z,N) implicit none integer :: i,N real :: a(N), b(N), c(N), x(N), y(N), z(N) N=1024 !DEC$ distribute point do i = 1, N a(i) = a(i) + i b(i) = b(i) + i c(i) = c(i) + i x(i) = x(i) + i y(i) = y(i) + i z(i) = z(i) + i end do end subroutine loop_distribution_pragma1

Example

subroutine loop_distribution_pragma1(a,b,c,x,y,z,N)

implicit none

integer :: i,N

real :: a(N), b(N), c(N), x(N), y(N), z(N)

N=1024

!DEC$ distribute point

do i = 1, N

a(i) = a(i) + i

b(i) = b(i) + i

c(i) = c(i) + i

x(i) = x(i) + i

y(i) = y(i) + i

z(i) = z(i) + i

end do

end subroutine loop_distribution_pragma1

Placed within a loop, the compiler will attempt to distribute the loop at that point. All loop-carried dependencies will be ignored. The following example uses the directive within a loop to precisely indicate where the split should take place:

Example
subroutine loop_distribution_pragma2(a,b,c,x,y,z,N) implicit none integer :: i,N real :: a(N), b(N), c(N), x(N), y(N), z(N) N=1024 do i = 1, N a(i) = a(i) + i b(i) = b(i) + i c(i) = c(i) + i !DEC$ distribute point x(i) = x(i) + i y(i) = y(i) + i z(i) = z(i) + i end do end subroutine loop_distribution_pragma2

Example

subroutine loop_distribution_pragma2(a,b,c,x,y,z,N)

implicit none

integer :: i,N

real :: a(N), b(N), c(N), x(N), y(N), z(N)

N=1024

do i = 1, N

a(i) = a(i) + i

b(i) = b(i) + i

c(i) = c(i) + i

!DEC$ distribute point

x(i) = x(i) + i

y(i) = y(i) + i

z(i) = z(i) + i

end do

end subroutine loop_distribution_pragma2

Load Pair (Itanium® Compiler)

Load pairs (ldfp) are instructions that load two contiguous single or double precision values from memory in one move. Load pairs can significantly improve performance.

Manual Loop Transformations

There might be cases where these manual transformations are called acceptable or even preferred. As a general rule, you should let the compiler transform loops for you. Manually transform loops as a last resort; use this strategy only in cases where you are attempting to gain performance increases.

Manual loop transformations have many disadvantages, which include the following:

Application code becomes harder to maintain over time.
New compiler features can cause you to lose any optimization you gain by manually transforming the loop.
Architectural requirements might restrict your code to a specific architecture unintentionally.

The HLO report can give you an idea of what loop transformations have been applied by the compiler.

Experimentation is a critical component in manually transforming loops. You might try to apply a loop transformation that the compiler ignored. Sometimes, it is beneficial to apply a manual loop transformation that the compiler has already applied with -O3 (Linux) or /O3 (Windows).