Understanding Run-time Performance

The information in this topic assumes that you are using a performance optimization methodology and have analyzed the application type you are optimizing.

After profiling your application to determine where best to spend your time, attempt to discover what optimizations and what limitations have been imposed by the compiler. Use the compiler reports to determine what to try next.

Depending on what you discover from the reports you may be able to help the compiler through options, directives, and slight code modifications to take advantage of key architectural features to achieve the best performance.

The compiler reports can describe what actions have been taken and what actions cannot be taken based on the assumptions made by the compiler. Experimenting with options and directives allows you to use an understanding of the assumptions and suggest a new optimization strategy or technique.

Helping the Compiler

You can help the compiler in some important ways:

Use the Math Kernel Library (MKL) instead of user code, or calling F90 intrinsics instead of user code.

See Applying Optimization Strategies for other suggestions.

Memory Aliasing on Itanium®-based Systems

Memory aliasing is the single largest issue affecting the optimizations in the Intel® compiler for Itanium®-based systems. Memory aliasing is writing to a given memory location with more than one pointer. The compiler is cautious to not optimize too aggressively in these cases; if the compiler optimizes too aggressively, unpredictable behavior can result (for example, incorrect results, abnormal termination, etc.).

Since the compiler usually optimizes on a module-by-module, function-by-function basis, the compiler does not have an overall perspective with respect to variable use for global variables or variables that are passed into a function; therefore, the compiler usually assumes that any pointers passed into a function are likely to be aliased. The compiler makes this assumption even for pointers you know are not aliased. This behavior means that perfectly safe loops do not get pipelined or vectorized, and performance suffers.

There are several ways to instruct the compiler that pointers are not aliased:

Non-Unit Stride Memory Access

Another issue that can have considerable impact on performance is accessing memory in a non-Unit Stride fashion. This means that as your inner loop increments consecutively, you access memory from non adjacent locations. For example, consider the following matrix multiplication code:

Example

!Non-Unit Stride Memory Access

subroutine non_unit_stride_memory_access(a,b,c, NUM)

  implicit none

  integer :: i,j,k,NUM

  real :: a(NUM,NUM), b(NUM,NUM), c(NUM,NUM)

! loop before loop interchange

  do i=1,NUM

    do j=1,NUM

      do k=1,NUM

        c(j,i) = c(j,i) + a(j,k) * b(k,i)

      end do

    end do

  end do

end subroutine non_unit_stride_memory_access

Notice that c[i][j], and a[i][k] both access consecutive memory locations when the inner-most loops associated with the array are incremented. The b array however, with its loops with indexes k and j, does not access Memory Unit Stride. When the loop reads b[k=0][j=0] and then the k loop increments by one to b[k=1][j=0], the loop has skipped over NUM memory locations having skipped b[k][1], b[k][2] .. b[k][NUM].

Loop transformation (sometimes called loop interchange) helps to address this problem. While the compiler is capable of doing loop interchange automatically, it does not always recognize the opportunity.

The memory access pattern for the example code listed above is illustrated in the following figure:

Assume you modify the example code listed above by making the following changes to introduce loop interchange:

Example

subroutine unit_stride_memory_access(a,b,c, NUM)

  implicit none

  integer :: i,j,k,NUM

  real :: a(NUM,NUM), b(NUM,NUM), c(NUM,NUM)

! loop after interchange

  do i=1,NUM

    do k=1,NUM

      do j=1,NUM

        c(j,i) = c(j,i) + a(j,k) * b(k,i)

      end do

    end do

  end do

end subroutine unit_stride_memory_access

After the loop interchange the memory access pattern might look the following figure: