Loop independence is important since loops that are independent can be parallelized. Independent loops can be parallelized in a number of ways, from the course-grained parallelism of OpenMP*, to fine-grained Instruction Level Parallelism (ILP) of vectorization and software pipelining.
Loops are considered independent when the computation of iteration Y of a loop can be done independently of the computation of iteration X. In other words, if iteration 1 of a loop can be computed and iteration 2 simultaneously could be computed without using any result from iteration 1, then the loops are independent.
Occasionally, you can determine if a loop is independent by comparing results from the output of the loop with results from the same loop written with a decrementing index counter.
For example, the loop shown in example 1 might be independent if the code in example 2 generates the same result .
Example |
---|
subroutine loop_independent_A (a,b,MAX) implicit none integer :: j, MAX, a(MAX), b(MAX) do j=0, MAX a(j) = b(j) end do end subroutine loop_independent_A subroutine loop_independent_B (a,b,MAX) implicit none integer :: j, MAX, a(MAX), b(MAX) do j=MAX, 0, -1 a(j) = b(j) end do end subroutine loop_independent_B |
When loops are dependent, improving loop performance becomes much more difficult. Loops can be dependent in several, general ways.
The following sections illustrate the different loop dependencies.
Cross-iteration flow dependence is created when variables are read then written in different iterations, as shown in the following example:
Example |
---|
subroutine flow_dependence (A,MAX) implicit none integer :: J,MAX real :: A(MAX) do j=1, MAX A(J)=A(J-1) end do end subroutine flow_dependence |
The above example is equivalent to the following lines for the first few iterations:
Sample iterations |
---|
A[1]=A[0]; A[2]=A[1]; |
Recurrence relations feed information forward from one iteration to the next.
Example |
---|
subroutine time_stepping_loops (a,b,MAX) implicit none integer :: J,MAX real :: a(MAX), b(MAX) do j=1, MAX a(j) = a(j-1) + b(j) end do end subroutine time_stepping_loops |
Most recurrences cannot be made fully parallel. Instead, look for a loop further out or further in to parallelize. You might be able to get more performance gains through unrolling.
Cross-iteration anti-dependence is created when variables are written then read in different iterations, as shown in the following example:
Example |
---|
subroutine anti_dependence (A,MAX) implicit none integer :: J,MAX real :: a(MAX), b(MAX) do J=1, MAX A(J)=A(J+1) end do end subroutine anti_dependence |
The above example is equivalent to the following lines for the first few iterations:
Sample interations |
---|
A[1]=A[2]; A[2]=A[3]; |
Cross-iteration output dependence is where variables are written then rewritten in a different iteration. The following example illustrates this type of dependency:
Example |
---|
subroutine output_dependence (A,B,C,MAX) implicit none integer :: J,MAX real :: a(MAX), b(MAX), c(MAX) do J=1, MAX A(J)=B(J) A(J+1)=C(J) end do end subroutine output_dependence |
The above example is equivalent to the following lines for the first few iterations:
Sample interations |
---|
A[1]=B[1]; A[2]=C[1]; A[2]=B[2]; A[3]=C[2]; |
The Intel® compiler can successfully vectorize or software pipeline (SWP) most loops containing reductions on simple math operators like multiplication (*), addition (+), subtraction (-), and division (/). Reductions collapse array data to scalar data by using associative operations:
Example |
---|
subroutine reduction (sum,c,MAX) implicit none integer :: j,MAX real :: sum, c(MAX) do J=0, MAX sum = sum + c(j) end do end subroutine reduction |
The compiler might occasionally misidentify a reduction and report flow-, anti-, output-dependencies or sometimes loop-carried memory-dependency-edges; in such cases, the compiler will not vectorize or SWP the loop.