Loop Independence

Loop independence is important since loops that are independent can be parallelized. Independent loops can be parallelized in a number of ways, from the course-grained parallelism of OpenMP*, to fine-grained Instruction Level Parallelism (ILP) of vectorization and software pipelining.

Loops are considered independent when the computation of iteration Y of a loop can be done independently of the computation of iteration X. In other words, if iteration 1 of a loop can be computed and iteration 2 simultaneously could be computed without using any result from iteration 1, then the loops are independent.

Occasionally, you can determine if a loop is independent by comparing results from the output of the loop with results from the same loop written with a decrementing index counter.

For example, the loop shown in example 1 might be independent if the code in example 2 generates the same result .

Example

subroutine loop_independent_A (a,b,MAX)

  implicit none

  integer :: j, MAX, a(MAX), b(MAX)

  do j=0, MAX

    a(j) = b(j)

  end do

end subroutine loop_independent_A

subroutine loop_independent_B (a,b,MAX)

  implicit none

  integer :: j, MAX, a(MAX), b(MAX)

  do j=MAX, 0, -1

    a(j) = b(j)

  end do

end subroutine loop_independent_B

When loops are dependent, improving loop performance becomes much more difficult. Loops can be dependent in several, general ways.

The following sections illustrate the different loop dependencies.

Flow Dependency - Read After Write

Cross-iteration flow dependence is created when variables are read then written in different iterations, as shown in the following example:

Example

subroutine flow_dependence (A,MAX)

  implicit none

  integer :: J,MAX

  real :: A(MAX)

  do j=1, MAX

    A(J)=A(J-1)

  end do

end subroutine flow_dependence

The above example is equivalent to the following lines for the first few iterations:

Sample iterations

A[1]=A[0];

A[2]=A[1];

Recurrence relations feed information forward from one iteration to the next.

Example

subroutine time_stepping_loops (a,b,MAX)

  implicit none

  integer :: J,MAX

  real :: a(MAX), b(MAX)

  do j=1, MAX

    a(j) = a(j-1) + b(j)

  end do

end subroutine time_stepping_loops

Most recurrences cannot be made fully parallel. Instead, look for a loop further out or further in to parallelize. You might be able to get more performance gains through unrolling.

Anti Dependency - Write After Read

Cross-iteration anti-dependence is created when variables are written then read in different iterations, as shown in the following example:

Example

subroutine anti_dependence (A,MAX)

  implicit none

  integer :: J,MAX

  real :: a(MAX), b(MAX)

  do J=1, MAX

    A(J)=A(J+1)

  end do

end subroutine anti_dependence

The above example is equivalent to the following lines for the first few iterations:

Sample interations

A[1]=A[2];

A[2]=A[3];

Output Dependency - Write After Write

Cross-iteration output dependence is where variables are written then rewritten in a different iteration. The following example illustrates this type of dependency:

Example

subroutine output_dependence (A,B,C,MAX)

  implicit none

  integer :: J,MAX

  real :: a(MAX), b(MAX), c(MAX)

  do J=1, MAX

    A(J)=B(J)

    A(J+1)=C(J)

  end do

end subroutine output_dependence

The above example is equivalent to the following lines for the first few iterations:

Sample interations

A[1]=B[1];

A[2]=C[1];

A[2]=B[2];

A[3]=C[2];

Reductions

The Intel® compiler can successfully vectorize or software pipeline (SWP) most loops containing reductions on simple math operators like multiplication (*), addition (+), subtraction (-), and division (/). Reductions collapse array data to scalar data by using associative operations:

Example

subroutine reduction (sum,c,MAX)

  implicit none

  integer :: j,MAX

  real :: sum, c(MAX)

  do J=0, MAX

    sum = sum + c(j)

  end do

end subroutine reduction

The compiler might occasionally misidentify a reduction and report flow-, anti-, output-dependencies or sometimes loop-carried memory-dependency-edges; in such cases, the compiler will not vectorize or SWP the loop.