Strip-mining and Cleanup

Strip-mining, also known as loop sectioning, is a loop transformation technique for enabling SIMD-encodings of loops, as well as a means of improving memory performance. By fragmenting a large loop into smaller segments or strips, this technique transforms the loop structure in two ways:

First introduced for vectorizers, this technique consists of the generation of code when each vector operation is done for a size less than or equal to the maximum vector length on a given vector machine.

The compiler automatically strip-mines your loop and generates a cleanup loop. For example, assume the compiler attempts to strip-mine the following loop:

Example1: Before Vectorization

i = 1

do while (i<=n)

  a(i) = b(i) + c(i) ! Original loop code

  i = i + 1

end do

The compiler might handle the strip mining and loop cleaning by restructuring the loop in the following manner:

Example 2: After Vectorization

!The vectorizer generates the following two loops

i = 1

do while (i < (n - mod(n,4)))

! Vector strip-mined loop.

  a(i:i+3) = b(i:i+3) + c(i:i+3)

  i = i + 4

end do

do while (i <= n)

  a(i) = b(i) + c(i)     !Scalar clean-up loop

  i = i + 1

end do

Loop Blocking

It is possible to treat loop blocking as strip-mining in two or more dimensions. Loop blocking is a useful technique for memory performance optimization. The main purpose of loop blocking is to eliminate as many cache misses as possible. This technique transforms the memory domain into smaller chunks rather than sequentially traversing through the entire memory domain. Each chunk should be small enough to fit all the data for a given computation into the cache, thereby maximizing data reuse.

Consider the following example. The two-dimensional array A is referenced in the j (column) direction and then in the i (row) direction (column-major order); array B is referenced in the opposite manner (row-major order). Assume the memory layout is in column-major order; therefore, the access strides of array A and B for the code  would be 1 and MAX,  respectively. In example 2: BS = block_size; MAX must be evenly divisible by BS.

Consider the following loop example code:

Example: Original loop

REAL A(MAX,MAX), B(MAX,MAX)

DO I =1, MAX

  DO J = 1, MAX

    A(I,J) = A(I,J) + B(J,I)

  ENDDO

ENDDO

The arrays could be blocked into smaller chunks so that the total combined size of the two blocked chunks is smaller than the cache size, which can improve data reuse. One possible way of doing this is demonstrated below:

Example: Transformed Loop after blocking

REAL A(MAX,MAX), B(MAX,MAX)

DO I =1, MAX, BS

  DO J = 1, MAX, BS

   DO II = I, I+MAX, BS-1

    DO J = J, J+MAX, BS-1

     A(II,JJ) = A(II,JJ) + B(JJ,II

    ENDDO

   ENDDO

  ENDDO

ENDDO