Parallelism Overview

This section discusses the three major features of parallel programming supported by the IntelŪ compiler:

Each of these features contributes to application performance depending on the number of processors, target architecture (IA-32 or ItaniumŪ architecture), and the nature of the application. These features of parallel programming can be combined to contribute to application performance.

Parallel programming can be explicit, that is, defined by a programmer using OpenMP directives. Parallel programming can also be implicit, that is, detected automatically by the compiler. Implicit parallelism implements auto-parallelization of outer-most loops and auto-vectorization of innermost loops (or both).

Parallelism defined with OpenMP and auto-parallelization directives is based on thread-level parallelism (TLP). Parallelism defined with auto-vectorization techniques is based on instruction-level parallelism (ILP).

The IntelŪ compiler supports OpenMP and auto-parallelization for IA-32, Intel EM64T, and Itanium architectures for multiprocessor systems, dual-core processors systems, and systems with Hyper-Threading Technology (HT Technology) enabled.

Auto-vectorization is supported on the families of the PentiumŪ, Pentium with MMX™ technology, Pentium II, Pentium III, and Pentium 4 processors. To enhance the compilation of the code with auto-vectorization, users can also add vectorizer directives to their program. A closely related technique software pipelining (SWP) is available on the Itanium-based systems.

The following table  summarizes the different ways in which parallelism can be exploited with the IntelŪ Compiler.

Parallelism

Description

Explicit

Parallelism programmed by the user

OpenMP* (thread-level parallelism)

IA-32 and Itanium® architectures

Supported on:

  • IA-32, Intel EM64T, and Itanium-based multiprocessor systems and dual-core processors

  • Hyper-Threading Technology-enabled systems

Implicit

Parallelism generated by the compiler and by user-supplied hints

Auto-parallelization (thread-level parallelism)
of outer-most loops; IA-32 and Itanium architectures

Supported on:

  • IA-32, Intel EM64T, and Itanium-based multiprocessor systems and dual-core processors

  • Hyper-Threading Technology-enabled systems

Auto-vectorization (instruction-level parallelism)
of inner-most loops; IA-32 and Itanium architectures

Supported on:

  • PentiumŪ, Pentium with MMX™ Technology, Pentium II, Pentium III, and Pentium 4 processors

Parallel Program Development

OpenMP

The IntelŪ compiler supports the OpenMP* Fortran version 2.5 API specification available from the OpenMP* (http://www.openmp.org) web site. The OpenMP directives relieve the user from having to deal with the low-level details of iteration space partitioning, data sharing, and thread scheduling and synchronization.

Auto-Parallelization

The Auto-parallelization feature of the IntelŪ compiler automatically translates serial portions of the input program into semantically equivalent multithreaded code. Automatic parallelization determines the loops that are good worksharing candidates, performs the dataflow analysis to verify correct parallel execution, and partitions the data for threaded code generation as is needed in programming with OpenMP directives. The OpenMP and Auto-parallelization applications provide the performance gains from shared memory on multiprocessor and dual-core systems and IA-32 processors with the Hyper-Threading Technology.

Auto-Vectorization

Auto-vectorization detects low-level operations in the program that can be done in parallel, and then converts the sequential program to process 2, 4, 8 or up to 16 elements in one operation, depending on the data type. In some cases auto-parallelization and vectorization can be combined for better performance results. For example, in the code below, thread-level parallelism can be exploited in the outermost loop, while instruction-level parallelism can be exploited in the innermost loop.

Example

DO I = 1, 100     ! Execute groups of iterations in different hreads (TLP)

  DO J = 1, 32    ! Execute in SIMD style with multimedia extension (ILP)

     A(J,I) = A(J,I) + 1

  ENDDO

ENDDO

Auto-vectorization can help improve performance of an application that runs on systems based on PentiumŪ, Pentium with MMX™ technology, Pentium II, Pentium III, and Pentium 4 processors.

The following tables summarize the options that enable auto-vectorization, auto-parallelization, and OpenMP support.

Auto-vectorization: IA-32 only

Windows*

Linux*

Description

/Qx

-x

Generates specialized code to run exclusively on processors with the extensions specified by {K|W|N|B|P}. P is the only valid value on Mac OS* systems.

See the following topic in Compiler Options:

/Qax

-ax

Generates, in a single binary, code specialized to the extensions specified by {K|W|N|B|P} and also generic IA-32 code. P is the only valid value on Mac OS systems.

The generic code is usually slower.

See the following topic in Compiler Options:

/Qvec-report

-vec-report

Controls the diagnostic messages from the vectorizer, see subsection that follows the table.

See the following topic in Compiler Options:

Auto-parallelization: IA-32 and Itanium® architectures

Windows*

Linux*

Description

/Qparallel

-parallel

Enables the auto-parallelizer to generate multithreaded code for loops that can be safely executed in parallel.

Intel® Itanium®-based systems only:

  • Implies -opt-mem-bandwith1 (Linux) or /Qopt-mem-bandwidth1 (Windows).

See the following topic in Compiler Options:

/Qpar-threshold[:n]

-par-threshold{n}

Sets a threshold for the auto of loops based on the probability of profitable execution of the loop in parallel, n=0 to 100.

See the following topic in Compiler Options:

/Qpar-report

-par-report

Controls the auto-parallelizer's diagnostic levels.

See the following topic in Compiler Options:

OpenMP: IA-32 and Itanium® architectures

Windows*

Linux*

Description

/Qopenmp

-openmp

Enables the parallelizer to generate multithreaded code based on the OpenMP directives.

Intel® Itanium®-based systems only:

  • Implies -opt-mem-bandwith1 (Linux) or /Qopt-mem-bandwidth1 (Windows).

See the following topic in Compiler Options:

/Qopenmp-report

-openmp-report

Controls the OpenMP parallelizer's diagnostic levels.

See the following topic in Compiler Options:

/Qopenmp-stubs

-openmp-stubs

Enables compilation of OpenMP programs in sequential mode. The OpenMP directives are ignored and a stub OpenMP library is linked.

See the following topic in Compiler Options:

Note

When both -openmp (Linux) or /Qopenmp (Windows) and -parallel (Linux) or /Qparallel (Windows) are specified on the command line,  the -parallel (Linux) or /Qparallel (Windows) option is only applied in routines that do not contain OpenMP directives. For routines that contain OpenMP directives, only the -openmp (Linux) or /Qopenmp (Windows) option is applied.

With the right choice of options, you can:

Additionally, with the relatively small effort of adding OpenMP directives to existing code you can transform a sequential program into a parallel program. The following example shows OpenMP directives within the code.

Example

!OMP$ PARALLEL PRIVATE(NUM), SHARED (X,A,B,C)

! Defines a parallel region  

!OMP$ PARALLEL DO

! Specifies a parallel region that

! implicitly contains a single DO directive

DO I = 1, 1000

  NUM = FOO(B(i), C(I))

  X(I) = BAR(A(I), NUM)

! Assume FOO and BAR have no other effect

ENDDO

See examples of the auto-parallelization and auto-vectorization directives in the following topics.