This section discusses the three major features of parallel programming supported by the IntelŪ compiler:
Parallelization with OpenMP*
Auto-parallelization
Auto-vectorization
Each of these features contributes to application performance depending on the number of processors, target architecture (IA-32 or ItaniumŪ architecture), and the nature of the application. These features of parallel programming can be combined to contribute to application performance.
Parallel programming can be explicit, that is, defined by a programmer using OpenMP directives. Parallel programming can also be implicit, that is, detected automatically by the compiler. Implicit parallelism implements auto-parallelization of outer-most loops and auto-vectorization of innermost loops (or both).
Parallelism defined with OpenMP and auto-parallelization directives is based on thread-level parallelism (TLP). Parallelism defined with auto-vectorization techniques is based on instruction-level parallelism (ILP).
The IntelŪ compiler supports OpenMP and auto-parallelization for IA-32, Intel EM64T, and Itanium architectures for multiprocessor systems, dual-core processors systems, and systems with Hyper-Threading Technology (HT Technology) enabled.
Auto-vectorization is supported on the families of the PentiumŪ, Pentium with MMX™ technology, Pentium II, Pentium III, and Pentium 4 processors. To enhance the compilation of the code with auto-vectorization, users can also add vectorizer directives to their program. A closely related technique software pipelining (SWP) is available on the Itanium-based systems.
The following table summarizes the different ways in which parallelism can be exploited with the IntelŪ Compiler.
Parallelism |
Description |
---|---|
Explicit |
Parallelism programmed by the user |
OpenMP* (thread-level parallelism) IA-32 and Itanium® architectures |
Supported on:
|
Implicit |
Parallelism generated by the compiler and by user-supplied hints |
Auto-parallelization (thread-level parallelism) |
Supported on:
|
Auto-vectorization (instruction-level parallelism) |
Supported on:
|
The IntelŪ compiler supports the OpenMP* Fortran version 2.5 API specification available from the OpenMP* (http://www.openmp.org) web site. The OpenMP directives relieve the user from having to deal with the low-level details of iteration space partitioning, data sharing, and thread scheduling and synchronization.
The Auto-parallelization feature of the IntelŪ compiler automatically translates serial portions of the input program into semantically equivalent multithreaded code. Automatic parallelization determines the loops that are good worksharing candidates, performs the dataflow analysis to verify correct parallel execution, and partitions the data for threaded code generation as is needed in programming with OpenMP directives. The OpenMP and Auto-parallelization applications provide the performance gains from shared memory on multiprocessor and dual-core systems and IA-32 processors with the Hyper-Threading Technology.
Auto-vectorization detects low-level operations in the program that can be done in parallel, and then converts the sequential program to process 2, 4, 8 or up to 16 elements in one operation, depending on the data type. In some cases auto-parallelization and vectorization can be combined for better performance results. For example, in the code below, thread-level parallelism can be exploited in the outermost loop, while instruction-level parallelism can be exploited in the innermost loop.
Example |
---|
DO I = 1, 100 ! Execute groups of iterations in different hreads (TLP) DO J = 1, 32 ! Execute in SIMD style with multimedia extension (ILP) A(J,I) = A(J,I) + 1 ENDDO ENDDO |
Auto-vectorization can help improve performance of an application that runs on systems based on PentiumŪ, Pentium with MMX™ technology, Pentium II, Pentium III, and Pentium 4 processors.
The following tables summarize the options that enable auto-vectorization, auto-parallelization, and OpenMP support.
Windows* |
Linux* |
Description |
---|---|---|
/Qx |
-x |
Generates specialized code to run exclusively on processors with the extensions specified by {K|W|N|B|P}. P is the only valid value on Mac OS* systems. See the following topic in Compiler Options: |
/Qax |
-ax |
Generates, in a single binary, code specialized to the extensions specified by {K|W|N|B|P} and also generic IA-32 code. P is the only valid value on Mac OS systems. The generic code is usually slower. See the following topic in Compiler Options: |
/Qvec-report |
-vec-report |
Controls the diagnostic messages from the vectorizer, see subsection that follows the table. See the following topic in Compiler Options: |
Windows* |
Linux* |
Description |
---|---|---|
/Qparallel |
-parallel |
Enables the auto-parallelizer to generate multithreaded code for loops that can be safely executed in parallel. Intel® Itanium®-based systems only:
See the following topic in Compiler Options: |
/Qpar-threshold[:n] |
-par-threshold{n} |
Sets a threshold for the auto of loops based on the probability of profitable execution of the loop in parallel, n=0 to 100. See the following topic in Compiler Options: |
/Qpar-report |
-par-report |
Controls the auto-parallelizer's diagnostic levels. See the following topic in Compiler Options: |
Windows* |
Linux* |
Description |
---|---|---|
/Qopenmp |
-openmp |
Enables the parallelizer to generate multithreaded code based on the OpenMP directives. Intel® Itanium®-based systems only:
See the following topic in Compiler Options: |
/Qopenmp-report |
-openmp-report |
Controls the OpenMP parallelizer's diagnostic levels. See the following topic in Compiler Options: |
/Qopenmp-stubs |
-openmp-stubs |
Enables compilation of OpenMP programs in sequential mode. The OpenMP directives are ignored and a stub OpenMP library is linked. See the following topic in Compiler Options: |
When both -openmp (Linux) or /Qopenmp (Windows) and -parallel (Linux) or /Qparallel (Windows) are specified on the command line, the -parallel (Linux) or /Qparallel (Windows) option is only applied in routines that do not contain OpenMP directives. For routines that contain OpenMP directives, only the -openmp (Linux) or /Qopenmp (Windows) option is applied.
With the right choice of options, you can:
Increase the performance of your application with minimum effort
Use compiler features to develop multithreaded programs faster
Additionally, with the relatively small effort of adding OpenMP directives to existing code you can transform a sequential program into a parallel program. The following example shows OpenMP directives within the code.
Example |
---|
!OMP$ PARALLEL PRIVATE(NUM), SHARED (X,A,B,C) ! Defines a parallel region !OMP$ PARALLEL DO ! Specifies a parallel region that ! implicitly contains a single DO directive DO I = 1, 1000 NUM = FOO(B(i), C(I)) X(I) = BAR(A(I), NUM) ! Assume FOO and BAR have no other effect ENDDO |
See examples of the auto-parallelization and auto-vectorization directives in the following topics.