Intel Visual Fortran Compiler 11.0 for Windows
Advanced Optimisation Features
Software compiled using the Intel C++ Compilers for Windows benefits from advanced optimisation features.
Multi-Threaded Application Support
OpenMP and auto-parallelization help convert serial applications into parallel applications, allowing you to take full advantage of multi-core technology like the Intel Core Duo processor and dual-core Itanium 2 processor, as well as symmetric multi-processing systems:
- OpenMP is the industry standard for portable multi-threaded application development. It is effective at fine-grain (loop-level) and large-grain (function-level) threading.
- OpenMP directives are an easy and powerful way to convert serial applications into parallel applications, enabling potentially big performance gains from parallel execution on multi-core and symmetric multiprocessor systems.
- Auto Parallelisation improves application performance on multiprocessor systems by means of automatic threading of loops. This option detects parallel loops capable of being executed safely in parallel and automatically generates multi-threaded code.
- Automatic parallelisation relieves the user from having to deal with the low-level details of iteration partitioning, data sharing, thread scheduling, and synchronizations. It also provides the performance benefits available from multiprocessor systems and systems that support Hyper-Threading Technology (HT Technology).
Interprocedural Optimisation (IPO)
Interprocedural optimisation (IPO) can dramatically improve application performance in programs that contain many small- or medium-sized functions that are frequently used, especially for programs that contain calls within loops. This set of techniques, which can be enabled for automatic operation in the Intel compilers, uses multiple files or whole programs to detect and perform optimisations, rather than focusing within individual functions.
The IPO process first requires that source files are compiled with the IPO option, creating object (.o) files that contain the intermediate language (IL) used by the compiler. Upon linking, the compiler combines all of the IL information and analyses it for optimisation opportunities. Typical optimisations made as part of the IPO process include procedure inlining and re-ordering, eliminating dead (unreachable) code, and constant propagation, or the substitution of known values for constants. IPO enables more aggressive optimisation that what is available at the intra-procedural level, since the added context of multiple procedures makes those more-aggressive optimisations safe.
Profile-Guided Optimisation (PGO)
The Profile-Guided Optimisation (PGO) compilation process enables the Intel C++ Compiler to take better advantage of the processor microarchitecture, more effectively use instruction paging and cache memory, and make better branch predictions. It improves application performance by reorganising code layout to reduce instruction-cache thrashing, shrinking code size and reducing branch mispredictions.
PGO is a three-stage process, 1) a compile of the application with instrumentation added, 2) a profile-generation phase, where the application is executed and monitored, and 3) a recompile where the data collected during the first run aids optimisation. A description of several code size influencing profile-guided optimisations follows:
- Basic block and function ordering — Place frequently-executed blocks and functions together to take advantage of instruction-cache locality.
- Aid inlining decisions — Inline frequently-executed functions so the increase in code size is paid in areas of highest performance impact.
- Aid vectorisation decisions — Vectorise high trip count and frequently-executed loops so the increase in code size is mitigated by the increase in performance.
With PGO, the compiler uses information about how applications run to improve performance. Combined with IPO, which optimises application logic in and of itself, application performance can be improved, sometimes dramatically.
Automatic Vectoriser
Vectorisation automatically parallelises code to maximise underlying processor capabilities. This advanced optimisation analyses loops and determines when it is safe and effective to execute several iterations of the loop in parallel by utilising MMX technology, SSE, SSE2, and SSE3 instructions.
Use vectorisation to optimise your application code and take advantage of these new extensions when running on Intel processors. Features include support for advanced, dynamic data alignment strategies, including loop peeling to generate aligned loads and loop unrolling to match the prefetch of a full cache line.
High Level Optimisation (HLO)
Data prefetching is an effective technique to hide memory access latency, significantly improving performance in many compute-intensive applications. Data prefetching inserts prefetch instructions for selected data references at specific points in the program, so referenced data items are moved as close to the processor as possible (put in cache memory) before the data items are actually used.
Loop unrolling is the combination of two or more loop iterations in order to reduce the loop count. While it often causes code size to increase, loop unrolling frequently reduces the number of instructions that must be executed.
Intel Debugger
The Intel Debugger enables optimised code debugging (i.e., debugging code that has been significantly transformed for optimal execution on a specific hardware architecture). The Intel Compilers produce standards-compliant debug information for optimised code debugging that is available to all debuggers that support Intel Compilers. The Intel Debugger supports multi-core architectures by enabling debugging of multi-threaded applications, providing the following related capabilities:
- An all-stop/all-go execution model (i.e., all threads are stopped when one is stopped, and all threads are resumed when one is resumed).
- List all created threads.
- Switch focus between threads
- Examine detailed thread state
- Set breakpoints (including all stop, trace and watch variations) and display a back-trace of the stack for all threads or for a subset of threads.
- The built-in GUI provides a Thread panel (on the Current Source pane) that activates when a thread is created, and that allows an operator to select thread focus and display related details.