Intel Fortran Compiler 11.0 for Mac
Advanced Optimisation
Interprocedural Optimisation
Interprocedural optimisation (IPO) can dramatically improve application performance in programs that contain many small- or medium-sized functions that are frequently used, especially for programs that contain calls within loops. This set of techniques, which can be enabled for automatic operation in the Intel Compilers, uses multiple files or whole programs to detect and perform optimisations, rather than focusing within individual functions.
Typical optimisations made as part of the IPO process include procedure inlining and re-ordering, eliminating dead (unreachable) code, and constant propagation, or the substitution of known values for constants. IPO enables more aggressive optimisation than what is available at the intra-procedural level, since the added context of multiple procedures makes those more-aggressive optimisations safe.
Profile-Guided Optimisation
The Profile-Guided Optimisation (PGO) compilation process enables the Intel C++ Compiler to take better advantage of the processor microarchitecture, more effectively use instruction paging and cache memory, and make better branch predictions. It improves application performance by reorganising code layout to reduce instruction-cache thrashing, shrinking code size and reducing branch mispredictions.
PGO is a three-stage process. Those steps include, 1) a compile of the application with instrumentation added, 2) a profile-generation phase, where the application is executed and monitored, and 3) a recompile where the data collected during the first run aids optimisation.
A description of several code size influencing profile-guided optimisations follows:
- Basic block and function ordering — Place frequently-executed blocks and functions together to take advantage of instruction-cache locality.
- Aid inlining decisions — Inline frequently-executed functions so the increase in code size is paid in areas of highest performance impact.
- Aid vectorisation decisions — Vectorise high trip count and frequently-executed loops so the increase in code size is mitigated by the increase in performance.
Automatic Vectoriser
Vectorisation automatically parallelises code to maximise underlying processor capabilities. This advanced optimisation analyses loops and determines when it is safe and effective to execute several iterations of the loop in parallel by utilising MMX, SSE, SSE2, and SSE3 instructions.
Use vectorisation to optimise your application code and take advantage of these new extensions when running on Intel processors. Features include support for advanced, dynamic data alignment strategies, including loop peeling to generate aligned loads and loop unrolling to match the prefetch of a full cache line.
High Level Optimisation
Data prefetching is an effective technique to hide memory access latency, significantly improving performance in many compute-intensive applications. Data prefetching inserts prefetch instructions for selected data references at specific points in the program, so referenced data items are moved as close to the processor as possible (put in cache memory) before the data items are actually used.
Loop unrolling is the combination of two or more loop iterations in order to reduce the loop count. While it often causes code size to increase, loop unrolling frequently reduces the number of instructions that must be executed.
Intel Debugger
The Intel Debugger enables optimised code debugging (i.e., debugging code that has been significantly transformed for optimal execution on a specific hardware architecture). The Intel Compilers produce standards-compliant debug information for optimised code debugging that is available to all debuggers that support Intel Compilers.
The Intel Debugger supports multi-core architectures by enabling debugging of multi-threaded applications, providing the following related capabilities:
- An all-stop/all-go execution model (i.e., all threads are stopped when one is stopped, and all threads are resumed when one is resumed)
- List all created threads
- Switch focus between threads
- Examine detailed thread state
- Set breakpoints (including all stop, trace and watch variations) and display a back-trace of the stack for all threads or for a subset of threads
- The built-in GUI provides a Thread panel (on the Current Source pane) that activates when a thread is created, and that allows an operator to select thread focus and display related details
The recently enhanced GNU Project Debugger (GDB debugger) can also be used for parallel applications.