Performance indicators such as response time, resource utilization, and throughput are important for many software applications. This is particularly true for high-throughput and high-performance computing applications such as those used, for example, in quantitative finance and engineering.
Software performance is a pervasive quality that is difficult to understand. There are three principal components to performance: computation, communication and locality. However, the situation is much more difficult, with everything affecting it from the software itself (algorithm, memory access pattern, etc.) to all underlying layers (compiler, operating system, middleware, database, communication networks, hardware).
The scale and complexity, and the wide range of performance improvement techniques employed by hardware designers, makes it difficult to develop software mapping well on modern computers. This challenge is reflected in the increasing gap between hardware and software, i.e. actual application performance decreases relative to the peak computational power of a modern multi-core processor. An application running at 5-10% peak performance is often considered to be efficient.
In the past, programmers could rely on innovations in hardware architecture to boost software performance without having to modify a single line of code. However, this is no longer the case. The hardware community has passed the "ball" to software developers. The old design goal for new more powerful processors (CPU) was to drive the clock rate up. Cooling and power constraints have stopped this trend, and many commercial chip manufacturers have adopted a new strategy of increasing computational power by increasing the level of parallelism at all levels. An arsenal of different micro-processor techniques are used to achieve this objective: cutting-edge processors feature tens of cores on a single chip (architecture-level parallelism), each core runs multiple threads (core-level parallelism), and multiple instructions are executed each clock cycle (instruction-level parallelism) using pipelining, out-of-order, and speculative execution. Single Instruction Multiple Data (SIMD) parallelism operates on multiple pieces of data simultaneously. Modern memory subsystems have very high latency with respect to processor speeds, and the gap is increasing. Techniques such as software memory prefetching, STREAM processing, and hardware multi-threading are trying to hide memory latency, and aim to increase memory-level parallelism.
The inherent complexity of modern multi-processor multi-core systems can be partly illustrated by the visual performance model suggested by Williams . It provides a useful bound and allows bottleneck analysis to understand the key components of architecture and performance, depending on the operational intensity (OI) of a given computational kernel. Figure 1 shows the different rooflines (upper performance bounds) for a generic machine limiting achievable performance, depending on peak double precision (DP) floating point performance and peak memory bandwidth for a given OI. Additional rooflines are obtained considering different memory access patterns (e.g. random or unit stride, w/out prefetching) or special instructions like SIMD and FMA (fused multiply and add). As a result, optimal performance can only be attained if all CPU features are fully exploited.
Moreover, the push for parallelism continues to increase. It is anticipated that future many-core architectures will provide a massive level of parallelism with hundreds if not thousands of cores. In addition, new technologies such as general-purpose computing on massively parallel graphics processing units (GPU) are becoming popular, providing a low-cost, high performance computing (HPC) platform. In the past this was mostly marketing hype or limited to the HPC scientific commuting. This may change as soon as the business community starts to recognize the potential of GPU.
With massive levels of parallelism being introduced into mainstream computing, complexity has significantly increased, making it difficult to develop applications that can efficiently leverage the computational power of modern computing platforms. Here is where special performance tools come into play. They can support software developers to identify optimization opportunities, fix bottlenecks and implementation deficiencies in the code and to ensure and document that the tested system meets requirements.
This article aims to provide a comprehensive overview of code performance analysis. First, the workflow of performance assessment is presented. This is followed by an introduction to basic techniques to measure and characterize application performance. Finally, the capabilities of modern performance analysis tools are discussed and a selection of different tools is presented.
2. Performance Assessment
A typical performance assessment workflow iterates on a measure/analyze/tune loop (see Figure 2) until performance meets the agreed upon requirements.
- Measure means using performance tools to gather data on how and where the application spends its time and system resources.
- Analyze refers to analyzing the measurement data to find time and resolve bottlenecks that cause the application to fall below the expected metrics.
- Tune involves devising an approach for reducing or eliminating these bottlenecks and then implementing that approach.
2.1 Measurement Techniques
Special purpose code or hardware is used to perform performance measurements. They can be applied at any level of the abstraction hierarchy (hardware level, function level, source line level), depending on what is being measured. At the lowest level, measuring the run time of an application requires only basic hardware support (on-board clock). In addition, most systems include higher-resolution timing registers to measure tighter intervals in terms of elapsed CPU cycles. Depending on the system, more extensive hardware counters may be available providing a configurable set of registers - Hardware Performance Monitors (HPM) - that can record counts of hardware events such as memory references, cache misses, branch mispredictions, instructions and network operations.
In order to take measurements while a program is running, special instructions must be injected (at binary or source code level) into the control flow of the running system. Execution of instrumented code generates events, such as method timers, entry/exit, or object allocation. The main advantage of instrumentation-based profiling over other known techniques is flexibility. Virtually any data, from low-level events to high-level data, can be selectively captured using this technique. However, instrumenting the program will typically slow it as the instrumentation code introduces a significant overhead. In addition, the injected instrumentations alter program behaviour (observer effect) and change its characteristics (e.g. cache access).
Instead of embedding instructions directly into an application's control flow, the sampling process interrupts the target program using an interval timer or events (e.g. CPU cycles, cache miss), and increments the sample counter of either the address currently being executed or of each function on the current call stack (call stack sampling). Event triggers can either be synchronous or asynchronous depending on whether they are initiated by direct program action or not. Sampling tries to correlate the sample count, with the execution time assuming that the higher the sampling counter of a function, the higher is processor utilization or the likelihood of potential performance bottleneck.
Sampling and instrumentation are complementary techniques. Instrumentation allows the programmer to observe events as they happen at runtime, since it is embedded directly into the system's control flow. This allows for more complete coverage. With instrumentation, an instrumented function is guaranteed to be observed each time control passes to it, ensuring that the events recorded are exactly those that occurred at runtime. With sampling, there is no such guarantee.
In practice, sampling can often provide a more accurate picture of the target program's execution than other approaches, as they are not as intrusive (low measurement overhead) and do not have as many side effects (such as on memory caches or instruction pipelines). In addition, the method allows the target program to be run without modifications including all compiler optimizations, and can be easily attached to a running process.
2.2 Performance Characterization - Profiling & Tracing
Before measurements can be of use in guiding optimization, they must be compiled and aggregated into concise descriptions called performance characterizations. The two most widely used approaches in the field of performance analysis are profiling and tracing. Profiling maps regions of source or object code to the amount of a particular resource they consume. In the most common case, the resource measured is cumulative time, but profiles may also be generated from HPM measurements. A trace is a performance characterization generated from observations of performance events (e.g. those recorded by trace instrumentation) over time. Unlike profiles, which discard time information, a trace can capture the full sequence of events in the execution of a program. Since they give a precise ordering of events, traces can be useful in determining evolutionary behaviour for analyzing the correctness of program execution. This is difficult with profiles, because timing information, and thus insight into causality, is discarded.
3. Performance Analysis Tools
Most performance analysis tools support either profiling or tracing. Table 1 shows a selection of open-source and commercial performance analysis tools. The types of analysis capabilities offered by performance tools vary significantly ranging from simple timing analysis to more sophisticated analysis features that can identify delicate performance bottlenecks.
Hardware performance counter analysis can help gain more insight into hardware (architecture) related issues, such as the memory and cache characteristics of the program. Parallel profilers can help identify opportunities for parallelism and analyze the communication traffic of parallel applications, and enable the design of more efficient communication strategies. I/O access patterns can be studied to improve the read/write throughput from/to hard disk. Since the majority of performance tools do not have any facility for applying optimizations to a user's code, the optimization stage is left to the user. At best, the performance tool may indicate where and why a particular bottleneck occurs in the code and expects the user to optimize the code.
Developing software featuring good performance metrics has always been a difficult task. The advent of massively parallel computing, with a higher complexity of hardware and software, has made this a more daunting task. It seems as though the entire IT industry has bet its future on software developers successfully switching to explicitly programming in parallel. This requires a deep understanding of parallel computing and computer architecture. Theoretical knowledge is still valuable, and many helpful tools are available to analyze the effective performance of the code and provide unique insight, which is hard to get by other means. Sampling-based tools offer a quick and easy way to obtain a general idea of the program's performance characteristics. Following this step, tools based on instrumentation can be used to selectively refine the analysis.
1. S. Williams: The Roofline Model, in Performance Tuning of Scientific Application, edited by: David H. Bailey, Robert F. Lucas, Samuel W. Williams, CRC Press, 2010, ISBN: 978-1-4398156-9-4.
2. Adam Leko, Hans Sherburne, Hung-Hsun Su, Bryan Golden, Alan George: Practical Experiences with Modern Parallel Performance Analysis Tools: An Evaluation, Whitepaper, http://www.hcs.ufl.edu/upc/toolevaluation.pdf
3. Connie U. Smith, Lloyd G. Williams: Best Practices for Software Performance, Engineering, 29th International Computer Measurement Group Conference, Dallas
4. T. G. Mattson, B. A. Sanders, and B. L. Massingill: Patterns for Parallel Programming, Addison-Wesley, ISBN 0-321-22811-1
5. S. Akhter and J. Roberts: Multi-Core Programming: Increasing Performance through Software Multithreading, Intel Press, ISBN 0-9764832-4-6