Extrae is the package devoted to generate Paraver trace-files for a post-mortem analysis. Extrae is a tool that uses different interposition mechanisms to inject probes into the target application so as to gather information regarding the application performance. The following table summarizes the programming models and systems supported by Extrae.

Supported platforms
Linux clusters (x86-64)
nVidia GPU
Intel Xeon Phy
K computer
Supported programming models


(* Also available in conjunction with MPI)

In order to facilitate the configuration, Extrae can be configured through an XML file. The distributed package contains several examples

1. Interposition mechanisms

Extrae takes advantage of multiple interposition mechanisms to add monitors into the application. No matter which mechanism is being used, the target is the same, to collect performance metrics at known applications points to finally provide the performance analyst a correlation between performance and the application execution. Extrae currently uses the following interposition mechanisms:

1.1 Linker preload (LD_PRELOAD)

Most of the current operating systems allow injecting a shared library into an application before the application gets actually loaded. If the library that is being preloaded provides the same symbols as those contained in shared libraries of the application, such symbols can be wrapped in order to inject code in these calls. In Linux systems this technique is commonly known by using the LD_PRELOAD environment variable. Extrae contains substitution symbols for many parallel runtimes, as OpenMP (either Intel, GNU or IBM runtimes), pthread, CUDA accelerated applications, and MPI applications.

1.2 DynInst

DynInst is an instrumentation library that allows modifying the application by injecting code at specific code locations. Although it originally allowed modifying the application code when the application was run, now it supports rewriting the binary of the application so the code injection is required only once. Extrae uses DynInst to instrument different parallel programming runtimes as OpenMP (either for Intel, GNU or IBM runtimes), CUDA accelerated applications, and MPI applications. DynInst also offers Extrae the possibility to easily instrument user functions by simply listing them in a file.

1.3 Additional instrumentation mechanisms

Extrae also takes the advantage of some parallel programming runtimes that have their own instrumentation (or profile) mechanisms available for performance tools. These not only include the widely-known Message Passing Interface (MPI) which provides the Profile-MPI (PMPI) layer, but also the CUPTI infrastructure to get information from CUDA devices or even the OpenCL profiling capabilities. There are some compilers that allow instrumenting application routines by using special compilation flags during compilation and link phases.

1.4 Extrae API

Finally, Extrae gives the user the possibility to manually instrument the application and emit its own events if the previous mechanisms do not fulfill the user's needs. The Extrae API is detailed in the Extrae user-guide documentation that accompanies the package.

2. Sampling mechanisms

Extrae does not only offer the possibility to instrument the application code, but also offers to use sampling mechanisms to gather performance data. While adding monitors into specific location of the application produces insight which can be easily correlated with source code, the resolution of such data is directly related with the application control flow. Adding sampling capabilities into Extrae allows providing performance information of regions of code which has not been instrumented.

Currently, Extrae sports two different sampling mechanisms. The first mechanism is the old-known signal timers, which fires the sampling handler at a specified time interval. The second sampling mechanism uses the processor performance counters to fire the sampling handler at a specified interval of events interval. While the first mechanism can provide a totally uncorrelated samples with the application code, the second mechanism, using the appropriate performance counters, can provide insight of the application but still presenting some correlation with the application code/performance.

2.1 Performance data gathered

The monitors added by Extrae gather different types of information. Depending on the monitor placement, each monitor can be taught to gather specific information. The most common information gathered is:

2.2 Timestamp

When analyzing the behavior of an application, it is important to have a fine-grained timestamping mechanism (up to nanoseconds). Extrae provides a set of clock functions that are specifically implemented for different target machines in order to provide the most accurate possible timing. On systems that have daemons that inhibit the usage of these timers or that do not have a specific timer implementation. In such cases, Extrae still uses advanced POSIX clocks to provide nanosecond resolution timestamps with low cost.

2.3 Performance and other counter metrics

Extrae uses the PAPI and the PMAPI interfaces to collect information regarding the microprocessor performance. With the advent of the components in the PAPI software, Extrae is not only able to collect information regarding how is behaving the microprocessor only, but also allows studying multiple components of the system (disk, network, operating system, among others) and also extend the study over the microprocessor (power consumption and thermal information). Extrae mainly collects these counter metrics at the parallel programming calls and at samples. It also allows capturing such information at the entry and exit points of the user routines instrumented.

2.4 References to the source code

Analyzing the performance of an application requires relating the code that is responsible for such performance. This way the analyst can locate the performance bottlenecks and suggest improvements on the application code. Extrae provides information regarding the source code that was being executed (in terms of name of function, file name and line number) at specific location points like programming model calls or sampling points.