4. Extrae XML configuration file

Extrae is configured through a XML file that is set through the EXTRAE_CONFIG_FILE environment variable. The included examples provide several XML files to serve as a basis for the end user. For instance, the MPI examples provide four XML configuration files:

  • extrae.xml Exemplifies all the options available to set up in the configuration file. We will discuss below all the sections and options available. It is also available on this document on appendix An example of Extrae XML configuration file.
  • extrae_explained.xml The same as the above with some comments on each section.
  • summarized_trace_basic.xml A small example for gathering information of MPI and OpenMP information with some performace counters and calling information at each MPI call.
  • detailed_trace_basic.xml A small example for gathering a summarized information of MPI and OpenMP parallel paradigms.
  • extrae_bursts_1ms.xml An XML configuration example to setup the bursts tracing mode. This XML file will only capture the regions in between MPI calls that last more than the given threshold (1ms in this example).

Please note that most of the nodes present in the XML file have an enabled attribute that allows turning on and off some parts of the instrumentation mechanism. For example, <mpi enabled="yes"> means MPI instrumentation is enabled and process all the contained XML subnodes, if any; whether <mpi enabled="no"> means to skip gathering MPI information and do not process XML subnodes.

Each section points which environment variables could be used if the tracing package lacks XML support. See appendix Environment variables for the entire list.

Sometimes the XML tags are used for time selection (duration, for instance). In such tags, the following postfixes can be used: n or ns for nanoseconds, u or us for microseconds, m or ms for milliseconds, s for seconds, M for minutes, H for hours and D for days.

4.1. XML Section: Trace configuration

The basic trace behavior is determined in the first part of the XML and contains all of the remaining options. It looks like:

<?xml version='1.0'?>
 
<trace enabled="yes"
  home="@sed_MYPREFIXDIR@"
  initial-mode="detail"
  type="paraver"
>
 
< ... other XML nodes ... >
  
</trace>

The <?xml version='1.0'?> is mandatory for all XML files. Don’t touch this. The available tunable options are under the <trace> node:

  • enabled Set to yes if you want to generate tracefiles.
  • home Set to where the instrumentation package is installed. Usually it points to the same location that EXTRAE_HOME environment variable.
  • initial-mode Available options
    • detail Provides detailed information of the tracing.
    • bursts Provides summarized information of the tracing. This mode removes most of the information present in the detailed traces (like OpenMP and MPI calls among others) and only produces information for computation bursts.
  • type Available options
    • paraver The intermediate files are meant to generate Paraver tracefiles.
    • dimemas The intermediate files are meant to generate Dimemas tracefiles.

See also

EXTRAE_ON, EXTRAE_HOME, EXTRAE_INITIAL_MODE and EXTRAE_TRACE_TYPE environment variables in appendix Environment variables.

4.2. XML Section: MPI

The MPI configuration part is nested in the config file (see section XML Section: Trace configuration) and its nodes are the following:

<mpi enabled="yes">
  <counters enabled="yes" />
  <comm-calls enabled="yes" />
</mpi>

MPI calls can gather performance information at the begin and end of MPI calls. To activate this behavior, just set to yes the attribute of the nested <counters> node. When <comm-calls> is set to no, the calls to certain MPI_Comm_* calls (_rank, _size) are excluded from instrumentation to reduce tracing overhead.

See also

EXTRAE_DISABLE_MPI and EXTRAE_MPI_COUNTERS_ON environment variables in appendix Environment variables.

4.3. XML Section: pthread

The pthread configuration part is nested in the config file (see section XML Section: Trace configuration) and its nodes are the following:

<pthread enabled="yes">
  <locks enabled="no" />
  <counters enabled="yes" />
</pthread>

The tracing package allows to gather information of some pthread routines. In addition to that, the user can also enable gathering information of locks and also gathering performance counters in all of these routines. This is achieved by modifying the enabled attribute of the <locks> and <counters>, respectively.

See also

EXTRAE_DISABLE_PTHREAD, EXTRAE_PTHREAD_LOCKS and :envvar:` EXTRAE_PTHREAD_COUNTERS_ON` environment variables in appendix cha:EnvVars.

4.4. XML Section: OpenMP

The OpenMP configuration part is nested in the config file (see section XML Section: Trace configuration) and its nodes are the following:

<openmp enabled="yes" ompt="no">
  <locks enabled="no" />
  <task dependencies="yes" />
  <taskloop enabled="yes" dependencies="yes"/>
  <counters enabled="yes" />
</openmp>

The tracing package allows to gather information of some OpenMP runtimes and outlined routines. In addition to that, the user can also enable gathering information of locks and also gathering performance counters in all of these routines. This is achieved by modifying the enabled attribute of the <locks> and <counters>, respectively.

See also

EXTRAE_DISABLE_OMP, EXTRAE_OMP_LOCKS and EXTRAE_OMP_COUNTERS_ON environment variables in appendix Environment variables.

4.5. XML Section: Callers

<callers enabled="yes">
  <mpi enabled="yes">1-3</mpi>
  <sampling enabled="no">1-5</sampling>
  <dynamic-memory enabled="no">1-5</dynamic-memory>
</callers>

Callers are the routine addresses present in the process stack at any given moment during the application run. Callers can be used to link the tracefile with the source code of the application.

The instrumentation library can collect a partial view of those addresses during the instrumentation. Such collected addresses are translated by the merging process if the correspondent parameter is given and the application has been compiled and linked with debug information.

There are three points where the instrumentation can gather this information:

  • Entry of MPI calls
  • Sampling points (if sampling is available in the tracing package)
  • Dynamic memory calls (malloc, free, realloc)

The user can choose which addresses to save in the trace (starting from 1, which is the closest point to the MPI call or sampling point) specifying several stack levels by separating them by commas or using the hyphen symbol.

See also

EXTRAE_MPI_CALLER environment variable in appendix Environment variables.

4.6. XML Section: User functions

<user-functions enabled="no"
  list="/home/bsc41/bsc41273/user-functions.dat"
  exclude-automatic-functions="no">
  <counters enabled="yes" />
</user-functions>

The file contains a list of functions to be instrumented by Extrae. There are different alternatives to instrument application functions, and some alternatives provides additional flexibility, as a result, the format of the list varies depending of the instrumentation mechanism used:

  • DynInst Supports instrumentation of user functions, outer loops, loops and basic blocks. The given list contains the desired function names to be instrumented. After each function name, optionally you can define different basic blocks or loops inside the desired function always by providing different suffixes that are provided after the + character. For instance:

    • To instrument the entry and exit points of foo function just provide the function name (foo).
    • To instrument the entry and exit points of foo function plus the entry and exit points of its outer loop, suffix the function name with outerloops (i.e., foo+outerloops).
    • To instrument the entry and exit points of foo function plus the entry and exit points of its N-th loop function you have to suffix it as loop_N, for instance foo+loop_3.
    • To instrument the entry and exit points of foo function plus the entry and exit points of its N-th basic block inside the function you have to use the suffix bb_N, for instance foo+bb_5. In this case, it is also possible to specifically ask for the entry or exit point of the basic block by additionally suffixing _s or _e, respectively.

    Additionally, these options can be added by using comas, as in: foo+outerloops,loop_3,bb_3_e,bb_4_s,bb_5.

    To discover the instrumentable loops and basic blocks of a certain function you can execute the command ${EXTRAE_HOME}/bin/extrae -config extrae.xml -decodeBB, where extrae.xml is an Extrae configuration file that provides a list on the user functions attribute that you want to get the information.

  • GCC and ICC (through -finstrument-functions) GNU and Intel compilers provide a compile and link flag named -finstrument-functions that instruments the routines of a source code file that Extrae can use. To use this functionality a file containing the names of the functions to be instrumented has to be provided. Compile the executable using the flag -rdynamic (or link it using -export-dynamic) in order to make the functions visible. For instance, to instrument the functions foo, bar and baz the user would create a file with:

    foo
    bar
    baz
    

    In specific cases (e.g., functions declared inside a Fortran CONTAINS construct) the user may also need to provide the function address as given by the command nm. For instance, to instrument the routine pi_kernel from the pi executable the user would run nm as follows:

    $ nm -a pi | grep pi_kernel
    
    00000000004005ed T pi_kernel
    

    and then add <FUNCTION_NAME> # <HEX_ADDRESS> into the function list:

    pi_kernel # 00000000004005ed
    

The exclude-automatic-functions attribute is used only by the DynInst instrumenter. By setting this attribute to yes the instrumenter will avoid automatically instrumenting the routines that either call OpenMP outlined routines (i.e., routines with OpenMP pragmas) or call CUDA kernels.

Finally, in order to gather performance counters in these functions and also in those instrumented using the extrae_user_function API call, the node counters has to be enabled.

Warning

Note that you need to compile your application binary with debugging information (typically the -g compiler flag) in order to translate the captured addresses into valuable information such as: function name, file name and line number.

See also

EXTRAE_FUNCTIONS environment variable in appendix Environment variables.

4.7. XML Section: Performance counters

The instrumentation library can be compiled with support for collecting performance metrics of different components available on the system. These components include:

  • Processor performance counters. Such access is granted by PAPI [1] or PMAPI [2].
  • Network performance counters. (Only available in systems with Myrinet GM/MX networks).
  • Operating system accounts.

Here is an example of the counters section in the XML configuration file:

<counters enabled="yes">
  <cpu enabled="yes" starting-set-distribution="1">
    <set enabled="yes" domain="all" changeat-time="5s">
      PAPI_TOT_INS,PAPI_TOT_CYC,PAPI_L1_DCM
      <sampling enabled="yes" period="100000000">PAPI_TOT_CYC</sampling>
    </set>
    <set enabled="yes" domain="user" changeat-globalops="5">
      PAPI_TOT_INS,PAPI_TOT_CYC,PAPI_FP_INS
    </set>
  </cpu>
  <network enabled="yes" />
  <resource-usage enabled="yes" />
</counters>

See also

EXTRAE_COUNTERS, EXTRAE_NETWORK_COUNTERS and EXTRAE_RUSAGE environment variables in appendix Environment variables.

4.7.1. Processor performance counters

Processor performance counters are configured in the <cpu> nodes. The user can configure many sets in the <cpu> node using the <set> node, but just one set will be used at any given time in a specific task. The <cpu> node supports the starting-set-distribution attribute with the following accepted values:

  • number (in range 1..N, where N is the number of configured sets) All tasks will start using the set specified by number.
  • block Each task will start using the given sets distributed in blocks (i.e., if two sets are defined and there are four running tasks: tasks 1 and 2 will use set 1, and tasks 3 and 4 will use set 2).
  • cyclic Each task will start using the given sets distributed cyclically (i.e., if two sets are defined and there are four running tasks: tasks 1 and 3 will use, and tasks 2 and 4 will use set 2).
  • thread-cyclic Sets will be distributed cyclically between tasks and threads in a task.
  • random Each task will start using a random set, and also calls either to Extrae_next_hwc_set or Extrae_previous_hwc_set will change to a random set.

Each set contains a list of performance counters to be gathered at different instrumentation points (see sections XML Section: MPI, XML Section: OpenMP and XML Section: User functions). If the tracing library is compiled to support PAPI, performance counters must be given using the canonical name (like PAPI_TOT_CYC and PAPI_L1_DCM), or the PAPI code in hexadecimal format (like 8000003b and 80000000, respectively) [3] If the tracing library is compiled to support PMAPI, only one group identifier can be given per set [4] and can be either the group name (like pm_basic and pm_hpmcount1) or the group number (like 6 and 22, respectively).

In the given example (which refers to PAPI support in the tracing library) two sets are defined. First set will read PAPI_TOT_INS (total instructions), PAPI_TOT_CYC (total cycles) and PAPI_L1_DCM (1st level cache misses). Second set is configured to obtain PAPI_TOT_INS (total instructions), PAPI_TOT_CYC (total cycles) and PAPI_FP_INS (floating point instructions).

Additionally, if the underlying performance library supports sampling mechanisms, each set can be configured to gather information (see section XML Section: Callers) each time the specified counter reaches a specific value. The counter that is used for sampling must be present in the set. In the given example, the first set is enabled to gather sampling information every 100M cycles.

Furthermore, performance counters can be configured to report accounting on different basis depending on the domain attribute specified on each set. Available options are:

  • kernel Only counts events ocurred when the application is running in kernel mode.
  • user Only counts events ocurred when the application is running in user-space mode.
  • all Counts events independently of the application running mode.

In the given example, first set is configured to count all the events ocurred, while the second one only counts those events ocurred when the application is running in user-space mode.

Finally, the instrumentation can change the active set in a manual and an automatic fashion. To change the active set manually see Extrae_previous_hwc_set and Extrae_next_hwc_set API calls in section Basic API. To change automatically the active set two options are allowed: based on time and based on application code. The former mechanism requires adding the attribute changeat-time and specify the minimum time to hold the set. The latter requires adding the attribute changeat-globalops with a value. The tracing library will automatically change the active set when the application has executed as many MPI global operations as selected in that attribute. When In any case, if either attribute is set to zero, then the set will not me changed automatically.

4.7.2. Network performance counters

Network performance counters are only available on systems with Myrinet GM/MX networks and they are fixed depending on the firmware used. Other systems, like BG/* may provide some network performance counters, but they are accessed through the PAPI interface (see section XML Section: Performance counters and PAPI documentation).

If <network> is enabled the network performance counters appear at the end of the application run, giving a summary for the whole run.

4.7.3. Operating system accounting

Operating system accounting is obtained through the getrusage(2) system call when <resource-usage> is enabled. As network performance counters, they appear at the end of the application run, giving a summary for the whole run.

4.8. XML Section: Storage management

The instrumentation package can be instructed on what/where/how produce the intermediate trace files. These are the available options:

<storage enabled="no">
  <trace-prefix enabled="yes">TRACE</trace-prefix>
  <size enabled="no">5</size>
  <temporal-directory enabled="yes">/scratch</temporal-directory>
  <final-directory enabled="yes">/gpfs/scratch/bsc41/bsc41273</final-directory>
</storage>

Such options refer to:

  • trace-prefix Sets the intermediate trace file prefix. Its default value is TRACE.
  • size Let the user restrict the maximum size (in megabytes) of each resulting intermediate trace file [5].
  • temporal-directory Where the intermediate trace files will be stored during the execution of the application. By default they are stored in the current directory. If the directory does not exist, the instrumentation will try to make it.
  • final-directory Where the intermediate trace files will be stored once the execution has been finished. By default they are stored in the current directory. If the directory does not exist, the instrumentation will try to make it.

4.9. XML Section: Buffer management

Modify the buffer management entry to tune the tracing buffer behavior.

<buffer enabled="yes">
  <size enabled="yes">150000</size>
  <circular enabled="no" />
</buffer>

By, default (even if the enabled attribute is no) the tracing buffer is set to 500k events. If <size> is enabled the tracing buffer will be set to the number of events indicated by this node. If the circular option is enabled, the buffer will be created as a circular buffer and the buffer will be dumped only once with the last events generated by the tracing package.

See also

EXTRAE_BUFFER_SIZE environment variable in appendix Environment variables.

4.10. XML Section: Trace control

<trace-control enabled="yes">
  <file enabled="no" frequency="5M">/gpfs/scratch/bsc41/bsc41273/control</file>
  <global-ops enabled="no">10</global-ops>
  <remote-control enabled="yes">
    <mrnet enabled="yes" target="150" analysis="spectral" start-after="30">
      <clustering max_tasks="26" max_points="8000"/>
      <spectral min_seen="1" max_periods="0" num_iters="3" signals="DurBurst,InMPI"/>
    </mrnet>
  </remote-control>
</trace-control>

This section groups together a set of options to limit/reduce the final trace size. There are three mechanisms which are based on file existence, global operations executed and external remote control procedures.

Regarding the file, the application starts with the tracing disabled, and it is turned on when a control file is created. Use the property frequency to choose at which frequency this check must be done. If not supplied, it will be checked every 100 global operations on MPI_COMM_WORLD.

If the global-ops tag is enabled, the instrumentation package begins disabled and starts the tracing when the given number of global operations on MPI_COMM_WORLD has been executed. The user can also specify multiple intervals in the form start-stop separated by commas.

The remote-control tag section allows to configure some external mechanisms to automatically control the tracing. Currently, there is only one option which is built on top of MRNet and it is based on clustering and spectral analysis to generate a small yet representative trace.

These are the options in the mrnet tag:

  • target the approximate requested size for the final trace (in Mb).
  • analysis one between clustering and spectral.
  • start-after number of seconds before the first analysis starts.

The clustering tag configures the clustering analysis parameters:

  • max_tasks maximum number of tasks to get samples from.
  • max_points maximum number of points to cluster.

The spectral tag section configures the spectral analysis parameters:

  • min_seen minimum times a given type of period has to be seen to trace a sample.
  • max_periods maximum number of representative periods to trace. 0 equals to unlimited.
  • num_iters number of iterations to trace for every representative period found.
  • signals performance signals used to analyze the application. If not specified, DurBurst is used by default.

4.11. XML Section: Bursts

<burst enabled="no" threshold="500u" mpi-statistics="yes" omp-statistics="yes" omp-summarization="no" />

If the user enables this option, the instrumentation library will just emit information of computation bursts (i.e., not does not trace MPI calls, OpenMP runtime, and so on) when the current mode (through initial-mode in section XML Section: Trace configuration) is set to bursts. The library will discard all those computation bursts that last less than the selected threshold.

In addition to that, when the tracing library is running in burst mode, it computes some statistics of MPI activity. Such statistics can be dumped in the tracefile by enabling mpi-statistics.

4.12. XML Section: Others

<others enabled="yes">
  <minimum-time enabled="no">10M</minimum-time>
  <finalize-on-signal enabled="yes" 
    SIGUSR1="no" SIGUSR2="no" SIGINT="yes"
    SIGQUIT="yes" SIGTERM="yes" SIGXCPU="yes"
    SIGFPE="yes" SIGSEGV="yes" SIGABRT="yes"
  />
  <flush-sampling-buffer-at-instrumentation-point enabled="yes" />
</others>

This section contains other configuration details that do not fit in the previous sections. At the moment, there are three options to be configured.

  • The minimum-time option indicates the instrumentation package the minimum instrumentation time. To enable it, set enabled to yes and set the minimum time within the minimum-time tag.
  • The option labeled as finalize-on-signal instructs the instrumentation package to listen for different types of signals [6] and dump and finalize the execution whenever they occur. If a signal occurs but it is not configured, then the execution may finish without generating the trace-file. Caveat: Some MPI implementations use SIGUSR1 and/or SIGUSR2, so if you want to capture those signals check first that enabling them do not alter with the application execution.
  • The flush-sampling-buffer-at-instrumentation-point lets the user decide whether the sampling buffer should be checked for flushing at instrumentation points. If this option is not enabled, then the buffer will only be dumped once at the end of the application execution.

4.13. XML Section: Sampling

<sampling enabled="no" type="default" period="50m" variability="10m"/>

This section configures the time-based sampling capabilities. Every sample contains processor performance counters (if enabled in section Processor performance counters and either PAPI or PMAPI are referred at configure time) and callstack information (if enabled in section XML Section: Callers and proper dependencies are set at configure time).

This section contains two attributes besides enabled. These are:

  • type determines which timer domain is used (see setitimer(2) or setitimer(3p) for further information on time domains). Available options are: real (which is also the default value, virtual and prof (which use the SIGALRM, SIGVTALRM and SIGPROF respectively). The default timing accumulates real time, but only issues samples at master thread. To let all the threads to collect samples, the type must be virtual or prof.
  • period specifies the sampling periodicity. In the example above, samples are gathered every 50ms.
  • variability specifies the variability to the sampling periodicity. Such variability is calculated through the random() system call and then is added to the periodicity. In the given example, the variability is set to 10ms, thus the final sampling period ranges from 45 to 55ms.

4.14. XML Section: CUDA

<cuda enabled="yes" />

This section indicates whether the CUDA calls should be instrumented or not. If enabled is set to yes, CUDA calls will be instrumented, otherwise they will not be instrumented.

4.15. XML Section: OpenACC

<openacc enabled="yes" />

This section indicates wheter OpenACC host activity should be instrumented. If enabled is set to yes, OpenACC activity made by the host will be instrumented, otherwise it will be not. If the user wants to capture device activity, they must also enable CUDA instrumentation.

4.16. XML Section: OpenCL

<opencl enabled="yes" />

This section indicates whether the OpenCL calls should be instrumented or not. If enabled is set to yes, Opencl calls will be instrumented, otherwise they will not be instrumented.

4.17. XML Section: Input/Output

<input-output enabled="no" internals="no"/>

This section indicates whether I/O calls (read and write) are meant to be instrumented. If enabled is set to yes, the aforementioned calls will be instrumented, otherwise they will not be instrumented. If internals is set to yes, I/O calls that occur inside other traced calls will also be captured.

4.18. XML Section: Dynamic memory

<dynamic-memory enabled="no">
  <alloc enabled="yes" threshold="32768" />
  <free  enabled="yes" />
</dynamic-memory>

This section indicates whether dynamic memory calls (malloc, free, realloc) are meant to be instrumented. If enabled is set to yes, the aforementioned calls will be instrumented, otherwise they will not be instrumented.

This section allows deciding whether allocation and free-related memory calls shall be instrumented.

Additionally, the configuration can also indicate whether allocation calls should be instrumented if the requested memory size surpasses a given threshold (32768 bytes, in the example).

4.19. XML Section: Memory references through Intel PEBS sampling

<pebs-sampling enabled="yes">
  <loads  enabled="yes" frequency="100" minimum-latency="10" />
  <stores enabled="no" frequency="50">
    <offcore-l3-misses enabled="no" /> <!-- Read together with stores samples. -->
  </stores>
  <load-l3-misses enabled="no" frequency="25" />
</pebs-sampling>

This section tells Extrae to use the PEBS feature from recent Intel processors [7] to sample memory references. These memory references capture the linear address referenced, the component of the memory hierarchy that solved the reference and the number of cycles to solve the reference.

In the example above, PEBS monitors one load instruction every 100 Hz that requires at least 10 cycles to be solved. Alternatively, the setting ‘frequency’ can be replaced by ‘period’, then PEBS will monitor one load instruction every given number of loads. Please note that the ‘period’ setting is not available for Skylake and newer processors.

4.20. XML Section: Executing CPU identification

<cpu-events enabled="no" frequency="0" emit-always="no" poi="none|openmp" />

By default the core identifier where every thread is executing is emitted at initialization points. This section enables extra measurements, configurable by time (‘frequency’) and points of interest (‘poi’). When ‘frequency’ is set, new measurements are emitted at instrumentation points if enough time has passed since the previous measurement. When ‘poi’ is set to ‘openmp’, new measurements are emitted at the entry and exit points of OpenMP outlined functions, tasks, and work dispatches. This option may introduce high variable overhead, please use with caution.

4.21. XML Section: Merge

<merge enabled="yes" 
  synchronization="default"
  binary="mpi_ping"
  tree-fan-out="16"
  max-memory="512"
  joint-states="yes"
  keep-mpits="yes"
  translate-addresses="yes"
  sort-addresses="yes"
  translate-data-addresses="yes"
  overwrite="yes"
  stop-at-percentage="50"
>
  mpi_ping.prv
</merge>

If this section is enabled and the instrumentation package is configured to support this, the merge process will be automatically invoked after the application run. The merge process will use all the resources devoted to run the application.

In the given example, the leaf of this node will be used as the tracefile name (mpi_ping.prv`). Current available options for the merge process are given as attribute of the <merge> node and they are:

  • synchronization: which can be set to default, node, task, no. This determines how task clocks will be synchronized (default is node).
  • binary: points to the binary that is being executed. It will be used to translate gathered addresses (MPI callers, sampling points and user functions) into source code references.
  • tree-fan-out: only for MPI executions sets the tree-based topology to run the merger in a parallel fashion.
  • max-memory: limits the intermediate merging process to run up to the specified limit (in MBytes).
  • joint-states: which can be set to yes, no. Determines if the resulting Paraver tracefile will split or join equal consecutive states (default is ``yes``).
  • keep-mpits: whether to keep the intermediate tracefiles after performing the merge.
  • translate-addresses : whether to identify the calling site of instrumented calls and samples by the specified levels of the callstack (enabled by default); or just by their instruction addresses.
  • sort-addresses: whether to sort all addresses that refer to the source code (enabled by default).
  • translate-data-addresses: whether to identify allocated objects by their full callpath (enabled by default); or just by the tuple <library_base_address, symbol_offset_within_library>.
  • overwrite: set to yes if the new tracefile can overwrite an existing tracefile with the same name. If set to no, then the tracefile will be given a new name using a consecutive id.
  • stop-at-percentage: stops the generation of the tracefile at a given percentage. Accepts integer values from 1 to 99. All other values disable this option.

In Linux systems, the tracing package can take advantage of certain functionalities from the system and can guess the binary name, and from it the tracefile name. In such systems, you can use the following reduced XML section replacing the earlier section.

<merge enabled="yes" 
  synchronization="default"
  tree-fan-out="16"
  max-memory="512"
  joint-states="yes"
  keep-mpits="yes"
  translate-addresses="yes"
  sort-addresses="yes"
  translate-data-addresses="yes"
  overwrite="yes"
  stop-at-percentage="50"
/>

See also

For further references, see chapter Merging process.

4.22. Using environment variables within the XML file

XML tags and attributes can refer to environment variables that are defined in the environment during the application run. If you want to refer to an environment variable within the XML file, just enclose the name of the variable using the dollar symbol ($), for example: {$FOO$}.

Note that the user has to put an specific value or a reference to an environment variable which means that expanding environment variables in text is not allowed as in a regular shell (i.e., the instrumentation package will not convert the follwing text {bar$FOO$bar).

Footnotes

[1]More information available on their website http://icl.cs.utk.edu/papi. Extrae requires at least PAPI 3.x.
[2]PMAPI is only available for AIX operating system, and it is on the base operating system since AIX5.3. Extrae requires at least AIX 5.3.
[3]Some architectures do not allow grouping some performance counters in the same set.
[4]Each group contains several performance counters.
[5]This check is done each time the buffer is flushed, so the resulting size of the intermediate trace file depends also on the number of elements contained in the tracing buffer (see XML Section: Buffer management).
[6]See man signal(2) and man signal(7) for more details.
[7]Check for availability on your system by looking for pebs in /proc/cpuinfo.