16. Instrumented routines¶
16.1. MPI¶
These are the instrumented MPI routines in the Extrae package:
- MPI_Init
- MPI_Init_thread [1]
- MPI_Finalize
- MPI_Bsend
- MPI_Ssend
- MPI_Rsend
- MPI_Send
- MPI_Bsend_init
- MPI_Ssend_init
- MPI_Rsend_init
- MPI_Send_init
- MPI_Ibsend
- MPI_Issend
- MPI_Irsend
- MPI_Isend
- MPI_Recv
- MPI_Irecv
- MPI_Recv_init
- MPI_Reduce
- MPI_Ireduce
- MPI_Reduce_scatter
- MPI_Ireduce_scatter
- MPI_Allreduce
- MPI_Iallreduce
- MPI_Barrier
- MPI_Ibarrier
- MPI_Cancel
- MPI_Test
- MPI_Wait
- MPI_Waitall
- MPI_Waitany
- MPI_Waitsome
- MPI_Bcast
- MPI_Ibcast
- MPI_Alltoall
- MPI_Ialltoall
- MPI_Alltoallv
- MPI_Ialltoallv
- MPI_Allgather
- MPI_Iallgather
- MPI_Allgatherv
- MPI_Iallgatherv
- MPI_Gather
- MPI_Igather
- MPI_Gatherv
- MPI_Igatherv
- MPI_Scatter
- MPI_Iscatter
- MPI_Scatterv
- MPI_Iscatterv
- MPI_Comm_rank
- MPI_Comm_size
- MPI_Comm_create
- MPI_Comm_create_group
- MPI_Comm_free
- MPI_Comm_dup
- MPI_Comm_dup_with_info
- MPI_Comm_split
- MPI_Comm_split_type
- MPI_Comm_spawn
- MPI_Comm_spawn_multiple
- MPI_Cart_create
- MPI_Cart_sub
- MPI_Start
- MPI_Startall
- MPI_Request_free
- MPI_Scan
- MPI_Iscan
- MPI_Sendrecv
- MPI_Sendrecv_replace
- MPI_File_open [2]
- MPI_File_close [2]
- MPI_File_read [2]
- MPI_File_read_all [2]
- MPI_File_read_all_begin [2]
- MPI_File_read_all_end [2]
- MPI_File_read_at [2]
- MPI_File_read_at_all [2]
- MPI_File_read_at_all_begin [2]
- MPI_File_read_at_all_end [2]
- MPI_File_read_ordered [2]
- MPI_File_read_ordered_begin [2]
- MPI_File_read_ordered_end [2]
- MPI_File_read_shared [2]
- MPI_File_write [2]
- MPI_File_write_all [2]
- MPI_File_write_all_begin [2]
- MPI_File_write_all_end [2]
- MPI_File_write_at [2]
- MPI_File_write_at_all [2]
- MPI_File_write_at_all_begin [2]
- MPI_File_write_at_all_end [2]
- MPI_File_write_ordered [2]
- MPI_File_write_ordered_begin [2]
- MPI_File_write_ordered_end [2]
- MPI_File_write_shared [2]
- MPI_Compare_and_swap [3]
- MPI_Fetch_and_op [3]
- MPI_Get [3]
- MPI_Put [3]
- MPI_Win_complete [3]
- MPI_Win_create [3]
- MPI_Win_fence [3]
- MPI_Win_flush [3]
- MPI_Win_flush_all [3]
- MPI_Win_flush_local [3]
- MPI_Win_flush_local_all [3]
- MPI_Win_free [3]
- MPI_Win_post [3]
- MPI_Win_start [3]
- MPI_Win_wait [3]
- MPI_Probe
- MPI_Iprobe
- MPI_Testall
- MPI_Testany
- MPI_Testsome
- MPI_Request_get_status
- MPI_Intercomm_create
- MPI_Intercomm_merge
- MPI_Graph_create
- MPI_Dist_graph_create
- MPI_Neighbor_allgather
- MPI_Ineighbor_allgather
- MPI_Neighbor_allgatherv
- MPI_Ineighbor_allgatherv
- MPI_Neighbor_alltoall
- MPI_Ineighbor_alltoall
- MPI_Neighbor_alltoallv
- MPI_Ineighbor_alltoallv
- MPI_Neighbor_alltoallw
- MPI_Ineighbor_alltoall
16.2. OpenMP¶
16.2.1. Intel compilers - icc, iCC, ifort¶
The instrumentation of the Intel OpenMP runtime for versions 8.1 to 10.1 is only available using the Extrae package based on DynInst library.
These are the instrument routines of the Intel OpenMP runtime functions using DynInst:
The instrumentation of the Intel OpenMP runtime for version 11.0 to 12.0 is
available using the Extrae package based on the LD_PRELOAD
and also
the DynInst mechanisms. The instrumented routines include:
- __kmpc_fork_call
- __kmpc_barrier
- __kmpc_dispatch_init_4
- __kmpc_dispatch_init_8
- __kmpc_dispatch_next_4
- __kmpc_dispatch_next_8
- __kmpc_dispatch_fini_4
- __kmpc_dispatch_fini_8
- __kmpc_single
- __kmpc_end_single
- __kmpc_critical [4]
- __kmpc_end_critical [4]
- omp_set_lock [4]
- omp_unset_lock [4]
- __kmpc_omp_task_alloc
- __kmpc_omp_task_begin_if0
- __kmpc_omp_task_complete_if0
- __kmpc_omp_taskwait
16.2.2. IBM compilers - xlc, xlC, xlf¶
Extrae supports IBM OpenMP runtime 1.6.
These are the instrumented routines of the IBM OpenMP runtime:
16.2.3. GNU compilers - gcc, g++, gfortran¶
Extrae supports GNU OpenMP runtime 4.2 and 4.9.
These are the instrumented routines of the GNU OpenMP runtime:
- GOMP_parallel_start
- GOMP_parallel_sections_start
- GOMP_parallel_end
- GOMP_sections_start
- GOMP_sections_next
- GOMP_sections_end
- GOMP_sections_end_nowait
- GOMP_loop_end
- GOMP_loop_end_nowait
- GOMP_loop_static_start
- GOMP_loop_dynamic_start
- GOMP_loop_guided_start
- GOMP_loop_runtime_start
- GOMP_loop_ordered_static_start
- GOMP_loop_ordered_dynamic_start
- GOMP_loop_ordered_guided_start
- GOMP_loop_ordered_runtime_start
- GOMP_loop_static_next
- GOMP_loop_dynamic_next
- GOMP_loop_guided_next
- GOMP_loop_runtime_next
- GOMP_parallel_loop_static_start
- GOMP_parallel_loop_dynamic_start
- GOMP_parallel_loop_guided_start
- GOMP_parallel_loop_runtime_start
- GOMP_barrier
- GOMP_critical_start [4]
- GOMP_critical_end [4]
- GOMP_critical_name_start [4]
- GOMP_critical_name_end [4]
- GOMP_atomic_start [4]
- GOMP_atomic_end [4]
- GOMP_task
- GOMP_taskwait
- GOMP_parallel
- GOMP_taskgroup_start
- GOMP_taskgroup_end
16.3. pthread¶
These are the instrumented routines of the pthread runtime:
- pthread_create
- pthread_detach
- pthread_join
- pthread_exit
- pthread_barrier_wait
- pthread_mutex_lock
- pthread_mutex_trylock
- pthread_mutex_timedlock
- pthread_mutex_unlock
- pthread_rwlock_rdlock
- pthread_rwlock_tryrdlock
- pthread_rwlock_timedrdlock
- pthread_rwlock_wrlock
- pthread_rwlock_trywrlock
- pthread_rwlock_timedwrlock
- pthread_rwlock_unlock
16.4. CUDA¶
These are the instrumented CUDA routines in the Extrae package:
- cudaLaunch
- cudaConfigureCall
- cudaThreadSynchronize
- cudaThreadExit
- cudaStreamCreate
- cudaStreamCreateWithFlags
- cudaStreamCreateWithPriority
- cudaStreamSynchronize
- cudaStreamDestroy
- cudaMemcpy
- cudaMemcpyAsync
- cudaDeviceReset
- cudaDeviceSynchronize
The CUDA accelerators do not have memory for the tracing buffers, so the tracing buffer resides in the host side.
Typically, the CUDA tracing buffer is flushed at cudaThreadSynchronize
and cudaMemcpy
calls, so it is possible that the
tracing buffer for the device gets filled if no calls to this routines are
These are the instrumented OpenACC routines in the Extrae package:
- OACC_init
- OACC_compute
- OACC_data
- OACC_data_alloc
- OACC_data_update
- OACC_launch
- OACC_update
- OACC_wait
16.5. OpenCL¶
These are the instrumented OpenCL routines in the Extrae package:
- clBuildProgram
- clCompileProgram
- clCreateBuffer
- clCreateCommandQueue
- clCreateContext
- clCreateContextFromType
- clCreateKernel
- clCreateKernelsInProgram
- clCreateProgramWithBinary
- clCreateProgramWithBuiltInKernels
- clCreateProgramWithSource
- clCreateSubBuffer
- clEnqueueBarrierWithWaitList
- clEnqueueBarrier
- clEnqueueCopyBuffer
- clEnqueueCopyBufferRect
- clEnqueueFillBuffer
- clEnqueueMarkerWithWaitList
- clEnqueueMarker
- clEnqueueMapBuffer
- clEnqueueMigrateMemObjects
- clEnqueueNativeKernel
- clEnqueueNDRangeKernel
- clEnqueueReadBuffer
- clEnqueueReadBufferRect
- clEnqueueTask
- clEnqueueUnmapMemObject
- clEnqueueWriteBuffer
- clEnqueueWriteBufferRect
- clFinish
- clFlush
- clLinkProgram
- clSetKernelArg
- clWaitForEvents
- clRetainCommandQueue
- clReleaseCommandQueue
- clRetainContext
- clReleaseContext
- clRetainDevice
- clReleaseDevice
- clRetainEvent
- clReleaseEvent
- clRetainKernel
- clReleaseKernel
- clRetainMemObject
- clReleaseMemObject
- clRetainProgram
- clReleaseProgram
The OpenCL accelerators have small amounts of memory, so the tracing buffer resides in the host side.
Typically, the accelerator tracing buffer is flushed at each cl_Finish
call, so it is possible that the tracing buffer for the accelerator gets filled
if no calls to this routine are executed.
However if the operated OpenCL command queue is tagged as not Out-of-Order, then
flushes will also happen at clEnqueueReadBuffer
, clEnqueueReadBufferRect
and clEnqueueMapBuffer
if their corresponding blocking parameter is set to
[1] | The MPI library must support this routine |
[2] | (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26) The MPI library must support MPI/IO routines |
[3] | (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15) The MPI library must support 1-sided (or RMA -remote memory address-) routines |
[4] | (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16) The instrumentation of OpenMP locks can be enabled/disabled |