16. Instrumented routines¶

16.1. MPI¶

These are the instrumented MPI routines in the Extrae package:

MPI_Init
MPI_Init_thread [1]
MPI_Finalize
MPI_Bsend
MPI_Ssend
MPI_Rsend
MPI_Send
MPI_Bsend_init
MPI_Ssend_init
MPI_Rsend_init
MPI_Send_init
MPI_Ibsend
MPI_Issend
MPI_Irsend
MPI_Isend
MPI_Recv
MPI_Irecv
MPI_Recv_init
MPI_Reduce
MPI_Ireduce
MPI_Reduce_scatter
MPI_Ireduce_scatter
MPI_Allreduce
MPI_Iallreduce
MPI_Barrier
MPI_Ibarrier
MPI_Cancel
MPI_Test
MPI_Wait
MPI_Waitall
MPI_Waitany
MPI_Waitsome
MPI_Bcast
MPI_Ibcast
MPI_Alltoall
MPI_Ialltoall
MPI_Alltoallv
MPI_Ialltoallv
MPI_Allgather
MPI_Iallgather
MPI_Allgatherv
MPI_Iallgatherv
MPI_Gather
MPI_Igather
MPI_Gatherv
MPI_Igatherv
MPI_Scatter
MPI_Iscatter
MPI_Scatterv
MPI_Iscatterv
MPI_Comm_rank
MPI_Comm_size
MPI_Comm_create
MPI_Comm_create_group
MPI_Comm_free
MPI_Comm_dup
MPI_Comm_dup_with_info
MPI_Comm_split
MPI_Comm_split_type
MPI_Comm_spawn
MPI_Comm_spawn_multiple
MPI_Cart_create
MPI_Cart_sub
MPI_Start
MPI_Startall
MPI_Request_free
MPI_Scan
MPI_Iscan
MPI_Sendrecv
MPI_Sendrecv_replace
MPI_File_open [2]
MPI_File_close [2]
MPI_File_read [2]
MPI_File_read_all [2]
MPI_File_read_all_begin [2]
MPI_File_read_all_end [2]
MPI_File_read_at [2]
MPI_File_read_at_all [2]
MPI_File_read_at_all_begin [2]
MPI_File_read_at_all_end [2]
MPI_File_read_ordered [2]
MPI_File_read_ordered_begin [2]
MPI_File_read_ordered_end [2]
MPI_File_read_shared [2]
MPI_File_write [2]
MPI_File_write_all [2]
MPI_File_write_all_begin [2]
MPI_File_write_all_end [2]
MPI_File_write_at [2]
MPI_File_write_at_all [2]
MPI_File_write_at_all_begin [2]
MPI_File_write_at_all_end [2]
MPI_File_write_ordered [2]
MPI_File_write_ordered_begin [2]
MPI_File_write_ordered_end [2]
MPI_File_write_shared [2]
MPI_Compare_and_swap [3]
MPI_Fetch_and_op [3]
MPI_Get [3]
MPI_Put [3]
MPI_Win_complete [3]
MPI_Win_create [3]
MPI_Win_fence [3]
MPI_Win_flush [3]
MPI_Win_flush_all [3]
MPI_Win_flush_local [3]
MPI_Win_flush_local_all [3]
MPI_Win_free [3]
MPI_Win_post [3]
MPI_Win_start [3]
MPI_Win_wait [3]
MPI_Probe
MPI_Iprobe
MPI_Testall
MPI_Testany
MPI_Testsome
MPI_Request_get_status
MPI_Intercomm_create
MPI_Intercomm_merge
MPI_Graph_create
MPI_Dist_graph_create
MPI_Neighbor_allgather
MPI_Ineighbor_allgather
MPI_Neighbor_allgatherv
MPI_Ineighbor_allgatherv
MPI_Neighbor_alltoall
MPI_Ineighbor_alltoall
MPI_Neighbor_alltoallv
MPI_Ineighbor_alltoallv
MPI_Neighbor_alltoallw
MPI_Ineighbor_alltoall

16.2. OpenMP¶

16.2.1. Intel compilers - icc, iCC, ifort¶

The instrumentation of the Intel OpenMP runtime for versions 8.1 to 10.1 is only available using the Extrae package based on DynInst library.

These are the instrument routines of the Intel OpenMP runtime functions using DynInst:

__kmpc_fork_call
__kmpc_barrier
__kmpc_invoke_task_func
__kmpc_set_lock [4]
__kmpc_unset_lock [4]

The instrumentation of the Intel OpenMP runtime for version 11.0 to 12.0 is available using the Extrae package based on the LD_PRELOAD and also the DynInst mechanisms. The instrumented routines include:

__kmpc_fork_call
__kmpc_barrier
__kmpc_dispatch_init_4
__kmpc_dispatch_init_8
__kmpc_dispatch_next_4
__kmpc_dispatch_next_8
__kmpc_dispatch_fini_4
__kmpc_dispatch_fini_8
__kmpc_single
__kmpc_end_single
__kmpc_critical [4]
__kmpc_end_critical [4]
omp_set_lock [4]
omp_unset_lock [4]
__kmpc_omp_task_alloc
__kmpc_omp_task_begin_if0
__kmpc_omp_task_complete_if0
__kmpc_omp_taskwait

16.2.2. IBM compilers - xlc, xlC, xlf¶

Extrae supports IBM OpenMP runtime 1.6.

These are the instrumented routines of the IBM OpenMP runtime:

_xlsmpParallelDoSetup_TPO
_xlsmpParRegionSetup_TPO
_xlsmpWSDoSetup_TPO
_xlsmpBarrier_TPO
_xlsmpSingleSetup_TPO
_xlsmpWSSectSetup_TPO
_xlsmpRelDefaultSLock [4]
_xlsmpGetDefaultSLock [4]
_xlsmpGetSLock [4]
_xlsmpRelSLock [4]

16.2.3. GNU compilers - gcc, g++, gfortran¶

Extrae supports GNU OpenMP runtime 4.2 and 4.9.

These are the instrumented routines of the GNU OpenMP runtime:

GOMP_parallel_start
GOMP_parallel_sections_start
GOMP_parallel_end
GOMP_sections_start
GOMP_sections_next
GOMP_sections_end
GOMP_sections_end_nowait
GOMP_loop_end
GOMP_loop_end_nowait
GOMP_loop_static_start
GOMP_loop_dynamic_start
GOMP_loop_guided_start
GOMP_loop_runtime_start
GOMP_loop_ordered_static_start
GOMP_loop_ordered_dynamic_start
GOMP_loop_ordered_guided_start
GOMP_loop_ordered_runtime_start
GOMP_loop_static_next
GOMP_loop_dynamic_next
GOMP_loop_guided_next
GOMP_loop_runtime_next
GOMP_parallel_loop_static_start
GOMP_parallel_loop_dynamic_start
GOMP_parallel_loop_guided_start
GOMP_parallel_loop_runtime_start
GOMP_barrier
GOMP_critical_start [4]
GOMP_critical_end [4]
GOMP_critical_name_start [4]
GOMP_critical_name_end [4]
GOMP_atomic_start [4]
GOMP_atomic_end [4]
GOMP_task
GOMP_taskwait
GOMP_parallel
GOMP_taskgroup_start
GOMP_taskgroup_end

16.3. pthread¶

These are the instrumented routines of the pthread runtime:

pthread_create
pthread_detach
pthread_join
pthread_exit
pthread_barrier_wait
pthread_mutex_lock
pthread_mutex_trylock
pthread_mutex_timedlock
pthread_mutex_unlock

pthread_rwlock_rdlock
pthread_rwlock_tryrdlock
pthread_rwlock_timedrdlock
pthread_rwlock_wrlock
pthread_rwlock_trywrlock
pthread_rwlock_timedwrlock
pthread_rwlock_unlock

16.4. CUDA¶

These are the instrumented CUDA routines in the Extrae package:

cudaLaunch
cudaConfigureCall
cudaThreadSynchronize
cudaThreadExit
cudaStreamCreate
cudaStreamCreateWithFlags
cudaStreamCreateWithPriority
cudaStreamSynchronize
cudaStreamDestroy
cudaMemcpy
cudaMemcpyAsync
cudaDeviceReset
cudaDeviceSynchronize

The CUDA accelerators do not have memory for the tracing buffers, so the tracing buffer resides in the host side.

Typically, the CUDA tracing buffer is flushed at cudaThreadSynchronize, cudaStreamSynchronize and cudaMemcpy calls, so it is possible that the tracing buffer for the device gets filled if no calls to this routines are executed.

These are the instrumented OpenACC routines in the Extrae package:

OACC_init
OACC_compute
OACC_data
OACC_data_alloc
OACC_data_update
OACC_launch
OACC_update
OACC_wait

16.5. OpenCL¶

These are the instrumented OpenCL routines in the Extrae package:

clBuildProgram
clCompileProgram
clCreateBuffer
clCreateCommandQueue
clCreateContext
clCreateContextFromType
clCreateKernel
clCreateKernelsInProgram
clCreateProgramWithBinary
clCreateProgramWithBuiltInKernels
clCreateProgramWithSource
clCreateSubBuffer
clEnqueueBarrierWithWaitList
clEnqueueBarrier
clEnqueueCopyBuffer
clEnqueueCopyBufferRect
clEnqueueFillBuffer
clEnqueueMarkerWithWaitList
clEnqueueMarker
clEnqueueMapBuffer
clEnqueueMigrateMemObjects
clEnqueueNativeKernel
clEnqueueNDRangeKernel
clEnqueueReadBuffer
clEnqueueReadBufferRect
clEnqueueTask
clEnqueueUnmapMemObject
clEnqueueWriteBuffer
clEnqueueWriteBufferRect
clFinish
clFlush
clLinkProgram
clSetKernelArg
clWaitForEvents
clRetainCommandQueue
clReleaseCommandQueue
clRetainContext
clReleaseContext
clRetainDevice
clReleaseDevice
clRetainEvent
clReleaseEvent
clRetainKernel
clReleaseKernel
clRetainMemObject
clReleaseMemObject
clRetainProgram
clReleaseProgram

The OpenCL accelerators have small amounts of memory, so the tracing buffer resides in the host side.

Typically, the accelerator tracing buffer is flushed at each cl_Finish call, so it is possible that the tracing buffer for the accelerator gets filled if no calls to this routine are executed.

However if the operated OpenCL command queue is tagged as not Out-of-Order, then flushes will also happen at clEnqueueReadBuffer, clEnqueueReadBufferRect and clEnqueueMapBuffer if their corresponding blocking parameter is set to true.

Footnotes

[1]	The MPI library must support this routine

[2]	(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26) The MPI library must support MPI/IO routines

[3]	(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15) The MPI library must support 1-sided (or RMA -remote memory address-) routines

[4]	(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16) The instrumentation of OpenMP locks can be enabled/disabled

16. Instrumented routines¶

16.1. MPI¶

16.2. OpenMP¶

16.2.1. Intel compilers - icc, iCC, ifort¶

16.2.2. IBM compilers - xlc, xlC, xlf¶

16.2.3. GNU compilers - gcc, g++, gfortran¶

16.3. pthread¶

16.4. CUDA¶

16.5. OpenCL¶

Table Of Contents

Previous topic

Next topic

This Page