It is often useful to collect information about how much of an application run time is spent executing Intel(R) MKL-DNN primitives and which of those take the most time. One of the popular methods to do this is to use profilers like Linux* perf or Intel(R) VTune(tm) Amplifier. Currently, Intel MKL-DNN has very limited support for these tools since it does not annotate code generated at run-time and thus the profiles cannot properly attribute it. However, Intel MKL-DNN implements another feature called verbose mode that allows tracing execution of Intel MKL-DNN primitives and collection of basic statistics like execution time and primitive parameters.
Verbose mode
To enable Intel MKL-DNN verbose mode, set MKLDNN_VERBOSE
environment variable to 1
(to dump only execution time) or 2
(to dump both execution and creation time). For example:
$ export MKLDNN_VERBOSE=1
$ ./benchdnn --conv ic16ih7oc16oh7kh5ph2n"wip"
This will produce the following output (the line break was added to fit into the page width):
mkldnn_verbose,exec,reorder,simple:any,undef,in:f32_nchw out:f32_nChw8c,num:1,2x16x7x7,0.484863
mkldnn_verbose,exec,reorder,simple:any,undef,in:f32_goihw out:f32_gOIhw8i8o,num:1,1x16x16x5x5,0.494141
mkldnn_verbose,exec,reorder,simple:any,undef,in:f32_nchw out:f32_nChw8c,num:1,2x16x7x7,0.478027
mkldnn_verbose,exec,reorder,simple:any,undef,in:f32_x out:f32_x,num:1,16,0.219971
mkldnn_verbose,exec,convolution,jit:avx2,forward_inference,fsrc:nChw8c fwei:gOIhw8i8o fbia:x \
fdst:nChw8c,alg:convolution_direct,mb2_g1ic16oc16_ih7oh7kh5sh1dh0ph2_iw7ow7kw5sw1dw0pw2,0.0170898
mkldnn_verbose,exec,reorder,simple:any,undef,in:f32_nChw8c out:f32_nchw,num:1,2x16x7x7,0.488037
mkldnn_verbose,exec,reorder,simple:any,undef,in:f32_nChw8c out:f32_nchw,num:1,2x16x7x7,0.00512695
0:PASSED __REPRO: ic16ih7oc16oh7kh5ph2nwip
tests:1 passed:1 skipped:0 mistrusted:0 unimplemented:0 failed:0
Each line with verbose information is formatted as a comma-separated list containing:
mkldnn_verbose
stage
, e.g. create
or exec
primitive-kind
, e.g. convolution
, reorder
, sum
, ...
- primitive implementation name
- propagation-kind, e.g.
forward_training
- input/output data info, e.g. data type and data format
- auxiliary information, e.g. algorithm or number of input
- problem description
- for convolution the problem description is dumped in benchdnn friendly format
- for reorder, sum, and concat problem description is simply logical dims
- for other primitives the problem description is similar to convolution one
- execution time in milliseconds
To get more information about verbose report format please refer to the verbose_templ()
function in the src/common/verbose.hpp file.
NOTE The format is subject to change
WARNING Verbose mode has non-negligible performance impact especially if the output rate is high.
Intel(R) VTune(TM) profiling
To collect performance data of JIT-kernels set VTUNEROOT
environment variable to path to VTune before building of Intel MKL-DNN. For example:
$ mkdir -p build && cd build && cmake -DVTUNEROOT=/path/to/vtune .. && make
Dump JIT-kernels
To dump JIT-kernels set MKLDNN_JIT_DUMP environment variable to 1
. For example:
$ export MKLDNN_JIT_DUMP=1
$ ./simple-net-c
This will produce the following output files: mkldnn_dump_jit_avx2_conv_fwd_kernel_f32.0.bin mkldnn_dump_jit_uni_lrn_fwd_kernel_f32.2.bin mkldnn_dump_jit_uni_lrn_fwd_kernel_f32.3.bin mkldnn_dump_jit_uni_lrn_fwd_kernel_f32.4.bin mkldnn_dump_jit_uni_pool_kernel_f32.5.bin mkldnn_dump_jit_uni_relu_kernel_f32.1.bin
To open these files any disassembler can be used. For example:
$ xed -ir mkldnn_dump_jit_avx2_conv_fwd_kernel_f32.0.bin
XDIS 0: PUSH BASE 53 push ebx
XDIS 1: PUSH BASE 55 push ebp
XDIS 2: BINARY BASE 41 inc ecx
XDIS 3: PUSH BASE 54 push esp
XDIS 4: BINARY BASE 41 inc ecx
XDIS 5: PUSH BASE 55 push ebp
XDIS 6: BINARY BASE 41 inc ecx
XDIS 7: PUSH BASE 56 push esi
XDIS 8: BINARY BASE 41 inc ecx
XDIS 9: PUSH BASE 57 push edi
XDIS a: BINARY BASE 48 dec eax
XDIS b: DATAXFER BASE 8B07 mov eax, dword ptr [edi]
XDIS d: BINARY BASE 48 dec eax
XDIS e: DATAXFER BASE 8B7708 mov esi, dword ptr [edi+0x8]
XDIS 11: BINARY BASE 48 dec eax
XDIS 12: DATAXFER BASE 8B5710 mov edx, dword ptr [edi+0x10]
XDIS 15: BINARY BASE 48 dec eax
XDIS 16: DATAXFER BASE 8B5F18 mov ebx, dword ptr [edi+0x18]
XDIS 19: BINARY BASE 48 dec eax
XDIS 1a: DATAXFER BASE 8B4F40 mov ecx, dword ptr [edi+0x40]
XDIS 1d: BINARY BASE 44 inc esp
XDIS 1e: DATAXFER BASE 8B6F70 mov ebp, dword ptr [edi+0x70]
...
Legal information