Intel(R) Math Kernel Library for Deep Neural Networks (Intel(R) MKL-DNN)  0.17
Performance library for Deep Learning
Performance profiling

It is often useful to collect information about how much of an application run time is spent executing Intel(R) MKL-DNN primitives and which of those take the most time. One of the popular methods to do this is to use profilers like Linux* perf or Intel(R) VTune(tm) Amplifier. Currently, Intel MKL-DNN has very limited support for these tools since it does not annotate code generated at run-time and thus the profiles cannot properly attribute it. However, Intel MKL-DNN implements another feature called verbose mode that allows tracing execution of Intel MKL-DNN primitives and collection of basic statistics like execution time and primitive parameters.

Verbose mode

To enable Intel MKL-DNN verbose mode, set MKLDNN_VERBOSE environment variable to 1 (to dump only execution time) or 2 (to dump both execution and creation time). For example:

$ export MKLDNN_VERBOSE=1
$ ./benchdnn --conv ic16ih7oc16oh7kh5ph2n"wip"

This will produce the following output (the line break was added to fit into the page width):

mkldnn_verbose,exec,reorder,simple:any,undef,in:f32_nchw out:f32_nChw8c,num:1,2x16x7x7,0.484863
mkldnn_verbose,exec,reorder,simple:any,undef,in:f32_goihw out:f32_gOIhw8i8o,num:1,1x16x16x5x5,0.494141
mkldnn_verbose,exec,reorder,simple:any,undef,in:f32_nchw out:f32_nChw8c,num:1,2x16x7x7,0.478027
mkldnn_verbose,exec,reorder,simple:any,undef,in:f32_x out:f32_x,num:1,16,0.219971
mkldnn_verbose,exec,convolution,jit:avx2,forward_inference,fsrc:nChw8c fwei:gOIhw8i8o fbia:x \
fdst:nChw8c,alg:convolution_direct,mb2_g1ic16oc16_ih7oh7kh5sh1dh0ph2_iw7ow7kw5sw1dw0pw2,0.0170898
mkldnn_verbose,exec,reorder,simple:any,undef,in:f32_nChw8c out:f32_nchw,num:1,2x16x7x7,0.488037
mkldnn_verbose,exec,reorder,simple:any,undef,in:f32_nChw8c out:f32_nchw,num:1,2x16x7x7,0.00512695
0:PASSED __REPRO: ic16ih7oc16oh7kh5ph2nwip
tests:1 passed:1 skipped:0 mistrusted:0 unimplemented:0 failed:0

Each line with verbose information is formatted as a comma-separated list containing:

To get more information about verbose report format please refer to the verbose_templ() function in the src/common/verbose.hpp file.


NOTE The format is subject to change



WARNING Verbose mode has non-negligible performance impact especially if the output rate is high.


Intel(R) VTune(TM) profiling

To collect performance data of JIT-kernels set VTUNEROOT environment variable to path to VTune before building of Intel MKL-DNN. For example:

$ mkdir -p build && cd build && cmake -DVTUNEROOT=/path/to/vtune .. && make

Dump JIT-kernels

To dump JIT-kernels set MKLDNN_JIT_DUMP environment variable to 1. For example:

$ export MKLDNN_JIT_DUMP=1
$ ./simple-net-c

This will produce the following output files: mkldnn_dump_jit_avx2_conv_fwd_kernel_f32.0.bin mkldnn_dump_jit_uni_lrn_fwd_kernel_f32.2.bin mkldnn_dump_jit_uni_lrn_fwd_kernel_f32.3.bin mkldnn_dump_jit_uni_lrn_fwd_kernel_f32.4.bin mkldnn_dump_jit_uni_pool_kernel_f32.5.bin mkldnn_dump_jit_uni_relu_kernel_f32.1.bin

To open these files any disassembler can be used. For example:

$ xed -ir mkldnn_dump_jit_avx2_conv_fwd_kernel_f32.0.bin
XDIS 0: PUSH BASE 53 push ebx
XDIS 1: PUSH BASE 55 push ebp
XDIS 2: BINARY BASE 41 inc ecx
XDIS 3: PUSH BASE 54 push esp
XDIS 4: BINARY BASE 41 inc ecx
XDIS 5: PUSH BASE 55 push ebp
XDIS 6: BINARY BASE 41 inc ecx
XDIS 7: PUSH BASE 56 push esi
XDIS 8: BINARY BASE 41 inc ecx
XDIS 9: PUSH BASE 57 push edi
XDIS a: BINARY BASE 48 dec eax
XDIS b: DATAXFER BASE 8B07 mov eax, dword ptr [edi]
XDIS d: BINARY BASE 48 dec eax
XDIS e: DATAXFER BASE 8B7708 mov esi, dword ptr [edi+0x8]
XDIS 11: BINARY BASE 48 dec eax
XDIS 12: DATAXFER BASE 8B5710 mov edx, dword ptr [edi+0x10]
XDIS 15: BINARY BASE 48 dec eax
XDIS 16: DATAXFER BASE 8B5F18 mov ebx, dword ptr [edi+0x18]
XDIS 19: BINARY BASE 48 dec eax
XDIS 1a: DATAXFER BASE 8B4F40 mov ecx, dword ptr [edi+0x40]
XDIS 1d: BINARY BASE 44 inc esp
XDIS 1e: DATAXFER BASE 8B6F70 mov ebp, dword ptr [edi+0x70]
...

Legal information