|
Intel(R) Math Kernel Library for Deep Neural Networks (Intel(R) MKL-DNN)
0.17
Performance library for Deep Learning
|
Disclaimer: MKLDNN Int8 primitives are a work in progress and not all definitions and configurations have been implemented or included in the documentation. Moreover, the example included in this documentation relies on int8 primitives which use the MKL binary dependency and is limited to MKLDNN built with the MKL binary.
To push higher performance during inference computations, recent work has focused on computing at a lower precision (i.e. shrinking the size of data for activations and weights) to achieve higher throughput. Eight-bit computations (referred to as int8) offer improved performance over higher precision types -because it allows packing more data into a single instruction, at the cost of reduced but acceptable accuracy.
To operate with int8 data types from a higher precision format (e.g. 32-bit floating point), data must first be quantized. The quantization process converts a given input into a lower-precision format. The precision and accuracy factors are determined by the scale and rounding-mode respectively.
The scale is usually obtained from sampling the dataset of previous executions in the original format (e.g. the activations and weights from training in fp32) and is formulated as:

is a tensor corresponding to either the weights
or the activations
.The purpose is to establish the range of values used in the computation where selecting a proper scaling factor prevents over or underflows when computing the lower precision results.
The next step is to calculate the quantization factor for converting the values into the corresponding int8 range. This is also known as the scale or scaling factor applied to the original high-precision values and is calculated as:
is the quantization factor for activations with non-negative values.
is the quantization factor for weights.The low-precision values, known as the quantized activation, weights, and bias values are calculated as:
![$\alpha_{u8} = \lceil Q_{\alpha} \alpha_{f32} \rceil \in [0,255]$](form_29.png)
![$W_{s8} = \lceil Q_{w} W_{f32} \rceil \in [-127,127]$](form_30.png)
![$b_{s32} = \lceil Q_{\alpha} Q_{w} b_{f32} \rceil \in [-2^{31},2^{31}-1]$](form_31.png)
rounds to the selected rounding mode.When the destination value (e.g. from a convolution) is stored as a signed 32-bit integer, the result is bound to the same quantization scaling factors:


Where the approximated value is due to the rounded values.
Inversely, the dequantized value is calculated as:

To show how the int8 parameters are obtained, suppose we first start off with a set of arbitrary high-precision input and output values. These values come from sampling a previously executed training run and are in their original 32-bit floating point format as:
where 
where 
where 
The scaling factors are:


Finally, the quantized input values for the 8-bit operation are calculated as:
![$ = \lceil 17 \times [15, 14, ... 11 ] \rceil = [255, 238, ... 187] $](form_45.png)
![$W_{s8} = \lceil Q_{w} W_{f32} \rceil = \lceil 12.96 \times [-5.1 , 6.8, ... -1.2, 9.8 ] \rceil = [-66, 88, ... -15, 127] $](form_46.png)
![$b_{s32} = \lceil Q_{\alpha} Q_{w} b_{f32} \rceil = \lceil 17 \times 12.96 \times [ 2.4, -5.2 ... -8 ] \rceil = [528, -1145, ... -1762] $](form_47.png)
These arrays are the new inputs for the int8 net.
MKLDNN supports low-precision computations for inference through the int8 primitives. Int8 primitives are ordinary MKLDNN primitives which have their input and output parameters configured to 8-bit types. Int8 primitives are optimized for high performance, one example is the use of specialized 512-bit wide low-precision instructions available through the Advanced Vector Extensions 512 (AVX512) for Intel Skylake Server Systems. Currently, the
supported primitives are:
MKLDNN primitive behaviour may be extended for additional functionalities involving output data transformation. These additional features are configured via primitive attributes. The primitive attributes definition is an opaque structure for passing extra parameters to a primitive descriptor. These parameters include Scaling Factor, Round Mode and Fused Post-Operations (PostOps). All operation primitives support the attributes structure, however, not all configurations are implemented and result in failed primitive creation.
The scaling factor, as previously described, is known prior to the inference operation where the values are calculated from a set of formulas. In MKLDNN, the scaling factor is applied to the output of a primitive. Moreover, to perform input transformations (e.g. source, bias and weights), MKLDNN performs quantizing and dequantizing of data for int8 through the Reorder Primitive.
MKLDNN has 2 formats for defining the output scaling factor, depending on the configuration set by the scaling mask, the output is either scaled uniformly across all the dimensions (mask = 0) or a set of scaling values are applied to specific dimension(s), as explanation below:
Note: Mask is always applied to the logical dimension; this is independent of the dimension format that the primitive might select. The dimensions in MKLDNN are defined as follows:
The Round Mode in attributes specifies the form of rounding for the resulting output value. Round mode will be applied whenever the output data type is not 32-bit floating point. The two options are:
Fused Post-Operations (PostOps) allow chaining operations during the primitive computation. Note that the resulting output value from PostOps is always affected by the scaling factor. The supported operations are:
![$ dst[ ] \leftarrow scale * dst[] + op(...); instead of dst[ ] \leftarrow op(...) $](form_48.png)
![$ dst[ ] \leftarrow scale * eltwise_op ( op(...) ); instead of dst[ ] \leftarrow op(...)$](form_49.png)
The list of supported eltwise operations for int8 is currently limited to ReLU. For instance, PostOps may only configure a convolution with accumulation followed by eltwise (relu).
The MKLDNN repository contains an example called simple_int8_net.cpp that executes a Convolution with ReLU from the AlexNet topology using int8 computations. This example extends the simple_net.cpp source focusing on the creation and execution of int8 primitives using PostOps and Scaling Factors to obtain similar results.
Create a memory descriptor for each convolution parameter. The convolution data uses 8-bit integer values, so the memory descriptors are configured as:
Note: the destination type is chosen as unsigned because the convolution applies a ReLU operation where data results ≥ 0.
Configuring int8-specific parameters in an int8 primitive is done via the Attributes Primitive. Create an attributes object for the convolution and configure it accordingly.
The ReLU layer from Alexnet is executed through the PostOps feature. Create a PostOps object and configure it to execute an eltwise relu operation.
The diagram to summarize this example is as follows:
1.8.13