MemoryDriven Mixed Low Precision Quantization For Enabling Deep Network Inference On Microcontrollers
Abstract
This paper presents a novel endtoend methodology for enabling the deployment of lowerror deep networks on microcontrollers. To fit the memory and computational limitations of resourceconstrained edgedevices, we exploit mixed lowbitwidth compression, featuring 8, 4 or 2bit uniform quantization, and we model the inference graph with integeronly operations. Our approach aims at determining the minimum bit precision of every activation and weight tensor given the memory constraints of a device. This is achieved through a rulebased iterative procedure, which cuts the number of bits of the most memorydemanding layers, aiming at meeting the memory constraints. After a quantizationaware retraining step, the fakequantized graph is converted into an inference integeronly model by inserting the Integer ChannelNormalization (ICN) layers, which introduce a negligible loss as demonstrated on INT4 MobilenetV1 models. We report the latencyaccuracy evaluation of mixedprecision MobilenetV1 family networks on a STM32H7 microcontroller. Our experimental results demonstrate an endtoend deployment of an integeronly Mobilenet network with Top1 accuracy of 68% on a device with only 2MB of FLASH memory and 512kB of RAM, improving by 8% the Top1 accuracy with respect to previously published 8 bit implementations for microcontrollers.
1 Introduction
Enabling machine learning on extremeedgedevices is challenging due to their tight memory and computing power constraints. When envisioning smart sensors operating on batteries, the target power envelope must be below tens of mWs to guarantee a battery lifetime of years. This requirement impacts the system architecture design: adding computational units (e.g. floatingpoint units) or memory banks contributes increasing the complexity and the power cost, and hence the energy, of a system.
Nowadays, microcontroller units (MCUs), such STMicroelectronics STM32 devices, feature an energy consumption compliant with the requirement of smart autonomous sensors and include energyefficient computational units for running machine learning workloads. However, the typical size of the embedded memory cuts is limited to a few MB (a STM32H7 MCU features 2MB of FLASH memory) and the computation core (commonly a single ARM CortexM CPU) runs up to few hundreds of MHz. To boost the performance of this class of MCUs while leveraging the high flexibility of softwareprogrammability, ARM recently released a software library, CMSISNN [14], which enabled the efficient computation of deep networks on tiny microcontrollers. The optimized routines composing the library realize convolutional operations in fixedpoint representations, to exploit instructionlevel parallelism. Unfortunately, due to memory constraints, only a small set of relatively complex networks has been ported to the microcontroller domain yet [25]. For what concerns models tailored for hard problems, e.g. image classification among the 1000 classes of Imagenet dataset, fitting them on MCU memory resources is still an open problem.
To address this problem, a crucial contribution comes from the recent work aiming at designing novel network topologies optimized not only in terms of accuracy but also for computational and memory costs [10, 17, 23]. In addition, a variety of compression techniques can be applied to further shrink a trained model. Among these, the quantization of either activations values and parameters to a lowbitwidth format, i.e. 8 bit or less, is extremely effective because, besides reducing the memory footprint, it allows to operate with low precision integer operations, which can be efficiently mapped on the limited instructionset of tiny microcontrollers. Figure 1 highlights the typical development flow to deploy a deep network design into a resourceconstrained device. A pretrained network is quantized by means of an initial deviceaware fine tuning process, which can include also a retraining step. The resultant fakequantized model , emulating quantized values during the forward pass, is turned into an integeronly deployment model by means of an additional optimization step. Ideally, . The stateoftheart methodology for training a quantized integeronly model is currently integrated within the Tensorflow framework, which shows a low accuracy degradation when targeting 8 bit implementations [11]. This compression level is however not sufficient to bring complex models with high accuracy into memoryconstrained microcontroller. As an example, a 8 bit MobilenetV1 [10] with the highest accuracy requires more than 4 MB of embedded memory, which is prohibitive for the majority of microcontroller devices available. To this end, a more aggressive subbyte quantization methodology is needed, combined with novel techniques for deriving integeronly inference models.
In this work, we present a methodology for quantizing deep networks based on a mixedprecision scheme. The selection of the bit precision of every individual tensor is automated such as to satisfy the memory limitations of a given device. Moreover, we improve the methodology [11] for integeronly inference networks to support subbyte perchannel quantization. Our experimental evaluation is conducted over the MobilenetV1 family networks on Imagenet [10]. We argue that this is a representative problem for tiny microntrollers, not yet solved [12] and much harder than quantizing overparameterized networks [2].
This paper places the following contributions: i) We introduce the Integer ChannelNormalization (ICN) activation layer to achieve an efficient conversion of the fakequantized graph, also exploiting perchannel quantization and optimized quantizationaware training strategies, into an integeronly deployment graph. ii) We present a mixedprecision quantization methodology driven by the memory constraints of a target architecture, which aims at selecting the bit precision of every weight and activation tensor of an integeronly network. iii) We studied the latencyaccuracy tradeoff on isomemory mixedprecision networks belonging to the MobilenetV1 family when running on a STM32H7 microcontroller device.
Our methodology demonstrates an integeronly deployment of a MobilenetV1 network on a STM32H7 microcontroller with 68% Top1 accuracy, which is 8% higher than previous reported 8 bit integeronly implementations [11].
2 Related Work
Quantized Neural Networks. Early works on quantization of deep networks targeted 16 bits fixedpoint implementations [15], which result in an almost lossless approximation of fullprecision trained networks, or extreme binarized networks, which, despite the fascinating lowcomputational and memory requirements, showed major accuracy losses when applied on image classification benchmarks [4, 19]. Several studies demonstrated that 8 bit quantization of weights and activations results in a good tradeoff between latency, compression and a nearzero accuracy degradation, also if applied to efficient Imagenet classification networks [11, 18, 12]. Among the employed methodologies, TensorRT [18] approximates the parameters tensor by the minimization of the KL divergence metric between quantized and fullprecision values. On the contrary, [11] quantizes values within a range defined by the tensor min and max values. Concerning activations, the PACT approach [2] demonstrated the highest efficiency by leveraging backpropagation to learn the quantization ranges. Recently, to fit stringent memory requirements, more aggressive subbyte precision quantization approaches, i.e. less than 8 bit, are under investigation [3, 12, 6, 13, 16]. The works [12, 6] exploits learningbased approaches for determining the quantization ranges of activation and weights at lowbitwidth precision. Stateoftheart accuracy level on the efficient MobilenetV1 model has been reported by [13, 16], by making use of perchannel quantization when moving to 4 bits precision. It is also worth to mention as nonuniform quantizers have resulted as the best approximators when reducing the bit precision [24, 22, 9]. However, a highprecision (floating point) arithmetic is needed on uncompressed values within the datapath, hence these methods results not suitable for the microcontroller domain. In this work, we leverage existing techniques and show the insights, concerning either computational and memory aspects, when bringing fakequantized networks to the integeronly arithmetic domain, which is not taken into consideration by this class of works.
Mixed Low Precision Quantization. Mixedprecision techniques make use of multiple bit precision throughout a quantized network, motivated by the fact that a lossy and aggressive linear cut is not necessary to reach a given compression rate. The method [7] targeted perpixel binarization based on a defined tensor mask. Despite achieving an extreme quantization level, a perpixel quantization cannot be efficiently handled on a microcontroller, due to the controlbased nature of the required dataflow. The HAWQ [5] method relies on a second order Hessian metric to define prioritization of tensor’s bit precision to reduce, but without choosing the optimal pertensor quantization level. On the same direction, HAQ [22] dynamically explores multiple lowbitwidth precision at training time by means of reinforcement learning. When optimizing for memory constraints, a nonuniform quantization is used. Compered to this, our methodology for bit precision selection applies statically, before quantizationaware retraining, and it is based on a rulebased iterative procedure. Both [5] and [22] reports superior accuracy than ours when compressing networks to a 1MB of memory footprint, but they include nonuniform clustering quantization of floatingpoint parameters, therefore not fullycomparable with our work in terms of microcontroller readiness, as current MCUs are not equipped with the hardware needed for manipulation and computation on these data formats.
Deep networks for resourceconstrained devices. To bridge the gap between the complexity on deep networks and the limitations of resourceconstrained devices, deviceaware optimization strategies have also been presented. The work [1] introduced FINNR to quantize and deploy a generic model into constrained FPGA architectures. Their quantization approach makes use of integer thresholds [21, 8, 20] for data compression. This method enabled a lossless integer representation of a fakequantized networks, but demands larger memory footprint with respect to our proposed method. In contrast, the integeronly deployment in [11] presented a compact fixedpoint 8 bit quantization strategy, which performs the folding of batchnormalization and scaling factors into weights before applying a uniform quantizer. Additionally, perlayer fixedpoint parameters are needed for adapting the dynamic range when passing data from a layer to the next one. In contrast with this work, our methodology generalizes the deployment process when a more effective quantization strategy is used, i.e. perchannel mixedprecision quantization.
3 Background on LowBitwidth Quantization
The quantization process aims at quantizing either the network parameters and the activations values, i.e. the temporary input and output values of the network layers. While the parameters can be quantized just before the inference (forward) pass [18], the quantization of the activations requires the insertion of fakequantized activation layers within the network graph. These additional layers are responsible for recording the activation range statistics, optionally via backpropagation [2], and apply quantization during the forward pass depending on the collected statistics. Because of injected quantization noise, the original fullprecision network is approximated with the correspondent fakequantized function . A quantizationaware retraining of a fakequantized model is essential to recover accuracy, especially when lowbitwidth precision is employed [11].
In the remainder of the paper we only focus on uniform quantization because its arithmetic is naturally supported the instructionset of generalpurpose programmable MCUs. Hence, without loosing generalities, any tensor , either representing weights or activations or only a subset of them, can be quantized across the range with a given number of bits [11] as:
(1) 
Equation (1) derives from the mapping:
(2) 
where is a bias parameter required to shift the numeric domain of the quantized tensors into or ranges, representative of the UINTQ and INTQ datatypes. If is constrained to zero, e.g. when , the quantization range is symmetric.
In the case of weights, the parameters and can be computed as the min and max values of a tensor [11] or by means of more sophisticated statistic analysis [18] or via backpropagation [2]. A PerLayer (PL) quantization exploit single values and for the whole fullprecision tensor, hence the Equation 1 is applied layerwise. A PerChannel (PC) procedure results more effective by independently approximating a given tensor along the outer dimension [13]. This corresponds to compute the and parameters in correspondence of any output channel of the tensor.
To determine the quantization range of the activation values, statistics can be collected at training time during the forward pass, or against a specific calibration dataset. The PACT strategy demonstrated the effectiveness of learning via backprogation while to reproduce the nonlinearity of the ReLU function. In our implementation, the of Equation 1 is replaced by because of the lighter software implementation (the operand gets simply truncated, i.e. a shift operation), becoming: .
4 IntegerOnly Inference
Previous work [11] discussed the training and integeronly deployment of a fakequantized network with 8 bit perlayer quantization. The weight quantization is applied after folding the batchnorm parameters into the convolutional weights. However, when reducing the bit precision below 8 bit using perlayer quantization, the folding process itself can lead to accuracy drop because it can drastically affects the range of the parameters to quantize. As a reference, Table 2 shows the collapse of the training process for INT4 MobilenetV1 with the folding of the batchnorm parameters enabled.
With the aim of an integeronly deployment, we extend [11] to a) prevent the folding of batch normalization parameters into convolutional weights and b) support perchannel lowbitwidth weight quantization. We observe that any fakequantized network’s subgraph composed by a convolutional layer, a batchnormalization layer and a fakequantizer activation module can be modeled by the transfer function:
(3) 
where is the output of a fullprecision convolution and are channelwise fullprecision parameters of a batch normalization layer. It is worth to note that this kind of formulation holds for any featurewise or layerwise scaling factor applied to the convolution’s output tensor.
When applying a perlayer quantization of either input/output activations and weights, the Rule 2 is injected into Equation 3 that becomes:
(4) 
where is the integer output of a lowbitwidth convolution. We define the arrays 11], each element of can be decomposed as , where is a signed fractionary fixedpoint number with . For the sake of notation, we indicate as and the two vectors such as . Given this, Equation 4 can be rewritten as: . As done by [ , i.e. the quantized bias, and
(5) 
Note that every value in Equation 5 is an integer or a fixedpoint value, so that a quantized convolutional layer can be computed with an integeronly arithmetic. Since the static parameters vary along the channel dimension, we name this activation function (Equation 5) as Integer ChannelNormalization activation, indicated as . If weight parameters get quantized perchannel (PC), i.e. every output channel weight bank has its own and values, Equation (5) still holds after deriving the , and vector parameters accordingly.
4.1 Memory Requirement
Table 1 schematizes the memory requirements to compute the transfer Function 5, considering both perlayer (PL) or perchannel (PC) quantization and the ICN layer. The table reports the amount of parameters of a convolution operation with a receptive field, input channels and output channels. The weightparameters are stored in memory as UINTQ, where Q denotes the number of bits, so that the represented numeric domain corresponds to [0, ]. , and are in a UINT8 format ( as INT16 if PC is applied), and are stored as INT32 and is a INT8 array. For comparison purpose, Table 1 reports also the higher memory requirement of a quantized convolutional layer if using the thresholding method proposed by [21, 8], which exponentially increases with .
5 MemoryDriven Mixed Low Precision Methodology for MCU Deployment
To run deep networks on microcontrollers, the memory footprint is a stringent constraint. Given common microcontroller architectures [25], we distinguish:

ReadOnly (RO) Memory, to store frozen inference parameters, i.e. parameters that will not change during the lifetime of a smart device.

ReadWrite (RW) Memory, to store temporary values, i.e. input and output of any quantized convolutional layer that depends on the current sensor data.
At any step of the inference pass, a pair of temporary activation tensors, i.e. the input and output of a layer, and the whole set of fixed parameters must be present in the memory. If considering a network of stacked quantized convolutional layers and a device with and memory budget (expressed in bytes), the above requirement is translated as:
(6) 
where indicates the ith quantized convolutional layer and returns the memory footprint of a tensor with bit precision . is the memory footprint of the additional set of layer’s static parameters (see Table 1) with datatype detailed in Section 4.1. Concerning activation values:
(7) 
Our methodology aims determining the bit precision of any input , output and weight tensor of the th layer, to match the memory constraints (6) and (7). Only the values of are admittable solutions; is fixed to 8. Note that , hence fixing is equivalent to set . Initially, the bit precision of every tensor is set as . Algorithm 1 and Algorithm 2 reports the pseudocode of the procedure to cut the bit precision of, respectively, activations and weights, under the hypothesis that exists a solution that satisfy (6) and (7). The procedure in Algorithm 1 iterates over the quantized convolutiona layers in a forward and backward fashion: the bit precision of output tensors are cut during the forward pass, reductions of the input tensors’ precision are applied during the backward pass. Any cut consists of reducing the bit precision by a single step, i.e. from 8 to 4 and from 4 to 2 bits, and it is applied if the number of bits of the intended tensor (output during forward or input during backward) is lower or equal, but with a higher footprint, than the other activation tensor of the ith layer.
Algorithm 2 details the iterative procedure for cutting bits of the weights parameters. At any iteration, a layer score is computed as the ratio between the layer’s footprint of the ith layer and the total occupation. Among the highest scores within a margin, the layer with the lowest layer’s index is selected for the cut. This heuristic rule is intended to favorite the cut of central layers with respect to the last layers, usually more critical for what concern quantization.
6 Experimental Results
We run experiments on the MobilenetV1 family networks [10] on Imagenet using the PyTorch framework. In the following, a MobilenetV1 model is referred with a label , where is the spatial resolution of the input data and refers to the width channel multiplier. The quantizationaware retraining starts from pretrained weights^{1}^{1}1Pretrained weights are downloaded from https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet_v1.md. Every training session executes on a compute node equipped with 4 NVIDIATesla P100 GPUs for 8 hours. ADAM is chosen as optimizer with an initial learning rate of 1e4, which is decreased in a fixed schedule to 5e5 and 1e5 at, respectively, the 5th and 8th epoch. Running statistics and learned parameters of batchnormalization layers are frozen after the first training epoch. Batch size is 128. An asymmetric uniform quantization is applied on weights: the PACT method is used in case of PL quantization while min/max statistics are employed in case of PC quantization. PPQ [16] is applied for refining pretrained weights before the quantizationaware retraining. Folding of batchnormalization parameters into weights, when applied layerwise, starts from the 2nd training epoch. Activations are quantized with the PACT strategy.
Quantization Method  Top1 Accuracy  Weight Memory Footprint 

Fullprecision [11]  70.9%  16.27 MB 
PL+FB INT8 [11]  70.1%  4.06 MB 
PL+FB INT4 (our)  0.1%  2.05 MB 
PL+ICN INT4 (our)  61.75%  2.10 MB 
PC+ICN INT4 (our)  66.41%  2.12 MB 
PC W4A4 [16]  64.3%   
PC W4A8 [13]  65%   
PC+Thresholds INT4 (our)  66.46%  2.35 MB 
To proof the effectiveness of the ICN layers, we apply our quantization approach to a MobilenetV1 224_1.0 model and we measure the accuracy achieved by a 4 bit integeronly implementation. Table 2 reports the accuracies for the following strategies: PL+FB stands for perlayer quantization and folding of batchnorm parameters into weights, PL+ICN indicates perlayer quantization with ICN layers and PC+ICN refers to perchannel quantization with ICN layers. First we can note that only thanks to the proposed ICN layers, the folding of the batchnorm parameters, which causes the collapse of the training process (PL+FB INT4), can be avoided, therefore enabling the convergence of the training algorithm (PL+ICN INT4 and PC+ICN INT4). Secondly, the insertion of the ICN layer introduces an almost negligible accuracy drop of 0.3% on PL+ICN and 0.05% on PCICN with respect to the fakequantized graph. Moreover, by means of PC quantization, the accuracy of our 4 bit model is higher than other reported implementations [13, 16]. In addition, Table 2 also reports the memory footprint of our PC+ICN INT4, which results to be 10% less memorydemanding than using the integer thresholds based methodology.
To evaluate our proposed methodology for the deployment of deep networks on microcontrollers, we apply our mixedprecision technique on all the Mobilenet configurations after setting the memory constraints and , corresponding with the memory characteristics of an STM32H7 device. The trained integeronly models are also bechmarked on the STM32H7 MCU running at 400MHz, to assess the implications for inference deployments. To this aim, we leverages an extended version of the ARM CMSISNN [14] library, featuring an output stationary dataflow, and we measure latency in terms of clock cycles. Figure 2 plots the accuracylatency tradeoff measured on two configurations. MixQPL indicates perlayer quantization with either the folding of batchnorm parameters or ICN for layers with or . On the contrary, MixQPCICN indicates integeronly models with perchannel quantization and ICN as activation layers. Every curve represents a group of Mobilenet models with same input resolution. Increasing the width multiplier causes a longer latency because of the increasing amount of MAC operations. When applying our mixedprecision method under this memory constraints, Mobilenet models with width multipliers of 0.25 and 0.5, with the exception of 224_0.5, features no cuts of bit precision. Hence, under the configuration MixQPL, these points corresponds to the 8 bit integeronly models described in [11]. Pareto frontiers are mostly populated by MixQPCICN configurations. The most accurate model, PC+ICN 192_0.5, scores 68% Top1 accuracy by featuring 4 bit weight on the last convolutional pointwise and on the linear layers, in addition to , as determined by the memorydriven procedure of Section 5. This score is 8% higher than the more accurate INT8 Mobilenet (192_0.5) fitting into the same device. Note that all the configurations featuring width multiplier suffers of a dramatic accuracy degradation with respect to fullprecision settings (from 2% to 15%) due to aggressive quantization required to fit into the memory constrains. On the latency side, the fastest inference model (128_0.25 MixQPL), which features a homogeneous 8 bit quantization, runs at 10fps, 20 higher than the the most precise configuration (224_0.75 PC+ICN), but only achieves 43% of Top1 accuracy. We can observe that the MixQPCICN quantization introduces a latency overhead of approx. 20% with respect to the MixQPL setting, due to the additional subtractions of biases within the inner loop of the convolution. On the other hand, MixQPCICN provides up to 4% more accuracy for classification.
To further test our proposed mixedprecision method, we set the memory constrain to and compare with other mixedprecision methodologies in Table 4. Our best models feature up to 7% lower accuracy with respect to [22], but we remark the integeronly nature of our solution. On the other hand, our implementation features a 2% higher accuracy than INT8 models with comparable memory footprint and tailored for integeronly deployments.
Model  Quantization Method  Top1 Accuracy  Memory Constraints 

MobilenetV1_224_0.5  MixQPCICN  62.9%  1MB + 512kB 
MobilenetV1_192_0.5  MixQPCICN  60.2%  1MB + 256kB 
MobilenetV1_224_0.5 [11]  INT8 PL+FB  60.7%  1.34 MB 
MobilenetV1_224_0.25 [11]  INT8 PL+FB  48.0%  0.47 MB 
MobilenetV1 [22]  MIX notuniform  57.14% / 67.66%  1.09 / 1.58 MB 
MobileNetV2 [22]  MIX notuniform  66.75% / 70.90%  0.95 / 1.38 MB 
SqueezeNext [5]  MIX notuniform  68.02%  1.09 MB 
7 Conclusion
By mixing quantization methodologies is possible to execute complex deep neural networks such as MobilenetV1 on memory constrained MCU edge devices. To pursue this objective, in this work we introduced a mixedprecision quantization technique tailored for memoryconstrained microcontroller devices, leveraging the formulation of a quantized activation layer, i.e. the Integer ChannelNormalization activation, to enable sub byte integeronly deployments. The experimental results show a MobilenetV1 network running on a microcontroller equipped with 2MB of Flash and 512kB of RAM and featuring a Top1 accuracy of 68%, which is 8% higher stateoftheart integeronly 8 bit implementations.
Acknowledgments
We thank the Italian Supercomputing Center CINECA for the access to their HPC facilities needed to run deeplearning experiments.
References
 Blott et al. [2018] M. Blott, T. B. Preußer, N. J. Fraser, G. Gambardella, K. O’brien, Y. Umuroglu, M. Leeser, and K. Vissers. Finnr: An endtoend deeplearning framework for fast exploration of quantized neural networks. ACM Transactions on Reconfigurable Technology and Systems (TRETS), 11(3):16, 2018.
 Choi et al. [2018] J. Choi, Z. Wang, S. Venkataramani, P. I.J. Chuang, V. Srinivasan, and K. Gopalakrishnan. Pact: Parameterized clipping activation for quantized neural networks. arXiv preprint arXiv:1805.06085, 2018.
 Choukroun et al. [2019] Y. Choukroun, E. Kravchik, and P. Kisilev. Lowbit quantization of neural networks for efficient inference. arXiv preprint arXiv:1902.06822, 2019.
 Courbariaux et al. [2016] M. Courbariaux, I. Hubara, D. Soudry, R. ElYaniv, and Y. Bengio. Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or1. arXiv preprint arXiv:1602.02830, 2016.
 Dong et al. [2019] Z. Dong, Z. Yao, A. Gholami, M. Mahoney, and K. Keutzer. Hawq: Hessian aware quantization of neural networks with mixedprecision. arXiv preprint arXiv:1905.03696, 2019.
 Esser et al. [2019] S. K. Esser, J. L. McKinstry, D. Bablani, R. Appuswamy, and D. S. Modha. Learned step size quantization. arXiv preprint arXiv:1902.08153, 2019.
 Fromm et al. [2018] J. Fromm, S. Patel, and M. Philipose. Heterogeneous bitwidth binarization in convolutional neural networks. In Advances in Neural Information Processing Systems, pages 4006–4015, 2018.
 Gao et al. [2018] H. Gao, W. Tao, D. Wen, T.W. Chen, K. Osa, and M. Kato. Ifqnet: Integrated fixedpoint quantization networks for embedded vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 607–615, 2018.
 Han et al. [2015] S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.
 Howard et al. [2017] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
 Jacob et al. [2018] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko. Quantization and training of neural networks for efficient integerarithmeticonly inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2704–2713, 2018.
 Jain et al. [2019] S. R. Jain, A. Gural, M. Wu, and C. Dick. Trained uniform quantization for accurate and efficient neural network inference on fixedpoint hardware. arXiv preprint arXiv:1903.08066, 2019.
 Krishnamoorthi [2018] R. Krishnamoorthi. Quantizing deep convolutional networks for efficient inference: A whitepaper. arXiv preprint arXiv:1806.08342, 2018.
 Lai et al. [2018] L. Lai, N. Suda, and V. Chandra. Cmsisnn: Efficient neural network kernels for arm cortexm cpus. arXiv preprint arXiv:1801.06601, 2018.
 Lin et al. [2016] D. Lin, S. Talathi, and S. Annapureddy. Fixed point quantization of deep convolutional networks. In International Conference on Machine Learning, pages 2849–2858, 2016.
 Liu and Mattina [2019] Z.G. Liu and M. Mattina. Learning lowprecision neural networks without straightthrough estimator (ste). arXiv preprint arXiv:1903.01061, 2019.
 Ma et al. [2018] N. Ma, X. Zhang, H.T. Zheng, and J. Sun. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision (ECCV), pages 116–131, 2018.
 Migacz [2017] S. Migacz. 8bit inference with tensorrt. In GPU Technology Conference, volume 2, page 7, 2017.
 Rastegari et al. [2016] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnornet: Imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, pages 525–542. Springer, 2016.
 Rusci et al. [2018] M. Rusci, A. Capotondi, F. Conti, and L. Benini. Workinprogress: Quantized nns as the definitive solution for inference on lowpower arm mcus? In 2018 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ ISSS), pages 1–2. IEEE, 2018.
 Umuroglu and Jahre [2017] Y. Umuroglu and M. Jahre. Streamlined deployment for quantized neural networks. arXiv preprint arXiv:1709.04060, 2017.
 Wang et al. [2018] K. Wang, Z. Liu, Y. Lin, J. Lin, and S. Han. Haq: hardwareaware automated quantization. arXiv preprint arXiv:1811.08886, 2018.
 Wu et al. [2018] B. Wu, X. Dai, P. Zhang, Y. Wang, F. Sun, Y. Wu, Y. Tian, P. Vajda, Y. Jia, and K. Keutzer. Fbnet: Hardwareaware efficient convnet design via differentiable neural architecture search. arXiv preprint arXiv:1812.03443, 2018.
 Zhang et al. [2018] D. Zhang, J. Yang, D. Ye, and G. Hua. Lqnets: Learned quantization for highly accurate and compact deep neural networks. In Proceedings of the European Conference on Computer Vision (ECCV), pages 365–382, 2018.
 Zhang et al. [2017] Y. Zhang, N. Suda, L. Lai, and V. Chandra. Hello edge: Keyword spotting on microcontrollers. arXiv preprint arXiv:1711.07128, 2017.
Appendix A Mixedprecision Quantization
Figure 3 plots the bit precision of every weight and activation tensor of the MixQPL and MixQPCICN MobilenetV1 models of experimental Section 6. Table 4 reports the Top1 accuracy metrics of the experimented models.
Model  MixQPL Top1 Accuracy  MixQPCICN Top1 Accuracy 

224_1.0  59.61%  64.29% 
224_0.75  67.06%  68.02% 
224_0.5  63.12%  63.48% 
224_0.25  50.76%  51.70% 
192_1.0  61.94%  65.88% 
192_0.75  64.67%  67.23% 
192_0.5  59.50%  62.93% 
192_0.25  48.12%  49.75% 
160_1.0  59.49%  64.46% 
160_0.75  64.75%  65.70% 
160_0.5  59.55%  61.25% 
160_0.25  44.77%  47.79% 
128_1.0  49.44%  49.44% 
128_0.75  60.44%  63.53% 
128_0.5  54.20%  58.22% 
128_0.25  43.45%  44.68% 