# autoq_automated_kernelwise_neural_network_quantization___b65c70e3.pdf Published as a conference paper at ICLR 2020 AUTOQ: AUTOMATED KERNEL-WISE NEURAL NETWORK QUANTIZATION Qian Lou, Feng Guo, Minje Kim, Lantao Liu, and Lei Jiang {louqian, fengguo, minje, lantao, jiang60}@iu.edu Indiana University Bloomington Network quantization is one of the most hardware friendly techniques to enable the deployment of convolutional neural networks (CNNs) on low-power mobile devices. Recent network quantization techniques quantize each weight kernel in a convolutional layer independently for higher inference accuracy, since the weight kernels in a layer exhibit different variances and hence have different amounts of redundancy. The quantization bitwidth or bit number (QBN) directly decides the inference accuracy, latency, energy and hardware overhead. To effectively reduce the redundancy and accelerate CNN inferences, various weight kernels should be quantized with different QBNs. However, prior works use only one QBN to quantize each convolutional layer or the entire CNN, because the design space of searching a QBN for each weight kernel is too large. The hand-crafted heuristic of the kernel-wise QBN search is so sophisticated that domain experts can obtain only sub-optimal results. It is difficult for even deep reinforcement learning (DRL) Deep Deterministic Policy Gradient (DDPG)-based agents to find a kernel-wise QBN configuration that can achieve reasonable inference accuracy. In this paper, we propose a hierarchical-DRL-based kernel-wise network quantization technique, Auto Q, to automatically search a QBN for each weight kernel, and choose another QBN for each activation layer. Compared to the models quantized by the state-of-the-art DRL-based schemes, the same models quantized by Auto Q reduce the inference latency by 54.06%, and decrease the inference energy consumption by 50.69% averagely, while achieving the same inference accuracy. 1 INTRODUCTION Although convolutional neural networks (CNNs) have been the dominant approach (Sandler et al., 2018) to solving a wide variety of problems such as computer vision and recommendation systems, it is challenging to deploy CNNs to mobile devices having only limited hardware resources and tight power budgets, due to their huge essential computing overhead, e.g., an inference of Mobile Net V2 (Sandler et al., 2018) involves 6.9M weights and 585M floating point operations. Several approaches such as pruning (He et al., 2018) and low-rank approximation (Denton et al., 2014) are proposed to reduce the inference computing overhead of CNNs. Network quantization (Wang et al., 2019; Lin et al., 2017) becomes one of the most hardware friendly CNN acceleration techniques by approximating real-valued weights and activations to QBN-bit fixed-point representations, and performing inferences using cheaper fixed-point multiple-accumulation (MAC) operations, where QBN is the quantization bit number. Instead of using one QBN for the whole CNN, the layer-wise network quantization (Wang et al., 2019; Elthakeb et al., 2018) assigns a QBN to the weights of each convolutional layer, and searches another QBN for the activations of the same layer to decrease the inference computing overhead. But the inference cost of the layer-wise quantized CNNs is still prohibitive for low-power mobile devices powered by batteries. Recent works (Zeng et al., 2019; Choukroun et al., 2019b; Zhang et al., 2018; Li et al., 2019; Krishnamoorthi, 2018; Sasaki et al., 2019) find that various weight kernels of a This work was supported in part by NSF CCF-1908992 and CCF-1909509. Published as a conference paper at ICLR 2020 -0.25 0 0.25 0 mean=-0.012 std=0.127 mean=-0.052 std=0.198 mean=-0.022 std=0.126 mean=0.070 std=0.200 outliers -1 Figure 1: The weight distribution of kernels. 50 100 150 200 250 Top-1 Accuracy Latency (ms) network-wise quantization layer-wise quantization kernel-wise quantization Figure 2: Inference accuracy and latency. convolutional layer (Res Net-18) exhibit different variances shown in Figure 1 and hence have different amounts of redundancy. Therefore, they quantize each weight kernel independently for higher accuracy by calculating a QBN-element scaling factor vector for each kernel, rather than globally quantize all the kernels of a layer as a whole. To reduce different amounts of redundancy among different weight kernels, these kernel-wise network quantization techniques should have searched a QBN for each kernel of each layer in a CNN. However, the search space of choosing a QBN for each weight kernel is too large, so prior kernel-wise network quantization (Zeng et al., 2019; Choukroun et al., 2019b; Zhang et al., 2018; Li et al., 2019; Krishnamoorthi, 2018; Sasaki et al., 2019) still uses the same QBN for the entire CNN. As Figure 2 shows, compared to the layer-wise quantized model, on the same FPGA accelerator (Umuroglu et al., 2019a), the kernel-wise quantized model (assigning a QBN to each weight kernel and choosing a QBN for each activation layer) improves the inference accuracy by 2% (Image Net) with the same computing overhead (inference latency). How to decide a QBN for each weight kernel is the most important task of the kernel-wise network quantization, since the QBNs have a large impact on the inference accuracy, latency and hardware overhead. Determining a QBN for each weight kernel via hand-crafted heuristics is so sophisticated that even machine learning experts can obtain only sub-optimal results. Recent works (Wang et al., 2019; Elthakeb et al., 2018) automatically select a QBN for each layer of a CNN through a deep reinforcement learning (DRL) agent without human intervention. However, it is still difficult for low-power mobile devices such as drones and smart glasses to adopt the layer-wise quantized CNN models. These mobile devices are very sensitive to the bit-width of fixed-point MAC operations and memory access during inferences due to their limited battery lifetime and hardware resources. Kernel-wise network quantization assigning a QBN to each weight kernel and searching a QBN for each activation layer of a CNN becomes a must to enable the efficient deployment of deep CNNs on mobile devices by reducing the inference computing overhead. Although it is straightforward to perform kernel-wise quantization via DRL, it takes ultra-long time for a DRL agent to find a proper QBN for each weight kernel of a CNN. As CNN architectures are becoming deeper, it is infeasible to employ rule-based domain expertise or conventional DRL-based techniques to explore the exponentially enlarging search space of kernel-wise network quantization. In this paper, we propose a hierarchical-DRL-based agent, Auto Q, to automatically and rapidly search a QBN for each weight kernel and choose a QBN for each activation layer of a CNN for accurate kernel-wise network quantization. Auto Q comprises a high-level controller (HLC) and a low-level controller (LLC). The HLC chooses a QBN for each activation layer and generates a goal, the average QBN for all weight kernels of a convolutional layer, for each layer. Based on the goal, the LLC produces an action, QBN, to quantize each weight kernel of the layer. The HLC and LLC simultaneously learn by trials and errors, i.e., penalizing inference accuracy loss while rewarding a smaller QBN. We also build a state space, a goal and an action space, an intrinsic reward and an extrinsic reward for Auto Q. Instead of proxy signals including FLOPs, number of memory access and model sizes, we design the extrinsic reward to take the inference latency, energy consumption and hardware cost into consideration. 2 BACKGROUND AND RELATED WORK Quantization. Recent works (Lin et al., 2016; Zhou et al., 2017; Jacob et al., 2018; Mc Kinstry et al., 2018; Zhang et al., 2018) quantize the real-valued weights and activations to fixed-point representations, so that the model size is reduced and inferences can use low-cost fixed-point MAC operations. To further reduce inference computing overhead, prior works (Kim & Smaragdis, 2016; Xu et al., 2018; Guo et al., 2017; Tang et al., 2017; Rastegari et al., 2016; Lin et al., 2017) quantize weights and activations into multi-bit binary codes of {-1, +1}s. Rather than real-valued MACs, inferences of these quantized models depend on bit-wise logic operations, i.e., XNORs and popcounts. These Published as a conference paper at ICLR 2020 traditional quantization techniques either simply assign a single QBN to the whole CNN or require domain experts to determine a QBN for each layer of a CNN. Table 1: The search space size of network quantization. QBN [0, 32], where 0 means the component is pruned. nlayer is the layer number of the network. quantization granularity search space size (weight activation) network-wise 33 33 layer-wise 33nlayer 33nlayer kernel-wise 33 Pnlayer i=1 couti 33nlayer Kernel-wise quantization. As Table 1 shows, almost all prior works (Lin et al., 2016; Kim & Smaragdis, 2016; Rastegari et al., 2016; Lin et al., 2017; Guo et al., 2017; Zhou et al., 2017; Jacob et al., 2018; Tang et al., 2017; Xu et al., 2018; Mc Kinstry et al., 2018; Zhang et al., 2018) categorized as the network-wise quantization focus on searching a QBN [0, 32] for all weights, and searching another QBN for all activations in a CNN. Totally, there are only 1089 combinations of the QBN configuration for the network-wise quantization. The layer-wise quantization (Wang et al., 2019) searches a QBN [0, 32] for all weights of a convolutional layer, and decides another QBN for all activations of the same layer. The QBN search space size of the layer-wise quantization substantially increases to 33nlayer 33nlayer, where nlayer is the layer number of a CNN. Recent works (Zeng et al., 2019; Choukroun et al., 2019b; Zhang et al., 2018; Li et al., 2019; Krishnamoorthi, 2018; Sasaki et al., 2019) observe various weight kernels of a convolutional layer have different amounts of redundancy, and quantize each weight kernel independently for higher accuracy. To exploit different amounts of redundancy among different weight kernels, these kernel-wise network quantization techniques should have searched a QBN for each kernel of each convolutional layer, and assigned a QBN for each activation layer in a CNN. However, the search space size of the kernel-wise network quantization is 33 Pnlayer i=1 couti 33nlayer, where couti is the number of weight kernels (output channels) of the ith layer. No prior work tries to search such huge design space. Table 2: The comparison of DRL-based techniques for quantization and pruning. feature AMC Re Le Q HAQ Auto Q search for activations and weights kernel-wise quantization hierarchical DRL shaped intrinsic reward Auto ML. Recent works take advantage of DRL (Baker et al., 2016; Zoph et al., 2017), genetic algorithm (Suganuma et al., 2017; Stanley & Miikkulainen, 2002) and Bayesian Optimization (Kandasamy et al., 2018; Stewart & Stalzer, 2018) to automatically architect CNNs for higher inference accuracy. Their network architectures outperform many human-designed neural networks. The weight channel pruning is automatically conducted by DRL (He et al., 2018) and genetic algorithm (Wang et al., 2018). Re Le Q (Elthakeb et al., 2018) quantizes only the weights of each layer of a CNN by DRL, while HAQ (Wang et al., 2019) performs the layer-wise quantization for both weights and activations via a DRL agent. No prior quantization or pruning work relies on hierarchical DRL. Table 2 compares Auto Q against prior DRL-based techniques for quantization and pruning. Auto Q is the first work to automatically quantize each weight kernel and each activation layer of a pre-trained CNN model for mobile devices by hierarchical DRL. Overview. We do not aim to present a new network quantization technique, but we formulate the search of a QBN for each weight kernel and each activation layer as a hierarchical DRL problem. We propose a two-level hierarchical DRL technique, Auto Q, to automatically quantize the weights in the kernel-wise manner and the activations in the layer-wise fashion. We build the state space, action and goal space, extrinsic and intrinsic reward functions and a hierarchical DRL agent for Auto Q. Although we use the state-of-the-art learned quantization technique, LQ-Nets (Zhang et al., 2018), to quantize weight kernels and activation layers with the QBNs found by Auto Q, future novel quantization techniques can be easily integrated to Auto Q to improve the inference accuracy of the quantized networks. In the extrinsic reward, besides the inference latency and energy (Wang et al., 2019), Auto Q also considers the FPGA area overhead critical to low-cost mobile devices. Published as a conference paper at ICLR 2020 layer-wise search Li-1 A[i-1][:][:][:] Li-1 A[i-1][:][:][:] Li-1 W[i-1][:][:][:] Li-1 W[i-1][:][:][:] Li A[i][:][:][:] Li A[i][:][:][:] Li W[i][:][:][:] Li W[i][:][:][:] Li+1 A[i+1][:][:][:] Li+1 A[i+1][:][:][:] Li+1 W[i+1][:][:][:] Li+1 W[i+1][:][:][:] =3-bit ... ... Li-1 A[i-1][:][:][:] Li-1 W[i-1][:][:][:] Li A[i][:][:][:] Li W[i][:][:][:] Li+1 A[i+1][:][:][:] Li+1 W[i+1][:][:][:] Kj W[i][j][:][:] Kj W[i][j][:][:] Kj-1 W[i][j-1][:][:] Kj-1 W[i][j-1][:][:] Kj+1 W[i][j+1][:][:] Kj+1 W[i][j+1][:][:] =3-bit ... ... kernel-wise search quantized model Auto Q Agent ❶state[ , ] a[ , ] ❸ Li Kj a[ , ] ❸ Li Kj e Rd[ , ] Li Kj e Rd[ , ] Li Kj machine learning hardware over- machine learning hardware over- power, latency area configuration configuration configuration head estimator Figure 3: The working flow of Auto Q (HLC: high-level controller, LLC: low-level controller). Working Flow. For an nlayer-layer CNN, the weight is defined as W Rnlayer cout cin ww hw, where nlayer is the number of layers; cout denotes the number of kernels (output channels); cin means the number of input channels; ww indicates the kernel width; and hw is the kernel height. The activation is defined as A Rnlayer cin wa ha, where wa is the feature map width; and ha means the feature map height. The working flow of Auto Q is shown in Figure 3. Auto Q consists of a high-level controller (HLC) and a low-level controller (LLC). The HLC quantizes the network layer by layer, while the LLC searches a QBN for each weight kernel in a layer. At first, Auto Q receives an observation state[Li,Kj] from the environment that is the quantized network model, where state[Li,Kj] includes the information of the CNN architecture. The HLC makes a goal g Li that is the QBN for the activation layer Li. The flow then jumps to . Or the HLC generates a goal g Li which is the average QBN of all weight kernels in the layer Li for the LLC. The LLC produces an action a[Li,Kj], QBN, for the weight kernel Kj of the layer Li. For the entire layer Li, the LLC aims to reach the goal g Li of the HLC. The environment sends the network quantization and hardware configuration to the fast and accuracy machine-learning-based hardware overhead estimator. The hardware overhead estimator returns the energy consumption, area overhead and inference latency for the current quantization and hardware configuration. With the hardware overhead and inference accuracy, the environment generates an extrinsic reward e Rd[Li,Kj] for Auto Q to evaluate the LLC action. Based on all actions of LLC for the layer Li, the HLC provides an intrinsic reward i Rd Li to tell how well the goal is implemented by the LLC. State Space. A state state[Li,Kj] (observation) is represented by state[Li,Kj] = (Li, Kj, cin, cout, skernel, sstride, sfeature, bdw, bw/a, g Li 1, a[Li,Kj 1]) (1) where Li is the layer index; Kj means the weight kernel index; cin indicates the number of input channels; cout denotes the number of kernels; skernel is the kernel size; sstride is the stride; sfeature is the input feature map size; bdw binarily indicates depthwise convolution or not; bw/a binarily represents weight or activation; g Li 1 is the goal (average QBN) of the last layer; and a[Li,Kj 1] is the action (QBN) of the last kernel in the Li layer. For each variable in state[Li,Kj], we normalize it to [0, 1]. If the layer is a fully-connected layer, we set skernel = 1, sstride = 0, and bdw = 0. Goal and Action Space. The HLC produces the average QBN for all weight kernels of each layer or the QBN for each activation layer as a goal, while the LLC generates a QBN for each weight kernel in a layer as an action. The HLC goal g Li for the Li layer uses a continuous space and can be any real value between 1 and goalmax, where goalmax is the maximum average QBN for a layer and we set it to 8. If the Li layer is an activation layer, we round the real-valued g Li to the discrete value of roundup(1 + g Li (goalmax 1)). Although the LLC action is an integer between 0 and actionmax, it still uses a continuous space to capture the relative order, i.e., 2-bit is more aggressive than 3-bit, where actionmax is the maximum QBN for a kernel and we set it to 8. For the Kj kernel of the Li layer, the LLC generates the continuous action ra[Li,Kj] that is in the range of [0, 1], and round it up to the discrete value a[Li,Kj] = roundup(ra[Li,Kj] actionmax). Published as a conference paper at ICLR 2020 Extrinsic Reward. After an action a[Li,Kj] is taken, Auto Q arrives at a new state state[Li,Kj+1] and receives an extrinsic reward e Rd from the environment. The HLC aims to maximize the accumulative extrinsic reward e Rd = P i couti+j 1 e Rd e Rd[Li,Kj], where γe Rd [0, 1) is a decay factor. The immediate extrinsic reward can be represented by e Rd[Li,Kj](NC, HC) = log( accuracy(NC)ψacc lat(NC, HC)ψl en(NC, HC)ψe area(NC, HC)ψa ) (2) where NC is the network configuration; HC means the hardware configuration, e.g., memory bandwidth; accuracy(NC) indicates the inference accuracy; lat is the inference latency of the network NC running on the hardware HC; en represents the inference energy of NC running on HC; area is the FPGA area (hardware cost) used by NC on HC; ψacc, ψl, ψe and ψa are user-defined factors deciding the impact of inference accuracy, latency, energy and FPGA area on the extrinsic reward. By different values of user-defined factors, Auto Q implements the resource-constrained and accuracy-guaranteed searches. For resource-constrained applications, e.g., low-power drones, Auto Q sets ψacc = 1, ψl = 0, ψe = 0 and ψa = 0 to achieve the best accuracy given the maximum amount of hardware resources (latency, energy, and FPGA area). This extrinsic reward offers no incentive for lower QBNs, so Auto Q reduces the QBN by limiting the action space. Auto Q allows arbitrary action at the first few layers and starts to limit the action when it finds that the hardware resource budget is insufficient even after using the smallest QBN for all the following layers. For accuracy-guaranteed applications, e.g., fingerprint locks, Auto QB sets ψacc = 2, ψl < 1, ψe < 1 and ψa < 1 to obtain the shortest latency, the minimal energy, and the smallest hardware cost with no accuracy loss. Intrinsic Reward. Based on the goal g Li produced by the HLC for the Li layer, the LLC generates cout actions a[Li,K0] a[Li,Kcout 1] at the states state[Li,K0] state[Li,Kcout 1]. Auto Q then arrives the state state[Li,Kcout 1], where it receives an intrinsic reward i Rd and maximizes the accumulative intrinsic reward i Rd = P j γj 1 i Rd i Rd[Li,Kj], where γi Rd [0, 1) is a decay factor and i Rd[Li,Kj] indicates the intrinsic reward for the weight kernel Ki of the layer Li. The LLC produces actions to help the HLC to maximize the extrinsic reward, so it should aim to complete the goal of the HLC and to maximize the extrinsic reward. But at the beginning of the Auto Q training, the extremely low extrinsic reward due to the random goals of the HLC prevents the LLC from efficiently learning from the environment. We propose a shaped reward as the intrinsic reward for the LLC to take both the goal completion and the extrinsic reward into consideration, and to enable fine-grained low-level behavior learning. The intrinsic reward can be represented by i Rd Li = (1 ζ) ( ||g Li cout j=0 a Li,Kj||2) + ζ j=0 e Rd Li,Kj (3) where ζ is a user-defined factor dynamically enlarging from 0.1 to 0.8 as the number of training epochs increases. When ζ is small, the HLC has stronger influence on the LLC. On the contrary, when ζ = 1, the LLC maximizes only the accumulative extrinsic reward. Hardware Overhead Estimator. A recent work (Wang et al., 2019) estimates the hardware latency and energy by physical FPGA accelerators. However, a typical synthesis for a CNN model on a FPGA costs > 30 minutes (Gopinath et al., 2019). Invoking a FPGA synthesis for each action will make Auto Q unacceptably slow. We adopt fast and accurate FPGA latency, area (Liu & Carloni, 2013) and power (Zhou et al., 2019) models to predict the inference latency, energy and FPGA area for an arbitrary configuration of network and hardware. These machine-learning-based models are highly accurate and can estimate the hardware overhead to compute the extrinsic reward of Auto Q within several milliseconds. Hierarchical DRL. Auto Q uses a HIerarchical Reinforcement learning with Off-policy correction (HIRO) (Nachum et al., 2018), to implement the HLC and the LLC. The LLC is trained by incorporating g Li into the standard TD3 method (Nachum et al., 2018). So the low-level Q-value function QLLC θLLC is to minimize the error εLLC(state[Li,Kj], g Li, a[Li,Kj], state[Li,Kj+1]), which is (QLLC θLLC (state[Li,Kj ], g Li, a[Li,Kj ]) i Rd Li γi Rd QLLC θLLC (state[Li,Kj+1], g Li, µLLC φLLC (state[Li,Kj+1], g Li)))2 (4) where µLLC φLLC is trained to maximize QLLC θLLC. We further augment µLLC φLLC with Gaussian noises by collecting the actions as N(µLLC φLLC, σa[Li,Kj ]), where N is a Gaussian distribution, and σa[Li,Kj ] is Published as a conference paper at ICLR 2020 the variance. During the exploitation, σa[Li,Kj ] is initialized to 0.5 and decayed after each episode exponentially. The HLC converts a series of high-level transition tuples (s[Li,K0:Kcout 1], g Li, a[Li,K0:Kcout 1], e Rd[Li,K0:Kcout 1], s[Li+1,K0]) (5) to state-goal-reward transitions (s[Li,K0], g Li, X e Rd[Li,K0:Kcout 1], s[Li+1,K0]) (6) where a[Li,K0:Kcout 1] denotes the sequence of a[Li,K0] a[Li,Kcout 1]; and e Rd[Li,K0:Kcout 1] means the sequence of e Rd[Li,K0] e Rd[Li,Kcout 1]. Auto Q stores these state-goal-reward transitions into the replay buffer. However, since transitions obtained from the past LLCs do not accurately reflect the actions that would occur if the same goal was used with the current LLC, Auto Q has to introduce a correction translating old transitions into ones that agree with the current LLC. Auto Q re-labels the high-level transition (s[Li,K0], g Li, P e Rd[Li,K0:Kcout 1], s[Li+1,K0]) with a different goal g Li chosen to maximize the probability µLLC φLLC(a[Li,K0:Kcout 1]|s[Li,K0:Kcout 1], g Li). Auto Q computes 10 candidate goals sampled randomly from a Gaussian distribution centered at g Li, and selects the minimal goal to re-label the experience. Quantization and Finetuning. During a search, we quantize the model by the learned quantization technique (Zhang et al., 2018), and finetune the quantized model for ten epochs to recover the accuracy using stochastic gradient descent (SGD) with a fixed learning rate of 10 3 and momentum of 0.9. We randomly select 100 categories from the Image Net to accelerate the model finetuning. After the search is done, we quantize the model with the best policy found by Auto Q and finetune it on the full dataset. Implementation Details. An Auto Q agent, i.e., HLC or LLC, consists of an actor network and a critic network. Both share the same architecture, i.e., two hidden layers, each of which has 300 units. For the actor network, we add an additional sigmoid function producing an output in the range of [0, 1]. We use a fixed learning rate of 10 4 for the actor network and 10 3 for the critic network. Auto Q trains the networks with the batch size of 64 and the replay buffer size of 2000. Auto Q first explores 100 episodes with a constant noise, i.e., δa[Li,Kj ] = 0.5 for the LLC and δg[Li] = 0.5 for the HLC, and then exploits 300 episodes with exponentially decayed noise. Storage Cost. We need to record a 4-bit QBN ranging from 0 to 8 for each activation layer and each weight kernel of a convolutional layer. The storage overhead of Auto Q is 0.1% of the size of various CNN models. For instance, Res Net-18 found by resource-constrained Auto Q requires 8.3MB to store its quantized model in Table 3. The storage overhead of Auto Q is only 0.07%. 4 EXPERIMENTAL RESULTS Experimental Settings. To evaluate Auto Q, we selected several CNN models including Res Net-18, Res Net-50, Squeeze Net V1 (Iandola et al., 2016) and Mobile Net V2 (Sandler et al., 2018). The CNN models are trained on Image Net including 1.26M training images and tested on 50K test images spanning 1K categories of objects. We evaluated the inference performance, energy consumption and FPGA area of the CNN models quantized by Auto Q on a Xilinx Zynq-7020 embedded FPGA. On the FPGA, we implemented a temporal CNN accelerator (Umuroglu et al., 2019b) that uses bit-serial multipliers, each of which computes with one-bit digits from multiple weights and their corresponding activations in parallel at one time, and then accumulates their partial products. 4.1 OVERALL PERFORMANCE Resource-constrained Quantization. We make Auto Q perform the resource-constrained searches by imposing a latency constraint and setting ψacc = 1, ψl = 0, ψe = 0 and ψa = 0 in the extrinsic reward. With such a setting, Auto Q aims to search for the best inference accuracy given the longest latency constraint, which is set to the inference latency of the 4-bit network-wise quantized CNN models. We compare the kernel-wise Auto Q quantized models against the layer-wise Hardware Aware Automated Quantization (HAQ) (Wang et al., 2019) quantized models and the 4-bit networkwise quantized models in Table 3. We used the LQ-Nets quantization (Zhang et al., 2018) to quantize and finetune the models in all three schemes. The network-wise scheme uses 4-bit to quantize the whole models, while the layer-wise scheme searches a QBN for weights of each layer, and chooses another QBN for activations of the same layer. Auto Q chooses a QBN for each weight kernel, and selects another QBN for each activation layer of a CNN. In Table 3, the average QBN of weights Published as a conference paper at ICLR 2020 Table 3: Network Quantization by Auto Q (A-QBN: the average QBN of activations; W-QBN: the average QBN of weights; LAT: inference latency). model scheme resource-constrained accuracy-guaranteed top-1 top-5 A-QBN W-QBN LAT top-1 top-5 A-QBN W-QBN LAT err (%) err(%) (bit) (bit) (ms) err (%) err(%) (bit) (bit) (ms) network-wise 32.7 12.32 4 4 296.8 32.7 12.32 4 4 296.8 layer-wise 31.8 11.92 3.32 4.63 290.9 32.5 11.90 3.37 3.65 189.6 kernel-wise 30.22 11.62 4.12 3.32 286.3 32.6 11.82 3.02 2.19 125.3 original 30.10 11.62 16 16 1163 30.10 11.62 16 16 1163 network-wise 27.57 9.02 4 4 616.3 27.57 9.02 4 4 616.3 layer-wise 26.79 8.32 4.23 3.51 612.3 27.49 9.15 4.02 3.12 486.4 kernel-wise 25.53 7.92 3.93 4.02 610.3 27.53 9.12 3.07 2.21 327.3 original 25.20 7.82 16 16 2357 25.20 7.82 16 16 2357 Squeeze Net V1 network-wise 45.67 23.12 4 4 43.1 45.67 23.12 4 4 43.1 layer-wise 44.89 21.14 3.56 4.27 42.1 45.63 23.04 3.95 3.28 25.5 kernel-wise 43.51 20.89 4.05 3.76 41.6 45.34 23.02 3.29 2.32 12.5 original 43.10 20.5 16 16 127.3 43.10 20.5 16 16 127.3 Mobile Net V2 network-wise 31.75 11.67 4 4 37.4 31.35 11.67 4 4 37.4 layer-wise 30.98 10.57 3.57 4.22 36.9 31.34 10.57 3.92 3.21 23.9 kernel-wise 29.20 9.67 4.14 3.67 36.1 31.32 11.32 3.13 2.26 10.2 original 28.90 9.37 16 16 123.6 28.90 9.37 16 16 123.6 (W-QBN) can be calculated by Pnlayer Li=1 Pccouti Kj=1 Weight QBN[Li,Kj] Pnlayer i=1 ccouti (7) where couti is the number of output channels in the layer Li and Weight QBN[Li,Kj] is the QBN for the Kjth weight kernel in the layer Li. The average QBN of activations (A-QBN) is computed Pnlayer Li=1 Act QBNLi nlayer , where Act QBNLi is the QBN for all activations of the layer Li. Compared to the layer-wise quantization, Auto Q improves the top-1 inference accuracy by > 1.25% when spending almost the same inference latency. Compared to the 16-bit full-precision models, the models quantized by Auto Q degrade the inference accuracy by at most only 0.41%, but reduce the inference latency by 71.2% on average. Accuracy-guaranteed Quantization. We run Auto Q to do the accuracy-guaranteed searches by setting ψacc = 2, ψl = 0.5, ψe = 0 and ψa = 0 in the extrinsic reward. Such an extrinsic reward drives Auto Q to quantize the models to achieve the shortest inference latency without significant accuracy loss. Compared to the layer-wise scheme, Auto Q substantially reduces the inference latency by 42.2% while achieving a similar (averagely -0.1%) top-1 inference accuracy. Compared to Res Net-18 and Res Net50, the compact models such as Squeeze Net V1 suffer from larger top-1 accuracy degradation, i.e., -0.3% in a accuracy-guaranteed search of Auto Q. 1 4 7 10 13 16 19 network layer Auto Q 1 4 7 10 13 16 19 network layer Auto Q Figure 4: The ave. QBNs in various layers. 0 50 100 150 200 250 Kernel network layer Auto Q Figure 5: The weight kernel QBNs in a layer. 4.2 DETAILED ANALYSIS Kernel-wise Search. Auto Q can assign a QBN to each kernel of a convolutional layer. The average weight QBN and the average activation QBN of each Res Net-18 layer found by an accuracyguaranteed Auto Q search are shown in Figure 4. Both the network-wise and layer-wise quantization techniques use only one QBN to quantize all weight kernels in a convolutional layer, and quantize all activations of the layer by another QBN. On the contrary, Auto Q searches a QBN for each weight kernel. Compared to a CNN model quantized by the network-wise or layer-wise quantization technique, the same model quantized by the kernel-wise Auto Q can achieve similar inference accuracy but with a smaller average QBN in each layer. We also show the weight kernel QBNs of the L14 layer of Res Net-18 produced by resource-constrained Auto Q searches in Figure 5. Auto Q automatically identifies which weight kernel has a smaller (larger) variance and thus less (more) redundancy, so that it can assign a larger (smaller) QBN to the weight kernel. For instance, as Figure 1 shows, Published as a conference paper at ICLR 2020 compared to the 53th weight kernel (top-right), the 52th weight kernel (top-left) of Res Net-18 has a smaller weight distribution variance. Therefore, in Figure 5, Auto Q assigns a smaller QBN to the 52th weight kernel but provides the 53th weight kernel a larger QBN. 0 50 100 150 200 250 300 350 Training Episodes 0% 20% 40% 60% 80% 100% Inference Accuracy DDPG HIRO Auto Q Figure 6: The DRL scheme comparison. Hierarchical DRL Agent with Shaped Intrinsic Reward. We evaluated and compared our hierarchical-DRL-based Auto Q against the traditional one-level DDPG-based DRL adopted by a recent layer-wise quantization technique, HAQ (Wang et al., 2019). The reward comparison of different techniques during the kernel-wise quantization on Mobile Net V2 is shown in Figure 6. HAQ and Auto Q both support resource-constrained searches, but HAQ cannot support accuracy-guaranteed searches. So their rewards are just the inference accuracy. Through the goals of the HLC and the actions of the LLC, Auto Q can find a QBN for each weight kernel and achieve > 70% accuracy much faster than the DDPG-based DRL, i.e., it reaches 70% accuracy after only 200 episodes. However, the DDPG-based DRL is stuck with 20% inference accuracy until 250 episodes. The hierarchical-DRL-based Auto Q significantly accelerates the search space exploration of the kernel-wise network quantization. Although Auto Q uses a prior hierarchical DRL agent HIRO (Nachum et al., 2018) to search a QBN for each weight kernel, we propose a novel shaped intrinsic reward considering both the completion of the HLC goals and the extrinsic reward to accelerate the search. The intrinsic reward of HIRO takes only the completion of the HLC goals into consideration. The LLC of HIRO cannot directly learn from the environment. Therefore, compared to Auto Q, it takes extra 200 episodes for HIRO to reach only 60% accuracy as shown in Figure 6. Extrinsic Reward. Unlike the reward of the DDPG-based layer-wise HAQ (Wang et al., 2019) considering only the inference accuracy, the extrinsic reward of Auto Q can balance the trade-off between the inference accuracy, latency, energy consumption and FPGA area by enabling the accuracyguaranteed search. By setting ψacc = 2, ψl = 0.5, ψe = 0.5 and ψa = 0.5, Auto Q takes the inference accuracy, latency, energy and FPGA area into consideration during an accuracy-guaranteed search. For instance, Auto Q can find two kernel-wise QBN configurations having similar inference accuracy, latency and energy for Mobile Net V2. We cannot differentiate these two configurations by using only the HAQ reward. However, the first configuration consumes 94% of the FPGA area, while the other configuration occupies 85% of the FPGA area. Auto Q can identify the second QBN configuration as a better choice via its extrinsic reward. Squeeze Net Mobile Net V2 Squeeze Net Mobile Net V2 0 10 20 30 40 50 Latency (ms) Spatial Architecture Temporal Architecture network layer kernel sub-kernel (a) Latency. Squeeze Net Mobile Net V2 Squeeze Net Mobile Net V2 0 10 20 30 40 50 60 70 Energy (m J) Spatial Architecture Temporal Architecture network layer kernel sub-kernel (b) Energy. Figure 7: The comparison of latency and energy between temporal and spatial CNN accelerators. Quantization Granularity. Besides the temporal CNN accelerator (Umuroglu et al., 2019b), the kernel-wise quantized models found by the accuracy-guaranteed Auto Q can reduce the inference latency on a spatial CNN accelerator, Bit Funsion (Sharma et al., 2018), that relies on a 2D systolic array of the fusion units spatially summing the shifted partial products of weights and activations.As Figure 7 shows, compared to the layer-wise quantized models, on average, the kernel-wise quantized models reduce the inference latency by 39.04% and decrease the inference energy by 33.34% on the spatial CNN accelerator. Therefore, the kernel-wise quantized models greatly reduce the inference latency and energy on both the temporal and spatial CNN accelerators. Prior works (Mellempudi et al., 2017; Choukroun et al., 2019a) suggest it is possible to divide a weight kernel into several subkernels and quantize each sub-kernel independently. We also use Auto Q to search a QBN for each weight sub-kernel. As Figure 7 shows, the sub-kernel-wise quantized models cannot improve the inference latency or energy on the spatial CNN accelerator consisting of systolic computing arrays. Each dot-product operation of a sub-kernel-wise quantized model has to be split into several dotproduct operations to be accumulated together. A systolic computing array still has to be designed to accommodate the weight sub-kernel with the largest QBN in a kernel. Therefore, we can see that it is difficult for the fine-grained quantization schemes choosing a QBN for each weight unit that is a Published as a conference paper at ICLR 2020 part of a kernel to further reduce the inference latency or energy on both the temporal and the spatial CNN accelerators. 5 CONCLUSION In this paper, we propose a hierarchical-DRL-based kernel-wise network quantization technique, Auto Q, consisting of a HLC and a LLC. The HLC automatically searches an average weight QBN and an average activation QBN for each convolutional layer. Based on the average weight QBN, the LLC generates a QBN for each weight kernel in each layer. We also create a state space, a goal and action space, an intrinsic reward and an extrinsic reward to support Auto Q. Particularly, our shaped intrinsic reward enables the LLC to learn efficiently from the environment by considering both the HLC goal completion and the environment extrinsic reward. Moreover, the extrinsic reward of Auto Q can balance the inference accuracy, latency, energy consumption and FPGA area. Compared to the models quantized by the state-of-the-art DRL-based schemes, on average, the same models quantized by Auto Q reduce the inference latency by 54.06%, and decrease the inference energy consumption by 50.69%, while achieving the same inference accuracy. Bowen Baker, Otkrist Gupta, Nikhil Naik, and Ramesh Raskar. Designing neural network architectures using reinforcement learning. Co RR, abs/1611.02167, 2016. Yoni Choukroun, Eli Kravchik, and Pavel Kisilev. Low-bit quantization of neural networks for efficient inference. Co RR, abs/1902.06822, 2019a. Yoni Choukroun, Eli Kravchik, and Pavel Kisilev. Low-bit quantization of neural networks for efficient inference. ar Xiv preprint ar Xiv:1902.06822, 2019b. Emily Denton, Wojciech Zaremba, Joan Bruna, Yann Le Cun, and Rob Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. In International Conference on Neural Information Processing Systems, 2014. Ahmed T. Elthakeb, Prannoy Pilligundla, Amir Yazdanbakhsh, Sean Kinzer, and Hadi Esmaeilzadeh. Releq: A reinforcement learning approach for deep quantization of neural networks. Co RR, abs/1811.01704, 2018. Sridhar Gopinath, Nikhil Ghanathe, Vivek Seshadri, and Rahul Sharma. Compiling kb-sized machine learning models to tiny iot devices. In ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 79 95, 2019. Yiwen Guo, Anbang Yao, Hao Zhao, and Yurong Chen. Network sketching: Exploiting binary structure in deep cnns. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 5955 5963, 2017. Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, and Song Han. Amc: Automl for model compression and acceleration on mobile devices. In European Conference on Computer Vision, 2018. Forrest N. Iandola, Matthew W. Moskewicz, Khalid Ashraf, Song Han, William J. Dally, and Kurt Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <1mb model size. Co RR, abs/1602.07360, 2016. Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In The IEEE Conference on Computer Vision and Pattern Recognition, June 2018. Kirthevasan Kandasamy, Willie Neiswanger, Jeff Schneider, Barnab as P oczos, and Eric Xing. Neural architecture search with bayesian optimisation and optimal transport. Co RR, abs/1802.07191, 2018. Minje Kim and Paris Smaragdis. Bitwise neural networks. In ICML Workshop on Resource-Efficient Machine Learning, 2016. Published as a conference paper at ICLR 2020 Raghuraman Krishnamoorthi. Quantizing deep convolutional networks for efficient inference: A whitepaper. ar Xiv preprint ar Xiv:1806.08342, 2018. Rundong Li, Yan Wang, Feng Liang, Hongwei Qin, Junjie Yan, and Rui Fan. Fully quantized network for object detection. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 2810 2819, 2019. Darryl Lin, Sachin Talathi, and Sreekanth Annapureddy. Fixed point quantization of deep convolutional networks. In International Conference on Machine Learning, 2016. Xiaofan Lin, Cong Zhao, and Wei Pan. Towards accurate binary convolutional neural network. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, pp. 345 353. Curran Associates, Inc., 2017. Hung-Yi Liu and Luca P. Carloni. On learning-based methods for design-space exploration with high-level synthesis. In IEEE/ACM Design Automation Conference, 2013. Jeffrey L. Mc Kinstry, Steven K. Esser, Rathinakumar Appuswamy, Deepika Bablani, John V. Arthur, Izzet B. Yildiz, and Dharmendra S. Modha. Discovering low-precision networks close to fullprecision networks for efficient embedded inference. Co RR, abs/1809.04191, 2018. Naveen Mellempudi, Abhisek Kundu, Dheevatsa Mudigere, Dipankar Das, Bharat Kaul, and Pradeep Dubey. Ternary neural networks with fine-grained quantization. Co RR, abs/1705.01462, 2017. Ofir Nachum, Shane Gu, Honglak Lee, and Sergey Levine. Data-efficient hierarchical reinforcement learning. In Annual Conference on Neural Information Processing Systems, 2018. Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet classification using binary convolutional neural networks. In IEEE European Conference on Computer Vision, 2016. Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In IEEE Conference on Computer Vision and Pattern Recognition, 2018. S. Sasaki, A. Maki, D. Miyashita, and J. Deguchi. Post training weight compression with distribution-based filter-wise quantization step. In IEEE Symposium in Low-Power and High Speed Chips, pp. 1 3, 2019. H. Sharma, J. Park, N. Suda, L. Lai, B. Chau, V. Chandra, and H. Esmaeilzadeh. Bit fusion: Bitlevel dynamically composable architecture for accelerating deep neural network. In ACM/IEEE International Symposium on Computer Architecture, pp. 764 775, 2018. Kenneth O. Stanley and Risto Miikkulainen. Evolving neural networks through augmenting topologies. Journal Evolutionary Computation, 10(2):99 127, June 2002. ISSN 1063-6560. Lawrence Stewart and Mark Stalzer. Bayesian optimization for parameter tuning of the xor neural network. Co RR, abs/1709.07842, 2018. Masanori Suganuma, Shinichi Shirakawa, and Tomoharu Nagao. A genetic programming approach to designing convolutional neural network architectures. In ACM Genetic and Evolutionary Computation Conference, pp. 497 504, 2017. Wei Tang, Gang Hua, and Liang Wang. How to train a compact binary neural network with high accuracy? In AAAI Conference on Artificial Intelligence, pp. 2625 2631, 2017. Yaman Umuroglu, Davide Conficconi, Lahiru Rasnayake, Thomas B. Preusser, and Magnus Sj alander. Optimizing bit-serial matrix multiplication for reconfigurable computing. ACM Transactions on Reconfigurable Technology and Systems, 12(3), August 2019a. Yaman Umuroglu, Davide Conficconi, Lahiru Rasnayake, Thomas B. Preusser, and Magnus Sj alander. Optimizing bit-serial matrix multiplication for reconfigurable computing. ACM Transactions on Reconfigurable Technology and Systems, 12(3), August 2019b. Published as a conference paper at ICLR 2020 Kuan Wang, Zhijian Liu, Yujun Lin, Ji Lin, and Song Han. Haq: Hardware-aware automated quantization with mixed precision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8612 8620, 2019. Yunhe Wang, Chang Xu, Jiayan Qiu, Chao Xu, and Dacheng Tao. Towards evolutionary compression. In ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2476 2485, 2018. Chen Xu, Jianqiang Yao, Zhouchen Lin, Wenwu Ou, Yuanbin Cao, Zhirong Wang, and Hongbin Zha. Alternating multi-bit quantization for recurrent neural networks. In International Conference on Learning Representations, 2018. Linghua Zeng, Zhangcheng Wang, and Xinmei Tian. Kcnn: Kernel-wise quantization to remarkably decrease multiplications in convolutional neural network. In International Joint Conference on Artificial Intelligence, pp. 4234 4242, 2019. Dongqing Zhang, Jiaolong Yang, Dongqiangzi Ye, and Gang Hua. Lq-nets: Learned quantization for highly accurate and compact deep neural networks. In European Conference on Computer Vision, 2018. Aojun Zhou, Anbang Yao, Yiwen Guo, Lin Xu, and Yurong Chen. Incremental network quantization: Towards lossless cnns with low-precision weights. In International Conference on Learning Representations, 2017. Yuan Zhou, Haoxing Ren, Yanqing Zhang, Ben Keller, Brucek Khailany, and Zhiru Zhang. Primal: Power inference using machine learning. In IEEE/ACM Design Automation Conference, 2019. Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V. Le. Learning transferable architectures for scalable image recognition. Co RR, abs/1707.07012, 2017.