# rethinking_datafree_quantization_as_a_zerosum_game__23f93bf0.pdf

Rethinking Data-Free Quantization as a Zero-Sum Game

Biao Qian, Yang Wang , Richang Hong, Meng Wang

Key Laboratory of Knowledge Engineering with Big Data, Ministry of Education, School of Computer Science and Information Engineering, Hefei University of Technology, China yangwang@hfut.edu.cn, {hfutqian,hongrc.hfut,eric.mengwang}@gmail.com

Data-free quantization (DFQ) recovers the performance of quantized network (Q) without accessing the real data, but generates the fake sample via a generator (G) by learning from full-precision network (P) instead. However, such sample generation process is totally independent of Q, specialized as failing to consider the adaptability of the generated samples, i.e., beneﬁcial or adversarial, over the learning process of Q, resulting into non-ignorable performance loss. Building on this, several crucial questions how to measure and exploit the sample adaptability to Q under varied bit-width scenarios? how to generate the samples with desirable adaptability to beneﬁt the quantized network? impel us to revisit DFQ. In this paper, we answer the above questions from a game-theory perspective to specialize DFQ as a zero-sum game between two players a generator and a quantized network, and further propose an Adaptability-aware Sample Generation (Ada SG) method. Technically, Ada SG reformulates DFQ as a dynamic maximization-vs-minimization game process anchored on the sample adaptability. The maximization process aims to generate the sample with desirable adaptability, such sample adaptability is further reduced by the minimization process after calibrating Q for performance recovery. The Balance Gap is deﬁned to guide the stationarity of the game process to maximally beneﬁt Q. The theoretical analysis and empirical studies verify the superiority of Ada SG over the state-of-the-arts. Our code is available at https://github.com/hfutqian/Ada SG.

Introduction

Deep Neural Networks (DNNs) have encountered great challenges when involving the applications (Krizhevsky, Sutskever, and Hinton 2017; Wang 2021) on resourceconstrained devices, owing to the increasing demands for computing and storage resources. Network quantization (Lin, Talathi, and Annapureddy 2016; Jacob et al. 2018), which reduces the model size and energy consumption by mapping the ﬂoating-point weighs and activations to low-bit ones, is a promising approach to improve the efﬁciency of DNNs for model compression (Han, Mao, and Dally 2015; Qian et al. 2021). Quantization methods generally dedicate themselves to recovering the performance drop originating

Yang Wang is the corresponding author. Copyright c 2023, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

GDFQ Qimera Ada SG

GDFQ Qimera Ada SG

(a) 3-bit precision (b) 5-bit precision

Generated samples with desirable adaptability

Epoch 0 50 100 150 200 250 300 350

Test accuracy (%)

Epoch 0 25 50 75 100 125 150 175 200

Generated samples

Generated samples with desirable adaptability

Generated samples

Figure 1: Existing work, e.g., GDFQ (Xu et al. 2020) (the blue) and Qimera (Choi et al. 2021) (the orange), suffers from a non-ignorable accuracy loss, as they fail to consider the sample adaptability. Our Ada SG (the green) generates the sample with desirable adaptability to maximally recover Q with varied bit widths, such as (a) 3-bit and (b) 5-bit precision. The observations are from Res Net-18 on Image Net.

with the quantization errors, which involves ﬁne-tuning or calibration operations with the original training data. However, in many real-world scenes, such as medical and military ﬁelds, the original data may not be accessible due to privacy and security issues. Fortunately, recently proposed data-free quantization (DFQ), a potential method to quantize models without accessing the original data, aims to synthesize meaningful fake samples instead, which improves quantized network (Q) by knowledge distillation (Hinton, Vinyals, and Dean 2015; Qian et al. 2022) against the pretrained full-precision model (P). Among the prior researches (Cai et al. 2020; Zhang et al. 2021), the generative fashions (Xu et al. 2020; Choi et al. 2021; Zhu et al. 2021) have recently attracted increasing attention, owing to their superior performance. The generative model is introduced as a generator (G) to capture the distribution of the original data from P for better fake samples, where P is regarded as the discriminator to guide the generation process . For example, Qimera (Choi et al. 2021) generated boundary supporting samples to reduce the gap between the synthetic and real data. Nevertheless, there still remains a non-ignorable performance loss when encountering various bit-width settings. The reasons may lie in several aspects: (1) Due to the limited capacity of the generator, the generated sample with incomplete distribution is impossible to

The Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI-23)

Figure 2: Unlike the existing arts (a), our Ada SG (b) aims to generate the sample with desirable adaptability from G, i.e., the dependence of generated sample on Q, to maximally beneﬁt Q with varied bit widths.

fully recover the original dataset, so that it is a crucial criterion: is the sample beneﬁcial or adversarial over the learning process of Q? However, in the existing arts, the generated sample customized for P, can t always beneﬁt Q in varied bit-width settings (e.g., 3-bit or 5-bit precision, refer to (2)(3)), where only limited information from P can be utilized to recover Q. (2) In 3-bit precision, Q often suffers from a sharp accuracy drop compared to P due to large quantization error, leading to its poor learning ability. Under such case, the generated sample by G may bring an unexpected large disagreement between the predictions of P and Q, which makes the optimization loss too large to converge, resulting in no improvement; see the orange ﬂat curve in Fig.1(a). (3) In 5-bit precision, Q still possesses comparable recognition ability with P due to a small accuracy drop. Under such case, most of the generated samples by G, for which Q and P give similar predictions (i.e., reach an agreement), may not beneﬁt Q. However, under the constraint of the optimization loss, Q receives no improvement, even impairment; see the blue curve (Epoch 0-50) in Fig.1(b). Based on the above, the existing approaches fail to consider the sample adaptability, i.e., beneﬁcial or adversarial over the calibration process of Q, to Q with varied bit widths during the sample generation process from G, where Q is independent of the generation process; see Fig.2(a). For example, for (2), the sample with large adaptability may be one with small disagreement between P and Q; while for (3), it may be one with large disagreement. The above naturally elicits the following basic questions: how to measure and exploit the sample adaptability to quantized network under varied bit-width scenarios? how to generate the sample with desirable adaptability to beneﬁt the quantized network? To answer the above questions, we consider to generate the sample with large adaptability to Q by taking Q into account during the generation process; see Fig.2(b). Speciﬁcally, G aims to generate the sample with large adaptability, which essentially maximizes the reward for G, by enlarging the disagreement between P and Q, to beneﬁt Q; while Q is calibrated to improve itself by exploiting the sample with large adaptability. It is apparent that such sample adaptability will not be large to Q after Q is reﬁned; in other words,

such process of beneﬁting Q leads to decreasing the sample adaptability, which essentially minimizes the loss for Q, and is adversarial to maximizing the reward goal for G. Based on the above, we rethink date-free quantization process and formulate it as a zero-sum game (Li et al. 2022; Zhang et al. 2022) between two players (an adversarial game process where one player s reward is the other s loss while their sum is zero) a generator and a quantized network; and we further propose an Adaptability-aware Sample Generation (Ada SG) method. Technically, we specify it via a dynamic maximization-vs-minimization game process anchored on sample adaptability, and deﬁned as:

min θq Θq max θg Θg R(θg, θq), (1)

where G and Q are parametrized by θg Θg and θq Θq, respectively. The optimization of Eq.(1) consists of ﬁxing θq, and updating θg to the optimal θ g for maximizing R(θg, θq); while alternatively ﬁxing θ g, and updating θq to the optimal θ q for minimizing R(θ g, θq). Speciﬁcally, the maximization process exploits the sample adaptability to Q upon P to generate two types of samples: disagreement (i.e., P can predict correctly but Q not) and agreement samples (i.e., P and Q have the same prediction), such sample adaptability is further reduced by the minimization process after calibrating Q for performance recovery. To achieve the stationarity to maximally beneﬁt Q, the Balance Gap (BG) for the adaptability of the generated samples within the game process is deﬁned to set up such a stationary objective to balance the maximization versus minimization process. We remark that Ada SG is essentially an adversarial game governed by the sample adaptability, which is fundamentally orthogonal to the existing arts that improve Q by transferring knowledge from P to Q. One recent study (Liu, Zhang, and Wang 2021) generates adversarial samples via G, by maximizing the gap between P and Q, and minimizing their gap to beneﬁt Q for calibration. However, it fails to consider the sample adaptability to Q, hence suffers from the non-ideal generated samples. Besides, they focus primarily on the adversarial sample generation rather than adversarial game process perspective for Ada SG to generate the samples with desirable adaptability. The theoretical analysis and empirical studies validate the superiority of Ada SG to the state-of-the-arts.

Adaptability-aware Sample Generation Recent generative data-free quantization (DFQ) methods (Xu et al. 2020; Choi et al. 2021; Zhu et al. 2021) aim to reconstruct the original training dataset with a generator (G) by exploiting the distribution from a pre-trained fullprecision network (P), which is further exploited to recover the performance of quantized network (Q) via the calibration operation. However, we observe that Q is independent of the generation process by the existing arts and whether the generated sample is beneﬁcial or adversarial, namely sample adaptability, over the calibration process of Q, is crucial to the DFQ process. Motivated by the above, we focus primarily on the sample adaptability to Q. Building on this, the DFQ process is formulated as a dynamic maximization-vsminimization game process governed by the sample adapt-

Sample number

Sample number

(b) 5-bit precision (a) 3-bit precision

Disagreement Disagreement

Figure 3: Given the 6000 real images from Image Net as the input to both P and Q upon Res Net-18, the disagreement between P and Q varies greatly for varied Q with (a) 3-bit and (b) 5-bit precision.

ability, as illustrated in Fig.4. Naturally, one crucial question is how to measure the sample adaptability to Q, which will be discussed in the next section.

How to Measure the Sample Adaptability to Q? To measure the sample adaptability to Q, we primarily focus on two crucial issues: 1) the dependence of generated sample on Q, and 2) the disagreement between the predictions of P and Q, in that for Q with different bit widths (e.g., 3 or 5 bit), the disagreement varies greatly when delivering the same sample (take real data as an example) to both P and Q; see Fig.3. We deﬁne the generated sample that depends on Q, namely disagreement sample, below: Deﬁnition 1 (Disagreement Sample). Given a random noise vector z N(0, 1) and an arbitrary one-hot label y, a generator G generates a sample x = G(z|y). Then the logit outputs of pre-trained full-precision model P and quantized model Q are given as zp = P(x) and zq = Q(x), respectively. Suppose that the generated sample x can be correctly predicted by P, i.e., argmax(zp) = argmax(y). We say x is the disagreement sample if argmax(zp) = argmax(zq). Thus the probability vector encoding the disagreement between P and Q, is formulated as

pds = softmax(zp zq) RC, (2)

where pds(c) = exp(zp(c) zq(c)) PC j=1 exp(zp(j) zq(j)) (c {1, 2, ..., C})

represents the c-th entry of the vector pds as the probability that x is labeled as the disagreement sample of the c-th class; C denotes the number of class. Together with the sample dependence to Q, the disagreement sample can viewed as the one with large adaptability to Q. Another key problem is how to model the disagreement between P and Q, which can be exploited to measure the sample adaptability. Eq.(2) shows that the disagreement reaches the maximum provided pds(c) approaches to 1 while with other elements to be 0, which corresponds to the minimum entropy of pds; while the disagreement reaches the minimum if each element in pds is equal, indicating the maximum entropy of pds. Hence, the disagreement can be computed via the information entropy function Hinfo( ), and formulated as

Hinfo(pds) =

c=1 pds(c)log 1 pds(c). (3)

For different datasets, C varies greatly, we further normalize

Hinfo(pds) as

H = 1 H info(pds) [0, 1), (4)

where H info(pds) = Hinfo(pds) min(Hinfo(pds)) max(Hinfo(pds)) min(Hinfo(pds)).

The constant max(Hinfo(pds)) = PC c=1 1 C log 1

C represents the maximum value of Hinfo(pds), in the case that each element in pds has the same class probability 1

C , where Q perfectly aligns with P (i.e., zp = zq); while min( ) is utilized to obtain the minimum of Hinfo(pds) within a batch. Thus, the sample adaptability is closely related to H the more H, the larger the disagreement over the sample to both P and Q, to yield the samples with large adaptability. As per Eq.(2)(4), pds distributes in a C-dimensional vector space, while H is a real value. We hence map that to a C-dimensional space RC while characterizing H, which is achieved via the unit vector pds ||pds|| (|| || denotes ℓ2 norm), then H is reformulated as

HC = pds ||pds||H = pds ||pds||(1 H info(pds)) RC, (5)

where the category information (i.e., pds ||pds||) of the generated sample apart from the disagreement (i.e., H) is exploited to well measure the sample adaptability. The measurement of sample adaptability inspires us to revisit the DFQ process based on H: the goal of G is to generate the sample with large adaptability to Q, i.e., increasing the sample adaptability, by maximizing H (the gain of H is positive, serving as the reward for G), to beneﬁt Q; while for the calibration process, Q is optimized to recover itself with the generated sample. It infers that such sample adaptability will not be large to Q, while the sample can no longer beneﬁt Q after Q is reﬁned; in other words, such process of beneﬁting Q by learning from P results in decreasing the sample adaptability, which is achieved by minimizing H (the gain of H is negative, serving as the loss for Q), adversarial to maximizing H, which encourages the loss for Q to cancel out the reward for G, such that the sum of reward and loss tends to be zero. Such fact makes the overall changing (summation) of the disagreement (H) close to 0 (see Fig.4 for the intuition), which is in line with the principle of zerosum game (Shoham and Leyton-Brown 2008; v. Neumann 1928), as discussed in the next section.

Zero-Sum Game: Adversarial Game for Data-Free Quantization We formally revisit the DFQ process from a game-theory perspective (Li et al. 2022; Zhang et al. 2022), and formulate the DFQ as a zero-sum game between two players a generator and a quantized network, which is an expansion of Eq.(1), as follows:

min θq Θq max θg Θg R(θg, θq) = min θq Θq max θg Θg Ez,y[1 H info(pds)],

(6) where G and Q are parameterized by θg Θg and θq Θq, respectively. In particular, Eq.(6) is iteratively optimized via gradient descent during the training process, where each iteration consists of two steps: on one hand, θq is ﬁxed, R(θg, θq) in Eq.(6) is maximized to update θg, which is equivalent to generating the sample with large adaptability,

Full-precision

Update G Update Q

Gradient flow:

Reward for G

Zero-sum game Sum

Figure 4: Illustration of data-free quantization process as a zero-sum game between G (player 1) and Q (player 2). G generates the sample with large adaptability by maximizing R( , ) during the maximization process, such sample adaptability is further reduced by minimizing R( , ) during the minimization process when calibrating Q.

i.e., increasing the sample adaptability; on the other hand, θg is ﬁxed, R(θg, θq) in Eq.(6) is minimized to update θq, which is equivalent to calibrating Q with the generated sample, beneﬁting from decreasing the sample adaptability. During the zero-sum game, the overall changing (summation) of the sample adaptability to Q incurred by maximum and minimum optimization is close to zero. One critical question is how to converge Eq.(6) to achieve the equilibrium (i.e., the sample adaptability no longer changes) for the zerosum game. Thanks to the classical Nash equilibrium (Cardoso et al. 2019) deﬁned as a state, where no player can improve its individual gain during the zero-sum game, the optimization process will reach an equilibrium (θ g, θ q) when (1) for the ﬁxed θ q, G fails to maximize R(θg, θ q) to obtain a θg Θg, yielding R(θg, θ q) > R(θ g, θ q); (2) for the ﬁxed θ g, Q can no longer improve itself by minimizing R(θ g, θq) to obtain a θq Θq, yielding R(θ g, θq) < R(θ g, θ q). Based on that, for all θg Θg and θq Θq, (θ g, θ q) satisﬁes the following inequality: R(θg, θ q) R(θ g, θ q) R(θ g, θq), (7)

where θ g and θ q are the parameters of G and Q under an equilibrium state, respectively. We remark that maximizing R( , ) in Eq.(6) is equivalent to generating the sample with the largest adaptability, which may incur too large disagreement (i.e., too small H info(pds)) between P and Q, while Q (especially for Q with low bit width) has no sufﬁcient ability to learn informative knowledge from P, which, in turn, encourages G to generate the sample with the lowest adaptability by alternatively maximizing R( , ) in Eq.(6). However, encouraging the sample with the lowest adaptability may incur too

large agreement (i.e., too large H info(pds)) between P and Q, where the generated sample may not be informative to calibrate Q (especially for Q with high bit width). The above facts indicate that the sample with either largest or lowest adaptability generated by maximizing R( , ) in Eq.(6) is not necessarily the best. To address the issues, we further reﬁne the maximization objective of Eq.(6) during the zerosum game by the cooperation between the disagreement and agreement sample along with the bound constraints on sample adaptability, which will be elaborated in the next.

Maximization Process: Generating Samples with Desirable Adaptability According to Deﬁnition 1, the category (label) information is crucial to establish the dependence of generated sample on Q. Hence, to generate the disagreement sample with large adaptability and further maximize R( , ) in Eq.(6), we exploit the category (label) information to generate the sample. To achieve it, given the label y, the generated sample should be classiﬁed as disagreement sample with the same label y. Thereby, we present the following Cross-Entropy loss HCE(., .) to match pds and y, formulated as Lds = Ez,y[HCE(pds, y)]. (8) We aim to minimize Eq.(8) to encourage G to generate the disagreement sample that P can predict correctly but Q fails, which, however, may incur too large disagreement (i.e., too small H info(pds)) between P and Q, thus fail to yield the desirable sample adaptability. To remedy such issue, complementary to the disagreement sample, we further deﬁne the agreement sample below: Deﬁnition 2 (Agreement Sample). Based on Deﬁnition 1, we say the generated sample x is the agreement sample if argmax(zp) = argmax(zq). Thus, similar in spirit to pds, the probability vector that describes the agreement between P and Q, is formulated as pas = softmax(zp + zq) RC, (9)

where pas(c) = exp(zp(c)+zq(c)) PC j=1 exp(zp(j)+zq(j)) is the c-th entry of the

vector pas as the probability that x is the agreement sample of the c-th class. Following Eq.(8), the loss function for generating agreement sample is given as Las = Ez,y[HCE(pas, y)], (10) which is minimized to encourage to generate the agreement sample that both P and Q can correctly predict to possess larger H info(pds).

Cooperation for desirable sample adaptability Upon the above, let Lds and Las cooperate with each other to maximize R( , ) in Eq.(6), the sample with desirable adaptability can be generated. Hence, the generation loss is given as Ls = Lds + Las. (11) With Ls, when a large disagreement incurs, Las dominates to generate agreement sample, to reduce the gap between P and Q, so as to enlarge H info(pds) for disagreement sample; to be analogous, Lds dominates to reduce H info(pds) for agreement sample; see Fig.5. Nevertheless, in some cases, there still exist the samples with either too small or large H info(pds) beyond the cooperation ability.

Adaptability-aware Sample Generation

Agreement sample Disagreement sample 9

Player 1 Player 2

Network calibration

Generation process

Calibration process

Low-bit case High-bit case

dominates dominates

Large disagreement incurs too large sample adaptability

Large agreement incurs too small sample adaptability

Figure 5: Illustration of Ada SG on achieving the stationarity of the zero-sum game via the balance gap (BG). When BG > 0 (BG < 0), the generated sample leads to a large disagreement (agreement) between P and Q, owing to the weak (strong) learning ability of Q, e.g., Q with low-bit (high-bit) precision, resulting into too large (small) sample adaptability. When BG = 0, the DFQ game process is stationary, revealing that the sample adaptability increased by the maximization process can be fully exploited by the minimization process to maximally beneﬁt Q with varied bit widths.

Bound constraint on sample adaptability To address the above problem, we further impose the bound constraints for H info(pds) via the hinge loss (Lim and Ye 2017) below:

Lb =Ez,y[max λl H info(pds), 0 ]

+ Ez,y[max H info(pds) λu, 0 ], (12)

where λl and λu denote the lower and upper bound of H info(pds), such that 0 λl < λu 1. Speciﬁcally, λl serves to prevent G from generating the sample with too large adaptability (i.e., too small H info(pds)) via maximizing Eq.(6), while λu aims to avoid the sample with too small adaptability (i.e., too large H info(pds)), which, in turn, offer the guarantee for the cooperation between Lds and Las. BNS information The above fact discusses the adaptability of single sample to Q, while calibrating Q requires the generated samples within a batch, motivating us to exploit the distribution information of the training data. We consider the batch normalization statistics (BNS) information (Xu et al. 2020; Choi et al. 2021) about the training data contained in P, which is learned by

m=1 (||µg m µm||2 2 + ||σg m σm||2 2), (13)

where µg m/σg m are the mean/variance of the generated sample s distribution at the m-th BN layer of the total M layers; and µm/σm are the corresponding mean/variance parameters stored in the m-th BN layer of P.

Overall loss To this end, we ﬁnalize the loss function for the maximization (sample generation) process as LG = α(Lds + Las) + βLb + γLBNS, (14) where α, β and γ are the balance hyperparameters. Throughout minimizing LG, which is equivalent to maximizing R( , ) in Eq.(6), the sample with desirable adaptability can be generated, such sample is exploited by calibrating Q during the minimization process according to Eq.(6).

Minimization Process: Calibrating Q by Reducing Sample Adaptability With the above generated sample, the goal of the minimization process is to calibrate Q for the performance recovery by minimizing R( , ) in Eq.(6), which indicates that the disagreement between P and Q is decreased to reach the agreement. In particular, we further introduce a temperature parameter τ to soften the output, thus the loss function for the minimization (calibration) process is formulated as

LQ = Ez,y[1 H info(pτ ds)], (15)

where pτ ds = softmax((zp zq)/τ). By alternatively optimizing Eq.(14) and Eq.(15) during a zero-sum game, the samples with desirable adaptability can be generated by G to maximally recover Q until reaching a Nash equilibrium.

Theoretical Analysis: a Balance Gap for Sample Adaptability One may wonder whether the above maximization-vsminimization process can keep stationary during the optimization process over Eq.(6), which is critical for sample generation with desirable adaptability and Q s recovery. We conﬁrm that by our theoretical analysis based on the balance gap (BG) over sample adaptability, which is deﬁned as: Deﬁnition 3 (Balance Gap). Considering the objective R( , ) for the DFQ game process. Assume that, after one iteration, the parameters (θ1 g, θ1 q) of the game are updated to (θ2 g, θ2 q) via the gradient descent. Then, the balance gap (BG) is deﬁned as

BG = R(θ2 g, θ2 q) R(θ1 g, θ1 q), (16)

which encodes the deviation for the value of R( , ) before and after each iteration. When BG > 0 or BG < 0, the generated sample leads to a large disagreement or agreement between P and Q, owing to the weak or strong learning ability of Q, where the sample with too large or small adaptability

fails to beneﬁt Q during the calibration process. We remark that BG is well bounded to avoid the large deviation, which facilitates generating the samples with desirable adaptability. Such fact is validated via the following proposition: Proposition 1. Considering the stationarity of DFQ game process. Then, the balance gap (BG) is bounded below:

|BG| = |R(θ2 g, θ2 q) R(θ1 g, θ1 q)| L||[θ2 g; θ2 q] [θ1 g; θ1 q]|| (17)

where L is the Lipschitz constant. Proof. Considering the Lipschitz continuity of the objective function R( , ), we divide the proof into two parts: Part 1. Lipschitz Continuity. We say a function f(x,y) is LLipschitz continuous with respect to a norm ||.|| if

|f(x2, y2) f(x1, y1)| L||[x2; y2] [x1; y1]|| (18)

for any x1, x2 X and any y1, y2 Y . The previous inequality holds if and only if

||[ xf(x, y); yf(x, y)]|| L (19)

for any x X and any y Y . L is the Lipschitz constant. Part 2. According to the ﬁrst-order Taylor expansion and Cauchy inequality, we have:

|BG| = |R(θ2 g, θ2 q) R(θ1 g, θ1 q)|

= ||[ θg R(θg, θq); θq R(θg, θq)]T [θ2 g θ1 g; θ2 q θ1 q]||

||[ θg R(θg, θq); θq R(θg, θq)]T || ||[θ2 g; θ2 q] [θ1 g; θ1 q]||,

then Eq.(17) holds if and only if θg R(θg, θq) and θq R(θg, θq) are upper bounded. First, we note that

R(θg, θq) = Ez,y[1 H info(pds)], (20)

where H info(pds) is obtained by normalizing Hinfo(pds) = PC c=1 pds(c)log 1 pds(c), and pds = softmax(zp zq); and the corresponding gradients are computed by

θg R(θg, θq) = R(θg, θq)

θq R(θg, θq) = R(θg, θq)

zq zq θq , (21)

where x θg and zq

θq are the gradients produced by G and Q, respectively. It is apparent that the gradients of the information entropy function Hinfo( ), the softmax function and the activation function (e.g., Sigmod, Relu) in deep neural network are all bounded. Therefore, L R, make ||[ θg R(θg, θq); θq R(θg, θq)]T || be upper bounded. To sum up, the balance gap (BG) is bounded by Eq.(17). The proposition provides an upper and lower bound for BG, to well avoid the sample generation encoding too large or small adaptability, in line with the intuition of Lb, which offers a guarantee for the stationarity of DFQ game, we further achieve the stationarity via the following proposition: Proposition 2. During each iteration, the DFQ game is stationary, that is, the sample adaptability increased by the maximization process can be fully exploited by the minimization process to maximally beneﬁt Q, when BG = 0.

Proof. As aforementioned, each iteration contains the maximization and minimization process, then we have:

BG = R(θ2 g, θ2 q) R(θ1 g, θ1 q)

= R(θ2 g, θ1 q) R(θ1 g, θ1 q) R(θ2 g, θ1 q) R(θ2 g, θ2 q)

where g and q denote the deviation for the value of R(., .) during the maximization and minimization process, respectively. Thus, if g > q (i.e., BG > 0), the generated sample leads to a large disagreement between P and Q, owing to the weak learning ability of Q; if g < q (i.e., BG < 0), the generated sample leads to a large agreement between P and Q, owing to the strong learning ability of Q. Under such cases, the sample with too large or small adaptability fails to beneﬁt Q during the calibration process. To sum up, if and only if g = q (i.e., BG = 0), the DFQ game is stationary, that is, the sample adaptability increased by the maximization process can be fully exploited by the minimization process to maximally beneﬁt Q by minimizing LQ, which is equivalent to minimizing R( , ) in Eq.(6). The proposition discloses the effectiveness of the coordination between Lds and Las during the maximization process. Speciﬁcally, when BG > 0 (< 0), the sample adaptability is too large or small; therefore, for next iteration, Las (Lds) will dominate to generate the sample with desirable adaptability by minimizing LG (equivalent to maximizing R( , ) in Eq.(6)), to achieve the stationarity of the DFQ game, i.e., BG = 0; see Fig.5.

Experimental Settings and Implementation Details

We validate Ada SG over three typical image classiﬁcation datasets, including CIFAR-10, CIFAR-100 (Krizhevsky 2009) and Image Net (ILSVRC2012) (Russakovsky et al. 2015). CIFAR-10 and CIFAR-100 contain 10 and 100 classes of images, respectively. Both of them are split into 50K training images and 10K testing images. Image Net consists of 1.2M samples for training and 50k samples for validation with 1000 categories. For data-free setting, only validation sets are adopted to evaluate the performance of the quantized models (Q). In the experiments, we quantize pretrained full-precision networks (P) including Res Net-20 for CIFAR, and Res Net-18, Res Net-50, and Mobile Net V2 for Image Net, via the following quantizer to yield Q: Quantizer. Following (Xu et al. 2020; Choi et al. 2021), we quantize both full-precision (ﬂoat32) weights and activations into n-bit precision via a symmetric linear quantization method based on (Jacob et al. 2018) below:

θq = round (2n 1) θ θmin θmax θmin 2n 1 , (22)

where θ and θq are the full-precision and quantized value. round( ) returns the nearest integer value to the input. θmin and θmax are the minimum and maximum of θ. Regarding the bit width n, we select n = {3, 4, 5}, which are representative for low-bit and high-bit cases, particularly: 3-bit quantization actually leads to a huge accuracy loss, which is

a major challenge for the existing DFQ methods; while 5bit or higher-bit quantization produces a small performance loss, which is selected to validate the generalization ability. For the maximization process, we construct the architecture of the generator (G) following ACGAN (Odena, Olah, and Shlens 2017), while P and Q play the role of discriminator, where G is trained with the loss function Eq.(14) using Adam (Kingma and Ba 2014) as an optimizer with a momentum of 0.9 and a learning rate of 1e-3. For the minimization process, Q is optimized with the loss function Eq. (15), where SGD with Nesterov (Nesterov 1983) is adopted as an optimizer with a momentum of 0.9 and weight decay of 1e-4. For CIFAR, the learning rate is initialized to 1e-4 and decayed by 0.1 for every 100 epochs, while it is 1e-5 (1e-4 for Res Net-50) and divided by 10 at epoch 350 (at epoch 200 and 300 for Res Net-50) on Image Net. The generator and quantized model are totally trained for 400 epochs. The batch size is set to 16. For the hyperparameters, α, β and γ in Eq.(14); λl and λu in Eq.(12) are empirically set to be 0.1, 1, 1, 0.3 and 0.8. All experiments are implemented with pytorch (Paszke et al. 2019) based on the code of GDFQ (Xu et al. 2020) and run on an NVIDIA Ge Force GTX 1080 Ti GPU and an Intel(R) Core(TM) i7-6950X CPU @ 3.00GHz. To validate how Ada SG generates the sample with desirable adaptability to maximally beneﬁt Q, we empirically validate why Ada SG works, including the comparisons with the state-of-the-arts, ablation study as well as visual analysis.

Why does Ada SG Work?

We experimentally verify the core idea of Ada SG generating the sample with desirable adaptability to recover the performance of Q with varied bit widths. We perform the data-free quantization experiments with Res Net-18 (3-bit and 5-bit precision) serving as both P and Q on Image Net. Fig.6(a1)(b1) illustrates that the disagreement (Hinfo(pds) in Eq.(3)) between P and Q for Ada SG ﬂuctuates stably within a small range compared to GDFQ (Xu et al. 2020) and Qimera (Choi et al. 2021), conﬁrming that the generated sample with desirable adaptability by Ada SG is fully exploited to beneﬁt Q, and the bound constraint loss (Eq.(12)) can avoid generating the sample with too large or small adaptability, which leads to ridiculously large disagreement or agreement. Fig.6(a2)(b2) reveals that the balance gap (Deﬁnition 3) keeps on par with zero (either BG > 0 or BG < 0) during the iteration process unlike GDFQ and Qimera, which conﬁrm the principle that Ada SG achieves the stationarity of the DFQ game and the coordination between Lds (Eq.(8)) and Las (Eq.(10)) can generate the sample with desirable adaptability to maximally beneﬁt Q.

Comparison with State-of-the-arts

To verify the superiority of Ada SG, we compare it with typical DFQ fashions, including: GDFQ (Xu et al. 2020), ARC (Zhu et al. 2021) and Qimera (Choi et al. 2021): reconstructing the original data from P; ZAQ (Liu, Zhang, and Wang 2021) focuses primarily on the adversarial sample generation rather than adversarial game process for Ada SG; Intra Q (Zhong et al. 2022) optimizes the noise to obtain fake sample without a generator; AIT (Choi et al. 2022) focuses on

Figure 6: Illustration of Ada SG on generating the samples with desirable adaptability to Q under (a) 3-bit and (b) 5bit precision. (a1)(b1) Disagreement (i.e., Hinfo(pds)) between P and Q during the generation and calibration process. (a2)(b2) Balance gap (BG).

improving the loss function and manipulating the gradients for ARC to generate better sample, denoted as ARC+AIT. Table 1 summarizes our following observations: 1) Ada SG offers a signiﬁcant and consistent performance gain over the state-of-the-arts, in line with our purpose of generating the sample with desirable adaptability to maximally beneﬁt Q. Impressively, Ada SG achieves at most 9.71%, 6.63% and 35.87% accuracy gains on CIFAR-10, CIFAR-100 and Image Net, respectively. Especially, compared with GDFQ, ARC and Qimera where Q is independent of the generation process, Ada SG obtains accuracy improvement with a large margin, e.g., at least 0.3% gain (Res Net-20 with 5w5a on CIFAR-10), conﬁrming the necessity of Ada SG focusing on the sample adaptability to Q. Speciﬁcally, ZAQ suffers from a large performance gap compared to Ada SG, since many unexpected samples are generated without considering the sample adaptability, which are harmful to calibrating Q. Ada SG shows obvious advantages over AIT despite of the combination with ARC. 2) Ada SG achieves the substantial gains for Q with varied bit-widths, conﬁrming the desirable adaptability of our generated sample to varied Q. Note that, for 3-bit case, most of the existing methods suffer from a poor accuracy, even fail to converge, while Ada SG obtains at most 35.87% (Res Net-18 with 3w3a) and at least 4.21% (Res Net-20 with 3w3a on CIFAR-100) performance gains.

Ablation Study Validating adaptability with disagreement and agreement samples As aforementioned, the disagreement and agreement samples play a critical role on the sample adapt-

Dataset Model (FP.) Bit width (nwna)

ZAQ (CVPR 2021)

Intra Q (CVPR 2022)

ARC+AIT (CVPR 2022)

GDFQ (ECCV 2020) ARC (IJCAI 2021) Qimera (Neur IPS 2021)

Ada SG (Ours)

3w3a - 77.07 - 75.11 - 74.43 84.14 Res Net-20 4w4a 92.13 91.49 90.49 90.11 88.55 91.26 92.10 (93.89) 5w5a 93.36 - 92.98 93.38 92.88 93.46 93.76

3w3a - 48.25 - 47.61 - 46.13 52.76 Res Net-20 4w4a 60.42 64.98 61.05 63.75 62.76 65.10 66.42 (70.33) 5w5a 68.70 - 68.40 67.52 68.40 69.02 69.42

3w3a - - - 20.23 23.37 1.17 37.04 Res Net-18 4w4a 52.64 66.47 65.73 60.60 61.32 63.84 66.50 (71.47) 5w5a 64.54 69.94 70.28 68.49 68.88 69.29 70.29

3w3a - - - 1.46 14.30 - 26.90 Mobile Net V2 4w4a 0.10 65.10 66.47 59.43 60.13 61.62 65.15 (73.03) 5w5a 62.35 71.28 71.96 68.11 68.40 70.45 71.61 3w3a - - - 0.31 1.63 - 16.98 Res Net-50 4w4a 53.02 - 68.27 54.16 64.37 66.25 68.58 (77.73) 5w5a 73.38 - 76.00 71.63 74.13 75.32 76.03

Table 1: Accuracy (%) comparison with the state-of-the-arts on CIFAR-10, CIFAR-100 and Image Net. : the results implemented by author-provided code. -: no results are reported. nwna indicates the weights and activations are quantized to n bit. FP.: the accuracy of full-precision model. The best results are reported with boldface.

Model (Full precision) Lds Las Lb LBNS Bit width 3w3a 5w5a

Res Net-18 (71.47)

14.86 69.15 26.41 69.81 32.75 70.13 15.01 70.06 20.98 67.35 37.04 70.29

Table 2: Ablation study about varied components of Ada SG. nwna indicates the weights and activations are quantized to n bit. The best results are reported with boldface.

ability to Q, which serve as a pivotal role between the maximization and minimization process during the zero-sum game. We conduct the ablation study on Lds (Eq.(8)) and Las (Eq.(10)) over Image Net. Table 2 suggests the great superiority (37.04% and 70.29%) of Ada SG (including the both) to other cases. It is worth noting that, removing either or both of Lds and Las obtains a large performance degradation (at most 22.18% and 1.14%), implying the intuition of the cooperation between Lds and Las. Interestingly, the case without Lb (Eq.(12)) receives the minimal accuracy loss (4.29% and 0.16%), conﬁrming the importance of Lb on the basis of Lds and Las.

Visualization of Generated Samples

To further show the desirable adaptability of the generated samples by Ada SG to Q, we conduct the visual analysis over Mobile Net V2 serving as both P and Q on Image Net by the similarity matrix (each element is obtained by computing the ℓ1 norm between the probability distribution pds of every two samples), along with the visualization of generated samples in Fig.7. Fig.7(a) illustrates that the generated sam-

0.0 0.5 1.0 1.5

GDFQ Ada SG

3-bit case 5-bit case (a) (b)

Figure 7: (a) The similarity comparison between the generated samples. (b) Visualization of the generated samples, where each row denotes one of 8 randomly chosen classes from Image Net.

ples by Ada SG possess a much larger similarity (the darker, the larger) than those by GDFQ, implying that the generated sample by GDFQ varies greatly, where a lots of samples with undesirable adaptability exist against Ada SG. Fig.7(b) shows that the samples generated for 3-bit and 5-bit precision vary greatly, while the samples from varied categories also differ greatly from each other, verifying that the samples possess desirable adaptability to varied Q, upon the fact that the category (label) information is fully exploited.

In this paper, we rethink date-free quantization process as a zero-sum game between two players a generator and a quantized network, then further develop an Adaptabilityaware Sample Generation (Ada SG) method, which features a dynamic maximization-vs-minimization game process anchored on sample adaptability. The maximization process generates the sample with desirable adaptability, which is further reduced by the minimization process after recovering Q. Balance Gap is deﬁned to achieve the stationarity for the zero-sum game. The theoretical analysis and empirical studies validate the advantages of Ada SG to the existing arts.

Acknowledgments

This work are supported by National Natural Science Foundation of China under the grant no U21A20470, 62172136, 72188101, U1936217. Key Research and Technology Development Projects of Anhui Province (no.202004a5020043).

References Cai, Y.; Yao, Z.; Dong, Z.; Gholami, A.; Mahoney, M. W.; and Keutzer, K. 2020. Zeroq: A novel zero shot quantization framework. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13169 13178. Cardoso, A. R.; Abernethy, J.; Wang, H.; and Xu, H. 2019. Competing against Nash equilibria in adversarially changing zero-sum games. In International Conference on Machine Learning, 921 930. PMLR. Choi, K.; Hong, D.; Park, N.; Kim, Y.; and Lee, J. 2021. Qimera: Data-free Quantization with Synthetic Boundary Supporting Samples. Advances in Neural Information Processing Systems, 34. Choi, K.; Lee, H. Y.; Hong, D.; Yu, J.; Park, N.; Kim, Y.; and Lee, J. 2022. It s All In the Teacher: Zero-Shot Quantization Brought Closer to the Teacher. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8311 8321. Han, S.; Mao, H.; and Dally, W. J. 2015. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. ar Xiv preprint ar Xiv:1510.00149. Hinton, G.; Vinyals, O.; and Dean, J. 2015. Distilling the knowledge in a neural network. In NIPS. Jacob, B.; Kligys, S.; Chen, B.; Zhu, M.; Tang, M.; Howard, A.; Adam, H.; and Kalenichenko, D. 2018. Quantization and training of neural networks for efﬁcient integer-arithmeticonly inference. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2704 2713. Kingma, D. P.; and Ba, J. 2014. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980. Krizhevsky, A. 2009. Learning Multiple Layers of Features from Tiny Images. Master s thesis, University of Tront. Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2017. Imagenet classiﬁcation with deep convolutional neural networks. Communications of the ACM, 60(6): 84 90. Li, J.; Ren, T.; Yan, D.; Su, H.; and Zhu, J. 2022. Policy learning for robust markov decision process with a mismatched generative model. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 36, 7417 7425. Lim, J. H.; and Ye, J. C. 2017. Geometric gan. ar Xiv preprint ar Xiv:1705.02894. Lin, D.; Talathi, S.; and Annapureddy, S. 2016. Fixed point quantization of deep convolutional networks. In International conference on machine learning, 2849 2858. PMLR. Liu, Y.; Zhang, W.; and Wang, J. 2021. Zero-shot adversarial quantization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1512 1521.

Nesterov, Y. E. 1983. A method for solving the convex programming problem with convergence rate O (1/kˆ 2). In Dokl. akad. nauk Sssr, volume 269, 543 547. Odena, A.; Olah, C.; and Shlens, J. 2017. Conditional image synthesis with auxiliary classiﬁer gans. In International conference on machine learning, 2642 2651. PMLR. Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32. Qian, B.; Wang, Y.; Hong, R.; Wang, M.; and Shao, L. 2021. Diversifying inference path selection: Movingmobile-network for landmark recognition. IEEE Transactions on Image Processing, 30: 4894 4904. Qian, B.; Wang, Y.; Yin, H.; Hong, R.; and Wang, M. 2022. Switchable Online Knowledge Distillation. In Computer Vision ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23 27, 2022, Proceedings, Part XI, 449 466. Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. 2015. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3): 211 252. Shoham, Y.; and Leyton-Brown, K. 2008. Multiagent systems: Algorithmic, game-theoretic, and logical foundations. Cambridge University Press. v. Neumann, J. 1928. Zur theorie der gesellschaftsspiele. Mathematische annalen, 100(1): 295 320. Wang, Y. 2021. Survey on deep multi-modal data analytics: Collaboration, rivalry, and fusion. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 17(1s): 1 25. Xu, S.; Li, H.; Zhuang, B.; Liu, J.; Cao, J.; Liang, C.; and Tan, M. 2020. Generative low-bitwidth data free quantization. In European Conference on Computer Vision, 1 17. Springer. Zhang, M.; Zhao, P.; Luo, H.; and Zhou, Z.-H. 2022. Noregret learning in time-varying zero-sum games. In International Conference on Machine Learning, 26772 26808. PMLR. Zhang, X.; Qin, H.; Ding, Y.; Gong, R.; Yan, Q.; Tao, R.; Li, Y.; Yu, F.; and Liu, X. 2021. Diversifying sample generation for accurate data-free quantization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 15658 15667. Zhong, Y.; Lin, M.; Nan, G.; Liu, J.; Zhang, B.; Tian, Y.; and Ji, R. 2022. Intra Q: Learning Synthetic Images with Intra Class Heterogeneity for Zero-Shot Network Quantization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12339 12348. Zhu, B.; Hofstee, P.; Peltenburg, J.; Lee, J.; and Alars, Z. 2021. Auto Re Con: Neural Architecture Search-based Reconstruction for Data-free. In International Joint Conference on Artiﬁcial Intelligence.