# on_the_generalization_of_feature_incremental_learning__ac26e9da.pdf

On the Generalization of Feature Incremental Learning

Chao Xu1 , Xijia Tang1 , Lijun Zhang2 and Chenping Hou1

1College of Science, National University of Defense Technology, Changsha, 410073, China. 2Nanjing University, Nanjing, China. {xcnudt, TXJnudt, hcpnudt}@hotmail.com, {zljzju}@gmail.com

In many real applications, the data attributes are incremental and the samples are stored with accumulated feature spaces gradually. Although there are several elegant approaches to tackling this problem, the theoretical analysis is still limited. There exist at least two challenges and fundamental questions. 1) How to derive the generalization bounds of these approaches? 2) Under what conditions do these approaches have a strong generalization guarantee? To solve these crucial but rarely studied problems, we provide a comprehensive theoretical analysis in this paper. We begin by summarizing and reﬁning four strategies for addressing feature incremental data. Subsequently, we derive their generalization bounds, providing rigorous and quantitative insights. The theoretical ﬁndings highlight the key factors inﬂuencing the generalization abilities of different strategies. In tackling the above two fundamental problems, we also provide valuable guidance for exploring other learning challenges in dynamic environments. Finally, the comprehensive experimental and theoretical results mutually validate each other, underscoring the reliability of our conclusions.

1 Introduction In recent years, with the vast utilization of Machine Learning (ML) methods in many different applications, Statistical Learning Theory [Weston, 2013; de Mello and Ponti, 2018], which reveals the laws of machine learning, has attracted more and more attention. Among the various research in Statistical Learning Theory, generalization bound [Xu and Zeevi, 2020] plays an important role since it is an index to measure the generalization ability of a model directly. Due to its importance, traditional generalization theories have achieved many solid theoretical results in supervised learning [Lei et al., 2019; Morvant et al., 2012; Antos et al., 2002; Li et al., 2022], semi-supervised learning [El-Yaniv and Pechyony, 2007; Liu and Chen, 2018; Das et al., 2013; He et al., 2021] and unsupervised learning [Li et al., 2019; Li and Liu, 2021; Downey et al., 2010]. Besides the above learning paradigms, we may face more complicated scenarios. In many dynamic environment ap-

plications, the data are usually accumulated over time and collected from open and dynamic environments. Thus, the data attributes (features) are incremental and the samples are stored with accumulated feature spaces gradually. For instance, when we deploy sensors in the ecosystem to collect data, in which the signal returned from each sensor corresponds to a feature (old feature). With advancements in observation techniques and sensor technology, new sensors are continuously integrated, generating additional signals (new features) and progressively expanding data feature spaces. This underscores the critical importance of developing learning systems that can effectively adapt to dynamic and evolving environments [Dietterich, 2017]. In this scenario, as shown in Figure 1, the data collection procedure is divided into two stages, i.e., previous and current stages. The corresponding feature spaces include the old feature space Xp, X(1) i Xp, i = 1, 2 and the new feature space Xc, [X(1) 2 , X(2) 2 ] Xc.

Under such a data background, feature increment learning [Yang et al., 2022; Gu et al., 2022; Sadreddin and Sadaoui, 2021] has attracted wide attention and inspired a lot of excellent works [Ye et al., 2018; Hou and Zhou, 2018; Xu et al., 2016; Hou et al., 2019; Zhang et al., 2020; Hou et al., 2021]. While existing feature-incremental learning approaches have demonstrated satisfactory performance in various applications, a comprehensive generalization analysis remains absent. This gap arises due to two main challenges: 1) The fundamental i.i.d. assumption of traditional learning is violated in feature-incremental scenarios, rendering conventional generalization theories inapplicable. 2) The diverse strategies employed by different approaches, such as model reuse [Ye et al., 2018] and data tailoring [Hou and Zhou, 2018], complicate uniﬁed analysis. Consequently, although these algorithms are intuitively reasonable, rigorous theoretical underpinnings is still limited. To gain a deep understanding of the workings of this complex machine-learning scenario, we focus on two fundamental questions. 1) How to derive the generalization bounds of these feature incremental learning approaches? 2) Under what conditions do these approaches have a strong generalization guarantee?

Aiming at these critical but rarely-studied fundamental problems, We begin by summarizing and reﬁning four strategies, i.e., feature tailoring, data adaption, model reuse, and data reconstruction. Subsequently, we derive their generalization

Proceedings of the Thirty-Fourth International Joint Conference on Artiﬁcial Intelligence (IJCAI-25)

0.6 0.7 0.4

0.3 0.2 0.2

0.6 0.9 0.8

0.4 0.6 0.3

0.7 0.3 0.2

0.9 0.2 0.6

0.5 0.1 0.9

0.2 0.4 0.8

0.2 0.3 0.1

Previous Current

New Features Old Features

(1) 2 X (2) 2 X

(2) 1 X Unobserved

Figure 1: Illustration of incremental features in a dynamic environment. In the ecosystem monitoring task, the data returned by the new sensor (new type features) will accumulate to the previous sensor data (old type features).

bounds, providing rigorous and quantitative insights. The theoretical results reveal that the generalization ability of feature tailoring depends primarily on the predictive power of the old features, while data adaption is inﬂuenced by the sample size and the richness of information contained in the current sample. For model reuse, the quality of the pre-trained model plays a critical role, and for data reconstruction, the reconstruction distribution discrepancy is the key determinant. Beyond these theoretical ﬁndings, we validate their correctness through numerical comparisons. The main contributions of this paper are summarized as follows.

We provide a comprehensive generalization analysis of four model design strategies for the feature incremental scenario, deriving their generalization bounds to offer practical guidance for model design.

We advance rigorous and quantitative comparisons by analyzing the generalization ability of these strategies. Theoretical analysis identiﬁes key factors inﬂuencing the tightness of their generalization bounds.

Comprehensive experimental results corroborate the theoretical ﬁndings, enhancing their reliability and demonstrating the feasibility of applying these theoretical insights to model design.

2.1 Notations According to the scenario of this article shown in Figure. 1, the data collection process is divided into two stages, i.e, previous and current stages. The corresponding feature spaces include the previous feature space Xp Rd1 and the current feature space Xc Rd1+d2. Similarly, we denote the label spaces of the two stages by Yp and Yc, respectively. Due to the invariance of classiﬁcation task and label space, we consider Yp and Yc to be identically distributed, denoted as Y = {+1, 1}.

Speciﬁcally, we denote the old features of the previous stage as X(1) 1 , the unobserved augmented features as X(2) 1 , and the label is yp. In the current stage, the feature shared by two stages is denoted as X(1) 2 , and the augmented feature is denoted as X(2) 2 , the label is yc. Here, X(1) i Xp, i = 1, 2, X2 = [X(1) 2 , X(2) 2 ] Xc. Without loss of generality, we assume that the data samples in the current stage are global observations and obey distribution Dc Xc Y. The data points in the previous stage are local observations and obey distribution Dp Xp Y. To simplify the presentation, we denote Sp = (X(1) 1 , yp) as the samples of previous stage of size n1, Scp = (X(1) 2 , yc) as the current data of size n2 that fall into the old feature space, and Sc = (X2, yc) as the samples in current stage of size n2.

2.2 Formulations

Firstly, we summarize and reﬁne four strategies for addressing feature incremental problem, i.e., feature tailoring, data adaption, model reuse, and data reconstruction, based on the data application modes. The ﬁrst two serve as baselines, which transform the feature increment learning problem into a traditional learning problem. Model reuse inherit pre-trained models from previous stages by imposing consistency constraints on corresponding local models. Finally, data reconstruction utilizes existing observations to recover unobserved features in the previous stage. Subsequently, we carry out a model analysis on the four strategies. For illustration, consider the linear classiﬁer. Denoted by F the hypothesis space, where each linear classiﬁer f : w x 7 R. Consider a loss function ℓ: R Y 7 R+ non-negative and Lipschitz continuous. Correspondingly, we denote the hypotheses under the four strategies as fi, i = 1, 2, 3, 4. For the classiﬁcation task, our goal is to learn a wellgeneralized classiﬁer for the current data Sc, that is, minimize

Proceedings of the Thirty-Fourth International Joint Conference on Artiﬁcial Intelligence (IJCAI-25)

the generalization error RDc(f). Since there is only empirical data on hand, we optimize the ˆR(f) as an approximation.

Lemma 1 (Generalization Error Bound). Let L be the family of loss function associated to F, i.e., L = {x ℓ(f(x), y), f F}. Suppose the loss function is LLipschitz, then, for any δ > 0, with probability at least 1 δ over a sample of size m, the following inequality holds for all f F:

RD(f) ˆRm(f) + 2Rm(L) +

where Rm(L) is Rademacher complexity of loss function class L associated to F, which can be bounded by using the celebrated Talagrand s lemma [Hahn and BFb, 1976].

Model design and theoretical analysis are carried out around the empirical error and hypothesis space complexity. Inspired by Lemma 1, we will derive the following optimization objective

min f F 1 m

i=1 ℓ(f (xi) , yi) + αH(f). (2)

The ﬁrst term is the empirical error, H(f) is a regularization term used to control the complexity of the model, and α is a trade-off parameter.

Strategy 1 (Feature Tailoring) In the incremental feature scenario, existing classiﬁcation algorithms cannot be directly applied. One approach is to tailor the features by discarding the incremental features X(2) 2 and training the model using only partial observations. The optimization objective for this strategy is represented as

min w, (xi,yi) Sp Scp

i=1 ℓ(f (xi) , yi) + α w 2 2 ,

(3) where w represents the coefﬁcient vector corresponding to the linear classiﬁer f1.

Strategy 2 (Data Adaption) Due to the inconsistent feature space dimensions between the two stages of data, existing classiﬁcation algorithms cannot be directly applied. To address this, a data adaption approach is adopted. Speciﬁcally, the data from the previous stage, X(1) 1 , is discarded. The optimization objective is formulated as follows

min w, (xi,yi) Sc 1 n2

i=n1 ℓ(f (xi) , yi) + α w 2 2 . (4)

Strategy 1 and Strategy 2 both adapt existing observations to the algorithm, and the difference lies in the way of data tailoring. Intuitively, both strategies are simple and convenient to operate, but part of the valuable observations are wasted.

Strategy 3 (Model Reuse) The core concept of model reuse is to pre-train a model w0 during the previous stage and then develop an algorithm to train the classiﬁer for the current stage by leveraging w0.

Speciﬁcally, we assume that the model component w1 shared between the current and previous stages remains consistent with the pre-trained model w0. Based on this assumption, the optimization objective for strategy 3 is formulated as follows

min w,b (xi,yi) Sc

i=1 ℓ(f (xi) , yi) + α w2 2 2 + β w1 w0 2 2

(5) Here, w1 and w2 represents the vector component of w corresponding to X(1) 2 and X(2) 2 , and α and β are two trade-off parameters.

Strategy 4 (Data Reconstruction) The main idea of data reconstruction is to use existing observations to reconstruct the unobserved feature X(2) 1 in the previous stage, and then use the reconstructed data as training data to train model. Speciﬁcally, the learning task includes data reconstruction and model training. Model training is the same as strategy 1, but the data participated in the training are different, which will not be repeated here. Next, we will mainly discuss data reconstruction. For demonstration and theoretical analysis, we simply build a reconstruction function Φ : XΩ7 X(2) 1 that reconstructs X(2) 1 , XΩrefers to the existing observations, and X(2) 1 refers to the reconstruction features that are not observed in the previous stage. In this paper, we focus on leveraging the data correlations between the two stages and the feature correlations between the old and new features within the current stage to reconstruct the unobserved features from the previous stage. Based on this, we design the following framework for reconstruction functions Φ Φ (XΩ) = arg min

X(2) 1 L(D) + λR(F), (6)

where λ is a trade-off parameter, L(D) represents the reconstruction loss of using data correlation to reconstruct the unobserved features, and R(F) represents the reconstruction loss of using feature correlation.

3 Generalization Ability Analysis In this part, we will analyze, compare and discuss the generalization ability for the four strategies. First of all, we need to introduce some basic deﬁnitions and lemmas. Deﬁnition 1 (Rademacher Complexity [Bartlett and Mendelson, 2001]). Given a function class F. For a function f F and a sample Z = (Z1, Z2, , Zm) of size m, Z Z = (X, Y). Then, the empirical Rademacher complexity of F with respect to the sample Z is deﬁned as

ˆRZ(F) = Eσ

i=1 σif(Zi)

where the random variables σi are called Rademacher variables, which obey the uniform distribution on { 1, +1}. The Rademacher complexity of F is the expectation of empirical Rademacher complexity based on the experience of all samples of size m drawn by D

Rm(F) = EZ Dm h ˆRZ(F) i . (8)

Proceedings of the Thirty-Fourth International Joint Conference on Artiﬁcial Intelligence (IJCAI-25)

Deﬁnition 2 (Y-Discrepancy [Mohri and Medina, 2012]). Let DP , DQ be two distributions over X and denote by y P , y Q the labeling functions over DP and DQ, respectively. Given a hypothesis class F and the corresponding loss function ℓ, the Y-discrepancy between (DP , y P ) and (DQ, y Q) is deﬁned as

discy (SPα, SQ) = sup f F

RDP (f, y P ) RDQ (f, y Q) .

(9) As we only have the empirical data on hand, by introducing weights α over the empirical data SP sampled from DP , and thus the weighted empirical risk is deﬁned as

i=1 αiℓ(f(xi), y P i),

the weighted empirical Y-discrepancy is denoted by

disc Y (SPα, SQ) = sup f F

ˆRSPα (f, y P ) ˆRSQ (f, y Q) .

(10) With the deﬁnition of weighted empirical Y-discrepancy, the generalization error on (DQ, y Q) can be bounded in terms of the risk over (DP , y P ) and their Y-discrepancy. Subsequently, we carry out a generalization theoretical analysis of the four strategies. As for the baseline strategies feature tailoring and data adaption, the generalization bounds are similar to Lemma 1. The difference between the two strategies lies in the form and amount of participating training data. Speciﬁcally, comparing the generalization bounds of the two strategies, we take a simple example as illustration. As shown in Figure 2, consider a binary classiﬁcation task with Y = {+1, 1} and let d = 2. We can see that using any local Feature 1 or Feature 2 does not work well for this classiﬁcation task, since the trained model is underﬁtting. Correspondingly, the right of Figure 2 shows the generalization error curves of the model trained with local and global features, respectively. It can be found that the key factors dominating the generalization ability of feature tailoring and data adaption are the quality of old features and the number of data samples, respectively. This conclusion is straightforward and intuitive, so we will not elaborate further. We next derive generalization error bounds for strategies 2 and 3. Due to space limitations, the details of the proofs are list in the supplementary ﬁle. Theorem 1. Let F2 be the family of the hypothesis set, and denote the hypothesis returned by data adaption as f2. Suppose the loss function is L-Lipschitz, then, for any δ > 0, with probability at least 1 δ over a sample of size n2, the following inequalities hold for all f2 F2.

RDc(f2) ˆRn2(f2) + 2L 1 n2 ΛM2 +

Where Λ = max { x |x X} represents the radius of the feature domain, and w2 M2, M2 represents the radius of the linear hypothesis space. w2 represents the hypothesis coefﬁcient corresponding to the linear classiﬁer f2.

Theorem 2. Let F3 be the family of the hypothesis set, and denote the hypothesis returned by model reuse as f3. Suppose

Sample Size

Figure 2: Illustration of the generalization ability of Strategy 1 versus Strategy 2.

the loss function is L-Lipschitz, then, for any δ > 0 , with probability at least 1 δ over a sample of size n2, the following inequalities holds for all f3 F3,

RDc(f3) ˆRn2(f3) +

+ 2L 1 n2 Λ ε + p

(M3)2 (M0 ε)2

(12) Here, Λ = max { x |x X} represents the radius of the feature domain, and w3 M3, M3 represents the radius of the hypothesis space. w3 represents the hypothesis coefﬁcient corresponding to the linear classiﬁer f3. w0 = M0, w0 represents the coefﬁcient vector corresponding to the linear classiﬁer pre-trained by X(1) 1 , y1 . w1 3 w0 ε, w1 3 represents the vector component of w3 corresponding to w0. As shown in Theorems 1 and 2, the bounds differ from the standard generalization bound as the Rademacher complexity is concretized using the Frobenius norm of matrices. In this way, we can compare the generalization bounds of data adaption and model reuse. Comparing the generalization bounds of f2 and f3, it can be observed that the tightness of the generalization bound is largely inﬂuenced by the second term, which corresponds to the Rademacher complexity of the hypothesis space. Therefore, our analysis primarily focuses on comparing this term. In our setting, the current stage feature space retains the old features from the previous stage. Therefore, the classiﬁcation coefﬁcient w0 pretrained by the previous stage data should be inherited. It can be obtained that w1 3 is in the ε-neighborhood of w0, that is, w1 3 w0 ε, and ε M0. Besides, compared to F2, the hypothesis function space F3 has been constrained due to the adding of w1 3 w0 2 2. Thus, it is natural to assume that M3 M2. With these mild assumptions, we have the following corollary. Corollary 1. Assume that w1 3 w0 ε, M3 = ρM0 with ρ > 1 and M3 M2. When ε

(ρ + 1)2 2 M0, we have

(M3)2 (M0 ε)2 M2. (13)

Theorem 1 and Theorem 2 give the generalization error bounds of the hypothesis trained by data adaption and model reuse. Corollary 1 compared the two generalization upper

Proceedings of the Thirty-Fourth International Joint Conference on Artiﬁcial Intelligence (IJCAI-25)

Strategies Main parts Key factors

Feature Tailoring ˆRn1+n2(f1) Old features predictive power data adaption 2L 1 n2 ΛM2 Sample size n2 and the contained information richness

Model Reuse 2LΛ n2

(M3)2 (M0 σ)2

Pre-trained model quality

Data Reconstruction disc Y ˆSpα, Sc

Reconstruction distribution discrepancy

Table 1: The main parts of generalization bounds of four strategies.

bounds. It can be concluded that the empirical error is a good approximation of the generalization error as the training data tends to inﬁnity. In addition, the generalization error upper bound of f3 is tighter. In intuitive, model reuse reduces the size of the hypothesis space, since w1 3 is restricted in the εneighborhood of w0. Therefore, we know that the key factor in dominating the generalization ability of the model reuse is the quality of the pre-trained model, that is, the higher the quality of the pre-trained model, the smaller the value of ε and the tighter the generalization bound. Theorem 3. Let F4 be the family of hypothesis set, and denote the hypothesis returned by data reconstruction as f4. Suppose the loss function is L-Lipschitz, then, for any δ > 0, with probability at least 1 δ over a sample ˆSp of size n1 and a sample Sc of size n2, the following inequality holds for all f4 F4 and any weighted empirical distribution distribution Dpα over the sample Sp and current distribution Dc over the sample Sc.

RDc (f4, yc) λ ˆR ˆ SPα (f4, yp) + (1 λ) ˆRSc (f4, yc)

+ λdisc Y ˆSpα, Sc + 2LR(n1 + n2)(F4) +

log(1/δ) 2(n1 + n2).

(14) Here, ˆDp represents the distribution of the reconstructed previous stage data [X(1) 1 , ˆX(2) 1 ], and ˆSp represents the samples sampled from ˆDp. Dc represents the distribution of the current stage data, which is the same as the test data distribution, and Sc represents the samples sampled from Dc. λ represents the ratio of samples ˆSp and Sc. According to Theorem 3, we know that the generalization ability of data reconstruction is mainly affected by the distribution discrepancy between the reconstructed data and the current stage data, which is directly related to the performance of the reconstruction function. By observing, a small distribution discrepancy disc Y ˆSpα, Sc leads to a hypothesis with a tighter generalization error bound and inspires us to consider the distribution discrepancy between the reconstructed data and the observed data when designing the reconstruction function. It should be emphasized that data reconstruction is an independent and arduous task. In this paper, we present a demonstration of designing the reconstruction function based on optimal transport and feature correlation. In summary, we know that the key factor in dominating the generalization ability of the data reconstruction is the reconstruction distribution discrepancy, which plays a key role in the design of

reconstruction functions. Finally, Table 1 summarizes the generalization bounds of the four strategies: feature tailoring, data adaptation, model reuse, and data reconstruction. The main parts of the generalization bounds highlight the key factors inﬂuencing generalization ability. Illustration 2 compares the generalization performance of feature tailoring and data adaptation. It reveals that the primary factor for feature tailoring is the predictive power of the old features, while for data adaptation, it is the number of data samples. Corollary 1 contrasts the generalization bounds of data adaptation and model reuse, identifying the quality of pre-trained models as the critical determinant for model reuse. Theorem 3 examines the generalization ability of data reconstruction, concluding that its effectiveness hinges on the reconstruction distribution discrepancy, which is closely tied to the design of the reconstruction function.

4 Mutual Veriﬁcation Experiments

In this section, soft margin SVM [Cortes and Vapnik, 1995] and logistic regression (LR) [Berger et al., 1996] are applied as demonstrations, aiming to form mutual veriﬁcation through experiments and theories. Due to space limitation, the optimization objectives and detailed implementation information of the four strategies are provided in the supplementary materials.

4.1 Datasets and Setting

We adopt 8 datasets from UCI Repository 1 and LIBSVM Library 2 to carry out the experiments. As in the feature increment scenario mentioned in this paper, the data collection process is divided into two stages. Speciﬁcally, the global feature of the existing data is denoted as X = [X1, , Xd1+d2] Rd1+d2, where Xi, i = 1, 2, , d1 + d2 is the i-th feature. Let X = [X1, , Xd1] Rd1 be the previous feature and X = [Xd1, , Xd1+d2] Rd1+d2 be the current feature. Furthermore, to obtain the data type that conforms to the scenario in this paper, we tailor the existing binary classiﬁcation data. Without loss of generality, we let d1 = d2. Similarly, let n2 = n1/2 be the amount of data in the previous stage. As for the parameters selection of the algorithms, we conduct K-fold cross-validation on the training set. Speciﬁcally, we use the grid search method to obtain the optimal parameter combination, and the search range of each parameter

1http://archive.ics.uci.edu/ml 2http://www.csie.ntu.edu.tw/ cjlin/libsvm

Proceedings of the Thirty-Fourth International Joint Conference on Artiﬁcial Intelligence (IJCAI-25)

Dataset Strategy Acc AUC F1-Score

Feature Tailoring 0.6706 0.0046 0.6854 0.0047 0.7575 0.0038 data adaption 0.7013 0.0029 0.7326 0.0043 0.7748 0.0035 Model Reuse 0.7226 0.0031 0.7431 0.0042 0.8236 0.0025 Data Reconstruction 0.7422 0.0032 0.7561 0.0022 0.8431 0.0041

Feature Tailoring 0.6083 0.0034 0.5325 0.0046 0.5657 0.0056 data adaption 0.6211 0.0056 0.5256 0.0032 0.6012 0.0035 Model Reuse 0.6647 0.0034 0.5494 0.0043 0.6674 0.0046 Data Reconstruction 0.5816 0.0036 0.5005 0.0042 0.5544 0.0035

Feature Tailoring 0.5168 0.0053 0.5971 0.0042 0.6483 0.0035 data adaption 0.6734 0.0042 0.6544 0.0026 0.7637 0.0035 Model Reuse 0.7013 0.0038 0.6636 0.0032 0.8019 0.0045 Data Reconstruction 0.5236 0.0041 0.6037 0.0032 0.6510 0.0065

Feature Tailoring 0.7060 0.0036 0.4736 0.0032 0.4968 0.0025 data adaption 0.7386 0.0041 0.6034 0.0032 0.5477 0.0026 Model Reuse 0.7642 0.0051 0.6833 0.0062 0.5733 0.0045 Data Reconstruction 0.6833 0.0034 0.4621 0.0022 0.4708 0.0041

Feature Tailoring 0.7437 0.0026 0.5393 0.0024 0.6976 0.0035 data adaption 0.8133 0.0043 0.7864 0.0032 0.8196 0.0036 Model Reuse 0.8300 0.0069 0.8163 0.0032 0.8243 0.0046 Data Reconstruction 0.7755 0.0040 0.7495 0.0069 0.7685 0.0045

Feature Tailoring 0.7060 0.0043 0.6819 0.0043 0.7818 0.0048 data adaption 0.7471 0.0054 0.6842 0.0054 0.8548 0.0034 Model Reuse 0.7880 0.0039 0.7134 0.0039 0.8609 0.0032 Data Reconstruction 0.7123 0.0047 0.6903 0.0047 0.8234 0.0045

Feature Tailoring 0.7627 0.0034 0.7585 0.0042 0.6851 0.0041 data adaption 0.7827 0.0044 0.7735 0.0032 0.7021 0.0034 Model Reuse 0.8027 0.0051 0.7923 0.0034 0.7321 0.0024 Data Reconstruction 0.7717 0.0034 0.7635 0.0046 0.6831 0.0046

Feature Tailoring 0.6954 0.0041 0.5771 0.0032 0.7126 0.0035 data adaption 0.7064 0.0038 0.6023 0.0031 0.7356 0.0039 Model Reuse 0.7664 0.0048 0.6634 0.0054 0.7826 0.0035 Data Reconstruction 0.6864 0.0038 0.5823 0.0032 0.7156 0.0043

Table 2: Comparative experiment results ( )(mean std) of SVM model under the four strategies. The best results on each dataset are bolded.

(b) F1 score

Figure 3: Comparative experiment results ( )(mean std) of LR model under the four strategies.

Proceedings of the Thirty-Fourth International Joint Conference on Artiﬁcial Intelligence (IJCAI-25)

The number of incremental features

LR+stategy1 LR+stategy2 LR+stategy3 LR+stategy4

(a) covtype

The number of incremental features

LR+stategy1 LR+stategy2 LR+stategy3 LR+stategy4

The number of incremental features

LR+stategy1 LR+stategy2 LR+stategy3 LR+stategy4

The number of incremental features

LR+stategy1 LR+stategy2 LR+stategy3 LR+stategy4

Figure 4: Incremental process experiment results ( )(mean std) of two models under the four strategies.

is 10 3, 10 2, 10 1, 100, 101, 102, 103 . Finally, three common metrics, Accuracy, F1-score, and AUC, were used to evaluate the model performance, and the higher the values of these metrics, the better the performance of the algorithms( ). Additionally, we also conducted experiments on several realworld multi-view datasets. Due to space constraints, the experimental results are presented in the supplementary materials.

4.2 Comparative Experiment

In this subsection, we conduct experiments to verify the validity of the above theoretical conclusions. The comparative experiment results are shown in Table 2 and Figure 3. According to the experiment results, it can be concluded that data adaption performs better than feature tailoring in general, which is consistent with the intuition, that is, the model obtained from feature tailoring could be underﬁtting. In addition, the model performance of model reuse has certain advantages over data adaption, which is consistent with the conclusion of Corollary 1, that is, the generalization error bound for model reuse is tighter than data adaption. As for data reconstruction, the performance of the model is unstable and generally worse than model reuse, since the reconstruction function cannot effectively reduce the distribution discrepancy between the

reconstructed data and the observations without prior distribution information about the data. In particular, data reconstruction will perform well when the potential data distribution relatively matches the reconstruction function, such as for the dataset ionosphere . That is to say, data reconstruction can achieve good generalization ability only when the distribution discrepancy between the reconstructed data and the observations is sufﬁciently small. This requires incorporating more prior distribution information to construct the reconstruction function.

4.3 Numerical Veriﬁcation Experiment In order to verify Corollary 1 numerically, we calculated the numerical value of each variable in Corollary 1 and combined with the condition for comparative veriﬁcation. Speciﬁcally, we further analyze the ε-neighborhood. Denote f as the empirical optimal classiﬁer trained by strategy 3, and w as the corresponding coefﬁcient. Let ˆw = w0, w2 be the coefﬁcient corresponding to ˆf. Due to the optimality of the empirical objective at w , we have

ˆRn2(f ) + α w2 2 F + β w1 w0 2 F ˆRn2( ˆf) + α w2 2 F + β w0 w0 2 F (15)

Proceedings of the Thirty-Fourth International Joint Conference on Artiﬁcial Intelligence (IJCAI-25)

Ion Ger Ivs kr_V Cle Lac Cov Hea Dataset

Figure 5: The numerical values of ε, and M0 produced by the experiments on 8 datasets.

then w1 w0 2 F 1

ˆRn2( ˆf) ˆRn2(f )

ˆRn2( ˆf) ˆRn2(f )

ˆRn2( ˆf) ˆRn2(f )

(ρ + 1)2 2 M0,

we compare the numerical values of ε, and M0 produced by the experiments. The experimental results are shown in Figure 5. From the experimental results, it can be seen that

the condition ε 1

(ρ + 1)2 2 M0 is always

4.4 Impact of Incremental Features To investigate the effect of incremental features on model performance, we conducted experiments by progressively increasing the number of incremental features from 0 to d2. Speciﬁcally, we selected d2/3, 2d2/3, and d2 as representative points to illustrate performance trends. Results in Figure 4 reveal the following: (1) Feature tailoring performance depends only on the observed features from the previous stage. (2) For data adaptation and model reuse, performance generally improves with more incremental features, as additional features provide more information, alleviating underﬁtting. (3) In contrast, data reconstruction shows inconsistent performance, with degradation on some datasets due to increased reconstruction complexity as the feature count grows.

5 Conclusion

In this paper, we focus on the feature increment learning problem, for which theoretical analysis is still limited. To gain a

deep understanding of the workings of this complex machine learning problem, we ﬁrst summarize and reﬁne four strategies, i.e., feature tailoring, data adaption, model reuse, and data reconstruction, based on the data application modes. Furthermore, we carry out research on the generalization theory of these four typical strategies in feature incremental scenarios. Speciﬁcally, we propose a common procedure and analyze the generalization ability of these four common data application strategies, and make a horizontal comparison of them to derive rigorous and quantitative conclusions. In addition, a series of experiments prove that the theory in this paper is effective, which is helpful to guide the model design through the theory.

Acknowledgments

This work was partially supported by the National Key Research and Development Program (No. 2022ZD0114803), the NSF for Distinguished Young Scholars under Grant No. 62425607, the Key NSF of China under Grant No. 62136005. Chenping Hou is the corresponding author.

References [Antos et al., 2002] Andr as Antos, Bal azs K egl, Tam as Linder, and G abor Lugosi. Data-dependent margin-based generalization bounds for classiﬁcation. J. Mach. Learn. Res., 3:73 98, 2002. [Bartlett and Mendelson, 2001] Peter L. Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. In Computational Learning Theory, 14th Annual Conference on Computational Learning Theory, COLT, volume 2111 of Lecture Notes in Computer Science, pages 224 240. Springer, 2001. [Berger et al., 1996] Adam L. Berger, Stephen Della Pietra, and Vincent J. Della Pietra. A maximum entropy approach to natural language processing. Comput. Linguistics, 22(1):39 71, 1996. [Cortes and Vapnik, 1995] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Mach. Learn., 20(3):273 297, 1995. [Das et al., 2013] Shubhomoy Das, Travis Moore, Weng Keen Wong, Simone Stumpf, Ian Oberst, Kevin Mc Intosh, and Margaret M. Burnett. End-user feature labeling: Supervised and semi-supervised approaches based on locallyweighted logistic regression. Artif. Intell., 204:56 74, 2013. [de Mello and Ponti, 2018] Rodrigo Fernandes de Mello and Moacir Antonelli Ponti. Machine Learning - A Practical Approach on the Statistical Learning Theory. Springer, 2018. [Dietterich, 2017] Thomas G. Dietterich. Steps toward robust artiﬁcial intelligence. AI Mag., 38(3):3 24, 2017. [Downey et al., 2010] Doug Downey, Oren Etzioni, and Stephen Soderland. Analysis of a probabilistic model of redundancy in unsupervised information extraction. Artif. Intell., 174(11):726 748, 2010. [El-Yaniv and Pechyony, 2007] Ran El-Yaniv and Dmitry Pechyony. Transductive rademacher complexity and its

Proceedings of the Thirty-Fourth International Joint Conference on Artiﬁcial Intelligence (IJCAI-25)

applications. In Learning Theory, 20th Annual Conference on Learning Theory, COLT 2007, San Diego, CA, USA, June 13-15, 2007, Proceedings, volume 4539 of Lecture Notes in Computer Science, pages 157 171. Springer, 2007. [Gu et al., 2022] Shilin Gu, Yuhua Qian, and Chenping Hou. Incremental feature spaces learning with label scarcity. ACM Trans. Knowl. Discov. Data, 16(6):106:1 106:26, 2022. [Hahn and BFb, 1976] M. G. Hahn and BFb. Probability in Banach Spaces. Probability in Banach Spaces, 1976. [He et al., 2021] Haiyun He, Hanshu Yan, and Vincent Y. F. Tan. Information-theoretic generalization bounds for iterative semi-supervised learning. Co RR, abs/2110.00926, 2021. [Hou and Zhou, 2018] Chenping Hou and Zhi-Hua Zhou. One-pass learning with incremental and decremental features. IEEE Trans. Pattern Anal. Mach. Intell., 40(11):2776 2792, 2018. [Hou et al., 2019] Chenping Hou, Ling-Li Zeng, and Dewen Hu. Safe classiﬁcation with augmented features. IEEE Trans. Pattern Anal. Mach. Intell., 41(9):2176 2192, 2019. [Hou et al., 2021] Bo-Jian Hou, Lijun Zhang, and Zhi-Hua Zhou. Learning with feature evolvable streams. IEEE Trans. Knowl. Data Eng., 33(6):2602 2615, 2021. [Lei et al., 2019] Yunwen Lei, Ur un Dogan, Ding-Xuan Zhou, and Marius Kloft. Data-dependent generalization bounds for multi-class classiﬁcation. IEEE Trans. Inf. Theory, 65(5):2995 3021, 2019. [Li and Liu, 2021] Shaojie Li and Yong Liu. Sharper generalization bounds for clustering. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 6392 6402. PMLR, 2021. [Li et al., 2019] Feijiang Li, Yuhua Qian, Jieting Wang, Chuangyin Dang, and Liping Jing. Clustering ensemble based on sample s stability. Artif. Intell., 273:37 55, 2019. [Li et al., 2022] Jian Li, Yong Liu, and Weiping Wang. Convolutional spectral kernel learning with generalization guarantees. Artif. Intell., 313:103803, 2022. [Liu and Chen, 2018] Chao Liu and Di-Rong Chen. Generalization error bound of semi-supervised learning with 1 regularization in sum space. Neurocomputing, 275:1793 1800, 2018. [Mohri and Medina, 2012] Mehryar Mohri and Andres Mu noz Medina. New analysis and algorithm for learning with drifting distributions. Co RR, abs/1205.4343, 2012. [Morvant et al., 2012] Emilie Morvant, Sokol Koc o, and Liva Ralaivola. Pac-bayesian generalization bound on confusion matrix for multi-class classiﬁcation. In Proceedings of the 29th International Conference on Machine Learning, ICML 2012, Edinburgh, Scotland, UK, June 26 - July 1, 2012. icml.cc / Omnipress, 2012.

[Sadreddin and Sadaoui, 2021] Armin Sadreddin and Samira Sadaoui. Incremental feature learning for inﬁnite data. Co RR, abs/2108.02932, 2021. [Weston, 2013] Jason Weston. Statistical learning theory in practice. In Bernhard Sch olkopf, Zhiyuan Luo, and Vladimir Vovk, editors, Empirical Inference - Festschrift in Honor of Vladimir N. Vapnik, pages 81 93. Springer, 2013. [Xu and Zeevi, 2020] Yunbei Xu and Assaf Zeevi. Towards optimal problem dependent generalization error bounds in statistical learning theory. Co RR, abs/2011.06186, 2020. [Xu et al., 2016] Chang Xu, Dacheng Tao, and Chao Xu. Streaming view learning. Co RR, abs/1604.08291, 2016. [Yang et al., 2022] Yanyan Yang, Degang Chen, Xiao Zhang, Zhenyan Ji, and Yingjun Zhang. Incremental feature selection by sample selection and feature-based accelerator. Appl. Soft Comput., 121:108800, 2022. [Ye et al., 2018] Han-Jia Ye, De-Chuan Zhan, Yuan Jiang, and Zhi-Hua Zhou. Rectify heterogeneous models with semantic mapping. In Proceedings of the 35th International Conference on Machine Learning, ICML, volume 80 of Proceedings of Machine Learning Research, pages 1904 1913. PMLR, 2018. [Zhang et al., 2020] Zhenyu Zhang, Peng Zhao, Yuan Jiang, and Zhi-Hua Zhou. Learning with feature and distribution evolvable streams. In Proceedings of the 37th International Conference on Machine Learning,ICML, volume 119 of Proceedings of Machine Learning Research, pages 11317 11327. PMLR, 2020.

Proceedings of the Thirty-Fourth International Joint Conference on Artiﬁcial Intelligence (IJCAI-25)