# linear_bandits_with_partially_observable_features__6f64d432.pdf

Linear Bandits with Partially Observable Features

Wonyoung Kim * 1 Sungwoo Park * 2 Garud Iyengar 3 Assaf Zeevi 3 Min-hwan Oh 2

We study the linear bandit problem that accounts for partially observable features. Without proper handling, unobserved features can lead to linear regret in the decision horizon T, as their influence on rewards is unknown. To tackle this challenge, we propose a novel theoretical framework and an algorithm with sublinear regret guarantees. The core of our algorithm consists of (i) feature augmentation, by appending basis vectors that are orthogonal to the row space of the observed features; and (ii) the introduction of a doubly robust estimator. Our approach achieves a regret bound of e O( p

(d + dh)T), where d is the dimension of the observed features and dh depends on the extent to which the unobserved feature space is contained in the observed one, thereby capturing the intrinsic difficulty of the problem. Notably, our algorithm requires no prior knowledge of the unobserved feature space, which may expand as more features become hidden. Numerical experiments confirm that our algorithm outperforms both noncontextual multi-armed bandits and linear bandit algorithms depending solely on observed features.

1. Introduction

We consider a linear bandit problem where the learning agent has access to only a subset of the features, while the reward is determined using the complete set of features, including both observed and unobserved elements. Conventional linear bandit problems rely on the assumption that the rewards are linear in only observed features, without accounting for the potential presence of unobserved features. However, in many real-world applications, rewards are often affected by the unobserved hence latent features that are not observable by the agent. For example, in online

*Equal contribution 1Chung-Ang University, Seoul, Korea 2Seoul National University, Seoul, Korea 3Columbia University, New York, USA. Correspondence to: Min-hwan Oh <minoh@snu.ac.kr>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

advertising without personalization i.e., when serving general users each advertisement is associated not only with observable features such as its category, format, or display time, but also with unobservable factors such as emotional appeal, creative design quality, and brand familiarity. These latent factors influence users click-through rates, yet they are not directly quantifiable. Similarly, in clinical trials, treatment outcomes depend not only on observable features like dosage and formulation, but also on unobserved factors such as manufacturing variability and potential side effect risks. Hence, accurately incorporating latent features is essential for providing precise recommendations.

To address latent features, Park & Faradonbeh (2022; 2024); Kim et al. (2023a); Zeng et al. (2025) typically assume that the true features follow a specific distribution, such as a Gaussian distribution. Establishing a sublinear regret bound in the decision horizon without such structural assumptions on the latent features remains a significant challenge and has not been accomplished yet. Key challenges in the bandit problem with partially observable features arise from the complete lack of information about the latent features indeed, the agent does not even know whether the features are partially observed. We tackle these challenges by proposing a novel linear bandit algorithm that is agnostic to partial observability. Despite having no knowledge of the unobserved features, our algorithm achieves a tighter regret bound than both linear bandit algorithms that consider only observed features and multi-armed bandit (MAB) algorithms that ignore features entirely. For brevity, we will refer to linear bandit algorithms relying solely on observed features as OLB algorithms henceforth. Specifically, our proposed algorithm achieves a

T-rate regret bound, without requiring any prior knowledge of the unobserved features, where T is the decision horizon.

The key idea of our proposed algorithm can be summarized in the following two procedures: (i) reconstructing feature vectors via feature augmentation to capture the influence of unobserved features on rewards, and (ii) introducing a novel doubly robust (DR) estimator to mitigate information loss due to partial observability. For (i), we decompose the reward into two additive components: one projected onto the row space spanned by the observed features, and the other projected onto its orthogonal complement. The former term maximally captures the effect of observed features, while

Linear Bandits with Partially Observable Features

the latter minimizes the impact of unobserved features. We then augment observed features with an orthonormal basis from the complementary space, which is orthogonal to the row space of observed features. This formulation enables us to reformulate the problem within a conventional linear bandit framework, where the reward function is defined as a dot product of the augmented features and an unknown parameter, without any additional additive term.

However, since these augmented features are not identical to the unobserved features, estimation errors may arise from information loss. To mitigate such errors, we leverage (ii) a DR estimator, which is widely used in the statistical literature for its robustness to errors caused by missing data. These two approaches allow our algorithm to compensate for missing information due to partial observability, improving both estimation accuracy and adaptability to the environment.

Our main contributions are summarized as follows:

We propose a linear bandit problem with partially observable features. Our problem setting is more general and challenging than those in the existing literature on linear bandits with latent features, which often rely on specific structural assumptions governing the relationship between observed and latent features. In contrast, our approach assumes no additional structure for the unobserved features beyond the linearity of the reward function, which is commonly adopted in the linear bandit literature (Section 3).

We introduce a novel estimation strategy by (i) augmenting the features that maximally capture the effect of reward projected onto the observed features, while minimizing the impact of unobserved features (Section 4), and (ii) constructing a DR estimator that mitigates errors from unobserved features. By integrating augmented features with the DR estimator, we guarantee a

t-convergence rate on the rewards for all arms in each round t (Theorem 2).

We propose an algorithm named Robust to Latent Features (Ro LF) for a general linear bandit framework with latent features (Algorithm 1). The algorithm achieves a regret bound of e O( p

(d + dh)T)1 (Theorem 3), where dh is the number of nonzero coefficients associated with the component of the reward projected onto the orthogonal complement of the row space of observed features (Section 4.2). Ro LF requires no prior knowledge or modeling of unobserved features, yet achieves, to the best of our knowledge, a sharper regret bound than OLB and MAB algorithms, as well as existing methods accounting for partial observability or model misspecification within the linear bandit framework.

1 e O( ) is the Big-O notation omitting logarithmic factors.

Our numerical experiments confirm that our algorithm consistently outperforms OLB (Li et al., 2010; Agrawal & Goyal, 2013; Kim & Paik, 2019) and MAB (Lattimore & Szepesvári, 2020) algorithms, validating both its practicality and theoretical guarantees (Section 6).

2. Related Works

While our setting appears similar to prior works on bandit problems with (i) model misspecification and (ii) partial observability, it differs from both lines of research in several aspects, including the nature of the unobserved features and the strategies used to address the problem.

First, our problem setting is more general and challenging than misspecified linear bandit problems, where the reward model deviates from the true reward due to nonlinearity (Lattimore & Szepesvári, 2020) or additive deviation terms (Ghosh et al., 2017; Bogunovic et al., 2021; He et al., 2022). While prior works incorporate cumulative misspecification error into the regret bound, we obtain a regret bound that remains unaffected by such deviations. In particular, Ghosh et al. (2017) employ hypothesis testing for model selection and obtain a regret bound of O(K

T log T) under high misspecification. In contrast, our algorithm achieves a regret bound of O( p

(d + dh)T log T) without such tests, while handling partial observability in a unified framework.

Second, in bandit problems with partially observable features, prior works often rely on structural assumptions. For example, Park & Faradonbeh (2022; 2024); Kim et al. (2023a) assume the true features are drawn from a Gaussian distribution, while Zeng et al. (2025) model them as evolving according to a linear dynamical system with additive Gaussian noise, where observed features are generated via a known linear mapping, also corrupted by Gaussian noise. These methods typically aim to recover the true features via decoders (Park & Faradonbeh, 2022; 2024), Bayesian oracles (Kim et al., 2023a), or Kalman filtering (Zeng et al., 2025). In contrast, our approach makes no structural assumptions about the relationship between observed and unobserved features and does not attempt to recover the latter. Instead, we mitigate the information loss from partial observability by projecting the latent part of the reward using only the observed features. A more detailed and comprehensive literature review is provided in Appendix A.

3. Preliminaries

3.1. Notation

For any n N, let [n] denote the set {1, 2, . . . , n}. For a vector v, we denote its L1, L2 and supremum norms by v 1, v 2, and v , respectively. The L2-norm weighted by a positive definite matrix D is denoted by

Linear Bandits with Partially Observable Features

v D. For two vectors v1 and v2, the inner product is defined as the dot product, i.e., v1, v2 := v 1 v2, and we use both notations interchangeably. For a matrix M, its minimum and maximum eigenvalues are denoted by λmin(M) and λmax(M), respectively. We let R(M) denote the row space of M, i.e., the subspace spanned by the rows of M.

3.2. Problem Formulation

In this section, we outline our problem setting and introduce several key assumptions. Each arm a [K] is associated with a true feature vector za Rdz that determines its rewards. However, the agent can observe only a subset of its elements, with the others remaining unobserved. Specifically, za is defined as follows:

x(1) a , , x(d) a , u(1) a , , u(du) a

For clarity, we highlight the observed components in blue and the unobserved components in red. Note that the dimensions of the latent feature vector, du = dz d, and the true feature vector, dz, are both unknown to the agent. As a result, the agent is unaware of how many features, if any, are unobserved, or even whether partial observability exists at all. This lack of structural information presents a fundamental challenge, as it prevents the agent from explicitly modeling or compensating for the unobserved components when selecting an appropriate learning strategy.

It is worth noting that the setting with fixed observed features2 includes linear bandits with misspecification error (Ghosh et al., 2017; Bogunovic et al., 2021; He et al., 2022) as special cases. In Appendix D, we present a setting with varying observed features and an algorithm that achieves a

T-rate regret bound. Moreover, if latent features were allowed to change arbitrarily over time, the problem would become non-learnable and thus ill-posed. Consequently, assuming fixed features is both natural and well-justified (see Table 1 for comparisons).

The reward associated with arm a is defined as the dot product between its true feature vector za and an unknown parameter θ Rdz, namely

ya,t = za, θ + ϵt a [K], (2)

where at [K] denotes the action selected by the agent at round t. Here, ϵt represents a random noise term that captures the inherent stochasticity in the reward generation process. We make the following assumption on ϵt, which is commonly adopted in bandit problems. Assumption 1 (Sub-Gaussian noise). Let Ft denote the history at round t, represented by a filtration of σ-algebras,

2This assumption is standard in linear bandits with model misspecification (Ghosh et al., 2017; Lattimore et al., 2020), which is a special case of our partially observable feature setting.

Table 1. Summary of problem settings covered in this paper and the corresponding results. Note that if latent features arbitrarily change over time, the problem itself would become non-learnable, making the problem ill-posed (see Appendix D for details).

Observed Features Unobserved Features Learnable? Results

Fixed Fixed Yes Theorem 3 Varying Fixed Yes Theorem 6 Varying Varying No -

e.g., σ(a1, ya1,1, . . . , at 1, yat 1,t 1, at). The reward noise ϵt is conditionally σ-sub-Gaussian, i.e., for all λ R,

E[exp(λϵt)|Ft 1] exp λ2σ2

Since ϵt is sampled after at is observed, ϵt is Ft-measurable. Under Assumption 1, it follows that E[ϵt|Ft 1] = 0, thus E[ya,t|Ft 1] = za, θ . For brevity, we use Et 1[ ] to denote E[ |Ft 1] henceforth. To eliminate issues of scale for analysis, we assume that the expected reward | za, θ | 1 for all a [K], and we do not assume any bound on the norm of za as well.

Let a := argmaxa [K] za, θ denote the optimal action, considering both observed and latent features. Moreover, we denote by z and y ,t the true feature vector and the realized reward associated with the optimal action a , respectively. We evaluate the theoretical performance of our algorithm using cumulative regret, which is defined as the expected sum of the differences between the reward of a and at:

t=1 θ , z zat

t=1 (Et 1[y ,t] Et 1[yat,t]) .

Considering the composition of za defined in Eq. (1), we decompose the parameter vector as θ = [(θ(o) ) , (θ(u) ) ] , where θ(o) Rd and θ(u) Rdu correspond to the parameters associated with the observed and latent features, respectively. Adopting this decomposition of θ , the reward yat,t defined in Eq. (2) can be decomposed into the following three terms:

yat,t = xat, θ(o) + ϵt + uat, θ(u) . (3)

The last term of Eq. (3) corresponds to the inaccessible portion of the reward. This reward model is equivalent to that imposed in the linear bandits with misspecification error (Ghosh et al., 2017; Lattimore et al., 2020). While the regret bound in Lattimore et al. (2020) includes misspecification error that grows linearly with decision horizon, our proposed method (Section 4) addresses this misspecification error and achieves a regret bound that is sublinear in T.

Linear Bandits with Partially Observable Features

Before presenting our method and algorithm, we first establish a lower bound on the regret of linear bandit algorithms that disregard the unobserved portion of the rewards. In particular, the following theorem provides lower bounds for two OLB algorithms: OFUL (Abbasi-yadkori et al., 2011) and Lin TS (Agrawal & Goyal, 2013).

Theorem 1 (Regret lower bound of OFUL and Lin TS ignoring latent features). Under partial observability, there exists a problem instance where both OFUL and Lin TS suffer from cumulative expected regret that grows linearly with T due to their disregard for the unobserved components.

Sketch of proof. Consider a linear bandit problem instance with an action set A := {1, 2}, where d = du = 1, implying that dz = 2. The true feature vectors are given by Z := {[1, 3] , [2, 19/4] } R2, with the agent observing only the first element of each vector, while the second element remains unobserved. In this setting, we take arm 2 to be the optimal action. However, an estimator relying solely on the observed features is inconsistent, leading OFUL and Lin TS to select the suboptimal arm with probability Θ(1), and thereby incur regret that grows linearly in T.

Theorem 1 shows that neglecting the latent portion of the reward in decision-making may cause the agent to fail in identifying the optimal action. The comprehensive proof is deferred to Appendix E.1. While Theorem 1 specifically considers OFUL and Lin TS which are known to achieve the most efficient regret bounds for UCBand Thompson sampling-based policies in the linear bandit framework we additionally present an algorithm-agnostic lower bound based on a different analysis (see Appendix F for details).

4. Robust Estimation for Partially Observable Features

4.1. Feature Vector Augmentation with Orthogonal Projection

In the linear bandit framework, accurate estimation of the unknown reward parameter θ contributes to optimal decisionmaking. However, in our problem setting, the learner lacks access to the latent reward component namely, the last term in Eq. (3). Consequently, without an appropriate mechanism to compensate for the absence of θ(u) , the agent fails to accurately estimate θ , resulting in ineffective learning, as demonstrated in Theorem 1.

That said, minimizing regret that is, selecting the optimal arm does not require recovery of the true reward parameter θ entirely; rather, it suffices to estimate the K expected rewards {z a θ : a [K]}. A straightforward approach is to ignore the features altogether and apply MAB algorithms such as UCB1 (Auer et al., 2002), which are known to achieve a regret bound of e O(

KT). However, these algorithms

tend to suffer higher regret than feature-based algorithms, particularly when the number of arms is significantly larger than the dimension of the feature vectors, i.e., K dz.

To tackle this dilemma, we propose a unified approach that handles partial observability and ensures efficient estimation. Let us define X := (x1, . . . , x K) Rd K as the matrix formed by concatenating the observed part of the true features, and U := (u(u) 1 , . . . , u(u) K ) Rdu K as the matrix formed by concatenating their latent complements for each arm. Without loss of generality, we assume a set of K vectors {x1, . . . , x K} spans Rd.3 We denote by PX := X (XX ) 1X the projection matrix onto the row space of X, denoted by R(X). Then the vector of rewards for all arms, Yt = (y1,t, . . . , y K,t) , is decomposed as:

Yt = X θ(o) + U θ(u) + ϵt1K

= PX X θ(o) + U θ(u)

+ (IK PX) X θ(o) + U θ(u) + ϵt1K

= X θ(o) + (XX ) 1XU θ(u)

+ (IK PX)U θ(u) + ϵt1K, (4)

where the first term corresponds to the reward projected onto R(X), whereas the second term is the reward projected onto R(X) , the subspace of RK orthogonal to R(X). We reparametrize θ(o) + (XX ) 1XU θ(u) as µ(o) , representing the parameter associated with the observed features.

To handle the second term in Eq. (4), we consider a set of row vector bases {b 1 , . . . , b K d} R(X) , where bi RK for i [K d]. Given the set, there exist coefficients µ(u) ,1, . . . , µ(u) ,K d R such that the term can be expressed as a linear combination:

(IK PX)U θ(u) =

i=1 µ(u) ,i bi. (5)

We define the number of nonzero coefficients as:

dh(b 1 , . . . , b K d) := |{i [K d] : µ(u) ,i = 0}|. (6)

Note that dh = 0 for any basis set {b 1 , . . . , b K d} when the latent feature space is completely included in the observed feature space, i.e., R(U) R(X). In this case, (IK PX)U = 0K du, which means that the projected rewards onto R(X) can be linearly expressed by the observed features. If R(U) R(X), on the other hand, then dh = K d for any basis set {b 1 , . . . , b K d}.

3When d > K, we can apply singular value decomposition (SVD) on X to reduce the feature dimension to d K with R(X) = d.

Linear Bandits with Partially Observable Features

Figure 1. Illustration comparing OLB algorithms and our approach in estimating rewards of K = 3 arms. OLB algorithms find estimates within R(X) thus accumulating errors from unobserved features. However, our approach utilizes the projection of the latent reward, b bµ(u) t , onto the orthogonal complement of R(X), thereby enabling reward estimation in RK.

R(X) q2 X bµ(o) t

(a) OLB algorithms: Estimation confined to R(X)

(b) Ours: Projection onto orthogonal complement

In other cases, the quantity dh depends on the choice of the basis {b 1 , . . . , b K d}, and tends to be smaller when a larger portion of R(U) is included within R(X). For any choice of the basis, our algorithm achieves e O( p

(d + dh)T) regret without prior knowledge of dh, which does not exceed the e O(

KT) regret bound achieved by MAB algorithms ignoring features. Further details are provided in Section 5.2.

Similar to the reparametrization of µ(o) , we denote µ(u) := [µ(u) ,1, . . . , µ(u) ,K d] . By concatenating µ(o) and µ(u) , we

define µ := [(µ(o) ) , (µ(u) ) ] RK, so that the reward for each a [K] can be rewritten as follows:

ya,t = e a Y

= e a [X b1 b K d]µ + ϵt

= [xa e a b1 e a b K d]µ + ϵt. (7)

The decomposition in Eq. (7) implies that Eq. (4) takes the form Yt = [X b1 b K d]µ + ϵt1K. Note that ea RK denotes the a-th standard basis vector. With this modification, the rewards are now represented as a linear function of the augmented feature vectors: exa := [x a e a b1 e a b K d] RK, without any misspecification error. A toy example illustrating our strategy is shown in Figure 1.

In terms of regret minimization, while Sup Lin UCB (Chu et al., 2011) achieves a regret bound of e O(

d T), it is often considered impractical as it computes log T distinct batches and estimators, requiring knowledge of T and, more critically, discards a significant portion of samples at each parameter update. To retain the theoretical efficiency while improving practicality, we adopt the doubly robust (DR) estimation framework, which is known to achieve e O(

d T) (Kim et al., 2021; 2023c). Given that exa RK, we propose an efficient algorithm that employs a DR estimator with ridge regularization and obtains a regret bound

KT) (see Appendix C). However, when K > d and du = 0, the regret is the regret is higher than that of OLB algorithms. To address this challenge, we propose a novel estimation strategy in the following section that eliminates the dependence on K in the regret bound.

4.2. Doubly Robust Lasso Estimator

In Eq. (7), the parameter µ is sparse, with its sparsity depending on the number of nonzero elements required to represent the inaccessible portion of the reward projected onto R(X) , namely dh defined in Eq. (6). Recall from Eq. (5) that this projection can be expressed using at most dh basis vectors, implying that µ(u) has at most dh nonzero entries.

Let ˇµL t denote the Lasso estimator of µ using the augmented feature vectors, defined as follows:

ˇµL t := argmin µ

τ=1 (yaτ ,τ ex aτ µ)2

2pt log 2Kt2

where p is the coupling probability used to define the multinomial distribution for pseudo-action sampling (as defined in Eq. (9)), and δ is the confidence parameter of the algorithm. Note that eσ2 max := maxa [K] e a (P

a [K] exaex a )ea, i.e., the largest diagonal entry of the Gram matrix constructed from the augmented feature vectors exa over A.

For the estimator in Eq. (8) to correctly identify the zero entries in µ , the compatibility condition needs to be satisfied (van de Geer & Bühlmann, 2009). Although the compatibility condition does not, in general, require a positive minimum eigenvalue for the Gram matrix, in our setting, λmin(t 1 Pt s=1 exasex as) > 0. Therefore, the compatibility condition is implied without any additional assumption. However, ensuring a sufficiently large minimum eigenvalue typically requires collecting a large number of exploration samples, which in turn increases regret. Achieving this with fewer exploration samples remains a key challenge in the bandit literature, as the minimum eigenvalue affects the convergence rate of the estimator and, consequently, the regret bound (Soare et al., 2014; Kim et al., 2021).

We introduce a doubly robust (DR) estimator that employs the Gram matrix constructed from the augmented feature vectors over the entire action space, Pt s=1 P

a [K] exaex a , rather than only from the chosen actions, Pt s=1 exasex as. Originating in the missing-data literature (Bang & Robins, 2005), DR estimation yields consistent estimators if either the imputation or the observation probability model is correct (Kim et al., 2021). In the bandit setting, only the reward of the chosen arm is observed for each decision round, leaving the other K 1 as missing. Hence, DR estimation imputes these K 1 missing rewards and incorporates all the

Linear Bandits with Partially Observable Features

associated feature vectors into the estimation process. Since the observation probabilities are determined by the policy which is known to the learner the DR estimator remains robust to errors in the reward estimation. While Kim & Paik (2019) proposed a DR Lasso estimator under IID features satisfying the compatibility condition, we propose a novel DR Lasso estimator that does not rely on such assumptions.

We improve the DR estimation by incorporating resampling and coupling methods. For each t, let Et [t] denote the set of exploration rounds such that for any τ Et, the action aτ is sampled uniformly over [K]. The set Et is constructed recursively: starting with E0 = , we define

Et 1 {t} if |Et| Ce log 2Kt2

Et 1 otherwise.

Here, Ce is defined as (8K)3eσ 2 mineσ2 max(1 p) 2, where eσ2 min := λmin(P

a [K] exaex a ) denotes the minimum eigenvalue of the Gram matrix constructed from the augmented feature vectors over the entire action space.

At each round t / Et, the algorithm selects at according to an ϵt-greedy strategy: with probability ϵt = t 1/2, it selects bat = argmaxa [K] ex a bµL t 1; with 1 ϵt, it selects an action uniformly at random from [K] \ {bat}. Given at, a pseudoaction eat is sampled from a multinomial distribution:

ϕat,t := P(eat = at | at) = p,

ϕk,t := P(eat = k | at) = 1 p

for all k [K] \ {at}, where p (1/2, 1) is the coupling probability specified by the algorithm. To couple the policies for the actual action at and the pseudo-action eat, we repeatedly resample both of them until they match.

With the pseudo-actions coupled with the actual actions, we construct unbiased pseudo-rewards for all a [K] as:

eya,t := ex a ˇµL t + I(eat = a)

ya,t ex a ˇµL t , (10)

where ˇµL t is the imputation estimator that fills in the missing rewards of unselected arms at round t, as defined in Eq. (8).

As shown in Eq. (10), the pseudo-reward includes an inverse probability term. Hence, when DR estimation is applied under the ϵt-greedy policy with ϵt = t 1/2, its variance can grow unbounded over time. To mitigate this issue, we propose a resampling and coupling strategy: by coupling the ϵt-greedy policy with the multinomial distribution (Eq. (9)), we ensure that each inverse probability weight ϕ 1 a,t, for a [K], remains bounded by O(K). This strategy effectively stabilizes the variance of the pseudo-rewards.

Moreover, one can show that the resampling succeeds with high probability for each round. Let Mt denote the

event that the pseudo-action eat matches the chosen action at within a specified number of resamples. For a given δ (0, 1), we set the maximum number of resamples to ρt := log((t + 1)2/δ )/ log(1/(1 p)); then Mt occurs with probability at least 1 δ /(t + 1)2. Resampling allows the algorithm to further explore the action space to balance regret minimization with accurate reward estimation.

For a = eat, i.e., an arm that is not selected in the round t, we impute the missing rewards using ex k ˇµL t . For a = eat, in contrast, the term I(eat = a)ya,t/ϕa,t calibrates the predicted reward to ensure the unbiasedness of the pseudorewards for all arms. Given that Eeat[I(eat = a)] = P(eat = a) = ϕa,t, taking the expectation over eat on both sides of Eq. (10) gives Eeat[eya,t] = Et 1[ya,t] = ex a µ for all a [K]. Although the estimate ex a ˇµt may have a large error, it is multiplied by the mean-zero random variable (1 I(eat = a)/ϕa,t), which makes the resulting pseudorewards defined in Eq. (10) robust to errors of ex a ˇµt. The pseudo-rewards can only be computed and thus can only be used when the pseudo-action eat matches the actual action at, that is, when the event Mt occurs. Since Mt occurs with high probability, we are able to compute pseudorewards for almost all rounds.

Incorporating the pseudo-rewards, eya,t, into estimation, we define our DR Lasso estimator as follows:

bµL t := argmin µ

τ=1 I(Mτ) X

eya,τ ex a µ 2

2t log 2Kt2

and note that our estimator uses the Gram matrix aggregated over all actions at each round, Pt τ=1 PK a=1 exaex a , instead of the one built only from the chosen actions, Pt τ=1 exaτ ex aτ . The following theorem guarantees convergence of the estimator, across all arms, to the true parameter, after a sufficient number of exploration rounds. Theorem 2 (Consistency of the DR Lasso estimator). Let dh denote the dimension of the projected latent rewards defined in Eq. (6). Then with probability at least 1 3δ,

max a [K] |ex a (bµL t µ )| 8σeσmax

for all rounds t such that t |Et|.

Although the DR Lasso estimator leverages K-dimensional feature vectors, its error bound depends only logarithmically on K. Such fast convergence is typically achieved under classical regularity conditions, including the compatibility condition and the restrictive minimum eigenvalue condition (van de Geer & Bühlmann, 2009; Bühlmann & van de Geer, 2011). Existing Lasso-based bandit approaches (Kim & Paik, 2019; Bastani & Bayati, 2020; Oh

Linear Bandits with Partially Observable Features

Algorithm 1 Robust to Latent Feature (Ro LF)

1: INPUT: Observed features {xa : a [K]}, coupling probability p (1/2, 1), confidence parameter δ > 0. 2: Initialize bµ0 = 0K, exploration phase E0 = , and exploration factor Ce := (8K)3eσ 2 mineσ2 max(1 p) 2. 3: Find orthogonal basis {b 1 , . . . , b K d} R(X) and construct {exa : a [K]}. 4: for t = 1, . . . , T do 5: if |Et| Ce log(2Kt2/δ) then 6: Randomly sample bat uniformly over [K] and Et = Et 1 {t}. 7: else 8: Compute bat := arg maxa [K] ex a bµL t 1. 9: end if 10: while eat = at and count ρt do 11: Sample at with P(at = bat) = 1 (t 1/2) and P(at = k) = t 1/2/(K 1), k = bat. 12: Sample eat according to Eq. (9). 13: count = count + 1. 14: end while 15: Play at and observe yat,t. 16: if eat = at then 17: Set bµL t := bµL t 1. 18: else 19: Update bµL t following Eq. (11) with eya,t and update ˇµL t following Eq. (8). 20: end if 21: end for

et al., 2021; Ariu et al., 2022; Chakraborty et al., 2023; Lee et al., 2025) generally impose these conditions directly on the feature vectors. Our approach, on the other hand, does not require such assumptions, as the augmented features are orthogonal vectors lying in R(X) . Moreover, their average Gram matrix satisfies λmin(P

a [K] exaex a ) min{1, λmin(P

a [K] xax a )}. Thus, the convergence rate scales only as log K with respect to K.

The consistency is proved by bounding the two components of the error in the pseudo-rewards (Eq. (10)): (i) the noise of the reward and (ii) the error of the imputation estimator ˇµt. Since (i) is sub-Gaussian, it can be bounded using martingale concentration inequalities. For (ii), the imputation error ex a (ˇµL t µ ) is multiplied by the mean-zero random variable (1 I(eat = a)/ϕa,t) and thus the magnitude of the error can be bounded by ˇµL t µ 1/

t. The detailed proof is deferred to Appendix E.2.

5. Proposed Algorithm and Regret Analysis

5.1. Robust to Latent Features (Ro LF) Algorithm

In this section, we present our algorithm Ro LF in Algorithm 1. In the initialization step, given the observed features, our algorithm constructs a set of orthogonal basis

Table 2. An overview of regret bound range of our algorithm, Ro LF, depending on dh [0, K d], the number of nonzero elements to effectively capture the unobserved part of reward projected onto R(X) . Note that e O denotes the big-O notation omitting logarithm factors.

Feature space Regret bound

span(observed) span(latent) e O(

d T) span(observed) span(latent) e O(

KT) otherwise e O( p

vectors {b 1 , . . . , b K d} R(X) to augment each observed feature vector. The algorithm first selects a candidate action bat either uniformly at random or greedily based on the estimated reward and then resamples at and eat until they match or the maximum number of trials allowed for each round, ρt := log((t + 1)2/δ )/ log(1/(1 p)), is reached. Once the resampling phase ends, the algorithm selects at and the corresponding reward yat,t is observed. If at and eat match within ρt, then both the imputation and the main estimators are updated; otherwise, neither is updated.

Our proposed algorithm does not require knowledge of the dimension of the unobserved features du, nor the dimension of the reward component projected onto R(X) . Although we present the algorithm for fixed feature vectors, the algorithm is also applicable to time-varying feature vectors. In such cases, we estimate the bias caused by unobserved features by augmenting the standard basis. For further details, refer to Appendix D.

5.2. Regret Analysis

Theorem 3 (Regret bound for Lasso Ro LF). Let dh denote the number of nonzero coefficients in the representation of the projected latent reward as defined in Eq. (6). Then for any δ (0, 1) and p (1/2, 1), with probability at least 1 6δ, the cumulative regret of Ro LF is bounded by

Reg(T) 2 83K3(1 p) 2 log 2KT 2

2(d + dh)T log 2KT 2

To the best of our knowledge, Theorem 3 provides the first regret bound that converges faster than e O(

KT), specifically for algorithms that account for unobserved features without relying on any structural assumptions. Assuming exa 1 (instead of exa 2 1), eσ2 min and eσ2 max are constant independent of d or K. Note that the number of rounds required for the exploration phase is O(K3 log KT), which grows only logarithmically with the time horizon T. The factor K3 is irreducible, as the algorithm needs information

Linear Bandits with Partially Observable Features

about all K biases from the missing features. By employing the Gram matrix of the augmented feature vectors over the action space A, PK a=1 exaex a , when combined with DR estimation, we reduce the required exploration phase time from O(K4 log KT) to O(K3 log KT), thereby improving the sample complexity by a factor of K.

As demonstrated in Theorem 2, the last term on the righthand side of the regret bound is proportional to d + dh rather than

K, resulting in an overall regret bound of O( p

(d + dh)T log KT). The specific value of dh is determined by the relationship between R(X) and R(U), as discussed in Section 4.1. Table 2 summarizes how the regret bound varies under different relationships between these subspaces. The formal proof of Theorem 3 is deferred to Appendix E.3.

6. Numerical Experiments

6.1. Experimental Setup

In this experiment, we simulate and compare our algorithms, Algorithm 1 and Algorithm 2 (Appendix C), with baseline OLB algorithms: Lin UCB (Li et al., 2010) and Lin TS (Agrawal & Goyal, 2013). These algorithms adopt UCB and Thompson sampling strategies, respectively, assuming a linear reward model based on observed features. We also include DRLasso (Kim & Paik, 2019) since our algorithm employs DR estimation with a Lasso estimator, and UCB(δ) (Lattimore & Szepesvári, 2020), an MAB algorithm that ignores features, to assess the effectiveness of using feature information, even under partial observability.

To provide comprehensive results, we consider two scenarios: one with partial observability and one without. For each scenario, we conduct experiments under three cases, classified by the relationship between the row spaces of the observed and unobserved features, R(X) and R(U), which determines the value of dh: (i) Case 1, the general case where neither R(X) nor R(U) fully contains the other; (ii) Case 2, R(U) R(X), where the row space spanned by the unobserved features lies entirely within that spanned by the observed features, thus dh = 0; (iii) Case 3, R(X) R(U), where the row space spanned by the observed feature space is fully contained within that spanned by the unobserved features, implying dh = K d. Note that in Scenario 2, Case 3 is excluded because R(U) = implies R(X) = , which violates our problem setup.

In the simulation, after constructing the true features za Rdz and observed features xa Rd, we apply singular value decomposition (SVD) to the observed feature matrix X to derive orthogonal basis vectors {b 1 , . . . , b K d} orthogonal to R(X). These vectors are then concatenated with X to form the augmented feature matrix. The reward parameter θ Rdz is sampled from Unif( 1/2, 1/2),

and the rewards are generated following Eq. (2). For the hyperparamters in our algorithms, the coupling probability p and the confidence parameter δ, are set to 0.6 and 10 4, respectively. The total decision horizon is T = 1200. Throughout the experiments, we fix the number of arms at K = 30 and the dimension of the true features at dz = 35, ensuring dz d to cover both scenarios. Further details on the experimental setup are provided in Appendix B.

6.2. Experimental Results

6.2.1. SCENARIO 1

In this scenario, d is set to dz/2 , so that the agent observes only about half of the full feature dimension. From Figure 2, we observe that the baseline OLB algorithms Lin UCB, Lin TS, and DRLasso perform worse than our algorithms in terms of both the level and robustness of cumulative regret. This pattern is consistently observed across all three cases, implying that linear bandit algorithms fail to identify the optimal arm in the presence of unobserved components, as they cannot capture the portion of the reward associated with the unobserved features. However, our algorithms show almost the same performance regardless of the relationship between R(X) and R(U), indicating the robustness to variations in feature observability structure.

Note that for all cases our algorithms Ro LF-Lasso (Algorithm 1) and Ro LF-Ridge (Algorithm 2) exhibit a sharp decline in the growth rate of cumulative regret after a certain number of rounds. This behavior is primarily due to the forced-exploration phase built into the algorithms, which ensures sufficient coverage of the action space in the early stages. Following this phase, the resampling and coupling strategies further enhance the efficiency of the DR estimation, leading to slower regret accumulation over time. This pattern is commonly shown from experimental results of other bandit algorithms employing the forced-exploration strategy (Goldenshluger & Zeevi, 2013; Hao et al., 2020; Chakraborty et al., 2023).

Furthermore, in Case 2 (Figure 2(b)), the cumulative regret grows more slowly than in other cases. This behavior is explained by the relationship R(U) R(X), implying the reward components projected onto R(X) can be fully expressed by the observed features. As a result, the augmented features fully capture the underlying reward structure, enabling faster convergence relative to the other scenarios.

6.2.2. SCENARIO 2

In Scenario 2, we set d = dz = 2K, implying that no latent features remain and allowing us to evaluate our algorithms under the condition d > K. As discussed in Section 4.1, we apply SVD to reduce the dimensionality of the observed features before constructing the augmented feature set. Figure 3 shows that our algorithms perform well even without

Linear Bandits with Partially Observable Features

0 200 400 600 800 1000 1200 Round (t)

Cumulative Regret

Ro LF-Lasso (Ours) Ro LF-Ridge (Ours) DRLasso Lin UCB Lin TS UCB( )

(a) Case 1 (General Case)

0 200 400 600 800 1000 1200 Round (t)

Cumulative Regret

Ro LF-Lasso (Ours) Ro LF-Ridge (Ours) DRLasso Lin UCB Lin TS UCB( )

(b) Case 2 (R(U) R(X))

0 200 400 600 800 1000 1200 Round (t)

Cumulative Regret

Ro LF-Lasso (Ours) Ro LF-Ridge (Ours) DRLasso Lin UCB Lin TS UCB( )

(c) Case 3 (R(X) R(U))

Figure 2. Cumulative regrets of the algorithms with partial observability (Scenario 1).

0 200 400 600 800 1000 1200 Round (t)

Cumulative Regret

Ro LF-Lasso (Ours) Ro LF-Ridge (Ours) DRLasso Lin UCB Lin TS UCB( )

(a) Case 1 (General Case)

0 200 400 600 800 1000 1200 Round (t)

Cumulative Regret

Ro LF-Lasso (Ours) Ro LF-Ridge (Ours) DRLasso Lin UCB Lin TS UCB( )

(b) Case 2 (R(U) R(X))

Figure 3. Cumulative regrets of the algorithms without partial observability (Scenario 2).

unobserved features, both in terms of the level and robustness of the cumulative regret. Moreover, by incorporating dimensionality reduction, our algorithms remain effective even when the feature matrix is not full rank. Lastly, similar to Figure 2, cumulative regret in Case 2 converges slightly more slowly than in Case 1.

Meanwhile, the baseline OLB algorithms continue to struggle in identifying the optimal arm throughout the horizon. For Lin UCB and Lin TS, this behavior may be attributed to the curse of dimensionality, where regret scales linearly with the feature dimension d a well-known limitation of linear bandit algorithms (Oswal et al., 2020; Tran et al., 2024). Additionally, since the true rewards are bounded by 1, the resulting small reward gaps between arms may further hinder the identification of the optimal arm. In the case of DRLasso, similar behavior arises from the fixed feature setting: the algorithm uses an averaged context vector across the action space at each round for parameter estimation. Consequently, under the fixed-feature setting, this strategy becomes effectively equivalent to using a single fixed vector throughout the entire learning horizon, thereby limiting the expressiveness of the estimation.

7. Conclusion

In this work, we addressed the problem of partially observable features within the linear bandit framework. We showed that conventional algorithms that ignore unobserved features may suffer linear regret due to information loss, and introduced Ro LF, a novel algorithm that accounts for latent features using only observed data without requiring prior knowledge. Our algorithm achieves a tighter regret bound than existing methods, and this improvement is supported by our numerical experiments.

For future work, from the perspective that our feature augmentation strategy reformulates the problem as another linear bandit problem without model misspecification, extending this approach to other reward models, e.g., generalized linear models, would be an interesting direction. Additionally, although we assumed the latent reward component to be linear in the unobserved features, we can relax this assumption by viewing the latent reward component as an exogenous factor that interferes with the learning process. This perspective allows for modeling the latent reward using a general function class without structural assumptions.

Linear Bandits with Partially Observable Features

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Acknowledgements

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (RS-2022-NR071853; RS-2023-00222663; RS-2025-16070886), by a grant of Korean ARPA-H Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (RS-2024-00512375), by AI-Bio Research Grant through Seoul National University, and by the Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (RS-2025-02263754; RS-2021-II211341, Artificial Intelligence Graduate School Program, Chung Ang University). Garud Iyengar s research was partially supported by ONR grant N000142312374 and NSF grant EFMA-2132142.

Abbasi-yadkori, Y., Pál, D., and Szepesvári, C. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, volume 24. Curran Associates, Inc., 2011.

Abe, N. and Long, P. M. Associative reinforcement learning using linear probabilistic concepts. In Proceedings of the Sixteenth International Conference on Machine Learning, ICML 99, pp. 3 11. Morgan Kaufmann Publishers Inc., 1999.

Agrawal, S. and Goyal, N. Thompson sampling for contextual bandits with linear payoffs. In Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pp. 127 135. PMLR, 2013.

Ariu, K., Abe, K., and Proutiere, A. Thresholded lasso bandit. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp. 878 928. PMLR, 2022.

Auer, P. Using confidence bounds for exploitationexploration trade-offs. Journal of Machine Learning Research, 3:397 422, 2002.

Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47:235 256, 2002.

Bang, H. and Robins, J. M. Doubly robust estimation in missing data and causal inference models. Biometrics, 61 (4):962 973, 2005.

Bastani, H. and Bayati, M. Online decision making with high-dimensional covariates. Operations Research, 68 (1):276 294, 2020.

Bogunovic, I., Losalka, A., Krause, A., and Scarlett, J. Stochastic linear bandits robust to adversarial attacks. In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 of Proceedings of Machine Learning Research, pp. 991 999. PMLR, 2021.

Bühlmann, P. and van de Geer, S. Statistics for High Dimensional Data: Methods, Theory and Applications. Springer Berlin Heidelberg, 2011.

Chakraborty, S., Roy, S., and Tewari, A. Thompson sampling for high-dimensional sparse linear contextual bandits. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 3979 4008. PMLR, 2023.

Chu, W., Li, L., Reyzin, L., and Schapire, R. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, volume 15 of Proceedings of Machine Learning Research, pp. 208 214. PMLR, 2011.

Dani, V., 9, ., Hayes, T., and Kakade, S. M. Stochastic linear optimization under bandit feedback. 21st Annual Conference on Learning Theory - COLT 2008, Helsinki, Finland, pp. 355 366, 2008.

Dimakopoulou, M., Zhou, Z., Athey, S., and Imbens, G. Balanced linear contextual bandits. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01):3445 3453, 2019.

Ghosh, A., Ray Chowdhury, S., and Gopalan, A. Misspecified linear bandits. Proceedings of the AAAI Conference on Artificial Intelligence, 31(1), 2017.

Goldenshluger, A. and Zeevi, A. A linear response bandit problem. Stochastic Systems, 3(1):230 261, 2013.

Hao, B., Lattimore, T., and Wang, M. High-dimensional sparse linear bandits. In Advances in Neural Information Processing Systems, volume 33, pp. 10753 10763. Curran Associates, Inc., 2020.

He, J., Zhou, D., Zhang, T., and Gu, Q. Nearly optimal algorithms for linear contextual bandits with adversarial corruptions. In Advances in Neural Information Processing Systems, volume 35, pp. 34614 34625. Curran Associates, Inc., 2022.

Linear Bandits with Partially Observable Features

Kim, G.-S. and Paik, M. C. Doubly-robust lasso bandit. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.

Kim, J.-H., Yun, S.-Y., Jeong, M., Nam, J., Shin, J., and Combes, R. Contextual linear bandits under noisy features: Towards bayesian oracles. In Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, volume 206 of Proceedings of Machine Learning Research, pp. 1624 1645. PMLR, 2023a.

Kim, W., Kim, G.-S., and Paik, M. C. Doubly robust thompson sampling with linear payoffs. In Advances in Neural Information Processing Systems, volume 34, pp. 15830 15840. Curran Associates, Inc., 2021.

Kim, W., Lee, K., and Paik, M. C. Double doubly robust thompson sampling for generalized linear contextual bandits. Proceedings of the AAAI Conference on Artificial Intelligence, 37(7):8300 8307, 2023b.

Kim, W., Paik, M. C., and Oh, M.-H. Squeeze all: Novel estimator and self-normalized bound for linear contextual bandits. In Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, volume 206 of Proceedings of Machine Learning Research, pp. 3098 3124. PMLR, 2023c.

Kim, W., Iyengar, G., and Zeevi, A. A doubly robust approach to sparse reinforcement learning. In Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, volume 238 of Proceedings of Machine Learning Research, pp. 2305 2313. PMLR, 2024.

Lai, T. and Robbins, H. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, pp. 4 22, 1985.

Lattimore, T. and Szepesvári, C. Bandit algorithms. Cambridge University Press, 2020.

Lattimore, T., Szepesvari, C., and Weisz, G. Learning with good feature representations in bandits and in RL with a generative model. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 5662 5670. PMLR, 2020.

Lee, H., Hwang, T., and Oh, M.-h. Lasso bandit with compatibility condition on optimal arm. In International Conference on Learning Representations, 2025.

Li, L., Chu, W., Langford, J., and Schapire, R. E. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web, WWW 10, pp. 661 670. Association for Computing Machinery, 2010.

Oh, M.-H., Iyengar, G., and Zeevi, A. Sparsity-agnostic lasso bandit. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 8271 8280. PMLR, 2021.

Oswal, U., Bhargava, A., and Nowak, R. Linear bandits with feature feedback. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pp. 5331 5338. AAAI Press, 2020.

Park, H. and Faradonbeh, M. K. S. A regret bound for greedy partially observed stochastic contextual bandits. In Decision Awareness in Reinforcement Learning Workshop at ICML 2022, 2022.

Park, H. and Faradonbeh, M. K. S. Thompson sampling in partially observable contextual bandits. ar Xiv preprint ar Xiv:2402.10289, 2024.

Robbins, H. E. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 58:527 535, 1952.

Rusmevichientong, P. and Tsitsiklis, J. N. Linearly parameterized bandits. Mathematics of Operations Research, 35 (2):395 411, 2010.

Soare, M., Lazaric, A., and Munos, R. Best-arm identification in linear bandits. In Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014.

Tennenholtz, G., Shalit, U., Mannor, S., and Efroni, Y. Bandits with partially observable confounded data. In Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence, volume 161 of Proceedings of Machine Learning Research, pp. 430 439. PMLR, 2021.

Tran, N. P., Ta, T. A., Mandal, D., and Tran-Thanh, L. Symmetric linear bandits with hidden symmetry. In Advances in Neural Information Processing Systems, volume 37, pp. 128699 128733. Curran Associates, Inc., 2024.

Tropp, J. A. User-friendly tail bounds for sums of random matrices. Foundations of Computational Mathematics, 12(4):389 434, 2012.

Tropp, J. A. An introduction to matrix concentration inequalities. Found. Trends Mach. Learn., 8(1 2):1 230, 2015.

van de Geer, S. A. and Bühlmann, P. On the conditions used to prove oracle results for the Lasso. Electronic Journal of Statistics, 3:1360 1392, 2009.

Linear Bandits with Partially Observable Features

Zeng, S., Bhatt, S., Koppel, A., and Ganesh, S. Partially observable contextual bandits with linear payoffs. In ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025.

Linear Bandits with Partially Observable Features

A. Related Works

In bandit problems, the learning agent learns only from the outcomes of chosen actions, leaving unchosen alternatives unknown (Robbins, 1952). This constraint requires a balance between exploring new actions and exploiting actions learned to be good, known as the exploration-exploitation tradeoff. Efficiently managing this tradeoff is crucial for guiding the agent towards the optimal policy. To address this, algorithms based on optimism in the face of uncertainty (Lai & Robbins, 1985) employ the Upper Confidence Bound (UCB) strategy, which encourages the learner to select actions with the highest sum of estimated reward and uncertainty. This approach adaptively balances exploration and exploitation, and has been widely and studied in the context of linear bandits (Abe & Long, 1999; Auer, 2002; Dani et al., 2008; Rusmevichientong & Tsitsiklis, 2010). Notable examples include Lin UCB (Li et al., 2010; Chu et al., 2011) and OFUL (Abbasi-yadkori et al., 2011), which are known for their practicality and performance guarantees. However, existing approaches differ from ours in two key aspects: (i) they assume that the learning agent can observe the entire feature vector related to the reward, and (ii) their algorithms have regret that scales linearly with the dimension of the observed feature vector, i.e., e O(d

In contrast, we develop an algorithm that achieves a sublinear regret bound by employing the doubly robust (DR) technique, thereby avoiding the linear dependence on the dimension of the feature vectors. The DR estimation in the framework of linear contextual bandits was first introduced by Kim & Paik (2019) and Dimakopoulou et al. (2019), and subsequent studies improve the regret bound in this problem setting by a factor of

d (Kim et al., 2021; 2023b). A recent application (Kim et al., 2023c) achieves a regret bound of order O( d T log T) under IID features over rounds. However, the extension to non-stochastic or non-IID features remains an open question. To address this issue, we develop a novel analysis that applies the DR estimation to non-stochastic features, achieving a regret bound sublinear with respect to the dimension of the augmented feature vectors. Furthermore, we extend DR estimation to handle sparse parameters, thereby further improving the regret bound to be sublinear in the reduced dimension.

Our problem is more general and challenging than misspecified linear bandits, where the assumed reward model fails to accurately reflect the true reward, such as when the true reward function is non-linear (Lattimore & Szepesvári, 2020), or when a deviation term is added to the reward model (Ghosh et al., 2017; Bogunovic et al., 2021; He et al., 2022). While our work assumes that the misspecified (or inaccessible) portion of the reward is linearly related to certain unobserved features, misspecified linear bandit problems can be reformulated as a special case of our framework. While the regret bounds in Lattimore & Szepesvári (2020), Bogunovic et al. (2021) and He et al. (2022) incorporate the sum of misspecification errors that may accumulate over the decision horizon, our work establishes a regret bound that is sublinear in the decision horizon T and is not affected by misspecification errors. Ghosh et al. (2017) proposed a hypothesis test to decide between using linear bandits or MAB, demonstrating an O(K

T log T) regret bound when the total misspecification error exceeds Ω(d

T). In contrast, our algorithm achieves an O( p

(d + dh)T log T) regret bound without requiring hypothesis tests for misspecification or partial observability.

Last but not least, our problem appears similar to the literature addressing bandit problems with partially observable features (Tennenholtz et al., 2021; Park & Faradonbeh, 2022; 2024; Kim et al., 2023a; Zeng et al., 2025). In particular, Park & Faradonbeh (2022; 2024); Kim et al. (2023a) assume that the true features follow a specific distribution, typically Gaussian. Zeng et al. (2025) further assume that the true features evolve according to a linear dynamical system with additive Gaussian noise. Park & Faradonbeh (2022; 2024) and Zeng et al. (2025) construct the observed features as emissions from the true features via a known linear mapping, also corrupted by additive Gaussian noise, whereas Kim et al. (2023a) first corrupt the true features with Gaussian noise and then generate the observed features by masking elements of the corrupted features following an unknown Bernoulli distribution. In addition, all of these approaches aim to recover the true features: Park & Faradonbeh (2022; 2024) introduce a known decoder mapping from the observed features to the corresponding latent features; Kim et al. (2023a) leverage a Bayesian oracle strategy for estimation; and Zeng et al. (2025) estimate the true features using a Kalman filter. In contrast, our setting imposes no structural assumptions on either the observed or latent features, making the problem more general and challenging than those addressed in the aforementioned works. Furthermore, our approach does not attempt to recover any information related to latent features. Instead, we compensate for the lack of reward information due to unobserved features, in the sense that we project the inaccessible portion of the reward onto the orthogonal complement of the row space spanned by the observed feature vectors.

On the other hand, Tennenholtz et al. (2021) assume that partially observed features are available as an offline dataset, and leverage the dataset to recover up to L dimensions of the d-dimensional true features, where L d is the dimension of the partially observed features. This setting is different from ours in which partial observability arises naturally and no offline access is available. In their framework, the correlation between the observed and unobserved features is used to model the

Linear Bandits with Partially Observable Features

relationship between the estimator based on the observed features and that based on the true features, which requires further estimation of the unknown correlation. In contrast, our method is agnostic to such correlation, which makes our approach more practical. Moreover, we exploit the observed features via feature augmentation and DR estimation, resulting in a faster convergence rate of regret compared to their UCB-based approach.

B. Detailed Experimental Setup

For both scenarios, the features including the true features za, observed features xa, and unobserved features ua are constructed differently based on the relationship between the row space spanned by the observed features, R(X), and the row space spanned by the unobserved features, R(U). In Case 1, the general case, the true features za for each arm a [K] are sampled from N(0, Idz), and the observed features xa are obtained by truncating the first d elements of za, following the definition given in Eq. (1).

In Cases 2 and 3, on the other hand, the features are generated in a way to explicitly reflect the inclusion relationship between R(X) and R(U). Particularly, in Case 2, where R(U) R(X), the observed features xa are sampled from N(0, Id) for each a [K]. Then, we generate a coefficient matrix Cu Rdu d, where each element is sampled from Unif( 1, 1), and construct ua by computing ua = Cuxa. This construction ensures that R(U) lies within R(X). In Case 3, where R(X) R(U), we reverse the process by sampling ua N(0, Idu), generating Cx Rd du from Unif( 1, 1), and we set xa = Cxua, thereby ensuring R(X) R(U). In both Case 2 and Case 3, the true features za are formed by concatenating xa and ua.

After the construction of the features, the orthogonal basis vectors {b 1 , . . . , b K d} are derived via singular value decomposition (SVD) on the observed feature matrix X, ensuring orthogonality to R(X). These basis vectors are linearly concatenated to X to form the augmented feature matrix. The reward parameter θ Rdz is sampled from the uniform distribution Unif( 1/2, 1/2), and the rewards are generated via dot products following the definition Eq. (2). The coupling probability p, a hyperparameter used in the sampling distribution of eat, is set to 0.6 (see Eq. (9)). The confidence parameter δ, which is also a hyperparameter, is set to 10 4, and the total decision horizon is T = 1200.

Throughout the experiments, we fix the number of arms at K = 30 and the dimensionality of the true features dz = 35. Furthermore, to accommodate both partial and full observability, we set dz d. Specifically, in Scenario 1, d is set to dz/2 = 17, indicating that only about half of the full feature space is observable to the agent. In Scenario 2, on the other hand, we set d = 2K = 60 and dz = d, implying that latent features are absent. The setup of the second scenario also allows us to evaluate our algorithm in the case where d > K. For each of the three structural cases (i.e., the relationships between R(X) and R(U)) considered under both scenarios, we conduct five independent trials using different random seeds. The results are presented in terms of the sample mean and one standard deviation of the cumulative regret.

C. Robust to Latent Feature Algorithm with Ridge Estimator

Our Doubly robust (DR) ridge estimator is defined as follows:

τ=1 I(Mτ) X

a [K] exaex a + IK

τ=1 I(Mτ) X

a [K] exaeya,τ

where eya,τ is the DR pseudo reward:

eya,t := ex a ˇµR t + I(eat = a)

ya,t ex a ˇµR t ,

and the imputation estimator ˇµR t is defined as

τ=1 exaτ ex aτ + p IK

τ=1 exaτ yaτ ,τ

The following theorem shows that this Ridge estimator is consistent, meaning it converges to the true parameter µ with high probability as the agent interacts with the environment.

Linear Bandits with Partially Observable Features

Algorithm 2 Robust to Latent Feature with Ridge Estimator (Ro LF-Ridge)

1: INPUT: Observed features {xa : a [K]}, coupling probability p (1/2, 1), confidence parameter δ > 0. 2: Initialize bµ0 = 0K, exploration phase Et = and exploration factor Ce := 32(1 p) 2K2. 3: Find orthogonal basis {b 1 , . . . , b K d} R(X) and construct {exa : a [K]}. 4: for t = 1, . . . , T do 5: if |Et| Ce log(2Kt2/δ) then 6: Randomly sample at uniformly over [K] and Et = Et 1 {t}. 7: else 8: Compute bat := arg maxa [K] ex a bµR t 1. 9: end if 10: while eat = at and count ρt do 11: Sample at with P(at = bat) = 1 (t 1/2) and P(at = k) = t 1/2/(K 1), k = bat. 12: Sample eat according to Eq. (9). 13: count = count + 1. 14: end while 15: Play at and observe yat,t. 16: if eat = at then 17: Set bµR t := bµR t 1. 18: else 19: Update bµR t following Eq. (12) with eya,t and update ˇµR t following Eq. (13). 20: end if 21: end for

Theorem 4 (Consistency of the DR Ridge estimator). For each t, let Et [t] denote an exploration phase such that for each τ Et the action aτ is sampled uniformly over [K]. Assume that µ 1 and exa 1 for all a [K]. Then with probability at least 1 3δ,

max a [K] |ex a (bµR t µ )| 2

K log t + 1

for all rounds t such that |Et| 32(1 p) 2K2 log(2Kt2/δ).

With |Et| = O(K2 log Kt) exploration rounds, the DR Ridge estimator achieves an O( p

K/t) convergence rate over all K rewards. This is possible because the DR pseudo-rewards defined in Eq. (10) impute the missing rewards for all arms a [K] using ex a ˇµt, based on the samples collected during the exploration phase, Et. With this convergence guarantee, we establish a regret bound for Ro LF-Ridge, which is an adaptation of Algorithm 1 using the Ridge estimator.

Theorem 5 (Regret bound for Ridge Ro LF). Suppose µ 1 and za 1 for all a [K]. For δ (0, 1), with probability at least 1 4δ, the cumulative regret of the proposed algorithm using the DR Ridge estimator is bounded by

Reg(T) 32K2

(1 p)2 log 2d T 2

T K 1 + 4δ + 8

The first and second terms come from the distribution of at, which is a combination of the 1 t 1/2-greedy policy and resampling up to ρt := log((t + 1)2/δ)/ log(1/p) trials. The third term is determined by the size of the exploration set, Et, while the last term arises from the estimation error bounded by the DR estimator as described in Theorem 4. The hyperparameter p (1/2, 1) balances the size of the exploration set in the third term and the estimation error in the last term. Overall, the regret is O( KT log T), which shows a significant improvement compared to the regret lower bound in Theorem 1 for any linear bandit algorithm that does not account for unobserved features and unobserved rewards.

D. A Modified Algorithm for Time-Varying Observed Features

In this section, we define the problem of linear bandits with partially observable features under a setting where the observed features vary over time, describe our proposed method, and provide theoretical guarantees.

Linear Bandits with Partially Observable Features

D.1. Problem Formulation

Let x1,t, . . . , x K,t denote the observed features and u1, . . . , u K denote the unobserved features. Now the observed features arbitrarily change over t but the unobserved features are fixed over time. When the algorithm selects an arm at, the reward is

yat,t = xat,t, θ(o) + uat, θ(u) + ϵt,

where ϵt is Sub-Gaussian noise that follows Assumption 1. The expected reward of each arm is stable over time, where MAB algorithms without using features are applicable for achieving a e O(

KT) regret bound. When the observed features vary over time, the expected reward of each arm E[yat,t] = xat,t, θ(o) + uat, θ(u) also arbitrarily changes over time, and MAB algorithms suffer regret linear in T. To our knowledge, there is no other work that addresses this challenging setting.

D.2. Proposed Method: Orthogonal Basis Augmentation

We address the problem by augmenting standard basis e1, . . . , e K in RK to estimate bias caused by the unobserved features. Let exa,t := e a [Xt e1 e K] Rd+K and let a := ua, θ(u) denote the bias that stems from the latent features. Then,

ya,t = x a,t, θ(o) + u a,t, θ(u) + ϵt

= e a [Xt e1 . . . e K], [θ(o) 1 . . . K] + ϵt.

Therefore, applying the Ro LF-Ridge algorithm to the augmented features exa,t := e a [Xt e1 . . . e K] yields the following regret bound.

Theorem 6 (Regret bound for Ridge-Ro LF-V with time-varying observed features). If the observed features are vary over time, then for any δ (0, 1), with probability at least 1 4δ, the cumulative regret of the proposed algorithm Ridge-Ro LF-V using DR Ridge estimator is bounded by

Reg(T) 4δ + 2

T d + K 1 + 32(K + d)2

(1 p)2 log 2(K + d)T 2

The proof follows similar arguments given in Theorem 5 and is therefore omitted. The resulting regret bound achieves a rate of e O( p

(d + K)T), which, to the best of our knowledge, is the first sublinear regret guarantee for partially observable linear bandits (as well as misspecified linear bandits) with arbitrarily time-varying observed features.

E. Missing Proofs

E.1. Proof of Theorem 1

Throughout this paper, we consider a bandit problem where the agent observes only a subset of the reward-generating feature vector and cannot access or estimate the unobserved portion. If the agent uses online decision-making algorithms that rely solely on observed features, as defined in Definition 1, the resulting issue can be interpreted as a model misspecification. Therefore, in this theorem, we present a problem instance where misspecified algorithms, considering only observed features, may incur regret that grows linearly in T.

Following the statement of Theorem 1, we assume that d = du = 1, which means dz = 2. Given the true feature set Z = {[1, 3] , [2, 19/4] }, let the first element of each vector is observed to the agent; while the second element remains unobserved. This results in x1 = x1 = 1, x2 = x2 = 2, u1 = u1 = 3 and u2 = u2 = 19/4. We set the true parameter as θ R2 = [2, 1] , meaning θ(o) = θ(o) = 2 and θ(u) = θ(u) = 1. Using the reward function from Section 3.2 and considering Assumption 1, the expected reward for each arm is given by

γi := E[yi] = z i θ = xiθ(o) + uiθ(u) i {1, 2}.

Plugging the values in, the true mean reward for each arm is directly computed as γ1 = 2 3 = 1 and γ2 = 4 19/4 = 3/4, which satisfies the assumption that its absolute value does not exceed 1 (Section 3.2), and since γ1 < γ2, arm 2 is the

Linear Bandits with Partially Observable Features

optimal action. We further assume that the total learning horizon T > 256σ2 max{log 2, log(1/δ )} for any δ (0, 1/2), where σ is the sub-Gaussian parameter of the reward noise.

For brevity, we denote the latent reward components as g1 := u1θ(u) and g2 := u2θ(u) , yielding g1 = 3 and g2 = 19/4. Since γ2 = 2γ1 and |gi| 3 > 0 for all i {1, 2}, our problem setup satisfies the large deviation criterion in Definition 1 and Theorem 2 of Ghosh et al. (2017), with l = 3 and β = 0. Applying the theorem, it follows that OFUL (Abbasi-yadkori et al., 2011) suffers linear regret in this problem instance, i.e., Ω(T). Motivated by this result, we further show that Lin TS (Agrawal & Goyal, 2013) is similarly affected.

For each round t [T], Lin TS estimates the true parameter using the ridge estimator, given by:

bθt = (X t Xt + λId) 1(X t Yt)

= (X t Xt + λId) 1(X t (Xtθ(o) + gt + ϵt))

= θ(o) λV 1 t θ(o) + V 1 t X t gt + V 1 t X t ϵt, (14)

where Xt := (x a1, . . . , x at) Rt d is a matrix containing features chosen up to round t, Yt := (ya1, . . . , yat) Rt

is a vector of observed rewards, and ϵt := (ϵ1, . . . , ϵt) Rt contains noise attached to each reward. Unlike a typical ridge estimator, here the term gt := (ga1, . . . , gat) Rt, the vector containing the latent portion of observed rewards is introduced due to model misspecification. Note that Vt := (X t Xt + λId) 0.

For this problem instance, since d = 1, Eq. (14) is equivalent to:

bθt = θ(o) θ(o) Pt τ=1 x2aτ + 1 + Pt τ=1 xaτ gaτ Pt τ=1 x2aτ + 1 + Pt τ=1 xaτ ϵτ Pt τ=1 x2aτ + 1 ,

where we assume λ = 1. Note that we also denote bθt and θ(o) by bθt and θ(o) , respectively, since both are scalars. Hence, the estimation error is computed as:

bθt θ(o) = θ(o) Pt τ=1 x2aτ + 1 + Pt τ=1 xaτ gaτ Pt τ=1 x2aτ + 1 + Pt τ=1 xaτ ϵτ Pt τ=1 x2aτ + 1 . (15)

Let N1 and N2 denote the number of times arms 1 and 2 have been played up to round t, respectively. This implies that N1 + N2 = t. Then, for the numerator of the second term, since

τ=1 xaτ gaτ = (g1 + + g1 | {z } N1

+ 2g2 + + 2g2 | {z } N2

) = g1N1 + 2g2N2,

we can observe that

τ=1 xaτ gaτ = g1N1 + 2g2N2

g N1 + 2g N2 = g N1 + 2g(t N1)

= 2gt g N1 gt ( N1 t)

where g = min{g1, g2} = 19/4, which implies that Pt τ=1 xaτ gaτ = Θ(t). For the denominator, Pt τ=1 x2 aτ + 1, which grows at a rate of O(t), implying that the second term of the right-hand side in Eq. (15) is Θ(1), and that bθt is not consistent since it does not converge to θ(o) as t .

For arm 2, which is optimal, to be selected in round t + 1 under Lin TS, the condition x2eθt x1eθt must hold, where eθt N bθt, v2(Pt τ=1 x2 aτ + 1) 1 . Given the assumptions that x1 = 1 and x2 = 2, arm 2 is selected whenever eθt 0.

Linear Bandits with Partially Observable Features

Thus, for arm 1 to be chosen, we require eθt < 0. We will show that the probability of eθt < 0 does not diminish sufficiently to be ignored, even when the agent plays for a sufficiently large amount of time. To clarify, let us define the two events Eeθ := {eθt 0} and Eˆθ := {bθt 0}. We revisit Eq. (14) as follows:

bθt = θ(o) Pt τ=1 x2 aτ Pt τ=1 x2aτ + 1 + Pt τ=1 xaτ gaτ Pt τ=1 x2aτ + 1 + Pt τ=1 xaτ ϵτ Pt τ=1 x2aτ + 1

θ(o) + g1N1 + 2g2N2

N1 + 4N2 + 1 + Pt τ=1 xaτ ϵτ N1 + 4N2 + 1 ( θ(o) > 0)

θ(o) + g1N1 + 2g2N2

N1 + 4N2 + 1 + 2 N1 + 4N2 + 1

τ=1 ϵτ, (16)

where the last inequality holds since Pt τ=1 xaτ ϵτ maxa {1,2} xa Pt τ=1 ϵτ and maxa {1,2} xa = 2. For the second term of Eq. (16), for t 19,

g1N1 + 2g2N2

N1 + 4N2 + 1 = 3N1 19

2 N2 N1 + 4N2 + 1

8 (N1 + 4N2) N1 + 4N2 + 1

8 + 19/8 t + 3N2 + 1

which is upper bounded by 9/4. This bound is followed by:

bθt θ(o) + g1N1 + 2g2N2

N1 + 4N2 + 1 + 2 N1 + 4N2 + 1

4 + 2 t + 3N2 + 1

Thus, we have the following:

P(bθt > 0) P

Since ϵτ is an IID sub-Gaussian random variable for all τ [t], by applying Hoeffding inequality we obtain: P(bθt > 0) exp t/128σ2 . Given this, we now bound the probability of the event Eeθ:

P(Eeθ) = P(Eeθ Eˆθ) + P(Eeθ Ec ˆθ)

= P(Eeθ | Eˆθ) P(Eˆθ) + P(Eeθ | Ec ˆθ) P(Ec ˆθ)

= P(eθt 0 | bθt 0) P(bθt 0) + P(eθt 0 | bθt < 0) P(bθt < 0)

exp t 128σ2

+ P(eθt 0 | bθt < 0). (18)

Since the second term of Eq. (18) is calculated under a Gaussian distribution, thus its value does not exceed 1/2 for all t [T], it follows that P(Ec eθ) 1/2 exp t/128σ2 . Note that the total decision horizon T > 256σ2 log(1/δ ), thus

Linear Bandits with Partially Observable Features

for any t > 128σ2 log(1/δ ), we have P(Ec eθ) 1/2 δ . This implies that for rounds beyond T/2, the suboptimal arm will be played at least (1/2 δ )T/2 times for any δ (0, 1/2), thus incurring

E[Reg Lin TS(T)] (γ2 γ1) 1

For OFUL, we also present another analysis that does not require the assumption that the suboptimal arm is played for initial t rounds, which is taken in Theorem 2 of Ghosh et al. (2017). The optimal arm, arm 2, is selected when

x2bθt + x2 q

1 + Pt 1 τ=1 x2aτ

> x1bθt + x1 q

1 + Pt 1 τ=1 x2aτ

where bθt is the same ridge estimator as in Lin TS. The inequality Eq. (19) is equivalent to bθt > (1 + Pt 1 τ=1 x2 aτ ) 1/2, which implies bθt > 1/

2t. By Eq. (17),

exp t 128σ2

Thus, for t 128σ2 log 2, the probability of selecting arm 2 is less than 1/2 and for T > 256σ2 log 2,

E[Reg OFUL(T)] (γ2 γ1) 1

and the algorithm suffers expected regret linear in T.

E.2. Proof of Theorem 2

Let Vt := Pt τ=1 P

a [K] exaex a . Then

max a [K] |ex a (bµL t µ )| s X

a [K] |ex a (bµL t µ )|2 = t 1/2 bµL t µ Vt.

To use Lemma 3, we prove a bound for Pt τ=1 P

a [K] eya,τ ex a µt exa . By definition of eya,τ,

eya,τ ex a µ exa

1 I(eaτ = a)

exaex a ˇµL t µ + I(eaτ = a)

ya,τ ex a µ exa

1 I(eaτ = a)

exaex a ˇµL t µ

ya,τ ex a µ exa

Linear Bandits with Partially Observable Features

With probability at least 1 δ/(τ + 1)2, the event Mτ happens for all τ 1 and we obtain a pair of matching sample eaτ and aτ. Under Mτ, the second term in Eq. (20) is equal to,

ya,τ ex a µ exa

ya,τ ex a µ exa

τ=1 ϵτ exaτ

Because v = maxi [d] |e i v| for any v Rd,

τ=1 ϵτ exaτ

p max a [K]

τ=1 ϵτe a exaτ

Applying Lemma 1, with probability at least 1 δ/t2,

τ=1 ϵτe a exaτ

max a [K] σ

τ=1 (e a exaτ )2 log 2Kt2

= max a [K] σ

v u u t2e a

τ=1 exaτ ex aτ

ea log 2Kt2

max a [K] σ

2e a Vtea log 2Kt2

2t log 2Kt2

where the last equality follows by the definition eσ2 max = maxa [K] e a (P

a [K] exaex a )ea. Thus,

τ=1 ϵτ exaτ

2t log 2Kt2

Now we turn to the first term in Eq. (20). Define At := Pt τ=1 P

a [K] I(eaτ = a)ϕ 1 a,τ exaex a . Then the first term is rearranged as

1 I(eaτ = a)

exaex a ˇµL t µ

= (Vt At) ˇµL t µ . (22)

Since v = maxi [d] |e i v| for any v Rd, (Vt At) ˇµL t µ = max a [K] |e a (Vt At) ˇµL t µ |

e a (Vt At) A 1/2 t 2

ˇµL t µ At .

Because ˇµL t is a minimizer of Eq. (8), by Lemma 3 and Eq. (21),

ˇµL t µ At 4σeσmax

2t(d + dh) log(2Kt2/δ)

λmin (At) .

By Corollary 1, with ϵ (0, 1) to be determined later, for t 8ϵ 2(1 p) 2K2 log(2Kt2/δ), with probability at least 1 δ/t2, IK V 1/2 t At V 1/2 t 2 ϵ. (23)

Linear Bandits with Partially Observable Features

Eq. (23) implies (1 ϵ)IK V 1/2 t At V 1/2 t and (1 ϵ)Vt At. Thus,

ˇµL t µ At 4σeσmax

2t(d + dh) log(2Kt2/δ)

(1 ϵ)λmin (Vt) = 4σeσmax

2(d + dh) log(2Kt2/δ)

and Eq. (22) is bounded by,

(Vt At) ˇµL t µ max i [K]

e i (Vt At) A 1/2 t 2 4σeσmax

2(d + dh) log(2Kt2/δ)

e i (Vt At) V 1/2 t 2 4σeσmax p(1 ϵ)eσmin

2(d + dh) log 2Kt2

For the first term in Eq. (24), we have that

e i (Vt At) V 1/2 t 2 = max i [K]

e i V1/2 t V 1/2 t (Vt At) V 1/2 t 2

t IK V 1/2 t At V 1/2 t 2 eσmax

where the last inequality holds due to Eq. (23). Combining this result with Eq. (24), we obtain that

(Vt At) ˇµL t µ 4ϵσeσ2 max p(1 ϵ)eσmin

2t(d + dh) log 2Kt2

Setting ϵ = K 1/2eσmin/8eσmax yields that 1 ϵ 1/2 and that

(Vt At) ˇµL t µ σeσmax

2t log 2Kt2

Now we conclude that for t 83K3eσ 2 mineσ2 max(1 p) 2 log(2Kt2/δ),

a [K] (eya,τ ex a µt)exa

2t log 2Kt2

by Lemma 3 and taking a union bound on the both terms in Eq. (20), with probability at least 1 2δ/t2,

bµL t µ Vt 8σeσmax

2t(d + dh) log(2Kt2/δ)

λmin (Vt) = 8σeσmax

2(d + dh) log 2Kt2

which completes the proof.

E.3. Proof of Theorem 3

Since we have shown that the reward defined in Eq. (2) is equivalent to its form obtained via the projection and augmentation strategy in Eq. (7), we henceforth calculate the regret bound based on Eq. (7). Under the assumption that the expected reward is bounded by 1 (Section 3.2), the instantaneous regret Et 1[y ,t] Et 1[yat,t] is bounded above by 2 for any t [T], and that the number of rounds for the exploration phase satisfies |ET | 83K3(1 p) 2 log(2KT 2/δ). Given these bounds, the

Linear Bandits with Partially Observable Features

cumulative regret is bounded as:

t=1 (Et 1[y ,t] Et 1[yat,t])

t ET (Et 1[y ,t] Et 1[yat,t]) + X

t [T ]\ET (Et 1[y ,t] Et 1[yat,t])

2 83K3(1 p) 2 log 2KT 2

t [T ]\ET (Et 1[y ,t] Et 1[yat,t])

= 2 83K3(1 p) 2 log 2KT 2

t [T ]\ET {I(at = bat)(Et 1[y ,t] Et 1[yat,t])}

t [T ]\ET {I(at = bat)(Et 1[y ,t] Et 1[yat,t])} . (25)

We first consider the second term of Eq. (25). By Theorem 2, on the event {at = bat},

Et 1[y ,t] Et 1[yat,t] = ex a µ ex batµ = ex a µ + ex a bµL t 1 ex a bµL t 1 + ex bat bµL t 1 ex bat bµL t 1 ex batµ

ex a (µ bµL t 1) + ex bat(µ bµL t 1) + ex a bµL t 1 ex bat bµL t 1

2 max a [K]

ex a (µ bµL t 1) + ex a bµL t 1 ex bat bµL t 1

2 max a [K]

ex a (µ bµL t 1)

with probability at least 1 2δ/t2 for each t [T] \ ET . Summing over t and applying a union bound, we obtain that, with probability at least 1 4δ,

t [T ]\ET {I(at = bat) (Et 1[y ,t] Et 1[yat,t])} 32σeσmax

2(d + dh)T log 2KT 2

For the last term of Eq. (25), X

t [T ]\ET {I(at = bat) (Et 1[y ,t] Et 1[yat,t])}

t [T ] I(at = bat) P (at = bat) + P (at = bat)

( Et 1[y ,t] Et 1[yat,t] 2)

T K 1 + 4δ, (27)

where the first term in Eq. (27) holds with probability at least 1 δ by Hoeffding s inequality, and the remaining terms follow from Lemma 5. Taking a union bound over Eq. (26), Eq. (27), and Mt, we obtain, with probability at least 1 6δ,

Reg(T) 2 83K3(1 p) 2 log 2KT 2

δ + 4δ + 32σeσmax

2(d + dh)T log 2KT 2

which concludes the proof.

Linear Bandits with Partially Observable Features

E.4. Proof of Theorem 4

Let e Vt := Pt τ=1 I(Mτ) P

a [K] exaex a + IK and Vt := Pt τ=1 P

a [K] exaex a + IK. By definition of bµR t presented in Eq. (12),

ex a (bµR t µ ) = ex a e V 1 t

τ=1 I(Mτ) X

a [K] exa eya,τ ex a µ µ

By definition of the pseudo-rewards,

eya,τ ex a µ = 1 I(eaτ = a)

ex a ˇµR t µ + I(eaτ = a)

Let e At := Pt τ=1 I(Mτ) P

a [K] I(eaτ =a)

ϕa,t exaex a + IK and At := Pt τ=1 P

a [K] I(eaτ =a)

ϕa,t exaex a + IK Then,

ex a (bµR t µ ) = ex a e V 1 t

e Vt e At ˇµR t µ +

τ=1 I(Mτ) X

ϕa,τ exaϵτ µ

By definition of the imputation estimator ˇµt,

τ=1 exaτ ex aτ + p IK

τ=1 exaτ ϵτ pµ

1 ϕaτ ,τ exaτ ex aτ + IK

1 ϕaτ ,τ exaτ ϵτ µ

ϕa,τ exaτ ex aτ + IK

1 pexaτ ϵτ µ

where the second equality holds because ϕaτ ,τ = p and the coupling event t τ=1Mτ. Thus,

ˇµR t µ = A 1 t

1 pexaτ ϵτ µ

Plugging in (28),

ex a (bµR t µ ) = ex a e V 1 t

e Vt e At A 1 t

1 pexaτ ϵτ µ

τ=1 I(Mτ) X

ϕa,τ exaϵτ µ

Under the coupling event, e At = At, e Vt = Vt and P

a [K] ϕ 1 a,τI(eaτ = a)exa = exaτ /p. It follows that

ex a (bµR t µ ) = ex a V 1 t (Vt At) A 1 t + IK t X

1 pexaτ ϵτ µ

= ex a V 1/2 t V1/2 t A 1 t V1/2 t V 1/2 t

1 pexaτ ϵτ µ

Taking absolute value on both sides, by Cauchy-Schwarz inequality,

max a [K] |ex a (bµR t µ )| max a [K] exa V 1 t V1/2 t A 1 t V1/2 t 2

1 pexaτ ϵτ µ

Linear Bandits with Partially Observable Features

Corollary 1 implies that, for t 8ϵ 2(1 p) 2K2 log(2Kt2/δ), with ϵ to be determined later,

IK V 1/2 t At V 1/2 t ϵIK,

with probability at least 1 δ for all t 1. Rearranging terms gives

V1/2 t A 1 t V1/2 t (1 ϵ) 1IK.

Combining this with Eq. (29),

max a [K] |ex a (bµR t µ )| maxa [K] exa V 1 t 1 ϵ

1 pexaτ ϵτ µ

maxa [K] exa V 1 t 1 ϵ

τ=1 exaτ ϵτ

where the last inequality follows from the triangle inequality. Note that Vt is deterministic, both Vt and Pt τ=1 exaτ ex aτ +IK are positive definite, and Vt Pt τ=1 exaτ ex aτ + IK. Then, by Lemma 9 of (Abbasi-yadkori et al., 2011), with probability at least 1 δ,

τ=1 exaτ ϵτ

τ=1 exaτ ϵτ

( Pt τ=1 exaτ ex aτ +IK) 1

2 log det(Pt τ=1 exaτ ex aτ + IK)1/2

log det(Pt τ=1 exaτ ex aτ + IK) δ

Applying Lemma 10 of Abbasi-yadkori et al. (2011) yields

τ=1 exaτ ex aτ + IK

Tr Pt τ=1 exaτ ex aτ + K

t maxa [K] exaτ 2 + K

for all t 1, where the last inequality holds by exaτ 2

K exaτ K. Hence,

τ=1 exaτ ϵτ

K log t + 1

which follows that,

max a [K] |ex a (bµR t µ )| maxa [K] exa V 1 t 1 ϵ

K log t + 1

δ + µ V 1 t

K log t + 1

δ + µ V 1 t

Since µ V 1 t µ 2

K, choosing ϵ = 1/2 completes the proof.

Linear Bandits with Partially Observable Features

E.5. Proof of Theorem 5

Similar to the proof of Theorem 3 (Appendix E.3), the instantaneous regret is bounded above by 2 for any t [T], and the number of rounds for the exploration phase satisfies |ET | 32(1 p) 2K2 log(2d T 2/δ). Consequently, the cumulative regret is bounded above as follows:

Reg(T) 32(1 p) 2K2 log 2d T 2

t [T ]\ET Et 1[y ,t] Et 1[yat,t]

= 32(1 p) 2K2 log 2d T 2

t [T ]\ET {I(at = bat) (Et 1[y ,t] Et 1[yat,t])}

t [T ]\ET {I(at = bat) (Et 1[y ,t] Et 1[yat,t])} . (30)

We first consider the second term in Eq. (30). On the event {at = bat},

Et 1[y ,t] Et 1[yat,t] = ex a µ ex batµ = ex a µ + ex a bµR t 1 ex a bµR t 1 + ex bat bµR t 1 ex bat bµR t 1 ex batµ

ex a (µ bµR t 1) + ex bat(µ bµR t 1) + ex a bµR t 1 ex bat bµR t 1

2 max a [K]

ex a µ bµR t 1 + ex a bµR t 1 ex bat bµR t 1

2 max a [K]

ex a µ bµR t 1

with probability at least 1 3δ, for all t such that t |Et| (Theorem 4). Summing over t gives, with probability at least 1 3δ, X

t [T ]\ET {I(at = bat) (Et 1[y ,t] Et 1[yat,t])} 8

For the last term in Eq. (30), we have that X

t [T ]\ET {I(at = bat) (Et 1[y ,t] Et 1[yat,t])}

t [T ] I(at = bat) P (at = bat) + P (at = bat)

T K 1 + 4δ, (32)

where the first term in Eq. (32) holds with probability at least 1 δ by Hoeffding s inequality, and the remaining terms follow from Lemma 5. Finally, by taking a union bound over Eq. (31) and Eq. (32), we obtain, with probability at least 1 4δ,

Reg(T) 32K2

(1 p)2 log 2d T 2

T K 1 + 4δ + 8

which completes the proof.

F. Algorithm-Agnostic Lower Bound of Regret Ignoring Unobserved Features

In this section, extending our argument in Theorem 1, we show that there exists a problem instance such that linear bandit algorithms relying solely on observed features can incur regret that grows linearly in T. We begin by formally defining such algorithms.

Linear Bandits with Partially Observable Features

Definition 1 (Policy dependent on observed features). For each t [T], let πt : Rd Rt 1 [0, 1] be a policy that maps an observed feature vector x {xa : a [K]}, given past reward observations {yas,s : s [t 1]}, to a probability of selection. Then the policy πt is dependent only on observed features if, for any ya1,1, . . . , yat 1,t 1, it holds that x1 = x2 implies πt(x1|ya1,1, . . . , yat 1,t 1) = πt(x2|ya1,1, . . . , yat 1,t 1).

For instance, the UCB and Thompson sampling-based policies for linear bandits (with observed features), considered in Theorem 1, satisfy Definition 1, as they assign the same selection probability as long as the observed features are the same. In contrast, policies in MAB algorithms (which disregard observed features) may assign different selection probabilities even when the observed features are equal, and thus are not dependent on the observed features. In the theorem below, we particularly provide a lower bound for algorithms that employ policies that are dependent on the observed features.

Theorem 7 (Regret Lower Bound under Policies Dependent on Observed Features). For any algorithm Π := (π1, . . . , πT ) that consists of the policies {πt : t [T]} that are dependent on observed features, there exists a set of features {z1, . . . , z K} and a parameter θ Rdz such that the cumulative regret

RegΠ(T, θ , z1, . . . , z K) T

Proof. We start the proof by providing a detailed account of the scenario described in the theorem. Without loss of generality, we consider the case where K = 3. As stated in the theorem, a represents the index of the optimal action when considering the entire reward, including both observed and latent components. In contrast, ao denotes the index of the optimal action when considering only the observed components. We introduce an additional notation, a , which refers to an action whose observed features are identical to those of a , but with a distinct latent component. Specifically, this implies that a = a and za = za , but xa = xa . By definition of the policy πt that depends on the observed features, πt(xa ) = πt(xa ) and the probability of selecting the optimal arm is πt(xa ) 1/2.

Taking this scenario into account, the observed part of the features associated with a , a , and ao are defined as follows:

2, . . . , 1

2, . . . , 1

2, . . . , 1

Additionally, we define the unobserved feature vectors for actions a , a , and ao as follows:

ua := [1, . . . , 1] , ua := [ 1, . . . , 1] , uao := [ 1, . . . , 1, 1, . . . , 1] ,

where in uao, the number of 1 s and -1 s are equal. This ensures that the scenario aligns with the assumption imposed on the feature vectors throughout this paper. We further define the true parameter as follows:

3d, . . . , 1

3d, 2 3du , . . . , 2 3du

thus it follows that θ(o) = [1/3d, . . . , 1/3d] Rd and θ(u) = [2/3du, . . . , 2/3du] Rdu. Note that it is straightforward to verify that | za, θ | 1, thereby satisfying the assumption on the mean reward (Section 3.2). With this established, we can also observe that the expected rewards for the three actions are defined as:

za , θ = xa , θ(o) + ua , θ(u) = 1

za , θ = xa , θ(o) + ua , θ(u) = 1

zao, θ = xao, θ(o) + uao, θ(u) = 1

respectively, and it is straightforward to verify that za , θ zao, θ = 2/3 > 0 and that za , θ za , θ = 4/3 > 0, which confirms that a is optimal when considering the full feature set.

At each round t [T] for any policy πt satisfying Definition 1, we have

πt(xa |ya1,1, . . . , yat 1,t 1) = πt(xa |ya1,1, . . . , yat 1,t 1),

Linear Bandits with Partially Observable Features

which implies P(at = a ) = P(at = a ) and

P(at = a ) = 1 P(at = a ) P(at = ao) 1 P(at = a ) = 1 P(at = a ),

hence the probability of selecting an optimal arm cannot exceed 1/2. Thus, the expected regret,

RegΠ(T, θ , za , za , zao) 1

t=1 P(at = a ) T

which completes the proof.

G. Technical Lemmas

Lemma 1. (Exponential martingale inequality) If a martingale (Xt; t 0), adapted to filtration Ft, satisfies E[exp(λXt)|Ft 1] exp(λ2σ2 t /2) for some constant σt, for all t, then for any a 0,

P (|XT X0| a) 2 exp

2 PT t=1 σ2 t

Thus, with probability at least 1 δ,

t=1 σ2 t log 2

G.1. A Hoeffding bound for Matrices

Lemma 2. Let {Mτ : τ [t]} be a Rd d-valued stochastic process adapted to the filtration {Fτ : τ [t]}, i.e., Mτ is Fτmeasurable for τ [t]. Suppose that the matrix Mτ is symmetric and the eigenvalues of the difference Mτ E[Mτ |Fτ 1] lie in [ b, b] for some b > 0. Then for x > 0,

τ=1 Mτ E[Mτ |Fτ 1]

Proof. The proof adapts Hoeffding s inequality to a matrix stochastic process, following the argument of Tropp (2012). Let Dτ := Mτ E[Mτ |Fτ 1]. Then, for x > 0,

We bound the first term of Eq. (33); a similar argument applies to the second term. For any v > 0,

Since Pt τ=1 Dτ is a real symmetric matrix,

Linear Bandits with Partially Observable Features

where the last inequality holds since exp(v Pt τ=1 Dτ) has nonnegative eigenvalues. Taking expectations on both sides yields

τ=1 Dτ + log exp(v Dt)

By Lieb s theorem (Tropp, 2015), the map D 7 Tr exp(H + log D) is concave over the set of symmetric positive definite matrices for any symmetric positive definite H. Applying Jensen s inequality, we obtain

τ=1 Dτ + log exp(v Dt)

τ=1 Dτ + log E [exp(v Dt)| Ft 1]

By convexity of evx and Hoeffding s lemma, for all x [ b, b],

2b e vb + x + b

Since the eigenvalues of Dt lie within [ b, b], it follows that

E [exp(v Dt)| Ft 1] E e vb

2b (b Id Dt) + evb

2b (Dt + b Id) Ft 1

= e vb + evb

Now we recursively upper bound Eq. (34) as follows:

τ=1 Dτ + log E [exp(v Dt)| Ft 1]

τ=1 Dτ + (v2b2

τ=1 Dτ + (v2b2

2 )Id + log E [exp(v Dt 1)| Ft 2]

τ=1 Dτ + (2v2b2

Tr exp (tv2b2

= exp tv2b2

= d exp tv2b2

Linear Bandits with Partially Observable Features

Thus we have

d exp vx + tv2b2

Minimizing over v > 0 gives v = x/(tb2) and

which proves the lemma.

G.2. A Bound for the Gram Matrix

The Hoeffding bound for matrices (Lemma 2) implies the following bound for the two Gram matrices At := Pt τ=1 exaτ ex aτ and Vt := Pt τ=1 P

a [K] exaex a Corollary 1. For any ϵ (0, 1) and t 8ϵ 2(1 p) 2K2 log(2Kt2/δ), with probability at least 1 δ/t2, IK V 1/2 t At V 1/2 t 2 ϵ.

Proof. Note that

V 1/2 t At V 1/2 t IK = V 1/2 t

ϕa,τ 1 exaex a

and the martingale difference matrix for each τ [t],

ϕa,τ 1 V 1/2 t exaex a V 1/2 t

V 1/2 t exaex a V 1/2 t 2

1 p + K 2 max a [K]

V 1/2 t exaex a V 1/2 t 2

1 p max a [K] exa 2 V 1 t

Note that the second inequality holds under the assumption p (1/2, 1), which implies ϕ 1 a,τ (K 1)/(1 p), and the last inequality is obtained via the Sherman Morrison formula. By Lemma 2, for x > 0, we have

P V 1/2 t At V 1/2 t IK 2 > x 2K exp (1 p)2tx2

Setting x = ϵ (0, 1), for t 8ϵ 2(1 p) 2K2 log(2Kt2/δ) with probability at least 1 δ/t2, IK V 1/2 t At V 1/2 t 2 ϵ.

G.3. An error bound for the Lasso estimator

Lemma 3 (An error bound for the Lasso estimator with unrestricted minimum eigenvalue). Let xττ [t] [ 1, 1]d denote the covariates, and let yτ = x τ w + eτ for some w Rd and eτ R. For λ > 0, consider the estimator

bwt = argmin w

yτ x τ w 2 + λ w 1.

Linear Bandits with Partially Observable Features

Define S := i [d] : w(i) = 0 and Σt := Pt τ=1 xτx τ . Suppose that Σt has a strictly positive minimum eigenvalue and

that Pt τ=1 eτxτ λ/2. Then,

bwt w Σt 2λ p

λmin (Σt) .

Proof. The proof follows a similar structure to Lemma B.4 in Kim et al. (2024), but we provide a new argument for the unrestricted minimum eigenvalue condition. Let X t := (x1, . . . , xt) [ 1, 1]d t and e t := (e1, . . . , et) Rt. We denote by Xt(j) the j-th column of Xt and by bwt(j) the j-th entry of bwt. By definition of bwt,

Xt ( w bwt) + et 2 2 + λ bwt 1 et 2 2 + λ w 1,

which implies

Xt ( w bwt) 2 2 + λ bwt 1 2 (bwt w) X t et + λ w 1

2 bwt w 1 X t et + λ w 1 λ bwt w 1 + λ w 1,

where the last inequality uses the bound on λ. On the left hand side, by triangle inequality,

i S |bwt(i)| + X

i [d]\ S |bwt(i)|

i S |bwt(i)| X

i S |bwt(i) w(i)| + X

i [d]\ S | w(i)|

i S |bwt(i) w(i)| + X

i [d]\ S |bwt(i)|,

and for the right-hand side, bwt w 1 = X

i S |bwt(i) w(i)| + X

i [d]\ S |bwt(i)|.

Plugging in both sides and rearranging the terms,

Xt ( w bwt) 2 2 2λ X

i S |bwt(i) w(i)|. (35)

Because X t Xt is positive definite,

Xt ( w bwt) 2 2 λmin(X t Xt) X

i S |bwt(i) w(i)|2

λmin(X t Xt) | S|

i S |bwt(i) w(i)|

where the last inequality holds by Cauchy-Schwarz inequality. Plugging in Eq. (35) gives,

Xt ( w bwt) 2 2 2λ X

i S |bwt(i) w(i)|

| S| λmin(Σt) Xt ( w bwt) 2

2λ2| S| λmin(Σt) + 1

2 Xt ( w bwt) 2 2,

where the last inequality uses ab a2/2 + b2/2. Rearranging the terms,

Xt ( w bwt) 2 2 4λ2| S| λmin(Σt),

which proves the result.

Linear Bandits with Partially Observable Features

G.4. Eigenvalue bounds for the Gram matrix.

Lemma 4. For a [K], let exa := [x a , e a p1, , e a p K d] Rd denote augmented features. Then, an eigenvalue of P

a [K] exaex a is in the following interval

a [k] xax a

a [K] xax a

Proof. Let P := (p1, . . . , p K d) RK (K d). Because the columns in P are orthogonal each other and to x1, . . . , x K,

a [K] exaex a = P

a [K] xax a P

a [K] xae a P P

a [K] P eax a P P

a [K] xax a P

a [K] xae a P P

a [K] P eax a IK d

a [K] xax a P

a [K] Xeae a P P a [K] P eae a X IK d

a [K] xax a XP P X IK d

a [K] xax a O O IK d

Thus, for any λ R, det(P

a [K] exaex a λIK) = det(P

a [K] xax a λId)(1 λ)K d. Solving det(P

a [K] xax a λId)(1 λ)K d = 0 gives the eigenvalues and the lemma is proved.

G.5. A Bound for the Probability of Exploration

Lemma 5. For each t, let bat = argmaxa [K] ex a bµt denote the maximizing action based on the estimator bµt. Then the action at chosen by the resampling scheme in Algorithm 1 and Algorithm 2 satisfies,

t=1 P(at = bat) 2

T (K 1) + 2δ.

Proof. Since the algorithm resamples at most ρt times, for a fixed t [T],

P(at = ˆat) = P ({eat = at} {at = ˆat}) + P ({eat = at} {at = ˆat})

P ({eat = at} {at = ˆat}) + P(eat = at)

= P ({eat = at} {at = k}) + P(eat = at) | {z } Failure of resampling

for k = ˆat, (36)

where P(at = k) = t 1/2/(K 1) and P(eat = at) = p, defined by Algorithm 1 and Eq. (9), respectively. For the first term in Eq. (36), by applying the union bound, we obtain

P ({eat = at} {at = k}) = P

m=1 {Resampling success at trial m} {at = k}

m=1 P ({Resampling success at trial m} {at = k})

Linear Bandits with Partially Observable Features

which, combined with Eq. (36), gives

P(at = ˆat) 1

t(K 1) + P(Resampling failure)

By the definition of ρt, the probability that the resampling fails is bounded by δ/(t + 1)2. Thus, the probability of the event {at = bat} is

P(at = ˆat) 1

t(K 1) + δ (t + 1)2

Summing up over t [T] completes the proof.