# on_the_robustness_of_text_vectorizers__397105b0.pdf On the Robustness of Text Vectorizers R emi Catellier 1 2 Samuel Vaiter 1 3 Damien Garreau 1 2 A fundamental issue in machine learning is the robustness of the model with respect to changes in the input. In natural language processing, models typically contain a first embedding layer, transforming a sequence of tokens into vector representations. While the robustness with respect to changes of continuous inputs is well-understood, the situation is less clear when considering discrete changes, for instance replacing a word by another in an input sentence. Our work formally proves that popular embedding schemes, such as concatenation, TF-IDF, and Paragraph Vector (a.k.a. doc2vec), exhibit robustness in the H older or Lipschitz sense with respect to the Hamming distance. We provide quantitative bounds for these schemes and demonstrate how the constants involved are affected by the length of the document. These findings are exemplified through a series of numerical examples. 1. Introduction Recent advances in natural language processing (NLP) have exceeded all expectations. In particular, the advent of large language models such as BERT (Devlin et al., 2018) and GPT (Brown et al., 2020) are transforming radically the way we interact with computers. They typically rely on a deep neural network (DNN) architecture and are trained on a variety of tasks such as sentiment analysis, translation, and text summarization. A known issue with DNNs is the existence of adversarial examples: examples modified in order to radically change the output of the model. Initially popularized in the context of image classification (Szegedy et al., 2014), such examples also exist in NLP and a flourishing literature exists on this topic (Zhang et al., 2020). This problem has sparked 1Universit e Cˆote d Azur, CNRS, LJAD, France 2Inria, France 3CNRS, France. Correspondence to: Damien Garreau . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). a tremendous interest into the robustness of models with respect to small changes in the input. In this paper, we focus on the robustness of the vectorization NLP pipelines: the transformation of the input document into a vector representation. We will consider documents as ordered sequences of tokens, not necessarily corresponding to words. For instance, GPT 2 uses Byte Pair encoding (Gage, 1994; Sennrich et al., 2016), which relies on tokens corresponding to sub-words. As far as we reckon, there are essentially three main schools of thought when it comes to vectorization: (i) concatenation of vectors corresponding to each token of the document. These vectors are often called word vectors when the tokens are individual words. They can either be one-hot representations of the tokens, or obtained by a mapping learned from data. A celebrated approach to produce word vectors is word2vec (Mikolov et al., 2013a;b), which transports semantic properties to the embedding space. Many other methods exist, such as Glo Ve (Pennington et al., 2014), EMF (Li et al., 2015), Word Piece (Wu et al., 2016), Fast Text (Bojanowski et al., 2017), and ELMo (Peters et al., 2018). Positional information is typically added to the token embeddings. (ii) TF-IDF (term frequency - inverse document frequency), taking words as tokens and simply considering the frequencies of each individual word in the document. These frequencies are reweighted by an overall importance term to take into account the lesser importance of frequently appearing words such as articles. This is the historical approach to text vectorization (Luhn, 1957; Jones, 1972). (iii) ad hoc approaches. Notably, Paragraph Vector (also known as doc2vec (Le & Mikolov, 2014)) extends the ideas of word2vec. Although we will focus on doc2vec in this work, we emphasize that there exists other ad hoc approaches, such as skip-thought vectors (Kiros et al., 2015), quick-thought (Logeswaran & Lee, 2018), or universal sentence encoder (Cer et al., 2018). A priori, vectorizers are not designed to be robust to small changes. Even when modifying a single word of the input document, the embedding could change drastically. Thus, we ask the following question: Are text vectorizers provably robust with respect to modifying a small subset of the document? On the Robustness of Text Vectorizers Typical notions of robustness in machine learning deals with continuous input data: changing slightly the observation means that for instance its ℓ2-norm evolves infinitesimally. The challenge of our analysis is the fundamentally discrete nature of text data. Changing a word in a document is usually not innocuous one can think of extreme cases where the meaning of this word is flipped and vectorizers sensitive to the semantics of input documents should capture this phenomenon. Nevertheless, we show that the answer is positive for all vectorizers that we study. Another difficulty is that the mathematical formalization of some of these vectorizers was not the main concern of the community. A necessary first step is thus to give an unequivocal definition of our objects of interest. Contributions. In this paper, we analyze the robustness of vectorizers as their local regularity (Lipschitz, H older) with respect to the Hamming distance (Section 2). We prove: the 1/2-H older continuity of concatenation of token and positional embeddings (Proposition 3.1); the Lipschitz continuity of TF-IDF (Proposition 4.1), and the 1/2-H older continuity of it normalized variant (Proposition 4.2); the Lipschitz continuity of doc2vec (Theorem 5.1). As a necessary step to derive the latter, we make two new mathematical contributions (see Appendix), we propose: a local Lipschitz analysis of the softmax (Theorem H.6); a Gr onwall Bellman Bahouri result (Theorem G.1) needed when casting the doc2vec analysis as an ODE problem. The code for all experiments of the paper is available at https://github.com/dgarreau/ vectorizer-robustness. Related work. (Adversarial examples). A major motivation for studying robustness is its impact on the existence of adversarial examples. In the case of DNNs, robustness often means Lipschitz continuity with respect to the inputs. For instance, one can show that a network having a small Lipschitz constant prevents the existence of small adversarial changes. More precisely, Hein & Andriushchenko (2017) provide a lower bound on the norm of the input manipulation needed to change the classifier decision inversely proportional to the Lipschitz constant of the network. This was later extended by Weng et al. (2018b) to DNNs with Re LU activations. Quantitatively, Weng et al. (2018a) show that fully connected layers have a Lipschitz constant potentially as large as the operator norm of the weight matrix. From a practical point of view, it has also been noticed that enforcing the Lipschitz constants of the layers to remain low does improve the robustness (Cisse et al., 2017). (Generalization & interpolation). It is known that robust algorithms generalize better. In particular, Xu & Mannor (2012) derive generalization bounds for generic algorithms depending in their robustness. The definition of robustness here includes Lipschitz continuous DNNs. More recently, Bubeck & Sellke (2021) extending (Bubeck et al., 2020) showed that in order to train Lipschitz continuous models, one has to take a large number of parameters. (Theory of vectorizers). Surprisingly, the robustness of vectorizers received little attention until now on the theoretical side, and all previous works on robustness assume continuous input. Nevertheless, there exist some theoretical works on similar problems. Most notably, Arora et al. (2016) analyze a large class of word vectorizers and explain how the intriguing alignment properties observed experimentally appear. Notations. For u Rp, we denote by u its Euclidean norm. Let g : R Rd R be a function. The derivative in the time variable (µ) is denoted by µg whereas g (resp. 2g) denotes the Jacobian (resp. the Hessian) of g in the space variable. We let 1 = (1, . . . , 1) Rd. For a matrix R, σmin(R) is its smallest singular value. For a given set S, |S| is its cardinal. 2. Framework Let us now present the mathematical framework in which we perform our analysis. We consider tokens from a finite dictionary D, identified as [D] := {1, . . . , D}. A document x built on D is a finite sequence of elements of D, and we write [D] for the set of all documents. Thus the central object of our work, a vectorizer, is simply a mapping φ : [D] Rd, where d is the dimension of the embedding. The length of x will be denoted by T(x), and therefore x can be written as (x1, . . . , x T (x)). The set of all documents over D of length T will be denoted [D]T [D] . When there is no ambiguity, we remove the dependency in x from our notation, e.g., T(x) becomes T. As discussed in the related work, robustness is often synonym with Lipschitz continuity of the model distance between outputs lies within a constant factor of the distance between inputs. As distance between input documents x and x of same length, we consider the Hamming distance, which is the number of indices such that xt and xt differ: d H(x, x) := |{t [T] : xt = xt}| . The distance between outputs will simply be measured by the Euclidean norm in Rd. In definitive, for a given document length T, what we call Lipschitz continuity of the vectorizer φ can be written as x, x [D]T , φ(x) φ( x) C d H(x, x) , (1) where C is called the Lipschitz constant. Another way to quantify robustness is to allow for an exponent in Eq. (1): x, x [D]T , φ(x) φ( x) C d H(x, x)β , (2) On the Robustness of Text Vectorizers with 1 β > 0. This is known as H older continuity, and coincides with Lipschitz continuity whenever β = 1. While it is known that Lipschitz continuity implies H older continuity on the real line when β 1, this is not the case here, since d H takes values in N. Thus in our setting, Lipschitz continuity is a weaker notion of robustness than H older continuity. Often we obtain more precise results, depending explicitly on the set of indices such that the documents differ. To this extent, for a given subset S of [T], we define the set of S-close documents BS(x) of x [D]T as BS(x) = { x [D]T : xi = xi for i S} . Said alternatively, x BS(x) if it is obtained by replacing the tokens of x with indices belonging to S by arbitrary tokens in D. We note that BS(x) is a subset of the Hamming ball of radius |S|. Let us consider for instance the document x = the quick brown fox and the set of perturbed indices S = {2, 3} Here, x has length T = 4, |S| = 2, and an element of BS(x) is the document x = the slow blue fox. 3. Warm-up: concatenation Concatenation embeddings generally proceed by first mapping each token xt of x to a vector u(xt, t) Rd. In a second step, these vector representations are concatenated together to form φ(x). We assume that the representation u(xt, t) can be written as u(xt, t) = [ue(xt); up(t)] Rd , (3) where ue Rde denotes vector representations of individual tokens, while up Rdp encodes positional information, and we define d := de + dp. Token embeddings. As noted in the introduction, there are essentially two widespread choices for ue: either use sparse representations for individual tokens or use dense representations. The first approach is often synonymous with the use of one-hot encodings, hence considering the mapping ue : j 7 1j as a building brick, where, for any j D, we define 1j the j-th vector of the canonical basis of RD. This has the advantage of simplicity. One caveat is that, although sparse, one-hot vectors have dimensionality de = D the size of the dictionary. Regarding dense embeddings, as discussed in the introduction, the mapping j 7 ue(j) is learned from data and can encompass some semantic properties. In all these examples, ue(j) typically has dimensionality de D (for instance, gensim takes de = 100 in its word2vec implementation). Positional embeddings. A common choice is to learn positional embeddings, jointly with token embeddings. It is also possible to use deterministic positional embeddings, such as one-hot vectors up(t) = 1t RTmax, where Tmax is a maximal document size, or more complicated functions of t. For instance, the original transformers architecture uses a sinusoidal transformation of t as positional embedding (Vaswani et al., 2017). Further, it is also possible to incorporate additional positional information in the embedding for instance BERT incorporates segment position information corresponding to the index of the sentence the token belongs to (Devlin et al., 2018, Figure 2). Finally, one can simply ignore up altogether, relying simply on the order of the u(xt) to convey the positional information. Let us note that when de = dp, one can add ue and up in Eq. (3) instead of concatenating them, a possibility to which our analysis is robust. Concatenation. For a given u, the embedding φ(x) of a document x is formed by concatenating the u(xt, t)s for t [T]. Formally, if T Tmax, then the concatenation φ(x) of (x1, . . . , x T ) is defined as φ(x) := [u(x1, 1); . . . ; u(x Tmax, Tmax)] Rd Tmax , and if T < Tmax, as (zero-padding), φ(x) := [u(x1, 1); . . . ; u(x T , T); 0; . . . ; 0] Rd Tmax . Since the embedding is explicit in this case, it is straightforward to show the following: Proposition 3.1 (Robustness of concatenation). Let x [D]T , S [T], and x BS(x). Then φ(x) φ( x) max j =k ue(j) ue(k) p In particular, for small perturbation of the input document, concatenation is 1/2-H older with respect to the Hamming distance. Closer inspection of the proof reveals that the constant depends only on the perturbed tokens: if the changes made are close from the point of view of ue, then φ(x) and φ( x) remain close. 4. TF-IDF transform Let x be a document of length T built on D. In this section, we will assume that tokens correspond to individual words. Forgetting the sequential nature of natural language, one can simply look at the words appearing in x with repetitions this is informally called a bag-of-words representation. Any given word j D appears in this representation with multiplicity mj(x). The TF-IDF transform of x is a vector φ(x) RD, with each coordinate of φ(x) corresponding to a word of the dictionary. Component-wise, φ(x) is a product of two terms: the term frequency fj and the inverse document frequency vj: T , vj := log |C| |{z C s.t. j z}| , (4) On the Robustness of Text Vectorizers where C is a set of documents. We will assume that vj > 0. The exact expressions appearing in Eq. (4) can vary depending on implementation, we use here the most common definitions (in particular, they are the default choices used by scikit-learn (Pedregosa et al., 2011)). The (nonnormalized) TF-IDF of x can be written φ(x)j = fjvj for all j D. Intuitively, one wants to quantify the importance of each word in the document, while ignoring common words appearing in many documents such as articles. Finally, it is common to normalize φ(x), generally using the Euclidean norm. We denote by ϕ(x) := φ(x)/ φ(x) the normalized TF-IDF of x. 4.1. Robustness results As we saw in the previous section, the TF-IDF transform of a given document can be given in closed-form as a function of the word multiplicities and the given coefficients. This allows a simple analysis, at least in the non-normalized case. Proposition 4.1 (Robustness of non-normalized TF-IDF). Let x [D]T , S [T], and x BS(x). Let mmax be the maximal word multiplicity in x and vmax be the maximal inverse document frequency over D. Then φ(x) φ( x) 4mmaxvmax |S| In other words, non-normalized TF-IDF is Lipschitz continuous for the Hamming distance, with Lipschitz constant inversely proportional to the common length of the documents. In reality, the dependency in T is slightly more complicated since nothing prevents mmax from being as large as T in pathological cases (when all the words of the document are identical). In any case, we uncover a satisfying fact about TF-IDF: small changes in long documents do not matter much. Taking into account the normalization, we have a similar result: Proposition 4.2 (Robustness of normalized TF-IDF). Let x [D]T . Let vmin be the minimal inverse document frequency associated to the words of x. Let S [T] such that |S| φ(x) /(4mmaxvmax) and x BS(x). Then ϕ(x) ϕ( x) 4m1/2 maxv1/2 max D1/4 In plain words, normalized TF-IDF is 1/2-H older with respect to the Hamming distance. Again, the constant appearing decreases with the length of the base document. A close inspection of the proof also reveals that the D is actually equal to D(x), the size of the local dictionary. 4.2. Experimental validation In order to check the accuracy of Proposition 4.2, we ran some numerical experiments. We considered movie reviews 50 100 150 length of the document Euclidean distance to q0 Figure 1. Normalized TF-IDF, influence of T. Documents of increasing length t, 5 random replacements. Proposition 4.2 gives a bound in O(1/ from the IMDB dataset as documents and the TF-IDF implementation from scikit-learn with L2 normalization. Influence of the document length. In a first set of experiments, we investigated the behavior of ϕ(x) ϕ( x) with respect to the length T of x. To this extent, for several documents, we created a sequence of growing documents by considering the first t words of the documents, with t ranging from 5 to T. For each value of t, we replaced 5 words in the intermediary document and repeated this experiment several time. The words to replace were chosen uniformly at random in the document, and the replacements uniformly at random in D, and we estimated the supremum of φ(x) φ( x) by taking the maximum over these repetitions. Proposition 4.2 predicts that, since |S| is kept constant here, the supremum of φ(x) φ( x) over all possible replacements should be upper bounded by 1/ T (up to numerical constants). This appears to be empirically true (see Figure 1). Influence of the number of removals. In a second set of experiments, we looked at the dependency of ϕ(x) ϕ( x) with respect to |S|. This time keeping x fixed, we gradually increased the number of replaced words from 1 to T. Since T is fixed, Proposition 4.2 predicts that the supremum of ϕ(x) ϕ( x) over all possible replacements should behave at most as p |S|. This also appears to be empirically true, see Figure 2. 5. Paragraph Vector (doc2vec) We now turn to the most challenging part of our analysis, doc2vec. On a high level, a token embedding matrix is learned jointly with a document embedding matrix on a corpus, aiming to predict correctly a missing token in a On the Robustness of Text Vectorizers 0 50 100 150 number of replacements Euclidean distance to q0 Figure 2. Normalized TF-IDF, influence of |S|. For a given document, s words are replaced at random with s ranging from 1 to T. Proposition 4.2 gives a bound in O p given context. The key difference with other vectorizers is that, at inference time, another minimization problem is solved by the model. Different documents yield different optimization problems, and therefore it is quite challenging to see where the resulting minimizer is located with respect to the original embedding. 5.1. A primer on doc2vec The key idea underlying paragraph vector is neural probabilistic language modeling (Bengio et al., 2000): predict words of a document knowing (i) the context of the missing word in the document, and (ii) some global information about the document, encoded as a vector q Rd. Thus the key concept is the probability of observing word j at position t given some context c(t) and vector q. This is written informally as P (j|c(t), q), and we describe its exact formulation in the next paragraphs. Two models are proposed in Le & Mikolov (2014): distributed memory (PVDM) model, similar to the continuous bag of words model of Mikolov et al. (2013a), and distributed bag of words (PVDBOW) model, similar to the skip gram model. We first focus on the PVDM model, PVDBOW being a simplified version thereof, referring to Figure 3 for a visual help. Local information. For a document x with length T, for any ν < t < T ν, we define the neighborhood of t as γ(t) := (t ν, . . . , t 1, t + 1, . . . , t + ν) . (5) Here, ν is an hyperparameter often called context size (or window size), quantifying the breath of the context considered by the model. To this neighborhood corresponds the context c(t) := (xt ν, . . . , xt 1, xt+1, . . . , xt+ν) . (6) Intuitively, c(t) corresponds to the tokens surrounding xt in the document x. The tokens contained in c(t) are then mapped to their one-hot representations, which are aggregated together. There are two natural ways to do this, either computing the mean (PVDMmean) or the concatenation of these vectors (PVDMconcat). Thus, at this stage, the local information at index t is summarized as a vector ht, with s γ(t) 1xs RD if average is used, and ht := [1xt ν; . . . ; 1xt 1; 1xt+1; . . . ; 1xt+ν] R2νD if concatenation is used (see bottom layer of Figure 3). Projecting and lifting. This local information is then projected into Rd, with d D, the embedding space. At this stage, the document vector q Rd is added to the local representation. This intermediary representation is lifted back to RD. PVDM relies on two matrices P and R such that each context is mapped to yt := R(Pht + q) = πt + Rq RD , where πt := RPht RD. Here, P has size d D for PVDMmean, and d 2νD for PVDMconcat, while R has size D d. When tokens are words, the columns of P are called word vectors, since they correspond to d dimensional embeddings for individual words. We refer to the intermediate layers of Figure 3 for a visual help. Prediction. Finally, the prediction for xt is encoded as the softmax of yt, where the softmax σ : RD RD is defined for u RD as euj PD k=1 euk 1 j D . (7) In particular, all components of σ(yt) lie between 0 and 1 and sum to one, and reading coordinate j of σ(yt) can be interpreted as reading the predicted probability of token j. To summarize, σ(yt) encodes a discrete distribution over D that depends on the context of xt and the document vector q (topmost layer of Figure 3). Training. Let us call x(1), . . . , x(N) the documents in our training set, with lengths T1, . . . , TN. To each of these documents correspond an embedding q(i) Rd, which can be seen as the columns of a matrix Q Rd N, each giving rise to y(i) t . The columns of Q are often referred to as document vectors. The key idea here is to learn P, Q, and R so that the predicted tokens at position t are accurate for all documents. Seeing σ(y(i) t ) as a discrete probability distribution on D, On the Robustness of Text Vectorizers 1xt 1 1xt+1 (average/concatenation) q Rd P Rd D P P P yt := πt + Rq RD σ(yt) σ 1xt Cross-entropy Figure 3. Overview of the doc2vec vectorizer, PVDM model. For a given document, for each token position t, the model considers the context c(t) of xt. The one-hot representation of the tokens in c(t) (which are of size D) are either average or concatenated, then projected to the embedding layer (Rd, in blue). At this stage, the document embedding q Rd is added to this local representation, which is then lifted back to yt RD. Taking a softmax transform of yt yields a discrete distribution on D, which is compared to the truth (xt) using cross-entropy (top part, dotted line). During training, PV minimizes objective (8) to find satisfying token embeddings P, document embedding q, and lifting R. At inference time, P and R are frozen and only q is allowed to vary. a natural way to compare it to the groundtruth (x(i) t ) is to compute the cross-entropy between the distribution putting mass one at x(i) t and σ(y(i) t ), that is, ℓ(i) t := log σ(y(i) t )x(i) t := ψx(i) t (y(i) t ) , where we defined ψ := log σ coordinate-wise. The optimization problem solved by PV is written Minimize P,Q,R t x(i) ψx(t) t (y(i) t ) , (8) where t x(i) means t ranging from ν + 1 to Ti ν 1. Problem (8) is solved by stochastic gradient descent, or ADAM (Kingma & Ba, 2015). Inference. Let us describe the embedding of a new document x, assuming that the model was trained on a corpus. The way inference works for the PV model is to keep P and R fixed, and to optimize solely in q Rd Minimize q Rd 1 T t x ψxt(yt) . (9) An important observation is that q 7 ψxt(πt + Rq) is a convex function, although not strictly (see Appendix). Therefore, a regularization term is often added to Eq. (9), a point which we will clarify in the next section. Also noting that q has only d parameters, solving PV inference (9) efficiently is not too challenging. The case of PVDBOW. PVDBOW is another model falling under the PV umbrella. In a nutshell, following the idea of the distributed bag of word model, PVDBOW works the other way around and uses only the representation of the document to predict tokens. At position t, no local information is taken into account and we put πt = 0 in that case. The predicted token distribution for the document is encoded as before (as σ(yt) = σ(Rq)), and its quality also measured as ψxt(yt) for all tokens in the document, leading to the same optimization problems. To summarize, PVDBOW is a simplified, lightweight version of PVDM, simply obtained by taking πt = 0 in our framework. In particular, there is no matrix P, which leads to fewer parameters, and thus easier training and inference, a fact which was pointed out by Le & Mikolov (2014). Nevertheless, they recognize that PVDBOW still performs well as an embedding, and recommend considering as an embedding the concatenation of PVDM and PVDBOW. Hierarchical softmax and negative sampling. In practice, as advocated by Le & Mikolov (2014), two additional expedients are used. First, the softmax is replaced by hierarchical softmax (Morin & Bengio, 2005). In a nutshell, each call of σ has a computational cost linear in D, which can be as large as 105 in practice. A solution is to replace the softmax by a tree-based approximation thereof, which computation is much faster. Second, following Mikolov et al. (2013a), it is common to incorporate tokens with a negative association to the token to predict when computing ℓt, leading to faster training. These two possibilities are non-trivial modifications to the PV model and we do not consider them in our analysis. 5.2. Robustness result Before stating our robustness result, let us explain why it is challenging and outline the proof technique. As detailed On the Robustness of Text Vectorizers in the previous section, the embedding of a document x of length T is found by solving q0 = arg min q Rd 2 q 2o , (10) where F(q) := 1 t x ψxt(πt + Rq). The regularization term α q 2 /2 with α > 0 ensures uniqueness of the solution. Indeed, the softmax is invariant by translation by a vector proportional to 1, and solutions to (9) are not unique. As before, consider x, a modified version of x where tokens with indices in S have been replaced by others. The embedding q1 of x is found by solving q1 = arg min t x 2 q 2o , (11) where G(q) := 1 T P t x ψ xt( πt + Rq), and πt is defined analogously to πt. The main challenge here is that q0 and q1 are solutions of distinct optimization problems, which can be quite different if |S| is large. From discrete to continuous. The solution we propose to connect between these two problems is to interpolate smoothly between them. There are many ways to do this, and we settle for the simplest: linear interpolation. More precisely, we define for all µ [0, 1] and q Rd by Ψlin(µ, q) := (1 µ)F(q) + µG(q) . (12) Subsequently, for all µ [0, 1], we can solve the following regularized optimization problem: q(µ) := arg min q Rd n Ψlin(µ, q) + α 2 q 2o , (13) giving rise to a continuous trajectory in the embedding space (see Figure 4 for an illustration). One can think of q(µ) as the embedding of a fictitious document traveling halfway between x and x as µ ranges from 0 to 1. Dynamics of interpolation. This approach is powerful, since it allows us to transform a problem which is discrete in nature (elements of a sum are modified) to a continuous one (time parameter varies). In particular, the dynamics of µ 7 q(µ) are described by an ordinary differential equation (ODE). Indeed, for all µ [0, 1], since q 7 Ψlin(µ, q) + α 2 q 2 is a strongly convex function, q(µ) is the (unique) critical point of q Ψlin(µ, q) + αq, where denotes derivative with respect to the space coordinate (q). That is, for all µ [0, 1], Ψlin(µ, q(µ)) + αq(µ) = 0 . Differentiating, we get that for all µ [0, 1], 2Ψlin(µ, q(µ)) + α I q (µ) + µ Ψlin(µ, q(µ)) = 0 , (14) 1.2 1.0 0.8 0.6 0.8 Figure 4. Continuously interpolating between q0, the embedding of x (in red), and q1, the embedding of x (in black). Visualization in a 2-dimensional slice of Rd. To each µ [0, 1] corresponds a solution to (13), appearing here as a point of the trajectory between q0 and q1 (solid orange line). Dynamics of this trajectory are described by Eq. (14). Different document perturbations lead to different embeddings and associated trajectories (dotted lines). where g denotes derivative with respect to the time coordinate (µ) and I the identity matrix. Let us set Φlin(µ, q) := 2Ψlin(µ, q) + α I 1 µ Ψlin(µ, q) . (15) Then, Eq. (14) can be rewritten as q (µ) = Φlin(q(µ), µ). Spectrum of the Hessian of the log-softmax. Looking back at the ODE problem, it appears that one needs to understand precisely the behavior of Φlin. Intuitively, an ill-behaved function could lead to the explosion of the solution of the ODE, preventing the existence of reasonable bounds on q(µ) q(0) for large µ. This understanding relies on the control of the smallest positive eigenvalue of 2Ψlin, λ1(µ, q). Coming back to the definition of Ψlin (Eq. (12)), F, and G, we see that λ1 closely related to λmin, the smallest positive eigenvalue of the Hessian of the logsoftmax, for which we have precise results (Lemma H.5 and Theorem H.9). Gr onwall-type result. Once that a precise control is achieved on Φlin, one may have hoped to use standard Gr onwall type inequalities such as Pachpatte (2004) to obtain quantitative bounds on q(1) q(0) . However, in our setting, the growth of Φlin prevents us from getting explicit bounds and we had to prove a new result (Theorem G.1) which is actually true in a more general setting than that of doc2vec. Specifying this result, we get: On the Robustness of Text Vectorizers 0 50 100 150 200 length of the document Euclidean distance to q0 0 50 100 150 200 length of the document Euclidean distance to q0 0 50 100 150 200 length of the document Euclidean distance to q0 Figure 5. Influence of the length of the document on the robustness of doc2vec. Five random replacements, from left to right: PVDMmean, PVDMconcat, and PVDBOW. 0 50 100 150 200 number of replacements Euclidean distance to q0 0 50 100 150 200 number of replacements Euclidean distance to q0 0 50 100 150 200 number of replacements Euclidean distance to q0 Figure 6. Influence of the number of words replaced on the robustness of doc2vec. From left to right: PVDMmean, PVDMconcat, and PVDBOW. Theorem 5.1 (Bounded trajectories). Let x [D]T , S [T], and x BS(x). Suppose that R RD d is such that σmin(R) > 0 and Im(R) 1 . Let µ 7 q(µ) be the solution of ODE (14). Then, there exist two constants c = c(α) > 0 and L = L( q(0) ) > 0 depending explicitly on P, R, ν, and D such that, whenever |S| /T c, sup µ [0,1] q(µ) q(0) L|S| Since φ(x) = q(0) and φ( x) = q(1), a corollary of Theorem 5.1 is that the doc2vec embedding is Lipschitz continuous with respect to the Hamming distance, with Lipschitz constant at most inversely proportional to the document leng Theorem Coming back to our initial question, Theorem 5.1 guarantees that, for documents of reasonable length and small perturbations, doc2vec embeddings can not vary too greatly. We emphasize that Theorem 5.1 is true for all three doc2vec models. The key assumption here is that |S| is small enough. We argue that it is only natural to ask so: indeed, if one is allowed to modify every single token of x, this yield a completely different document (although having the same length), which could a priori be embedded anywhere. The other main assumptions concern the matrix R. Experimentally, we observe that σmin(R) > 0 holds (see Section I.3). Requiring that Im(R) 1 is not too restricting: because of the translation invariance by 1 of the softmax, one can always normalize R by removing the average line from each line. The main limitation of Theorem 5.1 is the dependency of c and L in the problems parameters. Exact expression can be found in Appendix (Theorem F.7). 5.3. Experimental validation In order to verify the validity of Theorem 5.1, we ran similar experiments to those presented in Section 4. We considered again movie reviews from the IMDB dataset. As vectorizer, we trained doc2vec models from scratch on a subset of the IMDB dataset (103 reviews). The associated dictionary has size D = 18, 416: we took tokens as words of the English dictionary. Note that one can also consider sub-word tokens, but in that case replacing a word in the document usually implies replacing several tokens. We chose d = 50 as dimension of the embedding. We took ν = 5 as context size parameter. We present results of experiments regarding the influence of the document length in Figure 5. Theorem 5.1 predicts that, since |S| is kept constant here, the supremum of φ(x) φ( x) over all replacements should be upper bounded by 1/T (up to numerical constants). This appears to be empirically true. We present results of experiments regarding the influence On the Robustness of Text Vectorizers of the number of replaced words in Figure 6. Here we took the number of replaced words from ν + 1 to T ν 1 to avoid for border effects. Since T is fixed, Theorem 5.1 predicts that the supremum of φ(x) φ( x) over all possible replacements should behave at most linearly in |S|. This appears to be empirically true. We present in Appendix (Section I) additional results with another implementation, gensim ( ˇReh uˇrek & Sojka, 2010). In particular, this implementation uses hierarchical softmax. The results are consistent with the behavior presented here. 6. Conclusion In this paper, we proved that several popular text vectorizers are robust, in the sense that they are either Lipschitz or H older continuous with respect to the Hamming distance. Proving this robustness was possible for concatenation and TF-IDF thanks to elementary computations, but required a much more challenging mathematical analysis for doc2vec requiring two new results (local Lipschitz continuity of the softmax and a new Gr onwall Bellman Bahouri non-explosion lemma). Let us outline future research directions. First, we studied the robustness of the true solution of (8) and (9). In practice, this problem is solved thanks to gradient descent, and it would be interesting to measure the impact of this approximation. A second line of work would consist in obtaining refined results when we put a random model on the distribution of the words of the document, similarly to what is done in (Arora et al., 2016). Acknowledgements This work was supported by the French government under the management of the Agence Nationale de la Recherche, grant agreements Gra Va ANR-18-CE40-0005, NEMATIC ANR-21-CE45-0010, and NIM-ML ANR-21-CE23-000501. D.G. acknowledges the support of EU Horizon 2020 project AI4Media (contract no. 951911) and would like to thank Charbel Yachouchi for his preliminary work. S.V. would like to thank Nicolas Patry for fruitful discussion about NLP embeddings. Agarwal, R. P., Deng, S., and Zhang, W. Generalization of a retarded Gronwall-like inequality and its applications. Appl. Math. Comput., 165(3):599 612, 2005. Alghamdi, W., Hsu, H., Jeong, H., Wang, H., Michalak, P. W., Asoodeh, S., and Calmon, F. P. Beyond adult and compas: Fairness in multi-class prediction. In Neur IPS, 2022. Arnold, V. I. Ordinary Differential Equations. MIT Press, 1978. Arora, S., Li, Y., Liang, Y., Ma, T., and Risteski, A. A latent variable model approach to PMI-based word embeddings. TACL, 4:385 399, 2016. Bailleul, I. and Catellier, R. Non-explosion criteria for rough differential equations driven by unbounded vector fields. Ann. Fac. Sci. Toulouse Math., 29(3):721 759, 2020. Bengio, Y., Ducharme, R., and Vincent, P. A neural probabilistic language model. In Neur IPS, 2000. Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T. Enriching word vectors with subword information. TACL, 5: 135 146, 2017. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Neur IPS, 2020. Bubeck, S. and Sellke, M. A universal law of robustness via isoperimetry. In Neur IPS, 2021. Bubeck, S., Eldan, R., Lee, Y. T., and Mikulincer, D. Network size and size of the weights in memorization with two-layers neural networks. In Neur IPS, 2020. Cer, D., Yang, Y., Kong, S.-y., Hua, N., Limtiaco, N., John, R. S., Constant, N., Guajardo-Cespedes, M., Yuan, S., Tar, C., et al. Universal sentence encoder. ar Xiv preprint ar Xiv:1803.11175, 2018. Cisse, M., Bojanowski, P., Grave, E., Dauphin, Y., and Usunier, N. Parseval networks: Improving robustness to adversarial examples. In ICML, 2017. Dannan, F. M. Integral inequalities of Gronwall-Bellman Bihari type and asymptotic behavior of certain second order nonlinear differential equations. J. Math. Anal. Appl., 108(1):151 164, 1985. Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 2018. Gage, P. A new algorithm for data compression. C Users Journal, 12(2):23 38, 1994. Gao, B. and Pavel, L. On the properties of the softmax function with application in game theory and reinforcement learning. ar Xiv preprint ar Xiv:1704.00805, 2017. Garreau, D. and Mardaoui, D. What does LIME really see in images? In ICML, 2021. On the Robustness of Text Vectorizers Hein, M. and Andriushchenko, M. Formal guarantees on the robustness of a classifier against adversarial manipulation. In Neur IPS, 2017. Jones, K. S. A statistical interpretation of term specificity and its application in retrieval. J. Doc., 1972. Kim, Y.-H. Gronwall, Bellman and Pachpatte type integral inequalities with applications. Nonlinear Anal. Theory Methods Appl., 71(12):2641 2656, 2009. Kingma, D. P. and Ba, J. Adam: A Method for Stochastic Optimization. In ICLR, 2015. Kiros, R., Zhu, Y., Salakhutdinov, R. R., Zemel, R., Urtasun, R., Torralba, A., and Fidler, S. Skip-thought vectors. In Neur IPS, 2015. Le, Q. and Mikolov, T. Distributed representations of sentences and documents. In ICML, 2014. Li, Y., Xu, L., Tian, F., Jiang, L., Zhong, X., and Chen, E. Word embedding revisited: A new representation learning and explicit matrix factorization perspective. In AISTAT, 2015. Logeswaran, L. and Lee, H. An efficient framework for learning sentence representations. In International Conference on Learning Representations, 2018. Luhn, H. P. A statistical approach to mechanized encoding and searching of literary information. IBM J. Res. Dev., 1 (4):309 317, 1957. Mikolov, T., Chen, K., Corrado, G. S., and Dean, J. Efficient estimation of word representations in vector space. ar Xiv preprint ar Xiv:1301.3781, 2013a. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. Distributed representations of words and phrases and their compositionality. Neur IPS, 2013b. Morin, F. and Bengio, Y. Hierarchical probabilistic neural network language model. In AISTAT, 2005. Pachpatte, B. G. On some new nonlinear retarded integral inequalities. J. Inequal. Pure Appl. Math, 5(3):80, 2004. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. Scikit-learn: Machine learning in Python. JMLR, 12: 2825 2830, 2011. Pennington, J., Socher, R., and Manning, C. D. Glove: Global vectors for word representation. In EMNLP, 2014. Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., and Zettlemoyer, L. Deep contextualized word representations. In NAACL, 2018. ˇReh uˇrek, R. and Sojka, P. Software Framework for Topic Modelling with Large Corpora. In Proc. of the LREC 2010 Workshop on New Challenges for NLP Frameworks, 2010. Sennrich, R., Haddow, B., and Birch, A. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1715 1725. Association for Computational Linguistics, 2016. Steele, J. M. The Cauchy-Schwarz master class: an introduction to the art of mathematical inequalities. Cambridge University Press, 2004. Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., and Fergus, R. Intriguing properties of neural networks. In ICLR, 2014. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. Advances in Neural Information Processing Systems, 30, 2017. Wei, D., Wu, H., Wu, M., Chen, P.-Y., Barrett, C., and Farchi, E. Convex bounds on the softmax function with applications to robustness verification. In AISTATS, 2023. Weng, L., Zhang, H., Chen, H., Song, Z., Hsieh, C.-J., Daniel, L., Boning, D., and Dhillon, I. Towards fast computation of certified robustness for Re LU networks. In ICML, 2018a. Weng, T.-W., Zhang, H., Chen, P.-Y., Yi, J., Su, D., Gao, Y., Hsieh, C.-J., and Daniel, L. Evaluating the robustness of neural networks: An extreme value theory approach. In ICLR, 2018b. Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K., et al. Google s neural machine translation system: Bridging the gap between human and machine translation. ar Xiv preprint ar Xiv:1609.08144, 2016. Xu, H. and Mannor, S. Robustness and generalization. Mach. Learn., 86(3):391 423, 2012. Zhang, W. E., Sheng, Q. Z., Alhazmi, A., and Li, C. Adversarial attacks on deep-learning models in natural language processing: A survey. ACM Trans. Intell. Syst. Technol., 11(3):1 41, 2020. On the Robustness of Text Vectorizers A. General organization This Appendix is organized as follows: in Section B (resp. C) we collect the missing proofs for Section 3 (resp. 4) of the main paper. The next five sections are dedicated to the proof of Theorem 5.1: First, in Section D, we formally prove that the dynamics of the interpolation scheme between two minimizers follow an ordinary differential equation (ODE). We actually show a more general result and provide technical conditions on the interpolation Ψ under which we are able to formulate the interpolation between minimization problems as an ODE. Next, in Section E, we derive quantitative bounds for the solution of this ODE. We show how to specialize this result in the doc2vec setting in Section F, proving Theorem 5.1 in the process. The main tool used to obtain these bounds is a general Gr onwall-Bellman-Bahouri type result for ODE with exponentially-growing coefficients. This result (Theorem G.1), as well as all other technical results concerning ODEs, is stated and proved in Section G . In order to specialize our result to the doc2vec setting, we needed a fine-grained study of the (log-)softmax function. In particular, we derive a new bound on the softmax function (Theorem H.9), which is proved in Section H. We conclude this Appendix with additional experimental results supporting our claims in Section I. B. Omitted proofs for concatenation B.1. Proof of Proposition 3.1 By definition of φ and Pythagoras theorem, φ(x) φ( x) 2 = X t S [Tmax] u(xt, t) u( xt, t) 2 . By definition of u (Eq. (3)), one has u(xt, t) u( xt, t) = [ue(xt) u( xt); 0] , (16) and therefore u(xt, t) u( xt, t) 2 = ue(xt) u( xt) 2 . We deduce that φ(x) φ( x) 2 |S [Tmax]| max j =k ue(j) ue(k) 2 . Remark B.1 (Concatenation v.s. sum). Replacing the concatenation by a sum in the definition of u (Eq. (3)) does not change the proof. Indeed, the key step Eq. (16) remains unchanged in that case: the key idea here is that position tokens are the same for words in the same position, and cancel out when forming the difference. C. Omitted proofs for TF-IDF vectorization C.1. Proof of Proposition 4.1 By definition, we can write j=1 fjvj1j = 1 j=1 mjvj1j . Similarly, since x has same length as x, j=1 mjvj1j , where we let mj denote the multiplicity of word j in document x. We deduce that φ(x) φ( x) 2 = 1 j=1 (mj mj)2v2 j . On the Robustness of Text Vectorizers By letting vmax be the maximal inverse document frequency on D, we already see that φ(x) φ( x) 2 v2 max T 2 j=1 (mj mj)2 . In the previous display, only terms such that mj = mj count. Using the inequality between p-norms, we have mj = mj (mj mj)2 mj = mj |mj mj| Now, by the triangle inequality, X mj = mj |mj mj| X mj = mj mj + X mj = mj mj . We notice that these two sums are equal: every removed word has to appear somewhere. Moreover, |{j s.t. mj = mj| 2 |S|, since modifying one word changes at most two multiplicities, and this happens at most |S| times. Therefore, we have proved that X mj = mj |mj mj| 4mmax |S| , (17) where we recall that mmax is the maximal multiplicity of words of x. Backtracking, we have φ(x) φ( x) 2 v2 max T 2 16m2 max |S|2 , and we can conclude by simply taking the square root of this last display. C.2. Proof of Proposition 4.2 We notice that ϕ(x) ϕ( x) 2 = 1 + 1 2ϕ(x) ϕ( x) = 2 2 φ(x) φ( x) φ(x) φ( x) . (18) In this last term we recognize the cosine similarity between φ(x) and φ( x). Since we are working under the assumptions of Lemma C.1, we have φ(x) φ( x) φ(x) φ( x) 1 8mmaxvmax |S| Coming back to Eq. (18), we see that ϕ(x) ϕ( x) 2 16mmaxvmax |S| We conclude by using Lemma C.2 and taking the square root. C.3. Auxilliary results We have the following result, key to the proof of Prop. 4.2, and of independent interest: Lemma C.1 (Cosine similarity robustness). Let x be a document. Let S [T] such that |S| φ(x) /(4mmaxvmax) and x BS(x). Then φ(x) φ( x) φ(x) φ( x) 1 8mmaxvmax |S| φ(x) . (19) Proof. By homogeneity, we can multiply numerator and denominator in Eq. (19) by T and deal with multiplicities instead of frequencies in this proof. We first focus on the numerator and write φ(x) φ( x) = φ(x) (φ(x) + φ( x) φ(x)) = φ(x) 2 + j=1 mj( mj mj)v2 j , (20) On the Robustness of Text Vectorizers by definition of φ. Using Cauchy-Schwarz inequality, we find that j=1 mj(mj mj)v2 j s X j (mj mj)2v2 j . In the first part of the right-hand side we recognize φ(x) , and in the second part, the same quantity bounded in the proof of Proposition 4.1. We deduce that X j mj(mj mj)v2 j φ(x) 4mmaxvmax |S| . Coming back to Eq. (20), we have proved that φ(x) φ( x) φ(x) 2 4mmaxvmax φ(x) |S| , which is positive under our assumption. Let us now look into the denominator of Eq. (19). Using the triangle inequality and Proposition 4.1, we write φ( x) φ(x) + 4mmaxvmax |S| . Putting everything together, we have φ(x) φ( x) φ(x) φ( x) φ(x) 2 4mmaxvmax φ(x) |S| φ(x) ( φ(x) + 4mmaxvmax |S|) = 1 u with u := 4mmaxvmax |S| / φ(x) . Again, by our assumption, u (0, 1). It is straightforward to show that (1 u)/(1 + u) 1 2u for all u (0, 1), and we deduce the result. We also have the following: Lemma C.2 (Lower bound on φ(x) ). Let x be a document. Let vmin be the minimum inverse document frequency for words contained in x and D(x) the size of the local dictionary. Then φ(x) Tvmin p Proof. Straightforward from the definitions and the comparison of p-norms. D. Dynamics of interpolation Recall that we are considering, for all µ [0, 1], the following minimization problem: q(µ) := arg min q Rd n Ψlin(µ, q) + α 2 q 2o . (21) In this section, we show that under mild regularity assumptions on Ψ, q is the unique solution of the following ODE: 2Ψlin(µ, q(µ)) + α I q (µ) + µ Ψlin(µ, q(µ)) = 0 . (22) Notation. For any matrix M RA B, let us define the operator norm of M as M op := sup Mv v , v RB \ {0} . For any ρ > 0, we also define Bd(ρ) the open Euclidean ball of center 0 and radius ρ. Finally, for a1, a2 > 0, define a1 a2 := max(a1, a2). We can now state the required assumptions on Ψ. On the Robustness of Text Vectorizers Assumption D.1 (Convexity). Let d 1. We suppose that Ψ C1,2([0, 1] Rd; R) and that, for all (µ, q) [0, 1] Rd, 2Ψ(µ, q) is a positive semi-definite matrix. Since α > 0, A.D.1 this guarantees that q(µ) is uniquely-defined for each µ. Next, we define some quantities related to the local Lipschitz continuity of Ψ and its derivatives. Definition D.2 (Local Lipschitz semi-norms). Let Ψ C1,2([0, 1] Rd; R). For all ρ > 0, let us define L1(ρ) := sup µ [0,1] q = q Bd(0,ρ) 2Ψ(µ, q) 2Ψ(µ, q) op q q , L2(ρ) := sup µ [0,1] q = q Bd(0,ρ) µ Ψ(µ, q) µ Ψ(µ, q) and M(ρ) := sup µ [0,1] q Bd(0,ρ) µ Ψ(µ, q) . (24) Our second assumption on Ψ at this stage is that these quantities are all finite. Assumption D.3 (Global Lipschitz continuity). Let Ψ C1,2([0, 1] Rd; R). Suppose that L1(ρ) + L2(ρ) < + and sup ρ>0 M(ρ) < + , where L1(ρ), L2(ρ), and M(ρ) are defined in Eq. (23) and Eq. (24). In this setting, we are able to prove the following result: Theorem D.4 (Equivalence ODE/minimization problem). Assume that Ψ satisfies A.D.1 and A.D.3. Then µ 7 q(µ) is differentiable on [0, 1], and q is the unique solution of Eq. (22). Note that under assumption A.D.1 the matrix 2Ψ(µ, q)+α I is invertible. One can then rewrite Eq. (22) in a more standard form, namely q (µ) = 2Ψ(µ, q(µ)) + α I 1 µ Ψ(µ, q(µ)) . (25) Thus, to study the ODE problem, one needs the regularity properties (local Lipschitz continuity, boundedness...) of the function Φ : (µ, q) [0, 1] Rd 7 Φ(µ, q) := 2Ψ(µ, q) + α I 1 µ Ψ(µ, q) . (26) The interplay between µ Ψ and 2Ψ here is crucial. Indeed, in Section F we will see that when specified in the doc2vec case, the term in µ gives the desired quantity |S| T whereas the term in 2Ψ has to be handled using precise properties on the softmax function. Theorem D.4 is standard in the ODE literature and holds as soon as the quantities appearing in Eq. (25) are well-behaved. More precisely, this is the case c = 0 of Theorem G.1 in Section G. We now simply check that the assumptions of Theorem G.1 are satisfied in the setting of Theorem D.4. This is achieved by Lemma D.5 and Lemma D.6. We start by a result upper bounding the norm of the inverse Hessian. Lemma D.5 (Norm of inverse Hessian). Let Ψ : [0, 1] Rd R. Assume that A.D.1 holds. Then, µ, q [0, 1] Rd, ( 2Ψ(µ, q) + α I) 1 op 1 The proof of Lemma D.5 exploits the fact that 2Ψ is a non-negative symmetric matrix and can be diagonalized in orthonormal basis with non-negative eigenvalues. The regularization of the minimization problem with the addition of the term α 2 q can be translated with the addition of the term α I to the previous Hessian matrix, which then becomes a positive definite symmetric matrix. One then only has to estimate the smallest eigenvalue of the matrix to conclude. Proof. By A.D.1, for all µ [0, 1], q 7 Ψ(µ, q) is convex and, for any µ, q [0, 1] Rd, 2Ψ(µ, q) is a positive semi-definite matrix with non-negative eigenvalues. From these, N0(µ, q) = Rank 2Ψ(µ, q) of them are non-zero, and they can be ranked as 0 < λ1(µ, q) λN0(µ,q)(µ, q) . On the Robustness of Text Vectorizers Moreover, there exists an orthogonal matrix P(µ, q) (meaning that P(µ, q)P(µ, q) = I) such that P(µ, q) 2Ψ(µ, q)P(µ, q) = diag(0, . . . , 0, λ1(µ, q), . . . , λN0(µ, q)) . Furthermore since 2Ψ(µ, q) is a symmetric matrix, its range and its kernel are orthogonal complements, Ker 2Ψ(µ, q) Im( 2Ψ(µ, q)) = Rd and h Im( 2Ψ(µ, q)) if, and only if, P(µ, q)h = (0, . . . , 0, h1, , h N0). Hence P(µ, q) 2Ψ(µ, q) + α I P(µ, q) = diag(α, . . . , α, λ1(µ, q) + α, . . . , λN0(µ, q) + α) , which implies that 2Ψ(µ, q) + α I is an invertible positive definite matrix such that P(µ, q) 2Ψ(µ, q) + α I 1 P(µ, q) = diag 1 α, . . . , 1 α, 1 λ1(µ, q) + α, . . . , 1 λN0(µ, q) + α From the last display, one readily sees that the maximum eigenvalue of 2Ψ(µ, q) + α I 1 is 1/α, proving our claim. The next lemma shows how regularity assumptions on Ψ translate into regularity conditions for Φ. Lemma D.6 (Global-Lispchitz continuity of Φ). Let Ψ such that A.D.1 and A.D.3 hold. Then Φ is globally Lipschitz continuous in q uniformly in µ [0, 1]. Moreover, for all ρ > 0, sup µ [0,1] q = q Rd Φ(µ, q) Φ(µ, q) sup ρ>0 L2(ρ) + supρ>0 L1(ρ) supρ>0 M(ρ) The proof of Lemma D.6 relies on the following identity, which is true for any non-negative symmetric matrices A, B Rd d and vectors X, Y Rd: (A + α I) 1X (B + α I) 1Y = (A + α I) 1(A B)(B + α I) 1X + (B + α I) 1(X Y ) . (29) Lemma D.5 allows us to conclude. Proof. Let q, q Bd(0, ρ). Using Eq. (29), we have Φ(µ, q) Φ(µ, q) = ( 2Ψ(µ, q) + αI) 1 ( 2Ψ(µ, q) + αI) 1 µ Ψ(µ, q) ( 2Ψ(µ, q) + αI) 1 ( µ Ψ(µ, q) µ Ψ(µ, q)) = ( 2Ψ(µ, q) + αI) 1 2Ψ(µ, q) 2Ψ(µ, q) ( 2Ψ(µ, q) + αI) 1 µ Ψ(µ, q) (30) ( 2Ψ(µ, q) + αI) 1 ( µ Ψ(µ, q) µ Ψ(µ, q)) . (31) Taking the norm and using Lemma D.5 (in particular Inequality (27)), we have for ρ = q q , Φ(µ, q) Φ(µ, q) 1 α2 2Ψ(µ, q) 2Ψ(µ, q) op µ Ψ(µ, q) α µ Ψ(µ, q) µ Ψ(µ, q) Φ(µ, q) Φ(µ, q) 1 α + L2(ρ) q q . Taking the supremum for µ [0, 1], q = q belonging to Bd(0, ρ) and ρ > 0 yields the claim. We now have all the tools to prove Theorem D.4. On the Robustness of Text Vectorizers Proof of Theorem D.4. Note that in that setting, using Lemma D.5, for all µ [0, 1], q 7 Ψ(µ, q) + α 2 q 2 is a strongly convex function and has a unique minimum, which is also the unique critical point of the gradient q 7 Ψ(µ, q) + αq. Let q0 Rd be such that {q0} = arg min Ψ(0, q) + α Thanks to Lemma D.6, Φ satisfies the hypothesis of Theorem G.1, with α sup ρ>0 M(ρ), b = 1 supρ>0 L1(ρ) supρ>0 M(ρ) α + sup ρ>0 L2(ρ) , and c = 0. (32) Let µ [0, 1] q(µ) be the unique solution up to time 1 to the ODE q (µ) = 2Ψ(µ, q(µ)) + α I 1 µ Ψ(µ, q(µ)) = Φ(µ, q(µ)), q(0) = q0 . According to Theorem G.1 applied to Λ = Φ, it exists and is well-defined up until µ = 1. Remark that when differentiating in µ [0, 1] the function µ 7 Ψ(µ, q(µ)) + αq(µ), we have 2Ψ(µ, q(µ)) + α I q (µ) + µ Ψ(µ, q(µ)) = 2Ψ(µ, q(µ)) + α I (q (µ) Φ(µ, q(µ))) = 0 . Hence Ψ(µ, q(µ)) + αq(µ) = Ψ(0, q(0)) + αq(0) = 0 . Thus, for any µ [0, 1], {q(µ)} = arg min n Ψ(µ, q) + α which is the promised result. Remark D.7 (Crude bounds under mild assumptions). Using the same standard result (condition c = 0 in Theorem G.1) could naturally give us some crude bounds on q(µ) q(0) , relying only on assumptions A.D.1 and A.D.3. More precisely, these bounds would strongly depend on α and improve as α . Namely, using Eq. (32) and Theorem D.4 one have for all µ [0, 1], q(µ) q(0) µ α sup ρ M(ρ) exp 1 α sup ρ>0 L1(ρ) sup ρ M(ρ) + sup ρ>0 L1(ρ) µ . This is not the regime we aim at, since α is a small, fixed regularization constant whose role is simply to ensure that the minimization problem is well-posed. E. Quantitative bounds on the trajectory Let us recall that q is the minimizer of the interpolated problem (21). In the previous section, we have made two assumptions (A.D.1 and A.D.3), guaranteeing that q is well-defined and is the unique solution to the ODE (22). In this section, we show how to obtain quantitative bounds on q(0) q(µ) by studying the ODE (22). To derive these bounds, we now make two additional assumptions on Ψ. The first one is an algebraic assumption which greatly improves the computations. Assumption E.1 (Common kernel). We assume that there exists a fixed subspace E Rd such that dim E = N0 and for all (µ, q) [0, 1] Rd Ker 2Ψ(µ, q) = E , Im( 2Ψ(µ, q)) = E, and µ Ψ(µ, q) E . The second one is a refined local-Lipschitz assumption (a quantitative version of A.D.3), which will allow us to use the case c = 0 in the Gronwall-Bahouri-Bellman type result Theorem F.7. Assumption E.2 (quantitative (local)-Lipschitz continuity). Recall L1 and L2 from Definition D.2, and M from Eq. (24). For any µ, q, define λ1(µ, q) the smallest positive eigenvalue of 2Ψ(µ, q). For any ρ > 0, define w 1(ρ) := inf µ [0,1] q Bd(0,ρ) λ1(µ, q) . On the Robustness of Text Vectorizers We assume that there exist positive constants (Γi)i 1,...,2 and non negative constants (γi)i 1,...,2, such that for all ρ > 0, L1(ρ) Γ1 eγ1ρ , L1(ρ) Γ2 eγ2ρ , M(ρ) Γ0 eγ0ρ , and w 1(ρ) 1 Γ 1 e γ 1ρ . Under these stronger assumptions, we can obtain the following: Theorem E.3 (Quantitative bounds on the trajectory). Assume that Ψ satisfies A.D.1, A.E.1, and A.E.2. Suppose furthermore that 4Γ 1(Γ0Γ 1Γ1 + Γ2) < exp 2 (γ 1 + γ0 + γ1) γ2 q0 + Γ 1Γ0 e(γ 1+γ0+γ1) γ2 q0 . (33) Then µ 7 q(µ) is differentiable on [0, 1], it is the unique solution of Eq. (22) and furthermore µ [0, 1], q(µ) q0 2µΓ 1Γ0 e(P2 i= 1 γi) q0 . The proof of Theorem E.3 follows the same path as the proof of Theorem D.4, with analogues of Lemmas D.5 and D.6. The crucial differences come from the fundamental use of A.E.1, which somehow allows us to diagonalize the Hessian 2Ψ for all µ, q, and thus allows is to use estimates on the smallest positive eigenvalue of the Hessian. In practical cases, this assumption will not allow us to use global-Lipchitz estimates. We therefore introduce A.E.2 to deal with that. These two ingredients allow us to use the case c > 0 in the Gr onwall-Bahouri-Bellman type lemma (Theorem G.1). The following Lemma gives an improve bounds for the norm of the inverse of the Hessian, using the algebraic requirement on the Hessian. Its proof is similar to the proof of Lemma D.5, and we only point out how to modify it. Lemma E.4 (Quantitative norm of inverse Hessian). Let Ψ : [0, 1] Rd R. Assume that A.D.1 and A.E.1 hold. Then ( 2Ψ(µ, q) + α I) 1 Im( 2Ψ(µ,q)) op 1 λ1(µ, q) , (34) where f|E denotes the restriction of f to the set E. Proof. Remind that from the proof of Lemma D.5, for all (q, µ) Rd [0, 1], we have P(µ, q) 2Ψ(µ, q) + α I 1 P(µ, q) = diag 1 α, . . . , 1 α, 1 λ1(µ, q) + α, . . . , 1 λN0(µ, q) + α Assuming that A.E.1 holds, we have for all (µ, q) [0, 1] Rd, N0(µ, q) = N0. Restricting to E, we see readily that the largest eigenvalue becomes 1/(α + λ1(µ, q)). Here again, by using the algebraic requirements on Ψ and the local-Lipshcitz bound we are able to derive a local-Lipschitz continuity result for Φ. Here again, the proof is quite similar to the one of Lemma E.5. Lemma E.5 (Local-Lispchitz continuity of Φ). Let Ψ such that A.D.1, and A.E.1 hold. Then Φ is locally-Lipschitz continuous in q uniformly in µ [0, 1]. More precisely, for all q, q Rd and all µ [0, 1]; Φ(µ, q) Φ(µ, q) 1 w 1( q ) L1( q q )M( q ) w 1( q ) + L2( q q ) q q . If additionally A.E.2 holds, we get Φ(µ, q) Φ(µ, q) 2Γ 1(Γ0Γ 1Γ1 + Γ2) e (γ 1+γ0+γ1) γ2 q q q q , (35) and Φ(µ, q) Γ 1Γ0 e(γ 1+γ0) q . On the Robustness of Text Vectorizers Proof. Since Ψ satisfies A.E.1, for all (µ, q) [0, 1] Rd and all q Rd µ Ψ(µ, q) Im( 2Ψ(µ, q)), and we can use we can use the second part of Lemma D.5, namely Inequality (34). Indeed, Eq. (30), in norm, is upper bounded by 1 λ1(µ, q) + αL1( q q ) 1 λ1(µ, q) + αM( q ) q q , while (31) is bounded by 1 λ1(µ, q) + αL2( q q ) q q . Summing these last two displays and using the definition of w 1 and the bounds of A.E.2 allows us to conclude. Proof of Theorem E.3. Remark that thanks to Lemma E.5, Φ satisfies the condition of Theorem G.1 with a = Γ 1Γ0, b = 2Γ 1(Γ0Γ 1Γ1 + Γ2) and c = (γ 1 + γ0 + γ1) γ2. Furthermore, Eq. (35) can be translated into 2b < exp 2c q0 + a ec q0 . which is exactly the condition of application of Theorem G.1. It ensure that there exists a unique solution µ 7 q(µ) to Eq. (22). Following the proof of Theorem D.4 we can conclude easily. F. Specializing our results for doc2vec In the previous sections, we have seen that, under some technical assumptions on Ψ, the mapping q is solution to an ODE, and we proved some bounds on q(µ) q(0) (by means of Theorem E.3). In this section, we check that these assumptions are satisfied for the Ψ occurring when considering doc2vec embeddings. That is, Ψ = Ψlin, where Ψlin is defined by Eq. (12). This is embodied as Theorem F.7, which is Theorem 5.1 with explicit constants. We first prove a useful bound on the norm of πt: Lemma F.1 (Bound on πt). Define Π := 2νσmax(R) sup i P:,i . Then, for any document x and any position t x, it holds that We emphasize that Lemma F.1 is true regardless of the model used (PVDMmean, PVDMconcat, PVDBOW), even though this bound can be strengthened for specific models. Moreover, it only depends on the P and R matrices, which are fixed matrices after training. Proof. Recall that we defined πt = RPht. For PVDBOW, ht = 0 and there is nothing to prove. Otherwise, let us first write πt = RPht σmax(R) Pht and focus on Pht . Let us assume that we work with PVDMconcat. Since, in that case, ht is the concatenation of 2ν arbitrary one-hot vectors, Pht is the sum of 2ν arbitrary columns of P. Using the triangle inequality, we deduce that Pht is smaller than 2ν times the largest norm of a column of P. When PVDMmean is used, the reasoning is similar. Ignoring the 1/(2ν) factor (which we consider to be part of P), the bound is the same. Since the matrix R appears in all the definition of the embeddings, one needs some (mild) assumptions on R. The first one ensures that the condition number of R is not equal to + . Assumption F.2 (Condition number of R). Let us R RD d. We assume that Im(R) 1 , and further that the smallest singular value of R is non-negative, that is, σmin(R) > 0 . On the Robustness of Text Vectorizers The requirement for the range of R is needed here in order to work in the setting of Lemma H.6 and H.8, and then use the nice bounds for the (local)-Lipschitz constant of the softmax and its Jacobian. Lemma F.3. Suppose that A.F.2 hold. Then Ψlin satisfies A.D.1. Proof. Recall that S denotes the set of modified words. Coming back to the definition of F and G, we see that, when forming the difference F G, many cancellations happen. To be more precise, replacing a word at position t only modifies πs for s belonging to the neighborhood of t. Thus G(q) F(q) = X ψ xt( πt + Rq) ψxt(πt + Rq) , (36) where E {s [T], |s t| ν with t S}. In particular, there is a numerical constant ℓ> 0 such that |E| ℓν |S|. From the definition of Ψlin, Eq. (36), and Lemma H.1, we deduce that Ψlin(µ, q) =R ψxt(πt + Rq) ψ xt( πt + Rq) + 1 t x ψxt(πt + Rq) σ(πt + Rq) σ( πt + Rq) + µ 1 σ(πt + Rq) 1xt ! µ Ψlin(µ, q) =R 1 T σ(πt + Rq) σ( πt + Rq) ! 1xt 1 xt σ(u(πt πt) + Rq)(πt πt) du 2Ψlin(µ, q) = R t E ( σ(πt + Rq) σ( πt + Rq)) + 1 t x σ(πt + Rq) where we remind that σ = diag(σ) σσ . Hence, 2Ψlin(µ, ) is a symmetric non-negative matrix and Ψlin satisfies A.D.1. Next, we show that Ψlin satisfies A.E.1. Lemma F.4. Suppose that A.F.2 holds. For all µ [0, 1] and all q Rd, Ker 2Ψlin(µ, q) = {0} , and Ψlin satisfies A.E.1 with N0 = d. Let us recall that we defined λ1 the smallest non-zero eigenvalue of the Hessian of Ψlin. Then, for all (µ, q) [0, 1] Rd, it holds that λ1(µ, q) e 2 Dσmin(R)2 e 2 2σmax(R) q . Proof. Let us remind from Lemma H.5 the definition of λmin, namely for z RD, λmin(z) = min Spec diag (σ(z)) σ(z)σ(z) \{0} . For q, y Rd and since Ry 1 (thanks to A.F.2), the minimax theorem allows us to write (using Eq. (37)) 2Ψlin(µ, q)y, y =µ 1 t x ( σ( πt + Rq)) (Ry), (Ry) t x ( σ(πt + Rq)) (Ry), (Ry) t x (µλmin( πt + Rq) + (1 µ)λmin(πt + Rq)) Ry 2 . On the Robustness of Text Vectorizers Here we have crucialy used A.F.2 and in particular the fact that Im(R) 1 and that πt 1 in order to make λmin appears. Let us set σ(1)(z) = min i [D] σi(z) . Thanks to Lemma H.5, one has 2Ψlin(µ, q)y, y 1 µDσ(1)( πt + Rq)2 + (1 µ)Dσ(1)(πt + Rq)2 D Ry 2 . Furthermore, thanks to Theorem H.9, and we have 2Ψlin(µ, q)y, y 1 2 πt + Rq + (1 µ) exp 2 2 πt + Rq 1 2σmax(R) q 1 Dσmin(R)2 y 2 , where we remind that Π is defined in Lemma F.1. This implies that Ker 2Ψlin(µ, q) = {0}, that Ψlin fulfills A.E.1 with N0 = d, and that λ1(µ, q) e 2 Dσmin(R)2 e 2 2σmax(R) q . (38) Next, we show that Ψlin satisfies A.E.2. Lemma F.5 (Local Lipschitz continuity of Ψlin). Suppose that A.F.2 holds. Then Ψlin satisfies A.E.2 with 2Π 1 σmin(R)2 , Γ0 = 4ℓνσmax(R)|S| T , Γ1 = 8 e6 (D 1)σmax(R)3 , Γ2 = 4ℓνΠ e4 D 1 σmax(R)2 |S| and γ 1 = 2 2σmax(R) , γ0 = 0 , γ1 = 3 2σmax(R) , and γ2 = 2 Proof. We have for all µ [0, 1] and all q Rd, µ Ψlin(µ, q) σ(πt + Rq) σ( πt + Rq) ! 4σmax(R)|E| 4ℓνσmax(R)|S| where we have used the fact that σ 1 and the previous bound gives the value of Γ0 and γ0. Thanks to Lemma H.6 σ is locally-Lipschitz continuous and thanks to Lemma H.8, σ is also locally-Lipschitz continuous, hence for q, q Rd µ Ψlin(µ, q) µ Ψlin(µ, q) 0 (σ(u(πt πt) + Rq) σ(u(πt πt) + R q)) (πt πt) du 2(2Π+σmax(R)( q q )) ℓνσmax(R)2 |S| 2σmax(R)( q q ) σmax(R)2 q q , On the Robustness of Text Vectorizers where we have used Lemma H.6 and the bound D D 1 2, which gives the value of Γ2 and γ2. Finally, let us remark that 2Ψlin(µ, q) 2Ψlin(µ, q) op µ 1 R ( σ( πt + Rq) σ( πt + R q)) R R ( σ(πt + Rq) σ(πt + R q)) R (D 1)σmax(R)3 e3 2σmax(R)( q q ) q q , which gives the value of Γ1 and γ1. Finally, Eq. 38 gives directly that w 1(ρ) = e 2 Dσmin(R)2 e 2 2σmax(R) q , which concludes the proof. Next, we show that q0 is not too large. Lemma F.6 (Bound on q0 ). Suppose that A.F.2 holds. Then We demonstrate Lemma F.6 in practice in Section I.2. The key idea behind the proof is that the regularization term α prevents q from escaping to infinity. Proof. Let us recall that q0 = arg min q Rd t x ψxt(πt + Rq) + α In view of Lemma F.3, q0 is the unique solution of the following equation: σ(πt + Rq) 1xt ! + αq = 0 . (39) From Eq. (39), we deduce that σ(πt + Rq0) 1xt ! . By definition of σmax(R) and the triangle inequality, this is upper bounded by t x σ(πt + Rq0) 1xt . (40) But we notice that, for any q Rd and t x, σ(πt + Rq) 1xt 2 = X j =xt σj(πt + Rq)2 + (σxt(πt + Rq) 1)2 j σj(πt + Rq)2 + 1 2σxt(πt + Rq) σ(πt + Rq) 1xt 2 2 , where we have used the fact that σ 1 and σi 0. Hence each term in Eq. (40) is upper bounded by 2. Keeping in mind that the summation over t x has at most T terms, we deduce the result. On the Robustness of Text Vectorizers We are now ready to apply case c > 0 of Theorem G.1 to obtain the promised quantitative bounds. Theorem F.7 (Quantitative bounds for doc2vec embeddings). Let Ψlin defined in Eq. (12), and suppose A.F.2 holds. Let us define A := 4ℓνD e2 B := 64ℓνDσmax(R)2 σmin(R)2 e10 2Π σmax(R)2 σmin(R)2 + Π D 1 Suppose that |S| T 1 2B e 2(AC+1) e C sup µ [0,1] q(µ) q0 2A e C q0 |S| Proof. Remark that Ψlin satisfies A.(D.1) (Lemma F.3), A.(E.1) (Lemma F.4) and A.(E.2) (Lemma F.5). Therefore the assumptions of Theorem E.3 are satisfied. Let us note that γ0γ 1 = A|S| 2Γ 1(Γ 1Γ1Γ0 + Γ2) B |S| T and (γ 1 + γ1) γ2 = C . Remark that in that setting, thanks to Lemma F.6, q0 α . We also have the following straightforward bounds: T 1 and C q0 e C q0 e C Using Eq. (41), one necessarily have T e 2(AC+1) e C e 2C( q0 +A |S| T e C q0 ) . This guarantees that Eq. (33) and one can apply Theorem E.3, which yields the desired result. G. Gr onwall-Bahouri-Bellman type result In this section, we collect all results related to ODEs. In our setting, as seen in Lemma D.6, and in view of A.E.2, the coefficients of Eq. 22 are not globally Lispchitz (although locally-Lipschitz). Thus, while local existence and uniqueness of solutions to Eq. (22) is a given (small µ regime), existence up to time 1 and non-explosion of the solutions is much more challenging to achieve (large µ regime). Unfortunately, this is the regime that we are interested into: the local behavior of the ODE at µ = 0 does not tell us anything interesting, since what we aim at is the comparison between the starting point (µ = 0) and final point (µ = 1) of the dynamic. Our strategy is to use an ad hoc extension of the Gr onwall-Bahouri-Bellman lemma to deal with our specific setting. Our approach is inspired by proofs of Gr onwall-Bellman-Bahouri type lemmas, see for example Dannan (1985); Agarwal et al. (2005); Kim (2009); Pachpatte (2004). It relies on an explicit integration of the integral inequality which will pop up in the computations. Note that, instead of generic local constants L, M, and w 1, and in view of Section F, we will suppose that all those quantity are locally bounded by some exponential functions. Our derivation is very close to that of Pachpatte inequality (Pachpatte, 2004), but here we keep track of the constants. In doing, so we gain an explicit criteria for non-explosion of the solutions up to time µ = 1. To view other applications of non-explosion on the time-one map, one could also consult (Bailleul & Catellier, 2020) and the references therein. On the Robustness of Text Vectorizers Theorem G.1 (Gr onwall-Bahouri-Bellman type inequality). Let Λ : [0, 1] Rd Rd be a continuous function and a, b, c > 0 be numerical constants such that, for all q, q Rd, sup µ [0,1] Λ(µ, q) a ec q , (43) and sup µ [0,1] Λ(µ, q) Λ(µ, q) b ec( q q ) q q . (44) Let q0 Rd such that either c = 0 or 2b < exp 2c q0 + a ec q0 . (45) Then, there exists a unique function q : [0, 1] Rd such that q(0) = 0 and for all µ [0, 1] q (µ) = Λ(µ, q(µ)) . (46) Furthermore, for all µ [0, 1], ( 2µa ec q0 if c > 0, µa eb if c = 0 . Proof. Step 1: Existence of the map satisfying (46). Note that since Λ is locally Lipschitz continuous, thanks to the Cauchy-Lipschitz/Picard-Lindel of theorem (see Arnold (1978, Chapter 2)), there exists an open interval I of [0, 1] and a unique function q : I Rd such that q is the unique solution to Eq. (46). Note that an open interval of [0, 1] which contains 0 is necessarily of the form [0, τ) with 0 < τ < 1 or [0, 1]. Remark also that for all µ I , thanks to the regularity assumption on Λ, on I , µ 7 Λ(µ, q(µ)) is continuous and for µ I the following integral equation is satisfied: q(µ) = q0 + Z µ 0 Λ( µ, q( µ)) d µ . Step 2: . Taking the norm in the previous display and using the triangle inequality, we see that q(µ) q0 Z µ 0 Λ( µ, q0) d µ + Z µ 0 Λ( µ, q(µ)) Λ( µ, q0) d µ . Using our assumptions on Λ, namely Eqs. (43) and (44), we obtain q(µ) q0 µa ec q0 +b Z µ 0 ec( q( µ) + q0 ) q( µ) q0 d µ . Since q( µ) q(q0) q( µ) q0 , we deduce that q(µ) q0 µa ec q0 +b e2c q0 Z µ 0 ec q( µ) q0 q( µ) q0 d µ . Let us define for all µ I , ( a ec q0 +b e2c q0 1 µ R µ 0 ec q( µ) q0 q( µ) q0 d µ if µ > 0 , a ec q0 if µ = 0 . Note that 1 µ R µ 0 ec q( µ) q0 q( µ) q0 d µ µ 0 ec q(0) q0 q(0) q0 = 0 and Q is continuous in µ = 0. Furthermore, for µ I \{0}, µ2 b e2c q0 Z µ 0 ec q( µ) q0 q( µ) q0 d µ + 1 µb ec q(µ) q0 q(µ) q0 . (47) With this notation in hand, for any µ I \{0}, q(µ) q0 µQ(µ). Since we restrict our attention to µ 1, we have q(µ) q0 Q(µ) . (48) On the Robustness of Text Vectorizers Step 3: Differential inequality. Since x 7 x ecx is a non-decreasing function, we have for µ I \{0} Q (µ) b e2c q0 q(µ) q0 µ ecµ q(µ) q0 b e2c q0 Q(µ) ec Q(µ) , (49) where we used Eq. (47) for a direct bound one the derivative and Eq. (48) to obtain the last display. Step 4: Cauchy-Lipschitz setting (c = 0). Suppose for a moment that c = 0 so that we are in the standard setting of global Cauchy-Lipschitz/Picard Lindel of Theorem and standard Gr onwall Lemma. We have for µ I I = [0, 1] and q(µ) q0 µQ(µ) aµ ebµ . Step 5: Gr onwall-Bahouri-Bellman integration (c > 0). Suppose now that c > 0. Let β > 0. Let us remark that for all x 0, x ecx 1 c e2cx, and we have c e2c q0 e2c Q(µ) . (50) Multiplying both sides of Eq. (50) by e 2c Q(µ), one recognize (up to constants) the derivative of e 2c Q. Integrating from 0 to µ, we have proved that e 2ca ec q0 e 2c Q(µ) 2 b e2c q0 µ . (51) When e 2ca ec q0 2b e2c q0 µ > 0 , (52) ec Q(µ) e 2ca ec q0 2b e2c q0 µ 1 Furthermore, whenever Eq (45) holds (namely Eq. (52) is true for all µ [0, 1]) we can take I = [0, 1], since Eq. (53) guaranty that Q does not explode. For µ I \{0} which satisfies Eq. (52), when using the previous bound and Eq. (49), we have the following inequality : Q (µ) b e2c q0 e 2ca ec q0 2b e2c q0 µ 1 When dividing by Q(µ) and integrating, we get log(Q(µ)) log(Q(0)) exp b e2c q0 Z µ e 2ca ec q0 2b e2c q0 µ 1 Therefore, for all µ which satisfies Eq. (52), q(µ) q0 µQ(µ) µa ec q0 exp Z µ 0 b e2c q0 e 2ca ec q0 2b e2c q0 µ 1 µa ec q0 exp e 2ca ec q0 1 2 e 2ca ec q0 2b e2c q0 µ 1 µa ec q0 exp 2b e2c q0 µ 1 where we have use that for 0 v < v, v v v v. Note that Eq. (54) makes sense since Eq. (52) is satisfied. Finally, one can use Eq. (52) and write 2b e2c q0 µ 2b e2c q0 e 2ac e2 q0 1 , On the Robustness of Text Vectorizers which gives q(µ) q0 µ2a ec q0 , which is the wanted result. Remark G.2 (Improving Theorem F.7). There are several open avenues to improve Theorem F.7. One possibility is to keep a finer the dependency in µ when bounding q(µ) q0 (namely keeping the µ factor when deriving Eq. (48)). A second possible improvement is to use a finer inequality than y ey when deriving Eq. (50). Unfortunately, in both cases, we were unsuccessful in integrating these more complicated expressions in a tractable form (derivation leading to Eq. (51)). H. Technical results related to the softmax function In this section, we collect all technical facts related to the softmax function used throughout the proofs. Let us recall that we defined the softmax function from RD to RD as σ(x) = (σ1(x), . . . , σD(x)) , where for all i [D], σi(x) = exi PD j=1 exj . We also defined, for all x RD and all i [D], ψi(x) = log(σi(x)) . H.1. Basics on the softmax function We start by recalling elementary properties of the softmax function. Lemma H.1 (Softmax derivatives). We have ( σk(x)(1 σk(x)) if k = ℓ σk(x)σℓ(x) otherwise. In a more concise way, σ(x) = diag(σ(x)) σ(x)σ(x) . A straightforward consequence of Lemma H.1 is the computation of the first derivatives of ψi (these are very standard computations, see for instance Proposition 1 and 2 in Gao & Pavel (2017)). Lemma H.2 (Gradient of ψi). We have ( 1 + σk(x) if k = i σk(x) otherwise. In more concise notation, ψi = σ 1i. Similarly, we have: Lemma H.3 (Hessian of ψi). We have xk xℓ ψi(x) = ( σ (x)k (1 σ (x)k) if k = ℓ σ (x)k σ (x)ℓotherwise. In more concise notation, 2ψi = σ = diag(σ) σσ . (55) Corollary H.4 (Convexity of log-softmax). For any i [D], the function ψi is convex. The proof of the previous fact relies on the Courant minimax theorem, which gives the value of the eigenvalue of a real symmetric matrix. Furthermore, we also use that fact that a function such that its Hessian is a non-negative symmetric matrix is convex. On the Robustness of Text Vectorizers Proof. Let x, v RD. Since P i σi(x) = 1, we have 2ψi(x)v, v = X k σk(x)v2 k X ℓ σℓ(x)vℓvk k σk(x)v2 k Hence, thanks to the Courant minimax principle, all the eigenvalues of 2ψi are non-negative. Hence 2ψi is a non-negative symmetric matrix, and ψi is a convex function. The following proposition controls the spectrum of the Hessian of ψi, that is the gradient of the softmax, in function of the minimal and maximal values of the softmax function. Lemma H.5 (Spectrum of the softmax Jacobian). Let ρ > 0. For x RD, let us define σ(1)(x) := min i [D] σi(x) , and σ(D)(x) := max i [D] σi(x) . Let us define λmin(x) := min {Spec ( σ(x)) \{0}} , and λmax(x) := max {Spec ( σ(x))} . Then Dσ2 (1)(x) λmin(x) λmax(x) Dσ2 (D)(x) . Proof. According to Lemma H.3, σ(x) = diag(σ(x)) σ(x)σ(x) . This matrix is symmetric, and according to Corollary H.4, its eigenvalues are non-negative real numbers. Since P i σi(x) = 1, one has σ(x)1 = 0 , where, as before, 1 = (1, . . . , 1) . Since for all i [D], σi(x) = 0, if v Ker ( σ(x)) then necessarily for all i [D], viσi(x) σi(x) P j σj(x)vj = 0, and v = v11. Hence, Im( σ(x)) = 1 and Ker ( σ(x)) = Vec (1). Using the Courant minimax characterization of eigenvalues, we have λmin(x) = min v 1 v =1 σ(x)v, v = min v 1 v =1 v ( σ(x)) v and λmax(x) = max v 1 v =1 σ(x)v, v = max v 1 v =1 v ( σ(x)) v . (56) Note then that for v 1 (and dropping the x dependency), v ( σ(x)) v = Now, the Cauchy-Schwarz inequality guarantees that the previous display is non-negative, but this is not enough to conclude. We resort to the four-letter identity (Steele (2004, Exercise 3.7), see also Garreau & Mardaoui (2021, Proposition 13)) to write v ( σ(x)) v = j 0 and D 2. We have min i [D] inf x ρ P j xj=0 σi(x) = 1 1 + (D 1) e and max i [D] sup x ρ P 1 + (D 1) e q Remark H.10 (Bounding the softmax). The softmax function is ubiquitous in machine learning, and many bounds can be found in the literature (Wei et al., 2023). Generally, these bounds are pointwise, and not applicable in our case since we need a global bound on the ball of radius ρ (with the additional constraint P j xj = 0 coming from our algebraic assumption). Proof. Step 1: the infimum is achieved and invariant by permutation. For any x RD such that x ρ, σi(x) (0, 1) for all i [D]. Furthermore, σi(x) = σi(x)1i σi(x)σ(x) , where we remind that (11, . . . , 1D) is the canonical basis of RD. Hence σi(x) = 0 and the supremum is achieved on the sphere. Note that B0(ρ) := n x RD : x = ρ, P j xj = 0 o is a compact set, and the infimum is a minimum. Consider i0 [D] and y B0(ρ) a joint minimizer such that σi0(y) = min i [D] min x B0(ρ) σi(x) . (59) Remark that Eq. (59) is invariant by permutation, i.e., for any permutation τ : [D] [D], we have στ(i0)(τ y) = σi0(y) = min i [D] min x B0(ρ) σi(x) , where τ y = (yτ(i))i {1,...,D}. Hence, one can suppose without loss of generality that i0 = 1. Step 2: the coordinates of a minimizer are equal under z 7 z e z except at i0. In this setting, since we have for all i {2, . . . , D}, σ1(y) σi(y) this implies that y1 yi. Using the fact that P j yj = 0, when summing the previous inequality for all i [D], one gets y1 0. Note that in fact y1 < 0. Indeed, if y1 = 0, we have yi = 0 for all i [D] and y = 0 = ρ. We are in the setting of a minimization problem under constrains, namely y solves minimize σ(x) subject to x 2 = ρ2, x, 1 = 0 . Using the Lagrange-Multiplier Theorem, there exist α, β R such that for the aforementioned solution y we have σ(y) + α 2 ρ2 (y) + β ( , 1 ) (y) = 0 , which translate into σ1(y) σ1(y)2 + 2αy1 + β = 0 σ1(y)σi(y) + 2αyi + β = 0 for i {2, . . . , D} . Remark that β = 0 and α = 0. Indeed, by summing all these previous equality, and using that P yi = 0 and P σi = 1, one gets Dβ = 0 and β = 0. Remind that y1 < 0, and since σ1(y) σ1(y)2 + 2αy1 = 0 , On the Robustness of Text Vectorizers if α = 0 then σ1(y)(1 σ1(y)) = 1, which is not possible. Hence α = 0. We also have that for all i, j {2, . . . , D}, σ1(y) = 2αyi σi(y) = 2αyj Using that fact that α = 0, this implies that yi eyi = y2 ey2 for all i {2, . . . , D} and that i=2 yi = y1 + i=2 y2 e y2 eyi ! i=1 eyi ey1 ! y2 e y2 = y1 + ey1 1 σ1(y) σ1(y) y2 e y2 . As a consequence, for all i {2, . . . , D}, yi e yi = y2 e y2 = y1 e y1 σ1(y) 1 σ1(y) . (60) Step 3: expression of the minimum in function of the solution of z e z = c. Since y1 < 0, the previous equality (60) implies that yi > 0 for i {2, . . . , D}. For any 0 < c < e 1, the equation x e x = c has exactly two solutions, which we call 0 < y (c) < 1 < y+(c). Let us define 2 i D, yi = y y1 e y1 σ1(y) 1 σ1(y) the number of negative solutions. By definition of n, we necessarily have σ1(y) = ey1 ey1 +n ey +(D 1 n) ey+ . (61) Recall that P j yj = 0 and y = ρ, hence y1 + ny + (D 1 n)y+ = 0 (62) y2 1 + ny2 + (D 1 n)y2 + = ρ2 . (63) When n = D 1, one can solve the previous equations and we have y1 = ρ q D D 1 and yj = ρ q 1 D(D 1) for all j {2, , D}, and σ1(y) = 1 1+(D 1) e D D 1 ρ . Since the problem here is symmetric in y and y+, one can suppose that 1 n D 2. Hence rewriting Eq. (62), we obtain y+ = n D 1 ny + 1 D 1 ny1 Replacing the value of y+ by the right-hand side of the previous display in Eq. (63), we obtain n + n2 y2 + 2 n D 1 ny1y ρ2 y2 1 1 + 1 D 1 n Dividing by n + n2 D 1 n , we get y2 + 2 D 1y1y D 1 n n(D 1) ρ2 D n n(D 1)y2 1 We can see the previous display as a quadratic equation in y , which we now solve. There exists ε { 1, 1} such that On the Robustness of Text Vectorizers where (y1) = q (D 1)ρ2 Dy2 1 . Note that in this setting one necessarily have D D 1ρ y1 0 , (64) since we have already seen that the minimization problem under constrains has a solution, y and y+ exist and 1 n D 2. When the previous condition is not satisfied, necessarily in the case n = D 1 or n = 0 holds which has already been treated. Finally, when using the fact that y+ = 1 D 1 n(y1 + ny ), y+ = y1 + ε q n D 1 n (y1) And since y+ > y , we have ε r n D 1 n > ε and we conclude that ε = 1, i.e., y+ = y1 + q n D 1 n (y1) Taking a step back, we have managed to express all coordinates as an explicit function of y1. Step 4: closed-form expression of the minimum. Replacing y and y+ in Eq. (61) by the expression obtained in Eqs. (65) and (66), we have to minimize the function of y1 defined g(y1) = ey1 D 1 + (D 1 n)e y1+ n D 1 n (y1) 1 + e D D 1 y1 D 1 +(D 1 n) e n D 1 n (y1) 1 + e D D 1 y1 p r n D 1 n e n D 1 n (y1) Note that for y1 satisfying Eq. (64) y1 is non-positive. It is elementary to show that y 7 (y) is an increasing function on R . Moreover, for all a > 0, h : x 7 a e x a eax is an increasing function on R+. Thus y 7 h( (y)/(D 1)) is a decreasing mapping on R . Hence, by taking a = q n , we have n + r n D 1 n , (D 1 n)nh (y) n + r n D 1 n On the Robustness of Text Vectorizers 0 50 100 150 200 250 300 length of the document Euclidean distance to q0 0 50 100 150 200 250 300 length of the document Euclidean distance to q0 0 50 100 150 200 250 300 length of the document Euclidean distance to q0 Figure 7. Influence of the length of the document with gensim implementation of doc2vec. Increasing the length of a document by considering the first words of 3 IMDB examples and replacing 5 words at random several times for each document leng Theorem Dimension of the embedding is d = 50, size of the dictionary is D = 23, 048. Using this last display, we write 1 + (D 1) e D D 1 y1 . The right-hand side is an increasing function of y1, whose minimum value is q D ρ, and this gives 1 + (D 1) e D D 1 ρ . (67) Thus equality in the key bound is reached for 1 D(D 1)ρ, . . . , with value given by Eq. (67). Step 5: Proof for the maximum. Following the same reasoning as in the proof of Theorem H.9, we show that the maximum is reached for the point r 1 D(D 1)ρ, . . . , and the coordinate σ1, and we get the wanted result. I. Additional experimental results In this section we collect additional experimental results. I.1. Illustration of Theorem 5.1 with another implementation In Figure 7 and 8, we present a replication of the experiment presented in Section 5.2 of the main paper. This time, we used the gensim implementation of the doc2vec model. The main difference is the use of hierarchical softmax instead of softmax. Despite this difference, the empirical results remain consistent with our theoretical claims and experimental results with an ad hoc implementation. We conjecture that the hierarchical softmax has similar algebraic properties to the softmax, in particular kernel stability, which would justify conducting the same analysis. I.2. Illustration of Lemma F.6 In Figure 9, we illustrate the bound provided by Lemma F.6. We consider the 5 longest examples of the IMDB dataset and create artificial documents of increasing length as before. We observe no asymptotic dependency in T, as predicted by the theoretical result. On the Robustness of Text Vectorizers 0 50 100 150 number of replacements Euclidean distance to q0 0 50 100 150 number of replacements Euclidean distance to q0 0 50 100 150 number of replacements Euclidean distance to q0 Figure 8. Influence of number of replacements with gensim implementation of doc2vec. Considering 3 examples from the IMDB dataset. Dimension of the embedding is d = 50, size of the dictionary is D = 23, 048. 0 500 1000 1500 length of the document PVDMmean, norm of q0 0 500 1000 1500 length of the document PVDMconcat, norm of q0 0 500 1000 1500 length of the document PVDBOW, norm of q0 Figure 9. Norm of the original embedding as a function of T. I.3. Singular values of R In Figure 10, we empirically check that the singular values of the (learned) R are well-behaved. We considered the matrices from our local model and report the histogram of their singular values in log scale in Figure 10. 3 4 5 6 7 singular values of R (log scale) 3 4 5 6 singular values of R (log scale) 3 4 5 6 7 8 singular values of R (log scale) Figure 10. Singular values of R, in log scale. We observe that σmin(R) > 0.