# a_theoretical_perspective_on_hyperdimensional_computing__242a165e.pdf Journal of Artificial Intelligence Research 72 (2021) 215-249 Submitted 01/2021; published 10/2021 A Theoretical Perspective on Hyperdimensional Computing Anthony Thomas ahthomas@eng.ucsd.edu Sanjoy Dasgupta dasgupta@eng.ucsd.edu Tajana Rosing tajana@eng.ucsd.edu Department of Computer Science University of California, San Diego San Diego, CA 92093, USA Hyperdimensional (HD) computing is a set of neurally inspired methods for obtaining highdimensional, low-precision, distributed representations of data. These representations can be combined with simple, neurally plausible algorithms to effect a variety of information processing tasks. HD computing has recently garnered significant interest from the computer hardware community as an energy-efficient, low-latency, and noise-robust tool for solving learning problems. In this review, we present a unified treatment of the theoretical foundations of HD computing with a focus on the suitability of representations for learning. 1. Introduction Hyperdimensional (HD) computing is an emerging area at the intersection of computer architecture and theoretical neuroscience (Kanerva, 2009). It is based on the observation that brains are able to perform complex tasks using circuitry that: (1) uses low power, (2) requires low precision, and (3) is highly robust to data corruption. HD computing aims to carry over similar design principles to a new generation of digital devices that are highly energy-efficient, fault tolerant, and well-suited to natural information processing (Rahimi et al., 2018). The wealth of recent work on neural networks also draws its inspiration from the brain, but modern instantiations of these methods have diverged from the desiderata above. The success of these networks has rested upon choices that are not neurally plausible, most notably significant depth and training via backpropagation. Moreover, from a practical perspective, training these models often requires high precision and substantial amounts of energy. While a large body of literature has sought to ameliorate these issues with neural networks, these efforts have largely been designed to address specific performance limitations. By contrast, the properties above emerge naturally from the basic architecture of HD computing. Hyperdimensional computing focuses on the very simplest neural architectures. Typically, there is a single, static, mapping from inputs x to much higher-dimensional neural representations φ(x) living in some space H. All computational tasks are performed in H-space, using simple, operations like element-wise additions and dot products. The mapping φ is often taken to be random, and the embeddings have coordinates that have low precision; for instance, they might take values 1 and +1. The entire setup is elementary and lends itself to fast, low-power hardware realizations. c 2021 AI Access Foundation. All rights reserved. Thomas, Dasgupta, & Rosing Indeed, a cottage industry has emerged around developing optimized implementations of HD computing based algorithms on hardware accelerators (Imani et al., 2017; Rahimi et al., 2018; Gupta et al., 2018; Schmuck et al., 2019; Salamat et al., 2019; Imani et al., 2019). Broadly speaking, this line of work touts HD computing as an energy efficient, low-latency, and noise-resilient alternative to conventional realizations of general purpose ML algorithms like support vector machines, multilayer perceptrons, and nearest-neighbor classifiers. While this work has reported impressive performance benefits, there has been relatively little formal treatment of HD computing as a tool for general purpose learning. This review has two broad aims. The first, more modest, goal is to introduce the area of hyperdimensional computing to a machine learning audience. The second is to develop a particular mathematical framework for understanding and analyzing these models. The recent literature has suggested a variety of different HD architectures that conform to the overall blueprint given above, but differ in many important details. We present a unified treatment of many such architectures that enables their properties to be compared. The most basic types of questions we wish to answer are: 1. How can individual items, sets of items, and sequences of items, be represented and stored in H-space, in a manner that permits reliable decoding? 2. What kinds of noise can be tolerated in H-space? 3. What kinds of structure in the input x-space are preserved by the mapping to Hspace? 4. What is the power of linear separators on the φ-representation? Some of these questions have been introduced in the HD computing literature and studied in isolation (Plate, 2003; Gallant & Okaywe, 2013; Kleyko et al., 2018; Frady et al., 2018). In this work we address these questions formally and in greater generality. 2. Introduction to HD Computing In the following section we provide an introduction to the fundamentals of HD computing and provide some brief discussion of its antecedents in the neuroscience literature. 2.1 High-Dimensional Representations in Neuroscience Neuroscience has proven to be a rich source of inspiration for the machine learning community: from the perceptron (Rosenblatt, 1958), which introduced a simple and generalpurpose learning algorithm for linear classifiers, to neural networks (Rumelhart et al., 1986), to convolutional architectures inspired by visual cortex (Fukushima, 1980), to sparse coding (Olshausen & Field, 1996) and independent component analysis (Bell & Sejnowski, 1995). One of the most consequential discoveries from the neuroscience community, underlying much research at the intersection of neuroscience and machine learning, has been the notion of high-dimensional distributed representations as the fundamental data structure for diverse types of information. In the neuroscience context, these representations are also typically sparse. To give a concrete example, the sensory systems of many organisms have a critical component consisting of a transformation from relatively low dimensional sensory inputs A Theoretical Perspective on Hyperdimensional Computing Input Data: x X HD Encoding: φ : X H Memory and Data Structures H HD Decoding HD Algorithms: Learning/Reasoning Output Noise/Corruption: H Entirely in H-space Mixed in H, X-space Figure 1: The flow of data in HD computing. Data is mapped from the input space to HDspace under an encoding function φ : X H. HD representations of data are stored in data structures and may be corrupted by noise or hardware failures. HD representations can be used as input for learning algorithms or other information processing tasks and may be decoded to recover the input data. to much higher-dimensional sparse representations. These latter representations are then used for subsequent tasks such as recall and learning. In the olfactory system of the fruit fly (Masse et al., 2009; Turner et al., 2008; Wilson, 2013; Caron et al., 2013), the mapping consists of two steps that can be roughly captured as follows: 1. An input x Rn is collected via a sensory organ and mapped under a random linear transformation to a point φ(x) Rd (d n) in a high-dimensional space. 2. The coordinates of φ(x) are sparsified by a thresholding operation which just retains the locations of the largest k coordinates. In the fly, the olfactory input is a roughly 50-dimensional vector (n = 50) corresponding to different types of odor receptor neurons while the sparse representation to which it is mapped is roughly 2,000-dimensional (d = 2000). A similar expand-and-sparsify template is also found in other species, suggesting that this process somehow exposes the information present in the input signal in a way that is amenable to learning by the brain (Stettler & Axel, 2009; Olshausen & Field, 2004; Chacron et al., 2011). The precise mechanisms by which this occurs are still not fully understood, but may have close connections to some of the literature on the theory of neural networks and kernel methods (Cybenko, 1989; Barron, 1993; Rahimi & Recht, 2008). 2.2 HD Computing The notion of high-dimensional, distributed, data representations has engendered a number of computational models that have collectively come to be known as vector symbolic architectures (VSA) (Levy & Gayler, 2008). In general, VSAs provide a systematic way to generate and manipulate high-dimensional representations of symbols so as to implement Thomas, Dasgupta, & Rosing cognitive operations like association between related concepts. Notable examples of VSAs include holographic reduced representations (Plate, 1995, 2003), binary spatter codes (Kanerva, 1994, 1995), and matrix binding of additive terms (Gallant & Okaywe, 2013). HD computing can be seen as a successor to these early VSA models, with a strong additional slant towards hardware efficiency. While our treatment focuses primarily on recent work on HD computing, many of our results apply to these earlier VSA models as well. An overview of data-flow in HD computing is given in Figure 1. The first step in HD computing is encoding, which maps a piece of input data to its high-dimensional representation under some function φ : X H. The nature of φ depends on the type of input and the choice of H. In this review, we consider inputs consisting of sets, sequences, and structures composed from a finite alphabet as well as vectors in a Euclidean space. The space H is some d-dimensional inner-product space defined over the real numbers or a subset thereof. Work in the literature on both HD computing and traditional neural networks has also explored complex-valued embeddings (Weiss et al., 2016; Parcollet et al., 2019; Zhang et al., 2016). However, we here focus on the more common case of real-valued embeddings. For computational reasons, it is common to restrict H to be defined over integers in a limited range [ b, b]. We emphasize that the dimension of H need not, in general, be greater than that of X. Indeed, in several cases the encoding methods discussed can be used to reduce the dimension of the data. The HD representations of data can be manipulated using simple element-wise operators. Two common and important such operations are bundling and binding. The bundling operator is used to compile a set of elements in H and takes the form of a function : H H H. The function takes two points in H and returns a third point that is similar to both operands. The binding operator is used to create ordered tuples of points in H and is likewise a function : H H H. The function takes a pair of points in H as input, and produces a third point dissimilar to both operands. We make these notions more precise in our subsequent discussion of encoding. Given the HD representation φ(S) of a set of items S X (produced by bundling the items), we may be interested to query the representation to determine if it contains the encoding of some x X. To do so, we compute a metric of similarity ρ(φ(x), φ(S)) and declare that the item is present in S if the similarity is greater than some critical value. This process can be used to decode the HD representation so as to recover the original points in X (Plate, 2003; Frady et al., 2018). We may additionally wish to assert that we can decode reliably even if φ(S) has been corrupted by some noise process. One of our chief aims in this paper is to mathematically characterize sufficient conditions for robust decoding under different noise models and input data types. Beyond simply storing and recalling specific patterns, HD representations may also be used for learning. HD computing is most naturally applicable to classification problems. Suppose we are given some collection of labeled examples S = {(xi, yi)}N i=1, where xi X and yi {ci}K i=1 is a categorical variable indicating the class label of a particular xi. One simple form of HD classification bundles together the data corresponding to a particular class to generate a prototypical example for the class (Kanerva, 2009; Kleyko et al., 2018; A Theoretical Perspective on Hyperdimensional Computing Rahimi et al., 2018): i : yi=ck φ(xi) (1) The resulting φ(ck) are sometimes quantized to lower precision or sparsified via a thresholding operation. A nice feature of this scheme is that it is extremely simple to implement in an on-line fashion: that is, on streaming data arriving continuously over time (Rahimi et al., 2018). It is common to fine-tune the class prototypes using a few rounds of perceptron training (Imani et al., 2017, 2019b). Given some subsequent piece of query data xq X for which we do not know the correct label, we simply return the label of the most similar prototype: k = argmax k 1,...,K ρ(φ(xq), φ(ck)). The similarity metric ρ is typically taken to be the dot-product, with the operands normalized if necessary. Thus, on the whole, the scheme is quite similar to classical statistical methods like naive Bayes and Fisher s linear discriminant. In Section 5.2.1, we consider properties of the HD encoding that can make linear models more powerful in HD space than in the original space. HD computing and closely related techniques have been applied to a wide variety of practical problems in fields ranging from bio-signal processing (Rahimi et al., 2016; Imani et al., 2017), to natural language processing (Sahlgren, 2005), and robotics (Mitrokhin et al., 2019; Neubert et al., 2019). We are here concerned with a more abstract treatment that focuses on the basic properties of HD computing and will not attempt to survey this literature. The interested reader is referred to (Rahimi et al., 2016; Kleyko et al., 2018) for discussions related to practical aspects of HD computing. 3. Encoding and Decoding Discrete Data A central object in HD computing is the mapping from inputs to their high-dimensional representations. The design of this mapping, typically referred to as encoding in the literature on HD computing, has been the subject of considerable research. There is a wide range of possible encoding methods. Some of these have been introduced in the HD computing literature and studied in isolation (Plate, 2003; Gallant & Okaywe, 2013; Kleyko et al., 2018). In this review, we present a novel unifying framework in which to study these mappings and to characterize their key properties in a non-asymptotic setting. We first discuss the encoding and decoding of sets in some detail. Many HD encoding procedures for more complex data types such as sequences essentially amount to transforming the data into a set and then applying the standard set-encoding method. 3.1 Finite Sets Let A = {ai}m i=1 be some finite alphabet of m symbols. Symbols a A are mapped to H under an encoding function φ : A H. Our goal in this section is to consider the encoding of sets S whose elements are drawn from A. The HD representation of S is constructed by superimposing the embeddings of the constituent elements using the bundling operator Thomas, Dasgupta, & Rosing : H H H. The encoding of S is defined to be φ(S) = a Sφ(a). We first focus on the intuitive setting in which is the element-wise sum and then address other forms of bundling. To determine if some a A is contained in S, we check if the dot product φ(a), φ(S) exceeds some fixed threshold. If the codewords {φ(a) : a A} are orthogonal and have a constant length L, then we have φ(a), φ(S) = L2 1(a S), where 1 is the indicator function which evaluates to one if its argument is true and zero otherwise. However, when the codewords are not perfectly orthogonal, we have φ(a), φ(S) = L1(a S)+ , where is the cross-talk caused by interference between the codewords. In order to decode reliably, we must ensure the contribution of the cross-talk is small and bounded. We formalize this using the notion of incoherence popularized in the sparse coding literature. We define incoherence formally as (Donoho et al., 2005): Definition 1 Incoherence. For µ 0, we say φ : A H is µ-incoherent if for all distinct a, a A, we have | φ(a), φ(a ) | µL2 where L = mina A φ(a) . When d m, it is possible to have codewords that are mutually orthogonal, whereupon µ = 0. In general, we will be interested in results that do not require d m. 3.1.1 Exact Decoding of Sets In the following section, we show how the cross-talk can be bounded in terms of the incoherence of φ, and use this to derive a simple threshold rule for exact decoding. Theorem 2 Let L = mina A φ(a) and let the bundling operator be the element wise sum. To decode whether an element a lies in set S, we use the rule φ(a), φ(S) 1 This gives perfect decoding for sets of size s if φ is 1/(2s)-incoherent. Proof Consider some symbol a. Then: φ(a), φ(S) = 1(a S) φ(a), φ(a) + X a S\{a} φ(a), φ(a ) If a S, then the above is lower bounded by L2 s L2µ, where µ is the incoherence of φ. Otherwise, it is upper bounded by s L2µ. So we decode perfectly if s L2µ < L2/2, or µ < 1/(2s). 3.1.2 Random Codebooks In practice, each φ(a) is usually generated by sampling from some distribution over H or a subset of H (Kanerva, 2009; Kleyko et al., 2018; Rahimi et al., 2018). We typically A Theoretical Perspective on Hyperdimensional Computing require that this distribution is factorized so that coordinates of φ(a) are i.i.d.. Intuitively, the incoherence condition stipulated in Theorem 2 will hold if dot products between two different codewords are concentrated around zero. Furthermore, we would like it to be the case that this concentration occurs quickly as the encoding dimension is increased. It turns out that a fairly broad family of simple distributions satisfies these properties. As an example, suppose φ(a) is sampled from the uniform distribution over { 1}d, which we denote φ(a) { 1}d. In this case, L = d exactly, and a direct application of Hoeffding s inequality and the union bound yields: P( distinct a, a A s.t. | φ(a), φ(a ) | µd) m2 exp µ2d (Recall that m = |A|.) Stated another way, with high probability µ = O( p (ln m)/d), meaning that we can make µ as small as desired by increasing d. In fact, the same basic approach holds for the much broader class of sub-Gaussian distributions, which can be characterized as follows (Wainwright, 2019): Definition 3 Sub-Gaussian Random Variable. A random variable X PX is said to be sub-Gaussian if there exists σ R+, referred to as the sub-Gaussian parameter, such that: E[exp (λ(X E[X]))] exp σ2λ2 for all λ R. Intuitively, the tails of a sub-Gaussian random variable decay at least as fast those of a Gaussian. We say the encoding φ is σ-sub-Gaussian if φ(a) is generated by sampling its d coordinates independently from the same sub-Gaussian distribution with parameter σ. We say φ is centered if the distribution from which it is sampled is of mean zero. In general, we assume φ is centered unless stated otherwise. Codewords drawn from a sub-Gaussian distribution have the useful property that their lengths concentrate fairly rapidly around their expected value. This concentration is, in general, worse than sub-Gaussian but well behaved nonetheless. The following result is well known but we reiterate it here as it is useful for our subsequent discussion. A proof is available in the appendix. Theorem 4 Let φ be centered and σ-sub-Gaussian. Then: P( a A s.t. | φ(a) 2 2 E[ φ(a) 2 2]| t) 2m exp c min t2 for some positive absolute constant c. Like the conventional Gaussian distribution, sub-Gaussianity is preserved under linear transformations. That is, if x = {xi}n i=1 is a sequence of i.i.d. sub-Gaussian random variables and a is an arbitrary vector in Rn, then a, x is sub-Gaussian with parameter σ a 2 (Wainwright, 2019). We can obtain a more general version of the previous result about φ { 1}d which applies to φ(a) sampled from any sub-Gaussian distribution. Thomas, Dasgupta, & Rosing Theorem 5 Let φ be σ-sub-Gaussian. Then, for µ > 0, P( distinct a, a A s.t. | φ(a), φ(a ) | µL2) m2 exp µ2κL2 where κ = (mina φ(a) 2)/(maxa φ(a) 2). Proof Fix some a and a . Treating φ(a) as a fixed vector in Rd and using the fact that sub-Gaussianity is preserved under linear transformations, we may apply a Chernoffbound for sub-Gaussian random variables (e.g. Prop 2.1 of (Wainwright, 2019)) to obtain: P(| φ(a), φ(a ) | µL2) 2 exp µ2L4 2σ2 φ(a) 2 2 where Lmax = maxa A φ(a) 2. Therefore, taking κ = L2/L2 max, we have: P(| φ(a), φ(a ) | µL2) 2 exp µ2κL2 and the claim follows by applying the union bound over all m 2 < m2/2 pairs of codewords. We note that, per Theorem 4, κ 1 as d becomes large. To be concrete and provide useful practical guidance, we here introduce three running examples of codeword distributions. Dense Binary Codewords. In our first example, the most common in practice in our impression, φ(a) is sampled from the uniform distribution over the d-dimensional unit cube { 1, +1}d. This approach is advantageous because it leads to efficient hardware implementations (Imani et al., 2017; Rahimi et al., 2018) and is simple to analyze. Gaussian Codewords. Our second example consists of codewords sampled from the ddimensional Gaussian distribution (Plate, 2003). That is, φ(a) N(0d, σ2Id), where 0d is the d-dimensional zero vector. Here, the codewords will not be of exactly the same length. However, Theorem 4 ensures that squared codeword lengths are concentrated around their expected value of σ2d. More formally, for some τ > 0: P( a A s.t. | φ(a) 2 2 σ2d| τσ2d) 2m exp c min τ 2d, τd . In both cases, we can see that to obtain a µ-incoherent codebook with probability 1 δ, is it sufficient to choose: Or, stated another way, we have µ = O( p (ln m)/d) with high probability. The key point in the two examples above is that the encoding dimension is inversely proportional to µ2. Per Theorem 2, to decode correctly it is sufficient to have µ = 1/(2s), meaning that the encoding dimension scales quadratically with the number of elements in the set, but only logarithmically in the alphabet size and probability of error. We will also consider a third example in which the codewords are sparse and binary. However, we defer this for the time being as slightly different encoding methods and analysis techniques are appropriate. A Theoretical Perspective on Hyperdimensional Computing 3.1.3 Decoding with Small Probability of Error The analysis above gives strong uniform bounds showing that, with probability at least 1 δ over random choice of the codebook, every subset of size at most s will be correctly decoded. However, this guarantee requires us to impose the unappealing restriction that s d which is a significant practical limitation. We here show that we can obtain s = O(d) but with a weaker pointwise guarantee: any arbitrarily chosen set of size at most s will be correctly decoded with probability 1 δ over the random choice of codewords. Rather than insist on a hard upper bound on the incoherence of the codebook, we can instead require the milder condition that random sums over dot-products between s codewords are small with high-probability. We define this property more formally as follows: Definition 6 Subset Incoherence. For τ > 0, we say a random mapping φ : A H satisfies (s, τ, δ)-subset incoherence if, for any S A of size at most s, with probability at least 1 δ over the choice of φ: a S φ(a), φ(a ) where L = mina A ||φ(a)||. Once again, it turns out that sampling the codewords from a sub-Gaussian distribution can readily be seen to satisfy a subset-incoherence condition with high-probability: Theorem 7 Let φ be σ-sub-Gaussian and fix some S A of size s. Then a S φ(a), φ(a ) 2m exp κτ 2L2 where κ and L are as in Theorem 5. The proof is similar to Theorem 5 and is available in the appendix. As a concrete example, in the practically relevant case that φ { 1}d the above boils down to: a S φ(a), φ(a ) 2m exp τ 2d Stated another way, we have: τ = O( p (s ln m)/d). Following Theorem 2, in order to ensure correct decoding with high probability, we must simply argue that the codebook satisfies the subset-incoherence property with τ = 1/2, meaning we should choose the encoding dimension to be d = O(s ln m). This method of analysis is similar to that of (Plate, 2003; Gallant & Okaywe, 2013; Frady et al., 2018), who reach the same conclusion visa-vis linear scaling using the central limit theorem. However, our formalism is more general and is non-asymptotic. Thomas, Dasgupta, & Rosing 3.1.4 Comparing Set Representations We can estimate the size of a set by computing the norm of its encoding, where the precision of the estimate can be bounded in terms of the incoherence of φ. In the following discussion, we make the simplifying assumption that the codewords are all of a constant length L. Again appealing to Theorem 4, we can see that this assumption is not onerous since the codeword lengths concentrate around their expected value. Theorem 8 Let S be a set of size s. Then: L2 φ(S) 2 2 s(1 + sµ) Proof The proof is by direct manipulation: 1 L2 φ(S) 2 2 = 1 L2 φ(S), φ(S) = 1 a S φ(a), φ(a) + 1 a,a =a S φ(a), φ(a ) L2 (s L2 + s2µL2). The other direction is analogous. Given a pair of sets S, S over the same alphabet, we can estimate the size of their intersection and union directly from their encoded representation. Theorem 9 Let S and S be sets of size s and s drawn from A and denote their encodings by φ(S) and φ(S ) respectively. |S S | ss µ 1 L2 φ(S), φ(S ) |S S | + ss µ The proof is similar to Theorem 8 and is deferred to the appendix. Noting as well that |S S | = |S| + |S | |S S |, we see that we can estimate the size of the union using the previous theorem. In practice, it may be unnecessary to compute these quantities with a high degree of precision. For instance, it may only be necessary to identify sets with a large intersection-over-union. Provided the definition of large is somewhat loose, we can accept a higher incoherence among the codewords in exchange for reducing the encoding dimension. 3.1.5 Sparse and Low-Precision Encodings In the previous discussion, we assumed the bundling operator was the element-wise sum. This is a natural choice when the codewords are dense or non-binary. However, the resulting encodings are of unconstrained precision which may be undesirable from a computational perspective. For the purposes of representing sets of size s, we may truncate φ(S) to lie in the range [ c, c], with negligible loss in accuracy provided c = O( s). In practice, it is common to quantize the encodings more aggressively to binary precision by thresholding (Kanerva, 1994; Rahimi et al., 2017; Burrello et al., 2018; Imani et al., 2019a). In other words, we encode as φ(S) = gt(S), where gt is a thresholding function that is applied coordinate-wise: gt(x) = 1 if x t and 0 otherwise. A Theoretical Perspective on Hyperdimensional Computing As a notable special case of the thresholding rule described above, we here consider encoding with sparse codewords. In this case, we assume that a coordinate in a codeword is non-zero with some small probability. In other words, φ(a)i Bernoulli(p), where p 1/2. We may then bundle items by taking an element-wise sum of their codewords with threshold t = 1, which is equivalent to taking the element-wise maximum over the codewords. That is, φ(S) = maxa S φ(a), where the max operator is applied coordinate-wise. Noting that the max is upper bounded by the sum in this setting, the notion of incoherence is a relevant quantity and the analysis of Theorem 2 continues to apply. This encoding procedure is essentially a standard implementation of the popular Bloom filter data structure for representing sets (Bloom, 1970). The conventional Bloom filter differs slightly in that the typical decoding rule is to threshold φ(a), φ(S) at φ(a) 1. There is a large literature on Bloom filters with applications ranging from networking and database systems to neural coding, and several schemes for generating good codewords have been proposed (Broder & Mitzenmacher, 2004; Pagh et al., 2005; Dasgupta et al., 2018). Using the random coding scheme described here, the optimal value of p can be seen to be (ln 2)/s and, to ensure the probability of a false positive is at most δ, the encoding dimension should be chosen on the order of s ln(1/δ) (Broder & Mitzenmacher, 2004). A practical benefit of Bloom filters is that they have an efficient implementation using hash functions which does not require materializing a codebook as in methods based on random sampling. This may be beneficial when the alphabet size is large enough that storing codewords is not possible. The connections between HD computing and Bloom filters are examined in greater detail in (Kleyko et al., 2019). We remark that this method of encoding is related to an interesting procedure known as context dependent thinning (CDT) which can be used to control the density of binary representations (Rachkovskij, 2001; Kleyko et al., 2018). CDT takes the logical and of φ(S) and some permutation σ(φ(S)) to obtain the thinned representation φ(S) = φ(S) σ(φ(S)). This process can be repeated until the desired density of φ(S) is achieved. A capacity analysis of CDT representations can be found in (Kleyko et al., 2018). 3.2 Robustness to Noise In this section we explore the noise robustness properties of the encoding methods discussed above using the formalism of incoherence. We consider some unspecified noise process which corrupts the encoding of a set S A of size at most s according to φ(S) = φ(S) + S. We say S is ρ-bounded if: max a A | φ(a), S | ρ. We are interested in understanding the conditions under which we can still decode reliably. Theorem 10 Suppose S has size s and S is ρ-bounded. We can correctly decode S using the thresholding rule from Theorem 2 if: ρ L2 + sµ < 1 where L = mina A φ(a) 2. Thomas, Dasgupta, & Rosing The proof is a straightforward extension of Theorem 2 and is available in the appendix. The practical implication is that there is a tradeoffbetween the incoherence of the codebook and robustness to noise: a higher incoherence allows for a smaller encoding dimension but at the cost of a tighter constraint on ρ. We can analyze several practically relevant noise models by placing additional restrictions on S and by considering worst or typical case bounds on ρ. We here consider different forms of noise under constraints on H. Our goal is to understand how the magnitude of noise that can be tolerated scales with the encoding dimension, size s of the encoded set, and size m of the alphabet. In each setting we consider a passive model in which the noise is sampled randomly from some distribution, and an adversarial model in which the noise is arbitrary and may be designed to maliciously corrupt the encodings. We again appeal to Theorem 4 to justify a simplifying assumption that the codewords are of equal length. Lemma 11 Sub-Gaussian Codewords. Fix a centered and σ-sub-Gaussian codebook φ whose codewords are of length L. Consider the passive additive white Gaussian noise model S N(0, σ2 Id); that is, each coordinate is corrupted by Gaussian noise with mean zero and variance σ2 . Then, we can correctly decode with probability 1 δ over random draws of S provided: Now consider an adversarial model in which S is arbitrary save for a constraint on the norm: S 2 ωL. Then, we can correctly decode provided: Proof Let us first consider the passive case in which S N(0, σ2 Id). Fix some a A. Then φ(a), S N(0, σ2 L2). By a standard tail bound on the Gaussian distribution (Wainwright, 2019) and the union bound, we have: P( a s.t. | φ(a), S | ρ) 2m exp ρ2 Therefore, with probability 1 δ, we have that S is ρ-bounded for 2 ln(2m/δ). By Theorem 10 we can decode correctly if: L2 + sµ < 1 Now consider the adversarial case in which S 2 ωL. By the Cauchy-Schwarz inequality, | φ(a), S | ωL2. Therefore, by Theorem 10, we can decode correctly if L2 + sµ < 1 A Theoretical Perspective on Hyperdimensional Computing We again emphasize that, per Theorem 5, µ = O( p (ln m)/d). Since L = O( d), we can see that we can tolerate σ p d/(ln m) s in the passive case. We next turn to a notable special case of the above in which the codewords are dense and binary. In this case, we may assume that H is constrained to be integers in the range [ s, s]. Lemma 12 Dense Binary Codewords. Fix a codebook φ such that φ(a) { 1}d for each a A. Consider a passive noise model in which S unif({ c, ..., c}d); that is, each coordinate is shifted by an integer amount chosen uniformly at random between c and c. Then, we can correctly decode with probability 1 δ provided: d 2 ln(2m/δ) Now consider an adversarial model in which we assume S 1 ωsd. Then we can decode correctly if: A proof is available in the Appendix. We next consider the case of Section 3.1.5 in which the codewords are sparse and binary and the bundling operator is the element-wise maximum. We here assume that φ(S) = φ(S) + S is truncated so that each coordinate is either 0 or +1. Lemma 13 Sparse Binary Codewords. Fix a codebook φ such that φ(a) {0, 1}d, and assume some fraction p 1/2 of coordinates are non-zero for each a A. Consider a passive noise model in which: 2 0 w.p. 1 θ +1 w.p. θ Then we can decode correctly with probability 1 δ provided: Now consider an adversarial model in which S 1 ωd. Then we can decode correctly if ω < p(1 Proof Consider first the passive noise model. Fix some φ(a). Then: | φ(a), S | i=1 |φ(a)(i) (i) S |. Treating φ(a) as a fixed vector with dp non-zero entries, the sum is concentrated in the range dp(θ ϵ), and so ρ dp(θ + ϵ) with high probability. By Chernoff/Hoeffding and the union-bound, with probability 1 δ: Thomas, Dasgupta, & Rosing The result is obtained by noting that L = pd and applying Theorem 10. For the adversarial case, the result is obtained by again observing that | φ(a), S | φ(a) S 1 ωd for any a A and applying Theorem 10. 4. Encoding Structures We are often interested in representing more complex data types, such as objects with multiple attributes or features. In general, we suppose that we observe a set of features F whose values are assumed to lie in some set A. Let ψ : F H be an embedding of features, and φ : A H be an embedding of values. We associate a feature with its value through use of the binding operator : H H H that creates an embedding for a (feature,value) pair. For a feature f F taking on a value a A, its embedding is constructed as ψ(f) φ(a). A data point x = {(fi F, xi A)}n i=1 consists of n such pairs. For simplicity, we assume each x possesses all attributes, although our analysis also applies to the case that x possesses only some subset of attributes. The entire embedding for x is constructed as (Plate, 2003): i=1 ψ(fi) φ(xi) (2) As with sets we would typically like φ(x) to be decodable in the sense that we can recover the value associated with a particular feature, and comparable in the sense that φ(x), φ(x ) is reflective of a reasonable notion of similarity between x and x . From a formal perspective, we require the binding operator to satisfy several properties. First, binding should be associative and commutative. That is, for all a, b, c H, (a b) c = a (b c), and a b = b a. Second, there should exist an identity element I H such that I a = a for all a H. Third, for all a H, there should exist some a 1 H such that a a 1 = I. These properties are equivalent to stipulating that H be an abelian group under . Furthermore, binding should distribute over bundling. That is, for any a, b, c H, it should be the case that a (b+c) = a b+a c. We here also require that the lengths of bound pairs are bounded, that is to say: maxf F,a A ψ(f) φ(a) 2 M. A natural choice of embedding satisfying these properties is to sample ψ(f) randomly from { 1}d and choose to be the element-wise product. In this case ψ(f) is its own inverse, that is ψ(f) ψ(f) = I, and binding preserves lengths of codewords. We focus on this case here as it is intuitive, but our analysis generalizes in a straightforward way to any particular implementation satisfying the properties listed above. One can see the bound pairs satisfy various incoherence properties with high probability. For instance, we may declare the binding to be µ-incoherent if: max a A max a A,f F φ(a), ψ(f) φ(a ) µL2 where L = mina A φ(a) 2. We can extend Theorem 5 to see this property is satisfied with high probability: A Theoretical Perspective on Hyperdimensional Computing Theorem 14 Fix d, n, m Z+ and µ R+. Let φ be centered and σ-sub-Gaussian, be the element-wise product, and ψ(f) { 1}d. Then: P( a, a A, f F s.t. | φ(a), φ(a ) ψ(f) | µL2) nm2 exp κµ2L2 where L = mina A φ(a) 2 and κ is as defined in Theorem 5. The proof is similar to Theorem 5 and is available in the Appendix. This result is appealing because it means that the incoherence scales only logarithmically with m n which may be large in practice. As a corollary to the previous theorem, we also obtain the following useful incoherence property: P( a, a , f = f s.t. | φ(a), (φ(a ) ψ(f)) ψ 1(f ) | µL2) m2n2 exp κµ2L2 where ψ 1(f) is the inverse of ψ(f) with respect to . This notion of incoherence is useful for decoding representations. Along similar lines: P( a, a , f = f s.t. | φ(a) ψ(f), φ(a ) ψ(f ) | µL2) m2n2 exp κµ2L2 We note that the previous statement refers to symbols associated with different attributes and thus does not require any particular incoherence assumption on the φ(a). 4.1 Decoding Structures This representation can be decoded to recover the value associated with a particular feature. To recover the value of the i-th feature, we use the following rule: ˆxi = argmax a A φ(a), φ(x) ψ 1(fi) where ψ 1(f) denotes the group inverse of ψ(f). Since the binding operator is assumed to distribute over bundling, the dot-product above expands to: φ(a), φ(xi) + X j =i φ(a), (φ(xj) ψ(fj)) ψ 1(fi) ( L2(1 nµ) if xi = a n L2µ otherwise where the incoherence can be bounded as as in Equation 3. Thus µ < 1/(2n) is a sufficient condition for decodability. 4.2 Comparing Structures As with sets, we may wish to compare two structures without decoding them. As one would expect given Theorem 9, this is can be achieved by computing the dot-product between their encodings: Thomas, Dasgupta, & Rosing Theorem 15 Let x and x be two structures drawn from a common alphabet F A. Denote their encodings using Equation 2 by φ(x) and φ(x ). Then, if binding is µ-incoherent: |x x | n2µ 1 L2 φ(x), φ(x ) |x x | + n2µ where x x is defined to be the set {i : xi = x i}n i=1, that is, the features on which x and x agree. Proof Expanding: φ(x), φ(x ) = i=1 φ(xi) ψ(fi), j=1 φ(x j) ψ(fj) i=1 φ(xi) ψ(fi), φ(x i) ψ(fi) + X i =j φ(xi) ψ(fi), φ(x j) ψ(fj) A term in the first sum is L2 if xi = x i and bounded in L2µ otherwise. So the expression above is bounded as: L2|x x | + L2n2µ and the other direction of the inequality is analogous. As a practical example, in bioinformatics it is common to search for regions of high similarity between a reference and query genome. Work in (Imani et al., 2018) and (Kim et al., 2020) explored the use HD computing to accelerate this process by encoding short segments of DNA and estimating similarity on the HD representations. 4.3 Encoding Sequences Sequences are an important form of structured data. In this case, the feature set is simply the list of positions {1, 2, 3, ...} in the sequence. In practical applications, we are often interested in streams of data which arrive continuously over time. Typically, real-world processes do not exhibit infinite memory and we only need to store the n 1 most recent observations at any time. In the streaming setting, we would like to avoid needing to fully re-encode all n data points each time we receive a new sample, as would be the case using the method described above. This motivates the use of shift based encoding schemes (Kanerva, 2009; Rahimi et al., 2017; Kim et al., 2018). Let ρ(i)(z) denote a cyclic left-shift of the elements of z by i coordinates, and ρ( i)(z) denote a cyclic right-shift by i coordinates. In other words: ρ(1)((z1, z2, . . . , zd 1, zd)) = (z2, z3, . . . , zd, z1). In shift-based encoding a sequence x = (x1, ..., xn) is represented as: i=1 ρ(n i)(φ(xi)), where we take to be the element wise sum. Now suppose we receive symbol n + 1 and wish to append it to φ(x) while removing φ(x1). Then we may apply the rule: ρ(1)(φ(x) ρ(n 1)(φ(x1))) φ(xn+1) = i=1 ρ(n i)φ(xi+1) A Theoretical Perspective on Hyperdimensional Computing where we can additionally note that ρ is a special type of permutation and that permutations distribute over sums. However, in order to decode correctly, each φ(a) must satisfy an incoherence condition with the ρ(j)(φ(a )). We can again use the randomly generated nature of the codewords to argue this is the case; however, we must here impose the additional restriction that the φ(a) be bounded, and accordingly restrict attention to the case φ(a) { 1}d. Theorem 16 Fix d, m, n < d Z+ and µ R+ and let φ(a) { 1}d. Then: P( a, a A, i = 0 s.t. | φ(a), ρ(i)(φ(a )) | µd) nm2 exp µ2d Proof Fix some a, a and i. In the case that a = a , φ(a) and ρ(i)(φ(a)) are mutually independent. However, when a = a , φ(a) and ρ(i)(φ(a)) only satisfy pairwise independence and the techniques of Theorem 5 cannot be applied. To resolve this difficulty, let f(φ(a)) = φ(a), ρ(i)(φ(a)) , and denote by φ(a)\k the vector formed by replacing the k-th coordinate in φ(a) with an arbitrary value {+1, 1}. Then |f(φ(a)) f(φ(a)\k)| 4 and so by the bounded-differences inequality (Mc Diarmid et al., 1989): P(| φ(a), ρ(i)(φ(a )) | µd) 2 exp µ2d The result follows by the union bound. Several other related methods for encoding sequential information have been proposed in the literature (Plate, 2003; Gallant & Okaywe, 2013). For an extensive discussion of these approaches as well as an interesting discussion involving sequences of infinite length, the reader is referred to (Frady et al., 2018). 4.4 Discussion and Comparison with Prior Work We conclude our treatment of encoding and decoding discrete data with some brief discussion of our approach and its relation to antecedents in the literature. A key question addressed here and by several pieces of prior work is to bound the magnitude of crosstalk noise in terms of the encoding dimension (d), the number of items to encode (s) and the alphabet size (m). Early analysis in (Plate, 2003; Gallant & Okaywe, 2013; Kleyko et al., 2018) recovers the same asymptotic relationship as we do, but only under specific assumptions about the method used to generate the codewords and particular instantiations of the bundling and binding operators. Work in (Frady et al., 2018) provides a significantly more general treatment which, like ours, aims to abstract away from the particular choice of distribution from which codewords are sampled and from the particular implementation of bundling and binding operator. Their approach assumes the codewords are generated by sampling each component i.i.d. from some distribution and uses the central limit theorem (CLT) to justify modeling the crosstalk noise by a Gaussian distribution. Error bounds in the non-asymptotic setting are then obtained by applying a Chernoffstyle bound to the resulting Gaussian distribution. This approach again recovers the same asymptotic relationship between d, s and m as us, but does not generally yield formal bounds in the non-asymptotic setting. Our approach Thomas, Dasgupta, & Rosing based on sub-Gaussianity formalizes this analysis in the non-asymptotic setting. Like us, (Frady et al., 2018) also considers the effect of noise on the HD representations, but their treatment is limited to additive white noise, whereas we address both arbitrary additive passive noise and adversarial noise. In summary, our formalism using the notion of incoherence allows us to decouple the analysis of decoding and noise-robustness from any particular method for generating codewords and readily yields rigorous bounds in the non-asymptotic setting. Our approach is applicable to a large swath of HD computing and enables us to offer more general conditions under which thresholding based decoding schemes will succeed and of the effect of noise than is available in prior work. 5. Encoding Euclidean Data One option for encoding Euclidean vectors is to treat them as a special case of the structured data considered in the preceding section. As before, we think of our data as a collection of (feature,value) pairs x = {(fi, xi)}n i=1 with the important caveat that xi Rn. This case is more complex because the feature values may now be continuous, and because the data possesses geometric structure which is typically relevant for downstream tasks and must be preserved by encoding. We here analyze two of the most widely used methods for encoding Euclidean data and discuss general properties of structure preserving embeddings in the context of HD computing. 5.1 Position-ID Encoding A widely-used method in practice is to quantize the raw signal to a suitably low precision and then apply the structure encoding method discussed in the previous section (Rachkovskiy et al., 2005a, 2005b; Kleyko et al., 2018; Rahimi et al., 2018). In this approach, we first quantize the support of each feature f F into some set of m bins with centroids a1 < < am and assign each bin a codeword φ(a) H. However, instead of requiring the codewords to be incoherent, we now require the correlation between codewords to reflect the distance between corresponding quantizer bins. In other words φ(a), φ(a ) should be monotonically decreasing in |a a |. A simple method can be used to generate monotonic codebooks when the codewords are randomly sampled from { 1}d (Rachkovskiy et al., 2005a; Widdows & Cohen, 2015). Fixing some feature f, the codeword for the minimal quantizer bin, φ(a1), is generated by sampling randomly from { 1}d. To generate the codeword for the second bin, we simply flip some set of b bits in φ(a1), where: The codeword for the third bin is generated analogously from the second, where we assume the bits to be flipped are sampled such that a bit is flipped at most once. Thus the codewords for the minimal and maximal bins are orthogonal and the correlation between codewords for intermediate bins is monotonically decreasing in the distance between their corresponding bin centroids. A Theoretical Perspective on Hyperdimensional Computing In practice, it seems to be typical to use a single codebook for all features and for the quantizer to be a set of evenly spaced bins over the support of the data. While simple, this approach is likely to have sub-optimal rate when the features are on different scales or are far from the uniform distribution. Encoding then proceeds as follows: i=1 φ(xi) ψ(fi) where, as before ψ { 1}d is a vector which encodes the index i of a feature value xi as in the previous section on encoding sequences; hence the name position-ID encoding. There are several variations on this theme which are compared empirically in (Kleyko et al., 2018). This general encoding method was analyzed by (Rachkovskiy et al., 2005b), in the specific case of sparse and binary codewords, who show it preserves the L1 distance between points in expectation but do not provide distortion bounds. We here provide such bounds using our formalism of matrix incoherence. We assume that the underlying quantization of the points is sufficiently fine that it is a low-order term that can be ignored. Theorem 17 Let x and x be points in [0, 1]n with encodings φ(x) and φ(x ) generated using the rule described above. Assume that φ satisfies φ(a), φ(a ) = d(1 |a a |) for all a, a A, and let ψ { 1}d. Then, for all x, x : 2d( x x 1 2n2µ) ||φ(x) φ(x )||2 2 2d( x x 1 + 2n2µ) The proof is similar to Theorem 15 and is available in the Appendix. The practical implication of the previous theorem is that the position-ID encoding method preserves the L1 distance between points up to an additive distortion which can be bounded by the incoherence of the codebook. Per Equation 4, µ = O( p ln(mn)/d). Therefore, to ensure that 1 d φ(x) φ(x ) 2 2 x x 1 ϵ, the previous result implies we should choose d = O(n4 ϵ2 ln(nm)). This can be relaxed to a quadratic dependence on n in exchange for a weaker pointwise bound, but in either case means the encoding method may be problematic when the dimension of the underlying data is high. Noting that ||φ(x)||2 2 nd n2dµ, we can see that the encodings of each point are roughly of equal norm and lie in a ball of radius at most n dµ, where the exact position depends on the instantiation of the codebook. Thus, we can loosely interpret the encoding procedure as mapping the data into a thin shell around the surface of a high dimensional sphere. 5.2 Random Projection Encoding Another popular family of encoding methods embeds the data into H under some random linear map followed by a quantization (Rachkovskij, 2015; Imani et al., 2019c). More formally, for some x Rn, these embeddings take the form: φ(s) = g(Φx) where Φ Rd n is a matrix whose rows are sampled uniformly at random from the surface of the n-dimensional unit sphere, and g is a quantizer typically the sign function Thomas, Dasgupta, & Rosing restricting the embedding to H. The embedding matrix Φ may also be quantized to lower precision. This encoding method has also been studied in the context of kernel approximation where it is used to approximate the angular kernel (Choromanski et al., 2017), and to construct low-distortion binary embeddings (Jacques et al., 2013; Plan & Vershynin, 2014). While the following result is well known, we here show this encoding method preserves angular distance up to an additive distortion as this fact is important for subsequent analysis. Theorem 18 Let Sn 1 Rn denote the n-dimensional unit sphere. Let Φ Rd n be a matrix whose rows are sampled uniformly at random from Sn 1. Let X be a set of points supported on Sn 1. Denote the embedding of a point by φ(x) = sign(Φx). Then, for any x, x X, with high probability: d) dham(φ(x), φ(x )) dθ + O( where dham(a, b) is the Hamming distance between a and b, defined to be the number of coordinates on which a and b differ, and θ = 1 π cos 1( x, x ) [0, 1] is proportional to the angle between x and x . Proof Let Φ(i) denote the ith row of the matrix Φ. Then, the ith coordinate in the embedding of x can be written as sign( Φ(i), x ). The probability that the embeddings differ on their ith coordinate, that is ( Φ(i), x )( Φ(i), x ) < 0, is exactly (x, x )/π: the angle (in radians) between x and x divided by π. Therefore, the number of coordinates on which φ(x) and φ(x ) disagree is, concentrated in the range, d(θ ϵ). By Chernoff/Hoeffding, we have that with probability 1 δ: Noting that φ(x), φ(x ) = d 2dham(φ(x), φ(x )), we obtain the following simple corollary: Corollary 19 Let φ and θ be as defined in Theorem 18. Then, with high probability: d) φ(x), φ(x ) d(1 2θ) + O( To obtain a more explicit relationship with the dot product, we can use the first-order approximation cos 1(x) (π/2) x, to obtain θ 1 π x, x , from which we obtain: We emphasize that, in comparison to the position-ID method, the distortion in this case does not depend on the dimension of the underlying data which means this method may be preferable when the data dimension is large. A Theoretical Perspective on Hyperdimensional Computing 5.2.1 Connection with Kernel Approximation A natural question is whether the encoding procedure described above, which preserves dotproducts, can be generalized to capture more diverse notions of similarity? We can answer in the affirmative by noting that the random projection encoding method is closely related to the notion of random Fourier features which have been widely used for kernel approximation (Rahimi & Recht, 2008). The basic idea is to construct an embedding φ : Rn Rd, such that φ(x), φ(x ) k(x, x ), where k is a shift-invariant kernel. The construction exploits the fact that the Fourier transform of a shift-invariant kernel k is a probability measure: a well known result from harmonic analysis known as Bochner s Theorem (Rudin, 1962). The embedding itself is given by φ(x) = 1 d cos(Φx + b), where the rows of Φ are sampled from the distribution induced by k and the coordinates of b are sampled uniformly at random from [0, 2π]. Subsequent work in (Raginsky & Lazebnik, 2009) gave a simple scheme for quantizing the embeddings produced from random Fourier features to binary precision. Their construction yields an embedding ψ : Rn {0, 1}d such that: f1(k(x, x )) 1 ddham(ψ(x), ψ(x )) f2(k(x, x )) + where f1, f2 : R R are independent of the choice of kernel, and is a distortion term. The embedding itself is constructed by applying a quantizer Qt(x) = sign(x+t) coordinate wise over the embeddings constructed from random Fourier features. In other words ψ(x)i = 1 2(1 + Qti(φ(x)i)), where ti Unif[ 1, 1], and φ(x) is a random Fourier feature. This connection is highly appealing for HD computing. The quantized random Fourier feature scheme presents a simple recipe for constructing encoding methods meeting the desiderata of HD computing while preserving a rich variety of structure in data. For instance, shift-invariant kernels preserving the L1 and L2 distance among many others can be approximated using the method discussed above. Furthermore, this observation provides a natural point of contact between HD computing and the vast literature on kernel methods which has produced a wealth of algorithmic and theoretical insights. 5.3 Consequences of Distance Preservation The encoding methods discussed above are both appealing because they preserve reasonable notions of distance between points in the original data. Distance preservation is a sufficient condition to establish other desirable properties of encodings, namely preservation of neighborhood/cluster structure, robustness to various forms of noise, and in some cases, preservation of linear separability. We address the first two items here and defer the latter for our discussion of learning on HD representations. We formalize our notion of distance preservation as follows: Definition 20 Distance-Preserving Embedding: Let δX be a distance function on X Rn and δH be a distance function on H. We say φ preserves δX under δH if, there exist functions α, β : Z+ R such that β(d)/α(d) 0 as d , and: α(d)δX (x, x ) β(d) δH(φ(x), φ(x )) α(d)δX (x, x ) + β(d) (5) for all x, x X. Thomas, Dasgupta, & Rosing We typically wish the distance function δH on H to be simple to compute. In practice, it is often taken to be the Euclidean, Hamming, or angular distance. The position-ID method preserves the L1 distance with δH the squared Euclidean distance, α(d) = 2d, and β(d) n2µd; recall that in the constructions above, µ scales inversely with d and thus β(d)/α(d) 0. The signed random-projection method preserves the angular distance with α(d) = O(d), β(d) = O( d), and δH the Hamming, angular, or Euclidean distance. 5.3.1 Preservation of Cluster Structure In general, there is no universally applicable definition of cluster structure. Indeed, numerous algorithms have been proposed in the literature to target various reasonable notions of what constitutes a cluster in the data. Preservation of a distance function accords naturally with K-means like algorithms which, given a set of data X Rn compute a set of centroids C = {ci}k i=1, and define associated clusters as the Voronoi cells associated with each centroid. We here adopt this notion and state that cluster structure C is preserved if, for any x X: argmin c C δX (x, c) = argmin c C δH(φ(x), φ(c)) In other words, that the set of points bound to a particular cluster centroid does not change under the encoding. We can restate the above as requiring that, for some point x bound to a cluster centroid c, it is the case that: δH(φ(x), φ(c)) < δH(φ(x), φ(c )) for any c C \ {c}. From Definition 20 we have: δH(φ(x), φ(c )) δH(φ(x), φ(c)) α(d)(δX (x, c ) δX (x, c)) 2β(d) for any x X and c, c C. Rearranging the expressions above we can see the desired property will be satisfied if: β(d) α(d) < min x X min c =c(x) 1 2(δX (x, c ) δX (x, c(x))), where c(x) = argminc CδX (x, c) denotes the center in C closest to x. A sufficient condition for the existence of some d satisfying this property is that α(d) is monotone increasing and that α(d) is faster growing than β(d). This condition is satisfied for both the random projection and position-ID encoding methods. 5.3.2 Noise Robustness It is also of interest to consider robustness to noise in the context of encoding Euclidean data. Suppose we have a set of points, X, in Rn, and a distance function of interest δX ( , ) which is preserved a la Definition 20. Given an arbitrary point x X we consider a noise model which corrupts φ(x) to φ(x) + , where is some unspecified noise process. Along the lines of Section 3.2, we say is ρ-bounded if: max x X | φ(x), | ρ A Theoretical Perspective on Hyperdimensional Computing Suppose we wish to ensure the encodings can distinguish between all points at a distance ϵ1 from x and all points at a distance ϵ2. That is: φ(x) + φ(x ) < φ(x) + φ(x ) for all x X such that δX (x, x ) ϵ1 and all x X such that δX (x, x ) ϵ2. We say that such an encoding is (ϵ1, ϵ2)-robust. Theorem 21 Let δX be a distance function on X Rn and suppose φ is an embedding preserving δX under the squared Euclidean distance on H as described in Definition 20. Suppose is ρ-bounded noise. Then φ is (ϵ1, ϵ2) robust if: 4 (ϵ2 ϵ1) β(d) Proof Fix a point x whose encoding is corrupted as φ(x) + . Then for any x , x X with δX (x, x ) ϵ1 and δX (x, x ) ϵ2, we have: φ(x) + φ(x ) 2 2 φ(x) + φ(x ) 2 2 = φ(x) φ(x ) 2 2 φ(x) φ(x ) 2 2 2 φ(x ), + 2 φ(x ), α(d)δX (x, x ) β(d) α(d)δX (x, x ) β(d) 4ρ α(d)(ϵ2 ϵ1) 2β(d) 4ρ > 0, as desired. As before, we may consider passive and adversarial examples. Additive White Gaussian Noise. First consider the case that H = Rd and N(0, σ2 Id); that is, each coordinate of has a Gaussian distribution with mean zero and variance σ2 . Then, as before, we can note that φ(x), N(0, σ2 φ(x) 2 2). Then, it is very likely (four standard deviations in the tail of the normal distribution) that ρ < 4Lσ , where L = maxx X φ(x) 2. So then, we have the desired robustness property if: 16L (ϵ2 ϵ1) β(d) Assuming that α(d) is faster growing in d than L and β(d), there will exist some encoding dimension for which we can tolerate any given level of noise. In the case of the random projection encoding scheme described above α(d) = O(d), β(d) = O( d exactly. And so we can tolerate noise on the order of: d (ϵ2 ϵ1) O(1) For the position-ID encoding method, α(d) = O(d), L = O( nd) and β(d) = O(n2dµ), and so we can tolerate noise: d n((ϵ2 ϵ1) O(n2µ)) Thomas, Dasgupta, & Rosing Adversarial Noise. We now consider the case that H = { 1}, as in the random-projection encoding method, and is noise in which some fraction ω d of coordinates in φ(x) are maliciously corrupted by an adversary. Since 1 ωd, we have, for any x X: | φ(x), | φ(x) 1 ωd So then we can tolerate ω on the order of: 4d (ϵ2 ϵ1) β(d) 2d In the case of the random-projection encoding method this boils down to: ω (ϵ2 ϵ1) 1 meaning the total number of coordinates that can be corrupted is O(d(ϵ2 ϵ1)). Robustness to Input Noise. A natural question is whether the HD representations also confer any robustness to noise in the input space X rather than the HD space H. In general, preservation of distance does not imply any particular robustness to input noise and the answer to this question depends on the particulars of the encoding method in question. Since a general treatment is difficult to give, we will not pursue this matter in depth at present. 6. Learning on HD Data Representations We now turn to the question of using HD representations in learning algorithms. Our goal is to clarify in what precise sense the HD encoding process can make learning easier. We study two ways in which this can happen: the encoding process can increase the separation between classes and/or can induce sparsity. Both of these characteristics can be exploited by neurally plausible algorithms to simplify learning. Throughout this discussion, we assume access to a set of N labelled examples S = {(xi, yi)}N i=1, where xi lies in [0, 1]n and yi C is a categorical variable indicating the class label. In general, we are interested in the case that training examples arrive in a streaming, or online, fashion, although our conclusions apply to fixed and finite data as well. 6.1 Learning by Bundling The simplest approach to learning with HD representations is to bundle together the training examples corresponding to each class into a set of exemplars often referred to as prototypes which are then used for classification (Kleyko et al., 2018; Rahimi et al., 2018; Burrello et al., 2018). More formally, as described in Section 2, we construct the prototype ck for the k-th class as: i s.t. yi=k φ(xi) and then assign a class label for some query point xq as: ˆy = argmax k C A Theoretical Perspective on Hyperdimensional Computing This approach bears a strong resemblance to naive Bayes and Fisher s linear discriminant, which are both classic simple statistical procedures for classification (Bishop, 2007). Like these methods, the bundling approach is appealing due to its simplicity. However, it also shares their weaknesses in that it may fail to separate data that is in fact linearly separable. 6.2 Learning Arbitrary Linear Separators Linear separability is one of the most basic types of structure that can aid learning. The theory of linear models is well developed and several simple, neurally plausible, algorithms for learning linear separators are known, for instance, the Perceptron and Winnow (Rosenblatt, 1958; Littlestone, 1988). Thus, if our data is linearly separable in low-dimensional space we would like it to remain so after encoding, so that these methods can be applied. We now show formally that preservation of distance is sufficient, under some conditions, to preserve linear separability. Theorem 22 Let X and X be two disjoint, closed, and convex sets of points in Rn. Let p X and q X be the closest pair of points between the two sets. Suppose φ preserves L2 distance on X under the L2 distance on H in the sense of Definition 20. Then, the function f(x) = φ(x), φ(p) φ(q) 1 2(||φ(p)||2 2 ||φ(q)||2 2) is positive for all x X and negative for all x X provided: β(d) α(d) < 1 Proof We first observe: φ(x), φ(p) φ(q) 1 2 φ(p) 2 2 φ(q) 2 2 = 1 2 φ(x) φ(q) 2 2 1 2 φ(x) φ(p) 2 2. We may then use Definition 20 to obtain: 2 φ(x) φ(q) 2 2 1 2 φ(x) φ(p) 2 2 2 x q 2 2 α(d) 2 x p 2 2 β(d) = α(d) x, p q 1 2 p 2 2 q 2 2 β(d). By a standard proof of the hyperplane separation theorem (e.g., Section 2.5.1 of (Boyd & Vandenberghe, 2014)), 2( p 2 2 q 2 2) 1 for any x X, and thus f(x) > 0 if β(d) α(d) < 1 the proof for x X is analogous. Thomas, Dasgupta, & Rosing A natural question is whether a linear separator on the HD representation can capture a nonlinear decision boundary on the original data? The connection with kernel methods discussed in Section 5.2.1 presents one avenue for rigorously addressing this question. As noted there, the encoding function can sometimes be interpreted as approximating the feature map of a kernel, which in turn can be used to linearize learning problems in some settings (Shawe-Taylor et al., 2004). However, a thorough examination of this question is beyond the scope of the present work. 6.2.1 Learning Sparse Classifiers on Random Projection Encodings The random projection encoding method can be seen to lead to representations that are sparse in the sense that a subset of just k d coordinates suffice for determining the class label. This setting accords naturally with the Winnow algorithm (Littlestone, 1988) which is known to make on the order of k log d mistakes when the target function class is a linear function of k d variables. This can offer substantially faster convergence than the Perceptron when the margin is small. Curiously, while the Perceptron algorithm is commonly used in the HD community, we are unaware of any work using Winnow for learning. Theorem 23 Let X and X be two sets of points supported on the n-dimensional unit sphere and separated by a unit-norm hyperplane w with margin γ = minx X | x, w |. Let Φ Rd n be a matrix whose rows are sampled from the uniform distribution over the n-dimensional unit-sphere. Define the encoding of a point x by φ(x) = Φx. With high probability, X and X are linearly separable using just k coordinates in the encoded space, provided: d = Ω k exp n 2kγ2 To prove the theorem we first use the following simple Lemma: Lemma 24 Suppose there exists a row Φ(i) of the projection matrix such that Φ(i), w > 1 γ2/2. Then Φ(i), x is positive for any x X and negative for any x X . Proof The constraint on the dot product of Φ(i) and w implies Φ(i) w 2 = Φ(i) 2 + w 2 2 Φ(i), w < γ2. Thus for any x X, Φ(i), x = w, x + Φ(i) w, x γ + Φ(i) w, x γ Φ(i) w > 0. A similar argument shows that Φ(i), x is negative on X . Unfortunately, the probability of randomly sampling such a direction is tiny, on the order of γn. However, we might instead hope to sample k vectors that are weakly correlated with w and exploit their cumulative effect on x. We say a vector u Rn is ρ-correlated with w if u, w ρ. We are now in a position to prove the theorem. Proof For w Sn 1 and ρ (0, 1), let C = {u Sn 1 : u, w ρ} denote the spherical cap of vectors ρ-correlated with w. Suppose we pick vectors u(1), . . . , u(k) uniformly at random from C. Then, with probability at least 1/2: P j u(j), w P j u(j) 2 1 1 2kρ2 (7) A Theoretical Perspective on Hyperdimensional Computing To see this, note that without loss of generality we may assume w = e1, the first standard basis vector of Rn, and write any u Rn as u = (u1, u R): the first coordinate and the remaining n 1 coordinates. Now, let N = P j u(j), w = P j u(j) 1 kρ. Then: j u(j) R 2 2 + X i =j u(i) R , u(j) R i =j u(i) R , u(j) R . The last term has a symmetric distribution around zero over random samplings of the u(j). Thus, with probability 1/2, it is 0, whereupon P j u(j), w P j u(j) 2 N N2 + k 1 k 2N2 1 1 2kρ2 . To ensure the quantity above is at least 1 γ2/2, we must have: It now remains to compute the probability that a vector Φ(i) sampled uniformly from Sn 1 lies in C, or equivalently, that Φ(i) 1 ρ. Noting that we may simulate a random direction on Sn 1 by sampling z N(0, In) and normalizing, we obtain the reasonable approximation: Φ(i) 1 N(0, 1/n). Therefore, the probability that Φ(i) 1 ρ is on the order of e nρ2/2. So we need: d = Ω k exp n 2kγ2 In summary, the random projection method in tandem with the Winnow algorithm seems to be well suited to the HD setting, where sparsity can be exploited to simplify learning. 7. Conclusion To conclude, we lay out several research directions related to HD computing we believe it would be of particular interest to further explore. There are several interesting open problems related to encoding. Our analysis established preservation of only the most basic forms of structure in data. Can encoding procedures satisfying the desiderata of HD computing be designed that capture other forms of structure? The quantized random Fourier feature construction discussed in Section 5 presents one such option, but is only applicable to structure that can be captured using a shift-invariant kernel on a Euclidean space. For instance, can we devise encoding methods that exploit low-dimensional manifold structure in the data or which are adaptive and can be learned from a particular data set? Thomas, Dasgupta, & Rosing Several recent works have claimed, based on empirical evidence, that HD computing evinces one-shot learning (Burrello et al., 2018; Imani et al., 2017; Rahimi et al., 2018) in which a single labeled example is needed to learn a generalizable classifier (Thrun, 1996; Lake et al., 2011). However, this work has focused on settings in which specialized handcrafted features could be extracted, and it is not clear to us that existing encoding procedures would lead to one-shot classifiers absent such outside information. We would be interested to explore whether the HD representation makes one-shot learning easier in any broader sense. We expect this will necessitate the use of more sophisticated encoding procedures that can learn salient properties of a given domain. For this latter point we see dictionary learning (Olshausen & Field, 1996) as a promising avenue for developing adaptive encoding procedures. Dictionary learning is a well studied problem and can be solved using online and neurally plausible methods (Arora et al., 2015; Mairal et al., 2010) and would thus seem to be a promising avenue to address the limitations of existing encoding procedures without sacrificing the simplicity and neural plausibility of existing HD based methods. Acknowledgements This work was supported in part by CRISP, one of six centers in JUMP, an SRC program sponsored by DARPA, in part by an SRC-Global Research Collaboration grant, GRC TASK 3021.001, GRC TASK 2942.001, DARPA-PA-19-03-03 Agreement HR00112090036, and also NSF grants 1527034, 1730158, 1826967, 2100237, 2112167, 2052809, 2003279, 1830399, 1911095, 2028040, and 1911095. Appendix A. Proofs of Selected Theorems A.1 Proof of Theorem 4 Proof The result is an immediate consequence of the Hanson-Wright inequality (Hanson & Wright, 1971; Rudelson et al., 2013) which holds that, for x a centered, d-dimensional, σ-sub-Gaussian random vector, and A Rd d an arbitrary square matrix, the quadratic form x T Ax obeys the following concentration bound: P(|x T Ax E[x T Ax]| t) 2 exp c min t2 σ4 A 2 F , t σ2 A where c is a positive absolute constant, A 2 F = P i,j |Aij|2 is the Frobenius norm and A = max x 1 Ax is the operator norm. The result follows by taking A to be the d d identity matrix, in which case x T Idx = x 2 2, and union bounding over all m symbols in the alphabet. A.2 Proof of Theorem 7 Proof Fix some a / S. As described in Theorem 5, the quantity φ(a), φ(a ) is sub Gaussian with parameter at most L2 maxσ2, where Lmax = maxa φ(a) . Then, again using A Theoretical Perspective on Hyperdimensional Computing the fact that sub-Gaussianity is preserved under sums, by Hoeffding s inequality we have: a S φ(a), φ(a ) 2 exp τ 2L4 2 exp κτ 2L2 where κ = L2/L2 max. The result follows by union bounding over all m possible a. A.3 Proof of Theorem 9 Proof Expanding the dot product between the two representations: 1 L2 φ(S), φ(S ) = 1 a S S φ(a), φ(a) + 1 a S \{a} φ(a), φ(a ) |S S | + ss µ. The other direction is analogous. A.4 Proof of Theorem 10 Proof Consider some symbol a A. In the event a S: φ(a), φ(S) + S = φ(a), φ(S) + φ(a), S L2 s L2µ ρ and when a / S: φ(a), φ(S) + S s L2µ + ρ Therefore we can decode correctly if: ρ L2 + sµ < 1 Proof of Lemma 12 Proof Consider first the case of passive noise. Fix some a A. Noting that φ(a), S is the sum of d terms bounded in [ c, c], another application of Hoeffding s inequality and the union bound will show: P( a s.t. | φ(a), S | ρ) 2m exp ρ2 Therefore, with probability 1 δ, we have that S is ρ-bounded for ρ c p 2d ln(2m/δ). Noting that L = d exactly, the result follows by applying Theorem 10. Thomas, Dasgupta, & Rosing Now let us consider the adversarial case in which S 1 ωsd. We first observe that | φ(a), S | φ(a) S 1 ωsd. Then, applying Theorem 10 we obtain: as claimed. Proof of Theorem 14 Proof Note first that φ(a) ψ(f) 2 = φ(a) 2. Then, fixing a, a and f, by Hoeffding s inequality: P(| φ(a), φ(a ) ψ(f) | µL2) 2 exp L4µ2 2σ2||φ(a )||2 2 2 exp kµ2L2 where we have again defined κ = (mina φ(a) 2 2)/(maxa φ(a ) 2 2). The result follows by the union bound over all < nm2/2 combinations of a, a , f. Proof of Theorem 17 Proof Expanding: ||φ(x) φ(x )||2 2 = ||φ(x)||2 2 + ||φ(x )||2 2 2 φ(x), φ(x ) Note first that φ(x) 2 2 = nd + , where is a mean-zero noise term due to cross-talk between the codewords. Neglecting minor errors from the ceiling function, the dot-product expands to: φ(x), φ(x ) = i=1 φ(xi) ψ(fi), φ(x i) ψ(fi) + X i =j φ(xi) ψ(fi), φ(x j) ψ(fj) i=1 φ(xi), φ(x i) + = i=1 d(1 |a(xi) a(x i)|) + = d(n x x 1) + where a(xi) is taken to be the centroid corresponding to xi and is another noise term due to crosstalk. Putting both together and noting that , n2dµ we have: 2d( x x 1 2n2µ) φ(x) φ(x ) 2 2 2d( x x 1 + 2n2µ) where the incoherence can be bounded as in Equation 4. A Theoretical Perspective on Hyperdimensional Computing Arora, S., Ge, R., Ma, T., & Moitra, A. (2015). Simple, efficient, and neural algorithms for sparse coding. Proceedings of The 28th Conference on Learning Theory, COLT 2015, 40, 113 149. Barron, A. R. (1993). Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information Theory, 39(3), 930 945. Bell, A., & Sejnowski, T. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7(6), 1129 1159. Bishop, C. M. (2007). Pattern recognition and machine learning, 5th Edition. Information science and statistics. Springer. Bloom, B. H. (1970). Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, 13(7), 422 426. Boyd, S. P., & Vandenberghe, L. (2014). Convex Optimization. Cambridge University Press. Broder, A., & Mitzenmacher, M. (2004). Network applications of bloom filters: A survey. Internet Mathematics, 1(4), 485 509. Burrello, A., Schindler, K., Benini, L., & Rahimi, A. (2018). One-shot learning for ieeg seizure detection using end-to-end binary operations: Local binary patterns with hyperdimensional computing. In 2018 IEEE Biomedical Circuits and Systems Conference (Bio CAS), pp. 1 4. IEEE. Caron, S. J., Ruta, V., Abbott, L., & Axel, R. (2013). Random convergence of olfactory inputs in the drosophila mushroom body. Nature, 497(7447), 113 117. Chacron, M. J., Longtin, A., & Maler, L. (2011). Efficient computation via sparse coding in electrosensory neural networks. Current Opinion in Neurobiology, 21(5), 752 760. Choromanski, K. M., Rowland, M., & Weller, A. (2017). The unreasonable effectiveness of structured random orthogonal embeddings. In Advances in Neural Information Processing Systems, pp. 219 228. Cybenko, G. (1989). Approximations by superpositions of sigmoidal functions. Mathematics of Control, Signals, and Systems, 2(4), 303 314. Dasgupta, S., Sheehan, T. C., Stevens, C. F., & Navlakha, S. (2018). A neural data structure for novelty detection. Proceedings of the National Academy of Sciences, 115(51), 13093 13098. Donoho, D. L., Elad, M., & Temlyakov, V. N. (2005). Stable recovery of sparse overcomplete representations in the presence of noise. IEEE Transactions on Information Theory, 52(1), 6 18. Frady, E. P., Kleyko, D., & Sommer, F. T. (2018). A theory of sequence indexing and working memory in recurrent neural networks. Neural Computation, 30(6), 1449 1513. Fukushima, K. (1980). Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics, 36(4), 193 202. Thomas, Dasgupta, & Rosing Gallant, S. I., & Okaywe, T. W. (2013). Representing objects, relations, and sequences. Neural Computation, 25(8), 2038 2078. Gupta, S., Imani, M., & Rosing, T. (2018). Felix: Fast and energy-efficient logic in memory. In 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), pp. 1 7. IEEE. Hanson, D. L., & Wright, F. T. (1971). A bound on tail probabilities for quadratic forms in independent random variables. The Annals of Mathematical Statistics, 42(3), 1079 1083. Imani, M., Kong, D., Rahimi, A., & Rosing, T. (2017). Voicehd: Hyperdimensional computing for efficient speech recognition. In 2017 IEEE International Conference on Rebooting Computing (ICRC), pp. 1 8. IEEE. Imani, M., Messerly, J., Wu, F., Pi, W., & Rosing, T. (2019a). A binary learning framework for hyperdimensional computing. In 2019 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 126 131. IEEE. Imani, M., Morris, J., Bosch, S., Shu, H., De Micheli, G., & Rosing, T. (2019b). Adapthd: Adaptive efficient training for brain-inspired hyperdimensional computing. In 2019 IEEE Biomedical Circuits and Systems Conference (Bio CAS), pp. 1 4. IEEE. Imani, M., Morris, J., Messerly, J., Shu, H., Deng, Y., & Rosing, T. (2019c). Bric: Localitybased encoding for energy-efficient brain-inspired hyperdimensional computing. In Proceedings of the 56th Annual Design Automation Conference 2019, p. 52. ACM. Imani, M., Nassar, T., Rahimi, A., & Rosing, T. (2018). Hdna: Energy-efficient dna sequencing using hyperdimensional computing. In 2018 IEEE EMBS International Conference on Biomedical & Health Informatics (BHI), pp. 271 274. IEEE. Imani, M., Rahimi, A., Kong, D., Rosing, T., & Rabaey, J. M. (2017). Exploring hyperdimensional associative memory. In 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 445 456. IEEE. Imani, M., Salamat, S., Gupta, S., Huang, J., & Rosing, T. (2019). Fach: Fpga-based acceleration of hyperdimensional computing by reducing computational complexity. In Proceedings of the 24th Asia and South Pacific Design Automation Conference, pp. 493 498. ACM. Jacques, L., Laska, J. N., Boufounos, P. T., & Baraniuk, R. G. (2013). Robust 1-bit compressive sensing via binary stable embeddings of sparse vectors. IEEE Transactions on Information Theory, 59(4), 2082 2102. Kanerva, P. (1994). The spatter code for encoding concepts at many levels. In International Conference on Artificial Neural Networks, pp. 226 229. Springer. Kanerva, P. (1995). A family of binary spatter codes. In International Conference on Artificial Neural Networks, Vol. 1, pp. 517 522. Kanerva, P. (2009). Hyperdimensional computing: An introduction to computing in distributed representation with high-dimensional random vectors. Cognitive Computation, 1(2), 139 159. A Theoretical Perspective on Hyperdimensional Computing Kim, Y., Imani, M., Moshiri, N., & Rosing, T. (2020). Genie HD: Efficient dna pattern matching accelerator using hyperdimensional computing. In 2020 Design, Automation & Test in Europe Conference & Exhibition. IEEE. Kim, Y., Imani, M., & Rosing, T. S. (2018). Efficient human activity recognition using hyperdimensional computing. In Janowicz, K., Kuhn, W., Cena, F., Haller, A., & Vamvoudakis, K. G. (Eds.), Proceedings of the 8th International Conference on the Internet of Things, IOT 2018, Santa Barbara, CA, USA, October 15-18, 2018, pp. 38:1 38:6. ACM. Kleyko, D., Rahimi, A., Gayler, R. W., & Osipov, E. (2019). Autoscaling bloom filter: controlling trade-offbetween true and false positives. Neural Computing and Applications, 32, 1 10. Kleyko, D., Rahimi, A., Rachkovskij, D. A., Osipov, E., & Rabaey, J. M. (2018). Classification and recall with binary hyperdimensional computing: Tradeoffs in choice of density and mapping characteristics. IEEE Transactions on Neural Networks and Learning Systems, 29(12), 5880 5898. Lake, B., Salakhutdinov, R., Gross, J., & Tenenbaum, J. (2011). One shot learning of simple visual concepts. In Proceedings of the Annual Meeting of the Cognitive Science Society, Vol. 33. Levy, S. D., & Gayler, R. (2008). Vector symbolic architectures: A new building material for artificial general intelligence. In Conference on Artificial General Intelligence, pp. 414 418. IOS Press. Littlestone, N. (1988). Learning quickly when irrelevant attributes abound: A new linearthreshold algorithm. Machine Learning, 2(4), 285 318. Mairal, J., Bach, F., Ponce, J., & Sapiro, G. (2010). Online learning for matrix factorization and sparse coding.. Journal of Machine Learning Research, 11(1). Masse, N. Y., Turner, G. C., & Jefferis, G. S. (2009). Olfactory information processing in drosophila. Current Biology, 19(16), R700 R713. Mc Diarmid, C., et al. (1989). On the method of bounded differences. Surveys in Combinatorics, 141(1), 148 188. Mitrokhin, A., Sutor, P., Ferm uller, C., & Aloimonos, Y. (2019). Learning sensorimotor control with neuromorphic sensors: Toward hyperdimensional active perception. Science Robotics, 4(30). Neubert, P., Schubert, S., & Protzel, P. (2019). An introduction to hyperdimensional computing for robotics. KI-K unstliche Intelligenz, 33(4), 319 330. Olshausen, B., & Field, D. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381, 607 609. Olshausen, B. A., & Field, D. J. (2004). Sparse coding of sensory inputs. Current Opinion in Neurobiology, 14(4), 481 487. Pagh, A., Pagh, R., & Rao, S. S. (2005). An optimal bloom filter replacement. In Proceedings of the sixteenth annual ACM-SIAM Symposium on Discrete Algorithms, pp. 823 829. Thomas, Dasgupta, & Rosing Parcollet, T., Ravanelli, M., Morchid, M., Linar es, G., Trabelsi, C., Mori, R. D., & Bengio, Y. (2019). Quaternion recurrent neural networks. In International Conference on Learning Representations (ICLR). Plan, Y., & Vershynin, R. (2014). Dimension reduction by random hyperplane tessellations. Discrete & Computational Geometry, 51(2), 438 461. Plate, T. (2003). Holographic Reduced Representation: Distributed Representation for Cognitive Structures. CSLI Lecture Notes (CSLICHUP) Series. CSLI Publications. Plate, T. A. (1995). Holographic reduced representations. IEEE Transactions on Neural Networks, 6(3), 623 641. Rachkovskij, D. (2015). Formation of similarity-reflecting binary vectors with random binary projections. Cybernetics and Systems Analysis, 51(2), 313 323. Rachkovskij, D. A. (2001). Representation and processing of structures with binary sparse distributed codes. IEEE Transactions on Knowledge and Data Engineering, 13(2), 261 276. Rachkovskiy, D. A., Slipchenko, S. V., Kussul, E. M., & Baidyk, T. N. (2005a). Sparse binary distributed encoding of scalars. Journal of Automation and Information Sciences, 37(6). Rachkovskiy, D. A., Slipchenko, S. V., Misuno, I. S., Kussul, E. M., & Baidyk, T. N. (2005b). Sparse binary distributed encoding of numeric vectors. Journal of Automation and Information Sciences, 37(11). Raginsky, M., & Lazebnik, S. (2009). Locality-sensitive binary codes from shift-invariant kernels. In Advances in Neural Information Processing Systems, pp. 1509 1517. Rahimi, A., Benatti, S., Kanerva, P., Benini, L., & Rabaey, J. M. (2016). Hyperdimensional biosignal processing: A case study for emg-based hand gesture recognition. In 2016 IEEE International Conference on Rebooting Computing (ICRC), pp. 1 8. IEEE. Rahimi, A., Datta, S., Kleyko, D., Frady, E. P., Olshausen, B., Kanerva, P., & Rabaey, J. M. (2017). High-dimensional computing as a nanoscalable paradigm. IEEE Transactions on Circuits and Systems I: Regular Papers, 64(9), 2508 2521. Rahimi, A., Kanerva, P., Benini, L., & Rabaey, J. M. (2018). Efficient biosignal processing using hyperdimensional computing: Network templates for combined learning and classification of Ex G signals. Proceedings of the IEEE, 107(1), 123 143. Rahimi, A., Tchouprina, A., Kanerva, P., Mill an, J. d. R., & Rabaey, J. M. (2017). Hyperdimensional computing for blind and one-shot classification of eeg error-related potentials. Mobile Networks and Applications, 25, 1 12. Rahimi, A., & Recht, B. (2008). Random features for large-scale kernel machines. In Advances in Neural Information Processing systems, pp. 1177 1184. Rosenblatt, F. (1958). The Perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65(6), 386 408. Rudelson, M., Vershynin, R., et al. (2013). Hanson-wright inequality and sub-gaussian concentration. Electronic Communications in Probability, 18. A Theoretical Perspective on Hyperdimensional Computing Rudin, W. (1962). Fourier analysis on groups. John Wiley and Sons, Ltd. Rumelhart, D., Mc Clelland, J., & the PDP Research Group (1986). Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Volume 1: Foundations. MIT Press. Sahlgren, M. (2005). An introduction to random indexing. In Methods and applications of semantic indexing workshop at the 7th international conference on terminology and knowledge engineering. Salamat, S., Imani, M., Khaleghi, B., & Rosing, T. (2019). F5-hd: Fast flexible fpga-based framework for refreshing hyperdimensional computing. In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp. 53 62. Schmuck, M., Benini, L., & Rahimi, A. (2019). Hardware optimizations of dense binary hyperdimensional computing: Rematerialization of hypervectors, binarized bundling, and combinational associative memory. ACM Journal on Emerging Technologies in Computing Systems (JETC), 15(4), 1 25. Shawe-Taylor, J., Cristianini, N., et al. (2004). Kernel methods for pattern analysis. Cambridge university press. Stettler, D. D., & Axel, R. (2009). Representations of odor in the piriform cortex. Neuron, 63(6), 854 864. Thrun, S. (1996). Is learning the n-th thing any easier than learning the first?. In Advances in Neural Information Processing Systems, pp. 640 646. Turner, G. C., Bazhenov, M., & Laurent, G. (2008). Olfactory representations by drosophila mushroom body neurons. Journal of Neurophysiology, 99(2), 734 746. Wainwright, M. J. (2019). High-dimensional statistics: A non-asymptotic viewpoint, Vol. 48. Cambridge University Press. Weiss, E., Cheung, B., & Olshausen, B. (2016). A neural architecture for representing and reasoning about spatial relationships. In International Conference on Learning Representations (ICLR). Widdows, D., & Cohen, T. (2015). Reasoning with vectors: A continuous model for fast robust inference. Logic Journal of the IGPL, 23(2), 141 173. Wilson, R. I. (2013). Early olfactory processing in drosophila: mechanisms and principles. Annual Review of Neuroscience, 36, 217 241. Zhang, A., Tay, Y., Zhang, S., Chan, A., Luu, A. T., Hui, S. C., & Fu, J. (2016). Beyond fully-connected layers with quaternions: Parameterization of hypercomplex multiplications with 1/n parameters. In International Conference on Learning Representations (ICLR).