# text_classification_with_borns_rule__045ca6ad.pdf

Text Classiﬁcation with Born s Rule

Emanuele Guidotti Institute of Financial Analysis University of Neuchâtel, Switzerland emanuele.guidotti@unine.ch

Alﬁo Ferrara Department of Computer Science and Data Science Research Center University of Milan, Italy alﬁo.ferrara@unimi.it

This paper presents a text classiﬁcation algorithm inspired by the notion of superposition of states in quantum physics. By regarding text as a superposition of words, we derive the wave function of a document and we compute the transition probability of the document to a target class according to Born s rule. Two complementary implementations are presented. In the ﬁrst one, wave functions are calculated explicitly. The second implementation embeds the classiﬁer in a neural network architecture. Through analysis of three benchmark datasets, we illustrate several aspects of the proposed method, such as classiﬁcation performance, explainability, and computational efﬁciency. These ideas are also applicable to non-textual data.

1 Introduction

In quantum physics, the superposition principle is the idea that a system can be in multiple states at the same time (e.g., a cat is simultaneously alive and dead in Schrödinger s thought experiment [22]). Here we show that text can be formally treated as a quantum system in a superposition of words. Our superposition-of-words model is a bag-of-words model where each word is treated as a quantum state. By exploiting this representation, we develop a supervised classiﬁer based on key postulates of quantum mechanics, namely the Born rule [4].

The Born rule provides a link between the mathematical formalism of quantum theory and experiment, and as such is almost single-handedly responsible for practically all predictions of quantum physics.1 In this paper, we represent documents and classes as quantum objects and we compute the probability of a document to collapse in a target class by applying the Born rule.

The paper is structured as follows. Section 2 presents the motivation of this work. Section 3 introduces the notation and some preliminary notions in quantum mechanics. Section 4 develops our classiﬁcation algorithm. Section 5 embeds the classiﬁer in a neural network architecture. Section 6 presents our empirical results. Finally, Section 7 gives our concluding remarks and discusses extensions to semantics and non-textual data.

To simplify extensions to this work, we implement our classiﬁcation algorithm in scikit-learn [17] and we embed the classiﬁer in a neural network architecture using pytorch [16]. All the code is available at https://github.com/eguidotti/bornrule.

1We refer the reader to [14] for an introduction to the Born rule.

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

2 Motivation

In its most general form, the no-free-lunch theorem [23] implies that only prior knowledge makes it possible to generalize from the training examples to novel test examples. As quantum mechanics represents our understanding of nature at the deepest level [13], we wonder whether quantuminspired machine learning may incorporate a fundamental form of prior knowledge. Our hope is that such knowledge results in more efﬁcient algorithms, with better generalization ability and shorter computational times. Such algorithms would be also beneﬁcial in terms of explainability and interpretability [8], as their interpretation is immediately inherited from the physical model [18].2

This work should not be confused with quantum computing, which harnesses the properties of quantum states to perform calculations [1], or with quantum machine learning [3], which explores how to devise and implement quantum software that could enable machine learning that is faster than that of classical computers [6]. We refer to our work as quantum-inspired machine learning, in that we use quantum theory to derive machine learning algorithms that can run on classical computers. Although this is not the ﬁrst time quantum theory is applied to machine learning (see e.g., [15]), this is, to our knowledge, the ﬁrst step to develop a new general-purpose algorithm for supervised classiﬁcation based on key postulates of quantum mechanics, namely the Born rule.

Here we focus on text classiﬁcation, as the analogy with quantum systems is straightforward.3 However, the formalism we present is general and the implementations we derive can be used in practice as general-purpose classiﬁers.

Let x be a feature vector with elements xj for j = 1, ..., J. Let y be a probability vector with elements yk 0 for k = 1, ..., K, and such that P k yk = 1. Our goal is to learn a function g such that y = g(x). Then, given a test instance x , we predict the probabilities y = g(x ) and select the class k = argmaxk y k for classiﬁcation.

We start by giving some preliminary notions in quantum mechanics.

Wave function. In quantum physics, a system is regarded as a superposition of states |s and, using Dirac s notation [7], it is represented by a wave function |ψ :

s ψs|s with ψs C. (1)

Born rule. In the Copenhagen interpretation, the modulus squared of the inner product is interpreted as the (unnormalized) probability of the wave function |ψ collapsing to a new wave function |ϕ :

P(ψ ϕ) = | ϕ|ψ |2 = X

s ϕsψs 2 , (2)

where ϕs denotes the complex conjugate of ϕs. This is known as the Born rule, and it is one of the fundamental postulates of quantum mechanics.

Wave coefﬁcients. From (1) and (2), we notice that the coefﬁcient |ψs|2 represents the (unnormalized) probability of the wave function |ψ to collapse in state |s :

P(ψ s) = | s|ψ |2 = |ψs|2. (3)

4 Classiﬁcation Algorithm

Let the feature vector x contain only non-negative elements such that xj 0 for all j (e.g., word counts or tf-idf weights). We regard xj as the (unnormalized) probability of the data instance (e.g., document) to collapse in the j-th feature (e.g., word). We represent the j-th feature as a quantum state

2We refer the reader to [19] for an introduction to quantum mechanics and its interpretation. 3We refer the reader to [2, 5, 21, 27, 28] for quantum language models.

|j , and we represent the data instance with a superposition of states |ψ = P j ψj|j . According to (3) we have xj = P(ψ j) = |ψj|2, which implies the natural choice ψj = xj.

In a similar way, we represent the k-th class with a wave function |ϕ(k) = P

j ϕ(k) j |j , and we

obtain the coefﬁcients ϕ(k) j by setting the transition probability from |ϕ(k) to |j equal to the conditional probability of feature j given class k, which we write as Pj|k. According to (3) we have

Pj|k = P(ϕ(k) j) = |ϕ(k) j |2, which implies the natural choice ϕ(k) j = p Pj|k.

j ϕ(k) j |j = X

Finally, we obtain the classiﬁcation probability yk by computing the probability of |ψ to collapse in |ϕ(k) . By substituting (4) and (5) in (2), the unnormalized probabilities are:

uk = P(ψ ϕ(k)) = | ϕ(k)|ψ |2 = X

j ϕ(k) j ψj 2 = X

Pj|kxj 2 , (6)

and the normalized probabilities are yk = uk/ P k uk.

4.1 Training

To obtain the conditional probability Pj|k in (6) we proceed as follows. Given a training set {(x(n), y(n))}n=1,...,N, we normalize each feature vector x(n) such that it sums up to 1:

z(n) j = x(n) j P

j x(n) j . (7)

Then, we compute the conditional probability Pj|k from the (unnormalized) joint probability Pjk:

n z(n) j y(n) k , Pj|k = Pjk P

j Pj k . (8)

4.2 Regularization

We observe that if Pj|k is constant for k = 1, ..., K, then the j-th addend increases the summation in (6) by the same value for all classes. As uk is a monotonic transformation of the summation, the j-th addend does not alter the ranking of the probabilities uk, thus being irrelevant for the ﬁnal classiﬁcation k = argmaxk yk = argmaxk uk. To regularize the predictions, we shrink the contribution of irrelevant addends towards zero by re-weighting the summation in (6).

Let us rewrite, for ease of notation, the probabilities Pj|k as some weights Wjk 0. Then, j is irrelevant if Wjk is constant for all k. We normalize Wjk such that the weights of the classes k sum up to 1 for each j, that is:

Wk|j = Wjk P

k Wjk . (9)

An irrelevant j maximizes the entropy Hj = P

k Wk|jln(Wk|j), as Wk|j is uniformly distributed across the classes k. Thus, we introduce the following weights that range between 0 (irrelevant j with maximum entropy) and 1 (relevant j with null entropy):

Hj = 1 Hj Hmax = 1 + P

k Wk|j ln(Wk|j)

k 1) . (10)

Finally, we use Hj in (10) to re-weight the summation in (6), which becomes:

Pj|kxj 2 . (11)

4.3 Generalization

To simplify ablation studies, we generalize (11) as follows:

j Hh j W a jkxa j 1

a with Wjk = Pjk (P

k Pjk )1 b , (12)

where Hj is given in (9) (10), Pjk is given in (8), and a > 0, b 0, and h 0 are the model hyper-parameters. Here, we are mainly interested in the choice a = 1

2, b = 1, and h = 1, which corresponds to the original model in (11). Another special conﬁguration is a = 1, b = 0, and h = 0, which corresponds to uk = P

j Pk|jxj where Pk|j is the conditional probability of k given j. This conﬁguration offers a natural benchmark for our quantum approach in that it computes uk according to classical probability theory.

4.4 Explainability

The contribution of the j-th feature to the total probability uk (local explanation) is given by the addend Hh j W a jkxa j in (12). Therefore, the most inﬂuential feature for the classiﬁcation k is given by j = argmaxj Hh j W a jk xa j . In general, we use Hh j W a jk xa j to rank the features by the degree in which they contribute to the classiﬁcation k .

The explanation at the class level (global explanation) is obtained by investigating the product Hh j W a jk in (12), regardless of the vector x. The global most inﬂuential feature for each class k is given by j k = argmaxj Hh j W a jk. In general, we use Hh j W a jk to rank the features by their global importance with respect to class k.

4.5 Computational Complexity

In the training phase, the algorithm in (11), and more generally (12), learns the joint probability Pjk in (8) by multiplying the J N matrix of elements z(n) j with the N K matrix of elements y(n) k . Entry Pjk is given by the inner product of the j-th row of the left matrix (which has N entries) and the k-th column of the right matrix (which has N entries), so computing it takes time O(N). We do this once per element. Since the output matrix has dimension J K, there are O(JK) elements to consider and the total work is done in O(NJK). That is, the training time is at most linear in the number of samples (N), in the number of features (J), and in the number of classes (K).

In the prediction phase, we compute uk (and yk) in (11), and more generally (12). As these are all elementwise operations on the J K matrix of elements Pjk, the total work is done in time O(JK). That is, the prediction time is at most linear in the number of features (J), and in the number of classes (K), and it does not depend on the number of training samples (N).

Finally, we notice that the computational complexity can be further improved by using sparse matrices, and all the operations involved in the training and prediction phases can be easily parallelized (e.g., on GPUs). Thus, we expect the method to be highly scalable.

5 Neural Architecture

A major limitation of the algorithm presented in Section 4 is that it can be applied only when xj 0. Here we embed the method in a more ﬂexible architecture that admits xj C.

Let us assume that a data instance (e.g., document) can be represented as a superposition of some hidden states |s for s = 1, ..., S (e.g., word embeddings). Then, we write its wave function |ψ = P

s ψs|s where the coefﬁcients ψs generally depend on the feature vector x (e.g., words). We represent such coefﬁcients with a neural network ψs = ψs(x) that maps the feature vector x CJ to the vector of wave coefﬁcients ψ CS. Then, we write the wave function of class k as |ϕ(k) = P

s ϕ(k) s |s where the coefﬁcients ϕ(k) s depend on k and s, but not on the feature vector x. Finally, we use the Born rule in (2) to compute the probability of |ψ to collapse in |ϕ(k) :

uk = P(ψ ϕ(k)) = | ϕ(k)|ψ |2 = X

s ϕ(k) s ψs(x) 2 . (13)

Equation (13) is read as a neural network u = σ(Φv) where Φ is a matrix of elements Φks = ϕ(k) s , v = ψ(x) is the output of the previous layer, Φv denotes the matrix product P

s Φksvs, and the activation function σ( ) = | |2 is the modulus squared. Finally, we apply a normalization layer to obtain the probabilities yk = uk/ P

k uk [0, 1]. An illustration is given in Figure 1.

Figure 1: Born Layer (BL) architecture.

Our method is similar to the usual approach where the ﬁrst part of a network is assumed to learn log-probabilities, and the ﬁnal classiﬁcation layer converts them into probabilities by applying the softmax function. Here, the ﬁrst part of the network is assumed to learn the coefﬁcients of a wave function, and the ﬁnal classiﬁcation layer converts them into probabilities by applying the Born rule.

5.1 Initialization

We initialize the weights Φks such that the wave function |ϕ(k) has an equal probability to collapse in any state |s . To this end, let us write the complex-valued weights Φks = ρkseiθks, where i is the imaginary unit, ρks 0, and θks [0, 2π). We recall from (3) that the probability of |ϕ(k) to collapse in |s is P(ϕ(k) s) = |ϕ(k) s |2 = |Φks|2 = ρ2 ks. We set ρ2 ks = ρ2 so that the (unnormalized) probability is constant for all k and s. To normalize the probability, we set Sρ2 = 1, where S is the number of states (input dimension of the layer). We obtain ρ = 1/

S, which resembles the initialization in [9] and [10]. Finally, we sample θks from a uniform distribution in the interval [0, 2π), such that the weights Φks are uniformly distributed in the complex circle (isotropy).

Φks = eiθks

S with θks U(0, 2π). (14)

When the feature vector x is a (unnormalized) probability vector as in Section 4, then (11) can be written as the neural network in (13) where we use S = J and ψs(x) = xs. In this case, it is interesting to initialize the weights in (13) with the corresponding weights developed in (11), that is Φks = Hs p Ps|k.4

5.2 Explainability

We notice that the probabilities yk are invariant under scaling and rotation of the coefﬁcients ϕ(k) s in (13). To show that, we multiply ϕ(k) s by a scaling factor ρ and a phase factor eiθ. Then, we substitute ϕ(k) s ρeiθ ϕ(k) s in (13). The phase factor eiθ vanishes when computing the modulus and the scaling factor ρ vanishes when the probabilities are normalized.

The invariance by scaling and rotation implies that the weights Φks have no absolute meaning, and they become meaningful only in relation with each other. To inspect the relations among states s, for a given class k, it is interesting to visualize the weights Φks in the complex plane (with unlabelled axes). Here, the length of the vectors can be used to rank the states by importance, while the direction of the vectors produce constructive or destructive interference among the states (see Figure 5 for an illustration). For local explanations, Φksvs is used instead of Φks.

4More precisely, we scale the weights by dividing them by their mean and by the square root of the number of features to mimic (14). This does not alter the model, which is invariant under scaling (see Section 5.2), but it helps to prevent vanishing or exploding gradients during backpropagation.

6 Empirical Results

We illustrate several aspects of our classiﬁer using three well-established text classiﬁcation benchmarks: 20Newsgroup5, and the R8 and R52 subsets of Reuters 215786. We perform tokenization using the function nltk.word_tokenize7 and vectorize the text with Tﬁdf Vectorizer8. No other text transformation or cleaning procedure is performed. The ﬁnal datasets are composed by (20Newsgroup) 20 classes, 204 817 words, 11 314 training documents, and 7 532 test documents; (R8) 8 classes, 33 593 words, 5 485 training documents, 2 189 test documents; (R52) 52 classes, 38 132 words, 6 532 training documents, and 2 568 test documents.

All the results are obtained using Python 3.9 on a Google Cloud Virtual Machine equipped with Cent OS 7, 12 v CPU Intel Cascade Lake 85 GB RAM, 1 GPU NVIDIA Tesla A100, and CUDA 11.5.

6.1 Training time, prediction time, and accuracy score

We compare our methodology against a baseline of six classiﬁers on the 20Newsgroup dataset. With Born Classiﬁer (BC), we refer to the algorithm presented in Section 4, where weights are computed as in equation (8) and classiﬁcation probabilities are calculated as in equation (11). The baseline is composed by the algorithms Decision Tree (DT), K-Nearest Neighbors (KNN), Random Forest (RF), Support Vector Machine (SVM), Multinomial Naive Bayes (MNB), and Logistic Regression (LR). For all the algorithms in the baseline, we use the corresponding implementation in scikit-learn. All the classiﬁers are executed on CPU with default parameters.

Figure 2: From left to right: training time, prediction time, and accuracy score on the 20Newsgroup dataset (y-axis) for several classiﬁers, in function of the fraction of data used for training (x-axis).

The comparison between BC and the baseline is reported in Figure 2, where we show the training time, prediction time, and accuracy score in function of the fraction of data used for training in 10 independent executions. Figure 2 shows that BC is fast to train, is fast to predict, can be accelerated on GPU, and it achieves the highest accuracy regardless of the size of the dataset. Moreover, as the amount of training data decreases, the accuracy gap between BC and the other classiﬁers widens.

6.2 Imbalanced data

While 20Newsgroup is almost balanced in terms of documents per class, R8 and R52 are not. In particular, the most frequent class in R52 contains 2 840 training samples, while the least frequent class contains only 1 document in the training set. Figure 3 reports the F1-macro score for BC and the baseline on the three datasets, which are increasingly imbalanced. BC outperforms the baseline models and the performance gap widens for more imbalanced data. The native capability of BC to work with imbalanced data can be traced back to (8), which computes the conditional probability of the features given the classes. Dividing the joint probability by the marginal effectively normalizes by the class imbalance.

5http://qwone.com/~jason/20Newsgroups 6http://archive.ics.uci.edu/ml/machine-learning-databases/reuters21578-mld 7See https://www.nltk.org/book/ 8See https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.Tﬁdf Vectorizer.html

Figure 3: F1-macro score (y-axis) on 20Newsgroup, R8, R52, for several classiﬁers, in function of the fraction of data used for training (x-axis).

6.3 Hyper-parameters and ablation study

We now compare the overall accuracy score and runtime of BC with the results of the ﬁne-tuned baseline on 20Newsgroup (see Table 1). We then conduct an ablation study on BC (see Figure 4).

Model Accuracy (%) Runtime (s)

DT 53.9 5,583.156 KNN 55.6 640.574 RF 77.5 49,686.936 SVM 79.4 45,639.071 LR 82.9 6,066.966 MNB 84.1 15.320

BC (11) 87.3 0.043

Table 1: Accuracy score and runtime for several classiﬁers on the 20Newsgroup dataset. The runtime is the (CPU) time to optimize the model s hyper-parameters by 5-fold cross-validated gridsearch on the training set, plus the time to reﬁt the selected model. The accuracy score is the accuracy achieved by the model on the test set.

Figure 4: Ablation study on the 20Newsgroup dataset. The ﬁgure displays the test accuracy scores of the model in (12) for several values of the hyper-parameters a, b, and h. The biggest point identiﬁes the conﬁguration of hyperparameters a = 1

2, b = 1, h = 1, which corresponds to the original model in (11).

In Table 1, we tune the baseline classiﬁers via grid-search using 5-fold cross-validation. Except for BC, which needs no tuning, all the other classiﬁers use between 20 and 50 combinations of hyper-parameters (reported in the replication code). BC still achieves the highest accuracy by a remarkable margin. Furthermore, the overall time required to classify the dataset is several orders of magnitude smaller than than that required by the baseline algorithms. This is partly due to the speed of execution of BC (see Figure 2) and partly due to the fact that BC has no hyper-parameters to tune.

The ablation study is reported in Figure 4. Here we display the test accuracy scores of the model corresponding to equation (12) for several values of the hyper-parameters a, b, and h. The conﬁguration a = 1

2, b = 1, h = 1, corresponds to BC in equation (11) and it achieves an accuracy of 87.3%. Selecting the hyper-parameters via cross validation also achieves a test accuracy of 87.3%, suggesting that our prior generalizes well. The classical conﬁguration a = 1, b = 0, h = 0, discussed in Section 4.3, achieves an accuracy of 83.3%. Removing the regularization (h = 0) from our model leads to a large drop in performance: from 87.3% to 77.7%. On the other hand, adding the regularization (h = 1) to the classical conﬁguration does not lead to any performance gain: from 83.3% to 82.0%. These results can be understood by recalling that the classical conﬁguration uses the conditional probability of the classes given the features (Pk|j), while BC uses the conditional

probability of the features given the classes (Pj|k). As Pj|k gives larger weight to more frequent features, then BC is dominated by noise when the features irrelevant for classiﬁcation are also the most frequent (as is typical in text). Regularization becomes important to reduce the inﬂuence of frequent features that are slightly skewed. Instead, in the classical conﬁguration, regularization is redundant as Pk|j already limits the inﬂuence of such features.

6.4 Embedding in a neural network architecture

With Born Layer (BL), we refer to the neural architecture displayed in Figure 1 that takes in input ψ(x) and returns the classiﬁcation probabilities according to equation (13). Here we choose ψs(x) = xs so that the transformation in equations (11) and (13) are the same, and they can be directly compared (in general, ψ(x) may be an arbitrary transformation to exploit e.g., semantics, as discussed in Section 7). BL is initialized as in (14) and it is optimized on the L1-loss using Adam [12] with default parameters. We also initialize BL with the weights computed by BC as discussed at the end of Section 5.1. We refer to this version with BC+BL.

We stress-test BC and BL (which only rely on word counts) against a variety of deep learning approaches reported in literature (which exploit more sophisticated representations of the document). In Table 2, we include cooperative neural networks (Co NN) [20], models that learn distributed representations of entities and documents from a knowledge base (Text Ent) [25], graph convolutional networks (Text GCN) [20], the neural attentive bag-of-entities model (NABo E) [24], and the diversiﬁed ensemble neural network (DEns) [29].

We ﬁnd that BC achieves state-of-the-art accuracy on the 20Newsgroup dataset, and it is up to a million times faster than alternative approaches (0.001s). This conﬁrms that the weights computed explicitly in (11) are a good prior. Also BL matches state-of-the-art performance, but it requires signiﬁcantly longer computational times (about 34s). Initializing BL with BC requires few training epochs to further improve performance, suggesting that BC+BL can be ﬁne-tuned relatively fast (about 3s). Finally, we notice that, while BL optimizes complex-valued weights, BC+BL uses real-valued weights. This suggests that the reason for good performance lies in the transformation itself (13), rather than in the usage of complex numbers.

Accuracy (%) Runtime (s) 20NG R8 R52 20NG R8 R52

Co NN [20] 83.7 N/A N/A 120.000 N/A N/A Text Ent [25] 84.5 96.7 N/A 923.089 556.020 N/A Text GCN [26] 86.3 97.1 93.6 1206.372 109.184 186.531 NABo E [24] 86.8 97.1 N/A 152.154 24.110 N/A DEns [29] 87.1 97.7 94.3 N/A N/A N/A

BC (11) 87.3 95.4 88.0 0.001 0.001 0.001

BL (1 epoch) 84.6 96.5 87.9 0.347 0.276 0.274 BL (10 epochs) 86.2 96.8 92.6 3.451 2.747 2.723 BL (100 epochs) 87.1 97.1 92.7 34.461 27.452 27.171

BC+BL (1 epoch) 86.9 97.5 91.8 0.348 0.278 0.276 BC+BL (10 epochs) 87.4 97.7 95.2 3.458 2.764 2.724 BC+BL (100 epochs) 87.4 97.2 94.4 34.521 27.494 27.124

Table 2: Accuracy score and runtime for several methods on the 20Newsgourp, R8, and R52 datasets. The runtime is the (GPU) time to train the model. The accuracy score is the average score achieved by the model on the test set in 10 independent runs. Standard deviations are below 0.1 (%) and are omitted from the table. The runtime for Co NN is taken from the original paper [20]. The runtime for the other methods are obtained by running the corresponding code on our hardware infrastructure.

6.5 Explanation

To exemplify the explanation provided by BC, we extract the global weights described in Section 4.4 and we report the 10 most inﬂuential words (out of more than 200 000) for a subset of classes in 20Newsgroup. Table 3 shows that the words do not contain stopwords or punctuation, or terms that are too general to be representative of a speciﬁc class. The ability of BC to ﬁlter out noise can be traced back to the regularization in Section 4.2 that shrinks to zero the weights of words with little discriminatory power.

# Baseball Hockey Autos Graphics Macintosh Windows Cryptography

1 Phillies NHL car polygon Centris AX encryption 2 Braves hockey cars TIFF Quadra Windows Clipper 3 pitching Leafs eliot graphics Apple 3.1 clipper 4 Alomar team SHO 3D Mac windows crypto 5 Baseball Devils automotive 3DO Duo W4WG NSA 6 Players ESPN Callison CView LCIII cica escrow 7 Mets Wings Dumbest POV LC font key 8 Sox Pens rmt6r cview C650 BJ-200 DES 9 Cubs playoffs Thigpen tdawson BMUG NDIS Amanda 10 baseball playoff Toyota MPEG IIsi Win wiretap

Table 3: Global explanation on 20Newsgroup as described in Section 4.4. The table shows the top 10 features (words) for the classes: Baseball (rec.sport.baseball), Hockey (rec.sport.hockey), Autos (rec.autos), Graphics (comp.graphics), Macintosh (comp.sys.mac.hardware), Windows (comp.os.mswindows.misc), Cryptography (sci.crypt).

In Figure 5, we train BL for two epochs and we represent the evolution of the complex-valued weights described in Section 5.2. The ﬁgure represents the 10 weights that have the largest (and smallest) modulus for the class rec.sport.baseball at the end of the training. Starting from weights that are initialized randomly in all directions of the complex circle, BL learns to give larger weights to more relevant features, smaller weights to less relevant features, and to orientate the features in different directions. The weights converge in the same direction for words that are speciﬁcally related to the concept of baseball. The words hockey and NHL (National Hockey League), which are still relevant for a different class of sports, are associated with a large weight but they are also collocated in the opposite direction with respect to the baseball-related words.

Figure 5: The ﬁgure shows the 10 most inﬂuential features (solid lines) and the 10 least inﬂuential features (dotted lines) for the class rec.sport.baseball in the 20Newsgroup dataset, as described in Section 5.2. The evolution of the (complex-valued) weights is represented from left to right.

7 Conclusion

By regarding text as a superposition of words, we have introduced a supervised classiﬁcation algorithm based on key postulates of quantum mechanics, namely the Born rule. The classiﬁer is self-explainable, admits a closed-form expression (11), and can be embedded in neural network architectures (13). Our method outperforms traditional algorithms on the 20Newsgroup, R8, and R52 datasets, in terms of classiﬁcation performance and computational time. These results suggest that physical principles can be successfully exploited in machine learning and may open a new class of classiﬁcation algorithms.

There are several potential improvements and extensions to this work. First, our classiﬁer is derived from the Born rule and the explanation for its effectiveness in the view of machine learning remains an open question. Improving the empirical evaluation with thorough experiments would allow to assess the limitations of the method and to describe the beneﬁts and pitfalls in various applications and data types. Second, whenever a data instance can be represented with a probability distribution over some space, then a wave function can be constructed explicitly and the data may be classiﬁed with the algorithm in Section 4. More generally, an arbitrary data instance can be classiﬁed with the network in Section 5, where the ﬁrst part of the network learns the coefﬁcients of a wave function and the ﬁnal classiﬁcation layer converts them into probabilities by applying the Born rule. Using the transformation in (13) on top of existing architectures offers a natural way to push our framework beyond a bag-of-words (superposition-of-words) representation of the document, to encode semantics, and to extend the method to non-textual data. For instance, an image may be classiﬁed with the network in (13) where ψ(x) is a convolutional layer. This corresponds to regarding the image as a superposition of the features learnt by the convolution. Finally, an interesting research direction would be constructing deep networks that apply the transformation in (13) repeatedly. A deep architecture of this kind would ﬁnd its biological foundations in the quantum brain hypothesis [11], which suggests that quantum events could play a non-trivial role in neuronal cells, and contribute to an extremely high complexity, variability and computational power of neuronal dynamics.

Acknowledgments

This work was partially supported by the Google Cloud Research Credits program with the award GCP19980904.

[1] Frank Arute, Kunal Arya, Ryan Babbush, Dave Bacon, Joseph C Bardin, Rami Barends, Rupak Biswas, Sergio Boixo, Fernando GSL Brandao, David A Buell, et al. Quantum supremacy using a programmable superconducting processor. Nature, 574(7779):505 510, 2019.

[2] Ivano Basile and Fabio Tamburini. Towards quantum language models. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1840 1849, 2017.

[3] Jacob Biamonte, Peter Wittek, Nicola Pancotti, Patrick Rebentrost, Nathan Wiebe, and Seth Lloyd. Quantum machine learning. Nature, 549(7671):195 202, 2017.

[4] Max Born. Quantenmechanik der stoßvorgänge. Zeitschrift für Physik, 38(11):803 827, 1926.

[5] Yiwei Chen, Yu Pan, and Daoyi Dong. Quantum language model with entanglement embedding for question answering. IEEE Transactions on Cybernetics, 2021.

[6] Carlo Ciliberto, Mark Herbster, Alessandro Davide Ialongo, Massimiliano Pontil, Andrea Rocchetto, Simone Severini, and Leonard Wossnig. Quantum machine learning: A classical perspective. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 474(2209):20170551, 2018.

[7] Paul Adrien Maurice Dirac. A new notation for quantum mechanics. In Mathematical Proceedings of the Cambridge Philosophical Society, volume 35, pages 416 418. Cambridge University Press, 1939.

[8] Filip Karlo Došilovi c, Mario Brˇci c, and Nikica Hlupi c. Explainable artiﬁcial intelligence: A survey. In 2018 41st International Convention on Information and Communication Technology, Electronics and Microelectronics, pages 0210 0215, 2018.

[9] Xavier Glorot and Yoshua Bengio. Understanding the difﬁculty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artiﬁcial Intelligence and Statistics, pages 249 256, 2010.

[10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectiﬁers: Surpassing human-level performance on imagenet classiﬁcation. In Proceedings of the IEEE International Conference on Computer Vision, pages 1026 1034, 2015.

[11] Peter Jedlicka. Revisiting the quantum brain hypothesis: Toward quantum (neuro) biology? Frontiers in Molecular Neuroscience, 10:366, 2017.

[12] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

[13] Daniel Kleppner and Roman Jackiw. One hundred years of quantum physics. Science, 289(5481):893 898, 2000.

[14] Nicolaas P Landsman. Born rule and its interpretation. In Compendium of Quantum Physics, pages 64 70. Springer, 2009.

[15] Qiuchi Li, Dimitris Gkoumas, Alessandro Sordoni, Jian-Yun Nie, and Massimo Melucci. Quantum-inspired neural network for conversational emotion recognition. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 35, pages 13270 13278, 2021.

[16] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, volume 32, 2019.

[17] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. Scikit-learn: Machine learning in python. Journal of Machine Learning Research, 12(85):2825 2830, 2011.

[18] Carlo Rovelli. Relational quantum mechanics. International Journal of Theoretical Physics, 35(8):1637 1678, 1996.

[19] Carlo Rovelli. Helgoland. Prometheus, 2021.

[20] Harsh Shrivastava, Eugene Bart, Bob Price, Hanjun Dai, Bo Dai, and Srinivas Aluru. Cooperative neural networks: Exploiting prior independence structure for improved classiﬁcation. In Advances in Neural Information Processing Systems, volume 31, pages 4126 4136, 2018.

[21] Alessandro Sordoni, Jian-Yun Nie, and Yoshua Bengio. Modeling term dependencies with quantum language models for IR. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 653 662, 2013.

[22] John D Trimmer. The present situation in quantum mechanics: A translation of Schrödinger s "cat paradox" paper. Proceedings of the American Philosophical Society, pages 323 338, 1980.

[23] David H Wolpert and William G Macready. No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation, 1(1):67 82, 1997.

[24] Ikuya Yamada and Hiroyuki Shindo. Neural attentive bag-of-entities model for text classiﬁcation. In Proceedings of the 23rd Conference on Computational Natural Language Learning, pages 563 573, 2019.

[25] Ikuya Yamada, Hiroyuki Shindo, and Yoshiyasu Takefuji. Representation learning of entities and documents from knowledge base descriptions. In Proceedings of the 27th International Conference on Computational Linguistics, pages 190 201, 2018.

[26] Liang Yao, Chengsheng Mao, and Yuan Luo. Graph convolutional networks for text classiﬁcation. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 33, pages 7370 7377, 2019.

[27] Peng Zhang, Jiabin Niu, Zhan Su, Benyou Wang, Liqun Ma, and Dawei Song. End-to-end quantum-like language models with application to question answering. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, 2018.

[28] Peng Zhang, Zhan Su, Lipeng Zhang, Benyou Wang, and Dawei Song. A quantum manybody wave function inspired language modeling approach. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pages 1303 1312, 2018.

[29] Shaofeng Zhang, Meng Liu, and Junchi Yan. The diversiﬁed ensemble neural network. In Advances in Neural Information Processing Systems, volume 33, pages 16001 16011, 2020.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] See Section 7.

(c) Did you discuss any potential negative societal impacts of your work? [N/A] We don t see any potential negative societal impact other than those that apply to ML in general. (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [N/A] (b) Did you include complete proofs of all theoretical results? [N/A] 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] The code to reproduce the results is provided in the supplemental material. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [Yes] Datasets (20NG, R8, R52) and software (scikit-learn, pytorch) are cited properly. (b) Did you mention the license of the assets? [N/A]

(c) Did you include any new assets either in the supplemental material or as a URL? [Yes] (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identiﬁable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]