# interpretable_tensor_fusion__2fc39aa8.pdf

Interpretable Tensor Fusion

Saurabh Varshneya1 , Antoine Ledent2 , Philipp Liznerski1 , Andriy Balinskyy1 , Purvanshi Mehta3 , Waleed Mustafa1 and Marius Kloft1

1RPTU Kaiserslautern-Landau 2Singapore Management University 3Lica World {varshneya, liznerski, balinskyy, mustafa, kloft}@cs.uni-kl.de, purvanshi@lica.world, aledent@smu.edu.sg

Conventional machine learning methods are predominantly designed to predict outcomes based on a single data type. However, practical applications may encompass data of diverse types, such as text, images, and audio. We introduce interpretable tensor fusion (In Tense), a multimodal learning method for training neural networks to simultaneously learn multimodal data representations and their interpretable fusion. In Tense can separately capture both linear combinations and multiplicative interactions of diverse data types, thereby disentangling higher-order interactions from the individual effects of each modality. In Tense provides interpretability out of the box by assigning relevance scores to modalities and their associations. The approach is theoretically grounded and yields meaningful relevance scores on multiple synthetic and real-world datasets. Experiments on six realworld datasets show that In Tense outperforms existing state-of-the-art multimodal interpretable approaches in terms of accuracy and interpretability.

1 Introduction

The vast majority of machine learning systems are designed to predict outcomes based on a single datatype or modality . However, in various applications spanning fields from biology and medicine to engineering and multimedia multiple modalities are frequently in play [He et al., 2020; Lunghi et al., 2019]. The main challenge in multimodal learning is how to effectively fuse these diverse modalities. The most common approach is to combine the modalities in an additive way [Poria et al., 2015; Chagas et al., 2020]. Such linear combinations suffice in some cases. However, numerous applications necessitate capturing non-linear interactions between modalities. One such instance is sarcasm detection, described in Figure 1 [Hessel and Lee, 2020]. The arguably most popular approach to capture non-linear interactions of modalities is Tensor fusion [Zadeh et al., 2017; Tsai et al., 2019;

Full paper with technical appendix and code is available at: https://arxiv.org/abs/2405.04671

Video Audio

"Oh my god! You almost gave me a

heart attack"

suggests anxiety smirk, no anxiety

animated tone

linear combinations

non-linear combinations

sarcasm detected

individual modalities

modality interactions

Figure 1: Left is an excerpt of the MUSt ARD dataset on sarcasm detection, where the proposed In Tense method sets a new state-of-theart. (See Section 4 for details.) A linear combination of modalities fails here because the expressions of happiness and anxiety combine to something neutral rather than sarcasm. To detect sarcasm, the interactions among modalities are crucial. In Tense captures these interactions and assigns them with interpretable relevance scores, shown in the pie chart. Scores for individual modalities and their interactions are colored green and blue, respectively. In Tense reveals that interactions are crucial for successful sarcasm detection.

Liang et al., 2021]. The main idea in tensor fusion is to concatenate modalities via tensor products in a neural network. A substantial drawback of tensor fusion is its inherent lack of interpretability, which can significantly hinder its application in real-world scenarios. Interpretable multimodal models may reveal the relative importance of modalities [B uchel et al., 1998; Hessel and Lee, 2020], unveiling spurious modalities and social biases in the data. Identifying interactions among modalities is the main goal in several application domains. For instance, in statistical genetics, it is crucial to identify the interactions among Single Nucleotide Polymorphisms (SNPs) that contribute to the inheritance of a disease [Behravan et al., 2018; Elgart et al., 2022]. Although some interpretable multimodal methods exist, they are limited to linear combinations or require resource-intensive post hoc algorithms for interpretation. In this paper, we introduce interpretable tensor fusion (In Tense), which jointly learns multimodal neural representations and their interpretable fusion. In Tense provides out-ofthe-box interpretability by assigning relevance scores to all modalities and their interactions. Our approach is inspired by

Proceedings of the Thirty-Third International Joint Conference on Artiﬁcial Intelligence (IJCAI-24)

multiple kernel learning [Kloft et al., 2011], a classic kernelbased approach to interpretable multimodal learning, which we generalize to deep neural networks and term Multiple Neural Learning (MNL). While both MNL and In Tense provide relevance scores for the modalities, In Tense additionally produces scores for the interactions of modalities. These interaction scores are made possible through a novel interpretation of neural network weight matrices: We show that neural networks tend to favor higher-order tensor products, leading to spurious interpretations (i.e., overstating high-order interactions between modalities). We resolve this issue by deriving a theoretically well-founded normalization approach. In the theoretical analysis, we prove that this produces genuine relevance scores, avoiding spurious interpretations. In extensive experiments, we empirically validate the relevance scores on data and show that In Tense outperforms existing state-of-theart multimodal interpretable approaches in terms of accuracy and interpretability. In summary, our contributions are: We introduce Multiple Neural Learning (MNL), a theoretically guided adaptation of the established Multiple Kernel Learning algorithm to deep learning. We introduce In Tense, an extension of MNL and tensor fusion designed to capture non-linear interactions among modalities in an interpretable manner. We provide a rigorous theoretical analysis that provides evidence of the correct disentanglement within our fusion framework. We validate our approach through extensive experiments, where we meet the state-of-the-art classification accuracy while providing robust interpretability.

2 Related Work We now review existing multimodal learning methods that produce interpretability scores for the modalities. Interpretable Methods for Learning Linear Combinations of Modalities. The vast majority of interpretable multimodal learning methods consider linear combinations of modalities. The arguably most popular instance is Multiple Kernel Learning (MKL), where kernels from different modalities are combined linearly. Here, a weight is learned for each kernel determining its importance in the resulting linear combination of kernels [Kloft et al., 2011; Rakotomamonjy et al., 2008]. However, the performance of MKL is limited by the quality of the kernels. Finding adequate kernels can be especially problematic for structured high-dimensional data, such as text or images. Addressing this, several authors have studied combining multiple modalities using neural networks in a linear manner [Poria et al., 2015; Chen et al., 2014; Arabacı et al., 2021]. However, these representations are independently learned to form basis kernels and later combined in a second step through an SVM or another shallow learning method. Such independently learned representations cannot properly capture modality interactions. Methods for Learning Non-linear Combinations of Modalities. Hessel and Lee [2020] map neural representations to a space defined by a linear combination of the

modalities. While they quantify the overall importance of non-linear interactions, they do not provide scores for individual modality interactions. Tsai et al. [2020] introduce multimodal routing, which is based on dynamic routing [Sabour et al., 2017], to calculate scores for the modality interactions. These scores depend on the similarity of a modality s representation to so-called concept vectors, where one such vector is defined for each label. However, routing does not distinguish between linear and non-linear combinations and is thus misled by partially redundant information in the combinations. Indeed, we show through experiments (see Section 4) that the non-linear combinations learned by routing are incorrectly overestimated. Gat et al. [2021] propose a method to obtain modality relevances by computing differences of accuracies on a test set and a permuted test set. However, this method has limited interpretability and requires multiple forward passes through the trained network to obtain relevance scores. W ortwein et al. [2022] learn an aggregated representation for unimodal, bimodal, and trimodal interactions, respectively. However, their method does not learn fine-grained relevance scores for the various combinations of modalities. Alongside methods offering limited interpretability, there exist methods that non-linearly combine modalities without adding any interpretability [Zhang et al., 2023; Liang et al., 2021; Tan and Bansal, 2019]. In summary, none of these methods learns proper relevance scores of interactions between modalities.

Post-hoc Explanation Methods. There exist several methods for post-hoc explanation of multimodal learning methods [Gat et al., 2021; Chandrasekaran et al., 2018; Park et al., 2018; Kanehira et al., 2019; Cao et al., 2020; Frank et al., 2021]. These methods consist of two steps: first, training a multimodal model that is not inherently interpretable, followed by the calculation of relevance scores in hindsight. However, their two-step nature makes these methods challenging to analyze theoretically. Moreover, since the initial model disregards interpretability, it may lead to inherent limitations in the explanatory process. Additionally, these methods come with the added computational burden of producing relevance scores. Another limitation is their applicability, which is confined to specific types of modalities.

3 Methodology

In the following sections, we introduce several components comprising our approach. First, we review the classical Lpnorm Multiple Kernel Learning (MKL) framework [Kloft et al., 2011], which we extend to Multiple Neural Learning. Subsequently, we propose Interpretable Tensor Fusion (In Tense), which captures non-linear modality interactions. Furthermore, we show how In Tense learns disentangled neural representations, thereby computing correct relevance scores.

3.1 Preliminaries We consider a dataset {(xi, yi)}n i=1 with labels yi { 1, 1}. The inputs have M modalities, where xm i X m for m {1, . . . , M} denotes the mth modality of the datapoint xi, and X m is the input space associated with the modality m. In MKL, one considers kernel mixtures of the form k(u, v) =

Proceedings of the Thirty-Third International Joint Conference on Artiﬁcial Intelligence (IJCAI-24)

PM m=1 βmkm(u, v), where km(u, v) is a base kernel and βm 0 for all m. Imposing an Lp-norm constraint on the vector β RM gives rise to the following classic optimization problem:

minimize w1,w2, ,w L,β β RM,β 0, β p 1

βm wm, Ψm(xm i ) Hm

+ b, yi + λ

m=1 wm 2 L2(Hm)

where ℓis a loss function, and Ψm : X m Hm are feature maps from the input space X m to the Hilbert space Hm associated with kernel km such that for each m {1, 2, . . . , M} and u, v X m, km(u, v) = Ψm(u), Ψm(v) Hm, where , m H denotes the inner-product associated with the Hilbert space Hm. The base kernels km are assumed to be fixed functions. This is in sharp contrast to Multiple Neural Learning (MNL) introduced in the next section, where each feature map Ψm(x) is learned from the data.

3.2 Multiple Neural Learning In this section, we propose Multiple Neural Learning (MNL), an interpretable method for linear combination of modalities. In MNL, we train a neural network composed of two components: 1) modality subnetworks that output a neural representation for each modality and 2) a linear fusion layer that combines the representations in an interpretable manner. We define the optimization problem as:

minimize w1 L, ,w M L ,β, W 1,W 2, ,W M, β RM,β 0, β p 1

βm wm L , f m(xm i )

+ b, yi + Λ

m=1 wm l 2 2

where f m is the mth modality s subnetwork composed of L 1 layers with weights W m = {wm 1 , . . . , wm L 1}. A representation for the ith data point s mth modality is obtained by f m(xm i ). The fusion layer L with weights w1 L, . . . , w M L learns a linear combination of the modality representations. Λ and p (1 p < ) are hyperparameters and ℓ(t, y) = log exp(ty) 1+exp(ty) is the cross-entropy loss function. This setup can also be seen as additive fusion because it represents a linear combination of the modalities with weights βm. Notably, βm is a positive weight for the m th modality, indicating its relevance score. The vector β is simultaneously optimized with the network weights. However, the constraints on β introduce an increased difficulty in optimizing equation 2. The following theorem presents a simplified optimization problem by eliminating β from equation 2 along with a method to retrieve β from the learned weights.

Theorem 1. The optimization problem in equation 2 is equivalent to the following problem, where the parameters β are no longer present:

minimize w1 L,w2 L,...,w M L , W 1,W 2, ,W M

m=1 wm L , f m(xm i ) + b, yi

m=1 wm l 2 2 + Λ

m=1 wm L q 2

where q = 2p p+1 (and therefore 1 q 2). The corresponding values of relevance score β can be recovered after the optimization as:

2 p+1 2 PM m=1 w m L

The theorem states that the relevance of a modality in our jointly trained network can be obtained by applying a suitable p norm over the weights of the fusion layer L. A detailed proof of the theorem can be found in Appendix A. The central idea is observing that the parameters β can be absorbed into the weights wm L , pushing β into the regularization term. Subsequently, by showing that β can be minimized independently from the weights, the optimal value is attained through equation 4. Absorbing β in fusion weights in turn introduces the additional block Lq norm regularization term Λ(PM m=1 wm L q 2) 2 q . Correct Relevance Scores Through Normalization. In pre-experiments (see Appendix F.1) we found that the relevance scores can be misleading, especially when the network outputs high activation values for some modalities. We address this issue with proper normalization techniques. We propose an adaptation of the standard Batch Normalization (batch norm) [Ioffe and Szegedy, 2015], which we call Vector-wise Batch Normalization (VBN). VBN ensures that the L2-norm of the activation values, averaged over a minibatch, is constant. Let B be a mini-batch of datapoints indices. We define VBN as:

f m(xm i ) = f m(xm i ) µB,m σB,m , where (5)

µB,m = P i B f m(xm i ) |B| , and (6)

σ2 B,m = P i B f m(xm i ) µB,m 2 2 |B| . (7)

The mean µB,m is computed element-wise as in the standard batch norm. However, in equation 7, instead of computing the variance element-wise, we calculate the average of the squared L2 norm of a modality representation across the mini-batch. Unlike batch norm, we do not shift and scale the representations element-wise after the normalization step. Using VBN, the loss in equation 3 changes to:

D wm L , f m(xm i ) E + b, yi

Proceedings of the Thirty-Third International Joint Conference on Artiﬁcial Intelligence (IJCAI-24)

Note that VBN is applied after the activation function to obtain f m(xm). We found empirically that proper normalization is crucial for MNL to achieve competitive performance.

3.3 Interpretable Tensor Fusion In this section, we propose Interpretable Tensor Fusion (In Tense), an extension of MNL that additionally produces scores for interactions (non-linear combinations) of modalities. In Tense is based on tensor fusion, which captures multiplicative interactions among modalities by computing a tensor product over the individual modality representations [Zadeh et al., 2017]. In Tense operates as follows. For a dataset with M modalities, we consider interactions up to a given order of D, where D M. An order of D implies interaction among D modalities. A multiplicative interaction of modalities is defined by a subset I I, where I = {J {1, . . . , M} : |J| D}, and a tensor product f I(x) := f I1 f I2 . . . f I|I|, where f Im is the representation of modality Im, and denotes the tensor product operator. Analogously to equation 2, we obtain a new objective:

minimize w I L,βI:I I, β p 1,βI 0 W 1,W 2, ,W M

βI w I L, f I(x) + b, yi

m=1 wm l 2 2

This optimization problem can be seen as a special case of MNL, where the multiplicative interactions are treated as separate modalities. Therefore, in combination with equation 8, Theorem 1 computes the relevance scores for all modalities and their interactions.

What Can Go Wrong? In our experiments with synthetic multimodal datasets (Section 4.1), we found that relevance scores of higher-order interactions are greatly overestimated. Scores can be high even when no true interactions exist in the data. We call this phenomenon higher-order interaction bias. The bias is caused by higher-order tensor products corresponding to very large function classes, which approximately include the function classes corresponding to lower-order tensors as subsets. Indeed, it is possible that a linear combination of the components of a tensor product learns the same functions as a linear combination of the individual-modality representations. For instance, consider two modalities (mu, mv) and their representation vectors as u, v R3. Assume the first component of the learned representations is constant (e.g., 1), i.e., u = [1, u2, u3] and v = [1, v2, v3] . In such a scenario, the linear combination α1u2 +α2v2 (for α1, α2 R) can also be represented as α1(u v)2,1 + α2(u v)1,2. Using the MNL algorithm, the relevance scores for modalities mu and mv are α1 and α2 respectively. However, the relevance score for the modality with tensor product (mu v) is p

(α1)2 + (α2)2. Here, the Lp-norm regularization with any p < 2 will favor the representation mu v. Therefore, if the dimensions of the modality representations u and v are strictly greater than required to represent the ground truth (which is usually the case

in modern networks), lower-order functions will be preferably represented inside the higher-order products by learning a constant in the representations of each modality. Our experiments show that a trained network typically exhibits such behavior. We propose a solution to this problem of higher-order interaction bias in the rest of this section.

Correct Bi-modal Interactions. We now address the problem of higher-order interaction bias. The key idea is to introduce a normalization scheme that downweights higherorder interactions. Our normalization scheme is a sophisticated generalization of the Vector-wise Batch Normalization (VBN) scheme described in equation 5. Let m1 and m2 be two modalities and define their representations as f m1 and f m2. The representation of the bi-modal interaction is defined as f {m1,m2} = f m1 f m2. In this simple bi-modal case, our solution can be summarized as follows: we apply VBN to f m1 and f m2 before taking the product, and finally apply VBN again to the result. Formally, normalize each modality representation according to equation 5 as:

f m(xi) = f m(xi) E(f m(xi)) p

E ( f m(xi) E(f m(xi)) 2 2) ,

then compute the tensor-product as:

ˆf {m1,m2}(xi) = f m1 f m2,

and similarly apply VBN to the tensor-product to obtain f {m1,m2}. The centering step of normalization is applied elementwise over a mini-batch. Thus, if a few components were to be non-zero constants in a mini-batch, they would become zero after the normalization. This ensures that f m1,m2 cannot easily access lower degree information contained in f m1, because elements in f m2 cannot be a non-zero constant, and vice versa. The normalization could seemingly be trivially extended to more than two modalities by applying the normalization iteratively up to an n-order tensor product. However, such an extension may still lead to a high-interaction bias. We illustrate why such a trivial extension may not work for more than two modalities and later generalize the normalization scheme for any number of modalities.

Generalization Over M-modal Interactions. Extending the aforementioned normalization to cases where D > 2 is not straightforward. This complexity arises because, when fusing more than two modalities, potentially, the representations of a subset of M modalities conspire to produce a constant even though each individual modality representation is non-constant. For instance, consider three modalities with one-dimensional representations and apply VBN to the representations f 1, f 2, f 3, then to f 1,2, and finally to f 1,2,3 = f 1,2 f 3, it is still possible that representation f 1 is learned in the higher-order tensor product. For instance, assume the components of f 2 and f 3 are learned to satisfy the following for each datapoint: (1) f 2 = f 3; (2) we have that f 2 is a Rademacher variable (P[f 2 = 1] = P[f 2 = 1] = 0.5); and (3) f 2 is independent of f 1. Since f 1,2,3 = f 1f 2f 3 and f 2f 3 = 1 for all datapoints, we actually have f 1,2,3 = f 1,

Proceedings of the Thirty-Third International Joint Conference on Artiﬁcial Intelligence (IJCAI-24)

and this seemingly higher order combination can still recover the first modality. We address this issue by carefully normalizing the modalities features. The key is to prevent the combination of features of one or more modalities from resulting in a constant value. Similar to the bi-modal scenario, we need to ensure that the contribution of a subset of modalities to the larger fusion set is, on average, zero. This guarantees that no constant value, other than zero, is multiplied by the product of the complement of that subset within the original fusion set. Formally, for each I {1, 2, . . . , M}, the centering step of our batch norm procedure is defined as follows, where we first assume each modality is one-dimensional to simplify the exposition:

ℓ=0 ( 1)ℓ X

=S1,...,Sℓ I S1,...,Sℓdisjoint

m I\( Sk) f m

k {1,2,...,ℓ} E

When the modalities are multi-dimensional, the above operation is applied independently to each multi-index component 1. After preforming the centering step above for each multi-index component, we perform the generalized normalization step as follows

f I = ˆf I q

E ˆf I 2 Fr

While the solution can no longer be easily interpreted as a composition of standard batch norm operations, it is, in fact, possible to show that lower-order fusion can not be represented by a linear combination of their higher-order counterparts. Theorem 2 formalizes this result. Theorem 2. The centering step described in equation 9 can be represented as the multi variate polynomial: X

for some real coefficients GJ. Furthermore, the expected contribution of a subset of modalities J in the fusion of the set of modalities I, where J I is zero. That is, we have for any J I (including the empty set),

K:J K I GK Y

The theorem states that the expected value of the contribution of any subset J I of modalities is zero in the fusion of

1In particular, the multivariate case could be expressed with a similar formula as equation 9 with the products replaced by outer tensor products, but this would require a different reordering of the components for each term of the sum.

I. Thus, f I (higher-order) can not learn a linear combination of the f J (lower-order). Appendix B contains a comprehensive proof of the theorem. To make the exposition clearer, we provide as an example the case I = {1, 2, 3} and the individual representations f m are standardized using VBN. In this case, we have, using the notation f 1f 2 = E(f 1f 2):

ˆf 1,2,3 = f 1 f 2 f 3 f 1 f 2 f 3 f 1 f 3 f 2

f 2 f 3 f 1 f 1 f 2 f 3.

An elaborated centering step, without normalizing the individual modality representations, and strictly following equation 9 is described in Appendix B.1. In this section, we introduced Iterative Batch Normalization (Iter BN), a normalization scheme addressing higherorder interaction bias in multimodal learning. In the next section, we show the effectiveness of our method on synthetic and real-world datasets.

4 Experiments First, we experiment on synthetic data, where we control the amount of relevant information in the modalities, and compare In Tense s relevance scores to the established ground truth. Second, we compare the predictive performance of In Tense with popular multimodal fusion methods on six realworld multimodal datasets.

4.1 Evaluating the Relevance Scores We created a multimodal dataset where each modality of a datapoint is a sequence of letters chosen randomly from a predefined set. For each datapoint x and modality m, an informative subsequence is inserted at a random position with a probability of pm. We call our dataset SYNTHGENE. More details about it can be found in Appendix C. We perform two experiments to determine the correctness of the relevance scores obtained from In Tense. First, we construct a binary classification dataset with labels that ensure the modalities are independent and do not interact. Second, we generate another set of labels that can only be predicted using non-linear interactions among the modalities.

In Tense Assigns Correct Relevance Scores to Independent Modalities In this set of experiments, we create a synthetic dataset with independent modalities (i.e., without interactions). As a base-

...ACAGCTTATCG...

...ACGTCGTACGT...

...TCGGTCTTAGC...

...ACGTCGTAGCT...

...ACGTCCTACTT...

...GCGTGGTACGT...

...CCGTCATACAT...

...AGGTTGTACCT...

x1 x2 x3 x4

... CCGTCCTATCG....

... GCGCCGTACGA....

... ACGTTGTACGT....

... TAGTCGTAGCT....

Figure 2: An excerpt of three modalities of SYNTHGENE, our selfcurated binary classification dataset, where each sequence is made from a set of letters {A,C,G,T}. A positive class-sequence TCG and a negative class-sequence AGC is added according to the probability pm.

Proceedings of the Thirty-Third International Joint Conference on Artiﬁcial Intelligence (IJCAI-24)

Figure 3: The figure shows a high correlation of In Tense relevance scores and accuracies of unimodal models on Synth Gene. The modalities M2, M4, M7 achieve high relevance scores and high accuracy as they contain class-specific information. Other modalities contain no class-specific information, which leads to a very low relevance score and an accuracy of around 50% (equivalent to random guessing).

line, we train one model for each modality and then compare the accuracies of those models with the relevance scores of In Tense trained on all modalities together. Dataset. We create the SYNTHGENE dataset, where for each modality m, each datapoint with a positive/negative class label contains a class-specific sequence with a probability pm independent of other modalities. A high value of pm indicates that most sequences in the modality m contain a class-specific sequence. Thus, the higher the value of pm, the more relevant the modality m becomes. The labels for all datapoints are uniformly distributed between the two classes. Figure 2 shows how probability pm affects modality relevance. We use 10 modalities where the informative subsequence is inserted into modalities M2, M4, and M7. There is no discriminative information present in other modalities. Results. Figure 3 shows the relevance scores calculated by In Tense on Synth Gene. In Tense assigns the correct relevance scores as they align with human intuition. The higher the probability pm, the more informative signal is contained in modality m, and the higher the predicted relevance score. We further validate the correctness of In Tense s interpretability by comparing it with the accuracies obtained from unimodal models trained on each modality separately. Again, In Tense s relevance scores correlate with the unimodal accuracies.

In Tense Assigns Correct Relevance Scores to Interacting Modalities We now turn to a situation where the label depends on a nonlinear interaction among the modalities by design. Dataset. We also create the SYNTHGENE-TRI dataset, a trimodal version of SYNTHGENE. However, this time, the informative subsequence is not class-specific. Instead, the label is defined by an exclusive-or (XOR) relationship between the first two modalities (M1 and M2). The label is 0 if both modalities contain the subsequence or none of them does, and

Figure 4: Illustration of the relevance scores calculated by the proposed In Tense and the Multi Route baseline when higher-order modality interactions are involved in the ground truth. Multi Route leads to biased results (blue bars), where the relevance scores are concentrated toward higher-order interactions M1 2 3. In contrast, In Tense (orange bars) correctly assigns a high relevance score only to the interaction M1 2, which contains all class-specific signals.

the label is 1 otherwise (i.e., when one of the modalities contains the subsequence). Note that modality M3 does not contain any informative subsequence; thus, it is irrelevant. As before, we generate a balanced dataset with 50% of the samples being positive and 50% negative.

Results. Figure 4 shows the results. We observe that the global relevance scores calculated by the Multi Route baseline [Tsai et al., 2020] are biased toward high-order interactions. This occurs even when the interactions do not add any useful information and thus should have been discarded by Multi Route. We further see that the proposed In Tense method does avoid this bias and correctly assigns a high relevance score solely to the interaction of M1 and M2. This shows that In Tense can ensure the correctness of relevance scores even when higher-order interactions are involved.

4.2 In Tense Performs SOTA in Real-World Applications

We demonstrate the effectiveness of In Tense in providing interpretability without compromising predictive performance across a range of real-world applications. In order to compare performance and ensure reproducibility, we followed the experimental setup (e.g., data preprocessing, encodings of different modalities) of the Multi Bench [Liang et al., 2021] benchmark for all the experiments.

Sentiment analysis. In sentiment analysis, also known as opinion mining, the target is to identify the emotional tone or feeling underlying the data. Initially confined to text data, sentiment analysis has evolved to encompass multiple modalities. The task becomes challenging due to the intricate interactions of modalities, which play a significant role in expressing sentiments. Understanding sentiments is crucial in business intelligence, customer feedback analysis, and social media monitoring. To evaluate In Tense in sentiment analysis, we employed CMU-MOSEI [Bagher Zadeh et al., 2018], the

Proceedings of the Thirty-Third International Joint Conference on Artiﬁcial Intelligence (IJCAI-24)

Baselines Ours

Multi Route MRO MNL In Tense

MUSt ARD 65.9 66.5 67.4 69.6 CMU-MOSI 76.8 75.8 80.8 79.7 UR-FUNNY 63.6 63.4 63.4 65.1 CMU-MOSEI 80.2 79.7 80.5 81.5 AV-MNIST 71.8 72.0 72.4 72.8 ENRICO 46.7 49.2 47.1 50.8

Table 1: Accuracies for different baselines on the test fold. Each experiment is carried out ten times to compute the statistics.

largest dataset of sentence-level sentiment analysis for realworld online videos, and CMU-MOSI [Zadeh et al., 2016], a collection of annotated opinion video clips.

Humor and Sarcasm detection. Humor detectors identify elements that evoke amusement or comedy, while sarcasm detection aims to discern whether a sentence is presented in a sarcastic or sincere manner. Sarcasm and humor are often situational. Successfully detecting them requires a comprehensive understanding of various information sources, encompassing the utterance, contextual intricacies of the conversation, and background of the involved entities. As this information extends beyond textual cues, the challenge lies in learning the complex interactions among the available modalities. To assess our approach s effectiveness in these tasks, we utilized UR-FUNNY [Hasan et al., 2019] for humor detection and MUSt ARD [Castro et al., 2019] for sarcasm detection.

Layout Design Categorization. Layout design categorization is about classifying graphical user interfaces into predefined categories. Automizing this task can support designers in optimizing the arrangement of interactive elements, ensuring the creation of interfaces that are not only visually appealing but also functional and user-centric. Classifiers can, e.g., assign semantic captions to elements, enable smart tutorials, or be the foundation for advanced search engines. For this paper, we considered the ENRICO [Leiva et al., 2020] dataset as an example for layout design categorization. ENRICO comprises 20 design categories and 1460 user interfaces with five modalities, including screenshots, wireframe images, semantic annotations, DOM-like tree structures, and app metadata.

Digit Recognition. We also include results for Audiovision-MNIST (AV-MNIST) [Vielzeuf et al., 2018], a multimodal dataset comprising images of handwritten and recordings of spoken digits. Despite its apparent lack of immediate real-world application, the dataset s significance lies in its establishment as a standard multimodal benchmark. It allows us to situate our research within the broader context of previous research [P erez-R ua et al., 2019; Le Cun et al., 1998].

Baselines. We compare the classification performance of In Tense and MNL to the following state-of-the-art interpretable multimodal learning baselines: 1) Multimodal Residual Optimization (MRO) [W ortwein et al., 2022] and 2) Multimodal Routing (Multi Route) [Tsai et al., 2020]. Additionally, we consider three non-interpretable baselines: 1) LF-

0.0 0.1 0.2 0.3 0.4 0.5

MUSt ARD CMU-MOSI UR-FUNNY CMU-MOSEI

v a t va vt at vat

Figure 5: Relevance scores from In Tense for audio (a), vision (v), text (t), and all their possible interactions.

Concat, 2) TF Network [Zadeh et al., 2017], and 3) multimodal transformer (Mul T) [Tsai et al., 2019]. These baselines have been identified as leading in the independent comparison conducted by Liang et al. [2021]. The multimodal transformer was particularly highlighted for consistently reaching some of the highest accuracy levels.

Results. The results are shown in Table 1. We observe that our proposed models, MNL and In Tense, surpass all interpretable baselines in terms of classification accuracy. In Tense achieves the highest performance across all datasets except for CMU-MOSI, where MNL excels. CMU-MOSI is the smallest dataset in our analysis, a factor that may contribute positively to MNL s performance. Compared with non-interpretable multimodal learning methods (see Table 1 in Appendix E), MNL and especially In Tense demonstrate impressive performance, almost meeting the accuracy of the non-interpretable Multimodal Transformer (Mul T). The performance of our proposed models is within a narrow 2% error margin (and frequently much lower) compared to the Mul T baseline. Figure 5 shows the interpretable relevance scores that In Tense assigns to the various modalities. Notably, in three of the four datasets analyzed, text emerges as the most significant modality. We identify two plausible explanations for this phenomenon. First, several studies have reported a strong correlation of text with sentiment [Gat et al., 2021]. Second, the predominance of the text modality may be attributed to the availability of sophisticated word embeddings obtained from large pre-trained foundation models. However, we find an exception in the interpretability scores for the sarcasm detection dataset (MUSt ARD). Sarcasm detection requires information from multiple modalities, making sole reliance on one, especially text, insufficient for accuracy.

5 Conclusion

We introduced In Tense, a novel interpretable approach to multimodal learning that offers reliable relevance scores for modalities and their interactions. In Tense achieves state-ofthe-art performance in several challenging applications, from sentiment analysis and humor detection to layout design categorization and multimedia. We proved theoretically and validated empirically that In Tense correctly disentangles higherorder interactions from the individual effects of each modality. The full transparency of In Tense makes it suitable for future application in safety-critical domains.

Proceedings of the Thirty-Third International Joint Conference on Artiﬁcial Intelligence (IJCAI-24)

Ethical Statement As an interpretable approach, the proposed methodology naturally aids in making multimodal learning more transparent. By attributing importance scores to different modalities and their interactions, In Tense may reveal biases in decisionmaking and improve trustworthiness. For instance, consider a system tasked with classifying loan suitability. Our approach may expose social biases when relevance scores for certain modalities, such as gender extracted from vision, are disproportionately high. Moreover, unlike existing approaches, In Tense has no higher-order interaction bias (see Section 3.3). That is, it does not incorrectly assign large relevance scores to higher-order interactions, which can create the false impression of a social bias. The full transparency of In Tense prevents the deployment of a harmful classification model, contributing to the ethical use of AI in sensitive domains.

Acknowledgements SV, PL, WM, and MK acknowledge support by the Carl Zeiss Foundation, the DFG awards KL 2698/2-1, KL 2698/51, KL 2698/6-1 and KL 2698/7-1, and the BMBF awards 03|B0770E, and 01|S21010C.

References [Arabacı et al., 2021] Mehmet Ali Arabacı, Fatih Ozkan, Elif Surer, Peter Janˇcoviˇc, and Alptekin Temizel. Multi-modal egocentric activity recognition using multikernel learning. Multimedia Tools and Applications, 80(11):16299 16328, 2021. [Bagher Zadeh et al., 2018] Amir Ali Bagher Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2236 2246, Melbourne, Australia, July 2018. Association for Computational Linguistics. [Behravan et al., 2018] Hamid Behravan, Jaana M Hartikainen, Maria Tengstr om, Katri Pylk as, Robert Winqvist, Veli-Matti Kosma, and Arto Mannermaa. Machine learning identifies interacting genetic variants contributing to breast cancer risk: A case study in finnish cases and controls. Scientific reports, 8(1):13149, 2018. [B uchel et al., 1998] Christian B uchel, Cathy Price, and Karl Friston. A multimodal language region in the ventral visual pathway. Nature, 394(6690):274 277, 1998. [Cao et al., 2020] Jize Cao, Zhe Gan, Yu Cheng, Licheng Yu, Yen-Chun Chen, and Jingjing Liu. Behind the scene: Revealing the secrets of pre-trained vision-and-language models. In Computer Vision ECCV 2020: 16th European Conference, Glasgow, UK, August 23 28, 2020, Proceedings, Part VI 16, pages 565 580. Springer, 2020. [Castro et al., 2019] Santiago Castro, Devamanyu Hazarika, Ver onica P erez-Rosas, Roger Zimmermann, Rada Mihalcea, and Soujanya Poria. Towards multimodal sarcasm detection, 2019.

[Chagas et al., 2020] Paulo Chagas, Luiz Souza, Ikaro Ara ujo, Nayze Aldeman, Angelo Duarte, Michele Angelo, Washington LC Dos-Santos, and Luciano Oliveira. Classification of glomerular hypercellularity using convolutional features and support vector machine. Artificial intelligence in medicine, 103:101808, 2020. [Chandrasekaran et al., 2018] Arjun Chandrasekaran, Viraj Prabhu, Deshraj Yadav, Prithvijit Chattopadhyay, and Devi Parikh. Do explanations make vqa models more predictable to a human? ar Xiv preprint ar Xiv:1810.12366, 2018. [Chen et al., 2014] Jun Kai Chen, Zenghai Chen, Zheru Chi, and Hong Fu. Emotion recognition in the wild with feature fusion and multiple kernel learning. In Proceedings of the 16th International Conference on Multimodal Interaction, pages 508 513, 2014. [Elgart et al., 2022] Michael Elgart, Genevieve Lyons, Santiago Romero-Brufau, Nuzulul Kurniansyah, Jennifer A Brody, Xiuqing Guo, Henry J Lin, Laura Raffield, Yan Gao, Han Chen, et al. Non-linear machine learning models incorporating snps and prs improve polygenic prediction in diverse human populations. Communications Biology, 5(1):856, 2022. [Frank et al., 2021] Stella Frank, Emanuele Bugliarello, and Desmond Elliott. Vision-and-language or vision-forlanguage? on cross-modal influence in multimodal transformers. ar Xiv preprint ar Xiv:2109.04448, 2021. [Gat et al., 2021] Itai Gat, Idan Schwartz, and Alex Schwing. Perceptual score: What data modalities does your model perceive? Advances in Neural Information Processing Systems, 34:21630 21643, 2021. [Hasan et al., 2019] Md Kamrul Hasan, Wasifur Rahman, Amir Zadeh, Jianyuan Zhong, Md Iftekhar Tanveer, Louis Philippe Morency, et al. Ur-funny: A multimodal language dataset for understanding humor. ar Xiv preprint ar Xiv:1904.06618, 2019. [He et al., 2020] Zhipeng He, Zina Li, Fuzhou Yang, Lei Wang, Jingcong Li, Chengju Zhou, and Jiahui Pan. Advances in multimodal emotion recognition based on brain computer interfaces. Brain sciences, 10(10):687, 2020. [Hessel and Lee, 2020] Jack Hessel and Lillian Lee. Does my multimodal model learn cross-modal interactions? it s harder to tell than you might think! In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 861 877, 2020. [Ioffe and Szegedy, 2015] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pages 448 456. pmlr, 2015. [Kanehira et al., 2019] Atsushi Kanehira, Kentaro Takemoto, Sho Inayoshi, and Tatsuya Harada. Multimodal explanations by predicting counterfactuality in videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8594 8602, 2019.

Proceedings of the Thirty-Third International Joint Conference on Artiﬁcial Intelligence (IJCAI-24)

[Kloft et al., 2011] Marius Kloft, Ulf Brefeld, S oren Sonnenburg, and Alexander Zien. Lp-norm multiple kernel learning. The Journal of Machine Learning Research, 12:953 997, 2011. [Le Cun et al., 1998] Yann Le Cun, L eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998. [Leiva et al., 2020] Luis A Leiva, Asutosh Hota, and Antti Oulasvirta. Enrico: A dataset for topic modeling of mobile ui designs. In 22nd International Conference on Human Computer Interaction with Mobile Devices and Services, pages 1 4, 2020. [Liang et al., 2021] Paul Pu Liang, Yiwei Lyu, Xiang Fan, Zetian Wu, Yun Cheng, Jason Wu, Leslie Yufan Chen, Peter Wu, Michelle A Lee, Yuke Zhu, et al. Multibench: Multiscale benchmarks for multimodal representation learning. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021. [Lunghi et al., 2019] Giacomo Lunghi, Raul Marin, Mario Di Castro, Alessandro Masi, and Pedro J Sanz. Multimodal human-robot interface for accessible remote robotic interventions in hazardous environments. IEEE Access, 7:127290 127319, 2019. [Park et al., 2018] Dong Huk Park, Lisa Anne Hendricks, Zeynep Akata, Anna Rohrbach, Bernt Schiele, Trevor Darrell, and Marcus Rohrbach. Multimodal explanations: Justifying decisions and pointing to the evidence. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8779 8788, 2018. [P erez-R ua et al., 2019] Juan-Manuel P erez-R ua, Valentin Vielzeuf, St ephane Pateux, Moez Baccouche, and Fr ed eric Jurie. Mfas: Multimodal fusion architecture search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6966 6975, 2019. [Poria et al., 2015] Soujanya Poria, Erik Cambria, and Alexander Gelbukh. Deep convolutional neural network textual features and multiple kernel learning for utterancelevel multimodal sentiment analysis. In Proceedings of the 2015 conference on empirical methods in natural language processing, pages 2539 2544, 2015. [Rakotomamonjy et al., 2008] Alain Rakotomamonjy, Francis Bach, St ephane Canu, and Yves Grandvalet. Simplemkl. Journal of Machine Learning Research, 9:2491 2521, 2008. [Sabour et al., 2017] Sara Sabour, Nicholas Frosst, and Geoffrey E Hinton. Dynamic routing between capsules. Advances in neural information processing systems, 30, 2017. [Tan and Bansal, 2019] Hao Tan and Mohit Bansal. Lxmert: Learning cross-modality encoder representations from transformers. ar Xiv preprint ar Xiv:1908.07490, 2019. [Tsai et al., 2019] Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J Zico Kolter, Louis-Philippe Morency, and

Ruslan Salakhutdinov. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the conference. Association for Computational Linguistics. Meeting, volume 2019, page 6558. NIH Public Access, 2019. [Tsai et al., 2020] Yao-Hung Hubert Tsai, Martin Q Ma, Muqiao Yang, Ruslan Salakhutdinov, and Louis-Philippe Morency. Multimodal routing: Improving local and global interpretability of multimodal language analysis. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing, volume 2020, page 1823. NIH Public Access, 2020. [Vielzeuf et al., 2018] Valentin Vielzeuf, Alexis Lechervy, St ephane Pateux, and Fr ed eric Jurie. Centralnet: a multilayer approach for multimodal fusion. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, pages 0 0, 2018. [W ortwein et al., 2022] Torsten W ortwein, Lisa Sheeber, Nicholas Allen, Jeffrey Cohn, and Louis-Philippe Morency. Beyond additive fusion: Learning non-additive multimodal interactions. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 4681 4696, 2022. [Zadeh et al., 2016] Amir Zadeh, Rowan Zellers, Eli Pincus, and Louis-Philippe Morency. Mosi: Multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos, 2016. [Zadeh et al., 2017] Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. Tensor fusion network for multimodal sentiment analysis. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1103 1114, 2017. [Zhang et al., 2023] Yiyuan Zhang, Kaixiong Gong, Kaipeng Zhang, Hongsheng Li, Yu Qiao, Wanli Ouyang, and Xiangyu Yue. Meta-transformer: A unified framework for multimodal learning. ar Xiv preprint ar Xiv:2307.10802, 2023.

Proceedings of the Thirty-Third International Joint Conference on Artiﬁcial Intelligence (IJCAI-24)