# mumu_cooperative_multitask_learningbased_guided_multimodal_fusion__f20a2307.pdf

Mu Mu: Cooperative Multitask Learning-Based Guided Multimodal Fusion

Md Mofijul Islam, Tariq Iqbal

School of Engineering and Applied Science, University of Virginia {mi8uu,tiqbal}@virginia.edu

Multimodal sensors (visual, non-visual, and wearable) can provide complementary information to develop robust perception systems for recognizing activities accurately. However, it is challenging to extract robust multimodal representations due to the heterogeneous characteristics of data from multimodal sensors and disparate human activities, especially in the presence of noisy and misaligned sensor data. In this work, we propose a cooperative multitask learningbased guided multimodal fusion approach, Mu Mu, to extract robust multimodal representations for human activity recognition (HAR). Mu Mu employs an auxiliary task learning approach to extract features specific to each set of activities with shared characteristics (activity-group). Mu Mu then utilizes activity-group-specific features to direct our proposed Guided Multimodal Fusion Approach (GM-Fusion) for extracting complementary multimodal representations, designed as the target task. We evaluated Mu Mu by comparing its performance to state-of-the-art multimodal HAR approaches on three activity datasets. Our extensive experimental results suggest that Mu Mu outperforms all the evaluated approaches across all three datasets. Additionally, the ablation study suggests that Mu Mu significantly outperforms the baseline models (p < 0.05), which do not use our guided multimodal fusion. Finally, the robust performance of Mu Mu on noisy and misaligned sensor data posits that our approach is suitable for HAR in real-world settings.

Introduction Understanding human activity ensures effective humanautonomous-system collaboration in various settings, from autonomous vehicles to assistive living to manufacturing (Sabokrou et al. 2019; Iqbal and Riek 2017, 2021; Yasar and Iqbal 2021, 2022; Green et al. 2022b,a). For example, accurate activity recognition could aid collaborative robots in assisting a worker by bringing tools or autonomous vehicles in requesting to take over the controls from a distracted driver to ensure safety (Iqbal et al. 2019; Pakdamanian et al. 2020). Human activity recognition (HAR) has been extensively studied by utilizing unimodal sensor data, such as visual (Ryoo et al. 2017; Zhang and Parker 2011; Fan et al. 2018), skeleton (Arzani et al. 2017; Ke et al. 2017; Yan, Xiong, and

Copyright 2022, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

Lin 2018; Iqbal, Rack, and Riek 2016), and wearable sensors (Frank, Kubota, and Riek 2019; Batzianoulis et al. 2017). However, unimodal methods struggle to recognize activity in various real-world scenarios for multiple reasons. First, distinct activities can be mistakenly classified as the same when relying on visual sensors (Kong et al. 2019). For example, carrying a light and a heavy object activities look similar from visual modalities; however, they have distinct physical sensor data (Fig.1-a & b: Gyroscope & Acceleration). Second, unimodal methods may fail to recognize activities when the sensor data is noisy (Fig.1-c). In these cases, using multiple modalities can compensate for the weaknesses of any particular modality in recognizing an activity. Several multimodal learning approaches have been proposed to accurately recognize human activities by fusing data from multiple sensors (Feichtenhofer et al. 2019; Kong et al. 2019; Roitberg et al. 2015; Joze et al. 2020; Liu et al. 2019; Perez-Rua et al. 2019; Hasan et al. 2019; Islam and Iqbal 2020). Although these approaches work adequately in many scenarios, some crucial challenges remain in achieving robust recognition performance, particularly when data from multiple sensors are missing or misaligned. First, disparate activity-groups require different modalities to accurately recognize activities (an activity-group consists of a set of activities, that exhibit similar characteristics). For example, Kubota et al. (2019) found that data from the motion capture system helps to recognize gross-motion activities involving arm and leg movements (e.g., walking), whereas data from wearable sensors helps to recognize finegrained motion activities involving hand or finger movements (e.g., grasping). Thus, if a model can exploit the characteristics of activity-groups while extracting the multimodal representations, then that model can improve HAR performance. Moreover, in many existing datasets, activities are grouped into categories based on shared characteristics (Kubota et al. 2019; Awad et al. 2018). For example, Kong et al. (2019) grouped human activities into three groups: complex (e.g., carrying), simple (e.g., kicking), and desk (e.g., using PC). Surprisingly, apart from grouping the activities, these auxiliary activity-groups labels have not been utilized in extracting multimodal representations. Second, most existing multimodal learning approaches assume non-noisy and time-aligned multimodal sensor data during training and testing phases. These assumptions limit

The Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22)

(a) Carry-Light (Non-noisy data) (b) Carry-Heavy (Non-noisy data) (c) Carry-Heavy (Noisy data, except Orientation sensor)

Figure 1: (a) Carry-Light and (b) Carry-Heavy activities have similar visual features. (a & b) These activities have distinct gyroscope and acceleration data. (a & b: bottom-row) Our proposed method, Mu Mu, can prioritize salient modalities (Gyroscope and Acceleration, in this case) while extracting multimodal representations. (c) Mu Mu can adaptively adjust attention weights when data is noisy. For example, Mu Mu pays more attention to the non-noisy data (Orientation) than the noisy data (Gyroscope and Acceleration) or misaligned data (View-1 & 2). (Data samples are drawn from MMAct dataset (Kong et al. 2019)).

the applicability of the existing approaches in real-world settings, as the presence of misaligned and noisy sensor data is not uncommon due to occlusion and sensor noises (Fig. 1-c). Thus, we need to develop and evaluate the multimodal learning approaches in the presence of noisy and misaligned sensor data to ensure their applicability in real-world settings. To address the aforementioned challenges, we propose a novel Cooperative Multitask Learning-based Guided Multimodal Fusion Approach (Mu Mu) for HAR. In Mu Mu, we have designed two cooperative tasks: an auxiliary and a target task. First, Mu Mu extracts activity-group-specific features for activity-group recognition (auxiliary task). Second, the activity-group-specific features direct our Guided Multimodal Fusion Approach (GM-Fusion) to extract robust multimodal representations for recognizing activities (target task). Here, both tasks work cooperatively, where the auxiliary task guides the target task to extract complementary multimodal representations appropriately. We compared the performance of Mu Mu to several stateof-the-art HAR algorithms on three multimodal activity datasets (MMAct (Kong et al. 2019), UTD-MHAD (Chen, Jafari, and Kehtarnavaz 2015) and UCSD-MIT (Kubota et al. 2019)). The experimental results suggest that Mu Mu outperforms all the evaluated approaches in all evaluation conditions. Mu Mu achieved an improvement of 4.45% and 3.61% (F1-score) on the MMAct dataset for the crosssubject and cross-session evaluations, compared to the stateof-the-art approaches, respectively. Additionally, Mu Mu achieved an improvement of 6.86% and 2.48% (top-1 accuracy) on the UCSD-MIT and the UTD-MHAD datasets for leave-one-subject-out evaluation settings, compared to the evaluated approaches, respectively. Furthermore, our qualitative analysis suggests that Mu Mu can appropriately prioritize the modalities while extracting complementary representations, even in the presence of noisy and misaligned sensor data (Fig. 1). Moreover, our ablation study suggests

that Mu Mu significantly outperforms the baseline learning approaches (p < 0.05), which do not use guided fusion.

Related Work Multimodal Learning: Several multimodal learning approaches have been developed for various tasks, such as video classification (Feichtenhofer et al. 2019; Xiao et al. 2020), activity recognition (Islam and Iqbal 2021; Long et al. 2018; Joze et al. 2020), and visual question answering (Lu et al. 2019; Li et al. 2019). Some of these approaches have been designed to extract representations from similar types of modalities (Feichtenhofer, Pinz, and Wildes 2016, 2017; Zhang et al. 2018). For example, Simonyan and Zisserman (2014) designed a two-stream CNN-based model to extract spatial and temporal features from the visual modalities. Similarly, Feichtenhofer et al. (2019) proposed a twostream learning model to extract spatial-temporal features by varying the data sampling rate in those streams. Other approaches have focused on extracting representations from heterogeneous modalities (Kong et al. 2019; Samyoun et al. 2022; Joze et al. 2020; Perez-Rua et al. 2019; M unzner et al. 2017; Liu et al. 2019). For example, Long et al. (2018) designed an attention model to extract unimodal features, which were then fused to produce multimodal representations. Some approaches fuse representations at the intermediary layers of the model (Feichtenhofer et al. 2019; Joze et al. 2020). For instance, Xiao et al. (2020) used a multi-stream model to fuse representations at the intermediate layers. However, these approaches depend on human experts to determine which layers representations should be fused. These manual fusion approaches often introduce bias in the model and produce suboptimal representations. Multitask Learning: Several multitask learning models have been designed which aim to share knowledge across tasks to improve these tasks performance (Ruder 2017; Hashimoto et al. 2016; Zhang and Yang 2017; Guo et al.

2018; Vandenhende et al. 2020; Gagn e 2019; Zhou et al. 2020a). For example, Standley et al. (2020) proposed a framework where tasks are grouped and learned by exploiting the cooperative and competitive relationships among the tasks. Similarly, Guo, Lee, and Ulbricht (2020) utilized a tree-structure and Gumbel-softmax (Jang, Gu, and Poole 2016) to determine which parts of the network can be shared or branched to maximize the parameters sharing and the tasks performance. Primarily, the existing multitask learning approaches aim to maximize the sharing of learning parameters or knowledge among the heterogeneous tasks (Crawshaw 2020; Søgaard and Goldberg 2016; Ruder 2017). Additionally, multitask models have been used to learn shared representations (Ruder 2017; Xu et al. 2018; Zhou et al. 2020b; Achille et al. 2019; Zamir et al. 2018). For example, Liu, Johns, and Davison (2019) proposed a multitask attention model for learning task-aware shared representations. Moreover, Sun et al. (2020) designed an algorithm to learn feature sharing patterns across tasks for maximizing shared representations. The overall goal of these approaches is to compress a multitask model by maximizing the shared representations among the competitive tasks. In this work, we have designed a cooperative multitask learning approach, where the auxiliary task guides the target task to extract multimodal representations to recognize activities accurately.

Mu Mu: Multitask Learning-based Guided Multimodal Fusion Approach Problem Formulation We define a cooperative multitask learning problem, which involves learning the auxiliary and the target tasks cooperatively for multimodal fusion. Similar to the multi-class activity recognition problem, we aim to recognize a set of K activities, A = (A1, . . . , AK), by extracting multimodal representations (Xc) from M heterogeneous modalities, Xr = (Xr 1, . . . , Xr M) (r stands for raw feature). We have termed this activity recognition (Ai A) as the target task. Activity datasets defined activity-group in various ways. For example, UCSD-MIT uses human motion to define activity-group (gross & fine), whereas the MMAct dataset uses the complexity of the activities (complex, simple & desk). As different activity-groups share disparate characteristics, they require different modalities for recognizing activities (Kubota et al. 2019). Thus, we divide the activity set A into N activity-groups (G), where G = (G1, . . . , GN). Here, each activity-group (Gi), consists of Ji unique activities that share similar characteristics, where Gi = (Ai 1, . . . , Ai Ji), and Ai j A. We have termed the activity-group recognition (Gi G) as the auxiliary task.

Approach Overview Our proposed Cooperative Multitask Learning-based Guided Multimodal Fusion Approach (Mu Mu) consists of three learning modules (Fig. 2): Unimodal Feature Encoder (UFE) encodes modalityspecific spatial-temporal features. Auxiliary Task Learning (ATL) Module extracts activity-group-specific multimodal representations.

Figure 2: Mu Mu: Cooperative Multitask Learning-based Guided Multimodal Fusion Approach. The Unimodal Feature Encoder encodes unimodal spatial-temporal features. The Auxiliary Task module fuses the unimodal features to extract the activity-group-specific features. The activitygroup features guide the Target Task module to fuse and extract complementary multimodal representations by employing a Guided Multimodal Fusion Approach. We have designed a multitask learning loss for end-to-end training.

Target Task Learning (TTL) Module utilizes the activity-group-specific features from the auxiliary task as prior information to appropriately fuse and extract multimodal representations for activity recognition.

UFE: Unimodal Feature Encoder We have adopted the Unimodal Feature Encoder (UFE) architecture from the work by Islam and Iqbal (2020). In our implementation, UFE independently encodes data from each modality m M in four steps. First, UFE segments the raw data and produces Xr m = (xr m,1, xr m,2, . . . , xr m,Sm) RB Sm Dr m, where B is the batch size, Sm is the segment size, and Dr m is the raw feature dimension of modality m. Second, UFE encodes the spatial features of each segment of modality m M. Third, UFE utilizes an LSTM to encode unimodal spatial-temporal features. Fourth, a selfattention model has been employed to extract salient unimodal features, Xu = (xu 1, xu 2, . . . , , xu M) RB M Du, from the spatial-temporal features (Du is the unimodal (u) feature embedding size). Instead of utilizing a resource intensive multi-head self-attention model, which was used by Islam and Iqbal (2020), in this work, we have adopted a lightweight self-attention model from Long et al. (2018). Mu Mu uses the unimodal features, Xu, in the subsequent learning modules to produce multimodal representations.

ATL: Auxiliary Task Learning Module In the auxiliary task learning step, Mu Mu fuses the unimodal features to extract activity-group-specific multimodal representation for classifying the activity-groups in two steps:

Self Multimodal Fusion Approach (SM-Fusion): Mu Mu uses SM-Fusion to extract activity-group-specific salient features. SM-Fusion assigns attention weight (αm) to each modality for fusing unimodal features, Xu, and extracting multimodal auxiliary representation, Xaux. The attention weight, αm, is calculated in the following way,

γm = (W aux)T Xu m (1)

αm = exp(γm) P

m M exp(γm) (2)

Here, W aux is a learnable parameter. We have utilized a 1D-CNN with a filter size of 1 to calculate αm. Finally, this weight is used to fuse the unimodal features and extract multimodal auxiliary representation, Xaux:

m M αm Xu m (3)

Activity-Group Classification: The auxiliary representation, Xaux, is passed through a auxiliary task learning network, F aux, to classify the activity-group:

yaux = F aux(Xaux) (4)

TTL: Target Task Learning Module In Mu Mu, we have designed a target task to extract multimodal representations and classify activities in two steps. First, Mu Mu uses activity-group features from the auxiliary task to direct our proposed Guided Multimodal Fusion Approach (GM-Fusion) to extract multimodal representations. Because activity-group features can help to prioritize the salient modalities to extract multimodal representations appropriately. Second, Mu Mu uses fused representations to classify the activities. In Mu Mu, the auxiliary and the target tasks work cooperatively to extract complementary multimodal representations for recognizing activities accurately.

Guided Multimodal Fusion Approach (GM-Fusion): GM-Fusion uses the activity-group-specific features from auxiliary task as prior information, Xaux, to extract multimodal representations. First, GM-Fusion projects the extracted unimodal features, Xu, to produce unimodal key (Ku) and value (V u) feature vectors in the following way:

Ku = Xu W K; V u = Xu W V (5)

Here, W K and W V are learnable parameters. These unimodal key and value vectors are used to extract the multimodal representation. Second, GM-Fusion projects multimodal auxiliary representation, Xaux, to produce auxiliary query feature vector (Qaux).

Qaux = Xaux W Q (6)

Here, W Q is a learnable parameter. This auxiliary query feature vector (Qaux) is used as a prior to extract complementary multimodal representation, Xc, by utilizing the unimodal key (Ku) and value (V u) feature vectors:

Xc = W o Xc (8) Here, W o is a learnable projection parameter.

Activity Classification: Multimodal representation, Xc, is concatenated with activity-group-specific features, Xaux, for activity classification. Xc is passed through a target task learning network, F t, to classify the activities:

Xf = W f[Xc; Xaux] (9)

yt = F t(Xf) (10) Here, W f is a learnable projection parameter.

Multitask Learning Loss We have designed a multitask learning loss for end-to-end training of Mu Mu. This loss is used to train the auxiliary and the target tasks jointly. First, we use cross-entropy auxiliary loss, Laux, to train the auxiliary task for activity-group classification. Laux enforces the auxiliary task branch to learn the activity-group-specific multimodal representations.

Laux(yaux, ˆyaux) = 1

i=1 yaux i log ˆyaux i (11)

Second, we calculate the cross-entropy loss, Lt, to train the target task for activity classification. This loss ensures that the target task learns the robust multimodal representations for activity recognition.

Lt(yt, ˆyt) = 1

i=1 yt i log ˆyt i (12)

Finally, the auxiliary and target task losses are combined for end-to-end training of Mu Mu:

loss = Lt(yt, ˆyt) + βaux Laux(yaux, ˆyaux) (13) Here, βaux is the weight of auxiliary task learning loss.

Experimental Setup Datasets We evaluated the performance of our proposed approach, Mu Mu, by applying it on three multimodal activity datasets: UCSD-MIT (Kubota et al. 2019), UTD-MHD (Chen, Jafari, and Kehtarnavaz 2015) and MMAct (Kong et al. 2019). MMAct dataset contains 37 activities which are categorized into 3 groups: 16 complex (e.g., carrying), 12 simple (e.g., kicking), 9 desk(e.g., using PCs). UCSD-MIT dataset contains nine automotive and block assembly activities from 2 groups: 4 gross-motion (e.g., attaching part), and 5 finemotion (e.g., palmar grab). UTD-MHAD contains 27 activities which are categorized into 4 groups: 9 hand gesture (e.g., draw circle), 9 sports (e.g., bowling), 5 daily (e.g., door knock), and 4 training exercises (e.g., squat). Please check the supplementary materials for more details.

Method F1-Score (%) SMD (Hinton, Vinyals, and Dean 2015) 63.89 Student (Kong et al. 2019) 64.44 Multi-Teachers (Kong et al. 2019) 62.67 MMD (Kong et al. 2019) 64.33 MMAD (Kong et al. 2019) 66.45 HAMLET (Islam and Iqbal 2020) 69.35 Keyless (Long et al. 2018) 71.83 Mu Mu (Our method) 76.28

Table 1: Cross-subject performance comparison (F1-Score) of multimodal learning methods on MMAct dataset

Method F1-Score (%) SVM+HOG (Ofli et al. 2013) 46.52 TSN (RGB) (Wang et al. 2016) 69.20 TSN (Optical-Flow) (Wang et al. 2016) 72.57 MMAD (Kong et al. 2019) 74.58 TSN (Fusion) (Wang et al. 2016) 77.09 MMAD (Fusion) (Kong et al. 2019) 78.82 Keyless (Long et al. 2018) 81.11 HAMLET (Islam and Iqbal 2020) 83.89 Mu Mu (Our method) 87.50

Table 2: Cross-session performance comparison (F1-Score) of multimodal learning methods on MMAct dataset

Learning Architecture Implementation We segmented the data from visual modalities (RGB and depth) with a window size of 1 and a stride of 3. For the data from other sensor modalities, we used a window size of 5 and a stride of 5. To encode segmented spatial features, we used Res Net-50 model (He et al. 2016) for data from visual modalities (RGB and depth) and Co-occurrence approach (Li et al. 2018) for data from other sensors modalities (s EMG, Acceleration, Gyroscope, and Orientation). The unimodal feature of each modality is encoded to 128 sized feature embedding. We used two fully connected layers with Re-LU activation after the first layer for activity-group classification in auxiliary task learning. We used similar task learning architecture for the activity classification in target task learning. For more implementation and training procedure details, please check the supplementary materials.

Results and Discussion Comparison with Multimodal Approaches Results: We evaluated Mu Mu s performance by comparing it against the state-of-the-art HAR approaches on three datasets: MMAct, UTD-MHAD, and UCSD-MIT. For MMAct dataset, we followed originally proposed crosssubject and cross-session evaluation settings and reported F1-scores (Tables 1 & 2). The results suggest that Mu Mu outperforms state-of-the-art approaches on both crosssubject and cross-session evaluation settings with improvements of 4.45% and 3.61% in F1-score, respectively. For UTD-MHAD and UCSD-MIT datasets, we followed leaveone-subject-out cross-validation and reported top-1 accuracies (Tables 4 & 3). The results suggest that Mu Mu out-

Learning Methods Merge Types F1-Score (%)

Non-Attention SUM 52.35 CONCAT 50.92

HAMLET (Islam and Iqbal 2020) SUM 50.04 CONCAT 48.26

Keyless (Long et al. 2018) SUM 51.68 CONCAT 54.48 Mu Mu (Our method) - 61.34

Table 3: Performance comparison (F1-Score) of multimodal learning methods on UCSD-MIT dataset.

Method Accuracy (%) MHAD (Chen, Jafari, and Kehtarnavaz 2015) 79.10 SOS (Hou et al. 2016) 86.97 S2DDI (Wang et al. 2017) 89.04 DCNN (Imran and Kumar 2016) 91.20 Keyless (Long et al. 2018) 92.67 MCRL (Liu, Kong, and Jiang 2019) 93.02 Pose Map (Liu and Yuan 2018) 94.51 HAMLET (Islam and Iqbal 2020) 95.12 Mu Mu (Our method) 97.60

Table 4: Performance comparison (top-1 accuracy) of multimodal learning methods on UTD-MHAD dataset.

performs the best performing baselines with improvements of 6.86% and 2.48% in top-1 accuracy on UCSD-MIT and UTD-MHAD datasets, respectively. Discussion: The experimental results (Tables 1, 2, 4 & 3) suggest that Mu Mu outperforms all the state-of-theart approaches in all evaluation conditions. Moreover, the results indicate that attention-based HAR methods (i.e., Mu Mu, Keyless (Long et al. 2018) and HAMLET (Islam and Iqbal 2020)) outperform Non-Attention-based methods (i.e., Pose Map (Liu and Yuan 2018) and TSN (Wang et al. 2016)). Unlike Mu Mu, the other attention-based methods do not consider the activity-group information to extract multimodal representations. In our implementation, Mu Mu utilizes the activity-group information to extract complementary representations using our Guided Multimodal Fusion approach (GM-Fusion). GM-Fusion allows the prioritization of different modalities based on the activity-group information extracted by the auxiliary task learning module. Thus, the experimental results posit that incorporating activitygroup information allows the extraction of complementary representations effectively to improve the HAR accuracy. Although state-of-the-art multimodal HAR approaches show comparatively better performance on cross-session evaluation settings (Tables 2 & 4), the performance degrades on challenging cross-subject evaluation conditions for all evaluated baselines (Tables 1 & 3). The performance degrades because MMAct and UCSD-MIT datasets contain data samples that enforce the utilization of the wearable sensors to recognize activities accurately, where the wearable sensor data vary considerably across subjects (see Fig. 1). To address this challenge, Mu Mu utilizes activitygroup features to guide GM-Fusion to extract salient multi-

Learning Methods Modality Combinations R+S R+S+P R+D+S+P Keyless 90.20 92.67 83.87 HAMLET 95.12 91.16 90.09 Mu Mu 96.10 97.44 97.60

Table 5: Performance comparison (Accuracy %) of the impact of modality changes on UTD-MHAD dataset. R: RGB, D: Depth, S: Skeleton, P: Physical Sensors.

modal representations for recognizing activities accurately. On the other hand, state-of-the-art approaches fused unimodal features without considering activity-group information. Additionally, in the cross-subject evaluation conditions, Mu Mu outperforms the F1-score of state-of-the-art approaches on MMAct and UCSD-MIT datasets with an improvement of 4.45% and 6.86%, respectively. These performance improvements indicate that Mu Mu can generate robust multimodal representation by prioritizing the salient modalities than other approaches.

Impact of Supplementary Modalities To investigate whether additional modalities help to improve the performance of learning models, we evaluated the performance of Mu Mu and two baseline approaches (Keyless (Long et al. 2018)) and HAMLET (Islam and Iqbal 2020)) with various combinations of modalities. We conducted this study on the UTD-MHAD dataset with RGB, Depth, Skeleton, Physical sensors modalities. The experimental results suggest that Mu Mu outperformed the evaluated baselines on all the combinations of modalities tested (see Table 5). Results & Discussion: In Table 5, the results suggest that incorporating additional modalities helps Mu Mu to improve the HAR accuracy. However, additional modalities do not always improve the performance of two baselines. For example, incorporating the depth modality degrades the accuracy of the baseline methods, whereas the HAR accuracy of Mu Mu improves slightly with this additional modality. The performance of the baselines degrades, as additional modalities may not provide salient information to recognize activities accurately. For example, visual modality may not provide salient information for gesture recognition (e.g., wave, swipe), whereas physical sensors can help recognize those activities accurately. The baselines either concatenated or used self-attention to fuse unimodal features without considering the characteristics of activity-group, which results in performance degradation with supplementary modalities. However, Mu Mu uses activity-group information to guide the target task for prioritizing and fusing the additional modalities to extract complementary multimodal representations for recognizing activities accurately. Therefore, it is essential to prioritize the salient modalities for extracting robust representation to recognize activities accurately.

Impact of Noisy Modalities We conducted both quantitative and qualitative experiments to evaluate the performance of Mu Mu and three baselines (Non-Attention, HAMLET, and Keyless) in the presence of

Learning Methods No Noisy Modality Noisy Modalities Visual Non-Visual Non-Attention 68.29 66.30 66.02 HAMLET 69.35 64.10 67.57 Keyless 71.83 67.94 68.29 Mu Mu 76.28 74.22 73.78

Table 6: Performance comparison (F1-Score %) of the impact of noisy data on MMAct dataset. Visual: RGB (View 1 & 2), Non-visual: Gyroscope, Orientation & Acceleration.

noisy and misaligned sensor data. We developed the Non Attention method for evaluation purposes, where we extract unimodal features using CNN+LSTM model without using an attention mechanism. The extracted unimodal features are concatenated to classify activities. We conducted this study in cross-subject evaluation setting on MMAct dataset with visual modalities (View 1 & 2) and non-visual modalities (Gyroscope, Orientation & Acceleration). We randomly dropped raw features either from visual or non-visual modalities with 50% probability to introduce noise. The quantitative and qualitative experimental results are presented in Table 6 and Fig 1, respectively. Results & Discussion: The experimental results suggest that Mu Mu outperforms the evaluated baselines in the presence of noisy data (Table 6). In Mu Mu, our proposed Guided Multimodal Fusion Approach (GM-Fusion) appropriately prioritizes the modalities and extracts the robust multimodal representation from noisy sensor data for accurate activity recognition. However, the baseline multimodal learning approaches either use Non-Attention or self-attention based multimodal fusion, which may not effectively extract complementary multimodal representations. Additionally, the qualitative results of multimodal attention visualization (Fig. 1-Bottom row) indicate the same phenomenon that Mu Mu can prioritize the salient modalities to extract complementary representations from noisy and misaligned sensor data. For example, although the gyroscope and acceleration data provide distinctive features for carry-heavy activity, Mu Mu adjusts the multimodal attention weights when we introduce noise in those modalities (Fig. 1-Bottom row), by paying more attention to the non-noisy modality (Orientation) and less attention to noisy modalities (Gyroscope and Acceleration), which contribute to better HAR performance on noisy data (Table 6). In Fig. 1-Center row, it can be observed that HAMLET, which uses a self-attention based fusion approach, increased the attention weight to the noisy sensor data (i.e., Acceleration in Fig 1(c)) compared to the attention weight assigned on the non-noisy data samples (Fig 1(a & b)). These qualitative results indicate that self-attention based fusion may not appropriately prioritize the noisy sensor data to extract robust multimodal representations (Fig. 1-Center row), which also reflects in the quantitative results in Table 6.

Ablation Study and Significance Analysis To investigate the importance of various modules of Mu Mu, we developed three single-task-based baseline models by re-

Model Type Learning Models Average F1-Score Standard Deviation Significant Over

Single Task

B1 68.48% 1.26 None B2 70.52% 0.98 B1 & B3 B3 69.19% 0.72 B1 Multitask Mu Mu 75.97% 0.29 B1, B2 & B3

Table 7: Ablation study of Mu Mu components on MMAct Dataset. B1: Non-Attention, B2: Unimodal Attention, B3: Uni + Multimodal Attention. Self-Attention based Multimodal Fusion, Guided Multimodal Fusion, Significance analysis at α = 0.05 (Following Dror et al. (2019))

moving the auxiliary task learning branch in Mu Mu (Fig. 2). The Non-Attention model (B1) does not employ any attention approach in extracting unimodal or fusing multimodal features. The Unimodal Attention model (B2) employs an attention approach to extract unimodal features and concatenate multimodal features (similar to Keyless (Long et al. 2018)). The Unimodal + Multimodal Attention model (B3) uses an attention approach to extract unimodal and fuse multimodal features (similar to HAMLET (Islam and Iqbal 2020)). We trained and tested these models five times with different initialization of the learning parameters. Finally, we conducted the significance analysis at level α = 0.05 by following the procedure proposed by Dror, Shlomov, and Reichart (2019). We conducted this experimental analysis on MMAct dataset in cross-subject evaluation setting.

Results and Discussion: The experimental results in Table 7 suggest that the baseline B3, which uses an attention approach to prioritize the modalities, fails to outperform B2 significantly. Here, B2 uses the attention approach only to extract unimodal features. These results indicate that how a multimodal learning approach fuses the information is crucial in improving the HAR performance. Moreover, the experimental results in Table 7 indicate that Mu Mu significantly outperforms all the baseline models and improves the HAR accuracy. The primary difference between Mu Mu and the baseline models is that Mu Mu uses activity-group features to guide the target task for extracting multimodal representations. Thus, this experimental analysis indicates that Mu Mu, with the help of our guided multimodal fusion approach, can appropriately fuse multimodal features to improve the HAR accuracy significantly.

Qualitative Analysis

We conducted two qualitative analyses to evaluate the effectiveness of our guided multimodal fusion approach. First, we visualized the attention weights to evaluate whether Mu Mu can prioritize the salient modalities (Fig. 1). Second, we visualized t-SNE embeddings of unimodal and multimodal representations obtained using Mu Mu (Fig. 3-Right) and HAMLET with self-attention based fusion (Islam and Iqbal 2020) (Fig. 3-Left). We conducted these studies on the MMAct dataset in cross-subject evaluation setting. Attention Visualization: Our experimental analysis (Fig. 1) suggests that appropriately prioritizing the relevant

Figure 3: The t-SNE visualization of unimodal and multimodal representations. (Left) HAMLET (Self-Attention based Fusion), (Right) Mu Mu (Guided Multimodal Fusion).

modalities aids in improved HAR performance. The results in Fig. 1-a & b indicate that Mu Mu can appropriately prioritize the salient modalities (Gyroscope and Acceleration) in extracting complementary representations to distinguish visually similar activities (i.e., carry-light and carry-heavy). Additionally, when the data from these modalities are noisy, Mu Mu adjusts the attention weights to the non-noisy modalities (i.e., visual and orientation) to extract robust representations (Fig. 1). These results indicate that Mu Mu can adjust attention weights based on the extracted unimodal features to produce complementary representations. On the other hand, the self-attention based fusion approach can not appropriately prioritize the relevant modalities (Fig. 1), which results in performance degradation (Table 7). Feature Visualization (t-SNE): Feature Visualization (t-SNE): In Fig. 3, one can observe that the features are sparsely distributed with fractured clusters when obtained from HAMLET, whereas the features are more compact and smoothly distributed when obtained from Mu Mu. Specifically, for visual modalities, Mu Mu produces clustered representations, whereas HAMLET produces sparsely distributed representations. This visualization indicates that Mu Mu can extract non-overlapping distinctive representations, resulting in an improved HAR performance.

In this work, we have proposed a cooperative multitask learning-based guided multimodal fusion approach, Mu Mu. Mu Mu first extracts activity-group features for activitygroup recognition (Auxiliary task). Mu Mu then utilizes the activity-group features in the Guided Multimodal Fusion (GM-Fusion) module to extract complementary multimodal representations for HAR (Target task). Our extensive experimental results suggest that Mu Mu outperforms state-ofthe-art approaches on three multimodal activity recognition datasets in all evaluation conditions. Additionally, the robust performance on noisy data indicates the applicability of Mu Mu in real-world settings. Future work will focus on evaluating the performance of Mu Mu on other multimodal learning tasks, such as human motion prediction, visuallanguage navigation, and action or video retrieval.

Achille, A.; Lam, M.; Tewari, R.; Ravichandran, A.; Maji, S.; Fowlkes, C. C.; Soatto, S.; and Perona, P. 2019. Task2Vec: Task Embedding for Meta-Learning. In ICCV. Arzani, M. M.; Fathy, M.; Aghajan, H.; Azirani, A. A.; Raahemifar, K.; and Adeli, E. 2017. Structured prediction with short/long-range dependencies for human activity recognition from depth skeleton data. In IROS. Awad, G.; Butt, A.; Curtis, K.; Lee, Y.; Fiscus, J.; Godil, A.; Joy, D.; Delgado, A.; Smeaton, A.; Graham, Y.; et al. 2018. Trecvid 2018: Benchmarking video activity detection, video captioning and matching, video storytelling linking and video search. In Proceedings of TRECVID 2018. Batzianoulis, I.; El-Khoury, S.; Pirondini, E.; Coscia, M.; Micera, S.; and Billard, A. 2017. EMG-based decoding of grasp gestures in reaching-to-grasping motions. RAS. Chen, C.; Jafari, R.; and Kehtarnavaz, N. 2015. UTDMHAD: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor. In 2015 IEEE ICIP, 168 172. Crawshaw, M. 2020. Multi-Task Learning with Deep Neural Networks: A Survey. ar Xiv preprint ar Xiv:2009.09796. Dror, R.; Shlomov, S.; and Reichart, R. 2019. Deep dominance-how to properly compare deep neural models. In ACL, 2773 2785. Fan, L.; Huang, W.; Gan, C.; Ermon, S.; Gong, B.; and Huang, J. 2018. End-to-end learning of motion representation for video understanding. In CVPR, 6016 6025. Feichtenhofer, C.; Fan, H.; Malik, J.; and He, K. 2019. Slow Fast Networks for Video Recognition. In CVPR. Feichtenhofer, C.; Pinz, A.; and Wildes, R. P. 2016. Spatiotemporal Residual Networks for Video Action Recognition. In Neur IPS. Feichtenhofer, C.; Pinz, A.; and Wildes, R. P. 2017. Spatiotemporal multiplier networks for video action recognition. In CVPR, 4768 4777. Frank, A. E.; Kubota, A.; and Riek, L. D. 2019. Wearable activity recognition for robust human-robot teaming in safetycritical environments via hybrid neural networks. In IROS, 449 454. IEEE. Gagn e, C. 2019. A Principled Approach for Learning Task Similarity in Multitask Learning. In IJCAI. Green, H. N.; Islam, M. M.; Ali, S.; and Iqbal, T. 2022a. i Spy a Humorous Robot: Evaluating the Perceptions of Humor Types in a Robot Partner. In AAAI Spring Symposium on Putting AI in the Critical Loop: Assured Trust and Autonomy in Human-Machine Teams. Green, H. N.; Islam, M. M.; Ali, S.; and Iqbal, T. 2022b. Who s Laughing NAO? Examining Perceptions of Failure in a Humorous Robot Partner. In ACM/IEEE International Conference on Human-Robot Interaction (HRI), 313 322. Guo, M.; Haque, A.; Huang, D.-A.; Yeung, S.; and Fei-Fei, L. 2018. Dynamic task prioritization for multitask learning. In ECCV, 270 287.

Guo, P.; Lee, C.-Y.; and Ulbricht, D. 2020. Learning to branch for multi-task learning. In International Conference on Machine Learning, 3854 3863. PMLR. Hasan, M. K.; Rahman, W.; Bagher Zadeh, A.; Zhong, J.; Tanveer, M. I.; Morency, L.-P.; and Hoque, M. E. 2019. URFUNNY: A Multimodal Language Dataset for Understanding Humor. In EMNLP-IJCNLP, 2046 2056. Hashimoto, K.; Xiong, C.; Tsuruoka, Y.; and Socher, R. 2016. A joint many-task model: Growing a neural network for multiple nlp tasks. ar Xiv preprint ar Xiv:1611.01587. He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In CVPR, 770 778. Hinton, G.; Vinyals, O.; and Dean, J. 2015. Distilling the knowledge in a neural network. Neur IPS. Hou, Y.; Li, Z.; Wang, P.; and Li, W. 2016. Skeleton optical spectra-based action recognition using convolutional neural networks. IEEE Transactions on Circuits and Systems for Video Technology. Imran, J.; and Kumar, P. 2016. Human action recognition using RGB-D sensor and deep convolutional neural networks. In ICACCI. Iqbal, T.; Li, S.; Fourie, C.; Hayes, B.; and Shah, J. A. 2019. Fast Online Segmentation of Activities from Partial Trajectories. In ICRA. Iqbal, T.; Rack, S.; and Riek, L. D. 2016. Movement Coordination in Human-Robot Teams: A Dynamical Systems Approach. IEEE Transactions on Robotics, 32(4): 909 919. Iqbal, T.; and Riek, L. D. 2017. Human Robot Teaming: Approaches from Joint Action and Dynamical Systems. Humanoid Robotics. Iqbal, T.; and Riek, L. D. 2021. Temporal Anticipation and Adaptation Methods for Fluent Human-Robot Teaming. In 2021 IEEE International Conference on Robotics and Automation (ICRA), 3736 3743. Islam, M. M.; and Iqbal, T. 2020. HAMLET: A Hierarchical Multimodal Attention-based Human Activity Recognition Algorithm. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 10285 10292. Islam, M. M.; and Iqbal, T. 2021. Multi-GAT: A Graphical Attention-based Hierarchical Multimodal Representation Learning Approach for Human Activity Recognition. In IEEE Robotics and Automation Letters (RA-L). Jang, E.; Gu, S.; and Poole, B. 2016. Categorical reparameterization with gumbel-softmax. ar Xiv preprint ar Xiv:1611.01144. Joze, H. R. V.; Shaban, A.; Iuzzolino, M. L.; and Koishida, K. 2020. MMTM: Multimodal Transfer Module for CNN Fusion. In CVPR. Ke, Q.; Bennamoun, M.; An, S.; Sohel, F.; and Boussaid, F. 2017. A new representation of skeleton sequences for 3d action recognition. In CVPR, 3288 3297. Kong, Q.; Wu, Z.; Deng, Z.; Klinkigt, M.; Tong, B.; and Murakami, T. 2019. MMAct: A Large-Scale Dataset for Cross Modal Human Action Understanding. In ICCV, 8658 8667.

Kubota, A.; Iqbal, T.; Shah, J. A.; and Riek, L. D. 2019. Activity recognition in manufacturing: The roles of motion capture and s EMG+ inertial wearables in detecting fine vs. gross motion. In ICRA. Li, C.; Zhong, Q.; Xie, D.; and Pu, S. 2018. Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation. IJCAI. Li, L. H.; Yatskar, M.; Yin, D.; Hsieh, C.-J.; and Chang, K.- W. 2019. Visual BERT: A Simple and Performant Baseline for Vision and Language. In Neur IPS. Liu, G.; Qian, J.; Wen, F.; Zhu, X.; Ying, R.; and Liu, P. 2019. Action Recognition Based on 3D Skeleton and RGB Frame Fusion. In IROS, 258 264. Liu, M.; and Yuan, J. 2018. Recognizing human actions as the evolution of pose estimation maps. In CVPR. Liu, S.; Johns, E.; and Davison, A. J. 2019. End-To-End Multi-Task Learning With Attention. In CVPR, 1871 1880. Liu, T.; Kong, J.; and Jiang, M. 2019. RGB-D Action Recognition Using Multimodal Correlative Representation Learning Model. IEEE Sensors Journal, 19(5): 1862 1872. Long, X.; Gan, C.; De Melo, G.; Liu, X.; Li, Y.; Li, F.; and Wen, S. 2018. Multimodal keyless attention fusion for video classification. In AAAI. Lu, J.; Batra, D.; Parikh, D.; and Lee, S. 2019. Vi LBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. In Neur IPS. M unzner, S.; Schmidt, P.; Reiss, A.; Hanselmann, M.; Stiefelhagen, R.; and D urichen, R. 2017. CNN-Based Sensor Fusion Techniques for Multimodal Human Activity Recognition. In ACM ISWC, 158 165. Ofli, F.; Chaudhry, R.; Kurillo, G.; Vidal, R.; and Bajcsy, R. 2013. Berkeley mhad: A comprehensive multimodal human action database. In WACV, 53 60. IEEE. Pakdamanian, E.; Sheng, S.; Baee, S.; Heo, S.; Kraus, S.; and Feng, L. 2020. Deep Take: Prediction of Driver Takeover Behavior using Multimodal Data. In CHI. Perez-Rua, J.-M.; Vielzeuf, V.; Pateux, S.; Baccouche, M.; and Jurie, F. 2019. MFAS: Multimodal Fusion Architecture Search. In CVPR. Roitberg, A.; Somani, N.; Perzylo, A.; Rickert, M.; and Knoll, A. 2015. Multimodal Human Activity Recognition for Industrial Manufacturing Processes in Robotic Workcells. In ICMI. Ruder, S. 2017. An overview of multi-task learning in deep neural networks. ar Xiv preprint ar Xiv:1706.05098. Ryoo, M. S.; Rothrock, B.; Fleming, C.; and Yang, H. J. 2017. Privacy-Preserving Human Activity Recognition from Extreme Low Resolution. In AAAI. Sabokrou, M.; Pour Reza, M.; Fayyaz, M.; Entezari, R.; Fathy, M.; Gall, J.; and Adeli, E. 2019. AVID: Adversarial Visual Irregularity Detection. In ACCV, 488 505. Samyoun, S.; Islam, M. M.; Iqbal, I.; and Stankovic, J. 2022. M3Sense: Affect-Agnostic Multitask Representation Learning using Multimodal Wearable Sensors. In ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT).

Simonyan, K.; and Zisserman, A. 2014. Two-stream convolutional networks for action recognition in videos. In Neur IPS, 568 576. Søgaard, A.; and Goldberg, Y. 2016. Deep multi-task learning with low level tasks supervised at lower layers. In ACL. Standley, T.; Zamir, A.; Chen, D.; Guibas, L.; Malik, J.; and Savarese, S. 2020. Which tasks should be learned together in multi-task learning? In ICML, 9120 9132. PMLR. Sun, X.; Panda, R.; Feris, R.; and Saenko, K. 2020. Ada Share: Learning What To Share For Efficient Deep Multi-Task Learning. In Neur IPS, volume 33, 8728 8740. Vandenhende, S.; Georgoulis, S.; Proesmans, M.; Dai, D.; and Van Gool, L. 2020. Revisiting multi-task learning in the deep learning era. ar Xiv preprint ar Xiv:2004.13379. Wang, L.; Xiong, Y.; Wang, Z.; Qiao, Y.; Lin, D.; Tang, X.; and Van Gool, L. 2016. Temporal segment networks: Towards good practices for deep action recognition. In ECCV. Wang, P.; Wang, S.; Gao, Z.; Hou, Y.; and Li, W. 2017. Structured images for RGB-D action recognition. In CVPRW, 1005 1014. Xiao, F.; Lee, Y. J.; Grauman, K.; Malik, J.; and Feichtenhofer, C. 2020. Audiovisual Slow Fast Networks for Video Recognition. ar Xiv preprint ar Xiv:2001.08740. Xu, P.; Madotto, A.; Wu, C.-S.; Park, J. H.; and Fung, P. 2018. Emo2Vec: Learning Generalized Emotion Representation by Multi-task Training. In Proceedings of the 9th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis. ACL. Yan, S.; Xiong, Y.; and Lin, D. 2018. Spatial temporal graph convolutional networks for skeleton-based action recognition. In AAAI. Yasar, M. S.; and Iqbal, T. 2021. A Scalable Approach to Predict Multi-Agent Motion for Human-Robot Collaboration. In IEEE Robotics and Automation Letters (RA-L). Yasar, M. S.; and Iqbal, T. 2022. Robots That Can Anticipate and Learn in Human-Robot Teams. In HRI. Zamir, A. R.; Sax, A.; Shen, W.; Guibas, L. J.; Malik, J.; and Savarese, S. 2018. Taskonomy: Disentangling Task Transfer Learning. In CVPR. Zhang, H.; and Parker, L. E. 2011. 4-dimensional local spatio-temporal features for human activity recognition. In IROS. Zhang, S.; Yang, Y.; Xiao, J.; Liu, X.; Yang, Y.; Xie, D.; and Zhuang, Y. 2018. Fusing geometric features for skeletonbased action recognition using multilayer LSTM networks. IEEE Transactions on Multimedia, 20(9): 2330 2343. Zhang, Y.; and Yang, Q. 2017. A survey on multi-task learning. ar Xiv preprint ar Xiv:1707.08114. Zhou, F.; Shui, C.; Abbasi, M.; Robitaille, L.- E.; Wang, B.; and Gagn e, C. 2020a. Task Similarity Estimation Through Adversarial Multitask Neural Network. IEEE Transactions on Neural Networks and Learning Systems. Zhou, L.; Cui, Z.; Xu, C.; Zhang, Z.; Wang, C.; Zhang, T.; and Yang, J. 2020b. Pattern-Structure Diffusion for Multi Task Learning. In CVPR.