# hoi_analysis_integrating_and_decomposing_humanobject_interaction__5f743f1e.pdf

HOI Analysis: Integrating and Decomposing Human-Object Interaction

Yong-Lu Li Xinpeng Liu Xiaoqian Wu Yizhuo Li Cewu Lu Shanghai Jiao Tong University yonglu_li@sjtu.edu.cn, xinpengliu0907@gmail.com, enlighten@sjtu.edu.cn liyizhuo@sjtu.edu.cn, lucewu@sjtu.edu.cn

Human-Object Interaction (HOI) consists of human, object and implicit interaction/verb. Different from previous methods that directly map pixels to HOI semantics, we propose a novel perspective for HOI learning in an analytical manner. In analogy to Harmonic Analysis, whose goal is to study how to represent the signals with the superposition of basic waves, we propose the HOI Analysis. We argue that coherent HOI can be decomposed into isolated human and object. Meanwhile, isolated human and object can also be integrated into coherent HOI again. Moreover, transformations between human-object pairs with the same HOI can also be easier approached with integration and decomposition. As a result, the implicit verb will be represented in the transformation function space. In light of this, we propose an Integration-Decomposition Network (IDN) to implement the above transformations and achieve state-of-the-art performance on widely-used HOI detection benchmarks. Code is available at https://github.com/Dirty Harry LYL/ HAKE-Action-Torch/tree/IDN-(Integrating-Decomposing-Network).

1 Introduction

Human-Object Interaction (HOI) takes up most of the human activities. As a composition, HOI consists of three parts: <human, verb, object>. To detect HOI, machines need to simultaneously locate human and object and classify the verb [4]. Except for the direct thinking that maps pixels to semantics, in this work we rethink HOI and explore two questions in a novel perspective (Fig. 1): First, as for the inner structure of HOI, how do isolated human and object compose HOI? Second, what is the relationship between two human-object pairs with the same HOI?

For the ﬁrst question, we may ﬁnd some clues from psychology. The view of Gestalt psychology is usually summarized as one simple sentence: The whole is more than the sum of its parts [14]. This is also in line with human perception. Baldassano et al. [1] studied the mechanism of how the brain builds HOI representation and concluded that the encoding of HOI is not the simple sum of human and object: a higher-level neural representation exists. Speciﬁc brain regions, e.g., posterior superior temporal sulcus (p STS), are responsible for integrating isolated human and object into coherent HOI [1]. Hence, to encode HOI, we may need complex nonlinear transformation (integration) to combine isolated human and object. Also, we argue that a reverse process is also essential to decompose HOI into isolated human and object (decomposition). Here we use TI( ) and TD( ) to indicate integration and decomposition functions. According to [1], isolated human and object are different from coherent HOI pair. Therefore, TI( ) should be able to add interactive relationship to isolated elements. On the contrary, TD( ) should eliminate this interactive information. Through the

The ﬁrst two authors contribute equally. Cewu Lu is the corresponding author, member of Qing Yuan Research Institute and Mo E Key Lab of Artiﬁcial Intelligence, AI Institute, Shanghai Jiao Tong University, China and Shanghai Qi Zhi institute.

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

<h-v-o> h o

(a) HOI Inner Structure

(b) Inter-Pair Transformation

<human-jump-skateboard>

Figure 1: Two questions about HOI. First, we want to explore its inner structure. Second, the relationship between two human-object pairs with same HOI is studied.

semantic change before and after transformations, we can reveal the eigen structure of HOI carrying the semantics. Considering that verb is hard to represent explicitly in image space, our transformations are conducted in latent space. For the second question, directly transforming one human-object pair to another (inter-pair transformation) is difﬁcult. We need to consider not only isolated element differences but also interaction pattern change. However, with TI( ) and TD( ), things are different. We can ﬁrst decompose HOI pair-i into isolated person-i and object-i and eliminate the interaction semantics. Next, we transform human-i (object-i) to human-j (object-j) (j = i). The last step is to integrate human-j and object-j into pair-j and add the interaction simultaneously.

Interestingly, we ﬁnd the above process is kind of like Harmonic Analysis: to process the signal, we usually use Fourier Transform (FT) to decompose it into the integration of basic exponential functions; then we can modulate the exponential functions via very simple transformations like scalar-multiplication; ﬁnally, inverse FT can help us integrate the modulated elements and map them back to the input space. This elegant property brings a lot of convenience for signal processing. Therefore, we mimic this insight and design our methodology, i.e., HOI Analysis. To implement HOI Analysis, we propose an Integration-Decomposition Network (IDN). In detail, after extracting the features from human/object and human-object tight union boxes, we perform the integration TI( ) to integrate the isolated human and object into the union in latent space. Moreover, decomposition TD( ) is then performed to decompose the union into isolated human and object instances again. Through the transformations, IDN can learn to represent the interaction/verb with TI( ) and TD( ). That said, we ﬁrst embed verbs in transformation function space, then learn to add and eliminate interaction semantics and classify interactions during transformations. For the inter-pair transformation, we adopt a simple instance exchange policy. For each human/object, we beforehand ﬁnd its similar instances as candidates and randomly exchange the original instance with candidates in training. This policy can avoid complex transformation like motion transfer [3]. Hence, we can focus on the learning of TI( ) and TD( ). Moreover, the lack of samples for rare HOIs can also be alleviated. To train IDN, we adopt the objectives derived from transformation principles, such as integration validity, decomposition validity and interactiveness validity (detailed in Sec. 3.4). With them, IDN can effectively model the interaction/verb in transformation function space. Subsequently, IDN can be applied to the HOI detection task by comparing the above validities and greatly advance it.

Our contributions are threefold: (1) Inspired by Harmonic Analysis, we thereon devise HOI Analysis to model the HOI inner structure. (2) A concise Integration-Decomposition Network (IDN) is proposed to conduct the transformations in HOI Analysis. (3) By learning verb representation in transformation function space, IDN achieves state-of-the-art performance on HOI detection.

2 Related Work

Human-Object Interaction (HOI) detection [4, 18] is crucial for deeper scene understanding and can facilitate behavior and activity learning [15, 25, 46, 47, 37, 38, 44]. Recently, huge progress has been made in this ﬁeld with the promotion of large-scale datasets [18, 4, 5, 15, 25] and deep learning. HOI has been studied for a long history. Previously, most methods [16, 17, 55, 54, 6, 7] adopted hand-crafted features. With the renaissance of neural networks, recent works [8, 32, 27, 12, 45, 19, 49, 42, 4, 13, 41, 24, 28] start to leverage learning-based features with end-to-end paradigm. HORCNN [4] utilized a multi-stream model to leverage human, object and spatial patterns respectively, which is widely followed by subsequent works [12, 27, 49]. Differently, GPNN [42] adopted a graph model to address HOI learning for both images and videos. Instead of directly processing all

Figure 2: HOI Analysis. TD( ) and TI( ) indicate the decomposition and integration. First, we decompose the coherent HOI into isolated human and object. Next, human and object can be integrated into HOI again. Through TD( ) and TI( ), we can model the verb in transformation function space and conduct the inter-pair transformation (IPT) more easily. Red X means it is hard to operate IPT directly. gh, go indicate the inter-human/object transformation functions.

human-object pairs generated from detection, TIN [27] utilized interactiveness estimation to ﬁlter out non-interactive pairs in advance. In terms of modality, Peyre et al. [41] explored to learn a joint space via aligning the visual and linguistic features and used word analogy to address unseen HOIs. DJ-RN [24] recovered 3D human and object (location and size) and learned a 2D-3D joint representation. Finally, some works also explore to encode HOI with the help of a knowledge base. Based on human part-level semantics, HAKE [25] built a large-scale part state [30] knowledge base and Activity2Vec for ﬁner-grained action encoding. Xu et al. [53] constructed a knowledge graph from HOI annotations and the external source to advance the learning.

Besides the computer vision community, HOI is also studied in human perception and cognition researches. In [1], Baldassano et al. studied how human brain models HOI given HOI images. Interestingly, besides the brain regions responsible for encoding isolated human or object, certain regions can integrate isolated human and object into a higher-level joint representation. For example, p STS can coherently model the HOI, instead of simply summing isolated human and object information. This phenomenon inspires us to rethink the nature of HOI representation. Thus we propose a novel HOI Analysis method to encode HOI by integration and decomposition.

On the other hand, HOI learning is similar to another compositional problem: attribute-object learning [35, 34, 26]. Attribute-object compositions have many interesting properties such as contextuality, compositionality [34, 35] and symmetry [26]. To learn the attribute-object, attributes are seen as primitives equal with objects [34] or linear/non-linear transformations [35, 26]. Different from attributes expressed on object appearance, verbs in HOIs are more implicit and hard to locate in images. They are a kind of holistic representation of composed human and object instances. Thus, we propose several transformation validities to embed and capture the verbs in transformation function space, instead of utilizing an explicit classiﬁer to classify them [34] or using language priors [35, 26].

3.1 Overview

In an image, human and object can be explicitly seen. However, we can hardly depict which region is the verb. For hold cup , hold may be obvious and center on the hand and cup. But for ride bicycle , most parts of the person and bicycle all represent ride . Hence, vision systems may struggle given diverse interactions as it is hard to capture the appropriate visual regions. Though attention mechanism [12] may help, the long-tail distribution of HOI data usually makes it unstable. In this work, instead of directly ﬁnding the interaction region and mapping it to semantics [4, 12, 27, 49, 19], we propose a novel learning paradigm, i.e., learning the verb representation via HOI Analysis.

Inspired by the perception study [1], we propose the integration TI( ) and decomposition TD( ) functions. HOI naturally consists of human, object and implicit verb. Thus, we can decompose HOI

into basic elements and integrate them again like Harmonic Analysis. The overview of HOI Analysis is depicted in Fig. 2. As HOI is not the simple sum of isolated human and object [1], different from FT, our transformations are nonequivalent. The key difference lies in the addition and elimination of implicit interactions. We use binary interactiveness [27], which indicates whether human and object are interactive, to monitor these semantic changes. Hence, the interactiveness [27] of isolated human/object is False, and joint human-object has True interactiveness. From the above, TI( ) should have the ability to add interaction to isolated instances and make the integrated human-object has True interactiveness. On the contrary, TD( ) can eliminate the interaction between coherent human-object and force their interactiveness to be False. At last, to encode the implicit verbs, we represent them in the transformation function space. A pair of decomposition and integration functions are constructed for each verb and forced to operate the appropriate transformations.

We introduce the feature preparation as follows. First, given an image, we use an object detector [43] to obtain the human/object boxes bh, bo. Then, we adopt a COCO [29] pre-trained Res Net-50 [20] to extract human/object Ro I pooling features f a h, f a o from the third Res Net Block, where a indicates visual appearance. For simplicity, we use tight union box of human and object to represent the coherent HOI pair (union). Notably, coherent HOI carries the interaction semantics and is more than the sum of isolated human and object [1], i.e., the incoherent ones. With bh, bo, the union box bu can be easily obtained. The Ro I pooling feature of bu is thus adopted from the fourth Res Net Block as the appearance representation of coherent HOI (f a u). Note that f a u is twice the size of f a h, f a o , for passing through one more Res Net Block. Second, to encode the box location, we generate location features f b h, f b o, f b u, where b indicates box location. We follow the box coordinate normalization method [41], getting the normalized box ˆbh, ˆbo. Next, for union box, we concatenate ˆbh and ˆbo and feed them to an MLP to get f b u. For human/object box, ˆbh or ˆbo is also fed to an MLP to get f b h or f b o. The size of f b h or f b o is half the size of f b u. Third, the location features f b u, f b h, f b o are concatenated respectively to their corresponding appearance features f a u, f a h, f a o , getting ˆfu, ˆfh, ˆfo. The size of ˆfh and ˆfo are also half the size of ˆfu. For convenience, we concatenate ˆfh and ˆfo as ˆfh ˆfo.

Before transformations, we compress these features to reduce the computational burden via an autoencoder (AE). This AE is given ˆfu as input and pre-trained with an input-output reconstruction loss and a verb classiﬁcation loss (Sec. 3.4). The classiﬁcation score is denoted as SAE v . After pre-training, we use AE to compress ˆfu and ˆfh ˆfo to 1024 sized fu (coherent) and fh fo (isolated) respectively, Finally, we have fu, fh fo for integration and decomposition. The ideal transformations are: TD(fu) = fh fo, TI(fh fo) = fu, (1) where TD( ), TI( ) indicates the decomposition and integration functions, indicates the linear operation between isolated human and object features such as element-wise summation or concatenation. In most cases, concatenation performs better. As for the inter-pair transformation, we use

gh(f i h) = f j h, go(f i o) = f j o, i = j, (2)

where f i h, f i o indicate the features of human/object instances, and gh( ), go( ) are the interhuman/object transformation functions. Because the strict inter-pair transformation like motion transfer [3] is complex and not our main goal, we implement gh( ) and go( ) as simple feature replacement for simplicity. For human instances, we ﬁnd their substitutional persons with the same HOI according to the pose similarity. As to object instances, we use the objects of the same category and similar sizes as the substitutions. All substitutional candidates come from the same dataset (train set) and are randomly sampled during training. From the experiment (Sec. 4.5), we ﬁnd that this policy performs well and effectively improves the interaction representation learning.

We propose a concise Integration-Decomposition Network (IDN) as shown in Fig. 3. IDN mainly consists of two parts: the ﬁrst one is the integration and decomposition transformations (Sec. 3.2) which construct a loop between the union and human/object features; the second one is the inter-pair transformation (Sec. 3.3) that exchanges the human/object instances between pairs with same HOI. In Sec 3.4, we introduce the training objectives derived from the transformation principles. With them, IDN would learn more effective interaction representations and advance HOI detection in Sec. 3.5.

3.2 Integration and Decomposition

As shown in Fig. 3, IDN constructs a loop consists of two inverse transformations: integration and decomposition implemented with MLPs. That is, we represent the verb/interaction in MLP weight

(b) Integration-Decomposition Network (IDN) (a) Auto-Encoder

For i-th verb:

Interactive Non-Interactive Non-Interactive

Yes is close to If , is close to

No is far away

from If , is far away

Figure 3: The structure of IDN. (a) depicts the feature compressor AE. (b) shows the integrationdecomposition loop. For each verb vi, we adopt corresponding T vi I ( ) and T vi D ( ). The L2 distance dvi u , dvi ho are then used in interaction classiﬁcation (Sec. 3.5). Notably, the encoded feature fh fo is the sum of isolated human and object and thus not yet integrated with the HOI semantics (Fig. 4).

space or transformation function space. For each verb, we adopt a pair of appropriative MLPs as integration and decomposition functions, e.g., T vi I ( ) and T vi D ( ) for verb vi. For integration, when inputting a pair of isolated fh and fo, {T vi I ( )}n i=1 integrates them into n outputs for n verbs:

f vi u = T vi I (fh fo), (3)

where i = 1, 2, 3, ....n and n is the number of verbs, f vi u is the integrated union feature for the i-th verb. indicates concatenation. Through the integration function set {T vi I ( )}n i=1, we get a set of 1024 sized integrated union features {f vi u }n i=1. If the original fu contains the semantics of the i-th verb, it should be close to f vi u and far away from the other integrated union features. Second, the subsequent decomposition is depicted as follows. Given the integrated union feature set {f vi u }n i=1, we also use n decomposition functions {T vi D ( )}n i=1 to decompose them respectively:

f vi h f vi o = T vi D (f vi u ). (4)

The decomposition output is also a set of features {f vi h f vi o }n i=1, where f vi h and f vi o all have the same size with fh and fo. Similarly, if this human-object pair is performing the i-th interaction, the original input fh fo should be close to the f vi h f vi o and far away from the other {f vj h f vj o }n j =i.

3.3 Inter-Pair Transformation

Inter-Pair Transformation (IPT) is proposed to reveal the inherent nature of implicit verb, i.e., the shared information between different pairs with the same HOI. Here, we adopt a simple implementation: instance exchange policy. For humans, we ﬁrst use pose estimation [9, 23] to obtain poses and then operate alignment and normalization. In detail, the pelvis keypoints of all persons are aligned and all the distances between head and pelvis are scaled to one. Hence, we can ﬁnd similar persons according to the pose similarity, which is calculated as the sum of Euclidean distances between the corresponding keypoints of two persons. To keep the semantics, similar persons should have at least one same HOI. Selecting similar objects is simpler, we directly choose the objects of the same category. An extra criterion is that we choose objects with similar sizes. We use the area ratio between the object box and the paired human box as the criteria. Finally, m similar candidates are selected for each human/object. This whole selection is operated within one dataset. Formally, with instance exchange, Eq. 3 can be rewritten as:

f vi u = T vi I (gh(fh) go(fo)) = T vi I (f k1 h f k2 o ), (5)

where k1, k2 = 1, 2, 3, ...m and m is the number of selected similar candidates, here m = 5. And {T vi I ( )}n i=1 and {T vi D ( )}n i=1 should be equally effective before and after instance exchange. During training, we ﬁrst use Eq. 3 for a certain number of epochs and then replace Eq. 3 with Eq. 5 (Sec. 4.2). When using Eq. 5, we put the original instance and its exchanging candidates together and randomly sample them. Notably, we focus on the transformations between pairs with the same HOI. The transformations between different HOIs which need to manipulate the corresponding human posture, human-object spatial conﬁguration and interactive pattern are beyond the scope of this paper. For IPT, more sophisticate approaches are also possible, e.g., using motion transfer [3] to adjust

2D human posture according to another person with the same HOI but different posture (eating while sitting/standing), recovering 3D HOI [39, 24] and adjusting 3D pose [22, 10] to generate new images/features, using language priors to change the classes of interacted objects or HOI compositions [2, 36], etc. But these are beyond the scope of our main insight, so we leave these to the future work.

3.4 Transformation Principles as Objectives

Before training, we ﬁrst pre-train AE to compress the inputs. We ﬁrst feed ˆfu to the encoder and obtain the compressed fu. Then an MLP takes fu as input to classify the verbs with Sigmoids (one pair can have multiple HOIs simultaneously) with cross-entropy loss LAE cls . Meanwhile, fu is decoded and generates f recon u . We construct MSE reconstruction loss LAE recon between f recon u and ˆfu. The overall loss of AE is LAE = LAE cls + LAE recon. After pre-training, AE will be ﬁne-tuned together with transformation modules. Next, we detail the objectives derived from transformation principles.

Integration Validity. As aforementioned, we integrate fh and fo into the union feature set {f vi u }n i=1 for all verbs (Eq. 3 or 5). If integration is able to add the verb semantics, the corresponding f vi u that belongs to the ongoing verb classes should be close to the real fu. For example, if coherent fu contains the semantics of verb vp and vq, then f vp u and f vq u should be close to fu. Meanwhile, {f vi u }n i =p,q should be far away from fu. Hence, we can construct the distance:

dvi u = ||fu f vi u ||2. (6)

For n verb classes, we can get distance set {dvi u }n i=1. Considering above principle, if fu carries the p-th verb semantics, dp u should be small, and vice versa. Therefore, we can directly use the negative distances as the score of verb classiﬁcation, i.e. Su v = { dvi u }n i=1. Naturally, Su v is then used to generate verb classiﬁcation loss Lu cls = Lu ent + Lu hinge, where Lu ent is cross-entropy loss, Lu hinge = Pn i=1[yi max(0, dvi tvi 1 ) + (1 yi) max(0, tvi 0 dvi)]. yi = 1 indicates this pair has verb vi and otherwise yi = 0. t0 and t1 are chosen following semi-hard mining strategy: tvi 1 = min B vi (dvi) and tvi 0 = max B vi + (dvi), where Bvi denotes all the pairs without verb vi in the current mini-batch, and Bvi + denotes all the pairs with verb vi in the current mini-batch.

Decomposition Validity. This validity is proposed to constrain the decomposed {f vi h f vi o }n i=1 (Eq. 4). Similar to Eq. 6, we also construct n distances between {f vi h f vi o }n i=1 and fh fo as

dvi ho = ||fh fo f vi h f vi o ||2 (7)

and obtain {dvi ho}n i=1. Again {dvi ho}n i=1 should obey the same principle according to ongoing verbs. Thus, we get the second verb score Sho v = { dvi ho}n i=1 and verb classiﬁcation loss Lho cls.

Interactiveness Validity. Interactiveness [27] depicts whether a person and an object are interactive. Thus, it is False if and only if human-object do not have any interactions. As the 1+1>2 property [1], the interactiveness of isolated human fh or object fo should be False, so does fh fo. But after we integrate fh fo into {f vi u }n i=1, its interactiveness should be True. Meanwhile, the original union fu should have True interactiveness. We adopt one shared FC-Sigmoid as the binary classiﬁer for fu, fh fo and {f vi u }n i=1. The binary label converted from HOI label is zero if and only if a pair does not have any interactions. Notably, we also adopt the interactiveness validity upon decomposed {f vi h f vi o }n i=1 but achieve limited improvement. To keep the model concise, we just adopt the other three effective interactiveness validities hereinafter. Thus, we obtain three binary classiﬁcation cross entropy losses: Lu bin, Lho bin, LI bin. For clarity, we use a uniﬁed Lbin = Lu bin + Lho bin + LI bin.

The overall loss of IDN is L = Lu cls + Lho cls + Lbin. With the guidance of these principles, IDN can well capture the interaction changes during the transformations. Different from previous methods that aim at encoding the entire HOI representations statically, IDN focuses on dynamically inferring whether an interaction exists within human-object through the integration and decomposition. So IDN can alleviate the learning difﬁculty of complex and various HOI patterns.

3.5 Application: HOI Detection

We further apply IDN to HOI detection, which needs to simultaneously locate human-object and classify the ongoing interactions. For locations, we adopt the detected boxes from a COCO [29]

pre-trained Faster R-CNN [43], so does the object class probability Po. Then, verb scores can be obtained from Eq. 6 and 7. Su v = { dvi u }n i=1, Sho v = { dvi ho}n i=1 and SAE v obtained from AE are then fed to exponential functions or Sigmoids to generate P u v = exp(Su v ), P ho v = exp(Sho v ) and P AE v = Sigmoid(SAE v ). Since the validity losses would pull the features that meet the labels together and push away the others, thus here we directly use three kinds of distances to classify verbs. For example, if fu contains the i-th verb, dvi u = ||fu f vi u ||2 should be small (probability should be large); if not, dvi u should be large (probability should be small). The ﬁnal verb probabilities is acquired via Pv = α(P u v + P ho v + P AE v ), here α = 1

3. For HOI triplets, we get their HOI probabilities using PHOI = Pv Po for all possible compositions according to the benchmark setting.

4 Experiment

In this section, we ﬁrst introduce the adopted datasets, metrics (Sec. 4.1) and implementation (Sec. 4.2). Next, we compare IDN with the state-of-the-art on HICO-DET [4] and V-COCO [18] in Sec. 4.3. As HOI detection metrics [4, 18] expect both accurate human/object locations and verb classiﬁcation, the performance strongly relies on object detection. Hence, we conduct experiments to evaluate IDN with different object detectors. At last, ablation studies are conducted (Sec. 4.5).

4.1 Dataset and Metric

We adopt the widely-used HICO-DET [4] and V-COCO [18]. HICO-DET [4] consists of 47,776 images (38,118 for training and 9,658 for testing) and 600 HOI categories (80 COCO [29] objects and 117 verbs). V-COCO [18] contains 10,346 images (2,533 and 2,867 in train and validation sets, 4,946 in test set). Its annotations include 29 verb categories (25 HOIs and 4 body motions) and same 80 objects with HICO-DET[4]. For HICO-DET, we use m AP following [4]: true positive needs to contain accurate human and object locations (box Io U with reference to GT box is larger than 0.5) and accurate verb classiﬁcation. The role means average precision [18] is used for V-COCO.

4.2 Implementation Details

The encoder of the adopted AE compresses the input feature dimension from 4608 to 4096, then to 1024. The decoder is structured symmetrical to the encoder. For HICO-DET [4], AE is pretrained for 4 epochs using SGD with a learning rate of 0.1, momentum of 0.9, while each batch contains 45 positive and 360 negative pairs. The whole IDN (AE and transformation modules) is ﬁrst trained without inter-pair transformation (IPT) for 20 epochs using SGD with a learning rate of 2e-2, momentum of 0.9. Then we ﬁnetune IDN with IPT for 30 epochs using SGD, with a learning rate of 1e-3, momentum of 0.9. Each batch for the whole IDN contains 15 positive and 120 negative pairs. For V-COCO [18], AE is ﬁrst pre-trained for 60 epochs. The whole IDN is trained without IPT for 45 epochs using SGD, then ﬁne-tuned with IPT for 20 epochs. The other training parameters are the same as those for HICO-DET. In testing, LIS [27] is adopted with T = 8.3, k=12.0, ω=10.0. Following [27], we use NIS [27] in all testings with the default threshold and the interactiveness estimation of fu. All experiments are conducted on one single NVIDIA Titan Xp GPU.

4.3 Results

Setting. We compare IDN with state-of-the-art [45, 4, 13, 42, 12, 27, 19, 49, 41, 48, 24, 11, 21, 53, 50, 56, 28, 51, 18, 49] on two benchmarks in Tab. 1 and Tab. 2. For HICO-DET, we follow the settings in [4]: Full (600 HOIs), Rare (138 HOIs), Non-Rare (462 HOIs) in Default and Known Object sets. For V-COCO, we evaluate AProle (24 actions with roles) on Scenario 1 (S1) and Scenario 2 (S2). To purely illustrate the HOI recognition ability without the inﬂuence of object detection, we conduct evaluations with three kinds of detectors: COCO pre-trained (COCO), pre-trained on COCO and then ﬁnetuned on HICO-DET train set (HICO-DET), GT boxes (GT) in Tab. 1.

Comparison. With TI( ) and TD( ), IDN outperforms previous methods signiﬁcantly and achieves 23.36 m AP on the Default Full set of HICO-DET [4] with COCO detector. Moreover, IDN is the ﬁrst to achieve more than 20 m AP on all three Default sets without additional information used. Moreover, the improvement on the Rare set proves that the dynamically learned interaction representation can greatly alleviate the data deﬁciency of rare HOIs. With the HICO-DET ﬁnetuned detector, IDN also shows great improvements and achieves more than 26 m AP and further proves

Default Full Known Object Method Detector Feature Full Rare Non-Rare Full Rare Non-Rare Shen et al. [45] COCO VGG-19 6.46 4.24 7.12 - - - HO-RCNN [4] COCO Caffe Net 7.81 5.37 8.54 10.41 8.94 10.85 Interact Net [13] COCO Res Net50-FPN 9.94 7.16 10.77 - - - GPNN [42] COCO Res Net101 13.11 9.34 14.23 - - - Xu et al. [53] COCO Res Net50 14.70 13.26 15.13 - - - i CAN [12] COCO Res Net50 14.84 10.45 16.15 16.26 11.33 17.73 Wang et al. [50] COCO Res Net50 16.24 11.16 17.75 17.73 12.78 19.21 TIN [27] COCO Res Net50 17.03 13.42 18.11 19.17 15.51 20.26 No-Frills [19] COCO Res Net152 17.18 12.17 18.68 - - - Zhou et al. [56] COCO Res Net50 17.35 12.78 18.71 - - - PMFNet [49] COCO Res Net50-FPN 17.46 15.65 18.00 20.34 17.47 21.20 DRG [11] COCO Res Net50-FPN 19.26 17.74 19.71 23.40 21.75 23.89 Peyre et al. [41] COCO Res Net50-FPN 19.40 14.60 20.90 - - - VCL [21] COCO Res Net50 19.43 16.55 20.29 22.00 19.09 22.87 VSGNet [48] COCO Res Net152 19.80 16.05 20.91 - - - DJ-RN [24] COCO Res Net50 21.34 18.53 22.18 23.69 20.64 24.60 IDN COCO Res Net50 23.36 22.47 23.63 26.43 25.01 26.85 PPDM [28] HICO-DET Hourglass-104 21.73 13.78 24.10 24.58 16.65 26.84 Bansal et al. [2] HICO-DET Res Net101 21.96 16.43 23.62 - - - TIN [27]VCL HICO-DET Res Net50 22.90 14.97 25.26 25.63 17.87 28.01 TIN [27]DRG HICO-DET Res Net50 23.17 15.02 25.61 24.76 16.01 27.37 VCL [21] HICO-DET Res Net50 23.63 17.21 25.55 25.98 19.12 28.03 DRG [11] HICO-DET Res Net50-FPN 24.53 19.47 26.04 27.98 23.11 29.43 IDNVCL HICO-DET Res Net50 24.58 20.33 25.86 27.89 23.64 29.16 IDNDRG HICO-DET Res Net50 26.29 22.61 27.39 28.24 24.47 29.37 i CAN [12] GT Res Net50 33.38 21.43 36.95 - - - TIN [27] GT Res Net50 34.26 22.90 37.65 - - - Peyre et al. [41] GT Res Net50-FPN 34.35 27.57 36.38 - - - IDN GT Res Net50 43.98 40.27 45.09 - - - Table 1: Results on HICO-DET [4]. COCO is the COCO pre-trained detector, HICO-DET means that the COCO is further ﬁne-tuned on HICO-DET, GT means the ground truth human-object box pairs. Superscript DRG or VCL indicates that the HICO-DET ﬁne-tuned detector from DRG [11] or VCL [21] is used. the affect from detections (23.36 to 26.29 m AP). Given GT boxes, the gaps among the other three methods [12, 27, 41] are marginal. But IDN achieves more than 9 m AP improvement on HOI recognition solely. All these greatly verify the efﬁcacy of our integration and decomposition. On V-COCO [18], IDN achieves 53.3 m AP on S1 and 60.3 m AP on S2, both signiﬁcantly outperforming previous methods. Moreover, we also apply our IDN to existing HOI methods since its ﬂexibility as a plug-in. In detail, we apply integration and decomposition to i CAN [12] as a proxy task to enhance its feature learning. The performance improves from 14.84 m AP to 18.98 m AP (HICO-DET Full).

Efﬁciency and Scalability. In IDN, each verb is represented by a pair of MLPs (T vi I ( ) and T vi D ( )). To ensure the efﬁciency, we carefully designed the data ﬂow to make IDN is able to run on a single GPU. All transformations are operated in parallel and the inference speed is 10.04 FPS (i CAN [12]: 4.90 FPS, TIN [27]: 1.95 FPS, PPDM [28]: 14.08 FPS, PMFNet [49]: 3.95 FPS). We also considered an implementation which utilizes a single MLP for all verbs for scalability, i.e., conditioned MLP functions f vi u = TI(fh fo, f ID vi ) and f vi h f vi o = TD(f vi u , f ID vi ), where f ID vi is the verb indicator (one-hot/Word2Vec [33]/Glove [40]). For new verbs, we just change the verb indicator instead of increasing MLPs. It works similar to zero-shot learning like TAFE-Net [52] and Nan et al. [36], but performs worse (20.86 m AP, HICO-DET Full) than the reported version (23.36 m AP).

4.4 Visualization

To verify the effectiveness of transformations, we use t-SNE [31] to visualize fu, fh fo, T v I (fh fo) for different v in Fig. 4. We can ﬁnd integrated T v I (fh fo) obviously closer to the real union fu, while the simple linear combination fh fo cannot represent the interaction information. We also

fu fh fo Tv

Figure 4: Visualizations of fu, fh fo, T v I (fh fo).

Figure 5: Dv 1 and Dv 2.

Method AP S1 role AP S2 role Gupta et al. [18] 31.8 - Interact Net [13] 40.0 - GPNN [42] 44.0 - i CAN [12] 45.3 52.4 Xu et al. [53] 45.9 - Wang et al. [50] 47.3 - TIN [27] 47.8 54.2 IP-Net [51] 51.0 - VSGNet [48] 51.8 57.0 PMFNet [49] 52.0 - IDN 53.3 60.3 Table 2: Results on V-COCO [18].

Default Full Known Object Method Full Rare Non-Rare Full Rare Non-Rare IDN 23.36 22.47 23.63 26.43 25.01 26.85 AE only 17.27 14.02 18.24 20.99 17.39 22.06 TI only 21.26 19.96 21.65 24.73 23.28 25.17 TD only 21.05 19.21 21.60 24.51 22.56 25.10 w/o IPT 22.63 22.16 22.77 25.76 24.66 26.09 w/o Lu cls 19.98 18.02 20.57 23.24 20.55 24.04 w/o Lho cls 21.39 20.08 21.79 24.65 22.75 25.22 w/o Lbin 22.01 20.65 22.41 25.03 23.40 25.52 w/o LAE recon 21.07 20.11 21.36 24.22 22.50 24.74 w/o LAE cls 19.60 17.88 20.11 22.68 20.32 23.38 Table 3: Ablation studies on HICO-DET [4].

analyze the IPT. In detail, we randomly select a pair with verb v and denote its features as fh, fo, fu. Assume there are m other pairs with verb v, whose features are {f i h, f i o, f i u}m i=1. Then we calculate

Pm i=1 fu f i u 2

m and Dv 2 =

Pm i=1( T v I (fh f i o) f i u 2+ T v I (f i h fo) f i u 2) 2m . Here, Dv 1 is the mean distance from fu to {f i u}m i=1. Dv 2 is the mean distance from T v I (fh f i o) and T v I (f i h fo) to f i u (i = 1, 2, ..., m). If IPT can effectively transform one pair to another by exchanging the human/object, there should be Dv 1 > Dv 2. We compare Dv 1 and Dv 2 of 20 different verbs in Fig. 5. As shown, in most cases Dv 1 is much larger than Dv 2, indicating the effectiveness of IPT.

4.5 Ablation Study

We conduct ablation studies on HICO-DET [4] with COCO detector. The results are shown in Tab. 3. (1) Modules: The performance of each module is evaluated. TI, TD and AE achieve 21.26, 21.05, 17.27 m AP respectively and show complementary property. (2) Objectives: During training, we drop one of the three validity objectives respectively. Without anyone of them, IDN shows obvious degradation, especially integration validity. (3) Inter-Pair Transformation (IPT): IDN without IPT achieves 22.63 m AP, showing the importance of instance exchange policy. (4) AE: AE is pre-trained with: reconstruction loss LAE recon and verb classiﬁcation loss LAE cls . The removal of LAE cls hurts the performance more severely, especially on the Rare set, while LAE recon also plays an important auxiliary role in boosting the performance. (5) Transformation Order: In practice, we construct a loop (fh fo to {f vi u }n i=1 to {f vi h f vi o }n i=1) to train IDN with the consistency. Using fu instead of {f vi u }n i=1, i.e., fh fo to {f vi u }n i=1 and fu to {f vi h f vi o }n i=1, performs worse (21.77 m AP).

5 Conclusion

In this paper, we propose a novel HOI learning paradigm named HOI Analysis, which is inspired by Harmonic Analysis. And an Integration-Decomposition Network (IDN) is introduced to implement it. With the integration and decomposition between the coherent HOI and isolated human and object, IDN can effectively learn the interaction representation in transformation function space and outperform the state-of-the-art on HOI detection with signiﬁcant improvements.

Broader Impact

In this work, we propose a novel paradigm for Human-Object Interaction detection, which would promote human activity understanding. Our work could be useful for vision applications, such as the health care system in an intelligent hospital. Current activity understanding systems are usually computationally expensive and require high computational resources, and could cost many ﬁnancial and environmental resources. Considering this, we will release our code and trained models to the community, as part of efforts to alleviate the repeated training of future works.

Acknowledgments and Disclosure of Funding

This work is supported in part by the National Key R&D Program of China, No. 2017YFA0700800, National Natural Science Foundation of China under Grants 61772332 and Shanghai Qi Zhi Institute, SHEITC (2018-RGZN-02046).

[1] Christopher Baldassano, Diane M Beck, and Li Fei-Fei. Human-object interactions are more than the sum of their parts. Cerebral Cortex, 2017.

[2] Ankan Bansal, Sai Saketh Rambhatla, Abhinav Shrivastava, and Rama Chellappa. Detecting human-object interactions via functional generalization. AAAI, 2020.

[3] Caroline Chan, Shiry Ginosar, Tinghui Zhou, and Alexei A Efros. Everybody dance now. In ICCV, 2019.

[4] Yu-Wei Chao, Yunfan Liu, Xieyang Liu, Huayi Zeng, and Jia Deng. Learning to detect human-object interactions. In WACV, 2018.

[5] Yu Wei Chao, Zhan Wang, Yugeng He, Jiaxuan Wang, and Jia Deng. Hico: A benchmark for recognizing human-object interactions in images. In ICCV, 2015.

[6] Vincent Delaitre, Josef Sivic, and Ivan Laptev. Learning person-object interactions for action recognition in still images. In NIPS, 2011.

[7] Chaitanya Desai and Deva Ramanan. Detecting actions, poses, and objects with relational phraselets. In ECCV, 2012.

[8] Hao Shu Fang, Jinkun Cao, Yu Wing Tai, and Cewu Lu. Pairwise body-part attention for recognizing human-object interactions. In ECCV, 2018.

[9] Hao-Shu Fang, Shuqin Xie, Yu-Wing Tai, and Cewu Lu. Rmpe: Regional multi-person pose estimation. In ICCV, 2017.

[10] Hao-Shu Fang, Yuanlu Xu, Wenguan Wang, Xiaobai Liu, and Song-Chun Zhu. Learning pose grammar to encode human body conﬁguration for 3d pose estimation. In AAAI, 2018.

[11] Chen Gao, Jiarui Xu, Yuliang Zou, and Jia-Bin Huang. Drg: Dual relation graph for human-object interaction detection. In ECCV, 2020.

[12] Chen Gao, Yuliang Zou, and Jia-Bin Huang. ican: Instance-centric attention network for human-object interaction detection. In BMVC, 2018.

[13] Georgia Gkioxari, Ross Girshick, Piotr Dollár, and Kaiming He. Detecting and recognizing human-object interactions. In CVPR, 2018.

[14] EB Goldstein. Cognitive psychology belmont. CA: Thomson Higher Education, 2008.

[15] Chunhui Gu, Chen Sun, David A Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, et al. Ava: A video dataset of spatio-temporally localized atomic visual actions. In CVPR, 2018.

[16] Abhinav Gupta and Larry S. Davis. Objects in action: An approach for combining action understanding and object perception. In CVPR, 2007.

[17] Abhinav Gupta, Aniruddha Kembhavi, and Larry S Davis. Observing human-object interactions: Using spatial and functional compatibility for recognition. TPAMI, 2009.

[18] Saurabh Gupta and Jitendra Malik. Visual semantic role labeling. ar Xiv preprint ar Xiv:1505.04474, 2015.

[19] Tanmay Gupta, Alexander Schwing, and Derek Hoiem. No-frills human-object interaction detection: Factorization, appearance and layout encodings, and training techniques. In ICCV, 2019.

[20] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.

[21] Zhi Hou, Xiaojiang Peng, Yu Qiao, and Dacheng Tao. Visual compositional learning for human-object interaction detection. ar Xiv preprint ar Xiv:2007.12407, 2020.

[22] Jiefeng Li, Can Wang, Wentao Liu, Chen Qian, and Cewu Lu. Hmor: Hierarchical multi-person ordinal relations for monocular multi-person 3d pose estimation. In ECCV, 2020.

[23] Jiefeng Li, Can Wang, Hao Zhu, Yihuan Mao, Hao-Shu Fang, and Cewu Lu. Crowdpose: Efﬁcient crowded scenes pose estimation and a new benchmark. In CVPR, 2019.

[24] Yong-Lu Li, Xinpeng Liu, Han Lu, Shiyi Wang, Junqi Liu, Jiefeng Li, and Cewu Lu. Detailed 2d-3d joint representation for human-object interaction. In CVPR, 2020.

[25] Yong-Lu Li, Liang Xu, Xinpeng Liu, Xijie Huang, Yue Xu, Shiyi Wang, Hao-Shu Fang, Ze Ma, Mingyang Chen, and Cewu Lu. Pastanet: Toward human activity knowledge engine. In CVPR, 2020.

[26] Yong-Lu Li, Yue Xu, Xiaohan Mao, and Cewu Lu. Symmetry and group in attribute-object compositions. In CVPR, 2020.

[27] Yong-Lu Li, Siyuan Zhou, Xijie Huang, Liang Xu, Ze Ma, Hao-Shu Fang, Yanfeng Wang, and Cewu Lu. Transferable interactiveness knowledge for human-object interaction detection. In CVPR, 2019.

[28] Yue Liao, Si Liu, Fei Wang, Yanjie Chen, and Jiashi Feng. Ppdm: Parallel point detection and matching for real-time human-object interaction detection. In CVPR, 2020.

[29] Tsung Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014.

[30] Cewu Lu, Hao Su, Yonglu Li, Yongyi Lu, Li Yi, Chi-Keung Tang, and Leonidas J Guibas. Beyond holistic object recognition: Enriching image understanding with part states. In CVPR, 2018.

[31] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. JMLR, 2008.

[32] Arun Mallya and Svetlana Lazebnik. Learning models for actions and person-object interactions with transfer to question answering. In ECCV, 2016.

[33] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efﬁcient estimation of word representations in vector space. ar Xiv preprint ar Xiv:1301.3781, 2013.

[34] Ishan Misra, Abhinav Gupta, and Martial Hebert. From red wine to red tomato: Composition with context. In CVPR, 2017.

[35] Tushar Nagarajan and Kristen Grauman. Attributes as operators: factorizing unseen attribute-object compositions. In ECCV, 2018.

[36] Zhixiong Nan, Yang Liu, Nanning Zheng, and Song-Chun Zhu. Recognizing unseen attribute-object pair with generative model. In AAAI, 2019.

[37] Bo Pang, Kaiwen Zha, Hanwen Cao, Jiajun Tang, Minghui Yu, and Cewu Lu. Complex sequential understanding through the awareness of spatial and temporal concepts. Nature Machine Intelligence, 2(5):245 253, 2020.

[38] Bo Pang, Kaiwen Zha, Yifan Zhang, and Cewu Lu. Further understanding videos through adverbs: A new video task. In AAAI, 2020.

[39] Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed A. A. Osman, Dimitrios Tzionas, and Michael J. Black. Expressive body capture: 3d hands, face, and body from a single image. In CVPR, 2019.

[40] Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In EMNLP, 2014.

[41] Julia Peyre, Ivan Laptev, Cordelia Schmid, and Josef Sivic. Detecting rare visual relations using analogies. In ICCV, 2019.

[42] Siyuan Qi, Wenguan Wang, Baoxiong Jia, Jianbing Shen, and Song-Chun Zhu. Learning human-object interactions by graph parsing neural networks. In ECCV, 2018.

[43] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS, 2015.

[44] Dian Shao, Yue Zhao, Bo Dai, and Dahua Lin. Finegym: A hierarchical video dataset for ﬁne-grained action understanding. In CVPR, 2020.

[45] Liyue Shen, Serena Yeung, Judy Hoffman, Greg Mori, and Fei Fei Li. Scaling human-object interaction recognition through zero-shot learning. In WACV, 2018.

[46] Jianhua Sun, Qinhong Jiang, and Cewu Lu. Recursive social behavior graph for trajectory prediction. In CVPR, 2020.

[47] Jiajun Tang, Jin Xia, Xinzhi Mu, Bo Pang, and Cewu Lu. Asynchronous interaction aggregation for action detection. In ECCV, 2020.

[48] Oytun Ulutan, ASM Iftekhar, and BS Manjunath. Vsgnet: Spatial attention network for detecting human object interactions using graph convolutions. In CVPR, 2020.

[49] Bo Wan, Desen Zhou, Yongfei Liu, Rongjie Li, and Xuming He. Pose-aware multi-level feature network for human object interaction detection. In ICCV, 2019.

[50] Tiancai Wang, Rao Muhammad Anwer, Muhammad Haris Khan, Fahad Shahbaz Khan, Yanwei Pang, Ling Shao, and Jorma Laaksonen. Deep contextual attention for human-object interaction detection. In ICCV, 2019.

[51] Tiancai Wang, Tong Yang, Martin Danelljan, Fahad Shahbaz Khan, Xiangyu Zhang, and Jian Sun. Learning human-object interaction detection using interaction points. In CVPR, 2020.

[52] Xin Wang, Fisher Yu, Ruth Wang, Trevor Darrell, and Joseph E Gonzalez. Tafe-net: Task-aware feature embeddings for low shot learning. In CVPR, 2019.

[53] Bingjie Xu, Yongkang Wong, Junnan Li, Qi Zhao, and Mohan S Kankanhalli. Learning to detect humanobject interactions with knowledge. In CVPR, 2019.

[54] Bangpeng Yao, Xiaoye Jiang, Aditya Khosla, Andy Lai Lin, Leonidas Guibas, and Fei-Fei Li. Human action recognition by learning bases of action attributes and parts. In ICCV, 2011.

[55] Bangpeng Yao and Fei-Fei Li. Grouplet: A structured image representation for recognizing human and object interactions. In CVPR, 2010.

[56] Penghao Zhou and Mingmin Chi. Relation parsing neural network for human-object interaction detection. In ICCV, 2019.