# collaborative_dynamic_and_diversified_user_profiling__e996411c.pdf

The Thirty-Third AAAI Conference on Artiﬁcial Intelligence (AAAI-19)

Collaborative, Dynamic and Diversiﬁed User Proﬁling

Shangsong Liang1,2

1School of Data and Computer Science, Sun Yat-sen University, China 2Guangdong Key Laboratory of Big Data Analysis and Processing, Guangzhou 51006, China liangshangsong@gmail.com

In this paper, we study the problem of dynamic user proﬁling in the context of streams of short texts. Previous work on user proﬁling works with long documents, do not consider collaborative information, and do not diversify the keywords for proﬁling users interests. In contrast, we address the problem by proposing a user proﬁling algorithm (UPA), which consists of two models: the proposed collaborative interest tracking topic model (CITM) and the proposed streaming keyword diversiﬁcation model (SKDM). UPA ﬁrst utilizes CITM to collaboratively track each user s and his followees dynamic interest distributions in the context of streams of short texts, and then utilizes SKDM to obtain top-k relevant and diversiﬁed keywords to proﬁle users interests at a speciﬁc point in time. Experiments were conducted on a Twitter dataset and we found that UPA outperforms state-of-the-art non-dynamic and dynamic user proﬁling algorithms.

Introduction To capture users dynamic interests underlying their posts in microblogging platforms such as Twitter is of importance to the success of further design of applications that cater for users of such platforms, such as dynamic user clustering (Liang et al. 2017a; 2017b). In this paper, we study the problem of user proﬁling for streaming short documents (Balog et al. 2012; Liang 2018; Liang et al. 2018): collaboratively identifying users dynamic interests and tracking how they evolve over time in the context of streaming short texts. Our goal is to infer users and their collaborative topic distributions over time and dynamically proﬁle their interests with a set of diversiﬁed keywords in the context of streaming short texts. The ﬁrst user proﬁling model was proposed in (Balog et al. 2007), where a set of relevant keywords were identiﬁed for each user in a static collection of long documents and the dynamics of users interests were ignored. Recent work realize the importance of capturing users dynamic interests over time and a number of temporal proﬁling algorithms have been proposed for streams of long documents. However, previous work on user proﬁling suffer from the following problems: (1) They work with streams of long documents rather than short documents and made the assump-

Copyright c 2019, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

tion that the content of documents is rich enough to infer users dynamic interests. (2) They ignore any collaborative information, such as friends messages when inferring users interests at a speciﬁc point in time. (3) They just simply retrieve a list of top-k keywords as a user s proﬁle that may be semantically similar to each other and thus redundant.

Accordingly, in this paper, to address the aforementioned drawbacks in the previous work, we propose a User Proﬁling Algorithm in the context of streams of short documents, abbreviated as UPA, which is collaborative, dynamic and diversiﬁed. UPA consists of two proposed models a Collaborative Interest Tracking topic Model, abbreviated as CITM, and a Streaming Keyword Diversiﬁcation Model, abbreviated as SKDM. UPA algorithm ﬁrst utilizes our proposed CITM to track the changes of users dynamic interests in the context of streams of short documents. It then utilizes our proposed SKDM to produce top-k diversiﬁed keywords for proﬁling users interests at a speciﬁc point in time.

Our CITM topic model works with streaming short texts and is a dynamic multinomial Dirichlet mixture topic model that is able to infer and track each user s dynamic interest distributions based not only on the user s posts but also the collaborative information, i.e., his followees posts. Our hypothesis in CITM is that accounting for collaborative information is critical, especially for those users with limited activities, infrequent short posts, and thus sparse information. To perform the inference of users interest distributions in streams of short documents, we propose a collapsed Gibbs sampling algorithm. Then, our SKDM model works with users dynamic interest distributions produced by CITM and aims at retrieving a set of relevant and also diversiﬁed keywords for proﬁling users interests at time t such that redundancy of the retrieved keywords can be avoided while still keeping relevant keywords to proﬁle the users.

Our contributions are: (1) We propose a user proﬁling algorithm, UPA, to address the user proﬁling task in the context of streams of short documents. (2) We propose a topic model, CITM, that can collaboratively and dynamically track each user s and his followees interests. (3) We propose a collapsed Gibbs sampling algorithm to infer users and his followees interest distributions. (4) We propose a streaming keyword diversiﬁcation model, SKDM, to diversify the top-k keywords as users proﬁling results at time t.

Related Work User Proﬁling. User proﬁling has been gaining attention after the launch of user ﬁnding task at TREC 2005 enterprise track (Craswell, de Vries, and Soboroff 2005). Balog and de Rijke (2007) worked with a static long document corpus and modeled the proﬁle of a user as a set of relevant keywords. Recent work were aware of the importance of temporal user proﬁling. Temporal proﬁling for long documents was ﬁrst introduced in (Rybak, Balog, and Nørv ag 2014), where topical areas were organized in a predeﬁned taxonomy and interests were represented as a weighted unchanged tree built by the ACM classiﬁcation system. A probabilistic model was proposed in (Fang and Godavarthy 2014), where experts academic publications were used to investigate how personal interests evolve over time. We follow most previous work, and retrieve top-k words as proﬁle of a user s interests. Topic Modeling. Topic models provide a suite of algorithms to discover hidden thematic structure in a collection of documents. A topic model takes a set of documents as input, and discovers a set of latent topics recurring themes that are discussed in the collection and the degree to which each document exhibits those topics (Blei, Ng, and Jordan 2003). Since the well-known topic models, PLSI (Probabilistic Latent Semantic Indexing) (Hofmann 1999) and LDA (Latent Dirichlet Allocation) (Blei, Ng, and Jordan 2003), were proposed, topic models with dynamics have been widely studied. These include Dynamic Topic Model (Blei and Lafferty 2006), Dynamic Mixture Model (Wei, Sun, and Wang 2007), Topic over Time (Wang and Mc Callum 2006), Topic Tracking Model (Iwata et al. 2009), and more recently, dynamic Dirichlet multinomial mixture topic model (Liang et al. 2017c), user expertise tracking topic model (Liang 2018) and user collaborative interest tracking topic model (Liang, Yilmaz, and Kanoulas 2018). To our knowledge, none of existing dynamic topic models has considered the problem of user proﬁling for short texts that utilizes collaborative information to infer topic distributions.

Problem Formulation We follow most of the previous work (Balog and de Rijke 2007; Berendsen et al. 2013; Liang et al. 2018), and retrieve top-k words as proﬁle of a user. Then, the problem we address in this paper is: given a set of users and streams of short documents generated by them, track their interests over time and dynamically identify a set of top-k relevant and diversiﬁed keywords to each of the users. The dynamic user proﬁling algorithm is essentially a function h that satisﬁes:

Dt, ut h Wt,

where Dt = {. . . , dt 2, dt 1, dt} represents the stream of short documents generated by the users ut up to time t with dt being the most recent set of short documents arriving at t, ut = {u1, u2, . . . , u|ut|} represents a set of users appearing in the stream up to time t, with ui being the i-th user in ut and |ut| being the total number of users in the user set, and Wt = {wt,u1, wt,u2, . . . , wut,|ut|} represents all users proﬁling results at t with wt,ui = {wt,ui,1, wt,ui,2, . . . , wt,ui,k} being the proﬁling result, i.e.,

the top-k diversiﬁed keywords, for user ui at t. We assume that the length of a document d in Dt is no more than a predeﬁned small length (e.g., 140 characters in Twitter).

User Proﬁling Algorithm In this section, we detail our proposed User Proﬁling Algorithm (UPA) that consists of the proposed Collaborative Interest Tracking topic Model (CITM) and the proposed Streaming Keyword Diversiﬁcation Model (SKDM).

Overview We model users interests in streams by latent topics. Therefore, the dynamic interests of each user u ut at time period t can be represented as a multinomial distribution θt,u over topics, where θt,u = {θt,u,z}Z z=1 with θt,u,z being the interest score on topic z for user u at time period t and Z being the total number of latent topics. Similarly, the dynamic interests of each user s followees at t can be represented as a multinomial distribution ψt,u = {ψt,u,z}Z z=1 with ψt,u,z being the interest score of user u s followees ft,u as a whole on topic z at t. Here, ft,u denotes user u s all followees at t. Our UPA algorithm consists of two main steps: (1) UPA ﬁrst utilizes the proposed CITM to capture each user s dynamic interests θt,u and his collaborative interests ψt,u. (2) Given θt,u and ψt,u having been inferred, UPA then utilizes SKDM to identify top-k relevant and diversiﬁed keywords for proﬁling the user u s dynamic interests at time period t.

Collaborative Interest Tracking Model

Modeling Interests over Time. We close follow the previous work in (Liang, Yilmaz, and Kanoulas 2018; Liang et al. 2017a), and aim at inferring each user s dynamic interest distribution θt,u = {θt,u,z}Z z=1 and his collaborative interest distribution ψt,u = {ψt,u,z}Z z=1 at t in the context of streams of short documents in our CITM. We provide CITM s graphical representation in Fig. 1. To track the dynamics of a user u s interests, we assume that the mean of his current interests θt,u at time period t is the same as that at t 1, unless otherwise newly arrived documents associated with the user u in the streams can be observed. With this assumption and following the previous work on dynamic topic models (Iwata et al. 2010; 2009; Wei, Sun, and Wang 2007), we use the following Dirichlet prior with a set of precision values αt = {αt,z}Z z=1, where we let the mean of the current distribution θt,u depend on the mean of the previous distribution θt 1,u as:

P(θt,u|θt 1,u, αt)

z=1 θαt,u,zθt 1,u,z 1 t,u,z , (1)

where the precision value αt,z = {αt,u,z}|ut| u=1 represents the persistency of users interests, which is how saliency topic z is at time period t in contrast to that at t 1 for the users. As the distribution is a conjugate prior of the multinomial distribution, the inference is able to performed by Gibbs sampling (Liu 1994). Similarly, to track the dynamic changes of a user u s collaborative interest distribution ψt,u, we use

|ut 1| |ut 1| |ut| |ut|

|dt 1,u| |dt,u| |ut 1| |ut|

Figure 1: Graphical representation of our proposed CITM model. Shaded nodes represent observed variables.

the following Dirichlet prior with a set of precision values βt = {βt,z}Z z=1, where the mean of the current distribution ψt,u evolves from that of the previous distribution βt 1,u:

P(ψt,u|ψt 1,u, βt)

z=1 ψβt,u,zψt 1,u,z 1 t,u,z , (2)

where the precision value βt,z = {βt,u,z}|ut| u=1 represents the persistency of users collaborative interest, which is how saliency topic z is at time period t in contrast to that at t 1 for the users. In a similar way, to model the dynamic changes of the multinomial distribution of words speciﬁc to topic z, we assume a Dirichlet prior, in which the mean of the current distribution φt,z = {φt,z,v}V v=1 evolves from the mean of the previous distribution φt 1,z:

P(φt,z|φt 1,z, γt)

v=1 φγt,z,vφt 1,z,v 1 t,z,v , (3)

where V is the total number of words in a vocabulary v = {vi}V i=1 and γt = {γt,v}V v=1, with γt,v = {γt,z,v}Z z=1 representing the persistency of the word v in all topics at time t, a measure of how consistently the word belongs to the topics at t compared to that at t 1. Later in this subsection, we propose a collapsed Gibbs sampling algorithm to infer all users dynamic interest distributions Θt = {θt,u}|ut| u=1, their corresponding dynamic collaborative interest distributions Ψt = {ψt,u}|ut| u=1, and the words dynamic topic distributions Φt = {φt,z}Z z=1, and describe our update rules to obtain the optimal persistency values αt, βt and γt. Assume that we know all users interest distribution at time t 1, Θt 1, their collaborative interest distribution at time t 1, Ψt 1, and the words topic distribution, Φt 1. Then the proposed collaborative interest tracking model is essentially a generative topic model that depends on Θt 1,

Algorithm 1: Inference for our CITM model at time t.

Input : Distributions Θt 1, Ψt 1 and Φt 1 at t 1; Initialized αt, βt and γt; Number of iterations Niter. Output: Current distributions Θt, Ψt and Φt.

1 Initialize topic assignments randomly for all documents in dt 2 for iteration = 1 to Niter do

3 for user = 1 to |ut| do

4 for d = 1 to dt,u do

5 Draw zt,u,d from (5)

6 Update mt,u,zt,u,d, ot,u,zt,u,d and nt,zt,u,d,v

7 Update αt, βt and γt

8 Compute the posterior estimates Θt, Ψt and Φt.

Ψt 1 and Φt 1. For initialization and without loss of generalization, we let θ0,u,z = 1/Z, ψ0,u,z = 1/Z and φ0,z,v = 1/V at t = 0. Let all the short documents posted by user u at time period t denote as dt,u. The generative process of our model for documents in stream at time t, is as follows,

i. Draw Z multinomials φt,z, one for each topic z, from a Dirichlet prior distribution γt,zφt 1,z; ii. For each user u ut, draw multinomials θt,u and ψt,u from Dirichlet distributions with priors αt,uθt 1,u and βt,uψt 1,u, respectively; iii. For each document d dt,u, draw a topic zd based on the mixture of θt,u and ψt,u, and then for each word vd in the document d: (a) Draw a word vd from multinomial φt,zd. In the above generative process, given the documents in streams are short, and because most of the short documents are likely to talk about one single topic only (Yin and Wang 2014), we let all the words in the same document d be drawn from the multinomial distribution associated with the same topic zd. See the graphical representation of CITM in Fig. 1.

Interest Distribution Inference. We propose a collapsed Gibbs sampling algorithm that is based on the basic collapsed Gibbs sampler (Grifﬁths and Steyvers 2004; Wallach 2006) to approximately infer the distributions in our CITM topic model. As shown in Fig. 1 and the generative process, we adopt a conjugate prior (Dirichlet) for the multinomial distributions, and thus we can easily integrate out the uncertainty associated with multinomials θt,u, ψt,u and φt,z. We provide an overview of our proposed collapsed Gibbs sampling algorithm in Algorithm 1, where we denote mt,u,z, ot,u,z and nt,z,v to be the number of documents assigned to topic z for user u, the number of documents assigned to topic z for user u s followees and the number of times word v assigned to topic z for user u at t, respectively. In the Gibbs sampling procedure, we need to calculate the conditional distribution P(zt,u,d|zt, (u,d), dt, Θt 1, Ψt 1, Φt 1, ut, αt, βt, γt) at time t, where zt, (u,d) represents the topic assignments for all the documents in dt except the document d dt,u associated with user u at t, and zt,u,d is the topic

assigned to the document d dt,u. For obtaining this conditional distribution used during sampling, we begin with the joint probability of the current document set, P(zt, dt|Θt 1, Ψt 1, Φt 1, ut, αt, βt, γt) at time t:

P (zt, dt|Θt 1, Ψt 1, Φt 1, ut, αt, βt, γt) (4)

= (1 λ)P (zt, dt|Θt 1, Φt 1, ut, αt, γt)

= +λP (zt, dt|Ψt 1, Φt 1, ut, βt, γt)

Γ (P v(κb)) Q v Γ (κb)

Γ (P z(κ2)) Q z Γ (κ2)

Γ (P v(κb)) Q v Γ (κb)

Γ (P z(κ4)) Q z Γ (κ4)

Γ (P z κ3) ,

where Γ( ) is a gamma function, λ is a free parameter that governs the linear mixture of a user s interests and his followees interests, and the set of parameters κ are deﬁned as: κ1 = mt,u,z + αt,zθ, κ2 = αt,u,zθ, κ3 = ot,u,z + βt,zψ, κ4 = βt,u,zψ, κa = nt,z,v + γt,vφ, and κb = γt,z,vφ. Here, we let θ, ψ and φ abbreviate for θt 1,u,z, ψt 1,u,z and φt 1,z,v, respectively. Based on the above joint distribution (4) and using the chain rule, we can obtain the following conditional distribution conveniently for the proposed Gibbs sampling (step 5 of Algorithm 1) as the following:

P(zt,u,d = z|zt, (u,d), dt, Θt 1, Ψt 1, Φt 1, ut, αt, βt, γt) = (1 λ) mt,u,z + αt,u,zθ 1 PZ z=1(mt,u,z + αt,u,zθ) 1 +

λ ot,u,z + βt,u,zψ 1 PZ z=1(ot,u,z + βt,u,zψ) 1

Q v d QNd,v j=1 (nt,z,v, (u,d) + γt,z,vφ + j 1) QNd i=1(nt,z, (u,d) + i 1 + PV v=1 γt,z,vφ) , (5)

where Nd, Nd,v, zt, (u,d), nt,z,v, (u,d) and nt,z, (u,d) are the length of document d, the number of word v appearing in d, topic assignments for all documents except the document d from user u at t, the number of word v assigned to topic z in all documents except the one from user u at t, and the number of documents assigned to z in all documents except the one from user u at t, respectively. At each iteration during the sampling (steps 2 to 7 of Algorithm 1), the precision parameters αt, βt and γt can be estimated by maximizing the joint distribution (4). We apply ﬁxed-point iterations to obtain the optimal αt, βt and γt. By applying the two bounds in (Minka 2000), we can derive the following update rules of αt, βt and γt for maximizing the joint distribution in our ﬁxed-point iterations:

αt,u,z (1 λ)αt,u,z (mt,u,z + αt,u,zθ) (αt,u,zθ)

(PZ z=1 mt,u,z + αt,u,zθ) (PZ z=1 αt,u,zθ) ,

βt,u,z λβt,u,z (ot,u,z + βt,u,zψ) (βt,u,zψ)

(PZ z=1 ot,u,z + βt,u,zψ) (PZ z=1 βt,u,zψ) , (6)

γt,z,v γt,z,v (nt,z,v + γt,z,vφ) (γt,z,vφ)

(PV v=1 nt,z,v + γt,z,vφ) (PV v=1 γt,z,vφ) ,

where (x) = log Γ (x)

x is a Digamma function. Our derivations of the update rules for αt, βt and γt in (6) are analogous to those in (Liang, Yilmaz, and Kanoulas 2018; Liang et al. 2017c; 2017b). After the Gibbs sampling is done, with the fact that Dirichlet distribution is conjugate to multinomial distribution, we can conveniently infer each user s interest distribution θt,u, his collaborative interest distribution ψt,u and the

Algorithm 2: SKDM model for generating top-k keywords for collaborative, dynamic, diversiﬁed user proﬁling.

Input : Current distributions Θt and Φt Output: All users proﬁling results at time t, Wt 1 for u = 1, . . . , |ut| do

2 wt,u /* wt,u Wt */

4 for z = 1, . . . , Z do

5 δt,u,z (1 λ)P(z|t, u) + λP(z|t, ft,u)

7 for all positions in the ranked list wt,u do

8 for z = 1, . . . , Z do

9 qt[z|t, u] = δt,u,z 2sz|t,u+1

10 z arg maxz qt[z|t, u]

11 v arg maxv ev η1 qt[z |t, u] P(v|t, z ) + η2 P z =z qt[z|t, u] P(v|t, z) + (1 η1 η2) tﬁdf(v|t, u) 12 wt,u wt,u {v } /* append v to wt,u */

13 ev ev\{v } /* remove v from ev */

14 for z = 1, . . . , Z do

15 sz|t,u sz|t,u + P (v |t,u) PZ z =1 P (v |t,z )

words topic distribution φt,z at t, respectively as: θt,u,z = mt,u,z+αt,u,z PZ z =1 mt,u,z +αt,u,z , ψt,u,z = ot,u,z+βt,u,z PZ z =1 ot,u,z +βt,u,z , and

φt,z,v = nt,z,v+γt,z,v PV v =1 nt,z,v +γt,z,v .

Streaming Keyword Diversiﬁcation Model

After we obtain θt,u, ψt,u and φt,z, inspired by PM-2 diversiﬁcation method (Dang and Croft 2012), we closely follow the work in (Liang et al. 2017c; 2018) and propose a streaming keyword diversiﬁcation model (i.e., Algorithm 2), SKDM. To generate top-k diversiﬁed keywords for each user u at t, SKDM starts with an empty keyword set wt,u with k empty seats (step 2 of Algorithm 2), and a set of candidate keywords (step 3), ev, which is the whole words v in the vocabulary, i.e., initially let ev = v. For each of the seats, it computes the quotient qt[z|t, u] for each topic z given a user u at t by the Sainte-Lagu e formula (step 9): qt[z|t, u] = δt,u,z 2sz|t,u+1, where δt,u,z is the ﬁnal probability of the user u has interest on topic z at t and is set to be δt,u,z = (1 λ)P(z|t, u)+λP(z|t, ft,u) (step 5), and sz|t,u is the number of seats occupied by topic z (in initialization, let sz|t,u = 0 for all topics (step 6)). Here P(z|t, u) and P(z|t, ft,u) are the probabilities of user u s own and his collaborative interest on topic z at t, respectively. Obviously, we can obtain P(z|t, u) and P(z|t, ft,u) by our CITM algorithm such that P(z|t, u) = θt,u,z and P(z|t, ft,u) = ψt,u,z, i.e., we have:

δt,u = (1 λ)θt,u + λψt,u, (7)

where δt,u = {δt,u,z}Z z=1 is user u s ﬁnal interest distributions inferred based on his own and his collaborative information at time t. According to the Sainte-Lagu e method, seats should be awarded to the topic with the largest quotient in order to best maintain the proportionality of the result list. Therefore, our SKDM assigns the current seat to the topic z with the largest quotient (step 10). The keyword to ﬁll this seat is the one that is not only relevant to topic z but to other topics and should be speciﬁc to the user, and thus we propose to obtain the keyword v for user u s proﬁling at t as (step 11): v arg maxv ev η1 qt[z |t, u] P(v|t, z ) + η2 P

z =z qt[z|t, u] P(v|t, z) + (1 η1 η2) tﬁdf(v|t, u), where 0 η1, η2 1 are two free parameters that satisfy 0 η1 + η2 1, P(v|t, z) is the probability that v is associated with topic z at time t and thus can be set to be P(v|t, z) = φt,z,v, and tﬁdf(v|t, u) is a time-sensitive term frequency-inverse document frequency function for user u at t, which can be deﬁned as:

tﬁdf(v|t, u) = tf(v|dt,u) idf(v|u, dt), (8)

where tf(v|dt,u) = |{d dt,u:v d}|

|dt,u| is a term frequency function that computes how many percents of the documents that contain the word v in the whole document set dt,u, and idf(v|u, dt) = log |dt| |{d dt:v d}|+ϵ is an inverse document frequency function with ϵ being set to 1 to avoid the division-by-zero error. According to (8), if v frequently appears in the document set dt,u generated by user u but not frequently appears in the document set dt generated by all the users, tﬁdf(v|t, u) will return a high score. After the word v is selected, SKDM adds v as a result keyword to wt,u, i.e., wt,u wt,u {v } (step 12), removes it from the candidate word set ev, i.e., ev ev\{v } (step 13), and increases the number of seats occupied by each of the topics z by its normalized relevance to v as (step 15): sz|t,u sz|t,u+ P (v |t,u) PZ z =1 P (v |t,z ). The process (steps 7 to 15)

repeats until we get k diversiﬁed keywords. The order in which a keyword is appended to wt,u determines its ranking for the proﬁling. After the process is done, we obtain a set of diversiﬁed keywords wt,u that proﬁle the user u at t.

Experimental Setup Research Questions The research questions guiding the remainder of the paper are: (RQ1) How does UPA perform for user proﬁling compared to state-of-the-art methods? (RQ2) How does the contribution of the proposed interest tracking topic model, CITM, to the overall performance of UPA? (RQ3) What is the contribution of the collaborative information for user proﬁling? (RQ4) What is the impact of the length of the time intervals, ti ti 1, in UPA?

Dataset We work with a dataset collected from Twitter.1 It contains 1,375 active randomly selected users and their tweets posted

1Crawled from https://dev.twitter.com/.

from the beginning of their registrations up to May 31, 2015. According to the statistics, most of the users are being followed by 2 to 50 followers. In total, we have 7.52 million tweets with timestamps including those from users followees . The average length of the tweets is 12 words. We use this dataset as our stream of short documents. We obtain two categories of Ground Truths: one for evaluating Relevance-oriented (RGT) performance and another for evaluating Diversity-oriented (DGT) performance. To create the RGT ground truth, we split the dataset into 5 different partitions of time periods, i.e., a week, a month, a quarter, half a year and a year, respectively. For each Twitter user at every speciﬁc time period, an annotator was asked to generate a ranked list of top-k relevant keywords (k were decided by the annotators) as the user s proﬁle. In total, 68 annotators took part in the labelling with each of them labelled about 5 Twitter user for these 5 different partitions. To create the ground truth for diversity evaluation, DGT, as it is expensive to manually obtain aspects of the keywords from annotators, we cluster the relevant keywords with their embeddings2 into 15 categories 3 by k-means (Mac Queen 1967). Relevant keywords within a cluster are regarded as being relevant to the same aspect in the DGT ground truth.

We make comparisons between our UPA and the following state-of-the-art baseline algorithms: (1) tﬁdf. It simply utilizes (8), i.e., the content of users documents to retrieve top-k keywords as proﬁles for the users. (2) Predictive Language Model (PLM). It models the dynamics of personal interests via a probabilistic language model (Fang and Godavarthy 2014). (3) Latent Dirichlet Allocation (LDA). This model (Blei, Ng, and Jordan 2003) infers topic distributions speciﬁc to each document via the LDA model. (4) Author Topic model (Author T). This model (Rosen-Zvi et al. 2004) infers topic distributions speciﬁc to each user in a static dataset. (5) Dynamic Topic Model (DTM). This dynamic model (Blei and Lafferty 2006) utilizes a Gaussian distribution for inferring topic distribution of long documents in streams. (6) Topic over Time model (To T). This dynamic model (Wang and Mc Callum 2006) normalizes timestamps of long documents in a collection and then infers topics distribution for each document. (7) Topic Tracking Model (TTM). This dynamic model (Iwata et al. 2009) captures the dynamic topic distributions of long documents arriving at time t in streams of long documents. (8) GSDMM. This is a Gibbs Sampling-based Dirichlet Multinomial Mixture model that assigns one topic for each short document in a static collection (Yin and Wang 2014). For fair comparisons, the topic model baselines, GSDMM, TTM, To T, DTM and LDA, use both each user s interests θt,u and their collaborative interests for proﬁling. As these baselines can not directly infer collaborative interest distributions, we use the average interests of the user s

2Publicly available from https://nlp.stanford.edu/projects/ glove/. 3Information of the categories is available from http:// dmoztools.net.

followees as his collaborative interest distribution. Thus, unlike (7), in these baselines we use the mixture interests δt,u = (1 λ)θt,u + λ 1 |ft,u| P

u ft,u θt,u for representing each user s ﬁnal interest distribution with θt,u being inferred by the corresponding baseline topic models. The baselines, tﬁdf, PLM and Author T, are static proﬁling algorithms, while the others are dynamic. Again, for fair comparisons, UPA and all the other topic models use our SKDM algorithm to obtain the top-k keywords. We set the number of topics Z = 20 in all the topic models. For tuning parameters, λ, η1 and η2, we use a 70%/20%/10% split for our training, validation and test sets, respectively. The train/validation/test splits are permuted until all users were chosen once for the test set. We repeat the experiments 10 times and report the average results. For further analysis of the contribution of collaborative interests ψt,u inferred by our CITM model to the proﬁling, we use another baseline denoted as UPAavg, in which δt,u is set to be (1 λ)θt,u + λ 1 |ft,u| P

u ft,u θt,u with θt,u being inferred by CITM. Note that we still denote the proposed proﬁling algorithm using (7) as UPA.

Evaluation Metrics We use standard relevance-oriented evaluation metrics, Pre@k (Precision at k), NDCG@k (Normalized Discounted Cumulative Gain at k), MRR@k (Mean Reciprocal Rank at k), and MAP@k (Mean Average precision at k) (Croft, Metzler, and Strohman 2015), and diversity-oriented metrics, Pre-IA@k (Intent-Aware Pre@k) (Agrawal et al. 2009), αNDCG@k (Clarke et al. 2008), MRR-IA@k (Agrawal et al. 2009), MAP-IA@k (Agrawal et al. 2009). We also propose semantic versions of the original metrics, denoted as Pre S@k, NDCG-S@k, MRR-S@k, MAP-S@k, Pre-IA-S@k, α-NDCG-S@k, MRR-IA-S@k, and MAP-IA-S@k, respectively. Here the only difference between the original metrics and the corresponding semantic ones is the way to compute the relevance score of a retrieval keyword v to ground truth keyword vgt. For original metrics, we let the relevance score be 1 if and only if v = vgt, otherwise be 0; whereas for the semantic versions, we let the relevance score be the cosine similarity between the word embedding vectors of v and vgt. Since we usually choose not too many keywords to describe a user s proﬁle, we compute the scores at depth 10, i.e., let k = 10. For all the metrics we abbreviate M@k as M, where M is one of the metrics.

Results and Discussions In this section, we analyse our experimental results.

Overall Performance We start by answering research question RQ1. The following ﬁndings can be observed from Tables 1 and 2: (1) In terms of both relevance and diversity, all the topic modelbased proﬁling algorithms, i.e., UPA, UPAavg, GSDMM, To T, TTM, DTM, Author T and LDA, outperform traditional algorithms, i.e., PLM and tﬁdf, which demonstrates that topic modeling does help to proﬁle users interests. (2) UPA and UPAavg outperform all the baseline models in terms of

Table 1: Relevance performance of UPA, UPAavg and the baselines using time periods of each month. Statistically signiﬁcant differences between UPAavg and GSDMM, and between UPA and UPAavg are marked in the upper right hand corner of UPAavg s and UPA scores, respectively. Statistical signiﬁcance is tested using a two-tailed paired t-test and is denoted using for α = .01, and for α = .05.

Pre NDCG MRR MAP Pre-S NDCG-S MRR-S MAP-S

tﬁdf .254 .229 .375 .135 .409 .392 .853 .203 PLM .273 .239 .668 .140 .417 .398 .870 .212 LDA .281 .252 .674 .142 .424 .407 .878 .217 Author T .288 .260 .674 .145 .429 .408 .897 .220 DTM .295 .270 .694 .153 .436 .419 .883 .226 TTM .301 .276 .728 .156 .440 .426 .882 .228 To T .312 .283 .744 .158 .445 .428 .884 .230 GSDMM .321 .301 .746 .163 .452 .437 .891 .236 UPAavg .367 .361 .840 .195 .483 .468 .939 .262 UPA .399 .398 .860 .211 .501 .490 .946 .274

Table 2: Diversiﬁcation performance of UPA, UPAavg and the baselines using time periods of every month. Notational conventions for the statistical signiﬁcances are as in Table 1.

Pre α-ND MRR MAP Pre α-ND MRR MAP -IA CG -IA -IA -IA-S CG-S -IA-S -IA-S

tﬁdf .157 .187 .480 .185 .257 .325 .725 .150 PLM .162 .192 .487 .187 .265 .332 .742 .152 LDA .171 .203 .493 .192 .272 .338 .744 .155 Author T .174 .205 .505 .195 .276 .343 .748 .157 DTM .177 .206 .507 .197 .279 .347 .748 .159 TTM .180 .208 .509 .221 .282 .351 .751 .162 To T .182 .213 .513 .225 .290 .355 .754 .170 GSDMM .194 .228 .525 .237 .304 .368 .780 .173 UPAavg .238 .265 .597 .252 .362 .421 .808 .216 UPA .266 .302 .623 .266 .395 .452 .814 .231

relevance and diversity on all the metrics, which conﬁrms the effectiveness of the proposed user proﬁling algorithm for the task. (3) The ordering of the methods, UPA > UPAavg > GSDMM > To T TTM DTM Author T LDA > PLM > tifdf, is mostly consistent across the two ground truths and on the relevance and diversity evaluation metrics. Here A > B denotes that method A statistically signiﬁcantly performs better than method B and A B denotes that we did not observe a signiﬁcant difference between A and B. This, once again, conﬁrms that UPA and its averaged version, UPAavg, outperform all the baselines. (4) In most cases, UPA > UPAavg holds, which conﬁrms that the collaborative information inferred by the proposed topic model, CITM, does help to improve the proﬁling performance. Additionally, Table 3 shows the top six keywords of an example user s dynamic proﬁle with time being ﬁve quarters from April 2014 to May 2015. As shown in the table, the diversiﬁed keywords generated by UPA are semantically closer to those from the ground truth compared to those generated by the baseline, GSDMM, which again demonstrates the effectiveness of the proposed UPA algorithm.

Contribution of CITM We now turn to answer research question RQ2. Recall that the only difference between our UPA/UPAavg and the baselines is that UPA utilizes our CITM to track users dynamic

Table 3: Top six keywords of an example user s dynamic proﬁle with the time being ﬁve quarters from April 2014 to May 2015. The keywords from the DGT ground truth, generated from GSDMM and UPA are presented for the user, respectively.

Apr. 2014 to Jun. 2014

Jul. 2014 to Sep. 2014 Oct. 2014 to Dec. 2014

Jan. 2015 to Mar. 2015

Apr. 2015 to May 2015

Ground Truth Apple Java i Phone Python Apple Pay Ojective C

Apple Git i Pad Ojective C Apple Event Python

Apple Event Linin Proﬁle open Education i OS Nats Twitter Education

Microblog Students Linked In Proﬁle Arts Education FB After School

Social Media Education Nats Twitter Connected Learning FB Courses

GSDMM Apple Computer i Phone Science Java Technology

Apple Company University Technology i Pad Language

Apple Christmas Linked In Education i OS Friends

Online Education Students Website Degree Presentation

Courses Online Presentation Digital Learning Education

UPA Apple Java i Phone Programming CPlus Plus Computer

Apple Programming i Pad Git Event Python

Apple Linked In Education i OS Twitter Education

Linked In Students Microblog Education FB Art

Education Media Learning FB Courses Twitter

interests and then our SKDM to diversify the keywords, whereas other topic models utilize different topic models to obtain users interests and then the SKDM for keyword diversiﬁcation. As in Tables 1 and 2, UPA/UPAavg outperforms all the topic models, i.e., GSDMM, To T, TTM, DTM Author T and LDA, which illustrates that the proposed topic model, CITM, does be effective and has signiﬁcant contribution to the performance of our user proﬁling algorithm.

Contribution of Collaborative Interests Here we turn to answer research question RQ3. We vary the parameter λ that governs how much the collaborative information, ψt,u, are utilized for proﬁling. A larger λ indicates more collaborative information is utilized for the proﬁling. Fig. 2 shows the performance on the relevance and diversity evaluation metrics (use Precision and Pre-IA as representative metrics only), where we use the best baseline, GSDMM, as a representative. When we increase λ from 0 to 0.6, i.e., giving more weight to the collaborative information, the performance of all the models gradually improves, with UPA still outperforming UPAavg and GSDMM. This, again, illustrates that integrating collaborative information into the models helps to improve the performance. Moreover, as shown in Fig. 2, UPA that utilizes collaborative interests outperforms UPAavg that simply utilizes the average of the followees interests as its collaborative interests, which once again demonstrates that the inferred collaborative interests in UPA is effective.

Impact of Time Period Length Finally, we answer research question RQ4. We compare the performance for different time periods, a week, a month, a quarter, half a year and a year, respectively, using the two ground truths, RGT and DGT, on the representative relevance and diversity metrics, Precision and Pre-IA, in Fig. 3. As is shown in Fig. 3, UPA and UPAavg beat the baselines for time periods of all lengths, which illustrates that our proposed user proﬁling algorithm works better than the state-of-the-art ones for dynamic user proﬁling regardless of period length. The performance of UPA, UPAavg and the best baseline, GSDMM, improves signiﬁcantly on all the metrics when the period length increases from a week to a

UPA UPAavg GSDMM

0.0 0.2 0.4 0.6 0.8 1.0

0.25 0.35 0.45

UPA UPAavg GSDMM

0.0 0.2 0.4 0.6 0.8 1.0

0.15 0.25 0.35

Figure 2: Relevance and diversity performance of UPA, UPAavg and GSDMM on representative metrics, Precision and Pre-IA, with varying scores of λ, respectively.

UPA UPAavg GSDMM

Quarter Half Year

0.25 0.35 0.45

UPA UPAavg GSDMM

Quarter Half Year

0.15 0.20 0.25 0.30

Figure 3: Relevance and diversity performance of UPA, UPAavg and GSDMM on time periods of a week, a month, a quarter, half a year, and a year, respectively.

quarter, whereas it reaches a plateau as the time periods further increase from a quarter to a year. In all the cases UPA and UPAavg signiﬁcantly outperform the best baseline, GSDMM. These ﬁndings illustrate the fact that the performance of the proposed algorithms is robust and is able to maintain signiﬁcant improvements over the state-of-the-art nondynamic and dynamic algorithms. In addition, UPA always outperforms UPAavg on all the metrics and all the different period lengths, which once again illustrates that the collaborative interest distribution inferred by the proposed CITM model helps to enhance the user proﬁling performance.

Conclusions

We have studied the problem of collaborative, dynamic and diversiﬁed user proﬁling in the context of streams of short

texts. To tackle the problem, we have proposed a streaming proﬁling algorithm, UPA, that consists of two models: the proposed collaborative interest tracking topic model, CITM, and the proposed streaming keyword diversiﬁcation model, SKDM. Our CITM tracks the changes of users and their followees interest distribution in streams of short texts, a sequentially organized corpus of short texts, and our SKDM diversiﬁes the top-k keywords for proﬁling users dynamic interests. To effectively infer users and their followees dynamic interest distribution in our CITM model, we have proposed a collapsed Gibbs sampling algorithm, where during the sampling one single topic is assigned to a document to address the textual sparsity problem. We have conduced experiments on a Twitter dataset. We evaluated the performance of our UPA and the baseline algorithms using two categories of ground truths on both the original metrics and the proposed semantic versions of the metrics. Experimental results show that our UPA is able to proﬁle users dynamic interests over time for streams of short texts. In the future, we intend to utilize auxiliary resources such as Wikipedia articles that the entities in the short documents link to for further improvement of user proﬁling.

Agrawal, R.; Gollapudi, S.; Halverson, A.; and Ieong, S. 2009. Diversifying search results. In WSDM, 5 14. Balog, K., and de Rijke, M. 2007. Determining expert proﬁles (with and application to expert ﬁnding). In IJCAI, 2657 2662. Balog, K.; Bogers, T.; Azzopardi, L.; de Rijke, M.; and van den Bosch, A. 2007. Broad expertise retrieval in sparse data environments. In SIGIR, 551 558. Balog, K.; Fang, Y.; de Rijke, M.; Serdyukov, P.; and Si, L. 2012. Expertise retrieval. Found. Trends Inf. Retr. 6:127 256. Berendsen, R.; Rijke, M.; Balog, K.; Bogers, T.; and Bosch, A. 2013. On the assessment of expertise proﬁles. JAIST 64(10):2024 2044. Blei, D. M., and Lafferty, J. D. 2006. Dynamic topic models. In ICML, 113 120. Blei, D. M.; Ng, A. Y.; and Jordan, M. I. 2003. Latent dirichlet allocation. J. Mach. Learn. Res. 3:993 1022. Clarke, C. L. A.; Kolla, M.; Cormack, G. V.; Vechtomova, O.; Ashkan, A.; B uttcher, S.; and Mac Kinnon, I. 2008. Novelty and diversity in information retrieval evaluation. In SIGIR, 659 666. Craswell, N.; de Vries, A. P.; and Soboroff, I. 2005. Overview of the TREC 2005 enterprise track. In TREC 05, 1 7. Croft, W. B.; Metzler, D.; and Strohman, T. 2015. Search engines: Information retrieval in practice. Addison-Wesley Reading. Dang, V., and Croft, W. B. 2012. Diversity by proportionality: An election-based approach to search result diversiﬁcation. In SIGIR, 65 74.

Fang, Y., and Godavarthy, A. 2014. Modeling the dynamics of personal expertise. In SIGIR, 1107 1110. Grifﬁths, T. L., and Steyvers, M. 2004. Finding scientiﬁc topics. PNAS 101:5228 5235. Hofmann, T. 1999. Probabilistic latent semantic indexing. In SIGIR, 50 57. Iwata, T.; Watanabe, S.; Yamada, T.; and Ueda, N. 2009. Topic tracking model for analyzing consumer purchase behavior. In IJCAI, volume 9, 1427 1432. Iwata, T.; Yamada, T.; Sakurai, Y.; and Ueda, N. 2010. Online multiscale dynamic topic models. In KDD, 663 672. ACM. Liang, S.; Ren, Z.; Yilmaz, E.; and Kanoulas, E. 2017a. Collaborative user clustering for short text streams. In AAAI, 3504 3510. Liang, S.; Ren, Z.; Zhao, Y.; Ma, J.; Yilmaz, E.; and Rijke, M. D. 2017b. Inferring dynamic user interests in streams of short texts for user clustering. ACM Trans. Inf. Syst. 36(1):10:1 10:37. Liang, S.; Yilmaz, E.; Shen, H.; Rijke, M. D.; and Croft, W. B. 2017c. Search result diversiﬁcation in short text streams. ACM Trans. Inf. Syst. 36(1):8:1 8:35. Liang, S.; Zhang, X.; Ren, Z.; and Kanoulas, E. 2018. Dynamic embeddings for user proﬁling in twitter. In KDD, 1764 1773. Liang, S.; Yilmaz, E.; and Kanoulas, E. 2018. Collaboratively tracking interests for user clustering in streams of short texts. IEEE Transactions on Knowledge and Data Engineering. Liang, S. 2018. Dynamic user proﬁling for streams of short texts. In AAAI, 5860 5867. Liu, J. S. 1994. The collapsed Gibbs sampler in Bayesian computations with applications to a gene regulation problem. J. Am. Stat. Assoc. 89(427):958 966. Mac Queen, J. B. 1967. Some methods for classiﬁcation and analysis of multivariate observations. Minka, T. 2000. Estimating a dirichlet distribution. Rosen-Zvi, M.; Grifﬁths, T.; Steyvers, M.; and Smyth, P. 2004. The author-topic model for authors and documents. In UAI, 487 494. Rybak, J.; Balog, K.; and Nørv ag, K. 2014. Temporal expertise proﬁling. In ECIR, 540 546. Wallach, H. M. 2006. Topic modeling: beyond bag-ofwords. In ICML, 977 984. Wang, X., and Mc Callum, A. 2006. Topics over time: a nonmarkov continuous-time model of topical trends. In KDD, 424 433. Wei, X.; Sun, J.; and Wang, X. 2007. Dynamic mixture models for multiple time-series. In IJCAI, 2909 2914. Yin, J., and Wang, J. 2014. A dirichlet multinomial mixture model-based approach for short text clustering. In KDD, 233 242.