# collaborative_dynamic_and_diversified_user_profiling__e996411c.pdf The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19) Collaborative, Dynamic and Diversified User Profiling Shangsong Liang1,2 1School of Data and Computer Science, Sun Yat-sen University, China 2Guangdong Key Laboratory of Big Data Analysis and Processing, Guangzhou 51006, China liangshangsong@gmail.com In this paper, we study the problem of dynamic user profiling in the context of streams of short texts. Previous work on user profiling works with long documents, do not consider collaborative information, and do not diversify the keywords for profiling users interests. In contrast, we address the problem by proposing a user profiling algorithm (UPA), which consists of two models: the proposed collaborative interest tracking topic model (CITM) and the proposed streaming keyword diversification model (SKDM). UPA first utilizes CITM to collaboratively track each user s and his followees dynamic interest distributions in the context of streams of short texts, and then utilizes SKDM to obtain top-k relevant and diversified keywords to profile users interests at a specific point in time. Experiments were conducted on a Twitter dataset and we found that UPA outperforms state-of-the-art non-dynamic and dynamic user profiling algorithms. Introduction To capture users dynamic interests underlying their posts in microblogging platforms such as Twitter is of importance to the success of further design of applications that cater for users of such platforms, such as dynamic user clustering (Liang et al. 2017a; 2017b). In this paper, we study the problem of user profiling for streaming short documents (Balog et al. 2012; Liang 2018; Liang et al. 2018): collaboratively identifying users dynamic interests and tracking how they evolve over time in the context of streaming short texts. Our goal is to infer users and their collaborative topic distributions over time and dynamically profile their interests with a set of diversified keywords in the context of streaming short texts. The first user profiling model was proposed in (Balog et al. 2007), where a set of relevant keywords were identified for each user in a static collection of long documents and the dynamics of users interests were ignored. Recent work realize the importance of capturing users dynamic interests over time and a number of temporal profiling algorithms have been proposed for streams of long documents. However, previous work on user profiling suffer from the following problems: (1) They work with streams of long documents rather than short documents and made the assump- Copyright c 2019, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. tion that the content of documents is rich enough to infer users dynamic interests. (2) They ignore any collaborative information, such as friends messages when inferring users interests at a specific point in time. (3) They just simply retrieve a list of top-k keywords as a user s profile that may be semantically similar to each other and thus redundant. Accordingly, in this paper, to address the aforementioned drawbacks in the previous work, we propose a User Profiling Algorithm in the context of streams of short documents, abbreviated as UPA, which is collaborative, dynamic and diversified. UPA consists of two proposed models a Collaborative Interest Tracking topic Model, abbreviated as CITM, and a Streaming Keyword Diversification Model, abbreviated as SKDM. UPA algorithm first utilizes our proposed CITM to track the changes of users dynamic interests in the context of streams of short documents. It then utilizes our proposed SKDM to produce top-k diversified keywords for profiling users interests at a specific point in time. Our CITM topic model works with streaming short texts and is a dynamic multinomial Dirichlet mixture topic model that is able to infer and track each user s dynamic interest distributions based not only on the user s posts but also the collaborative information, i.e., his followees posts. Our hypothesis in CITM is that accounting for collaborative information is critical, especially for those users with limited activities, infrequent short posts, and thus sparse information. To perform the inference of users interest distributions in streams of short documents, we propose a collapsed Gibbs sampling algorithm. Then, our SKDM model works with users dynamic interest distributions produced by CITM and aims at retrieving a set of relevant and also diversified keywords for profiling users interests at time t such that redundancy of the retrieved keywords can be avoided while still keeping relevant keywords to profile the users. Our contributions are: (1) We propose a user profiling algorithm, UPA, to address the user profiling task in the context of streams of short documents. (2) We propose a topic model, CITM, that can collaboratively and dynamically track each user s and his followees interests. (3) We propose a collapsed Gibbs sampling algorithm to infer users and his followees interest distributions. (4) We propose a streaming keyword diversification model, SKDM, to diversify the top-k keywords as users profiling results at time t. Related Work User Profiling. User profiling has been gaining attention after the launch of user finding task at TREC 2005 enterprise track (Craswell, de Vries, and Soboroff 2005). Balog and de Rijke (2007) worked with a static long document corpus and modeled the profile of a user as a set of relevant keywords. Recent work were aware of the importance of temporal user profiling. Temporal profiling for long documents was first introduced in (Rybak, Balog, and Nørv ag 2014), where topical areas were organized in a predefined taxonomy and interests were represented as a weighted unchanged tree built by the ACM classification system. A probabilistic model was proposed in (Fang and Godavarthy 2014), where experts academic publications were used to investigate how personal interests evolve over time. We follow most previous work, and retrieve top-k words as profile of a user s interests. Topic Modeling. Topic models provide a suite of algorithms to discover hidden thematic structure in a collection of documents. A topic model takes a set of documents as input, and discovers a set of latent topics recurring themes that are discussed in the collection and the degree to which each document exhibits those topics (Blei, Ng, and Jordan 2003). Since the well-known topic models, PLSI (Probabilistic Latent Semantic Indexing) (Hofmann 1999) and LDA (Latent Dirichlet Allocation) (Blei, Ng, and Jordan 2003), were proposed, topic models with dynamics have been widely studied. These include Dynamic Topic Model (Blei and Lafferty 2006), Dynamic Mixture Model (Wei, Sun, and Wang 2007), Topic over Time (Wang and Mc Callum 2006), Topic Tracking Model (Iwata et al. 2009), and more recently, dynamic Dirichlet multinomial mixture topic model (Liang et al. 2017c), user expertise tracking topic model (Liang 2018) and user collaborative interest tracking topic model (Liang, Yilmaz, and Kanoulas 2018). To our knowledge, none of existing dynamic topic models has considered the problem of user profiling for short texts that utilizes collaborative information to infer topic distributions. Problem Formulation We follow most of the previous work (Balog and de Rijke 2007; Berendsen et al. 2013; Liang et al. 2018), and retrieve top-k words as profile of a user. Then, the problem we address in this paper is: given a set of users and streams of short documents generated by them, track their interests over time and dynamically identify a set of top-k relevant and diversified keywords to each of the users. The dynamic user profiling algorithm is essentially a function h that satisfies: Dt, ut h Wt, where Dt = {. . . , dt 2, dt 1, dt} represents the stream of short documents generated by the users ut up to time t with dt being the most recent set of short documents arriving at t, ut = {u1, u2, . . . , u|ut|} represents a set of users appearing in the stream up to time t, with ui being the i-th user in ut and |ut| being the total number of users in the user set, and Wt = {wt,u1, wt,u2, . . . , wut,|ut|} represents all users profiling results at t with wt,ui = {wt,ui,1, wt,ui,2, . . . , wt,ui,k} being the profiling result, i.e., the top-k diversified keywords, for user ui at t. We assume that the length of a document d in Dt is no more than a predefined small length (e.g., 140 characters in Twitter). User Profiling Algorithm In this section, we detail our proposed User Profiling Algorithm (UPA) that consists of the proposed Collaborative Interest Tracking topic Model (CITM) and the proposed Streaming Keyword Diversification Model (SKDM). Overview We model users interests in streams by latent topics. Therefore, the dynamic interests of each user u ut at time period t can be represented as a multinomial distribution θt,u over topics, where θt,u = {θt,u,z}Z z=1 with θt,u,z being the interest score on topic z for user u at time period t and Z being the total number of latent topics. Similarly, the dynamic interests of each user s followees at t can be represented as a multinomial distribution ψt,u = {ψt,u,z}Z z=1 with ψt,u,z being the interest score of user u s followees ft,u as a whole on topic z at t. Here, ft,u denotes user u s all followees at t. Our UPA algorithm consists of two main steps: (1) UPA first utilizes the proposed CITM to capture each user s dynamic interests θt,u and his collaborative interests ψt,u. (2) Given θt,u and ψt,u having been inferred, UPA then utilizes SKDM to identify top-k relevant and diversified keywords for profiling the user u s dynamic interests at time period t. Collaborative Interest Tracking Model Modeling Interests over Time. We close follow the previous work in (Liang, Yilmaz, and Kanoulas 2018; Liang et al. 2017a), and aim at inferring each user s dynamic interest distribution θt,u = {θt,u,z}Z z=1 and his collaborative interest distribution ψt,u = {ψt,u,z}Z z=1 at t in the context of streams of short documents in our CITM. We provide CITM s graphical representation in Fig. 1. To track the dynamics of a user u s interests, we assume that the mean of his current interests θt,u at time period t is the same as that at t 1, unless otherwise newly arrived documents associated with the user u in the streams can be observed. With this assumption and following the previous work on dynamic topic models (Iwata et al. 2010; 2009; Wei, Sun, and Wang 2007), we use the following Dirichlet prior with a set of precision values αt = {αt,z}Z z=1, where we let the mean of the current distribution θt,u depend on the mean of the previous distribution θt 1,u as: P(θt,u|θt 1,u, αt) z=1 θαt,u,zθt 1,u,z 1 t,u,z , (1) where the precision value αt,z = {αt,u,z}|ut| u=1 represents the persistency of users interests, which is how saliency topic z is at time period t in contrast to that at t 1 for the users. As the distribution is a conjugate prior of the multinomial distribution, the inference is able to performed by Gibbs sampling (Liu 1994). Similarly, to track the dynamic changes of a user u s collaborative interest distribution ψt,u, we use |ut 1| |ut 1| |ut| |ut| |dt 1,u| |dt,u| |ut 1| |ut| Figure 1: Graphical representation of our proposed CITM model. Shaded nodes represent observed variables. the following Dirichlet prior with a set of precision values βt = {βt,z}Z z=1, where the mean of the current distribution ψt,u evolves from that of the previous distribution βt 1,u: P(ψt,u|ψt 1,u, βt) z=1 ψβt,u,zψt 1,u,z 1 t,u,z , (2) where the precision value βt,z = {βt,u,z}|ut| u=1 represents the persistency of users collaborative interest, which is how saliency topic z is at time period t in contrast to that at t 1 for the users. In a similar way, to model the dynamic changes of the multinomial distribution of words specific to topic z, we assume a Dirichlet prior, in which the mean of the current distribution φt,z = {φt,z,v}V v=1 evolves from the mean of the previous distribution φt 1,z: P(φt,z|φt 1,z, γt) v=1 φγt,z,vφt 1,z,v 1 t,z,v , (3) where V is the total number of words in a vocabulary v = {vi}V i=1 and γt = {γt,v}V v=1, with γt,v = {γt,z,v}Z z=1 representing the persistency of the word v in all topics at time t, a measure of how consistently the word belongs to the topics at t compared to that at t 1. Later in this subsection, we propose a collapsed Gibbs sampling algorithm to infer all users dynamic interest distributions Θt = {θt,u}|ut| u=1, their corresponding dynamic collaborative interest distributions Ψt = {ψt,u}|ut| u=1, and the words dynamic topic distributions Φt = {φt,z}Z z=1, and describe our update rules to obtain the optimal persistency values αt, βt and γt. Assume that we know all users interest distribution at time t 1, Θt 1, their collaborative interest distribution at time t 1, Ψt 1, and the words topic distribution, Φt 1. Then the proposed collaborative interest tracking model is essentially a generative topic model that depends on Θt 1, Algorithm 1: Inference for our CITM model at time t. Input : Distributions Θt 1, Ψt 1 and Φt 1 at t 1; Initialized αt, βt and γt; Number of iterations Niter. Output: Current distributions Θt, Ψt and Φt. 1 Initialize topic assignments randomly for all documents in dt 2 for iteration = 1 to Niter do 3 for user = 1 to |ut| do 4 for d = 1 to dt,u do 5 Draw zt,u,d from (5) 6 Update mt,u,zt,u,d, ot,u,zt,u,d and nt,zt,u,d,v 7 Update αt, βt and γt 8 Compute the posterior estimates Θt, Ψt and Φt. Ψt 1 and Φt 1. For initialization and without loss of generalization, we let θ0,u,z = 1/Z, ψ0,u,z = 1/Z and φ0,z,v = 1/V at t = 0. Let all the short documents posted by user u at time period t denote as dt,u. The generative process of our model for documents in stream at time t, is as follows, i. Draw Z multinomials φt,z, one for each topic z, from a Dirichlet prior distribution γt,zφt 1,z; ii. For each user u ut, draw multinomials θt,u and ψt,u from Dirichlet distributions with priors αt,uθt 1,u and βt,uψt 1,u, respectively; iii. For each document d dt,u, draw a topic zd based on the mixture of θt,u and ψt,u, and then for each word vd in the document d: (a) Draw a word vd from multinomial φt,zd. In the above generative process, given the documents in streams are short, and because most of the short documents are likely to talk about one single topic only (Yin and Wang 2014), we let all the words in the same document d be drawn from the multinomial distribution associated with the same topic zd. See the graphical representation of CITM in Fig. 1. Interest Distribution Inference. We propose a collapsed Gibbs sampling algorithm that is based on the basic collapsed Gibbs sampler (Griffiths and Steyvers 2004; Wallach 2006) to approximately infer the distributions in our CITM topic model. As shown in Fig. 1 and the generative process, we adopt a conjugate prior (Dirichlet) for the multinomial distributions, and thus we can easily integrate out the uncertainty associated with multinomials θt,u, ψt,u and φt,z. We provide an overview of our proposed collapsed Gibbs sampling algorithm in Algorithm 1, where we denote mt,u,z, ot,u,z and nt,z,v to be the number of documents assigned to topic z for user u, the number of documents assigned to topic z for user u s followees and the number of times word v assigned to topic z for user u at t, respectively. In the Gibbs sampling procedure, we need to calculate the conditional distribution P(zt,u,d|zt, (u,d), dt, Θt 1, Ψt 1, Φt 1, ut, αt, βt, γt) at time t, where zt, (u,d) represents the topic assignments for all the documents in dt except the document d dt,u associated with user u at t, and zt,u,d is the topic assigned to the document d dt,u. For obtaining this conditional distribution used during sampling, we begin with the joint probability of the current document set, P(zt, dt|Θt 1, Ψt 1, Φt 1, ut, αt, βt, γt) at time t: P (zt, dt|Θt 1, Ψt 1, Φt 1, ut, αt, βt, γt) (4) = (1 λ)P (zt, dt|Θt 1, Φt 1, ut, αt, γt) = +λP (zt, dt|Ψt 1, Φt 1, ut, βt, γt) Γ (P v(κb)) Q v Γ (κb) Γ (P z(κ2)) Q z Γ (κ2) Γ (P v(κb)) Q v Γ (κb) Γ (P z(κ4)) Q z Γ (κ4) Γ (P z κ3) , where Γ( ) is a gamma function, λ is a free parameter that governs the linear mixture of a user s interests and his followees interests, and the set of parameters κ are defined as: κ1 = mt,u,z + αt,zθ, κ2 = αt,u,zθ, κ3 = ot,u,z + βt,zψ, κ4 = βt,u,zψ, κa = nt,z,v + γt,vφ, and κb = γt,z,vφ. Here, we let θ, ψ and φ abbreviate for θt 1,u,z, ψt 1,u,z and φt 1,z,v, respectively. Based on the above joint distribution (4) and using the chain rule, we can obtain the following conditional distribution conveniently for the proposed Gibbs sampling (step 5 of Algorithm 1) as the following: P(zt,u,d = z|zt, (u,d), dt, Θt 1, Ψt 1, Φt 1, ut, αt, βt, γt) = (1 λ) mt,u,z + αt,u,zθ 1 PZ z=1(mt,u,z + αt,u,zθ) 1 + λ ot,u,z + βt,u,zψ 1 PZ z=1(ot,u,z + βt,u,zψ) 1 Q v d QNd,v j=1 (nt,z,v, (u,d) + γt,z,vφ + j 1) QNd i=1(nt,z, (u,d) + i 1 + PV v=1 γt,z,vφ) , (5) where Nd, Nd,v, zt, (u,d), nt,z,v, (u,d) and nt,z, (u,d) are the length of document d, the number of word v appearing in d, topic assignments for all documents except the document d from user u at t, the number of word v assigned to topic z in all documents except the one from user u at t, and the number of documents assigned to z in all documents except the one from user u at t, respectively. At each iteration during the sampling (steps 2 to 7 of Algorithm 1), the precision parameters αt, βt and γt can be estimated by maximizing the joint distribution (4). We apply fixed-point iterations to obtain the optimal αt, βt and γt. By applying the two bounds in (Minka 2000), we can derive the following update rules of αt, βt and γt for maximizing the joint distribution in our fixed-point iterations: αt,u,z (1 λ)αt,u,z (mt,u,z + αt,u,zθ) (αt,u,zθ) (PZ z=1 mt,u,z + αt,u,zθ) (PZ z=1 αt,u,zθ) , βt,u,z λβt,u,z (ot,u,z + βt,u,zψ) (βt,u,zψ) (PZ z=1 ot,u,z + βt,u,zψ) (PZ z=1 βt,u,zψ) , (6) γt,z,v γt,z,v (nt,z,v + γt,z,vφ) (γt,z,vφ) (PV v=1 nt,z,v + γt,z,vφ) (PV v=1 γt,z,vφ) , where (x) = log Γ (x) x is a Digamma function. Our derivations of the update rules for αt, βt and γt in (6) are analogous to those in (Liang, Yilmaz, and Kanoulas 2018; Liang et al. 2017c; 2017b). After the Gibbs sampling is done, with the fact that Dirichlet distribution is conjugate to multinomial distribution, we can conveniently infer each user s interest distribution θt,u, his collaborative interest distribution ψt,u and the Algorithm 2: SKDM model for generating top-k keywords for collaborative, dynamic, diversified user profiling. Input : Current distributions Θt and Φt Output: All users profiling results at time t, Wt 1 for u = 1, . . . , |ut| do 2 wt,u /* wt,u Wt */ 4 for z = 1, . . . , Z do 5 δt,u,z (1 λ)P(z|t, u) + λP(z|t, ft,u) 7 for all positions in the ranked list wt,u do 8 for z = 1, . . . , Z do 9 qt[z|t, u] = δt,u,z 2sz|t,u+1 10 z arg maxz qt[z|t, u] 11 v arg maxv ev η1 qt[z |t, u] P(v|t, z ) + η2 P z =z qt[z|t, u] P(v|t, z) + (1 η1 η2) tfidf(v|t, u) 12 wt,u wt,u {v } /* append v to wt,u */ 13 ev ev\{v } /* remove v from ev */ 14 for z = 1, . . . , Z do 15 sz|t,u sz|t,u + P (v |t,u) PZ z =1 P (v |t,z ) words topic distribution φt,z at t, respectively as: θt,u,z = mt,u,z+αt,u,z PZ z =1 mt,u,z +αt,u,z , ψt,u,z = ot,u,z+βt,u,z PZ z =1 ot,u,z +βt,u,z , and φt,z,v = nt,z,v+γt,z,v PV v =1 nt,z,v +γt,z,v . Streaming Keyword Diversification Model After we obtain θt,u, ψt,u and φt,z, inspired by PM-2 diversification method (Dang and Croft 2012), we closely follow the work in (Liang et al. 2017c; 2018) and propose a streaming keyword diversification model (i.e., Algorithm 2), SKDM. To generate top-k diversified keywords for each user u at t, SKDM starts with an empty keyword set wt,u with k empty seats (step 2 of Algorithm 2), and a set of candidate keywords (step 3), ev, which is the whole words v in the vocabulary, i.e., initially let ev = v. For each of the seats, it computes the quotient qt[z|t, u] for each topic z given a user u at t by the Sainte-Lagu e formula (step 9): qt[z|t, u] = δt,u,z 2sz|t,u+1, where δt,u,z is the final probability of the user u has interest on topic z at t and is set to be δt,u,z = (1 λ)P(z|t, u)+λP(z|t, ft,u) (step 5), and sz|t,u is the number of seats occupied by topic z (in initialization, let sz|t,u = 0 for all topics (step 6)). Here P(z|t, u) and P(z|t, ft,u) are the probabilities of user u s own and his collaborative interest on topic z at t, respectively. Obviously, we can obtain P(z|t, u) and P(z|t, ft,u) by our CITM algorithm such that P(z|t, u) = θt,u,z and P(z|t, ft,u) = ψt,u,z, i.e., we have: δt,u = (1 λ)θt,u + λψt,u, (7) where δt,u = {δt,u,z}Z z=1 is user u s final interest distributions inferred based on his own and his collaborative information at time t. According to the Sainte-Lagu e method, seats should be awarded to the topic with the largest quotient in order to best maintain the proportionality of the result list. Therefore, our SKDM assigns the current seat to the topic z with the largest quotient (step 10). The keyword to fill this seat is the one that is not only relevant to topic z but to other topics and should be specific to the user, and thus we propose to obtain the keyword v for user u s profiling at t as (step 11): v arg maxv ev η1 qt[z |t, u] P(v|t, z ) + η2 P z =z qt[z|t, u] P(v|t, z) + (1 η1 η2) tfidf(v|t, u), where 0 η1, η2 1 are two free parameters that satisfy 0 η1 + η2 1, P(v|t, z) is the probability that v is associated with topic z at time t and thus can be set to be P(v|t, z) = φt,z,v, and tfidf(v|t, u) is a time-sensitive term frequency-inverse document frequency function for user u at t, which can be defined as: tfidf(v|t, u) = tf(v|dt,u) idf(v|u, dt), (8) where tf(v|dt,u) = |{d dt,u:v d}| |dt,u| is a term frequency function that computes how many percents of the documents that contain the word v in the whole document set dt,u, and idf(v|u, dt) = log |dt| |{d dt:v d}|+ϵ is an inverse document frequency function with ϵ being set to 1 to avoid the division-by-zero error. According to (8), if v frequently appears in the document set dt,u generated by user u but not frequently appears in the document set dt generated by all the users, tfidf(v|t, u) will return a high score. After the word v is selected, SKDM adds v as a result keyword to wt,u, i.e., wt,u wt,u {v } (step 12), removes it from the candidate word set ev, i.e., ev ev\{v } (step 13), and increases the number of seats occupied by each of the topics z by its normalized relevance to v as (step 15): sz|t,u sz|t,u+ P (v |t,u) PZ z =1 P (v |t,z ). The process (steps 7 to 15) repeats until we get k diversified keywords. The order in which a keyword is appended to wt,u determines its ranking for the profiling. After the process is done, we obtain a set of diversified keywords wt,u that profile the user u at t. Experimental Setup Research Questions The research questions guiding the remainder of the paper are: (RQ1) How does UPA perform for user profiling compared to state-of-the-art methods? (RQ2) How does the contribution of the proposed interest tracking topic model, CITM, to the overall performance of UPA? (RQ3) What is the contribution of the collaborative information for user profiling? (RQ4) What is the impact of the length of the time intervals, ti ti 1, in UPA? Dataset We work with a dataset collected from Twitter.1 It contains 1,375 active randomly selected users and their tweets posted 1Crawled from https://dev.twitter.com/. from the beginning of their registrations up to May 31, 2015. According to the statistics, most of the users are being followed by 2 to 50 followers. In total, we have 7.52 million tweets with timestamps including those from users followees . The average length of the tweets is 12 words. We use this dataset as our stream of short documents. We obtain two categories of Ground Truths: one for evaluating Relevance-oriented (RGT) performance and another for evaluating Diversity-oriented (DGT) performance. To create the RGT ground truth, we split the dataset into 5 different partitions of time periods, i.e., a week, a month, a quarter, half a year and a year, respectively. For each Twitter user at every specific time period, an annotator was asked to generate a ranked list of top-k relevant keywords (k were decided by the annotators) as the user s profile. In total, 68 annotators took part in the labelling with each of them labelled about 5 Twitter user for these 5 different partitions. To create the ground truth for diversity evaluation, DGT, as it is expensive to manually obtain aspects of the keywords from annotators, we cluster the relevant keywords with their embeddings2 into 15 categories 3 by k-means (Mac Queen 1967). Relevant keywords within a cluster are regarded as being relevant to the same aspect in the DGT ground truth. We make comparisons between our UPA and the following state-of-the-art baseline algorithms: (1) tfidf. It simply utilizes (8), i.e., the content of users documents to retrieve top-k keywords as profiles for the users. (2) Predictive Language Model (PLM). It models the dynamics of personal interests via a probabilistic language model (Fang and Godavarthy 2014). (3) Latent Dirichlet Allocation (LDA). This model (Blei, Ng, and Jordan 2003) infers topic distributions specific to each document via the LDA model. (4) Author Topic model (Author T). This model (Rosen-Zvi et al. 2004) infers topic distributions specific to each user in a static dataset. (5) Dynamic Topic Model (DTM). This dynamic model (Blei and Lafferty 2006) utilizes a Gaussian distribution for inferring topic distribution of long documents in streams. (6) Topic over Time model (To T). This dynamic model (Wang and Mc Callum 2006) normalizes timestamps of long documents in a collection and then infers topics distribution for each document. (7) Topic Tracking Model (TTM). This dynamic model (Iwata et al. 2009) captures the dynamic topic distributions of long documents arriving at time t in streams of long documents. (8) GSDMM. This is a Gibbs Sampling-based Dirichlet Multinomial Mixture model that assigns one topic for each short document in a static collection (Yin and Wang 2014). For fair comparisons, the topic model baselines, GSDMM, TTM, To T, DTM and LDA, use both each user s interests θt,u and their collaborative interests for profiling. As these baselines can not directly infer collaborative interest distributions, we use the average interests of the user s 2Publicly available from https://nlp.stanford.edu/projects/ glove/. 3Information of the categories is available from http:// dmoztools.net. followees as his collaborative interest distribution. Thus, unlike (7), in these baselines we use the mixture interests δt,u = (1 λ)θt,u + λ 1 |ft,u| P u ft,u θt,u for representing each user s final interest distribution with θt,u being inferred by the corresponding baseline topic models. The baselines, tfidf, PLM and Author T, are static profiling algorithms, while the others are dynamic. Again, for fair comparisons, UPA and all the other topic models use our SKDM algorithm to obtain the top-k keywords. We set the number of topics Z = 20 in all the topic models. For tuning parameters, λ, η1 and η2, we use a 70%/20%/10% split for our training, validation and test sets, respectively. The train/validation/test splits are permuted until all users were chosen once for the test set. We repeat the experiments 10 times and report the average results. For further analysis of the contribution of collaborative interests ψt,u inferred by our CITM model to the profiling, we use another baseline denoted as UPAavg, in which δt,u is set to be (1 λ)θt,u + λ 1 |ft,u| P u ft,u θt,u with θt,u being inferred by CITM. Note that we still denote the proposed profiling algorithm using (7) as UPA. Evaluation Metrics We use standard relevance-oriented evaluation metrics, Pre@k (Precision at k), NDCG@k (Normalized Discounted Cumulative Gain at k), MRR@k (Mean Reciprocal Rank at k), and MAP@k (Mean Average precision at k) (Croft, Metzler, and Strohman 2015), and diversity-oriented metrics, Pre-IA@k (Intent-Aware Pre@k) (Agrawal et al. 2009), αNDCG@k (Clarke et al. 2008), MRR-IA@k (Agrawal et al. 2009), MAP-IA@k (Agrawal et al. 2009). We also propose semantic versions of the original metrics, denoted as Pre S@k, NDCG-S@k, MRR-S@k, MAP-S@k, Pre-IA-S@k, α-NDCG-S@k, MRR-IA-S@k, and MAP-IA-S@k, respectively. Here the only difference between the original metrics and the corresponding semantic ones is the way to compute the relevance score of a retrieval keyword v to ground truth keyword vgt. For original metrics, we let the relevance score be 1 if and only if v = vgt, otherwise be 0; whereas for the semantic versions, we let the relevance score be the cosine similarity between the word embedding vectors of v and vgt. Since we usually choose not too many keywords to describe a user s profile, we compute the scores at depth 10, i.e., let k = 10. For all the metrics we abbreviate M@k as M, where M is one of the metrics. Results and Discussions In this section, we analyse our experimental results. Overall Performance We start by answering research question RQ1. The following findings can be observed from Tables 1 and 2: (1) In terms of both relevance and diversity, all the topic modelbased profiling algorithms, i.e., UPA, UPAavg, GSDMM, To T, TTM, DTM, Author T and LDA, outperform traditional algorithms, i.e., PLM and tfidf, which demonstrates that topic modeling does help to profile users interests. (2) UPA and UPAavg outperform all the baseline models in terms of Table 1: Relevance performance of UPA, UPAavg and the baselines using time periods of each month. Statistically significant differences between UPAavg and GSDMM, and between UPA and UPAavg are marked in the upper right hand corner of UPAavg s and UPA scores, respectively. Statistical significance is tested using a two-tailed paired t-test and is denoted using for α = .01, and for α = .05. Pre NDCG MRR MAP Pre-S NDCG-S MRR-S MAP-S tfidf .254 .229 .375 .135 .409 .392 .853 .203 PLM .273 .239 .668 .140 .417 .398 .870 .212 LDA .281 .252 .674 .142 .424 .407 .878 .217 Author T .288 .260 .674 .145 .429 .408 .897 .220 DTM .295 .270 .694 .153 .436 .419 .883 .226 TTM .301 .276 .728 .156 .440 .426 .882 .228 To T .312 .283 .744 .158 .445 .428 .884 .230 GSDMM .321 .301 .746 .163 .452 .437 .891 .236 UPAavg .367 .361 .840 .195 .483 .468 .939 .262 UPA .399 .398 .860 .211 .501 .490 .946 .274 Table 2: Diversification performance of UPA, UPAavg and the baselines using time periods of every month. Notational conventions for the statistical significances are as in Table 1. Pre α-ND MRR MAP Pre α-ND MRR MAP -IA CG -IA -IA -IA-S CG-S -IA-S -IA-S tfidf .157 .187 .480 .185 .257 .325 .725 .150 PLM .162 .192 .487 .187 .265 .332 .742 .152 LDA .171 .203 .493 .192 .272 .338 .744 .155 Author T .174 .205 .505 .195 .276 .343 .748 .157 DTM .177 .206 .507 .197 .279 .347 .748 .159 TTM .180 .208 .509 .221 .282 .351 .751 .162 To T .182 .213 .513 .225 .290 .355 .754 .170 GSDMM .194 .228 .525 .237 .304 .368 .780 .173 UPAavg .238 .265 .597 .252 .362 .421 .808 .216 UPA .266 .302 .623 .266 .395 .452 .814 .231 relevance and diversity on all the metrics, which confirms the effectiveness of the proposed user profiling algorithm for the task. (3) The ordering of the methods, UPA > UPAavg > GSDMM > To T TTM DTM Author T LDA > PLM > tifdf, is mostly consistent across the two ground truths and on the relevance and diversity evaluation metrics. Here A > B denotes that method A statistically significantly performs better than method B and A B denotes that we did not observe a significant difference between A and B. This, once again, confirms that UPA and its averaged version, UPAavg, outperform all the baselines. (4) In most cases, UPA > UPAavg holds, which confirms that the collaborative information inferred by the proposed topic model, CITM, does help to improve the profiling performance. Additionally, Table 3 shows the top six keywords of an example user s dynamic profile with time being five quarters from April 2014 to May 2015. As shown in the table, the diversified keywords generated by UPA are semantically closer to those from the ground truth compared to those generated by the baseline, GSDMM, which again demonstrates the effectiveness of the proposed UPA algorithm. Contribution of CITM We now turn to answer research question RQ2. Recall that the only difference between our UPA/UPAavg and the baselines is that UPA utilizes our CITM to track users dynamic Table 3: Top six keywords of an example user s dynamic profile with the time being five quarters from April 2014 to May 2015. The keywords from the DGT ground truth, generated from GSDMM and UPA are presented for the user, respectively. Apr. 2014 to Jun. 2014 Jul. 2014 to Sep. 2014 Oct. 2014 to Dec. 2014 Jan. 2015 to Mar. 2015 Apr. 2015 to May 2015 Ground Truth Apple Java i Phone Python Apple Pay Ojective C Apple Git i Pad Ojective C Apple Event Python Apple Event Linin Profile open Education i OS Nats Twitter Education Microblog Students Linked In Profile Arts Education FB After School Social Media Education Nats Twitter Connected Learning FB Courses GSDMM Apple Computer i Phone Science Java Technology Apple Company University Technology i Pad Language Apple Christmas Linked In Education i OS Friends Online Education Students Website Degree Presentation Courses Online Presentation Digital Learning Education UPA Apple Java i Phone Programming CPlus Plus Computer Apple Programming i Pad Git Event Python Apple Linked In Education i OS Twitter Education Linked In Students Microblog Education FB Art Education Media Learning FB Courses Twitter interests and then our SKDM to diversify the keywords, whereas other topic models utilize different topic models to obtain users interests and then the SKDM for keyword diversification. As in Tables 1 and 2, UPA/UPAavg outperforms all the topic models, i.e., GSDMM, To T, TTM, DTM Author T and LDA, which illustrates that the proposed topic model, CITM, does be effective and has significant contribution to the performance of our user profiling algorithm. Contribution of Collaborative Interests Here we turn to answer research question RQ3. We vary the parameter λ that governs how much the collaborative information, ψt,u, are utilized for profiling. A larger λ indicates more collaborative information is utilized for the profiling. Fig. 2 shows the performance on the relevance and diversity evaluation metrics (use Precision and Pre-IA as representative metrics only), where we use the best baseline, GSDMM, as a representative. When we increase λ from 0 to 0.6, i.e., giving more weight to the collaborative information, the performance of all the models gradually improves, with UPA still outperforming UPAavg and GSDMM. This, again, illustrates that integrating collaborative information into the models helps to improve the performance. Moreover, as shown in Fig. 2, UPA that utilizes collaborative interests outperforms UPAavg that simply utilizes the average of the followees interests as its collaborative interests, which once again demonstrates that the inferred collaborative interests in UPA is effective. Impact of Time Period Length Finally, we answer research question RQ4. We compare the performance for different time periods, a week, a month, a quarter, half a year and a year, respectively, using the two ground truths, RGT and DGT, on the representative relevance and diversity metrics, Precision and Pre-IA, in Fig. 3. As is shown in Fig. 3, UPA and UPAavg beat the baselines for time periods of all lengths, which illustrates that our proposed user profiling algorithm works better than the state-of-the-art ones for dynamic user profiling regardless of period length. The performance of UPA, UPAavg and the best baseline, GSDMM, improves significantly on all the metrics when the period length increases from a week to a UPA UPAavg GSDMM 0.0 0.2 0.4 0.6 0.8 1.0 0.25 0.35 0.45 UPA UPAavg GSDMM 0.0 0.2 0.4 0.6 0.8 1.0 0.15 0.25 0.35 Figure 2: Relevance and diversity performance of UPA, UPAavg and GSDMM on representative metrics, Precision and Pre-IA, with varying scores of λ, respectively. UPA UPAavg GSDMM Quarter Half Year 0.25 0.35 0.45 UPA UPAavg GSDMM Quarter Half Year 0.15 0.20 0.25 0.30 Figure 3: Relevance and diversity performance of UPA, UPAavg and GSDMM on time periods of a week, a month, a quarter, half a year, and a year, respectively. quarter, whereas it reaches a plateau as the time periods further increase from a quarter to a year. In all the cases UPA and UPAavg significantly outperform the best baseline, GSDMM. These findings illustrate the fact that the performance of the proposed algorithms is robust and is able to maintain significant improvements over the state-of-the-art nondynamic and dynamic algorithms. In addition, UPA always outperforms UPAavg on all the metrics and all the different period lengths, which once again illustrates that the collaborative interest distribution inferred by the proposed CITM model helps to enhance the user profiling performance. Conclusions We have studied the problem of collaborative, dynamic and diversified user profiling in the context of streams of short texts. To tackle the problem, we have proposed a streaming profiling algorithm, UPA, that consists of two models: the proposed collaborative interest tracking topic model, CITM, and the proposed streaming keyword diversification model, SKDM. Our CITM tracks the changes of users and their followees interest distribution in streams of short texts, a sequentially organized corpus of short texts, and our SKDM diversifies the top-k keywords for profiling users dynamic interests. To effectively infer users and their followees dynamic interest distribution in our CITM model, we have proposed a collapsed Gibbs sampling algorithm, where during the sampling one single topic is assigned to a document to address the textual sparsity problem. We have conduced experiments on a Twitter dataset. We evaluated the performance of our UPA and the baseline algorithms using two categories of ground truths on both the original metrics and the proposed semantic versions of the metrics. Experimental results show that our UPA is able to profile users dynamic interests over time for streams of short texts. In the future, we intend to utilize auxiliary resources such as Wikipedia articles that the entities in the short documents link to for further improvement of user profiling. Agrawal, R.; Gollapudi, S.; Halverson, A.; and Ieong, S. 2009. Diversifying search results. In WSDM, 5 14. Balog, K., and de Rijke, M. 2007. Determining expert profiles (with and application to expert finding). In IJCAI, 2657 2662. Balog, K.; Bogers, T.; Azzopardi, L.; de Rijke, M.; and van den Bosch, A. 2007. Broad expertise retrieval in sparse data environments. In SIGIR, 551 558. Balog, K.; Fang, Y.; de Rijke, M.; Serdyukov, P.; and Si, L. 2012. Expertise retrieval. Found. Trends Inf. Retr. 6:127 256. Berendsen, R.; Rijke, M.; Balog, K.; Bogers, T.; and Bosch, A. 2013. On the assessment of expertise profiles. JAIST 64(10):2024 2044. Blei, D. M., and Lafferty, J. D. 2006. Dynamic topic models. In ICML, 113 120. Blei, D. M.; Ng, A. Y.; and Jordan, M. I. 2003. Latent dirichlet allocation. J. Mach. Learn. Res. 3:993 1022. Clarke, C. L. A.; Kolla, M.; Cormack, G. V.; Vechtomova, O.; Ashkan, A.; B uttcher, S.; and Mac Kinnon, I. 2008. Novelty and diversity in information retrieval evaluation. In SIGIR, 659 666. Craswell, N.; de Vries, A. P.; and Soboroff, I. 2005. Overview of the TREC 2005 enterprise track. In TREC 05, 1 7. Croft, W. B.; Metzler, D.; and Strohman, T. 2015. Search engines: Information retrieval in practice. Addison-Wesley Reading. Dang, V., and Croft, W. B. 2012. Diversity by proportionality: An election-based approach to search result diversification. In SIGIR, 65 74. Fang, Y., and Godavarthy, A. 2014. Modeling the dynamics of personal expertise. In SIGIR, 1107 1110. Griffiths, T. L., and Steyvers, M. 2004. Finding scientific topics. PNAS 101:5228 5235. Hofmann, T. 1999. Probabilistic latent semantic indexing. In SIGIR, 50 57. Iwata, T.; Watanabe, S.; Yamada, T.; and Ueda, N. 2009. Topic tracking model for analyzing consumer purchase behavior. In IJCAI, volume 9, 1427 1432. Iwata, T.; Yamada, T.; Sakurai, Y.; and Ueda, N. 2010. Online multiscale dynamic topic models. In KDD, 663 672. ACM. Liang, S.; Ren, Z.; Yilmaz, E.; and Kanoulas, E. 2017a. Collaborative user clustering for short text streams. In AAAI, 3504 3510. Liang, S.; Ren, Z.; Zhao, Y.; Ma, J.; Yilmaz, E.; and Rijke, M. D. 2017b. Inferring dynamic user interests in streams of short texts for user clustering. ACM Trans. Inf. Syst. 36(1):10:1 10:37. Liang, S.; Yilmaz, E.; Shen, H.; Rijke, M. D.; and Croft, W. B. 2017c. Search result diversification in short text streams. ACM Trans. Inf. Syst. 36(1):8:1 8:35. Liang, S.; Zhang, X.; Ren, Z.; and Kanoulas, E. 2018. Dynamic embeddings for user profiling in twitter. In KDD, 1764 1773. Liang, S.; Yilmaz, E.; and Kanoulas, E. 2018. Collaboratively tracking interests for user clustering in streams of short texts. IEEE Transactions on Knowledge and Data Engineering. Liang, S. 2018. Dynamic user profiling for streams of short texts. In AAAI, 5860 5867. Liu, J. S. 1994. The collapsed Gibbs sampler in Bayesian computations with applications to a gene regulation problem. J. Am. Stat. Assoc. 89(427):958 966. Mac Queen, J. B. 1967. Some methods for classification and analysis of multivariate observations. Minka, T. 2000. Estimating a dirichlet distribution. Rosen-Zvi, M.; Griffiths, T.; Steyvers, M.; and Smyth, P. 2004. The author-topic model for authors and documents. In UAI, 487 494. Rybak, J.; Balog, K.; and Nørv ag, K. 2014. Temporal expertise profiling. In ECIR, 540 546. Wallach, H. M. 2006. Topic modeling: beyond bag-ofwords. In ICML, 977 984. Wang, X., and Mc Callum, A. 2006. Topics over time: a nonmarkov continuous-time model of topical trends. In KDD, 424 433. Wei, X.; Sun, J.; and Wang, X. 2007. Dynamic mixture models for multiple time-series. In IJCAI, 2909 2914. Yin, J., and Wang, J. 2014. A dirichlet multinomial mixture model-based approach for short text clustering. In KDD, 233 242.