# where_and_why_users_check_in__bfc42c6d.pdf Where and Why Users Check In Yoon-Sik Cho, Greg Ver Steeg, and Aram Galstyan USC Information Sciences Institute Marina del Rey, CA 90292 yoonsik@isi.edu, gregv@isi.edu, galstyan@isi.edu The emergence of location based social network (LBSN) services makes it possible to study individuals mobility patterns at a fine-grained level and to see how they are impacted by social factors. In this study we analyze the check-in patterns in LBSN and observe significant temporal clustering of check-in activities. We explore how self-reinforcing behaviors, social factors, and exogenous effects contribute to this clustering and introduce a framework to distinguish these effects at the level of individual check-ins for both users and venues. Using check-in data from three major cities, we show not only that our model can improve prediction of future check-ins, but also that disentangling of different factors allows us to infer meaningful properties of different venues. 1 Introduction Human mobility patterns influence human behavior in both routine and profound ways. Any picture of the spread of disease, traffic congestion, or urban crime would be incomplete without understanding the movements of individuals and groups. Even though the behavior of groups of humans has many degrees of freedom, previous works (Gonzalez, Hidalgo, and Barabasi 2008; Rhee et al. 2011) have demonstrated that human mobility exhibits structural regularities. The recent emergence of Location Based Social Network (LBSN) services such as Gowalla and Foursquare has enabled researchers to perform fine-grained analysis of users mobility patterns and their impact on social interactions. In LBSN services, users share their current location or the venues they have visited in the past with their friends. Most LBSNs give unique IDs to different establishments even if they share the same geographical location (i.e., Lat+Long coordinates); we emphasize this distinction by using the term venue rather than location . Typically, a user checks in to a specific venue by using a smartphone or tablet to choose from a list of venues near their current location as determined by Wi-Fi or GPS. This information is sent to the LBSN server and shared with their friends. A user can check-in to a venue during each visit and is often encouraged to do so through incentives. Copyright c 2014, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. The primary LBSN data consist of check-in history of the users, where each check-in is described by a user id, venue id, and the time of the check-in. In addition, most LBSN services also provide secondary data that describe the underlying social network of the users. Prior research has studied the correlation between individual mobility patterns and social interactions, e.g., by predicting social ties based on similar mobility patterns (Crandall et al. 2010), or, conversely, by predicting the next check-in location of a user based on the recent check-in history within his local network (Mc Gee, Caverlee, and Cheng 2013). While most prior work has focused on user-based modeling of spatial-temporal LBSN data (Cho, Myers, and Leskovec 2011; Gao, Tang, and Liu 2012a), here we argue that a venue-centric approach is sometimes preferable. For instance, if the goal is to predict future attendance of a particular venue or to measure the impact of an ad campaign on attendance, it is more natural to focus on the check-in dynamics of venues rather than users. While recent work has studied correlations between a venue s characteristics and its popularity (Joseph, Tan, and Carley 2012), the dynamics of venue-specific check-ins have been largely ignored. We focus on modeling the full temporal dynamics of check-ins from a venue-centric perspective. We observe that check-ins at venues are clustered in time, sometimes exhibiting bursty behavior. We also observe that the average check-in patterns for both users and venues are not static, but change over time. We include three primary mechanisms to describe check-in dynamics: (1) Repeated behavior is captured by a self-reinforcing mechanism in which a user is strongly influenced by his recent behavior; (2) Social influence, i.e., a visit by a user triggers future visits by his friends; and (3) Exogenous effects, which include external events (such as releasing new SW for the service or a promotion campaign) that modulate the attendance rates. Here we are especially interested in assessing social influence on visitation patterns. Toward this goal, we adopt a parametric point process model known as a Hawkes process (Hawkes 1971) to describe check-in dynamics at venues. A Hawkes process is an example of a self-exciting point process in which past events positively influence the likelihood (intensity) of future events. This model allows us to measure the likelihood that a particular (offspring) event was triggered by a past (parent) event. This allows us to dis- Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence tinguish the most likely factors contributing to an individual check-in. Combining this information with the known social network structure enables us to estimate the fraction of check-ins that can be plausibly attributed to social influence. Beyond the rich explanatory power of the model, we also demonstrate that it predicts future check-in data better than several alternatives. In particular, we consider various baseline point process models and compare them on their ability to capture temporal dynamics of check-ins. Finally, we consider each of the three mechanisms in our model separately and demonstrate their validity by distinguishing social and non-social venues and by capturing known exogenous effects like (external) promotion campaigns. This multi-faceted analysis allows a fine-grained discrimination of different types of venues. While we focus on user/venue dynamics here, the mechanisms we describe are general and could apply to other aspects of human behavior. Related Work There is a growing body of literature on LBSN analysis. Link prediction using geo-coincidences has been studied in (Crandall et al. 2010; Wang et al. 2011; Scellato, Noulas, and Mascolo 2011). Other studies have used social network information to infer user location (Sadilek, Kautz, and Bigham 2012; Gao, Tang, and Liu 2012b) and predict next check-in (Mc Gee, Caverlee, and Cheng 2013). Several recent studies have also attempted to cluster users (Joseph, Tan, and Carley 2012) and venues (Cranshaw et al. 2012) based on similar visitation patterns. Most human activity patterns have bursty dynamics and cannot be adequately described by homogenous Poisson process. To describe temporal correlations in social interactions, researchers have used Non-homogeneous Poisson processes (NHPP) such as Cox-process (Lando 1998; Perry and Wolfe 2013), as well as and Hidden Markov Models (Raghavan et al. 2013). Our approach here is based on self-exciting Hawkes process (Hawkes 1971) that has been previously used for modeling urban crime (Mohler et al. 2011), inter-gang violence (Cho et al. 2013), and repeated social interactions (Blundell, Heller, and Beck 2012). 2 Dataset Description We use the Gowalla dataset (Cho, Myers, and Leskovec 2011) in this work. Gowalla is a location-based social networking website where users share their locations by checking-in. In this dataset, the network consists of 196,591 nodes and 950,327 undirected edges. Between February 2009 and October 2010, there were 6,442,890 check-ins. We extracted all the check-ins of active users in San Francisco, New York, and Stockholm as representative samples from western U.S., eastern U.S., and Europe. Asian cities had little activity and were excluded from analysis; Check-ins from a relatively active city in Asia (Tokyo) were a quarter of those in San Francisco. We collected all activity within a rectangular box of latitude-longitude coordinates around each of the selected cities. We considered only the 20% of users who were most active to ensure sufficient statistics for parameter estimation. The 20% of most active users represented around 80% of the total number of check-ins. This 80 20 rule was universal across all the cities we examined. Statistics from each city are presented below. Table 1: Statistics of Check-ins from Three Cities Dataset statistics San Francisco Number of Check-ins 142,972 Number of Venues 10,751 Number of Users 5,989 Number of Check-ins 114,777 Number of Venues 17,062 Number of Users 6,205 Number of Check-ins 184,485 Number of Venues 15,753 Number of Users 9,320 3 Model Description 3.1 Modeling Temporal Patterns Figure 1: Temporal pattern of check-ins (SF) 2009.9.1 2009.10.1 2009.11.1 2009.12.1 2010.1.1 2009.9.1 2009.10.1 2009.11.1 2009.12.1 2010.1.1 ue Bottle Coffee 2009.9.1 2009.10.1 2009.11.1 2009.12.1 2010.1.1 ue Bottle Coffee Golden Gate Park Farmer Browns Little Skillet Blue Bottle Coffee We treat the check-ins in LBSN as a marked point process in time, where the mark represents the venue as well as the user for an event at a given time. By separating every process with respect to the venue id, each venue forms its own point process. We defer analysis of user-specific processes until Sec. 3.3. As shown in Figure 1, clustering is apparent in the three temporal point processes. Thicker lines represent the degree to which an event is explained by previous events (as opposed to background or exogenous effects). The strength of ties was mathematically computed using the self-exciting point process known as the Hawkes process (Hawkes 1971), detailed in the next section. The Hawkes process defines a (mark specific) intensity as a function of history and time. This model has been widely used in various applications that show temporal clustering of events such as shocks and aftershocks in seismology. Hawkes process Each check-in at a given venue is treated as an event in the given venue-specific point process. We assume that the intensity of check-in events involving the venue v at time t is given as follows: λv(t) = µv + X p:tp 0): N(t + t) N(t) = Z t+ t t λ( ) d . (4) In our experiment, we focus on predicting daily check-ins, so we set t = 24hrs. The time, t is randomly sampled from some random time that includes at least half of the data so that we have enough history for parameter estimation. For each venue, we repeat the experiment 1,000 times for different random t s and compare our prediction to the actual number of events. The prediction error is computed using the gap between the actual number of events and our prediction: abs(true count predicted count). The number of predicted events is estimated using Equation 4. We compare the Hawkes process to several other baselines including nonhomogenous Poisson processes(NHPP) and Cox processes. Baseline 1: piecewise-constant NHPP Check-in data from three cities shows strong activity during the weekend compared to weekdays. We separate weekend check-ins and weekday check-ins to estimate the rate parameter respectively. Each parameter is λweekend and λweekday is constant and can be easily estimated. For predicting the number of visits on a given day, we simply use the appropriate rate parameter for weekends or weekdays. As for the Hawkes process, we repeat the experiment 1,000 times for each venue. Baseline 2: NHPP with drifting λ(t) = at + b (5) We define the rate function as a linear function of time. On many venues, the check-ins became more frequent as time elapsed from the first check-in. This is because more and more people joined the Gowalla service after its introduction. This intensity function well captures the birth of users, and we see that this simple intensity function predicts the number of visitors relatively well. Baseline 3: Cox proportional hazard model A Cox process is a generalization of Poisson process where the random intensity is a stochastic process. Cox proportional hazard model (Cox 1972) associates covariates which modulate up or down to the baseline rate λ0(t). Often the baseline is a function of time with non-parametric form while the covariates involve coefficient β which is estimated using the partial likelihood. For our experiment, we define x(t) as a number of unique visitors assuming the size of unique visitors in the past affects the intensity function. λ(t) = λ0(t) exp(βx(t)) (6) In our experiment, we assume the baseline rate as constant, and repeat the same experiment for baseline 3 with the same sampled time/ training set. Baseline 4: Sigmoidal Gaussian Cox process λ(s) = λ?σ(g(s)) (7) This is another variant of the Cox process for which the intensity function is a transformation of a random realization from a Gaussian process. Adams et. al. (Adams, Murray, and Mac Kay 2009) suggested this model which achieves tractable inference on unknown intensity function.The random intensity function λ(s) has an upper-bound λ? and a sigmoid function which projects g(s) to the intensity function where the g(s) is sampled from Gaussian process. In (Adams, Murray, and Mac Kay 2009), sigmoidal Gaussian cox process (SGCP) inferred the intensity functions defined in simple form in their synthetic experiment. We also compare the Hawkes process to SGCP on prediction of future events. Error Comparison We use 360 venues (120 each) from three cities, and repeat 1,000 predictions for each sampled time range between t and t + t. The venues with few check-ins (less than 100 during 400 days) or with a short history (less than 200 days from the first check-in to the final check-in) were excluded in this experiment. Some venues have more frequent check-ins, hence we evenly divide the 360 venues into three groups based on the total number of actual check-in counts from 1,000 test samples, and name them inactive/moderate/active venues reflecting fewer to more check-ins, respectively. Splitting venues by activity was done on a per city basis to avoid city-specific bias due to higher average usage. As for the 1,000 test samples, we divide them into two groups, ones which had no check-ins (zero) and others which had more than zero checkins. On average, 70% of test samples from inactive venues had zero check-ins, 50% of test samples from moderate venues fell into the zero group, and 35% of test samples from active venues fell into the zero group. Table 2: Performance of Predictions Process Obs. Avg. Prediction Error (each inactive moderate active sample) venues venues venues zero 0.3202 0.3710 0.7210 Hawkes non-zero 0.7238 0.5361 0.4712 zero 0.3273 0.4455 1.0306 Baseline 1 non-zero 0.7305 0.7029 0.5937 zero 0.5318 0.6795 1.6901 Baseline 2 non-zero 0.6011 0.5707 0.5086 zero 0.7289 0.9347 2.2185 Baseline 3 non-zero 0.6040 0.6361 0.5660 zero 0.2477 0.4037 0.9989 Baseline 4 non-zero 0.7927 0.7331 0.5990 Th results are presented in Table 2. The average prediction error for zero group has been averaged over the total number of zero occurrence in the test sample, while the average prediction error for non-zero group has been averaged over the total number of the counts in the test sample. We observe that Hawkes process clearly outperforms the other baselines for venues with moderate and high activity levels. In particular, for those venues the Hawkes process produces more accurate prediction for both the rate of events when they occurred, and the absence of events. For the inactive venues, we find that Baseline 4 makes more accurate predictions for non-events, while Baseline 2 (and also Baseline 3) make better prediction of the rate of events when they occur. The former observation can be attributed to the fact that Baseline 4 tends to under-predict, which results in low prediction error for non-events and the higher prediction error (among all methods) for events. Out of all the processes under consideration, Hawkes process is the only process which captures influence between check-ins while the other processes only capture fluctuation of rates over time. 4.3 Evaluating the Three Factors We now focus on understanding the relative importance of the three main factors put forward in Section 3.3 that are responsible for temporal clustering. Toward this goal, recall that the (directional) correlation between two check-in events can be measured using Equation 3. Furthermore, by using the existing social network information, we can estimate the relative contribution of each factor by analyzing the identity of users in those check-ins. Namely, if both checkins are by the same user, then the event pair contributes to the self-reinforcing behavior. Similarly, when the check-ins are by two users that are connected in the social network, then the events contribute to the social effect. Finally, event pairs that belong to neither of these groups are attributed to exogenous effects. Since we are interested in differentiating the strength of effects, we separately add all the pairs of pv i!j in Equation 3 for the cases when event i and j involve the same person, or two people who are friends with each other, or neither of these two. To understand which factor contributes more to the temporal patterns, we define the following scores corresponding to each of three factors, respectively: P ti P ti P ti