# evaluating_recommender_system_stability_with_influenceguided_fuzzing__f4ed5ef0.pdf

The Thirty-Third AAAI Conference on Artiﬁcial Intelligence (AAAI-19)

Evaluating Recommender System Stability with Inﬂuence-Guided Fuzzing

David Shriver,1 Sebastian Elbaum,1 Matthew B. Dwyer,1 David S. Rosenblum2

1Department of Computer Science, University of Virginia, Charlottesville, VA, USA 2Department of Computer Science, National University of Singapore, Singapore {dlshriver, selbaum, matthewbdwyer}@virginia.edu, david@comp.nus.edu.sg

Recommender systems help users to ﬁnd products or services they may like when lacking personal experience or facing an overwhelming set of choices. Since unstable recommendations can lead to distrust, loss of proﬁts, and a poor user experience, it is important to test recommender system stability. In this work, we present an approach based on inferred models of inﬂuence that underlie recommender systems to guide the generation of dataset modiﬁcations to assess a recommender s stability. We implement our approach and evaluate it on several recommender algorithms using the Movie Lens dataset. We ﬁnd that inﬂuence-guided fuzzing can effectively ﬁnd small sets of modiﬁcations that cause signiﬁcantly more instability than random approaches.

Introduction Recommender systems ﬁlter information to help users make decisions when lacking personal experience or knowledge (Resnick and Varian 1997) or when the set of choices is overwhelmingly large (Herlocker et al. 2004). We see them everywhere, from e-commerce sites such as Amazon, news article recommendation at the New York Times, and movie recommendation by Netﬂix, to guides for venture capital opportunities (Zhao, Zhang, and Wang 2015), safe combinations of pharmaceutical drugs (Chiang et al. 2018), and collision avoidance actions for aircraft (Julian et al. 2016). The robustness of these systems their ability to make good recommendations in the presence of noise can have signiﬁcant impact on the end goals of both providers and users (O Mahony et al. 2004), such as the proﬁtability of an e-commerce site, or the safety and effectiveness of recommended drug combinations. In this work, we focus on a particular dimension of robustness called stability, which measures how recommendations change when the system is trained on modiﬁed data, regardless of the recommendation quality. Intuitively, a system lacks stability when there exist a number of changes to the dataset that can have an effect on the recommendations that is disproportionate to the amount of change. The lack of stability is problematic in that it causes recommendation inconsistencies that can lead to loss of user trust and limits system adoption (Adomavicius and Zhang 2012).

Copyright c 2019, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

Determining the stability of a system, however, is extremely challenging, due in part to the size of the input space that needs to be explored. As an example, the dataset we use later in the paper has 943 users, 1682 items, and 100,000 ratings on a scale from 1 to 5. A na ıve stability assessment approach that simply adds a single random rating to that dataset would likely overestimate the system stability since ﬁnding one addition among the 8 million possible (943 users 1682 items 5 ratings) that can cause instability is very unlikely. Similarly, an exhaustive search for instability-inducing modiﬁcations seems infeasible, especially when considering multiple modiﬁcations (i.e., adding k new ratings to our dataset means exploring a space of at least 10(6 k) potential modiﬁcations) or when considering datasets that can contain hundreds of millions of data points. Furthermore, even if generating all modiﬁcations was feasible, the system must then be retrained on the modiﬁed dataset, which is often an even more expensive process. Our approach to this problem is to exploit the assumptions underlying the design of recommender algorithms as to how the relationships among items and users inﬂuence recommendation behavior. For instance, a user-based recommender algorithm assumes that inﬂuence between users is important and therefore employs similarity measures between users in order to determine inﬂuence. Unfortunately, these inﬂuence relationships are encoded by the training process in complex, internal data structures that are typically difﬁcult to access and interpret. Nevertheless, we can infer characteristics of the inﬂuence relationships learned by the system by analyzing the recommendations it produces and the ratings it was trained on. For instance, we can estimate the inﬂuence of a user based on the number of items rated by that user that are recommended to other users. Based on this insight, our approach ﬁrst infers a model of inﬂuence underlying a recommender system, and then it performs fuzzing on the training dataset in order to assess the system s stability. Fuzzing is a widely-used automated software testing technique that generates invalid, unexpected or other directed random inputs to cost-effectively test the stability a system (Miller, Fredriksen, and So 1990)1. As part of its output, the approach includes examples of changes and types of changes to the dataset that cause in-

1A recent survey on fuzzing can be found at (Liang et al. 2018).

stability. These examples can then be used by developers to improve the design of the recommender algorithm, adjust its parameters, or even discard and replace the algorithm entirely, in an effort to meet the stability requirements for a given dataset. Stability results also can inform model training tasks, such as when additional preprocessing of the data is needed, when regularization needs to be modiﬁed, or when to perform retraining during deployment. Our overall contributions are: (1) We approximate inﬂuence relationships learned by recommender systems from the recommendations they produce and their training data. (2) We use these models for heuristic-based, inﬂuenceguided fuzzing of recommender system stability, to search for dataset modiﬁcations having a disproportionate effect on computed recommendations. (3) We evaluate our approach on several recommender algorithms with the Movie Lens dataset. We ﬁnd that our inﬂuence-guided fuzzing heuristics are much more effective than randomly generating modiﬁcations. For instance, for one recommender system tested, 10 rating modiﬁcations generated with our inﬂuence-guided heuristic caused 44% of users to have their top-ranked item removed from their recommendations after retraining, a 40000x increase over random.

Related Work O Mahony et al. describe two aspects to robustness, accuracy and stability (O Mahony et al. 2004). Accuracy, in regard to robustness, is a measure of the recommendation quality after changes are made to the dataset, while stability is a measure of how different the recommendations are after a change is made. Gunawardana and Shani consider robustness to be the stability of the recommendation in the presence of fake information (Gunawardana and Shani 2015). Adomavicius and Zhang take a slightly different view of robustness and stability, deﬁning stability as the consistency of recommendations over some period of time under the assumption that any new ratings added to the dataset completely agree with the prior recommendations (Adomavicius and Zhang 2010; 2012). Using this deﬁnition of stability, they ﬁnd that the stability of a recommender system does not necessarily correlate with the accuracy of the system. In this work, we deﬁne stability as the consistency of recommendations, given any small set of rating modiﬁcations, and we relax the assumption of Adomavicius and Zhang that new ratings must align with the past. In prior work, recommender system robustness was commonly evaluated in the context of attacking recommender systems. Shilling attacks, or proﬁle-injection attacks, add user proﬁles to a recommender system with crafted sets of item ratings in order to increase or decrease the position of some item in the recommendations of all users (O Mahony et al. 2004; Lam and Riedl 2004). Gunes et al. provide a comprehensive survey on shilling attacks against recommender systems (Gunes et al. 2014). Our approach differs from adversarial and shilling attacks in a couple signiﬁcant ways. First, our fuzzing approach automatically infers inﬂuence models from the execution of the recommender system. These models aim to approximate portions of the training dataset that are driving most recommendations. Second,

Algorithm, A

Recommender System

Algorithm, A

Recommender System

Influence Modelling

Recommendations, R

Recommendations, R

Diffing Oracle Stable Unstable Models

Our Approach

Rating Modifications

Figure 1: Diagram of our approach for generating modiﬁed datasets based on inferred inﬂuence models.

our fuzzing approach uses those models to generate rating modiﬁcations, which are much smaller than adding whole new proﬁles, users, or items to a dataset. Whereas shilling attacks add new user proﬁles to promote or demote targeted items, our approach generates small sets of rating modiﬁcations for existing users and items to provide a general assessment of the stability of a recommender system. Additionally, because shilling attacks add new users to a system, they require adding many ratings for each of those new users. For example Lam and Riedl introduce between 25 and 100 users with 3404 ratings each to a dataset of almost 1,000,000 ratings (a percent change of between 8% and 34%) (Lam and Riedl 2004). In contrast, our goal is to identify small sets of modiﬁcations (under 1% change) that cause signiﬁcant change to the recommendations produced by the system.

Approach In this section we present our approach to assessing the stability of recommender systems by fuzzing the dataset used to train the recommender. A diagram of our approach is shown in Figure 1. A recommender system consists of an algorithm, A and a dataset, D. The algorithm uses the dataset to compute a model of user preferences. The recommender system employs the learned model to compute recommendations for users of the system. Our approach infers an inﬂuence model from a dataset D and a set of recommendations R. The inferred inﬂuence model is then used for fuzzing D to produce a new dataset D . Using the modiﬁed dataset we train a new recommender system and generate recommendations R . The recommendations R and R are then compared using a differential oracle to assess the instability of the original recommender system. To simplify our presentation, we begin by deﬁning recommender systems in terms of users, items, ratings, and rankings. Then we describe how to approximate recommender

Table 1: Inﬂuence Model Functions

Name Deﬁnition

User IU(u) = |{u | rank(i, u ) = rating(u, i)}| Item II(i) = |{u | rank(i , u) = rating(u, i)}| Attribute IAI(a) = |{(u, i) | rank(i, u) = a attr(i)}| Rating IR(r) = |{i | rank(i, u) = |r avg{rating(u , i)}| ϵ}|

systems with inferred models of inﬂuence. Next, we explain how we can use these models for inﬂuence-guided fuzzing of modiﬁcations to a dataset. Finally, we deﬁne differential oracles for recommender stability, and discuss the assumptions made by our approach.

Deﬁning Recommender Systems

Let U be a ﬁnite set of users, I a ﬁnite set of Items, and R a possibly inﬁnite, ordered set of rating values. Without loss of generality, we assume that R Z>0 (within some interval [l, h]). This framework extends to recommender systems in general. For instance, in a system that recommends pharmaceutical drugs that are safe to take with a given prescription (Chiang et al. 2018), the items to recommend are drugs, the user is a prescription, and the ratings encode whether the prescription included a drug. A dataset D deﬁnes the partial function rating : U I R that captures how users rate items, with rating(u, i) = if u has not rated i. In addition to deﬁning rating, D characterizes each u U by attributes drawn from the set AU, and each i I by attributes drawn from the set AI through functions attr : U 2AU and attr : I 2AI, respectively. A user or item may be characterized by multiple attributes. Given D and a parameter k for computing top-k rankings, a recommender algorithm Q computes a partial function rank k,Q : I U [1, k] for k |I|, where rank k,Q(i, u) = if item i is not ranked for user u. For every u U there are a number of ranked items, ku k, and the projection of rank onto users, rank k,Q(u): Iku [1, ku], is a bijection, where Iku I, |Iku| = ku. In what follows, we drop the subscript Q from rank k,Q since Q is typically apparent from the context, and we drop the subscript k since k is typically a parameter to Q.

Inferring Inﬂuence

Recommender systems generate recommendations based on some notion of inﬂuence between aspects of the dataset, such as users (U), items (I), or attributes (AI or AU). For instance a user-based algorithm computes users that are inﬂuential to a given user, based on the similarity of ratings between pairs of users. Users with high similarity in their ratings are considered more inﬂuential to each other than users

with low similarity in their ratings. In an item-based recommender algorithm, inﬂuence occurs between items, and is computed based on the similarity of ratings between items. Because inﬂuence is used to produce recommendations, we conjecture that we can approximate the inﬂuence of various aspects (users, items, ratings, or attributes) of a dataset for a recommender system, based on the data used to train the recommender and its recommendations. By observing the relationship between some aspect in the dataset and some aspect in the recommendations, such as whether user u rated an item that is recommended to user u , we can build an approximate model of the overall inﬂuence. In general, we approximate inﬂuence scores with a function:

IA : An R, where A {U, I, R, AI, AU} (1)

This function maps aspects of the dataset to a real value representation of the inﬂuence of that aspect. In this work, we focus on inﬂuence functions over single types of aspect (n = 1), however richer forms of inﬂuence are possible. Using the computed inﬂuence scores, we can approximate the internal inﬂuence model of a recommender system as a list of aspects and their associated inferred inﬂuence score. In this work we present four models of inﬂuence, which are shown in Table 1. These models of inﬂuence capture the four major aspects of a recommender system: users, items, ratings, and attributes. The ﬁrst column of Table 1 gives a short descriptive name to the inﬂuence model, and the second provides a deﬁnition in terms of the recommender system elements deﬁned earlier. The superscript of the inﬂuence function signiﬁes the type of aspect over which this inﬂuence applies. The inﬂuence function computes a score for an aspect based on the dataset and the rankings produced by the recommender system of interest. We describe each of these models in more detail below. User inﬂuence measures the impact of a user u on all other users. A user, u is considered to have impacted another user u if user u has provided a rating for an item appearing in the top-k items for user u . This model of inﬂuence captures the intuition that if a user affects the recommendations of many users, then that user likely has high inﬂuence. Item inﬂuence measures the impact that an item i has on recommendations, by counting the number of users who are recommended at least one item, and have rated item i. This model of inﬂuence captures the intuition that items rated by many users are likely to be more inﬂuential. Attribute inﬂuence measures the impact of an item attribute on the recommendation of items with that attribute. An attribute, a is considered to be inﬂuential if many recommended items have attribute a. This model captures the intuition that if many recommended items have a similar attribute, then that attribute must be signiﬁcant. Rating inﬂuence measures the impact of rating values on recommendation. A rating value r is considered to be more impactful if more recommended items have an average rating within some value ϵ of r. This model captures the intuition that if many recommended items have similar average rating values, then that rating value is more inﬂuential. In other words, items with average ratings near r are more likely to be recommended if r has high inﬂuence. While, the

notion of inﬂuence for users, items, and attributes is intuitive, the notion of rating inﬂuence is less intuitive. However, we see in Section that it can be unusually effective.

Inﬂuence-Guided Fuzzing Using inﬂuence models, such as those deﬁned above, we can deﬁne fuzzing heuristics which produce a set of modiﬁcations to the original dataset that are likely to cause a recommender system to exhibit unstable behavior. We allow three basic types of dataset modiﬁcations: add, remove, and change. Add inserts a new rating to the dataset for a user u, item i, and value r if no rating value existed for u and i in the original dataset. Remove deletes an existing rating for a user u and item i from the data set. Change deletes an existing rating for a user u and item i and inserts a new rating with value r for user u and item i. We deﬁne a modiﬁcation fuzzing heuristic as a function:

M : I D Z 2M (2)

M {A, R, C}, where A is a set of additions, R a set of removals, and C a set of changes. As per Equation 2, a heuristic function takes an inﬂuence function, a dataset, and a number of modiﬁcations to be made, and outputs a set of additions, a set of removals, and a set of changes, or some subset of these. We deﬁne a small sample of possible fuzzing heuristics, shown in the top section of Table 2. These heuristics were chosen by keeping the modiﬁcation type constant (as the Add modiﬁcation type) and varying the inﬂuence type, and by keeping the inﬂuence type constant (as user inﬂuence) and varying the type of modiﬁcation. This is not a complete listing of possible heuristics but is designed to cover a variety of inﬂuence models, given the limited space available. We discuss the intuition for each of these heuristics below. Each row of Table 2 is a fuzzing heuristic. The ﬁrst column of each row provides a short name which we use as an identiﬁer, and the second column provides a description of the heuristic where the bolded and italicized words specify the modiﬁcation and inﬂuence types, respectively. The inﬂuence-guided fuzzing heuristics deﬁned here are given three letter names based on how they operate. The ﬁrst letter is based on the type of modiﬁcation they produce, where A is for add, C is for change, and R is for remove modiﬁcations. The second letter is the area of the inﬂuence model they select aspects from. An M means that the heuristic chooses the most inﬂuential aspect and an L means it chooses the least inﬂuential. The third letter speciﬁes the type of inﬂuence used by the heuristic. User inﬂuence is speciﬁed by a U, item inﬂuence is an I, attribute inﬂuence is an A, and rating inﬂuence is an R. AMU. The AMU fuzzing heuristic adds random ratings to the user with the highest inﬂuence score and is deﬁned as:

AMU : IU D Z 2A (3)

Items are selected from the set of items not yet rated by that user and both items and rating values are selected uniformly at random. The intuition behind this heuristic is that by adding ratings to the highest inﬂuence user, we may be

Table 2: Fuzzing Heuristics

Name Description

AMU Add a rating with random value to a random item for the most inﬂuential user. RMU Remove a rating from a random item for the most inﬂuential user. CMU Change the rating of a random item to the low value for the most inﬂuential user. ALI Add a rating with a random value to the least inﬂuential item. AMR Add a random rating value to an item with an average rating near the most inﬂuential average rating. AMA Add a rating with low value to a random user for an item with the most attribute inﬂuence.

ARAND Add ratings to random users and items. RRAND Remove random ratings. CRAND Change random ratings to the low value.

able to cause a new item to be recommended to many other users, or cause a previously recommended item to stop being recommended for many users. RMU. The RMU fuzzing heuristic removes random ratings from the most inﬂuential user and is deﬁned as:

RMU : IU D Z 2R (4)

Items are selected uniformly at random from the set of items rated by that user. The intuition behind this heuristic is that by removing ratings from the highest inﬂuence user, that user can lose inﬂuence, causing the recommendations of other users to change. CMU. The CMU fuzzing heuristic changes random ratings by the user with the highest inﬂuence score to have the lowest possible rating value. This heuristic is deﬁned as:

CMU : IU D Z 2C (5)

Items are selected uniformly at random from the set of items rated by that user. The intuition behind this heuristic is that by changing ratings of the highest inﬂuence user, we may cause a recommended item to no longer be recommended. ALI. The ALI fuzzing heuristic adds random ratings to the item with the lowest inﬂuence score. We deﬁne the function for this heuristic as:

ALI : II D Z 2A (6)

Users are selected from the set of users that have not rated the least inﬂuential item. Both users and rating values are selected uniformly at random. The intuition is that adding ratings to the least inﬂuential item may cause it to become inﬂuential, and cause recommendations to change. AMR. The AMR fuzzing heuristic adds new random ratings to items with average rating values near the most inﬂuential rating value. We deﬁne the function for this heuristic as:

AMR : IR D Z 2A (7)

Items are selected uniformly at random from the set of items with an average rating within 0.05 of the most inﬂuential rating. This selects items very close to the inﬂuential rating. We experimented with several values of ϵ, and we chose 0.05 as the value that generally produced the most instability. Users are selected uniformly at random from the set of users that have not rated the selected item. The rating value to add is selected uniformly at random. The intuition behind this heuristic is that by adding random ratings to items with an inﬂuential average rating, we can move the average away from the inﬂuential rating value to reduce the likelihood that the item will be recommended. AMA. The AMA heuristic adds low valued ratings to items with the highest aggregate attribute inﬂuence and is deﬁned as: AMA : IAI D Z 2A (8) Because items can have multiple attributes, this heuristic aggregates the inﬂuence scores of all attributes of an item by summing all of their inﬂuence scores. We use a slightly modiﬁed attribute inﬂuence function in order to negatively weight low inﬂuence attributes:

IAI 2 (a) = 2 |{a |IAI(a) IAI(a )}| |AI| (9) The item with the highest aggregate attribute inﬂuence is selected for modiﬁcation. Users are selected uniformly at random from the users who have not rated the selected item. The intuition behind this heuristic is that by adding low ratings to items with many highly inﬂuential attributes, we can decrease the inﬂuence of those attributes, causing items with those attributes to not be recommended.

Differential Stability Oracles To test the stability of recommender systems, we deﬁne oracles as Boolean predicates of the form: f{d(rank(u), rank (u)) | u U} < δ (10) This predicate ensures that a function f applied to the set of distances between users rankings using the original and modiﬁed datasets is below a speciﬁed threshold δ. In this work, we consider f to be a function that computes the average distance. We can compute the distance between rankings using a variety of metrics, depending on which types of change we wish to be sensitive to. Many possible metrics may be used to compute the distance between rankings, which cover a range of important aspects of change, including the order of the top-k rankings, the inclusion of items in the top-k recommendations, and the exclusion of important items in a user s top-k. In this work, we introduce the Top Out measure to quantify signiﬁcant ranking changes. This measure checks whether the top item in the unmodiﬁed ranked list has dropped out of the top-k rankings when using the modiﬁed dataset. We assume that the top ranked item is likely to be one of the most difﬁcult to change. Therefore, if this item is not in the ranked list after modiﬁcations are added to the dataset, then the recommendations should be considered to have signiﬁcantly changed. We deﬁne this measure as:

Top Out(Ru, R u) = 1 r1 R u 0 otherwise (11)

Ru is the set of items recommended to user u using the recommender trained on the original dataset and R u is the set of items recommended to user u after modiﬁcations are made to the dataset. The item r1 is the top-ranked item (rank(r1, u) = 1) in the ranked list of u. This metric is sensitive only to the top ranked item for a user, which generally has the highest likelihood of being preferred by the user. To change the value of this metric, the top item in the original ranking must not be included in the new top-k ranked list.

Practical Considerations and Usage

For this approach to be applicable, certain preconditions must be met. First, the developer must have read and write access to the full dataset that was used to train the recommender system under test. This is a reasonable as testing will generally be performed by a developer or a dedicated tester of the system, and will thus have access to this data. Second, we assume that additions, removals, or changes are realistic modiﬁcations that can occur to a dataset which is the case for most recommendation systems that evolve over time. Third, in deﬁning differential oracles for recommender stability, we assume that identifying a threshold of acceptable instability δ is possible either by using standard or historical measures. While choosing a threshold value is domain-, application-, and even dataset-dependent, an empirical process can be used to ﬁnd appropriate values. For instance, a space of threshold values for the initial system can be explored to help developers set expectations, and then that baseline threshold can be used as the system or the dataset evolves. When those preconditions are met, the approach can provide not only better stability estimates than existing approaches but also concrete dataset changes that may cause signiﬁcant instability. Historical stability estimates can then be used by developers to assess the evolution of their recommender from a robustness perspective and to pinpoint departures from established trends. Developers can also use the concrete dataset changes to determine how best to adjust the existing algorithm underlying the recommendation system to make it more robust to variations in the dataset. Last, the stability estimates and the dataset changes can guide data cleansing procedures (by for example increasing or decreasing the impact of certain records or aspects) and assist in the deﬁnition of data retraining policies after deployment.

We carried out a study to explore the cost-effectiveness of our fuzzing heuristics, and we also explore how the type of inﬂuence and type of modiﬁcations used by a fuzzing heuristic affects its ability to ﬁnd instability-inducing modiﬁcations. More speciﬁcally, we answer the following questions: RQ1: How effective and efﬁcient are the different inﬂuence models in guiding generation of instability-inducing modiﬁcations? RQ2: How can we fuzz a recommender system if the algorithm is a black box and the type of underlying inﬂuence is unknown?

Study Design

We evaluated our approach to fuzzing recommender systems by applying the selected inﬂuence-guided fuzzing heuristics to several recommender systems using a variety of recommender algorithms and a movie ratings dataset. We discuss these choices in more detail below. We evaluated the fuzzing heuristics for 3 sizes of modiﬁcation set: 1, 10, and 100. These sizes correspond to changes of 0.001, 0.01, and 0.1 percent of the dataset respectively and were chosen to be much smaller than the size of the dataset. With less than 0.1% of the ratings being modiﬁed, we would expect the recommendations to exhibit proportionally small amounts of change. For each heuristic, size, and recommender system (described next), we generated 100 modiﬁcation sets. We then trained each recommender on the modiﬁed dataset and generate Top-10 recommendations for every user. The log of the average Top Out metric over the 100 generated modiﬁcation sets are plotted in Figure 2.

Recommender Algorithms For this study we selected four recommendation algorithms. The algorithms were chosen to represent a variety of recommender techniques ranging from memory-based to model-based, and from contentbased to collaborative ﬁltering. An additional criteria that guided our algorithm selection was that a candidate algorithm had to work with the Movie Lens dataset (Harper and Konstan 2015), either because it was a part of the Lenskit framework or because its adaptation to that framework required only minor data wrangling. We describe each of the four chosen algorithms below. User-User. The User-User algorithm is a memory-based collaborative ﬁltering algorithm introduced by the Group Lens project (Resnick et al. 1994). The algorithm computes user similarity scores based on user rating vectors. In this work, we use the implementation of User-User provided by the Lenskit framework (Ekstrand et al. 2011), which computes user similarity using vector cosine similarity (Breese, Heckerman, and Kadie 1998). Item-Item. The Item-Item algorithm is a model-based collaborative ﬁltering algorithm. Similarity scores are precomputed for all pairs of item rating vectors. The predicted value of an item is estimated by aggregating the ratings of the most similar items (Sarwar et al. 2001; Deshpande and Karypis 2004). We again use the Item-Item implementation provided by the Lenskit framework (Ekstrand et al. 2011). Funk SVD. Funk SVD is a model-based collaborative ﬁltering algorithm that uses stochastic gradient descent to learn a matrix factorization (Funk 2006). In this work, we use the implementation of Funk SVD provided by the Lenskit framework (Ekstrand et al. 2011), which learns 25 latent features. Light FM. Light FM is a hybrid recommender algorithm that uses both ratings and item attributes to build a recommender model (Kula 2015). In our study, we use the Python implementation of the algorithm provided by its author2. It is the only algorithm of the four that explicitly uses item attributes when building a recommender model. Using the

2https://github.com/lyst/lightfm

Movie Lens dataset, we provide the genres as item attributes.

Movie Lens 100k Dataset We use the dataset released in 1998 from the Movie Lens recommendation system. The dataset is available as a group of tab separated ﬁles, containing 100 thousand integer ratings (from 1 to 5) collected from 943 users over a period of 8 months on 1682 movies. Each user has rated at least 20 items. More details about the data collection process and the dataset itself are available at https://grouplens.org/datasets/movielens/100k/ (Harper and Konstan 2015).

Treatments To study the effectiveness of inﬂuence-guided fuzzing, we will compare the deﬁned heuristics from Table 2 against three random baselines: ARAND, RRAND, and CRAND. The ARAND heuristic adds new ratings to the dataset by randomly selecting users, items, and rating values. The RRAND heuristic removes ratings from the dataset by randomly selecting users, items, and rating values. The CRAND heuristic randomly changes ratings in the dataset to have the lowest rating value by randomly selecting users and items. These baselines were chosen to control for the effect of the inﬂuence model on the user or item choice. By comparing the inﬂuence-based heuristics to the random baseline of the same modiﬁcation type (e.g., CRAND and CMU), we can evaluate the effectiveness of using inﬂuence models to guide fuzzing of recommender systems.

Differential Oracle Our evaluation reports the results from using the Top Out distance metric (d = Top Out), and we use a differential oracle that computes the average distance over all users (f = avg).

Methodology To evaluate efﬁciency, we measure how long it takes each heuristic to generate a modiﬁcation. We generate 100 modiﬁcation sets of size 1, 10, and 100 for each heuristic and report the average time to generate a modiﬁcation set of each size. To compare the effectiveness of inﬂuence-guided fuzzing heuristics to random fuzzing, we compute the average Top Out instability across 100 generated modiﬁcation sets for each conﬁguration of recommender and modiﬁcation set size. We explored modiﬁcation sets of size 1, 10, and 100, which are small compared to the original dataset. Changing 100 ratings would change only 0.1% of the dataset used in this study. We arbitrarily chose to stop with a maximum modiﬁcation set size of 100. We then compute the ratio of the Top Out instability induced by each heuristic to the instability induced by the random fuzzing heuristic of the same modiﬁcation type. A ratio greater than 1.0 indicates that a heuristic outperforms random. A value of 10 means that inﬂuence-guided fuzzing caused 10 times more instability than random.

Threats to Validity The recommendations produced by a system are dependent on both the recommender algorithm as well as the dataset used to train the system. Because we only evaluate our approach with a single dataset, our results may not generalize to other datasets. That said, the approach is general, and this dataset is commonly used and includes user and item attributes for content-based recommendation.

Top Out Instability Ratio (log)

User-User Item-Item Funk SVD Light SVM

Figure 2: Ratio of mean Top Out instability for each inﬂuence-guided fuzzing heuristic compared to the mean Top Out instability for ARAND using sets of 1, 10, and 100 Add-type modiﬁcations. Heuristics that perform signiﬁcantly better or worse than the baseline (with p < 0.05), are presented in color.

In this work we look at only a small subset of the possible inﬂuence functions and modiﬁcation fuzzing heuristics. There are many other heuristics that can generate instabilityinducing modiﬁcations or better approximation functions for computing inﬂuence. We do show that heuristics can be used to generate instability-inducing modiﬁcations more effectively than random methods. There are many possible metrics for computing the distance between recommendations. In this work, we only report the results of Top Out since the results of using the Jaccard and AOD measures were similar.

RQ1:Effectiveness and Efﬁciency of Fuzzing

Effectiveness of Heuristics Overall, inﬂuence-guided fuzzing is effective at generating smaller sets of modiﬁcations that induce higher amounts of instability than random, succeeding in 3 of 4 recommendation algorithms, often providing order of magnitude improvements. Figure 2 shows the ratio of the Top Out instability for each inﬂuence-guided fuzzing heuristic, compared to its corresponding random baseline. The x-axis contains the heuristic and recommender pairs, and the y-axis represents the ratio on a log scale. In cases where the baseline caused a Top Out instability of 0.0, the ratio could not be computed, so we plot these points at the value on the highest value of the y-axis. We also perform a statistical pairwise comparison of the instability induced by each inﬂuence-guided fuzzing heuristic to the instability induced by ARAND using Welch s t-test. Heuristics that perform signiﬁcantly better or worse than the baseline (with p < 0.05), are presented in color in Figure 2. For the User-User recommender system, 3 of the inﬂuence-guided fuzzing heuristics perform signiﬁcantly better than random, for at least one size of modiﬁcation set. In fact, for modiﬁcation sets of size 1, the AMR heuristic induced instability 128x better than random. Unlike User-User, single modiﬁcations were not as effective on

Funk SVD, and only performed better than random for AMR. However, sets of 10 and 100 modiﬁcations performed significantly better than random when using ALI and AMR. When generating sets of 10 modiﬁcations, AMR induced instability over 40000x greater than ARAND. With Light FM, 3 of our inﬂuence-guided heuristics (AMU, ALI, and AMR) performed similarly to random. However, AMA performed signiﬁcantly better than the random baseline for sets of 10 and 100 modiﬁcations, causing 1.3x and 3.2x greater instability, respectively. While, Add-type modiﬁcations tended to be ineffective for Item-Item, the RMU and CMU heuristics outperformed their corresponding baselines by 6.3x and 13.4x greater instability, respectively.

Fitting of Inﬂuence Model The effectiveness of a given inﬂuence-guided fuzzing heuristic depends on how closely the inferred inﬂuence model used by the heuristic approximates the actual inﬂuences used by the recommender algorithm. For instance, in Figure 2, we see that the AMA heuristic is not effective at fuzzing instability-inducing modiﬁcations. This is likely because AMA uses an inferred attribute inﬂuence model, while the three algorithms for which it does not perform well do not rely on any attribute data. However, for a recommender that does rely on item attribute information, such as Light FM, the AMA heuristic is effective at generating instability-inducing modiﬁcations. Similarly, the rating inﬂuence based AMR heuristic is unusually effective at causing signiﬁcant changes to recommendations for the User-User and Funk SVD recommenders. We believe this is because inﬂuential rating values for these recommenders are on the extreme ends of the rating scale. The most inﬂuential rating value for the User-User recommender is 1.0, while the inﬂuential rating value for the Funk SVD recommender is 5.0. For the other two recommenders, for which AMR performed poorly, the most inﬂuential rating value was closer to the middle of the scale.

Effectiveness by Modiﬁcation Type For three of the recommender systems explored in this work, Add-type mod-

Table 3: Time (in seconds) to generate a modiﬁcation set.

Heuristic / Set Size 1 10 100

ARAND 0.539 0.608 2.020 AMU 0.600 0.673 1.692 ALI 0.607 0.801 1.809 AMR 0.458 0.580 1.582 AMA 3.126 3.053 3.753

iﬁcations were the most effective at inﬂuencing change. For these recommenders, at least one heuristic performed signiﬁcantly better than random for modiﬁcation sets of size 10 and 100. For example, the AMA heuristic is effective at fuzzing instability-inducing modiﬁcation sets for the Light FM recommender. For both the User-User and Funk SVD algorithm, the AMR heuristic is effective at generating single modiﬁcations that induce signiﬁcant change in the recommendations. However, Add-type modiﬁcations were not effective at inducing instability for the Item-Item recommender. Remove-type modiﬁcations were more effective than random for sets of 100 modiﬁcations for User-User, Item-Item, and Funk SVD. This modiﬁcation type was also effective for the Item-Item recommender when fuzzing sets of 10 modiﬁcations. Overall, Change-type modiﬁcations affected the fewest recommender systems, but when they were effective, they caused large amounts of change. Change-type modiﬁcations were only effective on two algorithms, Item-Item and Funk SVD, and only for sets of 100 modiﬁcations. However, they were able to produce higher levels of instability in the Item-Item recommender than any other modiﬁcation type. We conjecture that this is because the value of ratings in these systems is more inﬂuential than the relationships between users or items.

Efﬁciency of Heuristics We brieﬂy evaluate the efﬁciency of each heuristic to generate a set of modiﬁcations. The mean time (in seconds) over the course of 100 trials for each Add-type heuristic and set size are reported in Table 3. The cost of fuzzing heuristics for Change and Remove modiﬁcations were similar. The time to generate modiﬁcation sets for random and inﬂuence-based fuzzing heuristics are comparable for AMU, ALI, and AMR, and within a factor of 6 in the worst case for AMA3. But even in the worse case, the timing are practically the same when considering the time it takes to train the recommender system on the new data (over 2 minutes when training on the small Movie Lens 100k dataset).

RQ2: How effective is fuzzing in the absence of algorithm information?

We explore whether inﬂuence-guided fuzzing is an effective technique for testing stability when the recommender algo-

3AMA takes longer due to the repeated aggregation of the inﬂuence of multiple attributes for every item (see Equation 9).

Table 4: Mean Top Out instability for APORTFOLIO

Conﬁguration ARAND APORTFOLIO Recommender,M

User-User,10 0.002969 0.035599 User-User,100 0.031294 0.257794

Item-Item,10 0.002375 0.000244 Item-Item,100 0.026681 0.008653

Funk SVD,10 0.000011 0.130721 Funk SVD,100 0.053446 0.768982

Light FM,10 0.011474 0.012450 Light FM,100 0.010933 0.017678

rithm under test is a black box, and the sorts of inﬂuence used by the system are unknown. To accomplish that, we introduce a portfolio approach to generating modiﬁcations. We assume a budget of n modiﬁcations and a user speciﬁed portfolio of m inﬂuence-guided fuzzing heuristics. For each modiﬁcation in the modiﬁcation set, we use a round robin approach to select one of the m fuzzing heuristics to generate the modiﬁcation. We evaluate this approach with a portfolio of the four inﬂuence-guided fuzzing heuristics studied previously. We will call this hybrid heuristic APORTFOLIO. Using budgets of n = 10 and n = 100, we compare the effectiveness of APORTFOLIO to ARAND using each of the four recommenders studied above.4 The average Top Out instability for APORTFOLIO is reported in Table 4. Values in this table are bold if they are signiﬁcantly better than random. For the User-User, Funk SVD, and Light FM recommenders, the portfolio approach is able to ﬁnd modiﬁcation sets that cause more instability than ARAND. APORTFOLIO was not effective for the Item-Item recommender system. This is likely because none of the Add-type heuristics performed well on the Item-Item recommender. Overall, using a portfolio of inﬂuence-guided fuzzing heuristics is effective at generating modiﬁcation sets that cause signiﬁcant changes in recommendations.

Conclusions and Future Work We present an approach that uses inﬂuence-guided fuzzing to test the stability of recommender systems. We build on the insight that inﬂuence models can be inferred from the recommendations produced by a recommender system and the dataset used to train that system. We deﬁne a sample of fuzzing heuristics that use inferred inﬂuence models to generate modiﬁcations to the original dataset that induce instability in the recommendations. To test instability we deﬁne a test oracle that uses a threshold of acceptable instability, measured as the distance between users recommendations. Our study shows that inﬂuence-guided fuzzing is effective at ﬁnding small sets of modiﬁcations that cause signiﬁcantly more instability than random approaches. In future work,

4We do not use n = 1 for this approach because it is equivalent to selecting a single heuristic.

we will perform a more exhaustive search of this space to identify what features of heuristics and inﬂuence models are most effective. For example, using hybrid forms of inﬂuence over multiple types of aspects in the dataset, and fuzzing heuristics that take advantage of multiple types of inﬂuence.

Acknowledgments This work has been supported in part by National Science Foundation awards #1526652 and #1617916, and by A*STAR SERC PSF grant 152120008.

References Adomavicius, G., and Zhang, J. 2010. On the stability of recommendation algorithms. In Proceedings of the Fourth ACM Conference on Recommender Systems, Rec Sys 10, 47 54. New York, NY, USA: ACM. Adomavicius, G., and Zhang, J. 2012. Stability of recommendation algorithms. ACM Trans. Inf. Syst. 30(4):23:1 23:31. Breese, J. S.; Heckerman, D.; and Kadie, C. 1998. Empirical analysis of predictive algorithms for collaborative ﬁltering. In Proceedings of the Fourteenth Conference on Uncertainty in Artiﬁcial Intelligence, 43 52. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc. Chiang, W.; Shen, L.; Li, L.; and Ning, X. 2018. Drug recommendation toward safe polypharmacy. Co RR abs/1803.03185. Deshpande, M., and Karypis, G. 2004. Item-based top-n recommendation algorithms. ACM Trans. Inf. Syst. 22(1):143 177. Ekstrand, M. D.; Ludwig, M.; Konstan, J. A.; and Riedl, J. T. 2011. Rethinking the recommender research ecosystem: Reproducibility, openness, and lenskit. In Proceedings of the Fifth ACM Conference on Recommender Systems, 133 140. New York, NY, USA: ACM. Funk, S. 2006. Netﬂix update: Try this at home. Gunawardana, A., and Shani, G. 2015. Evaluating Recommender Systems. Boston, MA: Springer US. 265 308. Gunes, I.; Kaleli, C.; Bilge, A.; and Polat, H. 2014. Shilling attacks against recommender systems: a comprehensive survey. Artiﬁcial Intelligence Review 42(4):767 799. Harper, F. M., and Konstan, J. A. 2015. The movielens datasets: History and context. ACM Trans. Interact. Intell. Syst. 5(4):19:1 19:19. Herlocker, J. L.; Konstan, J. A.; Terveen, L. G.; and Riedl, J. T. 2004. Evaluating collaborative ﬁltering recommender systems. ACM Trans. Inf. Syst. 22(1):5 53. Julian, K. D.; Lopez, J.; Brush, J. S.; Owen, M. P.; and Kochenderfer, M. J. 2016. Policy compression for aircraft collision avoidance systems. In 2016 IEEE/AIAA 35th Digital Avionics Systems Conference (DASC), 1 10. Kula, M. 2015. Metadata embeddings for user and item cold-start recommendations. In Bogers, T., and Koolen, M., eds., Proceedings of the 2nd Workshop on New Trends on Content-Based Recommender Systems, volume 1448, 14 21. CEUR-WS.org.

Lam, S. K., and Riedl, J. 2004. Shilling recommender systems for fun and proﬁt. In Proceedings of the 13th International Conference on World Wide Web, WWW 04, 393 402. New York, NY, USA: ACM. Liang, H.; Pei, X.; Jia, X.; Shen, W.; and Zhang, J. 2018. Fuzzing: State of the art. IEEE Transactions on Reliability 67(3):1199 1218. Miller, B. P.; Fredriksen, L.; and So, B. 1990. An empirical study of the reliability of unix utilities. Commun. ACM 33(12):32 44. O Mahony, M.; Hurley, N.; Kushmerick, N.; and Silvestre, G. 2004. Collaborative recommendation: A robustness analysis. ACM Trans. Internet Technol. 4(4):344 377. Resnick, P., and Varian, H. R. 1997. Recommender systems. Commun. ACM 40(3):56 58. Resnick, P.; Iacovou, N.; Suchak, M.; Bergstrom, P.; and Riedl, J. 1994. Grouplens: An open architecture for collaborative ﬁltering of netnews. In Proceedings of the 1994 ACM Conference on Computer Supported Cooperative Work, 175 186. New York, NY, USA: ACM. Sarwar, B.; Karypis, G.; Konstan, J.; and Riedl, J. 2001. Item-based collaborative ﬁltering recommendation algorithms. In Proceedings of the International Conference on World Wide Web, 285 295. New York, NY, USA: ACM. Zhao, X.; Zhang, W.; and Wang, J. 2015. Risk-hedged venture capital investment recommendation. In Proceedings of the 9th ACM Conference on Recommender Systems, Rec Sys 15, 75 82. New York, NY, USA: ACM.