# capturing_the_style_of_fake_news__26354a6b.pdf

The Thirty-Fourth AAAI Conference on Artiﬁcial Intelligence (AAAI-20)

Capturing the Style of Fake News

Piotr Przybyła Institute of Computer Science, Polish Academy of Sciences Warsaw, Poland piotr.przybyla@ipipan.waw.pl

In this study we aim to explore automatic methods that can detect online documents of low credibility, especially fake news, based on the style they are written in. We show that general-purpose text classiﬁers, despite seemingly good performance when evaluated simplistically, in fact overﬁt to sources of documents in training data. In order to achieve a truly style-based prediction, we gather a corpus of 103,219 documents from 223 online sources labelled by media experts, devise realistic evaluation scenarios and design two new classiﬁers: a neural network and a model based on stylometric features. The evaluation shows that the proposed classiﬁers maintain high accuracy in case of documents on previously unseen topics (e.g. new events) and from previously unseen sources (e.g. emerging news websites). An analysis of the stylometric model indicates it indeed focuses on sensational and affective vocabulary, known to be typical for fake news.

Introduction The problem of fake news and the wider issue of credibility in online media continue to attract considerable attention not only of their consumers and creators, but also of policy makers and the digital industry. One of the responses of social media sites is to signal untrustworthiness of certain content, e.g. as disputed in Facebook (Clayton et al. 2019). Unfortunately, the manual fact-checking involved is too laborious to be applied to every post published on these platforms and elsewhere. That is why we consider text analysis techniques that could automatically assess the credibility of documents published online, which could also be useful for other stakeholders intending to reduce the impact of misinformation, including journalists (Chen, Conroy, and Rubin 2015) and web users (Berghel 2017). The most straightforward approach to the problem is automatic veriﬁcation of each claim included in a document. This task, however, has many challenges, namely insufﬁcient level of text understanding methods and limited coverage and currency of knowledge bases. It may seem easier to train a machine learning (ML) model on a collection of online documents, accompanied

Copyright c 2020, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

by expert-assigned labels indicating their credibility, using one of general-purpose text classiﬁcation algorithms. As we show in the next section, this has indeed been done, sometimes leading to impressive classiﬁcation accuracy. The disadvantage of such solutions is that we have no direct control over which features of the document (e.g. word occurrences in bag of words representation) the credibility assessment is based on. Since some features that provide good performance on the training or test data might not be desirable in real-life application scenario, it does not sufﬁce to know that a classiﬁer makes the right decision in most cases we would like it to do so for the right reasons. This includes knowing what features are important for a particular decision (interpretability) and making sure they are not speciﬁc to the training data (generalisability). For example, an ML model might learn to recognise the source a given document comes from (using its name appearing in text) and assign credibility label based on other documents from the same source, i.e. other articles from the the website, seen in the training data. While taking into account the reputation of a source is a heuristic heavily used by humans when assessing information online (Metzger, Flanagin, and Medders 2010) and commonly advised for fake news spotting (Hunt 2016), it may be misleading in an ML context. The fake news websites tend to be short-lived (Allcott and Gentzkow 2017), and such a model would be helpless when new sources replace them. The document topic could be another easily accessible, yet misleading feature. While fake news outlets indeed concentrate around a few current themes that are guaranteed to engage the target audience (Bakir and Mc Stay 2017), these topics will be replaced over time, making a classiﬁer obsolete. In this study we focus on the style of writing, i.e. the form of text rather than its meaning (Ray 2015). Since fake news sources usually attempt to attract attention for a shortterm ﬁnancial or political goal (Allcott and Gentzkow 2017) rather than to build a long-term relationship with the reader, they favour informal, sensational, affective language (Bakir and Mc Stay 2017). This indicator of low credibility could be used to build a reliable classiﬁer. Several directions could be pursued to avoid the model being biased by the sources or topics available in the training

data. In this study we provide the following contribution:

We present a textual corpus with 103,219 documents, covering a wide range of topics, from 223 sources labelled for credibility based on studies performed at Politi Fact and Pew Research Center, which is a useful resource in building unbiased classiﬁers.

We use the corpus to construct evaluation scenarios measuring performance of credibility estimation methods more realistically by applying them to documents from sources and topics that were not available at training time.

We propose two classiﬁers: a neural network and a model based on features used in stylometric analysis and demonstrate that the latter indeed captures the affective language elements.

In order to encourage and facilitate further research, we make the corpus, the evaluation scenarios and the code (for the stylometric and neural classiﬁers) available online1.

Related work

The problem of fake news has been attracting major attention since the 2016 presidential elections in the US (Allcott and Gentzkow 2017). It has been a subject of research in journalism and political sciences, but much more is needed, especially to assess the widely discussed connections with social media and political polarisation (Tucker et al. 2018).

Annotated corpora

In the challenge of automatic detection of fake content, textual data annotated with respect to credibility, veracity or related qualities play a crucial role (Torabi Asr and Taboada 2019). One of the most commonly used resources of this kind is Open Sources2, a publicly-available list of around 1000 web sources with human-assigned credibility labels. This list was used by a web browser plugin B.S. detector, whose decision in turn were used to generate a corpus of 12,999 posts from 244 websites it labelled as fake. The corpus was made available as a Kaggle dataset3. Another collection4 was collected by automatically scraping the domains from Open Sources. Pathak and Srihari (2019) manually selected around 700 documents from the dataset and labelled them based on the type of misinformation they use. Journalists at Buzz Feed News contributed by assessing the veracity of 2,282 posts published before the 2016 elections by 9 Facebook pages (Silverman 2016). The data was later made available as a corpus5. A different approach to creating a corpus was taken by Shu et al. (2018), who explored the claims fact-checked by Politi Fact and Gossip Cop and automatically retrieved relevant webpages, obtaining 23,921 articles, with the vast majority covering celebrity gossip.

1https://github.com/piotrmp/fakestyle 2https://github.com/Big Mc Large Huge/opensources 3https://www.kaggle.com/mrisdal/fake-news 4https://github.com/several27/Fake News Corpus 5https://zenodo.org/record/1239675

Credibility assessment

The ﬁrst studies on recognition of fabricated news focused on machine-generated (Badaskar, Agarwal, and Arora 2008) or satirical (Burfoot and Baldwin 2009) articles. Approaches to what we currently call fake news were hampered by low amount of data available; e.g. Rubin, Conroy, and Chen (2015) worked on just 144 news items and the recognition performance was not signiﬁcantly better than chance. Horne and Adali (2017) used datasets with 35 fake news items from a Buzz Feed News article (Silverman 2016) and another 75 items gathered by themselves manually and made interesting observations on the stylistic cues affecting credibility. However, the prediction performance may not be reliable due to small data size. P erez-Rosas et al. (2018) attempted to overcome the lack of data by artiﬁcially generating a fake news corpus through crowdsourcing, achieving a classiﬁcation accuracy of 0.74 on a balanced test set. However, their classiﬁer is trained on less than 1000 documents and uses word n-grams, making it prone to overﬁtting to sources or topics, which is conﬁrmed by weaker results in a cross-topic evaluation scenario (around 0.50-0.60). Another way to collect a sufﬁcient number of manually credibility-labelled documents with limited resources is active learning (Bhattacharjee, Talukder, and Balantrapu 2017). Rashkin et al. (2017) used a dataset including satire and hoax news stories and the evaluation, performed on previously unseen sources, showed an F-score of 0.65. The classiﬁer is however unlikely to be topicindependent due to relying on word tri-grams, which is demonstrated by presence of keywords related to current topics among strong features (e.g. syria). Ahmed, Traore, and Saad (2017) reported high accuracy (92% on a balanced test set) using only TF-IDF of word n-grams on the Kaggle dataset. Given how this type of features is prone to overﬁtting to particular news sources (e.g. through their names appearing in text) and that the evaluation was performed through ordinary cross-validation, it seems unlikely this accuracy would be upheld on new sources. A recent study by Potthast et al. (2018), using the Buzz Feed News corpus, is similar to our work, as they ensured their classiﬁer is style-based by building it on top of stylometric features. However, they argued that accurate identiﬁcation of fake news was not possible, and instead focus on detecting hyperpartisan media, which consistently follow the rightor left-wing narrative, as opposed to mainstream outlets. Our study has a wider scope, since many sources included in our corpus lack such partisan consistency or even do not focus on politics at all. To sum up, although there has been several attempts to classify credible and non-credible news publications, they were all limited by the amount of available data, resulting in either low performance or likely overﬁtting to topics or sources that were used during training. We aim to overcome this limitation by gathering a corpus much larger than any of the previously used for training such models, which allows us to achieve high classiﬁcation performance while keeping the credibility assessment style-driven.

Corpus Since the deﬁnition of fake news is still being discussed (Gelfert 2018), within this work we use the notion of credibility. We deﬁne a document as non-credible if the source it comes from has been assessed as such by experts. To gather non-credible documents, we use a list of websites (Gillin 2017) prepared by journalists at Politi Fact, a nonproﬁt fact-checking centre. On the other hand, the most trusted news outlets from a study (Mitchell et al. 2014) by Pew Research Center (PRC), an independent public opinion research unit, are considered credible. Websites of both categories are crawled to obtain documents, which are then converted into plain text and treated as learning cases with the label denoting them as credible (0) or non-credible (1), according to the source they come from.

Collecting the documents To obtain documents from non-credible sources we use the websites labelled in 2017 by Politi Fact as fake news (192 sources) and impostor (49) (Gillin 2017). Unfortunately, less than a quarter of them are still active in 2019, but most websites remain available in the Way Back Machine archives6. Since the list was last updated on 09.11.2017, the latest available snapshot of the main page of each website between 01.01.2017 and 09.11.2017 is selected for crawling. Six of the websites are excluded, since they do not contain any news pieces, but rather discussions, prank content and advise articles. The websites for which no documents except a front page are available in the archive are excluded, too. As per the credible sources, we choose the 21 media outlets that were commonly trusted than distrusted, according to the survey report (Mitchell et al. 2014) by PRC. We exclude two news aggregators (Google News, Yahoo News), who do not create their own content, but link to external sources; and MSNBC, which contains only video materials. In total, this procedure retains 205 non-credible and 18 credible websites. The sites are crawled by following HTML links, starting from the main page and limiting the path length to 5 and maximum number of visited links to 10,000. We ignore pages not archived in 2017, duplicates and subpages with text of average line length less than 15 words. This process results in a corpus of 52,790 pages from noncredible and 50,429 from credible sources. Finally, after conversion from HTML to plain text using manually designed heuristic rules, we obtain a textual corpus of 103,219 documents and 117M tokens.

Corpus exploration The numbers of documents coming from respective sources in the corpus differ greatly. This is especially true for the non-credible ones, which span from a few large websites (13 have more than 1000 documents) to plenty of very small ones (77 have less than 50 documents). The fact that fake news is published by numerous small sources, three quarters of which are unavailable after two years, illustrates the importance of the credibility assessment not relying on a particular source. The credible outlets are larger and closer in size:

6https://archive.org/web/

from 1227 to 6019 documents. While the corpus is gathered from US media, the distribution of non-credible content over numerous outlets with much smaller size than the credible sources was also conﬁrmed in an analysis of online media reach in Europe (Fletcher et al. 2018). In order to model topical differences between sources, we compute a model of 100 topics using LDA (Latent Dirichlet Allocation) (Blei, Ng, and Jordan 2003) implemented in Mallet (Mc Callum 2002). Next, each document is assigned to the topic it has the strongest association with. Figure 1 shows how many of the documents from credible and non-credible sources are assigned to the largest 15 topics, described by associated keywords. We can see that some themes are far more popular in the non-credible part: the comparisons between the current president and his predecessor and election rival (topics #19 and #70), media coverage (#85), Muslims and immigration (#23 and #11) and health/nutrition (#76). The areas that are more commonly covered by credible sources include cinema (#50) and sports (#5). Some issues popular in both classes are the Russia investigations (#62), crime (#55) and international conﬂicts in the middle east and Korea (#17 and #2). This analysis further illustrates the need for a credibility classiﬁer to avoid relying on topics in its decisions. Many of the differences in vocabulary between the source types come from interest in very speciﬁc and current themes, such as Hillary Clinton s e-mails, Donald Trump s presidency or illegal immigration through Mexico. While these features may seem discriminatory at a certain period of time, a classiﬁer based on it may be unable to perform well in future, when the media attention turns elsewhere.

Stylometric classiﬁer

In terms of general architecture of the stylometric classiﬁcation, we use a collection of stylystic features followed by linear modeling. While similar approaches were applied to credibility assessment (Burfoot and Baldwin 2009; Ahmed, Traore, and Saad 2017; Horne and Adali 2017; Rashkin et al. 2017; P erez-Rosas et al. 2018), in this study we take special care to avoid using features that would allow a classiﬁer to overﬁt to particular sources and topics. That is why instead of popular n-grams of words, we use n-grams of Part of Speech (POS) tags. Another group of tools frequently employed in stylistic analysis are dictionaries, e.g. Linguistic Inquiry and Word Count (LIWC) (Tausczik and Pennebaker 2009), used in fake news detection (Horne and Adali 2017; Rashkin et al. 2017; P erez-Rosas et al. 2018), or General Inquirer (GI) (Stone et al. 1962), used for hyperpartisan news recognition (Potthast et al. 2018). The weakness of these resources lies in the limited dictionary size, e.g. GI contains 8640 words in 182 categories7. We therefore increase its size by expanding each category with words similar according to word2vec (Mikolov et al. 2013) representation. Firstly, for each category of size n, we build a logistic regression model of belonging to this category using all words represented by vec-

7http://www.wjh.harvard.edu/ inquirer/spreadsheet guide.htm

Figure 1: The largest 15 LDA topics in the corpus, each shown with the six most signiﬁcant keywords, an identiﬁer and bars illustrating number of credible and non-credible documents associated with it.

tors trained on Google News corpus8. Then, 4 n new words with the highest score are added to the category. Performing this procedure for all 182 categories yields a dictionary with a total size of 34,293 words.

Stylometric features The documents are preprocessed by Stanford Core NLP (Manning et al. 2014), including sentence segmentation, tokenisation and POS tagging. This annotation is used to generate the following document features:

number of sentences, average sentence length (in words) and average word length (in characters),

number of words matching different letter case schemes (all lower case, all upper case, just ﬁrst letter upper case, other), represented as counts normalised by the document length,

frequencies of POS unigrams, bigrams and trigrams, represented as counts normalised by the document length (if present in at least 5 documents),

frequencies of words belonging to the 182 word categories in the expanded GI dictionary, represented as counts normalised by the document length.

Classiﬁer The dataset includes 103,219 instances described by 39,235 features. We apply a two-stage approach for selecting relevant features: ﬁrst preliminary ﬁltering, then building a regularised classiﬁer. At the ﬁltering stage, we use the Pearson correlation with the output variable, which is a common technique for linear classiﬁers (Guyon and Elisseeff 2003). First, we check

8https://code.google.com/archive/p/word2vec/

whether feature j is present in document i by computing a binary matrix with elements bi,j =1[xi,j = 0]. Potthast et al. (2018) perform ﬁltering by removing features that occur in less than 2.5% or 10% of documents. We argue this could lead to a loss of information, since a less frequent feature could still be signiﬁcant, as long as a large majority of the documents it occurs in belongs to the same class. Therefore, we take into account class label y by computing the correlation coefﬁcient and including each feature j such that bj = [b1,j, b2,j, . . .] satisﬁes |cor(bj, y)| > 0.05. The number of retained features depends on the training-test split, but we observe it to be always below 5%. To assess the probability of a document belonging to the non-credible category, a logistic regression model is built. The vastness of the feature space implies a need for regularisation, so we apply the L1 version (LASSO), as implemented in glmnet (Friedman, Hastie, and Tibshirani 2010) package in R (R Core Team 2013) with penalty parameter λ selected through cross-validation over training set. The classiﬁer output is used directly as a non-credibility score taking values between 0 (credible) and 1 (noncredible). When evaluation demands a discrete output (to compute accuracy), a threshold of 0.5 is applied.

The second of the applied solutions, called Bi LSTMAvg, is a neural network with architecture based on elements used in natural language processing, i.e. word embeddings (Mikolov et al. 2013) and bidirectional LSTM (Hochreiter and Schmidhuber 1997). Since LSTM is most commonly employed to represent the meaning of short text fragments, esp. sentences, we have decided to add an additional layer that computes the credibility scores (probabilities of credible and non-credible classes) of an article by averaging the

scores of all its sentences. This should also encourage the classiﬁer to seek credibility clues in every sentence rather than just focus on the easy ones (e.g. mentioning source name). Speciﬁcally, the following layers are included:

An embedding layer, representing each token using a 300dimensional word2vec vector trained on Google News,

Two LSTM layers, forward and backward, representing each sentence by two 100-dimensional vectors (output of the last cell in a sequence),

A densely-connected layer, reducing the dimensionality to 2 and applying softmax to compute class probability,

An averaging layer, representing each document s class probability scores by averaging the scores for all its sentences.

The neural network is implemented and trained in Tensor Flow for 10 epochs with sentence length limited to 120 tokens and document length limited to 50 sentences.

Baseline classiﬁers

To understand if general-purpose text classiﬁers are able to capture document style without overﬁtting to features indicating a source or topic and to put the performance of our stylometric and neural solutions in perspective, we also evaluate two baseline models: bag of words and BERT.

Bag of words

This simple model represent documents through frequencies of unigrams, bigrams and trigrams of lemmata (base forms) of words, occurring in at least 200 documents. The feature ﬁltering and logistic regression model construction is performed as in stylometric classiﬁcation.

To employ BERT, a commonly used pre-trained language model (Devlin et al. 2018), we take the uncased base version and ﬁne-tune it in a supervised text classiﬁcation task using the recommended architecture (linear prediction over the output corresponding to the [CLS] element). We use the ﬁrst 512 tokens of each document and the process is executed in each CV fold independently.

The main evaluation procedure involves running the model construction and prediction in a 5-fold cross validation (CV) scenario and comparing its output to true labels. Due to sufﬁciently balanced classes, we use accuracy instead of precision or recall. Three scenarios are considered:

plain document-based CV, where folds include completely random documents from across the dataset.

topic-based CV, where each of the LDA topics, generated as described previously, is assigned to one of the CV folds with all associated documents. This scenario simulates a situation when a test document belongs to previously unseen topic, e.g. corresponding to a new event.

Method doc. CV topic CV source CV Stylometric 0.9274 0.9173 0.8097 Bi LSTMAvg 0.8994 0.8921 0.8250 Bag of words 0.9913 0.9886 0.7078 BERT 0.9976 0.9965 0.7960

Table 1: Classiﬁcation accuracy of our stylometric and neural classiﬁers compared to baselines in three evaluation scenarios, simulating, respectively, a new document from known sources and topics, a document from unknown topic and a document from unseen source.

source-based CV, where each of the document sources is assigned to one of the CV folds with all its documents. This allows us to measure the performance expected for articles from previously unseen websites.

Using CV, while increasing the computation time, helps to strengthen the evaluation by including documents from all sources in the test sets. To facilitate comparisons of classiﬁcation performance with other approaches, the corpus download site includes the assignment of documents to CV folds. Using table 1, showing classiﬁcation accuracy, we can clearly distinguish two groups of methods. Firstly, the popular general-purpose classiﬁers (bag of words and BERT) perform extremely well in document CV, but lose 20-30% when applied in source CV, which indicates they overﬁt to sources seen in training. The stylometric method, although noticeably weaker in document CV, proves more resistant to new sources, almost reaching 81%, which could be considered a positive result compared to the most similar previous work (Potthast et al. 2018). Interestingly, even better results are provided by the Bi LSTMAvg network, which despite the worst performance in document CV beats everything else in source CV. Topic CV is less of a challenge, as all of the tested methods loose at most 1% in this scenario. For better understanding of stylistic differences between sources, ﬁgure 2 shows a boxplot of the non-credibility scores computed by the stylometric classiﬁer in sourcebased CV, grouped by sources sorted by mean score, with colour corresponding to the true category. We can see the credible (blue) sources mostly below the 0.5 threshold and the non-credible ones (yellow) above it. Nevertheless, the wide range of scores in the large sources means some of documents are misclassiﬁed. We can also notice cases where the mean score places a source in a wrong category. The most striking noncredible examples of that (two leftmost yellow bars) are The Times Mexico (times.com.mx) and Before It s News (beforeitsnews.com). The ﬁrst one (currently unavailable) was labelled by Politi Fact as Imposter site , as it pretends to be a credible medium a branch of The Times. It has relatively few pages (34 in the corpus), but it mixes madeup stories, e.g. Leaked Audio: Mexican President Agrees to Pay For Wall, with articles copied from reputable sources, e.g. Snapchat s Physical Footprint Reveals Core Priority of the Brand, originally from Yahoo Finance. Such instances will be challenging for any contentor style-based classiﬁer. The second problematic source is a large (3,303 documents

Figure 2: Predicted non-credibility scores assigned to documents by the stylometric classiﬁer, grouped by sources (credible and non-credible), sorted by average score of all documents in each source.

in the corpus) citizen journalism portal, allowing anonymous users to post their content of various kind, frequently resembling discussion forum rather than a news outlet. This could render the classiﬁcation difﬁcult, but the portal includes obviously fake stories, too, e.g. Worker Says Nanobot Mosquito Killed Woman!. The most challenging cases of credible sources (two rightmost blue bars), whose style resembles the non-credible ones, are Fox News, and The Blaze, which have been regarded as the most distrusted and the least known of the credible sources included (Mitchell et al. 2014). Finally, thanks to the choice of the features and the tendency of regularised regression to include a limited number of them in a model, we are able to visualise some of the motives behind a non-credibility score provided by the stylometric model. This can help us to make sure the classiﬁer indeed relies on stylistic elements of a document. While a detailed analysis of the syntactical patterns remains beyond the scope of this work, the GI dictionary features have clear correspondence in text, which was used to highlight signiﬁcant words in a fragment of an article9 from the noncredible portion (ﬁgure 3). Speciﬁcally, we have highlighted in yellow the words for which GI features contribution was above 20. We can see that the words indicating lower credibility are indeed quite affective, e.g. idiots, outrage, digrace. Note how the highlighted words are not speciﬁc to the topic of the article (except one: collusion), which can explain good performance in topic-based evaluation. To compare with the bag of words model, the words covered by n-grams with coefﬁcients above 20 are underlined. This baseline approach appears to be less style-driven: while there is some overlap (i.e. idiots), other words clearly indicate the focus on the document topic (i.e. Barack Obama, Trump, Democrats).

9http://freedomdaily.com/trump-just-bombarded-3-dirtydems-major-surprise-will-shut-good/

The obtained results clearly emphasise the importance of designing realistic evaluation scenarios when measuring the performance of credibility assessment solutions. Naive evaluation (through random held-out subset) might suggest that a simple bag of words model can achieve near-perfect accuracy, while in confrontation with documents from unseen source it performs poorly. To the best of our knowledge, this is the ﬁrst time this aspect of credibility assessment is observed and measured. The proposed stylometric classiﬁer, while showing more consistent performance over evaluation scenarios, still loses over 10% of the accuracy on unseen sources. The most straighforward explanation might be that instead of the general style of fake news, the model captures styles of individual sources. Another possible reason for that is imperfection of text extraction procedure, which sometimes produces a plain text version that includes not only the actual news description, but also some standard website elements (e.g. encouraging social media sharing or commenting), that are converted to POS n-grams and could be picked up by the classiﬁer. Unfortunately, these fragments differentiate only speciﬁc websites, not the general categories we seek to recognise. The interpretability of the stylometric classiﬁer allows us to verify that it indeed takes into account the affective words. Such interpretability is helpful in making sure a model generalises well (Lipton 2016), but also plays a role in obtaining users trust (Pieters 2011). Being able to explain why a model has made certain decision is crucial in the application scenarios relevant for this work. Take political misperceptions as a prominent example (Flynn, Nyhan, and Reiﬂer 2017) providing an alternative explanation for observed events has been shown to be signiﬁcantly more effective than a simple contradiction (Nyhan and Reiﬂer 2015).

Figure 3: Sample document fragment with words indicating low credibility according to the GI dictionary features of the stylometric classiﬁer (highlighted in yellow) and the baseline bag of word model (underlined).

Unfortunately, the sentence-level credibility scores provided by Bi LSTMAvg are too coarse-grained to be informative, so additional work is necessary to obtain word-level importance measure. The mechanism of attention (Vaswani et al. 2017) is commonly used in this role, but the validity of its use as explanation is debatable (Jain and Wallace 2019). Despite the interpretability issues associated with neural networks in general, the obtained results show they are worth considering in this scenario. Bi LSTMAvg, despite lacking any explicit style-focused features, avoids overﬁtting and delivers the best performance on new sources. The most important limitations of the study come from the basic assumptions we make about credibility. Firstly, that its assessment at the level of a source is inherited by all documents within it. We can expect that not every document from a non-credible source contains false information, just as not every news item from the trusted outlets is perfectly accurate. Whether this understanding of credibility reﬂects the concept of fake news will depend on which of its deﬁnitions we apply while some rely on the veracity of provided information, others emphasise being misleading by design (Gelfert 2018). We aim to address this problem in future by extending the corpus with document-level credibility assessment. Secondly, the dependency between writing style of a document and its credibility observed in our dataset might not be universal or permanent. While the current misinformation

landscape is dominated by obvious proﬁt-driven websites, there may exist some (possibly more in future) that are on par with real news outlets in terms of quality and style, yet provide misleading content.

Conclusions

To sum up, the credibility of news articles in our corpus can indeed be estimated based on the style they are written in. However, given how subtle the manifestation of style might be compared to more prominent traits, such as source or topic, special care is needed when collecting a learning sample and designing classiﬁcation and evaluation procedures to make sure its theoretical performance translates to a beneﬁt in a social context. The high classiﬁcation accuracy obtained in the experiments indicates that despite previous claims that automatic fake news detection based on style does not work in general (Potthast et al. 2018) or may never be possible (Tucker et al. 2018), it is a worthwhile direction of research. We hope that future work in this ﬁeld will be facilitated by the contributed corpus and evaluation scenarios.

Acknowledgments

This work was supported by the Polish National Agency for Academic Exchange through a Polish Returns grant number PPN/PPO/2018/1/00006 and by the Google Cloud Platform through research credits.

Ahmed, H.; Traore, I.; and Saad, S. 2017. Detection of Online Fake News Using N-Gram Analysis and Machine Learning Techniques. In Intelligent, Secure, and Dependable Systems in Distributed and Cloud Environments, 127 138. Springer International Publishing. Allcott, H., and Gentzkow, M. 2017. Social Media and Fake News in the 2016 Election. Journal of Economic Perspectives 31(2):211 236. Badaskar, S.; Agarwal, S.; and Arora, S. 2008. Identifying Real or Fake Articles: Towards better Language Modeling. In Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-II. Asian Federation of Natural Language Processing. Bakir, V., and Mc Stay, A. 2017. Fake News and The Economy of Emotions: Problems, causes, solutions. Digital Journalism 6(2):154 175. Berghel, H. 2017. Lies, Damn Lies, and Fake News. Computer 50(2):80 85. Bhattacharjee, S. D.; Talukder, A.; and Balantrapu, B. V. 2017. Active learning based news veracity detection with feature weighting and deep-shallow fusion. In Proceedings of the IEEE International Conference on Big Data. IEEE. Blei, D. M.; Ng, A. Y.; and Jordan, M. I. 2003. Latent Dirichlet Allocation. Journal of Machine Learning Research 3:993 1022. Burfoot, C., and Baldwin, T. 2009. Automatic satire detection: are you having a laugh? In Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, 161 164. Association for Computational Linguistics. Chen, Y.; Conroy, N. J.; and Rubin, V. L. 2015. News in an online world: The need for an automatic crap detector . Proceedings of the Association for Information Science and Technology 52(1).

Clayton, K.; Blair, S.; Busam, J. A.; Forstner, S.; Glance, J.; Green, G.; Kawata, A.; Kovvuri, A.; Martin, J.; Morgan, E.; Sandhu, M.; Sang, R.; Scholz-Bright, R.; Welch, A. T.; Wolff, A. G.; Zhou, A.; and Nyhan, B. 2019. Real Solutions for Fake News? Measuring the Effectiveness of General Warnings and Fact-Check Tags in Reducing Belief in False Stories on Social Media. Political Behavior.

Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 4171 4186. Association for Computational Linguistics.

Fletcher, R.; Cornia, A.; Graves, L.; and Nielsen, R. K. 2018. Measuring the reach of fake news and online disinformation in Europe. Technical report, Reuters Institute for the Study of Journalism.

Flynn, D. J.; Nyhan, B.; and Reiﬂer, J. 2017. The Nature and Origins of Misperceptions: Understanding False and Unsupported Beliefs About Politics. Political Psychology 38:127 150.

Friedman, J.; Hastie, T.; and Tibshirani, R. 2010. Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software 33(1).

Gelfert, A. 2018. Fake News: A Deﬁnition. Informal Logic 38(1):84 117.

Gillin, J. 2017. Politi Fact s guide to fake news websites and what they peddle. Politi Fact.

Guyon, I., and Elisseeff, A. 2003. An Introduction to Variable and Feature Selection. Journal of Machine Learning Research 3:1157 1182.

Hochreiter, S., and Schmidhuber, J. 1997. Long Short-Term Memory. Neural Computation 9(8):1735 1780.

Horne, B. D., and Adali, S. 2017. This Just In: Fake News Packs a Lot in Title, Uses Simpler, Repetitive Content in Text Body, More Similar to Satire than Real News. In Proceedings of the 2nd International Workshop on News and Public Opinion at ICWSM. Association for the Advancement of Artiﬁcial Intelligence.

Hunt, E. 2016. What is fake news? How to spot it and what you can do to stop it. The Guardian.

Jain, S., and Wallace, B. C. 2019. Attention is not Explanation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 3543 3556. Minneapolis, Minnesota: Association for Computational Linguistics.

Lipton, Z. C. 2016. The Mythos of Model Interpretability. In Proceedings of the 2016 ICML Workshop on Human Interpretability in Machine Learning (WHI 2016).

Manning, C. D.; Bauer, J.; Finkel, J.; Bethard, S. J.; Surdeanu, M.; and Mc Closky, D. 2014. The Stanford Core NLP Natural Language Processing Toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Association for Computational Linguistics.

Mc Callum, A. K. 2002. MALLET: A machine learning for language toolkit.

Metzger, M. J.; Flanagin, A. J.; and Medders, R. B. 2010. Social and Heuristic Approaches to Credibility Evaluation Online. Journal of Communication 60(3):413 439.

Mikolov, T.; Chen, K.; Corrado, G.; and Dean, J. 2013. Efﬁcient Estimation of Word Representations in Vector Space. ar Xiv:1301.3781.

Mitchell, A.; Kiley, J.; Gottfried, J.; and Matsa, K. E. 2014. Political Polarization & Media Habits. Technical report, Pew Research Center. Nyhan, B., and Reiﬂer, J. 2015. Displacing Misinformation about Events: An Experimental Test of Causal Corrections. Journal of Experimental Political Science 2(1):81 93. Pathak, A., and Srihari, R. 2019. BREAKING! Presenting Fake News Corpus for Automated Fact Checking. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, 357 362. Florence, Italy: Association for Computational Linguistics. P erez-Rosas, V.; Kleinberg, B.; Lefevre, A.; and Mihalcea, R. 2018. Automatic Detection of Fake News. Proceedings of the 27th International Conference on Computational Linguistics 3391 3401. Pieters, W. 2011. Explanation and trust: what to tell the user in security and AI? Ethics and Information Technology 13(1):53 64. Potthast, M.; Kiesel, J.; Reinartz, K.; Bevendorff, J.; and Stein, B. 2018. A Stylometric Inquiry into Hyperpartisan and Fake News. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 231 240. Association for Computational Linguistics. R Core Team. 2013. R: A Language and Environment for Statistical Computing. Rashkin, H.; Choi, E.; Jang, J. Y.; Volkova, S.; and Choi, Y. 2017. Truth of Varying Shades: Analyzing Language in Fake News and Political Fact-Checking. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. Ray, B. 2015. Style: An Introduction to History, Theory, Research, and Pedagogy. Parlor Press, The WAC Clearinghouse. Rubin, V. L.; Conroy, N. J.; and Chen, Y. 2015. Towards news veriﬁcation: Deception detection methods for news discourse. Proceedings of the Hawaii International Conference on System Sciences. Shu, K.; Mahudeswaran, D.; Wang, S.; Lee, D.; and Liu, H. 2018. Fake News Net: A Data Repository with News Content, Social Context and Dynamic Information for Studying Fake News on Social Media. ar Xiv:1809.01286. Silverman, C. 2016. This Analysis Shows How Viral Fake Election News Stories Outperformed Real News On Facebook. Buzz Feed News. Stone, P. J.; Bales, R. F.; Namenwirth, J. Z.; and Ogilvie, D. M. 1962. The general inquirer: A computer system for content analysis and retrieval based on the sentence as a unit of information. Behavioral Science 7(4):484 498. Tausczik, Y. R., and Pennebaker, J. W. 2009. The Psychological Meaning of Words: LIWC and Computerized Text Analysis Methods. Journal of Language and Social Psychology 29(1):24 54. Torabi Asr, F., and Taboada, M. 2019. Big Data and quality data for fake news and misinformation detection. Big Data & Society 6(1). Tucker, J. A.; Guess, A.; Barber a, P.; Vaccari, C.; Siegel, A.; Sanovich, S.; Stukal, D.; and Nyhan, B. 2018. Social Media, Political Polarization, and Political Disinformation: A Review of the Scientiﬁc Literature. Technical report, Hewlett Foundation. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention Is All You Need. In Advances in Neural Information Processing Systems 30. Curran Associates, Inc. 5998 6008.