Clinton vs. Trump 2016: Analyzing and Visualizing Tweets and Sentiments of Hillary Clinton and Donald Trump ABSTRACT Siddharth Grover, Oklahoma State University, Stillwater The United States 2016 presidential campaign has seen an unprecedented amount of media coverage, numerous presidential candidates, and acrimonious debates over wide-ranging topics from candidates of both the Republican and the Democratic Party. Twitter is a dominant social medium for people to understand, express, relate and support the policies proposed by their favorite political leaders. In this paper, we analyzed the sentiment of the tweets posted by Hillary Clinton and Donald Trump on their Twitter feeds. We also analyzed the most frequent policy related keywords used by these candidates along with the Twitter handles they most frequently mentioned in their tweets. This paper demonstrates application of SAS Text Miner and SAS Sentiment Analysis Studio to perform text mining and sentiment analysis on tweets collected during the election fever. We found out that Donald Trump was more concerned with media coverage as he frequently tweeted at and mentioned media handles and used social space to negatively talk about other presidential nominees. Trump also falls short on positive sentiment and had an overall negative sentiment. Hillary Clinton, on the other hand, used the same space to discuss some of her policies and current events. Though Clinton did not show the same ability to drive the mainstream media narrative on Twitter as Trump, she did generate an overall positive sentiment. INTRODUCTION The 2016 US Presidential election has been the nation s biggest media fest for the past several months, and the media coverage is poised to increase as we near the Election Day. In this election year, Twitter has emerged as yet another battleground. How this platform is used, how both the candidates project themselves and how they are perceived by the online community will ultimately have an influence on the voters on the Election Day. This paper follows the Twitter journey of two different presidential candidates by analyzing their views and sentiments though tweets. Text mining and sentiment analysis was performed that helped in focusing on the following key points to better lay the comparison between both the candidates. Most used phrases on Twitter Which Twitter handles candidates most tweeted at Policy key words mentioned on Twitter Comparing sentiments on Clinton and Trump s tweets DATA PREPARATION We extracted about 200,000 tweets accessing the live stream API of Twitter, using a java program mytwitterscraper which is an open source real-time Twitter scraper. The timeline for the analysis was from April 2016 to June 2016. We concentrated on @realdonaldtrump, @hillaryclinton Twitter handles and also on trending hashtags like #trump2016 and #clinton2016. We also collected information on number of followers, re-tweets and favorited tweet for both the candidates. Web Scraping (mytwitterscraper) Donald Trump Tweets (@realdonaldtrump,#trump20 16) Hillary Clinton Tweets (@hillaryclinton, #clinton2016) Figure 1. Data Preparation 1
METHODOLOGY Following text mining process flow was implemented - Figure 2. Text Mining Process Flow TEXT IMPORT The text from around 200,000 tweets was extracted and saved as text files using the Text Import node of SAS Enterprise Miner. This node converts different type of files into text files and saves them in the destination folder specified by the user. Many of the tweets were re-tweets and redundant, therefore removed to get a wide variety of topics for better analysis. TEXT PARSING Text mining was initiated by parsing the data to find tokens (terms), parts of speech tags, entities, etc. We ignored parts of speech, which filter prepositions, determinants, auxiliary verbs etc. along with numeric values and punctuation as these contains very less information. The term-by-frequency document matrix which is generated by this node was really helpful in understanding the frequency of terms in the text and number of documents those terms are in. Figure 3. SAS Text Parsing Node property settings and Terms Output TEXT FILTERING Text Filter node was used to reduce the number of terms by eliminating the terms with lowest frequencies in the documents. English dictionary was used to identify and correct the spell check errors. Filter viewer helps in viewing all the tweets containing a specific term and the ability to further drill down by creating concept links based on those terms. We used Text filtering node with Inverse Document Frequency as term weight property. The spell check option was helpful in removing redundant and incorrect terms. 2
Output 1. Text Filtering Node Spell Check output Output 2. Most Frequent Terms trough Text Filter Node CONCEPT LINKS One of the interesting functions of SAS Enterprise Miner is to create Concept Links in the Interactive Filter Viewer setting of the Text Filter Node. Concept links helps in visualizing the association between the co-occurring terms in the documents. The width of the line signifies the strength of the association between the terms. A thick line depicts strong association between the terms. We created four different concept links to understand the association between words in our data set based on frequency. We built two concept links around official Twitter accounts of both the candidates to reflect what they are talking about. We further made two more concept links around what Twitter users are talking about both the candidates. 3
Output 3. Official Hillary Clinton Twitter handle: @hillaryclinton The above concept shows the association between terms in the tweets Hillary Clinton made. There is a strong relationship among the terms status, hillaryclinton, supporter and love, which might imply she tweets about her love for her supporters. Terms like campaign, money, and voter might imply asking voters to vote or donate money for her campaign. Output 4. Concept link for #hillary2016 This concept links shows the association between terms in the tweets containing hashtag #hillary2016. There s a strong relationship between the terms iamwithher and hillary, this might be due to Hillary supporters mentioning the hashtag #iamwithher and #hillary2016 together in the same tweet. Also people tweeting about #hillary2016 also might have been Donald Trump supporters or Bernie Sanders supporters as terms like trump2016 and bernieorbust are also in there. 4
Output 5. Official Donald Trump Twitter handle: @realdonaldtrump The concept link shows the terms mentioned in the tweets by Donald Trump. There s a strong relationship between terms foxnews, cnn, trump2016 and hillaryclinton as these might be most mentioned terms in Donald s tweets. This might imply that Trump frequently tweets at media handles and also talks about terms like American, gun, potus. Output 6. Concept link for #Trump 2016 The concept link shows association between terms in the tweets containing hashtag #trump2016. There s a strong association between terms gun, realdonaldtrump, neverhillary and hillaryclinton. This might be due to the 5
Trump supporters tweeting their support for Donald Trump by mentioning the term like neverhillary and hillaryclinton. Also people tweeting about #trump2016, mentioned the terms like america, want, cnn. ANALYSIS OF MOST RECENT TWEETS We further did an analysis on the most recent original 3000 tweets from both Hillary Clinton and Donald Trump. Most Frequent Words Figure 3. Most frequent words based on recent 3000 tweets Most Mentioned Twitter Handles TRUMP % CLINTON % @CNN 16% @POTUS 36% @FoxNews 14% @billclinton 20% @foxandfriends 8% @realdonaldtrump 16% @nytimes 7% @BernieSanders 10% @JebBush 7% @HFA 4% Figure 4. Most mentioned handles in Twitter timeline 6
Top Policy-Focused Key Words TRUMP COUNT CLINTON COUNT Terror 133 Guns 250 Immigration 78 Health 150 Jobs 78 Taxes 83 Taxes 53 Immigration 80 Guns 13 Education 53 Education 13 Foreign 53 Health 8 Vets 28 Figure 5. Most frequently mentioned policy keywords SENTIMENT ANALYSIS We focused on performing sentiment analysis on tweets posted from official Twitter handles of both candidates - (@realdonaldtrump and @hillaryclinton). From the data collected for text mining, we extracted two random samples as modeling data sets with 5,000 tweets each. We further extracted two additional sets of random data sets with 2,000 tweets that would be used to test the results. We used the most recent tweets for data exploration and initial trends before diving into sentiment analytics. We built a basic Statistical model to find out the overall sentiments of the tweets. The statistical model is built from the training tweets by taking term frequencies contributing to the weights of the terms and validation data is used to finetune the model for increased accuracy. The Statistical model is able to predict the sentiment for the overall tweets but not at granular level. Feature level sentiment prediction can only be accomplished using rule-based models. The Rule-based model is more flexible and sophisticated as compared to the Statistical model. It allows to write custom rules along with the rules learned from the Statistical model. In the rule-based model to predict the sentiment of the tweets we divided the tweets into two bins - positive and negative. For this, we took a random sample of tweets from the entire dataset containing around 2,000 tweets and categorized them as either positive or negative. We only considered those tweets for modeling which were undisputedly coded as positive or negative. Finally, all of the models were scored against the test data to see how they ll hold and predict overall sentiment expressed by both candidates. MODEL TESTING The statistical model has an overall precision of 84% for Trump and 83% for Clinton whereas the rule-based model has a precision of 92% for Trump and 90% for Clinton on their respective test data sets. We ran different statistical models on the modeling data set. Smoothed Relative Frequency with Chi Square was used as criterion to build the model. The statistical model built is used to test the model accuracy on the test dataset for overall sentiments. We have used a total of 2,000 tweets of the candidates for testing the accuracy of the statistical model. 7
Statistical Model Results Negative Positive Overall Rule-Based Model Results Negative Positive Overall Trump 83% 85% 84% 94% 90% 92% Clinton 79% 87% 83% 85% 95% 90% Output 7. Model Testing Results Results: The rule-based model built for Trump shows 94% precision for positive sentiment and 90% precision for negative sentiment. Model built for Hillary shows 85% precision for positive sentiment and 95% precision for negative sentiment. Overall model precision for both the rule-based models is above 90%. SENTIMENTS ANALYSIS: TRUMP Overall sentiment associated to Trump is negative. Sentiment distribution is at 47% Negative, 27% Positive and 25% Neutral. Output 8. Trump Sentiment Analysis Model Output SENTIMENTS ANALYSIS: CLINTON Overall sentiment associated to Hillary is positive. Sentiment distribution is at 37% Positive, 35% Neutral and 28% Negative. Output 9. Clinton Sentiment Analysis Model Output 8
CONLCUSION In the world of real-time information, Twitter plays an important role in disseminating news and opinions. In order to capture the varying political views of leading presidential candidates, we performed text mining and sentiment analysis on Hillary Clinton and Donald Trump tweets in the time period from April 2016 June 2016. We created concept links to understand association between terms mentioned in the tweets by both the candidates and also the public tweets about them. We discovered that Donald Trump frequently tweeted at and mentioned media handles in his tweets and the tweets had an overall negative sentiment. Hillary Clinton on the other hand had more policy focused keywords and frequently tweeted towards the political establishment. Clinton fell short on engagement, measured through re-tweets, but generated more overall positive sentiment. REFERENCES 1. Chakraborty, Goutam, Murali Pagolu and Satish Garla. November 2013. Book Text Mining and Analysis: Practical Methods, Examples, and Case Studies Using SAS Institute. 2. SAS Institute Inc. 2014. Getting Started with SAS Text Miner 13.2. Cary, NC: SAS Institute Inc. 3. Analysis of Change in Sentiments towards Chick-fil-A after Dan Cathy s Statement about Same-Sex Marriage Using SAS Text Miner and SAS Sentiment Analysis Studio by Swati Grover, Jeffin Jacob and Goutam Chakraborty ACKNOWLEDGMENTS We thank WUSS 2016 conference committee for giving us an opportunity to present our work. We also thank Dr. Goutam Chakraborty for his continuous support and guidance. CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at: Siddharth Grover Oklahoma State University Phone: 407-744-6809 Email: sid.grover@okstate.edu Siddharth Grover is a Master s student in Business Analytics from Oklahoma State University. He has an MBA degree in Marketing from Xavier University, India. He is working as a graduate teaching assistant at Oklahoma State University. He interned with BMO Financial Group as an AML Model Management Intern during summer 16. Earlier he worked as a Media Planner for a year at GroupM, India. He has a years experience in using SAS tools for Predictive Modeling and Data Mining. He is a Base SAS 9 certified professional, SAS Certified Advanced Programmer and SAS Certified Statistical Business Analyst. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies. 9