Beyond Binary Labels: Political Ideology Prediction of Twitter Users

Similar documents
Psychological Factors

JUDGE, JURY AND CLASSIFIER

THE WORKMEN S CIRCLE SURVEY OF AMERICAN JEWS. Jews, Economic Justice & the Vote in Steven M. Cohen and Samuel Abrams

Amy Tenhouse. Incumbency Surge: Examining the 1996 Margin of Victory for U.S. House Incumbents

Nonvoters in America 2012

Understanding factors that influence L1-visa outcomes in US

1. The Relationship Between Party Control, Latino CVAP and the Passage of Bills Benefitting Immigrants

PERCEIVED ACCURACY AND BIAS IN THE NEWS MEDIA A GALLUP/KNIGHT FOUNDATION SURVEY

Green in Your Wallet or a Green Planet: Views on Government Spending and Climate Change

The Fourth GOP Debate: Going Beyond Mentions

Political Posts on Facebook: An Examination of Voting, Perceived Intelligence, and Motivations

VoteCastr methodology

Understanding Taiwan Independence and Its Policy Implications

Modeling Ideology and Predicting Policy Change with Social Media: Case of Same-Sex Marriage

Prepared by: Meghan Ogle, M.S.

Changing Confidence in the News Media: Political Polarization on the Rise

Ohio State University

Fake news on Twitter. Lisa Friedland, Kenny Joseph, Nir Grinberg, David Lazer Northeastern University

Progressives in Alberta

The Effects of Political and Demographic Variables on Christian Coalition Scores

Evaluating the Connection Between Internet Coverage and Polling Accuracy

Partisan Nation: The Rise of Affective Partisan Polarization in the American Electorate

The Attack of the Bots and Trolls: The Social Storms that are Destroying Public Confidence in Institutions

Michigan 14th Congressional District Democratic Primary Election Exclusive Polling Study for Fox 2 News Detroit.

Data Models. 1. Data REGISTRATION STATUS VOTING HISTORY

List of Tables and Appendices

AMERICAN JOURNAL OF UNDERGRADUATE RESEARCH VOL. 3 NO. 4 (2005)

Big Data, information and political campaigns: an application to the 2016 US Presidential Election

Lab 3: Logistic regression models

To Build a Wall or Open the Borders: An Analysis of Immigration Attitudes Among Undergraduate University Students

CSES Module 5 Pretest Report: Greece. August 31, 2016

The Case of the Disappearing Bias: A 2014 Update to the Gerrymandering or Geography Debate

PREDICTORS OF CONTRACEPTIVE USE AMONG MIGRANT AND NON- MIGRANT COUPLES IN NIGERIA

AMERICAN VIEWS: TRUST, MEDIA AND DEMOCRACY A GALLUP/KNIGHT FOUNDATION SURVEY

Vote Likelihood and Institutional Trait Questions in the 1997 NES Pilot Study

Res Publica 29. Literature Review

Analysis: Impact of Personal Characteristics on Candidate Support

California Ballot Reform Panel Survey Page 1

Factors which influence the sentencing of domestic violence offenders

The Cook Political Report / LSU Manship School Midterm Election Poll

Elite Polarization and Mass Political Engagement: Information, Alienation, and Mobilization

Tracking Sentiment Evolution on User-Generated Content: A Case Study on the Brazilian Political Scene

The Ideological Foundations of Affective Polarization in the U.S. Electorate

FOR RELEASE SEPTEMBER 13, 2018

Proposed Sentence Risk Assessment Instrument [204 Pa.Code Chapter 305]

The Correlates of Wealth Disparity Between the Global North & the Global South. Noelle Enguidanos

BY Amy Mitchell, Katie Simmons, Katerina Eva Matsa and Laura Silver. FOR RELEASE JANUARY 11, 2018 FOR MEDIA OR OTHER INQUIRIES:

CRIMINAL JUSTICE PROCESS FOLLOW-UP AUDIT

The Intersection of Social Media and News. We are now in an era that is heavily reliant on social media services, which have replaced

UTS:IPPG Project Team. Project Director: Associate Professor Roberta Ryan, Director IPPG. Project Manager: Catherine Hastings, Research Officer

Turnout and Strength of Habits

The Ideological Operation of the United States Supreme Court

We are here to help? Volunteering Behavior among Immigrants in Germany

Using Machine Learning Techniques to Interpret Open-ended Responses in Web Surveys

twentieth century and early years of the twenty-first century, reversed its net migration result,

FOR RELEASE MAY 17, 2018

TAIWAN. CSES Module 5 Pretest Report: August 31, Table of Contents

Does criminal sanctioning direct democracy? A county-level analysis of the relationship between sentencing and voting behavior

Judicial Elections and Their Implications in North Carolina. By Samantha Hovaniec

Thornbury Township Police Services Survey: Initial Data Analyses and Key Findings

Statistics, Politics, and Policy

CHAPTER FIVE RESULTS REGARDING ACCULTURATION LEVEL. This chapter reports the results of the statistical analysis

November 2018 Hidden Tribes: Midterms Report

Political Sophistication and Third-Party Voting in Recent Presidential Elections

State of the Facts 2018

Central Florida Puerto Ricans Findings from 403 Telephone interviews conducted in June / July 2017.

FOR RELEASE MAY 17, 2018

Don Me: Experimentally Reducing Partisan Incivility on Twitter

Who says elections in Ghana are free and fair?

Subjectivity Classification

2014 Ohio Election: Labor Day Akron Buckeye Poll

BLUE STAR HIGHWAY COMMUNITY OPINION SURVEY REPORT

What makes people feel free: Subjective freedom in comparative perspective Progress Report

Rick Santorum has erased 7.91 point deficit to move into a statistical tie with Mitt Romney the night before voters go to the polls in Michigan.

Political Sophistication and Third-Party Voting in Recent Presidential Elections

A Powerful Agenda for 2016 Democrats Need to Give Voters a Reason to Participate

Party Cue Inference Experiment. January 10, Research Question and Objective

The 2006 United States Senate Race In Pennsylvania: Santorum vs. Casey

The Public Opinion and Political Action. Chapter 6

North Carolina Races Tighten as Election Day Approaches

Vote Compass Methodology

Divergences in Abortion Opinions across Demographics. its divisiveness preceded the sweeping 1973 Roe v. Wade decision protecting abortion rights

Who s Following Trump and Clinton?

Summary of the Results of the 2015 Integrity Survey of the State Audit Office of Hungary

PROTECTING THE FLAG OF THE UNITED STATES

Alabama Statewide Republican Primary Runoff Election August 24 26, 2017

D A T A D I C T I O N A R Y D2 D A T A D I C T I O N A R Y

University of Groningen. Attachment in cultural context Polek, Elzbieta

Supplementary Materials A: Figures for All 7 Surveys Figure S1-A: Distribution of Predicted Probabilities of Voting in Primary Elections

FOR RELEASE MAY 17, 2018

AN OVERVIEW OF THE CAMPAIGN AND A REASONED GUESS

Organizing the Health Sector: Decentralization Issues

Analysis of the Reputation System and User Contributions on a Question Answering Website: StackOverflow

Kansas Speaks 2015 Statewide Public Opinion Survey

Following the Leader: The Impact of Presidential Campaign Visits on Legislative Support for the President's Policy Preferences

Report on Citizen Opinions about Voting & Elections

THE LOUISIANA SURVEY 2018

AmericasBarometer Insights: 2014 Number 105

THREE ESSAYS IN POLITICAL ECONOMY CAGDAS AGIRDAS DISSERTATION

Can Politicians Police Themselves? Natural Experimental Evidence from Brazil s Audit Courts Supplementary Appendix

Transcription:

Beyond Binary Labels: Political Ideology Prediction of Twitter Users Daniel Preoţiuc-Pietro Joint work with Ye Liu (NUS), Daniel J Hopkins (Political Science), Lyle Ungar (CS) 2 August 2017

Motivation User attribute prediction from text is successful: Age (Rao et al. 2010 ACL) Gender (Burger et al. 2011 EMNLP) Location (Eisenstein et al. 2010 EMNLP) Personality (Schwartz et al. 2013 PLoS One) Impact (Lampos et al. 2014 EACL) Political Orientation (Volkova et al. 2014 ACL) Mental Illness (Coppersmith et al. 2014 ACL) Occupation (Preoţiuc-Pietro et al. 2015 ACL) Income (Preoţiuc-Pietro et al. 2015 PLoS One)... and useful in many applications.

Political Ideology & Text Hypothesis: Political ideology of a user is disclosed through language use partisan political mentions or issues cultural differences

Political Ideology & Text Previous CS/NLP research used data sets with user labels identified through: 1. User descriptions H1 Users are far more likely to be politically engaged

Political Ideology & Text 2. Partisan Hashtags H2 The prediction problem was so far over-simplified

Political Ideology & Text 3. Lists of Conservative/Liberal users H3 Neutral users

Political Ideology & Text 4. Followers of partisan accounts H4 Differences in language use exist between moderate and extreme users

Data Political ideology specific of country and culture our use case is US politics (similar to all previous work) the major US ideology spectrum is Conservative Liberal seven point scale

Data We collect a new data set: 3.938 users (4.8M tweets) public Twitter handle with >100 posts Political ideology is reported through an online survey only way to obtain unbiased ground truth labels (Flekova et al. 2016 ACL, Carpenter et al. 2016 SPPS) additionally reported age, gender and other demographics

Data Data available at preotiuc.ro full data for research purposes aggregate for replicability Twitter Developer Agreement & Policy VII.A4 Twitter Content, and information derived from Twitter Content, may not be used by, or knowingly displayed, distributed, or otherwise made available to any entity to target, segment, or profile individuals based on [...] political affiliation or beliefs Study approved by the Internal Review Board (IRB) of the University of Pennsylvania

Class Distribution 1000 750 500 401 453 696 501 692 594 250 195 0 696 401 453

Data For comparison to previous work, we collect a data set: 13.651 users (25.5M tweets) follow liberal/conservative politicians on Twitter

Hypotheses H1 Previous studies used users far more likely to be politically engaged H2 The prediction problem was so far over-simplified H3 Neutral users can be identified H4 Differences in language use exist between moderate and extreme users

Engagement H1 Previous studies used users far more likely to be politically engaged Manually coded: Political words (234) Political NEs: mentions of politician proper names (39) Media NEs: mentions of political media sources and pundints (20)

Engagement Data set obtained using previous methods 4.00 3.50 0.11 Political word usage across user groups 0.18 3.00 2.50 0.73 Media/Pundit Names Politician Names Political Words 0.79 2.00 1.50 1.00 0.50 0.00 2.64 2.95 Average percentage of political word usage

Engagement Our data set 4.00 3.50 0.11 Political word usage across user groups 0.18 3.00 2.50 0.73 Media/Pundit Names Politician Names Political Words 0.79 2.00 1.50 1.00 0.50 0.03 0.24 0.03 0.14 0.02 0.07 0.02 0.07 0.03 0.09 0.03 0.12 0.04 0.19 0.00 2.64 0.76 0.55 0.42 0.36 0.46 0.51 0.76 2.95 Average percentage of political word usage

Engagement Our data set 4.00 3.50 0.11 Political word usage across user groups 0.18 3.00 2.50 0.73 Media/Pundit Names Politician Names Political Words 0.79 2.00 1.50 1.00 0.50 0.03 0.24 0.03 0.14 0.02 0.07 0.02 0.07 0.03 0.09 0.03 0.12 0.04 0.19 0.00 2.64 0.76 0.55 0.42 0.36 0.46 0.51 0.76 2.95 Average percentage of political word usage

Engagement Take aways: 3x more political terms for automatically identified users compared to the highest survey-based scores almost perfectly symmetrical U-shape across all three types of political terms The difference between 1-2/6-7 is larger than 2-3/5-6

Hypotheses H1 Previous studies used users far more likely to be politically engaged H2 The prediction problem was so far over-simplified H3 Neutral users can be identified H4 Differences in language use exist between moderate and extreme users

Over-simplification H2 The prediction problem was so far over-simplified 1.0.9.8.7.6.891.972.976.5 CvL Topics Political Terms Domain Adaptation ROC AUC, Logistic Regression, 10-fold cross-validation

Over-simplification H2 The prediction problem was so far over-simplified 1.0.972.976.9.891.8.785.785.789.7.6.5 CvL 1v7 Topics Political Terms Domain Adaptation ROC AUC, Logistic Regression, 10-fold cross-validation

Over-simplification H2 The prediction problem was so far over-simplified 1.0.972.976.9.891.8.785.785.789.7.662.679.690.6.5 CvL 1v7 2v6 Topics Political Terms Domain Adaptation ROC AUC, Logistic Regression, 10-fold cross-validation

Over-simplification H2 The prediction problem was so far over-simplified 1.0.972.976.9.891.8.785.785.789.7.6.662.679.690.581.590.625.5 CvL 1v7 2v6 3v5 Topics Political Terms Domain Adaptation ROC AUC, Logistic Regression, 10 fold-cross validation

Over-simplification Predicting continuous political leaning (1 7).40.369.30.294.286.300.256.20.145.10.00 Leaning Unigrams LIWC Topics Emotions Political All Pearson R between predictions and true labels, Linear Regression, 10-fold cross-validation

Over-simplification Seven-class classification 30% 20% 19.60% 22.20% 24.20% 26.20% 27.60% 10% 0% Accuracy, 10-fold cross-validation GR Logistic regression with Group Lasso regularisation

Hypotheses H1 Previous studies used users far more likely to be politically engaged H2 The prediction problem was so far over-simplified H3 Neutral users can be identified H4 Differences in language use exist between moderate and extreme users

Neutral Users H3 Neutral users can be identified Words associated with either extreme conservative or liberal Words associated with neutral users a a a correlation strength Correlations are age and gender controlled. Extreme groups are combined using matched age and gender distributions.

Political Engagement H3a There is a separate dimension of political engagement Combine the classes into a scale: 4 3&5 2&6 1&7.40.369.30.294.286.300.256.20.145.165.149.169.169.196.10.079.00 Leaning Engagement Unigrams LIWC Topics Emotions Political All Pearson R between predictions and true labels, Linear Regression, 10 fold-cross validation

Hypotheses H1 Previous studies used users far more likely to be politically engaged H2 The prediction problem was so far over-simplified H3 Neutral users can be identified H4 Differences in language use exist between moderate and extreme users

Moderate Users H4 Differences between moderate and extreme users Words associated with moderate liberals (5 and 6). Words associated with extreme liberals (7). a a a correlation strength relative frequency Correlations are age and gender controlled

Take Aways User-level trait acquisition methodologies can generate non-representative samples Political ideology: Goes beyond binary classes The problem was to date over-simplified New data set available for research New model to identify political leaning and engagement

Questions? www.preotiuc.ro wwbp.org