Towards Tackling Hate Online Automatically

Similar documents
Instructors: Tengyu Ma and Chris Re

Classifier Evaluation and Selection. Review and Overview of Methods

A Qualitative and Quantitative Analysis of the Political Discourse on Nepalese Social Media

Overview. Ø Neural Networks are considered black-box models Ø They are complex and do not provide much insight into variable relationships

Recommendations For Reddit Users Avideh Taalimanesh and Mohammad Aleagha Stanford University, December 2012

The Social Web: Social networks, tagging and what you can learn from them. Kristina Lerman USC Information Sciences Institute

JUDGE, JURY AND CLASSIFIER

Unequal participation: Why workers don t vote (anymore) and why it matters

Research and strategy for the land community.

The Nature of Entrepreneurship and its Determinants: Opportunity or Necessity?

Category-level localization. Cordelia Schmid

Analysis of the Reputation System and User Contributions on a Question Answering Website: StackOverflow

Probabilistic Latent Semantic Analysis Hofmann (1999)

Understanding factors that influence L1-visa outcomes in US

Distributed representations of politicians

Derbyshire Constabulary SIMPLE CAUTIONING OF ADULT OFFENDERS POLICY POLICY REFERENCE 06/122. This policy is suitable for Public Disclosure

Politcs and Policy Public Policy & Governance Review

Course Catalogue School of Social Sciences Fall 2015 Fall 2017 University of Mannheim

AUTOMATED CONTRACT REVIEW

Development of Agenda-Setting Theory and Research. Between West and East

Support Vector Machines

THE VALUE HETEROGENEITY OF THE EUROPEAN COUNTRIES POPULATION: TYPOLOGY BASED ON RONALD INGLEHART S INDICATORS

Belonging and Exclusion in the Internet Era: Estonian Case

ETHNO-CULTURAL IDENTITY CONFLICT IN THE ACCULTURATION PROCESS

Do natives beliefs about refugees education level affect attitudes toward refugees? Evidence from randomized survey experiments

Automatic Thematic Classification of the Titles of the Seimas Votes

Albania - the Chief Justice has held annual press conferences with journalists. Azer - creating its electronic court system (!)

Experiments on Data Preprocessing of Persian Blog Networks

Classification of posts on Reddit

FILMS AND PUBLICATIONS AMENDMENT BILL

(a) Draw side-by-side box plots that show the yields of the two types of land. Check for outliers before making the plots.

Comparison of the Psychometric Properties of Several Computer-Based Test Designs for. Credentialing Exams

LFS AD HOC MODULE ON MIGRANTS AND THE LABOUR MARKET

Analysis of Categorical Data from the California Department of Corrections

List of Tables and Appendices

Measurement and Summary Statistics Practice

PREDICTING COMMUNITY PREFERENCE OF COMMENTS ON THE SOCIAL WEB

Terms of Use. Last modified: January Acceptance of these Terms of Use

CS388: Natural Language Processing Coreference Resolu8on. Greg Durrett

ANNUAL SURVEY REPORT: REGIONAL OVERVIEW

Monitoring Media Pluralism in Europe: Application of the Media Pluralism Monitor 2017 in the European Union, FYROM, Serbia & Turkey

Beyond Binary Labels: Political Ideology Prediction of Twitter Users

arxiv: v2 [cs.si] 10 Apr 2017

The language for most tablet questions was customized based on whether the respondent said they had an ipad or another type of tablet computer.

CSE 190 Professor Julian McAuley Assignment 2: Reddit Data. Forrest Merrill, A Marvin Chau, A William Werner, A

CONSTITUTIONAL PATRIOTISM BETWEEN FACTS AND NORMS

THE PRIMITIVES OF LEGAL PROTECTION AGAINST DATA TOTALITARIANISMS

Vote Compass Methodology

PRACTICE DIRECTION [ ] DISCLOSURE PILOT FOR THE BUSINESS AND PROPERTY COURTS

Subjectivity Classification

CS 229: r/classifier - Subreddit Text Classification

Using Quantitative Methods to Study Parliament

The Economic and Social Outcomes of Children of Migrants in New Zealand

BARSTOW COMMUNITY COLLEGE DISTRICT ACADEMIC SENATE CONSTITUTION AND BY-LAWS. Legal Basis for an Academic Senate. Membership Qualifications

Studies on translation and multilingualism

Guidelines Targeting Economic and Industrial Sectors Pertaining to the Act on the Protection of Personal Information. (Tentative Translation)

The Rules of Procedure of the Constitutional Court of the Republic of Slovenia

PCGENESIS PAYROLL SYSTEM OPERATIONS GUIDE

Jakarta Declaration. World Press Freedom Day Critical Minds for Critical Times: Media s role in advancing peaceful, just and inclusive societies

Advisory Committee on Equal Opportunities for Women and Men

CS 229 Final Project - Party Predictor: Predicting Political A liation

Pioneers in Mining Electronic News for Research

Analysis of Social Voting Patterns on Digg

Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis

Deep Learning Working Group R-CNN

Characteristics of the Ethnographic Sample of First- and Second-Generation Latin American Immigrants in the New York to Philadelphia Urban Corridor

Report on Citizen Opinions about Voting & Elections

Identifying Factors in Congressional Bill Success

IOM International Organization for Migration OIM Organizaţia Internaţională pentru Migraţie

HOUSEHOLD SURVEY FOR THE AFRICAN MIGRANT PROJECT: KENYA. Manual for Interviewers and Supervisors. October 2009

Good Governance Practice for Cooperative Development in Ethiopia! How it Works?

The legislator has also assigned various other tasks to the Inspectorate. We have also been assigned tasks with international legislation.

Political Integration of Immigrants: Insights from Comparing to Stayers, Not Only to Natives. David Bartram

Survey Report Victoria Advocate Journalism Credibility Survey The Victoria Advocate Associated Press Managing Editors

Psychological Factors

Website Terms of Use

COMMUNICATION FROM THE COMMISSION. On the global approach to transfers of Passenger Name Record (PNR) data to third countries

AP AMERICAN GOVERNMENT STUDY GUIDE POLITICAL BELIEFS AND BEHAVIORS PUBLIC OPINION PUBLIC OPINION, THE SPECTRUM, & ISSUE TYPES DESCRIPTION

Privacy International's comments on the Brazil draft law on processing of personal data to protect the personality and dignity of natural persons

2013 No. 777 LIBRARIES

oductivity Estimates for Alien and Domestic Strawberry Workers and the Number of Farm Workers Required to Harvest the 1988 Strawberry Crop

Copyright Juta & Company Limited

EasyChair Preprint. (Anti-)Echo Chamber Participation: Examing Contributor Activity Beyond the Chamber

Political Science. Political Culture and Policy Liberalism in American States: A Test of a New Measure. Mark Wagner. Introduction

CSE 190 Assignment 2. Phat Huynh A Nicholas Gibson A

Media Release SMU study reveals challenges and emotional distress faced by migrant workers in Singapore Singapore, 4 November 2015 (Wednesday)

Natural Language Technologies for E-Rulemaking. Claire Cardie Department of Computer Science Cornell University

Chapter 1: Introduction

Conspiracist propaganda

CREATIVE EDITING. DOROTHY A. BOWLES University of Tennessee & DIANE L. BORDEN Son Diego Stote University. ~.. WADSWORTH CENGAGE Learning"

Migrant Wages, Human Capital Accumulation and Return Migration

Midterm Elections Used to Gauge President s Reelection Chances

Statistical Analysis of Corruption Perception Index across countries

Purposes of the Law. Information of Public Importance. Public Authority Body. Legal Presumptions of Justified Interest

MINORITY PROTECTION IN TODAY S OSCE: LESSONS LEARNED

JOB DESCRIPTION I. JOB IDENTIFICATION. Position Title: Jurilinguist Linguistic Profile: CCC Group and Level: ADG-C

THE STATUTES OF THE CHANCELLERY OF THE RIIGIKOGU (consolidated text)

SECURE REMOTE VOTER REGISTRATION

NEWFOUNDLAND AND LABRADOR OFFICE OF THE INFORMATION AND PRIVACY COMMISSIONER

TEXAS STATE RECORDS RETENTION SCHEDULE

Transcription:

Towards Tackling Hate Online Automatically Nikola Ljubešić 1, Darja Fišer 2,1, Tomaž Erjavec 1 1 Department of Knowledge Technologies, Jožef Stefan Institute, Ljubljana 2 Department of Translation, University of Ljubljana SS22 Colloquium on Intolerant and Abusive Content Online Auckland, New Zealand 30 June 2018

Overview 1 The FRENK project 2 Data harvesting (Facebook) 3 Filtering the data by topic (migrants, LGBT) 4 Manual data annotation (PyBossa) 5 Automating the identification process

FRENK

The FRENK Project Slovene basic research project Resources, methods, and tools for the understanding, identification, and classification of various forms of socially unacceptable discourse in the information society (2017 2019) Primary project goal: Interdisciplinary treatment of linguistic, sociological, legal and technological dimensions of different forms of socially unacceptable discourse (SUD) Partners Dept. of Knowledge Technologies, Jožef Stefan institute (lead) Faculty of Arts (linguistics) Faculty of Social Sciences (social sciences) The Peace Institute (law)

State of the art in automated hate speech detection Usage of supervised machine learning: computer is given (as) many (as possible) examples of hate speech and non-hate speech, a classifier is trained on these examples To obtain these examples, annotation campaigns have to be run 1 Classification schema / typology 2 Annotation guidelines 3 Annotator training In most (all?) cases ad-hoc treatment of these three components 1 Not well-defined / well-argued typology 2 No or very basic annotation guidelines 3 Untrained students (or paper authors?) at disposal used for data annotation FRENK tries to address all the above issues

Harvesting

Harvesting the data from Facebook Facebook has the Graph API - we can communicate with Facebook (data) via computer programs Collecting all posts and comments on Facebook pages of three popular daily newspapers (alexa.com) # of posts # of comments 24urcom 8,375 126,983 RTV.SLOVENIJA 12,192 12,998 SiOL.net.Novice 20,257 57,406 Nova24TV 9,848 83,728

Filtering

Filtering the data for topics of interest Two topics (targets) of interest: Migrants / Islamophobia LGBT / Homophobia Want to (semi-)automate the filtering process Application of supervised machine learning Identify examples of each topic via keyword search (100 posts per topic) Use these exemplary documents to train classifiers for each topic for each post the classifier predicts whether the post is on the topic of migrants, LGBT, or other Results of automatic classification are not perfect, but good enough for pre-filtering the data Precision Recall Migrants 0.80 0.66 LGBT 0.86 0.53 Other 0.75 0.97

Amount of data after filtering # of posts # of comments 24urcom 8,375 126,983 Migrants 178 16,849 LGBT 17 2,252 SiOL.net.Novice 20,257 57,406 Migrants 98 3,205 LGBT 12 456 Nova24TV 9,848 83,728 Migrants 684 23,174 LGBT 65 2,037

Annotation

Annotation schema and guidelines: SUD type Decision tree for SUD type Background based SUD? YES: are there elements of violence? YES: background, violence NO: background, hate NO: SUD towards individuals and groups? YES: elements of violence? YES: other, threat NO: other, hate NO: is the speech unacceptable? YES: unacceptable speech NO: acceptable speech

Annotation schema and guidelines: SUD target Migrants / LGBT Related to migrants / LGBT Journalists or media Another commenter Other

Annotation in PyBossa - a tool for crowdsourcing

Initial annotation campaign Annotators: bachelor and master students from the Faculties of Arts and Social Sciences, University of Ljubljana 33 annotators, 16/17 per topic Each annotator annotates the same data, 16/17 annotations per instance Training session, 5 hours Annotation guidelines on 8 pages Communication via mailing list

Distribution of responses Migrants LGBT acceptable 47.57 % background, hate, migrants 23.51 % other, hate, commenter 6.19 % background, violence, migrants 4.69 % other, hate, journalist 4.2 % other, hate, other 2.56 % other, hate, related 1.96 % background, hate, related 1.83 % acceptable 63.77 % background, hate, lgbt 17.57 % other, hate, commenter 5.44 % other, hate, other 4.22 % background, hate, related 2.43 % other, hate, related 1.47 % unacceptable, no target 0.88 % do not know 0.76 %

Entropy of response distributions Entropy: measure of uncertainty. Lower is better. If every annotator gave the same response, entropy is 0. Migrants LGBT

Easy examples acceptable If I myself had enough for a decent life, I d take in or at least help one of our families background, violence, migrants The media show only how they are in need and such... I wonder how many of those that would open their door to them now would help them if they physically or psychologically harassed them... or their relatives... they are not so terribly in need as the media show! They are like the Trojan horse! Seal the borders with a wall and shoot those that come near!

Hard examples unacceptable, other 5; acceptable 3; background, hate, migrants 2; other, hate, commenter 2;... DON T EAT SHIT other, hate, related 5; background, hate, related 3; other, hate, journalist 2; unacceptable, other 2;... We have proof that monkeys are not only in parliament..

Automation

Two current approaches in machine learning Traditional methods Linear regression, Logistic regression, Decision trees, Support vector machines... Text representation through manually defined variables, mostly specific words or sequences of words (n-grams) Deep learning methods AI hype, drastic improvements in image and audio processing, varying in text processing, data hungry! Text representation through distributed word representations fed into a neural network (matrix multiplications) Each word is represented through a sequence of numbers, representations of cat and dog are much more similar than of cat and car

Two use cases GermEval 2018 shared task This years shared task at the German NLP conference, 20+ teams on board (a lot!) 5,000 training examples Traditional methods: 75% accuracy Deep learning methods: 75% accuracy Dataset of deleted comments from a website Croatian, 24sata.hr, obtained from the pubilsher 500,000 training examples Traditional methods: 85% accuracy Deep learning methods: 95% accuracy

Conclusion FRENK interdisciplinary project, trying to improve the problem definition and data annotation deficiencies of current projects Data harvesting: easy Data selection: medium, but crucial, question of sample representativeness Data annotation: hard, very costly, both in terms of annotator training and the annotation itself (if done properly) (Semi-)Automation: possible, but very challenging Accuracy depends on the amount of training data Good results can be expected on a small number of classes Training data very situational, topic- and target-dependent

Towards Tackling Hate Online Automatically Nikola Ljubešić 1, Darja Fišer 2,1, Tomaž Erjavec 1 1 Department of Knowledge Technologies, Jožef Stefan Institute, Ljubljana 2 Department of Translation, University of Ljubljana SS22 Colloquium on Intolerant and Abusive Content Online Auckland, New Zealand 30 June 2018