The Social Web: Social networks, tagging and what you can learn from them. Kristina Lerman USC Information Sciences Institute

Similar documents
Analysis of Social Voting Patterns on Digg

arxiv: v1 [cs.cy] 11 Jun 2008

Analysis of Social Voting Patterns on Digg

arxiv:cs/ v1 [cs.hc] 7 Dec 2006

Dynamics of Collaborative Document Rating Systems

Stochastic Models of Social Media Dynamics

Predicting Information Diffusion Initiated from Multiple Sources in Online Social Networks

arxiv: v1 [cs.cy] 29 Apr 2010

Using a Model of Social Dynamics to Predict Popularity of News

Social Computing in Blogosphere

Measurement and Analysis of an Online Content Voting Network: A Case Study of Digg

PREDICTING COMMUNITY PREFERENCE OF COMMENTS ON THE SOCIAL WEB

Pioneers in Mining Electronic News for Research

VOTING DYNAMICS IN INNOVATION SYSTEMS

Probabilistic Latent Semantic Analysis Hofmann (1999)

Feedback loops of attention in peer production

Learning and Visualizing Political Issues from Voting Records Erik Goldman, Evan Cox, Mikhail Kerzhner. Abstract

Chapter 2: Uses and effects Dutch girl fakes a trip to South East Asia 15 Esteem issues determine how people put their best Facebook

Introduction to using social media

@all studying the #twitter phenomenon. December 2009

Smartocracy: Social Networks for Collective Decision Making

A Social Contagion: An Empirical Study of Information Spread on Digg and Twitter Follower Graphs

The Evaluation in the Republic of Science. From peer review to open soft peer review

Technology. Technology 7-1

THE AUTHORITY REPORT. How Audiences Find Articles, by Topic. How does the audience referral network change according to article topic?

How Social Computing Impacts Society

Events and Memes in Media- rich Social Informa7on Networks

A New Computer Science Publishing Model

A comparative analysis of subreddit recommenders for Reddit

Do two parties represent the US? Clustering analysis of US public ideology survey

Miyakita, Goki; Leskinen, Petri; Hyvönen, Eero U.S. Congress prosopographer - A tool for prosopographical research of legislators

A Qualitative and Quantitative Analysis of the Political Discourse on Nepalese Social Media

Recommendations For Reddit Users Avideh Taalimanesh and Mohammad Aleagha Stanford University, December 2012

Towards Tackling Hate Online Automatically

Predicting the Popularity of Online

arxiv: v2 [cs.si] 12 Aug 2013

1. ISSUING AGENCY: The City of Albuquerque Human Resources Department.

Introduction-cont Pattern classification

How to Survive PR 2.0 and Thrive in the Brand New World of (Web) Communications

Hoboken Public Schools. Project Lead The Way Curriculum Grade 8

Topicality, Time, and Sentiment in Online News Comments

101 Ways Your Intern Can Triple Your Website Traffic & Performance This Year

Ushio: Analyzing News Media and Public Trends in Twitter

Research Collection. Newspaper 2.0. Master Thesis. ETH Library. Author(s): Vinzens, Gianluca A. Publication Date: 2015

bitqy The official cryptocurrency of bitqyck, Inc. per valorem coeptis Whitepaper v1.0 bitqy The official cryptocurrency of bitqyck, Inc.

REPORT DOCUMENTATION PAGE. Trend Monitoring and Forecasting. Byeong Ho Kang N/A AOARD UNIT APO AP AFRL/AFOSR/IOA(AOARD)

Outline. From Pixels to Semantics Research on automatic indexing and retrieval of large collections of images. Research: Main Areas

LexisNexis Information Professional

How Social Media Is Changing Communications

An Integrated Tag Recommendation Algorithm Towards Weibo User Profiling

Issue Overview: Are social networking sites good for our society?

eadvocacy: Basics, Best Practices and New Tools Social Networks

The EPO approach to Computer Implemented Inventions (CII) Yannis Skulikaris Director Operations, Information and Communications Technology

Comment Mining, Popularity Prediction, and Social Network Analysis

CSE 308, Section 2. Semester Project Discussion. Session Objectives

VISA LOTTERY SERVICES REPORT FOR DV-2007 EXECUTIVE SUMMARY

The Karma of Digg: Reciprocity in Online Social Networks

Using Social Media to Build Your Brand. Susan Getgood

B. Executive Summary. Page 2 of 7

Fall Detection for Older Adults with Wearables. Chenyang Lu

This Time It's Personal: Social Networks, Viral Politics and Identity Management

JOURNAL OF OBJECT TECHNOLOGY

IBM Cognos Open Mic Cognos Analytics 11 Part nd June, IBM Corporation

A Bloom Filter Based Scalable Data Integrity Check Tool for Large-scale Dataset

Users reading habits in online news portals

Spring Tracking Survey 2008 Final Topline 5/19/08 Data for April 8 May 11, 2008

Was This Review Helpful to You? It Depends! Context and Voting Patterns in Online Content

Experiments on Data Preprocessing of Persian Blog Networks

e-campaigning: The Present and Future

Today I am going to speak about the National Digital Newspaper Program or NDNP, the Historic Maryland Newspapers Project or HMNP--the Maryland

Overview. Ø Neural Networks are considered black-box models Ø They are complex and do not provide much insight into variable relationships

Strong regularities in online peer production

Document and Author Promotion Strategies in the Secure Wiki Model

2011 The Pursuant Group, Inc.

11th Annual Patent Law Institute

Performance Evaluation of Cluster Based Techniques for Zoning of Crime Info

Introduction to Social Media for Unitarian Universalist Leaders

SOCIAL MEDIA OPTIMIZATION

Digitisation Project Tanja Zech NSW Parliament

AMONG the vast and diverse collection of videos in

THE PRIMITIVES OF LEGAL PROTECTION AGAINST DATA TOTALITARIANISMS

VS. Who REALLY Owns the Web?

Online Social Networks

Chapter 9 Content Statement

Return on Investment from Inbound Marketing through Implementing HubSpot Software

1/12/12. Introduction-cont Pattern classification. Behavioral vs Physical Traits. Announcements

Mining Expert Comments on the Application of ILO Conventions on Freedom of Association and Collective Bargaining

Politicians as Media Producers

CSE 190 Assignment 2. Phat Huynh A Nicholas Gibson A

THE GOP DEBATES BEGIN (and other late summer 2015 findings on the presidential election conversation) September 29, 2015

Social Media & Internet Security

A Skeleton-Based Model for Promoting Coherence Among Sentences in Narrative Story Generation

Popularity Prediction of Reddit Texts

Hoboken Public Schools. PLTW Introduction to Computer Science Curriculum

Space Climate Observatory

UTAH LEGISLATIVE BILL WATCH

Introduction to MySpace

COMMON GROUND BETWEEN COMPANY AND CIVIL SOCIETY SURVEILLANCE REFORM PRINCIPLES

The European Patent Office

Data, Social Media, and Users: Can We All Get Along?

Transcription:

The Social Web: Social networks, tagging and what you can learn from them Kristina Lerman USC Information Sciences Institute

The Social Web The Social Web is a collection of technologies, practices and services that turn the Web into a platform for users to create and use content in a social environment Authoring tools blogs Collaboration tools wikis, Wikipedia Tagging systems del.icio.us, Flickr, CiteULike Social networking Facebook, MySpace, Essembly Collaborative filtering Digg, Amazon, Yahoo answers

Social Web Features Users create content Articles, opinions, creative products Users annotate content Metadata Tags freely chosen labels Geo-tags location information Ratings Users create connections Between content and metadata Between content or metadata and users Among users (social networks) Users traverse these connections, creating new ones along the way Allows users to interact See and discuss comment Create new links between content, metadata and users

Flickr example submitter Quality (PLEASE Read the RULES) Spectacular Nature invited images only reflections sky water pink blue trees reeds nature landscape seascape serene earth land QUALITY specnature groups tags image stats

User s social networks User s contacts (friends) and group membership

User s tags Tags are keywordbased metadata added to content Help users organize their own data Facilitate searching and browsing for information Freely chosen by user

User s favorite images (by other photographers)

Social Web is challenging Social Web is enormous and growing rapidly Some popular sites have >1 million users and >1 billion objects 2G/day of authored content 10-15G/day of user generated content [From Andrew Tomkins, Yahoo! Research] Social Web is highly heterogeneous Different languages Different content types Social Web is highly dynamic New users and content Links are created and destroyed

Social Web is interesting Social Web as a complex dynamical system Collective behavior emerges from actions taken by many users Interesting interactions between users: network-mediated, environment-mediated (e.g., popularity-based) Social Web as a knowledge-generating system Users express personal knowledge (e.g., through tags) Tailor information to user s individual preferences or combine users knowledge to create a folksonomy of concepts Social Web as a problem-solving system By exposing human activity, Social Web allows users to harness the power of collective intelligence to solve problems Lots of data for empirical studies Social Web is amenable to analysis Design systems for optimal performance

Outline for the research talk I study how user-contributed metadata can be used to solve a variety of information processing problems, including information discovery and personalization. Dynamics of information spread on networks Social browsing = social networks + recommendation Patterns of information spread on networks indicative of content quality Mathematical analysis of collaborative decision-making on Digg Learning from social tagging Machine learning methods to extract information from tags created by distinct users Better than Google: using Del.icio.us tags to find Web services

Dynamics of information spread on networks with: Dipsy Kapoor, Aram Galstyan

Social news aggregator Digg Users submit stories Users vote on (digg) stories Digg selects some stories for the front page based on users votes Users create social networks by adding other users as friends Digg provides Friends Interface to track recent activity of friends See stories friends submitted See stories friends dugg

How the Friends interface works submitter see stories my friends submitted see stories my friends dugg fans of submitter fans of voters

Top users Digg ranks users Based on how many of their stories were promoted to front page User with most stories is ranked #1, Top 1000 users data Usage statistics User rank How many stories user submitted, dugg, commented on Social networks Friends: outgoing links A B := B is a friend of A Reverse friends: incoming links A B := A is a fan of B

100,000 10,000 number fans+1 1,000 100 10 1 1 10 100 1,000 number friends+1

Digg dataset Stories Collected by scraping Digg now available through the API Front pages stories : data about ~200 stories most recently promoted to the front page on June 30, 2006 Newly submitted stories : data about ~900 stories Most recently submitted on June 30, 2006 For each story extracted Submitter s name Title of story Names of the first 215 users to vote on the story Number of votes story received

Dynamics of votes 2500 story interestingness number of votes (diggs) 2000 1500 1000 500 0 0 1000 2000 3000 4000 5000 time (min)

Distribution of votes ~200 front page stories submitted in June 29-30, 2006 ~30,000 front page stories submitted in 2006 Wu & Huberman, 2007

Dynamics of information spread How do producers promote their content? How do consumers find interesting new content? How do stories become popular on Digg? Social networks play a major role in promoting stories on Digg Patterns of information spread through networks can be used to predict how popular the story will become Mathematical model for collaborative decision-making on Digg

Stories spread through the network Distribution of the numbers of users who can see the story through the Friends Interface

And receive votes from within the network Distribution of the number of in-network votes Cascade = number of in-network votes (votes from fans of the previous voters)

Patterns of network spread But, how the story spreads through the network is different for different stories

Correlation already after the first 10 votes!

Classification: Training Decision tree classifier Features Number of in-network votes Number of fans of submitter Story interestingness Yes if > 500 votes No if < 500 votes 10-fold validation on 207 stories Correctly classified 84% of instances v10 <=4 >4 yes(130/5) v10 <=8 fans1 <=85 >85 >8 no(18/0) no(29/13) yes(30/8)

Predicting interestingness Predict how interesting a story is based on the first 10 votes Test dataset 900 stories submitted on June 30, 2006, but not yet promoted to the front page 48 stories that were submitted by top users (rank<100) and received at least 10 votes Retrieve the final number of votes received by stories Classification Correctly classified 36 examples (TP=4, TN=32) 12 errors (FP=11, FN=1) Looking at the promoted stories only Digg prediction: 5 of 14 received more than 520 votes (Pr=0.36) Our prediction: 4 of 7 received more than 520 votes (Pr=0.57)

Analysis as a tool to study Social Web Mathematical analysis can help understand and predict the emergent behavior of collaborative information systems Analysis of collective behavior on Digg Dynamics of collective voting Dynamics of user rank Analysis can aid the design of Digg Study the choice of the promotion algorithm before it is implemented Study the effect of design choices on system behavior story timeliness, interestingness, user participation, incentives to join social networks, etc.

Dynamics of collective voting Model characterizes a story by Interestingness r probability a story will received a vote when seen by a user Visibility Visibility on the upcoming stories page Decreases with time as new stories are submitted Visibility on the front page Decreases with time as new stories are promoted Visibility through the friends interface Stories friends submitted Stories friends dugg (voted on)

Mathematical model Mathematical model describes how the number of votes m(t) changes in time! m( t) = r( v f + vu + vi )! t Solve equation Solutions parametrized by S, r Other parameters estimated from data

Dynamics of votes data model Lerman, Social Information Processing in Social News Aggregation Internet Computing (in press) 2007

Exploring the parameter space num reverse friends S Minimum S required for the story to be promoted for a given r for a fixed promotion threshold 4000 minimum S 3500 3000 2500 2000 1500 1000 500 promotion time (min) Time taken for a story with r and S to be promoted to the front page for a fixed promotion threshold 3000 2500 r=0.25 r=0.1 2000 1500 1000 500 0 0 0.1 0.2 0.3 0.4 interestingness r 0 0 200 400 600 800 1000 num reverse friends S

Dynamics of user influence 1000 Digg ranked users according to how many front page stories they had user rank 100 10 Model of the dynamics of user influence Number of stories promoted to the front page F User s social network growth S 1 0 10 20 30 week aaaz user1 digitalgopher user2 felchdonkey user3 MrCalifornia user4 3monkeys user5 dirtyfratboy user6

Model of rank dynamics Number of stories promoted to the front page F Number of stories M submitted over Δt=week User s promotion success rate S(t)! F ( t) = cs( t) M! t User s social network S grows as Others discover him through new front page stories ~ΔF Others discover him through the Top Users list ~g(f)! S ( t) = b! F( t) + g( F)! t Solve equations Estimate b, c, g(f) from data

Solutions 1 1800 1500 F S user2 data digitalgopher data 1800 1500 F S digitalgopher user2 model 1200 1200 number 900 number 900 600 600 300 300 0 0 5 10 15 20 25 30 35 0 0 5 10 15 20 25 30 35 week week 1200 1000 F S user6 data dirtyfratboy data 1200 1000 F S dirtyfratboy user6 model 800 800 number 600 number 600 400 400 200 200 0 0 5 10 15 20 25 30 35 0 0 5 10 15 20 25 30 35 week week Lerman, Dynamics of Collaborative Rating of Information in KDD/SNA workshop, 2007

Solutions 2 800 F S aaaz user1 data 800 F S aaaz user1 model 600 600 number 400 number 400 200 200 0 0 5 10 15 20 25 30 35 week 0 0 5 10 15 20 25 30 35 week 600 500 F S 3monkeys user5 data 600 500 F S 3monkeys user5 model 400 400 number 300 number 300 200 200 100 100 0 0 5 10 15 20 25 30 35 0 0 5 10 15 20 25 30 35 week week Lerman, Dynamics of Collaborative Rating of Information in KDD/SNA workshop, 2007

Solutions 3 100 80 F S user3 data felchdonkey data 100 80 F S felchdonkey user3 model number 60 40 number 60 40 20 20 0 0 5 10 15 20 25 30 35 0 0 5 10 15 20 25 30 35 week week 100 80 F S user4 data MrCalifornia data 100 80 F S MrCalifornia user4 model number 60 40 number 60 40 20 20 0 0 5 10 15 20 25 30 35 0 0 5 10 15 20 25 30 35 week week Lerman, Dynamics of Collaborative Rating of Information in KDD/SNA workshop, 2007

Learning from social tagging with: Anon Plangrasopchok

Metadata Metadata ( data about data ) used to facilitate the understanding, characteristics, use and magement of data. [source: Wikipedia] Terms from a formal taxonomy used to describe data E.g., Linnean classification system describes living organisms Animalia Arthropoda Insecta Orthoptera Caelifera Tetrigidae [source: Linnean Classification System]

Semantic Web Semantic Web attempted to impose meaning on Web data to improve information access and usability [Berners-Lee & Hendler in Scientific American, 2001] Web content annotated with machine-readable metadata (from a formal taxonomy) to aid automatic information retrieval and integration Still unrealized in 2008 Too complicated: specialized training to be used effectively Costly and time-consuming to produce Variety of specialized ontologies: ontology alignment problem

Tags as alternative metadata Tags serve a function similar to that of metadata from a formal taxonomy Tags help users organize their data Facilitate searching and browsing for information Describe the semantics (meaning) of content Tags are keywords used to describe content Freely-chosen by user No controlled vocabulary or formal taxonomy Insect Grasshopper Australian Macro Orthoptera Brown On leaf

Tagging and semantics Collective tagging of content may lead to an emergent informal classification system Folksonomy user generated taxonomy used to categorize and retrieve web content using open-ended labels called tags. Advantages over formal taxonomies Simpler: easier cognitive process Bottom-up: decentralized, emergent, scalable Dynamic: adapts to changing needs and priorities But, since it comes from many different users, Noisy: need tools to extract meaning from data [source: Wikipedia]

Tagging is simpler than categorization Categorization=assigning object a single concept within a taxonomy Rashmi Sinha 2005

Collective tagging on Delicious Web source popular tags user tags

Probabilistic learning Given a collection of documents tagged by different users Use machine learning techniques to find hidden or latent topics in a collection of sources Learn a hierarchy of hidden topics = folksonomy? Sources Probabilistic Model Hidden Topics (Compressed description) Users Tags

Alternative models plsa MWA ITM R Z U R Z T N t D [Hoffman, in UAI 99] U R T N b [Wu+, in WWW 06] I T Z N t D [Plangrasopchok & Lerman, in IIWeb 07] Aggregate all tags from all users as if having only single user Take into account individual difference

The price we pay for aggregating data Keeping track of individual variations in tag usage, rather than aggregate tags over all users, can improve learning Navarro+ 2006

Evaluation on synthetic data Test how models perform when tags are ambiguous Created a synthetic data set with tunable noise parameter (tag ambiguity) 40 resources tagged by 200 users 10 topics 25 unique tags Evaluate performance of the learning models on synthetic data Find the topic distribution of resources using 3 models Distance A = distance between resources computed using the actual topic distribution Distance L = distance between resources computed using learned topic distribution Δ= Distance L -Distance A D Evaluation The better the learned topic distribution, the smaller the Δ

Results on synthetic data Δ Significant < 0.01 More noise, more tag ambiguity

Information discovery Apply the learning framework to real data Leverage user-contributed tags to discover hidden topics in a collections of Web sources Find sources that provide some functionality Simpler goal: find sources that provide the same functionality as the seed, e.g., http://flytecomm.com Improve robustness of information integration applications Increase coverage of the applications

Data sets for discovery task Seeds flytecomm : flight status geocoder : coordinates of an address wunderground : weather conditions hotels : hotel reservations whitepages : phone book Collected data by scraping del.icio.us Sources Users Tags For each seed, retrieve the 20 popular tags For each tag, retrieve other resources annotated with same tag For each resource, retrieve all tags added by users

Probabilistic approach Find hidden topics in a collection of sources, using Probabilistic Generative Model Compute similarity between seed and source using hidden topics Sources Probabilistic Model Users Tags Hidden topics (compressed description) Compute Source Similarity Similar sources (sorted)

Application to resource discovery

Application to resource discovery (delicious data) Entropy of user interests p(i u) dataset geocoder wunderg flytecomm whitepage online-res Entropy(SD) 1.419(0.278) 1.397(0.227) 1.285(0.279) 1.157(0.272) 0.629(0.41) Higher entropy => users seem to be interested in all topics equally likely

Conclusions In their every day use of Social Web sites, users create large quantity of data, which express their knowledge and opinions Content Articles, media content, opinion pieces, etc. Metadata Tags, ratings, discussion, social networks Links between users, content, and metadata Social Web enables new problem solving approaches Collective problem solving Efficient, robust solutions beyond the scope of individual capabilities Social information processing Use knowledge, opinions of others for own information needs

Further reading To see more papers on Social Web http://www.citeulike.org/user/krisl/tag/socialweb To see my papers on the Social Web http://www.citeulike.org/user/krisl/tag/mysocialweb