Cluster Analysis. (see also: Segmentation)

Similar documents
Dimension Reduction. Why and How

No Adults Allowed! Unsupervised Learning Applied to Gerrymandered School Districts

Support Vector Machines

Instructors: Tengyu Ma and Chris Re

Do two parties represent the US? Clustering analysis of US public ideology survey

AMONG the vast and diverse collection of videos in

Random Forests. Gradient Boosting. and. Bagging and Boosting

A comparative analysis of subreddit recommenders for Reddit

Statistical Analysis of Corruption Perception Index across countries

Probabilistic earthquake early warning in complex earth models using prior sampling

Subreddit Recommendations within Reddit Communities

Overview. Ø Neural Networks are considered black-box models Ø They are complex and do not provide much insight into variable relationships

Web Mining: Identifying Document Structure for Web Document Clustering

Classifier Evaluation and Selection. Review and Overview of Methods

Computational challenges in analyzing and moderating online social discussions

Recommendations For Reddit Users Avideh Taalimanesh and Mohammad Aleagha Stanford University, December 2012

Evaluating the Connection Between Internet Coverage and Polling Accuracy

UTS:IPPG Project Team. Project Director: Associate Professor Roberta Ryan, Director IPPG. Project Manager: Catherine Hastings, Research Officer

POPULATION AGEING: a Cross-Disciplinary Approach Harokopion University, Tuesday 25 May 2010 Drawing the profile of elder immigrants in Greece

Blockmodels/Positional Analysis Implementation and Application. By Yulia Tyshchuk Tracey Dilacsio

Partition Decomposition for Roll Call Data

Probabilistic Latent Semantic Analysis Hofmann (1999)

Biogeography-Based Optimization Combined with Evolutionary Strategy and Immigration Refusal

8 5 Sampling Distributions

Compare Your Area User Guide

Response to the Report Evaluation of Edison/Mitofsky Election System

Committee for Economic Development: October Business Leader Study. Submitted to:

Experiments on Data Preprocessing of Persian Blog Networks

Research Statement. Jeffrey J. Harden. 2 Dissertation Research: The Dimensions of Representation

Discovering Migrant Types Through Cluster Analysis: Changes in the Mexico-U.S. Streams from 1970 to 2000

Situational Analysis: Peterborough & the Kawarthas

Comparison Sorts. EECS 2011 Prof. J. Elder - 1 -

Police patrol districting method and simulation evaluation using agent-based model & GIS

Parties, Candidates, Issues: electoral competition revisited

Performance Evaluation of Cluster Based Techniques for Zoning of Crime Info

Instant Runoff Voting s Startling Rate of Failure. Joe Ornstein. Advisor: Robert Norman

Potential alliances for Turkey in coming WTO agricultural negotiations. CIHEAM Analytic note. N 20 June Berna Türkekul

REVEALING THE GEOPOLITICAL GEOMETRY THROUGH SAMPLING JONATHAN MATTINGLY (+ THE TEAM) DUKE MATH

Ideological Perfectionism on Judicial Panels

Economics 470 Some Notes on Simple Alternatives to Majority Rule

The 2017 TRACE Matrix Bribery Risk Matrix

A Cluster-Based Approach for identifying East Asian Economies: A foundation for monetary integration

A Retrospective Study of State Aid Control in the German Broadband Market

Social Rankings in Human-Computer Committees

IDENTIFYING FAULT-PRONE MODULES IN SOFTWARE FOR DIAGNOSIS AND TREATMENT USING EEPORTERS CLASSIFICATION TREE

The Seventeenth Amendment, Senate Ideology, and the Growth of Government

Political Economics II Spring Lectures 4-5 Part II Partisan Politics and Political Agency. Torsten Persson, IIES

Agent Modeling of Hispanic Population Acculturation and Behavior

A GENERAL TYPOLOGY OF PERSONAL NETWORKS OF IMMIGRANTS WITH LESS THAN 10 YEARS LIVING IN SPAIN

* Source: Part I Theoretical Distribution

Structural Folds: Generative Disruption in Overlapping Groups. Balázs Vedres David Stark

THE PRIMITIVES OF LEGAL PROTECTION AGAINST DATA TOTALITARIANISMS

IN THE UNITED STATES DISTRICT COURT FOR THE EASTERN DISTRICT OF PENNSYLVANIA

Deep Learning Working Group R-CNN

twentieth century and early years of the twenty-first century, reversed its net migration result,

List of Tables and Appendices

Acculturation over time among adolescents from immigrant Chinese families

EUROPEAN CITIZENSHIP

Hoboken Public Schools. AP Statistics Curriculum

Key Considerations for Implementing Bodies and Oversight Actors

Chapter 8: Recursion

SIERRA LEONE 2012 ELECTIONS PROJECT PRE-ANALYSIS PLAN: INDIVIDUAL LEVEL INTERVENTIONS

Processes. Criteria for Comparing Scheduling Algorithms

A model for election night forecasting applied to the 2004 South African elections

Iowa Voting Series, Paper 6: An Examination of Iowa Absentee Voting Since 2000

1. The Relationship Between Party Control, Latino CVAP and the Passage of Bills Benefitting Immigrants

An Integrated Tag Recommendation Algorithm Towards Weibo User Profiling

Hoboken Public Schools. Project Lead The Way Curriculum Grade 8

QUALITY OF LIFE IN TALLINN AND IN THE CAPITALS OF OTHER EUROPEAN UNION MEMBER STATES

Ward profile information packs: Ryde North East

Efficiency Consequences of Affirmative Action in Politics Evidence from India

Using a Fuzzy-Based Cluster Algorithm for Recommending Candidates in eelections

The Timeline Method of Studying Electoral Dynamics. Christopher Wlezien, Will Jennings, and Robert S. Erikson

KNOW THY DATA AND HOW TO ANALYSE THEM! STATISTICAL AD- VICE AND RECOMMENDATIONS

Publicizing malfeasance:

Cities and product variety: evidence from restaurants

Analysis of National Identity Data Based on ISSP Questionnaires

Network Indicators: a new generation of measures? Exploratory review and illustration based on ESS data

The parametric g- formula in SAS JESSICA G. YOUNG CIMPOD 2017 CASE STUDY 1

DANISH TECHNOLOGICAL INSTITUTE. Supporting Digital Literacy Public Policies and Stakeholder Initiatives. Topic Report 2.

Understanding the Effect of Gerrymandering on Voter Influence through Shape-based Metrics

DU PhD in Home Science

Learning and Visualizing Political Issues from Voting Records Erik Goldman, Evan Cox, Mikhail Kerzhner. Abstract

Acculturation Strategies : The Case of the Muslim Minority in the United States

Introduction to Path Analysis: Multivariate Regression

Outline. From Pixels to Semantics Research on automatic indexing and retrieval of large collections of images. Research: Main Areas

Maternity support policies: a cluster analysis of 22 European Union countries

RECOMMENDED CITATION: Pew Research Center, May, 2017, Partisan Identification Is Sticky, but About 10% Switched Parties Over the Past Year

Progressives in Alberta

A COMPARISON OF ARIZONA TO NATIONS OF COMPARABLE SIZE

The Direct Democracy Deficit in Two-tier Voting

Supreme Court of Florida

Living in the Shadows or Government Dependents: Immigrants and Welfare in the United States

Supporting Information for Do Perceptions of Ballot Secrecy Influence Turnout? Results from a Field Experiment

Wisconsin Economic Scorecard

Analyzing Racial Disparities in Traffic Stops Statistics from the Texas Department of Public Safety

Data Assimilation in Geosciences

Mining Expert Comments on the Application of ILO Conventions on Freedom of Association and Collective Bargaining

information it takes to make tampering with an election computationally hard.

Should the Democrats move to the left on economic policy?

Transcription:

Cluster Analysis (see also: Segmentation)

Cluster Analysis Ø Unsupervised: no target variable for training Ø Partition the data into groups (clusters) so that: Ø Observations within a cluster are similar in some sense Ø Observations in different clusters are different in some sense Ø There is no one correct answer, though there are good and bad clusters Ø No method words best all the time That s not very specific

(Some) Applications of Clustering Ø Customer segmentation: groups of customers with similar shopping or buying patterns Ø Dimension reduction: Ø cluster variables together Ø cluster individuals together and use cluster variable as proxy for demographic or behavioral variables Ø Image segmentation Ø Gather stores with similar characteristics for sales forecasting Ø Find related topics in text data Ø Find communities in social networks

Methodology Ø Hard vs. Fuzzy Clustering Ø Hard: objects can belong to only one cluster Ø k-means (PROC FASTCLUS) Ø DBSCAN Ø Hierarchical (PROC CLUSTER) Ø Fuzzy: objects can belong to more than one cluster (usually with some probability) Ø Gaussian Mixture Models

Methodology Ø Hierarchical vs. Flat Ø Hierarchical: clusters form a tree so you can visually see which clusters are most similar to each other.

Methodology Ø Hierarchical vs. Flat Ø Hierarchical: clusters form a tree so you can visually see which clusters are most similar to each other. Ø Agglomerative: points start out as individual clusters, and they are combined until everything is in one cluster. Ø Divisive: All points start in same cluster and at each step a cluster is divided into two clusters. Ø Flat: Clusters are created according to some other process, usually iteratively updating cluster assignments

Hierarchical Clustering (Agglomerative) Some Data A B C I H J D G E F

Hierarchical Clustering (Agglomerative) First Step

Hierarchical Clustering (Agglomerative) Second Step

Hierarchical Clustering (Agglomerative) Third Step

Hierarchical Clustering (Agglomerative) Forth Step

Hierarchical Clustering (Agglomerative) Fifth Step

Hierarchical Clustering (Agglomerative) Sixth Step

Hierarchical Clustering (Agglomerative) Seventh Step We might have known that we only wanted 3 clusters, in which case we d stop once we had 3.

Hierarchical Clustering (Agglomerative) Eighth Step

Hierarchical Clustering (Agglomerative) Final Step

Hierarchical Clustering 9 3 1 8 5 6 7 4 2 Levels of the Dendrogram

Resulting Dendrogram 9 8 7 6 5 4 3 2 1 A B C D E F G H I J

Linkages Which clusters/points are closest to each other? How do I measure the distance between a point/cluster and a cluster?

Linkages Single Linkage: Distance between the closest points in the clusters. (Minimum Spanning Tree)

Linkages Complete Linkage: Distance between the farthest points in the clusters.

Linkages Centroid Linkage: Distance between the centroids (means) of each cluster. x x

Linkages Average Linkage: Average distance between all points in the clusters.

Linkages Ward s Method: Increase in SSE (variance) when clusters are combined. centroid for cluster i, c i x Ø Default in SAS PROC CLUSTER Ø Shown mathematically similar to centroid linkage data points in cluster i: x 1, x 2,, x Ni

Hierarchical Clustering Summary Ø Disadvantages Ø Lacks global objective function: only makes decision based on local criteria. Ø Merging decisions are final. Once a point is assigned to a cluster, it stays there. Ø Computationally intensive, large storage requirements, not good for large datasets Ø Poor performance on noisy or high-dimensional data like text. Ø Advantages Ø Lacks global objective function: no complicated algorithm or problem with local minima Ø Creates hierarchy that can help choose the number of clusters and examine how those clusters relate to each other. Ø Can be used in conjunction with other faster methods

k- Means Clustering (PROC FASTCLUS in SAS) Ø The most popular clustering algorithm data points in Cluster 1 x Cluster 2 (C 2 ) centroid c 2 Cluster 1 (C 1 ) centroid c 1 x data points in Cluster 2 Ø Tries to minimize the sum of squared distances from each point to its cluster centroid. (Global objective function)

k- Means Algorithm Ø Start with k seed points Ø Randomly initialized (most software) Ø Determined methodically (SAS PROC FASTCLUS) Ø Assign each data point to the closest seed point. Ø The seed point then represents a cluster of data Ø Reset seed points to be the centroids of the cluster Ø Repeat steps 2-4 updating the cluster centroids until they do not change.

k- Means Interactive Demo http://home.deib.polimi.it/matteucc/clustering/tutorial_html/appletkm.html (You may have to add the site to your exceptions list on the Java Control Panel to view.)

Choice of Distance Metric Ø Most distances like Euclidean, Manhattan, or Max will provide similar answers. Ø Use cosine distance (really 1-cos since cosine measures similarity) for text data. This is called spherical k-means. Ø Using Mahalanobis distance is essentially the Expectation-Maximization (EM method) for Gaussian Mixtures.

Determining Number of Clusters (SSE) Ø Try the algorithm with k=1,2,3, Ø Examine the objective function values Ø Look for a place where the marginal benefit to objective function for adding a cluster becomes small k=1 objective function (SSE) is 902

Determining Number of Clusters (SSE) Ø Try the algorithm with k=1,2,3, Ø Examine the objective function values Ø Look for a place where the marginal benefit to objective function for adding a cluster becomes small k=2 objective function (SSE) is 213

Determining Number of Clusters (SSE) Ø Try the algorithm with k=1,2,3, Ø Examine the objective function values Ø Look for a place where the marginal benefit to objective function for adding a cluster becomes small k=3 objective function (SSE) is 193

Determining Number of Clusters (SSE) Ø Try the algorithm with k=1,2,3, Ø Examine the objective function values Ø Look for a place where the marginal benefit to objective function for adding a cluster becomes small Objective Function 1000 800 600 400 200 0 k=1 k=2 k=3 k=4 Elbow => k=2

k- Means Summary Ø Disadvantages Ø Dependent on initialization (initial seeds) Ø Can be sensitive to outliers Ø If problem, should consider k-mediods (uses median not mean) Ø Have to input the number of clusters Ø Difficulty detecting non-spheroidal (globular) clusters Ø Advantages Ø Modest time/storage requirements. Ø Shown you can terminate method after small number of iterations with good results. Ø Good for wide variety of data types

Cluster Validation How do I know that my clusters are actually clusters? Ø Lots of techniques/metrics have been proposed Ø Measure separation between clusters Ø Measure cohesion within clusters Ø All have merit, most are difficult to interpret in the context of statistical significance.

Cluster Validation Ø To establish statistical significance: Ø Show that you can t do just as well with randomized data (i.e. assume the null hypothesis of no clusters) Ø Simulate ~1000 random data sets choosing from the distributions or ranges of your variables. Cluster them with the same number of clusters. Record the SSE (k-means objective function) or validity metric of choice. Use this to show that your actual SSE is far better than you could expect to achieve if no clusters exist.

Profiling Clusters Now that we have clusters, how do we describe them? Ø Use basic descriptives and hypothesis tests to show differences between clusters Ø Use a decision tree to predict cluster Ø SAS EM has segment profiler node

Other types of Clustering (self- study) Ø DBSCAN Density based algorithm designed to find dense areas of points. Capable of identifying noise points which do not belong to any clusters. Ø Graph/Network Clustering Spectral clustering and modularity maximization. Covered in Social Network Analysis in Spring.

Some Explanation of SAS s Clustering Output (SELF- STUDY) Because it s not exceedingly easy to figure out online!

Cubic Clustering Criterion (CCC) Ø Only available in SAS (to my knowledge) Ø CCC > 2 means that clustering is good Ø 0 > CCC > 2 means clustering requires examination Ø If slightly negative, risk of outliers is low Ø If ~< -30 then risk of outliers is high Ø Should not be used with single or complete linkage, but with centroid or ward s method. Ø Each cluster must have >10 observations. Source: Tufféry, Stéphane. Data Mining and Statistics for Decision Making. Wiley 2011

Determining Number of Clusters with the Cubic Clustering Criterion (CCC) Ø A partition into k clusters is good when we see a dip in CCC for k-1 clusters and a peak for k clusters. Ø After k clusters, the CCC should either a gradually decrease or a gradual rise (the latter event happens when more isolated groups or points are present) 1 Source: Tufféry, Stéphane. Data Mining and Statistics for Decision Making. Wiley 2011

Determining Number of Clusters with the Cubic Clustering Criterion (CCC) Image Source: Tufféry, Stéphane. Data Mining and Statistics for Decision Making. Wiley 2011

Determining Number of Clusters with the Cubic Clustering Criterion (CCC) WARNING: Do not expect the CCC to be common knowledge outside of the SAS domain.

Overall R- Squared and Pseudo- F These statistics draw connections between a final clustering and ANOVA. Ø Total Sum of Squares (SST) Ø Between Group Sum of Squares (SSB) Ø Within Group Sum of Squares (SSW) Ø This is the k-means objective previously referred to as SSE. Ø Minimizing SSW => Maximizing SSB Ø SST = SSB + SSW. Ø Overall R 2 = SSB/SST Ø b

Example: PenDigit Data Ø Goal: Automatic recognition of handwritten digits Ø Digit database of 250 samples from 44 writers Ø Subjects wrote digits in random order inside boxes of 500 by 500 tablet pixel resolution Ø Spatial resampling to obtain a constant number of regularly spaced points on the trajectory Ø (x #, x % ) give the first point coordinate Ø (x ',x ( ) give the second point coordinate Ø etc.

Example: PenDigit Data proc fastclus run; data=datasets.pendigittest maxclusters=10 out = clus; var x1--x16;

Example: PenDigit Data The first step to creating your own hierarchical dendrogram.

Example: PenDigit Data proc glm data= clus; class cluster; model x1 = cluster; run; quit;

Example: PenDigit Data

Example: PenDigit Data

Example: PenDigit Data Essentially using the centroids as predictions and then computing R- squared.