Dimension Reduction. Why and How

Similar documents
Cluster Analysis. (see also: Segmentation)

Statistical Analysis of Corruption Perception Index across countries

Appendix to Non-Parametric Unfolding of Binary Choice Data Keith T. Poole Graduate School of Industrial Administration Carnegie-Mellon University

A comparative analysis of subreddit recommenders for Reddit

Learning and Visualizing Political Issues from Voting Records Erik Goldman, Evan Cox, Mikhail Kerzhner. Abstract

Identity Theft. What does a victim look like?

Probabilistic Latent Semantic Analysis Hofmann (1999)

Random Forests. Gradient Boosting. and. Bagging and Boosting

Classification of posts on Reddit

Home Ownership. Mamak Ashtari Alexander Basilia Chien-Ting Chen Ashish Markanday Santosh

Vote Compass Methodology

No Adults Allowed! Unsupervised Learning Applied to Gerrymandered School Districts

CHAPTER 5 SOCIAL INCLUSION LEVEL

Intersections of political and economic relations: a network study

In Elections, Irrelevant Alternatives Provide Relevant Data

Predicting Information Diffusion Initiated from Multiple Sources in Online Social Networks

Constraint satisfaction problems. Lirong Xia

What makes people feel free: Subjective freedom in comparative perspective Progress Report

AMONG the vast and diverse collection of videos in

Classifier Evaluation and Selection. Review and Overview of Methods

Understanding factors that influence L1-visa outcomes in US

A robust model to measure governance in African countries

PASW & Hand Calculations for ANOVA

Estimating the Margin of Victory for Instant-Runoff Voting

Instructors: Tengyu Ma and Chris Re

Pathbreakers? Women's Electoral Success and Future Political Participation

Subreddit Recommendations within Reddit Communities

IN THE UNITED STATES DISTRICT COURT FOR THE EASTERN DISTRICT OF PENNSYLVANIA

Honors General Exam PART 3: ECONOMETRICS. Solutions. Harvard University April 2014

DU PhD in Home Science

A Retrospective Study of State Aid Control in the German Broadband Market

P(x) testing training. x Hi

CS269I: Incentives in Computer Science Lecture #4: Voting, Machine Learning, and Participatory Democracy

Impact of the EU Enlargement on the Agricultural Income. Components in the Member States

Comparison Sorts. EECS 2011 Prof. J. Elder - 1 -

Case Study: Get out the Vote

Web Mining: Identifying Document Structure for Web Document Clustering

Parties, Candidates, Issues: electoral competition revisited

The Contribution of Veto Players to Economic Reform: Online Appendix

Impact of Human Rights Abuses on Economic Outlook

Supplementary Materials for Strategic Abstention in Proportional Representation Systems (Evidence from Multiple Countries)

Computational Inelasticity FHLN05. Assignment A non-linear elasto-plastic problem

Introduction to Path Analysis: Multivariate Regression

Schooling and Cohort Size: Evidence from Vietnam, Thailand, Iran and Cambodia. Evangelos M. Falaris University of Delaware. and

Louis M. Edwards Mathematics Super Bowl Valencia Community College -- April 30, 2004

John Parman Introduction. Trevon Logan. William & Mary. Ohio State University. Measuring Historical Residential Segregation. Trevon Logan.

CS 229 Final Project - Party Predictor: Predicting Political A liation

Is the Great Gatsby Curve Robust?

CSC304 Lecture 16. Voting 3: Axiomatic, Statistical, and Utilitarian Approaches to Voting. CSC304 - Nisarg Shah 1

Performance Evaluation of Cluster Based Techniques for Zoning of Crime Info

Title: Adverserial Search AIMA: Chapter 5 (Sections 5.1, 5.2 and 5.3)

Hyo-Shin Kwon & Yi-Yi Chen

Lab 3: Logistic regression models

The 2017 TRACE Matrix Bribery Risk Matrix

Approaches to Analysing Politics Variables & graphs

The Borda count in n-dimensional issue space*

The impact of low-skilled labor migration boom on education investment in Nepal

Why Do We Pay Attention to Candidate Race, Gender, and Party? A Theory of the Development of Political Categorization Schemes

Hoboken Public Schools. AP Statistics Curriculum

Immigration and Internal Mobility in Canada Appendices A and B. Appendix A: Two-step Instrumentation strategy: Procedure and detailed results

Contiguous States, Stable Borders and the Peace between Democracies

Hoboken Public Schools. Algebra II Honors Curriculum

1. The augmented matrix for this system is " " " # (remember, I can't draw the V Ç V ß #V V Ä V ß $V V Ä V

Predicting Congressional Votes Based on Campaign Finance Data

Essential Questions Content Skills Assessments Standards/PIs. Identify prime and composite numbers, GCF, and prime factorization.

Who Would Have Won Florida If the Recount Had Finished? 1

twentieth century and early years of the twenty-first century, reversed its net migration result,

Commuting and Minimum wages in Decentralized Era Case Study from Java Island. Raden M Purnagunawan

The Determinants of Low-Intensity Intergroup Violence: The Case of Northern Ireland. Online Appendix

Chapter 12 Services and Settlements

Kiriya Kulkolkarn. Abstract This study provides a picture of immigrant employment in manufacturing of Thailand.

Partisan Influence in Congress and Institutional Change

An Integrated Tag Recommendation Algorithm Towards Weibo User Profiling

Compare Your Area User Guide

Supporting Information for Inclusion and Public. Policy: Evidence from Sweden s Introduction of. Noncitizen Suffrage

Designing Weighted Voting Games to Proportionality

Institutional Arrangements and Logrolling: Evidence from the European Union

Support Vector Machines

Voting and preference aggregation

List of Tables and Appendices

Voting and preference aggregation

Behind a thin veil of ignorance and beyond the original position: a social experiment for distributive policy preferences of young people in Greece.

I. MODEL Q1 Q2 Q9 Q10 Q11 Q12 Q15 Q46 Q101 Q104 Q105 Q106 Q107 Q109. Stepwise Multiple Regression Model. A. Frazier COM 631/731 March 4, 2014

Just War or Just Politics? The Determinants of Foreign Military Intervention

Leaving the Good Life: Predicting Migration Intentions of Rural Nebraskans

DATA AT WORK: NEGOTIATING CIVIL WARS

RELATIONSHIP BETWEEN COMMUNITY SATISFACTION AND MIGRATION INTENTIONS OF RURAL NEBRASKANS

Understanding True Position with MMC in Calypso. Last Updated: 9/15/2014 True Position with MMC 1

1. The Relationship Between Party Control, Latino CVAP and the Passage of Bills Benefitting Immigrants

Recommendations For Reddit Users Avideh Taalimanesh and Mohammad Aleagha Stanford University, December 2012

Estimating the Margin of Victory for an IRV Election Part 1 by David Cary November 6, 2010

GLOBALIZATION AND THE GREAT U-TURN: INCOME INEQUALITY TRENDS IN 16 OECD COUNTRIES. Arthur S. Alderson

Now, therefore be it and it is hereby ordained chapter 152 Outdoor Advertising shall read as follows:

Self-Selection and the Earnings of Immigrants

Supplementary Material for Preventing Civil War: How the potential for international intervention can deter conflict onset.

Guelph 3Ts Reference Report

Approval Voting Theory with Multiple Levels of Approval

Corruption's Effect on Socioeconomic Factors

Relative Performance Evaluation and the Turnover of Provincial Leaders in China

St. Mary s County Board of Appeals Annual Report

Transcription:

Dimension Reduction Why and How

The Curse of Dimensionality As the dimensionality (i.e. number of variables) of a space grows, data points become so spread out that the ideas of distance and density become murky. Let s explore this fact

The Curse of Dimensionality some points are close together Some Data in 2 Dimensions

The Curse of Dimensionality others are far apart Some Data in 2 Dimensions

The Curse of Dimensionality max distance = 30 Some Data in 2 Dimensions

The Curse of Dimensionality min distance = 0.02 Some Data in 2 Dimensions

The Curse of Dimensionality max/min = 30/0.02 = 1500. Some Data in 2 Dimensions

The Curse of Dimensionality max/min = 30/0.02 = 1500. The max distance is 1500 times larger than the min distance Some Data in 2 Dimensions

The Curse of Dimensionality Ø Now lets generate those 500 points in 3-space, 4-space,, 50-space. Ø We ll compute that same metric, the ratio of the maximum distance to the minimum distance Ø See how it changes as the number of dimensions grows

The Curse: Euclidean Distance

The Curse: Euclidean Distance

The Curse: Euclidean Distance as dimensionà max distance à min distance

The Curse: Euclidean Distance as dimensionà max distance à min distance Distribution of distance becomes nearly constant! All the points become equidistant even though randomly generated!

The Curse: Volume of Sphere to Cube Ø Here s another one. Ø Imagine a sphere that sits perfectly (inscribed) inside of a cube. Ø In 3-dimensions, it looks like this: 1 Ø For simplicity, it s a unit cube and unit diameter sphere 1

The Curse: Volume of Sphere to Cube Volume of Sphere: (4/3)π(0.5) 3 0.52 Volume of Cube: 1 So the sphere takes up over half of the space.

The Curse: Volume of Sphere to Cube In d-space, the volume of hypersphere: Volume of hypercube: 1

The Curse: Volume of Sphere to Cube As dà, the ratio of the volume of the sphere to the cube gets closer and closer to 0. It s as if ALL of the volume of the hypercube is contained in the corners! (none in the sphere, relatively speaking)

The Curse of Dimensionality Ø No distance/similarity metric is immune to the vastness of high dimensional space. Ø One more. Let s look at the distribution (or lack thereof) of cosine similarity. Ø Compute the cosine similarity between each pair of points, and divide that similarity by the maximum.

The Curse: Cosine Similarity

When is this a problem? Ø Primarily when using algorithms which rely on distance or similarity Ø Particularly for clustering and k nearest neighbor methods Ø Secondarily on all models due to collinearity and a desire for model simplicity. Ø Computational/storage complexity can be problematic in all algorithms.

What can we do about it? Dimension Reduction

Dimension Reduction Overview FEATURE SELECTION FEATURE EXTRACTION Choose subset of existing features By their relationship to a target (supervised) Create new features Often linear combinations of existing features (PCA, SVD, NMF) By their distribution (unsupervised) Often chosen to be uncorrelated

Feature Selection Ø Removing features manually Ø Redundant (multicollinearity/vifs) Ø Irrelevant (Text mining stop words) Ø Poor quality features (>50% missing values) Ø Forward/Backward/Stepwise Regression Ø Decision Tree Ø Variable Importance Table Ø Can change a little depending on metric Ø Gini/Entropy/Mutual Information/Chi-Square

Ø PCA Feature Extraction: Continuous Variables Ø Create a new set of features as linear combinations of your originals Ø These new features are ranked by variance (importance/information) Ø Use the first several PCs in place of original features Ø SVD Ø Same as PCA, except the variance interpretation is no longer valid Ø Common for text-mining, since X T X is related to cosine similarity. Ø Factor Analysis Ø The principal components are rotated so that our new features are more interpretable. Ø Occasionally other factor analysis algorithms like maximum likelihood are considered.

Feature Extraction: Continuous Variables Ø Discretization/Binning Ø While this doesn t reduce the dimensions of your data (it increases them!), it is still a form of feature extraction!

Feature Extraction: Nominal Variables Ø Encoding variables with numeric values. Original Level Negative Checking Account Balance New Value - 100 No checking account 0 Balance is zero 0 0<Balance<200 100 200<Balance<800 500 Balance>800 900 Balance>800 and IncomeDD 1000

Feature Extraction: Nominal Variables Ø Encoding variables with numeric values. Ø If ONE categorical variable has 100 levels, what you really have is ~100 variables. Ø Correspondence analysis Ø Method similar to PCA for categorical data. Ø Uses chi-squared table (contingency table) and chi-squared distance. Ø Can be used to get coordinates of categorical variables in a lowerdimensional space. Ø More often used as exploratory method, potentially for binning purposes.