Probabilistic Latent Semantic Analysis Hofmann (1999)

Similar documents
Learning and Visualizing Political Issues from Voting Records Erik Goldman, Evan Cox, Mikhail Kerzhner. Abstract

Support Vector Machines

Overview. Ø Neural Networks are considered black-box models Ø They are complex and do not provide much insight into variable relationships

A comparative analysis of subreddit recommenders for Reddit

Instructors: Tengyu Ma and Chris Re

Dimension Reduction. Why and How

Cluster Analysis. (see also: Segmentation)

The Social Web: Social networks, tagging and what you can learn from them. Kristina Lerman USC Information Sciences Institute

Pivoted Text Scaling for Open-Ended Survey Responses

Classification of posts on Reddit

Analyzing and Representing Two-Mode Network Data Week 8: Reading Notes

CS 229: r/classifier - Subreddit Text Classification

Appendix to Non-Parametric Unfolding of Binary Choice Data Keith T. Poole Graduate School of Industrial Administration Carnegie-Mellon University

CHAPTER 5 SOCIAL INCLUSION LEVEL

Classifier Evaluation and Selection. Review and Overview of Methods

Mining Expert Comments on the Application of ILO Conventions on Freedom of Association and Collective Bargaining

Popularity Prediction of Reddit Texts

Partition Decomposition for Roll Call Data

AMONG the vast and diverse collection of videos in

October Next Generation Smart Border Security Ability. Quality. Delivery.

FOURIER ANALYSIS OF THE NUMBER OF PUBLIC LAWS David L. Farnsworth, Eisenhower College Michael G. Stratton, GTE Sylvania

Recommendations For Reddit Users Avideh Taalimanesh and Mohammad Aleagha Stanford University, December 2012

Identifying Ideological Perspectives of Web Videos Using Folksonomies

Identifying Ideological Perspectives of Web Videos using Patterns Emerging from Folksonomies

Random Forests. Gradient Boosting. and. Bagging and Boosting

Analysis of the Reputation System and User Contributions on a Question Answering Website: StackOverflow

Introduction to Text Modeling

Multidimensional Topic Analysis in Political Texts

Modelling migration: Review and assessment

Text as Actuator: Text-Driven Response Modeling and Prediction in Politics. Tae Yano

Using Poole s Optimal Classification in R

Users reading habits in online news portals

Web Mining: Identifying Document Structure for Web Document Clustering

Deep Classification and Generation of Reddit Post Titles

Subreddit Recommendations within Reddit Communities

Many theories of comparative politics rely on the

Do two parties represent the US? Clustering analysis of US public ideology survey

Automated Classification of Congressional Legislation

Towards Tackling Hate Online Automatically

arxiv: v4 [cs.cl] 7 Jul 2015

Statistical Analysis of Corruption Perception Index across countries

Word Embeddings for the Analysis of Ideological Placement in Parliamentary Corpora

Predicting Congressional Votes Based on Campaign Finance Data

Two-dimensional voting bodies: The case of European Parliament

Topic Signatures in Political Campaign Speeches

Tengyu Ma Facebook AI Research. Based on joint work with Rong Ge (Duke) and Jason D. Lee (USC)

Analyzing the DarkNetMarkets Subreddit for Evolutions of Tools and Trends Using Latent Dirichlet Allocation. DFRWS USA 2018 Kyle Porter

An Integrated Tag Recommendation Algorithm Towards Weibo User Profiling

The Pupitre System: A desk news system for the Parliamentary Meeting rooms

Indian Political Data Analysis Using Rapid Miner

Benchmarks for text analysis: A response to Budge and Pennings

Ethnic Persistence, Assimilation and Risk Proclivity

Using Poole s Optimal Classification in R

Hyo-Shin Kwon & Yi-Yi Chen

F E M M Faculty of Economics and Management Magdeburg

Backoff DOP: Parameter Estimation by Backoff

Psychological Factors

EXTRACTING POLICY POSITIONS FROM POLITICAL TEXTS USING WORDS AS DATA * January 21, 2003

CS 229 Final Project - Party Predictor: Predicting Political A liation

The Issue-Adjusted Ideal Point Model

Demographics of News Sharing in the U.S. Twittersphere

A Tale of Two Villages

The Evolving Scope and Content of Central Bank Speeches

EXTRACTING POLICY POSITIONS FROM POLITICAL TEXTS USING WORDS AS DATA. Michael Laver, Kenneth Benoit, and John Garry * Trinity College Dublin

Computation and the Theory of Customs Unions

Discovering Migrant Types Through Cluster Analysis: Changes in the Mexico-U.S. Streams from 1970 to 2000

Category-level localization. Cordelia Schmid

European Corporate Governance Codes: An Empirical Analysis of Their Content, Variability and Convergence

Doctoral Research Agenda

WORKGROUP S CONSENSUS PROCESS AND GUIDING PRINCIPLES CONSENSUS

DISPLACEMENT TRACKING MATRIX

Using a Fuzzy-Based Cluster Algorithm for Recommending Candidates in eelections

Essays on the Single-mindedness Theory. Emanuele Canegrati Catholic University, Milan

Latent Class Modeling of Political Mobility: Implications for Legislative Recruitment, Representation and Institutional Development

From Meander Designs to a Routing Application Using a Shape Grammar to Cellular Automata Methodology

Using Poole s Optimal Classification in R

Measures of the integration of foreign migrants in Lombardia: some new experiences

Ranking Subreddits by Classifier Indistinguishability in the Reddit Corpus

Voting Behaviour and Political Culture among Students

Outline. From Pixels to Semantics Research on automatic indexing and retrieval of large collections of images. Research: Main Areas

No Adults Allowed! Unsupervised Learning Applied to Gerrymandered School Districts

A Cluster-Based Approach for identifying East Asian Economies: A foundation for monetary integration

ANNUAL SURVEY REPORT: REGIONAL OVERVIEW

1. The augmented matrix for this system is " " " # (remember, I can't draw the V Ç V ß #V V Ä V ß $V V Ä V

twentieth century and early years of the twenty-first century, reversed its net migration result,

Use and abuse of voter migration models in an election year. Dr. Peter Moser Statistical Office of the Canton of Zurich

Chapter. Sampling Distributions Pearson Prentice Hall. All rights reserved

Identifying Factors in Congressional Bill Success

Exploring QR Factorization on GPU for Quantum Monte Carlo Simulation

Comparison Sorts. EECS 2011 Prof. J. Elder - 1 -

ANNUAL SURVEY REPORT: AZERBAIJAN

Distributed representations of politicians

Measured Strength: Estimating the Strength of Alliances in the International System,

Michael Laver and Ernest Sergenti: Party Competition. An Agent-Based Model

Migration and Tourism Flows to New Zealand

Measuring the Shadow Economy of Bangladesh, India, Pakistan, and Sri Lanka ( )

MAKING RECOMMENDATIONS WITH A DOMINO EFFECT

Different Endowment or Remuneration? Exploring wage differentials in Switzerland

Measured Strength: Estimating the Strength of Alliances in the International System,

Tengyu Ma Facebook AI Research. Based on joint work with Yuanzhi Li (Princeton) and Hongyang Zhang (Stanford)

Transcription:

Probabilistic Latent Semantic Analysis Hofmann (1999) Presenter: Mercè Vintró Ricart February 8, 2016

Outline Background Topic models: What are they? Why do we use them? Latent Semantic Analysis (LSA) Methodology The Aspect Model Training the model: EM Algorithm. Evaluation Perplexity Information Retrieval 1

Topic models Background Ø What is a topic? The subject matter of a text. It captures what it is about. Ø Why do we want to extract topics? Important for many text mining tasks: search result organization, document clustering, passage segmentation, etc. Ø How do we do that? Use topic models to discover hidden topic-based patterns. 2

Topic models Background Text Politics Sport Technology Dogs Wolves Images 3

Latent Semantic Analysis (LSA) Background Ø Technique for extracting and representing the contextual-usage meaning of words. Ø Mapping from high-dimensional count vectors to a lower dimensional representation: 1. Write frequencies as a term-document matrix 2. Perform Singular Value Decomposition (SVD) of the matrix 4

Latent Semantic Analysis (LSA) Background 1. Term-document matrix Doc 1: I have a fluffy cat. Doc 2: I see a fluffy dog. I have a fluffy cat see dog Doc 1 1 1 1 1 1 0 0 Doc 2 1 0 1 1 0 1 1 5

Latent Semantic Analysis (LSA) Background 2. Singular Value Decomposition (SVD) LSA Orthogonal matrix containing the left singular vectors. Orthogonal matrix containing the right singular vectors. Diagonal matrix containing the square roots of eigenvalues from U or V in descending order. LSA approximation of N. 6

LSA and topics Background Ø Documents with similar topical content tend to be close in the latent semantic space. Ø Documents which share no terms with each other directly but which do share many terms with another one are similar in the latent semantic space. 7

From LSA to PLSA Background Strengths of LSA Ø Fully automatic construction Ø Representationally simple Weaknesses of LSA Ø No generative model Ø Many ad-hoc parameters Ø Polysemous words PLSA to the rescue! 8

Probabilistic Latent Semantic Analysis (PLSA) Methodology Aspect model Ø Latent variable model Ø The data can be expressed in terms of: documents words observed variables topics latent variables 9

Probabilistic Latent Semantic Analysis (PLSA) Methodology Aspect model Ø Conditional independence assumption: Ø Graphical model representation of the aspect model: 10

Probabilistic Latent Semantic Analysis (PLSA) Methodology Aspect model Product rule Conditional independence assumption Probability of a document Probability of a word given a topic Probability of a topic given a document 11

Probabilistic Latent Semantic Analysis (PLSA) Methodology The EM Algorithm Ø E-step Ø M-step The posterior probabilities for the latent variables are computed The parameters are updated 12

PLSA: Relation to LSA Methodology Ø The model can be equivalently parameterized by Ø The joint probability P(w,d) can be interpreted as Contains the document probabilities, P(d z) Diagonal matrix of the prior probabilities of the topics, P(z) Contains the word probabilities, P(w z) 13

PLSA: Polysemy Methodology Topic 1 Topic 2 Ø The word stems are the 10 most probable words in the distribution P(w z) in descending order. Ø Segment is identified as a polysemous word. Topic 1: Image region Topic 2: Phonetic segment 14

PLSA: Some limitations Methodology Ø The number of parameters grows linearly with the size of training documents Ø Not a well-defined generative model Latent Dirichlet Allocation The model is prone to overfitting Tempered EM 15

Perplexity Evaluation Ø Compare the predictive performance of PLSA and LSA. Ø Perplexity - Measure commonly used in language modelling to assess the generalization performance of a model. - A lower value of perplexity indicates better performance. Ø Two data sets used MED: information retrieval test collection with 1033 documents LOB: dataset with noun-adjective pairs 16

Perplexity Evaluation MED data LOB data Upper baseline 17

Information Retrieval Evaluation 18

Summary Ø LSA can provide useful semantic insights about documents, but it lacks a sound statistical foundation. Ø PLSA is a probabilistic variant of LSA. Ø Used to extract topics from a collection of documents. Ø The model evaluation shows that PLSA significantly outperforms LSA. Ø Prone to overfitting (Tempered EM), Ø Not a well-defined generative model. Thank you! Any questions? 19