Name Phylogeny. A Generative Model of String Variation. Nicholas Andrews, Jason Eisner and Mark Dredze

Similar documents
FOR RELEASE: TUESDAY, DECEMBER 19 AT 4 PM

Ipsos Poll Conducted for Reuters Daily Election Tracking:

Ipsos Poll Conducted for Reuters Daily Election Tracking:

Subject: Pinellas County Congressional Election Survey

COSC-282 Big Data Analytics. Final Exam (Fall 2015) Dec 18, 2015 Duration: 120 minutes

Ipsos Poll Conducted for Reuters State-Level Election Tracking:

Political Blogs: A Dynamic Text Network. David Banks. DukeUniffirsity

Minnesota Public Radio News and Humphrey Institute Poll

THE PRESIDENTIAL NOMINATION CONTESTS May 18-23, 2007

FOR RELEASE: TUESDAY, SEPTEMBER 11 AT 4 PM

Learning and Visualizing Political Issues from Voting Records Erik Goldman, Evan Cox, Mikhail Kerzhner. Abstract

Chapter. Sampling Distributions Pearson Prentice Hall. All rights reserved

Hierarchical Item Response Models for Analyzing Public Opinion

Entity Linking Enityt Linking. Laura Dietz University of Massachusetts. Use cursor keys to flip through slides.

UC-BERKELEY. Center on Institutions and Governance Working Paper No. 22. Interval Properties of Ideal Point Estimators

FOR RELEASE: FRIDAY, JULY 20 AT 6 AM

Voters Divided Over Who Will Win Second Debate

FOR RELEASE: MONDAY, DECEMBER 10 AT 4 PM

Conducted by the University of New Hampshire Survey Center

Red Oak Strategic Presidential Poll

Practice Questions for Exam #2

Marist College Institute for Public Opinion Poughkeepsie, NY Phone Fax

NJ VOTERS NAME CHRISTIE, CLINTON TOP CHOICES FOR PRESIDENT CLINTON LEADS IN HEAD-TO-HEAD MATCH UP

CS 229: r/classifier - Subreddit Text Classification

Personality and Individual Differences

IMMEDIATE RELEASE DECEMBER 22, 2014

Republicans Tune into Campaign News IRAQ DOMINATES NEWS INTEREST

New HampshireElection IssuesSurvey. Wave3. December13,2007

CS388: Natural Language Processing Coreference Resolu8on. Greg Durrett

Heading into the Conventions: A Tied Race July 8-12, 2016

CONTACT: TIM VERCELLOTTI, Ph.D., (732) , EXT. 285; (919) (cell) CLINTON SOLIDIFIES LEADS OVER PRIMARY RIVALS

In New Hampshire, Clinton Still Ahead, Warren Moves Up

Google Consumer Surveys Presidential Poll Fielded 8/18-8/19

ADDING RYAN TO TICKET DOES LITTLE FOR ROMNEY IN NEW JERSEY. Rutgers-Eagleton Poll finds more than half of likely voters not influenced by choice

FOR RELEASE: WEDNESDAY, NOVEMBER 1 AT 4 PM

THE DEMOCRATS IN NEW HAMPSHIRE January 5-6, 2008

Classifier Evaluation and Selection. Review and Overview of Methods

Chapter. Estimating the Value of a Parameter Using Confidence Intervals Pearson Prentice Hall. All rights reserved

Introduction to Text Modeling

CONTACT: TIM VERCELLOTTI, Ph.D., (732) , EXT. 285; (919) (cell) GIULIANI AND CLINTON LEAD IN NEW JERSEY, BUT DYNAMICS DEFY

In Iowa Democratic Caucuses, Turnout Will Tell the Tale

FOR RELEASE: WEDNESDAY, NOVEMBER 14 AT 4 PM

Pennsylvania s Female Voters And the 2012 Presidential Election

An Homophily-based Approach for Fast Post Recommendation in Microblogging Systems

Approval, Favorability and State of the Economy

SouthCarolinaElection IssuesSurvey

The Shadow Value of Legal Status --A Hedonic Analysis of the Earnings of U.S. Farm Workers 1

LATINOS NATIONALLY SAY THEY ARE BETTER OFF TODAY THAN FOUR YEARS AGO

Predicting the Next US President by Simulating the Electoral College

A Post-Debate Bump in the Old North State? Likely Voters in North Carolina September th, Table of Contents

Appendix to Non-Parametric Unfolding of Binary Choice Data Keith T. Poole Graduate School of Industrial Administration Carnegie-Mellon University

TUESDAY, MARCH 22, 2016 ELECTORAL COLLEGE VOTES: 11

I. The Role of Political Parties

PASW & Hand Calculations for ANOVA

Are policy makers out of step with their constituency when it comes to immigration?

Franklin Pierce / WBZ Poll

Emerson College Poll: Iowa Leaning For Trump 44% to 41%. Grassley, Coasting to a Blowout, Likely to Retain Senate Seat.

Can Politicians Police Themselves? Natural Experimental Evidence from Brazil s Audit Courts Supplementary Appendix

HYPOTHETICAL 2016 MATCH-UPS: CHRISTIE BEATS OTHER REPUBLICANS AGAINST CLINTON STABILITY REMAINS FOR CHRISTIE A YEAR AFTER LANE CLOSURES

GOP Electability Test (Romney/Perry/Cain)

Mining Expert Comments on the Application of ILO Conventions on Freedom of Association and Collective Bargaining

Computational challenges in analyzing and moderating online social discussions

Views of Leading 08 Candidates CLINTON AND GIULIANI S CONTRASTING IMAGES

Latino Decisions / America's Voice June State Latino Battleground Survey

Minnesota Public Radio News and Humphrey Institute Poll

Clinton Lead Cut to 8% in Michigan (Clinton 49% - Trump 41%- Johnson 3% - Stein 1%)

Republicans Say Campaign is Being Over-Covered HILLARY CLINTON MOST VISIBLE PRESIDENTIAL CANDIDATE

Simulating Electoral College Results using Ranked Choice Voting if a Strong Third Party Candidate were in the Election Race

CLINTON TRUMPS TRUMP WITH MAJORITY SUPPORT IN FAIRLEIGH DICKINSON UNIVERSITY PUBLICMIND POLL, BUT VOTERS DIVIDED OVER TRUMP S LOCKER ROOM TALK

Neither Bush nor Democrats Making Their Case PUBLIC DISSATISFIED WITH IRAQ DEBATE COVERAGE

Democrats, Clinton, Giuliani Hold Strongest Hands

PRRI/The Atlantic April 2016 Survey Total = 2,033 (813 Landline, 1,220 Cell phone) March 30 April 3, 2016

Romney s Speech Well Received by Republicans OPRAH BOOSTS OBAMA S VISIBILITY

Tulane University Post-Election Survey November 8-18, Executive Summary

Clinton Shows Strengths for 2016 Yet With Some Chinks in Her Armor

Commuting and Minimum wages in Decentralized Era Case Study from Java Island. Raden M Purnagunawan

Conducted by the University of New Hampshire Survey Center

Heavy Coverage of Pakistan, Only Modest Interest WIDESPREAD INTEREST IN RISING OIL PRICES

Clinton vs. Trump 2016: Analyzing and Visualizing Tweets and Sentiments of Hillary Clinton and Donald Trump

The Labor Market Returns to Authorization for Undocumented Immigrants: Evidence from the Deferred Action for Childhood Arrivals Program

The margin of error for 1,004 interviews is ± 3.1%

Tengyu Ma Facebook AI Research. Based on joint work with Yuanzhi Li (Princeton) and Hongyang Zhang (Stanford)

2008 AMERICAN PRESIDENTIAL ELECTIONS: AN OVERVIEW

Probabilistic Latent Semantic Analysis Hofmann (1999)

Conducted by the University of New Hampshire Survey Research Center

NUMBERS, FACTS AND TRENDS SHAPING THE WORLD FOR RELEASE AUGUST 26, 2016 FOR MEDIA OR OTHER INQUIRIES:

Emerson Poll: With No Joe, Clinton Leads Sanders By Wide Margin. Trump Solidifies Support in GOP Field. Carson and Rubio Pull Away From Pack.

Random Forests. Gradient Boosting. and. Bagging and Boosting

Hillary Clinton Leading the Democratic Race in California

Presidential Greatness and Political Experience

arxiv: v1 [cs.si] 30 Apr 2013

1. Do you approve or disapprove of the job Barack Obama is doing as president? 3-4 Mar 09 63% Democrats 93% 5 2

PENNSYLVANIA: DEMOCRATS LEAD FOR BOTH PRESIDENT AND SENATE

Gingrich, Romney Most Heard About Candidates Primary Fight and Obama Speech Top News Interest

An Integrated Tag Recommendation Algorithm Towards Weibo User Profiling

A comparative analysis of subreddit recommenders for Reddit

Pennsylvania Republicans: Leadership and the Fiscal Cliff

Universality of election statistics and a way to use it to detect election fraud.

New York Election Issues Survey: January 24, 2008

OHIO: CLINTON HOLDS SMALL EDGE; PORTMAN LEADS FOR SENATE

Transcription:

Name Phylogeny A Generative Model of String Variation Nicholas Andrews, Jason Eisner and Mark Dredze Department of Computer Science, Johns Hopkins University EMNLP 2012 Thursday, July 12

Outline Introduction Generative Model Mutation Model Inference Experiments Future Work

What s a name phylogeny? A fragment of a name phylogeny learned by our model Khawaja Gharibnawaz Muinuddin Hasan Chisty Thomas Ruggles Pynchon, Jr. Thomas Ruggles Pynchon Jr. Khwaja Muin al-din Chishti Thomas R. Pynchon, Jr. Khwaja Gharib Nawaz Khwaja Moinuddin Chishti Thomas R. Pynchon Jr. Thomas Pynchon, Jr. Ghareeb Nawaz Khwaja gharibnawaz Muinuddin Chishti Thomas R. Pynchon Thomas Pynchon Jr. Each edge corresponds to a mutation

Problem: organizing disorganized collections of strings Barack Obama Sr Mitt Romney President Barack Obama Mitt rommey mitt Barack Obama Barack Barack H. Obama Willard M. Romney Barry barak Obama Romney Mr. Romney President Barrack barack obama Clinton Governor Mitt Romney Hillary Clinton clinton Billy Ms. Clinton will clinton Vice President Clinton Bill Clinton President Bill Clinton Hillary Bill bill Hillary Rodham Clinton William Clinton

Problem: organizing disorganized collections of strings Barack Obama Sr Barack Barack Obama Barack H. Obama Obama Barrack barack obama Barry barak President Barack Obama Hillary Clinton Vice President Clinton Ms. Clinton Hillary Hillary Rodham Clinton President Clinton clinton Mitt Romney Mitt rommey Romney mitt Mr. Romney Willard M. Romney Governor Mitt Romney Billy bill Bill will clinton President Bill Clinton Bill Clinton William Clinton

Challenges Name variation: the same entity may have different names, and a good measure of similarity between strings may not be available (This work) Disambiguation: different entities may have names in common, requiring the use of context to disambiguate between them Barack Obama Sr Barack Barack Obama Barack H. Obama Obama Barrack barack obama Barry barak President Barack Obama Hillary Clinton Vice President Clinton Ms. Clinton Hillary Hillary Rodham Clinton President Clinton clinton Mitt Romney Mitt rommey Romney mitt Mr. Romney Willard M. Romney Governor Mitt Romney Billy bill Bill will clinton President Bill Clinton Bill Clinton William Clinton

How does a name phylogeny help? 1. Organizes name variants into connected components (clusters) Khawaja Gharibnawaz Muinuddin Hasan Chisty Thomas Ruggles Pynchon, Jr. Thomas Ruggles Pynchon Jr. Khwaja Muin al-din Chishti Thomas R. Pynchon, Jr. Khwaja Gharib Nawaz Khwaja Moinuddin Chishti Thomas R. Pynchon Jr. Thomas Pynchon, Jr. Ghareeb Nawaz Khwaja gharibnawaz Muinuddin Chishti Thomas R. Pynchon Thomas Pynchon Jr. 2. Align names as mutations of one another Khawaja Gharibnawaz Muinuddin Hasan Chisty Thomas Ruggles Pynchon, Jr. Ghareeb Nawaz Khwaja Gharib Nawaz Khwaja gharibnawaz Khwaja Muin al-din Chishti Khwaja Moinuddin Chishti Muinuddin Chishti Thomas Ruggles Pynchon Jr. Thomas R. Pynchon, Jr. Thomas R. Pynchon Jr. Thomas R. Pynchon Thomas Pynchon, Jr. Thomas Pynchon Jr. 3. We can estimate a mutation model given a phylogeny, and a mutation model gives a distribution over phylogenies ( EM)

Outline Introduction Generative Model Mutation Model Inference Experiments Future Work

Generative Model We propose a generative model for string variation explaining the reasons for name variation.... x 10001 = Mitt Romney x 10002 = President Barack Obama x 10003 = Barack Obama x 10004 = Secretary of State Hillary Clinton x 10005 = Hillary Clinton x 10006 = Barack Obama x 10007 = Clinton x 10008 = Obama... What are the sources of variation for names?

Copying a previous mention We can copy a name seen before. Procedure:... x 10001 = Mitt Romney x 10002 = President Barack Obama x 10003 = Barack Obama x 10004 = Secretary of State Hillary Clinton x 10005 = Hillary Clinton x 10006 = Barack Obama x 10007 = Clinton x 10008 = Obama... x 100001 = Barack Obama Select a previous name mention uniformly at random Decide to copy it with probability 1 µ

Mutating a previous mention We can mutate a name seen before. Procedure:... x 10001 = Mitt Romney x 10002 = President Barack Obama x 10003 = Barack Obama x 10004 = Secretary of State Hillary Clinton x 10005 = Hillary Clinton x 10006 = Barack Obama x 10007 = Clinton x 10008 = Obama... x 100001 = Mitt Select a previous name mention uniformly at random Decide to mutate it with probability µ Sample a mutation from p( Mitt Romney)

Generating a new name We can generate a new name.... x 10001 = Mitt Romney x 10002 = President Barack Obama x 10003 = Barack Obama x 10004 = Secretary of State Hillary Clinton x 10005 = Hillary Clinton x 10006 = Barack Obama x 10007 = Clinton x 10008 = Obama... x 100001 = Joe Biden Procedure: Select with probability proportional to α (a pseudocount ) Sample a new name from p( ) A character language model

Generative model summary To generate the next name mention: 1. Pick an existing name mention w with probability 1/(α + k) 1.1 Copy w verbatim with probability 1 µ 1.2 Mutate w with probability µ 2. Decide to talk about a new entity with probability α/(α + k) 2.1 Generate a name for it

Generative model in action Mitt Romney President Barack Obama Secretary of State Hillary Clinton... Barack Obama Hillary Clinton Barack Obama Clinton Obama x 10001 = Mitt Romney x 10002 = President Barack Obama x 10003 = Barack Obama x 10004 = Secretary of State Hillary Clinton x 10005 = Hillary Clinton x 10006 = Barack Obama x 10007 = Clinton x 10008 = Obama

Generative model in action Mitt Romney President Barack Obama Secretary of State Hillary Clinton... Mitt Barack Obama Barack Obama Hillary Clinton Clinton Obama x 10001 = Mitt Romney x 10008 = Obama x 10002 = President Barack Obama x 10009 = Mitt x 10003 = Barack Obama x 10004 = Secretary of State Hillary Clinton x 10005 = Hillary Clinton x 10006 = Barack Obama x 10007 = Clinton

Generative model in action Mitt Romney President Barack Obama Secretary of State Hillary Clinton... Mitt Barack Barack Obama Barack Obama Hillary Clinton Clinton Obama x 10001 = Mitt Romney x 10002 = President Barack Obama x 10003 = Barack Obama x 10004 = Secretary of State Hillary Clinton x 10005 = Hillary Clinton x 10006 = Barack Obama x 10007 = Clinton x 10008 = Obama x 10009 = Mitt x 10010 = Barack

Generative model in action Mitt Romney President Barack Obama Secretary of State Hillary Clinton... Mitt Barack Barack Obama Barack Obama Hillary Clinton Clinton Barry Obama x 10001 = Mitt Romney x 10002 = President Barack Obama x 10003 = Barack Obama x 10004 = Secretary of State Hillary Clinton x 10005 = Hillary Clinton x 10006 = Barack Obama x 10007 = Clinton x 10008 = Obama x 10009 = Mitt x 10010 = Barack x 10011 = Barry

Generative model in action Mitt Romney President Barack Obama Secretary of State Hillary Clinton... Mitt Barack Barry Barack Obama Barack Obama Obama Hillary Clinton Clinton Hillary Clinton x 10001 = Mitt Romney x 10008 = Obama x 10002 = President Barack Obama x 10009 = Mitt x 10003 = Barack Obama x 10010 = Barack x 10004 = Secretary of State Hillary Clinton x 10011 = Barry x 10005 = Hillary Clinton x 10012 = Hillary Clinton x 10006 = Barack Obama x 10007 = Clinton

A few observations The proposed generative model is clearly naive No model of discourse or of name structure The pseudocount α controls the likelihood of new names We assume a low mutation probability µ, so that most names are copied from earlier frequent names

Outline Introduction Generative Model Mutation Model Inference Experiments Future Work

Name variation as mutations Mutations capture different types of name variation: 1. Transcription errors: Barack barack 2. Misspellings: Barack Barrack 3. Abbreviations: Barack Obama Barack O. 4. Nicknames: Barack Barry 5. Dropping words: Barack Obama Barack

Mutation via probabilistic finite-state transducers The mutation model is a probabilistic finite-state transducer with four character operations: copy, substitute, delete, insert Character operations are conditioned on the right input character Latent regions of contiguous edits Back-off smoothing Transducer parameters θ determine the probability of being in different regions, and of the different character operations

Example: Mutating a name Mr. Robert Kennedy Mr. Bobby Kennedy Example mutation M r. _ R o b e r t _ K e n n e d y $ M r. _[ Beginning of edit region

Example: Mutating a name Mr. Robert Kennedy Mr. Bobby Kennedy Example mutation M r. _ R o b e r t _ K e n n e d y $ M r. _[B 1 substitution operation: (R, B)

Example: Mutating a name Mr. Robert Kennedy Mr. Bobby Kennedy Example mutation M r. _ R o b e r t _ K e n n e d y $ M r. _[B o b 2 copy operations: (ε, o), (ε, b)

Example: Mutating a name Mr. Robert Kennedy Mr. Bobby Kennedy Example mutation M r. _ R o b e r t _ K e n n e d y $ M r. _[B o b 3 deletion operations: (e,ε), (r,ε), (t, ε)

Example: Mutating a name Mr. Robert Kennedy Mr. Bobby Kennedy Example mutation M r. _ R o b e r t _ K e n n e d y$ M r. _[B o b b y 2 insertion operations: (ε,b), (ε,y)

Example: Mutating a name Mr. Robert Kennedy Mr. Bobby Kennedy Example mutation M r. _ R o b e r t _ K e n n e d y $ M r. _[B o b b y] End of edit region

Example: Mutating a name Mr. Robert Kennedy Mr. Bobby Kennedy Example mutation M r. _ R o b e r t _ K e n n e d y $ M r. _[B o b b y]_ K e n n e d y $

Outline Introduction Generative Model Mutation Model Inference Experiments Future Work

Inference Input: An unaligned corpus of names ( bag-of-words ) The order in which the tokens were generated is unknown No inputs or outputs are known for the mutation model Barack Obama Sr Mitt Romney President Barack Obama Mitt rommey mitt Barack Obama Barack Barack H. Obama Willard M. Romney Barry barak Obama Romney Mr. Romney President Barrack barack obama Clinton Governor Mitt Romney Hillary Clinton clinton Billy Ms. Clinton will clinton Vice President Clinton Bill Clinton President Bill Clinton Hillary Bill bill Hillary Rodham Clinton William Clinton Output: A distribution over name phylogenies parametrized by transducer parameters θ

Observed vs unobserved names Could there be latent forms in the phylogeny?? Khwaja Gharib Nawaz Khwaja Muin al-din Chishti Ghareeb Nawaz Khwaja gharibnawaz Muinuddin Chishti?

Observed vs unobserved names What we'd like to do: Khawaja Gharibnawaz Muinuddin Hasan Chisty Khwaja Muin al-din Chishti Khwaja Gharib Nawaz Khwaja Moinuddin Chishti Ghareeb Nawaz Khwaja gharibnawaz Muinuddin Chishti What we actually do: Khwaja Gharib Nawaz Khwaja Muin al-din Chishti Ghareeb Nawaz Khwaja gharibnawaz Muinuddin Chishti

Type phylogeny vs token phylogeny The generative model is over tokens (name mentions) Ehud Barak President Barack Obama Secretary of State Hillary Clinton Barak Barack Barack Obama Barack Obama Clinton Hillary Clinton Hillary Clinton Barry Barry Obama But we do type-level inference for the following reasons: 1. Allows faster inference 2. Allows type-level supervision

Type phylogeny vs token phylogeny We collapse all copy edges into a single vertex Ehud Barak President Barack Obama Secretary of State Hillary Clinton Barak Barry Barack BARRY (2) BARACK OBAMA (2) HILLARY CLINTON (2) Obama Clinton The first token in each collapsed vertex is a mutation, and the rest are copies Every edge in the phylogeny now corresponds to a mutation Approximation: disallow multiple tokens of the same type to be derived from mutations

Scoring phylogenies The weight of a single phylogeny is the product of the weight of its edges δ(y pa(y)) y Y What should the edge weights be?

Edge weights New names: edges from to a name x: δ(x ) = α p(x ) Mutations: edges from a name x to a name y: δ(y x) = µ p(y x) n x n y + 1 Approximation: Edges weights are not quite edge factored. We are making an approximation of the form E y δ(y pa(y)) y Eδ(y pa)

Inference via EM Iterate until convergence: 1. E-step: Given θ, compute a distribution over name phylogenies 2. M-step: Re-estimate transducer parameters θ given marginal edge probabilities. This step sums over alignments for each (x, y) string pair using forward-backward Each (x, y) pair may be viewed as a training example weighted by the marginal probability of the edge from x to y

E-step: marginalizing over latent variables The latent variables in the model are: 1. Name phylogeny (spanning tree) relating names as inputs and/or outputs 2. Character alignments from potential input names x to output names y We use the Matrix-Tree theorem for directed graphs (Tutte, 1984) to efficiently evaluate marginal probabilities: 1. Partition function (sum over phylogenies) 2. Edge marginals

Speed of inference Two main slowdowns: The complexity of the E-step is dominated by the O(n 3 ) (for n names) matrix inversion required to compute the edge marginals c xy. The M-step sums over alignments for O(n 2 ) input-output pairs Approximation: To speed up inference, we prune edges (set δ(y x) = 0) for names with no trigrams in common

Outline Introduction Generative Model Mutation Model Inference Experiments Future Work

Data preparation We used English Wikipedia (2011) to create lists of name variants 1. Wikipedia redirects are human-curated pages to resolve common name variants to the correct page (unambiguously) 2. We use Freebase to restrict to redirects for Person entities 3. We applied some further filters to remove redirects that were clearly not names (e.g. numbers) 4. We use LDC Gigaword to obtain a frequency for each name variant

Sample Wikipedia redirects Ho Chi Minh, Ho chi mihn, Ho-Chi Minh, Ho Chih-minh Guy Fawkes, Guy fawkes, Guy faux, Guy Falks, Guy Faukes, Guy Fawks, Guy foxe, Guy Falkes Nicholas II of Russia, Nikolai Aleksandrovich Romanov, Nicholas Alexandrovich of Russia, Nicolas II Bill Gates, Lord Billy, Bill Gates, BillGates, Billy Gates, William Gates III, William H. Gates William Shakespeare, William shekspere, William shakspeare, Bill Shakespear Bill Clinton, Billll Clinton, William Jefferson Blythe IV, Bill J. Clinton, William J Clinton

Wikipedia as supervision We use Wikipedia name lists for supervision and evaluation Treat page redirects as gold mutations of the page title: Ho Chi Minh Ho chi mihn Ho Chi Minh Ho-Chi Minh Ho Chi Minh Ho Chih-minh Each list of redirects is cluster of names belonging to the same entity No ambiguous names (by construction)

Experiment 1: Transducer log-likelihood Data: 1500 entities (roughly 6000 names) for train 1500 different entities (roughly 6000 names) for test Procedure: At train time 1. Initialize transducer parameters θ using different amounts of supervision (up to 250 entities) 2. Run EM for 10 iterations to re-estimate θ 3. α = 1.0, µ = 0.1 At test time 1. Evaluate log-likelihood of the transducer on all gold pairs from the test set

Experiment 1: Mutation model log-likelihood 150000 160000 170000, Held out log-likelihood 180000 190000 200000 210000 sup=0 220000 sup=5 sup=25 230000 sup=100 sup=250 240000 0 1 2 3 4 5 6 7 8 9 EM iteration

Experiment 2: Ranking Data: same as before Procedure: At train time 1. Estimate transducer parameters θ 2. α = 1.0, µ = 0.1 At test time 1. For each Wikipedia person page in the test set, produce a ranking of all test aliases 2. Compute mean reciprocal rank (MRR) over all such rankings

Experiment 2: Ranking 0.85 0.80 0.75 MRR 0.70 0.65 0.60 1500 jwink lev sup10 semi10 unsup sup For each article name in the test corpus, produce a ranking of redirects The rankings are evaluated using mean reciprocal rank

Outline Introduction Generative Model Mutation Model Inference Experiments Future Work

Future Work More sophisticated mutation models Incorporate internal name structure Incorporate context in the generative story Cross-lingual experiments Each vertex labeled with a language, allowing systematic relationships between languages Other potential applications Derivational morphology Paraphrase Transliteration Historical linguistics Bibliographic entry variation

Experiment 3 (preliminary): Precision/Recall Procedure: At train time 1. Estimate transducer parameters θ using EM 2. Find the best spanning tree given θ At test time 1. Attach held-out names to the most likely vertex in the inferred spanning tree 2. Evaluate precision and recall for the connected component

Experiment 3 (preliminary): Example attachment Khawaja Gharibnawaz Muinuddin Hasan Chisty Thomas Ruggles Pynchon, Jr. Thomas Ruggles Pynchon Jr. Khwaja Muin al-din Chishti Thomas R. Pynchon, Jr. Khwaja Gharib Nawaz Khwaja Moinuddin Chishti Thomas R. Pynchon Jr. Thomas Pynchon, Jr. Ghareeb Nawaz Khwaja gharibnawaz Muinuddin Chishti Thomas R. Pynchon Thomas Pynchon Jr.?? Thomas Ruggles Held-out names can attach to any vertex in the tree Including Attachment weights given by edge weights δ(y x)

Experiment 3 (preliminary): Results 1.0 0.8 Precision 0.6 0.4 0% supervised 1% supervised 0.2 8% supervised 24% supervised 100% supervised 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Recall