COSC-282 Big Data Analytics. Final Exam (Fall 2015) Dec 18, 2015 Duration: 120 minutes

Similar documents
Practice Questions for Exam #2

THE VANISHING CENTER OF AMERICAN DEMOCRACY APPENDIX

NBC News/WSJ/Marist Poll. April New York Questionnaire

Learning and Visualizing Political Issues from Voting Records Erik Goldman, Evan Cox, Mikhail Kerzhner. Abstract

Subject: Pinellas County Congressional Election Survey

FOR RELEASE: TUESDAY, DECEMBER 19 AT 4 PM

How to identify experts in the community?

Red Oak Strategic Presidential Poll

NBC News/WSJ/Marist Poll March 2016 Michigan Questionnaire

BREAKING THE GLASS CEILING: A WOMAN PRESIDENTIAL CANDIDATE May 30 - June 2, 2008

Computational challenges in analyzing and moderating online social discussions

(READ AND RANDOMIZE LIST)

Toplines. UMass Amherst/WBZ Poll of MA Likely Primary Voters

HART RESEARCH ASSOCIATES/PUBLIC OPINION STRATEGIES Study # page 1

Muhlenberg College/Morning Call. Pennsylvania 15 th Congressional District Registered Voter Survey

GW POLITICS POLL 2018 MIDTERM ELECTION WAVE 1

Before we begin, we need to ask you a couple of questions to determine your eligibility for the study.

Marist College Institute for Public Opinion Poughkeepsie, NY Phone Fax

Clinton vs. Trump 2016: Analyzing and Visualizing Tweets and Sentiments of Hillary Clinton and Donald Trump

Leadership Secrets Of Hillary Clinton

HART RESEARCH ASSOCIATES/PUBLIC OPINION STRATEGIES Study # page 1

Name Phylogeny. A Generative Model of String Variation. Nicholas Andrews, Jason Eisner and Mark Dredze

DRA NATIONAL AUDIENCE & COALITION MODELING:

CRUZ & KASICH RUN STRONGER AGAINST CLINTON THAN TRUMP TRUMP GOP CANDIDACY COULD FLIP MISSISSIPPI FROM RED TO BLUE

Loras College Statewide Wisconsin Survey October/November 2016

September 2017 Toplines

RECOMMENDED CITATION: Pew Research Center, October, 2016, Trump, Clinton supporters differ on how media should cover controversial statements

Toplines. UMass Amherst/WBZ Poll of MA Registered/Likely Voters

2016 NCSU N=879

THE GEORGE WASHINGTON BATTLEGROUND POLL

NBC News/WSJ/Marist Poll

Running head: WOMEN IN POLITICS AND THE MEDIA 1. Women in Politics and the Media : The United States vs. The Czech Republic

REGISTERED VOTERS October 30, 2016 October 13, 2016 Approve Disapprove Unsure 7 6 Total

Compared to: Study #2122 June 19-22, Democratic likely caucusgoers in Iowa 1,805 contacts weighted by age, sex, and congressional district

Evaluating Political Candidates

Ohio State University

Survey Overview. Survey date = September 29 October 1, Sample Size = 780 likely voters. Margin of Error = ± 3.51% Confidence level = 95%

Muhlenberg College/Morning Call Pennsylvania 7 th Congressional District 2018 Midterm Election Survey October

Hillary Clinton Leading the Democratic Race in California

PENNSYLVANIA: DEMOCRATS LEAD FOR BOTH PRESIDENT AND SENATE

NATIONAL: CLINTON HOLDS POST-DEBATE LEAD Dem voters still have some interest in a Biden run

HIGH POINT UNIVERSITY POLL MEMO RELEASE 2/15/2018 (UPDATE)

Google Consumer Surveys Presidential Poll Fielded 8/18-8/19

Toplines. UMass Amherst/WBZ Poll of NH Likely Voters

Growing share of public says there is too little focus on race issues

FOR RELEASE: MONDAY, DECEMBER 10 AT 4 PM

For immediate release Monday, March 7 Contact: Dan Cassino ;

News English.com Ready-to-use ESL / EFL Lessons

Franklin Pierce / WBZ Poll

POLL RESULTS. Question 1: Do you approve or disapprove of the job performance of President Donald Trump? Approve 46% Disapprove 44% Undecided 10%

Tulane University Post-Election Survey November 8-18, Executive Summary

Current Pennsylvania Polling

In battleground Virginia, Clinton beating all Republicans in 2016 presidential matchups; GOP voters divided, with Bush up, Christie down

NUMBERS, FACTS AND TRENDS SHAPING THE WORLD FOR RELEASE AUGUST 26, 2016 FOR MEDIA OR OTHER INQUIRIES:

Pennsylvania voters disapprove of the Republican efforts to repeal the Affordable Care Act by 17 points (52% to 35%).

THE PRESIDENTIAL RACE AND THE DEBATES October 3-5, 2008

Hillary Rodham Clinton: Do All The Good You Can By Cynthia Levinson READ ONLINE

IOWA: TRUMP HAS SLIGHT EDGE OVER CLINTON

Introduction to Text Modeling

RECOMMENDED CITATION: Pew Research Center, March 2014, Most Say U.S. Should Not Get Too Involved in Ukraine Situation

DOWNLOAD OR READ : THE LIFE OF PRESIDENT THOMAS JEFFERSON PDF EBOOK EPUB MOBI

Living in the Shadows or Government Dependents: Immigrants and Welfare in the United States

HART RESEARCH ASSOCIATES/PUBLIC OPINION STRATEGIES Study # page 1

THE PRESIDENTIAL NOMINATION CONTESTS May 18-23, 2007

FOR RELEASE NOVEMBER 07, 2017

January 19, Media Contact: James Hellegaard Phone number:

Who is registered to vote in Illinois?

******DRAFT***** Muhlenberg College/Morning Call 2016 Pennsylvania Republican Presidential Primary Survey. Mid April Version

Topline questionnaire

Clinton leads all Republican challengers in 2016 presidential matchups in battleground Virginia; GOP voters divided, but Christie, Bush top pack

HIGH POINT UNIVERSITY POLL MEMO RELEASE 10/13/2017 (UPDATE)

November 18, Media Contact: Jim Hellegaard Phone number:

A NOVEL EFFICIENT REVIEW REPORT ON GOOGLE S PAGE RANK ALGORITHM

Gender preference and age at arrival among Asian immigrant women to the US

FOR RELEASE: FRIDAY, JULY 20 AT 6 AM

FOR RELEASE: TUESDAY, SEPTEMBER 11 AT 4 PM

November 2017 Toplines

The Morning Call / Muhlenberg College Institute of Public Opinion. Pennsylvania 2012: An Election Preview

PEW RESEARCH CENTER. FOR RELEASE January 16, 2019 FOR MEDIA OR OTHER INQUIRIES:

Emerson College Poll: Iowa Leaning For Trump 44% to 41%. Grassley, Coasting to a Blowout, Likely to Retain Senate Seat.

Presidential Race. Virginia Illinois Maine. Published Nov 1 Oct 13 Nov 1 Sept 22 Oct 31 Sept 7. Hillary Clinton 49% 46% 53% 45% 46% 44%

1. In general, do you think things in this country are heading in the right direction or the wrong direction? Strongly approve. Somewhat approve Net

Nevada Poll Results Tarkanian 39%, Heller 31% (31% undecided) 31% would renominate Heller (51% want someone else, 18% undecided)

WEEKLY LATINO TRACKING POLL 2018: WAVE 1 9/05/18

RECOMMENDED CITATION: Pew Research Center, May, 2017, Partisan Identification Is Sticky, but About 10% Switched Parties Over the Past Year

OHIO: CLINTON HOLDS SMALL EDGE; PORTMAN LEADS FOR SENATE

PRRI March 2018 Survey Total = 2,020 (810 Landline, 1,210 Cell) March 14 March 25, 2018

A Nation Divided: New national poll shows Americans distrust Congress, the media, Hollywood, and even other voters in the U.S.

Online Appendix: Social Media and Fake News in the 2016 Election

HISPANIC/LATINO OVERSAMPLE

Non-fiction: Madam President? Women in high-power positions head to the forefront of politics.

HIGH POINT UNIVERSITY POLL MEMO RELEASE 9/24/2018 (UPDATE)

Emerson Poll: With No Joe, Clinton Leads Sanders By Wide Margin. Trump Solidifies Support in GOP Field. Carson and Rubio Pull Away From Pack.

Download Barack Obama: Our Forty-Fourth President (A Real-Life Story) Kindle

Statewide Survey on Job Approval of President Donald Trump

Sopranos Spoof vs. Obama Girl CAMPAIGN INTERNET VIDEOS: VIEWED MORE ON TV THAN ONLINE

Alabama Republican Presidential Primary Poll 2/26/16. None

Illustrating voter behavior and sentiments of registered Muslim voters in the swing states of Florida, Michigan, Ohio, Pennsylvania, and Virginia.

Public Hearing Better News about Housing and Financial Markets

Marist College Institute for Public Opinion 2455 South Road, Poughkeepsie, NY Phone Fax

Transcription:

Student Name: COSC-282 Big Data Analytics Final Exam (Fall 2015) Dec 18, 2015 Duration: 120 minutes Instructions: This is a closed book exam. Write your name on the first page. Answer all the questions in this exam paper. You must write clearly so that your writing can be recognized. Your answers should be thorough, complete, and relevant. Points will be deducted for irrelevant details. Start from the questions you are more confident with. Then deal with the difficult ones. Use the back of the pages if you need more room to write. No. Questions Points Your Score 1 Basic Concepts 10 2 Pair RDDs 12 3 Donald Clinton 10 4 Code Interpretation 10 5 Web Graph 18 6 Page Rank 20 7 Regression 10 8 Collaborative Filtering 10 Total: 100 Good luck! - 1 -

Q1. Basic Concepts. [10 points] The following are short answer questions to test basic concepts learned in the course. Please provide the definition (if there is any) for each concept, description of the concept to explain why we use it, how we use it, and an example of it. If you are asked to compare two concepts, describe each concept, state their commonality and difference, and provide examples. Make your answer short and concise. (1a) Inverse Document Frequency. [2 points] (1b) Stop words. [2 points] (1c) Search vs. IR Evaluation. [2 points] - 2 -

(1d) Content-based Recommendation vs. Collaborative Filtering. [2 points] (1e) Supervised Machine Learning vs. Unsupervised Machine Learning. [2 points] - 3 -

Q2. Pair RDDs. [12 points] Please write out the output for the following codes at the marked locations. 2a) val lines = sc.parallelize(list("hello world", "this is a scala program", "to create a pair RDD", "in spark")) val pairs = lines.map(x => (x.split(" ")(0), x)) pairs.filter {case (key, value) => key.length <3}.foreach(println) Location A: What is the output here? - 4 -

2b) val pairs = sc.parallelize(list((1, 2), (3, 4), (3, 6))) val pairs1 = pairs.reducebykey((x,y) => x*y) pairs1.foreach(println) Location B: What is the output here? - 5 -

// the code continues from the previous page: val pairs2 = pairs.mapvalues(x=>x+2) pairs2.foreach(println) Location C: What is the output here? - 6 -

// the code continues from the previous page: val pairs3 = pairs.map {case (x,y) => (y+1,x)} pairs3.foreach(println) Location D: What is the output here? - 7 -

// the code continues from the previous page: val pairs4 = pairs.mapvalues(x=>(x,1)) pairs4.foreach(println) Location E: What is the output here? - 8 -

// the code continues from the previous page: val pair5 = pairs4.reducebykey((x,y) => (x._1+y._1, x._2 + y._2)) pairs5. foreach(println) Location F: What is the output here? - 9 -

Q3. Donald Clinton. [10 points] Suppose you have two files, one.txt and two.txt. The content of one.txt is: Donald John Trump (born June 14, 1946) is an American real estate developer, television personality, business author and political candidate. He is the chairman and president of The Trump Organization, and the founder of Trump Entertainment Resorts.[1] Trump's career, branding efforts, lifestyle and outspoken manner helped make him a celebrity, a status amplified by the success of his NBC reality show, The Apprentice.[2][2] Trump is a son of Fred Trump, a New York City real estate developer.[9] Donald Trump worked for his father's firm, Elizabeth Trump & Son, while attending the Wharton School of the University of Pennsylvania, and officially joined the company in 1968.[10] In 1971, he was given control of the company, renaming it The Trump Organization.[11][12] Trump remains a major figure in American real estate and a celebrity for his prominent media exposures.[13] On June 16, 2015, Trump formally announced his candidacy for president of the United States in the 2016 election, seeking the nomination of the Republican Party.[14][15] Trump's early campaigning drew intense media coverage and saw him rise to high levels of popular support.[16] Since late July 2015, he has consistently been the front-runner in public opinion polls for the Republican Party nomination.[17][18][19] The content of two.txt is: Hillary Diane Rodham Clinton (born October 26, 1947) is an American politician who served as the 67th United States Secretary of State under President Barack Obama from 2009 to 2013. The wife of Bill Clinton, the 42nd President of the United States, she was First Lady of the United States during his tenure from 1993 to 2001. She served as a United States Senator from New York from 2001 to 2009. An Illinois native, Hillary Rodham graduated from Wellesley College in 1969, where she became the first student commencement speaker, then earned her J.D. from Yale Law School in 1973. After a stint as a Congressional legal counsel, she moved to Arkansas, marrying Bill Clinton in 1975. She co-founded Arkansas Advocates for Children and Families in 1977, became the first female chair of the Legal Services Corporation in 1978, and was named the first female partner at Rose Law Firm in 1979. The National Law Journal twice listed her as one of the hundred most influential lawyers in America. While First Lady of Arkansas from 1979 to 1981 and 1983 to 1992, she led a task force that reformed Arkansas' education system, while sitting on the board of directors of Wal-Mart, among other corporations. As First Lady of the United States, her major initiative, the Clinton health care plan of 1993, failed to reach a vote in Congress. In 1997 and 1999, she played a leading role in advocating the creation of the State Children's Health Insurance Program, the Adoption and Safe Families Act and the Foster Care Independence Act. Write a standalone program called DonaldClinton.scala, which prints out the words that appear in both files and their word counts, with the words sorted by their counts in descending order. - 10 -

[Space for Q3] - 11 -

[Extra space for Q3] - 12 -

Q4. Code Interpretation. [10 Points] Explain in English what the code does at the marked location. The answers need to be related to the following formula. where δ is the damping factor, N is the total number of pages in the graph, Γ is the set of sink nodes, α and β are pages, and r is the page rank score. The codes are in the next page. - 13 -

val links = // Load RDD of (page title, links) pairs var ranks = // Load RDD of (page title, rank) pairs for (i <- 0 to ITERATION) { val contribs = links.join(ranks).flatmap { //Location A: case (title, (links, rank)) //Location B: => links.map(dest => (dest, rank / links.size)) // Location C: } ranks = contribs.reducebykey( _+_ ) // Location D:.mapValues(0.15 + 0.85 * _ ) // Location E: } - 14 -

[Extra space for Q4] - 15 -

Q5. Web Graph. [18 Points] 5a) Draw a web graph for the web pages in the following three sites W, H, and M. The graph should contain six nodes w0, w1, w2, h0, h1, and m0. Draw here: - 16 -

5b) Suppose the site H is down due to power failure. Its web pages disappear from the web. What is the web graph now? Draw it. The graph should contain four nodes w0, w1, w2, and m0. - 17 -

Given a page α, we define out(α) as the number of out-links from α to other pages (the out-degree) and in(α) as the number of in-links from other pages to α (the in-degree). 5c) In the web graph in 5b, what are out(w0), out(w1),out(w2), and out(m0)? 5d) What are in(w0), in(w1),in(w2), and in(m0) in 5b? - 18 -

5e) Is there any source node in the web graph in 5b? If yes, which node(s)? 5f) Is there any sink node in the web graph in 5b? If yes, which node(s)? - 19 -

5g) Construct a follow matrix for the web graph in 5b. - 20 -

5h) Construct a jump vector for the web graph in 5b, assuming uniform randomness. - 21 -

5i) Construct a transition matrix for the web graph in 5b. Assume the damping factor is ¾. - 22 -

Q6. Page Rank. [20 Points] We can use the following formula to calculate the page rank score for a page α in a web graph. where δ is the damping factor, N is the total number of pages in the graph, Γ is the set of sink nodes, α and β are pages, and r is the page rank score. 6a) Write down the expressions for the page ranks of pages in the web graph in 5b. r(w0) = r(w1) = r(w2) = r(m0) = - 23 -

6b) Write down the expressions for the page ranks of the four pages with δ=3/4. r(w0) = r(w1) = r(w2) = r(m0) = - 24 -

6c) We are going to use fixed point iteration to solve the equations. Assume the initial pagerank value for each page is 1. That is, at iteration 1, r(w0) = r(w1) = r(w2) =r(m0) =1. What are the page rank values at iteration 2? r(w0) = r(w1) = r(w2) = r(m0) = - 25 -

[Extra space for Q6c] - 26 -

Q7. Regression. [10 Points] 7a) Vector is a data structure used in Spark MLLib to store the features for data. Assume the following libraries are available in your code: import org.apache.spark.mllib.linalg.vectors Write Spark code to create 5 dense vectors (0.0, 1.0), (-1.0, 0.2), (1.0, 2.5), (3.0, 4.0), and (4.0, 5.0). - 27 -

7b) The five vectors represent five data points. The data points can be drawn in a twodimensional plot. Mark the data points using crosses in the following graph. y 6 5 4 3 2 1-2 -1 0 1 2 3 4 5 6 7 x 7c) Assume a linear regression model for the data points. Draw a line that fits the best to all data points in the graph. - 28 -

7d) Assume a linear function for the data points: y = θ 0 + θ 1 x Based on the line and the graph, what is your best guess for the weights θ 0 and θ 1? - 29 -

Q8. Collaborative Filtering. [10 points] Consider three users u 1, u 2, and u 3, and four movies m 1, m 2, m 3, and m 4. The users rated the movies using a 4-point scale: -1: bad, 1: fair, 2: good, and 3: great. A rating of 0 means that the user did not rate the movie. The three users ratings for the four movies are: u 1 = (3, 0, 0, -1), u 2 = (2, -1, 0, 3), u 3 = (3, 0, 3, 1) 8a) Which user has more similar taste to u 1 based on cosine similarity, u 2 or u 3? Show detailed calculation process. - 30 -

8b) User u 1 has not yet watched movies m 2 and m 3. Which movie(s) are you going to recommend to user u 1, based on the user-based collaborative filtering approach? Justify your answer. - 31 -

[extra space for Q8] - 32 -

[extra space for the paper] - 33 -