Student Name: COSC-282 Big Data Analytics Final Exam (Fall 2015) Dec 18, 2015 Duration: 120 minutes Instructions: This is a closed book exam. Write your name on the first page. Answer all the questions in this exam paper. You must write clearly so that your writing can be recognized. Your answers should be thorough, complete, and relevant. Points will be deducted for irrelevant details. Start from the questions you are more confident with. Then deal with the difficult ones. Use the back of the pages if you need more room to write. No. Questions Points Your Score 1 Basic Concepts 10 2 Pair RDDs 12 3 Donald Clinton 10 4 Code Interpretation 10 5 Web Graph 18 6 Page Rank 20 7 Regression 10 8 Collaborative Filtering 10 Total: 100 Good luck! - 1 -
Q1. Basic Concepts. [10 points] The following are short answer questions to test basic concepts learned in the course. Please provide the definition (if there is any) for each concept, description of the concept to explain why we use it, how we use it, and an example of it. If you are asked to compare two concepts, describe each concept, state their commonality and difference, and provide examples. Make your answer short and concise. (1a) Inverse Document Frequency. [2 points] (1b) Stop words. [2 points] (1c) Search vs. IR Evaluation. [2 points] - 2 -
(1d) Content-based Recommendation vs. Collaborative Filtering. [2 points] (1e) Supervised Machine Learning vs. Unsupervised Machine Learning. [2 points] - 3 -
Q2. Pair RDDs. [12 points] Please write out the output for the following codes at the marked locations. 2a) val lines = sc.parallelize(list("hello world", "this is a scala program", "to create a pair RDD", "in spark")) val pairs = lines.map(x => (x.split(" ")(0), x)) pairs.filter {case (key, value) => key.length <3}.foreach(println) Location A: What is the output here? - 4 -
2b) val pairs = sc.parallelize(list((1, 2), (3, 4), (3, 6))) val pairs1 = pairs.reducebykey((x,y) => x*y) pairs1.foreach(println) Location B: What is the output here? - 5 -
// the code continues from the previous page: val pairs2 = pairs.mapvalues(x=>x+2) pairs2.foreach(println) Location C: What is the output here? - 6 -
// the code continues from the previous page: val pairs3 = pairs.map {case (x,y) => (y+1,x)} pairs3.foreach(println) Location D: What is the output here? - 7 -
// the code continues from the previous page: val pairs4 = pairs.mapvalues(x=>(x,1)) pairs4.foreach(println) Location E: What is the output here? - 8 -
// the code continues from the previous page: val pair5 = pairs4.reducebykey((x,y) => (x._1+y._1, x._2 + y._2)) pairs5. foreach(println) Location F: What is the output here? - 9 -
Q3. Donald Clinton. [10 points] Suppose you have two files, one.txt and two.txt. The content of one.txt is: Donald John Trump (born June 14, 1946) is an American real estate developer, television personality, business author and political candidate. He is the chairman and president of The Trump Organization, and the founder of Trump Entertainment Resorts.[1] Trump's career, branding efforts, lifestyle and outspoken manner helped make him a celebrity, a status amplified by the success of his NBC reality show, The Apprentice.[2][2] Trump is a son of Fred Trump, a New York City real estate developer.[9] Donald Trump worked for his father's firm, Elizabeth Trump & Son, while attending the Wharton School of the University of Pennsylvania, and officially joined the company in 1968.[10] In 1971, he was given control of the company, renaming it The Trump Organization.[11][12] Trump remains a major figure in American real estate and a celebrity for his prominent media exposures.[13] On June 16, 2015, Trump formally announced his candidacy for president of the United States in the 2016 election, seeking the nomination of the Republican Party.[14][15] Trump's early campaigning drew intense media coverage and saw him rise to high levels of popular support.[16] Since late July 2015, he has consistently been the front-runner in public opinion polls for the Republican Party nomination.[17][18][19] The content of two.txt is: Hillary Diane Rodham Clinton (born October 26, 1947) is an American politician who served as the 67th United States Secretary of State under President Barack Obama from 2009 to 2013. The wife of Bill Clinton, the 42nd President of the United States, she was First Lady of the United States during his tenure from 1993 to 2001. She served as a United States Senator from New York from 2001 to 2009. An Illinois native, Hillary Rodham graduated from Wellesley College in 1969, where she became the first student commencement speaker, then earned her J.D. from Yale Law School in 1973. After a stint as a Congressional legal counsel, she moved to Arkansas, marrying Bill Clinton in 1975. She co-founded Arkansas Advocates for Children and Families in 1977, became the first female chair of the Legal Services Corporation in 1978, and was named the first female partner at Rose Law Firm in 1979. The National Law Journal twice listed her as one of the hundred most influential lawyers in America. While First Lady of Arkansas from 1979 to 1981 and 1983 to 1992, she led a task force that reformed Arkansas' education system, while sitting on the board of directors of Wal-Mart, among other corporations. As First Lady of the United States, her major initiative, the Clinton health care plan of 1993, failed to reach a vote in Congress. In 1997 and 1999, she played a leading role in advocating the creation of the State Children's Health Insurance Program, the Adoption and Safe Families Act and the Foster Care Independence Act. Write a standalone program called DonaldClinton.scala, which prints out the words that appear in both files and their word counts, with the words sorted by their counts in descending order. - 10 -
[Space for Q3] - 11 -
[Extra space for Q3] - 12 -
Q4. Code Interpretation. [10 Points] Explain in English what the code does at the marked location. The answers need to be related to the following formula. where δ is the damping factor, N is the total number of pages in the graph, Γ is the set of sink nodes, α and β are pages, and r is the page rank score. The codes are in the next page. - 13 -
val links = // Load RDD of (page title, links) pairs var ranks = // Load RDD of (page title, rank) pairs for (i <- 0 to ITERATION) { val contribs = links.join(ranks).flatmap { //Location A: case (title, (links, rank)) //Location B: => links.map(dest => (dest, rank / links.size)) // Location C: } ranks = contribs.reducebykey( _+_ ) // Location D:.mapValues(0.15 + 0.85 * _ ) // Location E: } - 14 -
[Extra space for Q4] - 15 -
Q5. Web Graph. [18 Points] 5a) Draw a web graph for the web pages in the following three sites W, H, and M. The graph should contain six nodes w0, w1, w2, h0, h1, and m0. Draw here: - 16 -
5b) Suppose the site H is down due to power failure. Its web pages disappear from the web. What is the web graph now? Draw it. The graph should contain four nodes w0, w1, w2, and m0. - 17 -
Given a page α, we define out(α) as the number of out-links from α to other pages (the out-degree) and in(α) as the number of in-links from other pages to α (the in-degree). 5c) In the web graph in 5b, what are out(w0), out(w1),out(w2), and out(m0)? 5d) What are in(w0), in(w1),in(w2), and in(m0) in 5b? - 18 -
5e) Is there any source node in the web graph in 5b? If yes, which node(s)? 5f) Is there any sink node in the web graph in 5b? If yes, which node(s)? - 19 -
5g) Construct a follow matrix for the web graph in 5b. - 20 -
5h) Construct a jump vector for the web graph in 5b, assuming uniform randomness. - 21 -
5i) Construct a transition matrix for the web graph in 5b. Assume the damping factor is ¾. - 22 -
Q6. Page Rank. [20 Points] We can use the following formula to calculate the page rank score for a page α in a web graph. where δ is the damping factor, N is the total number of pages in the graph, Γ is the set of sink nodes, α and β are pages, and r is the page rank score. 6a) Write down the expressions for the page ranks of pages in the web graph in 5b. r(w0) = r(w1) = r(w2) = r(m0) = - 23 -
6b) Write down the expressions for the page ranks of the four pages with δ=3/4. r(w0) = r(w1) = r(w2) = r(m0) = - 24 -
6c) We are going to use fixed point iteration to solve the equations. Assume the initial pagerank value for each page is 1. That is, at iteration 1, r(w0) = r(w1) = r(w2) =r(m0) =1. What are the page rank values at iteration 2? r(w0) = r(w1) = r(w2) = r(m0) = - 25 -
[Extra space for Q6c] - 26 -
Q7. Regression. [10 Points] 7a) Vector is a data structure used in Spark MLLib to store the features for data. Assume the following libraries are available in your code: import org.apache.spark.mllib.linalg.vectors Write Spark code to create 5 dense vectors (0.0, 1.0), (-1.0, 0.2), (1.0, 2.5), (3.0, 4.0), and (4.0, 5.0). - 27 -
7b) The five vectors represent five data points. The data points can be drawn in a twodimensional plot. Mark the data points using crosses in the following graph. y 6 5 4 3 2 1-2 -1 0 1 2 3 4 5 6 7 x 7c) Assume a linear regression model for the data points. Draw a line that fits the best to all data points in the graph. - 28 -
7d) Assume a linear function for the data points: y = θ 0 + θ 1 x Based on the line and the graph, what is your best guess for the weights θ 0 and θ 1? - 29 -
Q8. Collaborative Filtering. [10 points] Consider three users u 1, u 2, and u 3, and four movies m 1, m 2, m 3, and m 4. The users rated the movies using a 4-point scale: -1: bad, 1: fair, 2: good, and 3: great. A rating of 0 means that the user did not rate the movie. The three users ratings for the four movies are: u 1 = (3, 0, 0, -1), u 2 = (2, -1, 0, 3), u 3 = (3, 0, 3, 1) 8a) Which user has more similar taste to u 1 based on cosine similarity, u 2 or u 3? Show detailed calculation process. - 30 -
8b) User u 1 has not yet watched movies m 2 and m 3. Which movie(s) are you going to recommend to user u 1, based on the user-based collaborative filtering approach? Justify your answer. - 31 -
[extra space for Q8] - 32 -
[extra space for the paper] - 33 -