Data 100. Lecture 9: Scraping Web Technologies. Slides by: Joseph E. Gonzalez, Deb Nolan

Similar documents
IBM Cognos Open Mic Cognos Analytics 11 Part nd June, IBM Corporation

Malicious URI resolving in PDFs

Hoboken Public Schools. PLTW Introduction to Computer Science Curriculum

File Systems: Fundamentals

Review: Background on Bits. PFTD: What is Computer Science? Scale and Bits: Binary Digits. BIT: Binary Digit. Understanding scale, what does it mean?

Clause Logic Service User Interface User Manual

Electronic Voting For Ghana, the Way Forward. (A Case Study in Ghana)

CS 5523 Operating Systems: Intro to Distributed Systems

Plan For the Week. Solve problems by programming in Python. Compsci 101 Way-of-life. Vocabulary and Concepts

Mojdeh Nikdel Patty George

Drafting Legislation Using XML in the U.S. House of Representatives

Geoportal Helpdesk - Support #2722 EEA: HTTP Status codes returned by the INSPIRE Validator

Estonian National Electoral Committee. E-Voting System. General Overview

Midterm Review. EECS 2011 Prof. J. Elder - 1 -

AP ELECTIONS API 2.1. Developer s Guide Revision 1.1

Subreddit Recommendations within Reddit Communities

11/15/13. Objectives. Review. Our Screen Saver Dependencies. Our Screen Saver Dependencies. Project Deliverables Timeline TEAM FINAL PROJECT

Cluster Analysis. (see also: Segmentation)

MOS Exams Objective Mapping

UNITED STATES PATENT AND TRADEMARK OFFICE BEFORE THE PATENT TRIAL AND APPEAL BOARD. UNITED PATENTS, INC., Petitioner, REALTIME DATA LLC, Patent Owner.

ETSI TS V1.4.1 ( )

Midterm Review. EECS 2011 Prof. J. Elder - 1 -

Maps, Hash Tables and Dictionaries

Working with the Supreme Court Database

Addressing the Challenges of e-voting Through Crypto Design

TAFTW (Take Aways for the Week) APT Quiz and Markov Overview. Comparing objects and tradeoffs. From Comparable to TreeMap/Sort

STATISTICAL GRAPHICS FOR VISUALIZING DATA

Robert Reeves. Deputy Clerk U.S. House of Representatives

Creating and Managing Clauses. Selectica, Inc. Selectica Contract Performance Management System

Analysis of AMS Elections 2010 Voting System

Systems and methods for conducting jury research and training for estimating punitive damages

ETSI TS V2.2.1 ( )

ECE250: Algorithms and Data Structures Trees

Priority Queues & Heaps

Case 4:14-cv SOH Document 30 Filed 11/24/14 Page 1 of 10 PageID #: 257

The usage of electronic voting is spreading because of the potential benefits of anonymity,

Google App Engine 8/10/17. CS Cloud Compu5ng Systems--Summer II 2017

One View Watchlists Implementation Guide Release 9.2

ForeScout Extended Module for McAfee epolicy Orchestrator

City of Toronto Election Services Internet Voting for Persons with Disabilities Demonstration Script December 2013

Learning and Visualizing Political Issues from Voting Records Erik Goldman, Evan Cox, Mikhail Kerzhner. Abstract

DevOps Course Content

CS 5523: Operating Systems

Support Vector Machines

YOOCHOOSE GmbH Terms and Conditions Subject Matter

Ballot Reconciliation Procedure Guide

Modeling Voting Machines

Digital research data in the Sigma2 prospective

CS 6630 Project Journal

irobot Create Setup with ROS and Implement Odometeric Motion Model Welcome Lab 4 Dr. Ing. Ahmad Kamal Nasir

Comparison Sorts. EECS 2011 Prof. J. Elder - 1 -

DIANA: A Human Rights Database

Online Ballots. Configuration and User Guide INTRODUCTION. Let Earnings Edge Assist You with Your Online Ballot CONTENTS

SMS based Voting System

Chapter. Estimating the Value of a Parameter Using Confidence Intervals Pearson Prentice Hall. All rights reserved

Introduction: Data & measurement

User Guide. News. Extension Version User Guide Version Magento Editions Compatibility

Exposure-Resilience for Free: The Hierarchical ID-based Encryption Case

Priority Queues & Heaps

Maps and Hash Tables. EECS 2011 Prof. J. Elder - 1 -

Summary This guide explains the general concepts regarding the use of the e- Nominations website Version 3.1 Date 07/02/ e-nominations...

Stack Takeoff and Estimating Api

Cloud Tutorial: AWS IoT. TA for class CSE 521S, Fall, Jan/18/2018 Haoran Li

VISA LOTTERY SERVICES REPORT FOR DV-2007 EXECUTIVE SUMMARY

FM Legacy Converter User Guide

Priority Queues & Heaps

Key Considerations for Implementing Bodies and Oversight Actors

Technology Tuesday Webcast Series: Want To Go Blogging? March 9, 2004 Presenter: Lori Bowen Ayre

Want To Go Blogging? Agenda. Bloggers. Residents of Planet Blogistan or Web + Logs

(a) Draw side-by-side box plots that show the yields of the two types of land. Check for outliers before making the plots.

The language for most tablet questions was customized based on whether the respondent said they had an ipad or another type of tablet computer.

The Social Web: Social networks, tagging and what you can learn from them. Kristina Lerman USC Information Sciences Institute

Social Computing in Blogosphere

General Framework of Electronic Voting and Implementation thereof at National Elections in Estonia

Taking the Mystery Out of Voting

Fragomen Privacy Notice

TO: Chair and Members REPORT NO. CS Committee of the Whole Operations & Administration

CDLC Emerging Technologies

A NOVEL EFFICIENT REVIEW REPORT ON GOOGLE S PAGE RANK ALGORITHM

HASHGRAPH CONSENSUS: DETAILED EXAMPLES

This policy sets out how we collect, use, disclose and protect personal information which we have collected or acquired.

Results of L Année philologique online OpenURL Quality Investigation

Coverage tools Eclipse Debugger Object-oriented Design Principles. Oct 26, 2016 Sprenkle - CSCI209 1

Downloaded from: justpaste.it/vlxf

Key Considerations for Oversight Actors

Design and Analysis of College s CPC-Building. System Based on.net Platform

Uncovering the veil on Geneva s internet voting solution

MIPAS Temperature and Pressure Validation by RO Data

Paper 10 Tel: Entered: February 9, 2016 UNITED STATES PATENT AND TRADEMARK OFFICE

Overview. Ø Neural Networks are considered black-box models Ø They are complex and do not provide much insight into variable relationships

Case 6:09-cv LED Document 1414 Filed 07/19/12 Page 1 of 16 PageID #: 50837

Probabilistic earthquake early warning in complex earth models using prior sampling

Quantitative Prediction of Electoral Vote for United States Presidential Election in 2016

Paper No Filed: October 7, 2015 UNITED STATES PATENT AND TRADEMARK OFFICE BEFORE THE PATENT TRIAL AND APPEAL BOARD

STATE OF MINNESOTA DEPARTMENT OF PUBLIC SAFETY

This manual represents a print version of the Online Filing Help.

Romee Strijd VLOG 8 // FASHION WEEK

9308/16 JT/CSM/nb 1 DG F 2C

c. References herein to the singular includes the plural and vice versa; and

Analysis of Social Voting Patterns on Digg

Transcription:

Data 100 Lecture 9: Scraping Web Technologies Slides by: Joseph E. Gonzalez, Deb Nolan deborah_nolan@berkeley.edu hellerstein@berkeley.edu?

Last Week

Visualization Ø Tools and Technologies Ø Maplotlib and seaborn Ø Concepts Ø Length, color, and faceting Ø Kinds of visualizations Ø Bar plots, histograms, rug plots, box plots, violin plot, scatter plots, and kernel density estimators Ø Good vs bad visualizations Ø Smoothing

Kernel Density Estimates and Smoothing

Kernel Density Estimators Ø Inferential statistics estimate properties of the population Ø Draw conclusions beyond the data Descriptive Plot Inferential Plot

Ø Inferential statistics estimate properties of the population Ø Draw conclusions beyond the data Suppose this data was constructed by a random sample of student grades? Inferential Plot What is the probability that the next student s grade will be between 90 and 93? Area Probability of 90 < x < 93? = Area under the curve No Data!

Inferential Plot Constructing KDEs Ø Non-parametric Model Ø size/complexity of the model depends on the data: ˆp(x) = 1 n K (r) = nx i=1 Query K (x x i ) Gaussian Kernel: (Commonly used à Very smooth): 1 r 2 p 2 2 exp 2 2 Data

ˆp(x) = 1 n nx i=1 K (x x i ) Gaussian Kernel: (Commonly used à Very smooth): K (r) = 1 r 2 p exp 2 2 2 2 Inferential Plot

ˆp(x) = 1 n nx i=1 K (x x i ) Gaussian Kernel: (Commonly used à Very smooth)): K (r) = 1 r 2 p exp 2 2 2 2 Inferential Plot How do you pick the kernel and bandwidth? Ø Goal: fit unseen data Ø Idea: Cross Validation Ø Hide some data Ø Draw the curve Ø Check if curve fits hidden data more on this later

=0.01 =0.05 =0.1 =1.0

Smoothing a Scatter Plot Descriptive Plot Inferential Plot Set opacity (alpha) on markers Kernel Smoothed Fit

Smoothing a Scatter Plot Inferential Plot Set opacity (alpha) on markers Ø Weighted combination of all y values 1 yˆ(x) = Pn i=1 wi (x) wi (x) = K (x Kernel Smoothed Fit xi ) n X i=1 wi (x)yi

Dealing with Big Data (Smoothly) Ø Big n (many rows) Ø Aggregation & Smoothing compute summaries over groups/regions Ø Sliding windows, kernel density smoothing Ø Set transparency or use contour plots to avoid over-plotting Ø Big p (many columns) Ø Faceting Using additional columns to Ø Adjust shape, size, color of plot elements Ø Breaking data down by auxiliary dimensions (e.g., age, gender, region ) Ø Create new hybrid columns that summarize multiple columns Ø Example: total sources of revenue instead of revenue by product

What s Next

This Week Ø Today (Tuesday) Ø Web technologies -- getting data from the web Ø Pandas on the Web Ø JSON, XML, and HTML Ø HTTP Get and Post Ø REST APIs, Scraping Ø Thursday Ø Both Fernando and I are out à guest lecturer Sam Lau!! Ø String processing Ø Python String Library Ø Regular Expressions Ø Pandas String Manipulation

Getting Data from the Web Starting Simple with Pandas

Pandas read_html Ø Loads tables from web pages Ø Looks for <table></table> Ø Table needs to be well formatted Ø Returns a list of DataFrames Ø Can load directly from URL Ø Careful! Data changes. Save a copy with your analysis Ø You will often need to do additional transformations to prepare the data Ø Demo!

HTTP Hypertext Transfer Protocol

HTTP Hypertext Transfer Protocol Ø Created at CERN by Tim Berners-Lee in 1989 as part of the World Wide Web Ø Started as a simple request-response protocol used by web servers and browsers to access hypertext Ø Widely used exchange data and provides services: Ø Access webpage & submit forms Ø Common API to data and services across the internet Ø Foundation of modern REST APIs (more on this soon)

Request Response Protocol Client Request Server Swipe Header First line contains: GET /sp18/syllabus.html?a=1 HTTP/1.1 HOST: ds100.org User-Agent: python-requests/2.18.4 Accept-Encoding: compress, gzip Accept: */* GET /sp18/syllabus.html?a=1 HTTP/1.1 Ø Ø Ø a method, e.g., GET or POST a URL or path to the document the protocol and its version Remaining Header Lines Ø Ø Key value pairs Specify a range of attributes Optional Body Ø send extra parameters & data

Request Response Protocol Client Request Server Swipe Response Header Body HTTP/1.1 200 OK Server: GitHub.com Date: Mon, 12 Feb 2018 05:41:55 GMT Last-Modified: Mon, 22 Jan 2018 06:16:48 GMT Access-Control-Allow-Origin: * Content-Type: text/html; charset=utf-8 Content-Encoding: gzip <!DOCTYPE html><html lang="en"> <head> <meta charset="utf-8"> <meta http-equiv="x-ua-compatible" content="ie=edge"> <title>ds100</title><meta name="author" content="uc Berkeley"> <meta name="viewport" content="width=device-width, initial-scale=1.0"> <link href="/assets/themes/bootstrap/css/bootstrap.min.css"> Ø First line contains status code Ø Key-Value Pair Lines Ø Data properties Ø Body Ø Returned data Ø HTML/JSON/Bytes

In a Web Browser Response Request

Request Types (Main Types) Ø GET get information Ø Parameters passed in URI (limited to ~2000 characters) Ø /app/user_info.json?username=mejoeyg&version=now Ø Request body is typically ignored Ø Should not have side-effects (e.g., update user info) Ø Can be cached in on server, network, or in browser (bookmarks) Ø Related requests: HEAD, OPTIONS Ø POST send information Ø Parameters passed in URI and BODY Ø May and typically will have side-effects Ø Often used with web forms. Ø Related requests: PUT, DELETE

Response Status Codes Ø 100s Informational Communication continuing, more input expected from client or server Ø 200 Success - e.g., 200 - general success; Ø 300s Redirection or Conditional Action requested URL is located somewhere else. Ø 400s Client Error Ø Ø 404 indicates the document was not found 403 indicates that the server understood the request but refuses to authorize it Ø 500s Internal Server Error or Broken Request error on the server side

HTML, XML, and JSON data formats of the web

HTML/XML/JSON Ø Most services will exchange data in HTML, XML, or JSON Ø Why? Ø Descriptive Ø Can maintain meta-data Ø Extensible Ø Organization can change and maintain compatibility Ø Human readable Ø Useful for debugging and provides a common interface Ø Machine readable Ø A wide range of technologies for parsing

JSON: JavaScript Object Notation Basic Type (String) Key : Value [Array] Object Ø Recursive datatype Ø Data inside of data Ø Value is a: Ø A basic type: Ø String Ø Number Ø true/false Ø Null Ø Array of Values Ø A dictionary of key:value pairs Ø Demo Notebook

XML and HTML extensible Markup Language

XML is a standard for semantic, hierarchical representation of data

Syntax : Element / Node The basic unit of XML code is called an element or node Each Node has a start tag and end tag <zone>4</zone> Start tag End tag Content

Syntax : Nesting A node may contain other nodes (children) in addition to plain text content. <plant> Start tag Content consists of two nodes <zone>4</zone> <light>mostly Shady</light> </plant> End tag Indentation is not needed. It simply shows the nesting

Syntax : Empty Nodes Nodes may be empty <plant> <zone></zone> <light/> These two nodes are empty Both formats are acceptable </plant>

Syntax : Attributes Nodes may have attributes (and attribute values) The attribute named type has a value of a <plant id='a'> <zone></zone> This empty node has two attributes: source and class <light source="2" class="new"/> </plant>

Syntax : Comments Comments can appear anywhere <plant> Two comments <! - elem with content --> <zone>4 <! - a second comment --></zone> <light>mostly Shady</light> </plant>

Well-formed XML Ø An element must have both an open and closing tag. However, if it is empty, then it can be of the form <tagname/>. Ø Tags must be properly nested: Ø Bad!: <plant><kind></plant></kind> Ø Tag names are case-sensitive Ø No spaces are allowed between < and tag name. Ø Tag names must begin with a letter and contain only alphanumeric characters.

Well-formed XML: Ø All attributes must appear in quotes in: name = "value" Ø Isolated markup characters must be specified via entity references. < is specified by < and > is specified by >. Ø All XML documents must have one root node that contains all the other nodes.

xhtml: Extensible Hypertext Markup Language Ø HTML is an XML- like structure à Pre-dated XML Ø HTML is often not well-formed, which makes it difficult to parse and locate content, Ø Special parsers fix the HTML to make it well-formed Ø Results in even worse HTML Ø xhtml was introduced to bridge HTML and XML Ø Adopted by many webpages Ø Can be easily parsed and queried by XML tools

Example of well formed xhtml

DOM: Document Object Model Ø Treat XML and HTML as a Tree Ø Fits XML and well formed HTML Ø Visual containment à children Ø Manipulated dynamically using JavaScript Ø HTML DOM and actual DOM the browser shows may differ (substantially) Ø Parsing in Python à Selenium + Headless Chrome (out of scope)

Tree terminology Ø There is only one root (AKA document node) in the tree, and all other nodes are contained within it. Ø We think of these other nodes as descendants of the root node. Ø We use the language of a family tree to refer to relationships between nodes. Ø parents, children, siblings, ancestors, descendants Ø The terminal nodes in a tree are also known as leaf nodes. Content always falls in a leaf node.

HTML trees: a few additional rules Ø Typically organized around <div> </div> elements Ø Hyperlinks: <a href= uri >Link Text</a> Ø The id attribute: unique key to identify an HTML node Ø Poorly written HTML à not always unique Ø Older web forms will contain forms: <form action="/submit_comment.php" method="post"> <input type="text" name="comment" value="blank" /> <input type="submit" value="submit" /> </form> See notebook for demo on working with forms

Which files are broken? http://bit.ly/ds100-sp18-xml filec.xml FileA.json FileB.json filed.xml

Next lecture Regex Staring Sam Lau We will finish REST and HTTP on Tuesday