LobbyView: Firm-level Lobbying & Congressional Bills Database

Similar documents
Congress Lobbying Database: Documentation and Usage

ForeScout Extended Module for McAfee epolicy Orchestrator

Please reach out to for a complete list of our GET::search method conditions. 3

General Framework of Electronic Voting and Implementation thereof at National Elections in Estonia

One View Watchlists Implementation Guide Release 9.2

Learning and Visualizing Political Issues from Voting Records Erik Goldman, Evan Cox, Mikhail Kerzhner. Abstract

IBM Cognos Open Mic Cognos Analytics 11 Part nd June, IBM Corporation

DevOps Course Content

New features in Oracle 11g for PL/SQL code tuning.

Python Congress Documentation

Lobbying Registration and Disclosure: The Role of the Clerk of the House and the Secretary of the Senate

SCHOOLMASTER. Appointment Scheduling. Student Information Systems. Revised - August Schoolmaster is SIF Certified

JD Edwards EnterpriseOne Applications

LexisNexis Information Professional

U.S. Congressional Documents

A New Computer Science Publishing Model

Plan For the Week. Solve problems by programming in Python. Compsci 101 Way-of-life. Vocabulary and Concepts

Creating and Managing Clauses. Selectica, Inc. Selectica Contract Performance Management System

Appendix 2. [Draft] Disclosure Review Document

Estonian National Electoral Committee. E-Voting System. General Overview

Fairsail Country Pack: U.S.A.

SPECIAL INSPECTOR GENERAL FOR AFGHANISTAN RECONSTRUCTION CHIEF FOIA OFFICER REPORT FISCAL YEAR 2010

Bankruptcy Practice Center

ALBERTA OFFICE OF THE INFORMATION AND PRIVACY COMMISSIONER ORDER F December 8, 2016 UNIVERSITY OF LETHBRIDGE. Case File Number

DATA ANALYSIS USING SETUPS AND SPSS: AMERICAN VOTING BEHAVIOR IN PRESIDENTIAL ELECTIONS

Installation Guide: cpanel Plugin

Chapter 7 Case Research

Care Management v2012 Enhancements. Lois Gillette Vice President, Care Management

The Digital Appellate Court Introduction to the edca Electronic Portal

Integration Guide for ElectionsOnline and netforum

The Pupitre System: A desk news system for the Parliamentary Meeting rooms

Tie Breaking in STV. 1 Introduction. 3 The special case of ties with the Meek algorithm. 2 Ties in practice

Bank Reconciliation Script

Tariffs and Tariff Comparison

User Guide. News. Extension Version User Guide Version Magento Editions Compatibility

Hoboken Public Schools. PLTW Introduction to Computer Science Curriculum

Inviscid TotalABA Help

KENTUCKY. Jim Swain, Chief Information Officer Legislative Research Commission. Monday, August 6, 2012

Comparison of the Psychometric Properties of Several Computer-Based Test Designs for. Credentialing Exams

GST 104: Cartographic Design Lab 6: Countries with Refugees and Internally Displaced Persons Over 1 Million Map Design

City of Toronto Election Services Internet Voting for Persons with Disabilities Demonstration Script December 2013

Relying Party Agreement. 1. Definitions

Global Conditions (applies to all components):

Results of L Année philologique online OpenURL Quality Investigation

Inventory Project: Identifying and Preserving Minnesota s Digital Legislative Record

E-Verify Solutions effective January 2015 page 1

My Health Online 2017 Website Update Online Appointments User Guide

ProQuest Legislative Insight Basic Research Guide May 2012 Thomas Cooper Library & Coleman Karesh Law Library University of South Carolina

PCGENESIS PAYROLL SYSTEM OPERATIONS GUIDE

Abstract: Submitted on:

101 Ready-to-Use Excel Macros. by Michael Alexander and John Walkenbach

State of Minnesota Department of Public Safety Bureau of Criminal Apprehension

7/26/2007 Page 1 of 9 GENESIS ADMINISTRATION: SETTING UP GRADING COMMENTS

STUDYING POLICY DYNAMICS

BMI for everyone. Compsci 6/101: PFTW. Accumulating a value. How to solve an APT. Review how APTs and Python work, run

Fairsail Payflow Cookbook for CSV Record Downloads

Freedom of Information Act 2000 (Section 50) Decision Notice

Guidelines Targeting Economic and Industrial Sectors Pertaining to the Act on the Protection of Personal Information. (Tentative Translation)

301 Politics and Film RPOL POL30. Master Course Syllabus

Oracle FLEXCUBE Bills User Manual Release Part No E

Voting System Qualification Test Report Democracy Live, LiveBallot Version 1.9.1

REPORT UNDER THE FREEDOM OF INFORMATION AND PROTECTION OF PRIVACY ACT CASE MANITOBA FINANCE - INSURANCE COUNCIL OF MANITOBA

Entity Linking Enityt Linking. Laura Dietz University of Massachusetts. Use cursor keys to flip through slides.

Honest Leadership and Open Government Act of 2007: The Role of the Clerk of the House and Secretary of the Senate

Clause Logic Service User Interface User Manual

AGENCY: U.S. Copyright Office, Library of Congress. SUMMARY: The U.S. Copyright Office is amending its regulations for the recordation

Improving Record-Linkage-Software for Survey-Data

TERANET CONNECT USER S GUIDE Version 1.4 August 2013

The Effectiveness of Receipt-Based Attacks on ThreeBallot

ecourts Attorney User Guide

CS 5523: Operating Systems

Introduction 2. Common Law 2. Common Law versus Legislation 5. How to Find and Understand Law 6. Legal Resources 8.

Electronic Voting For Ghana, the Way Forward. (A Case Study in Ghana)

User s Guide and Codebook for the ANES 2016 Time Series Voter Validation Supplemental Data

Studying Policy Dynamics. Frank R. Baumgartner, Bryan D. Jones, and John Wilkerson

ACCESSING GOVERNMENT INFORMATION IN. British Columbia

The Economics And Politics Of High Speed Rail Lessons From Experiences Abroad

Support Vector Machines

Case 4:14-cv SOH Document 30 Filed 11/24/14 Page 1 of 10 PageID #: 257

Fall 2016 COP 3223H Program #5: Election Season Nears an End Due date: Please consult WebCourses for your section

MIS 0855 Data Science (Section 005) Fall 2016 In-Class Exercise (Week 12) Integrating Datasets

Management Overview. Introduction

Maps, Hash Tables and Dictionaries

Testing the Waters: Working With CSS Data in Congressional Collections

REPORT VOLUME 6 MAY/JUNE 2017

1. Goto osr.ashrae.org and log in the right hand corner if not already logged in the site.

Downloaded from: justpaste.it/vlxf

Capstone Prospectus Julia Jackson, PUAD 5361 September 2, 2015

A REPORT BY THE NEW YORK STATE OFFICE OF THE STATE COMPTROLLER

Scytl. Enhancing Governance through ICT solutions World Bank, Washington, DC - September 2011

Lab 11: Pair Programming. Review: Pair Programming Roles

Distributed representations of politicians

BEST PRACTICES FOR RESPONDING TO ACCESS REQUESTS

Policy Framework for the Regional Biometric Data Exchange Solution

File Systems: Fundamentals

The California Voter s Choice Act: Managing Transformational Change with Voting System Technology

Maps and Hash Tables. EECS 2011 Prof. J. Elder - 1 -

LEXIS -NEXIS Political Universe User Guide for Professional, Deep Research

USPTO Patent Prosecution Research Data: Unlocking Office Action Traits

National Labor Relations Board

Transcription:

LobbyView: Firm-level Lobbying & Congressional Bills Database In Song Kim August 30, 2018 Abstract A vast literature demonstrates the significance for policymaking of lobbying by special interest groups. Yet, empirical studies of political representation have been limited by the difficulty of observing a direct connection between politicians and interest groups. This article introduces LobbyView, a comprehensive lobbying database that is based on the universe of lobbying reports filed under the Lobbying Disclosure Act of 1995. LobbyView bridges two distinct observable political behaviors with regard to congressional bills: (1) sponsorship by politicians, and (2) reported lobbying by interest groups. It also allows researchers to identify political actors and their lobbying activities based on standardized firm- and industry-level identifiers to facilitate systemic research on lobbying and merging with external datasets. Finally, we develop an API (Application Programming Interface) that enables researchers to bulk download the massive amount of unstructured data in standard formats so that they can conduct further analyses using preferred statistical software. Financial support from the National Science Foundation is acknowledged (SES-1264090 and SES-1725235). Associate Professor, Department of Political Science, Massachusetts Institute of Technology, Cambridge, MA, 02139. Email: insong@mit.edu, URL: http://web.mit.edu/insong/www/

1 Introduction This document currently contains only some technical details of the database. We will soon update this paper to provide a full description of the methods used for constructing the database. Specifically, we will show 1) how we disambiguate interest group names using natural language processing (NLP) as well as collaborative filtering, 2) how the complex relational structure is stored using PostgreSQL and Elasticsearch, and 3) the scalability of our methods for bill number and bill title matching described in Section 2. Moreover, we will conduct several descriptive and statistical analyses to demonstrate the scope and quality of the lobbying database. Until then, we refer readers to the following two papers for an introduction to LobbyView. References Kim, In Song. 2017. Political Cleavages within Industry: Firm-level Lobbying for Trade Liberalization. American Political Science Review 111 (1): 1 20. Kim, In Song, and Dmitriy Kunisky. 2018. Mapping Political Communities: A Statistical Analysis of Lobbying Networks in Legislative Politics. Working paper available at http://web.mit.edu/ insong/www/pdf/network.pdf. 2 Identifying Bills and Missing Congress Numbers Identifying congressional bills in lobbying reports is difficult because bill numbers are repeated across Congresses, and often do not appear directly annotated with Congress numbers in lobbing reports. Using the report filing year to guess the Congress often leads to erroneous matches, because reports filed at the beginning of a new Congress tend to include disclosures of lobbying activities from the previous year (and therefore, if a new Congress has begun recently, from the previous Congress as well). For example, consider the following lobbying report filed by Google, Inc. in 2013. It reads: Monitor legislation regarding online privacy including Safe Data Act (H.R. 2577, S. 1207) and Do not track proposals (H.R. 654). Monitor any Congressional or Administration efforts to impose privacy laws on search engines. Monitor Spectrum acts (S. 911, H.R. 2482). Figure 1: First Quarter Report by Google, Inc. in 2013 A naive guess would be that the bill H.R. 2577 refers to a bill from the 113th Congress, because the report was filed in 2013. However, it is clear from the report that this is a bill from the 112th Congress, the SAFE Data Act. We use the following strategies to mitigate this problem and correctly identify Congress session numbers under various circumstances. 1

1. Bill Number Search: We first identify bill numbers (e.g., H.R. 2577 above) using regular expression search in the report text. In the above example in Figure 1, our algorithm would identify bill numbers H.R. 2577, S. 1207, H.R. 654, S. 911, and H.R. 2482. Note that all of these bills are from the 112th Congress rather than the 113th. 2. Congress Identification: Given a bill number found in a specific issue text (a section of the lobbying report), we attempt to identify the most likely Congress to which that bill would belong using other text around the bill number. We consider a range of candidate Congresses extending backwards from the Congress containing the year that the lobbying report was filed. By default, we consider the three preceding Congresses; in the above example, therefore, we would consider the 113th, 112th, and 111th Congresses. We then retrieve the bills having the same number as the given bill from each of these Congresses (omitting the Congresses that do not have a bill of that number), and compute a bag-of-words representation (after a tokenization and stopword filtering pipeline) of each of those bills, producing vectors v 1,..., v n representing the n candidate bills. We also compute the same representation of the text around the mention of the bill number in the lobbying report, producing a vector w representing that text. We then choose the Congress number by maximizing the cosine similarity between the v i and w, choosing bill i with index given by i = argmax 1 i n v i w v i w. (1) If no bill having the same number exists in the entire range of Congresses we consider, we simply guess that the bill comes from the Congress of the year the lobbying report was filed. 3. Congress Propagation: If we successfully find a match for a Congress, it may be propagated to the other bills mentioned in the lobbying report, since, being scheduled on a quarterly basis, lobbying report will almost always only mention legislation from a single Congress. If different bills in a lobbying report disagree on the best-matching Congress, a majority vote may be taken, but this rarely occurs in practice. 4. Bill Title Search: Bills are sometimes only referred to by titles or alternate names. To account for this, we clean and tokenize the specific issue sections of the lobbying report, and perform a text matching operation against a table of bill titles. For instance, this operation would identify Safe Data Act in our previous example, even if the bill number H.R. 2577 were not mentioned. 5. Bill Range Expansion: It is also common for bills with nearby numbers to be related, and for lobbying reports to refer to ranges of bills when lobbying all of them at once. For 2

instance, a lobbying report filed by Mattel, Inc. in 2002 contains the following text: H.R.3009, Trade Act of 2002. Certain miscellaneous tariff bills to suspend the rates of duty on certain toy-related articles (H.R.4182-4186; S.2099-2103). WTO market access negotiations for non-agricultural products Port and border security measures Figure 2: Midyear Report by Mattel, Inc. in 2002 Therefore, if we find two bill numbers that are close (by default, we take this to mean that they share the same prefix and their numbers differ by at most 10), then we consider all other bills with numbers in between as also being lobbied in the same report. For instance, the pattern H.R. 4182-4186 in the excerpt shown in Figure 2 would be expanded into bills H.R. 4182, H.R. 4183, H.R. 4184, H.R. 4185, and H.R. 4186, all of which we would consider lobbied on by Mattel, Inc. 3 Technical Details This document concerns the code in the /trade/code/database directory of our repository, which sets up and provides access to a system of databases (running on SQLite and the Whoosh text indexing library) that store relationships between bills, their lobbiers, and various other related pieces of data. 3.1 Dependencies The table below is a summary of the packages on which the database depends, along with a short summary of the functionalities that they provide (further information and documentation can be easily found on their PyPI (Python Package Index) pages). The packages are roughly grouped by their function (database, parsing, etc). For the purposes of actual deployment, the file that manages these packages is database/requirements. 3

Package Description SQLAlchemy Basic Python bindings and model representations for SQL-type databases. Elixir Whoosh BeautifulSoup nltk path.py python-dateutil Higher-level abstractions for dealing with SQL-type databases, extending the functionality that SQLAlchemy makes available. A library for creating full-text index databases that are searchable with reasonable efficiency (if in the future better efficiency here is needed, there are non-python packages that can do a better job). SQL is not good, generally speaking, for full-text search, hence the necessity of this for the bill CRS summaries and lobbying report specific issue texts. Convenient library for parsing XML and HTML data into Python objects, although when speed becomes an issue there are faster but less convenient (and more code-verbose) alternatives, such as xml.etree in the core Python library. The natural language toolkit for Python, providing tools for tokenizing and statistically analyzing English-language texts. Convenience tools to make dealing with the filesystem easier from Python. Convenience tools for dealing with datetimes and time ranges. Before going any further, you will need to install all of these packages, which can be conveniently done through the REQUIREMENTS file, by running the following command in the shell (you will need the pip utility for Python package management): > pip install -r REQUIREMENTS Also, there are a few subpackages to install for the nltk package. In a Python interpreter, run the following: >>> import nltk >>> nltk.download() That will open up an interface for downloading extras/packages for nltk. Then, you should go into the Corpora tab and install the packages stopwords and wordnet. If no errors arise during this installation procedure, it is safe to proceed to the next steps. 4

4 Getting Started 4.1 Configuration All of the necessary configuration for the database system can be done by modifying the variables in general.py. The variables are listed below along with the effect that they have on the database creation process. Realistically speaking, to get things working locally you should just change DATA DIR to something that is not Della-specific. Everything else should be (more or less) good to go, assuming no large repository reorganizations have happened. Variable TEST MODE CONGRESSES OUTPUT DIR ROOT DIR DATA DIR LOBBY REPO DIR CLIENT NAME MATCHES FILE Description Intended as a testing mode for bill detection and related algorithms, but this is not yet complete, so avoid setting this to True before taking a look at the testing code. The range of congresses that all processes will be concerned with (note that in Python, the range(x, y) syntax gives the numbers x, x + 1,..., y 1, not including y). The directory for generic outputs of analytics scripts. The database directory (these directories are given relative to the trade/code directory). The directory used for storing databases, which can be totally separate from the code directories. For instance, on Della it is useful to put this under /tigress since it requires lots of storage space. The location of the lobby repository (parallel to the trade repository in the current setup). The file containing the client name filtering matches (i.e. the output of Josh s script). 4.2 Initialization Installing the databases is (or should be) very easy, just go to the trade/code directory and run the command sudo python -m database.setup. This will probably take a very long time to run from scratch possibly up to several days. 5

5 Working with the Database 5.1 Starting and Ending Sessions In order to get started with a database session, navigate to the trade/code directory, and run the following sequence of commands in the Python interpreter: >>> import database.general >>> from database.bills.models import * >>> from database.lda.models import * >>> from database.firms.models import * >>> database.general.init_db() When finished, the following command will safely close the database without accidentally permanently writing any changes that may have been made to the data: >>> database.general.close_db(write=false) If you are in fact correcting errors in the database or otherwise performing operations that cause changes that you would like permanently registered, then just change the above to instead pass the argument write=true. 5.2 Basic Database Objects and Relationships To see what objects are stored in the various databases, look at the models.py files in the directories bills, lda, and firms (the imports described above are what give you access to all of these classes). Each class has some fields, where are available to access on any object of the class. The system is best clarified with an example: consider the case of bill objects, which are represented by the Bill class. This class, like any other, has an id field that is its primary key, i.e. the value of this field uniquely identifies a Bill object. For bills, the id is a string of the form 110 HR7311, where 110 is the number of the congress to which this bill belongs, and HR7311 is the bill number. To retrieve a particular bill by its id, use the following snippet: >>> b = Bill.query.get( 110_HR7311 ) Once this command completes, the object b will have all of the fields listed under Bill in the file bills/models.py. So, for instance, we can get the date that the bill was introduced with b.introduced, get its CRS summary text with b.summary, and so forth. Any field inside Bill that is initialized as Field(ABC) where ABC is some text (example possible values are Integer for integer fields, Unicode(L) for a string field of maximum length L, or DateTime for a date/time field) is accessed in this straightforward way. 6

Other fields are registered as ManyToMany(ABC), ManyToOne(ABC), or OneToMany(ABC), where ABC is now the name of some other model. These fields contain references to one or more instances of some other model. For instance, in Bill, the field definition titles = OneToMany( BillTitle ) indicates that each bill has one or more (hence Many) associated objects of class BillTitle, which are its titles. In BillTitle, we see the field giving the reverse relation, bill = ManyToOne( Bill ) which indicates that many BillTitle objects can share the same Bill object (for some intuition, think of a OneToMany field as a my children relation, and of a ManyToOne field as a my parent relation). Thus, if b is a Bill object, then b.titles will give an iterable (effectively a Python list, for all basic purposes) containing all the titles of b. Conversely, if t is a BillTitle object, then t.bill is the Bill object to which the title belongs. The last possibility of these more complex relationships is a ManyToMany field, which as its name suggests creates a generic relation between two object types (where neither object plays the child or parent role). For example, we see in Bill, terms = ManyToMany( Term ) and in Term, bills = ManyToMany( Bill ) which means that for a bill b, looking at b.terms gives all of the terms that that bill is classified under, and for a term t, looking at t.bills gives all bills under that term. 5.3 Filtering Operations The more sophisticated and interesting sorts of queries that are possible are those that involve not just fetching particular bills or other objects and examining their relationships, but also involve filtering sets of objects by useful criteria. Example: filter by columns of each Model >>> reports = LobbyingReport.query.\ filter_by(year=2011) 7

returns all lobbying reports filed in 2011 Example: filter by membership in at least one ManyToMany related table >>> trade = LobbyingReport.query.\ filter(lobbyingreport.issues.any(lobbyingissue.code.\ in_([ TRADE (DOMESTIC/FOREIGN) ]))) >>> trade.count() 52418 returns all lobbying reports that at least has TRADE (DOMESTIC/FOREIGN) as one of issues lobbied. 5.4 Full-Text Indices Two types of data items are duplicated in a separate full-text index database to facilitate more efficient searching: the CRS summary text of each bill, and the text of each lobbying report specific issue. The code concerning the creation and access of these indices is found in the files bills/ix utils.py and lda/ix utils.py, respectively. The primary useful methods, in turn, for accessing these indices are summary search and issue search, in the above two files respectively. These both take one required argument, called queries list, which is a list of the queries (as strings) to make to the full-text index. They also have two optional boolean arguments, return objects and make phrase, which default to False. Setting return objects to True will return a collection of Bill or LobbyingSpecificIssue Python objects rather than just their id values. Setting make phrase to True will make each query into a phrase that is searched for a single unit, rather than separately searching for each word (as in the difference when searching Google for red cat running versus red cat running in quotes). In lda/ix utils.py, there is an additional method exposed for using the index that is called get bill specific issues by titles, which is a simple special case of issue search that searches for all of the titles of a particular bill in the specific issues, used in the database construction process to find the bills mentioned by title in specific issues. A simple example of using these indices to find bills pertaining to a particular textuallydistinguished subject (trade-related bills in our case) can be found in analytics/lobbied bills data.py. We define a list of queries on our bills in the following way: from database.analytics.bill_utils import * bill_queries = [ u trade barrier, 8

u tariff barrier,... u uruguay round, u harmonized tariff schedule ] Then, to get the id s of the bills that contain one of these phrases, we do this: query_bill_ids = database.bills.ix_utils.summary_search( bill_queries, make_phrase=true, return_objects=false ) This returns a list of id s as strings. If we wanted the corresponding Bill objects instead, we could instead pass the argument return objects=true. Note that here it is important that we use make phrase=true, since otherwise the query uruguay round would match all bills that contain both the word uruguay and the word round, not necessarily together, which is not what we want. A simple example of using these indices to find lobbying reports that contain a particular phrase, from database.analytics.lda_utils import * reports = lda_issue_search( [ Free trade agreements with South Korea ] ) 5.5 Calculating Herfindahl Indices for Industry Clients Belong to We provide a tool to measure the size of each firm in relation to the industry. Herfindahl index measures the levels of competition among firms (clients) within the same industry. 5.5.1 herfindahl.py (in lobby/code/hfcc) Given lobbying database containing firm, LDA, and bill information, 1. Pulls firm-level financial and LDA report information from lobbying database 2. Computes Herfindahl indices 3. Outputs rows with firm information (sorted by industry as identified by NAICS2), industry Herfindahl index, and lobbying information for firm. 9

To run: From lobby/code, type python -m hfcc.herfindahl No additional parameters needed. 5.5.2 herfindadd.py (in lobby/code/hfcc) Given output (csv) files generated by herfindahl.py, 1. Adds indicator showing whether firm lobbied on at least one trade issue 2. Adds firm-level compustat financial data 3. Generates (in addition) new csv files with industry-level information in rows To run: Ensure output files from herfindahl.py are in lobby/code From lobby/code, type python -m hfcc.herfindadd On-screen documentation will detail additional parameters that are needed. Example python -m hfcc.herfindadd namerica naics -s 1996 -e 2011 runs the script for the North America files, using NAICS (rather than SIC), starting from 1996 and ending in 2011 (herfindahl.py generates one file per year per classification system (NAICS / SIC).) 10