Categorical Data Analysis Jeremy Freese Department of Sociology University of Wisconsin-Madison This syllabus may be subject to additional, presumably minor revision before or after the course begins. The latest rendition will always be made available at the webpage listed below. Instructor: Jeremy Freese, University of Wisconsin-Madison jfreese@ssc.wisc.edu Teaching Assistant: Jason Beckfield, Indiana University jbeckfie@indiana.edu Course webpage: http://www.ssc.wisc.edu/~jfreese/cda.htm This workshop introduces students to current methods for analyzing categorical data, with its principal focus being regression models for categorical outcomes. We will consider models for binary, ordinal, and nominal outcomes, as well as useful and related models for censored and count outcomes. We will discuss the appropriate specification of models, their estimation with statistical software, and the proper and practical interpretation. Computing in the course will primarily use Stata. The course assumes a good working knowledge of the linear regression model for continuous variables, as well as an elementary knowledge of matrix algebra. Books This book is will serve as our primary reading for much of the course: Long, J. Scott. 1997. Regression Models for Categorical and Limited Dependent Variables. Thousand Oaks, CA: Sage. This book would also be really helpful if it were available, but unfortunately it won t be until shortly after our course ends: 1 Long, J. Scott and Jeremy Freese. 2001. Regression Models for Categorical Outcomes Using Stata. College Station, TX: Stata Press. 1 Even so, you should still buy it and cite it in everything you ever write.
The following books are also referenced on the syllabus: Agresti, Alan. 1990. Categorical Data Analysis. New York: John Wiley. Amemiya, Takeshi. 1985. Advanced Econometrics. Cambridge, MA: Harvard University Press. Fienberg, Stephen E. 1980. The Analysis of Cross-Classified Data (2nd ed.). Cambridge, MA: MIT Press. Cameron, A. Colin and Pravin K. Trivedi. 1998. Regression Analysis of Count Data. Oxford: Oxford University Press. Christensen, Ronald. 1997. Log-linear Models and Logistic Regression (2nd ed.). New York: Springer. Greene, William C. 2000. Econometric Analysis (4th ed.). New York: Prentice Hall. Hosmer, David W. and Stanley Lemeshow. 2000. Applied Logistic Regression. 2nd Edition. New York: Wiley. King, Gary. 1989. Unifying Political Methodology: The Likelihood Theory of Statistical Inference. Cambridge: Cambridge University Press. Powers, Daniel A. and Yu Xie. 2000. Statistical Methods for Categorical Data Analysis. San Diego: Academic Press. While all of these books have virtues, none are in any way required. Of them, I would recommend buying the Powers and Xie book if one wanted a book that integrated models for grouped data with the regression models from the first part of the course, and I would recommend buying the Agresti book first if one wanted a book specifically for its treatment of contingency table data. The Cameron and Trivedi book is peerless if one is going to do a lot of serious work with count models. The King book occupies a place of obvious importance in the political methodology movement within political science. The syllabus also makes reference to a few of the famed little green books published by Sage. The full references to these are included in the reading list. Readings and Schedule The pace of course like this tends to depend on sufficiently unpredictable factors, including student participation and reactions, that providing a precise daily schedule seems an exercise in pedagogical delusion. What follows is a listing of the topics that we might cover in the order that we will cover them. The reading list is intended less as assigned reading as an effort to provide both reading for the course and a bibliography of next sources should one
want to pursue any of these models in detail. As the course proceeds, I will provide more information about which readings would be the most instructive to do before class meetings. 1. Salutation; overview; Stata preliminaries McCloskey, Deirdre N. and Stephen T. Ziliak. 1996. The Standard Error of Regressions. Journal of Economic Literature 34: 97-114. 2. Review of the linear regression model and categorical independent variables in the linear regression model Long, Chapters 1-2 Powers and Xie, Chapter 2 3. Maximum likelihood estimation Long, Chapter 2.6 Powers and Xie, Appendix B Eliason, Scott R. 1993. Maximum Likelihood Estimation: Logic and Practice. Newbury Park, CA: Sage. King, Chapter 4 4. Regression models for censored and truncated data Long, Chapter 7 King, Chapter 9.1-9.3 Greene, Chapter 20.1-20.3 Application: Krasno, Jonathan, Donald Green, and Jonathan Cowden. 1994. The Dynamics of Campaign Fundraising in House Elections. Journal of Politics 56:459-474. On the Heckman model for sample selection bias: Heckman, James J. 1979. Sample Selection Bias as a Specification Error. Econometrica 47:153-161. 5. Binary models: specification and estimation
Long, Chapter 3.1-3.6 Powers and Xie, Chapter 3 King, Chapter 5.1-5.3 Christensen, Chapter 4 Fienberg, Chapter 6 Aldrich, John and Forrest Nelson. 1984. Linear Probability, Logit, and Probit Models. Newbury Park, CA: Sage. Jaccard, James. Interaction Effects in Logistic Regression. Newbury Park, CA: Sage. On measuring the magnitude of categorical covariates: Kaufman, Robert. 1996. Comparing Effects in Dichotomous Logistic Regression: A Variety of Standardized Coefficients. Social Science Quarterly 77:90-109. On models for rare events: King, Gary and Langche Zeng. 2001. "Logistic Regression in Rare Events Data." Political Analysis 9. Interpretation of results: Long, Chapter 3.7-3.9 Hosmer and Lemeshow, Chapter 5 Skewed logit model: Nagler, Jonathan. 1994. Scobit: an alternative estimator to logit and probit. American Journal of Political Science 38: 230-255. Heteroskedastic probit model: See discussion in Greene Applications: Brooks, Clem and Jeff Manza. 1997. "Social Cleavages and Political Alignments: U.S. Presidential Elections, 1960 to 1972." American Sociological Review 62:937-946.
Bartels, Larry. 2000. "Partisanship and Voting Behavior, 1952-1996." American Journal of Political Science 44:35-50. Rosenstone, Stephen and John Hansen. 2001. "Solving the Puzzle of Participation in Electoral Politics." Pp. 69-82 in Richard Niemi and Herbert Weisberg (eds.) Controversies in Voting Behavior. Washington DC: C.Q. Press. 6. Hypothesis testing and measuring goodness of fit Long, Chapter 4 Cameron and Trivedi, Chapter 5 7. Models for ordered outcomes: specification and estimation Long Chapter 5.1-5.3 Powers and Xie, Chapter 6 King, Chapter 5.4 Interpretation, parallel regression assumption, generalized model: Long, Chapter 5.4-5.7 Applications: Huckfeldt, Robert. 2001. "The Social Communication of Political Expertise." American Journal of Political Science. 45: 425-438. Greeley, Andrew M. and Michael Hout. 1999. "Americans' Increasing Belief in Life after Death: Religious Competition and Acculturation." American Sociological Review 64:813-835. (But if you read this, you should also check out the debate between Stolzenberg and Greeley/Hout in ASR 66(1): 146-158.) Stereotype ordinal regression model: Anderson, J.A. 1984. "Regression and ordered categorical variables (with discussion)." Journal of the Royal Statistical Society Series B 46:1-30. Lunt, Mark. 2001. "Stereotype Ordinal Regression." Stata Technical Bulletin 61:12-18. 8. Models for nominal outcomes: specification and estimation Long, Chapter 6.1-6.5
Powers and Xie, Chapter 7 Hosmer and Lemeshow, 8.1 Alvarez, R. Michael and Jonathan Nagler. 1998. "When politics and models collide: Estimating models of multiparty elections." American Journal of Political Science 42:55-96. Gould, William. 2000. "Interpreting Logistic Regression in All Its Forms." Stata Technical Bulletin Reprints 9:257-270. On the nested logit model: Amemiya, Chapter 9.3.5 (pp. 300-306) See also discussion in Greene Interpretation: Long, Chapter 6.6-6.10 King, Gary, Michael Tomz, and Jason Wittenberg. 2000. Making the Most Out of Statistical Analyses: Improving Interpretation and Presentation. American Journal of Political Science 44:341-355. Applications: Hao, Lingxin and Mary C. Brinton. 1997. "Productive Activities and Support Systems of Single Mothers." American Journal of Sociology 102:1305-1344. Brooks, Clem. 2000. "Civil Rights Liberalism and the Suppression of a Republican Political Realignment in the United States, 1972 to 1996." American Sociological Review 65:483-505. On testing the assumption of the independence of irrelevant alternatives: Hausman, J. A. and D. McFadden. 1984. Specification tests for the multinomial logit model. Econometrica 52:1219-1240. Small, K. A. and C. Hsiao. 1985. Multinomial logit specification tests. International Economic Review 26:619-627. Zhang, Junsen and Saul D. Hoffman. 1993. Discrete-Choice Logit Models. Sociological Methods and Research 22:193-213.
9. Poisson and negative binomial regression models for count outcomes Long, Chapter 8 Cameron and Trivedi, Chapters 1-3 King, Chapter 5.5-5.10 King, Gary. 1988. Statistical Models for Political Science Event Counts: Bias in Conventional Procedures and Evidence for the Exponential Poisson Regression Model. American Journal of Political Science 32:838-863. Applications: Kernell, Samuel and Michael McDonald. 1999. "Congress and America's Political Development: The Transformation of the Post Office from Patronage to Service." American Journal of Political Science. 43: 792-811. Lewis, David and James Michael Strine. 1996. "What Time Is It? The Use of Power in Four Different Types of Presidential Time." Journal of Politics. 58: 682-706. Sampson, Robert J. and John H. Laub. 1996. "Socioeconomic Achievement in the Life Course of Disadvantaged Men: Military Service as a Turning Point, Circa 1940-1965." American Sociological Review 61:347-367. Some further details on count models: Cameron and Trivedi, Chapters 4 and 12 (the entire book is tremendous, incidentally) 10. Event-history analysis: The point of this is not to teach you how to do event history analysis, as that is a matter which would require certainly more time than we can give and really an entire course. What I hope to do is to give an orientation into what an event history or survival analysis problem looks like, when and why you need special models for this kind of data, and how the approach is connected to the Poisson models that we just covered. Powers and Xie, Chapter 5 Carroll, Glenn R. 1983. "Dynamic Analysis of Discrete Dependent Variables: A Didactic Essay." Quality and Quantity 17:425-460. A good overall treatment of these models can be found in: Hosmer, David W. and Stanley Lemeshow. 1999. Applied Survival Analysis: Regression Modeling of Time to Event Data. New York: Wiley.
Applications: Hannan, Michael T. and Glenn R. Carroll. 1981. "Dynamics of Formal Political Structure: An Event-History Analysis." American Sociological Review 46:19-35. Warwick, Paul and Stephen T. Easton. 1992. "The Cabinet Stability Controversy: New Perspectives on a Classic Problem." American Journal of Political Science 36:122-146. 11. Contingency table analysis. Note: We are going to spend less time on this than planned in the last rendition of the course, but I couldn t see any reason not to include the whole reading list from last time as at least a reference for any students who become more interested in the topic. Introduction and the two-way table: Powers and Xie, Chapter 4.1-4.4.3 Fienberg, Chapter 2 Agresti, Chapter 2 Knoke, David and Peter J. Burke. 1980. Log-Linear Models. Newbury Park, CA: Sage. Multiway tables: Powers and Xie, Chapter 4.6 Agresti, Chapter 5 Fienberg, Chapter 3 Model comparison: Fienberg, Chapter 4 Agresti, Chapter 7 The Bayes Information Criterion (BIC) statistic: Raftery, Adrian. 1986. Choosing Models for Cross-Classifications. American Sociological Review 51:145-146. Raftery, Adrian E. 1995. Bayesian Model Selection in Social Research. Sociological Methodology 25:111-163.
Weakliem, David. 1999. A Critique of the Bayes Information Criterion for Model Selection. Sociological Methods and Research 27:411-427. Raftery, Adrian. 1999. Bayes Factors and BIC: Comment on 'A Critique of the Bayesian Information Criterion for Model Selection'. Sociological Methods and Research 27:411-427. Models for ordered categories (uniform association, row effects, column effects): Powers and Xie, Chapter 4.5 Christensen, Chapter 7 Agresti, Chapter 8 Green, J. A. 1988. Loglinear Analysis of Cross-Classified Ordinal Data: Application in Developmental Research. Child Development 59:1-25. Square tables: Powers and Xie, Chapter 4.4.5 and 4.4.6 Agresti, Chapter 10.1-10.5 (Compare to Agresti 11.1-11.2) Hout, Michael. 1983. Mobility Tables. Newbury Park, CA: Sage. Sobel, Michael, Michael Hout, and Otis Dudley Duncan. 1985. Exchange, Structure, and Symmetry in Occupational Mobility. American Journal of Sociology 91:359-372. Sobel, Michael E. 1988. Some Models for the Multiway Contingency Table with One-to-One Correspondence among Categories. Sociological Methodology 18:165-191. 12. Propensity-score matching models (for categorical independent variables) Rosenbaum, P. and D. Rubin. 1984. "Reducing Bias in Observational Studies using Subclassification on the Propensity Score." Journal of the American Statistical Association 79:516-524. Smith, Herbert L. 1997. "Matching with Multiple Controls to Estimate Treatment Effects in Observational Studies." Sociological Methodology 27:325-353.
Dehejia, Rajeev H. and Sadek Wahba. 1998. "Propensity Score Matching Methods for Non-Experimental Causal Studies." in Technical Working Paper Working Paper 6829 National Bureau of Economic Research. Imbens, Guido W. 1999. "The Role of Propensity Score in Estimating Dose-Response Functions." Technical Working Paper 237, National Bureau of Economic Research.