Inferring Directional Migration Propensities from the Migration Propensities of Infants: The United States

Similar documents
The Contributions of Past Immigration Flows to Regional Aging in the United States

PROJECTING THE LABOUR SUPPLY TO 2024

Joint Center for Housing Studies. Harvard University

Changing Times, Changing Enrollments: How Recent Demographic Trends are Affecting Enrollments in Portland Public Schools

Gender preference and age at arrival among Asian immigrant women to the US

Section IV. Technical Discussion of Methods and Assumptions

Migration. Ernesto F. L. Amaral. April 19, 2016

Family Ties, Labor Mobility and Interregional Wage Differentials*

Model Migration Schedules

Telephone Survey. Contents *

Colorado 2014: Comparisons of Predicted and Actual Turnout

Iowa Voting Series, Paper 4: An Examination of Iowa Turnout Statistics Since 2000 by Party and Age Group

The Case of the Disappearing Bias: A 2014 Update to the Gerrymandering or Geography Debate

Poverty Reduction and Economic Growth: The Asian Experience Peter Warr

No. 1. THE ROLE OF INTERNATIONAL MIGRATION IN MAINTAINING HUNGARY S POPULATION SIZE BETWEEN WORKING PAPERS ON POPULATION, FAMILY AND WELFARE

Planning for the Silver Tsunami:

Model migration schedules incorporating student migration peaks

This report examines the factors behind the

SocialSecurityEligibilityandtheLaborSuplyofOlderImigrants. George J. Borjas Harvard University

Preliminary Effects of Oversampling on the National Crime Victimization Survey

Volume 35, Issue 1. An examination of the effect of immigration on income inequality: A Gini index approach

Roles of children and elderly in migration decision of adults: case from rural China

The Employment of Low-Skilled Immigrant Men in the United States

Making use of the consistency of patterns to estimate age-specific rates of inter-provincial migration in South Africa

SIMPLE LINEAR REGRESSION OF CPS DATA

THE ROLE OF INTERNATIONAL MIGRATION IN MAINTAINING THE POPULATION SIZE OF HUNGARY BETWEEN LÁSZLÓ HABLICSEK and PÁL PÉTER TÓTH

Regional Trends in the Domestic Migration of Minnesota s Young People

Using data provided by the U.S. Census Bureau, this study first recreates the Bureau s most recent population

John Parman Introduction. Trevon Logan. William & Mary. Ohio State University. Measuring Historical Residential Segregation. Trevon Logan.

THE IMPACT OF TAXES ON MIGRATION IN NEW HAMPSHIRE

The Demography of the Labor Force in Emerging Markets

Benefit levels and US immigrants welfare receipts

The Effects of Immigration on Age Structure and Fertility in the United States

Migration Patterns in The Northern Great Plains

Online Appendix for The Contribution of National Income Inequality to Regional Economic Divergence

Evaluating the Role of Immigration in U.S. Population Projections

PI + v2.2. Demographic Component of the REMI Model Regional Economic Models, Inc.

Estimates by Age and Sex, Canada, Provinces and Territories. Methodology

Characteristics of the Ethnographic Sample of First- and Second-Generation Latin American Immigrants in the New York to Philadelphia Urban Corridor

Refugee Resettlement in Small Cities Reports

PRESENT TRENDS IN POPULATION DISTRIBUTION

CHAPTER 10 PLACE OF RESIDENCE

The National Citizen Survey

Chapter 5. Residential Mobility in the United States and the Great Recession: A Shift to Local Moves

NBER WORKING PAPER SERIES HOMEOWNERSHIP IN THE IMMIGRANT POPULATION. George J. Borjas. Working Paper

Robert H. Prisuta, American Association of Retired Persons (AARP) 601 E Street, N.W., Washington, D.C

Remittances and the Brain Drain: Evidence from Microdata for Sub-Saharan Africa

Prospects for Immigrant-Native Wealth Assimilation: Evidence from Financial Market Participation. Una Okonkwo Osili 1 Anna Paulson 2

Household Income, Poverty, and Food-Stamp Use in Native-Born and Immigrant Households

Wisconsin Economic Scorecard

Explaining the Deteriorating Entry Earnings of Canada s Immigrant Cohorts:

Introduction. Background

REVISIONS IN POPULATION PROJECTIONS AND THEIR IMPLICATIONS FOR THE GROWTH OF THE MALTESE ECONOMY

Analysis of birth records shows that in 2002 almost one in four births in the United States was to an

Family Ties, Labor Mobility and Interregional Wage Differentials*

STATISTICAL GRAPHICS FOR VISUALIZING DATA

Who Voted for Trump in 2016?

COLORADO LOTTERY 2014 IMAGE STUDY

Comparison of the Psychometric Properties of Several Computer-Based Test Designs for. Credentialing Exams

The Jordanian Labour Market: Multiple segmentations of labour by nationality, gender, education and occupational classes

An Empirical Analysis of Pakistan s Bilateral Trade: A Gravity Model Approach

Chapter 7. Migration

Part 1: Focus on Income. Inequality. EMBARGOED until 5/28/14. indicator definitions and Rankings

IN THE UNITED STATES DISTRICT COURT FOR THE EASTERN DISTRICT OF PENNSYLVANIA

Survey of Expert Opinion on Future Level of Immigration to the U.S. in 2015 and 2025 Summary of Results

Meanwhile, the foreign-born population accounted for the remaining 39 percent of the decline in household growth in

Non-Voted Ballots and Discrimination in Florida

11. Demographic Transition in Rural China:

Methodology and Assumptions for the Mapping America s Futures Project

Special Eurobarometer 469. Report

Secretary of Commerce

THE EFFECT OF EARLY VOTING AND THE LENGTH OF EARLY VOTING ON VOTER TURNOUT

Headship Rates and Housing Demand

Labor Market Performance of Immigrants in Early Twentieth-Century America

Volume Title: Domestic Servants in the United States, Volume URL:

Mexico as country of origin and host.

Inflation and relative price variability in Mexico: the role of remittances

Chapter One: people & demographics

The Rise and Decline of the American Ghetto

Socio-Economic Mobility Among Foreign-Born Latin American and Caribbean Nationalities in New York City,

Peruvians in the United States

Geographic Mobility Central Pennsylvania

Name Date Period. Approximate population in millions. Arizona Colorado Connecticut Georgia Idaho Iowa 3.

GREEN CARDS AND THE LOCATION CHOICES OF IMMIGRANTS IN THE UNITED STATES,

Post-Secondary Education, Training and Labour September Profile of the New Brunswick Labour Force

Immigrant Legalization

ANALYSIS OF THE EFFECT OF REMITTANCES ON ECONOMIC GROWTH USING PATH ANALYSIS ABSTRACT

PROJECTION OF NET MIGRATION USING A GRAVITY MODEL 1. Laboratory of Populations 2

The Impact of Interprovincial Migration on Aggregate Output and Labour Productivity in Canada,

The Effects of Housing Prices, Wages, and Commuting Time on Joint Residential and Job Location Choices

REGIONAL. San Joaquin County Population Projection

American Congregations and Social Service Programs: Results of a Survey

Patrick Adler and Chris Tilly Institute for Research on Labor and Employment, UCLA. Ben Zipperer University of Massachusetts, Amherst

REMITTANCE TRANSFERS TO ARMENIA: PRELIMINARY SURVEY DATA ANALYSIS

Characteristics of Poverty in Minnesota

The Case of the Disappearing Bias: A 2014 Update to the Gerrymandering or Geography Debate

People. Population size and growth

List of Tables and Appendices

Far From the Commonwealth: A Report on Low- Income Asian Americans in Massachusetts

Rural and Urban Migrants in India:

Transcription:

WORKING PAPER Inferring Directional Migration Propensities from the Migration Propensities of Infants: The United States Andrei Rogers Bryan Jones February 2007 Population Program POP2007-04

Inferring Directional Migration Propensities from the Migration Propensities of Infants: The United States Andrei Rogers Bryan Jones February 2007 Andrei Rogers, Professor of Geography, and Bryan Jones, graduate student, are both members of the Population Program, Institute of Behavioral Science, University of Colorado, Boulder. Acknowledgments: This research is supported by grants from the National Science Foundation (SES-0240808) and the National Institute for Child Health and Human Development (HSS-R03HD048561). The authors are grateful to Professor Rick Rogers for his suggestions after a careful reading of the text. Population Program, Institute of Behavioral Science, Boulder, Colorado 80309-0484

Abstract Beginning with the 2010 decennial census, the U.S. Census Bureau plans to drop its long-form questionnaire and to replace it with the American Community Survey (ACS). The resulting absence of the larger sample provided by the census count will complicate the measurement and analysis of internal migration flows. Additionally, the strategy of averaging accumulated samples over time will mix changing migration patterns. The migration question will refer to a one-year time interval instead of the five-year interval used in the censuses between 1960 and 2000, complicating historical comparisons and the production of multiregional projections based on five-year age groups. Consequently, students of territorial mobility increasingly will find it necessary to complement or augment possibly inadequate data collected on migration with estimates obtained by means of indirect estimation. This paper expands upon a method, previously tested on American, Mexican, and Indonesian data, that allows one to infer age-specific directional migration propensities at the regional level. The method uses birthplace-specific infant population data to approximated infant migration propensities, and from these infers the migration propensities of all other ages. The method is applied at both a four and ninedivision spatial scale. Keywords: migration, indirect estimation, United States

CONTENTS 1 INTRODUCTION 2 CONTEXT AND DATA 2.1 Four-region Trends 2.2 Nine-division Trends 2.3 Retirement Peaks 2.4 Data Issues 3 METHOD 3.1 Age-specific Regularities 3.2 The Linear Relationship 3.3 Families of Migration Flows 3.4 Outlying Observations 3.5 The Cubic Spline and the Model Schedule Fit 4 ANALYSIS 4.1 Constrained vs. Unconstrained Regression 4.2 Estimating Inter-regional Migration Using Observed Data: 1960-2000 4.3 Dividing Migration Flows into Families, 1960-2000 4.4 Applying Cubic Spline and Model Schedule Fit to the Data, 1960-2000 4.5 Applying Confidence Intervals to the Data, 1960-2000 5 PREDICTION 5.1 Predicting Migration Flows Using Current Period Infant Migration Propensity and Past Period Regression Parameters: 1970-2000 5.2 Predictions Using the Family Approach, 1970-2000 5.3 Applying Cubic Spline and Model Schedule Fit to the Data, 1970-2000 5.4 Applying Confidence Intervals to the Data, 1970-2000 5.5 The Confidence Interval and Family Approach, 1970-2000 6 SUMMARY OF RESULTS AND CONCLUSION REFERENCES

Inferring Directional Migration Propensities from the Migration Propensities of Infants: The United States 1. INTRODUCTION The 2010 United States national census will not contain a question on internal migration. The U.S. Census Bureau plans to drop its long-form questionnaire and to replace it with the American Community Survey (ACS). This change will provide more up-to-date information, but the absence of the larger sample provided by the census count will complicate the measurement and analysis of internal migration flows. The ACS will provide more timely data, but the samples will be smaller than those used by the census in the past, and the strategy of averaging accumulated samples over time will mix changing migration patterns. Additionally, the migration question will refer to a one-year time interval instead of the five-year interval used in the censuses between 1960 and 2000, complicating historical comparisons and the production of multiregional projections based on five-year age groups. Consequently, students of territorial mobility increasingly will find it necessary to complement or augment possibly inadequate data collected on migration with estimates obtained by means of indirect estimation (Rogers and Jordan 2004) Past studies of migration have identified a very consistent migration profile with respect to age. The model migration schedule (Rogers and Castro, 1981), which captures this profile, reflects the changing migration propensities exhibited by the various age cohorts. However, unlike mortality and fertility schedules, migration schedules as yet have not been used widely to develop adequate techniques for indirectly estimating age-specific migration propensities. This study expands upon a method, previously tested on American, Mexican, and Indonesian data that allows one to infer age-specific directional migration propensities at the regional level (Rogers and Jordan, 2004; Rogers et al., 2007). The method uses birthplace-specific infant migration 1

propensities to infer the migration propensities of all other ages. The method was initially applied to United States inter-regional migration data on a four-region scale and was found to be accurate enough for a detailed analysis of the model s performance (Rogers and Jordan, 2004). The method was then applied to Mexican and Indonesian data, to test the model s usefulness in countries with somewhat less accurate inter-regional migration data. This study will again apply the method to United States interregional migration data, this time using data from five decennial census periods (1960-2000), and two spatial scales (four and nine-divisions). Moreover, the study will incorporate a variety of extensions of the original model in an effort to improve its performance. Additionally, the introduction of the ninedivision analysis will provide deeper insight into United States inter-regional migration patterns, and a more realistic data set with which to evaluate the model. Section Two of this paper frames the study in the context of United States during the last half of the 20th century. Observed trends in the inter-regional migration data are examined over the five periods encompassed in this study: 1955-1960, 1965-1970, 1975-1980, 1985-1990, and 1995-2000. Section Three reviews the infant migration method, first used by Rogers and Jordan (2004), and discusses alterations in such a model, designed to improve its performance. Section Four contains our analysis of the estimation of age-specific migration profiles for the five census periods. Section Five considers the model s predictive ability, when data from an earlier period are used with infant migration propensities estimated for a current period to predict age-specific migration patterns for that current period. Finally, Section Six summarizes the results and offers a few conclusions. 2. CONTEXT AND DATA This paper is concerned with the ability of the infant migration model to estimate and predict observed migration in the United States during the latter half of the 20 th century. During this period, a few patterns consistently appeared in the regional migration data. Most notably, higher migration 2

propensities out of the Snowbelt regions (Northeast, Midwest), and into the Sunbelt regions (South, West), and the presence of retirement flows involving the increased movement of persons reaching retirement age into regions with warmer weather and amenities considered attractive by the elderly (Figure 1). Figure 1 about here 2.1 Four-region Trends Over the five census periods included in this study, the flow from the Northeast Region to the South Region exhibited the highest mean gross migraproduction rate 1 (0.52). The second and third highest mean GMRs were also for flows into the South Region (0.43 for Midwest to South and 0.38 for West to South). Conversely, three of the four lowest mean GMRs were associated with flows into the Northeast (Table 1). Examination of the GMRs of individual flows (as opposed to means across the five censuses) reveals that the 13 largest flows, and 15 of the top 20, were destined for the South Region. On the other hand, 9 of the 10 smallest flows were directed to the Northeast Region (Table 2). In addition to exhibiting the highest mean GMR, the flow from the Northeast to the South, also exhibits the most variation over the five periods (Table 1). Closer examination reveals that GMR for this flow has increased in each subsequent census period, peaking at 0.62 during the 1995-2000 period (Figure 2c). The age-profile (Figure 2a) reveals a significant increase in migration propensity between 1960 and 2000 for all age cohorts, but particularly for those between the ages of 25 and 65. Also note the clearly defined retirement peak exhibited by this flow. Controlling for level, by scaling the profiles to a common unit area under the curve (Figure 2b), reveals the consistent importance of retirement age 1 The gross migraproduction rate (GMR) is the sum of the migration rates or probabilities for each single year age cohort across a population at a given time (i.e., the total area under the migration schedule curve). This rate measures the total level of migration out of a region, and can be used to examine the levels of both total regional out-migration and destination specific regional out-migration (Rogers 1995). It is analogous in concept to the widely used gross reproduction rate (GRR), which is used to describe the level of fertility rates or probabilities. 3

migrants in this flow, although the retirement peak appears to have shifted to an earlier age, from 60 to 55, over the period. The importance of retirees from the Northeast to the South flow is not surprising, given the large movement from New York to Florida among persons at retirement age. Tables 1 and 2 about here Figures 2a, 2b, and 2c about here 2.2 Nine-Division Trends Further disaggregating the data into nine Census Divisions provides deeper insights into internal migration trends. Interestingly, three of the five flows with the highest mean GMRs at the nine division aggregation (Table 3) are hidden at the four-region level. Over the five census periods the Mountain to Pacific flow exhibited the highest mean GMR (0.74). The opposite flow, Pacific to Mountain, had the fifth highest mean GMR (0.29). Despite the relatively large difference in mean GMRs, time series data indicate that these flows have been, in fact, converging in terms of level (Figure 3). Flows from the Mid-Atlantic and New England Divisions (the two components of the Northeast Region in the fourregion model) to the South Atlantic Division had the second and third highest mean GMRs, respectively (0.47 and 0.39). Included in the highest ten GMRs are all the flows from the major metropolitan areas of the Northeast and Midwest (Boston, New York, Philadelphia, Detroit, Chicago) to the popular corresponding southern destinations in Florida, Georgia, and North Carolina. Five of the ten flows exhibiting the lowest mean GMRs were heading into the New England Division. Incidentally, four of the ten flows with the highest GMRs were hidden at the four-region aggregation, whereas none of the ten flows with the lowest GMRs were hidden. This is not surprising, since it indicates that lower GMRs are associated with larger geographical distances. An Examination of GMRs for individual periods (Table 4) reveals that the four highest GMRs are associated with flows from the Mountain to Pacific Divisions, whereas 12 of the 20 highest GMRs were for flows going to the South Atlantic Division. 4

Conversely, 14 of the 20 lowest GMRs were for flows heading into the New England Division, the four lowest flows originated in the East South Central Division. Tables 3 and 4 about here Figure 3 about here Given the presence of a retirement peak in the migration age-profile of the Northeast to South regional flow in the four-region model, one might expect both the New England and Mid-Atlantic to South Atlantic flows to exhibit similar retirement peaks in the nine-division model. Figures 4a and 5a indicate that this is indeed the case. The GMR from the Mid-Atlantic to South Atlantic Division is higher 2 than that from the New England Division, however the age-profiles indicate that in the retirement peak years both flows exhibit nearly equal levels. The difference results from higher agespecific migration propensities during the young adult years present in the Mid-Atlantic to South Atlantic flows. Controlling for level (Figures 4b and 5b) we see that profile is fairly constant for both flows over the five censuses. The exceptions occur in the retirement years for both 1960 flows, which exhibit an irregular pattern (possibly due to data issues), and in the 1990 flow from the New England Division, which exhibits a lower labor peak and higher retirement peak in comparison to flows from the other censuses. Figures 4a and 4b and 5a and 5b about here 2.3 Retirement Peaks Of the twelve inter-regional flows modelled in the four-region study (4 x 4 4 = 12 flows), three consistently exhibit retirement peaks: the Northeast to South flow, the Midwest to South flow, and the Midwest to West flow. The levels associated with the peaks vary, but all clearly display propensities that rise between the ages of 55 and 65, dropping again among the older cohorts. Flows from the 2 This is to be expected given that New York is included in the Mid-Atlantic Region. 5

Northeast and Midwest regions to the South Region contain the most prominent retirement peaks, whereas the Midwest to West peak is somewhat less noticeable. The nine-division study consists of 72 inter-regional flows (9 x 9 9 = 72 flows). Of these, six consistently exhibit retirement peaks: the New England, Mid-Atlantic, and East North Central to South Atlantic flows, as well as the Mid-Atlantic, East North Central, and Pacific to Mountain flows. The three flows into the South Atlantic Region exhibit both higher retirement peaks and less variability across the five censuses. The Pacific to Mountain Region flow, in general, exhibits a more prominent retirement peak than the other two Mountain bound flows, but exhibits a great deal of variability. Because these trends are so apparent in the observed inter-regional migration flows, the successful application of the infant migration model must include an output that reflects these trends. Therefore much of this study will be concerned with the ability of the model to adequately estimate and predict both the level and the intensity of migration, as well as the presence of retirement peaks. Many of the alterations to the model explained in the next section were made with this in mind. 2.4 Data Issues There are two data issues that arose with the nine-division census figures. First, there are several zero values in the migration data that are unlikely to reflect real life flows. Although the zero values tend to appear in the older age cohorts, and may be the result of very small sample sizes, it is reasonable to expect, given the size of the population of the United States, that there is some inter-regional migration taking place between all divisions and in all five-year age cohorts. However, because we do not wish to arbitrarily create data, we keep the zero values unchanged initially, and address them through extensions to the infant migration model. Second, we have found several irregular data points that are clearly the result of input errors. For example, in Figure 6, the migration propensity for the 70-74 cohort is much too high to be correct. The 6

reported migration propensity is 0.00825, but for the 65-69 and 75-79 groups propensities are only 0.000975 and 0.000889 respectively. The error is the obvious omission of a zero, so we modify S ij (70) to be 0.000825, which is much more reasonable. Another example of erroneous data involves the 60-64 cohort in the 1980 flow from New England to the West South Central Region. The reported value is 0.115, clearly a typing error. We correct this figure to 0.00115 based on the normal age-profile of migration, and on the values immediately preceding and following the value in the flow data. From the five censuses, there are a total of 360 migration flows with 6480 individual data points. We have changed eight values in seven migration flows. Every changed value involved either a typing error or the inclusion or omission of a zero (multiples of ten). Figure 6 about here 3. METHOD This study focuses on the indirect estimation of inter-regional migration propensities in the United States by using information on the migration propensities of infants to predict the corresponding propensities for all other age groups. This so called infant migration method (Rogers et al. 2003; Rogers and Jordan, 2004), relies on observed regularities in empirical age patterns of migration propensities that allows it to carry out such an analysis. We employ five strategies: (1) the identification of age-specific regularities, (2) the identification of families of similar migration age-profiles, (3) the use of the seven and eleven parameter model migration schedules to smooth observed data irregularities, (4) the detection and removal of outliers, and (5) the subsequent estimation and prediction of the migration flows from regularities observed in past data. 7

3.1 Age-specific Regularities Observed age patterns of migration probabilities generally exhibit strong regularities. The highest probabilities occur in the early adult years, when individuals leave their parental home to attend college, enter the military, marry, or enter the labor force. This is reflected in a labor peak in the prototypical empirical migration schedule (Rogers and Castro, 1981). The lowest probabilities occur in late adolescence and toward the end of the working ages. The migration probabilities of children mirror those of their parents, and because young adults migrate more than older adults, the migration rates of infants exceed those of adolescents. In some instances, particularly in the developed world, the migration probabilities of those reaching retirement age show a sudden increase and exhibit a retirement peak around age 65. In some instances particular regional migration flows have exhibited irregularities in the agespecific pattern of migration. In such cases it is useful to use an age-specific model migration schedule to smooth out these irregularities. This not only eliminates the irregularities, but also enforces a profile that is consistent with commonly observed data. The complete Rogers-Castro model migration schedule generally has four components; (1) the pre-labor force stage (children), (2) the labor force stage (adults), (3) the post-labor force retirement stage (elderly), and (4) a constant curve. This version of the model can be expressed as: m( x) = N ( x) + N ( x) + N ( x) + c m( x) = a exp( α x) 1 + a + a 1 2 3 2 1 exp{ α ( x µ ) exp[ λ ( x µ )]} 2 exp{ α ( x µ ) exp[ λ ( x µ )]} 3 2 3 3 + c where m(x) = migration probability at age x N 1 = pre labor force stage (child), N 2 = labor force stage (adult) N 3 = post-labor force stage (elderly), c = constant and α and µ are parameters, and x is age In those flows without a retirement hump, the third component in equation (1) is deleted. 2 3 3 2 (1) 8

3.2 The Linear Relationship Rogers and Jordan (2004) demonstrated the relationship between infant migration propensity and the migration propensity for all age groups together. Figure 7 plots infant migration propensities (Sij(- 5)) and migration propensities for all age cohorts (Sij(+)), as well as the best fitting straight line that results from a bivariate regression in which Sij(+) is dependent upon Sij(-5). 3 For the 1995-2000 data, an R 2 value of 0.88 indicates a strong relationship between the variables, suggesting that Sij(-5) is a potentially powerful predictor of migration propensity among other age cohorts. Figure 7 about here The use of the infant migration propensity as a starting point is advantageous in that, in the absence of reported migration data, its level can be approximated by the birthplace-specific population count of children who are 0-4 years old and residing in region j at the time of the census, and who were born in region i, within the past five years, and therefore must have migrated during the immediately preceding 5-year interval. Since they were, on average, born some 2-1/2 years ago, it is unlikely that they moved more than once. Hence, back-casting their numbers to their region of birth, as well as all those of other infants born in the same region, one is then able to divide each i to j migration number by the total ( surviving-to-census ) births in i, to obtain an estimate of each of the infant conditional-onsurvival migration probabilities, S ij (-5). Observed regularities in patterns of age-specific migration probabilities suggest that information on the probabilities of infant migration also can be linked to the corresponding probabilities in each of the subsequent age groups by means of a regression equation (Rogers and Jordan, 2004). We, therefore, can consider a linear regression that links each age-specific S ij (x) with S ij (-5): S ( x) = a + b S ( 5) error term (2) ij ij + 3 For a formal definition of Sij(-5) see Rogers (1995) p. 98. 9

Using this simple linear regression equation, estimated migration propensities for each of the subsequent five-year age cohorts can be determined. Figure 8a and 8b about here The ability of this model to predict migration in subsequent time periods depends largely on the consistency of the Sij(-5) to Sij(x) relationship over time. A useful method for testing this relationship is simply to plot observed regression parameters. Figures 8a and 8b examine slope coefficients and intercept values resulting from the simple linear regression (Equation 2) applied to the four-region data over the five census periods. In both plots, there is noticeable consistency between the 1960 and 1970 parameters, and again between the 1980, 1990, and 2000 parameters. However, there is a noticeable difference between these two groups in several respects. First, although the slope coefficients for the 1980-2000 data rise between the ages of 50 and 65, they do not do so for the 1960-1970 data. This difference may be due to more sharply defined retirement peaks in the 1980-2000 observed data, but it may also indicate that the simple model more accurately estimates the retirement peaks in those periods. Furthermore, the presence of larger negative intercept values in the 1980-2000 parameters (ages 50-65) increase the likelihood that negative migration propensities might result from the model. In terms of prediction, the clear distinction between 1960-1970 and 1980-2000 will likely result in weaker predictions for 1980 (when using 1970 regression parameters) relative to predictions for the other census periods. Figure 9a and 9b about here Figures 9a and 9b examine slope coefficients and intercept values resulting from the application of Equation 2 to the nine-division data over the five census periods. The most striking difference between the four-region and nine-division results is the increased consistency in both slope and intercept values across all periods in the nine- division data, when compared to results from the four-region data. 10

This is likely the result of the significantly larger number of data points included in the regression equations: the result of using nine-divisions as opposed to four-regions. Also noticeable is the lack of a retirement peak bulge in the slope coefficients, suggesting that the simple model may not capture the retirement peak in the six flows that consistently exhibit one at the nine-division aggregation. Again, this is likely the result of an increased number of data points. In the four-region model, three of the twelve flows (25%) exhibited retirement peaks, while in the ninedivision model only six of the seventy-two flows (8.3%) exhibited retirement peaks. The increased percentage of flows without retirement peaks in the nine-division model obscures the existence of the peak in those few flows that do exhibit one, increasing the importance of extending the model to account for differences in inter-regional flows, such as adopting families of migration flow profiles. 3.3 Families of Migration Flows Specific inter-regional migration flows often exhibit characteristics that allow us to separate or define one group of flows in relation to other groups. Each family of flows exhibits the same defining characteristics. For example, the presence or absence of a retirement peak is one such defining characteristic. Another characteristic concerns the location of the labor force peak on the horizontal axis, with some peaks occurring at younger (or older) ages than others. Yet another characteristic considers the ratio of the labor force peak value to the initial infant migration value; this defines the flow to be either a labor dominant or child dominant flow (Rogers and Castro, 1981). This particular study will only divide migration flows into two families based on the presence or absence of a retirement peak. Because a retirement peak is consistently present in three of the twelve inter-regional flows in the fourregion model, and six of the seventy-two inter-regional flows in the nine-division model, such an approach should improve the performance of the infant migration model. 11

3.4 Outlying Observations Migration data usually display the common age-specific pattern; but some do not. When plotting the S ij (x) values against the S ( 5) value, and fitting a regression line to the resulting scatter of points, ij several points may fall significantly above or below the regression line. These points, labeled outliers, skew the estimated S ij (x) values further away from the common age-specific pattern of observed values, reducing the accuracy of the model. Because these points probably reflect small sample sizes or errors in the data set, removing them should improve the fit of the regression line and yield improved estimated values. The points can be removed by statistical means, so that their deletion is not arbitrary. To do so, confidence intervals may be used. A 90% confidence interval, for example, defines a range of values that in general will capture the observed data points 90% of the time. To remove outlying data points, an interval first must be obtained such that the predicted value (in this case each S ij (x) ), lies between a lower and upper bound a certain percentage of the time. This study applies both 90% and 80% confidence intervals to the data sets. A migration propensity that lies outside of the appropriately calculated range is deleted. The following equation is used to determine confidence intervals in this study: Yˆ i 1 ( X X ) 2 i ± tn 2, α / 2 + (3) n S xx such that if the following is not satisfied: Yˆ i t 1 ( X X ) X ) 2 2 i i n Yi Yˆ 2, α / 2 + < < i + tn 2, α / 2 + (4) n S xx n S xx 1 ( X then the observation is removed from the equation. 12

3.5 The Cubic Spline and the Model Schedule Fit In the previous section, a method using confidence intervals to select and remove outlying observations was described. In this section, an alternative method for dealing with outlying observations is presented: fitting the model migration schedule of Equation 1 to observed data, after first using a cubic spline graduation of the observed data, and inserting the resulting curves into TableCurve 2D, a commercially available curve fitting software package (Jandel Scientific, 1996). We begin with a cubic spline constructed of third-order polynomials that pass through a set of pre-defined control points. The five-year data points serve as control points in this model, and one-year migration propensities are obtained by graduations using the cubic spline. One-year data are preferred to five- year data because they provide significantly more data points to which the model schedule can be fitted. Using TableCurve 2D, the seven or eleven parameter model migration schedule 4 is then fitted to the splined data. Five-year data resulting from the model fit are then used in place of the observed data in the simple linear regression described by Equation 2 in Section 3.2. 4 ANALYSIS In evaluating the various fits produced by our estimations and predictions, it is convenient to use an established measure of goodness-of-fit. We draw on the widely used mean absolute percentage error (MAPE) statistic, which for a particular flow is: MAPE = x Sˆ ij ( x) S S ij N ij ( x) ( x) 100 (5) where N is the total number of age groups. For all the flows taken together we use: 4 The seven parameter model (Equation 1 without the third component) is applied to data without a retirement peak; the eleven parameter model is fitted to those flows with a retirement peak. 13

n n Sˆ ij ( x) S ij ( x) i= 1 j i x Sij( x) MAPEij = 100 (6) n( n 1) N 4.1 Constrained versus Unconstrained Regression The simple, unaltered infant migration model was applied to the census data at both spatial scales in two forms: the first allowing for an intercept term (unconstrained), and the second forcing the regression through the origin (constrained). The latter procedure ensures that no negative propensities result from the application of the model. Allowing an intercept term in the four-region estimations resulted in a negative migration propensity in 10 out of 1,020 (0.98%) estimated propensities, and in 7 out of 816 (0.86%) predicted propensities. 5 Allowing an intercept term in the nine-division estimation resulted in a negative migration propensity in 65 out of 6,120 (1.06%) estimated propensities, and in 60 out of 4,896 (1.23%) predicted propensities. However, constraining the regression by forcing it through the origin produced less accurate results, as expected. Although the MAPE values (Figures 10a and 11a) for estimated flows at the four-region scale were similar in the early periods, they were significantly lower in the latter periods. They were lower for all periods at the nine-division scale. Similarly, MAPE values for predicted flows (Figures 10b and 11b) were similar in the first three periods at the four-region scale and in the middle two periods at the nine-division scale, but were significantly better for the ninedivision 1970 predictions and for the 2000 predictions at both scales when using unconstrained regression. Figures 10a and 10b about here Figures 11a & 11b about here 5 Estimation refers to the use of data from the current time period to generate age-specific migration propensities for that time period. Prediction refers to the use of data from an earlier period, used with infant migration propensities for a current period to predict age-specific migration patterns for that current period 14

Besides our MAPE measures of goodness-of-fit, we also examined the R 2 values associated with the regressions and found that they were similar in the 1960 and 1970 runs, but were superior in estimations using unconstrained regression in the latter three periods, particularly in the later ages, perhaps because the unconstrained regression model might better capture retirement behavior. Further analysis reveals that the slope coefficients associated with unconstrained regression rise during retirement age, again indicating this alternative specification may better capture retirement peaks. In light of these results, and the seemingly minimal likelihood of obtaining a negative propensity, this study will not constrain the simple linear regression model used in the infant migration estimation procedure. Although the simple unconstrained model did yield a few negative results, alterations in the model - such as dividing migration flows into families and the use of confidence intervals to eliminate outlying observations - are expected to eliminate most, if not all, of these. However, in the case of persisting negative propensities, the regression applied to the specific age cohort(s) within the specific inter-regional flow exhibiting negative results will be constrained. This methodology then allows us to use the most accurate regression technique for the vast majority of time, while simultaneously ensuring non-negativity in the very rare cases when unconstrained regression yields a negative propensity. 4.2 Estimating Inter-regional Migration Using Observed Data: 1960-2000 Tables 5 and 6 about here The first test of the infant migration model is to estimate census period age-specific migration propensities using observed data from that period. This tests the ability of the model to replicate existing data. To measure the strength of the relationship between infant migration propensity and migration propensity among other age cohorts we can look at R² values resulting from the simple linear regression model. We expect R² values to be higher among the youngest age cohorts, and among young adults, as infant migration takes place with parents, and often with siblings. We would expect the 15

relationship to weaken among the older age groups. Tables 5 and 6 present these values for the 17 age cohorts, across all five censuses, at the four-region and nine-division scales respectively. Immediately, we note the high R² values in the earlier age groups in comparison to those in the older age groups, at both spatial scales. With a few exceptions, R² values are similar at our two spatial scales, but are noticeably higher in the latter three censuses for the four-region data. Table 7 and 8 about here Table 7 presents the MAPEs associated with inter-regional flows from all five periods (1960-2000) at the four-region scale. The highest errors are associated with the West-to-Northeast and Southto-Midwest flows (all of which are over-estimated), as well as the Midwest-to-Northeast flow (largely due to low propensities). The lowest errors occur in flows from the Midwest-to-South and the West-to- Midwest regions, which is curious given the high propensity present in the former and the low propensity in the latter. The MAPE statistics for all inter-regional flows range from a high of 42.10% for the 1965-1970 period, to a low of 24.84% for the 1995-2000 period. The model predicts total interregional migration quite accurately; however, the MAPE statistics indicate that a larger degree of error exists in specific regional flows. MAPE values for the nine-division model are presented in Table 8. The MAPE statistics for all inter-regional flows are higher than those from the four-region data for every census period, reflecting the likelihood that a larger number of flows leads to a larger amount of variability. The range of individual values is quite wide, with 14 flows estimated with MAPE statistics of over 100%, and 9 flows estimated with MAPE statistics under 10%. Again the model predicts total inter-regional migration quite accurately, but with a wide variety of errors exhibited by specific regional flows. Table 9 about here Because the MAPE table for the nine-division model is quite large (5 x 72 = 360 values), it is helpful to look at aggregate MAPE values for flows into and out of the nine divisions (Table 9). 16

Interpretation of the aggregate data allow for a clearer picture of the model s performance at the regional scale. For instance, note that the highest MAPE values are associated with flows into the Northeast, Mid-Atlantic, and East North Central regions, as well as out of the South Atlantic region. All of these flows are over-estimated by the model, which is to be expected given the general migration trend of movements out of the so-called snow-belt to the sun-belt. Flows involving the Mountain and Pacific regions were generally the most accurately estimated by the simple model across the five periods. Figures 12a and 12b about here As discussed in Section Two, three specific flows consistently exhibit a retirement peak in all five periods at the four-region scale. In all cases, the retirement peak was under-estimated by the model, dramatically so in the case of the flows into the South region (see Figures 12a and 12b). In latter census periods, the ability of the model to capture the retirement peak improved, likely a result of larger slope coefficients (see Figure 8a) However, the effect the observed retirement peaks exerted on the regression parameters creates a small estimated retirement peak in the other nine-divisional flows, where no observed retirement peak existed. In the nine-division model, six flows consistently exhibited retirement peaks. Again, the retirement peak was under-estimated in all cases by the model. Unlike the four-region results, however, during no period did the nine-division model begin to pick up a significant retirement peak, even among flows where observed retirement peaks were quite large (e.g., Mid-Atlantic to South Atlantic, see Figures 13a and 13b). This result is likely due to the large number of non-retirement peak flows included in the regression at this spatial scale. We noted earlier (Figure 9a) that the slope coefficients resulting from the simple model showed no increase among age cohorts where a retirement peak would exist. This is further evidence that the model must be adjusted to separately account for flows that do exhibit retirement peaks. 17

Figures 13a and 13b about here Tables 10 and 11 about here Correspondingly, regional flows that exhibited retirement peaks were consistently underestimated by the model (see Tables 10 and 11). In most cases, the bulk of the under-estimation occurs in the older age-groups. However, all of these flows exhibit fairly high migration levels, and can be considered more popular migratory routes than most of the other flows. Therefore, it is likely that the under-estimation is not solely the result of an under-estimated retirement peak. It is not surprising that flows out of the South are over-estimated in most cases, particularly in the more recent time periods. The South Region is the most popular destination for United States migrants. Because the model makes estimates based upon regression parameters derived on the basis of all inter-regional flows, the influence of the less popular routes (e.g., West-to-Northeast) leads to the under-estimation of the more popular routes. 4.3 Dividing Migration Flows into Families, 1960-2000 The first extension to the simple model involves dividing the observed migration flows into families on the basis of the presence or absence of a retirement peak. The simple linear regression was then applied to each family separately, yielding altered regression parameters with which estimations were carried out. Because only three flows were included in the retirement peak family in the fourregion model (Northeast-to-South, Midwest-to-South, and Midwest-to-West), observed flows for all five periods were combined into one regression model, so that there would be an adequate number of observations. The nine remaining flows were analyzed both in the aggregate and separately by census period. For the nine-division model, flows with observed retirement peaks for the five periods also were initially grouped into one regression model; however, due to the larger number of flows with retirement peaks (in comparison to the four-region model) we were able to further disaggregate these flows by 18

destination after it became apparent that flows into the South Atlantic and Mountain regions exhibited obvious differences in patterns. 6 Tables 12 and 13 about here The R² values (Table 12) associated with the four-region regression model improve, ranging from 0.45 to 0.98, compared to a range of 0.13 to 0.97 for the simple model without a family disaggregation. The R² values from the nine-division regression model improve even more, ranging from 0.67 to 0.97, up from a range of 0.32 to 0.97. The MAPE values (Tables 14 and 15) also improve significantly, ranging from a high of 23.69% for the 1965-1970 data to a low of 15.79% for the 1955-1960 period. Propensities for the West-to-Northeast flow (over-estimated) still exhibit the highest errors, whereas the West-to-South flows exhibit the most accurately estimated propensities. The MAPE values for total inter-regional flows for the nine-division model are still higher than those from the fourregion model, ranging from a high of 41.99% for 1970 to a low of 27.79% for 1990. Although flows out of the New England, Mid-Atlantic, and East North Central regions still exhibit the highest errors, there is significant improvement over the results from the simple model. Furthermore, flows into the South Atlantic region are the most accurately estimated, indicating improvement in the model s ability to estimate retirement peaks because of the family classification. In conjunction with lower MAPE values, we also note the improved estimation of migration levels in the previously over- and under-predicted flows (Table 16). Tables 14, 15, and 16 about here Figures 14a and 14b contain observed and estimated age-profiles for the Northeast-to-South (four-region model) and the Mid-Atlantic to South Atlantic (nine-division model) respectively. Remember that estimates of these profiles from the simple model failed to capture the retirement peak 6 The three flows into the South Atlantic Region (from the NE, MA, and ENC regions) generally exhibit much higher GMRs and higher retirement peak in relation to the rest of the age-profile than do the three flows into the Mountain region (from the MA, ENC, and PAC regions). 19

evident in both observed profiles. Estimations obtained after classifying the flows clearly capture the retirement peak at both spatial scales. 7 Furthermore, note the dramatic improvement in MAPE values, R² values, and GMR error (Table 17). The family classification enhancement clearly improves the ability of the model to accurately estimate age-profiles. Figures 14a and 14b about here Table 17 about here 4.4 Applying Cubic Spline and Model Schedule Fits to the Data, 1960-2000 We have seen that a clear improvement in estimation results occurs when flows are divided into families. The next extension of the model addresses the existence of outlying or unusual observations in the data. Such observations can result from age misreporting, errors in data compilation, or may simply be an accurately observed anomaly. The presence of these data points skews regression parameters derived by the model, and decreases the accuracy of estimation. By applying the cubic spline to observed migration propensities, then fitting the splined data to the seven or eleven parameter model schedule (based upon presence or absence of a retirement peak 8 ), we can smooth out deviations or anomalies that reduce the model s accuracy (Tables 18 and 19). Tables 18 and 19 about here Using this extension further improves our results. R² values for the four-region regression model improve to a range of 0.65 to 0.99, and for the nine-division model to a range of 0.70 to 0.97. Total MAPE values (Tables 20 and 21) now range from a high of 21.49% for the 1965-1970 period to a low of 15.21% for the 1955-1906 period for the four-region model, and from a high of 34.73% for the 1965-1970 period to a low of 24.53% for the 1975-1980 period for the nine-division model. MAPE 7 Figure 14b contains two estimated age-profiles, the first derived from a regression model that included all flows exhibiting retirement peaks, the second only flows with retirement peaks going into the South Atlantic region. 8 Due to the dramatic improvement resulting from family classification, we apply further enhancements of the model to data that is organized by retirement peak family. 20

values for all five periods improved over the previous set of estimations. MAPE values for individual flows also fell slightly, with the highest errors in the four-region model still occurring in the West-to- Northeast flow (over-estimated, but large MAPE values due to very low propensities), and the lowest in the West-to-South flow (under 10% on average). At the nine-division aggregation, propensities for flows into New England, the Mid-Atlantic, and the East North Central region are all still over-estimated, but with significant improvement noted for flows into the Mid-Atlantic region. Also note the rising error associated with propensities for flows out of the east South Central region. These propensities are over-estimated as well, perhaps indicating the growing popularity of this region of the country. The lowest errors continue to be associated with flows into the South Atlantic, and with flows into and out of the Mountain and Pacific regions. These results indicate continued improvement in the estimation of both level and profile of migration, in this case due to the smoothing of irregular observed data points. Tables 20 and 21 about here Figures 15a and 15b again illustrate the model s ability to capture the retirement peak at both spatial scales. Associated goodness of fit measures (Table 22) also indicate continued improvement in the performance of the model. Figures 15a and 15b about here Table 22 about here 4.5 Applying Confidence Intervals to the Data, 1960-2000 A second extension that addresses irregular data points is the identification and removal of outlying observations using confidence intervals. This study uses both 80% and 90% confidence intervals to identify and remove outlying observations, and then reapplies the simple linear regression to the remaining data points (which are still divided into two families). We contrast this approach to 21

addressing irregular, incomplete, or incorrect data with the approach used in the prior section (Tables 23 and 24). Tables 23 and 24 about here The results from this extension also are promising. The R² values associated with the fourregion regression model range from 0.70 to 0.99 (a slight improvement over the ones from the fitted data) when using a 90% confidence interval, and increase to a range of 0.75 to 0.99 when an 80% confidence interval is applied. Similar improvements are noted at the nine-division scale, with a range of 0.77 to 0.99 resulting from the use of 90% confidence intervals, and 0.80 to 0.99 from the application of 80% confidence intervals. Significant improvements in R² values are exhibited by the older agecohorts which, not surprisingly, are where the most outlying observations are to be found (Tables 25 and 26). Tables 25 and 26 about here At the four-region scale, MAPE values are slightly improved over the previous procedure that smoothed the data, especially when using an 80% confidence interval to remove outliers, ranging from 14.85% (1965-1970) to 12.04% (1985-1990). Highest errors are associated with the Northeast-to-West flow, and lowest with the Northeast and Midwest-to-South flows. These results indicate both that the methodology continues to capture the retirement peak in flows where one exists, and that long-distance flows exhibiting lower propensities (Northeast-to-West, West-to-Northeast) are more difficult to estimate. Very encouraging is the accuracy with which flows with retirement peaks were estimated (see Figure 16a). At the nine-division scale, MAPE values improve significantly over the data smoothing methodology, ranging from a high of 21.38% for the 1970 period to a low of 15.67% for the 2000 period when an 80% confidence interval is applied to the data. Using the 80% confidence interval as opposed to the 90% interval at the nine-division scale resulted in only minimal improvement, the largest of which 22

occurs in the 1970 period (23.01% and 21.38% for 90% and 80%, respectively). Dramatic improvement is noted for individual flows previously exhibiting very high MAPE values (particularly flows into the East North Central region). Propensity for flows into and out of the New England region continue to exhibit the highest errors (despite improvement compared to previous model enhancements), and flows into the South Atlantic region, and into and out of the Pacific and Mountain regions the lowest. Note the extreme accuracy with which the Mid-Atlantic to South Atlantic flow is estimated (Figure 16b, Table 27). Figures 16a and 16b about here Table 27 about here. 5. PREDICTION Having applied several modifications to our model to improve its ability to estimate migration propensities, we now attempt to assess the predictive power of the model. Infant migration propensities derived from a current period were used with regression parameters derived from a previous period (e.g., 10 years-ago) to predict the migration schedule for all ages during the current period. These assessments will test the consistency of the correlations between infant migration propensities and those of all other ages, over time. Such correlations are vital to the success of the model. It is assumed that a change in a particular infant migration propensity is felt in the migration probabilities of all other age cohorts. All of the extensions discussed in the earlier sections have been applied in an attempt to improve the predictive power of the regression equations. 5.1 Predicting Migration Flows Using Current Period Infant Migration Propensities and Past Period Regression Parameters: 1970-2000 There is no new linear regression involved in generating predictions. Instead, we simply use regression parameters (slope and intercept values) from a past period and an observed infant propensity 23

from a current period in an attempt to project the entire age-schedule of migration for that current period. Table 28 examines the MAPE values associated with predicted migration propensities for the periods 1970-2000 at the four-region scale. With the exception of the 1970 period, the MAPE values are surprisingly consistent with those observed in the estimated data. 9 High errors are still associated with flows out of the West Region into the Northeast, as well as with flows from the South to the Northeast (all over-predicted). The lowest errors are associated with flows into the South from the Midwest Region and, once again, from the West to the Midwest Region, and again with flows from the Northeast to the West. Results for the nine-division model (Table 29) are also very similar to those observed in the estimated data, again with the exception of the 1970 period. MAPE values for total inter-regional flows are lower for the four-region data, again reflecting a larger degree of variability when a larger number of flows are included in the model. Tables 28 and 29 about here Figures 17a and 17b provide a comparison of the predictions of two high-level flows with retirement peaks at the four-region and nine-division scales. Both flows are predicted for 2000 using the 1990 regression parameters and the 2000 infant migration propensities. Table 30 summarizes key statistics associate with these flows. Note that the 2000 Northeast to South flow from the four-region model is actually predicted more accurately using 1990 data (MAPE = 15.25%) than it was estimated using 2000 data (MAPE = 20.56%). Also note, that although under-predicted, the model did pick up a retirement peak. In comparison, the 2000 Mid-Atlantic to South Atlantic flow from the nine-division model is dramatically under-predicted, and the model anticipates no retirement peak. Figures 17a and 17b about here Table 30 about here 9 Past studies generally reported increases in MAPE values when the model was used to predict migration propensities, when compared to the corresponding values for estimating propensities. 24