HW 3, Soc 388

Due Wednesday, Oct 24, in class

Late homeworks will generally not be accepted, because I will post answers to my website soon after the homework is due.  If you're stuck, email me or the TA.  If you still can't figure it out, just do the best you can and don't panic.

NOTE: All homeworks should include an edited STATA log.

Previous Reading Assignments: Hout Chapters 1-4; Agresti Ch 1-2, 6.

New Reading Assignment:  Agresti, Ch 3

Once again, (since it isn't defined in either text):  BIC= LRT- df*ln(N), where LRT is the goodness of fit chisquare, df is the residual degrees of freedom, and N is the sample size from the whole dataset.  The syllabus contains references that define BIC (Raftery 1986) and critique it (Weakliem 1999).  Lower BIC indicates better fit, and BIC < 0 indicates a model that is preferred to the saturated model.

Important ideas:  Goodness of fit measures, hypothesis testing, inference across many dimensions, different kinds of controls.

The data are available from my website, as well as my public folder via ftp (/afs/ir/users/m/r/mrosenfe/public) under the name "70-80-90 MR intermar.dta" (Stata ver 6) or "70-80-90 MR intermar.xls" if you'd rather start with the excel file and copy it into Stata.

The data have 225 cells, and 6 variables.  There 649,821 couples in the dataset (it's intermarriage data, surprise surprise).  The data consist of married people age 20-29 at the time of the census.  The variables are meth (husband's ethnicity) and feth (wife's ethnicity), with the same 5 categories we have seen before (non Hispanic Black, non Hispanic White, Mexican, Other Hispanic, non Hispanic Other).  There is a variable for census year (70, 80, and 90), and there is a variable for nativity of each spouse (born in the US vs Foreign born).  The dataset includes 3 of the possible 4 combinations of nativity; couples that are both foreign born are excluded.  The number of cells= 5*5*3*3=225.

In the following table, BW is the gender symmetric Black- White interaction;  MOh is the gender symmetric Mexican- Other Hispanic interaction; ethintdm is the dummy variable that treats all 5 kinds of ethnic endogamy the same, ethintct is the categorical variable that treats each kind of ethnic intermarriage differently.

Fill in the following Table

 Model # Model Description Terms in model Residual df Goodness of fit Chi-square Goodness of fit Chi-square P BIC ID 1 Constant only 1 224 4,503,895 0 4500897 86.2 2 year*meth year*feth 27 198 1,579,790 0 1577140 66.4 3 year*meth*mgen year*feth*fgen 57 168 453,658 0 451409.4 19.7 4 year*meth*mgen year*feth*fgen BW, MOh 59 166 200,027 0 197805.2 8.5 5 year*meth*mgen year*feth*fgen ethintdm 58 167 26,839 0 24603.8 2.94 6 year*meth*mgen year*feth*fgen ethintct 62 163 5,070 0 2888.3 1.02 7a year*meth*mgen year*feth*fgen ethintct*@year 67 158 4,069 0 1954.3 0.789 7b year*meth*mgen year*feth*fgen ethintct*year 72 153 3,882 0 1834.2 0.744 8 year*meth*mgen year*feth*fgen ethintct*year BW MOh 74 151 3,203 0 1181.9 0.687 My better fitting models: 9a year*meth*mgen year*feth*fgen ethintct*year BW MOh meth*fgen feth*mgen 82 143 2,053 0 139 0.466 9b year*meth*mgen*fgen year*feth*fgen*mgen ethintct*year*fgen*mgen BW MOh 128 97 536.5 0 -761.8 0.133 9c year*meth*mgen*fgen year*feth*fgen*mgen ethintct*year*fgen*mgen QS*year, QS*mgen*fgen, BohS*fgen, BWS*year Note: QS here is the full set of 5 off-diagonal, symmetric ethnic interactions (including BW and MOh, see log), and BohS is the sex- specific interaction between Black men and Other Hispanic women, and BWS is the sex specific interaction between Black men and White women. 156 69 107.3 0.0022 -816.2 0.0514

1) Fill in the above table, models 1-8

2) Does racial endogamy vary significantly between groups?  What is the statistical test that answers that question?

Yes; Model 6 improves dramatically on the fit of Model 5

3) Does racial endogamy vary significantly over time?  More so for some groups than for others?

Yes; Model 7b improves quite a lot on Model 6 (an improvement of more than 1100 on 10 degrees of freedom).  Black endogamy declines the most (but was the largest to start with), from log odds ratio of 7.73 in 1970 to 7.73-1.39=6.34 in 1990.  Between 1970 and 1980, 'Oth-NH', a category that includes mostly Asians and Native Americans declines the most in log odds ratio terms, from 3.186-1.029=2.157.  You could chart the racial endogamy of all 5 groups over time, 70-80-90, and show how all kinds of racial endogamy (here measured jointly with a bunch of controls) decline sharply over time.

4) Does US nativity effect racial endogamy? Describe the model(s), and the results you need to answer this question.

The comparison of models 9a and 9b demonstrates a very significant effect of U.S. nativity on racial endogamy, but models 9a-c have a lot of terms in them and that makes interpretation messy. A simple approach would be to take model 7b, and add ethintct*mgen and ethintct*fgen, and look at the interaction terms. In fact what one sees is that US nativity for either or both spouses increases 'other non Hispanic' (mostly Asian) endogamy, white endogamy, and black endogamy quite substantially.

5) Based on models 1-8, which would you say is a more powerful force in the marriage market- racial endogamy or the division between Blacks and Whites?  Why?

Racial endogamy is stronger than the Black- White divide.  In Models 4 and 8, the Racial endogamy terms are generally much larger in absolute value (representing stronger changes in the log odds ratio of marriage) than the Black- White interaction.  Furthermore, the racial or ethnic endogamy terms contribute more to the goodness of fit (compare Model 6 to Model 4) than the Black- White term (which is important in its own right, but not quite as important).

6) Which of the models 1-8 fits the best by LRT and by BIC?  Do any of them fit reasonably well?

Model 8 is the best fitting by LRT and BIC, but it's not nearly good enough

7) What is the difference between treating year as a continuous vs categorical variable in interactions with ethnic endogamy? How do models 7a and 7b differ? How do you interpret this difference?

Since year takes on 3 values in the dataset, ethinct*year adds 5x2=10 terms to the model compared to the base values of ethinct in model 6- i.e. change from 1970 to 1980, and change from 1970 to 1990. If we treat year as a continuous variable, ethinct*@year adds 5 terms to the model, because the change over time for each ethinct term is assumed to be linear with time. So there is a 5 df difference between 7a and 7b, depending on whether we assume that ethnic endogamy changes in a linear way over time, or whether we assume the decline in ethnic endogamy over time is non-linear enough to account for each year separately. Since model 7b fits substantially better than model 7a (a difference in -2LL of almost 200 on 5 df, and model 7b has lower BIC), this tells us that the decline in ethnic endogamy over time is not quite linear.

8) Construct a model that fits better (by BIC or LRT) than any of the models 1-8. What have you added to the previous models?

Models 9a-9c fit substantially better than models 1-8. Models 9b and 9c fit well by the BIC, and model 9c approaches a good fit by the LRT, which is not easy to obtain in a large dataset like this. Models 9b and 9c push the data to its limits, so the 'difficult' option speeds up the likelihood maximization considerably. We only start to make real progress in fitting the data when we add the 4way interactions of meth*mgen*fgen*year and feth*mgen*fgen*year and the partial 5 way ethinct*mgen*fgen*year (partial 5 way because ethinct accounts for part of the saturated interaction meth*feth.

9) Now here are some more abstract questions about a hypothetical dataset with 3 variables: A (5 categories) B(4 Categories) and C (3 categories).  Total number of cells is 5*4*3=60.  Fill in the following table.

 Model # Model Description Terms in model Residual df 1 A (constant plus 4 terms to fit the 5 categories of A- there's always one excluded category) 5 55 2 A,B (add in the 3 terms to fit the 4 categories of B) 8 52 3 A*B (Saturated interaction takes the full 5*4 terms to fit the 20 cells of A*B) 20 40 4 A*B,C (Adds to model 3 the 2 terms to fit the 3 categories of C) 22 38 5 A*B, B*C, A*C (There are a couple of ways of thinking about this one.  This model has the constant, plus the direct of effects of A, B, and C (4+3+2) plus the saturated interactions between A*B (4*3 terms), B*C (3*2 terms) and A*C (4*2 terms for a total of 1+9+12+6+8=36 terms 36 24 6 A*B*C (Saturated model has one term for every cell in the dataset) 60 0