---------------------------------------------------------------------------------------------------------
log: C:\AAA Miker Files\newer web pages\soc_388_notes\soc_388_2007\fifth_class_log.log
log type: text
opened on: 9 Oct 2007, 11:03:30
. set linesize 75
. use "C:\AAA Miker Files\newer web pages\soc_388_notes\soc_388_2007\ed_intermar.dta", clear
. *I am going to press forward a bit with model fitting to the educational intermarriage 4x4 dataset.
. * I am going to be referring to models as they are numbered in my comprehensive excel file, which has most of the summary statistics from each model.
. *just for review
. desmat: poisson count hed wed
-------------------------------------------------------------------------------------------------------
Poisson regression
-------------------------------------------------------------------------------------------------------
Dependent variable count
Optimization: ml
Number of observations: 16
Initial log likelihood: -221501.223
Log likelihood: -113882.425
LR chi square: 215237.595
Model degrees of freedom: 6
Pseudo R-squared: 0.486
Prob: 0.000
-------------------------------------------------------------------------------------------------------
nr Effect Coeff s.e.
-------------------------------------------------------------------------------------------------------
count
hed
1 HS 1.072** 0.004
2 Some Col 0.595** 0.005
3 BA+ 0.235** 0.005
wed
4 HS 1.229** 0.004
5 Some Col 0.733** 0.005
6 BA+ 0.142** 0.005
7 _cons 9.187** 0.005
-------------------------------------------------------------------------------------------------------
* p < .05
** p < .01
. poisgof
Goodness-of-fit chi2 = 227578.9
Prob > chi2(9) = 0.0000
. *skipping ahead to M4, full endogamy:
. desmat: poisson count hed wed ed_endog_full
-------------------------------------------------------------------------------------------------------
Poisson regression
-------------------------------------------------------------------------------------------------------
Dependent variable count
Optimization: ml
Number of observations: 16
Initial log likelihood: -221501.223
Log likelihood: -24059.274
LR chi square: 394883.898
Model degrees of freedom: 10
Pseudo R-squared: 0.891
Prob: 0.000
-------------------------------------------------------------------------------------------------------
nr Effect Coeff s.e.
-------------------------------------------------------------------------------------------------------
count
hed
1 HS 1.134** 0.007
2 Some Col 0.819** 0.006
3 BA+ -0.017* 0.007
wed
4 HS 1.372** 0.007
5 Some Col 1.020** 0.007
6 BA+ -0.278** 0.008
ed_endog_full
7 1 1.722** 0.009
8 2 0.676** 0.007
9 3 0.537** 0.008
10 4 2.487** 0.009
11 _cons 8.652** 0.008
-------------------------------------------------------------------------------------------------------
* p < .05
** p < .01
. poisgof
Goodness-of-fit chi2 = 47932.55
Prob > chi2(5) = 0.0000
. *One way to ask whether educational endogamy really matters, is to ask whether this model with 4 terms for the endogamy diagonal fits MUCH better than the independence model.
. *That would give us a chisquare test with 4 df, on the difference in goodness of fit between the models.
. display chi2tail(4,(227500-47000))
0
. *OK, so the P value for this comparison is Zero. What does that mean substantively?
. * P value of zero in that last test means we can reject the null hypothesis that the independence model fits as well as the full endogamy model, M4. On the other hand, the P=0 from poisgof means that M4 which we just ran still has a long way to go to fit the data well.
. *Let's add a few things, as we did last class, and then push it further.
. table hed wed, contents (mean ed_endog_full)
--------------------------------------------------
husband's | wife's education
education | <HS HS Some Col BA+
----------+---------------------------------------
<HS | 1 0 0 0
HS | 0 2 0 0
Some Col | 0 0 3 0
BA+ | 0 0 0 4
--------------------------------------------------
. table hed wed, contents (mean ed_diff_3)
--------------------------------------------------
husband's | wife's education
education | <HS HS Some Col BA+
----------+---------------------------------------
<HS | 0 0 0 1
HS | 0 0 0 0
Some Col | 0 0 0 0
BA+ | 1 0 0 0
--------------------------------------------------
. desmat: poisson count hed wed ed_endog_full ed_diff_3
-------------------------------------------------------------------------------------------------------
Poisson regression
-------------------------------------------------------------------------------------------------------
Dependent variable count
Optimization: ml
Number of observations: 16
Initial log likelihood: -221501.223
Log likelihood: -17940.195
LR chi square: 407122.056
Model degrees of freedom: 11
Pseudo R-squared: 0.919
Prob: 0.000
-------------------------------------------------------------------------------------------------------
nr Effect Coeff s.e.
-------------------------------------------------------------------------------------------------------
count
hed
1 HS 0.942** 0.007
2 Some Col 0.667** 0.007
3 BA+ 0.009 0.007
wed
4 HS 1.132** 0.007
5 Some Col 0.815** 0.007
6 BA+ -0.276** 0.008
ed_endog_full
7 1 1.410** 0.010
8 2 0.796** 0.007
9 3 0.583** 0.007
10 4 2.147** 0.010
ed_diff_3
11 1 -1.947** 0.023
12 _cons 8.964** 0.008
-------------------------------------------------------------------------------------------------------
* p < .05
** p < .01
. poisgof
Goodness-of-fit chi2 = 35694.39
Prob > chi2(4) = 0.0000
. *on one additional degree of freedom, we improved goodness of fit by about 12,000, which is good, but we still have a ways to go.
. *That was M5
. *now let's add a few terms.
. gen byte ed_diff_2=0
. replace ed_diff_2=1 if (hed-wed==2) | (wed-hed==2)
(4 real changes made)
. table hed wed, contents (mean ed_diff_2)
--------------------------------------------------
husband's | wife's education
education | <HS HS Some Col BA+
----------+---------------------------------------
<HS | 0 0 1 0
HS | 0 0 0 1
Some Col | 1 0 0 0
BA+ | 0 1 0 0
--------------------------------------------------
. desmat: poisson count hed wed ed_endog_full ed_diff_3 ed_diff_2
-------------------------------------------------------------------------------------------------------
Poisson regression
-------------------------------------------------------------------------------------------------------
Dependent variable count
Optimization: ml
Number of observations: 16
Initial log likelihood: -221501.223
Log likelihood: -145.628
LR chi square: 442711.189
Model degrees of freedom: 12
Pseudo R-squared: 0.999
Prob: 0.000
-------------------------------------------------------------------------------------------------------
nr Effect Coeff s.e.
-------------------------------------------------------------------------------------------------------
count
hed
1 HS 0.627** 0.008
2 Some Col 0.355** 0.007
3 BA+ 0.180** 0.008
wed
4 HS 0.817** 0.008
5 Some Col 0.461** 0.007
6 BA+ -0.142** 0.009
ed_endog_full
7 1 0.763** 0.011
8 2 0.779** 0.007
9 3 0.601** 0.008
10 4 1.195** 0.011
ed_diff_3
11 1 -2.749** 0.024
ed_diff_2
12 1 -1.068** 0.006
13 _cons 9.611** 0.009
-------------------------------------------------------------------------------------------------------
* p < .05
** p < .01
. poisgof
Goodness-of-fit chi2 = 105.2568
Prob > chi2(3) = 0.0000
. *first of all, notice that all of a sudden we have a poisgof test that actually comes into range.
. display chi2tail(3, 105)
1.307e-22
. *That's a small P value, meaning we still have some work to do, but on the other hand we're getting a lot closer. Then again, we only have 3 residual df to work with, so getting closer is in a sense not surprising. If we used all 16 terms available to us, we could fit the data exactly.
. *What do you make of the fact that the ed_endogamy terms are all positive, whereas the ed_diff terms are all negative so far?
. *The coefficients mean that compared to some (for the moment vague) comparison group, the probability or the odds of being married to someone with the same educational level as you is higher, and the odds of being married to someone with 2 or 3 categories difference from you is lower than we would otherwise expect.
.
. *Think about the actual data and predicted values of our various models.
. predict P_M6
(option n assumed; predicted number of events)
. *Those are the predicted values of model M6
. *let's go back and see how poorly the independence model and M4 fit the data.
. table hed wed, contents(sum P_independence sum P_independence) row col
------------------------------------------------------------
husband's | wife's education
education | <HS HS Some Col BA+ Total
----------+-------------------------------------------------
<HS | 9773.551 33398.43 20349.32 11263.7 74785
| 9773.551 33398.43 20349.32 11263.7 74785
|
HS | 28552.2 97569.33 59447.98 32905.5 218475
| 28552.2 97569.33 59447.98 32905.5 218475
|
Some Col | 17727.26 60578.06 36909.58 20430.1 135645
| 17727.26 60578.06 36909.58 20430.1 135645
|
BA+ | 12367.98 42264.19 25751.13 14253.7 94637
| 12367.98 42264.19 25751.13 14253.7 94637
|
Total | 68421 233810 142458 78853 523542
| 68421 233810 142458 78853 523542
------------------------------------------------------------
. table hed wed, contents(sum count sum P_independence) row col
------------------------------------------------------------
husband's | wife's education
education | <HS HS Some Col BA+ Total
----------+-------------------------------------------------
<HS | 32016 33374 8407 988 74785
| 9773.551 33398.43 20349.32 11263.7 74785
|
HS | 28370 137876 43783 8446 218475
| 28552.2 97569.33 59447.98 32905.5 218475
|
Some Col | 7051 48766 61633 18195 135645
| 17727.26 60578.06 36909.58 20430.1 135645
|
BA+ | 984 13794 28635 51224 94637
| 12367.98 42264.19 25751.13 14253.7 94637
|
Total | 68421 233810 142458 78853 523542
| 68421 233810 142458 78853 523542
------------------------------------------------------------
. *The independence model has way too few marriages along the endogamy diagonal, and way to many at the opposite corners.
. *independence does no justice to the data.
. table hed wed, contents(sum count sum P_endogamy_full) row col
------------------------------------------------------------
husband's | wife's education
education | <HS HS Some Col BA+ Total
----------+-------------------------------------------------
<HS | 32016 33374 8407 988 74785
| 32016 22561.17 15875.39 4332.443 74785
|
HS | 28370 137876 43783 8446 218475
| 17790.29 137876 49342.89 13465.83 218475
|
Some Col | 7051 48766 61633 18195 135645
| 12987.8 51193.47 61633 9830.73 135645
|
BA+ | 984 13794 28635 51224 94637
| 5626.913 22179.36 15606.73 51224 94637
|
Total | 68421 233810 142458 78853 523542
| 68421 233810 142458 78853 523542
------------------------------------------------------------
. *now let's take a look at M6, the model we just ran, with terms for ed_diff 3 and ed_diff_2
. table hed wed, contents(sum count sum P_M6) row col
------------------------------------------------------------
husband's | wife's education
education | <HS HS Some Col BA+ Total
----------+-------------------------------------------------
<HS | 32016 33374 8407 988 74785
| 32016 33801.72 8138.919 828.3573 74785
|
HS | 28370 137876 43783 8446 218475
| 27942.28 137876 44327.54 8329.189 218475
|
Some Col | 7051 48766 61633 18195 135645
| 7319.081 48221.46 61633 18471.45 135645
|
BA+ | 984 13794 28635 51224 94637
| 1143.643 13910.81 28358.55 51224 94637
|
Total | 68421 233810 142458 78853 523542
| 68421 233810 142458 78853 523542
------------------------------------------------------------
. *M6 was the first model whose poisgof was in the neighborhood, chisquare test of 105 on 3 df.
. *There does seem to be a little bit of a gender disparity in the far corners of the model, where M6 overpredicts the number of <HS educated women married to BA+ men, and underpredicts the number of <HS men married to BA+ women.
. *I'm going to follow the excel table for a minute, and add a term for M7 which has husbands one level higher than wives...
. gen byte ed_diff1_male=0
. replace ed_diff1_male=1 if hed-wed==1
(3 real changes made)
. table hed wed, contents( mean ed_diff1_male)
--------------------------------------------------
husband's | wife's education
education | <HS HS Some Col BA+
----------+---------------------------------------
<HS | 0 0 0 0
HS | 1 0 0 0
Some Col | 0 1 0 0
BA+ | 0 0 1 0
--------------------------------------------------
. desmat: poisson count hed wed ed_endog_full ed_diff_3 ed_diff_2 ed_diff1_male
-------------------------------------------------------------------------------------------------------
Poisson regression
-------------------------------------------------------------------------------------------------------
Dependent variable count
Optimization: ml
Number of observations: 16
Initial log likelihood: -221501.223
Log likelihood: -110.210
LR chi square: 442782.026
Model degrees of freedom: 13
Pseudo R-squared: 1.000
Prob: 0.000
-------------------------------------------------------------------------------------------------------
nr Effect Coeff s.e.
-------------------------------------------------------------------------------------------------------
count
hed
1 HS 0.614** 0.008
2 Some Col 0.320** 0.008
3 BA+ 0.136** 0.009
wed
4 HS 0.841** 0.008
5 Some Col 0.494** 0.008
6 BA+ -0.093** 0.010
ed_endog_full
7 1 0.796** 0.011
8 2 0.801** 0.008
9 3 0.637** 0.009
10 4 1.223** 0.011
ed_diff_3
11 1 -2.713** 0.024
ed_diff_2
12 1 -1.036** 0.007
ed_diff1_male
13 1 0.057** 0.007
14 _cons 9.578** 0.010
-------------------------------------------------------------------------------------------------------
* p < .05
** p < .01
. poisgof
Goodness-of-fit chi2 = 34.42006
Prob > chi2(2) = 0.0000
. predict M7
(option n assumed; predicted number of events)
. *let's take a quick look at residuals from M7
. rename M7 P_M7
. *for consistency
. gen M7_residuals= P_M7-count
. table hed wed, contents (sum count sum P_M7 sum M7_residuals) row col
-----------------------------------------------------------------
husband's | wife's education
education | <HS HS Some Col BA+ Total
----------+------------------------------------------------------
<HS | 32016 33374 8407 988 74785
| 32016 33497.12 8398.321 873.5618 74785
| 0 123.1172 -8.678711 -114.4382 .0002441
|
HS | 28370 137876 43783 8446 218475
| 28246.88 137876 43725.78 8626.337 218475
| -123.1172 0 -57.21875 180.3369 .0009766
|
Some Col | 7051 48766 61633 18195 135645
| 7059.679 48823.22 61633 18129.1 135645
| 8.679199 57.21875 0 -65.89844 -.0004883
|
BA+ | 984 13794 28635 51224 94637
| 1098.438 13613.66 28700.9 51224 94637
| 114.4382 -180.3369 65.89844 0 -.0002441
|
Total | 68421 233810 142458 78853 523542
| 68421 233810 142458 78853 523542
| .0002441 -.0009766 .0009766 .0002441 .0004883
-----------------------------------------------------------------
. *Where does M7 seem not quite to fit?
. display chi2tail(2, 34)
4.140e-08
. *There are several cells that have a difference in the neighborhood of 100 or so. How can we tell which cells are most important?
. *key is to standardize by taking into account the magnitude of what the predicted value in each cell is.
. poisgof
Goodness-of-fit chi2 = 34.42006
Prob > chi2(2) = 0.0000
. poisgof, pearson
Goodness-of-fit chi2 = 34.61454
Prob > chi2(2) = 0.0000
. *The pearson chisquare is going to be especially useful to us, you will see why
. gen M7_resid_pearson= M7_residuals/( P_M7^.5)
. table hed wed, contents (sum count sum P_M7 sum M7_resid_pearson) row col
-----------------------------------------------------------------
husband's | wife's education
education | <HS HS Some Col BA+ Total
----------+------------------------------------------------------
<HS | 32016 33374 8407 988 74785
| 32016 33497.12 8398.321 873.5618 74785
| 0 .67269 -.094702 -3.871902 -3.293914
|
HS | 28370 137876 43783 8446 218475
| 28246.88 137876 43725.78 8626.337 218475
| -.7325435 0 -.2736337 1.941652 .935475
|
Some Col | 7051 48766 61633 18195 135645
| 7059.679 48823.22 61633 18129.1 135645
| .1032969 .2589555 0 -.4894259 -.1271735
|
BA+ | 984 13794 28635 51224 94637
| 1098.438 13613.66 28700.9 51224 94637
| 3.452895 -1.5456 .3889801 0 2.296275
|
Total | 68421 233810 142458 78853 523542
| 68421 233810 142458 78853 523542
| 2.823648 -.6139546 .0206444 -2.419675 -.1893376
-----------------------------------------------------------------
. *The cells with the larger pearson residuals (in absolute value) are the cells where our model seems to fit the worst.
. *The two cells with the most unequal educational attainments are the cells with the largest (in absolute value) pearson residuals.
. *Another way of looking at pearson residuals, is to square them and then add them.
. gen M7_pearson_resid_squared= M7_resid_pearson^2
. table hed wed, contents (sum count sum P_M7 sum M7_pearson_resid_squared) row col
------------------------------------------------------------
husband's | wife's education
education | <HS HS Some Col BA+ Total
----------+-------------------------------------------------
<HS | 32016 33374 8407 988 74785
| 32016 33497.12 8398.321 873.5618 74785
| 0 .4525118 .0089685 14.99162 15.4531
|
HS | 28370 137876 43783 8446 218475
| 28246.88 137876 43725.78 8626.337 218475
| .53662 0 .0748754 3.770013 4.381509
|
Some Col | 7051 48766 61633 18195 135645
| 7059.679 48823.22 61633 18129.1 135645
| .0106702 .067058 0 .2395377 .3172659
|
BA+ | 984 13794 28635 51224 94637
| 1098.438 13613.66 28700.9 51224 94637
| 11.92248 2.38888 .1513055 0 14.46267
|
Total | 68421 233810 142458 78853 523542
| 68421 233810 142458 78853 523542
| 12.46977 2.908449 .2351494 19.00117 34.61454
------------------------------------------------------------
. *The sum over all cells of the pearson residual squared is simply the pearson chisquare statistic, 34.61454 for this model.
. *We had one term for ed_diff_3 to fit these two cells. We need to add a second term to account for the gender disparity
. gen byte ed_diff_3_male=0
. replace ed_diff_3_male=1 if hed-wed==3
(1 real change made)
. table hed wed, contents(mean ed_diff_3_male)
--------------------------------------------------
husband's | wife's education
education | <HS HS Some Col BA+
----------+---------------------------------------
<HS | 0 0 0 0
HS | 0 0 0 0
Some Col | 0 0 0 0
BA+ | 1 0 0 0
--------------------------------------------------
. *Once we add this second ed_diff_3 term, we will be fitting those two cells of most unequal education
> exactly.
. desmat: poisson count hed wed ed_endog_full ed_diff_3 ed_diff_2 ed_diff1_male ed_diff_3_male
-------------------------------------------------------------------------------------------------------
Poisson regression
-------------------------------------------------------------------------------------------------------
Dependent variable count
Optimization: ml
Number of observations: 16
Initial log likelihood: -221501.223
Log likelihood: -95.087
LR chi square: 442812.272
Model degrees of freedom: 14
Pseudo R-squared: 1.000
Prob: 0.000
-------------------------------------------------------------------------------------------------------
nr Effect Coeff s.e.
-------------------------------------------------------------------------------------------------------
count
hed
1 HS 0.618** 0.008
2 Some Col 0.330** 0.008
3 BA+ 0.150** 0.010
wed
4 HS 0.833** 0.008
5 Some Col 0.484** 0.008
6 BA+ -0.110** 0.011
ed_endog_full
7 1 0.789** 0.011
8 2 0.798** 0.008
9 3 0.631** 0.009
10 4 1.220** 0.011
ed_diff_3
11 1 -2.579** 0.033
ed_diff_2
12 1 -1.042** 0.007
ed_diff1_male
13 1 0.047** 0.007
ed_diff_3_male
14 1 -0.264** 0.048
15 _cons 9.585** 0.010
-------------------------------------------------------------------------------------------------------
* p < .05
** p < .01
. poisgof
Goodness-of-fit chi2 = 4.174479
Prob > chi2(1) = 0.0410
. *check it out. This model actually fits reasonably well.
. *We are using 15 out of 16 terms, so maybe we should not congratulate ourselves too much, but still..
. predict P_M8
(option n assumed; predicted number of events)
. table hed wed, contents(sum count sum P_M8) row col
------------------------------------------------------------
husband's | wife's education
education | <HS HS Some Col BA+ Total
----------+-------------------------------------------------
<HS | 32016 33374 8407 988 74785
| 32016 33457.31 8323.688 988 74785
|
HS | 28370 137876 43783 8446 218475
| 28286.69 137876 43783 8529.313 218475
|
Some Col | 7051 48766 61633 18195 135645
| 7134.312 48766 61633 18111.69 135645
|
BA+ | 984 13794 28635 51224 94637
| 984 13710.69 28718.31 51224 94637
|
Total | 68421 233810 142458 78853 523542
| 68421 233810 142458 78853 523542
------------------------------------------------------------
. *One last point:
. *In M8, the ed endogamy terms 1 and 2 seem awfully close. Could we save 1 df by combining them?
. test _x_7-_x_8=0
( 1) [count]_x_7 - [count]_x_8 = 0
chi2( 1) = 0.30
Prob > chi2 = 0.5821
. *these things are pretty close- can't reject the null of no difference.
. gen byte ed_diff_full_revised= ed_endog_full
. replace ed_diff_full_revised=1 if ed_endog_full==2
(1 real change made)
. table hed wed, contents (mean ed_diff_full_revised)
--------------------------------------------------
husband's | wife's education
education | <HS HS Some Col BA+
----------+---------------------------------------
<HS | 1 0 0 0
HS | 0 1 0 0
Some Col | 0 0 3 0
BA+ | 0 0 0 4
--------------------------------------------------
. *The actual numbers don't matter, because stata desmat or xi will generate dummy variables for every different level, but here we have 3 different levels of ed endogamy rather than 4.
. *how does it fit?
. desmat: poisson count hed wed ed_diff_full_revised ed_diff_3 ed_diff_2 ed_diff1_male ed_diff_3_male
-------------------------------------------------------------------------------------------------------
Poisson regression
-------------------------------------------------------------------------------------------------------
Dependent variable count
Optimization: ml
Number of observations: 16
Initial log likelihood: -221501.223
Log likelihood: -95.239
LR chi square: 442811.969
Model degrees of freedom: 13
Pseudo R-squared: 1.000
Prob: 0.000
-------------------------------------------------------------------------------------------------------
nr Effect Coeff s.e.
-------------------------------------------------------------------------------------------------------
count
hed
1 HS 0.622** 0.005
2 Some Col 0.331** 0.008
3 BA+ 0.151** 0.009
wed
4 HS 0.837** 0.005
5 Some Col 0.486** 0.007
6 BA+ -0.108** 0.010
ed_diff_full_revised
7 1 0.795** 0.006
8 3 0.632** 0.009
9 4 1.220** 0.011
ed_diff_3
10 1 -2.577** 0.033
ed_diff_2
11 1 -1.041** 0.007
ed_diff1_male
12 1 0.047** 0.007
ed_diff_3_male
13 1 -0.263** 0.048
14 _cons 9.580** 0.006
-------------------------------------------------------------------------------------------------------
* p < .05
** p < .01
. poisgof
Goodness-of-fit chi2 = 4.477238
Prob > chi2(2) = 0.1066
. *That's model 9, fit very nicely.
. save "C:\AAA Miker Files\newer web pages\soc_388_notes\soc_388_2007\ed_intermar.dta", replace
file C:\AAA Miker Files\newer web pages\soc_388_notes\soc_388_2007\ed_intermar.dta saved
. exit, clear