Fifth_class

---------------------------------------------------------------------------------------------------------

log: C:\AAA Miker Files\newer web pages\soc_388_notes\soc_388_2007\fifth_class_log.log

log type: text

opened on: 9 Oct 2007, 11:03:30

. set linesize 75

. use "C:\AAA Miker Files\newer web pages\soc_388_notes\soc_388_2007\ed_intermar.dta", clear

. *I am going to press forward a bit with model fitting to the educational intermarriage 4x4 dataset.

. * I am going to be referring to models as they are numbered in my comprehensive excel file, which has most of the summary statistics from each model.

. *just for review

. desmat: poisson count hed wed

-------------------------------------------------------------------------------------------------------

Poisson regression

-------------------------------------------------------------------------------------------------------

Dependent variable count

Optimization: ml

Number of observations: 16

Initial log likelihood: -221501.223

Log likelihood: -113882.425

LR chi square: 215237.595

Model degrees of freedom: 6

Pseudo R-squared: 0.486

Prob: 0.000

-------------------------------------------------------------------------------------------------------

nr Effect Coeff s.e.

-------------------------------------------------------------------------------------------------------

count

hed

1 HS 1.072** 0.004

2 Some Col 0.595** 0.005

3 BA+ 0.235** 0.005

wed

4 HS 1.229** 0.004

5 Some Col 0.733** 0.005

6 BA+ 0.142** 0.005

7 _cons 9.187** 0.005

-------------------------------------------------------------------------------------------------------

* p < .05

** p < .01

. poisgof

Goodness-of-fit chi2 = 227578.9

Prob > chi2(9) = 0.0000

. *skipping ahead to M4, full endogamy:

. desmat: poisson count hed wed ed_endog_full

-------------------------------------------------------------------------------------------------------

Poisson regression

-------------------------------------------------------------------------------------------------------

Dependent variable count

Optimization: ml

Number of observations: 16

Initial log likelihood: -221501.223

Log likelihood: -24059.274

LR chi square: 394883.898

Model degrees of freedom: 10

Pseudo R-squared: 0.891

Prob: 0.000

-------------------------------------------------------------------------------------------------------

nr Effect Coeff s.e.

-------------------------------------------------------------------------------------------------------

count

hed

1 HS 1.134** 0.007

2 Some Col 0.819** 0.006

3 BA+ -0.017* 0.007

wed

4 HS 1.372** 0.007

5 Some Col 1.020** 0.007

6 BA+ -0.278** 0.008

ed_endog_full

7 1 1.722** 0.009

8 2 0.676** 0.007

9 3 0.537** 0.008

10 4 2.487** 0.009

11 _cons 8.652** 0.008

-------------------------------------------------------------------------------------------------------

* p < .05

** p < .01

. poisgof

Goodness-of-fit chi2 = 47932.55

Prob > chi2(5) = 0.0000

. *One way to ask whether educational endogamy really matters, is to ask whether this model with 4 terms for the endogamy diagonal fits MUCH better than the independence model.

. *That would give us a chisquare test with 4 df, on the difference in goodness of fit between the models.

. display chi2tail(4,(227500-47000))

. *OK, so the P value for this comparison is Zero. What does that mean substantively?

. * P value of zero in that last test means we can reject the null hypothesis that the independence model fits as well as the full endogamy model, M4. On the other hand, the P=0 from poisgof means that M4 which we just ran still has a long way to go to fit the data well.

. *Let's add a few things, as we did last class, and then push it further.

. table hed wed, contents (mean ed_endog_full)

--------------------------------------------------

husband's | wife's education

education | <HS HS Some Col BA+

----------+---------------------------------------

<HS | 1 0 0 0

HS | 0 2 0 0

Some Col | 0 0 3 0

BA+ | 0 0 0 4

--------------------------------------------------

. table hed wed, contents (mean ed_diff_3)

--------------------------------------------------

husband's | wife's education

education | <HS HS Some Col BA+

----------+---------------------------------------

<HS | 0 0 0 1

HS | 0 0 0 0

Some Col | 0 0 0 0

BA+ | 1 0 0 0

--------------------------------------------------

. desmat: poisson count hed wed ed_endog_full ed_diff_3

-------------------------------------------------------------------------------------------------------

Poisson regression

-------------------------------------------------------------------------------------------------------

Dependent variable count

Optimization: ml

Number of observations: 16

Initial log likelihood: -221501.223

Log likelihood: -17940.195

LR chi square: 407122.056

Model degrees of freedom: 11

Pseudo R-squared: 0.919

Prob: 0.000

-------------------------------------------------------------------------------------------------------

nr Effect Coeff s.e.

-------------------------------------------------------------------------------------------------------

count

hed

1 HS 0.942** 0.007

2 Some Col 0.667** 0.007

3 BA+ 0.009 0.007

wed

4 HS 1.132** 0.007

5 Some Col 0.815** 0.007

6 BA+ -0.276** 0.008

ed_endog_full

7 1 1.410** 0.010

8 2 0.796** 0.007

9 3 0.583** 0.007

10 4 2.147** 0.010

ed_diff_3

11 1 -1.947** 0.023

12 _cons 8.964** 0.008

-------------------------------------------------------------------------------------------------------

* p < .05

** p < .01

. poisgof

Goodness-of-fit chi2 = 35694.39

Prob > chi2(4) = 0.0000

. *on one additional degree of freedom, we improved goodness of fit by about 12,000, which is good, but we still have a ways to go.

. *That was M5

. *now let's add a few terms.

. gen byte ed_diff_2=0

. replace ed_diff_2=1 if (hed-wed==2) | (wed-hed==2)

(4 real changes made)

. table hed wed, contents (mean ed_diff_2)

--------------------------------------------------

husband's | wife's education

education | <HS HS Some Col BA+

----------+---------------------------------------

<HS | 0 0 1 0

HS | 0 0 0 1

Some Col | 1 0 0 0

BA+ | 0 1 0 0

--------------------------------------------------

. desmat: poisson count hed wed ed_endog_full ed_diff_3 ed_diff_2

-------------------------------------------------------------------------------------------------------

Poisson regression

-------------------------------------------------------------------------------------------------------

Dependent variable count

Optimization: ml

Number of observations: 16

Initial log likelihood: -221501.223

Log likelihood: -145.628

LR chi square: 442711.189

Model degrees of freedom: 12

Pseudo R-squared: 0.999

Prob: 0.000

-------------------------------------------------------------------------------------------------------

nr Effect Coeff s.e.

-------------------------------------------------------------------------------------------------------

count

hed

1 HS 0.627** 0.008

2 Some Col 0.355** 0.007

3 BA+ 0.180** 0.008

wed

4 HS 0.817** 0.008

5 Some Col 0.461** 0.007

6 BA+ -0.142** 0.009

ed_endog_full

7 1 0.763** 0.011

8 2 0.779** 0.007

9 3 0.601** 0.008

10 4 1.195** 0.011

ed_diff_3

11 1 -2.749** 0.024

ed_diff_2

12 1 -1.068** 0.006

13 _cons 9.611** 0.009

-------------------------------------------------------------------------------------------------------

* p < .05

** p < .01

. poisgof

Goodness-of-fit chi2 = 105.2568

Prob > chi2(3) = 0.0000

. *first of all, notice that all of a sudden we have a poisgof test that actually comes into range.

. display chi2tail(3, 105)

1.307e-22

. *That's a small P value, meaning we still have some work to do, but on the other hand we're getting a lot closer. Then again, we only have 3 residual df to work with, so getting closer is in a sense not surprising. If we used all 16 terms available to us, we could fit the data exactly.

. *What do you make of the fact that the ed_endogamy terms are all positive, whereas the ed_diff terms are all negative so far?

. *The coefficients mean that compared to some (for the moment vague) comparison group, the probability or the odds of being married to someone with the same educational level as you is higher, and the odds of being married to someone with 2 or 3 categories difference from you is lower than we would otherwise expect.

. *Think about the actual data and predicted values of our various models.

. predict P_M6

(option n assumed; predicted number of events)

. *Those are the predicted values of model M6

. *let's go back and see how poorly the independence model and M4 fit the data.

. table hed wed, contents(sum P_independence sum P_independence) row col

------------------------------------------------------------

husband's | wife's education

education | <HS HS Some Col BA+ Total

----------+-------------------------------------------------

<HS | 9773.551 33398.43 20349.32 11263.7 74785

| 9773.551 33398.43 20349.32 11263.7 74785

HS | 28552.2 97569.33 59447.98 32905.5 218475

| 28552.2 97569.33 59447.98 32905.5 218475

Some Col | 17727.26 60578.06 36909.58 20430.1 135645

| 17727.26 60578.06 36909.58 20430.1 135645

BA+ | 12367.98 42264.19 25751.13 14253.7 94637

| 12367.98 42264.19 25751.13 14253.7 94637

Total | 68421 233810 142458 78853 523542

| 68421 233810 142458 78853 523542

------------------------------------------------------------

. table hed wed, contents(sum count sum P_independence) row col

------------------------------------------------------------

husband's | wife's education

education | <HS HS Some Col BA+ Total

----------+-------------------------------------------------

<HS | 32016 33374 8407 988 74785

| 9773.551 33398.43 20349.32 11263.7 74785

HS | 28370 137876 43783 8446 218475

| 28552.2 97569.33 59447.98 32905.5 218475

Some Col | 7051 48766 61633 18195 135645

| 17727.26 60578.06 36909.58 20430.1 135645

BA+ | 984 13794 28635 51224 94637

| 12367.98 42264.19 25751.13 14253.7 94637

Total | 68421 233810 142458 78853 523542

| 68421 233810 142458 78853 523542

------------------------------------------------------------

. *The independence model has way too few marriages along the endogamy diagonal, and way to many at the opposite corners.

. *independence does no justice to the data.

. table hed wed, contents(sum count sum P_endogamy_full) row col

------------------------------------------------------------

husband's | wife's education

education | <HS HS Some Col BA+ Total

----------+-------------------------------------------------

<HS | 32016 33374 8407 988 74785

| 32016 22561.17 15875.39 4332.443 74785

HS | 28370 137876 43783 8446 218475

| 17790.29 137876 49342.89 13465.83 218475

Some Col | 7051 48766 61633 18195 135645

| 12987.8 51193.47 61633 9830.73 135645

BA+ | 984 13794 28635 51224 94637

| 5626.913 22179.36 15606.73 51224 94637

Total | 68421 233810 142458 78853 523542

| 68421 233810 142458 78853 523542

------------------------------------------------------------

. *now let's take a look at M6, the model we just ran, with terms for ed_diff 3 and ed_diff_2

. table hed wed, contents(sum count sum P_M6) row col

------------------------------------------------------------

husband's | wife's education

education | <HS HS Some Col BA+ Total

----------+-------------------------------------------------

<HS | 32016 33374 8407 988 74785

| 32016 33801.72 8138.919 828.3573 74785

HS | 28370 137876 43783 8446 218475

| 27942.28 137876 44327.54 8329.189 218475

Some Col | 7051 48766 61633 18195 135645

| 7319.081 48221.46 61633 18471.45 135645

BA+ | 984 13794 28635 51224 94637

| 1143.643 13910.81 28358.55 51224 94637

Total | 68421 233810 142458 78853 523542

| 68421 233810 142458 78853 523542

------------------------------------------------------------

. *M6 was the first model whose poisgof was in the neighborhood, chisquare test of 105 on 3 df.

. *There does seem to be a little bit of a gender disparity in the far corners of the model, where M6 overpredicts the number of <HS educated women married to BA+ men, and underpredicts the number of <HS men married to BA+ women.

. *I'm going to follow the excel table for a minute, and add a term for M7 which has husbands one level higher than wives...

. gen byte ed_diff1_male=0

. replace ed_diff1_male=1 if hed-wed==1

(3 real changes made)

. table hed wed, contents( mean ed_diff1_male)

--------------------------------------------------

husband's | wife's education

education | <HS HS Some Col BA+

----------+---------------------------------------

<HS | 0 0 0 0

HS | 1 0 0 0

Some Col | 0 1 0 0

BA+ | 0 0 1 0

--------------------------------------------------

. desmat: poisson count hed wed ed_endog_full ed_diff_3 ed_diff_2 ed_diff1_male

-------------------------------------------------------------------------------------------------------

Poisson regression

-------------------------------------------------------------------------------------------------------

Dependent variable count

Optimization: ml

Number of observations: 16

Initial log likelihood: -221501.223

Log likelihood: -110.210

LR chi square: 442782.026

Model degrees of freedom: 13

Pseudo R-squared: 1.000

Prob: 0.000

-------------------------------------------------------------------------------------------------------

nr Effect Coeff s.e.

-------------------------------------------------------------------------------------------------------

count

hed

1 HS 0.614** 0.008

2 Some Col 0.320** 0.008

3 BA+ 0.136** 0.009

wed

4 HS 0.841** 0.008

5 Some Col 0.494** 0.008

6 BA+ -0.093** 0.010

ed_endog_full

7 1 0.796** 0.011

8 2 0.801** 0.008

9 3 0.637** 0.009

10 4 1.223** 0.011

ed_diff_3

11 1 -2.713** 0.024

ed_diff_2

12 1 -1.036** 0.007

ed_diff1_male

13 1 0.057** 0.007

14 _cons 9.578** 0.010

-------------------------------------------------------------------------------------------------------

* p < .05

** p < .01

. poisgof

Goodness-of-fit chi2 = 34.42006

Prob > chi2(2) = 0.0000

. predict M7

(option n assumed; predicted number of events)

. *let's take a quick look at residuals from M7

. rename M7 P_M7

. *for consistency

. gen M7_residuals= P_M7-count

. table hed wed, contents (sum count sum P_M7 sum M7_residuals) row col

-----------------------------------------------------------------

husband's | wife's education

education | <HS HS Some Col BA+ Total

----------+------------------------------------------------------

<HS | 32016 33374 8407 988 74785

| 32016 33497.12 8398.321 873.5618 74785

| 0 123.1172 -8.678711 -114.4382 .0002441

HS | 28370 137876 43783 8446 218475

| 28246.88 137876 43725.78 8626.337 218475

| -123.1172 0 -57.21875 180.3369 .0009766

Some Col | 7051 48766 61633 18195 135645

| 7059.679 48823.22 61633 18129.1 135645

| 8.679199 57.21875 0 -65.89844 -.0004883

BA+ | 984 13794 28635 51224 94637

| 1098.438 13613.66 28700.9 51224 94637

| 114.4382 -180.3369 65.89844 0 -.0002441

Total | 68421 233810 142458 78853 523542

| 68421 233810 142458 78853 523542

| .0002441 -.0009766 .0009766 .0002441 .0004883

-----------------------------------------------------------------

. *Where does M7 seem not quite to fit?

. display chi2tail(2, 34)

4.140e-08

. *There are several cells that have a difference in the neighborhood of 100 or so. How can we tell which cells are most important?

. *key is to standardize by taking into account the magnitude of what the predicted value in each cell is.

. poisgof

Goodness-of-fit chi2 = 34.42006

Prob > chi2(2) = 0.0000

. poisgof, pearson

Goodness-of-fit chi2 = 34.61454

Prob > chi2(2) = 0.0000

. *The pearson chisquare is going to be especially useful to us, you will see why

. gen M7_resid_pearson= M7_residuals/( P_M7^.5)

. table hed wed, contents (sum count sum P_M7 sum M7_resid_pearson) row col

-----------------------------------------------------------------

husband's | wife's education

education | <HS HS Some Col BA+ Total

----------+------------------------------------------------------

<HS | 32016 33374 8407 988 74785

| 32016 33497.12 8398.321 873.5618 74785

| 0 .67269 -.094702 -3.871902 -3.293914

HS | 28370 137876 43783 8446 218475

| 28246.88 137876 43725.78 8626.337 218475

| -.7325435 0 -.2736337 1.941652 .935475

Some Col | 7051 48766 61633 18195 135645

| 7059.679 48823.22 61633 18129.1 135645

| .1032969 .2589555 0 -.4894259 -.1271735

BA+ | 984 13794 28635 51224 94637

| 1098.438 13613.66 28700.9 51224 94637

| 3.452895 -1.5456 .3889801 0 2.296275

Total | 68421 233810 142458 78853 523542

| 68421 233810 142458 78853 523542

| 2.823648 -.6139546 .0206444 -2.419675 -.1893376

-----------------------------------------------------------------

. *The cells with the larger pearson residuals (in absolute value) are the cells where our model seems to fit the worst.

. *The two cells with the most unequal educational attainments are the cells with the largest (in absolute value) pearson residuals.

. *Another way of looking at pearson residuals, is to square them and then add them.

. gen M7_pearson_resid_squared= M7_resid_pearson^2

. table hed wed, contents (sum count sum P_M7 sum M7_pearson_resid_squared) row col

------------------------------------------------------------

husband's | wife's education

education | <HS HS Some Col BA+ Total

----------+-------------------------------------------------

<HS | 32016 33374 8407 988 74785

| 32016 33497.12 8398.321 873.5618 74785

| 0 .4525118 .0089685 14.99162 15.4531

HS | 28370 137876 43783 8446 218475

| 28246.88 137876 43725.78 8626.337 218475

| .53662 0 .0748754 3.770013 4.381509

Some Col | 7051 48766 61633 18195 135645

| 7059.679 48823.22 61633 18129.1 135645

| .0106702 .067058 0 .2395377 .3172659

BA+ | 984 13794 28635 51224 94637

| 1098.438 13613.66 28700.9 51224 94637

| 11.92248 2.38888 .1513055 0 14.46267

Total | 68421 233810 142458 78853 523542

| 68421 233810 142458 78853 523542

| 12.46977 2.908449 .2351494 19.00117 34.61454

------------------------------------------------------------

. *The sum over all cells of the pearson residual squared is simply the pearson chisquare statistic, 34.61454 for this model.

. *We had one term for ed_diff_3 to fit these two cells. We need to add a second term to account for the gender disparity

. gen byte ed_diff_3_male=0

. replace ed_diff_3_male=1 if hed-wed==3

(1 real change made)

. table hed wed, contents(mean ed_diff_3_male)

--------------------------------------------------

husband's | wife's education

education | <HS HS Some Col BA+

----------+---------------------------------------

<HS | 0 0 0 0

HS | 0 0 0 0

Some Col | 0 0 0 0

BA+ | 1 0 0 0

--------------------------------------------------

. *Once we add this second ed_diff_3 term, we will be fitting those two cells of most unequal education

> exactly.

. desmat: poisson count hed wed ed_endog_full ed_diff_3 ed_diff_2 ed_diff1_male ed_diff_3_male

-------------------------------------------------------------------------------------------------------

Poisson regression

-------------------------------------------------------------------------------------------------------

Dependent variable count

Optimization: ml

Number of observations: 16

Initial log likelihood: -221501.223

Log likelihood: -95.087

LR chi square: 442812.272

Model degrees of freedom: 14

Pseudo R-squared: 1.000

Prob: 0.000

-------------------------------------------------------------------------------------------------------

nr Effect Coeff s.e.

-------------------------------------------------------------------------------------------------------

count

hed

1 HS 0.618** 0.008

2 Some Col 0.330** 0.008

3 BA+ 0.150** 0.010

wed

4 HS 0.833** 0.008

5 Some Col 0.484** 0.008

6 BA+ -0.110** 0.011

ed_endog_full

7 1 0.789** 0.011

8 2 0.798** 0.008

9 3 0.631** 0.009

10 4 1.220** 0.011

ed_diff_3

11 1 -2.579** 0.033

ed_diff_2

12 1 -1.042** 0.007

ed_diff1_male

13 1 0.047** 0.007

ed_diff_3_male

14 1 -0.264** 0.048

15 _cons 9.585** 0.010

-------------------------------------------------------------------------------------------------------

* p < .05

** p < .01

. poisgof

Goodness-of-fit chi2 = 4.174479

Prob > chi2(1) = 0.0410

. *check it out. This model actually fits reasonably well.

. *We are using 15 out of 16 terms, so maybe we should not congratulate ourselves too much, but still..

. predict P_M8

(option n assumed; predicted number of events)

. table hed wed, contents(sum count sum P_M8) row col

------------------------------------------------------------

husband's | wife's education

education | <HS HS Some Col BA+ Total

----------+-------------------------------------------------

<HS | 32016 33374 8407 988 74785

| 32016 33457.31 8323.688 988 74785

HS | 28370 137876 43783 8446 218475

| 28286.69 137876 43783 8529.313 218475

Some Col | 7051 48766 61633 18195 135645

| 7134.312 48766 61633 18111.69 135645

BA+ | 984 13794 28635 51224 94637

| 984 13710.69 28718.31 51224 94637

Total | 68421 233810 142458 78853 523542

| 68421 233810 142458 78853 523542

------------------------------------------------------------

. *One last point:

. *In M8, the ed endogamy terms 1 and 2 seem awfully close. Could we save 1 df by combining them?

. test _x_7-_x_8=0

( 1) [count]_x_7 - [count]_x_8 = 0

chi2( 1) = 0.30

Prob > chi2 = 0.5821

. *these things are pretty close- can't reject the null of no difference.

. gen byte ed_diff_full_revised= ed_endog_full

. replace ed_diff_full_revised=1 if ed_endog_full==2

(1 real change made)

. table hed wed, contents (mean ed_diff_full_revised)

--------------------------------------------------

husband's | wife's education

education | <HS HS Some Col BA+

----------+---------------------------------------

<HS | 1 0 0 0

HS | 0 1 0 0

Some Col | 0 0 3 0

BA+ | 0 0 0 4

--------------------------------------------------

. *The actual numbers don't matter, because stata desmat or xi will generate dummy variables for every different level, but here we have 3 different levels of ed endogamy rather than 4.

. *how does it fit?

. desmat: poisson count hed wed ed_diff_full_revised ed_diff_3 ed_diff_2 ed_diff1_male ed_diff_3_male

-------------------------------------------------------------------------------------------------------

Poisson regression

-------------------------------------------------------------------------------------------------------

Dependent variable count

Optimization: ml

Number of observations: 16

Initial log likelihood: -221501.223

Log likelihood: -95.239

LR chi square: 442811.969

Model degrees of freedom: 13

Pseudo R-squared: 1.000

Prob: 0.000

-------------------------------------------------------------------------------------------------------

nr Effect Coeff s.e.

-------------------------------------------------------------------------------------------------------

count

hed

1 HS 0.622** 0.005

2 Some Col 0.331** 0.008

3 BA+ 0.151** 0.009

wed

4 HS 0.837** 0.005

5 Some Col 0.486** 0.007

6 BA+ -0.108** 0.010

ed_diff_full_revised

7 1 0.795** 0.006

8 3 0.632** 0.009

9 4 1.220** 0.011

ed_diff_3

10 1 -2.577** 0.033

ed_diff_2

11 1 -1.041** 0.007

ed_diff1_male

12 1 0.047** 0.007

ed_diff_3_male

13 1 -0.263** 0.048

14 _cons 9.580** 0.006

-------------------------------------------------------------------------------------------------------

* p < .05

** p < .01

. poisgof

Goodness-of-fit chi2 = 4.477238

Prob > chi2(2) = 0.1066

. *That's model 9, fit very nicely.

. save "C:\AAA Miker Files\newer web pages\soc_388_notes\soc_388_2007\ed_intermar.dta", replace

file C:\AAA Miker Files\newer web pages\soc_388_notes\soc_388_2007\ed_intermar.dta saved

. exit, clear