-----------------------------------------------------------------------------------------------------------

      name:  <unnamed>

       log:  C:\Documents and Settings\Michael Rosenfeld\My Documents\newer web pages\soc_meth_proj3\fall_2010_s381_logs\class8.log

  log type:  text

 opened on:  14 Oct 2010, 12:20:01

 

* Here are some housekeeping things I did before class started:

 

. use "C:\Documents and Settings\Michael Rosenfeld\Desktop\cps_mar_2000_new.dta", clear

 

. gen occ1990_reduced= occ1990 if  occ1990==95|occ1990==125|occ1990==178

(132297 missing values generated)

 

. describe  occ1990

 

              storage  display     value

variable name   type   format      label      variable label

------------------------------------------------------------------------------------------------------

occ1990         int    %78.0g      occ1990lbl

                                              Occupation, 1990 basis

 

. tabulate  occ1990 if occ1990==95|occ1990==125|occ1990==178

 

                 Occupation, 1990 basis |      Freq.     Percent        Cum.

----------------------------------------+-----------------------------------

                      Registered nurses |        966       68.37       68.37

                  Sociology instructors |          6        0.42       68.79

                                Lawyers |        441       31.21      100.00

----------------------------------------+-----------------------------------

                                  Total |      1,413      100.00

 

. tabulate  occ1990 if occ1990==95|occ1990==125|occ1990==178, nolab

 

Occupation, |

 1990 basis |      Freq.     Percent        Cum.

------------+-----------------------------------

         95 |        966       68.37       68.37

        125 |          6        0.42       68.79

        178 |        441       31.21      100.00

------------+-----------------------------------

      Total |      1,413      100.00

 

. label define occ1990_reduced 95 "nurses" 125 "sociologists" 178 "lawyers"

 

. label val  occ1990_reduced occ1990_reduced

 

. tabulate  occ1990_reduced

 

occ1990_redu |

         ced |      Freq.     Percent        Cum.

-------------+-----------------------------------

      nurses |        966       68.37       68.37

sociologists |          6        0.42       68.79

     lawyers |        441       31.21      100.00

-------------+-----------------------------------

       Total |      1,413      100.00

 

. save "C:\Documents and Settings\Michael Rosenfeld\Desktop\cps_mar_2000_new.dta", replace

file C:\Documents and Settings\Michael Rosenfeld\Desktop\cps_mar_2000_new.dta saved

 

 

. ttest incwage if occ1990==95| occ1990==125, by(occ1990)

 

Two-sample t test with equal variances

------------------------------------------------------------------------------

   Group |     Obs        Mean    Std. Err.   Std. Dev.   [95% Conf. Interval]

---------+--------------------------------------------------------------------

Register |     966    37536.85    702.6892    21839.96    36157.88    38915.83

Sociolog |       6    41508.33    2842.722    6963.219    34200.88    48815.78

---------+--------------------------------------------------------------------

combined |     972    37561.37    698.6046    21780.33    36190.42    38932.32

---------+--------------------------------------------------------------------

    diff |           -3971.481    8923.041               -21482.17    13539.21

------------------------------------------------------------------------------

    diff = mean(Register) - mean(Sociolog)                        t =  -0.4451

Ho: diff = 0                                     degrees of freedom =      970

 

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0

 Pr(T < t) = 0.3282         Pr(|T| > |t|) = 0.6564          Pr(T > t) = 0.6718

 

. ttest incwage if occ1990==95| occ1990==125, by(occ1990) unequal

 

Two-sample t test with unequal variances

------------------------------------------------------------------------------

   Group |     Obs        Mean    Std. Err.   Std. Dev.   [95% Conf. Interval]

---------+--------------------------------------------------------------------

Register |     966    37536.85    702.6892    21839.96    36157.88    38915.83

Sociolog |       6    41508.33    2842.722    6963.219    34200.88    48815.78

---------+--------------------------------------------------------------------

combined |     972    37561.37    698.6046    21780.33    36190.42    38932.32

---------+--------------------------------------------------------------------

    diff |           -3971.481    2928.283               -11252.58     3309.62

------------------------------------------------------------------------------

    diff = mean(Register) - mean(Sociolog)                        t =  -1.3562

Ho: diff = 0                     Satterthwaite's degrees of freedom =  5.62958

 

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0

 Pr(T < t) = 0.1134         Pr(|T| > |t|) = 0.2269          Pr(T > t) = 0.8866

 

* One of the things I talked about a lot in class today was why the df for the unequal variance ttest can be so different from the df for the equal variance ttest. Well, in simple terms, the equal variance t-test takes all the data equally into account, but the unequal variance t-test can weight the standard error of the difference so much to what the smaller sample is (see above, how the sociologists standard error of the mean is similar to the standard error of the difference), that you can think of the unequal variance t-test as taking only the smaller sample into account in terms of variance of the difference, which is why df is 6 rather than 970. But note also that the two tests have the same substantive interpretation (no significant difference) meaning the wild difference between the df of the two models does not determine the answer… See my Excel file and also Stata's documentation on T-tests (either printed doc or online pdfs) for Satterthwaite's formula.

 

. regress incwage age if age>25 & age<65

 

      Source |       SS       df       MS              Number of obs =   67639

-------------+------------------------------           F(  1, 67637) =    4.47

       Model |  4.5722e+09     1  4.5722e+09           Prob > F      =  0.0345

    Residual |  6.9210e+13 67637  1.0233e+09           R-squared     =  0.0001

-------------+------------------------------           Adj R-squared =  0.0001

       Total |  6.9214e+13 67638  1.0233e+09           Root MSE      =   31988

 

------------------------------------------------------------------------------

     incwage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]

-------------+----------------------------------------------------------------

         age |  -25.10229   11.87524    -2.11   0.035    -48.37775   -1.826828

       _cons |   27890.06   526.1078    53.01   0.000     26858.89    28921.23

------------------------------------------------------------------------------

 

* See my excel file for a graphical example of why the line does not fit the relationship between age and income. The relationship is a parabola, an upside down "U", and so we need a second order age term to fit it…

 

. gen age_sq=age^2

 

. regress incwage age  age_sq if age>25 & age<65

 

      Source |       SS       df       MS              Number of obs =   67639

-------------+------------------------------           F(  2, 67636) =  536.56

       Model |  1.0810e+12     2  5.4051e+11           Prob > F      =  0.0000

    Residual |  6.8133e+13 67636  1.0074e+09           R-squared     =  0.0156

-------------+------------------------------           Adj R-squared =  0.0156

       Total |  6.9214e+13 67638  1.0233e+09           Root MSE      =   31739

 

------------------------------------------------------------------------------

     incwage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]

-------------+----------------------------------------------------------------

         age |   3276.086   101.6719    32.22   0.000      3076.81    3475.363

      age_sq |  -37.40804   1.144351   -32.69   0.000    -39.65096   -35.16511

       _cons |  -40886.74   2167.744   -18.86   0.000    -45135.51   -36637.96

------------------------------------------------------------------------------

 

* Not that in the first model age is barely significant at all, but here both age and age squared are highly significant, and the R-square of the model has gone up quite a bit (but still has room for improvement).

 

. tabulate occ1990_reduced

 

occ1990_redu |

         ced |      Freq.     Percent        Cum.

-------------+-----------------------------------

      nurses |        966       68.37       68.37

sociologists |          6        0.42       68.79

     lawyers |        441       31.21      100.00

-------------+-----------------------------------

       Total |      1,413      100.00

 

. table  occ1990_reduced sex, contents(freq mean incwage) row col

 

----------------------------------------------------

occ1990_redu |                  Sex                

ced          |        Male       Female        Total

-------------+--------------------------------------

      nurses |          62          904          966

             | 48602.45161   36777.9281  37536.85197

             |

sociologists |           2            4            6

             |       39200      42662.5  41508.33333

             |

     lawyers |         308          133          441

             | 80236.42208  59704.73684  74044.32653

             |

       Total |         372        1,041        1,413

             | 74743.46774  39729.70893  48947.76858

----------------------------------------------------

 

* Why we do multiple regression: we want to control for potential confounding variables. In this case, maybe we would worry that the apparent advantage of lawyers over nurses could be due to the fact the lawyers are mostly male, and the nurses mostly female. So let's regress both at the same time.

 

. desmat: regress incwage  occ1990_reduced sex

---------------------------------------------------------------------------------

   Linear regression

---------------------------------------------------------------------------------

   Dependent variable                                                    incwage

   Number of observations:                                                  1413

   F statistic:                                                           83.696

   Model degrees of freedom:                                                   3

   Residual degrees of freedom:                                             1409

   R-squared:                                                              0.151

   Adjusted R-squared:                                                     0.149

   Root MSE                                                            42234.929

   Prob:                                                                   0.000

---------------------------------------------------------------------------------

nr Effect                                                      Coeff        s.e.

---------------------------------------------------------------------------------

   occ1990_reduced

1    sociologists                                           -604.948   17320.322

2    lawyers                                               25723.531**  3256.452

   sex

3    Female                                               -17003.194**  3422.971

4  _cons                                                   53448.743**  3479.591

---------------------------------------------------------------------------------

*  p < .05

** p < .01

 

* OK, even after accounting for the fact that women make less money than men, lawyers still earn significantly more than nurses. So the lawyer- nurse gap is not just a function of the gender distribution in the two occupations. In fact, if you look at the table above, you see that male lawyers make a lot more than male nurses, and female lawyers make a lot more than female nurses.

 

. predict m1_oc_gen

(option xb assumed; fitted values)

(132297 missing values generated)

 

* generates the predicted values for the above model.

 

 

. table  occ1990_reduced sex, contents(freq mean incwage mean  m1_oc_gen) row col

 

----------------------------------------------------

occ1990_redu |                  Sex                

ced          |        Male       Female        Total

-------------+--------------------------------------

      nurses |          62          904          966

             | 48602.45161   36777.9281  37536.85197

             |    53448.74     36445.55     37536.85

             |

sociologists |           2            4            6

             |       39200      42662.5  41508.33333

             |     52843.8      35840.6     41508.33

             |

     lawyers |         308          133          441

             | 80236.42208  59704.73684  74044.32653

             |    79172.27     62169.08     74044.33

             |

       Total |         372        1,041        1,413

             | 74743.46774  39729.70893  48947.76858

             |    74743.47     39729.71     48947.77

----------------------------------------------------

 

* Notice that the predicted values and the actual values in our 3x2=6 cells do not coincide. That is because our model had only 4 terms, and cannot fit the 6 cells exactly. Another way to think about this is that the 3 occupations have different gender income gaps, but our model above allowed for only 1 general gender income gap. If we want to fit all 6 cells exactly, we need to allow the gender gap to vary across occupations.

 

. desmat: regress incwage  occ1990_reduced*sex

---------------------------------------------------------------------------------

   Linear regression

---------------------------------------------------------------------------------

   Dependent variable                                                    incwage

   Number of observations:                                                  1413

   F statistic:                                                           50.578

   Model degrees of freedom:                                                   5

   Residual degrees of freedom:                                             1407

   R-squared:                                                              0.152

   Adjusted R-squared:                                                     0.149

   Root MSE                                                            42237.424

   Prob:                                                                   0.000

---------------------------------------------------------------------------------

nr Effect                                                      Coeff        s.e.

---------------------------------------------------------------------------------

   occ1990_reduced

1    sociologists                                          -9402.452   30344.261

2    lawyers                                               31633.970**  5879.320

   sex

3    Female                                               -11824.524*   5545.056

   occ1990_reduced.sex

4    sociologists.Female                                   15287.024   36996.590

5    lawyers.Female                                        -8707.162    7067.771

6  _cons                                                   48602.452**  5364.158

---------------------------------------------------------------------------------

*  p < .05

** p < .01

 

* Our 2 new terms are not significant, and the adjusted R-square does not improve, but the model does fit all 6 cells perfectly now.

 

. predict m2

(option xb assumed; fitted values)

(132297 missing values generated)

 

. table  occ1990_reduced sex, contents(freq mean incwage mean  m1_oc_gen mean m2)row col

 

----------------------------------------------------

occ1990_redu |                  Sex                

ced          |        Male       Female        Total

-------------+--------------------------------------

      nurses |          62          904          966

             | 48602.45161   36777.9281  37536.85197

             |    53448.74     36445.55     37536.85

             |    48602.45     36777.93     37536.86

             |

sociologists |           2            4            6

             |       39200      42662.5  41508.33333

             |     52843.8      35840.6     41508.33

             |       39200      42662.5     41508.33

             |

     lawyers |         308          133          441

             | 80236.42208  59704.73684  74044.32653

             |    79172.27     62169.08     74044.33

             |    80236.42     59704.74     74044.33

             |

       Total |         372        1,041        1,413

             | 74743.46774  39729.70893  48947.76858

             |    74743.47     39729.71     48947.77

             |    74743.47     39729.71     48947.77

----------------------------------------------------

 

. log close

      name:  <unnamed>

       log:  C:\Documents and Settings\Michael Rosenfeld\My Documents\newer web pages\soc_meth_proj

> 3\fall_2010_s381_logs\class8.log

  log type:  text

 closed on:  14 Oct 2010, 16:00:31

---------------------------------------------------------------------------------------------------