-----------------------------------------------------------------------------------

name:  <unnamed>

log:  C:\Documents and Settings\Michael Rosenfeld\My Documents\newer web paes\soc_meth_proj3\fall_2010_s381_logs\class4.log

log type:  text

opened on:  30 Sep 2010, 14:40:52

. table sex if age>24 & age<35, contents (freq mean yrsed sd yrsed)

-------------------------------------------------

Sex |       Freq.  mean(yrsed)    sd(yrsed)

----------+--------------------------------------

Male |       9,027     13.31212     2.967666

Female |       9,511     13.55657     2.854472

-------------------------------------------------

. table sex if age>24 & age<35, contents (mean yrsed sd yrsed freq)

-------------------------------------------------

Sex | mean(yrsed)    sd(yrsed)        Freq.

----------+--------------------------------------

Male |    13.31212     2.967666        9,027

Female |    13.55657     2.854472        9,511

-------------------------------------------------

* Take a look at my Web posted Excel File, under ttests, to get a full recount of the hand calculations of these numbers and statistics.

. ttest yrsed if age>24 & age<35, by(sex) unequal

Two-sample t test with unequal variances

------------------------------------------------------------------------------

Group |     Obs        Mean    Std. Err.   Std. Dev.   [95% Conf. Interval]

---------+--------------------------------------------------------------------

Male |    9027    13.31212    .0312351    2.967666    13.25089    13.37335

Female |    9511    13.55657    .0292693    2.854472    13.49919    13.61394

---------+--------------------------------------------------------------------

combined |   18538    13.43753    .0213921    2.912627     13.3956    13.47946

---------+--------------------------------------------------------------------

diff |           -.2444469    .0428057                 -.32835   -.1605438

------------------------------------------------------------------------------

diff = mean(Male) - mean(Female)                              t =  -5.7106

Ho: diff = 0                     Satterthwaite's degrees of freedom =  18383.6

Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0

Pr(T < t) = 0.0000         Pr(|T| > |t|) = 0.0000          Pr(T > t) = 1.0000

. ttest yrsed if age>24 & age<35, by(sex)

Two-sample t test with equal variances

------------------------------------------------------------------------------

Group |     Obs        Mean    Std. Err.   Std. Dev.   [95% Conf. Interval]

---------+--------------------------------------------------------------------

Male |    9027    13.31212    .0312351    2.967666    13.25089    13.37335

Female |    9511    13.55657    .0292693    2.854472    13.49919    13.61394

---------+--------------------------------------------------------------------

combined |   18538    13.43753    .0213921    2.912627     13.3956    13.47946

---------+--------------------------------------------------------------------

diff |           -.2444469    .0427623               -.3282649   -.1606289

------------------------------------------------------------------------------

diff = mean(Male) - mean(Female)                              t =  -5.7164

Ho: diff = 0                                     degrees of freedom =    18536

Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0

Pr(T < t) = 0.0000         Pr(|T| > |t|) = 0.0000          Pr(T > t) = 1.0000

* A couple points to note. First, the default for ttest and for regression(below) is the equal variance test. If you want the unequal variance ttest, you have to ask for it by name. The formulas I talked about in class last class were for the unequal variance ttest which I think of as being a little more intuitive. Second, the degrees of freedom of the two tests are different, but not in a way that matters because the t distribution with 18383 df is so much like the t distribution with 18536 df that the results are not affected at all. Lastly, in this case, the variance and SD of the two samples (men and women's education) are so close that it does not matter much which assumption we make about the combined variance of the two groups. Note that the t-statistics above differ only slightly. If the variances of the two samples were really different, then the results would be really different between the equal and the unequal variance tests…

. regress yrsed male if age>24 & age<35

Source |       SS       df       MS              Number of obs =   18538

-------------+------------------------------           F(  1, 18536) =   32.68

Model |  276.742433     1  276.742433           Prob > F      =  0.0000

Residual |  156979.922 18536  8.46892111           R-squared     =  0.0018

Total |  157256.664 18537  8.48339343           Root MSE      =  2.9101

------------------------------------------------------------------------------

yrsed |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]

-------------+----------------------------------------------------------------

male |  -.2444469   .0427623    -5.72   0.000    -.3282649   -.1606289

_cons |   13.55657   .0298401   454.31   0.000     13.49808    13.61506

------------------------------------------------------------------------------

*compare to the equal variance test above..

. gen monthsed=yrsed*12

(30484 missing values generated)

. ttest monthsed if age>24 & age<35, by(sex)

Two-sample t test with equal variances

------------------------------------------------------------------------------

Group |     Obs        Mean    Std. Err.   Std. Dev.   [95% Conf. Interval]

---------+--------------------------------------------------------------------

Male |    9027    159.7454    .3748215    35.61199    159.0107    160.4802

Female |    9511    162.6788    .3512319    34.25366    161.9903    163.3673

---------+--------------------------------------------------------------------

combined |   18538    161.2504    .2567052    34.95152    160.7472    161.7536

---------+--------------------------------------------------------------------

diff |           -2.933363    .5131471               -3.939178   -1.927547

------------------------------------------------------------------------------

diff = mean(Male) - mean(Female)                              t =  -5.7164

Ho: diff = 0                                     degrees of freedom =    18536

Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0

Pr(T < t) = 0.0000         Pr(|T| > |t|) = 0.0000          Pr(T > t) = 1.0000\

*Note: t-test is unit free, it doesn't matter whether we measure education in years or in microseconds, the resulting t statistic is the same..

. ttest yrsed if age>24 & age<35, by(sex)

Two-sample t test with equal variances

------------------------------------------------------------------------------

Group |     Obs        Mean    Std. Err.   Std. Dev.   [95% Conf. Interval]

---------+--------------------------------------------------------------------

Male |    9027    13.31212    .0312351    2.967666    13.25089    13.37335

Female |    9511    13.55657    .0292693    2.854472    13.49919    13.61394

---------+--------------------------------------------------------------------

combined |   18538    13.43753    .0213921    2.912627     13.3956    13.47946

---------+--------------------------------------------------------------------

diff |           -.2444469    .0427623               -.3282649   -.1606289

------------------------------------------------------------------------------

diff = mean(Male) - mean(Female)                              t =  -5.7164

Ho: diff = 0                                     degrees of freedom =    18536

Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0

Pr(T < t) = 0.0000         Pr(|T| > |t|) = 0.0000          Pr(T > t) = 1.0000

. display ttail(18000, 5.7164)

5.526e-09

*the syntax is ttail(df, t-statistic), and the output give you remaining right hand tail probability

. display 2*ttail(18000, 5.7164)

1.105e-08

* Usually we think about two-tailed tests, which means doubling the tail probability, because the distribution has two equal tails..

. display normal (5.716)

r(111);

* I guess stata didn't like the space..

. display normal(5.716)

.99999999

* normal (z-score) gives you the cumulative normal distribution up to that point. If we want the distribution in the two tails, we subtract from one and double it..

. display 2*(1-normal(5.716))

1.091e-08

. display invnormal(1-.025)

1.959964

* invnormal takes a tail probability and gives you the corresponding Z-score statistic. 1.96 is the key cutoff because if one tail has .025 left, that means the two tailed test would have .05 probability left in the tails.

. display invttail(5,.025)

2.5705818

* The t-statistic for the same upper tail probability is higher, but how much higher depends on the degrees of freedom. When df is small (like 5) the difference is more.

. display invttail(10000,.025)

1.9602012

* At df of 10,000, the t distribution becomes indistinguishable from the Normal distribution.

. regress yrsed male if age>24 & age<35

Source |       SS       df       MS              Number of obs =   18538

-------------+------------------------------           F(  1, 18536) =   32.68

Model |  276.742433     1  276.742433           Prob > F      =  0.0000

Residual |  156979.922 18536  8.46892111           R-squared     =  0.0018

Total |  157256.664 18537  8.48339343           Root MSE      =  2.9101

------------------------------------------------------------------------------

yrsed |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]

-------------+----------------------------------------------------------------

male |  -.2444469   .0427623    -5.72   0.000    -.3282649   -.1606289

_cons |   13.55657   .0298401   454.31   0.000     13.49808    13.61506

------------------------------------------------------------------------------

* First regress unweighted.

. regress yrsed male if age>24 & age<35 [aweight= perwt_rounded]

(sum of wgt is   3.7786e+07)

Source |       SS       df       MS              Number of obs =   18538

-------------+------------------------------           F(  1, 18536) =   25.52

Model |  195.741395     1  195.741395           Prob > F      =  0.0000

Residual |  142186.809 18536  7.67084641           R-squared     =  0.0014

Total |  142382.551 18537   7.6809921           Root MSE      =  2.7696

------------------------------------------------------------------------------

yrsed |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]

-------------+----------------------------------------------------------------

male |  -.2055446   .0406899    -5.05   0.000    -.2853005   -.1257887

_cons |   13.76294   .0285199   482.57   0.000     13.70704    13.81885

------------------------------------------------------------------------------

* Second, regress with aweights, which preserves our sample size (note the number of observations compared to the unweighted example) by rescaling the weights to an average of 1. The coefficient and t statistic are a little different, because the weighted data are a little different than the unweighted data.. Note that ttest does not accept weights.

. table sex if age>24 & age<35 [aweight= perwt_rounded], contents (mean yrsed sd yrsed freq)

-------------------------------------------------

Sex | mean(yrsed)    sd(yrsed)        Freq.

----------+--------------------------------------

Male |     13.5574     2.819247        9,027

Female |    13.76295     2.720855        9,511

-------------------------------------------------

. regress yrsed male if age>24 & age<35 [fweight= perwt_rounded]

Source |       SS       df       MS              Number of obs =37785945

-------------+------------------------------           F(  1,37785943) =52018.00

Model |  398979.047     1  398979.047           Prob > F      =  0.0000

Residual |   28981891037785943  7.67001924           R-squared     =  0.0014

Total |   29021788937785944  7.68057796           Root MSE      =  2.7695

------------------------------------------------------------------------------

yrsed |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]

-------------+----------------------------------------------------------------

male |  -.2055446   .0009012  -228.07   0.000    -.2073109   -.2037782

_cons |   13.76294   .0006317  2.2e+04   0.000     13.76171    13.76418

------------------------------------------------------------------------------

* If we use fweights instead, the number of observations is increased by a factor of about 2,000, and the t-statistic is increased by a factor of factor of the square root of 2,000, or 45 or so. And this is wrong, wrong, wrong… Because we don't have 36 million cases, we have 18 thousand cases…

. save "C:\Documents and Settings\Michael Rosenfeld\Desktop\cps_mar_2000_new.dta", replace

file C:\Documents and Settings\Michael Rosenfeld\Desktop\cps_mar_2000_new.dta saved

. log close

name:  <unnamed>

log:  C:\Documents and Settings\Michael Rosenfeld\My Documents\newer web pag

> es\soc_meth_proj3\fall_2010_s381_logs\class4.log

log type:  text

closed on:  30 Sep 2010, 15:50:00

-----------------------------------------------------------------------------------