. table sex if age>=25 & age<=34, contents (freq mean yrsed sd yrsed semean yrsed)

--------------------------------------------------------------

Sex |       Freq.  mean(yrsed)    sd(yrsed)   sem(yrsed)

----------+---------------------------------------------------

Male |       9,027     13.31212     2.967666     .0312351

Female |       9,511     13.55657     2.854472     .0292693

--------------------------------------------------------------

. display 2.967666/sqrt(9027)

.03123513

* Key point: standard error of the mean is sd/(sqrt(n))

. ttest yrsed if age>=25 & age<=34, by(sex)

Two-sample t test with equal variances

------------------------------------------------------------------------------

Group |     Obs        Mean    Std. Err.   Std. Dev.   [95% Conf. Interval]

---------+--------------------------------------------------------------------

Male |    9027    13.31212    .0312351    2.967666    13.25089    13.37335

Female |    9511    13.55657    .0292693    2.854472    13.49919    13.61394

---------+--------------------------------------------------------------------

combined |   18538    13.43753    .0213921    2.912627     13.3956    13.47946

---------+--------------------------------------------------------------------

diff |           -.2444469    .0427623               -.3282649   -.1606289

------------------------------------------------------------------------------

diff = mean(Male) - mean(Female)                              t =  -5.7164

Ho: diff = 0                                     degrees of freedom =    18536

Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0

Pr(T < t) = 0.0000         Pr(|T| > |t|) = 0.0000          Pr(T > t) = 1.0000

* Note: there are two kinds of t-tests, equal variance and unequal variance. Stata assumes the equal variance version unless you tell it otherwise (as I do below)

. ttest yrsed if age>=25 & age<=34, by(sex) unequal

Two-sample t test with unequal variances

------------------------------------------------------------------------------

Group |     Obs        Mean    Std. Err.   Std. Dev.   [95% Conf. Interval]

---------+--------------------------------------------------------------------

Male |    9027    13.31212    .0312351    2.967666    13.25089    13.37335

Female |    9511    13.55657    .0292693    2.854472    13.49919    13.61394

---------+--------------------------------------------------------------------

combined |   18538    13.43753    .0213921    2.912627     13.3956    13.47946

---------+--------------------------------------------------------------------

diff |           -.2444469    .0428057                 -.32835   -.1605438

------------------------------------------------------------------------------

diff = mean(Male) - mean(Female)                              t =  -5.7106

Ho: diff = 0                     Satterthwaite's degrees of freedom =  18383.6

Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0

Pr(T < t) = 0.0000         Pr(|T| > |t|) = 0.0000          Pr(T > t) = 1.0000

*In the case of men and women’s educations, the variance and standard deviations of the two groups are so similar, that it makes hardly any difference whether you use the equal or the unequal variance t-test. But in HW2 in some cases it will matter. See my excel file and my PDF file for more information about the t-tests, how they are calculated, and how the standard error of the difference (the denominator of the t-statistic) is calculated, and how df are calculated.

. display ttail(18536,-5.7164)

.99999999

* Keep in mind that Stata’s ttail function gives you the right hand cumulative distribution, so if you start with a negative statistic, you get a value very close to 1.

. display (1-ttail(18356,-5.7164))

5.525e-09

*This gives us the tail probability

. display 2*(1-ttail(18356,-5.7164))

1.105e-08

* And this above gives us the probability of the 2 tails added together.

*Now let’s generate a proper dummy variable for gender, that we can use as an input in a regression.

. tabulate sex

Sex |      Freq.     Percent        Cum.

------------+-----------------------------------

Male |     64,791       48.46       48.46

Female |     68,919       51.54      100.00

------------+-----------------------------------

Total |    133,710      100.00

. codebook sex

----------------------------------------------------------------------------------------

sex                                                                                  Sex

----------------------------------------------------------------------------------------

type:  numeric (byte)

label:  sexlbl

range:  [1,2]                        units:  1

unique values:  2                        missing .:  0/133710

tabulation:  Freq.   Numeric  Label

64791         1  Male

68919         2  Female

. gen byte female=0

r(110);

. replace female=1 if sex==2

. tabulate sex female

|        female

Sex |         0          1 |     Total

-----------+----------------------+----------

Male |    64,791          0 |    64,791

Female |         0     68,919 |    68,919

-----------+----------------------+----------

Total |    64,791     68,919 |   133,710

* When you generate a new variable, it is always important to cross tabulate it with the old variable. Here our new female variable is 0-1, rather than 1-2.

. regress yrsed female if age>=25&age<=34

Source |       SS       df       MS              Number of obs =   18538

-------------+------------------------------           F(  1, 18536) =   32.68

Model |  276.742433     1  276.742433           Prob > F      =  0.0000

Residual |  156979.922 18536  8.46892111           R-squared     =  0.0018

Total |  157256.664 18537  8.48339343           Root MSE      =  2.9101

------------------------------------------------------------------------------

yrsed |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]

-------------+----------------------------------------------------------------

female |   .2444469   .0427623     5.72   0.000     .1606289    .3282649

_cons |   13.31212   .0306297   434.62   0.000     13.25208    13.37216

------------------------------------------------------------------------------

* The OLS regression gives us a t-test exactly equal to the equal variance t-test.

. ttest yrsed if age>=25 & age<=34, by(sex)

Two-sample t test with equal variances

------------------------------------------------------------------------------

Group |     Obs        Mean    Std. Err.   Std. Dev.   [95% Conf. Interval]

---------+--------------------------------------------------------------------

Male |    9027    13.31212    .0312351    2.967666    13.25089    13.37335

Female |    9511    13.55657    .0292693    2.854472    13.49919    13.61394

---------+--------------------------------------------------------------------

combined |   18538    13.43753    .0213921    2.912627     13.3956    13.47946

---------+--------------------------------------------------------------------

diff |           -.2444469    .0427623               -.3282649   -.1606289

------------------------------------------------------------------------------

diff = mean(Male) - mean(Female)                              t =  -5.7164

Ho: diff = 0                                     degrees of freedom =    18536

Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0

Pr(T < t) = 0.0000         Pr(|T| > |t|) = 0.0000          Pr(T > t) = 1.0000

. display 2*(ttail(18356, 5.7164))

1.105e-08

. display 2*(1-normal(5.7164))

1.088e-08

* Because 18K degrees of freedom is a lot of degrees of freedom, the t and Normal distributions are almost exactly the same at the given statistic value (but notice the 1-Normal because Stata codes Normal for left hand cumulative distribution, while T is coded for right hand cumulative distribution (for an arbitrary reason I cannot guess). Stata help is your guide to syntax.

. display invnormal(1-.025)

1.959964

*1.96 is the key value of the Normal distribution. How many degrees of freedom do we need to have before the T distribution gets close to the Normal distribution in terms of yielding the same critical value associated with 2.5% single tail probability?

. display invttail(2, 0.025)

4.3026527

. display invttail(10, 0.025)

2.2281389

. display invttail(25, 0.025)

2.0595386

. display invttail(50, 0.025)

2.0085591

. display invttail(100, 0.025)

1.9839715

. display invttail(1000, 0.025)

1.9623391

. log close

name:  <unnamed>