---------------------------------------------------------------------------

name:  <unnamed>

log type:  text

opened on:  24 Jan 2013, 12:00:36

. use "C:\Users\Michael\Desktop\cps_mar_2000_new_unchanged.dta", clear

. ttest yrsed if age>24 & age<35, by(sex)

Two-sample t test with equal variances

------------------------------------------------------------------------------

Group |     Obs        Mean    Std. Err.   Std. Dev.   [95% Conf. Interval]

---------+--------------------------------------------------------------------

Male |    9027    13.31212    .0312351    2.967666    13.25089    13.37335

Female |    9511    13.55657    .0292693    2.854472    13.49919    13.61394

---------+--------------------------------------------------------------------

combined |   18538    13.43753    .0213921    2.912627     13.3956    13.47946

---------+--------------------------------------------------------------------

diff |           -.2444469    .0427623               -.3282649   -.1606289

------------------------------------------------------------------------------

diff = mean(Male) - mean(Female)                              t =  -5.7164

Ho: diff = 0                                     degrees of freedom =    18536

Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0

Pr(T < t) = 0.0000         Pr(|T| > |t|) = 0.0000          Pr(T > t) = 1.0000

. table sex if age>24 & age<35, contents(freq mean yrsed sd yrsed semean yrsed)

--------------------------------------------------------------

Sex |       Freq.  mean(yrsed)    sd(yrsed)   sem(yrsed)

----------+---------------------------------------------------

Male |       9,027     13.31212     2.967666     .0312351

Female |       9,511     13.55657     2.854472     .0292693

--------------------------------------------------------------

* Reviewing the mean, sd, and standard error of yrsed by gender.

. display 2.967666/(sqrt(9027))

.03123513

* Remember that standard error of the mean is a simple function of sd/sqrt(n).

. table sex if age>24 & age<35, contents(freq mean yrsed sd yrsed semean yrsed)

--------------------------------------------------------------

Sex |       Freq.  mean(yrsed)    sd(yrsed)   sem(yrsed)

----------+---------------------------------------------------

Male |       9,027     13.31212     2.967666     .0312351

Female |       9,511     13.55657     2.854472     .0292693

--------------------------------------------------------------

. table sex if age>24 & age<35 [aweight=perwt_rounded], contents(freq mean yrsed sd yrsed semean yrsed)

--------------------------------------------------------------

Sex |       Freq.  mean(yrsed)    sd(yrsed)   sem(yrsed)

----------+---------------------------------------------------

Male |       9,027      13.5574     2.819247      .029673

Female |       9,511     13.76295     2.720855     .0278992

--------------------------------------------------------------

* When we apply the weights the sample size is unchanged because aweights, also known as analytic weights rescales the weights. But the weighted data have somewhat different mean and somewhat different sd, and therefore somewhat different standard error, because the weights put more emphasis on some observations than on others.

. table sex if age>24 & age<35 [fweight=perwt_rounded], contents(freq mean yrsed sd yrsed semean yrsed)

--------------------------------------------------------------

Sex |       Freq.  mean(yrsed)    sd(yrsed)   sem(yrsed)

----------+---------------------------------------------------

Male |    1.86e+07      13.5574     2.819091     .0006543

Female |    1.92e+07     13.76295     2.720712     .0006205

--------------------------------------------------------------

* When we apply fweights, we are telling stata that each observation really counts for 2000 observations. This means that our sample size goes up dramatically (by a factor of 2000 compared to the aweight version above), but the mean and sd are the same. The standard error is reduced by a factor of sqrt(2000), or about 42. Note the way that mean and sd are not functions of sample size, but standard error is.

. ttest yrsed if age>24 & age<35, by(sex)

Two-sample t test with equal variances

------------------------------------------------------------------------------

Group |     Obs        Mean    Std. Err.   Std. Dev.   [95% Conf. Interval]

---------+--------------------------------------------------------------------

Male |    9027    13.31212    .0312351    2.967666    13.25089    13.37335

Female |    9511    13.55657    .0292693    2.854472    13.49919    13.61394

---------+--------------------------------------------------------------------

combined |   18538    13.43753    .0213921    2.912627     13.3956    13.47946

---------+--------------------------------------------------------------------

diff |           -.2444469    .0427623               -.3282649   -.1606289

------------------------------------------------------------------------------

diff = mean(Male) - mean(Female)                              t =  -5.7164

Ho: diff = 0                                     degrees of freedom =    18536

Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0

Pr(T < t) = 0.0000         Pr(|T| > |t|) = 0.0000          Pr(T > t) = 1.0000

* Back to our favorite t-test. How do we interpret the t-statistic of -5.7164? What probability do we attach to this statistic? The t-test reports a probability, which is the middle probability above, Pr(|T| > |t|) = 0.0000  , which corresponds to a 2-tail test. But the t-test output does not give us enough digits to quantify the probability. So, let’s ask Stata to quantify it for us.

. display normal(5.716)

.99999999

* That is the normal left-cumulative probability, we want the right, and both tails, so:

. display 2*(1-normal(5.716))

1.091e-08

. display ttail(18536,5.7164)

5.524e-09

* ttail gives the right hand cumulative probability to start, that is the probability from 5.7164 to infinity, which is the relevant tail for us.

. display 2*ttail(18536,5.7164)

1.105e-08

* and that, 1-in-100 million is our two-tailed T probability of finding a statistic this large by chance if the null hypothesis were true. So we reject the null hypothesis.

*Finding the key values of the normal distribution:

. display invnormal(1-.025)

1.959964

*and comparing to the t-distribution, as df increases the T becomes more Normal.

. display invttail(10, .025)

2.2281389

. display invttail(100, .025)

1.9839715

. display invttail(20000, .025)

1.9600826

. tabulate sex

Sex |      Freq.     Percent        Cum.

------------+-----------------------------------

Male |     64,791       48.46       48.46

Female |     68,919       51.54      100.00

------------+-----------------------------------

Total |    133,710      100.00

. tabulate sex, nolab

Sex |      Freq.     Percent        Cum.

------------+-----------------------------------

1 |     64,791       48.46       48.46

2 |     68,919       51.54      100.00

------------+-----------------------------------

Total |    133,710      100.00

*now I am going to generate a dummy variable for male, to use in regression.

. gen byte male=0

. replace male=1 if sex==1

. tabulate sex male

|         male

Sex |         0          1 |     Total

-----------+----------------------+----------

Male |         0     64,791 |    64,791

Female |    68,919          0 |    68,919

-----------+----------------------+----------

Total |    68,919     64,791 |   133,710

. ttest yrsed if age>24 & age<35, by(sex)

Two-sample t test with equal variances

------------------------------------------------------------------------------

Group |     Obs        Mean    Std. Err.   Std. Dev.   [95% Conf. Interval]

---------+--------------------------------------------------------------------

Male |    9027    13.31212    .0312351    2.967666    13.25089    13.37335

Female |    9511    13.55657    .0292693    2.854472    13.49919    13.61394

---------+--------------------------------------------------------------------

combined |   18538    13.43753    .0213921    2.912627     13.3956    13.47946

---------+--------------------------------------------------------------------

diff |           -.2444469    .0427623               -.3282649   -.1606289

------------------------------------------------------------------------------

diff = mean(Male) - mean(Female)                              t =  -5.7164

Ho: diff = 0                                     degrees of freedom =    18536

Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0

Pr(T < t) = 0.0000         Pr(|T| > |t|) = 0.0000          Pr(T > t) = 1.0000

. regress yrsed male if age>24 & age<35

Source |       SS       df       MS              Number of obs =   18538

-------------+------------------------------           F(  1, 18536) =   32.68

Model |  276.742433     1  276.742433           Prob > F      =  0.0000

Residual |  156979.922 18536  8.46892111           R-squared     =  0.0018

Total |  157256.664 18537  8.48339343           Root MSE      =  2.9101

------------------------------------------------------------------------------

yrsed |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]

-------------+----------------------------------------------------------------

male |  -.2444469   .0427623    -5.72   0.000    -.3282649   -.1606289

_cons |   13.55657   .0298401   454.31   0.000     13.49808    13.61506

------------------------------------------------------------------------------

*Note that the t-statistic produced by the (equal variance) t-test and the t-statistic produced by regression are the same. Regression is just a generalization of the t-test.

. regress yrsed male if age>24 & age<35 [aweight= perwt_rounded]

(sum of wgt is   3.7786e+07)

Source |       SS       df       MS              Number of obs =   18538

-------------+------------------------------           F(  1, 18536) =   25.52

Model |  195.741395     1  195.741395           Prob > F      =  0.0000

Residual |  142186.809 18536  7.67084641           R-squared     =  0.0014

Total |  142382.551 18537   7.6809921           Root MSE      =  2.7696

------------------------------------------------------------------------------

yrsed |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]

-------------+----------------------------------------------------------------

male |  -.2055446   .0406899    -5.05   0.000    -.2853005   -.1257887

_cons |   13.76294   .0285199   482.57   0.000     13.70704    13.81885

------------------------------------------------------------------------------

* aweighted regression yields a different coefficient and t-statistic, but they are of the same order of magnitude. Aweight is one way of applying the weights but making sure that the standard errors reflect the actual sample size you have.

. regress yrsed male if age>24 & age<35 [fweight= perwt_rounded]

Source |       SS       df       MS              Number of obs =37785945

-------------+------------------------------           F(  1,37785943) =52018.00

Model |  398979.047     1  398979.047           Prob > F      =  0.0000

Residual |   28981891037785943  7.67001924           R-squared     =  0.0014

Total |   29021788937785944  7.68057796           Root MSE      =  2.7695

------------------------------------------------------------------------------

yrsed |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]

-------------+----------------------------------------------------------------

male |  -.2055446   .0009012  -228.07   0.000    -.2073109   -.2037782

_cons |   13.76294   .0006317  2.2e+04   0.000     13.76171    13.76418

------------------------------------------------------------------------------

* fweighted regression increases the sample size by a factor of 2000, and increases the t-statistic by a factor of sqrt(2000), or about 42.

. table  occ1990 if  occ1990==95|  occ1990==125| occ1990==178, contents (freq mean inctot sd inctot)

----------------------------------------------------------------

Occupation, 1990      |

basis                 |        Freq.  mean(inctot)    sd(inctot)

----------------------+-----------------------------------------

Registered nurses |          966    40787.1677      22941.43

Sociology instructors |            6   44363.33333      6497.989

Lawyers |          441   99242.58277      71860.66

----------------------------------------------------------------

. graph box inctot if  occ1990==95|  occ1990==125| occ1990==178, over( occ1990)

* a quick look at how to do make box plots.

* one way to do ttests testing for differences in some variable between two occupations:

. ttest yrsed if occ1990==95| occ1990==125, by(occ1990)

Two-sample t test with equal variances

------------------------------------------------------------------------------

Group |     Obs        Mean    Std. Err.   Std. Dev.   [95% Conf. Interval]

---------+--------------------------------------------------------------------

Register |     966    15.54762    .0516706    1.605951    15.44622    15.64902

Sociolog |       6          17           0           0          17          17

---------+--------------------------------------------------------------------

combined |     972    15.55658    .0514811    1.605022    15.45556    15.65761

---------+--------------------------------------------------------------------

diff |           -1.452381    .6559623                -2.73965   -.1651122

------------------------------------------------------------------------------

diff = mean(Register) - mean(Sociolog)                        t =  -2.2141

Ho: diff = 0                                     degrees of freedom =      970

Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0

Pr(T < t) = 0.0135         Pr(|T| > |t|) = 0.0271          Pr(T > t) = 0.9865

* generate a dummy variable for nurses.

. gen nurses=0

. replace nurses=1 if occ1990==95

. regress yrsed nurses if  occ1990==95| occ1990==125

Source |       SS       df       MS              Number of obs =     972

-------------+------------------------------           F(  1,   970) =    4.90

Model |  12.5783363     1  12.5783363           Prob > F      =  0.0271

Residual |  2488.80952   970  2.56578301           R-squared     =  0.0050

Total |  2501.38786   971   2.5760946           Root MSE      =  1.6018

------------------------------------------------------------------------------

yrsed |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]

-------------+----------------------------------------------------------------

nurses |  -1.452381   .6559623    -2.21   0.027     -2.73965   -.1651122

_cons |         17   .6539346    26.00   0.000     15.71671    18.28329

------------------------------------------------------------------------------

. regress yrsed male if age>24 & age<35

Source |       SS       df       MS              Number of obs =   18538

-------------+------------------------------           F(  1, 18536) =   32.68

Model |  276.742433     1  276.742433           Prob > F      =  0.0000

Residual |  156979.922 18536  8.46892111           R-squared     =  0.0018

Total |  157256.664 18537  8.48339343           Root MSE      =  2.9101

------------------------------------------------------------------------------

yrsed |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]

-------------+----------------------------------------------------------------

male |  -.2444469   .0427623    -5.72   0.000    -.3282649   -.1606289

_cons |   13.55657   .0298401   454.31   0.000     13.49808    13.61506

------------------------------------------------------------------------------

* What if we changed the units of our variables? What if instead of years of education, we had months?

. gen monthsed=yrsed*12

(30484 missing values generated)

. regress monthsed male if age>24 & age<35

Source |       SS       df       MS              Number of obs =   18538

-------------+------------------------------           F(  1, 18536) =   32.68

Model |  39850.9104     1  39850.9104           Prob > F      =  0.0000

Residual |  22605108.7 18536  1219.52464           R-squared     =  0.0018

Total |  22644959.6 18537  1221.60865           Root MSE      =  34.922

------------------------------------------------------------------------------

monthsed |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]

-------------+----------------------------------------------------------------

male |  -2.933363   .5131471    -5.72   0.000    -3.939178   -1.927547

_cons |   162.6788   .3580818   454.31   0.000     161.9769    163.3807

------------------------------------------------------------------------------

* note that change of scale effects our coefficient and standard error (which are in the units of whatever the dependent variable are in), but the t-statistic is unchanged, because the t-statistic is unit free.

. ttest yrsed if age>24 & age<35, by(sex)

Two-sample t test with equal variances

------------------------------------------------------------------------------

Group |     Obs        Mean    Std. Err.   Std. Dev.   [95% Conf. Interval]

---------+--------------------------------------------------------------------

Male |    9027    13.31212    .0312351    2.967666    13.25089    13.37335

Female |    9511    13.55657    .0292693    2.854472    13.49919    13.61394

---------+--------------------------------------------------------------------

combined |   18538    13.43753    .0213921    2.912627     13.3956    13.47946

---------+--------------------------------------------------------------------

diff |           -.2444469    .0427623               -.3282649   -.1606289

------------------------------------------------------------------------------

diff = mean(Male) - mean(Female)                              t =  -5.7164

Ho: diff = 0                                     degrees of freedom =    18536

Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0

Pr(T < t) = 0.0000         Pr(|T| > |t|) = 0.0000          Pr(T > t) = 1.0000

. log close

name:  <unnamed>