-----------------------------------------------------------------------------------
log: C:\AAA Miker Files\newer web pages\soc_meth_proj3\section3_2009.log
log type: text
opened on: 10 Feb 2009, 15:25:54
. set mem 200m
Current memory allocation
current memory usage
settable value description (1M = 1024k)
--------------------------------------------------------------------
set maxvar 5000 max. variables allowed 1.909M
set memory 200M max. data space 200.000M
set matsize 400 max. RHS vars in models 1.254M
-----------
203.163M
. use "C:\AAA Miker Files\newer web pages\soc_meth_proj3\cps_mar_2000_new.dta", clear
. xi i.occ1990
--Break--
r(1);
*It might seem convenient to generate dummy variables for the entire set of occupational categories, but it is actually not a great idea, because there are something like a thousand different occupational categories, and that would add a thousand dummy variables to your dataset. You might run out of memory. Much better is to first create a variable for the 3 occupations you want, perhaps setting all other to missing, and then use that variable to generate the dummies.
. memory
bytes
--------------------------------------------------------------------
Details of set memory usage
overhead (pointers) 534,840 0.26%
data 45,327,690 21.61%
----------------------------
data + overhead 45,862,530 21.87%
free 163,852,662 78.13%
----------------------------
Total allocated 209,715,192 100.00%
--------------------------------------------------------------------
Other memory usage
set maxvar usage 2,001,730
set matsize usage 1,315,200
programs, saved results, etc. 76,822
---------------
Total 3,393,752
-------------------------------------------------------
Grand total 213,108,944
. *xi of occ1990 was going to make a thousand new dummy variables. That is not what we want or need.
. drop _Iocc*
. memory
bytes
--------------------------------------------------------------------
Details of set memory usage
overhead (pointers) 534,840 0.26%
data 15,242,940 7.27%
----------------------------
data + overhead 15,777,780 7.52%
free 193,937,412 92.48%
----------------------------
Total allocated 209,715,192 100.00%
--------------------------------------------------------------------
Other memory usage
set maxvar usage 2,001,730
set matsize usage 1,315,200
programs, saved results, etc. 73,958
---------------
Total 3,390,888
-------------------------------------------------------
Grand total 213,106,080
. tabulate metro
Metropolitan central city |
status | Freq. Percent Cum.
----------------------------+-----------------------------------
Not identifiable | 340 0.25 0.25
Not in metro area | 29,658 22.18 22.44
Central city | 32,481 24.29 46.73
Outside central city | 51,468 38.49 85.22
Central city status unknown | 19,763 14.78 100.00
----------------------------+-----------------------------------
Total | 133,710 100.00
. tabulate metro, nolab
Metropolita |
n central |
city status | Freq. Percent Cum.
------------+-----------------------------------
0 | 340 0.25 0.25
1 | 29,658 22.18 22.44
2 | 32,481 24.29 46.73
3 | 51,468 38.49 85.22
4 | 19,763 14.78 100.00
------------+-----------------------------------
Total | 133,710 100.00
. char metro[omit] 1
*redoing some things we did in class, making dummy variables for metro and putting them in the regression.
. xi i.metro
i.metro _Imetro_0-4 (naturally coded; _Imetro_1 omitted)
. regress incwage _Imetro* if age>29 & age<65 & sex==1
Source | SS df MS Number of obs = 29335
-------------+------------------------------ F( 4, 29330) = 190.17
Model | 1.1316e+12 4 2.8291e+11 Prob > F = 0.0000
Residual | 4.3633e+13 29330 1.4877e+09 R-squared = 0.0253
-------------+------------------------------ Adj R-squared = 0.0251
Total | 4.4765e+13 29334 1.5260e+09 Root MSE = 38570
------------------------------------------------------------------------------
incwage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_Imetro_0 | 4553.396 4006.316 1.14 0.256 -3299.164 12405.96
_Imetro_2 | 7255.712 667.5305 10.87 0.000 5947.322 8564.102
_Imetro_3 | 16013.39 593.5204 26.98 0.000 14850.06 17176.71
_Imetro_4 | 8368.313 758.1121 11.04 0.000 6882.38 9854.247
_cons | 27189.65 473.7616 57.39 0.000 26261.05 28118.24
------------------------------------------------------------------------------
. *This is the results we got today in class...
. *metro==1, rural, is the excluded category here. Everything else is compared to rural.
. *The constant is the excluded category mean.
. table metro if age>29 & age<65 & sex==1, contents (mean incwage)
-------------------------------------------
Metropolitan central city |
status | mean(incwage)
----------------------------+--------------
Not identifiable | 31743.04255
Not in metro area | 27189.6465
Central city | 34445.35841
Outside central city | 43203.0348
Central city status unknown | 35557.95997
-------------------------------------------
. *One variation is to include the weights
. table metro if age>29 & age<65 & sex==1 [aweight= perwt_rounded], contents (mean incwage)
-------------------------------------------
Metropolitan central city |
status | mean(incwage)
----------------------------+--------------
Not identifiable | 32020.1697
Not in metro area | 27344.17913
Central city | 34517.6849
Outside central city | 43963.55122
Central city status unknown | 35398.57026
-------------------------------------------
. table metro if age>29 & age<65 & sex==1 [aweight= perwt_rounded], contents (mean incwage sd incwage freq)
-------------------------------------------------------------------------
Metropolitan central city |
status | mean(incwage) sd(incwage) Freq.
----------------------------+--------------------------------------------
Not identifiable | 32020.1697 27352.47 94
Not in metro area | 27344.17913 28233.76 6,628
Central city | 34517.6849 38462.56 6,727
Outside central city | 43963.55122 44645.15 11,639
Central city status unknown | 35398.57026 36143.29 4,247
-------------------------------------------------------------------------
.
. *The thing about aweight is that it adjusts the mean but not the N.
. table metro if age>29 & age<65 & sex==1, contents (mean incwage sd incwage freq)
-------------------------------------------------------------------------
Metropolitan central city |
status | mean(incwage) sd(incwage) Freq.
----------------------------+--------------------------------------------
Not identifiable | 31743.04255 27474.74 94
Not in metro area | 27189.6465 28299.05 6,628
Central city | 34445.35841 38491.83 6,727
Outside central city | 43203.0348 44057.68 11,639
Central city status unknown | 35557.95997 36639.06 4,247
-------------------------------------------------------------------------
. *In theory, aweighted analysis is better.
. table metro if age>29 & age<65 & sex==1 [aweight= perwt_rounded], contents (mean incwage sd incwage freq)
-------------------------------------------------------------------------
Metropolitan central city |
status | mean(incwage) sd(incwage) Freq.
----------------------------+--------------------------------------------
Not identifiable | 32020.1697 27352.47 94
Not in metro area | 27344.17913 28233.76 6,628
Central city | 34517.6849 38462.56 6,727
Outside central city | 43963.55122 44645.15 11,639
Central city status unknown | 35398.57026 36143.29 4,247
-------------------------------------------------------------------------
. regress incwage _Imetro* if age>29 & age<65 & sex==1 [aweight= perwt_rounded]
(sum of wgt is 6.0783e+07)
Source | SS df MS Number of obs = 29335
-------------+------------------------------ F( 4, 29330) = 191.94
Model | 1.1913e+12 4 2.9784e+11 Prob > F = 0.0000
Residual | 4.5511e+13 29330 1.5517e+09 R-squared = 0.0255
-------------+------------------------------ Adj R-squared = 0.0254
Total | 4.6703e+13 29334 1.5921e+09 Root MSE = 39392
------------------------------------------------------------------------------
incwage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_Imetro_0 | 4675.991 4166.002 1.12 0.262 -3489.561 12841.54
_Imetro_2 | 7173.506 713.0054 10.06 0.000 5775.983 8571.028
_Imetro_3 | 16619.37 632.9456 26.26 0.000 15378.77 17859.97
_Imetro_4 | 8054.391 819.5563 9.83 0.000 6448.024 9660.758
_cons | 27344.18 529.7572 51.62 0.000 26305.83 28382.53
------------------------------------------------------------------------------
. *Standard regression is really a regression of mean or average values.
* rather than use xi, I prefer another dummy variable generator called desmat, a free add-on to stata.
. ssc install desmat, replace
checking desmat consistency and verifying not already installed...
the following files will be replaced:
c:\ado\stbplus\d\desmat.ado
installing into c:\ado\stbplus\...
installation complete.
. desmat: regress incwage metro=ind(2) if age>29 & age<65 & sex==1 [aweight= perwt_rounded]
---------------------------------------------------------------------------------
Linear regression
---------------------------------------------------------------------------------
Dependent variable incwage
Number of observations: 29335
aweight: perwt_rounded
F statistic: 191.942
Model degrees of freedom: 4
Residual degrees of freedom: 29330
R-squared: 0.026
Adjusted R-squared: 0.025
Root MSE 39391.542
Prob: 0.000
---------------------------------------------------------------------------------
nr Effect Coeff s.e.
---------------------------------------------------------------------------------
metro
1 Not identifiable 4675.991 4166.002
2 Central city 7173.506** 713.005
3 Outside central city 16619.372** 632.946
4 Central city status unknown 8054.391** 819.556
5 _cons 27344.179** 529.757
---------------------------------------------------------------------------------
* p < .05
** p < .01
. *I used desmat to run the regression, and to make the dummy variables at the same time, also telling stata to use the second category of metro as the excluded category.
. *desmat creates its own dummy variables, coding them _x_1, _x_2, etc.
*How to calculate p values from a given T statistic? Let’s say the T-statistic is 2.5, and the N is 1500.
. display ttail (1500, 2.5)
ttail not found
r(111);
. display ttail(1500, 2.5)
.0062627
. *How to generate P values from a T statistic
. *for a 2 tail test, we would want to double this the P value that stata gives us (which is just the one-sided tail probability).
. display .0062627*2
.0125254
. *a little more than 1%, but less than 5%
. display 2*(ttail(20, 2.5))
.02123355
. *Note that for ttail, as for a lot of stata commands that require parentheses, it did not like the space between the command and the paren.
. clear all
. exit, clear