Rosenfeld's section 3

-----------------------------------------------------------------------------------

log: C:\AAA Miker Files\newer web pages\soc_meth_proj3\section3_2009.log

log type: text

opened on: 10 Feb 2009, 15:25:54

. set mem 200m

Current memory allocation

current memory usage

settable value description (1M = 1024k)

--------------------------------------------------------------------

set maxvar 5000 max. variables allowed 1.909M

set memory 200M max. data space 200.000M

set matsize 400 max. RHS vars in models 1.254M

-----------

203.163M

. use "C:\AAA Miker Files\newer web pages\soc_meth_proj3\cps_mar_2000_new.dta", clear

. xi i.occ1990

--Break--

r(1);

*It might seem convenient to generate dummy variables for the entire set of occupational categories, but it is actually not a great idea, because there are something like a thousand different occupational categories, and that would add a thousand dummy variables to your dataset. You might run out of memory. Much better is to first create a variable for the 3 occupations you want, perhaps setting all other to missing, and then use that variable to generate the dummies.

. memory

bytes

--------------------------------------------------------------------

Details of set memory usage

overhead (pointers) 534,840 0.26%

data 45,327,690 21.61%

----------------------------

data + overhead 45,862,530 21.87%

free 163,852,662 78.13%

----------------------------

Total allocated 209,715,192 100.00%

--------------------------------------------------------------------

Other memory usage

set maxvar usage 2,001,730

set matsize usage 1,315,200

programs, saved results, etc. 76,822

---------------

Total 3,393,752

-------------------------------------------------------

Grand total 213,108,944

. *xi of occ1990 was going to make a thousand new dummy variables. That is not what we want or need.

. drop _Iocc*

. memory

bytes

--------------------------------------------------------------------

Details of set memory usage

overhead (pointers) 534,840 0.26%

data 15,242,940 7.27%

----------------------------

data + overhead 15,777,780 7.52%

free 193,937,412 92.48%

----------------------------

Total allocated 209,715,192 100.00%

--------------------------------------------------------------------

Other memory usage

set maxvar usage 2,001,730

set matsize usage 1,315,200

programs, saved results, etc. 73,958

---------------

Total 3,390,888

-------------------------------------------------------

Grand total 213,106,080

. tabulate metro

Metropolitan central city |

status | Freq. Percent Cum.

----------------------------+-----------------------------------

Not identifiable | 340 0.25 0.25

Not in metro area | 29,658 22.18 22.44

Central city | 32,481 24.29 46.73

Outside central city | 51,468 38.49 85.22

Central city status unknown | 19,763 14.78 100.00

----------------------------+-----------------------------------

Total | 133,710 100.00

. tabulate metro, nolab

Metropolita |

n central |

city status | Freq. Percent Cum.

------------+-----------------------------------

0 | 340 0.25 0.25

1 | 29,658 22.18 22.44

2 | 32,481 24.29 46.73

3 | 51,468 38.49 85.22

4 | 19,763 14.78 100.00

------------+-----------------------------------

Total | 133,710 100.00

. char metro[omit] 1

*redoing some things we did in class, making dummy variables for metro and putting them in the regression.

. xi i.metro

i.metro _Imetro_0-4 (naturally coded; _Imetro_1 omitted)

. regress incwage _Imetro* if age>29 & age<65 & sex==1

Source | SS df MS Number of obs = 29335

-------------+------------------------------ F( 4, 29330) = 190.17

Model | 1.1316e+12 4 2.8291e+11 Prob > F = 0.0000

Residual | 4.3633e+13 29330 1.4877e+09 R-squared = 0.0253

-------------+------------------------------ Adj R-squared = 0.0251

Total | 4.4765e+13 29334 1.5260e+09 Root MSE = 38570

------------------------------------------------------------------------------

incwage | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

_Imetro_0 | 4553.396 4006.316 1.14 0.256 -3299.164 12405.96

_Imetro_2 | 7255.712 667.5305 10.87 0.000 5947.322 8564.102

_Imetro_3 | 16013.39 593.5204 26.98 0.000 14850.06 17176.71

_Imetro_4 | 8368.313 758.1121 11.04 0.000 6882.38 9854.247

_cons | 27189.65 473.7616 57.39 0.000 26261.05 28118.24

------------------------------------------------------------------------------

. *This is the results we got today in class...

. *metro==1, rural, is the excluded category here. Everything else is compared to rural.

. *The constant is the excluded category mean.

. table metro if age>29 & age<65 & sex==1, contents (mean incwage)

-------------------------------------------

Metropolitan central city |

status | mean(incwage)

----------------------------+--------------

Not identifiable | 31743.04255

Not in metro area | 27189.6465

Central city | 34445.35841

Outside central city | 43203.0348

Central city status unknown | 35557.95997

-------------------------------------------

. *One variation is to include the weights

. table metro if age>29 & age<65 & sex==1 [aweight= perwt_rounded], contents (mean incwage)

-------------------------------------------

Metropolitan central city |

status | mean(incwage)

----------------------------+--------------

Not identifiable | 32020.1697

Not in metro area | 27344.17913

Central city | 34517.6849

Outside central city | 43963.55122

Central city status unknown | 35398.57026

-------------------------------------------

. table metro if age>29 & age<65 & sex==1 [aweight= perwt_rounded], contents (mean incwage sd incwage freq)

-------------------------------------------------------------------------

Metropolitan central city |

status | mean(incwage) sd(incwage) Freq.

----------------------------+--------------------------------------------

Not identifiable | 32020.1697 27352.47 94

Not in metro area | 27344.17913 28233.76 6,628

Central city | 34517.6849 38462.56 6,727

Outside central city | 43963.55122 44645.15 11,639

Central city status unknown | 35398.57026 36143.29 4,247

-------------------------------------------------------------------------

. *The thing about aweight is that it adjusts the mean but not the N.

. table metro if age>29 & age<65 & sex==1, contents (mean incwage sd incwage freq)

-------------------------------------------------------------------------

Metropolitan central city |

status | mean(incwage) sd(incwage) Freq.

----------------------------+--------------------------------------------

Not identifiable | 31743.04255 27474.74 94

Not in metro area | 27189.6465 28299.05 6,628

Central city | 34445.35841 38491.83 6,727

Outside central city | 43203.0348 44057.68 11,639

Central city status unknown | 35557.95997 36639.06 4,247

-------------------------------------------------------------------------

. *In theory, aweighted analysis is better.

. table metro if age>29 & age<65 & sex==1 [aweight= perwt_rounded], contents (mean incwage sd incwage freq)

-------------------------------------------------------------------------

Metropolitan central city |

status | mean(incwage) sd(incwage) Freq.

----------------------------+--------------------------------------------

Not identifiable | 32020.1697 27352.47 94

Not in metro area | 27344.17913 28233.76 6,628

Central city | 34517.6849 38462.56 6,727

Outside central city | 43963.55122 44645.15 11,639

Central city status unknown | 35398.57026 36143.29 4,247

-------------------------------------------------------------------------

. regress incwage _Imetro* if age>29 & age<65 & sex==1 [aweight= perwt_rounded]

(sum of wgt is 6.0783e+07)

Source | SS df MS Number of obs = 29335

-------------+------------------------------ F( 4, 29330) = 191.94

Model | 1.1913e+12 4 2.9784e+11 Prob > F = 0.0000

Residual | 4.5511e+13 29330 1.5517e+09 R-squared = 0.0255

-------------+------------------------------ Adj R-squared = 0.0254

Total | 4.6703e+13 29334 1.5921e+09 Root MSE = 39392

------------------------------------------------------------------------------

incwage | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

_Imetro_0 | 4675.991 4166.002 1.12 0.262 -3489.561 12841.54

_Imetro_2 | 7173.506 713.0054 10.06 0.000 5775.983 8571.028

_Imetro_3 | 16619.37 632.9456 26.26 0.000 15378.77 17859.97

_Imetro_4 | 8054.391 819.5563 9.83 0.000 6448.024 9660.758

_cons | 27344.18 529.7572 51.62 0.000 26305.83 28382.53

------------------------------------------------------------------------------

. *Standard regression is really a regression of mean or average values.

* rather than use xi, I prefer another dummy variable generator called desmat, a free add-on to stata.

. ssc install desmat, replace

checking desmat consistency and verifying not already installed...

the following files will be replaced:

c:\ado\stbplus\d\desmat.ado

installing into c:\ado\stbplus\...

installation complete.

. desmat: regress incwage metro=ind(2) if age>29 & age<65 & sex==1 [aweight= perwt_rounded]

---------------------------------------------------------------------------------

Linear regression

---------------------------------------------------------------------------------

Dependent variable incwage

Number of observations: 29335

aweight: perwt_rounded

F statistic: 191.942

Model degrees of freedom: 4

Residual degrees of freedom: 29330

R-squared: 0.026

Adjusted R-squared: 0.025

Root MSE 39391.542

Prob: 0.000

---------------------------------------------------------------------------------

nr Effect Coeff s.e.

---------------------------------------------------------------------------------

metro

1 Not identifiable 4675.991 4166.002

2 Central city 7173.506** 713.005

3 Outside central city 16619.372** 632.946

4 Central city status unknown 8054.391** 819.556

5 _cons 27344.179** 529.757

---------------------------------------------------------------------------------

* p < .05

** p < .01

. *I used desmat to run the regression, and to make the dummy variables at the same time, also telling stata to use the second category of metro as the excluded category.

. *desmat creates its own dummy variables, coding them _x_1, _x_2, etc.

*How to calculate p values from a given T statistic? Let’s say the T-statistic is 2.5, and the N is 1500.

. display ttail (1500, 2.5)

ttail not found

r(111);

. display ttail(1500, 2.5)

.0062627

. *How to generate P values from a T statistic

. *for a 2 tail test, we would want to double this the P value that stata gives us (which is just the one-sided tail probability).

. display .0062627*2

.0125254

. *a little more than 1%, but less than 5%

. display 2*(ttail(20, 2.5))

.02123355

. *Note that for ttail, as for a lot of stata commands that require parentheses, it did not like the space between the command and the paren.

. clear all

. exit, clear