Some housekeeping I have done with the March 200 CPS file, which you may also want to do to your own CPS extractions:
* We want to use the CPS variable perwt to generate weighted counts for the US population. Unfortunately, the weights come with two decimals and there are a few negative values. So I created a new weight, rounded to the nearest integer, and with negative values recoded to zero.
tabulate race if year==2000 [fweight=perwt]
may not use noninteger frequency weights
r(401);
. summarize perwt
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
perwt | 896445 1522.901 838.8933 -3033.03 14280.64
. gen perwt_rounded=round( perwt)
. replace perwt_rounded=0 if perwt<0
(51 real changes made)
. label var perwt_rounded "integer perwt, negative values recoded to 0"
* Some of the income variables had high values which coded for “missing” rather than real incomes (see the ipums documentation). I replaced these missing values with a missing value code Stata understands, the period.
summarize inctot if year==2000
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
inctot | 133710 248066.9 409591.8 -24998 999999
. summarize inctot if year==2000, detail
Total personal income
-------------------------------------------------------------
Percentiles Smallest
1% 0 -24998
5% 0 -18582
10% 1000 -13300 Obs 133710
25% 9440 -12949 Sum of Wgt. 133710
50% 26097.5 Mean 248066.9
Largest Std. Dev. 409591.8
75% 100000 999999
90% 999999 999999 Variance 1.68e+11
95% 999999 999999 Skewness 1.281213
99% 999999 999999 Kurtosis 2.662364
. replace inctot=. if inctot==999999
(219626 real changes made, 219626 to missing)
. replace inctot=. if inctot==999998
(53 real changes made, 53 to missing)
. summarize inctot if year==2000
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
inctot | 103226 26011.4 32061.48 -24998 425510
* The income variables come with value labels, but continuous variables should not have value labels and sometimes the empty value labels will confuse graphing functions. So set the value labels for those variables to missing (use the period):
describe incwage
storage display value
variable name type format label variable label
-----------------------------------------------------------------------------------------
incwage long %12.0g incwagelbl
Wage and salary income
. label val incwage .
. describe incwage
storage display value
variable name type format label variable label
-----------------------------------------------------------------------------------------
incwage long %12.0g Wage and salary income
. describe inctot
storage display value
variable name type format label variable label
-----------------------------------------------------------------------------------------
inctot long %12.0g inctotlbl
Total personal income
. label val inctot .
. describe incss
storage display value
variable name type format label variable label
-----------------------------------------------------------------------------------------
incss long %12.0g incsslbl Social Security income
. label val incss .
. label val incwelfr .