*Hello
students. This is a STATA Log
file. I'll be posting to the web a
series of Stata logs so that you can learn how to use the program, and how to
analyze the dataset. All my comments
will follow the asterisk, which is how STATA knows that this is a comment and
not an executable command. I've made
the STATA commands BOLD, and what follows the commands is STATA's
response.
.
. *The
first step is to load the data.
.
. *Even
before you load the data, you should make sure you have a log file open, so
that your commands and results are captured.
.
. use
"C:\AAA Rosie's files\new projects\Intro Soc. Methods Class\Data and info
for project 3\cps y2k full data (new).dta", clear
no room
to add more observations
r(901);
. *I
tried to open the data (using the menus File>Open; STATA printed out the equivalent
command in its own command line language).
Unfortunately, STATA doesn't have enough memory allocated to open the
dataset.
.
.
*Because I'm using Windows, I'll set the memory allocation from within STATA. If you're using a Mac, and you haven't set
the memory allocation yet, you need to quit the program and adjust the
preferences for STATA
.
. set
mem 20m
(20480k)
. *My
commands will follow the periods, and STATA's response will follow the commands,
without a leading period. STATA says it
has allocated 20480K, or 20 Megabytes of memory for data. Now I'll try to open the dataset again.
.
.
use "C:\AAA Rosie's files\new projects\Intro Soc. Methods Class\Data and
info for project 3\cps y2k full data (new).dta", clear
. *Now
the dataset has been loaded.
.
. *Next
step is to describe the data:
.
.
describe
Contains
data from C:\AAA Rosie's files\new projects\Intro Soc. Methods Class\D
>
ata and info for project 3\cps y2k full data (new).dta
obs:
133,710
vars: 39 18 Apr 2001 17:46
size:
10,830,510 (48.3% of memory free)
-------------------------------------------------------------------------------
1. phseq str5 %9s household sequence number p2
2. pernum byte %8.0g
3. age byte %8.0g p15
4. maritl byte %26.0g marlbl Marital Status p17
5. sex byte %8.0g sexnm p20
6. vet byte %22.0g vetnm veteran status p21
7. hga byte %8.0g Educational Attainment p22
8. race byte %11.0g racenm p25
9. reorigin byte %8.0g Hispanic Origin p27
10.
pppos str2 %9s
family sequence number within
each household p46
11. hrs1 byte %8.0g hours worked last week p76
12. occ str3 %9s Occupation p106
13. clswkr byte %32.0g cwrknm sector of worker p109
14. grswk int %9.0g gross weekly wages p135
15. unmem byte %13.0g unnm labor union member p139
16. lfsr byte %28.0g lfsrnm labor force status p145
17. ernval float %9.0g main job last year earnings
p228
18. ssval long %12.0g last year soc security
payments
p291
19. pawval int %12.0g last year welfare payments
p305
20. ptotr str2 %9s total person income
categories
p466
21. penatvty str3 %9s country of birth p722
22. pemntvty str3 %9s mother's country of birth
p725
23. pefntvty str3 %9s father's country of birth
p731
24. peinusyr str2 %9s time of immigration p731
25. pxnatvty str2 %9s allocation flag for
country of
birth p734
26. wgt2 int %9.0g rounded weight based on p50
27. ernval2 float %9.0g main job earnings, losses
recoded to zero
28. htype byte %37.0g htpnm household type h25
29. state byte %8.0g HG-ST60, or simply state of
residence h40
30. hgmsac str4 %9s metropolitan area code h44
31. hpmsasz byte %8.0g metropolitan area size h56
32. hcccr byte %8.0g residence in central city
h58
33. frelu18 byte %8.0g number of kids in fam under
18
f29
34. povll byte %8.0g ratio of fam income to
poverty
level f38
35. fwsval float %9.0g family income f48
36. famwgt2 int %8.0g adjusted family weight f233
37. yrsed float %9.0g years of education, from hga
38. citizen byte %33.0g citnm citizenship p733
39. health byte %11.0g hlthnm self reported health status p800
-------------------------------------------------------------------------------
Sorted
by: race
. *What
does this description mean? The top of
the description gives the following information: there are 133,710
observations. In this dataset, every
observation is a different person, so that means there are 133,710 individual
person records in this dataset. That's
a lot of people. There are 39 different
variables, such as age, race occupation and so on for each of the 133,710 people.
.
. *Look
at variable number 8, race. The data
description says race is stored as a numerical type (called a 'byte'), race has
descriptive labels (called 'racenm') and the variable description ('p25') means
that the 'race' variable corresponds to the person- survey portion of the CPS,
line 25. If you open the CPS
documentation, and skip to the Data Dictionary, and go past the household
survey and the family survey to the Person Survey, on line 25 you'll find the
RACE question.
.
let's look at the race data right now:
unrecognized
command: let
r(199);
.
.
*let's look at the race data right now (in the previous line I forgot to lead
> with an asterisk, so STATA complained.
.
.
tabulate race
p25 | Freq. Percent Cum.
------------+-----------------------------------
White | 113475 84.87 84.87
Black | 13626 10.19 95.06
Amer
Indian | 1894 1.42 96.47
Asian | 4715 3.53 100.00
------------+-----------------------------------
Total | 133710 100.00
. *What
does this mean? Out of the 133,710
people surveyed, 84.87% were White, 10.19% were Black, 1.42% were American
Indian, and 3.53% were Asian.
.
.
*That's nice, of course. The CPS is a
nationally representative sample, and one of the things you can do is figure
out not only how many Blacks and Whites there are in the sample, but by
extension how many Blacks and Whites there are in the US.
.
.
tabulate race [fweight=wgt2]
p25 | Freq. Percent Cum.
------------+-----------------------------------
White |
224256269 82.07 82.07
Black | 35370557 12.95 95.02
Amer
Indian | 2837831 1.04 96.06
Asian | 10769164 3.94 100.00
------------+-----------------------------------
Total |
273233821 100.00
. *
Using the [fweight=wgt2] option, STATA
applies the CPS individual weights to each person. fweight is a frequency weight, which means the wgt2 variable
contains the number of people in the US that each record in the dataset is supposed
to represent. In this case we see that
Whites make up 82% of the population, and Blacks are nearly 13% of the
population. The reason these percentages
are different from the percentage of records in the dataset is that Blacks are
underrepresented in the data, and the weights make up for that.
. *Be
clear in your interpretations about whether you are using weighted or unweighted
data. They mean different things, and
the results will be slightly different.
.
. *Now
let's look at income:
.
.
summarize ptotr
Variable
| Obs Mean Std.
Dev. Min Max
---------+-----------------------------------------------------
ptotr | 0
. *I
tried to summarize the income variable ptotr, and STATA wouldn't let me. The reason is that ptotr is a categorical
variable that I have stored as a string.
(See the output after the 'describe' command). In analyzing data, especially data that comes from large national
surveys where the data is compressed and stored in all sorts of ways, you have
to be conscious about which are the categorical variables, and which are the
real numbers. So now I'll use a command
that is more appropriate for a categorical variable:
.
.
tabulate ptotr
total |
person |
income |
categories |
p466 | Freq. Percent Cum.
------------+-----------------------------------
00 | 30484 22.80 22.80
01 | 16437 12.29 35.09
02 | 5077 3.80 38.89
03 | 7066 5.28 44.17
04 | 5857 4.38 48.55
05 | 6566 4.91 53.46
06 | 4866 3.64 57.10
07 | 5660 4.23 61.34
08 | 4135 3.09 64.43
09 | 5081 3.80 68.23
10 | 3384 2.53 70.76
11 | 4346 3.25 74.01
12 | 2738 2.05 76.06
13 | 3966 2.97 79.02
14 | 2089 1.56 80.59
15 | 3087 2.31 82.90
16 | 1682 1.26 84.15
17 | 2774 2.07 86.23
18 | 1356 1.01 87.24
19 | 1780 1.33 88.57
20 | 1115 0.83 89.41
21 | 1790 1.34 90.75
22 | 868 0.65 91.39
23 | 1053 0.79 92.18
24 | 627 0.47 92.65
25 | 1105 0.83 93.48
26 |
573 0.43 93.91
27 | 795 0.59 94.50
28 | 421 0.31 94.82
29 | 686 0.51 95.33
30 | 373 0.28 95.61
31 | 621 0.46 96.07
32 | 327 0.24 96.32
33 | 487 0.36 96.68
34 | 243 0.18 96.86
35 | 297 0.22 97.08
36 | 191 0.14 97.23
37 | 301 0.23 97.45
38 | 142 0.11 97.56
39 | 168 0.13 97.68
40 | 117 0.09 97.77
41 | 2979 2.23 100.00
------------+-----------------------------------
Total | 133710 100.00
.
*Okay. This gives you a string of
numbers, but what do they mean? Here's
where you need to turn to the documentation.
I have added labels to some of the categorical variables, but not
all. You need to go back to the
documentation to interpret this variable.
The variable name description points you to p466, which means Data
Dictionary, Personal Record, line 466.
There you will find out that, for instance, a value of 20 means that the
individual had income from all sources in 1999 of $47,500 to $49,999.
.
. *You
also will note that all 133,710 persons in the dataset are listed under one
category or another. You might suspect
that not all persons in the dataset had income- Children usually don't have any
income. So you need to look at the Data
Dictionary and see which category value codes for the people who have NO
Income. In this case, the people whose code
is '00' are the people that, in survey language, are NOT IN THE UNIVERSE for
this question. When you analyze data,
you need to be aware that sometimes a number is really a number (a zero meaning
zero income, for instance) and sometimes a number means something else (here
zero refers to people whose income was never calculated).
. *
Let's look at the total income categories again, excluding the 'out of universe
people'
.
.
tabulate ptotr if ptotr ~=
"00"
total |
person |
income |
categories |
p466 | Freq. Percent Cum.
------------+-----------------------------------
01 | 16437 15.92 15.92
02 | 5077 4.92 20.84
03 | 7066 6.85 27.69
04 | 5857 5.67 33.36
05 | 6566 6.36 39.72
06 | 4866 4.71 44.44
07 | 5660 5.48 49.92
08 | 4135 4.01 53.92
09 | 5081 4.92 58.85
10 | 3384 3.28 62.12
11 | 4346 4.21 66.34
12 | 2738 2.65 68.99
13 | 3966 3.84
72.83
14 | 2089 2.02 74.85
15 | 3087 2.99 77.84
16 | 1682 1.63 79.47
17 | 2774 2.69 82.16
18 | 1356 1.31 83.47
19 | 1780 1.72 85.20
20 | 1115 1.08 86.28
21 | 1790 1.73 88.01
22 | 868 0.84 88.85
23 | 1053 1.02 89.87
24 | 627 0.61 90.48
25 | 1105 1.07 91.55
26 | 573 0.56 92.11
27 | 795 0.77 92.88
28 | 421 0.41 93.28
29 | 686 0.66 93.95
30 | 373 0.36 94.31
31 | 621 0.60 94.91
32 | 327 0.32 95.23
33 | 487 0.47 95.70
34 | 243 0.24 95.94
35 | 297 0.29 96.22
36 | 191 0.19 96.41
37 | 301 0.29 96.70
38 | 142 0.14 96.84
39 | 168 0.16 97.00
40 | 117 0.11 97.11
41 | 2979 2.89 100.00
------------+-----------------------------------
Total | 103226 100.00
. *Note
the syntax of that last command. You
could read it as "tabulate ptotr if ptotr is not equal to 00". The "00" is in quotes because it
is a string (i.e. character) rather than a numerical variable. The ~= is STATA's way of saying 'Not
Equal'. Excluding the out of universe
folks, we are down to 103,226 records.
Of those records, the cumulative percentage (last column) shows that
about half of the people are in categories 1-7, and half are spread out among
the higher categories. Category 7,
according to the data dictionary, corresponds to earnings of $15,000 to
$17,499.
.
.
*Let's see how the income distribution changes if we limit ourselves to persons
age 30 to 40
.
.
tabulate ptotr if ptotr ~=
"00" & age > 29 & age <41
total |
person |
income |
categories |
p466 | Freq. Percent Cum.
------------+-----------------------------------
01 | 2514 11.15 11.15
02 | 702 3.11 14.26
03 | 1007 4.47 18.73
04 | 792 3.51 22.24
05 | 1175 5.21 27.45
06 | 871 3.86 31.32
07 | 1297 5.75 37.07
08 | 906 4.02 41.09
09 | 1263 5.60 46.69
10 | 803 3.56 50.25
11 | 1225 5.43 55.68
12 | 765 3.39 59.07
13 |
1240 5.50 64.57
14 | 648 2.87 67.45
15 | 935 4.15 71.59
16 | 483 2.14 73.74
17 | 899 3.99 77.72
18 | 397 1.76 79.48
19 | 563 2.50 81.98
20 | 288 1.28 83.26
21 | 542 2.40 85.66
22 | 250 1.11 86.77
23 | 299 1.33 88.10
24 | 188 0.83 88.93
25 | 306 1.36 90.29
26 | 159 0.71 90.99
27 | 231 1.02 92.02
28 | 107 0.47 92.49
29 | 185 0.82 93.31
30 | 83 0.37 93.68
31 | 177 0.78 94.47
32 | 81 0.36 94.82
33 | 127 0.56 95.39
34 | 60 0.27 95.65
35 | 66 0.29 95.95
36 | 44 0.20 96.14
37 | 85 0.38 96.52
38 | 28 0.12 96.64
39 | 44 0.20 96.84
40 | 26 0.12 96.95
41 | 687 3.05 100.00
------------+-----------------------------------
Total | 22548 100.00
.
*Well, the distribution has changed.
Now the midpoint is around category 10, or $22,500 to 24,999.
.
.
*Let's look at an easier to figure out income measure, earnval2
.
.
summarize earnval2
earnval2
not found
r(111);
.
summarize ernval2
Variable
| Obs Mean Std.
Dev. Min Max
---------+-----------------------------------------------------
ernval2 |
133710 15373.05 26884.27 0 362302
. *ernval2
is 1999 main job wages, coded as a real number rather than as a category.
.
.
Let's look at ernval excluding the non-earners
unrecognized
command: Let
r(199);
.
*Let's look at ernval excluding the non-earners (I forgot the asterisk above,
again)
.
summarize ernval2 if ernval >0
Variable
| Obs Mean Std.
Dev. Min Max
---------+-----------------------------------------------------
ernval2 |
71370 28801.05 31102.15 1 362302
. *So
the number of people who had a main job the previous year is about half the
total sample (71,370) people.
.
.
*Let's see how it differs by gender.
.
.
sort sex
. by
sex: summarize ernval
->
sex= male
Variable
| Obs Mean Std.
Dev. Min Max
---------+-----------------------------------------------------
ernval |
64791 20511.42 32907.62 -9999 362302
->
sex= female
Variable
| Obs Mean Std.
Dev. Min Max
---------+-----------------------------------------------------
ernval |
68919 10513.39 18354.45 -9999 333564
.
*Oops. I used ernval (the CPS's own
variable) rather than my created ernval2.
Here you can see that there are lots of negative incomes. Let's limit the data to people with positive
incomes:
.
. by
sex: summarize ernval if ernval >0
->
sex= male
Variable
| Obs Mean Std.
Dev. Min Max
---------+-----------------------------------------------------
ernval |
37422 35546.98 36598.75 1 362302
->
sex= female
Variable
| Obs Mean Std.
Dev. Min Max
---------+-----------------------------------------------------
ernval |
33948 21364.79 21253.24 1 333564
.
*Ok. For the people who had jobs in
1999, the men had average incomes of $35,546 and the women had average incomes
of $21,364
.
*That's a big difference by gender.
.
. *How
about racial differences?
.
.
sort race
. by
race: summarize ernval if ernval > 0
->
race= White
Variable
| Obs Mean Std.
Dev. Min Max
---------+-----------------------------------------------------
ernval |
61522 29289.83 31726 1 362302
->
race= Black
Variable
| Obs Mean Std.
Dev. Min Max
---------+-----------------------------------------------------
ernval |
6478 24135.99 23170.03 1 257525
->
race=Amer Indian
Variable
| Obs Mean Std. Dev. Min Max
---------+-----------------------------------------------------
ernval |
864 20402.3 24021.88 1 362302
->
race= Asian
Variable
| Obs Mean Std.
Dev. Min Max
---------+-----------------------------------------------------
ernval |
2506 31756.48 34032.78 1 284133
. * Now
let's look at the race and gender differences.
.
.
sort race sex
. by
race sex: summarize ernval if ernval >0
->
race= White sex=
male
Variable
| Obs Mean Std.
Dev. Min Max
---------+-----------------------------------------------------
ernval |
32761 36264.5 37243.12 1 362302
->
race= White sex=
female
Variable
| Obs Mean Std.
Dev. Min Max
---------+-----------------------------------------------------
ernval |
28761 21345.14 21321.52 1 333564
->
race= Black sex=
male
Variable
| Obs Mean Std.
Dev. Min Max
---------+-----------------------------------------------------
ernval |
2906 28037.98 26393.53 1 257525
->
race= Black sex=
female
Variable
| Obs Mean Std.
Dev. Min Max
---------+-----------------------------------------------------
ernval |
3572 20961.52 19610.2 1 244805
->
race=Amer Indian sex= male
Variable
| Obs Mean Std.
Dev. Min Max
---------+-----------------------------------------------------
ernval |
455 23529.18 27311.95 1 362302
->
race=Amer Indian sex= female
Variable
| Obs Mean Std. Dev. Min Max
---------+-----------------------------------------------------
ernval |
409 16923.76 19170.25 1 284133
->
race= Asian sex=
male
Variable
| Obs Mean Std.
Dev. Min Max
---------+-----------------------------------------------------
ernval |
1300 38456.79 39868.02 1 229339
->
race= Asian sex=
female
Variable
| Obs Mean Std.
Dev. Min Max
---------+-----------------------------------------------------
ernval |
1206 24533.93 24365.57 1 284133
.
*Okay. Now I'm going to quit the
program. I haven't added any variables
or made any changes to the data set (except re-sorting a few times), so I don't
need to save the changes.
. exit,
clear