A sample STATA session

*Hello students. This is a STATA Log file. I'll be posting to the web a series of Stata logs so that you can learn how to use the program, and how to analyze the dataset. All my comments will follow the asterisk, which is how STATA knows that this is a comment and not an executable command. I've made the STATA commands BOLD, and what follows the commands is STATA's response.

. *The first step is to load the data.

. *Even before you load the data, you should make sure you have a log file open, so that your commands and results are captured.

. use "C:\AAA Rosie's files\new projects\Intro Soc. Methods Class\Data and info for project 3\cps y2k full data (new).dta", clear

no room to add more observations

r(901);

. *I tried to open the data (using the menus File>Open; STATA printed out the equivalent command in its own command line language). Unfortunately, STATA doesn't have enough memory allocated to open the dataset.

. *Because I'm using Windows, I'll set the memory allocation from within STATA. If you're using a Mac, and you haven't set the memory allocation yet, you need to quit the program and adjust the preferences for STATA

. set mem 20m

(20480k)

. *My commands will follow the periods, and STATA's response will follow the commands, without a leading period. STATA says it has allocated 20480K, or 20 Megabytes of memory for data. Now I'll try to open the dataset again.

. use "C:\AAA Rosie's files\new projects\Intro Soc. Methods Class\Data and info for project 3\cps y2k full data (new).dta", clear

. *Now the dataset has been loaded.

. *Next step is to describe the data:

. describe

Contains data from C:\AAA Rosie's files\new projects\Intro Soc. Methods Class\D

> ata and info for project 3\cps y2k full data (new).dta

obs: 133,710

vars: 39 18 Apr 2001 17:46

size: 10,830,510 (48.3% of memory free)

-------------------------------------------------------------------------------

1. phseq str5 %9s household sequence number p2

2. pernum byte %8.0g

3. age byte %8.0g p15

4. maritl byte %26.0g marlbl Marital Status p17

5. sex byte %8.0g sexnm p20

6. vet byte %22.0g vetnm veteran status p21

7. hga byte %8.0g Educational Attainment p22

8. race byte %11.0g racenm p25

9. reorigin byte %8.0g Hispanic Origin p27

10. pppos str2 %9s family sequence number within

each household p46

11. hrs1 byte %8.0g hours worked last week p76

12. occ str3 %9s Occupation p106

13. clswkr byte %32.0g cwrknm sector of worker p109

14. grswk int %9.0g gross weekly wages p135

15. unmem byte %13.0g unnm labor union member p139

16. lfsr byte %28.0g lfsrnm labor force status p145

17. ernval float %9.0g main job last year earnings p228

18. ssval long %12.0g last year soc security payments

p291

19. pawval int %12.0g last year welfare payments p305

20. ptotr str2 %9s total person income categories

p466

21. penatvty str3 %9s country of birth p722

22. pemntvty str3 %9s mother's country of birth p725

23. pefntvty str3 %9s father's country of birth p731

24. peinusyr str2 %9s time of immigration p731

25. pxnatvty str2 %9s allocation flag for country of

birth p734

26. wgt2 int %9.0g rounded weight based on p50

27. ernval2 float %9.0g main job earnings, losses

recoded to zero

28. htype byte %37.0g htpnm household type h25

29. state byte %8.0g HG-ST60, or simply state of

residence h40

30. hgmsac str4 %9s metropolitan area code h44

31. hpmsasz byte %8.0g metropolitan area size h56

32. hcccr byte %8.0g residence in central city h58

33. frelu18 byte %8.0g number of kids in fam under 18

f29

34. povll byte %8.0g ratio of fam income to poverty

level f38

35. fwsval float %9.0g family income f48

36. famwgt2 int %8.0g adjusted family weight f233

37. yrsed float %9.0g years of education, from hga

38. citizen byte %33.0g citnm citizenship p733

39. health byte %11.0g hlthnm self reported health status p800

-------------------------------------------------------------------------------

Sorted by: race

. *What does this description mean? The top of the description gives the following information: there are 133,710 observations. In this dataset, every observation is a different person, so that means there are 133,710 individual person records in this dataset. That's a lot of people. There are 39 different variables, such as age, race occupation and so on for each of the 133,710 people.

. *Look at variable number 8, race. The data description says race is stored as a numerical type (called a 'byte'), race has descriptive labels (called 'racenm') and the variable description ('p25') means that the 'race' variable corresponds to the person- survey portion of the CPS, line 25. If you open the CPS documentation, and skip to the Data Dictionary, and go past the household survey and the family survey to the Person Survey, on line 25 you'll find the RACE question.

. let's look at the race data right now:

unrecognized command: let

r(199);

. *let's look at the race data right now (in the previous line I forgot to lead

> with an asterisk, so STATA complained.

. tabulate race

p25 | Freq. Percent Cum.

------------+-----------------------------------

White | 113475 84.87 84.87

Black | 13626 10.19 95.06

Amer Indian | 1894 1.42 96.47

Asian | 4715 3.53 100.00

------------+-----------------------------------

Total | 133710 100.00

. *What does this mean? Out of the 133,710 people surveyed, 84.87% were White, 10.19% were Black, 1.42% were American Indian, and 3.53% were Asian.

. *That's nice, of course. The CPS is a nationally representative sample, and one of the things you can do is figure out not only how many Blacks and Whites there are in the sample, but by extension how many Blacks and Whites there are in the US.

. tabulate race [fweight=wgt2]

p25 | Freq. Percent Cum.

------------+-----------------------------------

White | 224256269 82.07 82.07

Black | 35370557 12.95 95.02

Amer Indian | 2837831 1.04 96.06

Asian | 10769164 3.94 100.00

------------+-----------------------------------

Total | 273233821 100.00

. * Using the [fweight=wgt2] option, STATA applies the CPS individual weights to each person. fweight is a frequency weight, which means the wgt2 variable contains the number of people in the US that each record in the dataset is supposed to represent. In this case we see that Whites make up 82% of the population, and Blacks are nearly 13% of the population. The reason these percentages are different from the percentage of records in the dataset is that Blacks are underrepresented in the data, and the weights make up for that.

. *Be clear in your interpretations about whether you are using weighted or unweighted data. They mean different things, and the results will be slightly different.

. *Now let's look at income:

. summarize ptotr

Variable | Obs Mean Std. Dev. Min Max

---------+-----------------------------------------------------

ptotr | 0

. *I tried to summarize the income variable ptotr, and STATA wouldn't let me. The reason is that ptotr is a categorical variable that I have stored as a string. (See the output after the 'describe' command). In analyzing data, especially data that comes from large national surveys where the data is compressed and stored in all sorts of ways, you have to be conscious about which are the categorical variables, and which are the real numbers. So now I'll use a command that is more appropriate for a categorical variable:

. tabulate ptotr

total |

person |

income |

categories |

p466 | Freq. Percent Cum.

------------+-----------------------------------

00 | 30484 22.80 22.80

01 | 16437 12.29 35.09

02 | 5077 3.80 38.89

03 | 7066 5.28 44.17

04 | 5857 4.38 48.55

05 | 6566 4.91 53.46

06 | 4866 3.64 57.10

07 | 5660 4.23 61.34

08 | 4135 3.09 64.43

09 | 5081 3.80 68.23

10 | 3384 2.53 70.76

11 | 4346 3.25 74.01

12 | 2738 2.05 76.06

13 | 3966 2.97 79.02

14 | 2089 1.56 80.59

15 | 3087 2.31 82.90

16 | 1682 1.26 84.15

17 | 2774 2.07 86.23

18 | 1356 1.01 87.24

19 | 1780 1.33 88.57

20 | 1115 0.83 89.41

21 | 1790 1.34 90.75

22 | 868 0.65 91.39

23 | 1053 0.79 92.18

24 | 627 0.47 92.65

25 | 1105 0.83 93.48

26 | 573 0.43 93.91

27 | 795 0.59 94.50

28 | 421 0.31 94.82

29 | 686 0.51 95.33

30 | 373 0.28 95.61

31 | 621 0.46 96.07

32 | 327 0.24 96.32

33 | 487 0.36 96.68

34 | 243 0.18 96.86

35 | 297 0.22 97.08

36 | 191 0.14 97.23

37 | 301 0.23 97.45

38 | 142 0.11 97.56

39 | 168 0.13 97.68

40 | 117 0.09 97.77

41 | 2979 2.23 100.00

------------+-----------------------------------

Total | 133710 100.00

. *Okay. This gives you a string of numbers, but what do they mean? Here's where you need to turn to the documentation. I have added labels to some of the categorical variables, but not all. You need to go back to the documentation to interpret this variable. The variable name description points you to p466, which means Data Dictionary, Personal Record, line 466. There you will find out that, for instance, a value of 20 means that the individual had income from all sources in 1999 of $47,500 to $49,999.

. *You also will note that all 133,710 persons in the dataset are listed under one category or another. You might suspect that not all persons in the dataset had income- Children usually don't have any income. So you need to look at the Data Dictionary and see which category value codes for the people who have NO Income. In this case, the people whose code is '00' are the people that, in survey language, are NOT IN THE UNIVERSE for this question. When you analyze data, you need to be aware that sometimes a number is really a number (a zero meaning zero income, for instance) and sometimes a number means something else (here zero refers to people whose income was never calculated).

. * Let's look at the total income categories again, excluding the 'out of universe people'

. tabulate ptotr if ptotr ~= "00"

total |

person |

income |

categories |

p466 | Freq. Percent Cum.

------------+-----------------------------------

01 | 16437 15.92 15.92

02 | 5077 4.92 20.84

03 | 7066 6.85 27.69

04 | 5857 5.67 33.36

05 | 6566 6.36 39.72

06 | 4866 4.71 44.44

07 | 5660 5.48 49.92

08 | 4135 4.01 53.92

09 | 5081 4.92 58.85

10 | 3384 3.28 62.12

11 | 4346 4.21 66.34

12 | 2738 2.65 68.99

13 | 3966 3.84 72.83

14 | 2089 2.02 74.85

15 | 3087 2.99 77.84

16 | 1682 1.63 79.47

17 | 2774 2.69 82.16

18 | 1356 1.31 83.47

19 | 1780 1.72 85.20

20 | 1115 1.08 86.28

21 | 1790 1.73 88.01

22 | 868 0.84 88.85

23 | 1053 1.02 89.87

24 | 627 0.61 90.48

25 | 1105 1.07 91.55

26 | 573 0.56 92.11

27 | 795 0.77 92.88

28 | 421 0.41 93.28

29 | 686 0.66 93.95

30 | 373 0.36 94.31

31 | 621 0.60 94.91

32 | 327 0.32 95.23

33 | 487 0.47 95.70

34 | 243 0.24 95.94

35 | 297 0.29 96.22

36 | 191 0.19 96.41

37 | 301 0.29 96.70

38 | 142 0.14 96.84

39 | 168 0.16 97.00

40 | 117 0.11 97.11

41 | 2979 2.89 100.00

------------+-----------------------------------

Total | 103226 100.00

. *Note the syntax of that last command. You could read it as "tabulate ptotr if ptotr is not equal to 00". The "00" is in quotes because it is a string (i.e. character) rather than a numerical variable. The ~= is STATA's way of saying 'Not Equal'. Excluding the out of universe folks, we are down to 103,226 records. Of those records, the cumulative percentage (last column) shows that about half of the people are in categories 1-7, and half are spread out among the higher categories. Category 7, according to the data dictionary, corresponds to earnings of $15,000 to $17,499.

. *Let's see how the income distribution changes if we limit ourselves to persons age 30 to 40

. tabulate ptotr if ptotr ~= "00" & age > 29 & age <41

total |

person |

income |

categories |

p466 | Freq. Percent Cum.

------------+-----------------------------------

01 | 2514 11.15 11.15

02 | 702 3.11 14.26

03 | 1007 4.47 18.73

04 | 792 3.51 22.24

05 | 1175 5.21 27.45

06 | 871 3.86 31.32

07 | 1297 5.75 37.07

08 | 906 4.02 41.09

09 | 1263 5.60 46.69

10 | 803 3.56 50.25

11 | 1225 5.43 55.68

12 | 765 3.39 59.07

13 | 1240 5.50 64.57

14 | 648 2.87 67.45

15 | 935 4.15 71.59

16 | 483 2.14 73.74

17 | 899 3.99 77.72

18 | 397 1.76 79.48

19 | 563 2.50 81.98

20 | 288 1.28 83.26

21 | 542 2.40 85.66

22 | 250 1.11 86.77

23 | 299 1.33 88.10

24 | 188 0.83 88.93

25 | 306 1.36 90.29

26 | 159 0.71 90.99

27 | 231 1.02 92.02

28 | 107 0.47 92.49

29 | 185 0.82 93.31

30 | 83 0.37 93.68

31 | 177 0.78 94.47

32 | 81 0.36 94.82

33 | 127 0.56 95.39

34 | 60 0.27 95.65

35 | 66 0.29 95.95

36 | 44 0.20 96.14

37 | 85 0.38 96.52

38 | 28 0.12 96.64

39 | 44 0.20 96.84

40 | 26 0.12 96.95

41 | 687 3.05 100.00

------------+-----------------------------------

Total | 22548 100.00

. *Well, the distribution has changed. Now the midpoint is around category 10, or $22,500 to 24,999.

. *Let's look at an easier to figure out income measure, earnval2

. summarize earnval2

earnval2 not found

r(111);

. summarize ernval2

Variable | Obs Mean Std. Dev. Min Max

---------+-----------------------------------------------------

ernval2 | 133710 15373.05 26884.27 0 362302

. *ernval2 is 1999 main job wages, coded as a real number rather than as a category.

. Let's look at ernval excluding the non-earners

unrecognized command: Let

r(199);

. *Let's look at ernval excluding the non-earners (I forgot the asterisk above, again)

. summarize ernval2 if ernval >0

Variable | Obs Mean Std. Dev. Min Max

---------+-----------------------------------------------------

ernval2 | 71370 28801.05 31102.15 1 362302

. *So the number of people who had a main job the previous year is about half the total sample (71,370) people.

. *Let's see how it differs by gender.

. sort sex

. by sex: summarize ernval

-> sex= male

Variable | Obs Mean Std. Dev. Min Max

---------+-----------------------------------------------------

ernval | 64791 20511.42 32907.62 -9999 362302

-> sex= female

Variable | Obs Mean Std. Dev. Min Max

---------+-----------------------------------------------------

ernval | 68919 10513.39 18354.45 -9999 333564

. *Oops. I used ernval (the CPS's own variable) rather than my created ernval2. Here you can see that there are lots of negative incomes. Let's limit the data to people with positive incomes:

. by sex: summarize ernval if ernval >0

-> sex= male

Variable | Obs Mean Std. Dev. Min Max

---------+-----------------------------------------------------

ernval | 37422 35546.98 36598.75 1 362302

-> sex= female

Variable | Obs Mean Std. Dev. Min Max

---------+-----------------------------------------------------

ernval | 33948 21364.79 21253.24 1 333564

. *Ok. For the people who had jobs in 1999, the men had average incomes of $35,546 and the women had average incomes of $21,364

. *That's a big difference by gender.

. *How about racial differences?

. sort race

. by race: summarize ernval if ernval > 0

-> race= White

Variable | Obs Mean Std. Dev. Min Max

---------+-----------------------------------------------------

ernval | 61522 29289.83 31726 1 362302

-> race= Black

Variable | Obs Mean Std. Dev. Min Max

---------+-----------------------------------------------------

ernval | 6478 24135.99 23170.03 1 257525

-> race=Amer Indian

Variable | Obs Mean Std. Dev. Min Max

---------+-----------------------------------------------------

ernval | 864 20402.3 24021.88 1 362302

-> race= Asian

Variable | Obs Mean Std. Dev. Min Max

---------+-----------------------------------------------------

ernval | 2506 31756.48 34032.78 1 284133

. * Now let's look at the race and gender differences.

. sort race sex

. by race sex: summarize ernval if ernval >0

-> race= White sex= male

Variable | Obs Mean Std. Dev. Min Max

---------+-----------------------------------------------------

ernval | 32761 36264.5 37243.12 1 362302

-> race= White sex= female

Variable | Obs Mean Std. Dev. Min Max

---------+-----------------------------------------------------

ernval | 28761 21345.14 21321.52 1 333564

-> race= Black sex= male

Variable | Obs Mean Std. Dev. Min Max

---------+-----------------------------------------------------

ernval | 2906 28037.98 26393.53 1 257525

-> race= Black sex= female

Variable | Obs Mean Std. Dev. Min Max

---------+-----------------------------------------------------

ernval | 3572 20961.52 19610.2 1 244805

-> race=Amer Indian sex= male

Variable | Obs Mean Std. Dev. Min Max

---------+-----------------------------------------------------

ernval | 455 23529.18 27311.95 1 362302

-> race=Amer Indian sex= female

Variable | Obs Mean Std. Dev. Min Max

---------+-----------------------------------------------------

ernval | 409 16923.76 19170.25 1 284133

-> race= Asian sex= male

Variable | Obs Mean Std. Dev. Min Max

---------+-----------------------------------------------------

ernval | 1300 38456.79 39868.02 1 229339

-> race= Asian sex= female

Variable | Obs Mean Std. Dev. Min Max

---------+-----------------------------------------------------

ernval | 1206 24533.93 24365.57 1 284133

. *Okay. Now I'm going to quit the program. I haven't added any variables or made any changes to the data set (except re-sorting a few times), so I don't need to save the changes.

. exit, clear