*Hello students.  This is a STATA Log file.  I'll be posting to the web a series of Stata logs so that you can learn how to use the program, and how to analyze the dataset.  All my comments will follow the asterisk, which is how STATA knows that this is a comment and not an executable command.  I've made the STATA commands BOLD, and what follows the commands is STATA's response.

.

. *The first step is to load the data.

.

. *Even before you load the data, you should make sure you have a log file open, so that your commands and results are captured.

.

. use "C:\AAA Rosie's files\new projects\Intro Soc. Methods Class\Data and info for project 3\cps y2k full data (new).dta", clear

no room to add more observations

r(901);

 

. *I tried to open the data (using the menus File>Open; STATA printed out the equivalent command in its own command line language).  Unfortunately, STATA doesn't have enough memory allocated to open the dataset.

.

. *Because I'm using Windows, I'll set the memory allocation from within STATA.  If you're using a Mac, and you haven't set the memory allocation yet, you need to quit the program and adjust the preferences for STATA

.

. set mem 20m

(20480k)

 

. *My commands will follow the periods, and STATA's response will follow the commands, without a leading period.  STATA says it has allocated 20480K, or 20 Megabytes of memory for data.  Now I'll try to open the dataset again.

.

. use "C:\AAA Rosie's files\new projects\Intro Soc. Methods Class\Data and info for project 3\cps y2k full data (new).dta", clear

 

. *Now the dataset has been loaded.

.

. *Next step is to describe the data:

.

. describe

 

Contains data from C:\AAA Rosie's files\new projects\Intro Soc. Methods Class\D

> ata and info for project 3\cps y2k full data (new).dta

  obs:       133,710                          

 vars:            39                          18 Apr 2001 17:46

 size:    10,830,510 (48.3% of memory free)

-------------------------------------------------------------------------------

   1. phseq     str5   %9s                    household sequence number p2

   2. pernum    byte   %8.0g                 

   3. age       byte   %8.0g                  p15

   4. maritl    byte   %26.0g      marlbl     Marital Status p17

   5. sex       byte   %8.0g       sexnm      p20

   6. vet       byte   %22.0g      vetnm      veteran status p21

   7. hga       byte   %8.0g                  Educational Attainment p22

   8. race      byte   %11.0g      racenm     p25

   9. reorigin  byte   %8.0g                  Hispanic Origin p27

  10. pppos     str2   %9s                    family sequence number within

                                                each household p46

  11. hrs1      byte   %8.0g                  hours worked last week p76

  12. occ       str3   %9s                    Occupation p106

  13. clswkr    byte   %32.0g      cwrknm     sector of worker p109

  14. grswk     int    %9.0g                  gross weekly wages p135

  15. unmem     byte   %13.0g      unnm       labor union member p139

  16. lfsr      byte   %28.0g      lfsrnm     labor force status p145

  17. ernval    float  %9.0g                  main job last year earnings p228

  18. ssval     long   %12.0g                 last year soc security payments

                                                p291

  19. pawval    int    %12.0g                 last year welfare payments p305

  20. ptotr     str2   %9s                    total person income categories

                                                p466

  21. penatvty  str3   %9s                    country of birth p722

  22. pemntvty  str3   %9s                    mother's country of birth p725

  23. pefntvty  str3   %9s                    father's country of birth p731

  24. peinusyr  str2   %9s                    time of immigration p731

  25. pxnatvty  str2   %9s                    allocation flag for country of

                                                birth p734

  26. wgt2      int    %9.0g                  rounded weight based on p50

  27. ernval2   float  %9.0g                  main job earnings, losses

                                                recoded to zero

  28. htype     byte   %37.0g      htpnm      household type h25

  29. state     byte   %8.0g                  HG-ST60, or simply state of

                                                residence h40

  30. hgmsac    str4   %9s                    metropolitan area code h44

  31. hpmsasz   byte   %8.0g                  metropolitan area size h56

  32. hcccr     byte   %8.0g                  residence in central city h58

  33. frelu18   byte   %8.0g                  number of kids in fam under 18

                                                f29

  34. povll     byte   %8.0g                  ratio of fam income to poverty

                                                level f38

  35. fwsval    float  %9.0g                  family income f48

  36. famwgt2   int    %8.0g                  adjusted family weight f233

  37. yrsed     float  %9.0g                  years of education, from hga

  38. citizen   byte   %33.0g      citnm      citizenship p733

  39. health    byte   %11.0g      hlthnm     self reported health status p800

-------------------------------------------------------------------------------

Sorted by:  race 

 

. *What does this description mean?  The top of the description gives the following information: there are 133,710 observations.  In this dataset, every observation is a different person, so that means there are 133,710 individual person records in this dataset.  That's a lot of people.  There are 39 different variables, such as age, race occupation and so on for each of the 133,710 people. 

.

. *Look at variable number 8, race.  The data description says race is stored as a numerical type (called a 'byte'), race has descriptive labels (called 'racenm') and the variable description ('p25') means that the 'race' variable corresponds to the person- survey portion of the CPS, line 25.  If you open the CPS documentation, and skip to the Data Dictionary, and go past the household survey and the family survey to the Person Survey, on line 25 you'll find the RACE question.

 

. let's look at the race data right now:

unrecognized command:  let

r(199);

 

.

. *let's look at the race data right now (in the previous line I forgot to lead

>  with an asterisk, so STATA complained.

.

. tabulate race

 

        p25 |      Freq.     Percent        Cum.

------------+-----------------------------------

      White |     113475       84.87       84.87

      Black |      13626       10.19       95.06

Amer Indian |       1894        1.42       96.47

      Asian |       4715        3.53      100.00

------------+-----------------------------------

      Total |     133710      100.00

 

. *What does this mean?  Out of the 133,710 people surveyed, 84.87% were White, 10.19% were Black, 1.42% were American Indian, and 3.53% were Asian.

.

. *That's nice, of course.  The CPS is a nationally representative sample, and one of the things you can do is figure out not only how many Blacks and Whites there are in the sample, but by extension how many Blacks and Whites there are in the US.

.

. tabulate race [fweight=wgt2]

 

        p25 |      Freq.     Percent        Cum.

------------+-----------------------------------

      White |  224256269       82.07       82.07

      Black |   35370557       12.95       95.02

Amer Indian |    2837831        1.04       96.06

      Asian |   10769164        3.94      100.00

------------+-----------------------------------

      Total |  273233821      100.00

 

. * Using the [fweight=wgt2] option,  STATA applies the CPS individual weights to each person.  fweight is a frequency weight, which means the wgt2 variable contains the number of people in the US that each record in the dataset is supposed to represent.  In this case we see that Whites make up 82% of the population, and Blacks are nearly 13% of the population.  The reason these percentages are different from the percentage of records in the dataset is that Blacks are underrepresented in the data, and the weights make up for that.

. *Be clear in your interpretations about whether you are using weighted or unweighted data.  They mean different things, and the results will be slightly different.

.

. *Now let's look at income:

.

. summarize ptotr

 

Variable |     Obs        Mean   Std. Dev.       Min        Max

---------+-----------------------------------------------------

   ptotr |       0

 

. *I tried to summarize the income variable ptotr, and STATA wouldn't let me.  The reason is that ptotr is a categorical variable that I have stored as a string.  (See the output after the 'describe' command).  In analyzing data, especially data that comes from large national surveys where the data is compressed and stored in all sorts of ways, you have to be conscious about which are the categorical variables, and which are the real numbers.  So now I'll use a command that is more appropriate for a categorical variable:

.

. tabulate  ptotr

 

      total |

     person |

     income |

 categories |

       p466 |      Freq.     Percent        Cum.

------------+-----------------------------------

         00 |      30484       22.80       22.80

         01 |      16437       12.29       35.09

         02 |       5077        3.80       38.89

         03 |       7066        5.28       44.17

         04 |       5857        4.38       48.55

         05 |       6566        4.91       53.46

         06 |       4866        3.64       57.10

         07 |       5660        4.23       61.34

         08 |       4135        3.09       64.43

         09 |       5081        3.80       68.23

         10 |       3384        2.53       70.76

         11 |       4346        3.25       74.01

         12 |       2738        2.05       76.06

         13 |       3966        2.97       79.02

         14 |       2089        1.56       80.59

         15 |       3087        2.31       82.90

         16 |       1682        1.26       84.15

         17 |       2774        2.07       86.23

         18 |       1356        1.01       87.24

         19 |       1780        1.33       88.57

         20 |       1115        0.83       89.41

         21 |       1790        1.34       90.75

         22 |        868        0.65       91.39

         23 |       1053        0.79       92.18

         24 |        627        0.47       92.65

         25 |       1105        0.83       93.48

         26 |        573        0.43       93.91

         27 |        795        0.59       94.50

         28 |        421        0.31       94.82

         29 |        686        0.51       95.33

         30 |        373        0.28       95.61

         31 |        621        0.46       96.07

         32 |        327        0.24       96.32

         33 |        487        0.36       96.68

         34 |        243        0.18       96.86

         35 |        297        0.22       97.08

         36 |        191        0.14       97.23

         37 |        301        0.23       97.45

         38 |        142        0.11       97.56

         39 |        168        0.13       97.68

         40 |        117        0.09       97.77

         41 |       2979        2.23      100.00

------------+-----------------------------------

      Total |     133710      100.00

 

. *Okay.  This gives you a string of numbers, but what do they mean?  Here's where you need to turn to the documentation.  I have added labels to some of the categorical variables, but not all.  You need to go back to the documentation to interpret this variable.  The variable name description points you to p466, which means Data Dictionary, Personal Record, line 466.  There you will find out that, for instance, a value of 20 means that the individual had income from all sources in 1999 of $47,500 to $49,999.

.

. *You also will note that all 133,710 persons in the dataset are listed under one category or another.  You might suspect that not all persons in the dataset had income- Children usually don't have any income.  So you need to look at the Data Dictionary and see which category value codes for the people who have NO Income.  In this case, the people whose code is '00' are the people that, in survey language, are NOT IN THE UNIVERSE for this question.  When you analyze data, you need to be aware that sometimes a number is really a number (a zero meaning zero income, for instance) and sometimes a number means something else (here zero refers to people whose income was never calculated).

. * Let's look at the total income categories again, excluding the 'out of universe people'

.

. tabulate  ptotr if ptotr ~= "00"

 

      total |

     person |

     income |

 categories |

       p466 |      Freq.     Percent        Cum.

------------+-----------------------------------

         01 |      16437       15.92       15.92

         02 |       5077        4.92       20.84

         03 |       7066        6.85       27.69

         04 |       5857        5.67       33.36

         05 |       6566        6.36       39.72

         06 |       4866        4.71       44.44

         07 |       5660        5.48       49.92

         08 |       4135        4.01       53.92

         09 |       5081        4.92       58.85

         10 |       3384        3.28       62.12

         11 |       4346        4.21       66.34

         12 |       2738        2.65       68.99

         13 |       3966        3.84       72.83

         14 |       2089        2.02       74.85

         15 |       3087        2.99       77.84

         16 |       1682        1.63       79.47

         17 |       2774        2.69       82.16

         18 |       1356        1.31       83.47

         19 |       1780        1.72       85.20

         20 |       1115        1.08       86.28

         21 |       1790        1.73       88.01

         22 |        868        0.84       88.85

         23 |       1053        1.02       89.87

         24 |        627        0.61       90.48

         25 |       1105        1.07       91.55

         26 |        573        0.56       92.11

         27 |        795        0.77       92.88

         28 |        421        0.41       93.28

         29 |        686        0.66       93.95

         30 |        373        0.36       94.31

         31 |        621        0.60       94.91

         32 |        327        0.32       95.23

         33 |        487        0.47       95.70

         34 |        243        0.24       95.94

         35 |        297        0.29       96.22

         36 |        191        0.19       96.41

         37 |        301        0.29       96.70

         38 |        142        0.14       96.84

         39 |        168        0.16       97.00

         40 |        117        0.11       97.11

         41 |       2979        2.89      100.00

------------+-----------------------------------

      Total |     103226      100.00

 

. *Note the syntax of that last command.  You could read it as "tabulate ptotr if ptotr is not equal to 00".  The "00" is in quotes because it is a string (i.e. character) rather than a numerical variable.  The ~= is STATA's way of saying 'Not Equal'.  Excluding the out of universe folks, we are down to 103,226 records.  Of those records, the cumulative percentage (last column) shows that about half of the people are in categories 1-7, and half are spread out among the higher categories.  Category 7, according to the data dictionary, corresponds to earnings of $15,000 to $17,499. 

.

. *Let's see how the income distribution changes if we limit ourselves to persons age 30 to 40

.

. tabulate  ptotr if ptotr ~= "00" & age > 29 & age <41

 

      total |

     person |

     income |

 categories |

       p466 |      Freq.     Percent        Cum.

------------+-----------------------------------

         01 |       2514       11.15       11.15

         02 |        702        3.11       14.26

         03 |       1007        4.47       18.73

         04 |        792        3.51       22.24

         05 |       1175        5.21       27.45

         06 |        871        3.86       31.32

         07 |       1297        5.75       37.07

         08 |        906        4.02       41.09

         09 |       1263        5.60       46.69

         10 |        803        3.56       50.25

         11 |       1225        5.43       55.68

         12 |        765        3.39       59.07

         13 |       1240        5.50       64.57

         14 |        648        2.87       67.45

         15 |        935        4.15       71.59

         16 |        483        2.14       73.74

         17 |        899        3.99       77.72

         18 |        397        1.76       79.48

         19 |        563        2.50       81.98

         20 |        288        1.28       83.26

         21 |        542        2.40       85.66

         22 |        250        1.11       86.77

         23 |        299        1.33       88.10

         24 |        188        0.83       88.93

         25 |        306        1.36       90.29

         26 |        159        0.71       90.99

         27 |        231        1.02       92.02

         28 |        107        0.47       92.49

         29 |        185        0.82       93.31

         30 |         83        0.37       93.68

         31 |        177        0.78       94.47

         32 |         81        0.36       94.82

         33 |        127        0.56       95.39

         34 |         60        0.27       95.65

         35 |         66        0.29       95.95

         36 |         44        0.20       96.14

         37 |         85        0.38       96.52

         38 |         28        0.12       96.64

         39 |         44        0.20       96.84

         40 |         26        0.12       96.95

         41 |        687        3.05      100.00

------------+-----------------------------------

      Total |      22548      100.00

 

. *Well, the distribution has changed.  Now the midpoint is around category 10, or $22,500 to 24,999.

.

. *Let's look at an easier to figure out income measure, earnval2

.

. summarize earnval2

earnval2 not found

r(111);

 

. summarize ernval2

 

Variable |     Obs        Mean   Std. Dev.       Min        Max

---------+-----------------------------------------------------

 ernval2 |  133710    15373.05   26884.27          0     362302 

 

. *ernval2 is 1999 main job wages, coded as a real number rather than as a category.

.

. Let's look at ernval excluding the non-earners

unrecognized command:  Let

r(199);

 

. *Let's look at ernval excluding the non-earners (I forgot the asterisk above, again)

. summarize ernval2 if ernval >0

 

Variable |     Obs        Mean   Std. Dev.       Min        Max

---------+-----------------------------------------------------

 ernval2 |   71370    28801.05   31102.15          1     362302 

 

. *So the number of people who had a main job the previous year is about half the total sample (71,370) people. 

.

. *Let's see how it differs by gender.

.

. sort  sex

 

. by sex: summarize ernval

 

-> sex=    male 

Variable |     Obs        Mean   Std. Dev.       Min        Max

---------+-----------------------------------------------------

  ernval |   64791    20511.42   32907.62      -9999     362302 

 

-> sex=  female 

Variable |     Obs        Mean   Std. Dev.       Min        Max

---------+-----------------------------------------------------

  ernval |   68919    10513.39   18354.45      -9999     333564 

 

 

. *Oops.  I used ernval (the CPS's own variable) rather than my created ernval2.  Here you can see that there are lots of negative incomes.  Let's limit the data to people with positive incomes:

.

. by sex: summarize ernval if ernval >0

 

-> sex=    male 

Variable |     Obs        Mean   Std. Dev.       Min        Max

---------+-----------------------------------------------------

  ernval |   37422    35546.98   36598.75          1     362302 

 

-> sex=  female 

Variable |     Obs        Mean   Std. Dev.       Min        Max

---------+-----------------------------------------------------

  ernval |   33948    21364.79   21253.24          1     333564 

 

 

. *Ok.  For the people who had jobs in 1999, the men had average incomes of $35,546 and the women had average incomes of $21,364

. *That's a big difference by gender.

.

. *How about racial differences?

.

. sort race

 

. by race: summarize ernval if ernval > 0

 

-> race=      White 

Variable |     Obs        Mean   Std. Dev.       Min        Max

---------+-----------------------------------------------------

  ernval |   61522    29289.83      31726          1     362302 

 

-> race=      Black 

Variable |     Obs        Mean   Std. Dev.       Min        Max

---------+-----------------------------------------------------

  ernval |    6478    24135.99   23170.03          1     257525 

 

-> race=Amer Indian 

Variable |     Obs        Mean   Std. Dev.       Min        Max

---------+-----------------------------------------------------

  ernval |     864     20402.3   24021.88          1     362302 

 

-> race=      Asian 

Variable |     Obs        Mean   Std. Dev.       Min        Max

---------+-----------------------------------------------------

  ernval |    2506    31756.48   34032.78          1     284133 

 

 

. * Now let's look at the race and gender differences.

.

. sort race sex

 

. by race sex: summarize ernval if ernval >0

 

-> race=      White  sex=    male 

Variable |     Obs        Mean   Std. Dev.       Min        Max

---------+-----------------------------------------------------

  ernval |   32761     36264.5   37243.12          1     362302  

 

-> race=      White  sex=  female 

Variable |     Obs        Mean   Std. Dev.       Min        Max

---------+-----------------------------------------------------

  ernval |   28761    21345.14   21321.52          1     333564 

 

-> race=      Black  sex=    male 

Variable |     Obs        Mean   Std. Dev.       Min        Max

---------+-----------------------------------------------------

  ernval |    2906    28037.98   26393.53          1     257525 

 

-> race=      Black  sex=  female 

Variable |     Obs        Mean   Std. Dev.       Min        Max

---------+-----------------------------------------------------

  ernval |    3572    20961.52    19610.2          1     244805 

 

-> race=Amer Indian  sex=    male 

Variable |     Obs        Mean   Std. Dev.       Min        Max

---------+-----------------------------------------------------

  ernval |     455    23529.18   27311.95          1     362302 

 

-> race=Amer Indian  sex=  female 

Variable |     Obs        Mean   Std. Dev.       Min        Max

---------+-----------------------------------------------------

  ernval |     409    16923.76   19170.25          1     284133 

 

-> race=      Asian  sex=    male 

Variable |     Obs        Mean   Std. Dev.       Min        Max

---------+-----------------------------------------------------

  ernval |    1300    38456.79   39868.02          1     229339 

 

-> race=      Asian  sex=  female 

Variable |     Obs        Mean   Std. Dev.       Min        Max

---------+-----------------------------------------------------

  ernval |    1206    24533.93   24365.57          1     284133 

 

 

. *Okay.  Now I'm going to quit the program.  I haven't added any variables or made any changes to the data set (except re-sorting a few times), so I don't need to save the changes.

. exit, clear