Homework 2 [revised 5/5/2019]


Note: Unless otherwise noted (new Q4), do these functions without using weights.



1) Using the 2000 March CPS, compare the 1999 wage and salary earnings (variable incwage) of lawyers (occ1990==178), nurses (occ1990==95), and sociology teachers (occ1990==125).


            a) How many individuals are there in these 3 occupations in the 2000 CPS?

            b) What are the average income, the median income, the 25th percentile, and the 75th percentile of income for each occupation?

            c) How much overlap is there between the incomes of the 3 professions?

            d) Make a box graph of the income distribution of the 3 occupations, side by side. Copy the box plot into MS word or print it to a file so you can access it and copy it into your homework later.

            e) What is the standard deviation of the income for each group? What is the standard error of the mean for each group? Why is the standard error of the mean for sociology instructor’s income relatively large?

            f) Use Excel to calculate the t-statistics for each comparison (Sociologist-Nurse, Sociologist-Lawyer, Nurse-Lawyer). Calculate the T-statistics with unequal variance and the T-statistics with equal variance (for formulas you can apply, see the class Excel file, worksheet “HW 2 ttests”), and show these t-statistics and their associated T-probabilities in a table.

            g) Based on your Excel calculations, was the difference between lawyers’ and nurses’ mean incomes significant in 1999? How about lawyers and sociologists? Nurses and sociologists? Based on your calculations, what is the probability that the income gap between each of the 3 pairs is a result of random sampling variation? Do the equal variance and unequal variance t-tests yield different substantive answers in any of the 3 cases? If so, why?

            h) [NEW for Soc 381 in 2012/ Soc 180B can ignore Q1h]: Use either Stata or Excel to generate the 2 tail probabilities, for the Normal distribution, associated with each of the 6 T-tests (3 different comparisons, both equal and unequal variance assumption), and put the Normal probabilities next to the T-probabilities in the table above. In this case, the T-probabilities are the correct probabilities, and the Normal probabilities are just for comparison. Is there a substantive difference between the Normal and the T probabilities? For which cases are the Normal and T probabilities most similar, and for which cases are the Normal and T probabilities most different, and why?

            i) Use Stata to run and report both unequal variance and equal variance 2-sample T-tests on each of the 3 pairs of income comparisons (use the “unequal” option to create the unequal variance t-statistic; equal variance is the default). What do the results show?

            j) Use Stata to do a simple regression on incwage, with occ1990 as the only predictor (generate dummy variables for occupation), and including only 2 occupations at a time (one dummy variable, one comparison category) you will get an answer exactly the same as one of the T-tests (equal or unequal); which T-test is it?

            k) Now, include more than two occupational groups in your regression. Comment on the results of this last regression (be sure to discuss mean differences, the standard errors, and significances of the income differences between groups) and compare to the regression that included only two occupations.


2)         a) Show that Average(a+bXi)= a+b(Average(Xi)) (easy)

            b) Show that Variance(a+bXi)=b2(Variance(Xi)). In this proof, do *not* assume the property Var(bX)=b2Var(X) (and equivalently, don’t assume you know that SD(bX)=bSD(X)). Treat a and b as constants, and X as a variable.


For this question, you should use the rules about summary notation and the rules about averages that I show on the first page of my "notes about mean and variance and sample statistics", so the definition of summary notation and items 1 and 2. Also, you should use the definition of variance, which is item 4 in my "notes about mean and variance and sample statistics.”


3) Now create a new variable, incwage2

gen incwage2=incwage*2


            a) How are the average and standard deviation of incwage2 different from the average and standard deviation of incwage?

            b) Redo the Excel comparisons of income between the three professions, using incwage2 instead of incwage. How are the results different? How are they the same?

            c) Use Stata to generate 2 sample T-test comparisons of earnings between the 3 pairs of occupations. How do the T-statistics compare using incwage2 and incwage?

            d) Run the regressions with occupation predicting income, and answer the same question: does incwage2 give different results than incwage?


4) Redo the regressions of occupation on incwage, comparing sociologists’ incomes to nurses’ incomes using the weights, first with aweight and then with fweight. Now use the weights to calculate the same T-statistic by hand (“by hand” in this context means to generate the weighted N, mean, and SD in Stata, and then copy those values into a copy of Rosenfeld’s Excel sheet t-test calculator). How would you interpret the T-statistic produced by regress using the aweight? How would you interpret the T-statistic produced by regress using the fweight?