HW 4:
Revised Aug 26, 2014
1) Start with Anscombe’s data (available on my website). The data are Excel format, you can copy them into the Stata data editor. In Stata and in Excel, plot Y_{1} vs X_{1}, then Y_{2} vs X_{2}, and so on. and superimpose the regression line of X on Y (i.e. regress y x) in each scatterplot. In your homework include the Stata versions of 2 graphs, and the Excel versions of the other two. [Note: doing regression lines in Excel requires an addin module for statistics that you may not have. If you do all the figures for this part of the homework in Stata, that is fine too].
a) What do you notice about the regression line in all 4 cases? How different are the datasets? Comment on how informative the regression lines are.
b) Indicate on your graphs the point with the largest absolute value residual from the linear regression (retrieve the residuals after regression by predict varname, residual).
c) There are several measures of influence, that is which point is the most influential over the slope of the regression line. One measure of influence DFbeta, calculates alternative regression slopes by dropping one point at a time, and then reports which point’s absence would change the line slope the most. You can retrieve the DFbetas with the command dfbeta, after your regression. Indicate the point on each graph with the largest absolute value DFbeta. Why do you think Stata cannot calculate the DFbetas for the model regress y4 x4?
2) Use the 50state dataset (available on my website, stata format) which is derived from the very familiar
March, 2000
a) Is there a linear relationship between pct US_born and total income?
b) How would you describe the shape of the relationship between inctot and pct US born?
c) Generate the regression line of inctot
on US_born (i.e. regress
inctot US_born), and
add that line to the graph. Do the regressions separately with and without fweight=
d) Generate the residuals and the DFbetas for the unweighted regression above. Which state has the highest absolute value residual? Which state has the highest absolute value DFbeta? Why do you think the state with the largest (absolute value) residual and the state with the largest (absolute value) DFbeta are different?
e) Make a regression table, where regress inctot US_born
[fweight=
Model 1 
Model 2 
Model 3 



US_Born pct 
US_Born pct 
US_Born pct 




Your first control var 
Your first control var 





Your second control var 












Constant 
Constant 
Constant 



Unweighted N 
Unweighted N 
Unweighted N 
Adjusted Rsquare 
Adjusted Rsquare 
Adjusted Rsquare 
f) This dataset has only 51 observations (50 states plus
DC), so it is obviously a very reduced dataset from our original March 2000
g) [New] Now go back to our regular individual level March, 2000 CPS dataset, and make a simple table of the average of inctot for US born adults and immigrant adults. Then run a simple OLS regression on the same age group, with dummy variable US_born as the sole predictor of inctot. Who has more income? Why are these results different from Model 1 in part 2e above?
3) There is no question 3.
4) Question 4 should be done by hand, though you can check your work in Excel or in Stata if you need the confidence boost. On the final you will have to do something like this by hand:
We select 4 numbers at random from a large set. The 4 numbers are 21,21,29,29
a) What is the Average of the 4 numbers?
b) What is the Variance of the 4 numbers? (use 1/N rather than 1/(N1) in the Variance formula if you want the numbers to work out most easily)
c) What is the Standard Deviation of the 4 numbers?
d) What is the Standard Error of the Mean?
e) How sure are we that the average (of the large group of numbers these 4 are picked from) is greater than 21? Consult Freedman’s Tstatistic table for the answer. What Tstatistic value and how many degrees of freedom would this test correspond to? Is this a one or a two tailed test?
f) How sure are we that the average of these 4 numbers is greater than 21?
g) Let’s say we want to select more numbers from the large group in order to drive down the Standard Error of the Mean. How many numbers (approximately) would we have to draw from our large set in order to be 95% sure that the average of the whole set was within 1 point of the average that we measured? You can assume that the mean and the variance of our sample won’t change as we gather more measurements.
5) Interpretation of a regression table. In this table the dependent variable is being married to a black man, and the population is married white women. While this kind of dichotmous dependent variable would be in theory be better served by logistic regression, here we will be using regular OLS regression for simplicity. It will probably be easiest to do this problem in Excel.
a) Regression predicting proportion married to black men for married white women (to get percentage, multiply results by 100) in 1940, 1960, 1970, 1980, 1990, and 2000. Year2 is a continuous variable for the actual census year. fsomecolplus is a dichotomous variable for whether the married women had at least some college education or not.
regress
hus_black year2
fSomeColplus if frace==1
Source  SS
df
MS Number of obs = 2446725
+ F(
2,2446722) = 838.32
Model 
3.80217899 2 1.9010895 Prob >
F =
0.0000
Residual 
5548.54042446722 .002267745 Rsquared =
0.0007
+ Adj
Rsquared = 0.0007
Total 
5552.342582446724 .002269297 Root MSE =
.04762

hus_black  Coef. Std. Err. t
P>t [95% Conf. Interval]
+
year2 
.0000586 1.75e06 33.41
0.000 .0000552 .0000621
fSomeColplus 
.0007976 .0000695 11.48
0.000 .0006614 .0009338
_cons 
.113863 .0034622 32.89
0.000 .1206487 .1070773

Generate the following table of predicted values, based only on the regression results above:
Predicted percentage married to black men:
Year 
wives
with at least some college 
wives
without college education 
1940 


1960 


1970 


1980 


1990 


2000 


b) Generate the same table of predicted values, based only on the regression results below:
Predicted percentage (remembering that the predicted values from the model will be in terms of proportion, theoretically between 0 and 1) married to black men (this regression replaces continuous year with a categorical year variable, excluding the first census year 1940):
xi:
regress hus_black i.year2 fSomeColplus if frace==1
i.year2 _Iyear2_19402000 (naturally coded; _Iyear2_1940 omitted)
Source  SS
df
MS Number of obs = 2446725
+ F(
6,2446718) = 346.20
Model 
4.70979513 6 .784965855 Prob >
F =
0.0000
Residual 
5547.632792446718 .002267377 Rsquared =
0.0008
+ Adj
Rsquared = 0.0008
Total 
5552.342582446724 .002269297 Root MSE =
.04762

hus_black  Coef. Std. Err. t
P>t [95% Conf. Interval]
+
_Iyear2_1960
 .0004723 .0001217
3.88 0.000 .0007108
.0002337
_Iyear2_1970
 .0001328 .0001197
1.11 0.268 .0003674
.0001019
_Iyear2_1980
 .001089 .000118
9.23 0.000 .0008578
.0013202
_Iyear2_1990
 .0016261 .0001185
13.72 0.000 .0013938
.0018583
_Iyear2_2000
 .0030743 .0001199
25.64 0.000 .0028392
.0033093
fSomeColplus 
.0006729 .0000699 9.63
0.000 .0005359 .0008099
_cons 
.0010347 .0000934 11.08
0.000 .0008517 .0012178

Predicted percentage married to black men:
Year 
wives
with at least some college 
wives
without college education 
1940 


1960 


1970 


1980 


1990 


2000 


c) Take the predicted values from parts a and b above, and graph them in Excel. Put the college educated white women and the noncollege educated white women together on the same graph, but plot the predicted values from part (a) and part (b) separately. In Excel, use XY scatter plots, with points connected by lines. Examine the two graphs. Comment on linearity (are the predicted values linear with respect to year?) and additivity (is the difference between college educated women and noncollege educated women constant across years?).
Question 6 is an Extra Credit Question for soc 180B/280B, but is required in Soc 381:
6) A brief logistic regression exercise, using our friendly
old March 2000
a) Run a logistic regression predicting whether the subject is married, for subjects over age 16, using age, age_squared, and race as the predictor variables. The syntax will be as follows:
desmat: logit married @age @age_sq
race if age>16
or
xi: logit married age age_sq
i.race if age>16
or
logit married age age_sq i.race if age>16
generate predicted values and summarize them. And don’t forget that the “or” option will give you the exponentiated, or odds ratio version of the results:
logit married age age_sq i.race if age>16, or
b) Interpret the black coefficient (assuming white is the excluded category for race) in the above logistic regression. What is the 95% confidence interval for the black coefficient in odds ratio terms, and in log odds ratio terms? Does the 95% confidence interval for odds ratio of the black coefficient include 1 (explain)?
c) Explain the 5 df Likelihood Ratio Test that Stata produces in the model for question 6(a). What two models are being compared, and what is the conclusion of this Likelihood Ratio Test (in other words, what null hypothesis is being rejected)?
d) Start with the logistic regression model from 6(a), let’s call that Model 1. For Model 2, add one term, a dummy variable for whether the respondent was born in the US. For Model 3, add dummy variables for each category of metropolitan status (using the variable metro). Comment on the Likelihood Ratio Test comparison between Models 1, 2, and 3 indicating degrees of freedom difference between the models, expected chisquare values for each comparison, what the null hypothesis is, and whether the null hypothesis is rejected. Which model fits best by the Likelihood Ratio Test? Which model fits best by BIC? Generate Pvalues for the LRT and the BIC comparisons between models.
e) Run the same regression as part 6(a) above (i.e. Model 1), but with regular OLS regression (i.e. the familiar Stata function regress) instead of logit regression. Generate the predicted values and summarize them. How are the predicted values different between the OLS and the logistic regression? Comment on the difference in the range of the predicted values between the OLS regression and logistic regression results. Graph the black and white actual and predicted marriage rates by respondent age, for OLS and for logistic models predicting marriage rate.