# Instrumental variables Ed401
> mroz87 = read.table( "http://www-stat.stanford.edu/~rag/stat209/Mroz87.dat", header = T)
> names(mroz87)
[1] "lfp" "hours" "kids5" "kids618" "age" "educ"
[7] "wage" "repwage" "hushrs" "husage" "huseduc" "huswage"
[13] "faminc" "mtr" "motheduc" "fatheduc" "unem" "city"
[19] "exper" "nwifeinc" "wifecoll" "huscoll"
> # Variable definition also linked, main page
# lfp Dummy variable for labor-force participation.
# hours Wife's hours of work in 1975.
# kids5 Number of children 5 years old or younger.
# kids618 Number of children 6 to 18 years old.
# age Wife's age.
# educ Wife's educational attainment, in years.
# wage Wife's average hourly earnings, in 1975 dollars.
# repwage Wife's wage reported at the time of the 1976 interview.
# hushrs Husband's hours worked in 1975.
# husage Husband's age.
# huseduc Husband's educational attainment, in years.
# huswage Husband's wage, in 1975 dollars.
# faminc Family income, in 1975 dollars.
# mtr Marginal tax rate facing the wife.
# motheduc Wife's mother's educational attainment, in years.
# fatheduc Wife's father's educational attainment, in years.
# unem Unemployment rate in county of residence, in percentage points.
# city Dummy variable = 1 if live in large city, else 0.
# exper Actual years of wife's previous labor market experience.
# nwifeinc Non-wife income.
# wifecoll Dummy variable for wife's college attendance.
# huscoll Dummy variable for husband's college attendance.
> attach(mroz87)
> table(lfp)
lfp
0 1
325 428
> table(wage > 0)
FALSE TRUE
325 428
> detach(mroz87)
> poswage = subset(mroz87, wage > 0) # my new data subset
> poswage$logWage = log( poswage$wage ) # adding logWage to the data set for session
> attach(poswage)
> names(poswage)
[1] "lfp" "hours" "kids5" "kids618" "age" "educ"
[7] "wage" "repwage" "hushrs" "husage" "huseduc" "huswage"
[13] "faminc" "mtr" "motheduc" "fatheduc" "unem" "city"
[19] "exper" "nwifeinc" "wifecoll" "huscoll" "logWage"
> length(logWage)
[1] 428
> table(lfp) #all the women in this subset are in the workforce
lfp
1
428
> # so the poswage data subset is the 428 working women, and appended
> # to that subset is logWage
# see what you have by take log(wage)
> logWage[1:10]
[1] 1.21015366 0.32851207 1.51413773 0.09212329 1.52427210 1.55648008
[7] 2.12025954 2.05963416 0.75433635 1.54489939
> wage[1:10]
[1] 3.3540 1.3889 4.5455 1.0965 4.5918 4.7421 8.3333 7.8431 2.1262 4.6875
> log(3.354)
[1] 1.210154
> min(logWage)
[1] -2.054164
> min(wage) #not paid so well
[1] 0.1282
> quantile(wage)
0% 25% 50% 75% 100%
0.12820 2.26260 3.48190 4.97075 25.00000
> table(wage < 1)
FALSE TRUE
411 17
> # so we have a few very poorly paid women in this dataset it seems < 1$/hr
> # onto fitting regression (predictor educ), estimate returns to educ
> lm.posWage = lm(logWage ~ educ)
> summary(lm.posWage)
Call:
lm(formula = logWage ~ educ)
Residuals:
Min 1Q Median 3Q Max
-3.10256 -0.31473 0.06434 0.40081 2.10029
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.1852 0.1852 -1.000 0.318
educ 0.1086 0.0144 7.545 2.76e-13 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.68 on 426 degrees of freedom
Multiple R-Squared: 0.1179, Adjusted R-squared: 0.1158
F-statistic: 56.93 on 1 and 426 DF, p-value: 2.761e-13
> # we get a highly significant slope (but not big Rsq),
> # year increase in educ, fit increases 11 cents hourly wage
> cor(logWage,educ)^2 # R-squared for the OLS equation
[1] 0.1178826
> # now the IV fit using fatheduc as instrument (omitted vars concern)
> cor.test(educ, fatheduc) # investigate instrument
Pearson's product-moment correlation
data: educ and fatheduc
t = 9.4255, df = 426, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.3337579 0.4908623
sample estimates:
cor
0.4154030
> # significant, fairly strong (moderate at least) correlation educ and instrument
# can do IV by hand or use tsls in sem package or to be stata-phonic use ivreg in AER package
> install.packages("AER")
> library(AER) # bring AER into session
> help(ivreg)
ivreg package:AER R Documentation
Instrumental-Variable Regression
Description:
Fit instrumental-variable regression by two-stage least squares.
This is equivalent to direct instrumental-variables estimation
when the number of instruments is equal to the number of
predictors.
Usage:
ivreg(formula, instruments, data, subset, na.action, weights, offset,
contrasts = NULL, model = TRUE, y = TRUE, x = FALSE, ...)
Arguments:
formula, instruments: formula specification(s) of the regression
relationship and the instruments. Either 'instruments' is
missing and 'formula' has three parts as in 'y ~ x1 + x2 | z1
+ z2 + z3' (recommended) or 'formula' is 'y ~ x1 + x2' and
'instruments' is a one-sided formula '~ z1 + z2 + z3' (only
for backward compatibility).
data: an optional data frame containing the variables in the model.
By default the variables are taken from the environment from
which 'ivreg' is called.
subset: an optional vector specifying a subset of observations to be
used in fitting the model.
na.action: a function that indicates what should happen when the data
contain 'NA's. The default is set by the 'na.action' option.
weights: an optional vector of weights to be used in the fitting
process.
offset: an optional offset that can be used to specify an a priori
known component to be included during fitting.
contrasts: an optional list. See the 'contrasts.arg' of
'model.matrix.default'.
model, x, y: logicals. If 'TRUE' the corresponding components of the
fit (the model frame, the model matrices , the response) are
returned.
...: further arguments passed to 'ivreg.fit'.
> ivreg1 = ivreg(logWage ~ educ| fatheduc)
> summary(ivreg1)
Call:
ivreg(formula = logWage ~ educ | fatheduc)
Residuals:
Min 1Q Median 3Q Max
-3.0870 -0.3393 0.0525 0.4042 2.0677
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.44110 0.44610 0.989 0.323
educ 0.05917 0.03514 1.684 0.093 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.6894 on 426 degrees of freedom
Multiple R-Squared: 0.09344, Adjusted R-squared: 0.09131
Wald test: 2.835 on 1 and 426 DF, p-value: 0.09294
> # educ slope reduced by half, se increased, no longer signif
> # you can show by math , that if the omitted variable(s)are
> # positively related to predictor, the OLS regression will overestimate the slope
> cov(logWage, fatheduc)/cov(educ, fatheduc) # IV estimate of slope "by hand"
[1] 0.05917348
> .0144/.4154 # inflation of OLS standard error for the IV regression
[1] 0.03466538
> #matches IV s.e.
> # do two stage least squares as 2 stages
> lm.1 = lm(educ ~ fatheduc)
> lm.2 = lm(logWage ~ fitted(lm.1))
> summary(lm.2)
Call:
lm(formula = logWage ~ fitted(lm.1))
Residuals:
Min 1Q Median 3Q Max
-3.21264 -0.37631 0.05632 0.41728 2.06040
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.44110 0.46711 0.944 0.346
fitted(lm.1) 0.05917 0.03680 1.608 0.109
Residual standard error: 0.7219 on 426 degrees of freedom
Multiple R-Squared: 0.006034, Adjusted R-squared: 0.003701
F-statistic: 2.586 on 1 and 426 DF, p-value: 0.1086
> # matches the output pretty closely
> cor(logWage,fatheduc)^2 # R-square for the two-stage version of IV
>[1] 0.006033851
#### more on R-square. as we saw in the Woolridge stata output and you can see
# from ivreg in appendix reported R-square from the IV fit is .0934, smaller
# than the OLS R-square. You can compute that by 1 - SSresiduals/SSoutcome
# for the IV fit.