--------------------------------------------------------------------------------------------------

name:  <unnamed>

> log.log

log type:  text

opened on:   9 May 2019, 14:22:18

. *(8 variables, 11 observations pasted into data editor)

* The first thing I did was I opened the anscombe dataset (which you can find on my website, right next to the hw4 assignment, https://web.stanford.edu/~mrosenfe/soc_meth_proj3/Anscombe%27s_data.xls ) in Excel, then I copied the data along with column headers into the “data editor” tab in Stata, and I indicated that the first row was variable labels, and then I clicked OK. The variables (x1, y1, x2, y2, etc) then showed up in Stata.

. *class starts here

. *(8 variables, 11 observations pasted into data editor)

* Run these scatter plots yourself to see what they look like. Tufte had all 4 of these plots on page 2 of his book.

. twoway (scatter y2 x2)

. twoway (scatter y1 x1)

. twoway (scatter y2 x2) (lfit y2 x2)

* The above syntax means that we are going to make an XY scatter plot of Y2 against X2, and the second parenthesis means we are going to superimpose the best fit line, the regression line, onto the same graph.

. regress y2 x2

Source |       SS       df       MS              Number of obs =      11

-------------+------------------------------           F(  1,     9) =   17.97

Model |  27.5000024     1  27.5000024           Prob > F      =  0.0022

Residual |   13.776294     9  1.53069933           R-squared     =  0.6662

Total |  41.2762964    10  4.12762964           Root MSE      =  1.2372

------------------------------------------------------------------------------

y2 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]

-------------+----------------------------------------------------------------

x2 |         .5   .1179638     4.24   0.002     .2331475    .7668526

_cons |   3.000909   1.125303     2.67   0.026     .4552978     5.54652

------------------------------------------------------------------------------

. regress y1 x1

Source |       SS       df       MS              Number of obs =      11

-------------+------------------------------           F(  1,     9) =   17.99

Model |  27.5100011     1  27.5100011           Prob > F      =  0.0022

Residual |  13.7626904     9  1.52918783           R-squared     =  0.6665

Total |  41.2726916    10  4.12726916           Root MSE      =  1.2366

------------------------------------------------------------------------------

y1 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]

-------------+----------------------------------------------------------------

x1 |   .5000909   .1179055     4.24   0.002     .2333701    .7668117

_cons |   3.000091   1.124747     2.67   0.026     .4557369    5.544445

------------------------------------------------------------------------------

*The key point to note is that the regression line is the same for each numbered pair of X and Y, but the 4 scatter plots look very different.

. twoway (scatter y3 x3) (lfit y3 x3)

. twoway (scatter y4 x4) (lfit y4 x4)

. clear all

*Now on to the 50 state dataset, which is also right next to my HW4 assignment, on my class homepage.

. twoway (scatter incwage  NH_White_proportion, mlabel(statefip)) (lfit incwage NH_White_proportion)

. summarize  NH_White_proportion

Variable |       Obs        Mean    Std. Dev.       Min        Max

-------------+--------------------------------------------------------

NH_White_p~n |        51    .7626632    .1633623   .2354178   .9835737

. regress incwage  NH_White_proportion

Source |       SS       df       MS              Number of obs =      51

-------------+------------------------------           F(  1,    49) =    2.14

Model |  18878316.5     1  18878316.5           Prob > F      =  0.1500

Residual |   432407199    49  8824636.71           R-squared     =  0.0418

Total |   451285515    50   9025710.3           Root MSE      =  2970.6

-------------------------------------------------------------------------------------

incwage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]

--------------------+----------------------------------------------------------------

NH_White_proportion |   -3761.36   2571.649    -1.46   0.150    -8929.282    1406.562

_cons |   22161.23   2004.928    11.05   0.000     18132.18    26190.29

-------------------------------------------------------------------------------------

. predict M1_predicted

(option xb assumed; fitted values)

*One key thing to do after a regression is to generate a new variable with the predicted values of the regression, which we do above and we call it “M1_predicted.”

. twoway (scatter incwage  NH_White_proportion, mlabel(statefip)) (lfit incwage NH_White_proportion) (connected M1_predicted NH_White_proportion)

*The above syntax takes the 50 state dataset, plots average state income (Y axis) on non white proportion (X axis), and attaches the statename as a label to each point. Then the lfit line is plotted on top, and the predicted values plotted on top of that, showing that the predicted values from the regression line are the same as the lfit line.

. gen residual=incwage-M1_predicted

* From the predicted values, we generate a variable for the residual, which is actual minus predicted.

. gen abs_residual=abs(residual)

* We care about which points are furthest from the best fit line, so we generate a new variable with the absolute values of the residuals

. gsort -abs_residual

* The above command sorts our 50 state dataset according to absolute value of their residual from the last regression, from largest to smallest.

. dfbeta( NH_White_proportion)

_dfbeta_1: dfbeta(NH_White_proportion)

* The dfbeta command is another post-regression command, that asks Stata to generate dfbetas for specific predictors, here we have only one predictor, NH_White_proportion. Don’w worry about the units of the dfbetas. The dfbetas tell us which of the 50 states have the most influence of the slope of the regression line. It turns out that while CT and NJ are furthest from the line, DC has the most influence because it is furthest from the other points. You will have to plot the points to see what I mean.

. gen abs_dfbeta=abs( _dfbeta_1)

. list statefip abs_residual residual  abs_dfbeta  _dfbeta_1

+--------------------------------------------------------------------+

|             statefip   abs_re~l    residual   abs_df~a   _dfbeta_1 |

|--------------------------------------------------------------------|

1. |          Connecticut   5617.314    5617.314   .0487289    .0487289 |

2. |           New Jersey   5432.736    5432.736   .1172426   -.1172426 |

3. |           New Mexico   5139.932   -5139.932   .5375599    .5375599 |

4. |              Montana   5024.921   -5024.921   .2145686   -.2145686 |

5. |          Mississippi   5009.293   -5009.293   .2366829    .2366829 |

|--------------------------------------------------------------------|

6. |             Maryland   4840.275    4840.275   .1743481   -.1743481 |

7. |        West Virginia   4798.199   -4798.199   .2921322   -.2921322 |

8. |        Massachusetts   4666.598    4666.598    .098265     .098265 |

9. |             Arkansas   4460.403   -4460.403   .0185422   -.0185422 |

10. |             Colorado   4390.833    4390.833   .0056333    .0056333 |

|--------------------------------------------------------------------|

11. |         North Dakota   4204.203   -4204.203    .192666    -.192666 |

12. |            Minnesota    4187.81     4187.81   .1740437    .1740437 |

13. |               Alaska   3780.312    3780.312   .0506594   -.0506594 |

14. |            Louisiana   3555.554   -3555.554   .1484154    .1484154 |

15. |              Alabama   3416.563   -3416.563   .0482474    .0482474 |

|--------------------------------------------------------------------|

16. | District of Columbia   3381.999    3381.999   .5903278   -.5903278 |

17. |             Michigan   3208.876    3208.876   .0356351    .0356351 |

18. |        New Hampshire   3021.199    3021.199   .1813799    .1813799 |

19. |         South Dakota   2937.162   -2937.162   .1303558   -.1303558 |

20. |             Virginia   2706.413    2706.413   .0332524   -.0332524 |

|--------------------------------------------------------------------|

21. |             Illinois   2687.422    2687.422     .04929     -.04929 |

22. |           Washington   2536.313    2536.313   .0665131    .0665131 |

23. |             Oklahoma   2383.868   -2383.868   .0102179   -.0102179 |

24. |                Idaho   2289.952   -2289.952    .070259    -.070259 |

25. |             Kentucky   2257.391   -2257.391    .075205    -.075205 |

|--------------------------------------------------------------------|

26. |            Wisconsin   2079.558    2079.558   .0656909    .0656909 |

27. |       South Carolina    1986.75    -1986.75    .022419     .022419 |

28. |             Delaware   1970.206    1970.206   .0329438   -.0329438 |

29. |              Wyoming   1860.133   -1860.133   .0957266   -.0957266 |

30. |              Florida   1833.188   -1833.188   .0603009    .0603009 |

|--------------------------------------------------------------------|

31. |              Arizona    1755.73    -1755.73    .062599     .062599 |

32. |               Hawaii   1728.106   -1728.106   .3419163    .3419163 |

33. |         Rhode Island   1562.148    1562.148   .0499289    .0499289 |

34. |             Missouri   1460.693    1460.693   .0429018    .0429018 |

35. |             Nebraska     1231.4     -1231.4   .0429309   -.0429309 |

|--------------------------------------------------------------------|

36. |                 Ohio   1017.316    1017.316    .024228     .024228 |

37. |               Kansas   1001.302   -1001.302    .021283    -.021283 |

38. |             New York   998.9494    998.9494   .0335865   -.0335865 |

39. |              Georgia   939.2884   -939.2884   .0433774    .0433774 |

40. |                Texas    873.475    -873.475   .0654734    .0654734 |

|--------------------------------------------------------------------|

41. |                 Utah   544.7642   -544.7642   .0203697   -.0203697 |

42. |                Maine    536.482    -536.482   .0362305   -.0362305 |

43. |               Nevada   347.9611    347.9611   .0080656   -.0080656 |

44. |              Vermont   328.0741   -328.0741   .0203959   -.0203959 |

45. |           California   294.7873    294.7873   .0240138   -.0240138 |

|--------------------------------------------------------------------|

46. |               Oregon   241.7805    241.7805   .0072873    .0072873 |

47. |         Pennsylvania    230.294    -230.294   .0063625   -.0063625 |

48. |              Indiana   165.3883   -165.3883   .0053708   -.0053708 |

49. |            Tennessee   89.31062    89.31062   .0009869    .0009869 |

50. |       North Carolina   46.82497   -46.82497   .0009033    .0009033 |

|--------------------------------------------------------------------|

51. |                 Iowa   17.83441    17.83441   .0008664    .0008664 |

+--------------------------------------------------------------------+

* above is the list of states sorted from largest to smallest absolute residual. Below is the list of states sorted from larges to smallest absolute value dfbeta.

. gsort - abs_dfbeta

. list statefip abs_residual residual  abs_dfbeta  _dfbeta_1

+--------------------------------------------------------------------+

|             statefip   abs_re~l    residual   abs_df~a   _dfbeta_1 |

|--------------------------------------------------------------------|

1. | District of Columbia   3381.999    3381.999   .5903278   -.5903278 |

2. |           New Mexico   5139.932   -5139.932   .5375599    .5375599 |

3. |               Hawaii   1728.106   -1728.106   .3419163    .3419163 |

4. |        West Virginia   4798.199   -4798.199   .2921322   -.2921322 |

5. |          Mississippi   5009.293   -5009.293   .2366829    .2366829 |

|--------------------------------------------------------------------|

6. |              Montana   5024.921   -5024.921   .2145686   -.2145686 |

7. |         North Dakota   4204.203   -4204.203    .192666    -.192666 |

8. |        New Hampshire   3021.199    3021.199   .1813799    .1813799 |

9. |             Maryland   4840.275    4840.275   .1743481   -.1743481 |

10. |            Minnesota    4187.81     4187.81   .1740437    .1740437 |

|--------------------------------------------------------------------|

11. |            Louisiana   3555.554   -3555.554   .1484154    .1484154 |

12. |         South Dakota   2937.162   -2937.162   .1303558   -.1303558 |

13. |           New Jersey   5432.736    5432.736   .1172426   -.1172426 |

14. |        Massachusetts   4666.598    4666.598    .098265     .098265 |

15. |              Wyoming   1860.133   -1860.133   .0957266   -.0957266 |

|--------------------------------------------------------------------|

16. |             Kentucky   2257.391   -2257.391    .075205    -.075205 |

17. |                Idaho   2289.952   -2289.952    .070259    -.070259 |

18. |           Washington   2536.313    2536.313   .0665131    .0665131 |

19. |            Wisconsin   2079.558    2079.558   .0656909    .0656909 |

20. |                Texas    873.475    -873.475   .0654734    .0654734 |

|--------------------------------------------------------------------|

21. |              Arizona    1755.73    -1755.73    .062599     .062599 |

22. |              Florida   1833.188   -1833.188   .0603009    .0603009 |

23. |               Alaska   3780.312    3780.312   .0506594   -.0506594 |

24. |         Rhode Island   1562.148    1562.148   .0499289    .0499289 |

25. |             Illinois   2687.422    2687.422     .04929     -.04929 |

|--------------------------------------------------------------------|

26. |          Connecticut   5617.314    5617.314   .0487289    .0487289 |

27. |              Alabama   3416.563   -3416.563   .0482474    .0482474 |

28. |              Georgia   939.2884   -939.2884   .0433774    .0433774 |

29. |             Nebraska     1231.4     -1231.4   .0429309   -.0429309 |

30. |             Missouri   1460.693    1460.693   .0429018    .0429018 |

|--------------------------------------------------------------------|

31. |                Maine    536.482    -536.482   .0362305   -.0362305 |

32. |             Michigan   3208.876    3208.876   .0356351    .0356351 |

33. |             New York   998.9494    998.9494   .0335865   -.0335865 |

34. |             Virginia   2706.413    2706.413   .0332524   -.0332524 |

35. |             Delaware   1970.206    1970.206   .0329438   -.0329438 |

|--------------------------------------------------------------------|

36. |                 Ohio   1017.316    1017.316    .024228     .024228 |

37. |           California   294.7873    294.7873   .0240138   -.0240138 |

38. |       South Carolina    1986.75    -1986.75    .022419     .022419 |

39. |               Kansas   1001.302   -1001.302    .021283    -.021283 |

40. |              Vermont   328.0741   -328.0741   .0203959   -.0203959 |

|--------------------------------------------------------------------|

41. |                 Utah   544.7642   -544.7642   .0203697   -.0203697 |

42. |             Arkansas   4460.403   -4460.403   .0185422   -.0185422 |

43. |             Oklahoma   2383.868   -2383.868   .0102179   -.0102179 |

44. |               Nevada   347.9611    347.9611   .0080656   -.0080656 |

45. |               Oregon   241.7805    241.7805   .0072873    .0072873 |

|--------------------------------------------------------------------|

46. |         Pennsylvania    230.294    -230.294   .0063625   -.0063625 |

47. |             Colorado   4390.833    4390.833   .0056333    .0056333 |

48. |              Indiana   165.3883   -165.3883   .0053708   -.0053708 |

49. |            Tennessee   89.31062    89.31062   .0009869    .0009869 |

50. |       North Carolina   46.82497   -46.82497   .0009033    .0009033 |

|--------------------------------------------------------------------|

51. |                 Iowa   17.83441    17.83441   .0008664    .0008664 |

+--------------------------------------------------------------------+

. log close

name:  <unnamed>