---------------------------------------------------------------------------------------------------------------------------
name: <unnamed>
log: C:\Users\mexmi\Documents\newer web pages\soc_meth_proj3\fall_2021_logs\class14.log
log type: text
opened on: 3 Nov 2021, 09:54:35
. *class starts here.
. clear all
* First we opened the Excel file for the Anscombe dataset, and we copied the data into the Stata data editor (find the data editor among the icons in Stata under the menus, or go Window>data editor.
. *(8 variables, 11 observations pasted into data editor)
. twoway (scatter y2 x2) (lfit y2 x2)
. twoway (scatter y1 x1) (lfit y1 x1)
. twoway (scatter y3 x3) (lfit y3 x3)
*Things to note about the Anscombe data pairs (Yn and Xn): the scatter plots all look different but the best fit OLS line, which is what “lfit y3 x3” gives us, look the same.
. regress y1 x1
Source | SS df MS Number of obs = 11
-------------+---------------------------------- F(1, 9) = 17.99
Model | 27.5100011 1 27.5100011 Prob > F = 0.0022
Residual | 13.7626904 9 1.52918783 R-squared = 0.6665
-------------+---------------------------------- Adj R-squared = 0.6295
Total | 41.2726916 10 4.12726916 Root MSE = 1.2366
------------------------------------------------------------------------------
y1 | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
x1 | .5000909 .1179055 4.24 0.002 .2333701 .7668117
_cons | 3.000091 1.124747 2.67 0.026 .4557369 5.544445
------------------------------------------------------------------------------
. regress y2 x2
Source | SS df MS Number of obs = 11
-------------+---------------------------------- F(1, 9) = 17.97
Model | 27.5000024 1 27.5000024 Prob > F = 0.0022
Residual | 13.776294 9 1.53069933 R-squared = 0.6662
-------------+---------------------------------- Adj R-squared = 0.6292
Total | 41.2762964 10 4.12762964 Root MSE = 1.2372
------------------------------------------------------------------------------
y2 | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
x2 | .5 .1179638 4.24 0.002 .2331475 .7668526
_cons | 3.000909 1.125303 2.67 0.026 .4552978 5.54652
------------------------------------------------------------------------------
* In fact, the lines are exactly the same.
. clear all
*Now on to our 50 state summary data:
. use "C:\Users\mexmi\Documents\current class files\intro soc methods\fifty_state_dataset.dta"
. twoway (scatter incwage NH_White_proportion, mlabel(statefip)) (lfit incwage NH_White_proportion)
* This above produces a graph of states by average income and proportion Non-Hispanic White. The best fit line seems to show that the whiter the state, the lower the average income.
. summarize NH_White_proportion
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
NH_White_p~n | 51 .7626632 .1633623 .2354178 .9835737
. regress incwage NH_White_proportion
Source | SS df MS Number of obs = 51
-------------+---------------------------------- F(1, 49) = 2.14
Model | 18878316.5 1 18878316.5 Prob > F = 0.1500
Residual | 432407199 49 8824636.71 R-squared = 0.0418
-------------+---------------------------------- Adj R-squared = 0.0223
Total | 451285515 50 9025710.3 Root MSE = 2970.6
-------------------------------------------------------------------------------------
incwage | Coefficient Std. err. t P>|t| [95% conf. interval]
--------------------+----------------------------------------------------------------
NH_White_proportion | -3761.36 2571.649 -1.46 0.150 -8929.282 1406.562
_cons | 22161.23 2004.928 11.05 0.000 18132.18 26190.29
-------------------------------------------------------------------------------------
. predict M1_predicted
(option xb assumed; fitted values)
. gen residual=incwage- M1_predicted
. predict residual_v2, residual
. *two ways to generate the residuals.
* We generate the residuals and then generate a new variable holding the absolute value of residuals, because we want to know which state is furthest from the line, we don’t care whether it is above or below.
. gen abs_residual=abs(residual)
. gsort -abs_residual
. *gsort puts our data, our 51 states, in declining order of absolute value of residual (the minus in front of the variable tells stata to sort from largest to smallest).
* Now we use the dfbeta command to generate defbetas for the X variable of interest- here there is only one X variable.
dfbeta ( NH_White_proportion)
Generating DFBETA variable ...
_dfbeta_1: DFBETA NH_White_proportion
. gen abs_dfbeta=abs(_dfbeta_1)
* We want to know which states are most influential on the line, i.e. have the largest dfbeta in absolute value, so we generate a new variable with the absolute value of the dfbeta.
* States listed from largest absolute value residual to smallest:
. list statefip abs_residual residual abs_dfbeta _dfbeta_1
+--------------------------------------------------------------------+
| statefip abs_re~l residual abs_df~a _dfbeta_1 |
|--------------------------------------------------------------------|
1. | Connecticut 5617.314 5617.314 .0487289 .0487289 |
2. | New Jersey 5432.736 5432.736 .1172426 -.1172426 |
3. | New Mexico 5139.932 -5139.932 .5375599 .5375599 |
4. | Montana 5024.921 -5024.921 .2145686 -.2145686 |
5. | Mississippi 5009.293 -5009.293 .2366829 .2366829 |
|--------------------------------------------------------------------|
6. | Maryland 4840.275 4840.275 .1743481 -.1743481 |
7. | West Virginia 4798.199 -4798.199 .2921322 -.2921322 |
8. | Massachusetts 4666.598 4666.598 .098265 .098265 |
9. | Arkansas 4460.403 -4460.403 .0185422 -.0185422 |
10. | Colorado 4390.833 4390.833 .0056333 .0056333 |
|--------------------------------------------------------------------|
11. | North Dakota 4204.203 -4204.203 .192666 -.192666 |
12. | Minnesota 4187.81 4187.81 .1740437 .1740437 |
13. | Alaska 3780.312 3780.312 .0506594 -.0506594 |
14. | Louisiana 3555.554 -3555.554 .1484154 .1484154 |
15. | Alabama 3416.563 -3416.563 .0482474 .0482474 |
|--------------------------------------------------------------------|
16. | District of Columbia 3381.999 3381.999 .5903278 -.5903278 |
17. | Michigan 3208.876 3208.876 .0356351 .0356351 |
18. | New Hampshire 3021.199 3021.199 .1813799 .1813799 |
19. | South Dakota 2937.162 -2937.162 .1303558 -.1303558 |
20. | Virginia 2706.413 2706.413 .0332524 -.0332524 |
|--------------------------------------------------------------------|
21. | Illinois 2687.422 2687.422 .04929 -.04929 |
22. | Washington 2536.313 2536.313 .0665131 .0665131 |
23. | Oklahoma 2383.868 -2383.868 .0102179 -.0102179 |
24. | Idaho 2289.952 -2289.952 .070259 -.070259 |
25. | Kentucky 2257.391 -2257.391 .075205 -.075205 |
|--------------------------------------------------------------------|
26. | Wisconsin 2079.558 2079.558 .0656909 .0656909 |
27. | South Carolina 1986.75 -1986.75 .022419 .022419 |
28. | Delaware 1970.206 1970.206 .0329438 -.0329438 |
29. | Wyoming 1860.133 -1860.133 .0957266 -.0957266 |
30. | Florida 1833.188 -1833.188 .0603009 .0603009 |
|--------------------------------------------------------------------|
31. | Arizona 1755.73 -1755.73 .062599 .062599 |
32. | Hawaii 1728.106 -1728.106 .3419163 .3419163 |
33. | Rhode Island 1562.148 1562.148 .0499289 .0499289 |
34. | Missouri 1460.693 1460.693 .0429018 .0429018 |
35. | Nebraska 1231.4 -1231.4 .0429309 -.0429309 |
|--------------------------------------------------------------------|
36. | Ohio 1017.316 1017.316 .024228 .024228 |
37. | Kansas 1001.302 -1001.302 .021283 -.021283 |
38. | New York 998.9494 998.9494 .0335865 -.0335865 |
39. | Georgia 939.2884 -939.2884 .0433774 .0433774 |
40. | Texas 873.475 -873.475 .0654734 .0654734 |
|--------------------------------------------------------------------|
41. | Utah 544.7642 -544.7642 .0203697 -.0203697 |
42. | Maine 536.482 -536.482 .0362305 -.0362305 |
43. | Nevada 347.9611 347.9611 .0080656 -.0080656 |
44. | Vermont 328.0741 -328.0741 .0203959 -.0203959 |
45. | California 294.7873 294.7873 .0240138 -.0240138 |
|--------------------------------------------------------------------|
46. | Oregon 241.7805 241.7805 .0072873 .0072873 |
47. | Pennsylvania 230.294 -230.294 .0063625 -.0063625 |
48. | Indiana 165.3883 -165.3883 .0053708 -.0053708 |
49. | Tennessee 89.31062 89.31062 .0009869 .0009869 |
50. | North Carolina 46.82497 -46.82497 .0009033 .0009033 |
|--------------------------------------------------------------------|
51. | Iowa 17.83441 17.83441 .0008664 .0008664 |
+--------------------------------------------------------------------+
*CT and NJ are the largest residuals, but have small dfbetas.
* Now resort the data from largest to smallest absolute value dfbeta, and re-list:
. gsort - abs_dfbeta
. list statefip abs_dfbeta _dfbeta_1 abs_residual residual
+--------------------------------------------------------------------+
| statefip abs_df~a _dfbeta_1 abs_re~l residual |
|--------------------------------------------------------------------|
1. | District of Columbia .5903278 -.5903278 3381.999 3381.999 |
2. | New Mexico .5375599 .5375599 5139.932 -5139.932 |
3. | Hawaii .3419163 .3419163 1728.106 -1728.106 |
4. | West Virginia .2921322 -.2921322 4798.199 -4798.199 |
5. | Mississippi .2366829 .2366829 5009.293 -5009.293 |
|--------------------------------------------------------------------|
6. | Montana .2145686 -.2145686 5024.921 -5024.921 |
7. | North Dakota .192666 -.192666 4204.203 -4204.203 |
8. | New Hampshire .1813799 .1813799 3021.199 3021.199 |
9. | Maryland .1743481 -.1743481 4840.275 4840.275 |
10. | Minnesota .1740437 .1740437 4187.81 4187.81 |
|--------------------------------------------------------------------|
11. | Louisiana .1484154 .1484154 3555.554 -3555.554 |
12. | South Dakota .1303558 -.1303558 2937.162 -2937.162 |
13. | New Jersey .1172426 -.1172426 5432.736 5432.736 |
14. | Massachusetts .098265 .098265 4666.598 4666.598 |
15. | Wyoming .0957266 -.0957266 1860.133 -1860.133 |
|--------------------------------------------------------------------|
16. | Kentucky .075205 -.075205 2257.391 -2257.391 |
17. | Idaho .070259 -.070259 2289.952 -2289.952 |
18. | Washington .0665131 .0665131 2536.313 2536.313 |
19. | Wisconsin .0656909 .0656909 2079.558 2079.558 |
20. | Texas .0654734 .0654734 873.475 -873.475 |
|--------------------------------------------------------------------|
21. | Arizona .062599 .062599 1755.73 -1755.73 |
22. | Florida .0603009 .0603009 1833.188 -1833.188 |
23. | Alaska .0506594 -.0506594 3780.312 3780.312 |
24. | Rhode Island .0499289 .0499289 1562.148 1562.148 |
25. | Illinois .04929 -.04929 2687.422 2687.422 |
|--------------------------------------------------------------------|
26. | Connecticut .0487289 .0487289 5617.314 5617.314 |
27. | Alabama .0482474 .0482474 3416.563 -3416.563 |
28. | Georgia .0433774 .0433774 939.2884 -939.2884 |
29. | Nebraska .0429309 -.0429309 1231.4 -1231.4 |
30. | Missouri .0429018 .0429018 1460.693 1460.693 |
|--------------------------------------------------------------------|
31. | Maine .0362305 -.0362305 536.482 -536.482 |
32. | Michigan .0356351 .0356351 3208.876 3208.876 |
33. | New York .0335865 -.0335865 998.9494 998.9494 |
34. | Virginia .0332524 -.0332524 2706.413 2706.413 |
35. | Delaware .0329438 -.0329438 1970.206 1970.206 |
|--------------------------------------------------------------------|
36. | Ohio .024228 .024228 1017.316 1017.316 |
37. | California .0240138 -.0240138 294.7873 294.7873 |
38. | South Carolina .022419 .022419 1986.75 -1986.75 |
39. | Kansas .021283 -.021283 1001.302 -1001.302 |
40. | Vermont .0203959 -.0203959 328.0741 -328.0741 |
|--------------------------------------------------------------------|
41. | Utah .0203697 -.0203697 544.7642 -544.7642 |
42. | Arkansas .0185422 -.0185422 4460.403 -4460.403 |
43. | Oklahoma .0102179 -.0102179 2383.868 -2383.868 |
44. | Nevada .0080656 -.0080656 347.9611 347.9611 |
45. | Oregon .0072873 .0072873 241.7805 241.7805 |
|--------------------------------------------------------------------|
46. | Pennsylvania .0063625 -.0063625 230.294 -230.294 |
47. | Colorado .0056333 .0056333 4390.833 4390.833 |
48. | Indiana .0053708 -.0053708 165.3883 -165.3883 |
49. | Tennessee .0009869 .0009869 89.31062 89.31062 |
50. | North Carolina .0009033 .0009033 46.82497 -46.82497 |
|--------------------------------------------------------------------|
51. | Iowa .0008664 .0008664 17.83441 17.83441 |
|
|
* DC, NM, and HI, the 3 outlier states with the lowest proportion of NH White people, are the most influential points on the slope because they are outliers in X.
. codebook statefip, tab(60)
-------------------------------------------------------------------------------------------------------
statefip State (FIPS code)
-------------------------------------------------------------------------------------------------------
Type: Numeric (byte)
Label: statefiplbl
Range: [1,56] Units: 1
Unique values: 51 Missing .: 0/51
Tabulation: Freq. Numeric Label
1 1 Alabama
1 2 Alaska
1 4 Arizona
1 5 Arkansas
1 6 California
1 8 Colorado
1 9 Connecticut
1 10 Delaware
1 11 District of Columbia
1 12 Florida
1 13 Georgia
1 15 Hawaii
1 16 Idaho
1 17 Illinois
1 18 Indiana
1 19 Iowa
1 20 Kansas
1 21 Kentucky
1 22 Louisiana
1 23 Maine
1 24 Maryland
1 25 Massachusetts
1 26 Michigan
1 27 Minnesota
1 28 Mississippi
1 29 Missouri
1 30 Montana
1 31 Nebraska
1 32 Nevada
1 33 New Hampshire
1 34 New Jersey
1 35 New Mexico
1 36 New York
1 37 North Carolina
1 38 North Dakota
1 39 Ohio
1 40 Oklahoma
1 41 Oregon
1 42 Pennsylvania
1 44 Rhode Island
1 45 South Carolina
1 46 South Dakota
1 47 Tennessee
1 48 Texas
1 49 Utah
1 50 Vermont
1 51 Virginia
1 53 Washington
1 54 West Virginia
1 55 Wisconsin
1 56 Wyoming
* The meaning of the dfbetas: running take the original slope and SE, the dfbeta is how the slope would be different in units of SE without each point. For DC the dfbeta was 0.59, the original slope was -3761, and the SE of the slope was 2571. Without DC, this is what we would get.
. regress incwage NH_White_proportion if statefip~=11
Source | SS df MS Number of obs = 50
-------------+---------------------------------- F(1, 48) = 0.64
Model | 5576993.22 1 5576993.22 Prob > F = 0.4276
Residual | 418239982 48 8713332.96 R-squared = 0.0132
-------------+---------------------------------- Adj R-squared = -0.0074
Total | 423816975 49 8649326.02 Root MSE = 2951.8
-------------------------------------------------------------------------------------
incwage | Coefficient Std. err. t P>|t| [95% conf. interval]
--------------------+----------------------------------------------------------------
NH_White_proportion | -2252.848 2815.944 -0.80 0.428 -7914.684 3408.987
_cons | 20928.61 2214.384 9.45 0.000 16476.29 25380.93
-------------------------------------------------------------------------------------
. display -3761.36+ (0.59033*2571.7)
-2243.2083
* My by-hand calculation of the slope without DC based on the DFbeta is not exactly the same as the actual slope without DC, but it is close.
. log close
name: <unnamed>
log: C:\Users\mexmi\Documents\newer web pages\soc_meth_proj3\fall_2021_logs\class14.log
log type: text
closed on: 3 Nov 2021, 12:50:48
-------------------------------------------------------------------------------------------------------