This project is a full analysis of a moderately large dataset using the tools we have learnt throughout the course.
The project should be done individually. The project is due Sunday March 17 at 11:59 PM.
The data for your final project is based on real estate sales in Ames, Iowa in the years 2006- 2010. A description of the dataset can be found here.
I have created a subsample of 2000 cases for you to use after fixing some missing values.
To begin, randomly split the data into a two sets of equal size -- 1000 for selecting a model, with a final 1000 for validation, and reporting confidence intervals for the final effects. For reproducibility of your results, pick and store an integer seed to use for the split and any subsequent possibly randomization as in cross-validation, etc. For simplicity, I'll choose the seed
1 here. You need not use the same seed, choose one and have this line as the first line in your analysis.
Your task is to build a model to predict
SalePrice. based on the remaining variables.
The final project should be no more than 10 pages. Beware: the data set is large enough so simple stepwise model building procedures may be very slow.
The project should have the following parts:
The study: In this section, you should give a description of the study underlying their dataset. Possible questions to be answered are the following:
The data: In this section, you should describe the data set and possibly do some exploratory data analysis. For instance:
The models: In this section, you should develop a model for the data that will allow them to answer some of the specific goals of the study. Possible questions to be addressed here are the following:
Results: In this section, you should report their results obtained by fitting the proposed models in the previous section. Emphasis should be placed on clarity, as if the report were a statistical consultant’s report for a nonstatistician. For instance, loads of
R output would, in general, not be acceptable. Plots and well-organized tables are good things to have in this section. Possible questions to be addressed here are the following:
Appendix: In this section, you should attach a final, editied, copy of the R code used in the analysis. Ideally, there will be comments in the file, i.e. lines beginning with “#” to clarify what each part of the code is doing.
Acknowledgements: If you consult outside sources that refer to this data set, you should cite these as references, and describe what you used from each source. Sources include material found on the internet, journal articles and books.
There are no right or wrong answers for many of these questions. The goal of the project is to try to mimic the analysis of a real data set that you might come across in your own field of application.