# DATA SCIENCE 101¶

The course provides a solid introduction to data science, both exposing students to computational tools they can proficently use to analyze data and exploring the conceptual challenges of inferential reasoning. Each module/week represents a new “data adventure,” analyzing real datasets, exploring different questions and trying out tools.

There will be three traditional lectures per week and two labs with active students participation. Data analysis and computations will be carried out in R, a language that will be introduced during the course. There is no calculus prerequisite. Lecture notes, datasets, and labs markdowns will be available via the class web-page. Additional reading material and references will be made available. There will be a weekly homework and an in class final exam.

Stats101 is a new course, and as such it does not appear in the lists of required classes for majors. The statistics department believes that the materials in Stats101 cover the topics traditionally taught in Stats60. If your major requires Stats60, we invite you to ask your advisors if Stats101 could be accepted as a substitute.

## Topics¶

- Data science: what is the buzz about?
- Data munging and wrangling
- Numerical summaries of data
- Visualization tools
- Sampling variability and uncertainty of statistical estimates
- Testing statistical hypotheses
- Linear regression and prediction
- High dimensional data and principal component analysis
- Nonparametric statistics (transformations of the data, ranking, etc.)
- Safegurding reproducibility: the challenges of multiple comparisons and data snooping

## Instructor & TAs¶

### Instructors for Autumn 2017¶

### Email list¶

The course has an email list that reaches all TAs as well as the professors: stats101-aut1718-staff@lists.stanford.edu

**As a general rule, you should send course related to this email
list.**

Note that we will not respond to e-mails about the homework due on Friday if they are sent after 6pm on Thursday night. Office hours (which do not require an appointment) are often better for technical questions.

### Office hours¶

- Tuesday 10:30-12:30 (Jonathan, Sequoia Hall 137)
- Wednesday 10:30-12:30 (Lucy, Sequoia Hall 207)
- Thursday 2:00-4:00 (Peter, Sequoia Hall 202 or by appointment)

## Schedule & Location¶

M-F 9:30-10:20, 200-205

## Prerequisites and credits¶

Some familiarity with elementary algebraic notation at the high school level is assumed, but there is no calculus prerequisite. There is no computer science prerequisite either, but students must be willing to engage in computational and data analysis examples using software that will be introduced in class.

Datascience 101 fullfills the following undergraduate requirements GER: DB-NatSci, WAY-AQR. If you are interested in a major that has Stats60 or Stats190 as a requirement, we encourage you to enquire with your advisors about the possibility of substituting this with Stats101.

## Evaluation¶

The grade in the class will be determined on the basis of

- weekly homework assignments (60%) (worse grade will be dropped)
- final in class exam. (40%)

### Homework¶

- Text will be available on the class web-page, for solutions you will log into CANVAS.
- You may discuss homework problems with other students and with TAs in office hours, but you have to prepare the solutions yourself.
- As a general rule, we ask students NOT to complete the assignments during TAs’ in office hours. This will be easier if students do not use laptops while in TAs’ office hours.
- We do not accept late homework unless there is a documented medical/family emergency or an OAE letter.

## Modules materials¶

### 0. If you want a head start¶

In this class, we will be using the R language heavily in class notes, examples and lab exercises. R is free and you can install it like any other program on your computer.

- Go to the CRAN website and download it for your Mac or PC.
- Install the free version of the RStudio Desktop Software.
- Go through our install instructions

### 2. Manipulation of Data¶

- Lecture notes
- Labs and markdowns as in zip
- Reading materials
- R for data science, especially chapters 3,7,8.

### 4. Visualization¶

- Lecture notes
- Labs and markdowns as in zip
- Reading materials
- R for data science, especially chapters 1,5.

### 5. Sampling variability¶

- Lecture notes
- Labs and markdowns as in zip
- Reading materials
- Efron and Tibshirani (1980) “Introduction to the Bootstrap”: Introduction, Accuracy of the sample mean.
- Stigler (1989) Francis Galton’s Account of the Invention of Correlation

### 8. Principal Component Analysis¶

- Lecture notes
- Labs and markdowns as in zip
- Reading materials
- Stack exchange on PCA
- Stigler (1997) Regression towards the mean, historically considered
- Handout distributed in class
- Genes mirror geography within Europe

### 9. Non parametric statistics and review¶

- Lecture notes
- Labs and markdowns as in zip

### 10. Reproducibility¶

- Lecture notes
- Reading materials
- The Economist (October 2013) How science goes wrong
- NYT (2014) New Truths That Only One Can See
- Science isn’t broken, a blog entry from fivethirtyeight.com