The course provides a solid introduction to data science, both exposing students to computational tools they can proficently use to analyze data and exploring the conceptual challenges of inferential reasoning. Each module/week represents a new “data adventure,” analyzing real datasets, exploring different questions and trying out tools.

There will be three traditional lectures per week and two labs with active students participation. Data analysis and computations will be carried out in R, a language that will be introduced during the course. There is no calculus prerequisite. Lecture notes, datasets, and labs markdowns will be available via the class web-page. Additional reading material and references will be made available. There will be a weekly homework and an in class final exam.

Stats101 is a new course, and as such it does not appear in the lists of required classes for majors. The statistics department believes that the materials in Stats101 cover the topics traditionally taught in Stats60. If your major requires Stats60, we invite you to ask your advisors if Stats101 could be accepted as a substitute.

Travis-CI Build Status


  1. Data science: what is the buzz about?
  2. Data munging and wrangling
  3. Numerical summaries of data
  4. Visualization tools
  5. Sampling variability and uncertainty of statistical estimates
  6. Testing statistical hypotheses
  7. Linear regression and prediction
  8. High dimensional data and principal component analysis
  9. Nonparametric statistics (transformations of the data, ranking, etc.)
  10. Safegurding reproducibility: the challenges of multiple comparisons and data snooping

Instructor & TAs

Instructors for Autumn 2017

Teaching Assistants


Email list

The course has an email list that reaches all TAs as well as the professors:

As a general rule, you should send course related to this email list.

Note that we will not respond to e-mails about the homework due on Friday if they are sent after 6pm on Thursday night. Office hours (which do not require an appointment) are often better for technical questions.

Office hours

  • Tuesday 10:30-12:30 (Jonathan, Sequoia Hall 137)
  • Wednesday 10:30-12:30 (Lucy, Sequoia Hall 207)
  • Thursday 2:00-4:00 (Peter, Sequoia Hall 202 or by appointment)

Schedule & Location

M-F 9:30-10:20, 200-205

Prerequisites and credits

Some familiarity with elementary algebraic notation at the high school level is assumed, but there is no calculus prerequisite. There is no computer science prerequisite either, but students must be willing to engage in computational and data analysis examples using software that will be introduced in class.

Datascience 101 fullfills the following undergraduate requirements GER: DB-NatSci, WAY-AQR. If you are interested in a major that has Stats60 or Stats190 as a requirement, we encourage you to enquire with your advisors about the possibility of substituting this with Stats101.


The grade in the class will be determined on the basis of

  • weekly homework assignments (60%) (worse grade will be dropped)
  • final in class exam. (40%)


  • Text will be available on the class web-page, for solutions you will log into CANVAS.
  • You may discuss homework problems with other students and with TAs in office hours, but you have to prepare the solutions yourself.
  • As a general rule, we ask students NOT to complete the assignments during TAs’ in office hours. This will be easier if students do not use laptops while in TAs’ office hours.
  • We do not accept late homework unless there is a documented medical/family emergency or an OAE letter.

Modules materials

0. If you want a head start

In this class, we will be using the R language heavily in class notes, examples and lab exercises. R is free and you can install it like any other program on your computer.

  1. Go to the CRAN website and download it for your Mac or PC.
  2. Install the free version of the RStudio Desktop Software.
  3. Go through our install instructions

2. Manipulation of Data

4. Visualization

5. Sampling variability

6. Inference

7. Prediction

  • Labs and markdowns as in zip

8. Principal Component Analysis

9. Non parametric statistics and review

10. Reproducibility