Lecture 28: Outro

Lecture 28: Outro#

STATS 60 / STATS 160 / PSYCH 10

Announcements

Additional practice final.
Review sessions
- Skyler: Thursday 1:00 – 2:30 pm in CoDA E401
- Michael: Friday 11:00 am – 12:00 pm in CoDA B40

Today’s lecture#

Overview of the quarter
Three themes for the class
What’s next if you’re stats-curious?

Remembering our journey#

Unit 1: Thinking about scale#

Numbers in statistics#

In statistics and data science, numbers are used to

Describe observations
Quantify how confident we are

But, Numbers are only meaningful in context.

Is $10 billion a lot of money?

Three questions#

What type of number is this?
What can I compare this number to? Is it large or small compared to other similar values?
What would I have expected this number to be?

Ballpark estimates#

Set up a simple model to compute the quantity approximately by break up the estimate into small parts
- How many visitors go on tours at Stanford per year?
  
  \[\frac{\text{# visitors}}{\text{year}} = \frac{\text{# days}}{\text{year}} \times \frac{\text{# tours}}{\text{day}} \times \frac{\text{# visitors}}{\text{day}}\]
Approximate parts up to a factor of 10

Cost benefit analysis#

Ballpark estimates can be used to make important decisions.

Unit 2: Exploratory data analysis#

Putting data in context#

It is hard to find insight from looking at raw data.

Data visualization and data summaries let us choose what to communicate and focus on.

Data visualization#

Common graphic representations:

Bar chart and pie chart (categorical variables)
Time series (seeing how a variable changes over time)
Histogram (one quantitative variable)
Scatter plot (two quantitative variables).

Effective data visualization#

Our World in Data: Top of the Charts 2025.

Deceptive data visualization#

Watch out for misleading data visualizations.

Summaries of center and variability#

Summaries of center: what is the one number that best summarizes the data?

Mean, median, mode.

Variability: how similar are the different datapoints in the dataset?

Would you rather be given $\$150$ or flip a coin for $\$300$?
Variance, standard deviation, quantiles and gaps between quantiles.

Correlation and correlation coefficient#

What is the direction and strength of the association between two quantitative variables?

Misleading means#

The usefulness of a summary statistic depends on the data!

Outliers/skew

Multi-modal data

Subgroups

Unit 3: Probability#

Probability#

The mathematics of uncertainty.
One of the foundations of statistics.
Probability helped us:
- Assess how likely/unlikely coincidences are.
- Update our beliefs based on new information (conditional probability).
- Generalize findings from data to a broader group of people (hypothesis testing and confidence intervals).

Sample space and outcomes#

The set of possible outcomes is called the sample space.
An event is a collection of some of the possible outcomes.
If all outcomes are equally likely, then the probability of an event is equal to the number of outcomes in the event divided by the total number of possible outcomes.

Computing probabilities#

The multiplication rule

Complement rule

Coincidences#

Even if an event is rare, it is likely to happen when there are many opportunities for the rare event to take place.
Examples:
- The birthday paradox
- Winning streaks
- Unit 4: multiple testing!

Conditional probability#

Updating probabilities based on partial information
Bayes’ rule
Common mistakes in conditional probability:
- Base rate fallacy: the conditional probability is not informative by itself (librarians vs. farmers)
- $\Pr[A \mid B] \neq \Pr[B \mid A]$ (distracted driving, gateway drugs)
- Failing to condition on important information (OJ Simpson)
- Generalizing from a biased sample or failing to realize you have conditioned (hot guys are jerks, selection bias)

Unit 4: Estimates, hypothesis testing and experiments#

Statistical significance#

Statistically significant means that the results are unlikely to have occurred by random chance alone.
A p-value is the probability of finding a result at least as extreme/surprising, if outcomes happened by random chance alone.
The null hypothesis corresponds to “just chance” or “no effect.”
The alternative hypothesis corresponds to “better than chance” or “an effect.”

Computing p-value by simulation#

A simulation shows what the results would have looked like if the null hypothesis was true.
Computing the proportion of repetitions that were at least as extreme as the observed data gives the p-value.
Can be one-sided or two-sided.

Experiments#

Drawbacks of observational studies (marshmallow experiment)
Correlation vs. causation and confounding/hidden variables
Effect of selection bias
Potential outcomes model
Computing p-values with simulations and permutation tests

Estimates#

Sample vs. population
Sample size matters for estimation!
- The standard deviation of the sample mean is $\frac{\sigma}{\sqrt{n}}$
- To get $10$ times more accurate, you need $100$ times more samples.

Confidence intervals#

For large samples, the distribution of the sample mean is described by the “normal distribution.”

This can be used to make confidence intervals:

$\hat{\mu}_n \pm \frac{\hat{\sigma}_x}{\sqrt{n}}$ is a 68% confidence interval.
$\hat{\mu}_n \pm 2 \times \frac{\hat{\sigma}_x}{\sqrt{n}}$ is a 95% confidence interval.
$\hat{\mu}_n \pm 3 \times \frac{\hat{\sigma}_x}{\sqrt{n}}$ is a 99% confidence interval.

For proportions, $\hat{\sigma}_x = \sqrt{\hat{\pi}_n(1-\hat{\pi}_n)}$ where $\hat{\pi}_n$ is the sample proportion.

Unit 5: Machine Learning and Regression#

Predictions#

Statistics is often concerned with making predictions.

On observation $x$, predict outcome $y$.

$x$ is symptoms/test results, $y$ is diagnosis
$x$ is SAT score, $y$ is first-year GPA
$x$ is weather now, $y$ is weather later

We construct a simple model $f$ so that $f(x) = \hat{y}$, with the goal that $\hat{y}$ is as close to $y$ as possible.

Building models#

It is easier to learn from examples than build a model by hand
Use “training” data to build the model and “testing” data to evaluate the model.
Types of prediction problems
- Regression (predicting a quantitative $y$).
- Classification (predicting a categorical $y$).
- Text generation (predicting the next word in a sentence).

Examples of models#

Linear and quadratic regression

$k$-nearest neighbors

Markov text generators

Training data is everything!#

Selection bias in training data leads to biased models.
If $x$ is far from all training examples, $f(x)$ is probably not that accurate for predicting $y$.
More (good) data and better coverage improves performance.

tl;dr: three themes#

The three major ideas that I want you to take away from this class.

Theme 1: Insight from simple models#

The world is complicated.

Answering a question exactly is overwhelming and often impossible.

Strategy: construct a simple model of the situation.

At least within the simple model, we have the power to answer questions precisely and often quantitatively.

Theme 1: examples#

Ballpark estimates and cost-benefit analysis.
Hypothesis testing.
Machine learning and prediction.
Decision-making in sports

With great power comes great responsibility.

Know the strengths and limitations of your model.

Theme 2: Conditioning matters#

We might understand an uncertain situation well, but everything can change if we condition!

Common mistakes in conditional probability
Selection bias and sampling bias
- Hot guys are jerks
- Biased estimates and biased ML predictions from biased training data
Multi-modal data affects interpretation of summary statistics
- Male vs. female penguin body mass
- Does generic medical advice apply to you?

Theme 3: Critical thinking is essential#

Once you specify the model, statistics can give precise answers.

Is our model good? Does it fit the situation?

Think critically! Don’t calculate blindly.

Theme 3: examples#

“When means mislead”
- Usefulness of fundamental summary statistics (mean, median, standard deviation) depends on data (outliers, skew)
Correlation vs. causation
- Confounding variables
- Experimental design
Misleading graphs and figures
Multiple testing and $p$-hacking

Thinking critically#

We did a lot of “what does this mean in plain English?” exercises.

Thinking like this is important—I do this in my research and in my daily life constantly.

Even though a concept is formal and/or technical, we can and should try to really understand.

Feedback#

I want to make STATS60 great!

Please take a couple of minutes to give some feedback on the course this quarter.

I’m stats-curious. What’s next?#

If you like exploratory data analysis#

DATASCI 112: Principles of Data Science
- Deeper dive into data visualization and data analysis
- More machine learning: how to train and evaluate ML models
- Learn some programming essentials

If you like probability#

STATS 117: Introduction to probability theory
- Dive into probability theory
- Simple discrete models (coinflips, bags of marbles)
- Continuous models (Normal)
STATS 118: Probability theory for statistical inference
- Deeper dive into probability theory
- The theory behind the normal approximation
- Math behind other hypothesis tests

If you like experiments and hypothesis testing#

STATS 191: Introduction to Applied Statistics
- Deeper dive into methods for data analysis and prediction
- Applications to biology and social sciences

After taking probability theory,

STATS 200: Introduction to Theoretical Statistics
- Hypothesis testing
- Estimation and confidence intervals
- Bayesian methods
- Some theory of machine learning

If you like machine learning#

CS 106EA: Exploring artificial intelligence
- Training and evaluating ML models
- How do neural networks work?
- Challenges in ML: over-fitting, bias, distribution shift

After taking MATH 51 and CS 106:

CS 129: Applied Machine Learning
- More ML models:
  - logistic regression
  - support vector machines
  - deep learning
- “Unsupervised learning”: clustering and feature discovery

Thanks for a great quarter!#