url = 'http://web.stanford.edu/class/stats60/data/pearson_lee.csv'
pearson = read.csv(url, header=TRUE)
mother = pearson$mother
hist(mother, breaks=20, probability=TRUE)
abline(v=mean(mother), col='red', lwd=4)Numerical Summaries
Numeric summaries
In last lecture, we saw histograms as a useful way to graphically summarize a list of numbers.
In this set of notes, we will look at numeric summaries that boil down the histogram to a set of a few numbers.
In particular, we will look at:
- average or mean;
- standard deviation
- median.
- 25th and 75th percentiles
Reading: Chapter 3 of statsthinking21
Why do we use numeric summaries?
Less information than a histogram
Can be simpler for humans to compare across groups.
Important to use an appropriate numeric summary!
Average
The average of a list of numbers equals their sum, divided by how many there are.
The average is also called the mean.
Example: Compute the average of the sample: [1,4,6,7,8].
The answer is (1+4+6+7+8)/5 = 26/5 = 5.2
Why use the mean?
Captures some notion of center.
The center is an intuitive way distinguish different groups of data…
Summation notation
We will sometimes use summation notation known with the greek symbol “Sigma”.
Call our list \(X=[X_1, \dots, X_n]\).
We often write the sum
\[X_1 + X_2 + \dots + X_n = \sum_{i=1}^n X_i.\]
- The mean of a list \(X = [X_1, \dots, X_n]\) is often written as
\[\bar{X} = \frac{1}{n} \sum_{i=1}^n X_i.\]
Example
The average can be thought of as the “balancing point” of the sample.
Balance and the mean
- Example: Suppose
m=6.5then the deviations summed to the right and left are- sum([7-6.5, 8-6.5]) = 2
- sum([6.5-1, 6.5-4, 6.5-6] = 8.5
- Example: Suppose
m=5.2then- sum([6-5.2, 7-5.2, 8-5.2]) = 5.4
- sum([5.2-4, 5.2-1]) = 5.4
- The sample is balanced at \(m=5.2\).
Average of [1,1,1,1,4,4]
Average balances the list
- Another way to say that the average balances the list is: the sum of the deviations from average is always zero.
Example: [1,1,1,1,4,4]
- The average is 2.
- Deviations from average are [-1,-1,-1,-1,2,2].
- Sum of deviations from average: 0. This is always true.
\(\Sigma\) notation
- For any list \([X_1, \dots, X_n]\), the list of deviations from the mean is \([X_1-\bar{X}, \dots, X_n-\bar{X}]\).
\[\sum_{i=1}^n (X_i - \bar{X}) = 0.\]
Lists with repeated entries
With repeats, we can think of having a “weight” of 4/6 at 1 and 2/6 at 4.
The weights are represented by very skinny bar that are very tall…
The weight at 4 is the area of this skinny bar…
Average is the weighted sum \(4/6*1 + 2/6*4 = 2\).
Deviations from average is \(-1=(1-2)\) at 1 and \(2=(4-2)\) at 4.
Weighted sum of deviations from average: \[4/6 \times (-1) + 2/6 \times 2 = 0.\]
This is how we can compute the average of a histogram.
Average of a histogram
Each bar in a histogram has a midpoint \(M\) and an area \(A\).
Each bar contributes \(M*A\).
The average of the histogram is the sum of \(M*A\), summed over bars…
California population by age
| Age group | Count | Percentage |
|---|---|---|
| 0-20 | 10000000 | 29% |
| 20-55 | 17500000 | 17500000 / 34000000 = 52% |
| 55-75 | 4500000 | 13% |
| 75+ | 2000000 | 6% |
| Total | 34000000 | 100 % |
Average of mother from a histogram
Warning: package 'ggplot2' was built under R version 4.4.3
Math aside
Every bar can be broken up into smaller bars and the same definition applied.
When the bars get shorter and shorter this looks like the midpoint rule for integrating (with \(f\) the histogram) \[ \int_{-\infty}^{\infty} u \cdot f(u) \; du. \]
Median
The median of a histogram is the number with half the area to the left and half the area to the right.
Median of California population from histogram
- The median must be 0.21/0.52 percent of the way between 20 and 55:
\[ 20 + 0.21 / 0.52 * (55-20) = 34.1 \]
Median of a list of numbers
Defined similarly to a histogram: put half the data on the left and half the data on the right.
- Sort the numbers for smallest to largest.
- If the length of the list is odd, the median is the middle entry of the sorted values.
- Else, the median is the average of the two middle entries.
Example: median of [1,4,2,9,8]
The sorted values are [1,2,4,8,9].
Since the length of the list is 5, the median is the middle entry of the sorted values. The median is 4.
Example: median of [1,11,3,7,8,3]
- The sorted values are [1,3,3,7,8,11].
- Since the length of the list is 6, the median is the average of the middle entries. The median is (3+7)/2=5.
Comparing median and average
What is the mean of [3,7,4,11,5]? The median?
What is the mean of [3,37,4,41,5]? The median?
What do these examples tell us about the mean and median?
This tells us that the median is less sensitive to changes away from the center than the mean. Statisticians call this property of the median robustness.
Examples of medians of histograms
When the histogram is symmetric (e.g.
mother), the average and the median are equal.With heavier tails the mean can get further from the median
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_bar()`).
W[1:100] = W[1:100] + 4000
(mean_and_median(W / 400) + xlim(0,1) +
labs(title='Histogram of wage/400 after changing 3%'))`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
Warning: Removed 100 rows containing non-finite outside the scale range
(`stat_bin()`).
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_bar()`).
- I changed 100 out of 3000 of the points in the data set by adding a large number to them…
Scale or spread of a data set
Both average and median summarize the center or location of a sample.
Is this everything there is to say about a sample?
Suppose we are trying to predict the price of
AAPLtomorrow.We’d like to guess the price exactly, but that is unrealistic.
How much money we make will depend on how accurate our guess is.
Accuracy is not captured by mean or median.
Warning in matrix(daily, 51, 5): data length [251] is not a sub-multiple or
multiple of the number of rows [51]
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
We have 252-1 daily returns, and roughly 50 weekly returns.
The weekly returns (daily averaged over the week) are less variable: we might do better predicting these!
Standard deviation
Concept:
- A measure of variation. The larger the SD the more spread out the sample is.
- The SD quantifies how far numbers on a list are away from their average.
- Its units are in the original units of the list.
- It is always a positive number.
- Most entries on the list will be somehere around one SD away from the average. Very few will be more than two or three SDs away.
Understanding variation
- In later lectures, we’ll do things like compare two groups by sampling from the groups and then comparing the sample mean or sample median of some measured variable.
We’re often interested in deciding whether (or by how much) these groups differ on this variable.
Understanding the sampling variation is key to assessing our confidence in:
determining whether there is any difference
how accurately we can estimate this difference.
The first task falls under the umbrella of hypothesis testing while the second is uncertainty quantification.
A key observation of statistics: whenever we sample there is some unpredictable variation, BUT we can predict quite accurately how big this variation typically is!
Computing the standard deviation of a list L
Standard deviation in summation notation
Recall our list is \(L=[X_1, \dots, X_n]\).
The computation we just did is
\[ \sqrt{ \frac{1}{n} \sum_{i=1}^n(X_i-\bar{X})^2} \]
The denominator in standard deviation…
- Most software will adjust the mean of the deviations by dividing by \(n-1\) instead of \(n\):
\[ SD(L) = \sqrt{ \frac{1}{n-1} \sum_{i=1}^n(X_i-\bar{X})^2} \]
- When the sample is large enough, this distinction is not important, and
Rwill usually use the correct denominator
Calculation in table form
Example: L=[20,30,25,25]
Step 1: Compute the average
| Entry | Data | Deviation | Deviation\(^2\) |
|---|---|---|---|
| 1 | 20 | ||
| 2 | 30 | ||
| 3 | 25 | ||
| 4 | 25 | ||
| Total | 100 |
The average is 100/4 = 25.
Step 2: Compute the deviations and the squared deviations
| Entry | Data | Deviation | Deviation\(^2\) |
|---|---|---|---|
| 1 | 20 | -5 | 25 |
| 2 | 30 | 5 | 25 |
| 3 | 25 | 0 | 0 |
| 4 | 25 | 0 | 0 |
| Total | 100 | (not needed, but always 0) | 50 |
Step 3: Compute the root mean square
The mean square of the squared deviations is 50/4, so the root mean square is \(\sqrt{50/4}\approx 3.5\).
Using the corrected denominator, we would report \(\sqrt{50/3} \approx 4.1\).
Changing location and scale
How average and SD change when we switch units?
Suppose you are told that the average max temperature in Palo Alto for April 1 over the last 20 years is 70F with an SD of 6F.
The rule for converting Fahrenheit to Celsius is \[ C = \frac{5}{9} ( F - 32). \]
What would the average and SD be if you used C (Celsius) instead?
Mean and SD under change of units
- Suppose we have a list of 12 measurements in “old units”
\[F= [F_1, ..., F_{12}]\]
- We want to convert to new units
\[C = [C_1, ...., C_{12}]\]
- The transformation of units can be represented as:
\[C = a * F + b\]
- The average transforms like
\[ \bar{C} = a * \bar{F} + b \]
- The SD transforms like:
\[ \text{SD}(C) = |a| * \text{SD}(F) \]
- How about
median(C)?
Other notions of spread
We saw at least two notions of center of a histogram or sample.
There are similarly more than one notion of the spread of a histogram or sample.
Quartiles
The first quartile or 25% percentile is defined similarly to the median: it is the point for which 25% of the data is to the left and 75% to the right.
The third quartile or 75% percentile is defined similarly to the median: it is the point for which 75% of the data is to the left and 25% to the right.
Inter-quartile range
- Defined as the difference between the 3rd and 1st quartile.
California population by age
| Age group | Count | Percentage |
|---|---|---|
| 0-20 | 10000000 | 29% |
| 20-55 | 17500000 | 17500000 / 34000000 = 52% |
| 55-75 | 4500000 | 13% |
| 75+ | 2000000 | 6% |
| Total | 34000000 | 100 % |
The first quartile is \[ 20 * 0.25/0.29 \approx 17.2 \]
The third quartile is 6% into the 20-55 age group:
\[ 55 - 0.06 / 0.52 * (55 - 20) \approx 51.0 \]
- The inter-quartile range is \(51.0-17.2 \approx 33.8\)
Quartiles of California age data
Quartiles and the boxplot
- The
boxplotdisplays the quartiles (and hence the inter-quartile range).
Inter-quartile range and “robustness”
url = 'http://web.stanford.edu/class/stats60/data/income_2024.csv'
income = read.csv(url, header=TRUE)
# using 1000 as RH endpoint
bins = c(0, income$R)
counts = income$counts
prob = counts / sum(counts)
dens = prob / diff(bins) * 100
barplot(dens, diff(bins), space=0, names.arg=income$R, ylab='% / 1000$',
xlab='Income (1000$)')We saw that the median was less susceptible to changing extreme values of the data than the mean. What about the inter-quartile range?
Which changes more when we change the last bar in the histogram: mean or median? standard deviation or inter-quartile range?
Summary
We saw several numerical summaries of a single feature:
meanmediansd- quartiles
Some measures are location, others describe spread / variation.
All of these summaries can be computed directly from a histogram except
sdWe’ll see shortly you can compute
sdfor a histogram…Moral: if you know the histogram, you can summarize at well…
SD of a histogram
Alternative form of SD
\[ SD(L) = \sqrt{\left(\frac{1}{n} \sum_{i=1}^nX_i^2\right) - \bar{X}^2} \]
SD of a histogram
- For average of a histogram, each bar contributed
\[ A * M = h * (b-a) * (b+a)/2 = h * \frac{b^2-a^2}{2} \]
- For the average of the square of a histogram each bar contributes
\[ h * \frac{b^3-a^3}{3} \]
- Use the alternate form…
Math aside
- The SD of a histogram can be written as (with \(f\) the histogram)
\[ \left(\int_{-\infty}^{\infty} u^2 f(u) \; du - \left( \int_{-\infty}^{\infty} u f(u) \; du \right)^2 \right)^{1/2} \]