Survival analysis¶
Very brief description of data generating mechanism and models.
Large part of a full course EPI262
Assuming no censoring, we observe: \((T_i, X_i)\) with \(T_i\) an event / failure time and \(X_i\) are explanatory covariates.
With (right) censoring, we observe \((O_i, \delta_i, X_i)\) where \(O_i\) as a time (either event or censored) and \(\delta_i\) denotes whether the time is a true event or not.
Goal: understand \(F_{X_i} = {\cal L}(T_i|X_i)\) where \(T_i\) are the true event times.
References¶
Therneau & Grambsh: Modeling Survival Data
Kalbfleisch & Prentice: The Statistical Analysis of Failure Time Data
Estimands¶
Given a candidate \(F_{X_i}\) and age \(a\) we might be interested in remaining lifetime
Or perhaps the median
No natural loss function as in linear / logistic regression.
If we can model / estimate \({\cal L}(T|X)\) then we can derive estimates of above quantities.
Tests of equality of distribution functions are common rather than tests about means / medians.
Hazard function¶
For a survival distribution \(F\), many quantities of interest can be more simply expressed using hazard
When density exists, we get the hazard rate
or
with \(S(t) = \bar{F}(t) = 1 - F(t)\)
Hazard function for discrete laws¶
For discrete distributions supported on \(\{t_i\}\), the hazard is 0 everywhere except support points
“Natural” cumulative hazard
Does not have property
Does hold for alternative cumulative hazard
Hazard models¶
Many models in survival analysis model the hazard function (or cumulative version).
The hazard function just has to be non-negative (and integrate to \(\infty\) eventually…).
Constant hazard¶
If \(T \sim \text{Exp}(\lambda)\) with density
Then \(S(t) = e^{-t/\lambda}\) and
and
Estimating a hazard rate¶
Nelson-Aalen¶
Non-parametric estimate (assuming no ties in data)
where \(R(t) = \left\{j:T_j \geq t \right\}\) is the risk set at time \(t\).
This assumes no censoring – coming up…
library(survival)
events = rexp(100)
censor = rep(1, 100)
H = basehaz(coxph(Surv(events, censor)~1))
plot(H$time, H$hazard, type='l')
abline(0, 1, col='red', lty=2)
Nelson-Aalen estimator¶
Asymptotic properties can be analyzed using techniques from point processes in which \(H\) is the “cumulative intensity”.
Fluctuation process
can be analyzed using martingale techniques to provide pointwise confidence intervals for \(H\).
Nelson-Aalen variance estimate¶
Standard error
Can use delta method for CI for \(S(t)\) based on Breslow estimator of survival: 95% CI for \(S(t)\):
Where does this variance estimate come from?¶
Roughly, partitioning \([0,t]\) finely, set \(s_j=tj/m\)
and
A reasonable heuristic
leads to (conditional) variance above
Summing all these conditional variances (this uses the martingale property) leads to
We can approximate
Or,
With \(f(t)=|R(t)|^{-1}\) this yields
# a larger sample
events = rexp(1000)
censor = rep(1, 1000)
H = basehaz(coxph(Surv(events, censor)~1))
plot(H$time, H$hazard, type='l')
abline(0, 1, col='red', lty=2)
Models¶
Proportional hazards¶
Models the hazard rate
with baseline hazard rate \(h_0\).
In principle, the exponential could be swapped with another link.
Cox’s model¶
Allows nonparametric model for \(h_0\).
Example of a semiparametric model.
We will come back to Cox model shortly.
Accelerated failure time¶
Assumes \(T_i = \theta(X_i^T\beta)^{-1} \cdot Z_i\) where law of \(Z_i\) does not depend on any parameter.
Not hard to see this corresponds a
with \(h_0\) the hazard function of \(Z_i\).
Usually a parametric model for \(Z_i\), e.g. Weibull, log-logistic, etc.
Equivalent to
Choosing \(\theta(\eta)=\exp(-\eta)\) yields
Just a parametric regression model – errors are not Gaussian.
summary(survreg(Surv(events, censor)~1, dist='weibull'))
Call:
survreg(formula = Surv(events, censor) ~ 1, dist = "weibull")
Value Std. Error z p
(Intercept) 0.0236 0.0332 0.71 0.48
Log(scale) -0.0030 0.0246 -0.12 0.90
Scale= 0.997
Weibull distribution
Loglik(model)= -1022.3 Loglik(intercept only)= -1022.3
Number of Newton-Raphson Iterations: 6
n= 1000
Let’s generate some data from correctly specified model
X = runif(1000)
censor = rep(1,1000)
events = rexp(1000)^0.5 * exp(2*X-1)
M = survreg(Surv(events, censor) ~ X, dist='weibull')
summary(M)
Call:
survreg(formula = Surv(events, censor) ~ X, dist = "weibull")
Value Std. Error z p
(Intercept) -1.0567 0.0324 -32.6 <2e-16
X 2.0856 0.0546 38.2 <2e-16
Log(scale) -0.6852 0.0247 -27.7 <2e-16
Scale= 0.504
Weibull distribution
Loglik(model)= -605.7 Loglik(intercept only)= -1011.4
Chisq= 811.31 on 1 degrees of freedom, p= 1.9e-178
Number of Newton-Raphson Iterations: 6
n= 1000
Using
weibullestimates a variance in the regression…
M2 = survreg(Surv(events, censor) ~ X, dist='exponential')
summary(M2)
Call:
survreg(formula = Surv(events, censor) ~ X, dist = "exponential")
Value Std. Error z p
(Intercept) -1.1673 0.0636 -18.4 <2e-16
X 2.0651 0.1085 19.0 <2e-16
Scale fixed at 1
Exponential distribution
Loglik(model)= -883.3 Loglik(intercept only)= -1056.4
Chisq= 346.3 on 1 degrees of freedom, p= 2.7e-77
Number of Newton-Raphson Iterations: 4
n= 1000
Censoring¶
Let’s imagine a lab running experiments on development of cancer in mice after exposure to different levels of carcinogen.
Mice are born on day \(B\), exposed to carcinogen \(X\) and their symptoms are detected at time \(D\) yielding event time \(T=D-B\).
Funding for experiment was only guaranteed up to end date \(E=\)
3/31/2020.Not all mice had symptoms at end date \(E\).
For those mice who have presented symptoms, we observe time \(O=T-B\).
For the others, we observe \(O=E-B\) and only know that their symptoms could only present later than \(O\) in their lifetime.
This is called right-censoring we only know the time of interest lies in some interval \([O,\infty)\).
With right-censored data, observations are
with \(O_i\) the observed “time” for subject \(i\) and \(\delta_i=1\) denotes the time of interest was observed, \(\delta_i=0\) denotes this observation was right-censored.
Likelihood under censoring¶
In order to fit model and carry out inference, we usually must assume censoring is non-informative.
For non-informative censoring, the likelihood is
Or,
We have used
Tcens = rexp(length(events)) * 1.5
observed = pmin(events, Tcens)
censor = (events <= Tcens)
M3 = survreg(Surv(observed, censor) ~ X, dist='weibull')
summary(M3)
Call:
survreg(formula = Surv(observed, censor) ~ X, dist = "weibull")
Value Std. Error z p
(Intercept) -1.0666 0.0392 -27.2 <2e-16
X 2.1073 0.0779 27.0 <2e-16
Log(scale) -0.6881 0.0318 -21.7 <2e-16
Scale= 0.503
Weibull distribution
Loglik(model)= -340.8 Loglik(intercept only)= -610.9
Chisq= 540.15 on 1 degrees of freedom, p= 1.8e-119
Number of Newton-Raphson Iterations: 6
n= 1000
Nelson-Aalen with censoriing¶
Can still estimate hazard under right-censoring
\(R(t) = \left\{j: O_j \geq t\right\}\) is the risk set
\(d_j\) the number of events at \(T_j\) (1 if no ties possible)
\(n_j = |R(T_j)|\) the number at risk just prior to \(T_j\) (i.e. candidates to fail at \(T_j\))
Nelson-Aalen: variance with censoring¶
Standard error is estimated as
Or (asmptotically similar)
Can similarly use delta method for CI for \(S(t)\) based on Breslow estimator of survival function
events = rexp(1000)
Tcens = rexp(1000) * 3
observed = pmin(events, Tcens)
censor = (events < Tcens)
H = basehaz(coxph(Surv(observed, censor) ~ 1))
plot(H$time, H$hazard, type='l')
abline(0, 1, col='red', lty=2)
plot(H$time, exp(-H$hazard), type='l')
lines(H$time, exp(-H$time), col='red')
What happens if we ignore censoring?¶
ignore_censor = rep(1, length(censor))
Hbias = basehaz(coxph(Surv(observed, ignore_censor) ~ 1))
plot(Hbias$time, exp(-Hbias$hazard), type='l')
lines(H$time, exp(-H$time), col='red')
Kaplan-Meier¶
A more commonly used estimate of the survival function
Based on product formula (for partition \(V=(v_0=0 < v_1 < \dots < v_m)=\infty\)) of \([0,\infty)\))
with \(V=(0,T_1,\dots,T_{\sum_i \delta_i}, \infty)\) and
Note Taylor expansion \(e^{-h} \approx 1 - h \dots\) so it is close to Breslow’s estimator
plot(survfit(Surv(observed, censor) ~ 1), lwd=4)
lines(H$time, exp(-H$hazard), col='yellow', lwd=2)
plot(survfit(Surv(observed, censor) ~ 1), lwd=4, xlim=c(2,3), ylim=c(0,0.2))
lines(H$time, exp(-H$hazard), col='yellow', lwd=2)
plot(survfit(Surv(observed, censor) ~ 1), lwd=4)
lines(H$time, exp(-H$time), col='red', lwd=2) # truth
Bias when we ignore censoring¶
plot(survfit(Surv(observed, rep(1, length(censor))) ~ 1))
F = survfit(Surv(observed, censor) ~ 1)
lines(H$time, exp(-H$time), col='red')
Comparing populations¶
Suppose we have two populations from which we observe right-censored survival times \((O_{1,i},\delta_{1,i}), 1 \leq i \leq n_1\) and \((O_{2,i}, \delta_{2,i}), 1 \leq i \leq n_2\).
Can we tell if their underlying survival times (some of which we have not observed) come from the same distribution?
Null hypothesis \(H_0: F_1 = F_2\) where \(T_{1,i} \overset{IID}{\sim} F_1\) and \(T_{2,i} \overset{IID}{\sim} F_2\).
Log-rank test¶
Suppose we have a failure time \(T_j\) (could have come from either sample).
We have two risk sets \(R_1(t) = \left\{i: O_{1,i} \geq t\right\}, R_2(t) = \left\{i: O_{2,i} \geq t\right\}\)
The log-rank test model the interval \([t_j, t_j+dt)\) under \(H_0\).
At time \(T_j\) we the total (pooled) risk set is \(R(T_j) = R_1(T_j) \cup R_2(T_j)\) with cardinality \(|R(T_j)|\).
Contingency table¶
We construct a null model \(2x2\) table
| Status | |||||
| Failed | Survived | Total | |||
| Sample | 1 | $d_{1,j}$ | $|R_1(T_j)| - d_{1,j}$ | $|R_1(T_j)|$ | |
| 2 | $d_{2,j}$ | $|R_2(T_j)| - d_{2,j}$ | $|R_2(T_j)|$ | ||
| Total | $d_j = d_{1,j} + d_{2,j}$ | $|R(T_j)| - d_j$ | $|R(T_j)|$ |
Under \(H_0\) we assign \(d_j\) deaths at time \(T_j\) and they should distribute equally between two samples.
Comparing populations¶
Log-rank test¶
Expected number of deaths under \(H_0\) in Sample 1 at time \(T_j\) is
Law of \(d_{1,j} | |R_1(T_j)|, |R(T_j)|, d_j\) is Hypergeometric.
Summing over \(j\)
Mantel-Haenszel test applied to each of these 2x2 tables!
brainCancer = read.table(
'https://www.stanford.edu/class/stats305b/data/BrainCancer.csv',
sep=',',
header=TRUE)
names(brainCancer)
- 'sex'
- 'diagnosis'
- 'loc'
- 'ki'
- 'gtv'
- 'stereo'
- 'status'
- 'time'
plot(survfit(Surv(time, status) ~ sex, data=brainCancer), lty=2:3) # Kaplan-Meier does not drop to 0 -- why?
survdiff(Surv(time, status) ~ sex, data=brainCancer)
Call:
survdiff(formula = Surv(time, status) ~ sex, data = brainCancer)
N Observed Expected (O-E)^2/E (O-E)^2/V
sex=Female 45 15 18.5 0.676 1.44
sex=Male 43 20 16.5 0.761 1.44
Chisq= 1.4 on 1 degrees of freedom, p= 0.2