Survival analysis

Very brief description of data generating mechanism and models.
Large part of a full course EPI262
Assuming no censoring, we observe: $(T_i, X_i)$ with $T_i$ an event / failure time and $X_i$ are explanatory covariates.
With (right) censoring, we observe $(O_i, \delta_i, X_i)$ where $O_i$ as a time (either event or censored) and $\delta_i$ denotes whether the time is a true event or not.
Goal: understand $F_{X_i} = {\cal L}(T_i|X_i)$ where $T_i$ are the true event times.

References

Therneau & Grambsh: Modeling Survival Data
Kalbfleisch & Prentice: The Statistical Analysis of Failure Time Data

Estimands

Given a candidate $F_{X_i}$ and age $a$ we might be interested in remaining lifetime

\[ E_{F_{X_i}}(T - a | T \geq a) \]

Or perhaps the median

\[ \text{median}_{F_{X_i}}(T - a | T \geq a). \]

No natural loss function as in linear / logistic regression.
If we can model / estimate ${\cal L}(T|X)$ then we can derive estimates of above quantities.
Tests of equality of distribution functions are common rather than tests about means / medians.

Hazard function

For a survival distribution $F$, many quantities of interest can be more simply expressed using hazard

\[ \begin{aligned} H([t,t+dt)) &= P_F(T < t+dt|T \geq t) \\ \end{aligned} \]

When density exists, we get the hazard rate

\[ \begin{aligned} H([t,t+dt)) \approx h(t) \; dt \end{aligned} \]

\[ \begin{aligned} h(t) &= \frac{f(t)}{S(t)} \\ &= - \frac{d}{dt} \log S(t) \end{aligned} \]

with $S(t) = \bar{F}(t) = 1 - F(t)$

Cumulative hazard

Define

\[ H(t) = \int_{[0,t]} h(s) \; ds \]

\[ S(t) = \exp(-H(t)). \]

Hazard function for discrete laws

For discrete distributions supported on $\{t_i\}$, the hazard is 0 everywhere except support points

\[ h(t) = \begin{cases} \frac{P(T=t)}{\sum_{s \geq t} P(T=s)} &\leq 1 \\ 0 &= \text{if $t$ is not in support} \end{cases} \]

“Natural” cumulative hazard

\[ H(t) = \sum_{j:T_j \leq t} h(T_j) \]

Does not have property

\[ S(t) = \exp(-H(t)). \]

Does hold for alternative cumulative hazard

\[ \tilde{H}(t) = \sum_{j:T_j \leq t} -\log(1 - h(T_j)) \]

Hazard models

Many models in survival analysis model the hazard function (or cumulative version).
The hazard function just has to be non-negative (and integrate to $\infty$ eventually…).

Constant hazard

If $T \sim \text{Exp}(\lambda)$ with density

\[ f_T(t) = \frac{1}{\lambda} e^{-t/\lambda} 1_{[0,\infty)}(t) \]

Then $S(t) = e^{-t/\lambda}$ and

\[ h(t) = \frac{1}{\lambda} \]

and

\[ H(t) = t / \lambda. \]

Estimating a hazard rate

Nelson-Aalen

Non-parametric estimate (assuming no ties in data)

\[ \hat{H}(t) = \sum_{j:T_j \leq t} \frac{1}{ |\left\{T_j \geq t\right\}|} = \sum_{j:T_j \leq t} \frac{1}{ |R(t)|} \]

where $R(t) = \left\{j:T_j \geq t \right\}$ is the risk set at time $t$.

This assumes no censoring – coming up…

library(survival)
events = rexp(100)
censor = rep(1, 100)
H = basehaz(coxph(Surv(events, censor)~1))
plot(H$time, H$hazard, type='l')
abline(0, 1, col='red', lty=2)

Nelson-Aalen estimator

Asymptotic properties can be analyzed using techniques from point processes in which $H$ is the “cumulative intensity”.
Fluctuation process

\[ (\hat{H}(t) - H(t))_{t \geq 0} \]

can be analyzed using martingale techniques to provide pointwise confidence intervals for $H$.

Nelson-Aalen variance estimate

Standard error

\[ \widehat{Var}(\hat{H}(t)) = \sum_{j:T_j \leq t} \frac{1}{|R(T_j)|^2} \]

Can use delta method for CI for $S(t)$ based on Breslow estimator of survival: 95% CI for $S(t)$:

\[ \exp(-\hat{H}(t)) \pm 1.96 * \exp\left(-\widehat{H}(t)\right) \sqrt{\sum_{j:T_j \leq t} \frac{1}{|R(t)|^2}} \]

Where does this variance estimate come from?

Roughly, partitioning $[0,t]$ finely, set $s_j=tj/m$

\[ \widehat{H}(t) = \sum_{j=0}^{m-1} \widehat{H}(s_{j+1}) - \widehat{H}(s_j) \]

and

\[ \widehat{H}(s_{j+1}) - \widehat{H}(s_j) = \frac{\sum_{i=1}^n 1_{[s_{j+1},s_j]}(T_i)}{|R(s_j)|} \]

A reasonable heuristic

\[\widehat{H}(s_{j+1}) - \widehat{H}(s_j) \biggl| |R(s_j)| \sim \frac{1}{|R(s_j)|} \text{Poisson}(|R(s_j)| h(s_j) (s_{j+1}-s_j)) \]

leads to (conditional) variance above

\[ \frac{h(s_j) (s_{j+1}-s_j)}{|R(s_j)|} \]

Summing all these conditional variances (this uses the martingale property) leads to

\[ \int_{[0,t]} \frac{1}{|R(t)|} h(s) \; ds \]

We can approximate

\[f \mapsto \int_{[0,t]} f(s) h(s)\; ds \approx \int_{[0,t]} f(s) \hat{H}(ds) \]

\[ \int_{[0,t]} f(s) h(s)\; ds \approx \sum_{j:T_j \leq t} f(T_j) \frac{1}{|R(t_j)|} \]

With $f(t)=|R(t)|^{-1}$ this yields

\[ \sum_{j:T_j \leq t} \frac{1}{|R(T_j)|^2}. \]

# a larger sample
events = rexp(1000)
censor = rep(1, 1000)
H = basehaz(coxph(Surv(events, censor)~1))
plot(H$time, H$hazard, type='l')
abline(0, 1, col='red', lty=2)

Models

Proportional hazards

Models the hazard rate

\[ h_{\beta}(t;x) = \exp(x^T\beta) \cdot h_0(t) \]

with baseline hazard rate $h_0$.

In principle, the exponential could be swapped with another link.

Cox’s model

Allows nonparametric model for $h_0$.
Example of a semiparametric model.
We will come back to Cox model shortly.

Accelerated failure time

Assumes $T_i = \theta(X_i^T\beta)^{-1} \cdot Z_i$ where law of $Z_i$ does not depend on any parameter.
Not hard to see this corresponds a

\[ h_{\beta}(t;x) = \theta(x^T\beta) \cdot h_0(\theta(x^T\beta) t) \]

with $h_0$ the hazard function of $Z_i$.

Usually a parametric model for $Z_i$, e.g. Weibull, log-logistic, etc.

Equivalent to

\[ \log(T_i) = -\log(\theta(X_i^T\beta)) + \log Z_i \]

Choosing $\theta(\eta)=\exp(-\eta)$ yields

\[ \log(T_i) = X_i^T\beta + \log Z_i \]

Just a parametric regression model – errors are not Gaussian.

summary(survreg(Surv(events, censor)~1, dist='weibull'))

## 
## Call:
## survreg(formula = Surv(events, censor) ~ 1, dist = "weibull")
##               Value Std. Error     z     p
## (Intercept)  0.0603     0.0326  1.85 0.065
## Log(scale)  -0.0194     0.0248 -0.79 0.432
## 
## Scale= 0.981 
## 
## Weibull distribution
## Loglik(model)= -1052   Loglik(intercept only)= -1052
## Number of Newton-Raphson Iterations: 6 
## n= 1000

Let’s generate some data from correctly specified model

X = runif(1000)
censor = rep(1,1000)
events = rexp(1000)^0.5 * exp(2*X-1)
M = survreg(Surv(events, censor) ~ X, dist='weibull')

summary(M)

## 
## Call:
## survreg(formula = Surv(events, censor) ~ X, dist = "weibull")
##               Value Std. Error     z      p
## (Intercept) -1.0071     0.0315 -32.0 <2e-16
## X            2.0110     0.0540  37.2 <2e-16
## Log(scale)  -0.6994     0.0247 -28.4 <2e-16
## 
## Scale= 0.497 
## 
## Weibull distribution
## Loglik(model)= -585.6   Loglik(intercept only)= -982.8
##  Chisq= 794.41 on 1 degrees of freedom, p= 8.9e-175 
## Number of Newton-Raphson Iterations: 6 
## n= 1000

Using weibull estimates a variance in the regression…

M2 = survreg(Surv(events, censor) ~ X, dist='exponential')
summary(M2)

## 
## Call:
## survreg(formula = Surv(events, censor) ~ X, dist = "exponential")
##               Value Std. Error     z      p
## (Intercept) -1.1134     0.0628 -17.7 <2e-16
## X            1.9812     0.1089  18.2 <2e-16
## 
## Scale fixed at 1 
## 
## Exponential distribution
## Loglik(model)= -874   Loglik(intercept only)= -1035.5
##  Chisq= 323.06 on 1 degrees of freedom, p= 3.1e-72 
## Number of Newton-Raphson Iterations: 4 
## n= 1000

Censoring

Let’s imagine a lab running experiments on development of cancer in mice after exposure to different levels of carcinogen.
Mice are born on day $B$, exposed to carcinogen $X$ and their symptoms are detected at time $D$ yielding event time $T=D-B$.
Funding for experiment was only guaranteed up to end date $E=$3/31/2020.
Not all mice had symptoms at end date $E$.
For those mice who have presented symptoms, we observe time $O=T-B$.
For the others, we observe $O=E-B$ and only know that their symptoms could only present later than $O$ in their lifetime.
This is called right-censoring we only know the time of interest lies in some interval $[O,\infty)$.
With right-censored data, observations are

\[ (O_i, \delta_i, X_i) \]

with $O_i$ the observed “time” for subject $i$ and $\delta_i=1$ denotes the time of interest was observed, $\delta_i=0$ denotes this observation was right-censored.

Likelihood under censoring

In order to fit model and carry out inference, we usually must assume censoring is non-informative.
For non-informative censoring, the likelihood is

\[ \left(\prod_{i:\delta_i=1} f_{\beta}(O_i;X_i) \right) \left(\prod_{i:\delta_i=0} S_{\beta}(O_i;X_i) \right) \]

\[ \prod_{i} h_{\beta}(O_i;X_i)^{\delta_i} S_{\beta}(O_i;X_i) \]

We have used

\[ f(t) = h(t) S(t). \]

Tcens = rexp(length(events)) * 1.5
observed = pmin(events, Tcens)
censor = (events <= Tcens)
M3 = survreg(Surv(observed, censor) ~ X, dist='weibull')
summary(M3)

## 
## Call:
## survreg(formula = Surv(observed, censor) ~ X, dist = "weibull")
##               Value Std. Error     z      p
## (Intercept) -1.0243     0.0370 -27.7 <2e-16
## X            2.0303     0.0723  28.1 <2e-16
## Log(scale)  -0.7227     0.0305 -23.7 <2e-16
## 
## Scale= 0.485 
## 
## Weibull distribution
## Loglik(model)= -341.6   Loglik(intercept only)= -624.1
##  Chisq= 565.11 on 1 degrees of freedom, p= 6.5e-125 
## Number of Newton-Raphson Iterations: 6 
## n= 1000

Nelson-Aalen with censoriing

Can still estimate hazard under right-censoring

\[ \hat{H}(t) = \sum_{j: \delta_j = 1, T_j \leq t} \frac{d_j}{n_j} \]

$R(t) = \left\{j: O_j \geq t\right\}$ is the risk set
$d_j$ the number of events at $T_j$ (1 if no ties possible)
$n_j = |R(T_j)|$ the number at risk just prior to $T_j$ (i.e. candidates to fail at $T_j$)

Nelson-Aalen: variance with censoring

Standard error is estimated as

\[ \widehat{Var}(\hat{H}(t))= \sum_{\delta_j=1, T_j \leq t} \frac{d_j}{n_j (n_j - d_j)} \]

Or (asmptotically similar)

\[ \widehat{Var}(\hat{H}(t))= \sum_{\delta_j=1, T_j \leq t} \frac{d_j}{n_j^2} \]

Can similarly use delta method for CI for $S(t)$ based on Breslow estimator of survival function

\[ \hat{S}(t) = \exp(-\hat{H}(t)) \]

events = rexp(1000)
Tcens = rexp(1000) * 3
observed = pmin(events, Tcens)
censor = (events < Tcens)
H = basehaz(coxph(Surv(observed, censor) ~ 1))
plot(H$time, H$hazard, type='l')
abline(0, 1, col='red', lty=2)

plot(H$time, exp(-H$hazard), type='l')
lines(H$time, exp(-H$time), col='red')

What happens if we ignore censoring?

ignore_censor = rep(1, length(censor))
Hbias = basehaz(coxph(Surv(observed, ignore_censor) ~ 1))
plot(Hbias$time, exp(-Hbias$hazard), type='l')
lines(H$time, exp(-H$time), col='red')

Kaplan-Meier

A more commonly used estimate of the survival function

\[ \hat{S}(t) = \prod_{j:\delta_j=1, T_j \leq t} \left(1 - \frac{d_j}{n_j} \right) \]

Based on product formula (for partition $V=(v_0=0 < v_1 < \dots < v_m)=\infty$) of $[0,\infty)$)

\[ P(T > v_i) = \prod_{j=1}^{i} P(T > v_{j} | T > v_{j-1}) \]

with $V=(0,T_1,\dots,T_{\sum_i \delta_i}, \infty)$ and

\[ \widehat{ P}(T > T_{j} | T > T_{j-1}) = 1 - \frac{d_j}{n_j}. \]

Note Taylor expansion $e^{-h} \approx 1 - h \dots$ so it is close to Breslow’s estimator

plot(survfit(Surv(observed, censor) ~ 1), lwd=4)
lines(H$time, exp(-H$hazard), col='yellow', lwd=2)

plot(survfit(Surv(observed, censor) ~ 1), lwd=4, xlim=c(2,3), ylim=c(0,0.2))
lines(H$time, exp(-H$hazard), col='yellow', lwd=2)

plot(survfit(Surv(observed, censor) ~ 1), lwd=4)
lines(H$time, exp(-H$time), col='red', lwd=2) # truth

Bias when we ignore censoring

plot(survfit(Surv(observed, rep(1, length(censor))) ~ 1))
F = survfit(Surv(observed, censor) ~ 1)
lines(H$time, exp(-H$time), col='red')

Comparing populations

Suppose we have two populations from which we observe right-censored survival times $(O_{1,i},\delta_{1,i}), 1 \leq i \leq n_1$ and $(O_{2,i}, \delta_{2,i}), 1 \leq i \leq n_2$.
Can we tell if their underlying survival times (some of which we have not observed) come from the same distribution?
Null hypothesis $H_0: F_1 = F_2$ where $T_{1,i} \overset{IID}{\sim} F_1$ and $T_{2,i} \overset{IID}{\sim} F_2$.

Log-rank test

Suppose we have a failure time $T_j$ (could have come from either sample).
We have two risk sets $R_1(t) = \left\{i: O_{1,i} \geq t\right\}, R_2(t) = \left\{i: O_{2,i} \geq t\right\}$
The log-rank test model the interval $[t_j, t_j+dt)$ under $H_0$.
At time $T_j$ we the total (pooled) risk set is $R(T_j) = R_1(T_j) \cup R_2(T_j)$ with cardinality $|R(T_j)|$.

Contingency table

We construct a null model $2x2$ table

		Status
		Failed	Survived	Total
Sample	1	$d_{1,j}$	$\|R_1(T_j)\| - d_{1,j}$	$\|R_1(T_j)\|$
	2	$d_{2,j}$	$\|R_2(T_j)\| - d_{2,j}$	$\|R_2(T_j)\|$
Total		$d_j = d_{1,j} + d_{2,j}$	$\|R(T_j)\| - d_j$	$\|R(T_j)\|$

Under $H_0$ we assign $d_j$ deaths at time $T_j$ and they should distribute equally between two samples.

Comparing populations

Log-rank test

Expected number of deaths under $H_0$ in Sample 1 at time $T_j$ is

\[ \frac{d_j \cdot |R_1(T_j)|}{|R(T_j)|} \]

Law of $d_{1,j} | |R_1(T_j)|, |R(T_j)|, d_j$ is Hypergeometric.
Summing over $j$

\[ \frac{\sum_{j: d_j \neq 0} d_{1,j} - \frac{d_j |R_1(T_j)|}{|R(T_j)|}} {\sqrt{\sum_{j: d_j \neq 0} \text{Var}(d_{1,j} | |R_1(T_j)|, |R(T_j)|, d_j) }} \overset{H_0}{\Rightarrow} N(0,1) \]

Mantel-Haenszel test applied to each of these 2x2 tables!

brainCancer = read.table(
    'https://www.stanford.edu/class/stats305b/data/BrainCancer.csv',
    sep=',',
    header=TRUE)
names(brainCancer)

## [1] "sex"       "diagnosis" "loc"       "ki"        "gtv"       "stereo"   
## [7] "status"    "time"

plot(survfit(Surv(time, status) ~ sex, data=brainCancer), lty=2:3) # Kaplan-Meier does not drop to 0 -- why?

survdiff(Surv(time, status) ~ sex, data=brainCancer)

## Call:
## survdiff(formula = Surv(time, status) ~ sex, data = brainCancer)
## 
##             N Observed Expected (O-E)^2/E (O-E)^2/V
## sex=Female 45       15     18.5     0.676      1.44
## sex=Male   43       20     16.5     0.761      1.44
## 
##  Chisq= 1.4  on 1 degrees of freedom, p= 0.2

Cox (proportional hazards) model

Models the hazard rate

\[ h_{\beta}(t;x) = \exp(x^T\beta) \cdot h_0(t) \]

with baseline hazard rate $h_0$.

Given a death has occured at time $t$ what are the relative chances that individual $i$ died rather than individual $k$?
This should be the ratio \[ \frac{h_{\beta}(t;X_i)}{h_{\beta}(t;X_k)} \propto \exp((X_i-X_k)^T\beta) \]
Baseline hazard $h_0(t)$ cancels!

Partial likelihood

Therefore, given individual $i$ dies at some time $t$ the contribution to the (partial) likelihood is

\[ \frac{\exp(X_i^T\beta)}{\sum_{j \in R(t)} \exp(X_j^T\beta)} \]

As the death times are actual observed times for individuals with $\delta_i=1$, this yields something like a likelihood (recall when $\delta_i=1, O_i=T_i)$

\[ \prod_{i: \delta_i=1} \left(\frac{\exp(X_i^T\beta)}{\sum_{j \in R(T_i)} \exp(X_j^T\beta)} \right) = \prod_{i: \delta_i=1} \left(\frac{\exp(X_i^T\beta)}{\sum_{j: O_j \geq O_i} \exp(X_j^T\beta)} \right) \]

This is effectively the conditional likelihood given the set of observed failure times (under proportional hazards assumption).

Partial likelihood when no ties

When there is no censoring, this really is the conditional likelihood.
Let’s just consider $n=2$ with no censoring.
Let $R(T_1,T_2)$ denote the ranking of our two times. Under the Cox model its distribution depends on $\beta$, more precisely on $\eta=(\eta_1,\eta_2)=(X_1'\beta,X_2'\beta)$.
Let’s compute (working in density context)

\[ \begin{aligned} P_{\beta}(T_1<T_2) &= \int_0^{\infty} \left[\int_{t_1}^{\infty} e^{\eta_2} h_0(t_2) e^{-e^{\eta_2} H_0(t_2)} \; dt_2 \right] \\ & \qquad \cdot e^{\eta_1} h_0(t_1) e^{-e^{\eta_2} H_0(t_1)} \; dt_1. \end{aligned} \]

Inner integral

\[ \int_{t_1}^{\infty} e^{\eta_2} h_0(t_2) e^{-e^{\eta_2} H_0(t_2)} \; dt_2 = e^{-e^{\eta_2} H_0(t_1)}. \]

Final integral

\[ \int_0^{\infty} e^{\eta_1} h_0(t) e^{-(e^{-\eta_1}+e^{-\eta_2}) H_0(t_1)} \; dt_1 = \frac{e^{\eta}_1}{e^{\eta_1}+e^{\eta_2}} \]

Similar (but tedious) calculation yields case for $n>2$ (again with no censoring)

brainCancer = read.table(
    'https://www.stanford.edu/class/stats305b/data/BrainCancer.csv',
    sep=',',
    header=TRUE)

library(survival)
M = coxph(Surv(time, status) ~ sex, data=brainCancer)
summary(M) # why no intercept?

## Call:
## coxph(formula = Surv(time, status) ~ sex, data = brainCancer)
## 
##   n= 88, number of events= 35 
## 
##           coef exp(coef) se(coef)     z Pr(>|z|)
## sexMale 0.4077    1.5033   0.3420 1.192    0.233
## 
##         exp(coef) exp(-coef) lower .95 upper .95
## sexMale     1.503     0.6652     0.769     2.939
## 
## Concordance= 0.565  (se = 0.045 )
## Likelihood ratio test= 1.44  on 1 df,   p=0.2
## Wald test            = 1.42  on 1 df,   p=0.2
## Score (logrank) test = 1.44  on 1 df,   p=0.2

sqrt(diag(vcov(M)))

##   sexMale 
## 0.3420042

table(factor(brainCancer$diagnosis))

## 
##  HG glioma  LG glioma Meningioma      Other 
##         22          9         42         14

subs = !is.na(brainCancer$diagnosis)
M = coxph(Surv(time, status) ~ sex, data=brainCancer, subset=subs) # one missing diagnosis
M2 = coxph(Surv(time, status) ~ sex + diagnosis, data=brainCancer, subset=subs)
anova(M, M2)

## Analysis of Deviance Table
##  Cox model: response is  Surv(time, status)
##  Model 1: ~ sex
##  Model 2: ~ sex + diagnosis
##    loglik  Chisq Df P(>|Chi|)    
## 1 -136.54                        
## 2 -125.68 21.728  3 7.432e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(M2)

## Call:
## coxph(formula = Surv(time, status) ~ sex + diagnosis, data = brainCancer, 
##     subset = subs)
## 
##   n= 87, number of events= 35 
## 
##                        coef exp(coef) se(coef)      z Pr(>|z|)    
## sexMale              0.0353    1.0359   0.3617  0.098   0.9223    
## diagnosisLG glioma  -1.1108    0.3293   0.5632 -1.972   0.0486 *  
## diagnosisMeningioma -1.9822    0.1378   0.4382 -4.524 6.07e-06 ***
## diagnosisOther      -1.1891    0.3045   0.5113 -2.325   0.0200 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                     exp(coef) exp(-coef) lower .95 upper .95
## sexMale                1.0359     0.9653   0.50990    2.1046
## diagnosisLG glioma     0.3293     3.0368   0.10919    0.9931
## diagnosisMeningioma    0.1378     7.2586   0.05837    0.3252
## diagnosisOther         0.3045     3.2842   0.11177    0.8295
## 
## Concordance= 0.749  (se = 0.038 )
## Likelihood ratio test= 23.51  on 4 df,   p=1e-04
## Wald test            = 23.62  on 4 df,   p=1e-04
## Score (logrank) test = 29.69  on 4 df,   p=6e-06

Score

Differentiating negative log-likelihood yields

\[ \sum_{i:\delta_i=1} \left(\frac{\sum_{j: O_j \geq O_i} X_j \exp(X_j^T\beta)}{\sum_{j: O_j \geq O_i} \exp(X_j^T\beta)} - X_i \right) \]

Each first term is an expectation of $X$ w.r.t. a distribution defined over the risk set with weights $\exp(X_j^T\beta)$.
Could write

\[ \sum_{i:\delta_i=1} \left( E_{\beta,i}[X] - X_i \right) \]

Hessian

Differentiating negative log-likelihood again yields \[ \sum_{i:\delta_i=1} \text{Var}_{\beta,i}[X] \] with \[ \begin{aligned} \text{Var}_{\beta,i}[X] &= \frac{\sum_{j: O_j \geq O_i} X_jX_j^T \exp(X_j^T\beta)}{\sum_{j:O_j \geq O_i} \exp(X_j^T\beta)} - E_{\beta,i}[X] E_{\beta,i}[X]^T \\ &= E_{\beta,i}[XX^T] - E_{\beta,i}[X] E_{\beta,i}[X]^T. \end{aligned} \]

Relation between score and log-rank test

Suppose $X$ is a binary indicator (i.e. sex in our brainCancer data) with, say $X_i=1$ denoting female.
At an event time $T_i$, $X_i$ is the indicator that the death was a female death at that time.
Under $H_0:\beta=0$, $E_{0,i}[X]$ is the expected number of female deaths at time $T_i$:

\[ E_{0,i}[X] = \frac{d_i \cdot \sum_{j:O_j \geq O_i} X_i}{ \sum_{j:O_j \geq O_i} 1 } = \frac{d_i \cdot \# R_1(T_i)}{\# R(T_i)} \]

In other words, $E_{0,i}$ is the hypergeometric distribution on the $i$-th table of the log-rank test!

Standard optimality of score / LRT indicates that the log-rank tests will perform well against alternatives that are proportional hazards.
This connection to Cox score provides a clearer picture of what low-dimensional log-rank is testing.

Variations of log-rank test

Weighted log-rank

\[ \frac{\sum_{i:\delta_i=1} W(T_i)(X_i - E_{0,i}[X])}{\left(\sum_{i:\delta_i=1} W(T_i)^2 \text{Var}_{0,i}[X]\right)^{1/2}} \]

Multivariate for $K$ groups:

\[ \left(\sum_{i:\delta_i=1} W(T_i)(X_i - E_{0,i}[X])\right)^T \left(\sum_{i:\delta_i=1} W(T_i)^2 \text{Var}_{0,i}[X]\right)^{-1} \left(\sum_{i:\delta_i=1} W(T_i)(X_i - E_{0,i}[X])\right) \overset{H_0}{\approx} \chi^2_{K-1}. \]

Examples of weights $W(T_i) = \widehat{S}_P(T_i)^p(1-\widehat{S}_P(T_i))^q$ for pooled survival; $W(T_i)=(\# R(T_i))^{\alpha}$…

survdiff(Surv(time, status) ~ sex, data=brainCancer)

## Call:
## survdiff(formula = Surv(time, status) ~ sex, data = brainCancer)
## 
##             N Observed Expected (O-E)^2/E (O-E)^2/V
## sex=Female 45       15     18.5     0.676      1.44
## sex=Male   43       20     16.5     0.761      1.44
## 
##  Chisq= 1.4  on 1 degrees of freedom, p= 0.2

summary(coxph(Surv(time, status) ~ sex, data=brainCancer))

## Call:
## coxph(formula = Surv(time, status) ~ sex, data = brainCancer)
## 
##   n= 88, number of events= 35 
## 
##           coef exp(coef) se(coef)     z Pr(>|z|)
## sexMale 0.4077    1.5033   0.3420 1.192    0.233
## 
##         exp(coef) exp(-coef) lower .95 upper .95
## sexMale     1.503     0.6652     0.769     2.939
## 
## Concordance= 0.565  (se = 0.045 )
## Likelihood ratio test= 1.44  on 1 df,   p=0.2
## Wald test            = 1.42  on 1 df,   p=0.2
## Score (logrank) test = 1.44  on 1 df,   p=0.2

Using weight $W(t)=\hat{S}_P(t)$

survdiff(Surv(time, status) ~ sex, 
         data=brainCancer,
         rho=1)

## Call:
## survdiff(formula = Surv(time, status) ~ sex, data = brainCancer, 
##     rho = 1)
## 
##             N Observed Expected (O-E)^2/E (O-E)^2/V
## sex=Female 45     11.3     14.4     0.661      1.73
## sex=Male   43     16.0     13.0     0.735      1.73
## 
##  Chisq= 1.7  on 1 degrees of freedom, p= 0.2

An example with $K=3$

survdiff(Surv(time, status) ~ diagnosis, data=brainCancer)

## Call:
## survdiff(formula = Surv(time, status) ~ diagnosis, data = brainCancer)
## 
## n=87, 1 observation deleted due to missingness.
## 
##                       N Observed Expected (O-E)^2/E (O-E)^2/V
## diagnosis=HG glioma  22       17     5.69   22.4456   27.5457
## diagnosis=LG glioma   9        4     3.75    0.0167    0.0188
## diagnosis=Meningioma 42        9    20.26    6.2583   15.2537
## diagnosis=Other      14        5     5.30    0.0165    0.0196
## 
##  Chisq= 29.7  on 3 degrees of freedom, p= 2e-06

summary(coxph(Surv(time, status) ~ diagnosis, data=brainCancer))

## Call:
## coxph(formula = Surv(time, status) ~ diagnosis, data = brainCancer)
## 
##   n= 87, number of events= 35 
##    (1 observation deleted due to missingness)
## 
##                        coef exp(coef) se(coef)      z Pr(>|z|)    
## diagnosisLG glioma  -1.1072    0.3305   0.5620 -1.970   0.0488 *  
## diagnosisMeningioma -1.9937    0.1362   0.4221 -4.723 2.32e-06 ***
## diagnosisOther      -1.1872    0.3051   0.5109 -2.324   0.0201 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                     exp(coef) exp(-coef) lower .95 upper .95
## diagnosisLG glioma     0.3305      3.026   0.10985    0.9943
## diagnosisMeningioma    0.1362      7.342   0.05955    0.3115
## diagnosisOther         0.3051      3.278   0.11207    0.8305
## 
## Concordance= 0.725  (se = 0.044 )
## Likelihood ratio test= 23.5  on 3 df,   p=3e-05
## Wald test            = 23.61  on 3 df,   p=3e-05
## Score (logrank) test = 29.68  on 3 df,   p=2e-06

Using weight $W(t)=\hat{S}_P(t)$

survdiff(Surv(time, status) ~ diagnosis,
         data=brainCancer,
         rho=1)

## Call:
## survdiff(formula = Surv(time, status) ~ diagnosis, data = brainCancer, 
##     rho = 1)
## 
## n=87, 1 observation deleted due to missingness.
## 
##                       N Observed Expected (O-E)^2/E (O-E)^2/V
## diagnosis=HG glioma  22    14.13     4.73  18.71252   27.5305
## diagnosis=LG glioma   9     2.72     2.88   0.00869    0.0122
## diagnosis=Meningioma 42     6.53    15.41   5.11289   14.8544
## diagnosis=Other      14     3.86     4.23   0.03263    0.0474
## 
##  Chisq= 29.3  on 3 degrees of freedom, p= 2e-06

Baseline hazard

Breslow’s estimate of cumulative hazard

\[ \widehat{H}_B(t;\beta) = \sum_{i:\delta_i=1, T_i \leq t} \frac{d_i}{\sum_{j \in R(T_i)} \exp(X_j^T\beta)} \]

Cox’s partial likelihood can be derived as profile likelihood when estimating $H(t)$ as a piecewise constant function…

Estimated survival function

For a new subject with covariates $X_{new}$ we estimate the survival function at time $t$ as

\[ S_{\widehat{\beta}}(t;X_{new})=\exp\left(-\exp(X_{new}^T\widehat{\beta}) \cdot \widehat{H}_B(t;\widehat{\beta}) \right) \]

Profile likelihood

Suppose our likelihood is parameterized by $(\beta,\theta)$ but our interest is focused on $\beta$.
Let $\ell$ be the negative log-likelihood. The profile likelihood for parameter $\beta$ is

\[ \ell_{\text{P}}(\beta) = \inf_{\theta} \ell(\beta,\theta) \]

with minimizer $\widehat{\beta}_{\text{P}}$

Note that

\[ \widehat{\beta}_{\text{P}} = \widehat{\beta}_{MLE}! \]

Finding $\widehat{\beta}_{\text{P}}$ can be easier than finding $(\widehat{\beta}, \widehat{\theta})_{MLE}$, particularly when $\ell_{\text{P}}$ has a nice form…

Profile likelihood

It is tempting to say \[ \widehat{\beta}_{\text{P}} - \beta^* \overset{D}{\approx} N\left(0, \nabla^2 \ell_{\text{P}}(\widehat{\beta}_{\text{P}})^{-1}\right) \]
But because $\ell_{\text{P}}$ is defined by optimization it is not really a (negative log-) likelihood.
There certainly are examples when the above is asymptotically correct (OLS regression profiling out $\sigma^2$, Cox model, …): in general when $\theta$ is low-dimensional this will be OK.
Cox case is not low-dimensional…

Profile likelihood

Example: OLS

In the OLS regression model with unknown $\sigma^2$ (ignoring constants)

\[ \ell_{\text{P}}(\beta) = \frac{n}{2} \log(\|Y-X\beta\|^2_2) \]

and

\[ \nabla^2 \ell_{\text{P}}(\widehat{\beta}_{\text{P}}) = \frac{n}{\|Y-X\widehat{\beta}_{\text{P}}\|^2_2} X^TX. \]

While inverse Hessian is not unbiased it is asymptotically correct (note: usual MLE also only asymptotically correct)

Breslow estimate of hazard (sketch, no ties)

Modeling $H$ as having jumps $h_i$ at observed times $O_i$, full log-likelihood is

\[ \prod_{i=1}^n \left(e^{X_i^T\beta} h_i \right)^{\delta_i} \left(\exp\left(-e^{X_i^T\beta} \sum_{j:O_j \leq O_i} h_j\right) \right) \]

Taking logarithm

\[ \left(\sum_{i=1}^n \delta_i (X_i^T\beta + \log h_i) \right) - \left( \sum_{i=1}^n e^{X_i^T\beta} \sum_{j:O_j \leq O_i} h_j\right) \]

Rearranging

\[ \left(\sum_{i=1}^n \delta_i (X_i^T\beta + \log h_i) \right) - \left( \sum_{j=1}^n h_j \sum_{i:O_i \geq O_j} e^{X_i^T\beta} \right) \]

Relabelling index of summation

\[ \left(\sum_{i=1}^n \delta_i (X_i^T\beta + \log h_i) \right) - \left( \sum_{i=1}^n h_i \sum_{j:O_j \geq O_i} e^{X_j^T\beta} \right) \]

Differentiating

\[ \frac{\partial}{\partial h_i}: \frac{\delta_i}{h_i} - \sum_{j:O_j \geq O_i} e^{X_j^T\beta} \]

Leads to

\[ h_i(\beta) = \frac{\delta_i}{\sum_{j:O_j \geq O_i} e^{X_j^T\beta}}. \]

Plugging in $h(\beta)$
Noting that $\lim_{\delta \to 0} \delta \log \delta = 0, i.e. 0 \log 0 = 0$ yields first term

\[ \sum_{i=1}^n \delta_i (X_i^T\beta) - \log\left(\sum_{j:O_j \geq O_i} e^{X_j^T\beta}\right) \]

Second term

\[ \left(\sum_{i=1}^n \frac{\delta_i}{\sum_{k:O_k \geq O_i} e^{X_k^T\beta}} \sum_{j:O_j \geq O_i} e^{X_j^T\beta} \right) = \sum_{i=1}^n \delta_i. \]

Cox partial likelihood + something that doesn’t depend on $\beta$.

Diagnostics

Residuals (several)

Martingale: observed events under study - (fitted) expected events under study

\[ \delta_i - \exp(X_i'\widehat{\beta}) \cdot \widehat{H}_B(O_i;\widehat{\beta}) \]

Large negative values indicate individuals who survived a long time without failing – outliers?
Cox-Snell: (should be roughly censored exponential variables)

\[ -\log \left(S_{\widehat{\beta}}(O_i;X_i)\right) = \exp(X_i'\widehat{\beta}) \cdot \widehat{H}_B(T_i;\widehat{\beta}) \]

M3 = coxph(Surv(time, status) ~ sex + ki + diagnosis + gtv, data=brainCancer)
summary(M3)

## Call:
## coxph(formula = Surv(time, status) ~ sex + ki + diagnosis + gtv, 
##     data = brainCancer)
## 
##   n= 87, number of events= 35 
##    (1 observation deleted due to missingness)
## 
##                         coef exp(coef) se(coef)      z Pr(>|z|)    
## sexMale              0.14985   1.16166  0.36031  0.416  0.67749    
## ki                  -0.05484   0.94664  0.01832 -2.993  0.00277 ** 
## diagnosisLG glioma  -1.16457   0.31206  0.57321 -2.032  0.04219 *  
## diagnosisMeningioma -2.20885   0.10983  0.44137 -5.005  5.6e-07 ***
## diagnosisOther      -1.53888   0.21462  0.53680 -2.867  0.00415 ** 
## gtv                  0.04190   1.04279  0.02033  2.061  0.03934 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                     exp(coef) exp(-coef) lower .95 upper .95
## sexMale                1.1617     0.8608   0.57330    2.3538
## ki                     0.9466     1.0564   0.91324    0.9813
## diagnosisLG glioma     0.3121     3.2046   0.10146    0.9597
## diagnosisMeningioma    0.1098     9.1053   0.04624    0.2609
## diagnosisOther         0.2146     4.6594   0.07495    0.6146
## gtv                    1.0428     0.9590   1.00205    1.0852
## 
## Concordance= 0.782  (se = 0.042 )
## Likelihood ratio test= 40.5  on 6 df,   p=4e-07
## Wald test            = 38.44  on 6 df,   p=9e-07
## Score (logrank) test = 46.21  on 6 df,   p=3e-08

plot(predict(M3), resid(M3)) # default martingale residuals
abline(h=0, lty=2)

cox_snell = pmax(brainCancer$status[subs] - resid(M3), 0)
H = basehaz(coxph(Surv(cox_snell, brainCancer$status[subs]) ~ 1))
plot(H$time, H$hazard)
abline(c(0, 1))

Regularization

Negative log-likelihood expressible in terms of linear predictor $\eta=X\beta$ \[ \ell(\eta) = \sum_{i:\delta_i=1} \left[ \log \left(\sum_{j:O_j \geq O_i} \exp(\eta_j) \right) - \eta_i\right] \]
Negative log-likelihood is convex in $\eta$ (log-sum-exp)
We can (easily) regularize!

LASSO

\[ \text{minimize}_{\beta} \ell(X\beta) + \lambda \|\beta\|_1. \]

library(glmnet)

## Loading required package: Matrix

## Loaded glmnet 4.0-2

M = coxph(Surv(time, status) ~ ., data=brainCancer)
D = model.frame(M) # treats diagnosis as continuous
X = model.matrix(M)[,-1]
Y = D[,1]
plot(glmnet(X, Y, family='cox'))

Ridge

plot(glmnet(X, Y, family='cox', alpha=0))

Elastic Net

plot(glmnet(X, Y, family='cox', alpha=0.2))

plot(cv.glmnet(X, Y, family='cox'))

Cross-validation score

For logistic regression we looked at cross-validated Binomial deviance.
Misclassification error is a good metric when we don’t believe our logistic regression model.
For Cox model, we have a negative (partial) log-likelihood (which we can take to be $DEV/2$). What is analog of misclassification error for Cox?

Harrell’s concordance score (no ties)

A Cox model produces scores $\widehat{\eta}=X\widehat{\beta}$.
Cox model indicates that higher $\eta$ should yield sooner failure time.
For scores $\eta$ and times $T$, a pair is concordant if $\eta_i < \eta_j$ and $T_i > T_j$. But we have censored data…
We can compare pairs $(i,j)$ that are both uncensored (i.e. $\delta_i=\delta_j=1$).
If $O_i > O_j$ and $\delta_j=1$ then we know $T_i > T_j$ and can check concordance
If $O_i < O_j$ and $\delta_j=1$ then we don’t know that $T_i \overset{?}{>} T_j$…
Leads to \[ \begin{aligned} C(\eta, O, \delta) &= \frac{\sum_{i,j:\delta_j=1} 1_{\{\eta_i < \eta_j\}} 1_{\{O_i > O_j\}} d_j(\delta)}{\sum_{i,j: \delta_j=1} 1_{\{O_i > O_j\}} d_j(\delta)} \\ &= \frac{\sum_{i,j} 1_{\{\eta_i < \eta_j\}} 1_{\{O_i > O_j\}} d_j(\delta)}{\sum_{i,j} 1_{\{O_i > O_j\}} d_j(\delta)} \\ \end{aligned} \]

plot(cv.glmnet(X, Y, family='cox', type.measure='C'))

Verifying proportional hazard assumption

For a categorical variable, the lines cannot cross and should be some baseline function raised to different powers: in category $j$ \[ S_j(t) = \exp\left(-c_j H(t)\right) = (\exp(-H(t))^{c_j} \]
So if $c_j > c_k$, then $S_j(t) \leq S_k(t) \forall t$.
Crossing survival curves $\implies$ proportional hazards cannot hold.

plot(survfit(Surv(time, status) ~ diagnosis, data=brainCancer), lty=c(1:4))

plot(survfit(Surv(time, status) ~ sex, data=brainCancer), lty=c(1:2))

Verifying proportional hazard assumption

Difficult to produce such plots for models like M2 with more than one categorical predictor.
Need a way to detect departures from proportional hazards model.
Alternative model, allow $\beta$ to vary with time: \[ h_{\beta}(t;x) = h_0(t) \exp(X^T\beta(t)) \]
Proportional hazards for feature $j$ is equivalent to $\beta(t) \equiv \beta_j$

Residuals (continued)

Schoenfeld residuals: for each event $T_i,\delta_i=1$ \[ S_i(\beta) = X_i - E_{\beta,i}[X] \]
Total score \[ \sum_{i:\delta_i=1} S_i(\beta). \]
Scaled Schoenfeld residual \[ S_i^*(\beta) = \text{Var}_{\beta,i}[X]^{-1}S_i \]
$S^*_i$ can be related to Taylor series for $\beta(t)$ in time dependent alternative…

Using Schoenfeld residuals to test PH

Suppose $\beta=\beta(t)$ and write the log-likelihood as

\[ \sum_{i:\delta_i=1} \ell_i(\beta(T_i)) \]

where

\[ \ell_i(\beta(t)) = \left[X_i^T\beta(t) - \log\left(\sum_{j: j \in R(t)} \exp(X_j^T\beta(t)) \right) \right] \]

If we were to conduct a “score test” of $H_0:\beta(t) \equiv \beta$ we would plugin the best fitting constant $\beta$ (i.e. the Cox MLE).

Our likelihood depends only on $\{\beta(T_i): \delta_i=1\}$ so consider perturbations \[ \beta(T_i) = \widehat{\beta} + \zeta_i \]
Further, each $\delta_i$ only shows up in $\ell_i(\beta)$ and

\[ \begin{aligned} \ell_i(\widehat{\beta}_{MLE} + \zeta_i) - \ell_i(\widehat{\beta}_{MLE}) &= \nabla \ell_i(\widehat{\beta}_{MLE})^T\zeta_i + \frac{1}{2} \zeta_i^T\nabla^2 \ell_i(\widehat{\beta}_{MLE}) \zeta_i + \dots \\ &= S_i^T\zeta_i - \frac{1}{2} \zeta_i^T\text{Var}_{\widehat{\beta}_{MLE},i}[X]\zeta_i + \dots \end{aligned} \]

Residual plot

A “reasonable guess” at $\delta_i$ maximizes LHS above

\[ \widehat{\zeta}_i = S_i^*(\widehat{\beta}_{MLE}) \]

Formally, these are one-step estimators of $\zeta_i$ from the MLE under $H_0$…
Under null, plot of

\[ (T_i, \widehat{\beta}_{MLE,j} + \widehat{\zeta}_{i,j}), \qquad \delta_i=1, 1 \leq j \leq p \]

should look flat if $\beta_j(t) \equiv \beta_j$.

par(mfrow=c(2,1))
plot(cox.zph(M2))

M2

## Call:
## coxph(formula = Surv(time, status) ~ sex + diagnosis, data = brainCancer, 
##     subset = subs)
## 
##                        coef exp(coef) se(coef)      z        p
## sexMale              0.0353    1.0359   0.3617  0.098   0.9223
## diagnosisLG glioma  -1.1108    0.3293   0.5632 -1.972   0.0486
## diagnosisMeningioma -1.9822    0.1378   0.4382 -4.524 6.07e-06
## diagnosisOther      -1.1891    0.3045   0.5113 -2.325   0.0200
## 
## Likelihood ratio test=23.51  on 4 df, p=0.0001003
## n= 87, number of events= 35

Formal test of PH assumption

To test $H_0:\beta_j(t) \equiv \beta_j$ one can test whether the slope is zero in such plots. (Variance of slope estimator?)
Yields one degree of freedom tests.

cox.zph(M2)

##           chisq df    p
## sex       0.671  1 0.41
## diagnosis 1.337  3 0.72
## GLOBAL    1.951  4 0.74

cox.zph(M3)

##            chisq df    p
## sex       0.9708  1 0.32
## ki        0.0547  1 0.81
## diagnosis 1.1849  3 0.76
## gtv       0.1370  1 0.71
## GLOBAL    2.5483  6 0.86

		Status
		Failed	Survived	Total
Sample	1	\(d_{1,j}\)	\(\|R_1(T_j)\| - d_{1,j}\)	\(\|R_1(T_j)\|\)
	2	\(d_{2,j}\)	\(\|R_2(T_j)\| - d_{2,j}\)	\(\|R_2(T_j)\|\)
Total		\(d_j = d_{1,j} + d_{2,j}\)	\(\|R(T_j)\| - d_j\)	\(\|R(T_j)\|\)

Survival analysis

web.stanford.edu/class/stats305b

Survival analysis

References

Estimands

Hazard function

Cumulative hazard

Hazard function for discrete laws

Hazard models

Constant hazard

Estimating a hazard rate

Nelson-Aalen

Nelson-Aalen estimator

Nelson-Aalen variance estimate

Where does this variance estimate come from?

Models

Proportional hazards

Cox’s model

Accelerated failure time

Censoring

Likelihood under censoring

Nelson-Aalen with censoriing

Nelson-Aalen: variance with censoring

What happens if we ignore censoring?

Kaplan-Meier

Bias when we ignore censoring

Comparing populations

Log-rank test

Contingency table

Comparing populations

Log-rank test

Cox (proportional hazards) model

Partial likelihood

Partial likelihood when no ties

Inner integral

Final integral

Score

Hessian

Relation between score and log-rank test

Variations of log-rank test

Weighted log-rank

Multivariate for \(K\) groups:

An example with \(K=3\)

Baseline hazard

Estimated survival function

Profile likelihood

Profile likelihood

Profile likelihood

Example: OLS

Breslow estimate of hazard (sketch, no ties)

Diagnostics

Residuals (several)

Regularization

LASSO

Ridge

Elastic Net

Cross-validation score

Harrell’s concordance score (no ties)

Verifying proportional hazard assumption

Verifying proportional hazard assumption

Residuals (continued)

Using Schoenfeld residuals to test PH

Residual plot

Formal test of PH assumption