Very brief description of data generating mechanism and models.
Large part of a full course EPI262
Assuming no censoring, we observe: \((T_i, X_i)\) with \(T_i\) an event / failure time and \(X_i\) are explanatory covariates.
With (right) censoring, we observe \((O_i, \delta_i, X_i)\) where \(O_i\) as a time (either event or censored) and \(\delta_i\) denotes whether the time is a true event or not.
Goal: understand \(F_{X_i} = {\cal L}(T_i|X_i)\) where \(T_i\) are the true event times.
Therneau & Grambsh: Modeling Survival Data
Kalbfleisch & Prentice: The Statistical Analysis of Failure Time Data
\[ E_{F_{X_i}}(T - a | T \geq a) \]
\[ \text{median}_{F_{X_i}}(T - a | T \geq a). \]
No natural loss function as in linear / logistic regression.
If we can model / estimate \({\cal L}(T|X)\) then we can derive estimates of above quantities.
Tests of equality of distribution functions are common rather than tests about means / medians.
\[ \begin{aligned} H([t,t+dt)) &= P_F(T < t+dt|T \geq t) \\ \end{aligned} \]
\[ \begin{aligned} H([t,t+dt)) \approx h(t) \; dt \end{aligned} \]
or
\[ \begin{aligned} h(t) &= \frac{f(t)}{S(t)} \\ &= - \frac{d}{dt} \log S(t) \end{aligned} \]
with \(S(t) = \bar{F}(t) = 1 - F(t)\)
\[ H(t) = \int_{[0,t]} h(s) \; ds \]
\[ S(t) = \exp(-H(t)). \]
\[ h(t) = \begin{cases} \frac{P(T=t)}{\sum_{s \geq t} P(T=s)} &\leq 1 \\ 0 &= \text{if $t$ is not in support} \end{cases} \]
\[ H(t) = \sum_{j:T_j \leq t} h(T_j) \]
\[ S(t) = \exp(-H(t)). \]
\[ \tilde{H}(t) = \sum_{j:T_j \leq t} -\log(1 - h(T_j)) \]
Many models in survival analysis model the hazard function (or cumulative version).
The hazard function just has to be non-negative (and integrate to \(\infty\) eventually…).
\[ f_T(t) = \frac{1}{\lambda} e^{-t/\lambda} 1_{[0,\infty)}(t) \]
\[ h(t) = \frac{1}{\lambda} \]
and
\[ H(t) = t / \lambda. \]
\[ \hat{H}(t) = \sum_{j:T_j \leq t} \frac{1}{ |\left\{T_j \geq t\right\}|} = \sum_{j:T_j \leq t} \frac{1}{ |R(t)|} \]
where \(R(t) = \left\{j:T_j \geq t \right\}\) is the risk set at time \(t\).
library(survival)
events = rexp(100)
censor = rep(1, 100)
H = basehaz(coxph(Surv(events, censor)~1))
plot(H$time, H$hazard, type='l')
abline(0, 1, col='red', lty=2)Asymptotic properties can be analyzed using techniques from point processes in which \(H\) is the “cumulative intensity”.
Fluctuation process
\[ (\hat{H}(t) - H(t))_{t \geq 0} \]
can be analyzed using martingale techniques to provide pointwise confidence intervals for \(H\).
\[ \widehat{Var}(\hat{H}(t)) = \sum_{j:T_j \leq t} \frac{1}{|R(T_j)|^2} \]
\[ \exp(-\hat{H}(t)) \pm 1.96 * \exp\left(-\widehat{H}(t)\right) \sqrt{\sum_{j:T_j \leq t} \frac{1}{|R(t)|^2}} \]
\[ \widehat{H}(t) = \sum_{j=0}^{m-1} \widehat{H}(s_{j+1}) - \widehat{H}(s_j) \]
and
\[ \widehat{H}(s_{j+1}) - \widehat{H}(s_j) = \frac{\sum_{i=1}^n 1_{[s_{j+1},s_j]}(T_i)}{|R(s_j)|} \]
\[\widehat{H}(s_{j+1}) - \widehat{H}(s_j) \biggl| |R(s_j)| \sim \frac{1}{|R(s_j)|} \text{Poisson}(|R(s_j)| h(s_j) (s_{j+1}-s_j)) \]
leads to (conditional) variance above
\[ \frac{h(s_j) (s_{j+1}-s_j)}{|R(s_j)|} \]
\[ \int_{[0,t]} \frac{1}{|R(t)|} h(s) \; ds \]
\[f \mapsto \int_{[0,t]} f(s) h(s)\; ds \approx \int_{[0,t]} f(s) \hat{H}(ds) \]
\[ \int_{[0,t]} f(s) h(s)\; ds \approx \sum_{j:T_j \leq t} f(T_j) \frac{1}{|R(t_j)|} \]
\[ \sum_{j:T_j \leq t} \frac{1}{|R(T_j)|^2}. \]
# a larger sample
events = rexp(1000)
censor = rep(1, 1000)
H = basehaz(coxph(Surv(events, censor)~1))
plot(H$time, H$hazard, type='l')
abline(0, 1, col='red', lty=2)\[ h_{\beta}(t;x) = \exp(x^T\beta) \cdot h_0(t) \]
with baseline hazard rate \(h_0\).
Allows nonparametric model for \(h_0\).
Example of a semiparametric model.
We will come back to Cox model shortly.
Assumes \(T_i = \theta(X_i^T\beta)^{-1} \cdot Z_i\) where law of \(Z_i\) does not depend on any parameter.
Not hard to see this corresponds a
\[ h_{\beta}(t;x) = \theta(x^T\beta) \cdot h_0(\theta(x^T\beta) t) \]
with \(h_0\) the hazard function of \(Z_i\).
\[ \log(T_i) = -\log(\theta(X_i^T\beta)) + \log Z_i \]
\[ \log(T_i) = X_i^T\beta + \log Z_i \]
##
## Call:
## survreg(formula = Surv(events, censor) ~ 1, dist = "weibull")
## Value Std. Error z p
## (Intercept) 0.0603 0.0326 1.85 0.065
## Log(scale) -0.0194 0.0248 -0.79 0.432
##
## Scale= 0.981
##
## Weibull distribution
## Loglik(model)= -1052 Loglik(intercept only)= -1052
## Number of Newton-Raphson Iterations: 6
## n= 1000
X = runif(1000)
censor = rep(1,1000)
events = rexp(1000)^0.5 * exp(2*X-1)
M = survreg(Surv(events, censor) ~ X, dist='weibull')##
## Call:
## survreg(formula = Surv(events, censor) ~ X, dist = "weibull")
## Value Std. Error z p
## (Intercept) -1.0071 0.0315 -32.0 <2e-16
## X 2.0110 0.0540 37.2 <2e-16
## Log(scale) -0.6994 0.0247 -28.4 <2e-16
##
## Scale= 0.497
##
## Weibull distribution
## Loglik(model)= -585.6 Loglik(intercept only)= -982.8
## Chisq= 794.41 on 1 degrees of freedom, p= 8.9e-175
## Number of Newton-Raphson Iterations: 6
## n= 1000
weibull estimates a variance in the regression…##
## Call:
## survreg(formula = Surv(events, censor) ~ X, dist = "exponential")
## Value Std. Error z p
## (Intercept) -1.1134 0.0628 -17.7 <2e-16
## X 1.9812 0.1089 18.2 <2e-16
##
## Scale fixed at 1
##
## Exponential distribution
## Loglik(model)= -874 Loglik(intercept only)= -1035.5
## Chisq= 323.06 on 1 degrees of freedom, p= 3.1e-72
## Number of Newton-Raphson Iterations: 4
## n= 1000
Let’s imagine a lab running experiments on development of cancer in mice after exposure to different levels of carcinogen.
Mice are born on day \(B\), exposed to carcinogen \(X\) and their symptoms are detected at time \(D\) yielding event time \(T=D-B\).
Funding for experiment was only guaranteed up to end date \(E=\)3/31/2020.
Not all mice had symptoms at end date \(E\).
For those mice who have presented symptoms, we observe time \(O=T-B\).
For the others, we observe \(O=E-B\) and only know that their symptoms could only present later than \(O\) in their lifetime.
This is called right-censoring we only know the time of interest lies in some interval \([O,\infty)\).
With right-censored data, observations are
\[ (O_i, \delta_i, X_i) \]
with \(O_i\) the observed “time” for subject \(i\) and \(\delta_i=1\) denotes the time of interest was observed, \(\delta_i=0\) denotes this observation was right-censored.
In order to fit model and carry out inference, we usually must assume censoring is non-informative.
For non-informative censoring, the likelihood is
\[ \left(\prod_{i:\delta_i=1} f_{\beta}(O_i;X_i) \right) \left(\prod_{i:\delta_i=0} S_{\beta}(O_i;X_i) \right) \]
\[ \prod_{i} h_{\beta}(O_i;X_i)^{\delta_i} S_{\beta}(O_i;X_i) \]
\[ f(t) = h(t) S(t). \]
Tcens = rexp(length(events)) * 1.5
observed = pmin(events, Tcens)
censor = (events <= Tcens)
M3 = survreg(Surv(observed, censor) ~ X, dist='weibull')
summary(M3)##
## Call:
## survreg(formula = Surv(observed, censor) ~ X, dist = "weibull")
## Value Std. Error z p
## (Intercept) -1.0243 0.0370 -27.7 <2e-16
## X 2.0303 0.0723 28.1 <2e-16
## Log(scale) -0.7227 0.0305 -23.7 <2e-16
##
## Scale= 0.485
##
## Weibull distribution
## Loglik(model)= -341.6 Loglik(intercept only)= -624.1
## Chisq= 565.11 on 1 degrees of freedom, p= 6.5e-125
## Number of Newton-Raphson Iterations: 6
## n= 1000
\[ \hat{H}(t) = \sum_{j: \delta_j = 1, T_j \leq t} \frac{d_j}{n_j} \]
\(R(t) = \left\{j: O_j \geq t\right\}\) is the risk set
\(d_j\) the number of events at \(T_j\) (1 if no ties possible)
\(n_j = |R(T_j)|\) the number at risk just prior to \(T_j\) (i.e. candidates to fail at \(T_j\))
\[ \widehat{Var}(\hat{H}(t))= \sum_{\delta_j=1, T_j \leq t} \frac{d_j}{n_j (n_j - d_j)} \]
\[ \widehat{Var}(\hat{H}(t))= \sum_{\delta_j=1, T_j \leq t} \frac{d_j}{n_j^2} \]
\[ \hat{S}(t) = \exp(-\hat{H}(t)) \]
events = rexp(1000)
Tcens = rexp(1000) * 3
observed = pmin(events, Tcens)
censor = (events < Tcens)
H = basehaz(coxph(Surv(observed, censor) ~ 1))
plot(H$time, H$hazard, type='l')
abline(0, 1, col='red', lty=2)ignore_censor = rep(1, length(censor))
Hbias = basehaz(coxph(Surv(observed, ignore_censor) ~ 1))
plot(Hbias$time, exp(-Hbias$hazard), type='l')
lines(H$time, exp(-H$time), col='red')\[ \hat{S}(t) = \prod_{j:\delta_j=1, T_j \leq t} \left(1 - \frac{d_j}{n_j} \right) \]
\[ P(T > v_i) = \prod_{j=1}^{i} P(T > v_{j} | T > v_{j-1}) \]
with \(V=(0,T_1,\dots,T_{\sum_i \delta_i}, \infty)\) and
\[ \widehat{ P}(T > T_{j} | T > T_{j-1}) = 1 - \frac{d_j}{n_j}. \]
plot(survfit(Surv(observed, censor) ~ 1), lwd=4, xlim=c(2,3), ylim=c(0,0.2))
lines(H$time, exp(-H$hazard), col='yellow', lwd=2)plot(survfit(Surv(observed, censor) ~ 1), lwd=4)
lines(H$time, exp(-H$time), col='red', lwd=2) # truthplot(survfit(Surv(observed, rep(1, length(censor))) ~ 1))
F = survfit(Surv(observed, censor) ~ 1)
lines(H$time, exp(-H$time), col='red')Suppose we have two populations from which we observe right-censored survival times \((O_{1,i},\delta_{1,i}), 1 \leq i \leq n_1\) and \((O_{2,i}, \delta_{2,i}), 1 \leq i \leq n_2\).
Can we tell if their underlying survival times (some of which we have not observed) come from the same distribution?
Null hypothesis \(H_0: F_1 = F_2\) where \(T_{1,i} \overset{IID}{\sim} F_1\) and \(T_{2,i} \overset{IID}{\sim} F_2\).
Suppose we have a failure time \(T_j\) (could have come from either sample).
We have two risk sets \(R_1(t) = \left\{i: O_{1,i} \geq t\right\}, R_2(t) = \left\{i: O_{2,i} \geq t\right\}\)
The log-rank test model the interval \([t_j, t_j+dt)\) under \(H_0\).
At time \(T_j\) we the total (pooled) risk set is \(R(T_j) = R_1(T_j) \cup R_2(T_j)\) with cardinality \(|R(T_j)|\).
| Status | |||||
| Failed | Survived | Total | |||
| Sample | 1 | \(d_{1,j}\) | \(|R_1(T_j)| - d_{1,j}\) | \(|R_1(T_j)|\) | |
| 2 | \(d_{2,j}\) | \(|R_2(T_j)| - d_{2,j}\) | \(|R_2(T_j)|\) | ||
| Total | \(d_j = d_{1,j} + d_{2,j}\) | \(|R(T_j)| - d_j\) | \(|R(T_j)|\) |
\[ \frac{d_j \cdot |R_1(T_j)|}{|R(T_j)|} \]
Law of \(d_{1,j} | |R_1(T_j)|, |R(T_j)|, d_j\) is Hypergeometric.
Summing over \(j\)
\[ \frac{\sum_{j: d_j \neq 0} d_{1,j} - \frac{d_j |R_1(T_j)|}{|R(T_j)|}} {\sqrt{\sum_{j: d_j \neq 0} \text{Var}(d_{1,j} | |R_1(T_j)|, |R(T_j)|, d_j) }} \overset{H_0}{\Rightarrow} N(0,1) \]
brainCancer = read.table(
'https://www.stanford.edu/class/stats305b/data/BrainCancer.csv',
sep=',',
header=TRUE)
names(brainCancer)## [1] "sex" "diagnosis" "loc" "ki" "gtv" "stereo"
## [7] "status" "time"
## Call:
## survdiff(formula = Surv(time, status) ~ sex, data = brainCancer)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## sex=Female 45 15 18.5 0.676 1.44
## sex=Male 43 20 16.5 0.761 1.44
##
## Chisq= 1.4 on 1 degrees of freedom, p= 0.2
\[ h_{\beta}(t;x) = \exp(x^T\beta) \cdot h_0(t) \]
with baseline hazard rate \(h_0\).
Given a death has occured at time \(t\) what are the relative chances that individual \(i\) died rather than individual \(k\)?
This should be the ratio \[ \frac{h_{\beta}(t;X_i)}{h_{\beta}(t;X_k)} \propto \exp((X_i-X_k)^T\beta) \]
Baseline hazard \(h_0(t)\) cancels!
\[ \frac{\exp(X_i^T\beta)}{\sum_{j \in R(t)} \exp(X_j^T\beta)} \]
\[ \prod_{i: \delta_i=1} \left(\frac{\exp(X_i^T\beta)}{\sum_{j \in R(T_i)} \exp(X_j^T\beta)} \right) = \prod_{i: \delta_i=1} \left(\frac{\exp(X_i^T\beta)}{\sum_{j: O_j \geq O_i} \exp(X_j^T\beta)} \right) \]
When there is no censoring, this really is the conditional likelihood.
Let’s just consider \(n=2\) with no censoring.
Let \(R(T_1,T_2)\) denote the ranking of our two times. Under the Cox model its distribution depends on \(\beta\), more precisely on \(\eta=(\eta_1,\eta_2)=(X_1'\beta,X_2'\beta)\).
Let’s compute (working in density context)
\[ \begin{aligned} P_{\beta}(T_1<T_2) &= \int_0^{\infty} \left[\int_{t_1}^{\infty} e^{\eta_2} h_0(t_2) e^{-e^{\eta_2} H_0(t_2)} \; dt_2 \right] \\ & \qquad \cdot e^{\eta_1} h_0(t_1) e^{-e^{\eta_2} H_0(t_1)} \; dt_1. \end{aligned} \]
\[ \int_{t_1}^{\infty} e^{\eta_2} h_0(t_2) e^{-e^{\eta_2} H_0(t_2)} \; dt_2 = e^{-e^{\eta_2} H_0(t_1)}. \]
\[ \int_0^{\infty} e^{\eta_1} h_0(t) e^{-(e^{-\eta_1}+e^{-\eta_2}) H_0(t_1)} \; dt_1 = \frac{e^{\eta}_1}{e^{\eta_1}+e^{\eta_2}} \]
brainCancer = read.table(
'https://www.stanford.edu/class/stats305b/data/BrainCancer.csv',
sep=',',
header=TRUE)library(survival)
M = coxph(Surv(time, status) ~ sex, data=brainCancer)
summary(M) # why no intercept?## Call:
## coxph(formula = Surv(time, status) ~ sex, data = brainCancer)
##
## n= 88, number of events= 35
##
## coef exp(coef) se(coef) z Pr(>|z|)
## sexMale 0.4077 1.5033 0.3420 1.192 0.233
##
## exp(coef) exp(-coef) lower .95 upper .95
## sexMale 1.503 0.6652 0.769 2.939
##
## Concordance= 0.565 (se = 0.045 )
## Likelihood ratio test= 1.44 on 1 df, p=0.2
## Wald test = 1.42 on 1 df, p=0.2
## Score (logrank) test = 1.44 on 1 df, p=0.2
## sexMale
## 0.3420042
##
## HG glioma LG glioma Meningioma Other
## 22 9 42 14
subs = !is.na(brainCancer$diagnosis)
M = coxph(Surv(time, status) ~ sex, data=brainCancer, subset=subs) # one missing diagnosis
M2 = coxph(Surv(time, status) ~ sex + diagnosis, data=brainCancer, subset=subs)
anova(M, M2)## Analysis of Deviance Table
## Cox model: response is Surv(time, status)
## Model 1: ~ sex
## Model 2: ~ sex + diagnosis
## loglik Chisq Df P(>|Chi|)
## 1 -136.54
## 2 -125.68 21.728 3 7.432e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Call:
## coxph(formula = Surv(time, status) ~ sex + diagnosis, data = brainCancer,
## subset = subs)
##
## n= 87, number of events= 35
##
## coef exp(coef) se(coef) z Pr(>|z|)
## sexMale 0.0353 1.0359 0.3617 0.098 0.9223
## diagnosisLG glioma -1.1108 0.3293 0.5632 -1.972 0.0486 *
## diagnosisMeningioma -1.9822 0.1378 0.4382 -4.524 6.07e-06 ***
## diagnosisOther -1.1891 0.3045 0.5113 -2.325 0.0200 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## sexMale 1.0359 0.9653 0.50990 2.1046
## diagnosisLG glioma 0.3293 3.0368 0.10919 0.9931
## diagnosisMeningioma 0.1378 7.2586 0.05837 0.3252
## diagnosisOther 0.3045 3.2842 0.11177 0.8295
##
## Concordance= 0.749 (se = 0.038 )
## Likelihood ratio test= 23.51 on 4 df, p=1e-04
## Wald test = 23.62 on 4 df, p=1e-04
## Score (logrank) test = 29.69 on 4 df, p=6e-06
\[ \sum_{i:\delta_i=1} \left(\frac{\sum_{j: O_j \geq O_i} X_j \exp(X_j^T\beta)}{\sum_{j: O_j \geq O_i} \exp(X_j^T\beta)} - X_i \right) \]
Each first term is an expectation of \(X\) w.r.t. a distribution defined over the risk set with weights \(\exp(X_j^T\beta)\).
Could write
\[ \sum_{i:\delta_i=1} \left( E_{\beta,i}[X] - X_i \right) \]
Suppose \(X\) is a binary indicator (i.e. sex in our brainCancer data) with, say \(X_i=1\) denoting female.
At an event time \(T_i\), \(X_i\) is the indicator that the death was a female death at that time.
Under \(H_0:\beta=0\), \(E_{0,i}[X]\) is the expected number of female deaths at time \(T_i\):
\[ E_{0,i}[X] = \frac{d_i \cdot \sum_{j:O_j \geq O_i} X_i}{ \sum_{j:O_j \geq O_i} 1 } = \frac{d_i \cdot \# R_1(T_i)}{\# R(T_i)} \]
Standard optimality of score / LRT indicates that the log-rank tests will perform well against alternatives that are proportional hazards.
This connection to Cox score provides a clearer picture of what low-dimensional log-rank is testing.
\[ \frac{\sum_{i:\delta_i=1} W(T_i)(X_i - E_{0,i}[X])}{\left(\sum_{i:\delta_i=1} W(T_i)^2 \text{Var}_{0,i}[X]\right)^{1/2}} \]
\[ \left(\sum_{i:\delta_i=1} W(T_i)(X_i - E_{0,i}[X])\right)^T \left(\sum_{i:\delta_i=1} W(T_i)^2 \text{Var}_{0,i}[X]\right)^{-1} \left(\sum_{i:\delta_i=1} W(T_i)(X_i - E_{0,i}[X])\right) \overset{H_0}{\approx} \chi^2_{K-1}. \]
## Call:
## survdiff(formula = Surv(time, status) ~ sex, data = brainCancer)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## sex=Female 45 15 18.5 0.676 1.44
## sex=Male 43 20 16.5 0.761 1.44
##
## Chisq= 1.4 on 1 degrees of freedom, p= 0.2
## Call:
## coxph(formula = Surv(time, status) ~ sex, data = brainCancer)
##
## n= 88, number of events= 35
##
## coef exp(coef) se(coef) z Pr(>|z|)
## sexMale 0.4077 1.5033 0.3420 1.192 0.233
##
## exp(coef) exp(-coef) lower .95 upper .95
## sexMale 1.503 0.6652 0.769 2.939
##
## Concordance= 0.565 (se = 0.045 )
## Likelihood ratio test= 1.44 on 1 df, p=0.2
## Wald test = 1.42 on 1 df, p=0.2
## Score (logrank) test = 1.44 on 1 df, p=0.2
## Call:
## survdiff(formula = Surv(time, status) ~ sex, data = brainCancer,
## rho = 1)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## sex=Female 45 11.3 14.4 0.661 1.73
## sex=Male 43 16.0 13.0 0.735 1.73
##
## Chisq= 1.7 on 1 degrees of freedom, p= 0.2
## Call:
## survdiff(formula = Surv(time, status) ~ diagnosis, data = brainCancer)
##
## n=87, 1 observation deleted due to missingness.
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## diagnosis=HG glioma 22 17 5.69 22.4456 27.5457
## diagnosis=LG glioma 9 4 3.75 0.0167 0.0188
## diagnosis=Meningioma 42 9 20.26 6.2583 15.2537
## diagnosis=Other 14 5 5.30 0.0165 0.0196
##
## Chisq= 29.7 on 3 degrees of freedom, p= 2e-06
## Call:
## coxph(formula = Surv(time, status) ~ diagnosis, data = brainCancer)
##
## n= 87, number of events= 35
## (1 observation deleted due to missingness)
##
## coef exp(coef) se(coef) z Pr(>|z|)
## diagnosisLG glioma -1.1072 0.3305 0.5620 -1.970 0.0488 *
## diagnosisMeningioma -1.9937 0.1362 0.4221 -4.723 2.32e-06 ***
## diagnosisOther -1.1872 0.3051 0.5109 -2.324 0.0201 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## diagnosisLG glioma 0.3305 3.026 0.10985 0.9943
## diagnosisMeningioma 0.1362 7.342 0.05955 0.3115
## diagnosisOther 0.3051 3.278 0.11207 0.8305
##
## Concordance= 0.725 (se = 0.044 )
## Likelihood ratio test= 23.5 on 3 df, p=3e-05
## Wald test = 23.61 on 3 df, p=3e-05
## Score (logrank) test = 29.68 on 3 df, p=2e-06
## Call:
## survdiff(formula = Surv(time, status) ~ diagnosis, data = brainCancer,
## rho = 1)
##
## n=87, 1 observation deleted due to missingness.
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## diagnosis=HG glioma 22 14.13 4.73 18.71252 27.5305
## diagnosis=LG glioma 9 2.72 2.88 0.00869 0.0122
## diagnosis=Meningioma 42 6.53 15.41 5.11289 14.8544
## diagnosis=Other 14 3.86 4.23 0.03263 0.0474
##
## Chisq= 29.3 on 3 degrees of freedom, p= 2e-06
\[ \widehat{H}_B(t;\beta) = \sum_{i:\delta_i=1, T_i \leq t} \frac{d_i}{\sum_{j \in R(T_i)} \exp(X_j^T\beta)} \]
\[ S_{\widehat{\beta}}(t;X_{new})=\exp\left(-\exp(X_{new}^T\widehat{\beta}) \cdot \widehat{H}_B(t;\widehat{\beta}) \right) \]
Suppose our likelihood is parameterized by \((\beta,\theta)\) but our interest is focused on \(\beta\).
Let \(\ell\) be the negative log-likelihood. The profile likelihood for parameter \(\beta\) is
\[ \ell_{\text{P}}(\beta) = \inf_{\theta} \ell(\beta,\theta) \]
with minimizer \(\widehat{\beta}_{\text{P}}\)
\[ \widehat{\beta}_{\text{P}} = \widehat{\beta}_{MLE}! \]
It is tempting to say \[ \widehat{\beta}_{\text{P}} - \beta^* \overset{D}{\approx} N\left(0, \nabla^2 \ell_{\text{P}}(\widehat{\beta}_{\text{P}})^{-1}\right) \]
But because \(\ell_{\text{P}}\) is defined by optimization it is not really a (negative log-) likelihood.
There certainly are examples when the above is asymptotically correct (OLS regression profiling out \(\sigma^2\), Cox model, …): in general when \(\theta\) is low-dimensional this will be OK.
Cox case is not low-dimensional…
\[ \ell_{\text{P}}(\beta) = \frac{n}{2} \log(\|Y-X\beta\|^2_2) \]
and
\[ \nabla^2 \ell_{\text{P}}(\widehat{\beta}_{\text{P}}) = \frac{n}{\|Y-X\widehat{\beta}_{\text{P}}\|^2_2} X^TX. \]
\[ \prod_{i=1}^n \left(e^{X_i^T\beta} h_i \right)^{\delta_i} \left(\exp\left(-e^{X_i^T\beta} \sum_{j:O_j \leq O_i} h_j\right) \right) \]
\[ \left(\sum_{i=1}^n \delta_i (X_i^T\beta + \log h_i) \right) - \left( \sum_{i=1}^n e^{X_i^T\beta} \sum_{j:O_j \leq O_i} h_j\right) \]
\[ \left(\sum_{i=1}^n \delta_i (X_i^T\beta + \log h_i) \right) - \left( \sum_{j=1}^n h_j \sum_{i:O_i \geq O_j} e^{X_i^T\beta} \right) \]
\[ \left(\sum_{i=1}^n \delta_i (X_i^T\beta + \log h_i) \right) - \left( \sum_{i=1}^n h_i \sum_{j:O_j \geq O_i} e^{X_j^T\beta} \right) \]
\[ \frac{\partial}{\partial h_i}: \frac{\delta_i}{h_i} - \sum_{j:O_j \geq O_i} e^{X_j^T\beta} \]
\[ h_i(\beta) = \frac{\delta_i}{\sum_{j:O_j \geq O_i} e^{X_j^T\beta}}. \]
Plugging in \(h(\beta)\)
Noting that \(\lim_{\delta \to 0} \delta \log \delta = 0, i.e. 0 \log 0 = 0\) yields first term
\[ \sum_{i=1}^n \delta_i (X_i^T\beta) - \log\left(\sum_{j:O_j \geq O_i} e^{X_j^T\beta}\right) \]
\[ \left(\sum_{i=1}^n \frac{\delta_i}{\sum_{k:O_k \geq O_i} e^{X_k^T\beta}} \sum_{j:O_j \geq O_i} e^{X_j^T\beta} \right) = \sum_{i=1}^n \delta_i. \]
\[ \delta_i - \exp(X_i'\widehat{\beta}) \cdot \widehat{H}_B(O_i;\widehat{\beta}) \]
Large negative values indicate individuals who survived a long time without failing – outliers?
Cox-Snell: (should be roughly censored exponential variables)
\[ -\log \left(S_{\widehat{\beta}}(O_i;X_i)\right) = \exp(X_i'\widehat{\beta}) \cdot \widehat{H}_B(T_i;\widehat{\beta}) \]
## Call:
## coxph(formula = Surv(time, status) ~ sex + ki + diagnosis + gtv,
## data = brainCancer)
##
## n= 87, number of events= 35
## (1 observation deleted due to missingness)
##
## coef exp(coef) se(coef) z Pr(>|z|)
## sexMale 0.14985 1.16166 0.36031 0.416 0.67749
## ki -0.05484 0.94664 0.01832 -2.993 0.00277 **
## diagnosisLG glioma -1.16457 0.31206 0.57321 -2.032 0.04219 *
## diagnosisMeningioma -2.20885 0.10983 0.44137 -5.005 5.6e-07 ***
## diagnosisOther -1.53888 0.21462 0.53680 -2.867 0.00415 **
## gtv 0.04190 1.04279 0.02033 2.061 0.03934 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## sexMale 1.1617 0.8608 0.57330 2.3538
## ki 0.9466 1.0564 0.91324 0.9813
## diagnosisLG glioma 0.3121 3.2046 0.10146 0.9597
## diagnosisMeningioma 0.1098 9.1053 0.04624 0.2609
## diagnosisOther 0.2146 4.6594 0.07495 0.6146
## gtv 1.0428 0.9590 1.00205 1.0852
##
## Concordance= 0.782 (se = 0.042 )
## Likelihood ratio test= 40.5 on 6 df, p=4e-07
## Wald test = 38.44 on 6 df, p=9e-07
## Score (logrank) test = 46.21 on 6 df, p=3e-08
cox_snell = pmax(brainCancer$status[subs] - resid(M3), 0)
H = basehaz(coxph(Surv(cox_snell, brainCancer$status[subs]) ~ 1))
plot(H$time, H$hazard)
abline(c(0, 1))Negative log-likelihood expressible in terms of linear predictor \(\eta=X\beta\) \[ \ell(\eta) = \sum_{i:\delta_i=1} \left[ \log \left(\sum_{j:O_j \geq O_i} \exp(\eta_j) \right) - \eta_i\right] \]
Negative log-likelihood is convex in \(\eta\) (log-sum-exp)
We can (easily) regularize!
\[ \text{minimize}_{\beta} \ell(X\beta) + \lambda \|\beta\|_1. \]
## Loading required package: Matrix
## Loaded glmnet 4.0-2
M = coxph(Surv(time, status) ~ ., data=brainCancer)
D = model.frame(M) # treats diagnosis as continuous
X = model.matrix(M)[,-1]
Y = D[,1]
plot(glmnet(X, Y, family='cox'))For logistic regression we looked at cross-validated Binomial deviance.
Misclassification error is a good metric when we don’t believe our logistic regression model.
For Cox model, we have a negative (partial) log-likelihood (which we can take to be \(DEV/2\)). What is analog of misclassification error for Cox?
A Cox model produces scores \(\widehat{\eta}=X\widehat{\beta}\).
Cox model indicates that higher \(\eta\) should yield sooner failure time.
For scores \(\eta\) and times \(T\), a pair is concordant if \(\eta_i < \eta_j\) and \(T_i > T_j\). But we have censored data…
We can compare pairs \((i,j)\) that are both uncensored (i.e. \(\delta_i=\delta_j=1\)).
If \(O_i > O_j\) and \(\delta_j=1\) then we know \(T_i > T_j\) and can check concordance
If \(O_i < O_j\) and \(\delta_j=1\) then we don’t know that \(T_i \overset{?}{>} T_j\)…
Leads to \[ \begin{aligned} C(\eta, O, \delta) &= \frac{\sum_{i,j:\delta_j=1} 1_{\{\eta_i < \eta_j\}} 1_{\{O_i > O_j\}} d_j(\delta)}{\sum_{i,j: \delta_j=1} 1_{\{O_i > O_j\}} d_j(\delta)} \\ &= \frac{\sum_{i,j} 1_{\{\eta_i < \eta_j\}} 1_{\{O_i > O_j\}} d_j(\delta)}{\sum_{i,j} 1_{\{O_i > O_j\}} d_j(\delta)} \\ \end{aligned} \]
For a categorical variable, the lines cannot cross and should be some baseline function raised to different powers: in category \(j\) \[ S_j(t) = \exp\left(-c_j H(t)\right) = (\exp(-H(t))^{c_j} \]
So if \(c_j > c_k\), then \(S_j(t) \leq S_k(t) \forall t\).
Crossing survival curves \(\implies\) proportional hazards cannot hold.
Difficult to produce such plots for models like M2 with more than one categorical predictor.
Need a way to detect departures from proportional hazards model.
Alternative model, allow \(\beta\) to vary with time: \[ h_{\beta}(t;x) = h_0(t) \exp(X^T\beta(t)) \]
Proportional hazards for feature \(j\) is equivalent to \(\beta(t) \equiv \beta_j\)
Schoenfeld residuals: for each event \(T_i,\delta_i=1\) \[ S_i(\beta) = X_i - E_{\beta,i}[X] \]
Total score \[ \sum_{i:\delta_i=1} S_i(\beta). \]
Scaled Schoenfeld residual \[ S_i^*(\beta) = \text{Var}_{\beta,i}[X]^{-1}S_i \]
\(S^*_i\) can be related to Taylor series for \(\beta(t)\) in time dependent alternative…
\[ \sum_{i:\delta_i=1} \ell_i(\beta(T_i)) \]
where
\[ \ell_i(\beta(t)) = \left[X_i^T\beta(t) - \log\left(\sum_{j: j \in R(t)} \exp(X_j^T\beta(t)) \right) \right] \]
Our likelihood depends only on \(\{\beta(T_i): \delta_i=1\}\) so consider perturbations \[ \beta(T_i) = \widehat{\beta} + \zeta_i \]
Further, each \(\delta_i\) only shows up in \(\ell_i(\beta)\) and
\[ \begin{aligned} \ell_i(\widehat{\beta}_{MLE} + \zeta_i) - \ell_i(\widehat{\beta}_{MLE}) &= \nabla \ell_i(\widehat{\beta}_{MLE})^T\zeta_i + \frac{1}{2} \zeta_i^T\nabla^2 \ell_i(\widehat{\beta}_{MLE}) \zeta_i + \dots \\ &= S_i^T\zeta_i - \frac{1}{2} \zeta_i^T\text{Var}_{\widehat{\beta}_{MLE},i}[X]\zeta_i + \dots \end{aligned} \]
\[ \widehat{\zeta}_i = S_i^*(\widehat{\beta}_{MLE}) \]
Formally, these are one-step estimators of \(\zeta_i\) from the MLE under \(H_0\)…
Under null, plot of
\[ (T_i, \widehat{\beta}_{MLE,j} + \widehat{\zeta}_{i,j}), \qquad \delta_i=1, 1 \leq j \leq p \]
should look flat if \(\beta_j(t) \equiv \beta_j\).
## Call:
## coxph(formula = Surv(time, status) ~ sex + diagnosis, data = brainCancer,
## subset = subs)
##
## coef exp(coef) se(coef) z p
## sexMale 0.0353 1.0359 0.3617 0.098 0.9223
## diagnosisLG glioma -1.1108 0.3293 0.5632 -1.972 0.0486
## diagnosisMeningioma -1.9822 0.1378 0.4382 -4.524 6.07e-06
## diagnosisOther -1.1891 0.3045 0.5113 -2.325 0.0200
##
## Likelihood ratio test=23.51 on 4 df, p=0.0001003
## n= 87, number of events= 35
To test \(H_0:\beta_j(t) \equiv \beta_j\) one can test whether the slope is zero in such plots. (Variance of slope estimator?)
Yields one degree of freedom tests.
## chisq df p
## sex 0.671 1 0.41
## diagnosis 1.337 3 0.72
## GLOBAL 1.951 4 0.74
## chisq df p
## sex 0.9708 1 0.32
## ki 0.0547 1 0.81
## diagnosis 1.1849 3 0.76
## gtv 0.1370 1 0.71
## GLOBAL 2.5483 6 0.86