Evaluating Survival Models

May 4, 2019 9 min read

The most frequently used evaluation metric of survival models is the concordance index (c index, c statistic). It is a measure of rank correlation between predicted risk scores $\hat{f}$ and observed time points $y$ that is closely related to Kendall’s τ. It is defined as the ratio of correctly ordered (concordant) pairs to comparable pairs. Two samples $i$ and $j$ are comparable if the sample with lower observed time $y$ experienced an event, i.e., if $y_j > y_i$ and $\delta_i = 1$, where $\delta_i$ is a binary event indicator. A comparable pair $(i, j)$ is concordant if the estimated risk $\hat{f}$ by a survival model is higher for subjects with lower survival time, i.e., $\hat{f}_i >\hat{f}_j \land y_j > y_i$, otherwise the pair is discordant. Harrell’s estimator of the c index is implemented in concordance_index_censored.

While Harrell’s concordance index is easy to interpret and compute, it has some shortcomings:

it has been shown that it is too optimistic with increasing amount of censoring [1],
it is not a useful measure of performance if a specific time range is of primary interest (e.g. predicting death within 2 years).

Since version 0.8, scikit-survival supports an alternative estimator of the concordance index from right-censored survival data, implemented in concordance_index_ipcw, that addresses the first issue.

The second point can be addressed by extending the well known receiver operating characteristic curve (ROC curve) to possibly censored survival times. Given a time point $t$, we can estimate how well a predictive model can distinguishing subjects who will experience an event by time $t$ (sensitivity) from those who will not (specificity). The function cumulative_dynamic_auc implements an estimator of the cumulative/dynamic area under the ROC for a given list of time points.

The first part of this post will illustrate the first issue with simulated survival data, while the second part will focus on the time-dependent area under the ROC applied to data from a real study.

To see the full source code for producing the figures in this post, please see this notebook.

Bias of Harrell’s Concordance Index

Harrell’s concordance index is known to be biased upwards if the amount of censoring in the test data is high [1]. Uno et al. proposed an alternative estimator of the concordance index that behaves better in such situations. In this section, we are going to apply concordance_index_censored and concordance_index_ipcw to synthetic survival data and compare their results.

Simulation Study

We are generating a synthetic biomarker by sampling from a standard normal distribution. For a given hazard ratio, we compute the associated (actual) survival time by drawing from an exponential distribution. The censoring times were generated from a uniform independent distribution $\textrm{Uniform}(0,\gamma)$, where we choose $\gamma$ to produce different amounts of censoring.

Since Uno’s estimator is based on inverse probability of censoring weighting, we need to estimate the probability of being censored at a given time point. This probability needs to be non-zero for all observed time points. Therefore, we restrict the test data to all samples with observed time lower than the maximum event time $\tau$. Usually, one would use the tau argument of concordance_index_ipcw for this, but we apply the selection before to pass identical inputs to concordance_index_censored and concordance_index_ipcw. The estimates of the concordance index are therefore restricted to the interval $[0, \tau]$.

Let us assume a moderate hazard ratio of 2 and generate a small synthetic dataset of 100 samples from which we estimate the concordance index. We repeat this experiment 200 times and plot mean and standard deviation of the difference between the actual (in the absence of censoring) and estimated concordance index.

Since the hazard ratio remains constant and only the amount of censoring changes, we would want an estimator for which the difference between the actual and estimated c to remain approximately constant across simulations.

We can observe that estimates are on average below the actual value, except for the highest amount of censoring, where Harrell’s c begins overestimating the performance (on average).

With such a small dataset, the variance of differences is quite big, so let us increase the amount of data to 1000 and repeat the simulation.

Now we can observe that Harrell’s c begins to overestimate performance starting with approximately 49% censoring while Uno’s c is still underestimating the performance, but is on average very close to the actual performance for large amounts of censoring.

For the final experiment, we double the size of the dataset to 2000 and repeat the analysis.

The trend we observed in the previous simulation is now even more pronounced. Harrell’s c is becoming more and more overconfident in the performance of the synthetic marker with increasing amount of censoring, while Uno’s c remains stable.

In summary, while the difference between concordance_index_ipcw and concordance_index_censored is negligible for small amounts of censoring, when analyzing survival data with moderate to high amounts of censoring, you might want to consider estimating the performance using concordance_index_ipcw instead of concordance_index_censored.

Time-dependent Area under the ROC

The area under the receiver operating characteristics curve (ROC curve) is a popular performance measure for binary classification task. In the medical domain, it is often used to determine how well estimated risk scores can separate diseased patients (cases) from healthy patients (controls). Given a predicted risk score $\hat{f}$, the ROC curve compares the false positive rate (1 - specificity) against the true positive rate (sensitivity) for each possible value of $\hat{f}$.

When extending the ROC curve to continuous outcomes, in particular survival time, a patient’s disease status is typically not fixed and changes over time: at enrollment a subject is usually healthy, but may be diseased at some later time point. Consequently, sensitivity and specificity become time-dependent measures. Here, we consider cumulative cases and dynamic controls at a given time point $t$, which gives rise to the time-dependent cumulative/dynamic ROC at time $t$. Cumulative cases are all individuals that experienced an event prior to or at time $t$ ($t_i \leq t$), whereas dynamic controls are those with $t_i>t$. By computing the area under the cumulative/dynamic ROC at time $t$, we can determine how well a model can distinguish subjects who fail by a given time ($t_i \leq t$) from subjects who fail after this time ($t_i>t$). Hence, it is most relevant if one wants to predict the occurrence of an event in a period up to time $t$ rather than at a specific time point $t$.

The cumulative_dynamic_auc function implements an estimator of the cumulative/dynamic area under the ROC at a given list of time points. To illustrate its use, we are going to use data from a study that investigated to which extent the serum immunoglobulin free light chain (FLC) assay can be used predict overall survival. The dataset has 7874 subjects and 9 features; the endpoint is death, which occurred for 2169 subjects (27.5%).

First, we are loading the data and split it into train and test set to evaluate how well markers generalize.

x, y = load_flchain()

(x_train, x_test,
 y_train, y_test) = train_test_split(x, y, test_size=0.2, random_state=0)

Serum creatinine measurements are missing for some patients, therefore we are just going to impute these values with the mean using scikit-learn’s SimpleImputer.

num_columns = ['age', 'creatinine', 'kappa', 'lambda']

imputer = SimpleImputer().fit(x_train.loc[:, num_columns])
x_train = imputer.transform(x_train.loc[:, num_columns])
x_test = imputer.transform(x_test.loc[:, num_columns])

Similar to Uno’s estimator of the concordance index described above, we need to be a little bit careful when selecting the test data and time points we want to evaluate the ROC at, due to the estimator’s dependence on inverse probability of censoring weighting. First, we are going to check whether the observed time of the test data lies within the observed time range of the training data.

y_events = y_train[y_train['death']]
train_min, train_max = y_events["futime"].min(), y_events["futime"].max()

y_events = y_test[y_test['death']]
test_min, test_max = y_events["futime"].min(), y_events["futime"].max()

assert train_min <= test_min < test_max < train_max, \
    "time range or test data is not within time range of training data."

When choosing the time points to evaluate the ROC at, it is important to remember to choose the last time point such that the probability of being censored after the last time point is non-zero. In the simulation study above, we set the upper bound to the maximum event time, here we use a more conservative approach by setting the upper bound to the 80% percentile of observed time points, because the censoring rate is quite large at 72.5%. Note that this approach would be appropriate for choosing tau of concordance_index_ipcw too.

times = np.percentile(y["futime"], np.linspace(5, 81, 15))
print(times)

[ 470.3        1259.         1998.         2464.82428571 2979.
 3401.         3787.99857143 4051.         4249.         4410.17285714
 4543.         4631.         4695.         4781.         4844.        ]

We begin by considering individual real-valued features as risk scores without actually fitting a survival model. Hence, we obtain an estimate of how well age, creatinine, kappa FLC, and lambda FLC are able to distinguish cases from controls at each time point.

The plot shows the estimated area under the time-dependent ROC at each time point and the average across all time points as dashed line.

We can see that age is overall the most discriminative feature, followed by $\kappa$ and $\lambda$ FLC. That fact that age is the strongest predictor of overall survival in the general population is hardly surprising (we have to die at some point after all). More differences become evident when considering time: the discriminative power of FLC decreases at later time points, while that of age increases. The observation for age again follows common sense. In contrast, FLC seems to be a good predictor of death in the near future, but not so much if it occurs decades later.

Next, we will fit an actual survival model to predict the risk of death from the Veterans’ Administration Lung Cancer Trial. After fitting a Cox proportional hazards model, we want to assess how well the model can distinguish survivors from deceased in weekly intervals, up to 6 months after enrollment.

The plot shows that the model is doing quite well on average with an AUC of ~0.82 (dashed line). However, there is a clear difference in performance between the first and second half of the time range. Performance increases up to about 100 days from enrollment, but quickly drops thereafter. Thus, we can conclude that the model is less effective in predicting death past 100 days.

Conclusion

I hope this post helped you to understand some of the pitfalls when estimating the performance of markers and models from right-censored survival data. We illustrated that Harrell’s estimator of the concordance index is biased when the amount of censoring is high, and that Uno’s estimator is more appropriate in this situation. Finally, we demonstrated that the time-dependent area under the ROC is a very useful tool when we want to predict the occurrence of an event in a period up to time $t$ rather than at a specific time point $t$.

python

Sebastian Pölsterl

AI Researcher

My research interests include machine learning for time-to-event analysis, causal inference and biomedical applications.