scikit-survival 0.10 released

This release of scikit-survival adds two features that are standard in most software for survival analysis, but were missing so far:

  1. CoxPHSurvivalAnalysis now has a ties parameter that allows you to choose between Breslow’s and Efron’s likelihood for handling tied event times. Previously, only Breslow’s likelihood was implemented and it remains the default. If you have many tied event times in your data, you can now select Efron’s likelihood with ties="efron" to get better estimates of the model’s coefficients.
  2. A compare_survival function has been added. It can be used to assess whether survival functions across 2 or more groups differ.

To illustrate the use of compare_survival, let’s consider the Veterans’ Administration Lung Cancer Trial. Here, we are considering the Celltype feature and we want to know whether the tumor type impacts survival. We can visualize the survival function for each subgroup using the Kaplan-Meier estimator.

import matplotlib.pyplot as plt
from sksurv.datasets import load_veterans_lung_cancer
from sksurv.nonparametric import kaplan_meier_estimator

data_x, data_y = load_veterans_lung_cancer()
group_indicator = data_x.loc[:, "Celltype"]
groups = group_indicator.unique()

for group in groups:
    group_y = data_y[group_indicator == group]
    time, surv_prob = kaplan_meier_estimator(
        group_y["Status"],
        group_y["Survival_in_days"])

    plt.step(time, surv_prob, where="post",
             label="Celltype = {}".format(group))
    plt.xlabel("time $t$")
    plt.ylabel("est. probability of survival")
    plt.ylim(0, 1)
    plt.grid(True)
    plt.legend()
Kaplan-Meier estimates of survival function.

Kaplan-Meier estimates of survival function.

The figure indicates that patients with adenocarcinoma (green line) do not survive beyond 200 days, whereas patients with squamous cell lung cancer (blue line) can survive several years. We can determine whether this difference is indeed statistically significant by performing a non-parametric log-rank test. It groups patients according to cell type and compares the estimated group-specific hazard rate with the pooled hazard rate. Under the null hypothesis, the hazard rate of groups is equal for all time points. The alternative hypothesis is that the hazard rate of at least one group differs from the others at some time.

from sksurv.compare import compare_survival

chisq, pvalue, stats, covar = compare_survival(
        data_y, group_indicator, return_stats=True)

The resulting test statistic $\chi^2 = 25.40$, which corresponds to a highly significant P-value of $1.3\cdot{10}^{-5}$. In addition, we can look at group-specific statistics by specifying return_stats=True.

countsobservedexpectedstatistic
group
adeno272615.6910.31
large272634.55-8.55
smallcell484530.1014.90
squamous353147.65-16.65

The column counts lists the size of each group and is followed by the number of observed and expected events. The last column statistic is the difference between the observed and expected number of events from which the overall $\chi^2$ statistic is computed.

Download

The latest version of scikit-survival can be obtained via conda or pip. Pre-built conda packages are available for Linux, OSX and Windows:

 conda install -c sebp scikit-survival

Alternatively, you can install it from source via pip:

 pip install -U scikit-survival
Avatar
Sebastian Pölsterl
AI Researcher

My research interests include machine learning for time-to-event analysis, causal inference and biomedical applications.

Related