Introduction to Survival Analysis with
scikit-survival

PyCon UK, Cardiff

Sebastian Pölsterl

29 October 2017

Applications

Survival Analysis is often used when studying:

  • Time of death or recurrence of cancer patients.
  • Time between marriage and divorce.
  • Duration of unemployment.
  • Life span of a machine or device.
duration

What is Survival Analysis?

  • The objective in survival analysis is to establish a connection between covariates/features and the time of an event.
  • But: Parts of the training data can only be partially observed – they are censored.
survey-time

Censoring

image censoring
  • A record is uncensored if an event was observed during the study period
    • the exact time is known.
  • A record is right censored if a patient remained event-free
    • it is unknown whether an event occurred.

scikit-survival

  • scikit-survival is a module for survival analysis built on top of scikit-learn.
  • Allows easy mix-and-match with scikit-learn classes.
  • It is mainly a tool for research – it originates from the Prostate Cancer DREAM challenge.
Giraffe

Survival Data

Formally, each record consists of

  • a d-dimensional vector x of covariates, and
  • the time t > 0 when an event occurred
  • or the time c > 0 of censoring.

The observable time y is defined as:

Example: Lung Cancer Dataset

from sksurv.datasets import load_veterans_lung_cancer

data_x, data_y = load_veterans_lung_cancer()
Age Cell type Karnofsky score Months from Diagnosis Prior therapy? Treatment Survival in days Dead?
69 'squamous' 60 7 'no' 'standard' 72 True
53 'smallcell' 39 4 'yes' 'standard' 16 True
57 'adeno' 99 3 'no' 'test' 83 False

Is the new drug effective?

from sksurv.nonparametric import kaplan_meier_estimator

for group in ("standard", "test"):
    mask = data_x["Treatment"] == group
    time, surv_prob = kaplan_meier_estimator(
        data_y["Status"][mask],
        data_y["Survival_in_days"][mask])

    plt.step(time, surv_prob, where="post",
             label="Treatment = {}".format(group))

Is the new drug effective?

Kaplan-Meier plot
Kaplan-Meier plot

Predicting survival curves

from sksurv.preprocessing import OneHotEncoder
from sksurv.linear_model import CoxPHSurvivalAnalysis

encoder = OneHotEncoder()
estimator = CoxPHSurvivalAnalysis()
estimator.fit(encoder.fit_transform(data_x), data_y)

data_new_raw = pd.DataFrame(…)
data_new = encoder.transform(data_new_raw)

pred_curves = estimator.predict_survival_function(data_new)
for curve in pred_curves:
    plt.step(curve.x, curve.y, where="post")

Predicting survival curves

Cox model
Cox model

Integration with scikit-learn

from sksurv.datasets import load_breast_cancer
from sksurv.preprocessing import OneHotEncoder
from sksurv.linear_model import CoxnetSurvivalAnalysis
from sklearn.model_selection import GridSearchCV, KFold

X, y = load_breast_cancer()
Xt = OneHotEncoder().fit_transform(X)

cv = KFold(n_splits=5, shuffle=True, random_state=328)
coxnet = CoxnetSurvivalAnalysis(n_alphas=100,
    l1_ratio=1.0, alpha_min_ratio=0.01).fit(Xt, y)

gcv = GridSearchCV(coxnet,
    {"alphas": [[v] for v in coxnet.alphas_]},
    cv=cv).fit(Xt, y)

Integration with scikit-learn

Cross-validation scores
Cross-validation scores

Advanced Methods

scikit-survival includes implementations of more advanced methods:

  • Accelerated Failure Time Model
  • Gradient Boosting
  • Survival Support Vector Machine
  • Ensemble methods

Conclusion

scikit-survival is available for Python 3.4 and later on Linux, OSX, and Windows.

Install via Anaconda:

conda install -c sebp scikit-survival

or via pip:

pip install scikit-survival

Source code: github.com/sebp/scikit-survival