scikit-survival

PyCon UK, Cardiff

Sebastian Pölsterl

29 October 2017

**Survival Analysis** is often used when studying:

- Time of death or recurrence of cancer patients.
- Time between marriage and divorce.
- Duration of unemployment.
- Life span of a machine or device.

- The objective in
**survival analysis**is to establish a connection between covariates/features and the**time of an event**. **But**: Parts of the training data can only be partially observed – they are**censored**.

- A record is
**uncensored**if an event was observed during the study period- the exact time is known.

- A record is
**right censored**if a patient remained event-free- it is unknown whether an event occurred.

**scikit-survival**is a module for survival analysis built on top of scikit-learn.- Allows easy mix-and-match with scikit-learn classes.
- It is mainly a tool for research – it originates from the Prostate Cancer DREAM challenge.

Formally, each record consists of

- a
*d*-dimensional vector*x*of covariates, and - the time
*t*> 0 when an event occurred - or the time
*c*> 0 of censoring.

The observable time *y* is defined as:

```
from sksurv.datasets import load_veterans_lung_cancer
data_x, data_y = load_veterans_lung_cancer()
```

Age | Cell type | Karnofsky score | Months from Diagnosis | Prior therapy? | Treatment | Survival in days | Dead? |
---|---|---|---|---|---|---|---|

69 | 'squamous' | 60 | 7 | 'no' | 'standard' | 72 | True |

53 | 'smallcell' | 39 | 4 | 'yes' | 'standard' | 16 | True |

57 | 'adeno' | 99 | 3 | 'no' | 'test' | 83 | False |

```
from sksurv.nonparametric import kaplan_meier_estimator
for group in ("standard", "test"):
mask = data_x["Treatment"] == group
time, surv_prob = kaplan_meier_estimator(
data_y["Status"][mask],
data_y["Survival_in_days"][mask])
plt.step(time, surv_prob, where="post",
label="Treatment = {}".format(group))
```

```
from sksurv.preprocessing import OneHotEncoder
from sksurv.linear_model import CoxPHSurvivalAnalysis
encoder = OneHotEncoder()
estimator = CoxPHSurvivalAnalysis()
estimator.fit(encoder.fit_transform(data_x), data_y)
data_new_raw = pd.DataFrame(…)
data_new = encoder.transform(data_new_raw)
pred_curves = estimator.predict_survival_function(data_new)
for curve in pred_curves:
plt.step(curve.x, curve.y, where="post")
```

```
from sksurv.datasets import load_breast_cancer
from sksurv.preprocessing import OneHotEncoder
from sksurv.linear_model import CoxnetSurvivalAnalysis
from sklearn.model_selection import GridSearchCV, KFold
X, y = load_breast_cancer()
Xt = OneHotEncoder().fit_transform(X)
cv = KFold(n_splits=5, shuffle=True, random_state=328)
coxnet = CoxnetSurvivalAnalysis(n_alphas=100,
l1_ratio=1.0, alpha_min_ratio=0.01).fit(Xt, y)
gcv = GridSearchCV(coxnet,
{"alphas": [[v] for v in coxnet.alphas_]},
cv=cv).fit(Xt, y)
```

**scikit-survival** includes implementations of more advanced methods:

- Accelerated Failure Time Model
- Gradient Boosting
- Survival Support Vector Machine
- Ensemble methods

**scikit-survival** is available for Python 3.4 and later on Linux, OSX, and Windows.

Install via Anaconda:

`conda install -c sebp scikit-survival`

or via pip:

`pip install scikit-survival`

Source code: github.com/sebp/scikit-survival