Posts on Sebastian Pölsterlhttps://k-d-w.org/post/Recent content in Posts on Sebastian PölsterlHugo -- gohugo.ioen-us© Sebastian Pölsterl {year}Mon, 02 Sep 2019 18:06:55 +0200- scikit-survival 0.10 releasedhttps://k-d-w.org/blog/2019/09/scikit-survival-0.10-released/Mon, 02 Sep 2019 18:06:55 +0200https://k-d-w.org/blog/2019/09/scikit-survival-0.10-released/<p>This release of <a href="https://github.com/sebp/scikit-survival" target="_blank">scikit-survival</a> adds two features that
are standard in most software for survival analysis, but were missing so far:</p>
<ol>
<li><a href="https://scikit-survival.readthedocs.io/en/latest/generated/sksurv.linear_model.CoxPHSurvivalAnalysis.html#sksurv.linear_model.CoxPHSurvivalAnalysis" target="_blank">CoxPHSurvivalAnalysis</a>
now has a <code>ties</code> parameter that allows you to choose between Breslow’s
and Efron’s likelihood for handling tied event times. Previously, only
Breslow’s likelihood was implemented and it remains the default.
If you have many tied event times in your data, you can now select
Efron’s likelihood with <code>ties="efron"</code> to get better estimates of the
model’s coefficients.</li>
<li>A <a href="https://scikit-survival.readthedocs.io/en/latest/generated/sksurv.compare.compare_survival.html#sksurv.compare.compare_survival" target="_blank">compare_survival</a>
function has been added. It can be used to assess whether survival functions across 2 or more groups differ.</li>
</ol>
- Survival Analysis for Deep Learninghttps://k-d-w.org/blog/2019/07/survival-analysis-for-deep-learning/Mon, 29 Jul 2019 07:38:23 +0200https://k-d-w.org/blog/2019/07/survival-analysis-for-deep-learning/<p>Most machine learning algorithms have been developed to perform classification or regression. However, in clinical research we often want to estimate the time to and event, such as death or recurrence of cancer, which leads to a special type of learning task that is distinct from classification and regression. This task is termed <em>survival analysis</em>, but is also referred to as time-to-event analysis or reliability analysis.
Many machine learning algorithms have been adopted to perform survival analysis:
<a href="https://scholar.google.com/scholar?oi=bibs&cluster=18092275419152143443" target="_blank">Support Vector Machines</a>,
<a href="https://scholar.google.com/scholar?cluster=16319510831191377024" target="_blank">Random Forest</a>,
or <a href="https://scholar.google.com/scholar?cluster=14069073471114367075" target="_blank">Boosting</a>.
It has only been recently that survival analysis entered the era of deep learning, which is the focus of this post.</p>
<p>You will learn how to train a convolutional neural network to predict time to a (generated) event from MNIST images, using a loss function specific to survival analysis. The <a href="#primer-on-survival-analysis">first part</a>, will cover some basic terms and quantities used in survival analysis (feel free to skip this part if you are already familiar). In the <a href="#generating-synthetic-survival-data-from-mnist">second part</a>, we will generate synthetic survival data from MNIST images and visualize it. In the <a href="#cox-s-proportional-hazards-model">third part</a>, we will briefly revisit the most popular survival model of them all and learn how it can be used as a loss function for training a neural network.
<a href="#creating-a-convolutional-neural-network-for-survival-analysis-on-mnist">Finally</a>, we put all the pieces together and train a convolutional neural network on MNIST and predict survival functions on the test data.</p>
- scikit-survival 0.9 releasedhttps://k-d-w.org/blog/2019/07/scikit-survival-0.9-released/Sat, 27 Jul 2019 21:11:43 +0200https://k-d-w.org/blog/2019/07/scikit-survival-0.9-released/<p>This release of <a href="https://github.com/sebp/scikit-survival" target="_blank">scikit-survival</a> adds support for scikit-learn 0.21 and pandas 0.24, among a couple of other smaller fixes. Please see the <a href="https://scikit-survival.readthedocs.io/en/latest/release_notes.html" target="_blank">release notes</a> for a full list of changes. If you are using scikit-survival in your research, you can now cite it using an <a href="https://zenodo.org/record/3352343" target="_blank">Digital Object Identifier (DOI)</a>.</p>
- Evaluating Survival Modelshttps://k-d-w.org/blog/2019/05/evaluating-survival-models/Sat, 04 May 2019 11:12:05 +0000https://k-d-w.org/blog/2019/05/evaluating-survival-models/<p>The most frequently used evaluation metric of survival models is the concordance index (c index, c statistic). It is a measure of rank correlation between predicted risk scores $\hat{f}$ and observed time points $y$ that is closely related to <a href="https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient" target="_blank">Kendall’s τ</a>. It is defined as the ratio of correctly ordered (concordant) pairs to comparable pairs. Two samples $i$ and $j$ are comparable if the sample with lower observed time $y$ experienced an event, i.e., if $y_j > y_i$ and $\delta_i = 1$, where $\delta_i$ is a binary event indicator. A comparable pair $(i, j)$ is concordant if the estimated risk $\hat{f}$ by a survival model is higher for subjects with lower survival time, i.e., $\hat{f}_i >\hat{f}_j \land y_j > y_i$, otherwise the pair is discordant. Harrell’s estimator of the c index is implemented in <a href="https://scikit-survival.readthedocs.io/en/latest/generated/sksurv.metrics.concordance_index_censored.html#sksurv.metrics.concordance_index_censored" target="_blank">concordance_index_censored</a>.</p>
<p>While Harrell’s concordance index is easy to interpret and compute, it has some shortcomings:</p>
<ol>
<li>it has been shown that it is too optimistic with increasing amount of censoring <a href="https://dx.doi.org/10.1002/sim.4154" target="_blank">[1]</a>,</li>
<li>it is not a useful measure of performance if a specific time range is of primary interest (e.g. predicting death within 2 years).</li>
</ol>
<p>Since version 0.8, <a href="https://github.com/sebp/scikit-survival" target="_blank">scikit-survival</a> supports an alternative estimator of the concordance index from right-censored survival data, implemented in <a href="https://scikit-survival.readthedocs.io/en/latest/generated/sksurv.metrics.concordance_index_ipcw.html#sksurv.metrics.concordance_index_ipcw" target="_blank">concordance_index_ipcw</a>, that addresses the first issue.</p>
<p>The second point can be addressed by extending the well known receiver operating characteristic curve (ROC curve) to possibly censored survival times. Given a time point $t$, we can estimate how well a predictive model can distinguishing subjects who will experience an event by time $t$ (sensitivity) from those who will not (specificity). The function <a href="https://scikit-survival.readthedocs.io/en/latest/generated/sksurv.metrics.cumulative_dynamic_auc.html#sksurv.metrics.cumulative_dynamic_auc" target="_blank">cumulative_dynamic_auc</a> implements an estimator of the cumulative/dynamic area under the ROC for a given list of time points.</p>
<p>The <a href="#bias-harrels-cindex">first part</a> of this post will illustrate the first issue with simulated survival data, while the <a href="#timeroc">second part</a> will focus on the time-dependent area under the ROC applied to data from a real study.</p>
- scikit-survival 0.8 releasedhttps://k-d-w.org/blog/2019/05/scikit-survival-0.8-released/Wed, 01 May 2019 15:50:53 +0000https://k-d-w.org/blog/2019/05/scikit-survival-0.8-released/<p>This release of <a href="https://github.com/sebp/scikit-survival" target="_blank">scikit-survival 0.8</a> adds some nice enhancements for validating survival models.
Previously, <a href="https://github.com/sebp/scikit-survival" target="_blank">scikit-survival</a> only supported <a href="https://scikit-survival.readthedocs.io/en/latest/generated/sksurv.metrics.concordance_index_censored.html#sksurv.metrics.concordance_index_censored" target="_blank">Harrell’s concordance index</a> to assess the performance of survival models. While it is easy to interpret and compute, it has some shortcomings:</p>
<ol>
<li>it has been shown that it is too optimistic with increasing amount of censoring<sup><a href="#RefUno2011">1</a></sup>,</li>
<li>it is not a useful measure of performance if a specific time point is of primary interest (e.g. predicting 2 year survival).</li>
</ol>
- scikit-survival 0.7 releasedhttps://k-d-w.org/blog/2019/02/scikit-survival-0.7-released/Wed, 27 Feb 2019 21:42:20 +0000https://k-d-w.org/blog/2019/02/scikit-survival-0.7-released/<p>This is a long overdue maintenance release of <a href="https://github.com/sebp/scikit-survival" target="_blank">scikit-survival 0.7</a> that adds compatibility with Python 3.7 and scikit-learn 0.20. For a complete list of changes see the <a href="https://scikit-survival.readthedocs.io/en/latest/release_notes.html" target="_blank">release notes</a>.</p>
- scikit-survival 0.6.0 releasedhttps://k-d-w.org/blog/2018/10/scikit-survival-0.6.0-released/Sun, 07 Oct 2018 16:14:52 +0000https://k-d-w.org/blog/2018/10/scikit-survival-0.6.0-released/<p>Today, I released <a href="https://github.com/sebp/scikit-survival" target="_blank">scikit-survival 0.6.0</a>. This release is long overdue and adds support for NumPy 1.14 and pandas up to 0.23. In addition, the new class <a href="https://scikit-survival.readthedocs.io/en/latest/generated/sksurv.util.Surv.html#sksurv.util.Surv" target="_blank">sksurv.util.Surv</a> makes it easier to construct a structured array from NumPy arrays, lists, or a pandas data frame. The examples below showcase how to create a structured array for the dependent variable.</p>
- Convolutional Autoencoder as TensorFlow estimatorhttps://k-d-w.org/blog/2018/02/convolutional-autoencoder-as-tensorflow-estimator/Sun, 25 Feb 2018 15:07:55 +0000https://k-d-w.org/blog/2018/02/convolutional-autoencoder-as-tensorflow-estimator/<p>In my previous <a href="https://k-d-w.org/node/103" target="_blank">post</a>, I explained how to implement autoencoders as TensorFlow <code>Estimator</code>. I thought it would be nice to add convolutional autoencoders in addition to the existing fully-connected autoencoder. So that’s what I did. Moreover, I added the option to extract the low-dimensional encoding of the encoder and visualize it in TensorBoard.</p>
<p>The complete source code is available at <a href="https://github.com/sebp/tf_autoencoder" target="_blank">https://github.com/sebp/tf_autoencoder</a>.</p>
<h2 id="why-convolutions">Why convolutions?</h2>
<p>For the fully-connected autoencoder, we reshaped each 28x28 image to a 784-dimensional feature vector. Next, we assigned a separate weight to each edge connecting one of 784 pixels to one of 128 neurons of the first hidden layer, which amounts to 100,352 weights (excluding biases) that need to be learned during training. For the last layer of the decoder, we need another 100,352 weights to reconstruct the full-size image. Considering that the whole autoencoder consists of 222,384 weights, it is obvious that these two layers dominate other layers by a large margin. When using higher resolution images, this imbalance becomes even more dramatic.</p>
- Denoising Autoencoder as TensorFlow estimatorhttps://k-d-w.org/blog/2017/12/denoising-autoencoder-as-tensorflow-estimator/Fri, 22 Dec 2017 11:39:53 +0000https://k-d-w.org/blog/2017/12/denoising-autoencoder-as-tensorflow-estimator/<p>I recently started to use Google’s deep learning framework TensorFlow. Since version 1.3, TensorFlow includes a <a href="https://www.tensorflow.org/get_started/estimator" target="_blank">high-level interface</a> inspired by scikit-learn. Unfortunately, as of version 1.4, only 3 different classification and 3 different regression models implementing the <code>Estimator</code> interface are included. To better understand the <code>Estimator</code> interface, <code>Dataset</code> API, and components in <a href="https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/slim" target="_blank">tf-slim</a>, I started to implement a simple Autoencoder and applied it to the well-known MNIST dataset of handwritten digits. This post is about my journey and is split in the following sections:</p>
<ol>
<li><a href="#estimators">Custom Estimators</a></li>
<li><a href="#autoencoder-net">Autoencoder network architecture</a></li>
<li><a href="#autoencoder-model-fn">Autoencoder as TensorFlow Estimator</a></li>
<li><a href="#dataset-api">Using the Dataset API</a></li>
<li><a href="#denoising-autoencoder">Denoising Autocendoer</a></li>
</ol>
<p>I will assume that you are familiar with TensorFlow basics. The full code is available at <a href="https://github.com/sebp/tf_autoencoder" target="_blank">https://github.com/sebp/tf_autoencoder</a>.
A second part on <a href="https://k-d-w.org/node/107" target="_blank">Convolutional Autoencoders</a> is available too.</p>
- scikit-survival 0.5 releasedhttps://k-d-w.org/blog/2017/12/scikit-survival-0.5-released/Sat, 09 Dec 2017 11:33:23 +0000https://k-d-w.org/blog/2017/12/scikit-survival-0.5-released/<p>Today, I released a new version of <a href="https://github.com/sebp/scikit-survival" target="_blank">scikit-survival</a>. This release adds support for the latest version of scikit-learn (0.19) and pandas (0.21). In turn, support for Python 3.4, scikit-learn 0.18 and pandas 0.18 has been dropped.</p>