Posts

scikit-survival 0.8 released

This release of scikit-survival 0.8 adds some nice enhancements for validating survival models. Previously, scikit-survival only supported Harrell’s concordance index to assess the performance of survival models. While it is easy to interpret and compute, it has some shortcomings:

  1. it has been shown that it is too optimistic with increasing amount of censoring 1 ,
  2. it is not a useful measure of performance if a specific time point is of primary interest (e.g. predicting 2 year survival).

scikit-survival 0.7 released

This is a long overdue maintenance release of scikit-survival 0.7 that adds compatibility with Python 3.7 and scikit-learn 0.20. For a complete list of changes see the release notes.

scikit-survival 0.6.0 released

Today, I released scikit-survival 0.6.0. This release is long overdue and adds support for NumPy 1.14 and pandas up to 0.23. In addition, the new class sksurv.util.Surv makes it easier to construct a structured array from NumPy arrays, lists, or a pandas data frame. The examples below showcase how to create a structured array for the dependent variable.

Convolutional Autoencoder as TensorFlow estimator

In my previous post, I explained how to implement autoencoders as TensorFlow Estimator. I thought it would be nice to add convolutional autoencoders in addition to the existing fully-connected autoencoder. So that’s what I did. Moreover, I added the option to extract the low-dimensional encoding of the encoder and visualize it in TensorBoard.

The complete source code is available at https://github.com/sebp/tf_autoencoder.

Why convolutions?

For the fully-connected autoencoder, we reshaped each 28x28 image to a 784-dimensional feature vector. Next, we assigned a separate weight to each edge connecting one of 784 pixels to one of 128 neurons of the first hidden layer, which amounts to 100,352 weights (excluding biases) that need to be learned during training. For the last layer of the decoder, we need another 100,352 weights to reconstruct the full-size image. Considering that the whole autoencoder consists of 222,384 weights, it is obvious that these two layers dominate other layers by a large margin. When using higher resolution images, this imbalance becomes even more dramatic.

Denoising Autoencoder as TensorFlow estimator

I recently started to use Google’s deep learning framework TensorFlow. Since version 1.3, TensorFlow includes a high-level interface inspired by scikit-learn. Unfortunately, as of version 1.4, only 3 different classification and 3 different regression models implementing the Estimator interface are included. To better understand the Estimator interface, Dataset API, and components in tf-slim, I started to implement a simple Autoencoder and applied it to the well-known MNIST dataset of handwritten digits. This post is about my journey and is split in the following sections:

  1. Custom Estimators
  2. Autoencoder network architecture
  3. Autoencoder as TensorFlow Estimator
  4. Using the Dataset API
  5. Denoising Autocendoer

I will assume that you are familiar with TensorFlow basics. The full code is available at https://github.com/sebp/tf_autoencoder. A second part on Convolutional Autoencoders is available too.

scikit-survival 0.5 released

Today, I released a new version of scikit-survival. This release adds support for the latest version of scikit-learn (0.19) and pandas (0.21). In turn, support for Python 3.4, scikit-learn 0.18 and pandas 0.18 has been dropped.

scikit-survival 0.4 released and presented at PyCon UK 2017

I’m pleased to announce that scikit-survival version 0.4 has been released.

This release adds CoxnetSurvivalAnalysis, which implements an efficient algorithm to fit Cox’s proportional hazards model with LASSO, ridge, and elastic net penalty. This allows fitting a Cox model to high-dimensional data and perform feature selection. Moreover, it includes support for Windows with Python 3.5 and later by making the cvxopt package optional.

scikit-survival 0.3 released

Today, I released a new version of scikit-survival, a Python module for survival analysis built on top of scikit-learn.

This release adds predict_survival_function and predict_cumulative_hazard_function to sksurv.linear_model.CoxPHSurvivalAnalysis, which return the survival function and cumulative hazard function using Breslow’s estimator.

Moreover, it fixes a build error on Windows (#3) and adds the sksurv.preprocessing.OneHotEncoder class, which can be used in a scikit-learn pipeline.

Containers in Research

Last week, I attended the Docker Containers for Reproducible Research Workshop hosted by the Software Sustainability Institute. Many talks were addressing how containers can be used in a high performance computing (HPC) environment. Since running the Docker daemon requires root privileges, most administrators are reluctant to allow users running Docker containers in a HPC environment. This issue as been addressed by Singularity, which is an alternative conterization technology that does not require root privileges. The nice thing is that Singularity allows importing existing Docker images, which allows you creating a Singularity container from anything that is on Docker Hub. Although I only used Docker so far, Singularity sounds like a nice technology I would like to explore in the future.

Announcing scikit-survival – a Python library for survival analysis build on top of scikit-learn

I’ve meant to do this release for quite a while now and last week I finally had some time to package everything and update the dependencies. scikit-survival contains the majority of code I developed during my Ph.D.

About Survival Analysis

Survival analysis – also referred to as reliability analysis in engineering – refers to type of problem in statistics where the objective is to establish a connections between a set of measurements (often called features or covariates) and the time to an event. The name survival analysis originates from clinical research: in many clinical studies, one is interested in predicting the time to death, i.e., survival. Broadly speaking, survival analysis is a type of regression problem (one wants to predict a continuous value), but with a twist. Consider a clinical study, which investigates coronary heart disease and has been carried out over a 1 year period as in the figure below.