Algorithms for Large-scale Learning from Heterogeneous Survival Data


Many countries are nowadays challenged by ever-growing government expenditures for health care, which many seek to lower by employing electronic health records. Electronic health records systematically collect patients’ past and current treatments with the aim at lowering administrative overhead and identifying inadequate treatments. Moreover, having access to large collections of clinical data creates an opportunity for clinical research. However, analyzing health records is often very challenging: first, they comprise a large and heterogeneous set of patient data, and second, they consist of variables collected from a wide range of sources, such as medications, allergies, biomarkers, medical images, and genetic markers – each of which offers a different partial view on a patient’s state. Systematic analysis of such data is far beyond human capabilities and calls for machine learning techniques.

This thesis develops machine learning methods for predicting the time to an adverse event based on heterogeneous and high-dimensional health records. I introduce an improved training algorithm for the survival support vector machine that builds upon state-of-the-art methods in convex optimization to avoid the high time and space complexity of previous training algorithms. Experimental results on synthetic and real-world data demonstrate that my proposed optimization scheme allows analyzing datasets at least an order of magnitude larger than what would have been feasible with previous techniques. Second, I study dimensionality reduction methods in a comparative analysis of 19 feature extraction and feature selection methods. Whereas feature selection methods for learning from heterogeneous, high-dimensional feature vectors are well investigated, little work focused on feature extraction methods for survival analysis. I propose utilizing random survival forests to address two of the main problems encountered with feature extraction methods based on spectral embedding: 1) neighborhood graph construction and 2) out- of-sample extension. Experiments revealed that the proposed solution can represent similarities between patients better than the standard Euclidean distance and that feature extraction methods are a valuable alternative to feature selection methods, except if the number of available samples is low (<500). Finally, I describe heterogeneous survival ensembles, which aggregate a wide range of survival models to leverage the diversity in available models. The success of such a model is evident by the fact that it was among the winning methods of the Prostate Cancer DREAM challenge.

Sebastian Pölsterl
Post-Doctoral Researcher

My research interests include machine learning for time-to-event analysis, non-Euclidean data, and biomedical applications.