scikit-survival 0.17 released

This release adds support for scikit-learn 1.0, which includes support for feature names. If you pass a pandas dataframe to fit, the estimator will set a feature_names_in_ attribute containing the feature names. When a dataframe is passed to predict, it is checked that the column names are consistent with those passed to fit. The example below illustrates this feature.

For a full list of changes in scikit-survival 0.17.0, please see the release notes.

Installation

Pre-built conda packages are available for Linux, macOS, and Windows via

 conda install -c sebp scikit-survival

Alternatively, scikit-survival can be installed from source following these instructions.

Feature Names Support

Prior to scikit-survival 0.17, you could pass a pandas dataframe to estimators’ fit and predict methods, but the estimator was oblivious to the feature names accessible via the dataframe’s columns attribute. With scikit-survival 0.17, and thanks to scikit-learn 1.0, feature names will be considered when a dataframe is passed.

Let’s illustrate feature names support using the Veteran’s Lung Cancer dataset.

from sksurv.datasets import load_veterans_lung_cancer

X, y = load_veterans_lung_cancer()

X.head(3)
Age_in_yearsCelltypeKarnofsky_scoreMonths_from_DiagnosisPrior_therapyTreatment
069.0squamous60.07.0nostandard
164.0squamous70.05.0yesstandard
238.0squamous60.03.0nostandard

The original data has 6 features, three of which contain strings, which we encode as numeric using OneHotEncoder.

from sksurv.preprocessing import OneHotEncoder

transform = OneHotEncoder()
Xt = transform.fit_transform(X)

Transforms now have a get_feature_names_out() method, which will return the name of features after the transformation.

transform.get_feature_names_out()
array(['Age_in_years', 'Celltype=large', 'Celltype=smallcell',
       'Celltype=squamous', 'Karnofsky_score', 'Months_from_Diagnosis',
       'Prior_therapy=yes', 'Treatment=test'], dtype=object)

The transformed data returned by OneHotEncoder is again a dataframe, which can be used to fit Cox’s proportional hazards model.

from sksurv.linear_model import CoxPHSurvivalAnalysis

model = CoxPHSurvivalAnalysis().fit(Xt, y)

Since we passed a dataframe, the feature_names_in_ attribute will contain the names of the dataframe used when calling fit.

model.feature_names_in_
array(['Age_in_years', 'Celltype=large', 'Celltype=smallcell',
        'Celltype=squamous', 'Karnofsky_score', 'Months_from_Diagnosis',
        'Prior_therapy=yes', 'Treatment=test'], dtype=object)

This is used during prediction to check that the data matches the format of the training data. For instance, when passing a raw numpy array instead of a dataframe, a warning will be issued.

pred = model.predict(Xt.values)
UserWarning: X does not have valid feature names, but CoxPHSurvivalAnalysis was fitted with feature names

Moreover, it will also check that the order of columns matches.

X_reordered = pd.concat(
  (Xt.drop("Age_in_years", axis=1), Xt.loc[:, "Age_in_years"]),
  axis=1
)
pred = model.predict(X_reordered)
FutureWarning: The feature names should match those that were passed during fit. Starting version 1.2, an error will be raised.
Feature names must be in the same order as they were in fit.

For more details on feature names support, have a look at the scikit-learn release highlights.

Avatar
Sebastian Pölsterl
Post-Doctoral Researcher

My research interests include machine learning for time-to-event analysis, non-Euclidean data, and biomedical applications.

Related