
When evaluating FE techniques for modeling, the best practices involve cross validation (CV) or another resampling technique to estimate generalization (extra-sample) performance. To this end, the exploration, tuning, and comparison of FE approaches are necessary. Modern techniques in machine learning (ML) modeling can be effective tools for knowledge and hypothesis generation in this \(p>n\) setting, but intelligent decisions in dimensionality reduction and feature engineering (FE) are crucial to their effectiveness (Shi et al 2019 Rendleman et al 2019 Kuhn and Johnson 2020, Chapter 10). Analyses are further complicated by the right-censored nature of time-to-event data, as there may be cases where death or recurrence occurs after the latest follow-up time. This is often referred to as a “big p, little n” problem, \(p>n\). For specific diseases, there are far fewer cases ( n) than there are features ( p). Initiatives such as The Cancer Genome Atlas (TCGA) have collected and made available data from more than 20,000 patient cases across more than 30 cohorts of cancer. In medicine and biology, the availability and dimensionality of multi-omics data are increasing. While resampling approaches are the ideal choice for performance estimation with limited data, RRHO can enable more reliable exploratory feature engineering than standard HO. Similarly, more consistent reductions are observed with RRS-based CV. Monte Carlo simulations used to evaluate RRS on synthetic molecular data indicated that RRS-based HO (RRHO) yields statistically significant reductions in error and bias when compared with standard HO. RRS is a special case of continuous bin stratification which minimizes significant relationships between random HO groupings (or CV folds) and a continuous outcome. To provide more reliable HO-based model performance estimates, we propose a novel sampling procedure: representative random sampling (RRS).

A holdout (HO) estimation approach, however, would permit this flexibility at the expense of reliability. With the limited number of cases often present in multi-omics datasets (“big p, little n,” or many features, few subjects), a resampling approach such as cross validation (CV) would provide robust model performance estimates at the cost of flexibility in intermediate assessments and exploration in feature engineering approaches.

To this end, reliable comparison of feature engineering approaches in their ability to support machine learning survival modeling is crucial. Data-driven computational approaches may be key to identifying relationships with potential clinical or research use. High-dimensional cancer data can be burdensome to analyze, with complex relationships between molecular measurements, clinical diagnostics, and treatment outcomes.
