Big Data Modelling: The Evolution of a Technique to Cluster Cancer Patients

Author:

Giovanni Boniolo

Date of publication: 22 September 2025
Last update: 22 September 2025

Abstract

We provide an overview of how Big Data models in oncology are structured and how patient clustering based on similarity has evolved.

The beginning

Over centuries, medical research has refined techniques for categorizing patients who exhibit the same characteristics, aiming to enhance diagnosis, prognosis, and treatment. In recent decades, this classificatory strategy has been adopted by clinical trials and evidence-based medicine, enabling drug indications and clinical guidelines to be tailored to defined patient subgroups. These groups are considered homogeneous based on specific biomarkers—whether tissue-based, cellular, or molecular. This model of stratification and patient assignment has endured even in the age of molecular medicine (see, e.g., Boniolo & Nathan, 2017), where classification increasingly relies on genomic, proteomic, metabolomic ... (the 'omic') data.

However, the sheer complexity of many molecular diseases, cancer foremost among them, the overwhelming scale of molecular and clinical datasets (cf. Leonelli, 2016; Strasser, 2019), and the intrinsic individuality of patients have rendered this biomarker-driven model increasingly inadequate. As a result, an alternative paradigm has emerged: one that pivots from fixed markers to dynamic patterns of similarity.

This epistemological shift is exemplified by the work of Carlos Caldas and his team, pioneers in applying similarity-based cluster analysis to cancer research (Curtis et al., 2012; Ali et al., 2014; Bruna et al., 2016; Pereira et al., 2016; Russnes et al., 2017). Drawing on 997 breast cancer samples sourced from biobanks in the UK and Canada, they conducted comprehensive genomic and transcriptomic profiling on patients who had received standardized treatments and were monitored over a ten-year span. Their analysis, known as Integrative Clusters (iClusters or IntClusters), identified ten predictive molecular subtypes. To assess the reliability of these classifications, the same method was applied to two additional datasets: one comprising approximately 1,000 breast cancer samples, the other around 7,500.

As illustrated in Figure 1, this approach allowed direct comparison with existing molecular taxonomies (e.g., PAM50) ^{^[1]} and clinical outcomes. The resulting integrative classification demonstrated strong correlation with chemotherapy response, establishing a novel and predictive link between molecular stratification, therapeutic intervention, and patient prognosis. These findings were derived from studies involving breast cancer patients who had undergone adjuvant chemotherapy and for whom pathological complete response data were available.

Overview of the Integrative Cluster Subtypes and the Dominating Properties

Figure 1: Overview of the Integrative Cluster Subtypes and the Dominating Properties (from Curtis et al. 2012).

This success hinged on the ability to group patients by the degree of similarity among their features—an achievement made possible through the application of advanced machine learning techniques. This same principle underlies a broader methodological framework known as Patient Similarity (Brown, 2016; Pai & Bader, 2018; Parimbelli et al., 2018; Sánchez-Valle et al., 2020). This approach simultaneously addresses three critical dimensions:

the massive volume of data generated by ‘-omic’ platforms and imaging technologies across thousands of patients;
the extensive clinical records made accessible through electronic health systems, including diagnoses, lab results, prescriptions, therapies, treatment responses, disease trajectories, and longitudinal follow-up;
contextual data relating to lifestyle and environmental exposures.

The goal is to cluster patients who share similar characteristics across these various strata. When a new patient enters the system, their data—genomic, clinical, metabolomic, and more—are analyzed to identify points of resemblance with existing clusters. Based on these computed similarities, the patient is assigned to a group, which then informs decisions about treatment strategies and prognostic expectations.

Yet this raises a crucial question: what exactly do we mean by similarity in this context? As previously noted, a patient is not grouped merely for possessing specific features in isolation. Rather, they are assigned based on how closely their features align—quantitatively and qualitatively—with those of others in a defined cluster. Still, before turning to a deeper analysis of what constitutes ‘similarity’, it is essential to first examine the role of Big Data modeling in shaping this framework.

Modelling oncological Big Data

The term “Big Data modelling” ^{^[2]} broadly encompasses a set of methodologies designed to extract actionable insights from vast datasets through fully automated processes. Although the conceptual foundations of these methods were laid in the early 20th century, their practical impact has only been fully realized in recent years—driven by breakthroughs in computational power, data storage infrastructure, and sophisticated analytical techniques.

In the context of biomedicine, and oncology in particualr, modelling takes multiple forms (see Benzekry, 2020). At one end of the spectrum are white-box models—also known as mechanistic or hypothesis-driven approaches—which are grounded in a priori knowledge of the biological system and constructed from first principles. These models often employ systems of differential equations in which parameters carry clear physical or physiological significance. A prominent example is the modelling of tumor growth dynamics (Benzekry et al., 2014).

At the other end lie black-box models, typically data-driven and agnostic to the underlying biological mechanisms. These approaches infer patterns strictly from empirical associations between inputs and outputs, often without yielding insight into the internal logic of the system. Their parameters may lack any physical or biological interpretation. Machine learning and deep learning exemplify this paradigm. While efforts are underway to improve the interpretability of such models by incorporating domain knowledge (see Kelly et al., 2019; AlQuraishi & Sorger, 2021), they are still widely regarded as “black boxes.” This discussion centers specifically on black-box models designed to navigate cancer-related Big Data.

Cancer represents a paradigmatic case of biological complexity—marked by pronounced inter-patient and inter-tissue heterogeneity in molecular characteristics, pathological behavior, therapeutic responsiveness, and survival trajectories. In tandem with this complexity, advances in sequencing technologies have enabled high-resolution molecular profiling across multiple layers of biological information: genome, transcriptome, epigenome, proteome, and beyond. As a result, vast datasets have emerged, including The Cancer Genome Atlas (Tomczak et al., 2015), the Pan-Cancer Analysis of Whole Genomes (Gerstung et al., 2020), and experimental platforms like the Genomics of Drug Sensitivity in Cancer (Iorio et al., 2016) and the Cancer Cell Line Encyclopedia (Ghandi et al., 2016). Machine learning techniques are uniquely positioned to exploit these resources—whether derived from bulk tissue analyses or single-cell data—to disentangle cancer’s intrinsic complexity (Eraslan et al., 2019; Boniolo et al., 2021).

Yet despite the promise of these methods, biomedical Big Data models remain only sporadically integrated into clinical workflows. This disconnect arises in part from the rigorous validation protocols required for preclinical and clinical adoption (Bekisz & Geris, 2020). Just as crucial, however, is the cultural hesitation among clinicians to trust algorithmic tools in bedside decision-making. Moreover, fundamental concerns regarding the interpretability, fairness, and security of these models continue to occupy the center of technical and regulatory debate. Current efforts to develop systems that are explainable (Holzinger et al., 2017), free from algorithmic bias (Chen et al., 2020), and protective of patient privacy (Kaissis et al., 2020) offer a potential path forward.

These innovations could enable the widespread implementation of computational support in diagnostics, prognostics, and treatment planning—particularly within the framework of Molecular Tumor Boards. In such settings, interdisciplinary teams composed of clinicians, molecular biologists, and computational scientists collaborate to interpret a patient’s molecular profile and make informed therapeutic decisions (Kato et al., 2020)

Patient similarity as distance between points

To illustrate the mechanics of Big Data modelling and cluster analysis, one may turn to an instructive example drawn from Brown (2016), which elucidates the precise sense in which two sets of biomedical data may be deemed similar. Within this framework, each patient is mathematically represented as a vector situated in a high-dimensional statistical space, with each dimension corresponding to a distinct biomedical variable—be it molecular, clinical, or otherwise.

Given two such vectors, z and z′, each encoding the biomedical profile of an individual patient, the degree of similarity between them is quantified by computing their distance within this multidimensional space. This distance, denoted d(z,z′), reflects how closely aligned the two patients are in terms of their feature sets. One common metric for this purpose is cosine similarity, which assesses the cosine of the angle between the two vectors, thereby capturing the directional congruence of their respective data profiles regardless of magnitude:

Degree of similarity

where z∘z' is the scalar (dot) product of the two vectors, ∥z∥ and ∥z’∥ are the magnitude of the vectors z and z’ respectively, and zi and z’i are the components of the two vectors each of which representing a molecular or clinical datum. Thus, similarity is defined in terms of the cosine of the angle between the two vectors. If the vectors (i.e. the patients) are completely dissimilar, they point in opposite directions; then the angle between them is 180°, and the cosine similarity approaches –1. If the vectors (i.e. the patients) are identical, then the angle is 0°, and the cosine similarity equals 1. In general, given a benchmark dataset (the vector z) associated with a paradigmatic patient, one can assess its similarity with another dataset (the vector z’ associated with the patient to evaluate) by computing the cosine similarity. Importantly, this similarity reflects how close the respective clinical features are on average.

When the number of patients far exceeds two, as is common in biomedical Big Data models, the cosine similarity becomes less adequate. In such cases, alternative similarity measures, d(zi, zi’) between two vectors z and z’ with components zi and zi’ each one representing a datum of a given patient, are more appropriate for partitioning the patient population into similarity clusters. For example, variants of the Minkowski distance are used:

where the exponent 1/p can be set to define different types of distances depending on the value of p. For example, if we choose p equal to 1, 2, or p → ∞, we have the Manhattan distance, the Euclidean distance, and the Chebyshev distance, respectively. Thus, choosing the notion of distance, i.e., the kind of similarity, means choosing a different way of clustering patients. By the way, also choosing the numbers of clusters means choosing a different way of clustering patients (see Fig. 3)

Figure 3: Three different ways (b, c, d) of clustering the same set of points (a), i.e. patients (From Tan et al., 2017, 529)

To comprehend the conceptual foundations of Big Data modelling in the context of machine learning and artificial intelligence, one might consider a prototypical scenario: the development of an AI-powered Big Data system designed to support cancer detection in medical imaging, such as digital mammography or magnetic resonance imaging. The system’s core objective is to “train” a model to distinguish reliably between normal and pathological tissue, thereby enabling automated diagnostic assessments that are fast, precise, and reproducible.

This training begins with a substantial, annotated dataset composed of medical images that have been labeled by expert radiologists—some designated “cancer,” others “no cancer.” Through repeated exposure to these labeled examples, the model learns to associate distinct visual patterns with diagnostic categories, progressively acquiring the capacity to generalize such associations across novel, unseen cases.

Each time the model errs in classification, it adjusts its internal parameters—the variables that govern its interpretation of visual features. For example, it may learn that dense, asymmetrical masses located in specific anatomical zones are frequently indicative of malignancy. Through iterative refinement, the model tunes its decision boundaries, gradually becoming sensitive to subtle but diagnostically significant morphological cues.

Upon conclusion of the training phase—during which critical elements such as the similarity metric and the number of clusters are calibrated—the model is ready to be deployed for inference on new patient data. At this stage, the original training cohort of N patients (i.e., the reference population) is abstracted into a high-dimensional dataset that captures clinically meaningful features, including imaging biomarkers, genomic profiles, and laboratory results (see Fig. 4). During training, the model’s learned parameters partition this feature space into discrete regions, effectively clustering patients according to shared phenotypic or molecular traits. This constitutes the model’s initial predictive framework.

Two methodological elements are especially consequential at this juncture. First, the clustering structure is acutely sensitive to the choice of similarity metric—whether cosine, Manhattan, Euclidean, or Chebyshev—each of which privileges different aspects of the multidimensional feature space. Second, the validation of the model’s output is typically conducted using internal statistical criteria such as accuracy, precision, and sensitivity, often evaluated through cross-validation or held-out test sets. Optimizing these metrics is essential to ensure the model’s predictive robustness. A validated clustering is presumed to represent a biologically or clinically coherent stratification of the patient population, offering a plausible map of disease heterogeneity within the reference dataset.

The second phase of prediction involves applying the trained model to a new, individual patient—the target system. This patient’s data are projected into the existing feature space and assigned to one of the pre-established clusters using the same similarity metric employed during training. This assignment underpins the model’s predictive inference: clinicians may extrapolate diagnostic judgments, anticipate likely disease progression, or identify appropriate therapeutic strategies by examining the clinical profiles of patients belonging to the same cluster. This is the primary predictive output of the model and marks the culmination of the inference pipeline in Big Data modelling rooted in similarity and machine learning (cf. Boniolo, Boniolo & Valente, 2023; Boniolo, Campaner & Carrara, 2023).

The modelling and prediction process in the case of Big Data Models

Figure 4: The modelling and prediction process in the case of Big Data Models

References

Ali RH et al 2014. Genome-driven Integrated Classification of Breast Cancer Validated in over 7,500 Samples. Genome Biology 15: 431.

AlQuraishi LM, Sorger PK. 2021. Differentiable biology: using deep learning for biophysics-based and data-driven modeling of molecular mechanisms. Nature Methods 18: 1169-1180.

Bekisz S, Geris L. 2020. Cancer modeling: From mechanistic to data-driven approaches, and from fundamental insights to clinical applications. Journal of Computational Science. 46.

Benzekry S et al. 2014. Classical Mathematical Models for Description and Prediction of Experimental Tumor Growth. Plos Computational Biology, 10, e1003800.

Benzekry S. 2020. Artificial Intelligence and Mechanistic Modeling for Clinical Decision Making in Oncology. Clinical Pharmacology and Therapeutics, 108: 471-486.

Boniolo F, Boniolo G, Valente G. 2023. Prediction via Similarity: Biomedical Big Data and the Case of Cancer Models. Philos. Technol. 36. https://doi.org/10.1007/s13347-023-00608-9.

Boniolo F.et al. 2021. Artificial intelligence in early drug discovery enabling precision medicine. Expert Opin Drug Discov. 2:1-17.

Boniolo G, Campaner R, Carrara M. 2023. Patient Similarity in the Era of Precision Medicine: A Philosophical Analysis. Erkentnis, 88: 2911–2932.

Boniolo G, Nathan M.J. eds. 2017. Philosophy of Molecular Medicine. London: Routledge.

Brown SA. 2016. Patient Similarity: Emerging Concepts in Systems and Precision Medicine, Frontiers in Physiology 7: doi:10.3389/fphys.2016.00561.

Bruna A et al. 2016. A Biobank of Breast Cancer Explants with Preserved Intra-tumor Heterogeneity to Screen Anticancer Compounds. Cell 167: 260–74.

Chen IY et al. 2020. Ethical Machine Learning in Healthcare. Annual Review of Biomedical Data Science. 4.

Curtis C et al. 2012. The Genomic and Transcriptomic Architecture of 2,000 Breast Tumours Reveals Novel Subgroups. Nature 486: 346-52.

Durán JM. 2018. Computer Simulations in Science and Engineering, Springer.

Eraslan G et al. 2019. Deep learning: new computational modelling techniques for genomics. Nature Reviews Genetics 20:389-403.

Gerstung M et al 2020. The evolutionary history of 2,658 cancers. Nature, 578: 122-128.

Ghandi M et al. 2019. Next-generation characterization of the cancer cell line encyclopedia. Nature. 569: 503-508.

Holzinger A et al. 2017. What do we need to build explainable AI systems for the medical domain? arXiv preprint arXiv:1712.09923.

Iorio F et al. 2016. A landscape of pharmacogenomic interactions in cancer. Cell. 166:740-754.

Kaissis GA et al. 2020. Secure, privacy-preserving and federated machine learning in medical imaging. Nature Machine Intelligence. 2:305-311.

Kato S et al. 2020. Real-world data from a molecular tumor board demonstrates improved outcomes with a precision N-of-One strategy. Nature Communications. 11:1-9.

Kelly CJ et al. 2019. Key challenges for delivering clinical impact with artificial intelligence. BMC Med 17, 195. https://doi.org/10.1186/s12916-019-1426-2.

Leonelli S. 2016. Data-Centric Biology: a Philosophical Study. Chicago University Press.

Luo J et al. 2016. Big data application in biomedical research and health care: a literature review. Biomedical informatics insights, 8, BII-S31559.

Pai S, Bader GD. 2018. Patient Similarity Networks for Precision Medicine.” Journal of Molecular Biology. https://doi.org/10.1016/j.jmb.2018.05.037.

Parimbelli E et al. 2018. Patient Similarity for Precision Medicine: A Systematic Review. Journal of Biomedical Informatics, doi: https://doi.org/10.1016/j.jbi.2018.06.001.

Pereira B et al. 2016. The Somatic Mutation Profiles of 2,433 Breast Cancers Refine their Genomic and Transcriptomic Landscapes. Nature Communications 7: 11479, doi:10.1038/ncomms11479.

Russnes H et al. 2017. Breast Cancer Molecular Stratification: from Intrinsic Subtypes to Integrative Clusters. American Journal of Pathology 187: 2152-62.

Sánchez-Valle J et al. 2020. Interpreting molecular similarity between patients as a determinant of disease comorbidity relationships. Nat Commun 11, 2854. https://doi.org/10.1038/s41467-020-16540-x

Tan P-N et al. 2017. Introduction to Data Mining. Addison-Wesley, Second Edition.

Tomczak K, Czerwińska P, Wiznerowicz M. 2015. The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge. Contemporary oncology,

Big Data Health

General Diagnostic & Therapeutic Tools

Diseases

Translational research & Cancer Epidemiology