Machine Learning Techniques for Cancer Subtype Discovery and Single-Cell RNA Sequencing Data Analysis


Cancer is an umbrella term that includes a range of disorders, from those that are aggressive and life-threatening to indolent lesions with low or delayed potential for progression to death. After 20 years of cancer screening, the chance of a person being diagnosed with prostate or breast cancer has nearly doubled. However, this has only marginally reduced the number of patients with advanced disease, suggesting that screening has resulted in the substantial harm of excess detection and over-diagnosis. At the same time, 30 to 50% of patients with non-small cell lung cancer (NSCLC) develop recurrence and die after curative resection, suggesting that a subset of patients would have benefited from more aggressive treatments at early stages. Although not routinely recommended as the initial course of treatment, adjuvant and neoadjuvant chemotherapy have been shown to significantly improve the survival of patients with advanced early-stage disease. The ability to prognosticate outcomes would allow us to manage these diseases better: patients whose cancer is likely to advance quickly or recur would receive the necessary treatment. The important challenge is to discover the molecular subtypes of disease and subgroups of patients. To address this important challenge, we develop a novel approach named Subtyping via Consensus Factor Analysis (SCFA) that can efficiently remove noisy signals from consistent molecular patterns in order to reliably identify cancer subtypes and accurately predict risk scores of patients. In an extensive analysis of 7,973 samples related to 30 cancers that are available at The Cancer Genome Atlas (TCGA), we demonstrate that SCFA out-performs state-of-the-art approaches in discovering novel subtypes with significantly different survival profiles. We also demonstrate that SCFA accurately predicts risk scores that strongly correlate with patient survival and vital status. More importantly, the accuracy of subtype discovery and risk prediction improves when more data types are integrated into the analysis.

More recently, advancements in single-cell RNA sequencing (scRNA-seq) have revolutionized our ability to study biological systems at the single-cell level. The widespread utilization of scRNA-seq across various research domains, such as cancer, immunology, and virology, has resulted in the generation of massive amounts of scRNA-seq data each year. However, the analysis of scRNA-seq data poses significant computational challenges due to the increasing number of cells and technical noise. First, scRNA-seq data is high-dimensional, with thousands of genes representing each cell. This poses difficulties in visualizing and comprehending the data. Analyzing relationships between thousands of genes and millions of cells, as required for applications such as trajectory inference or gene regulatory network inference, can be computationally demanding and time-consuming. Second, scRNA-seq data is characterized by noise and sparsity, with numerous missing values and outliers. This makes it challenging to identify consistent patterns and trends, potentially leading to false positives or false negatives in the results. Third, technical noise is often introduced during the sample preparation and sequencing process, stemming from low starting material and amplification procedures. Such noise introduces inconsistencies in the data, hindering comparisons across different experiments.

To address the challenges associated with scRNA-seq data mining, we establish four innovative computational methods that effectively extract biological information from the noisy and massive single-cell data. First, we introduce an analysis framework, named single-cell Decomposition using Hierarchical Autoencoder (scDHA), that reliably extracts representative information of each cell. In one joint framework, the scDHA software package conducts cell segregation through unsupervised learning, dimension reduction and visualization, cell classification, and time-trajectory inference. Second, we develop three novel imputation methods: single-cell Imputation via Sub-space Regression (scISR), single-cell Imputation using Neural Network (scINN), and single-cell Imputation using Residual Network (scIRN). These methods effectively recover missing data caused by dropout events in scRNA-seq data. We validate the performance of the four methods using extensive real-world data, including 43 scRNA-seq datasets with over a million cells. We demonstrate that the proposed methods outperform state-of-the-art techniques in several research sub-fields of scRNA-seq analysis, including cell segregation through unsupervised learning, visualization of transcriptome landscape, cell classification, and pseudo-time inference.

The dissertation is divided into three parts. In the first part, I introduce the significance of molecular subtype discovery and then detail the proposed method, SCFA, for cancer subtyping and risk prediction. In the second part, I provide an overview of single-cell data (scRNA-seq), together with the opportunities and the computational challenges. Next, I describe the four methods we developed for single-cell analysis, scDHA, scISR, scINN, and scIRN. Each method is accompanied with extensive validation and extensive analyses. In the third part, I summarize the dissertation and discuss future research directions that I will potentially pursue.

ProQuest Dissertations and Theses
Duc Tran
Bioinformatics Scientist