Novel Techniques for Single-cell RNA Sequencing Data Imputation and Clustering

Abstract

Advances in single-cell technologies have shifted genomics research from the analysis of bulk tissues toward a comprehensive characterization of individual cells. These cutting-edge approaches enable the in-depth analysis of individual cells, unveiling the remarkable heterogeneity and complexity of cellular systems. By unraveling the unique signatures and functions of distinct cell types, single-cell technologies have not only deepened our understanding of fundamental biological processes but also unlocked new avenues for disease diagnostics and therapeutic interventions.

The applications of single-cell technologies extend beyond basic research, with significant implications for precision medicine, drug discovery, and regenerative medicine. By capturing the cellular heterogeneity within tumors, these methods have shed light on the mechanisms of tumor evolution, metastasis, and therapy resistance. Additionally, they have facilitated the identification of rare cell populations with specialized functions, such as stem cells and tissue-resident immune cells, which hold great promise for cell-based therapies.

However, one of the major challenges in analyzing scRNA-seq data is the prevalence of dropouts, which are instances where gene expression is not detected despite being present in the cell. Dropouts occur due to technical limitations and can introduce excessive noise into the data, obscuring the true biological signals. As a result, imputation methods are used to estimate missing values and reduce the impact of dropouts on downstream analyses. Furthermore, the high-dimensionality of scRNA-seq data presents additional challenges in effectively partitioning cell populations. Thus, robust computational approaches are required to overcome these challenges and extract meaningful biological insights from single-cell data.

There have been numerous imputation and clustering methods developed specifically to address the unique challenges associated with scRNA-seq data analysis. These methods aim to reduce the impact of dropouts and high dimensionality, allowing for accurate cell population partitioning and the discovery of meaningful biological insights. While these methods have unquestionably advanced the field of single-cell transcriptomics, they are not without limitations. Some methods may be computationally intensive, resulting in scalability issues with large datasets, whereas others may introduce biases or overfit the data, potentially affecting the accuracy of subsequent analyses. Furthermore, the performance of these methods can vary depending on the datasets complexity and heterogeneity. As a result, ongoing research is required to improve existing methodologies and create new algorithms that address these limitations while retaining robustness and accuracy in scRNA-seq data analysis.

In this work, we propose three imputation approaches which incorporate with statistical and deep learning framework. We robustly reconstruct the gene expression matrix, effectively mitigating dropout effects and reducing noise. This results in the enhanced recovery of true biological signals from scRNA-seq data and leveraging transcriptomic profiles of single cells. In addition, we introduce a clustering method, which exploits the scRNA-seq data to identify cellular subpopulations. Our method employs a combination of dimensionality reduction and network fusion algorithms to generate a cell similarity graph. This approach accounts for both local and global structure within the data, enabling the discovery of rare and previously unidentified cell populations.

We plan to assess the imputation and clustering methods through rigorous benchmarking on simulated and more than 30 real scRNA-seq datasets against existing state-of-the-art techniques. We will show that the imputed data generated from our method can enhance the quality of downstream analyses. Also, we demonstrate that our clustering algorithm is efficient in accurately identifying the cells populations and capable of analyzing big datasets.

In conclusion, this thesis propose an alternative approaches to advance current state of scRNA-seq data analysis by developing innovative imputation and clustering methods that enable a more comprehensive and accurate characterization of cellular subpopulations. These advancements potentially have broad applicability in diverse research fields, including developmental biology, immunology, and oncology, where understanding cellular heterogeneity is crucial.

Publication
ProQuest Dissertations and Theses
Avatar
Bang Tran
Assistant Professor

My research interests include single-cell imputation, single-cell analysis.