FLAGSHIP 1
From genes to cell states: controlling cellular programs
While each human cell contains only about 20,000 protein-coding genes, it is the complex regulatory networks among these genes that create the wide variety of cell types and functions. Genetic perturbation is a powerful method for analyzing the function of individual genes and their interactions with one another. Recent technological advances have enabled genome-wide perturbation screens, where individual genes are ablated in single cells, with imaging and sequencing readouts. These advances, when paired with the appropriate computational paradigms, open a unique window to catalog all cellular programs in a multimodal fashion (leveraging imaging and sequencing data) to understand the processes underlying cell state transitions, and design precise perturbations to control these processes and induce any desired cell state transition.
Towards this, novel theoretical, algorithmic and computational paradigms are needed for:
- integrating diverse, unpaired, multi-modal data to enable the prediction of missing modalities and translating between different cell lines and species;
- guiding the experimental design by predicting the outcome of unseen perturbations and identifying, through active learning, the most informative perturbations to test experimentally;
- identifying regulatory (causal) networks within and among cells.
We also launched our inaugural machine learning challenge in 2023, the Cancer Immunotherapy Machine Learning Challenge.
Learn More
To measure our progress on these problems, we have chosen to focus on the following biological challenges aligned with ongoing initiatives at the Broad:
- Determining cell intrinsic factors to induce desired cell state changes ex vivo: In collaboration with Evan Rosen’s, Melina Claussnitzer’s, Paul Blainey’s and Devavrat Shah’s groups, we aim to define the transcription factor combinations that, when perturbed in human adipose-derived mesenchymal stem cells (AMSCs) ex vivo, give rise to the adipocyte subtypes observed in vivo. Leveraging large-scale single-cell RNA-seq/multiome, and LipocyteProfiler single-cell imaging data on patient-derived AMSCs throughout the ex vivo differentiation period, we will generate a multi-modal catalog of cellular programs involved in adipocyte (trans-)differentiation to gain an understanding of the paths that can be taken by cells to move between AMSCs and different adipocyte subtypes.
- Determining cell intrinsic factors to induce desired cell state changes in vivo: In collaboration with Nir Hacohen’s, Ramnik Xavier’s and Munther Dahleh’s groups, we will map the full regulatory landscape of CD8 T cell subtypes/states. To achieve this, we are using single-cell Perturb-seq to introduce single-gene knockouts in isolated T cells, re-infused in vivo, before transcriptomic readout of the effects of the mutation. Our goal is to identify the optimal combination of genetic perturbations to obtain the T cell population ratio needed for the resolution of different diseases, with an initial focus on enhancing cancer immunotherapy.
- To support and complement the two projects described above and test the generalizability of new algorithms across a wide variety of perturbations and contexts, there is a need for systematic experiments in cell lines. In collaboration with DepMap, Jason Buenrostro, Xiao Wang, and the Carpenter/Singh lab, we are generating high-content transcriptomic, epigenetic and phenotypic data collected at the single-cell level and at different time points on cell lines from the Cancer Cell Line Encyclopedia exposed to single-drug treatments. These data will provide the basis for comprehensively cataloging cellular programs in a multi-modal fashion.
ML competitions: To benchmark the status of the field in terms of available algorithms, we launched our first ML challenge in 2023, the Cancer Immunotherapy Machine Learning Challenge. Participants were asked to develop methods and algorithms to predict the effects of gene perturbations on the distribution of T cell subpopulations and identify perturbations predicted to shift the T cell composition to a specified ratio. For the training we provided an in-house generated single-cell Perturb-seq dataset. This dataset was obtained by isolating naive T cells and subjecting them to single-gene knockouts for 70 genes, and then re-injecting them in a melanoma mouse model for ten days before sequencing. We have experimentally validated an additional 60 gene knockouts to evaluate the submitted algorithms. Some predictions were promising and biologically relevant, but the results also highlighted the complexity of the problem. While we can achieve measurable progress, solving these prediction problems requires iterative rounds of new experiments and new ML developments by an interdisciplinary team of researchers.
Like the prominent ImageNet Challenge for the problem of object recognition, we aim to expand our first ML challenge to solve the problem of predicting the effect of genetic perturbations. Towards this, we will develop a large-scale genetic perturbation database for both training and testing of ML algorithms. Following the example of CASP, which was instrumental in the protein structure prediction revolution, every year we will perform new experiments aligned with our flagships, thereby providing the means to objectively test and benchmark algorithms for predicting the effect of new perturbations. This Cell Perturbation Prediction Challenge (CPPC) will establish the state-of-the-art in perturbation prediction, identify what progress has been made, and highlight where future research may be most productively focused.