Day 1: Wednesday, April 30
1 -- Tavor Baharav
Statistical Integration of Bulk and Single-Cell Sequencing for Improved TCR Repertoire Analysis
The human adaptive immune system relies on T-cells, which through the combinatorial process of V(D)J recombination generate an estimated 10^11 possible T-cell receptors (TCRs), known as clonotypes. Characterizing the distribution and temporal dynamics of clonotypes within an individual’s TCR repertoire is a critical challenge in immunology, for example in developing neoantigen cancer vaccines. The vast number of T-cells (around 10^10) and clonotypes (10^6 to 10^8) in human adults makes this a difficult task. Sequencing can only measure around 10^4 cells, where single-cell provides paired alpha and beta chains for each cell, while bulk sequencing offers marginal counts for each alpha and beta chain comprising a TCR, losing pairing information and complicating clonotype distribution estimation.
In this work we demonstrate that integrating bulk and single-cell data can dramatically improve clonotype estimation. We formulate a joint likelihood, and show that the Maximum Likelihood Estimator (MLE) is a convex optimization problem, solvable efficiently with standard techniques. On synthetic data, the MLE for 10,000 cells worth of single-cell data alone attains the same estimation error as the MLE for 1,600 cells worth of single-cell data combined with 10,000 cells of bulk data. Experimental validation is challenging, as bulk and single-cell sequencing are typically performed independently, so we are collaborating with wet lab researchers to test our approach on new ground truth data. Ongoing work focuses on extending this statistical approach to a new well-based sequencing protocol that has the potential to attain a best-of-both-worlds solution — combining the cost-effectiveness and scalability of bulk sequencing with the pairing capability of single-cell sequencing. Overall, these approaches could significantly improve clonotype estimation by leveraging simple and inexpensive bulk sequencing, marking a first step towards a statistically unified TCR analysis framework.
2 -- Salil Bhate
Deriving genetic codes for molecular phenotypes from first principles
A goal of genetics is to find organizing principles underlying how phenotypic information is encoded in the genome sequence. Modeling covariation between sequences and phenotypic or epigenetic measurements has established genetic codes, such as the amino acid code and cis-regulatory code, which enable predicting phenotype from genotype. We present a formal postulate - Phenotype Sequence Alignment (PSA) – from which we instead derive genetic codes from first principles, without sequence-phenotype covariation. PSA asserts that a phenotype produced by an organism is aligned with its genome sequence; that the molecular similarity between the phenotype’s constituent structures (for example, between protein residues, cells or tissue regions) is in correspondence with the sequence similarity between the genomic loci that influence them. PSA therefore constrains the genetic encoding of a phenotype when its similarity structure is combinatorially complex relative to the genome sequence. We search for genetic codes using PSA as a constraint, with only phenotypic data and reference genomes, and without covariation or annotations. These codes, derived from first principles, correctly recover empirically established genotype-phenotype associations and evolutionarily conserved codes at the protein, single-cell, tissue and organ scales, in homeostasis and cancer. PSA is thus an organizing principle underlying the relationship between genotype and phenotype across levels of biological organization, which enables algorithms for de novo genetic discovery.
3 -- Avik Biswas
Dissecting the Pathways and Mechanisms of Drug Resistance Evolution in HIV using AI and cryogenic electron microscopy
Drug resistance to antiretroviral therapy remains a pervasive problem in the treatment of HIV/AIDS. We use a novel methodology combining a physics-based machine learning model of HIV protein fitness under drug selection pressure, with kinetic Monte Carlo simulations of sequence evolutionary trajectories, to explore the temporal acquisition of drug-resistance mutations (DRMs) in patients as they arise from the drug-naive population. Our model accurately captures the clinically reported time to acquire DRMs across the major HIV-1 drug target proteins. We find that slow DRMs are contingent on a network of epistatic interactions with accessory mutations that appear only after prolonged drug pressure. For the slowest acquired DRMs, we define the temporal ordering of events along mutational pathways leading to resistance starting from specific molecular clones of HIV. We then provide the mechanistic bases for the preferred pathways to drug resistance by ordering the mutant structures derived by high-resolution cryo-electron microscopy along the predicted trajectories. This work provides a framework for the development of combined computational and structural biology approaches to surveil the response of viral systems in general to external selection pressure from therapeutics, rationalize the mechanisms of resistance, and elucidate potential opportunities for therapeutic interventions.
5 -- Xingjian Chen
Generation of cellular images from single-cell gene expression with GenVinci
Cells were first discovered and characterized through imaging hundreds of years ago, while recent advances in single-cell genomics have enabled their systematic cataloging through comprehensive molecular profiling. However, both approaches typically capture only a single modality at a time, requiring substantial effort to obtain multiple views and often resulting in seemingly disparate representations of cellular identity. Recent breakthroughs in generative AI offer a transformative solution by enabling the synthesis of multiple views through non-linear transformations, providing a unified perspective on cellular and tissue biology.
Here, we introduce GenVinci, a transformer-based generative model that learns a universal cell representation to reconstruct morphological information from single-cell and spatial gene expression profiles. GenVinci is pretrained on over 120 million dissociated and spatially resolved single-cell gene expression profiles across more than 80 tissues from both humans and mice. The model is then fine-tuned to generate diverse types of cellular and tissue imaging data, including electron microscopy images, fluorescence microscopy images, neuron morphologies, and histological stains (hematoxylin and eosin (H&E)), from various molecular profiles at different resolutions.
GenVinci generates highly accurate cellular and tissue images that align with ground-truth morphologies, cell types, and pathological annotations. Notably, we demonstrate that GenVinci enables in silico perturbation experiments by generating previously unseen morphologies from perturbed gene expression data. This capability bridges molecular states with imaging phenotypes, facilitating the modeling of cellular behaviors and functions in silico.
Furthermore, we show that GenVinci can be flexibly applied to a range of downstream tasks, including image generation, multi-modal clustering, zero-shot classification, and data augmentation. By unifying different views of cellular and tissue biology, GenVinci significantly reduces the need for multiple experimental measurements, advancing the ultimate goal of virtual cell and tissue simulation.
6 -- Gillian Chu
LAML-Pro: Efficient Maximum Likelihood for Cell Genotype and Lineage Inference
The history of cell divisions relating a single cell to a multicellular tissue or organism is of fundamental importance in developmental biology. Recently, dynamic lineage tracing technologies have dramatically improved the ability to derive cell lineage trees by using genome editing systems (e.g. CRISPR, prime) to induce edits continuously across cell divisions. The edits in each cell are then measured in a large population of cells through single-cell sequencing or imaging technologies.
Inferring the cell lineage tree from the observed measurements remains a challenging computational problem. Current measurement technologies are imperfect, and thus any approach to reconstruct cell lineage trees must perform two separate steps: (1) identify each cell’s sequence of genomic edits, or “genotype”, and (2) apply phylogenetic inference techniques to estimate the cell lineage tree relating the cell genotypes. Nearly all current computational methods for lineage tracing analysis perform these steps sequentially, using statistical models or rule-based heuristics to identify genotypes which are then input into a phylogenetic algorithm to infer cell lineage trees. This two-step approach implicitly assumes that phylogenetic inference algorithms are provided with accurately identified genotypes, but the procedures for identifying cell genotypes are not error-free.
We introduce LAML-Pro, an algorithm which simultaneously infers both the cell genotypes and a maximum likelihood cell lineage tree. Formulating a tree-structured latent variable model, LAML-Pro combines an efficient Expectation-Maximization algorithm with standard tree topology search. We demonstrate that LAML-Pro outperforms state-of-the-art methods in terms of tree topology accuracy under a range of measurement error thresholds on both simulated and real data. We also demonstrate the ability to scale more efficiently than other probabilistic approaches, allowing the practical use of LAML-Pro to infer cell genotypes, tree topologies and parameters such as time-scaled branch lengths on current dynamic lineage tracing datasets.
7 -- Valentin De Bortoli
Towards DNA sequence generation with diffusion models
Generation of ordinal synthetic data, which are common in Electronic Healthcare Record (EHR) is a long-standing problem in the generative modeling and computational biology communities. Generating such data could have transformative effects on machine learning based techniques for biomedical sciences. First, generating synthetic data is crucial to protect patient privacy and benchmark downstream methods on publicly available data. Second, high quality data is crucial for model training and evaluation.
On the other hand, there now exist powerful generative modeling methods such as diffusion models.
Those models generate high-quality synthetic data and yield state-of-the-art results for many modalities. They operate by defining a continuous-time forward process which gradually adds Gaussian noise to data until fully corrupted. The corresponding reverse process progressively "denoises" a Gaussian sample into a sample from the data distribution. While discrete extensions of those models exist they do not leverage the ordinal structure of the data.
In this paper, we show how to extend this technique to structured ordinal discrete spaces and apply those techniques to the TCGA dataset and postmortem data of neurotypical individuals.
8 -- Yasha Ektefaie
Fleming: An AI Agent for Antibiotic Discovery In Mycobacterium Tuberculosis
Antibiotic-resistant tuberculosis (TB) remains one of the world’s most lethal infectious diseases, claiming over 1.5 million lives each year. New strains impervious to existing therapies threaten to reverse decades of progress, underscoring the urgent need for novel antibiotics. Yet the high cost and failure rate of drug development pose formidable hurdles to identifying effective TB therapeutics. Recent advances in artificial intelligence (AI) promise to accelerate antibiotic discovery by predicting molecular inhibitory properties and generating tailored compounds. However, producing an effective antibiotic requires coordinating multiple AI tools to optimize for chemical novelty, TB inhibitory effect, and acceptable absorption, distribution, metabolism, excretion, and toxicity (ADMET) properties. Here we introduce Fleming, a multi-agent system that orchestrates four specialized agents—a bacterial inhibition, molecular generation, ADMET evaluator and ADMET optimizer agent—into an integrated platform for TB antibiotic development. Trained on the largest curated TB inhibitor dataset to date (n = 114,933), Fleming precisely identifies promising leads and autonomously designs new chemical entities. Critically, a natural language interface allows Fleming to emulate the reasoning of medicinal chemists, bridging in silico predictions with human oversight to streamline hit-to-lead workflows. By consolidating diverse computational tasks and expert review, Fleming addresses the data, modeling, and validation gaps that have long hampered AI-based antibiotic discovery. Beyond TB, Fleming represents a scalable framework for AI-driven drug discovery, applicable to other resistant pathogens and therapeutic areas. We anticipate that similar multi-agent systems will play a pivotal role in biomedical research, transforming how AI and human expertise converge to combat infectious diseases.
9 -- Wenbin Guo
Generative modeling of DNA methylation states on sequencing reads
DNA methylation is a key epigenetic modification at CpG sites, regulating gene expression and cellular function. Sequencing-based technologies are commonly used to measure DNA methylation, where each read captures the methylation states of CpG sites within a contiguous genomic region. Conventional analyses often treat CpG sites independently, overlooking the intrinsic dependencies between adjacent sites. This simplification reduces the accuracy of methylation data interpretation and hampers insights into underlying biological mechanisms. Consequently, existing methods for simulating DNA methylation states on sequencing reads struggle to preserve data fidelity. To address this, we propose two complementary approaches for modeling DNA methylation states on sequencing reads: a probabilistic method and a deep learning-based method. The first extends the standard Hidden Markov Model (HMM) to a heterogeneous HMM, where state transition probabilities vary as a function of genomic distance, mimicking the observation that adjacent CpG sites with closer spatial proximity exhibit stronger state dependencies. The second approach leverages a bidirectional many-to-many Long Short-Term Memory (BiLSTM) network, which predicts methylation states using site-specific features while implicitly learning spatial dependencies. A compound loss function balances dependency learning and distribution preservation, enabling the BiLSTM model to accurately capture distance-dependent site-site interactions while maintaining the observed marginal distribution of methylation levels. Through simulation and application to Whole Genome Bisulfite Sequencing data, we show that the heterogeneous HMM effectively models distance-dependent dependencies, while the BiLSTM further improves predictive performance and alignment with observed methylation patterns. These approaches enhance our ability to model DNA methylation dynamics and lay the groundwork for next-generation bisulfite sequencing read simulators.
10 -- Tinglin Huang
Scalable Generation of Spatial Transcriptomics from Histology Images via Whole-Slide Flow Matching
Spatial transcriptomics (ST) has emerged as a powerful technology for bridging histology imaging with gene expression profiling. However, its application has been limited by low throughput and the need for specialized experimental facilities. Prior work sought to predict ST from whole-slide histology images to accelerate this process, but they suffer from two major limitations. First, they do not explicitly model cell-cell interaction as they factorize the joint distribution of whole-slide ST data and predict the gene expression of each spot independently. Second, their encoders struggle with memory constraints due to the large number of spots (often exceeding 10,000) in typical ST datasets. Herein, we propose STFlow, a flow matching generative model that considers cell-cell interaction by modeling the joint distribution of gene expression of an entire slide. It also employs an efficient slide-level encoder with local spatial attention, enabling whole-slide processing without excessive memory overhead. On the recently curated HEST-1k and STImage-1K4M benchmarks, STFlow substantially outperforms state-of-the-art baselines and achieves over 18% relative improvements over the pathology foundation models.
11 -- Akhil Jalan
Rigorous Matrix Completion Methods for Biomedically Inspired Missingness Patterns
We study transfer learning for matrix completion in a Missing Not-at-Random (MNAR) setting that is motivated by biomedical problems. For experimental design problems such as metabolite selection for metabolite balancing experiments, patient selection for companion diagnostics, and site selection for single-cell RNA sequencing, one samples a small set rows R and columns C of an underlying population matrix, and observes entries (i,j) such that i is in the R, and j is in the C. Existing matrix completion methods perform poorly on this imputation task, because their modeling assumptions do not hold for this row/column missingness pattern.
We address this issue by formulating a statistically rigorous matrix completion method that leverages transfer learning. Our method uses a pre-existing dataset for a different population matrix which is related to the target matrix through a heterogeneous feature shift in the latent singular space. Simulations of matrix completion for gene expression microarray data and metabolic network data show promising results, with our method outperforming multiple existing algorithms. We give statistically rigorous explanation of this performance and outline a broad framework for the applicability of our methods to several classes of biomedical data.
12 -- Khrystofor Khokhlov
Application of foundation models for molecular representation in cancer drug discovery and precision oncology
Drug discovery is a resource-intensive and time-consuming process, often requiring decades of effort and substantial financial investment, with a high risk of failure. Despite advances in high-throughput screening technologies, the size of chemical space presents a significant challenge: it is not feasible to experimentally screen all potential molecules. Our work aims to accelerate drug discovery by leveraging advances in deep learning (DL) models to identify promising hit candidates and improve the prediction of drug response in cancer. We are developing a DL model capable of identifying potentially novel cancer drug chemotypes and reliably predicting drug response on cancer cell line targets. Leveraging recent progress in transformer-based architectures and graph neural networks, we use molecular language models, graph models and cell foundation models to embed both molecular and genomic data into low-dimensional subspaces. For these aims, we utilize the large-scale drug repurposing and oncology datasets from the PRISM project at the Broad Institute, which provide a wealth of drug repurposing and oncology data, enabling robust training of machine learning (ML) models. We show that these embedding vector representations are superior to existing methods, as they enable more accurate drug response predictions. Our pipeline involves training single-target activity prediction models on drug molecular structures for in silico screening of chemical space to search for cancer drug candidates, followed by integrating genomic data to enhance biological context and train a hybrid precision oncology model capable of predicting drug response for novel drug:target pairs. Our results demonstrate that embedding vector representations produced by the proposed framework outperform existing approaches, offering a more accurate and efficient means of exploring chemical space. This work highlights the transformative potential of ML/DL methods in drug discovery, enabling targeted, cost-effective exploration of chemical libraries, and advancing the development of precision oncology treatments.
13 -- Julie Laffy
Expanding the landscape of protein secretion: Uncovering hidden diversity in the secretome
Remote communication between cells in different tissues is mediated by soluble factors that are secreted into the blood. These factors enable tissues to act in coordinated fashion when we undergo any kind of systemic response, but despite their critical role in organismal-level regulation, soluble factors and the communication networks they mediate remain largely understudied. Taking an integrated approach that leverages protein language models, cross-tissue omics data and a revised definition of the secretome, we first identify systematic biases that differentiate soluble factors from the rest of the proteome. Second, we reveal previously unappreciated diversity in regulatory secretome motifs. Third, we propose an approach to map putative inter-organ communication routes. Our work has implications for the identification of new soluble factors and their localisations, paving the way for novel protein design strategies and contributing to our understanding of how tissues communicate in health and disease.
14 -- Alex LeNail, Raleigh Linville, Vanessa Farell, Saul Sauceda, Jaanak Prashar, Gaurav Arya, Preston Ge, Benjamin James, Jonathan Weissman, Manolis Kellis, Myriam Heiman
Computational Design of Transcription Factor Gene Therapies to Reverse Age-Associated Neurodegeneration
Many of the neurodegenerative diseases of old age appear to owe much of their etiology to aging itself. This motivates an investigation into the molecular characteristics of neural aging, as well as the pursuit of interventions which reverse neuronal aging or bolster neuronal resilience to degeneration. We cataloged the molecular differences between young and old neurons with single-cell paired transcriptional and epigenetic assays, forming a multi-omic atlas of brain aging, revealing deficits in mitochondrial biogenesis, proteostasis, and synaptic gene expression. In order to reverse those phenotypes, we targeted endogenous homeostasis-maintenance genetic programs by increasing the dosage of relevant Transcription Factors (TFs). We computationally predicted a library of TFs most likely to engage those genetic programs by integrating multiple lines of evidence, and then produced a library of gene therapies which we administered in a pool to the aged mouse brain. Our single-nuclear RNA-seq readout describes the transcriptional consequences of each TF in each cell type in the aged motor cortex. Our results highlight certain under-studied TFs with promising rejuvenation potential, and benchmark our algorithmic approaches to predict the effects of TF perturbations.
15 -- Bijan Mazaheri
Causality-Guided Disentanglement of Batch and Biology
Large-scale biological data-sets are the amalgamation of many experiments or “batches” of data. These batches often have their own unique signal from variations in equipment, conditions, or lab technicians. Batch integration algorithms attempt to produce “corrected'” versions of batched datasets that retain biological information while removing batch-specific signals. These approaches over-correct when biological and batch signals are entangled, which occurs whenever batches contain imbalanced populations of cell-types or throughput-limited perturbations.
We propose a new principle for batch integration in entangled settings that is built on causal modeling. The key insight is that batch-signal should be corrected with respect to a task, giving rise to an objective based on conditional erasure of batch signal. For example, studies on cell-type should forgo global batch-uniformity in favor of batch uniformity within each cell-type cluster. Similarly, high-throughput drug screening should aim for per-drug batch-signal erasure. From this insight we develop the `”conditional batch independence criterion'' (CBIC) for studying observed signals like high throughput drug screening, and the “local batch independence criterion'” (LBIC) for studying unobserved signals like cell-type. The precise requirements for batch integration depend on the causal relationships within the available data as well as the question being studied.
Previous approaches to batch correction are poorly equipped to integrate entangled batch signals with respect to our new principles of CBIC and LBIC. To address this, we propose allowing weighted matching between data from different batches, which relaxes previous objectives into unbalanced optimal transport. These weighted matchings allow groups of cells to be joined with differing (but locally uniform) levels of batch participation, thereby implementing the objectives of CBIC and LBIC. This gives rise to a principled tool for entangled batch correction that couples unbalanced optimal transport with causality-guided signal processing.
16 -- Prisha Satwani
A Multi-Task Approach to T-Cell Receptor Representation Learning
The human immune system generates a diverse repertoire of T cell receptors (TCRs) to recognize peptide-MHCs (pMHCs), protein complexes which play a critical role in trigerring the immune response. Predicting TCR-pMHC specificity (which TCR binds to which pMHC) is essential for immunotherapy and vaccine development. However, TCR-pMHC interactions are highly cross-reactive, where a single TCR can bind multiple pMHCs and vice versa, making specificity prediction difficult. This challenge is compounded by the scarcity of labeled TCR-pMHC binding data, making model training more challenging.
We propose a multi-task fine-tuning framework that enhances TCR representation learning by jointly training on TCR-pMHC binding prediction and T cell type (CD4 vs CD8) classification, a biologically related but more data-abundant task. Our model consists of a shared encoder (a BERT-like TCR representation model) and two side-by-side task-specific classifiers. We alternate training batches from both tasks, allowing simultaneous rather than sequential learning. This shared learning imposes a regularization effect, leading to more generalizable representations and improving generalization to unseen pMHCs.
Compared to regular single-task fine-tuning, multi-task learning improves performance in predicting TCR binding to unseen pMHCs. Notably, our results show that T cell type provides an informative signal for TCR specificity, particularly in distinguishing between MHC class I (recognized by CD8 T cells) and MHC class II (recognized by CD4 T cells), a key distinction in immune response.
However, we observe a trade-off: while fine-tuning improves task performance, it alters the model’s broader TCR similarity structure. Future work could explore incorporating different training objectives to balance task-specific adaptation with preserving generalized TCR representations.
Our work demonstrates the potential of integrated, multi-task approaches in computational immunology. Expanding this framework to include more diverse TCR-related tasks could further enhance the model’s capabilities and our understanding of TCR biology.
17 -- Viktoria Schuster
Can sparse autoencoders make sense of biological latent representations?
Sparse autoencoders (SAEs) have lately been used to uncover interpretable latent features in large language models. Here, we explore their potential for decomposing latent representations in complex and high-dimensional biological data, where the underlying variables are often unknown. With the help of simulated data, we find that latent representations can encode observable and directly connected upstream hidden variables in superposition. Superpositions, however, are not identifiable if these generative variables are unknown. SAEs can recover variables from superposition, yielding interpretable features. Applied to single-cell multi-omics data, we show that an SAE can uncover key biological processes. Most importantly, we present a proof of concept for an automated analysis pipeline to link SAE features to biological concepts in order to enable large-scale analysis of single-cell expression models.
18 -- Petar Stojanov
Variation and regulatory mechanisms of the small RNA transcriptome across human tissues
Standard RNA-sequencing protocols exclude small RNAs and thus preclude the study of thousands of small noncoding RNAs with essential roles in the post-transcriptional regulation of gene expression. Here, present the characterization of small RNAs across 16,814 samples, 47 tissue sites and 978 donors in the GTEx Project. We quantified the expression of a total of 41,458 small RNAs, including microRNAs (miRNAs), Piwi-interacting RNAs (piRNAs), transfer RNAs, small nuclear RNAs, small nucleolar RNAs, Y RNAs, and others. We used supervised classification to identify putative novel RNAs not present in references, and detected 57 novel high-confidence miRNAs. We mapped QTLs in cis and trans, identifying 100s to 1000s of cis-eQTLs for each small RNA species. Among them, we discovered two trans-QTLs for tRNAs, corresponding to splice QTLs in TRMT1 and DTWD1, which alter the base editing activity of their respective target tRNAs. To investigate the propagation of genetic effects on coding genes through miRNAs, we fine-mapped and co-localized SNPs that affect miRNA expression in cis and mRNA expression in trans. Through mediation analysis we confirm miRNA-mRNA pairs are causally related, which we further corroborate through seed-pairing and conservation analysis. Notably, the tissue specificity of miRNA expression is reflected in the tissue specificity of complex traits co-localizing with miRNA eQTLs. For example, we identify an interaction between miR-5683 and ODAD1 in the cerebellum, a gene involved in the motility of glial cells. Furthermore, we found that this shared causal variant colocalized with the GWAS trait of tau levels. In summary, we demonstrate the importance of characterizing the full spectrum of small RNAs, which play critical roles in the regulation of gene expression, including in development and disease.
19 -- Jiageng Wu
Leveraging Large Language Models for Cognitive Function Assessment in Dementia Patients
Cognitive function is a key determinant of daily functioning in dementia patients, and accurate, timely assessments are essential for disease management and intervention. However, conventional assessment methods require substantial resources, limiting their scalability. Large language models (LLMs) represent a promising new approach to predicting cognitive function by leveraging unstructured Electronic Health Record (EHR) data. In this retrospective cohort study, we examined the EHRs of patients aged ≥65 years diagnosed with dementia between 2013 and 2020 at Mass General Brigham. Cognitive function was determined using the Brief Interview for Mental Status from the Minimum Data Set (MDS) or M1700 Cognitive Function from the Outcome and Assessment Information Set (OASIS). Clinical notes from the previous year served as model input. We evaluated multiple state-of-the-art LLMs, including QWEN2.5 (7B and 72B), Llama 3.1 (8B), Llama 3.3 (70B), Ministral-8B, Mistral Large, and DeepSeek-Llama (70B). Three inference strategies were tested: standard mode for direct predictions, chain-of-thought mode for step-by-step reasoning, and summary-based mode for integrated prediction. Among 685 MDS-assessed and 3,630 OASIS-assessed patients, the LLMs demonstrated strong cross-dataset generalizability. Llama-3.3-70B achieved the highest accuracy (66.31%) in OASIS, while QWEN2.5-72B performed best in MDS (45.23%). Larger LLMs generally outperformed smaller ones, though chain-of-thought and summary-based strategies did not consistently improve accuracy despite boosting interpretability. This may stem from limited comprehension of clinical text and the risk of hallucinations. By systematically evaluating multiple LLM and inference approaches, this work highlight the need for further optimization of LLM reasoning strategies to handle clinical text, minimizing hallucinations and improving reliability. Integrating LLMs into cognitive assessment workflows could enhance efficiency and accessibility, supporting more effective dementia care and management.
20 -- Liangying Yin
Estimation of causal effects of genes on complex traits using a Bayesian-network-based framework applied to GWAS data
Understanding the relationships between genes and complex traits is vital for elucidating the biological mechanisms behind trait variations and disease onset. Gene-based analyses, especially univariate-based gene analyses, are widely used for the characterization of gene-phenotype relationships. However, they are subject to the influence of confounders. While some genes directly contribute to trait variations, others exert their effects through other genes. Quantifying these direct and indirect effects can enhance our understanding of their contributions.
Here we presented a novel machine learning framework to differentiate core genes from peripheral ones and decipher their total and direct causal effects using imputed gene expression data from genome-wide association studies (GWAS) and raw gene expression from GTEx. The method is based on a Bayesian network (BN) approach which produces a directed graph showing the relationship between genes and the phenotype. Since the approach could uncover the overall causal structure, we are able to directly examine specific roles (i.e., core, peripheral, irrelevant genes) and the consequences (causal effects) of various genes on trait variations or onset of disease. The presented framework allows gene expression and disease trait(s) to be estimated in different samples, greatly improving the applicability of the approach. It is also extendable to decipher the causal network of >= 2 traits.
We verified the validity of the proposed framework by applying it to various simulated scenarios and 52 traits in UK biobank(UKBB). Split-half replication and stability selection analyses were performed to demonstrate the ability of our proposed method in identifying causally relevant genes. Overall speaking, our proposed framework provides a way to prioritize genes with direct or indirect casual effects, and estimate the ‘importance’ of such genes.
21 -- Xinhe Zhang
An AI-Cyborg System for Adaptive Intelligent Modulation of Organoid Maturation
Recent advancements in flexible bioelectronics have enabled continuous, long-term stable interrogation and intervention of biological systems. However, effectively utilizing the interrogated data to modulate biological systems to achieve specific biomedical and biological goals remains a challenge. In this study, we introduce an AI-driven bioelectronics system that integrates tissue-like, flexible bioelectronics with cyber learning algorithms to create a long-term, real-time bidirectional bioelectronic interface with optimized adaptive intelligent modulation (BIO-AIM). When integrated with biological systems as an AI-cyborg system, BIO-AIM continuously adapts and optimizes stimulation parameters based on stable cell state mapping, allowing for real-time, closed-loop feedback through tissue-embedded flexible electrode arrays. Applied to human pluripotent stem cell-derived cardiac organoids, BIO-AIM identifies optimized stimulation conditions that accelerate functional maturation. The effectiveness of this approach is validated through enhanced extracellular spike waveforms, increased conduction velocity, and improved sarcomere organization, outperforming both fixed and no stimulation conditions.
22 -- Chase Yakaboski
Multimodal genetic and functional support boosts drug target clinical success for common and rare diseases
Background/Objectives: We propose expanding genetic support for drug targets beyond “direct” evidence (e.g, GWAS) by introducing a new concept of “indirect” support–the propagation of observed genetic evidence from related genes sharing functional annotations. We evaluate how this extension can enhance clinical success and more robustly leverage multimodal genetic evidence to guide target prioritization.
Methods: We introduce PIGEAN, a fully Bayesian model that fuses direct evidence (GWAS, disease gene lists) with multimodal indirect support, integrating 144k+ gene annotations spanning gene set and pathway databases, single-cell expression patterns, and phenotypic data. We applied PIGEAN to 2,125 common and 461 rare diseases that could be mapped to indications from 29,460 drug programs, yielding direct, indirect, and combined probabilistic scores for gene-disease relevance. We assessed the relative success (RS) of 4,725 of these drug programs that included 1,314 targets that had significant direct and/or indirect support.
Results: For common diseases, we confirmed direct genetic support prioritization led to similar success with previous reports (our RS=2.24 vs. 1.97, 95% CI=1.54–3.25, 68% of programs). Using only indirect support to prioritize targets for rare and common diseases yielded success on par with direct support prioritization (RS=2.41, 95% CI=1.97–2.95) and was applicable to 82% of programs, including 2,384 that had no direct support. Narrowing the scope to drug programs with no known genetic evidence, indirect support had a significant positive correlation with RS for both common and rare diseases (R2=0.79/0.33, p=1.14e-6/0.015; RS=1.82/1.40). Integrating direct and indirect evidence led to the highest improvement in clinical success (RS=2.76, 95% CI=2.23–3.42).
Conclusions: Statistically integrating direct and indirect support boosts the likelihood of clinical success, even when direct evidence is unavailable. We hope PIGEAN offers a new framework for optimizing drug target identification leveraging multimodal genetic evidence.
Day 2: Thursday, May 1
23 -- Thomas Athey
Cascaded Detector Performance Analysis with an Application to Cell Detection
As both computer vision models and biomedical datasets grow in size, there is an increasing need for efficient inference algorithms. We utilize cascaded detectors to efficiently perform object detection of sparse objects in multiresolution images. Given an object's incidence and a set of detectors at different resolutions with known sensitivities and specificities, we derive the sensitivity and specificity of the associated n-level cascaded detector. We also derive the expected number of detector executions in a two-level cascade detector and analyze its dependence on various parameters. Finally, we compare one- and two-level random forest-based detectors in a cell detection task in three-dimensional sparsely labeled fluorescent neuron images. We show that the multi-level detector shows comparable performance in less than half the execution time on multiple datasets. We believe that our work can be extended to detect sparse objects in a variety of biomedical data domains and signal dimensions.
24 -- Salvatore Benfatto
Rapid Epigenomic Classification of Acute Leukemia
Acute leukemia (AL) is an aggressive form of blood cancer that requires precise molecular classification and urgent treatment. However, standard-of-care diagnostic tests are time and resource intensive and do not capture the full spectrum of AL heterogeneity.
Here, we developed a machine learning framework to rapidly classify AL using nanopore-based genome-wide DNA methylation profiling. We first assembled a comprehensive reference cohort (n=2,540 samples) and defined 38 distinct methylation classes across AL lineages and age groups. Methylation-based classification closely matched lineage classification by standard pathology evaluation in most patients and revealed disease heterogeneity beyond that captured by standard genetic categories. Using this reference, we developed a specialized deep neural network model (MARLIN) for rapid AL classification. MARLIN classification was concordant with pathology diagnoses in 18/19 (94.7%) retrospective cases profiled with nanopore sequencing, including refinement of the diagnosis in 7/19 (36.8%) cases. We further evaluated real-time MARLIN classification during nanopore sequencing in prospective patients with suspected AL, achieving an accurate methylation class prediction in less than two hours from the time of sample receipt.
In summary, we show that epigenetic profiling effectively resolves the biological heterogeneity of AL and is a valid surrogate for many conventional diagnostic assays. Our machine learning- and nanopore-based framework is fast, affordable and easy to implement, making it suitable for high-tech laboratories but also in remote settings, and provides a foundation for future developments in molecular AL diagnostics.
25 -- Uthsav Chitra
Mapping the topography of spatial gene expression with interpretable deep learning
Spatially resolved transcriptomics technologies provide high-throughput measurements of gene expression in a tissue slice, but the sparsity of this data complicates analysis of spatial gene expression patterns. We address this issue by deriving a \emph{topographic map} of a tissue slice—analogous to a map of elevation in a landscape—using a novel quantity called the \emph{isodepth}. Contours of constant isodepth enclose domains with distinct cell type composition, while gradients of the isodepth indicate spatial directions of maximum change in expression. We develop GASTON, an unsupervised and interpretable deep learning algorithm that simultaneously learns the isodepth, spatial gradients, and piecewise linear expression functions that model both continuous gradients and discontinuous variation in gene expression. GASTON relies on a novel model of spatial gradients parametrized with a conservative, neural gradient field and is broadly applicable in spatial statistics. We show that GASTON more accurately identifies spatial domains and marker genes than existing approaches across several tissues. Moreover, GASTON identifies gradients of neuronal differentiation and firing in the brain; gradients of metabolism and immune activity in the tumor microenvironment; and gradients of calcium and developmental processes in the heart embryo. If time permits, I will also present GASTON-Mix, which improves the spatial domain identification in GASTON by integrating GASTON with a mixture-of-experts (MoE) model.
26 -- Sebastiano Cultrera di Montesano
Hierarchical cross-entropy loss improves atlas-scale single-cell annotation models
Single-cell atlases now catalog millions of cells across diverse tissues, species, and experimental conditions, offering unprecedented resolution into cellular diversity and function. Accurate and robust annotation methods remain a critical first step in translating these large-scale datasets into actionable biological insights. From a machine learning perspective, cell type annotation is a multi-class classification task over a structured label space. Cell types are organized in a hierarchical ontology, where directed edges link broader categories to more specific subtypes, forming a directed acyclic graph (DAG) that encodes biological relationships among labels. Yet, most computational models—including deep learning approaches—optimize a standard cross-entropy loss that assumes flat, mutually exclusive labels, disregarding this structure entirely.
We introduce a hierarchical cross-entropy (HCE) loss that aligns model training with the cell ontology. By redistributing probability mass from specific subtypes to their broader parent types, HCE encourages biologically coherent predictions—particularly in cases of label ambiguity or inconsistent granularity across datasets.
To evaluate this approach, we trained three models of increasing complexity—a linear classifier, a multilayer perceptron (MLP), and a transformer-based model—on 15.2 million annotated human cells spanning 164 cell types. We then tested these models on 2.6 million newly added cells from 21 held-out studies, all sequenced with the same technology and annotated using a subset of the same labels seen in training. Incorporating HCE improved macro F1-scores by 12–15% across all models, without requiring changes to architecture or additional computational cost.
These gains were consistent across model types and test datasets, and particularly pronounced for cell types embedded in densely connected regions of the ontology. Our results demonstrate the value of structure-aware training and suggest a strategy for guiding future dataset selection to further enhance model generalization.
27 -- Mayank Mangesh Ghogale
Developmental Pattern Formation Occurs via Cue-Driven Cellular Diversification
Pattern formation mechanisms remain poorly understood, particularly for complex, 3-D structures. The sea urchin larval skeleton provides an elegant, morphologically simple model to study developmental pattern formation. In this two-component system, mesodermal cells called PMCs produce the calcium carbonate skeleton in response to patterning cues that are expressed in discrete spatial regions by the adjacent ectodermal cells. Our goal is to understand the temporal and spatial mechanisms for patterning cue reception at time points 15, 18, 21, 24 and 30 hours post fertilization (hpf). To achieve this, we identified and spatially mapped the gene expression trajectories within PMC’s using single cell RNA sequencing data along with single molecule-FISH (sm-FISH) results. We employed Identify Cell states Across Treatments (ICAT), an algorithm we developed to identify PMC trajectories and marker genes therein. We then mapped the trajectory markers spatially using whole mount sm-FISH, to spatially integrate the scRNA-seq results. The sm-FISH data was analyzed using Napari, ANTS and scikit-image by first manually applying spatial landmarks to each embryo for registration, producing a spatial template, then segmenting the template into subcellular Regional Adjacency Graph (RAGs) and finally mapping the expression data onto the RAG’s. The results show that PMC trajectories map to spatially discrete locations and are temporally dynamic. At 18 hpf, each spatial region in the PMC pattern exhibits a unique combination of trajectories, suggesting that the trajectories collectively represent a spatial code for pattern formation.
28 -- Navami Jain
Learning from pre-pandemic data to design and test variant-proof therapeutics
Effective pandemic preparedness relies on predicting immune-evasive viral mutations to enable early detection of concerning variants and design therapeutics that are resilient to future evolution. However, current strategies for viral evolution prediction are not available early in a pandemic and have limited predictive power – experimental approaches require host antibodies and existing computational methods draw heavily from current strain prevalence. Therapeutics have also historically been designed with an eye towards past or circulating variants, not towards future evolution.
To address these challenges, we developed EVEscape[1], a generalizable AI-driven framework that integrates fitness and biophysical information. EVEscape quantifies the escape potential of viruses at scale and is applicable before surveillance sequencing, experimental scans, or 3D structures of antibody complexes are available. We demonstrate that EVEscape, trained on sequences available pre-2020, performs as accurately as high-throughput scans at anticipating pandemic variation for SARS-CoV-2 and is generalizable to other viruses including Influenza, HIV, and understudied viruses with pandemic potential like Lassa and Nipah. While protein language models (PLMs) should improve generalization to viruses with limited data, we recently investigated their current failure modes for viruses, which inform the development of a new viral PLM[2,3].
We showcase EVEscape in critical applications:
(1) Surveillance: Monthly reports flagging SARS-CoV-2 escape variants from their first appearance[1] that we share publicly on evescape.org and with public health organizations.
(2) Vaccine evaluation: Designed panels of antigens that mimic future variants for early, proactive evaluation of the future protection of vaccines and therapeutics[4].
(3) Vaccine design: Designed a nanoparticle-based vaccine capable of eliciting broad, long-lasting protection against diverse coronaviruses, including future variants[5,6], currently in pre-clinical trials.
This three-pronged approach represents a paradigm shift in pandemic preparedness, offering a novel strategy to preemptively address viral families with pandemic potential and bolster global prevention efforts.
References
[1] N. Thadani*, S. Gurev*, P. Notin*...D. Marks. 2023. “Learning from Prepandemic Data to Forecast Viral Escape.” Nature.
[2,3] S. Gurev*, N. Youssef*, N. Jain, and D. Marks. 2024. “Tradeoffs of Alignment-based and Protein Language Models for Predicting Viral Mutation Effects.” Neurips MLSB & AIDrugX Workshops.
[4] N. Youssef, S. Gurev…J. Lemieux, J. Luban, M. Seaman, D. Marks. 2024. “Protein Design for Evaluating Vaccines against Future Viral Variation.” BioRxiv.
[5,6] S. Gurev*, N. Youssef*, H. Pierce-Hoffman, and D. Marks. 2024. “Future-Proof Vaccine Design with a Generative Model of Antibody Cross-Reactivity.” ICLR GEMbio & ICML ML4LMS Workshops.
29 -- Yi Huang
Mapping functional genetic variants and disease heritability in human adipocyte villages under metabolic disease-relevant stimuli
The rising prevalence of metabolic diseases represents a global public health concern, with limited understanding of pathophysiological mechanisms hindering the development of effective therapeutic approaches. Genetic variation and its interaction with environmental cues are crucial in the etiology of metabolic diseases, which are strongly linked to adipocyte dysfunction and are highly cell state- and context-specific. Here, we leverage a population-scale biobank (CellGenBank) to conduct pooled natural genetic variation screens in primary human adipocyte villages for single-nucleus transcriptomic and chromatin accessibility profiling under various disease-relevant stimuli. Processing 338k nuclei from 118 donors allowed us to identified key cell states characterized by canonical marker genes, including quiescent (PDGFRA) and proliferative (PDGFRA, CDK1, AURKB) adipose tissue-derived mesenchymal stem cells (AMSCs), structural Wnt-regulated adipose tissue-resident (SWAT; DCN, PLAC9, APOD) cells and adipogenic (ADIPOQ, PLIN1) cells, with distinct transcriptional responses to stimuli. By associating individual polygenic risk scores, we identified correlations between disease risk and shifts in cell state proportions. These states have been also mapped to metabolic disease relevant traits using single-cell heritability analysis, with body mass index highly enriched in AMSCs, whereas traits informative of metabolic health show strong enrichment in adipogenic cells and a subpopulation of SWAT cells. Furthermore, genome-wide eQTL mapping revealed hundreds to thousands of context- and cell state-specific eQTLs with potential roles in adipocyte function. Collectively, the implementation of the human adipocyte village approach coupled with single-nucleus functional genomics enabled the discovery of genetic mechanisms underlying metabolic diseases, highlighting its high therapeutic potential.
30 -- Aarti Jajoo
Leveraging diverse ancestry data to uncover genotype-driven transcriptomic mechanisms in adult and developing brain for psychiatric disorders
To better understand molecular mechanisms underlying psychiatric disorders, we need improved analytic approaches for integrating large-scale genomic data with biological data representing gene transcription in the human brain. However, one critical component of individual variation - different levels of gene regulation due to genetic ancestry diversity - has not been traditionally incorporated into such analyses. To address this, we leveraged the ancestral diversity of individuals in Genotype-Expression (GEx) reference panels and GWAS to enhance the detection of transcriptome-wide association study (TWAS) signals. To investigate GEx-level choices, we trained GReX (Genetically Regulated gene expression) models using rigorously constructed subsets of a human postmortem cortex GEx panel, generated through downsampling, segregating, and mixing samples of Admixed African (AA) and European (EUR) ancestry, while considering disease status in the subset design. TWAS results were obtained by integrating these GReX models with ancestry-specific GWASs for schizophrenia (SCZ), post-traumatic stress disorder (PTSD), major depressive disorder (MDD), and bipolar disorder (BIP). Ancestry-specific predicted genes were enriched in specialized pathways involving mitochondrial functions, organelle structure, and metabolism, while shared genes demonstrated high concordance (>95%) in predictor SNP weight direction. Shared TWAS signals, obtained by integrating various GReX models with a GWAS, demonstrated high concordance while uncovering novel signals at the gene, pathway, and drug-repurposing levels. Despite lower power due to smaller cohort sizes, AA GWASs enhanced signals and alleviated noise in meta-TWAS analysis when integrated with appropriate GReX models. EUR GReX-specific TWAS pathways included corticosteroid signaling in PTSD, TGF-beta and neurotrophins in MDD, and inflammation and viral life cycle in SCZ. AA GReX-specific TWAS pathways included glutamine signaling in PTSD, proline-peptide DNA activity in MDD, and immune cytotoxicity, serotonin, dopaminergic, and phagocytosis pathways in SCZ. Finally, ancestry-specific predicted genes in the developing brain exhibited pathway enrichment like the adult brain, but showed a higher proportion of shared TWAS signals across ancestries, alongside prominent ancestry-specific signals in specialized developmental and neuronal pathways.
31 -- Anurendra Kumar
WHISPER: Inference of contact-mediated cell-sell signaling
Recent advancements in single-cell transcriptomics have transformed our understanding of cellular communication by elucidating intricate interactions between cells, genes, and signaling pathways. While tools like CellChat effectively infer ligand-receptor (LR)-mediated signaling from scRNA-seq and spatial transcriptomics(ST) datasets, contact-mediated communication—particularly via gap junctions(GJ)—remains unexplored. Furthermore, existing methods leveraging spatial information for LR-mediated signaling fail to account for confounding effects from the spatial distribution of cell types, leading to high false-positive rates.
Here, we present WHISPER (Workflow for HIgh-precision Spatial Proximity-mediated cEll-cell inteRactions), a statistical framework that constructs contact networks from ST data to model cell-pair connectivity and infer a tensor representing interactions across cell types and signaling gene pairs (GJ or LR). To enhance scalability, WHISPER derives an analytical null distribution, enabling efficient analysis of large datasets. Applying WHISPER to mouse brain ST data, we demonstrate superior false-positive control compared to existing methods.
WHISPER identifies known contact-mediated interactions, such as astrocyte-oligodendrocyte GJ communication, and reveals novel interactions, including excitatory-astrocyte GJ coupling and excitatory-granule neuron LR signaling in hippocampus. To uncover broader communication patterns, we introduce a latent variable model that detects recurrent interaction e.g. cell types that interact using combinations of signalling gene pairs, resulting in a robust map of contact-mediated communication. Furthermore, we develop imputation strategies to extend WHISPER’s applicability to imaging-based datasets with limited gene throughput.
Electrical synapses (mediated by GJs) and chemical synapses (mediated by LRs) are known to regulate each other in specific contexts. Our findings reveal a novel crosstalk between LR- and GJ-mediated signaling, particularly enriched in extracellular matrix-related pathways. Finally, we apply WHISPER to Alzheimer’s disease models, performing differential analyses to identify both local and global alterations in cell-cell communication. These results provide key insights into the interplay between electrical and chemical synapses, shedding light on their coordinated role in neural development and disease.
32 -- Vidhi Lalchand
Cancer Drug Response Surface Modelling with Multi-Output Gaussian Processes
Dose-response prediction in cancer is a critical step to assessing the efficacy of drug combinations on cancer cell-lines. The efficacy of a pair of drugs can be expressively modelled through a dose-response surface which outputs the viability score across a spectrum of drug concentrations for each pair of drugs in the training data. Using large in-vitro drug sensitivity screens, the goal is to develop accurate predictive models that can be used to inform treatment decisions by predicting the efficacy of given drug combination on new cancer cell lines as well as predict the effect of unseen drugs on seen cancer cell-lines. Previous work for modelling dose response surfaces precluded scalability to large datasets and did not encode cell-lines and drugs, hence, these frameworks couldn't generalise to novel combinations of drugs and cell-lines. We achieve this through the use of an upstream a deep generative model (DGM) to embed the drugs in a continuous chemical space - enabling viability predictions for unseen drugs. We use a similar approach to encode cancer cell-lines. We demonstrate the performance of our model using an open source high-throughput dataset and show that it is able to efficiently borrow information and model cross-covariance across experimental outputs where each output encodes the efficacy across a grid of concentrations of a specific drug pair on a fixed cancer cell-line.
33 -- Max Land
3D interchromosomal interactions mediate pathway-level gene regulation and shape the aging phenotype in human skin fibroblasts
Fibroblasts play a critical role in maintaining the extracellular matrix (ECM) integrity by sensing and responding to mechanical signals from the tissue environment. A defining characteristic of aging, the deterioration of the ECM disrupts tissue homeostasis and is implicated in conditions such as fibrosis, cancer, and other aging-related diseases. However, the factors that influence fibroblasts to change state and function during aging remain unknown. Here, we perform comparative statistical analysis of Hi-C data from healthy skin fibroblasts of a young (10 yo) and old patient (75 yo). Age-specific analysis of 3D interchromosomal contacts revealed the spatial colocalization of genes involved in ECM, TGF-β signaling, and senescent pathways in young and inflammatory pathways in old. Using transcription factor (TF) enrichment analysis, we identified potential TF regulatory subnetworks (transcription factories) within spatially colocalized genes. Computational analysis of pathway-specific stimulation experiments (e.g., TGF-β, cytokines) between old and young fibroblasts confirmed functional differences in phenotype aligning with the presence of transcription factories. Collectively, our results suggest a pathway-level mechanism of gene regulation in which the 3D chromatin organization of fibroblasts facilitates spatial clustering of pathway genes, poising the cell toward specific phenotypes. This finding provides insights into how the interplay between ECM changes, mechanical signaling, and 3D chromatin directly shapes the ageing phenotype. In addition, this gene regulatory mechanism could potentially be targeted to develop novel therapeutic strategies to treat ageing and other fibroblast-related diseases.
34 -- Matthew Levine
Integrating Mechanistic RNA Dynamics with Machine Learning for Gene Regulatory Network Inference in Multimodal Single-Cell Data
Multimodal single-cell experiments enable the simultaneous measurement of chromatin state, and nascent and mature mRNA across thousands to millions of cells, providing an unprecedented view of transcriptional processes over development and perturbation. These data offer a unique opportunity for mechanistic inference of how DNA state changes propagate through RNA processing and how regulatory interactions govern transcriptional dynamics.
However, classical models of stochastic RNA dynamics either assume independent gene regulation or scale poorly beyond small, predefined networks. Machine learning approaches, while flexible, often rely on black-box predictions and ad hoc data transformations, limiting mechanistic interpretability.
Here, we introduce a framework that integrates continuous mechanistic models of stochastic RNA dynamics with machine learning to infer gene regulatory networks. This approach enables scalable Bayesian inference of gene regulatory dynamics governing DNA and RNA processing across neural and immune cell differentiation as well as under unseen genetic interventions. Beyond predicting cell fate, our method reveals regulation through transcription factor interactions and binding, and the downstream effects on RNA processing kinetics.
35 -- Dongshunyi Li
Unraveling Tissue Assembly with Hierarchical Transformers
One of the fundamental questions in biology is how cells interact and assemble into tissues, organs and ultimately the entire human body. While numerous studies have advanced our understanding of this process, only recently has it become possible to investigate the rules of tissue organization in a data-driven manner. High-throughput spatial transcriptomics technologies have transformed molecular biology by enabling the measurement of molecular features at scale directly within tissue context. The resulting large-scale, high-dimensional spatial data provide unique opportunities to develop models that learn the principles governing tissue assembly.
We introduce TISU (Tissue-Interaction & Spatial Understanding), a transformer-based model designed to identify the multi-hierarchical spatial organization of cells and the interaction mechanisms among them. Inspired by the concept of compartmentalization in multicellular organisms, TISU captures spatial structures from the cellular level to tissue module level, learning interaction rules within and across these organizational levels. By training on millions of cells, our model develops a systematic and comprehensive understanding of tissue assembly and holds the potential to reconstruct tissues from individual cells in silico.
Through rigorous evaluation, we demonstrate that TISU exhibits generalization capabilities to unseen tissue sections, accurately reconstructing missing cells and spatial modules. Additionally, the attention maps generated by the model reveal biologically meaningful spatial interactions and provide insights into tissue organization. These results highlight the promise of TISU as a powerful tool for advancing our understanding of tissue assembly and organization.
36 -- Tianyu Liu
spRefine Denoises and Imputes Spatial Transcriptomics with a Reference-free Framework Powered by Genomic Language Model
The analysis of spatial transcriptomics is puzzled by data quality, including the problems caused by the high noise level and the unmeasured genes. Considering that the cost of acquiring spatial transcriptomics is higher than that of traditional single-cell transcriptomics, designing an effective scheme to solve the above problems is an important research direction. In this manuscript, we propose a deep-learning-based method for joint denoising and imputing spatial transcriptomic samples with the help of genomic language models, which is named as spatial refining model (spRefine). We demonstrate that spRefine can generalize better cell-level or spot-level representations after denoising and imputation, and it can also improve the quality of data integration. In addition to this, spRefine has a strong role in model pre-training and the discovery of novel biological functions, and we support this point by including various downstream applications centralized to spRefine with data from scales. Moreover, spRefine can improve the estimation of spatial ageing clock modeling and identify novel ageing-associated relationships, which correlated to important biology process such as neuron function loss.
37 -- Minxing Pang
CelloType: a unified model for segmentation and classification of tissue images
Cell segmentation and classification are critical tasks in spatial omics data analysis. Here we introduce CelloType, an end-to-end model designed for cell segmentation and classification for image-based spatial omics data. Unlike the traditional two-stage approach of segmentation followed by classification, CelloType adopts a multitask learning strategy that integrates these tasks, simultaneously enhancing the performance of both. CelloType leverages transformer-based deep learning techniques for improved accuracy in object detection, segmentation and classification. It outperforms existing segmentation methods on a variety of multiplexed fluorescence and spatial transcriptomic images. In terms of cell type classification, CelloType surpasses a model composed of state-of-the-art methods for individual tasks and a high-performance instance segmentation model. Using multiplexed tissue images, we further demonstrate the utility of CelloType for multiscale segmentation and classification of both cellular and noncellular elements in a tissue. The enhanced accuracy and multitask learning ability of CelloType facilitate automated annotation of rapidly growing spatial omics data.
38 -- Yunyi Shen
Multi-marginal Schrödinger Bridges with Iterative Reference Refinement
A key challenge in automated biomedical discovery is ensuring that AI systems make interpretable decisions grounded in comprehensive literature review. To this end, scientists often undertake a series of actions, reading and reasoning over documents from multiple sources and deriving new insights. We build an agent system that conducts long chain-of-thoughts reasoning with retrieval for successfully answering such complex questions. Specifically, we formalize key characteristics in the complex reasoning traces of multi-document reasoning on biomedical literature, develop an interface for biomedical researchers to execute complex reasoning in collaboration for data collection, and train and evaluate the agent system to perform the retrieval and reasoning.
39 -- Shannon Shen
Supporting Biomedical Discovery with LLM Agents for Literature Grounded Search and Reasoning
Practitioners often aim to infer an unobserved population trajectory using sample snapshots at multiple time points. E.g. given single-cell sequencing data, scientists would like to learn how gene expression changes over a cell’s life cycle. But sequencing any cell destroys that cell.
So we can access data for any particular cell only at a single time point, but we have data across many cells. The deep learning community has recently explored using Schrödinger bridges (SBs) and their extensions in similar settings. However, existing methods either (1) interpolate between just two time points or (2) require a single fixed reference dynamic (often set to Brownian motion within SBs). But learning piecewise from adjacent time points can fail to capture long-term dependencies. And practitioners are typically able to specify a model family for the reference dynamic but not the exact values of the parameters within it. So we propose a new method that (1) learns the unobserved trajectories from sample snapshots across multiple time points and (2) requires specification only of a family of reference dynamics, not a single fixed one. We demonstrate the advantages of our method on simulated and real data.
40 -- Maria Skoularidou
Towards detecting molecular quantitative trait loci on neurotypical population
There is a vast literature that has been evolved over the past few decades focusing onthe detection of molecular quantitative trait loci (QTLs). These quantitative traits arecharacteristics that depend on inherited factors and whose intensity is affected by interactions between genes and environmental factors. In the present work we shall focus on the identification of such eQTLs across chromosome, using linear methods, based on postmortem data of neurotypical individuals of European origin.
41 -- Sabina Stefan
AI-based Precision Prognostics and Therapy Personalization for Childhood Brain Tumors
Accurate prediction of long-term patient outcomes remains a critical challenge in oncology, particularly for childhood brain tumors, where therapeutic strategies need to carefully balance the risk of tumor recurrence and death against severe treatment-induced sequelae. To address this challenge, we integrated genome-wide DNA methylation and copy-number profiles with time-to-event survival data from 2,540 patients to develop MANTIS, a precision AI tool for medulloblastoma and ependymoma outcome prediction. Our approach utilizes sparse neural network models to generate individualized survival probabilities over a 10-year period, achieving concordance indices as high as 0.79 for medulloblastoma and 0.75 for ependymoma in independent validation cohorts, exceeding the predictive accuracy of current clinical indicators. Applying explainable AI approaches to our model revealed an underappreciated role of an enhancer in activating the MYC oncogene, defining Group 3 medulloblastoma risk groups. Towards the goal of personalized therapy recommendations, we extended MANTIS to predict causal outcomes of varying craniospinal irradiation doses, utilizing data from a recent randomized clinical trial. Our findings indicate that medulloblastoma patients with a low estimated individual treatment effect (~60% of patients) exhibited no significant survival disadvantage (P=0.683) but maintained significantly higher IQ scores if assigned to a less intensive radiation dose (P=0.037). In conclusion, our study establishes a foundation for precision prognostics and therapy personalization based on genomic profiling and AI for pediatric brain tumor patients and offers a blueprint for similar initiatives in other pediatric and adult cancers.
42 -- Sophie Sun
Diffusion Model for Compositional Representation of Subcellular Structures in Eukaryotic Cell Division
In human cells, abnormalities in the cell cycle are hallmarks of diseases such as cancer. However, understanding how subcellular structures and their respective functions dynamically change in space and time during the cell cycle remains challenging. Live-cell imaging techniques, such as fluorescence microscopy, can analyze the spatiotemporal dynamics of subcellular structures as a function of the cell cycle. However, limitations related to spectral overlap do not allow the simultaneous visualization of more than a few structures within a single cell. This prevents us from transitioning towards integrated models of intracellular organization, whereby complex physical and functional interactions between multiple structures are resolved in space and time.
Here, we present YeastDiff, a computational approach for addressing limitations of simultaneously monitoring multiple structures within a single cell. YeastDiff is a cell-visualization tool that leverages conditional diffusion models (CDMs) and compositionality to generate high-resolution images of multi-subcellular structures. Trained on millions of live-cell images each depicting one of ~4000 different yeast proteins at a specific cell cycle stage, YeastDiff learns joint embedding of both the cell stage and gene identities to guide single and two-protein image generation. In addition to being able to generate biologically realistic single-cell images depicting each possible pairwise combination of subcellular structures, we are currently improving the compositional capabilities of YeastDiff to accurately represent spatial interactions between multiple proteins.
Collectively, YeastDiff addresses limitations in conventional fluorescence microscopy, enabling the study of the cell cycle-related dynamics of subcellular structures in an integrated fashion. This computational framework opening new avenues for investigating cell cycle regulation and disease mechanisms.