News

Tracking gene expression changes through cell lineage progression with PORCELAN

A new method for detecting gene-expression patterns linked to lineage progression, providing a powerful tool for studying cell state memory across biological systems.

Tracking gene expression changes through cell lineage progression with PORCELAN

A new method for detecting gene-expression patterns linked to lineage progression, providing a powerful tool for studying cell state memory across biological systems.

PRINT: A Milestone Technology for Understanding Gene Regulation

A novel computational method provides valuable insights into DNA-protein interactions and cis-regulatory element dynamics.

PRINT: A Milestone Technology for Understanding Gene Regulation

A novel computational method provides valuable insights into DNA-protein interactions and cis-regulatory element dynamics.

Friday Fellow Feature: Matthew Levine

Passionate about mathematics and dynamical systems, Schmidt Center Postdoctoral Fellow Matt Levine focuses on collaboration and knowledge sharing as he applies his computational skills towards biomedical discoveries.

Friday Fellow Feature: Matthew Levine

March 26, 2025

Fall 2024 Models, Inference & Algorithms (MIA) Talks – A Complete Recap Fall 2024 Models, Inference & Algorithms (MIA) Talks – A Complete Recap

2025

As spring unfolds and our 2025 Models, Inference & Algorithms (MIA) season gains momentum, we invite you to explore our upcoming April and May talks. Whether you're following along in real time or catching up during spring break, we encourage you to review our playlist of past MIA talks, which we will continue to update. In the meantime, we have compiled a recap of the fall 2024 MIA meetings, complete with links to each talk. Happy watching!

‍

‍September

***Vidhi Lalchand (Schmidt Center) on "Bayesian optimization and its uses in the latent space of generative models"***

1 - We started our fall MIA season with lightning talks from Ph.D. students and postdoctoral fellows across the Schmidt Center (Vidhi Lalchand), Harvard (Randy Ellis), MIT (Bowen Jing, Hannes Stark, Andreas Luttens), and the Whitehead Institute (Henry Kilgore). Their presentations covered a broad spectrum of machine learning applications, including protein subcellular compartmentalization, DNA sequence and protein ensemble generation, and dementia prediction using human biobank data.

‍

‍October

***Brian Cleary (Boston University). Photo: Boston University***

‍2 - Brian Cleary and Aedan Brown, a Ph.D. student in Cleary’s lab, explored evolutionary dynamics in complex genotype-to-phenotype relationships. Using metabolism as a model, they showed that selection at the genetic level can be difficult to pinpoint, while selection on collective modes—preferred directions in phenotype space—provides a clearer picture of evolutionary dynamics and polygenic trait variability.

‍

‍3 - David van Dijk presented three recent methods from his lab based on the idea that biological systems function like languages, where molecular components interact combinatorially, much like words forming sentences. His talk highlighted how large language models (LLMs) are advancing single-cell analysis, following a primer by Syed A. Rizvi, a PhD student in van Dijk’s lab, on LLMs and biological foundation models.

‍

***Noah Trebesch (University of Illinois Urbana-Champaign)***

‍4 - Ashkan Fakharzadeh Ghaan and Noah Trebesch, a postdoctoral fellow in his lab, discussed the use of advanced molecular dynamics simulation techniques to characterize the large-scale conformational transitions in molecular transporters – key drug targets that regulate cellular entry and exit. Their primer introduced the simulation techniques underlying their approach.

‍

***Neriman Tokcan (UMass Boston). Photo: Neriman Tokcan***

5 - Neriman Tokcan presented tensor methods as powerful tools for analyzing high-dimensional multi-omics data. By extending traditional matrices into multi-way arrays, tensor methods capture complex interactions among genomic variables such as genes, samples, conditions, and omics layers. She also introduced Consensus-Zero Inflated Poisson Tensor Factorization (C-ZIPTF), a method developed by her lab to improve the stability of tensor factorization for zero-inflated genomic data – a common phenomenon in genomics.

‍

‍November

***Marnix Medema (Wageningen University)***

6 - Marnix Medema presented the ongoing efforts in his lab and the broader community to use computational and AI-driven methods to map the biosynthetic diversity of microbiome-derived metabolites and uncover their roles in microbe-microbe and host-microbe interactions. On the same topic, Victoria Pascal introduced gutSMASH, a tool for systematically analyzing microbial genomes for known and putative specialized primary metabolic gene clusters, and BiG-MAP, an algorithm for assessing gene cluster abundance and expression by using metagenomic and metatranscriptomic data as input.

‍

***Abraham Gihawi (University of East Anglia). Photo: University of East Anglia***

7 - We concluded the fall season with a presentation by Abraham Gihawi and Steven Salzberg (videos not available), on cancer metagenomics. They explored the challenges of detecting microbiome DNA within tumor samples, addressing issues such as misclassified host sequencing reads and batch correction limitations.

‍

Thank you for supporting the MIA community. We hope this recap makes it easy to revisit the highlights from fall 2024 and we look forward to seeing you at our future events!

Events

March 25, 2025

Bridging the Gap: A Systems Biology Approach to Human Disease Bridging the Gap: A Systems Biology Approach to Human Disease

2025

Systems biology represents a powerful approach for revealing the molecular networks at play in model organisms through iterative cycles of perturbation, measurement, and analysis. However, applying this approach to human systems is inexact, and breaks a traditionally closed loop by introducing model systems that may not accurately capture the spatial and temporal complexity of human biology.

*Former Schmidt Center postodctoral fellow David Fischer*

David Fischer, former Eric and Wendy Schmidt Center postdoctoral fellow (2022–2024) and now an assistant professor at the Medical University of Vienna, leads a new perspective paper in Nature Reviews Genetics. Co-authored with Martin Villanueva (MIT) and Schmidt Center collaborators Peter Winter and Alex Shalek (Broad Institute), the paper presents a conceptual framework for studying human systems that accounts for the complexity of biological scales and helps bridge the "translational distance" between discoveries in human cohorts and model-based experimental validation.

Adapted from an update written by Broad Communications.

‍

Cells

People

Active Learning

Tissues

Proteins

March 7, 2025

A new model for neural circuit formation A new model for neural circuit formation

2025

Traditionally, models of neural circuit development have proposed two phases: genetically driven neural wiring, followed by environmentally driven pruning and refinement. But this is inconsistent with certain innate behaviors; many animals, for example, are capable of complex problem solving just after birth.

***Schmidt Center posdoctoral fellow Dániel Barabási.***

In a perspective piece, Eric and Wendy Schmidt Center postdoctoral fellow Dániel Barabási, along with André Ferreira Castro (University of Cambridge) and Florian Engert (Harvard), propose a new three-system model for neural circuit formation. In the first system, development patterns neural circuits, defining innate understanding. In the second, “Eureka” learning events rapidly update existing knowledge. In the third, synapses are perpetually fine-tuned.

Learn more about how this perspective reshapes our understanding of brain adaptability in Daniel's write-up and read the full paper in Nature Reviews Neuroscience.

Adapted from an update written by Broad Communications.

No items found.

February 19, 2025

Tracking gene expression changes through cell lineage progression with PORCELAN Tracking gene expression changes through cell lineage progression with PORCELAN

2025

Recent advances in barcoding technologies have made it possible to reconstruct a lineage tree of cells while simultaneously capturing their transcriptomic profiles. However, to fully leverage the resolution provided by these lineage-resolved single-cell RNA sequencing (scRNA-seq) datasets, new computational approaches are needed. These methods must address key challenges, such as ensuring that gene expression analysis goes beyond pairwise comparisons between stages and instead captures the full hierarchical structure of lineage trees, allowing for the detection of gene expression patterns that follow or deviate from lineage relationships.

In a new study published in Nature Communications, researchers at the Eric and Wendy Schmidt Center at the Broad Institute of MIT and Harvard introduce PORCELAN, a statistical framework that automatically detects gene expression patterns linked to lineage progression. This method provides a systematic way to study how gene expression and cell state memory evolve through cell divisions, offering new insights into processes such as cancer progression.

‍Decoding Gene Expression Through Lineage Trees‍

PORCELAN – short for Permutation, Optimization, and Representation learning-based single Cell gene Expression and Lineage ANalysis – combines representation learning with permutations among leaves in the lineage tree. Using a statistical approach, PORCELAN addresses three questions: How can we jointly capture lineage and gene expression information in cell representations? Which genes best reflect lineage relationships, and in which subtrees is this connection strongest? To what extent does gene expression preserve lineage tree structure across different resolutions?

***Schmidt Center graduate student and first author Hannah Schlueter aimed to create a rigorous and adaptable tool for studying cellular identity.***

The researchers validated PORCELAN using synthetic datasets and applied it to three biological systems with lineage-traced scRNA-seq data: lung cancer progression, mouse embryogenesis, and C. elegans development. In lung cancer, PORCELAN identified tumor cell subpopulations that contributed to metastases and pinpointed key genes associated with these transitions – many of which align with known cancer biomarkers and pathways. In developmental systems, the framework uncovered differences in how gene expression memory is maintained across cell divisions, highlighting contrasts between normal development and cancerous progression. These findings underscore the importance of lineage-resolved approaches in understanding fundamental biological processes.‍

A Flexible Tool for the Future‍

The study was led by Hannah Schlueter, a Schmidt Center graduate student and PhD student at MIT’s Laboratory for Information & Decision Systems (LIDS), in collaboration with corresponding author Caroline Uhler, Director of the Schmidt Center and Andrew (1956) and Erna Viterbi Professor of Engineering at MIT in the Department of Electrical Engineering and Computer Science (EECS) and the Institute for Data, Systems, and Society (IDSS).

“Our goal was to develop a method that is both rigorous and adaptable,” says Schlueter. “Because PORCELAN is modular, it can be applied to different data modalities, including lineage-resolved imaging data, by replacing the simpler tree-likeness score based on local autocorrelation, used for transcriptomic data, with a representation learning-based tree-likeness score. This flexibility makes it a powerful tool for studying how cellular identity is maintained and altered over time.”

As lineage tracing technologies continue to evolve, methods like PORCELAN highlight the critical role of applying statistical techniques to biological research. This approach, which merges computational tools with biological insights, is central to the work at the Schmidt Center. By developing methods that bridge computational models with biological questions, the Schmidt Center aims to drive discoveries that deepen our understanding of cellular biology, disease mechanisms, and potential therapeutic strategies.

Cells

Representation Learning

February 5, 2025

Schmidt Center Winter 2025 Newsletter Schmidt Center Winter 2025 Newsletter

2025

No items found.

January 24, 2025

PRINT: A Milestone Technology for Understanding Gene Regulation PRINT: A Milestone Technology for Understanding Gene Regulation

2025

In a new study published this week by Nature, researchers at the Broad Institute of MIT and Harvard and Harvard University developed a new computational method – PRINT – that identifies DNA-protein interaction footprints from both bulk and single-cell chromatin accessibility data. By applying PRINT, researchers uncovered the organization and dynamics of cis-regulatory elements (CREs) across different scales, providing deeper insights into how genes are regulated, and paving the way for innovative research in both health and disease.

‍Advancing Gene Regulation Studies‍

CREs play a critical role in gene expression by binding to regulatory proteins, such as transcription factors (TFs) and histones, influencing fundamental biological processes. However, current methods for studying CREs at high resolution often rely on bulk data, which can mask the variability and specificity of regulatory elements in individual cells.

In this study, PRINT changes the landscape by analyzing both bulk and single-cell chromatin accessibility data to capture CRE organization at a high resolution. By combining PRINT with a deep learning framework, seq2PRINT, the authors were able to identify the binding dynamics of regulatory proteins across various cell contexts and states, such as cell differentiation and aging. This nuanced view enabled the authors to realize that CREs do not have only two regulation states (open/active or closed/inactive), but that the same CREs can be bound by different sets of TFs across cell types.

By enabling researchers to track changes in gene regulation in rare cell types or during disease progression, PRINT sheds light into gene regulation dynamics at the single-cell level in physiological and pathological conditions.

***Yan Hu, graduate student at the Buenrostro Lab (image credit: Buenrostro Lab)***

“Our study really helped open new opportunities to study how different TFs and nucleosomes combinatorially encode the regulation of gene expression, as well as nominating candidate factors driving diseases,” said co-first author Yan Hu, a graduate student at the Buenrostro Lab at Harvard.

‍A Collaborative Effort‍

The study stems from a collaboration between Hu, Max Horlbeck, MD PhD, a geneticist at Boston Children’s Hospital and postdoctoral fellow at the Buenrostro Lab, and Ruochi Zhang, PhD, a postdoctoral fellow at the Eric and Wendy Schmidt Center at the Broad Institute.

The Buenrostro Lab aims to create sequencing technologies to better understand gene regulation across health and disease.

***Ruochi Zhang, postdoctoral fellow at the Schmidt Center***

The lab’s research heavily aligns with the Schmidt Center’s work, which converges biology and machine learning to drive biological discoveries. By blending advanced computational methods with experimental biology, PRINT represents the kind of innovation that emerges from interdisciplinary collaboration.

“This is a breakthrough that we couldn’t have accomplished alone without bridging biology and AI through this collaboration,” said Zhang. “Biology and AI form a two-way street—the diverse expertise within our team provides different perspectives on the problem, motivates innovative approaches for investigation, and ultimately drives deeper understanding of the questions we’re addressing.”

***Max Horlbeck, geneticist at Boston Children’s Hospital and postdoctoral fellow at the Buenrostro Lab (image credit: Max Horlbeck)***

Hu, Horlbeck, and Zhang co-led the efforts of the research, with contributions from colleagues at the Buenrostro Lab, Wagers Lab, and the Gene Regulation Observatory (GRO) at the Broad Institute. This work was supported by NIH institutes NHLBI, NHGRI, NIGMS, NICHD, and the common fund, Broad Institute, Schmidt Center, the GRO, Wagers Lab, Harvard Department of Stem Cell and Regenerative Biology, Harvard Stem Cell Institute, and the Impact of Genomic Variation on Function Consortium. Read more about their work in Nature.

Cells

Proteins

December 20, 2024

Friday Fellow Feature: Matthew Levine Friday Fellow Feature: Matthew Levine

2024

As dynamic as the systems that he researches, Matt Levine brings his friendliness, mathematical knowledge, and eagerness to collaborate and help others to the Schmidt Center.

After studying biophysics at Columbia University, the New Jersey native found his passion for computational projects as opposed to lab work. He joined a research project on diabetes with David Albers, George Hripcsak, and Lena Mamykina, quickly realizing that he enjoyed tackling complex dynamics problems.

Matt also collaborated with mathematician and his future PhD advisor Andrew Stuart, who showed him the clarifying power of math and encouraged him to pursue a PhD in applied mathematics at Caltech. As Matt figured out that he wanted to return to some of his biological roots, he worked with Michael Elowitz on biological computations during his final year of graduate school, and decided to pursue a fellowship at the intersection of applied math, machine learning, statistics, and biomedicine, leading him to the Schmidt Center in the fall of 2023.

In his research on dynamical systems, Matt finds a balance between combining traditional, physics-based models with modern AI techniques. “It’s easy for mathematicians to work on theoretical models that seem perfect on paper, but fail in practice,” he says. “This is why we need to ground ourselves in real data and questions to ensure that the theories work in real-world scenarios.”

Read on to learn more about Matt’s numerous collaborations, this year’s conference presentations, and travel highlights.

‍

Tell us about your area of research.

A lot of my research centers around dynamical systems, which are systems that evolve over time. My focus is on understanding these systems by learning from the data we collect, which often involves developing models to explain how these systems work or to predict their future behavior.

I think about this problem by blending old-school, knowledge-based approaches with newer techniques. The idea is that we often have a solid understanding of a system before we even start collecting data. Rather than relying solely on modern AI approaches that learn everything from scratch, I believe in using a hybrid style. This means we start with what we know and use that knowledge to fill in the gaps where we don’t have information.

I apply these principles in an application-agnostic setting, or a broad way where the methods are applicable to many different types of problems. However, I also focus on specific applications to make sure that the rubber can meet the road. I test these models on actual problems, observe where they fall short, and analyze why they didn’t work. After I get things to work in that setting, it’ll eventually fail again, creating a loop, and that’s the loop I like to live in.

A lot of the applications that I've worked on have a biomedical focus. For example, I’ve worked on modeling glucose dynamics in people with diabetes, using real-world data from daily self-monitoring to gain insights into their physiology. I’ve also applied these methods to climate science, aiming to improve climate models by using data to calibrate and refine these models.

What’s something you wish more people understood in regards to your research?

I wish that more people would recognize the power of mathematics. It’s a powerful language that can be used as a clear, precise, and efficient communication tool. Even when working with experimentalists, writing down the plan and evaluations in terms of rigorous mathematics is a clarifying exercise. By getting it into such a clear language, math brings up questions that wouldn’t have been asked otherwise, forcing you to look at new things carefully.

*Matt has worked substantially in the biomedical sciences, and enjoys collaborating on impactful applied projects.*

‍

Tell us about one of your collaborations.

I’m excited about a new collaboration with Luca Pinello, who’s exploring cell fate and differentiation as dynamic processes. He’s interested in how cells evolve over time, but instead of watching a continuous "video" of a cell’s entire lifespan, we work with snapshots—random moments in the cell's life. It’s like taking a snapshot of Earth today and trying to understand aging; you get different snapshots at different times but can't connect one person’s past and future.

This challenge raises interesting methodological questions and has led to discussions on various dynamical systems and modeling approaches. Our collaboration has led to new mathematical definitions and innovative methods. I'm excited to continue learning from Luca’s group, while sharing insights from my work – it’s been rewarding working together.

‍

You presented at several major conferences this year.

I went to two exciting conferences in July. The first was ICML (International Conference on Machine Learning) in Vienna, where I gave an oral presentation on a paper I have with a group from Stanford (Emily Fox, Ramesh Johari, Dessi Zaharieva, Bob Junyi Zou) – Hybrid Neural ODE Causal Modeling and an Application to Glycemic Response. We focused on learning dynamics from data collected from people with diabetes. Because the data was noisy and sparse, we had to use a lot of existing knowledge to improve our methods, which led us to develop some new approaches.

Before and after the conference, I also spent time in upper Austria on the weekends with a friend from grad school. It was really peaceful – we hiked, swam in lakes, observed the scenery, watched the cows, ate the sausage, and had a great time.

At the Fourth Symposium on Machine Learning and Dynamical Systems at the Fields Institute in Toronto, I presented work done in collaboration with a group from Caltech (Andrew Stuart, Edoardo Calvello, Nikola Kovachki) during my PhD. We’re exploring how large language models (LLMs) and their Transformer architectures, typically used for sequences like word lists, can be adapted for continuous data such as time series or images. Our new approach, called operator learning, reformulates these models to handle data of varying resolutions. This allows our models to process and understand data consistently, even when resolution changes, like in pathological images with different pixel densities.

‍

*Matt's never bored as long as he has a board to jot down his ideas.*

What’s a memorable experience you had related to your research?

I recently had the opportunity to collaborate with a friend – Iñigo Urteaga – from my Columbia days. Although we weren’t in the same research group, we were close friends and colleagues—he was a postdoc while I was still a pre-doc, before I applied to graduate school.

In the summer of 2023, I visited Iñigo at the Basque Center for Applied Mathematics in Bilbao, Spain. With him starting a new professorship there and me about to begin a postdoc at Broad in the fall of 2023, it was the ideal moment for us to reconnect and tackle a project we had long discussed. Our collaboration focuses on uncertainty quantification in dynamical systems—a field that examines how we can model systems from data and gauge the range of possible models that could explain our observations. This approach helps us understand not just what the data tells us, but how confident we should be about different explanations. We developed a JAX package that’s available on GitHub.

I visited him again last winter, and he’s been a formal visitor at MIT and Broad since October. We’re continuing to work together and he’s had a chance to meet the Schmidt Center fellows and start other collaborations.

Mostly, it was really nice to work in Bilbao. We’d take coffee breaks in cafes, drink our espresso, then have long, Spanish lunches. It was a pleasant, peaceful environment to get work done, making our collaboration particularly rewarding.

‍

*"It's very cathartic for me to spend time in the mountains, with the mountain air, in the middle of nowhere," says Matt.*

Let’s talk about the Oberwolfach Research Institute for Mathematics (MFO).

Coined “Math Camp” by a friend and located in the Black Forest, Germany, the MFO conference program hosts 20-30 researchers for focused seminars and workshops. When I attended, the schedule was relaxed, with lectures, coffee breaks, and group lunches, and every Wednesday we would hike and enjoy Black Forest cake together— it’s a whole tradition.

During one of these hikes, I was talking to a researcher from Youssef Marzouk’s group about challenges I faced with learning differential equations from partially observed data. That conversation sparked an idea about using data assimilation, which turned out to be a breakthrough. I went back to my room, ran some code, and the next day I had a working solution for something I'd been struggling with for over a year during my PhD. I was very excited and shared my findings with the group in an impromptu talk, and this idea eventually contributed to a paper I was working on. I'm still exploring similar concepts today with Youssef’s group, focusing on uncertainty quantification for machine learning in unobserved systems.

‍

What advice would you give to aspiring researchers?

One of the aspects I truly enjoy about academia is the emphasis on collaboration and knowledge sharing. I find a lot of joy in working with others and believe that a friendly, open environment greatly enhances the productivity and satisfaction of everyone involved.

When people ask how I’ve managed to build many collaborations, my answer is simple: being friendly goes a long way. I’ve noticed that many successful researchers share this trait. For instance, George Hripcsak, the former chair of the Department of Biomedical Informatics at Columbia, is a prime example. His success in leading large consortia and managing diverse projects largely comes from people enjoying working with him. He’s great at what he does and he’s approachable. In my experience, being kind, open, and personable often proves to be more effective than just being a research machine. Be friendly!

‍

What are some of your hobbies?

I love skiing – it’s very cathartic for me to spend time in the mountains, with the mountain air, in the middle of nowhere. Last winter I was an adaptive ski instructor, which was a really rewarding experience, as I worked with people who had specific needs, and tailored their skiing lessons accordingly. Whether it involved using specialized equipment or teaching them one-on-one, the goal was to make skiing accessible and enjoyable for everyone. It was definitely the hardest job I ever had, but incredibly fulfilling. I also like playing tennis and music, specifically the piano and guitar. I enjoy doing activities because they help me be present in the moment in a physically engaging way, which is different from my regular job.

‍

*Nothing brings Matt down -- after he falls, he bounces back up, with a smile.*

Some of Matt’s Favorite Things:

Underrated technology: Metronome
Game: Chess
Movie: Office Space
Book: A Visit from the Goon Squad by Jennifer Egan
Snack: Cadbury chocolate
Element on the periodic table: Ununennium
Anything else? I’m very appreciative of my mentors!

People

December 18, 2024

Schmidt Center PhD Fellow Jiaqi Zhang Presented with Broad’s Stuart L. Schreiber Award in Scientific Excellence Schmidt Center PhD Fellow Jiaqi Zhang Presented with Broad’s Stuart L. Schreiber Award in Scientific Excellence

2024

Each year at the Retreat, in recognition of their exceptional contributions to the institute, Broad presents outstanding Broadies with the Broad Institute Excellence and Achievement Awards as well as the Eric S. Lander and Stuart L. Schreiber Awards in Scientific Excellence. This year, the Eric and Wendy Schmidt Center is thrilled to announce that Schmidt Center PhD Fellow Jiaqi Zhang was among one of the winners.

The Eric S. Lander and Stuart L. Schreiber Awards in Scientific Excellence were established in 2019 to recognize outstanding scientists at the Broad who demonstrate, first and foremost, scientific excellence, along with exceptional commitment to promoting women in science — either by their own example or through mentoring and supporting women in their scientific careers. They demonstrate true Broadie spirit and show what makes the Broad so special. These rising stars presented their research during the award ceremony, and were recognized with a commemorative scientific excellence award and an accompanying $1,000 prize. They will also participate in a Broad-sponsored leadership development program and will serve on the award selection committee next year.

*Jiaqi receiving her award at the Broad Retreat.*

Jiaqi Zhang began her academic journey at Peking University, where she earned a bachelor’s degree in statistics. She is now a Ph.D. candidate in the Department of Electrical Engineering and Computer Science at MIT, a researcher in the Schmidt Center within the Uhler Lab at Broad, and co-leader of the Machine Learning subgroup in the Gene Regulation Observatory.

Jiaqi’s research focuses on causal inference and experimental design. She has developed active learning algorithms that enable efficient identification of optimal genetic interventions, which reduces experimental costs and increases precision. Her work has advanced fields such as cancer immunotherapy and cellular reprogramming and has led to numerous first-author publications in top journals. She has proven herself to be an emerging leader in machine learning and computational biology.

A cornerstone of Jiaqi’s contributions has been her work on a large-scale cancer immunotherapy data science challenge last year at the Broad, which attracted more than 1,000 participants globally. She co-designed and analyzed machine learning tasks to identify genetic targets that enhance T-cell efficacy against cancer. Jiaqi is currently in the final stages of preparing a first-author manuscript on the novel perturbations identified through this deep integration of computation and experimental work.

Beyond her scientific contributions, Jiaqi is a dedicated advocate for gender equity in science. She has consistently mentored undergraduate and early-stage graduate students. Her active participation in initiatives like the Women in Data Science Conference and the Rising Star in EECS workshop underscores her dedication to building a supportive and inclusive scientific community, inspiring more women to pursue and excel in STEM fields.

Adapted from an article written by Broad Communications.

People

December 4, 2024

Schmidt Center Postdoctoral Fellow Dániel Barabási Named to Forbes 30 Under 30 List Schmidt Center Postdoctoral Fellow Dániel Barabási Named to Forbes 30 Under 30 List

2024

The Eric and Wendy Schmidt Center at the Broad Institute of MIT and Harvard is thrilled to announce that Postdoctoral Fellow Dániel Barabási has been inducted into the Forbes 30 Under 30 Class of 2025 for Science.

With more than 10,000 nominations each year, the Under 30 list, evaluated by Forbes editors and industry expert leaders, recognizes a total of 600 young innovators under the age of 30 from North America, within 20 different industries, who are creating meaningful change with business, culture, and entrepreneurship. The science candidates were evaluated by a panel of judges featuring Sara Seager, professor of physics at MIT; Luna Yu, cofounder and CEO of Genecis; Christina Smolke, cofounder and CEO of Antheia; and Randy Glein, cofounder and partner of DFJ Growth.

Dániel joined the Schmidt Center as a fellow in 2024, collaborating with Professors Xiao Wang and Jason Buenrostro. He holds a PhD from the Harvard Biophysics Graduate Program, where he was advised by Florian Engert (2024), and a bachelor’s degree in physics from the University of Notre Dame, where he worked with Zoltán Toroczkai on distance-based network models of brain connectivity (2013-2017).

By characterizing brain connectivity maps and combining them with high-throughput transcriptomics, Dániel looks to uncover the architectural building blocks of neural circuits, thereby understanding the brain’s maturation, wiring, and function. In time, he aims to apply these tools to study evolutionary differences in brain connectivity between species and investigate how stress dysregulates the brain, with the goal of informing therapeutic interventions.

See the full list of the Forbes 30 Under 30 scientists and the rest of the 2025 class in other categories.

‍

People

November 7, 2024

A causal theory for studying the cause-and-effect relationships of genes A causal theory for studying the cause-and-effect relationships of genes

2024

By studying changes in gene expression, researchers learn how cells function at a molecular level, which could help them understand the development of certain diseases.

But a human has about 20,000 genes that can affect each other in complex ways, so even knowing which groups of genes to target is an enormously complicated problem. Also, genes work together in modules that regulate each other.

Researchers from MIT and the Eric and Wendy Schmidt Center at the Broad Institute of MIT and Harvard have now developed theoretical foundations for methods that could identify the best way to aggregate genes into related groups so they can efficiently learn the underlying cause-and-effect relationships between many genes.

Importantly, this new method accomplishes this using only observational data. This means researchers don’t need to perform costly, and sometimes infeasible, interventional experiments to obtain the data needed to infer the underlying causal relationships.

In the long run, this technique could help scientists identify potential gene targets to induce certain behavior in a more accurate and efficient manner, potentially enabling them to develop precise treatments for patients.

*Schmidt Center fellow and MIT EECS PhD student Jiaqi Zhang*

“In genomics, it is very important to understand the mechanism underlying cell states. But cells have a multiscale structure, so the level of summarization is very important, too. If you figure out the right way to aggregate the observed data, the information you learn about the system should be more interpretable and useful,” says Schmidt Center Fellow Jiaqi Zhang, MIT graduate student in the Department of Electrical Engineering and Computer Science (EECS), and co-lead author of a paper on this technique.

Zhang is joined on the paper by co-lead author Ryan Welch, also a Schmidt Center fellow and MIT master’s student in engineering; and senior author and Schmidt Center Director Caroline Uhler, Andrew (1956) and Erna Viterbi Professor of Engineering in EECS and the Institute for Data, Systems, and Society (IDSS) at MIT, and a researcher at MIT’s Laboratory for Information and Decision Systems (LIDS). The research will be presented at the Conference on Neural Information Processing Systems (NeurIPS).

*Schmidt Center fellow and MIT master's student Ryan Welch*

‍Learning from observational data

The problem the researchers set out to tackle involves learning programs of genes. These programs describe which genes function together to regulate other genes in a biological process, such as cell development or differentiation.

Since scientists can’t efficiently study how all 20,000 genes interact, they use a technique called causal disentanglement to learn how to combine related groups of genes into a representation that allows them to efficiently explore cause-and-effect relationships.

In previous work, the researchers demonstrated how this could be done effectively in the presence of interventional data, which are data obtained by perturbing variables in the network.

But it is often expensive to conduct interventional experiments, and there are some scenarios where such experiments are either unethical or the technology is not good enough for the intervention to succeed.

With only observational data, researchers can’t compare genes before and after an intervention to learn how groups of genes function together.

“Most research in causal disentanglement assumes access to interventions, so it was unclear how much information you can disentangle with just observational data,” Zhang says.

The MIT researchers developed a more general approach that uses a machine-learning algorithm to effectively identify and aggregate groups of observed variables, e.g., genes, using only observational data.

They can use this technique to identify causal modules and reconstruct an accurate underlying representation of the cause-and-effect mechanism. “While this research was motivated by the problem of elucidating cellular programs, we first had to develop novel causal theory to understand what could and could not be learned from observational data. With this theory in hand, in future work we can apply our understanding to genetic data and identify gene modules as well as their regulatory relationships,” Uhler says.

*Schmidt Center Director and MIT EECS and IDSS Professor Caroline Uhler*

‍A layerwise representation

Using statistical techniques, the researchers can compute a mathematical function known as the variance for the Jacobian of each variable’s score. Causal variables that don’t affect any subsequent variables should have a variance of zero.

The researchers reconstruct the representation in a layer-by-layer structure, starting by removing the variables in the bottom layer that have a variance of zero. Then they work backward, layer-by-layer, removing the variables with zero variance to determine which variables, or groups of genes, are connected.

“Identifying the variances that are zero quickly becomes a combinatorial objective that is pretty hard to solve, so deriving an efficient algorithm that could solve it was a major challenge,” Zhang says.

In the end, their method outputs an abstracted representation of the observed data with layers of interconnected variables that accurately summarizes the underlying cause-and-effect structure.

Each variable represents an aggregated group of genes that function together, and the relationship between two variables represents how one group of genes regulates another. Their method effectively captures all the information used in determining each layer of variables.

After proving that their technique was theoretically sound, the researchers conducted simulations to show that the algorithm can efficiently disentangle meaningful causal representations using only observational data.

In the future, the researchers want to apply this technique in real-world genetics applications. They also want to explore how their method could provide additional insights in situations where some interventional data are available, or help scientists understand how to design effective genetic interventions. In the future, this method could help researchers more efficiently determine which genes function together in the same program, which could help identify drugs that could target those genes to treat certain diseases.

This research is funded, in part, by the MIT-IBM Watson AI Lab and the U.S. Office of Naval Research.

‍‍Adapted from an article posted on MIT News.

‍

Causal Inference

Cells

September 26, 2024

Schmidt Center Fall 2024 Newsletter Schmidt Center Fall 2024 Newsletter

2024

No items found.

August 20, 2024

#WhyIScience Q&A: A systems biologist develops computational tools to bring scale to cell experiments #WhyIScience Q&A: A systems biologist develops computational tools to bring scale to cell experiments

2024

At first, Eric and Wendy Schmidt Center postdoctoral fellow Yue Qin thought she wanted to become a doctor. She’d always been interested in disease — why people got sick, why some illnesses could send you to a hospital while others could be treated at home. As she grew older, however, she realized she was more interested in learning about the roots of disease and the genes that caused them.

Growing up in Ningbo, a city in eastern China, Qin was strongly influenced by societal expectations that girls were better suited for language arts than math and science. She assumed that her struggles in a high school computer science course were because of her gender, even though she’d had no trouble with math. That all changed in college, at the University of California, San Diego (UCSD), when Qin took an introductory computer science course and fell in love with coding and its logic. She began seriously considering a career in computer science and combining it with her interest in biology.

*Qin loves being at the intersection of computer science and biology because "it opens up new avenues of scientific exploration and pushes the limit of what we are able to do," she says.*

Doing computational biology research as an undergraduate, as well as having supportive research advisors, inspired her to stay in research. After graduating with a bachelor’s degree in bioinformatics, Qin then went on to pursue her PhD at UCSD, where she used computational modeling to study how proteins interact with each other and assemble into a human cell. In January 2023, she joined the Schmidt Center as a postdoctoral fellow in the labs of Paul Blainey and Director Caroline Uhler. At the Broad, she aims to create an in silico cell: a computational model scientists can use to study at scale how external influences such as drug treatments affect cells.

We spoke with Qin about finding her place in science, how computational tools can advance biological research, and what it’s like to do both computational and wet lab experiments in this #WhyIScience Q&A.

What do you like about the intersection of computer science and biology?

What first intrigued me about computer science was that once you understand the logic of code, everything makes sense. You write a code and enter it into a computer and it will just do whatever the code says. Any errors are part of the code itself. Once I understood it, I found I was really in love with this logic. And in biology, you don't always understand what goes wrong in disease, how mistakes in our genome get translated. All those pieces are missing. But biology also has its own form of logic, and I was curious if I could use the logic in computer science to help me understand the logic in biological science.

Studying only biology limits you to a specific type of research, and the same goes for computer science. While the research within each discipline can be quite distinctive, when you merge these different fields, then you have amazing things like [the AI protein-structure prediction algorithm] AlphaFold. Before, biologists haven't been able to study the function of certain proteins because their structures were unknown. Now, with AlphaFold predictions, biologists have so many new hypotheses.

Being at the leading edge of science that intersects with computer science and biology is super exciting. It opens up new avenues of scientific exploration and pushes the limit of what we are able to do.

How are you using computational methods to study biological systems?

My goal is to build biotechnology tools and computational methods that can enable us to create an in silico cell to simulate interventions of treatments so that we can understand and treat disease.

Perturbing genes at scale in a dish is one of the easiest and most cost-effective ways to help us understand the functions of genes and find new therapeutic options. The problem we're really facing is that, for example, knocking out a single gene is not enough to cure cancer. We might need to treat with two drugs or perturb two pathways simultaneously to effectively cure cancer. We have 20,000 genes and if we want to exhaustively explore the effects of perturbing any two genes, that's 200 million options. But people have found that even perturbing just two pathways won’t be enough. The problem is just enormous; it's not solvable purely in the lab. However, by understanding how genes interact with each other using existing knowledge in an in silico cell, we can simulate unseen relationships between genes — even if we haven't seen this perturbation, models can learn from the existing data — to predict what we’d see in a dish.

With the help of machine learning, we really scale down the number of experiments that we have to do and can focus on the path that's most promising for therapeutics.

*Qin with members of Caroline Uhler's lab.*

What is it like to juggle computational and wet lab research?

In undergrad I took both biology and computer science courses. But what I found frustrating was that even with experience in both fields it was still hard to communicate to bench scientists because you need to get into really technical details and I didn’t know them. I thus decided to dive more into the wet lab so that I could bridge the two fields and drive effective collaborations. I decided that during my postdoc I wanted to get training in both.

What are you working on now?

In one project, I've been using cell images from Paul Blainey’s lab, where we perturb the genome using a new biotechnology tool that's currently on bioRxiv called CROPseq-multi, which allows us to look at genetic interactions in the image space in a pooled fashion. That's something that we could never do before because we just didn't have the tools. In the past, scientists have used images mainly in small-scale experiments to validate hypotheses, taking a few under the microscope to see if expected phenotypes show up. But we now finally have the power and the technology to easily generate large image datasets, empowering machine learning to help us understand what changes in cell morphology mean and connect morphology and genetics at a genome-wide scale.

How could an in silico cell improve the treatment of disease?

I wasn’t always aware of the imbalance in access to medical resources. My grandpa in Rizhao [a city in northeastern China] had cancer and even though I work in the field of cancer research, there was nothing I could do to help him. This made me realize that I'm in a privileged setting where I'm surrounded by medical experts, but the amount of medical resources that he had was totally incomparable to the ones I'm exposed to. That really got me thinking about how we could address such disparities, and one approach could be using an in silico cell to simulate different disease contexts using patient information including genomics and give therapeutic options to patients. With such a model, we could alleviate the inequitable access to medical knowledge for patients around the world.

We also need new therapeutics and there are so many diseases where we don’t know the cause and don’t even have a therapeutic option available for patients who are suffering. This research can help us find the direct pathway that we should target in personalized genetic contexts. Hopefully this can inspire new therapeutic developments from pharmaceutical companies.

‍Adapted from an article posted on the Broad site.

People

Cells

July 24, 2024

Schmidt Center Director Caroline Uhler named Andrew (1956) and Erna Viterbi Professor of Engineering Schmidt Center Director Caroline Uhler named Andrew (1956) and Erna Viterbi Professor of Engineering

2024

The Eric and Wendy Schmidt Center at the Broad Institute of MIT and Harvard is pleased to announce that Center Director Caroline Uhler has been named the Andrew (1956) and Erna Viterbi Professor of Engineering, effective July 1, 2024 for a five-year term. This MIT School of Engineering Professorship is awarded to an outstanding faculty member who is recognized as a leader and innovator in the field of Electrical Engineering and Computer Science.

Caroline holds BSc degrees in math and biology, an MSc in mathematics, and an MEd in mathematics education from the University of Zurich (years spanning 2004-7), and a PhD in statistics from UC Berkeley (2011). Before joining MIT as a faculty member in 2015, she spent three years as an assistant professor at IST Austria.

She is a professor in the Department of Electrical Engineering and Computer Science (EECS) and the Institute for Data, Systems, and Society (IDSS). She is also affiliated with the Laboratory for Information and Decision Systems (LIDS), the The Statistics and Data Science Center, and the Operations Research Center (ORC). Additionally, Caroline is a core institute member of the Broad Institute.

Caroline’s research focuses on machine learning methods for integrating and translating between vastly different data modalities and inferring causal or regulatory relationships from such data. She is particularly interested in using these methods to gain mechanistic insights into the link between genome packing and regulation in health and disease.

She is an elected member of the International Statistical Institute, and is the recipient of a Simons Investigator Award, a Sloan Research Fellowship, and an NSF Career Award. Recently, she was named a Fellow of the Institute of Mathematical Statistics (IMS), 2024, and a Fellow of the Society for Industrial and Applied Mathematics (SIAM), Class of 2023.

People

Causal Inference

Representation Learning

July 22, 2024

Researchers identify cheap and effective biomarkers for DCIS tumor stage Researchers identify cheap and effective biomarkers for DCIS tumor stage

2024

Ductal carcinoma in situ (DCIS), a pre-invasive tumor, accounts for about 25% of breast cancer diagnoses, a leading cause of cancer death. While doctors generally recommend treatment, they lack the appropriate evidence to reliably decide which tumor will remain benign and which might turn into a life-threatening invasive ductal carcinoma (IDC), resulting in high rates of overtreatment.

The current methods for understanding DCIS progression include manual assessment of nuclear morphology by pathologists, sequencing-based approaches, spatial transcriptomics, and highly multiplexed imaging. However, these methods face challenges due to cost, complexity, and limited information about the tissue microenvironment, which is necessary for accurate DCIS progression assessment.

In a new study published today in Nature Communications, researchers at the Broad Institute of MIT and Harvard and the Paul Scherrer Institute at ETH Zürich in Switzerland have found a simple and effective method of predicting the disease stage of DCIS, which could ultimately lead to more informed recommendations for DCIS breast cancer treatment. Their analysis demonstrates that, without the need of multiple stains or sequencing-based technologies, chromatin imaging provides sufficient information about cell states and tissue organization to accurately predict tumor stages.

*Caroline Uhler says that using machine learning to analyze data can ultimately lead to better disease diagnosis and treatment.*

The study stems from a long-term collaboration combining AI and biology between Caroline Uhler, who directs the Eric and Wendy Schmidt Center at the Broad Institute, and is a Professor in the Department of Electrical Engineering and Computer Science as well as the Institute for Data, Systems, and Society at MIT, and GV Shivashankar, Professor of Mechanogenomics and head of the Laboratory of Nanoscale Biology at the Paul Scherrer Institute.

Shivashankar’s lab is interested in understanding the underlying mechanisms for cell-state transitions and the association with disease states. They aim to improve early disease diagnostics by using multi-disciplinary approaches, such as single-cell imaging, functional genomics, and machine learning, to study the coupling between cell mechanics and genome organization in tissue contexts.

“Building on our previous studies with the Schmidt Center, we’re thrilled that we found a simple way to predict disease stage through the statistics of cell states, and we look forward to seeing how this can be applied to DCIS treatment,” said co-senior author Shivashankar.

*G.V. Shivashankar is currently developing various methods for the diagnosis and prognosis of cancer at PSI (image credit: PSI).*

The study aligns with the Schmidt Center’s goal of fostering a two-way street between biology and machine learning to advance biomedical discoveries and provide insights into how cells work in health and disease.

“As our research on DCIS shows, it’s important to create novel machine learning methods to analyze biomedical data,” said co-senior author Uhler. “Using machine learning to analyze data can lead to more accurate and simpler solutions for important biological questions, ultimately leading to better disease diagnosis and treatment.”

“Collaborating with Shivashankar’s lab, which I have done on several projects, provides me with a unique opportunity to develop computational methods for important biological problems,” said study first author Xinyi Zhang, a Ph.D. student at MIT and the Schmidt Center. “I’m able to see what the real roadblocks and challenges are in the biomedical space and start thinking about what to develop next.”

*Xinyi Zhang has collaborated with Shivashankar’s lab multiple times, develop computational methods for important biological problems (image credit: Jared Charney).*

Using unsupervised representation learning methods, the scientists analyzed 560 samples from 122 patients at 11 stages of DCIS progression from normal to cancerous breast tissues. They identified eight disease-relevant cell states based on nuclear morphology and chromatin organization, and found that all eight cell states exist in all disease stages, but with different abundances.

Based on the learned representations, the researchers then arranged the cell types from healthy to cancerous, finding that the order matched the natural progression of the disease, even though the model wasn't trained directly on disease stages. The study also demonstrated that spatial organization of cells near breast ducts and the co-localization of cell states can better predict disease stage compared to cell state abundance alone. This approach highlighted distinct cell states, their relative abundances, and their spatial neighborhoods, indicating their potential as biomarkers for cancer staging.

Although follow-up clinical trials with longitudinal tracking of DCIS patients are needed, this study demonstrated that high-dimensional AI-inferred features based on simple and cheap chromatin images can provide valuable insights into tumor progression. Uhler noted that this study introduces a new approach to exploring disease progression within a tumor microenvironment, specifically by leveraging machine learning and computational methods to extract meaningful information from complex chromatin images, without the need for extensive staining or sequencing.

By focusing on one of the Schmidt Center’s core missions – developing the foundations of machine learning to understand the programs of life – this study offers simple and cost-effective solutions for disease prognosis and treatment.

Learn more about this study in stories from MIT News and the Paul Scherrer Institute at ETH Zurich.

Representation Learning

Cells

Tissues

July 12, 2024

Machine learning and the microscope Machine learning and the microscope

2024

With recent advances in imaging, genomics and other technologies, the life sciences are awash in data. If a biologist is studying cells taken from the brain tissue of Alzheimer’s patients, for example, there could be any number of characteristics they want to investigate — a cell’s type, the genes it’s expressing, its location within the tissue, or more. However, while cells can now be probed experimentally using different kinds of measurements simultaneously, when it comes to analyzing the data, scientists usually can only work with one type of measurement at a time.

Working with “multimodal” data, as it’s called, requires new computational tools, which is where Xinyi Zhang, graduate fellow at the Eric and Wendy Schmidt Center at the Broad Institute of MIT and Harvard, comes in.

The fourth-year MIT PhD student is bridging machine learning and biology to understand fundamental biological principles, especially in areas where conventional methods have hit limitations. Working in the lab of MIT Professor and Schmidt Center Director Caroline Uhler in the Department of Electrical Engineering and Computer Science (EECS), the Laboratory for Information and Decision Systems (LIDS), and the Institute for Data, Systems, and Society (IDSS), and collaborating with researchers at the Schmidt Center, Zhang has led multiple efforts to build computational frameworks and principles for understanding the regulatory mechanisms of cells.

***Xinyi says she wants to keep applying her skills to solve the “most challenging questions that we don’t have the tools to answer.”***

“All of these are small steps toward the end goal of trying to answer how cells work, how tissues and organs work, why they have disease, and why they can sometimes be cured and sometimes not,” Zhang says.

The activities Zhang pursues in her down time are no less ambitious. The list of hobbies she has taken up at the Institute include sailing, skiing, ice skating, rock climbing, performing with MIT’s Concert Choir, and flying single-engine planes. (She earned her pilot’s license in November 2022.)

“I guess I like to go to places I’ve never been and do things I haven’t done before,” she says with signature understatement.

Uhler, her advisor, says that Zhang’s quiet humility leads to a surprise “in every conversation.”

“Every time, you learn something like, ‘Okay, so now she’s learning to fly,’” Uhler says. “It’s just amazing. Anything she does, she does for the right reasons. She wants to be good at the things she cares about, which I think is really exciting.”

Zhang first became interested in biology as a high school student in Hangzhou, China. She liked that her teachers couldn’t answer her questions in biology class, which led her to see it as the “most interesting” topic to study.

Her interest in biology eventually turned into an interest in bioengineering. After her parents, who were middle school teachers, suggested studying in the United States, she majored in the latter alongside electrical engineering and computer science as an undergraduate at the University of California at Berkeley.

***Xinyi with other members of Caroline Uhler's lab.***

Zhang was ready to dive straight into MIT’s EECS PhD program after graduating in 2020, but the COVID-19 pandemic delayed her first year. Despite that, in December 2022, she, Uhler, and two other co-authors published a paper in Nature Communications.

The groundwork for the paper was laid by Broad Institute core member Xiao Wang, one of the co-authors. She had previously done work with the Broad Institute in developing a form of spatial cell analysis that combined multiple forms of cell imaging and gene expression for the same cell while also mapping out the cell’s place in the tissue sample it came from — something that had never been done before.

This innovation had many potential applications, including enabling new ways of tracking the progression of various diseases, but there was no way to analyze all the multimodal data the method produced. In came Zhang, who became interested in designing a computational method that could.

The team focused on chromatin staining as their imaging method of choice, which is relatively cheap but still reveals a great deal of information about cells. The next step was integrating the spatial analysis techniques developed by Wang, and to do that, Zhang began designing an autoencoder.

Autoencoders are a type of neural network that typically encodes and shrinks large amounts of high-dimensional data, then expands the transformed data back to its original size. In this case, Zhang’s autoencoder did the reverse, taking the input data and making it higher-dimensional. This allowed them to combine data from different animals and remove technical variations that were not due to meaningful biological differences.

In the paper, they used this technology, abbreviated as STACI, to identify how cells and tissues reveal the progression of Alzheimer’s disease when observed under a number of spatial and imaging techniques. The model can also be used to analyze any number of diseases, Zhang says.

***Xinyi with Schmidt Center fellows.***

Given unlimited time and resources, her dream would be to build a fully complete model of human life. Unfortunately, both time and resources are limited. Her ambition isn’t, however, and she says she wants to keep applying her skills to solve the “most challenging questions that we don’t have the tools to answer.”

She’s currently working on wrapping up a couple of projects, one focused on studying neurodegeneration by analyzing frontal cortex imaging, and another on predicting protein images from protein sequences and chromatin imaging.

“There are still many unanswered questions,” she says. “I want to pick questions that are biologically meaningful, that help us understand things we didn’t know before.”

Adapted from an article posted on the MIT News site.

People

Cells

May 23, 2024

Caroline Uhler, Schmidt Center director, named IMS Fellow Caroline Uhler, Schmidt Center director, named IMS Fellow

2024

The Eric and Wendy Schmidt Center at the Broad Institute of MIT and Harvard is pleased to share that Center Director Caroline Uhler has been elected Fellow of the Institute of Mathematical Statistics (IMS). Uhler received the award for interdisciplinary excellence and for merging mathematical statistics and computational biology in innovative and impactful ways.

Uhler is a core institute member of the Broad Institute and a professor in the Department of Electrical Engineering and Computer Science (EECS) and the Institute for Data, Systems, and Society (IDSS) at MIT. She is also a SIAM Fellow, a Sloan Research Fellow, and an elected member of the International Statistical Institute.

Uhler’s research lies at the intersection of machine learning, statistics, and genomics, with a particular focus on causal inference, representation learning, and gene regulation. Her use of probabilistic graphical models and development of scalable algorithms with healthcare applications has enabled her research group to gain insights into causal relationships hidden within massive amounts of data, such as those generated during gene knockout or knockdown experiments.

Caroline Uhler, director of the Schmidt Center

For almost 90 years, the title of IMS Fellow has represented a prestigious honor. Evaluated by a committee of peers, each Fellow has exhibited exceptional mastery in statistical or probabilistic research and/or has showcased remarkable leadership that has left a lasting impact on the field.

‍

Established in 1935, the IMS is a member organization that fosters the development and dissemination of the theory and applications of statistics and probability. The IMS has over 4,700 active members throughout the world, with approximately 10% of the current IMS members earning the fellowship status. The announcement of the 2024 class of IMS Fellows can be viewed here.

Uhler will be honored among the new IMS Fellows at the IMS Presidential Address and Awards Ceremony at the Bernoulli-IMS 11th World Congress in Probability and Statistics on August 12-16, 2024 in Bochum, Germany.

People

Causal Inference

Representation Learning

May 13, 2024

Machine learning method reveals chromosome locations in individual cell nucleus Machine learning method reveals chromosome locations in individual cell nucleus

2024

Researchers from Carnegie Mellon University’s School of Computer Science and the Broad Institute of MIT and Harvard have made a significant advancement toward understanding how the human genome is organized inside a single cell. This knowledge is crucial for analyzing how DNA structure influences gene expression and disease processes.

In a paper published by the journal Nature Methods, Ray and Stephanie Lane Professor of Computational Biology Jian Ma and former Ph.D. students Kyle Xiong and Ruochi Zhang introduce scGHOST, a machine learning method that detects subcompartments — a specific type of 3D genome feature in the cell nucleus — and connects them to gene expression patterns. Zhang is currently a postdoctoral fellow at the Eric and Wendy Schmidt Center at the Broad Institute of MIT and Harvard.

Ruochi Zhang, postodctoral fellow at the Eric and Wendy Schmidt Center

In human cells, chromosomes aren’t arranged linearly but are folded into 3D structures. Researchers are particularly interested in 3D genome subcompartments because they reveal where chromosomes are located spatially inside the nucleus.

“One of the ultimate goals of single-cell biology is to elucidate the connections between cellular structure and function across a wide variety of biological contexts,” Ma said. “In this case, we are exploring how chromosome organization within the nucleus correlates with gene expression.”

While new technologies allow the study of these structures at the single-cell level, poor data quality can hinder precise understanding. scGHOST addresses this problem by using graph-based machine learning to enhance the data, making it easier to pinpoint and identify how chromosomes are spatially organized. scGHOST builds upon the Higashi method and its evolution, Fast Higashi, which focuses on scHi-C embeddings and imputations, that Ma's research group previously developed.

"Graph and hypergraph representation learning are integral to these methods and scGHOST, as they allow for a more nuanced and detailed exploration of the complex interactions within the genome,” said Zhang.

With the ability to accurately identify 3D genome subcompartments, scGHOST adds to the growing array of single-cell analysis tools scientists use to delineate the intricate molecular landscape of complex tissues, such as those in the brain. Ma anticipates that scGHOST could open new avenues to understanding gene regulation in health and disease.

Read more about their work in Nature Methods. Additionally, learn more about this research in a February 8, 2023, Models, Inference and Algorithms talk by Zhang.‍

Adapted from a news story posted on the CMU School of Computer Science’s website.

Cells

Representation Learning

People

April 11, 2024

Researchers introduce new AI tool to help clinicians capture uncertainty in medical images Researchers introduce new AI tool to help clinicians capture uncertainty in medical images

2024

In biomedicine, segmentation involves annotating pixels from an important structure in a medical image, like an organ or cell. Artificial intelligence models can help clinicians by highlighting pixels that may show signs of a certain disease or anomaly.

However, these models typically only provide one answer, while the problem of medical image segmentation is often far from black and white. Five expert human annotators might provide five different segmentations, perhaps disagreeing on the existence or extent of the borders of a nodule in a lung CT image.

“Having options can help in decision-making. Even just seeing that there is uncertainty in a medical image can influence someone’s decisions, so it is important to take this uncertainty into account,” says Marianne Rakic, an MIT computer science PhD candidate and fellow at the Eric and Wendy Schmidt Center at the Broad Institute of MIT and Harvard.

Rakic is lead author of a paper with others at MIT, the Broad Institute, and Massachusetts General Hospital that introduces a new AI tool that can capture the uncertainty in a medical image.

Known as Tyche (named for the Greek divinity of chance), the system provides multiple plausible segmentations that each highlight slightly different areas of a medical image. A user can specify how many options Tyche outputs and select the most appropriate one for their purpose.

Importantly, Tyche can tackle new segmentation tasks without needing to be retrained. Training is a data-intensive process that involves showing a model many examples and requires extensive machine-learning experience.

Because it doesn’t need retraining, Tyche could be easier for clinicians and biomedical researchers to use than some other methods. It could be applied “out of the box” for a variety of tasks, from identifying lesions in a lung X-ray to pinpointing anomalies in a brain MRI.

Ultimately, this system could improve diagnoses or aid in biomedical research by calling attention to potentially crucial information that other AI tools might miss.

“Ambiguity has been understudied. If your model completely misses a nodule that three experts say is there and two experts say is not, that is probably something you should pay attention to,” adds senior author Adrian Dalca, an assistant professor at Harvard Medical School and MGH, and a research scientist in the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL).

Their co-authors include Hallee Wong, a graduate student in electrical engineering and computer science; Jose Javier Gonzalez Ortiz PhD ’23; Beth Cimini, associate director for bioimage analysis at the Broad Institute; and John Guttag, the Dugald C. Jackson Professor of Computer Science and Electrical Engineering. Rakic will present Tyche at the IEEE Conference on Computer Vision and Pattern Recognition, where Tyche has been selected as a highlight.

Addressing ambiguity

AI systems for medical image segmentation typically use neural networks. Loosely based on the human brain, neural networks are machine-learning models comprising many interconnected layers of nodes, or neurons, that process data.

After speaking with collaborators at the Broad Institute and MGH who use these systems, the researchers realized two major issues limit their effectiveness. The models cannot capture uncertainty and they must be retrained for even a slightly different segmentation task.

Some methods try to overcome one pitfall, but tackling both problems with a single solution has proven especially tricky, Rakic says.

“If you want to take ambiguity into account, you often have to use an extremely complicated model. With the method we propose, our goal is to make it easy to use with a relatively small model so that it can make predictions quickly,” she says.

The researchers built Tyche by modifying a straightforward neural network architecture.

A user first feeds Tyche a few examples that show the segmentation task. For instance, examples could include several images of lesions in a heart MRI that have been segmented by different human experts so the model can learn the task and see that there is ambiguity.

The researchers found that just 16 example images, called a “context set,” is enough for the model to make good predictions, but there is no limit to the number of examples one can use. The context set enables Tyche to solve new tasks without retraining.

For Tyche to capture uncertainty, the researchers modified the neural network so it outputs multiple predictions based on one medical image input and the context set. They adjusted the network’s layers so that, as data move from layer to layer, the candidate segmentations produced at each step can “talk” to each other and the examples in the context set.

In this way, the model can ensure that candidate segmentations are all a bit different, but still solve the task.

“It is like rolling dice. If your model can roll a two, three, or four, but doesn’t know you have a two and a four already, then either one might appear again,” she says.

They also modified the training process so it is rewarded by maximizing the quality of its best prediction.

If the user asked for five predictions, at the end they can see all five medical image segmentations Tyche produced, even though one might be better than the others.

The researchers also developed a version of Tyche that can be used with an existing, pretrained model for medical image segmentation. In this case, Tyche enables the model to output multiple candidates by making slight transformations to images.

Better, faster predictions

When the researchers tested Tyche with datasets of annotated medical images, they found that its predictions captured the diversity of human annotators, and that its best predictions were better than any from the baseline models. Tyche also performed faster than most models.

“Outputting multiple candidates and ensuring they are different from one another really gives you an edge,” Rakic says.

The researchers also saw that Tyche could outperform more complex models that have been trained using a large, specialized dataset.

For future work, they plan to try using a more flexible context set, perhaps including text or multiple types of images. In addition, they want to explore methods that could improve Tyche’s worst predictions and enhance the system so it can recommend the best segmentation candidates.

This research is funded, in part, by the National Institutes of Health, the Eric and Wendy Schmidt Center at the Broad Institute of MIT and Harvard, and Quanta Computer.

This story was adapted from a piece on MIT News.

Organisms

Representation Learning

March 15, 2024

Schmidt Center director awarded Department of Defense MURI funding Schmidt Center director awarded Department of Defense MURI funding

2024

Eric and Wendy Schmidt Center at the Broad Institute of MIT and Harvard is excited to announce that center director Caroline Uhler has received a Multidisciplinary University Research Initiative (MURI) award from the U.S. Department of Defense.

MURI awards support interdisciplinary teams of researchers in conducting fundamental research on topics deemed critical by the defense department. Uhler and collaborators will use the award to advance optimal intervention design in complex systems — an effort that should enhance decision-making in areas ranging from biomedical to engineering and societal applications.

Uhler, the project’s principal investigator, is a core member of the Broad Institute and a professor in the Department of Electrical Engineering and Computer Science and the Institute for Data, Systems, and Society at MIT. She is joined on the award by Alberto Abadie and Devavrat Shah, MIT; Miguel Hernan, Harvard University; John Ioannidis, Stanford University; Mengdi Wang, Princeton University; and Feng Zhang, Broad Institute. The project will involve several graduate students and postdocs.

Uhler and the research team will develop a computational framework for evidence-based decision-making. The researchers plan to host machine learning competitions to test out their methodology in engineering, biological, health, and societal application areas.

Project Title: “Evaluating, Predicting, Optimizing, and Monitoring Hypothetical Interventions in Large Networked Systems”

‍

People

March 13, 2024

How do neural networks learn? A mathematical formula explains how they detect relevant patterns How do neural networks learn? A mathematical formula explains how they detect relevant patterns

2024

Neural networks have been powering breakthroughs in artificial intelligence, including the large language models that are now being used in a wide range of applications, from finance to human resources to healthcare. But these networks remain a black box whose inner workings engineers and scientists struggle to understand. Now, a team led by data and computer scientists at the University of California San Diego has given neural networks the equivalent of an X-ray to uncover how they actually learn.

The researchers found that a formula used in statistical analysis provides a streamlined mathematical description of how neural networks, such as GPT-2, a precursor to ChatGPT, learn relevant patterns in data, known as features. This formula also explains how neural networks use these relevant patterns to make predictions.

“We are trying to understand neural networks from first principles,” said Daniel Beaglehole, a PhD student in the UC San Diego Department of Computer Science and Engineering and co-first author of the study. “With our formula, one can simply interpret which features the network is using to make predictions.”

Adit Radhakrishnan, a postdoctoral fellow at Harvard who worked on the paper as an MIT EECS PhD student funded by the Schmidt Center and co-first author of the study, added: “We showed that neural networks, unlike other machine learning models, automatically implement this formula to identify features most relevant for prediction.”

The team presented their findings in the March 7 issue of the journal Science.

Why does it matter how neural networks make predictions? AI-powered tools are now pervasive in everyday life. Banks use them to approve loans. Hospitals use them to analyze medical data, such as X-rays and MRIs. Companies use them to screen job applicants. But it’s currently difficult to understand the mechanism neural networks use to make decisions and the biases in the training data that might impact this.

“If you don’t understand how neural networks learn, it’s very hard to establish whether neural networks produce reliable, accurate, and appropriate responses,” said Mikhail Belkin, the paper’s corresponding author and a professor at the UC San Diego Halicioglu Data Science Institute. “This is particularly significant given the rapid recent growth of machine learning and neural net technology.”

Former Eric and Wendy Schmidt Center PhD fellow Adit Radhakrishnan's research focuses on advancing the theoretical foundations of machine learning and developing new methods for tackling biomedical problems.

Understanding how neural networks make predictions is especially important in biological applications. In the realm of drug discovery, for example, researchers would not only want a model that accurately predicts drugs that are effective in treating cancer — they also want to discover biological mechanisms that make such drugs effective, explained Radhakrishnan. “By applying our findings to models trained to predict the effect of drugs on cancer cells, we can discover features of cancer cells that make them susceptible to a given drug and then develop new drugs to specifically target those mechanisms,” he said.

The study is part of a larger effort in Belkin’s research group to develop a mathematical theory that explains how neural networks work. “Technology has outpaced theory by a huge amount,” he said. “We need to catch up.”

The team also showed that the statistical formula they used to understand how neural networks learn, known as Average Gradient Outer Product (AGOP), could be applied to improve performance and efficiency in other types of machine learning architectures that do not include neural networks.

“If we understand the underlying mechanisms that drive neural networks, we should be able to build machine learning models that are simpler, more efficient, and more interpretable,” Belkin said. “We hope this will help democratize AI.”

The machine learning systems that Belkin envisions would need less computational power, and therefore less power from the grid, to function. These systems also would be less complex and so easier to understand.

Illustrating the new findings with an example

(Artificial) neural networks are computational tools to learn relationships between data characteristics (i.e. identifying specific objects or faces in an image). One example of a task is determining whether in a new image a person is wearing glasses or not. Machine learning approaches this problem by providing the neural network many example (training) images labeled as images of “a person wearing glasses” or ”a person not wearing glasses.” The neural network learns the relationship between images and their labels, and extracts data patterns, or features, that it needs to focus on to make a determination. One of the reasons AI systems are considered a black box is because it is often difficult to describe mathematically what criteria the systems are actually using to make their predictions, including potential biases. The new work provides a simple mathematical explanation for how the systems are learning these features.

Features are relevant patterns in the data. In the example above, there are a wide range of features that the neural networks learns, and then uses, to determine if in fact a person in a photograph is wearing glasses or not. One feature it would need to pay attention to for this task is the upper part of the face. Other features could be the eye or the nose area where glasses often rest. The network selectively pays attention to the features that it learns are relevant and then discards the other parts of the image, such as the lower part of the face, the hair and so on.

‍Feature learning is the ability to recognize relevant patterns in data and then use those patterns to make predictions. In the glasses example, the network learns to pay attention to the upper part of the face. In the new Science paper, the researchers identified a statistical formula that describes how the neural networks are learning features.

‍Alternative neural network architectures: The researchers went on to show that inserting this formula into computing systems that do not rely on neural networks allowed these systems to learn faster and more efficiently.

“How do I ignore what’s not necessary? Humans are good at this,” said Belkin. “Machines are doing the same thing. Large Language Models, for example, are implementing this ‘selective paying attention’ and we haven’t known how they do it. In our Science paper, we present a mechanism explaining at least some of how the neural nets are ‘selectively paying attention.’”

Study funders included the National Science Foundation and the Simons Foundation for the Collaboration on the Theoretical Foundations of Deep Learning. Belkin is part of NSF-funded and UC San Diego-led The Institute for Learning-enabled Optimization at Scale, or TILOS.

‍Paper title: Mechanism for feature learning in neural networks and backpropagation-free machine learning models

Adit Radhakrishnan, Harvard School of Engineering and Applied Sciences and Broad Institute of MIT and Harvard

Daniel Beaglehole and Mikhail Belkin, University of California San Diego

Parthe Pandit: IIT Bombay–Pandit did the work for this paper as a postdoctoral researcher at the UC San Diego Halicioglu Data Science Institute

This story was adapted from a piece UC San Diego Today.

No items found.

March 4, 2024

Student Spotlight: Victory Yinka-Banjo Student Spotlight: Victory Yinka-Banjo

2024

This interview is part of a series of short interviews from the Department of EECS, called Student Spotlights. Each Spotlight features a student answering their choice of questions about themselves and life at MIT. Today’s interviewee, Victory Yinka-Banjo, is a junior majoring in 6-7: Computer Science and Molecular Biology. Yinka-Banjo is Yinka-Banjo keeps a packed schedule; she is a member of the Office of Minority Education (OME) Laureates & Leaders program; a 2024 fellow in the public service-oriented BCAP program; has previously served as Secretary of the African Students’ Association and is now undergraduate president of the MIT Biotech Group; additionally, she is working on a cardiometabolic disease and deep learning project at the Broad Institute as an Eric and Wendy Schmidt Center Funded SuperUROP Scholar; a member of the Ginkgo Bioworks’ Cultivate Fellowship (a program that supports students interested in synthetic biology/biotech); and an ambassador for Leadership Brainery, which equips juniors/leaders of color with the resources needed to prepare for graduate school. Nevertheless, she found time to share a peek into her MIT experience with readers.

What’s your favorite building or room within MIT, and what’s special about it to you?

It has to be the Broad Institute of MIT & Harvard on Ames Street in Kendall Square, where I do my SuperUROP research in Caroline Uhler’s lab. Outside of classes, you’re 90% likely to find me on the newest mezzanine floor (between the 11th and 12th floor), in one of the UROP rooms I share with two other undergrads in the lab. We have standing desks, an amazing coffee/hot chocolate machine, external personal monitors, comfortable sofas – everything really! Not only is it my favorite building, it is also my favorite study spot on campus. In fact, I am there so often that when friends recently planned a birthday surprise for me, they told me they were considering having it at the Broad, since they could count on me being there.

I think the most beautiful thing about this building, apart from the beautiful view of Cambridge we get from being on one of the highest floors, is that when I was applying to MIT from high school, I had fantasized working at the Broad because of the ground-breaking research. To think that it is now a reality makes me appreciate every minute I spend on my floor, whether I am doing actual research or some last-minute studying for a midterm.

Tell me about one interest or hobby you’ve discovered since you came to MIT. (It doesn’t have to be academic!)

I have become pretty involved in the performing arts since I got to MIT! I have acted in two plays run by the Black Theater Guild, which was revived during my freshman year by one of my friends. I played a supporting role in the first play called Nkrumah’s Last Day, which was about Ghana at a time of governance under Kwame Nkrumah (its first president). In the second play, a ghost story/comedy called Shooting the Sheriff, I played one of the lead roles. Both caused me to step way out of my comfort zone and I loved the experiences because of that. I also got to act with some of my close friends who were first-time stage actors as well, so that made it even more fun.

Outside of acting, I also do spoken word/poetry. I have performed at events like the African Students Association Cultural Night, MIT Africa Innovate Conference and Black Womens’ Alliance Banquet. I try to use my pieces to share my experiences both within and beyond MIT, offering the perspective of an international Nigerian student. My favorite piece was called Code Switch, and I used concepts from CS & Biology (especially genetic code switching), to draw parallels with linguistic code-switching, and emphasize the beauty and originality of authenticity. This semester, I’m also a part of MIT Monologues and will be performing a piece called Inheritance, about the beauty of self-love found in affection transferred from a mother.

Are you a re-reader or a re-watcher—and if so, what are your comfort books, shows, or movies?

I don’t watch too many movies, although I used to be obsessed with all parts of High School Musical; and the only book I’ve ever reread is Americanah. I would actually say I am a re-podcaster! My go-to comfort-podcast is this episode, “A Breakthrough Unfolds”, by Google DeepMind. It makes me a little emotional every time I listen. It is such an exemplification of the power of science and its ability to break boundaries that humans formerly thought impossible. As a Computer Science & Biology major, I am particularly interested in these two disciplines’ applications to relevant problems, like the protein-folding problem discussed in the episode, which DeepMind’s solution for has caused massive advances in the biotech industry. It makes me so hopeful for the future of biology, and the ways in which computation can advance human health and precision medicine.

Who’s your favorite artist? (Using the term very broadly; any form of art can qualify!)

When I think of the word ‘artist’, I think of music artists first. There are so many who I love; my favorites also evolve over time. I’m Christian, so I listen to a lot of gospel music. I’m also Nigerian so I listen to a lot of afrobeats. Since last summer, I’ve been obsessed with Limoblaze, who fuses both gospel and afrobeats music! KB, a super talented gospel rapper, is also somewhat tied in ranking with Limo for me right now. His songs are probably ~50% of my workout playlist.

It’s time to get on the shuttle to the first Mars colony, and you can only bring one personal item. What are you going to bring along with you?

Oooh, this is a tough one, but it has to be my brass rat. Ever since I got mine at the end of sophomore year, it’s been nearly impossible for me to take it off. If there’s ever a time I forget to wear it, my finger feels off for the entire day.

Tell me about one conversation that changed the trajectory of your life.

Two specific career-defining moments come to mind. They aren’t quite conversations, but they are talks/lectures that I was deeply inspired by. The first was towards the end of high school when I watched this TEDx Talk about storing data in DNA. At the time, I was getting ready to apply to colleges and I knew that biology and computer science were two things I really liked, but I didn’t really understand the possibilities that could be birthed from them coming together as an interdisciplinary field. The TEDx talk was my eureka moment for computational biology.

The second moment was in my junior Fall during an introductory lecture to “Lab Fundamentals for Bioengineering” by Professor Jacquin Niles. I started the school year with a lot of confusion about my future post-grad, and the relevance of my planned career path to the communities that I care about. Basically, I was unsure about how Computational Biology fit into the context of Nigeria’s problems, especially because my interest in the field is oriented towards molecular biology/medicine, not necessarily public health.

In the US, most research focuses on diseases like cancer and Alzheimer’s, which, while important, are not the most pressing health conditions in tropical regions like Nigeria. When Prof Niles told us about his lab’s dedication to malaria research from a molecular biology standpoint, it was yet another eureka moment. Like yes! Computation and molecular biology can indeed mitigate diseases that affect developing nations like Nigeria–diseases that are understudied, and whose research is underfunded.

Since his talk, I found a renewed sense of purpose. Grad school isn’t the end goal. Using my skills to shine a light on the issues affecting my people that deserve far more attention is the goal. I’m so excited to see how I will use Computational Biology to possibly create the next cure to a commonly neglected tropical disease, or accelerate the diagnosis of one. Whatever it may be, I know that it will be close to home, eventually 🙂

What are you looking forward to about life after graduation? What do you think you’ll miss about MIT?

Thinking about graduating actually makes me sad. I’ve grown to love MIT. The biggest thing I’ll miss, though, is Independent Activities Period (IAP). It is such a unique part of the MIT experience. I’ve done a web development class/competition, research, a data science challenge, a molecular bio crash course, and a deep learning crash course over the past 3 IAPs. It is SUCH an amazing time to try something low stakes, forget about grades, explore Boston, build a robot, travel abroad, do less, go slower, really rejuvenate before the Spring, and embrace MIT’s motto of “mind and hand” by just being creative and explorative. It is such an exemplification of what it means to go here, and I can’t imagine it being the same anywhere else.

That said, I look forward to graduating so I can do more research. My hours spent at the Broad thinking about my UROP are always the quickest hours of my week. I love the rabbit holes my research allows me to explore, and I hope that I find those over and over again as I apply and hopefully get into PhD programs. I look forward to exploring a new city after I graduate too. I wouldn’t mind staying in Cambridge/Boston. I love it here. But I would welcome a chance to be somewhere new and embrace all the people and unique experiences it has to offer. I also hope to work on more passion projects post-grad. I feel like I have this idea in my head that once I graduate from MIT, I’ll have so much more time on my hands (we’ll see how that goes). I hope that I can use that time to work on education projects in Nigeria, which is a space I care a lot about. Generally, I want to make service more integrated in my lifestyle. I hope that post-graduation, I can prioritize doing that even more: making it a norm to lift others as I continue to climb.

Adapted from a profile posted on MIT Electrical Engineering and Computer Science Department's site.

No items found.

February 5, 2024

Data science challenge reveals new research directions for cancer immunotherapy Data science challenge reveals new research directions for cancer immunotherapy

2024

The Eric and Wendy Schmidt Center at the Broad Institute of MIT and Harvard is delighted to announce the completion of its Cancer Immunotherapy Data Science Grand Challenge.

Participants in the challenge developed algorithms to uncover new ways to modify, or “perturb,” T cells to make them more effective at killing cancer cells. Scientists in the Hacohen Lab at the Broad then tested their predictions in mouse models, making this the first challenge that the Schmidt Center knows of in which new experiments were performed based on the output of machine-learning models developed in the challenge.

While it’s too early to say whether any of the proposed perturbations could prove useful for cancer treatment, the researchers plan to further study some of the identified perturbations and the algorithms that gave rise to them.

The Schmidt Center partnered with Harvard’s Laboratory for Innovation Science (LISH), the MIT Department of Electrical Engineering and Computer Science, Topcoder, Gordian Biotechnology, and Saturn Cloud to run the challenge. More than 1,000 people from around the world registered for the competition.

“We are thrilled that our first data science challenge attracted so many participants, including various machine-learning experts who had not previously worked on biological problems,” said Caroline Uhler, director of the Eric and Wendy Schmidt Center, a core member of the Broad Institute, and a professor in the Department of Electrical Engineering and Computer Science and the Institute for Data, Systems, and Society at MIT.

Karim Lakhani, founder and co-director of LISH and a professor of business administration at the Harvard Business School, said: “At LISH, we believe that data science challenges can help organizations harness the power of the crowd to answer pressing questions in biology and other fields. We hope this challenge will serve as a case study in how machine-learning experts can collaborate with biologists to improve experimental design.”

Boosting cancer research with machine learning

Cancer immunotherapy seeks to harness the body’s immune system, and most often T cells, to recognize and kill cancer cells while leaving healthy cells alone. In the last decade, there have been many breakthroughs in cancer immunotherapy, yet treatments still only work for some cancer patients some of the time.

“We’re hopeful that challenges like this can help us home in on T-cell-perturbations that could ultimately lead to new therapeutics — and make cancer immunotherapy work for more patients,” said Nir Hacohen, an institute member at Broad, director of the Broad Institute’s Cell Circuits Program, and director of the Center for Cancer Immunology at Mass General Brigham.

Marc Schwartz, a postdoctoral fellow in the Hacohen Lab, previously ran experiments testing the effects of 73 gene knockouts in T cells in mouse models. Because researchers can’t scale mouse model experiments beyond 100 or so genes at a time, it’s not feasible to test out every gene in a particular disease pathway, explained Schwartz.

“That’s why we were excited about the idea of testing a limited number of genes that we think are important and then training an algorithm to learn something that we can't see from that data on our own,” he added.

The overarching data science challenge was divided into three parts that ran as individual data science competitions on Topcoder. In Challenge 1, participants received gene expression data from 66 of the 73 T-cell-gene knockouts from Schwartz’s experiments as training data. They then developed an algorithm that could predict how knocking out the seven “held-out” genes would affect T cells.

Challenge 2 participants used their algorithms from the first challenge to propose new gene knockouts (picking from any of the 20,000 genes in the entire genome) to shift as many T cells as possible into a cancer-fighting state. In Challenge 3, participants proposed a metric for ranking how well a particular gene knockout would bring about this desired shift in T cells.

To make the challenge accessible to participants without a biology background, Orr Ashenberg, associate director of computational biology at the Klarman Cell Observatory of the Broad Institute, produced a 1.5-hour crash course on cancer biology, genetic perturbations, and single-cell sequencing technologies.

Orr Ashenberg, associate director of computational biology at the Broad's Klarman Cell Observatory, delivers a lecture on single-cell sequencing technologies.

The Schmidt Center announced the Challenges 1 and 3 winners last March. The researchers then ran the top-scoring algorithms from Challenge 1 to predict which genes to knock out to mimic two kinds of cancer immunotherapy — CAR T-cell therapy and checkpoint blockade therapy. Next, Schwartz conducted experiments to see how well the proposed gene knockouts performed in a mouse model. To determine the Challenge 2 winners, Schmidt Center research fellow Jiaqi Zhang, who was instrumental in developing the challenge, calculated how well each participant’s algorithm from Challenge 1 predicted the effects of those ~60 gene knockouts.

The winners of Challenge 2 — the final part of the competition — are:

-First place: Brody Langille, Jordan Trajkovski, and Elizabeth Hudson

-Second place: mglettig (username)*

-Third place: Ai Vu Hong, researcher at Genethon, France

-Fourth place: Saket Kunwar, independent researcher, Nepal

-Fifth place: lxastro0 (username)*

-Sixth place: John Gardner, freelance data scientist

-Seventh place: agilsoft (username)*

-Eighth place: Basak Eraslan, postdoctoral researcher holding a joint position at the Regev Lab in Genentech and Kundaje Lab at Stanford University

-Ninth place: Haoyue Dai, Kun Zhang, Ignavier Ng, Yujia Zheng, Xinshuai Dong, and Yewen Fan from Carnegie Mellon University; Petar Stojanov, postdoctoral fellow at the Eric and Wendy Schmidt Center; Gongxu Luo, Mohamed bin Zayed University of Artificial Intelligence; and Biwei Huang, University of California, San Diego

-Ninth place: Liu Xindi, freelance programmer

-Ninth place: Johnson Zhou, Camille Sayoc, and Yi-Cheng Peng, Master’s students of the Faculty of Engineering and IT at the University of Melbourne, Victoria, Australia

The winning teams approached the problem using different deep-learning methods depending on the chosen input features. These features include gene expression and “chromatin accessibility,” the degree to which genetic information encoded in DNA can be accessed and read, measured by ATAC-seq peak counts. Additionally, some of the top-scoring teams incorporated learned representations from variational autoencoders — models that can capture meaningful features from raw data — or graph neural networks constructed based on the gene ontology database.

"We are grateful for the opportunity to participate in this challenge and are excited by the results,” said the first-place team in a prepared statement. “It's not often that you get invited to work on an important problem alongside preeminent scientists who furnish the problem description and data that you need to develop a novel solution — a novel solution that those same scientists can then turn around and validate in their lab.”

Martin Borch Jensen, chief scientific officer of Gordian Biotechnology, said: "Technological advances in sequencing have led to a vast amount of genomics data. As we pile up more and more transcriptomes from every type of cell in the human body, it becomes increasingly valuable to develop ways to understand how gene expression can cause and predict health and disease. I'm very excited for this competition to catalyze more work on this problem.”

Now, researchers at the Schmidt Center will further study the top-scoring algorithms to see if they can combine components from each into an even better predictive tool. The center plans to hold its second data science challenge later this year.

*Editor's note: Usernames were used instead of participant names in cases where the Schmidt Center could not get in touch with winners.

Cells

January 16, 2024

Building a two-way street between cell biology and machine learning Building a two-way street between cell biology and machine learning

2024

In a Comment for Nature Cell Biology, the Eric and Wendy Schmidt Center's director Caroline Uhler discusses how the rise of large-scale datasets in biology positions the field to become a driver of foundational advances in machine learning — and vice versa. Uhler, who is also a full professor in the Department of Electrical Engineering and Computer Science and the Institute for Data, Systems, and Society at MIT, advocates for new machine learning models that can better integrate different types of biological data and can uncover causal mechanisms in disease, not just associations. She also discusses the need for close collaborations between biologists and computational scientists so that predictive and causal algorithms can be incorporated into experimental design — and outlines some of the challenges, such as distinct cultures and vocabularies, of building those teams.

‍

No items found.

January 10, 2024

Researchers identify new regulators of cellular aging Researchers identify new regulators of cellular aging

2024

As we age, the risk for a wide range of diseases, including cancer and neurodegenerative conditions, increases. But while aging has been extensively studied, scientists don’t have a clear picture of the molecular changes that take place as we get older.

Now, researchers at the Broad Institute of MIT and Harvard and ETH Zürich in Switzerland have found key gene-expression regulators related to cellular aging that are tightly coupled to structural alterations of chromatin — the DNA-protein complex that forms chromosomes. The findings, published last month in Aging Cell, offer new insights into the biology of cellular aging. The research may also provide potential targets for aging reversal.

The study stems from a long-term collaboration between the laboratory of GV Shivashankar at ETH Zürich on the biological side and Caroline Uhler at the Broad Institute on the computational side.

“The explosion of biomedical data presents an exciting opportunity to develop novel machine learning methods to help answer important biological questions,” said study co-senior author Uhler, the director of the Eric and Wendy Schmidt Center at the Broad and a professor in the Department of Electrical Engineering and Computer Science and the Institute for Data, Systems, and Society at MIT. “In this work, the availability of large-scale sequencing data from many individuals in different age groups motivated us to develop methods to identify drivers of cellular aging,” she added.

Shivashankar’s lab has long been interested in understanding the relationship between a cell’s microenvironment, the three-dimensional structure of the genome, and gene expression in health and disease. Depending on how DNA is packed inside a cell’s nucleus, it may alter the expression of specific genes, which could in turn result in certain diseases, explained co-senior author Shivashankar, professor of Mechano-Genomics at ETH Zürich and head of the Laboratory of Nanoscale Biology at the Paul Scherrer Institute in Switzerland. “We’re very excited about understanding what may lead to healthy aging as opposed to cancer or neurodegeneration,” he added.

The study also aligns with the Eric and Wendy Schmidt Center’s goal of developing computational approaches for challenging biomedical questions. To this end, the Schmidt Center trains talented undergraduate, master’s, and PhD students as well as postdoctoral fellows with computational backgrounds on how to work with experimental biologists.

“As a graduate student in statistics, working closely with a biological lab allows me to gain a much deeper understanding of the kinds of questions and data that are most interesting to biologists,” said study co-first author Louis Cammarata, a PhD student at Harvard University and the Eric and Wendy Schmidt Center. “I’m able to design more useful computational methods because of this constant communication.”

Drivers of aging

In the nucleus of a cell, DNA coils around proteins to form chromatin. Other proteins bind along chromatin, creating complex three-dimensional structures that leave some genes accessible to transcription and others closed off.

Clockwise from top right: Caroline Uhler, GV Shivashankar (image credit: Paul Scherrer Institute), Louis Cammarata, and Jana Braunger

Uhler, Shivashankar, and their teams analyzed gene expression data from skin cells of 133 individuals aged 1 to 96 years, who were divided in five age groups. The difference in gene expression was particularly prominent when comparing the two oldest groups, which included people aged 61 to 85 years and those aged 86 to 96 years. Differentially expressed genes tended to be involved in biological processes such as immune response and cell proliferation, which play important roles in aging.

Next, the researchers used statistical algorithms to combine these data with information from a database that lists protein-protein interactions. The analysis revealed key age-associated regulators of gene expression, which include transcription factors — proteins that control how other genes are expressed.

“Transcription factors may be post-translationally activated or they may benefit from changes in chromatin organization to activate their target genes at a later time point,” said study co-first author Jana Braunger, a former master’s student at the Eric and Wendy Schmidt Center and current PhD student at the University of Heidelberg.

Gene expression hubs

To analyze the coupling between chromatin organization and changes in gene expression, the researchers used an experimental method called Hi-C, which provides a proximity map of the DNA packing.

Comparing Hi-C data from old and young skin cells revealed that the structure of chromatin changes over time, either drawing apart genes that were close together or bringing together genes that were far apart in young cells.

In the cell’s nucleus, nearby genes are often expressed as a group, Cammarata explained. “There are specific hotspots where different chromosomes come together, along with other molecules that are useful for transcription, and within those hubs, you have active transcription and co-regulation of genes,” he said. “In aging, changes in how DNA is folded influence these hotspots of transcription.”

Mitigating aging

Although more work is needed to determine whether alterations in chromatin structure drive changes in gene expression or vice versa, some of the gene-expression regulators identified in this study could serve as potential targets to mitigate, prevent, or even reverse cellular aging. “Identifying the key transcriptional drivers of cellular aging is crucial to develop interventions for cellular reprogramming and rejuvenation,” Shivashankar said.

Uhler noted that the study is an example of how computational researchers can develop new methods to help answer important biological questions — a core mission of the Eric and Wendy Schmidt Center. “We place great importance on training the next generation of scientists — researchers who are strong on the computational side and understand the biological questions,”she said. “Merging computational science and biology can help us tackle some of medicine’s biggest challenges.”

Cells

December 14, 2023

A new method for genomics analysis doesn’t require reference data A new method for genomics analysis doesn’t require reference data

2023

In 2003, scientists finished sequencing almost all of the three million nucleotide base pairs that make up the human genome. This feat led to an explosion in genomics analysis, which to this day relies on aligning sequencing data to a “reference genome” — a composite made up of DNA samples from different individuals in the same species — for humans and other species.

Now, researchers at Stanford University and the Eric and Wendy Schmidt Center at the Broad Institute of MIT and Harvard have developed a genomics analysis framework, SPLASH, that directly analyzes raw sequencing samples, eliminating the need for reference data. The method can perform genomic analyses more quickly and with less computing power than traditional methods. SPLASH should prove especially useful for analyzing genomes of understudied or rapidly mutating species.

In a study published earlier this month in Cell, the team showed that the framework can detect different strains of SARS-CoV-2 and find sequence diversity in adaptive immune receptors, among other findings. Kaitlin Chuang and Tavor Baharav, former PhD students at Stanford University, were co-first authors on the paper, and Julia Salzman, associate professor of biomedical science and biochemistry at Stanford, was the lead author. All research was performed in Salzman's group, whose lab combines statistics and genomics.

“A lot of sequencing analysis is done with implicit priors, meaning that your pipeline is only going to identify the one feature that it was designed to find,” said Baharav, who is now an Eric and Wendy Schmidt Center postdoctoral fellow. “With SPLASH, we’ve developed a method for unbiased, reference-free hypothesis generation.”

From alignment- to statistics-first

While genomics has revolutionized both medicine and ecology, its dependence on reference genomes has its limitations. For example, only 5% of mammalian species have had their genomes sequenced — a percentage that drops even further for organisms like bacteria and viruses. Additionally, because the human-reference genome only contains samples from a handful of individuals, it does not reflect global genomic diversity.

Eric and Wendy Schmidt Center postdoctoral fellow Tavor Baharav

Also, traditional genomics analysis aligns samples with references before comparing the samples to each other, discarding outliers. “When you're trying to detect an interesting, novel event, it almost by definition isn't going to align well to the reference,” said Baharav.

To address these and other limitations, researchers in the Salzman Lab at Stanford University came up with a way to analyze raw sequencing data without having to first align it to a reference genome.

Their framework, SPLASH, identifies unchanging "anchor" subsequences in the raw sequencing data that are followed by "target" sequences that vary by sample. SPLASH, which stands for “Statistically Primary aLignment Agnostic Sequence Homing,” uses a new statistical test to determine which stretch of RNA reads exhibit the most variation.

"This work illustrates how interdisciplinary teams with diverse perspectives and skill sets are powerful and needed for scientific progress,” said Salzman. “Initially, the team questioned why such a straightforward approach hadn't been implemented before, but we gradually came to realize that rethinking conventions can sometimes yield simple solutions that could work better than ingrained approaches.”

Unlike traditional methods, which can only detect certain types of genetic variations, the framework can detect a wide variety of variations. SPLASH is also much more computationally efficient than those methods. An updated version of the framework can complete the entire analysis in an hour while using much less computing power than alignment-first approaches.

Detecting viral mutations + microalgae growing on eelgrass

To test the effectiveness of SPLASH, the team used it to perform a range of genomic analyses. In one, they compared nasal swab samples from patients taken at different periods during the COVID-19 pandemic, when different viral strains were dominant. SPLASH was able to identify which anchors had “low p-values” and high effect sizes — indicators of viral mutations. They then mapped these reads to control samples from different COVID strains, determining that almost all of the anchors that SPLASH homed in on were indeed strain-defining mutations.

Eelgrass provides foraging areas and shelter for fish. Adam Obaza/NOAA.

Given that very few species have reference genomes, the team also tested how well SPLASH can detect variations between samples from two species — eelgrass and octopus — with limited reference data available. They compared RNA from eelgrass, a common seagrass, found in the Mediterranean and Norway, finding that almost 6% of targets did not align to eelgrass references. In particular, they noticed that the target sequences for one anchor varied by location and season.

The team theorized that these discrepancies could indicate the presence of different species of diatoms, microalgae that grow on other plants, as the anchor was less abundant in samples taken at night, when diatoms reduce expression of this particular type of gene.

“On its own, SPLASH does not provide immediately interpretable results, but it points researchers to interesting questions that they can investigate further,” said Baharav.

Next steps

Baharav, who completed his PhD in electrical engineering at Stanford earlier this year, is now applying his computational background to cancer research. As white blood cells develop, they shuffle around parts of their genome through a process called “V(D)J recombination.” This genetic reshuffling allows them to produce a huge array of antibodies and T-cell receptors, which they use to recognize and kill millions of microbes.

Cancer researchers like Baharav’s mentor, Rafael Irizarry, chair of the Department of Data Science at Dana-Farber Cancer Institute, want to better understand how V(D)J recombination works to design cancer vaccines. As a Schmidt Center fellow, Baharav is developing a reference-free way to analyze these adaptive immune receptors.

“SPLASH provides an exciting new statistical and computational framework for genomic analysis. I'm looking forward to building on this work to expand the scope of reference-free analysis, allowing researchers to perform unbiased inference on their data,” said Baharav. “As discussed in SPLASH, reference-based methods fall short in analyzing highly diverse genomic regions such as T cell receptors, which I'm looking to change.”

Cells

October 24, 2023

Maria Skoularidou receives Blackwell-Rosenbluth Award Maria Skoularidou receives Blackwell-Rosenbluth Award

2023

The Eric and Wendy Schmidt Center at the Broad Institute of MIT and Harvard is excited to announce that postdoctoral fellow Maria Skoularidou was awarded the 2023 Blackwell-Rosenbluth Award earlier this month. The Blackwell-Rosenbluth Award is granted to outstanding young researchers in the field of Bayesian statistics.

Skoularidou joined the Eric and Wendy Schmidt Center in September, 2023. She is co-advised by Nikos Daskalakis, director of the Neurogenomics and Translational Bioinformatics Laboratory at McLean Hospital and an associate professor of psychiatry at Harvard Medical School, and Costis Daskalakis, a professor in MIT’s Department of Electrical Engineering and Computer Science and a member of the MIT Computer Science and Artificial Intelligence Laboratory. Her research focuses on developing scalable and efficient computational methods to detect epigenetic effects in diverse trauma and PTSD contexts through employing information from various datasets.

Skoularidou holds a PhD in biostatistics from the University of Cambridge, where she was advised by Sylvia Richardson. Skoularidou has a four-year degree in informatics and a Master’s of Science in statistical science from the Athens University of Economics and Business. She founded (Dis)Ability in AI, a group that supports and advocates for disabled people’s needs at machine learning conferences and other venues, and is on the editorial board of ACM Transactions on Probabilistic Machine Learning.

“Maria has already made impressive contributions to the field of Bayesian inference as well as generative modeling and its applications to biomedical data,” said Caroline Uhler, director of the Eric and Wendy Schmidt Center, a core member of the Broad Institute, and a professor in the Department of Electrical Engineering and Computer Science and the Institute for Data, Systems and Society at MIT. “We’re excited to see what she’ll continue to accomplish as a Schmidt Center fellow.”

People

October 2, 2023

A more effective experimental design for engineering a cell into a new state A more effective experimental design for engineering a cell into a new state

2023

A strategy for cellular reprogramming involves using targeted genetic interventions to engineer a cell into a new state. The technique holds great promise in immunotherapy, for instance, where researchers could reprogram a patient’s T-cells so they are more potent cancer killers. Someday, the approach could also help identify life-saving cancer treatments or regenerative therapies that repair disease-ravaged organs.

But the human body has about 20,000 genes, and a genetic perturbation could be on a combination of genes or on any of the over 1,000 transcription factors that regulate the genes. Because the search space is vast and genetic experiments are costly, scientists often struggle to find the ideal perturbation for their particular application.

Researchers from MIT and Harvard University developed a new, computational approach that can efficiently identify optimal genetic perturbations based on a much smaller number of experiments than traditional methods.

Their algorithmic technique leverages the cause-and-effect relationship between factors in a complex system, such as genome regulation, to prioritize the best intervention in each round of sequential experiments.

The researchers conducted a rigorous theoretical analysis to determine that their technique did, indeed, identify optimal interventions. With that theoretical framework in place, they applied the algorithms to real biological data designed to mimic a cellular reprogramming experiment. Their algorithms were the most efficient and effective.

“Too often, large-scale experiments are designed empirically. A careful causal framework for sequential experimentation may allow identifying optimal interventions with fewer trials, thereby reducing experimental costs,” says co-senior author Caroline Uhler, a professor in the Department of Electrical Engineering and Computer Science (EECS) who is also the director of the Eric and Wendy Schmidt Center at the Broad Institute of MIT and Harvard, and a researcher at MIT’s Laboratory for Information and Decision Systems (LIDS) and Institute for Data, Systems and Society (IDSS).

Joining Uhler on the paper, which appears today in Nature Machine Intelligence, are lead author Jiaqi Zhang, a graduate student and Eric and Wendy Schmidt Center Fellow; co-senior author Themistoklis P. Sapsis, professor of mechanical and ocean engineering at MIT and a member of IDSS; and others at Harvard and MIT.

Active learning

When scientists try to design an effective intervention for a complex system, like in cellular reprogramming, they often perform experiments sequentially. Such settings are ideally suited for the use of a machine-learning approach called active learning. Data samples are collected and used to learn a model of the system that incorporates the knowledge gathered so far. From this model, an acquisition function is designed — an equation that evaluates all potential interventions and picks the best one to test in the next trial.

This process is repeated until an optimal intervention is identified (or resources to fund subsequent experiments run out).

“While there are several generic acquisition functions to sequentially design experiments, these are not effective for problems of such complexity, leading to very slow convergence,” Sapsis explains.

Acquisition functions typically consider correlation between factors, such as which genes are co-expressed. But focusing only on correlation ignores the regulatory relationships or causal structure of the system. For instance, a genetic intervention can only affect the expression of downstream genes, but a correlation-based approach would not be able to distinguish between genes that are upstream or downstream.

“You can learn some of this causal knowledge from the data and use that to design an intervention more efficiently,” Zhang explains.

The MIT and Harvard researchers leveraged this underlying causal structure for their technique. First, they carefully constructed an algorithm so it can only learn models of the system that account for causal relationships.

Then the researchers designed the acquisition function so it automatically evaluates interventions using information on these causal relationships. They crafted this function so it prioritizes the most informative interventions, meaning those most likely to lead to the optimal intervention in subsequent experiments.

“By considering causal models instead of correlation-based models, we can already rule out certain interventions. Then, whenever you get new data, you can learn a more accurate causal model and thereby further shrink the space of interventions,” Uhler explains.

This smaller search space, coupled with the acquisition function’s special focus on the most informative interventions, is what makes their approach so efficient.

The researchers further improved their acquisition function using a technique known as output weighting, inspired by the study of extreme events in complex systems. This method carefully emphasizes interventions that are likely to be closer to the optimal intervention.

“Essentially, we view an optimal intervention as an ‘extreme event’ within the space of all possible, suboptimal interventions and use some of the ideas we have developed for these problems,” Sapsis says.

Enhanced efficiency

They tested their algorithms using real biological data in a simulated cellular reprogramming experiment. For this test, they sought a genetic perturbation that would result in a desired shift in average gene expression. Their acquisition functions consistently identified better interventions than baseline methods through every step in the multi-stage experiment.

“If you cut the experiment off at any stage, ours would still be more efficient than the baselines. This means you could run fewer experiments and get the same or better results,” Zhang says.

The researchers are currently working with experimentalists to apply their technique toward cellular reprogramming in the lab.

Their approach could also be applied to problems outside genomics, such as identifying optimal prices for consumer products or enabling optimal feedback control in fluid mechanics applications.

In the future, they plan to enhance their technique for optimizations beyond those that seek to match a desired mean. In addition, their method assumes that scientists already understand the causal relationships in their system, but future work could explore how to use AI to learn that information, as well.

This work was funded, in part, by the Office of Naval Research, the MIT-IBM Watson AI Lab, the MIT J-Clinic for Machine Learning and Health, the Eric and Wendy Schmidt Center at the Broad Institute, a Simons Investigator Award, the Air Force Office of Scientific Research, and a National Science Foundation Graduate Fellowship.

Adapted from a news story posted on the MIT News website.

Cells

Active Learning

September 10, 2023

New machine learning techniques boost predictions for virtual drug screening with less data New machine learning techniques boost predictions for virtual drug screening with less data

2023

Scientists using machine learning tools to analyze biomedical data often turn to neural network algorithms, but before these models became popular, another simpler type of machine learning algorithm called kernel methods were commonly used. Kernel methods work by first applying straightforward operations to transform data and then training a simple model on the transformed data.

Now, in a new paper recently published in Nature Communications, researchers at the Eric and Wendy Schmidt Center at the Broad Institute of MIT and Harvard have developed a new way of using kernel methods that could make them more useful for a wider range of applications, such as virtual drug screening. They came up with the first “transfer learning” techniques for kernel methods that can be successfully applied to large-scale datasets. Transfer learning allows researchers to improve machine learning models by training them on one task in a way that enhances their performance on a second task — without having to spend the time and resources training a new model for each new task. In their paper, the team showed how their transfer learning framework allowed them to predict which drugs might be most effective in certain cancer cell lines where little data is available. They did this by transferring from cell lines in which many drugs have already been tested.

“Before our paper, there was no transfer learning method for kernel methods that could scale to the large datasets of most interest in the biomedical field and beyond. We’ve shown for the first time that transfer learning using kernels in these settings is possible and I think that is really exciting,” said Caroline Uhler, the senior author on the paper and a Broad core institute member, co-director of the Schmidt Center at Broad, and a professor in the Department of Electrical Engineering and Computer Science as well as the Institute for Data, Systems, and Society at MIT.

The team’s key innovation was creatively adapting transfer learning methods used in neural network algorithms so that they can be applied to kernel methods. This advance could find uses in other applications.

“Particularly for healthcare and biomedical applications, it's very hard to collect a lot of data for every question of interest. When you have very little data for a certain task but a related task has abundant data, this is exactly a setting where our method is effective,” said Adityanarayanan Radhakrishnan, a co-first author on the study and a Schmidt Center fellow, who worked on this study while completing his PhD as an Eric and Wendy Schmidt Center Fellow in Uhler’s lab at Broad and MIT, and is currently the George F. Carrier Postdoctoral Fellow at Harvard School of Engineering and Applied Sciences.

Transferring knowledge

The research team focused on kernel methods because they found in a previous paper that these performed better than typical neural network models on virtual drug screening tasks. But they wanted to make it possible for researchers to quickly reuse their kernel method algorithms to identify drugs for a wide range of cancer types without having to train a new model for each new type of cancer. They realized that transfer learning techniques are necessary for this, but because existing techniques don’t work well for kernel methods, they had to come up with new ones.

They decided to take inspiration from two transfer learning techniques that work well for neural network models, which they called projection and translation. The team adapted them to work with kernel methods and then tested their approach in a virtual drug screen.

The researchers analyzed performance of their transfer learning algorithms on two massive Broad datasets, one from the Connectivity Map (CMAP) and the other from the Cancer Dependency Map (DepMap). These datasets describe the effects of drugs on cancer cell lines across millions of drug and cell line combinations. The team trained their kernel method algorithms to predict either the genes expressed by a certain cell type after it was treated with a certain drug (using the CMAP dataset), or the proportion of cancer cells that survived after treatment with the same drug (using the DepMap dataset).

The scientists then applied their projection and translation techniques to their model so that it could complete the second task: to predict the effect of the drug on new cancer cell lines that have much less data. The projection transformation corrects the model’s predictions on the second task by recognizing when the prediction errors are falling into categories that can be easily corrected to the right category. And the translation technique fine-tunes the model by applying a correction term that shifts the model’s predictions so that it’s more accurate on the second task.

The team found that their transfer learning techniques allowed their original kernel method to be successfully “transferred” to the second task, without needing to be retrained. Compared to a new model trained only on the second task, the transfer learning techniques greatly boosted the accuracy of their model in predicting the effect of drugs for new cancer cell lines. And on a common machine learning task where the team trained their kernel method algorithms to recognize images, their approach surprisingly boosted the accuracy by up to 10 percent.

Moreover, the researchers were also able to pinpoint exactly how much extra data they would need to collect to increase the performance of the model. Uhler said this could be helpful to scientists trying to decide whether it’s worthwhile to collect more data in the lab. “That's really quite exciting because you can ask ‘how much is it worth for me to have a little bit better performance of my model if I know that we’ll need to collect, say, 10 or 20 percent more data?’” said Uhler.

Beyond drug screening

Two additional advantages of kernel methods are that they provide interpretability as well as a quantification of how uncertain the model is on a given prediction. To take advantage of the interpretability aspect, the research team is working on pinning down the features of a drug that lead their model to predict that it will be effective. In addition, the research team hopes that the uncertainty estimates provided by their kernel approach will be helpful in identifying which new drug and cell line combinations should be screened experimentally for a more effective drug discovery pipeline.

They also have plans to expand their framework to other applications, such as screening cancer genes that tumors heavily depend on for survival and might be targeted with new drugs.

The team adds that their transfer learning approach for kernel methods may also open up other, unexpected applications. Because kernel methods make it easy for scientists to mathematically understand what the model is doing, they can investigate what kinds of biomedical questions will be the best fit to study. “It now gives us a more thorough or deeper understanding of transfer learning and where the power comes from, so that we can analyze which tasks it will actually work for,” said Uhler.

Proteins

Representation Learning

August 31, 2023

Schmidt Center, Helmholtz Munich launch AI and machine learning in genomics collaboration Schmidt Center, Helmholtz Munich launch AI and machine learning in genomics collaboration

2023

Helmholtz Munich and the Eric and Wendy Schmidt Center at the Broad Institute of MIT and Harvard today announce the launch of a collaboration to bridge a gap in health research with AI and machine learning.

In the past decade, the field of genomics has accelerated to a point where we can now both measure and perturb biological systems at massive, unprecedented scales, holding huge potential for disease treatment. However, the computational tools needed to take advantage of all this data have not kept pace. By leveraging machine learning methods, the partnership between Helmholtz Munich and the Eric and Wendy Schmidt Center seeks to gain valuable insights into important genomics problems while simultaneously advancing the foundations of machine learning through novel research inspired by genomics questions.

Leading this joint initiative are Caroline Uhler, co-director of the Eric and Wendy Schmidt Center at the Broad Institute, and Fabian Theis, head of the Computational Health Center (CHC) at Helmholtz Munich and Director of Helmholtz AI. Both Caroline Uhler and Fabian Theis have backgrounds in machine learning, statistics, data science, biology, and human biology. “This exchange model between the Broad Institute and Helmholtz Munich will merge our expertise on machine learning and genomics to foster innovative ways to address major challenges in biomedical research,” said Fabian Theis.

The collaboration will encompass a range of activities, including the exchange of graduate students, postdoctoral fellows, and other research staff between the two research centers. These individuals will undertake short research stays, enabling them to benefit from the expertise and resources available at both centers. In addition, the research centers will co-organize workshops and conferences to facilitate knowledge exchange and foster collaboration in the field of AI and genomics.

“Despite an explosion in biological data, the technology sector remains the key driver of machine learning advances today,” said Caroline Uhler. “Both Helmholtz Munich and the Broad Institute are seeking to change that by developing foundations of machine learning that are geared specifically to biological problems, and we’re excited for this collaboration to amplify our efforts.”

No items found.

July 27, 2023

Making machine learning models make sense Making machine learning models make sense

2023

Gemma Moran will never forget how magical it felt to run her very first statistical models on genomics data during her undergraduate summer research project at the University of Sydney. Moran had initially planned to major in pure mathematics but veered away from that path towards a career in applied research after taking a few statistics courses. “I came to realize that I was much more interested in being able to apply math to real world applications and data,” she said.

Now, as a postdoctoral fellow with the Eric and Wendy Schmidt Center at the Broad Institute of MIT and Harvard, Moran’s interest in using statistical models to uncover biological patterns that could improve health care has only grown stronger. These days, Moran, who is based at Columbia University’s Data Science Institute, is working to combine the rigorous and intuitive nature of the simple statistical models she first learned about in undergrad with the flexibility and power of today’s modern machine learning algorithms. In September, Moran will launch her own research group to pursue this direction as an assistant professor of statistics at Rutgers University.

Those who work with her are confident that her research has been and will continue to be impactful. “Gemma is a clear thinker, a careful scientist, and a fantastic collaborator to work with and learn from,” said David Blei, Moran’s postdoctoral adviser and a professor of statistics and computer science at Columbia University. “What her algorithms discover is information that we can use to help make better scientific and medical predictions, and use to help further our understanding of biology and genetics.”

As part of a project with Anthony Philippakis, co-director of the Eric and Wendy Schmidt Center and chief data officer of the Broad Institute, Moran has been using a type of machine learning algorithm called a variational autoencoder (VAE) to reveal important connections between disease symptoms that doctors may be missing. Though it’s still in its early stages, this work has the potential to affect clinical care if these algorithms discover new ways to cluster symptoms into one disease versus another — a challenging task that has long relied on doctor’s observations alone.

‍Revealing New Relationships

As a graduate student at the University of Pennsylvania, Moran worked on designing a method to uncover genes that are most relevant in different subtypes of breast cancer. She also developed new theoretical techniques to estimate the uncertainty present in models. During her postdoctoral fellowship in Blei’s lab, Moran developed a new method that allows researchers to better interpret the results that variational autoencoder algorithms spit out. These algorithms are masterful at paring down massive datasets into tiny summaries that contain only the most important aspects of the bigger dataset. The problem, Moran explains, is that it’s very challenging for researchers to understand exactly what parts of the original dataset are captured in the small summaries.

Moran working in her office in Columbia's Data Science Institute

To illustrate the challenge and her new fix, Moran gives the example of a large dataset filled with hundreds and hundreds of movie ratings. To create a meaningful summary with fewer data points, the variational autoencoder algorithm might divide these ratings into categories like horror, comedy, action, and science fiction. While it learns, the algorithm creates connections between the movie titles in the original dataset and its new summary output. But if left to its own devices, the algorithm will create thousands of connections that will be difficult to interpret.

Importantly, by pruning down these connections at certain places in the network until they become sparse, Moran’s new method — named "sparse VAE" — makes it much easier to see what parts of the original data are directly linked to the smaller summary. For example, she could trace back the new “anchor points” to find that the movie “Alien” is only represented in the science fiction category of the summary, but a movie like “Everything Everywhere All At Once” might be represented in the categories of action, comedy, and science fiction. And as an added rare bonus, Moran’s new method successfully achieves a statistical property known as identifiability. This ensures that the model only has one way to interpret it, as long as there are anchor points in the data.

After chatting with Philippakis last year about her new sparse VAE method, the two realized that it could be a great way to unearth previously unknown relationships between health symptoms in ways that would be easy for doctors and health researchers to interpret. Essentially, their project uses machine learning to improve nosology, which is the scientific field of disease classification. Until now, to classify a new disease, doctors have relied on their own expertise and experience to know what symptoms — like blurry vision and increased urination for diabetes — co-occur. They’ve also had to decide how to meaningfully differentiate these symptoms from another group of symptoms that comprise a separate disease. But it’s possible that physicians haven’t noticed some co-occurring symptoms that might tell them more about disease severity or indicate a new subgroup of a disease — or require a new disease label altogether.

“What these machine learning methods are exactly designed to do is find what things travel together, and so in that way, they can help physicians see more things that travel together that they might not have noticed just by observation alone,” said Moran.

Moran stands in front of the Low Memorial Library

Moran and Philippakis are currently applying the sparse VAE method to data from 500,000 patients in the UK Biobank, which is a large patient dataset filled with detailed genetic and health information collected by researchers in the United Kingdom. They hope it may yield surprising correlations between biological signals that could improve the classification of diseases, with the goal of obtaining their first results later this year.

“I’m incredibly excited about where this line of research is headed,” said Anthony Philippakis. “In the same way that Gemma has already shown that her method can identify ‘eigen-movies’ that indicate similar classes of films, there is the opportunity to uncover ‘eigen-phenotypes’ that indicate collections of traits that are correlated with each other.”

New Job, Same Thrill

When Moran starts her own research group this fall at Rutgers University, she will continue her work on improving the interpretability and transparency of powerful machine learning algorithms applied to medical research. Her ultimate goal is to create algorithms that provide the most advantages to the health of society without propagating harmful biases against certain groups. Indeed, Moran sees this problem of bias in machine learning as one of the biggest challenges facing the field over the next ten years.

“It’s a really crazy time to be in machine learning. There are so many developments happening at breakneck speed,” she said. “What worries me is people building these powerful [machine learning] models without necessary checks and balances and transparency and interpretability … especially applied to health care because it's such a critical domain where we could see negative consequences if we're not using these tools responsibly.”

While Moran’s goals and physical locations on opposite sides of the globe have changed across her academic career, the joy she finds in the work has remained constant. “That feeling when you've had an idea and then you code up something that works — it's just very thrilling,” she said. For Moran, that thrill becomes even more meaningful when she’s answering a question that could help actual patients. “At the end of the day, I love math and modeling and thinking about variation and how to think about data, but it's nice to connect it to real world questions.”

Organisms

Representation Learning

June 6, 2023

Yue Qin named to Forbes 30 Under 30 Asia 2023 Yue Qin named to Forbes 30 Under 30 Asia 2023

2023

Eric and Wendy Schmidt Center at the Broad Institute of MIT and Harvard is excited to announce that postdoctoral fellow Yue Qin was named to the Forbes 30 Under 30 Asia 2023 list this May. The Forbes 30 under 30 lists highlight some of the most successful researchers, leaders, and entrepreneurs around the world.

Qin joined the Eric and Wendy Schmidt Center in January, 2023. She is co-advised by Paul Blainey, a core member of the Broad Institute and an associate professor of biological engineering at MIT, and Caroline Uhler, co-director of the Eric and Wendy Schmidt Center. Qin's research interests lie in understanding how to read out the programs of cells from the genome. Qin uses that knowledge to create in silico cells that simulate the effect of therapeutic interventions in different disease and genetic contexts with the ultimate goal of developing personalized medicine.

Qin holds a PhD in Bioinformatics and Systems Biology and a BSc in Bioinformatics from the University of California San Diego (UCSD). As a graduate student, she was the first author on a 2021 Nature paper that developed a machine learning framework to map the structure of human cells by fusing data from protein imaging and protein biophysical interactions. Qin is a Siebel Scholar and a recipient of an NCI Predoctoral to Postdoctoral Fellow Transition Award (F99/K00) as well as the Chancellor’s Dissertation Medal within the Jacobs School of Engineering at UCSD.

“Yue embodies the type of researcher we’re excited to work with at the Eric and Wendy Schmidt Center,” said Uhler, who is also a core member of the Broad Institute and a professor in the Department of Electrical Engineering and Computer Science and the Institute for Data, Systems and Society at MIT. “Her research is a great example of how computation and biology can go hand in hand in an age where the number of possible experiments we could perform has exploded.”

People

Cells

April 28, 2023

Machine learning model finds genetic factors for heart disease Machine learning model finds genetic factors for heart disease

2023

To get an inside look at the heart, cardiologists often use electrocardiograms (ECGs) to trace its electrical activity and magnetic resonance images (MRIs) to map its structure. Because the two types of data reveal different details about the heart, physicians typically study them separately to diagnose heart conditions.

Now, in a paper published in Nature Communications, scientists in the Eric and Wendy Schmidt Center at the Broad Institute of MIT and Harvard have developed a machine learning approach that can learn patterns from ECGs and MRIs simultaneously, and based on those patterns, predict characteristics of a patient’s heart. Such a tool, with further development, could one day help doctors better detect and diagnose heart conditions from routine tests such as ECGs.

The researchers also showed that they could analyze ECG recordings, which are easy and cheap to acquire, and generate MRI movies of the same heart, which are much more expensive to capture. And their method could even be used to find new genetic markers of heart disease that existing approaches that look at individual data modalities might miss.

Overall, the team said their technology is a more holistic way to study the heart and its ailments. “It is clear that these two views, ECGs and MRIs, should be integrated because they provide different perspectives on the state of the heart,” said Caroline Uhler, a co-senior author on the study, a Broad core institute member, co-director of the Schmidt Center at Broad, and a professor in the Department of Electrical Engineering and Computer Science as well as the Institute for Data, Systems, and Society at MIT.

"As a field, cardiology is fortunate to have many diagnostic modalities, each providing a different view into cardiac physiology in health and diseases. A challenge we face is that we lack systematic tools for integrating these modalities into a single, coherent picture,” said Anthony Philippakis, a senior co-author on the study and chief data officer at Broad and co-director of the Schmidt Center. “This study represents a first step towards building such a multi-modal characterization."

Model making

To develop their model, the researchers used a machine learning algorithm called an autoencoder, which automatically integrates gigantic swaths of data into a concise representation – a simpler form of the data. The team then used this representation as input for other machine learning models that make specific predictions.

In their study, the team first trained their autoencoder using ECGs and heart MRIs from participants in the UK Biobank. They fed in tens of thousands of ECGs, each paired with MRI images from the same person. The algorithm then created shared representations that captured crucial details from both types of data.

“Once you have these representations, you can use them for many different applications,” said Adityanarayanan Radhakrishnan, a co-first author on the study, an Eric and Wendy Schmidt Center Fellow at the Broad, and a graduate student at MIT in Uhler’s lab. Sam Friedman, a senior machine learning scientist in the Data Sciences Platform at the Broad, is the other co-first author.

One of those applications is predicting heart-related traits. The researchers used the representations created by their autoencoders to build a model that could predict a range of traits, including features of the heart like the weight of the left ventricle, other patient characteristics related to heart function like age, and even heart disorders. Moreover, their model outperformed more standard machine learning approaches, as well as autoencoder algorithms that were trained on just one of the imaging modalities.

“What we showed here is that you get better prediction accuracy if you incorporate multiple types of data,” Uhler said.

Radhakrishnan explained that their model made more accurate predictions because it used representations that had been trained on a much larger dataset. Autoencoders don’t require data that have been labeled by humans, so the team could feed their autoencoder with around 39,000 unlabeled pairs of ECGs and MRI images, rather than just around 5,000 labeled pairs.

The researchers demonstrated another application of their autoencoder: generating new MRI movies. By inputting an individual’s ECG recording into the model — without a paired MRI recording — the model produced the predicted MRI movie for the same person.

With more work, the scientists envision that such technology could potentially allow physicians to learn more about a patient’s heart health from just ECG recordings, which are routinely collected at doctors’ offices.

Broader gene search

With their autoencoder representations, the team realized they could also use them to look for genetic variants associated with heart disease. The traditional method of finding genetic variants for a disease, called a genome-wide association study (GWAS), requires genetic data from individuals that have been labeled with the disease of interest.

But because the team’s autoencoder framework doesn’t require labeled data, they were able to generate representations that reflected the overall state of a patient’s heart. Using these representations and genetic data on the same patients from the UK Biobank, the researchers created a model that looked for genetic variants that impact the state of the heart in more general ways. The model produced a list of variants including many of the known variants related to heart disease and some new ones that can now be investigated further.

Radhakrishnan said that genetic discovery could be the area in which the autoencoder framework, with more data and development, could have the most impact – not just for heart disease, but for any disease. The research team is already working on applying their autoencoder framework to study neurological diseases.

Uhler said this project is a good example of how innovations in biomedical data analysis emerge when machine learning researchers collaborate with biologists and physicians. “An exciting aspect about getting machine learning researchers interested in biomedical questions is that they might come up with a completely new way of looking at a problem.”

Support for the research was provided in part by the Eric and Wendy Schmidt Center at the Broad Institute, the National Science Foundation, the Office of Naval Research, the MIT-IBM Watson AI Lab, a Simons Investigator Award, the National Institutes of Health, and the American Heart Association.

Adapted from a news story posted on the Broad Institute website.

Organisms

Representation Learning

April 28, 2023

A deep (learning) dive into the roots of cancer A deep (learning) dive into the roots of cancer

2023

In a recent grant application to the National Institutes of Health, Petar Stojanov was required, among other things, to describe his “specific aims” as well as his background. It’s doubtful that the NIH reviewers would have considered Stojanov’s research agenda lacking in ambition, given its broad scope: to identify the genetic mutations that cause cancer and figure out how they cause it.

The reviewers, moreover, must have decided he had a credible chance of achieving these goals, or at least making progress toward their realization, as he was informed earlier this year that he had earned a coveted Pathway to Independence (K99) Award. As a result, Stojanov — a current Eric and Wendy Schmidt Center Postdoctoral Fellow at the Broad Institute of MIT and Harvard — will receive up to five years of research support, meaning he can devote himself fully to his scientific inquiries without having to worry about funding.

K99 grants help “outstanding” researchers transition from postdoctoral positions to running their own labs. In this next stage of his career, Stojanov will develop new methods in two types of machine learning: algorithms related to causality and deep generative models.

An early interest in computational biology

In some sense, Stojanov set off on the path that led him to this milestone when he was a high school student in Macedonia. A family friend told him that computational biology was becoming a hot area in science. Stojanov was immediately intrigued, he said, “for the same reason that has brought many people to this field — math and biology were my favorite subjects.” And here was a chance to combine his preferred disciplines into a unified course of study that might lead to an interesting career.

He spent his senior year of high school in Pelham, New York (where he lived with his family friend), as he’d always believed he “would have the best opportunities for innovation in the U.S.” A year later, he enrolled in Bard College, which had no courses, let alone a major, in computational biology. Stojanov stuck to his passion, nevertheless, taking the bulk of his classes in computer science, biology, mathematics, and chemistry. He gained hands-on experience in computational biology through summer research programs at George Washington University and the University of Maryland.

Stojanov on his way to work at the Broad Institute

After graduating from Bard in 2010, he took a job in the laboratory of Gaddy Getz, director of the Broad’s Cancer Genome Computational Analysis Group. That’s where Stojanov got started on the two-pronged research track he’s still pursuing today: First, to figure out which mutations are present in cancerous tissue and, second, to determine which of those mutations actually spur our cells to multiply out of control and drive cancer. The standard approach at the time was to rely on statistical methodology, such as examining whether the number of mutations in a given gene was greater than would be expected from random processes, unrelated to cancer.

Stojanov spent four productive years at the Broad, coauthoring more than a dozen papers — four of which he was a lead author. He didn’t sleep much those days, mainly because he was “hungry for projects and never said no to an opportunity.” Yet, by the end of that tenure, he felt that his work in this area could benefit from additional training in computer science, which would enable him to bring new tools to the kinds of problems he’d been grappling with. In 2014, he entered a PhD program at Carnegie Mellon University, where he immersed himself in machine learning techniques and other emerging approaches in artificial intelligence. Although his graduate research had nothing to do with biology, he recognized that the methods he was learning, combined with statistics, might lead to breakthroughs in his previous cancer investigations.

Bringing ML to bear on cancer research

Stojanov returned to the Broad in 2021 and picked up in the Getz lab where he had left off — this time ready to unleash the full power of AI. Getz was eager to have him back, touting “the unique set of skills that Petar has,” given his prior experience in cancer research and his recently strengthened background in computer science. “And now,” Getz said, “he’s applying his expertise in machine learning to the search for the drivers of cancer.”

Just counting the number of mutations in a gene is not enough to reveal the mechanisms underpinning cancer, Stojanov explained. “That may tell you which mutations are most prevalent, and maybe the most important, but it still doesn’t tell you what they do.” To understand how a mutation affects a gene, you have to look at gene expression, the cellular process by which the information encoded in a gene is used to create proteins.

In his latest work at the Broad, Stojanov is focusing on two variables: gene mutations, which can be gleaned from DNA sequencing data, and gene expression expression (which can be obtained from RNA sequencing data by measuring the amount of RNA, a gene-decoding molecule, in the cell). He then uses a set of machine learning tools called causal inference and discovery algorithms to uncover the “causal relationships” between these two variables – mutations and expression.

“The idea is to show that some aspects of gene expression are the consequences of mutations,” he said.

The only causal relationships he cares about are those associated with cancer. While sorting through DNA and RNA sequencing data from thousands of cancer patients, he’s looking for patterns. In particular, he said, “we might find mutations that influence patients with the same cancer type (or subtype), in the same way.”

Stojanov in his office with colleagues Pinar Eser (center) and Tim Coorens

As an intermediate step, Stojanov relies on a related class of machine learning-based tools, so-called deep generative models, which basically takes abstract (“high-dimensional”) information processed by computers and represents it in a form that is meaningful to humans. If you have mutation and expression data for 20,000 genes, he said, these models offer a way to summarize that vast amount of data in terms of the concepts you’re interested in, such as biological processes or cell subtypes that might be impacted by cancer.

The ultimate goal is to learn as much as possible about this multifaceted disease — how and where it starts and progresses. “To really understand what’s going on,” Stojanov said, “we need an interpretable map that shows which processes are affected by what mutations.”

Existing techniques can only get you so far

Eric and Wendy Schmidt Center co-director Caroline Uhler is excited by the prospect of “getting at the causal genes, which contain the mutations that drive cancer. "Once you have that,” she said, “you’re in a much better position to think about effective therapies. That’s really the promise of this work.”

Stojanov’s current research is, admittedly, at an early stage. He has a solid base of experience to draw on, and he’s picked out a set of tools, in the form of machine learning algorithms, that are poised to advance our knowledge base. The big challenge, Uhler pointed out, is that “existing techniques can only get you so far. Petar has to build on these methods and develop new algorithms in order to solve the important biological questions he plans to address.”

Stojanov is mindful of the hard work ahead and grateful that his burden has been eased by having several years of funding already secured. “This [K99] award gives you the ultimate amount of independence you can have as a postdoc,” he said.

When asked if getting the award is the best thing that could happen to someone in his position, embarking on such an ambitious enterprise, he replied, “Well, it’s certainly up there.”

‍

Cells

Causal Inference

March 30, 2023

A method for designing neural networks optimally suited for certain tasks A method for designing neural networks optimally suited for certain tasks

2023

Neural networks, a type of machine-learning model, are being used to help humans complete a wide variety of tasks, from predicting if someone’s credit score is high enough to qualify for a loan to diagnosing whether a patient has a certain disease. But researchers still have only a limited understanding of how these models work. Whether a given model is optimal for certain task remains an open question.

MIT researchers have found some answers. They conducted an analysis of neural networks and proved that they can be designed so they are “optimal,” meaning they minimize the probability of misclassifying borrowers or patients into the wrong category when the networks are given a lot of labeled training data. To achieve optimality, these networks must be built with a specific architecture.

The researchers discovered that, in certain situations, the building blocks that enable a neural network to be optimal are not the ones developers use in practice. These optimal building blocks, derived through the new analysis, are unconventional and haven’t been considered before, the researchers say.

In a paper published this week in the Proceedings of the National Academy of Sciences, they describe these optimal building blocks, called activation functions, and show how they can be used to design neural networks that achieve better performance on any dataset. The results hold even as the neural networks grow very large. This work could help developers select the correct activation function, enabling them to build neural networks that classify data more accurately in a wide range of application areas, explains senior author Caroline Uhler, a professor in the Department of Electrical Engineering and Computer Science (EECS) and co-director of the Eric and Wendy Schmidt Center at the Broad Institute of MIT and Harvard.

“While these are new activation functions that have never been used before, they are simple functions that someone could actually implement for a particular problem. This work really shows the importance of having theoretical proofs. If you go after a principled understanding of these models, that can actually lead you to new activation functions that you would otherwise never have thought of,” says Uhler, who is a core institute member of the Broad Institute, and a researcher at MIT’s Laboratory for Information and Decision Systems (LIDS) and Institute for Data, Systems and Society (IDSS).

Joining Uhler on the paper are lead author Adityanarayanan Radhakrishnan, an EECS graduate student and an Eric and Wendy Schmidt Center Fellow, and Mikhail Belkin, a professor in the Halicioğlu Data Science Institute at the University of California at San Diego.

Activation investigation

A neural network is a type of machine-learning model that is loosely based on the human brain. Many layers of interconnected nodes, or neurons, process data. Researchers train a network to complete a task by showing it millions of examples from a dataset.

For instance, a network that has been trained to classify images into categories, say dogs and cats, is given an image that has been encoded as numbers. The network performs a series of complex multiplication operations, layer by layer, until the result is just one number. If that number is positive, the network classifies the image a dog, and if it is negative, a cat.

Activation functions help the network learn complex patterns in the input data. They do this by applying a transformation to the output of one layer before data are sent to the next layer. When researchers build a neural network, they select one activation function to use. They also choose the width of the network (how many neurons are in each layer) and the depth (how many layers are in the network.)

“It turns out that, if you take the standard activation functions that people use in practice, and keep increasing the depth of the network, it gives you really terrible performance. We show that if you design with different activation functions, as you get more data, your network will get better and better,” says Radhakrishnan.

He and his collaborators studied a situation in which a neural network is infinitely deep and wide — which means the network is built by continually adding more layers and more nodes — and is trained to perform classification tasks. In classification, the network learns to place data inputs into separate categories.

“A clean picture”

After conducting a detailed analysis, the researchers determined that there are only three ways this kind of network can learn to classify inputs. One method classifies an input based on the majority of inputs in the training data; if there are more dogs than cats, it will decide every new input is a dog. Another method classifies by choosing the label (dog or cat) of the training data point that most resembles the new input.

The third method classifies a new input based on a weighted average of all the training data points that are similar to it. Their analysis shows that this is the only method of the three that leads to optimal performance. They identified a set of activation functions that always use this optimal classification method.

“That was one of the most surprising things — no matter what you choose for an activation function, it is just going to be one of these three classifiers. We have formulas that will tell you explicitly which of these three it is going to be. It is a very clean picture,” he says.

They tested this theory on a several classification benchmarking tasks and found that it led to improved performance in many cases. Neural network builders could use their formulas to select an activation function that yields improved classification performance, Radhakrishnan says.

In the future, the researchers want to use what they’ve learned to analyze situations where they have a limited amount of data and for networks that are not infinitely wide or deep. They also want to apply this analysis to situations where data do not have labels.

“In deep learning, we want to build theoretically grounded models so we can reliably deploy them in some mission-critical setting. This is a promising approach at getting toward something like that — building architectures in a theoretically grounded way that translates into better results in practice,” he says.

This work was supported, in part, by the National Science Foundation, Office of Naval Research, the MIT-IBM Watson AI Lab, the Eric and Wendy Schmidt Center at the Broad Institute, and a Simons Investigator Award.

Adapted from a news story posted on MIT News.

Representation Learning

March 28, 2023

Machine learning experts from around the world compete to improve cancer immunotherapy Machine learning experts from around the world compete to improve cancer immunotherapy

2023

Marios Gavrielatos had never participated in a machine learning competition when he decided to enter the Eric and Wendy Schmidt Center’s Cancer Immunotherapy Data Science Grand Challenge.

Gavrielatos’ friend and colleague, Konstantinos Kyriakidis, asked him to team up in the competition after learning about it from a promotional video on YouTube.

Despite Gavrielatos’ newcomer status, the pair developed a new deep learning model that won them the first part of the competition last month.

The challenge “helped me develop new computational skills, deep-learning wise,” said Gavrielatos, a bioinformatics master’s student at the National and Kapodistrian University of Athens, adding that because they couldn’t find similar problems online, “we had to develop something new ourselves, which was interesting.”

The Cancer Immunotherapy Data Science Grand Challenge, which ran on Topcoder from January 9 to February 3, aimed to uncover new ways to modify, or “perturb,” T cells to make them more effective at killing cancer cells to ultimately improve cancer treatment.

Top challenge submissions will be tested out in a lab at the Broad Institute of MIT and Harvard later this year.

The Eric and Wendy Schmidt Center at the Broad Institute of MIT and Harvard partnered with Harvard’s Laboratory for Innovation Science, the MIT Department of Electrical Engineering and Computer Science, Topcoder, Gordian Biotechnology, and Massachusetts General Hospital (MGH) to run the challenge. Over 900 people registered for the first part of the competition — making it Topcoder’s fifth-largest data science challenge to date.

“In biology, we can perform perturbations on a scale that other fields can only dream of, meaning we need to develop novel machine learning methods to best make use of such data and answer biological questions,” said Caroline Uhler, co-director of the Eric and Wendy Schmidt Center, a core member of the Broad Institute, and professor in the Department of Electrical Engineering and Computer Science and the Institute for Data, Systems and Society at MIT. “We held this data science challenge to direct bright computational minds from around the world to this problem in cancer immunotherapy. And we’re thrilled that we now get to test out some of their proposed perturbations experimentally.”

A great fit for a data science challenge

While chemotherapy and radiation have saved many lives, these treatments have a weak spot: they are not specific enough — meaning they can kill cancerous and healthy cells. The promise of cancer immunotherapy, a newer and effective form of cancer treatment, is that it can harness our immune system to recognize and kill cancer cells while leaving other cells alone in most cases.

Cancer cells have developed a number of ways to evade our immune system. One such strategy is sending signals to T cells to make them exhausted and ineffective at killing cancer cells. That’s why cancer researchers like Nir Hacohen, an institute member at Broad, director of the Broad Institute’s Cell Circuits Program and director of the Center for Cancer Immunology at Mass General Hospital, are investigating whether perturbing certain genes could shift T cells to a cancer-fighting, “effector” state.

“We were excited to develop this data science challenge with the Eric and Wendy Schmidt Center because the T cell exhaustion problem seemed like a great fit for this kind of competition,” said Hacohen. “It was an opportunity to combine our cancer biology and immunology knowledge with the computational and mathematical skills of machine learning experts from all over the world.”

Marc Schwartz, a postdoctoral fellow in the Hacohen Lab, ran experiments testing the effects of 73 gene knockouts in T cells on mice with cancer. Given that it took months to test a fraction of the 20,000 potential gene knockouts — a genetic perturbation that stops a gene from functioning — Broad researchers wanted a way to zero in on the most promising perturbations. Enter machine learning.

The overarching challenge was divided into three parts that ran as individual data science competitions on Topcoder. In Challenge 1, participants received gene expression data from 66 of the 73 T-cell-gene knockouts from Schwartz’s experiments as training data. They then had to develop an algorithm that could predict how knocking out the seven “held-out” genes would affect T cells.

Challenge 2 participants used their algorithms from the first challenge to propose new gene knockouts (picking from any of the 20K genes in the entire genome) to shift as many T cells as possible into a cancer-fighting state. In Challenge 3, participants proposed a metric for ranking how well a particular gene knockout would bring about this desired shift in T cells.

To solve Challenge 1, winners Gavrielatos and Kyriakidis first pared down the single-cell dataset so that it contained only expression information from important genes — that is, genes whose expression changed across different T cell states. The preprocessing of the data is a crucial step to distill the “signal” — or useful information — when working with such noisy data, said Kyriakidis, who has previously won several precision FDA data science challenges.

The pair next trained a deep learning model to predict what portion of T cells would move into an effector, exhausted, or alternate state after a specific gene was knocked out. Initially, they tried to come up with an algorithm using only the training data provided from Schwartz’s experiment. But as they continued working, they realized that incorporating public biomedical databases into their analysis — namely, Reactome, a database of biological pathways in human cells, and STRING, a protein interaction database — could reveal associations between the missing and observed genes.

“The whole process was so rewarding,” said Kyriakidis. “You have to divide the whole problem into smaller parts to try to find the solution to each part and connect the dots.”

Sometimes, simple algorithms are best

The second place winners were three MIT students — including two graduate students from the Laboratory for Information and Decision Systems (LIDS), Yuzhou Gu and Anzo Teh, MIT Institute for Data, Systems, and Society (IDSS) postdoc Yanjun Han, and undergraduate student Brandon Wang. Teh, who is also an Eric and Wendy Schmidt Center PhD fellow, said his advisor, MIT professor Yury Polyanskiy, suggested that he and the other researchers join forces for the challenge.

Anzo Teh, Eric and Wendy Schmidt Center PhD Fellow

Teh, Gu, and Han, have a theoretical and computational background — specifically, information theory — while the undergraduate student, Brandon Wang, has expertise in computational biology.

“I did feel like this challenge was a good way for me to learn how to work on these types of problems because I’m pretty new to the biology field,” said Teh.

Several teams used neural networks to describe the experimental gene expression data, an approach that often requires thousands of parameters to create an effective model. The MIT team, on the other hand, made a simplifying assumption that gene expression could be modeled with a small number of parameters following a Gaussian distribution, or a bell curve.

They then reduced the dimensions of their data from 20,000 to 50 columns using a machine learning technique called “principal component analysis.” The MIT team also incorporated an outside public database on human genes into their model, mapping human gene expression profiles to their missing mouse counterparts. Finally, they used a proven machine learning classification algorithm to determine how the gene expression profiles lined up with T cell states.

“Sometimes simple algorithms can work better than neural networks,” said Teh. The MIT team’s background in information theory, which is the study of organizing and quantifying data, helped them discover what signals in the experimental data to focus their models on.

Peter Novotný, the third place winner and a math professor at the University of Žilina in Slovakia, also took a relatively simple approach to solving Challenge 1. Novotný, a former Topcoder “copilot” who had participated in a NASA asteroid-hunter challenge, among many other competitions, has more of a mathematics than a computer science background. In part through participating in data science challenges, he’s discovered that he enjoys machine learning though.

“And, I also quite like competing,” he said.

For the cancer immunotherapy challenge, Novotný first selected 14 features from the T cell data that quantified how gene expression levels differed between perturbed and unperturbed cells, as the way to represent his training data. Then, he built a model using a common machine learning algorithm — the “random forest” — and predicted the distribution of T cell states for each of the seven withheld genes.

To make the challenge accessible to participants without a biology background, Lightmark Creative and Orr Ashenberg, associate director of computational biology at The Klarman Cell Observatory of the Broad Institute, produced a 1.5-hour crash course on cancer biology, perturbation data, and single-cell sequencing technologies.

“To compete in this contest, you really need to understand what the data is, and without those lectures, it would be quite difficult to understand the problem,” said Novotný.

In addition, Uhler held an IAP course that ran at the same time as the challenge, encouraging MIT students to team up and participate in the competition.

Testing perturbations in the lab

The Eric and Wendy Schmidt Center also announced last month who won the third challenge, in which participants came up with a metric to rank new T cell perturbations.

The winners of that challenge were:

First place: Dariusz Brzeziński and Wojciech Kotlowski from Poznań University of Technology in Poland
Second place: Salil Bhate, MIT, postdoctoral fellow at the Eric and Wendy Schmidt Center
Third place: Irene Bonafonte Pardàs, Artur Szalata, and Benjamin Schubert from Helmholtz Center Munich and Miriam Lyzotte from Mila - Quebec AI Institute

Now, researchers at the Hacohen Lab will run experiments to test how the perturbations proposed in Challenge 2 affect mouse T cells’ cancer-fighting abilities.

“It will be really exciting to see how these computationally identified perturbations actually perform in the lab,” said Uhler. “After all, machine learning cannot replace experiments, but the goal is to work hand in hand with biologists and help prioritize the next experiments to run.”

‍

Cells

Active Learning

January 20, 2023

Researchers develop an AI model that can detect future lung cancer risk Researchers develop an AI model that can detect future lung cancer risk

2023

The name Sybil has its origins in the oracles of Ancient Greece, also known as sibyls: feminine figures who were relied upon to relay divine knowledge of the unseen and the omnipotent past, present, and future. Now, the name has been excavated from antiquity and bestowed on an artificial intelligence tool for lung cancer risk assessment being developed by researchers at MIT's Abdul Latif Jameel Clinic for Machine Learning in Health, Mass General Cancer Center (MGCC), and Chang Gung Memorial Hospital (CGMH).

Lung cancer is the No. 1 deadliest cancer in the world, resulting in 1.7 million deaths worldwide in 2020, killing more people than the next three deadliest cancers combined.

"It’s the biggest cancer killer because it’s relatively common and relatively hard to treat, especially once it has reached an advanced stage,” says Florian Fintelmann, MGCC thoracic interventional radiologist and co-author on the new work. “In this case, it’s important to know that if you detect lung cancer early, the long-term outcome is significantly better. Your five-year survival rate is closer to 70 percent, whereas if you detect it when it’s advanced, the five-year survival rate is just short of 10 percent.”

Although there has been a surge in new therapies introduced to combat lung cancer in recent years, the majority of patients with lung cancer still succumb to the disease. Low-dose computed tomography (LDCT) scans of the lung are currently the most common way patients are screened for lung cancer with the hope of finding it in the earliest stages, when it can still be surgically removed. Sybil takes the screening a step further, analyzing the LDCT image data without the assistance of a radiologist to predict the risk of a patient developing a future lung cancer within six years.

In their new paper published in the Journal of Clinical Oncology, Jameel Clinic, MGCC, and CGMH researchers demonstrated that Sybil obtained C-indices of 0.75, 0.81, and 0.80 over the course of six years from diverse sets of lung LDCT scans taken from the National Lung Cancer Screening Trial (NLST), Mass General Hospital (MGH), and CGMH, respectively — models achieving a C-index score over 0.7 are considered good and over 0.8 is considered strong. The ROC-AUCs for one-year prediction using Sybil scored even higher, ranging from 0.86 to 0.94, with 1.00 being the highest score possible.

Despite its success, the 3D nature of lung CT scans made Sybil a challenge to build. Co-author Peter Mikhael, an MIT PhD student in electrical engineering and computer science, a fellow at the Eric and Wendy Schmidt Center, and an affiliate at the Jameel Clinic and the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL), likened the process to “trying to find a needle in a haystack.” The imaging data used to train Sybil was largely absent of any signs of cancer because early-stage lung cancer occupies small portions of the lung — just a fraction of the hundreds of thousands of pixels making up each CT scan. Denser portions of lung tissue are known as lung nodules, and while they have the potential to be cancerous, most are not, and can occur from healed infections or airborne irritants.

To ensure that Sybil would be able to accurately assess cancer risk, Fintelmann and his team labeled hundreds of CT scans with visible cancerous tumors that would be used to train Sybil before testing the model on CT scans without discernible signs of cancer.

MIT electrical engineering and computer science PhD student Jeremy Wohlwend, co-author of the paper and Jameel Clinic and CSAIL affiliate, was surprised by how highly Sybil scored despite the lack of any visible cancer. “We found that while we [as humans] couldn’t quite see where the cancer was, the model could still have some predictive power as to which lung would eventually develop cancer,” he recalls. “Knowing [Sybil] was able to highlight which side was the most likely side was really interesting to us.”

Co-author Lecia V. Sequist, a medical oncologist, lung cancer expert, and director of the Center for Innovation in Early Cancer Detection at MGH, says the results the team achieved with Sybil are important “because lung cancer screening is not being deployed to its fullest potential in the U.S. or globally, and Sybil may be able to help us bridge this gap.”

Lung cancer screening programs are underdeveloped in regions of the United States hardest hit by lung cancer due to a variety of factors. These range from stigma against smokers to political and policy landscape factors like Medicaid expansion, which varies from state to state.

Moreover, many patients diagnosed with lung cancer today have either never smoked or are former smokers who quit over 15 ago — traits that make both groups ineligible for lung cancer CT screening in the United States.

“Our training data consisted only of smokers because this was a necessary criterion for enrolling in the NLST,” Mikhael says. “In Taiwan, they screen nonsmokers, so our validation data is expected to contain people who didn’t smoke, and it was exciting to see Sybil generalize well to that population.”

“An exciting next step in the research will be testing Sybil prospectively on people at risk for lung cancer who have not smoked or who quit decades ago,” says Sequist. “I treat such patients every day in my lung cancer clinic and it’s understandably hard for them to reconcile that they would not have been candidates to undergo screening. Perhaps that will change in the future.”

There is a growing population of patients with lung cancer who are categorized as nonsmokers. Women nonsmokers are more likely to be diagnosed with lung cancer than men who are nonsmokers. Globally, over 50 percent of women diagnosed with lung cancer are nonsmokers, compared to 15 to 20 percent of men.

MIT Professor Regina Barzilay, a paper co-author and the Jameel Clinic AI faculty lead, who is also a member of the Koch Institute for Integrative Cancer Research, credits MIT and MGH’s joint efforts on Sybil to Sylvia, the sister to a close friend of Barzilay and one of Sequist’s patients. "Sylvia was young, healthy and athletic — she never smoked,” Barzilay recalls. “When she started coughing, neither her doctors nor her family initially suspected that the cause could be lung cancer. When Sylvia was finally diagnosed and met Dr. Sequist, the disease was too advanced to revert its course. When mourning Sylvia's death, we couldn't stop thinking how many other patients have similar trajectories.”

This work was supported by the Bridge Project, a partnership between the Koch Institute at MIT and the Dana-Farber/Harvard Cancer Center; the MIT Jameel Clinic; Quanta Computer; Stand Up To Cancer; the MGH Center for Innovation in Early Cancer Detection; the Bralower and Landry Families; Upstage Lung Cancer; and the Eric and Wendy Schmidt Center at the Broad Institute of MIT and Harvard. The Cancer Center of Linkou CGMH under Chang Gung Medical Foundation provided assistance with data collection and R. Yang, J. Song and their team (Quanta Computer Inc.) provided technical and computing support for analyzing the CGMH dataset. The authors thank the National Cancer Institute for access to NCI’s data collected by the National Lung Screening Trial, as well as patients who participated in the trial.

Adapted from a news story posted on MIT News.

Organisms

December 7, 2022

New method identifies spatial biomarkers of Alzheimer’s disease progression in animal model New method identifies spatial biomarkers of Alzheimer’s disease progression in animal model

2022

Many diseases affect how cells are spatially organized in tissues, such as in Alzheimer’s disease, where amyloid-β proteins clump together to form plaques in the brain. Studying how cells differ in various regions of tissue could help scientists better understand the key changes that lead to Alzheimer’s and other diseases. But integrating data on gene expression and cell structure and spatial location into the same analysis has proven challenging.

Now, researchers from the Broad Institute of MIT and Harvard and ETH Zürich in Switzerland have developed a computational framework for simultaneously analyzing gene expression, the structure of cell nuclei, and their position in space. STACI (Spatial Transcriptomics combined using Autoencoders with Chromatin Imaging) is the first method that combines these three kinds of data. The findings appeared recently in Nature Communications.

The team, led by Caroline Uhler, the study’s senior author and co-director of the Eric and Wendy Schmidt Center at the Broad, and Xinyi Zhang, first author on the study and a graduate student in Uhler’s lab, developed STACI and applied it to study a mouse model of Alzheimer’s disease.

STACI uses a kind of computational model called a neural network to analyze data generated by a technique called STARmap, which measures the expression of more than two thousand genes and maps their location in intact tissue. STARmap was developed by Xiao Wang, a core institute member at the Broad and co-author on the study.

The team used STACI to analyze brain tissue from the Alzheimer’s mouse model. By studying gene expression and the location of cells in the tissue, the scientists identified a part of the cortex in the mouse brain that was more likely to have significant plaque accumulations. With the help of G. V. Shivashankar, a study author and professor of mechano-genomics at ETH Zürich, the team also found that they could predict plaque size — a marker of disease progression — by analyzing just one feature of cells near the plaques: the structure of chromatin, the complex of DNA and protein that makes up chromosomes. The results suggest that chromatin structure could be a marker of Alzheimer’s disease progression.

“We began by asking how we can integrate these different data modalities,” said Uhler, who is also a core institute member at Broad and professor in the Department of Electrical Engineering and Computer Science at MIT. “What’s really exciting is that now, with STACI, we can begin to ask biological questions to learn more about disease by taking all modalities into account simultaneously.”

Zhang, who is also a fellow at the Schmidt Center, says that STACI is a useful tool for researchers because chromatin imaging is routine in labs and cheaper than measuring the gene expression of cells directly. “This study may provide simple, low-cost avenues for studying which regions of the brain are more affected by disease and for tracking disease progression,” she said.

Cells in space

In previous work, Uhler and Shivashankar showed that they could use computational techniques to analyze single-cell RNA sequencing data along with chromatin images. They collaborated with Wang to incorporate the analysis of cell location data from STARmap and build STACI.

STACI relies on a neural network, which learns patterns from “training” data to predict characteristics of new data. To develop STACI, the researchers trained it to build a map, called a latent space, that groups together cells with similar locations, gene expression, or chromatin structure. They then used STACI to analyze images of chromatin from mouse brain tissue.

From this latent space, the scientists found that the size of plaque deposits is highly correlated with the ratio of heterochromatin to euchromatin, which indicates how densely packed the chromatin is. This relationship suggests DNA packing could be a marker of disease progression.

The team says the connection between chromatin density and plaques suggests new questions in Alzheimer’s research. They hope their findings will spur other groups to investigate the biological relationship between DNA packing and plaque build-up.

Branching out

Brain tissue samples can vary widely in how they are collected and prepared, but the scientists designed STACI to account for this variation. The technique could also be applicable to other spatial data types, such as from Slide-Seq — developed by Fei Chen, Evan Macosko and other colleagues at the Broad — as well as Visium and MERFISH.

Uhler adds that STACI could also help researchers learn more about other diseases, since many have important spatial features. She envisions using the framework to analyze the local microenvironment in cancer, fibrosis or scarring in the lungs or other tissues, as well as developmental processes. As scientists apply STACI to new problems, they’ll likely encounter new analytical challenges, but she thinks this is an opportunity to help the model expand.

“This work shows how biology can be a great inspiration for novel computational questions and developments,” Uhler said. “And that’s really exciting.”

‍

This work was supported in part by the Eric and Wendy Schmidt Center, the Simons Foundation, the Office of Naval Research, the National Institutes of Health, and the National Science Foundation.

Adapted from a news story posted on the Broad Institute website.

‍

Cells

Representation Learning

November 21, 2022

Eric and Wendy Schmidt Center announces data science challenge to harness machine learning for cancer immunotherapy Eric and Wendy Schmidt Center announces data science challenge to harness machine learning for cancer immunotherapy

2022

The immune system is adept at fighting off viral and bacterial infections, but it can also find and attack cancer in the body. Cancer cells, however, are skilled at disarming the immune system’s T cells — allowing tumors to continue growing unabated.

Scientists at the Broad Institute of MIT and Harvard and beyond have been looking for ways to genetically modify T cells to improve their cancer-fighting ability. Now the Eric and Wendy Schmidt Center at the Broad Institute is joining this effort, by holding a data science challenge this winter that will call on machine learning enthusiasts to develop algorithms that identify effective genetic modifications in T cells.

Winners will receive monetary prizes at each stage — and, unlike in most data science challenges, the top-scoring participants will have their submissions experimentally validated. Members of a cancer immunology lab at Broad led by institute member Nir Hacohen will make the top-ranked genetic modifications in T cells in the lab and assess the cells’ cancer-fighting abilities.

The "Cancer Immunotherapy Data Science Grand Challenge" was announced earlier this month at the online coding tournament Topcoder Open, and will run from January 9 to February 3, 2023. The Eric and Wendy Schmidt Center is partnering with Harvard’s Laboratory for Innovation Science, the MIT Department of Electrical Engineering and Computer Science, Topcoder, and Massachusetts General Hospital (MGH) to run the challenge.

“Machine learning experts have largely gone into the fields of big technology and finance. With this challenge, we’re describing an important problem in cancer immunology in a way that is approachable for computational minds — thus hoping to entice more of these experts to the life sciences,” said Caroline Uhler, co-director of the Eric and Wendy Schmidt Center, a core member of the Broad Institute, and professor in the Department of Electrical Engineering and Computer Science and the Institute for Data, Systems and Society at MIT.

Improving cancer immunotherapy through machine learning

Cancer immunotherapies boost the immune system to fight off cancer in a variety of ways. Scientists have made many breakthroughs in cancer immunotherapy in the last decade, such as the development of several FDA-approved checkpoint blockade and “CAR T” therapies. CAR T treatments involve removing T cells from a cancer patient, genetically engineering them in the lab to target tumors, and then reintroducing them back into the patient. However, these treatments work for only a small number of cancer types and only in some patients.

“We hope that this challenge will allow us to quickly hone in on the most promising perturbations so we can better target our experimental validation.” — Nir Hacohen

To make T cell-based immunotherapies more effective for more patients, scientists are looking for other genetic changes they can introduce in T cells to make them better cancer killers. With the development of genome-editing technologies such as CRISPR in the last decade, researchers can look for those desirable changes by performing large-scale genetic screens to systematically modify or knock out each gene and study the effect of these “perturbations” at the single-cell level.

However, perturbing each of the 20,000 genes in the cell or the several hundred million different combinations of genes in the lab would be too costly and time-consuming. Machine learning can help, by predicting which genetic perturbations might be most effective.

“We hope that this challenge will allow us to quickly hone in on the most promising perturbations so we can better target our experimental validation,” said Hacohen, director of the Broad Institute’s Cell Circuits Program, institute member of the Broad Institute, and director of MGH’s Center for Cancer Immunology. “The predictions from this challenge will provide a crucial step toward making cancer immunotherapy more effective for more patients.”

The Cancer Immunotherapy Data Science Challenge will consist of three parts that will run at the same time. In the first part, participants will use transcriptomic and perturbational data from T cells in mouse tumors to develop algorithms that predict the effect of perturbations that have already been studied in the lab, allowing them to see how well their algorithms work. In part two, they’ll come up with a metric for ranking how well a particular gene knockout would shift T cells to a desired state.

And, third, participants will use their algorithms to propose perturbations that boost T cells’ ability to destroy tumors. The top-scoring participants from part one will have their proposed perturbations experimentally validated.

“Data science challenges like this one draw on the power of the crowd to bring in outside computational and creative machine learning techniques to solve biological problems,” said MarcAntonio Awada, head of research and data science at Harvard’s Institute for Digital, Data, and Design Institute. “In the past, crowdsourcing has led to out-of-the-box approaches and completely novel solutions compared to what experts had come up with.”

Unique learning and data access opportunities

The challenge will run concurrently with an Independent Activities Period course at MIT, which brings together computer science and biology students to collaborate on this problem. “The course provides a great opportunity for MIT students to apply their education and see that what they’re learning in the classroom has a direct impact on answering critical biomedical questions,” said Uhler, who is one of the course’s instructors.

A biology background isn’t necessary to participate. The Eric and Wendy Schmidt Center will provide all challenge participants with an online crash course on cancer immunology and unique features of the large-scale datasets. Interested participants can pre-register now as an individual or as part of a team on Topcoder, which is hosting the challenge on their platform.

Participants will have free access to Saturn Cloud to complete the challenge.

Adapted from a news story posted on the Broad Institute website.

‍

No items found.

May 13, 2022

Workshop sparks new tissue biology and AI research areas and collaborations Workshop sparks new tissue biology and AI research areas and collaborations

2022

Advancing our understanding of tissue biology requires tight collaborations between biologists with driving questions, technologists creating new experimental methods, and computational scientists who are creating new ways of analyzing data. One of the key aims of an April 27 workshop held by the Eric and Wendy Schmidt Center and the Klarman Cell Observatory at the Broad Institute was to explore the interface between these disciplines. Speakers and panelists included researchers at Stanford University, MIT, Harvard University, the Sloan Kettering Institute, UC Berkeley, Princeton University, and the Broad Institute.

The workshop brought together a diverse set of communities to discuss new tissue biology research questions — and new opportunities for collaboration between the biomedical sciences and machine learning.

Caroline Uhler, co-director of the Eric and Wendy Schmidt Center, a core member of the Broad Institute, and an associate professor in the Department of Electrical Engineering and Computer Science and the Institute for Data, Systems and Society at MIT, told workshop attendees during opening remarks that biology has seen an “explosion” of data in recent years. “We now have the opportunity to understand the programs of life, so not just the units (like genes or single cells), but actually the interactions between these units.”

Biological research frontiers

These cellular interactions play a key role in the cancer immunotherapy research shared by keynote speaker Garry Nolan, a professor in the pathology department at Stanford University. His research team develops algorithms to model tissue areas where different groups of cells interact, areas he calls “interface zones,” to gain insights into how cancer remodels its surrounding tissue and evades the body’s immune system. These interface zones are critical as the locus of cellular changes that lead to tumor growth.

“I would urge you, when you're looking at your RNA data sets, to the extent that you can call out these kinds of interface zones, pay special attention to the RNA changes that are occurring there,” said Nolan, adding later: “The boundary space is where the action is.”

Additionally, biologists should reconsider labeling tumors and other features “heterogenous,” which implies that tumors from different patients are too distinct from one another to be compared. “There is an order here that can be extracted,” said Nolan.

“If you ask me, I would say that we'll never be able to fully understand and model cell function if we don't include spatial proteomics data.” — Emma Lundberg

Meanwhile keynote speaker Emma Lundberg, an associate professor of bioengineering at Stanford and co-director of the Human Protein Atlas, outlined how her team has mapped where proteins are located in cells — a process known as "spatial proteomics." Interestingly, over half of proteins can be found in more than one part of the cell, which changes how they function.

“If you ask me, I would say that we'll never be able to fully understand and model cell function if we don't include spatial proteomics data,” said Lundberg.

Panelists also discussed next steps for engineered tissues and artificial organs in disease study and regenerative medicine. Sangeeta Bhatia, a professor of health sciences and of electrical engineering and computer science at MIT’s Koch Institute, said that researchers have been able to engineer artificial tissues and organs that have little structure, like the skin and cartilage, for decades. Now, they're moving onto endocrine tissue, like the pancreas and liver. “Then you start to think about the tissues whose function is dependent on architecture, like the kidney, the lung — that's the next frontier, and I think we are not quite there yet,” she said.

One challenge brought up by Paola Arlotta, a professor of stem cell and regenerative biology at Harvard University, is how to factor genetics into tissue and organ models. One way to do this is to see how cells from different individuals respond to the same kinds of disturbances. If researchers don’t take genetic variability into account, “we’re ignoring a fundamental component of what human disease is,” she said.

Computational and technological challenges

Keynote speaker Dana Pe’er, chair of the Computational and Systems Biology Program at the Sloan Kettering Institute, outlined computational limitations that need to be addressed to answer pressing biological questions. For example, as researchers move from profiling a small section of a tissue to mapping a whole tissue or organ in different samples, they need to be able to map different tissue sections to each other.

“We’re still largely trying to figure out how to process this data, which is hampering our ability to interpret and powerfully utilize the data,” Pe’er said.

Given that there’s not yet a spatial profiling technology that can provide both high resolution and high content information on features like proteins, researchers will often need to combine a spatial profiling method with single cell data.

Barbara Engelhardt, an associate professor of computer science at Princeton University, said taking multiple images from the same type of tissue and aligning them can help researchers better understand cell type variability.

At the end of the second panel, Anthony Philippakis, co-director of the Eric and Wendy Schmidt Center and chief data officer of the Broad Institute, asked panelists whether they had any “recipes for success” to foster collaborations between the two fields.

Bhatia emphasized the importance of having researchers, or research teams, who are “bilingual” — that is, able to understand both experimental and computational biology. "It doesn't work well if you're just the recipient of data and you don't understand the context." Bhatia said. "We have to create these teams where we can really speak both languages."

Starting the conversations needed to build this bilingual proficiency was precisely the goal of the workshop.

‍

Events

Tissues

April 13, 2022

Fellows develop AI methods to design antibodies and virtually screen drugs Fellows develop AI methods to design antibodies and virtually screen drugs

2022

Wengong Jin planned to research language processing for his computer science PhD. But when Jin learned about research on machine-learning for drug discovery at the MIT Computer Science and Artificial Intelligence Laboratory, he told his advisor, Regina Barzilay, that he’d had a change of heart.

“She thought I was jet lagged, because I’d just come over from China and I was proposing a really big switch,” he said.

Jin, now a fellow at the Eric and Wendy Schmidt Center, stayed the course. Six years later, he and a team of researchers have come up with a new kind of model to automatically design antibodies — holding huge potential for immunotherapy.

Meanwhile, another Eric and Wendy Schmidt Center Fellow, PhD candidate Adit Radhakrishnan, recently developed a simple yet powerful method for virtually screening new drug candidates. That framework appears in a study published this April in Proceedings of the National Academy of Sciences.

“A number of research institutes have started using machine learning to answer key questions in biology. But at the Eric and Wendy Schmidt Center, as Jin’s and Radhakrishnan’s research shows, our goal is to also go in the other direction, by using biomedical problems to drive advances in machine-learning,” said Caroline Uhler, co-Director of the Eric and Wendy Schmidt Center, a core member of the Broad Institute, and professor in the Department of Electrical Engineering and Computer Science and the Institute for Data, Systems and Society at MIT.

Game-changer for antibody design

Discovering drugs has traditionally been a labor-intensive process, with researchers toiling away for years to test millions of molecules only to come up with a handful of candidates. Now, researchers like Jin and Radhakrishnan are working to automate that process.

“The idea is that we don't need experts to get a cup of coffee and then work all night trying to figure out a new molecule, but rather, to let the machine do the heavy lifting,” Jin said.

During his PhD, Jin was part of a research team that developed a machine-learning algorithm to speed up antibiotic discovery. The researchers found a new antibiotic that was effective against bacteria that are resistant to multiple drugs. In this instance, the team provided the model with roughly a million possible compounds to sort through.

That left Jin and other researchers wondering: Could they use artificial intelligence to design molecules from scratch?

The answer was yes. Jin and other researchers developed a generative model that designed antibodies — Y-shaped proteins that bind to viruses, bacteria, and other pathogens, activating our bodies’ immune response — that could neutralize the SARS-CoV-2 virus. Their findings were published earlier this year in a paper at the International Conference on Learning Representations.

"The new model can propose in a couple of seconds an antibody that has a high likelihood of working — totally changing the game,” said Jin.

While researchers had worked on generative models for antibody discovery before, those models could only come up with a protein’s amino acid sequence — not its shape. In contrast, the new model, which represents the antibody as a graph, simultaneously designs both the sequence and structure of its binding region. “Whether or not the antibody is the right shape to bind to a virus or other pathogen is crucial to its success,” said Jin.

‍

“The new model can propose in a couple of seconds an antibody that has a high likelihood of working — totally changing the game." — Wengong Jin

‍

"While human experts have methods to generate neutralizing antibodies, it takes time and effort. The task becomes even more challenging when additional properties need to be enforced. As our understanding of disease biology and immune system deepens, the number of such desired characteristics will continue to grow. Computational methods for antibody design are particularly useful to address this challenge,” said Regina Barzilay, the AI faculty lead for the MIT Jameel Clinic for Machine Learning in Health.

And, because so many types of data are structured as networks, the model also represents an advance in the field of machine learning. “It’s an example of how biology proposed a new problem for machine learning to solve,” said Jin.

An old machine-learning method repurposed for virtual drug screening

Adit Radhakrishnan's father had pursued a mathematics education in India prior to immigrating to the U.S. He instilled in his son a love of math, which led the younger Radhakrishnan to pursue a PhD of his own in electrical engineering and computer science at MIT.

Radhakrishnan researches the fundamentals of deep learning — a kind of artificial intelligence modeled after the human brain that processes unstructured data. Understanding why deep learning is successful, and using that knowledge to build novel models for the healthcare and genomic space, underpins much of Radhakrishnan’s research as an Eric and Wendy Schmidt Center fellow.

Over the past few years, deep learning has become widely adopted in biological applications, with researchers increasingly turning to it to screen potential new drugs. In order to perform well on such tasks, researchers use very large deep learning models that often require significant computing power. Moreover, the complexity of this approach makes it hard for scientists to understand why these models make a given prediction, shedding little light on why a proposed drug could work.

To get around the complexities of deep learning, Radhakrishnan and other researchers, including Uhler and Mikhail Belkin, a professor at the Halıcıoğlu Data Science Institute at the University of California, San Diego, turned to an older class of machine learning models: kernel methods. Prior to the recent wave of deep learning, kernel methods were a prominent and computationally simple approach for machine learning tasks. These models have recently become popular again since they can serve as a proxy for using very large deep learning models with much less computational burden.

The team came up with a simple yet highly adaptable kernel framework that was able to predict the effect that a drug has on gene expression, a measure of how cells change in response to a drug. “In contrast to the expertise needed to train large deep learning models to solve a particular problem, it takes about three lines of code to train the kernel method to do the same task,” said Radhakrishnan.

The framework has uses beyond biology; the researchers demonstrated, for example, that it could be used by video streaming providers to predict how a viewer would rank a particular movie they hadn’t yet seen. And the framework allows researchers to gain insights into how more complex deep learning models function.

According to Radhakrishnan, who is not trained as a biologist, the best part of being a fellow at the Eric and Wendy Schmidt Center is that the center puts machine learning experts and biologists in constant conversation with each other.

“You don’t just have computational researchers running their methods on a biology dataset without a biologist in the mix. You can get continuous feedback on: Is this actually useful?” said Radhakrishnan. “So it gives you a much more guided focus on what biological problems are important and what computational methods are missing.”

Proteins