*Denotes co-first authorship. †Denotes co-last authors.
scGeneScope as the first large-scale, high-quality treatment-matched multiprofile dataset for single-cell biology with over 627k scRNA-seq profiles and 716k Cell Painting images from identical chemical treatments across 28 diverse mechanisms of action. In our paper, we present this dataset and challenge the hype around foundation models by providing a realistic testbed to enable rigorous benchmarking of ML models (linear to foundation models) for drug discovery.
We co-designed a generative AI assistant with genetics professionals to support genome sequencing analysis for rare disease diagnosis. By identifying key challenges in sensemaking and reanalysis, we developed and prototyped AI features that help synthesize variant evidence and flag cases for reanalysis, ultimately aiming to increase diagnostic yield and reduce time to diagnosis.
We developed a large language model (LLM)-powered framework, EvAgg, to aggregate and synthesize rare disease literature and related content, enabling clinical genomic analysts to review patient cases more rapidly and thoroughly in research settings. EvAgg reduced case review time by 34% (p < 0.002) and significantly increased the throughput of papers, variants, and cases analyzed.
We conducted a qualitative study to identify common challenges and data tasks across the biomedical discovery lifecycle by interviewing professionals from diverse roles in the field. Based on these insights, we proposed seven actionable recommendations to improve data quality, interoperability, and collaboration for precision medicine research.
Here we show that the transcription factor CLAMP doesn't just bind DNA, it also directly binds RNA and spliceosomal proteins through its prion-like domain, linking transcription to sex-specific alternative splicing. By regulating the dynamics of hnRNP splicing condensates, CLAMP ensures precise, sex-dependent splicing outcomes, revealing a new mechanism where transcription factors act as master organizers of splicing decisions.
This is a user-friendly platform for visualizing and perturbing gene regulatory networks using multi-omics data. It enables researchers to test biological hypotheses in silico and identify molecular candidates for follow-up experiments, without requiring coding expertise.
We discuss the spectrum of machine learning model transparency, from black box to explainable to interpretable, highlighting methods tailored for genomic studies. Our focus was on how incorporating biological knowledge into model design can improve both predictive performance and scientific insight for precision medicine.
We developed three interactive computational tools to uncover gene regulatory networks from temporal multi-omics data, focusing on transcription factor dynamics and sex-specific regulation. These platforms empower researchers to generate hypotheses, validate findings, and accelerate discovery, bringing us closer to personalized therapeutics.
scGeneScope code enables benchmarking for treatment response modeling of our generated perturbationally-paired single cell RNA-seq and Cell Painting image dataset.
The Evidence Aggregator is a large language model (LLM)-powered framework that aggregates and synthesizes rare disease literature and related content.
time2splice is a method to find temporal and sex-specific alternative splicing from multi-omics data.
TIMEOR is a web server and Dockerized command line tool to identify gene regulatory networks and assign mechanism from temporal and multi-omics data.
A fast protein analysis algorithm, using Dynamic Distributed Dimensional Data Model (D4M - by Dr. Jeremy Kepner), merging triplestore/NoSQL databases (Accumulo) with associative and distributed array representations of proteomic sequences for fast genomic big data analysis using sparse linear algebra. Our approach efficiently extracts statistical patterns to relate protein sequences, with the end goal of rapidly identifying novel pathogens.
Property of MIT Lincoln Laboratory
Web-based inventory management system used in many academic departments, mainly chemistry. Users log in and use a phone to scan barcodes for automatic item entry. The application uses the Parse Platform as a relational database to house inventory for DePauw University. This system has been updated by the maintainer Dr. Dave Roberts.
Property of DePauw
Set of Arduino workshop modules and Fritzing diagrams to teach students how to program as part of the Google Computer Science Summer Institute (CSSI).
Property of Google
Online internal system to monitor product batch data. Batch data is extracted from Eli Lilly's Data Mart and Data Warehouse databases and then visualized for the researcher (such as potency, and solubility fluctuations). This system continues to be run automatically daily, enabling employees to easily inspect and verify internal processes, saving significant money and time.
Property of Eli Lilly and Elanco