At Precision Med TRI-CON, I joined a panel on “Omics, Data, and AI in Precision Medicine,” where we tackled a big, messy question: What’s keeping AI from delivering on its full potential in precision medicine?
Spoiler alert: it’s not the algorithms. It’s the data. Or more specifically, the glaring lack of clean, curated, context-rich datasets.
AI’s superpower is finding patterns in data—but only when the data’s good. You’ve heard it before: “garbage in, garbage out.” That’s why large language models like ChatGPT sound like they’ve got MFAs in English—they’ve been trained on mountains of carefully curated, high-quality text. The same goes with AlphaFold, which cracked protein structure prediction thanks to 30+ years of meticulously standardized data from the Protein Data Bank (PDB).
Now let’s talk about the wild child of bioinformatics: gene expression data. Also known as chaos in spreadsheet form. Housed in repositories like Gene Expression Omnibus (GEO) and ArrayExpress, these datasets should be a goldmine. Instead, they’re a tangled mess. This is because gene expression is hyper-sensitive to context. Tissue type. Time of day. Stress level. Treatment. Species. Age. Whether Mercury was in retrograde. (Okay, maybe not that last one—but you get the point.) And yet, much of the data lack even the most basic metadata. These datasets weren’t built with machine learning in mind. And it shows.
This is a problem because gene expression is central to precision medicine. It holds the key to understanding how diseases develop, how individuals respond to treatment, and how we tailor therapies to patients. But without the “who, what, when, where, and why” behind each data point, even the most sophisticated AI model can’t extract the insights we need. No context, no prediction power.
Right now, we don’t have nearly enough of the right kind of data. Most of what’s out there is incomplete, inconsistent, and barely annotated. It’s like trying to fly a plane with no weather report, no altitude reading, and no idea what kind of aircraft you’re even in. Good luck getting off the ground, let alone landing safely.
And let’s be real: this isn’t a problem one heroic scientist can solve in their spare time. We need coordinated, community-wide effort—think consortia or public-private collaborations—to intentionally build the datasets that AI actually needs. That means:
- Standardized formats across labs
- Consistent, complete metadata
- Representation of real-world populations
- Centralized, accessible repositories that don’t require a PhD in archaeology to navigate
All panelists agreed that the path forward should focus on a smart hybrid approach: train foundational models on large, diverse datasets, then fine-tune them with smaller, highly curated subsets tailored to specific populations or diseases. It’s scalable. It’s strategic. And it’s well within reach—if we put in the effort.
But none of this will happen unless we shift our mindset: from treating data as a research byproduct to treating it as the engine that drives AI in life sciences. We have to build it with intention, structure, and scientific rigor.
TL;DR: Don’t Blame the AI
AI is ready. But if we keep feeding it sloppy, under-annotated, half-baked data and expecting brilliance? That’s on us.
Biology’s future is AI-powered, but only if we start building datasets worthy of the tools we’ve created to analyze them.
Because the path forward doesn’t start with better algorithms.
It starts with better data.
And that is something we, as a scientific community, can absolutely fix.