Can Stats be Fair? The Importance of Statistical Methods for Equity in Genomic Data Analysis
- leandra.braeuninger
- May 14
- 4 min read
by Leandra Bräuninger and Brieuc Lehmann
Have you ever wondered whether the statistical methods we use actually help—or hurt—health equity? It's a big question, especially in fields like genomics, where complex models and massive datasets shape research, diagnosis, and treatment.
That’s exactly what we set out to explore in our paper, Methodological opportunities in genomic data analysis to advance health equity, published today in Nature Reviews Genetics. Over the last two years, we worked as part of an incredible interdisciplinary team spanning statistics, genomics, and bioethics across academia and genomic medicine - props to Genomics England for funding all of this! We read over 180(!) papers, hosted several workshops, and conducted interviews with international experts to understand how statistical methods can help promote health equity in genomic research.
As a field, genomics recognised the importance of the (lack of) diversity in genomic research a few decades ago: today’s genomic datasets are disproportionately made up of people of European genetic ancestries, and as a result models trained using these datasets are typically less accurate on folks of non-European genetic ancestries. The choice of statistical methods used to analyse this data, however, is often not discussed. And while people often assume that math and models are objective (you know, “the numbers don’t lie”), the truth is, they do—just in subtler, sneakier ways. That’s because quantitative systems inherit the biases of the world they’re built in. And in genomics, that world is very much skewed.
Where does bias creep in?
To get a handle on this, we developed a framework mapping out the stages of a genomic data analysis where bias might be able to creep in. While it’s designed with genomics in mind, the framework is broad enough to shed light on bias in many other data-heavy research fields too, so here’s an overview:
A project typically starts with study design and data acquisition. This includes decisions about what data to collect and who to include, both of which are often limited and biased.
Then there’s data preparation, where we clean, code, and transform the raw inputs. This step, crucial in genomics, can introduce subtle distortions, especially if some groups’ data are less complete or differently formatted.
Next comes model development, where biological assumptions (often based on incomplete or biased knowledge) guide how we build statistical models. Even the way we define what’s "normal" or "expected" can be skewed.
Finally, there’s evaluation, how we measure whether our models are “good enough.” But good for whom? If we don’t check performance across different groups, models might look accurate while quietly failing entire population groups.
All of this plays out within the broader sociopolitical ecosystem, where things like research priorities, funding, and regulation can shape each stage of the data analysis.
What are statistical methods actually doing for equity?
Once we understand where bias comes from, the next question is: what can statistical methods do about it? In the paper, we explore four main ways that statistical methods can support health equity:
Reducing bias: Trying to mathematically correct for unfairness in the data or analysis.
Boosting statistical power: Making it easier to detect patterns, especially in underrepresented groups.
Assessing genetic variation: Challenging assumptions that everyone in a dataset is genetically similar.
Evaluating fairness: Identifying when models perform worse for certain groups.
For those of you working in genomics, the review includes many, many references to papers that fall into these categories. If that’s not your field then just know that different tools aim at different targets, and knowing which is doing what is half the battle.
So, what next?
The statistical methods we have available to us today are not perfect. During the workshops, we asked participants: What are the most important methodological gaps with respect to health equity? Based on their responses and the huge pile of papers we read, we came up with a set of recommendations for how to make statistical methods more equity-conscious. Some are genomics-specific, but several apply more broadly. Here we’ve picked out three we think everyone should care about:
Be careful how you label!
When we talk about equity between groups, we have to define what those groups are. This is surprisingly hard and, done badly, can have really negative consequences. Racial categories, for example, are social constructs that are all-too-commonly used to (falsely or at least inaccurately) characterise biological differences and in turn justify racist views. Instead, categories should precisely reflect the research question. Where possible, continuous measures (like genetic ancestry distances) may be more appropriate than discrete labels.
Support safe, privacy-preserving data sharing!
More diverse data = better models. But sharing sensitive health data, especially across countries or institutions, raises big privacy and legal issues. Statistical methods like federated learning and synthetic data generation are promising ways to combine datasets without putting individuals at risk. These tools are still evolving, but they may prove instrumental in developing tools that perform more equitably.
Try to understand the social determinants of health!
It’s well established that our background and environment has a significant impact on our health. Yet these factors—known as the social determinants of health—remain largely absent from genomic analyses. We recommend developing methods that integrate social and environmental context into genomic research. This is no easy task: the data can be messy, and the relationships are often complex. But addressing these challenges is essential if we want genomics to be equitable and truly serve diverse populations.
TL;DR – The paper in a nutshell
We explored how statistical methods in genomics can support (or undermine) health equity, developed a framework to spot where bias creeps in, and recommended ways to do better—both in genomics and beyond.
If you’re a researcher, policymaker, method developer, or just a curious nerd, we’d love for you to:
📬 Reach out with your ideas or questions📊 Tell us what statistical methods you’re using to advance equity
Because building a fairer future with data isn’t just possible—it’s necessary. And it’s something we can all help shape.
Comments