March 21, 2022

Addressing biased data in AI as part of working for health equality

Q&A: EQL’s Co-founder and Chief Medical Officer discusses the biases in health data, in support of the United Nations Day of Elimination of Racial Discrimination

Peter Grinbergs talks about how EQL’s mission to provide affordable access to healthcare by leveraging advances in AI technology, means addressing the biases in health data.


Q: EQL is using data to inform and develop products, how does bias in health data appear?

A: In healthcare, the most common source of data bias comes from samples that fail to appropriately represent a patient population.  In considering samples we must be conscious of how the data was collected, where it was collected and who it was collected from. For example, data can come from the NHS, be published in a research journal, or be acquired by a trusted source — all very valid collection methods, but we still can’t rule out the potential for significant bias and that has to be considered. For example, those with low socioeconomic status are less likely to engage in healthcare services, so any national data sets are unlikely to properly represent this population. Other examples could include health literacy, language barriers and cultural considerations.  

When it comes to racial discrimination, whilst not directly implied, it’s clear to see how not accommodating for language barriers for example would lead to an inadequate representation of those who require language support within the data reported by that particular healthcare solution or service. Inequalities in healthcare provision geographically would be another good example.

Fundamentally failure to eliminate bias leads to inaccuracy within a given model. AI models are only as good as the data they consume and if this data is not representative, the resulting model won’t be. Furthermore, there’s a very real risk that any residual bias becomes exponentially worse over time. If you fail to represent groups adequately, the model is less able to identify them, creating a diminishing representation over time.   

Pic credit: @nxvision

Q: How does bias in data impact the effectiveness of artificial technologies? 

A: Artificial intelligence relies on reliable, clean data. Reliable means it must be representative of the problem you are trying to solve; it’s no good trying to work out the probability of someone having a frozen shoulder if the data says nothing useful about frozen shoulders. Clean means it must not contain misleading information; it’s no good if your data sample contains examples only from those who had good access to healthcare and made a good recovery. 

Furthermore, AI solutions that use a probabilistic approach rely heavily on things like risk factors which, generally speaking, are closely linked to things like gender, lifestyle and culture. If there is a culture that reports a higher prevalence of diabetes, then poor access to data from that culture or omitting it completely would significantly limit the ability of any AI system to factor in this important consideration. 

Q: How do you know if the data you are using to train your technology is biased?

A: A good starting point is to assume the data does contain bias and then to consider what data points have been collected and to what extent racial, cultural, or under-represented groups have been identified. Just because something has been published or the data collected by a reliable source, does not automatically mean these things have been considered. 

Peter Grinbergs, Chief Medical Officer, EQL

As a community, we not only have an obligation to address data bias, but also a great opportunity to lead the way when it comes to solving the problem. Technology, by its very nature, is unbiased. It has no agenda and as such has the potential to level the playing field

Q: How can you ensure you are using truly representative data?

A: There are good scientific methods to ensure data is representative such as considering race, culture or socio-economic factors when recruiting participants for study. That said, study groups for investigations such as randomised control trials are by their very nature limited and controlled. 

When it comes to population health data there are methods to eradicate bias, such as ‘boosting’ sample data selectively to artificially enhance under-represented groups. If you identify the specific under-represented group or groups, assess how much of the data currently represents that group (e.g. 1% of the data speaks for that group), and then enhance the data for that group artificially by creating synthetic data to bring it up to parity (e.g. 5%). There are reliable data sets, for example, those produced by census information, that speak to what these ratios could be. 

Q: What can the health tech community do to address data bias in healthcare? 

A: As a community, we not only have an obligation to address data bias, but also a great opportunity to lead the way when it comes to solving the problem. Technology, by its very nature, is unbiased. It has no agenda and as such has the potential to level the playing field. As with any sound scientific approach, it’s important to explore the methodologies at play and subject them to an appropriate level of review. It’s also crucial to open up the debate and learn from the wider community with respect to the techniques and methods available. It’s also crucial to make the end consumers of such solutions an integral part of the discussion, accept criticism and feedback and learn from the very people they aim to support. 

At EQL, we routinely invite users to engage in the design process of our products, while also ensuring that the latest data science approaches are considered in all parts of the process. We have long been part of discussions and actively participate and drive the debate. From a technical perspective, we strive to interrogate data and make no assumptions when it comes to distilling insights. At its core, this means considering all possible angles with regards to our products, with a specific ask that revolves around putting the person’s needs at the heart of the solution. 

Q: Is there an argument for regulation of bias in data? 

A: Regulation can be helpful, however, on its own it is unlikely to provide a satisfactory answer. By its very nature, regulation differs from region to region and as such will always fall short of a truly global framework.  What’s needed is a benchmark against which AI and machine learning (ML) approaches can be evaluated. It’s an area EQL is passionate about and it was one of the reasons we lobbied The World Health Organization and the International Telecommunication Union for the creation of the AI for MSK Medicine topic group. 

EQL is the founding sponsor of this group and key staff continue to serve as official topic drivers and collaborators. The topic group is now well established and contains representatives from all over the world who meet regularly to debate this and other, important topics. 

Find out more about The World Health Organization and the International Telecommunication Union’s AI for MSK Medicine topic group, a community working to build a consensus on regulation AI in MSK

Discover moreDiscover more