PerspectiveMedicine

Achieving fairness in medical devices

See allHide authors and affiliations

Science  02 Apr 2021:
Vol. 372, Issue 6537, pp. 30-31
DOI: 10.1126/science.abe9195

The hardware or software that operates medical devices can be biased. A biased device is one that operates in a manner that disadvantages certain demographic groups and influences health inequity. As one measure of fairness, reducing bias is related to increasing fairness in the operation of a medical device. Initiatives to promote fairness are rapidly growing in a range of technical disciplines, but this growth is not rapid enough for medical engineering. Although computer science companies terminate lucrative but biased facial recognition systems, biased medical devices continue to be sold as commercial products. It is important to address bias in medical devices now. This can be achieved by studying where and how bias arises, and understanding these can inform mitigation strategies.

Bias in medical devices can be divided into three broad forms (see the figure). A medical device can exhibit physical bias, where physical principles are biased against certain demographics. Once data are collected, computational bias, which pertains to the distribution, processing, and computation of data that are used to operate a device, must be considered. Subsequent implementation in clinical settings can lead to interpretation bias, where clinical staff or other users may interpret device outputs differently based on demographics.

The physical working principle of a medical device is biased when it exhibits an undesirable performance variation across demographic groups. An example of physical bias occurs in the context of optical biosensors that use light to monitor vital signs. A pulse oximeter uses two colors of light (one in near-infrared and the other in visible light) to measure blood oxygenation. Through the pulse oximeter, it is possible to diagnose occult hypoxemia, low levels of arterial oxygen saturation that are not detectable from symptoms. However, a recent study found that Black patients had about three times the frequency of undiagnosed occult hypoxemia as measured by pulse oximeters (1). Dark skin tones respond differently to these wavelengths of light, particularly visible light. Because hypoxemia relates to mortality, such a biased medical device could lead to disparate mortality outcomes for Black and dark-skinned patients.

Physical bias is not restricted to skin color. For example, the mechanical design of implants for hip replacement exhibits a potentially troubling gender disparity. The three-dimensional models used to design hip-joint implants sometimes do not account for the distinct bone structure of female hips (2). This could lead to alignment issues and relatively poor outcomes for affected females. This problem was one motivation for the development of gender-specific implants. Fortunately, physical challenges can also be addressed through unexpected technical innovation, such as in the example of the remote plethysmograph. This device measures heart rate through visual changes in skin color. Because visual cues are biased, researchers developed an alternative approach using motion cues to estimate heart rate. Because motions are visible on the surface of skin, the technique is less biased by subsurface melanin content (3). With the goal of promoting fairness, an exciting technical direction of studying motion cues instead of color cues has been advanced.

Measuring fairness

Fairness can be quantified based on ϵ-bias. Fairness is maximized when ϵ = 0, achieving a state of 0-bias.

GRAPHIC: N. DESAI/SCIENCE

Computational workflows are becoming more tightly coupled with devices, which increases the number of entry points where computational bias can invade medical technologies. An aspect of computational bias is dataset bias. Consider the following example from x-ray imaging: Diagnostic algorithms can learn patterns from x-ray imaging datasets of thoracic conditions. However, these imaging datasets often contain a surprising imbalance, where females are underrepresented. For example, despite having a sample size of more than 100,000 images, frequently used chest x-ray databases are ∼60% male and ∼40% female (4). This imbalance worsens the quality of diagnosis for female patients. A solution is to ensure that datasets are balanced. Somewhat unexpectedly, balancing the gender representation to 50% female boosts diagnostic performance not only for females but also for males (4). Despite best efforts, demographic balancing of a dataset might not be possible. This could be due to conditions that present more often in one sex than the other. In such cases where balancing a dataset is truly infeasible, transfer learning can be used as a step toward a longer-term solution (5). Transfer learning could repurpose design parameters from task A (based on a balanced dataset) to task B (with an unbalanced dataset). In the future, it might be possible to balance a dataset using a human digital twin. These are computational models that can be programmed to reflect a desired race, sex, or morphological trait.

Another form of computational bias is algorithm bias, where the mathematics of data processing disadvantages certain groups. Now, software algorithms are able to process video streams to detect the spontaneous blink rate of a human subject. This is helpful in diagnosing a variety of neurological disorders, including Parkinson's disease (6) and Tourette syndrome (7). Unfortunately, traditional image-processing systems have particular difficulty in detecting blinks for Asian individuals (8). The use of such poorly designed and biased algorithms (9) could produce or exacerbate health disparities between racial groups.

Interpretation bias occurs when a medical device is subject to biased inference of readings. An example of a misinterpreted medical device is the spirometer, which measures lung capacity. The interpretation of spirometry data creates unfairness because certain ethnic groups, such as Black or Asian, are assumed to have lower lung capacity than white people: 15% lower for Black people and about 5% lower for Asian people. This assumption is based on earlier studies that may have incorrectly estimated innate lung capacity (10). Unfortunately, these “correction factors,” based on questionable assumptions, are applied to the interpretation of spirometer data. For example, before “correction,” a Black person's lung capacity might be measured to be lower than the lung capacity of a white person. After “correction” to a smaller baseline lung capacity, treatment plans would prioritize the white person, because it is expected that a Black person should have lower lung capacity, and so their capacity must be much lower than that of a white person before their reduction is considered a priority.


Embedded Image
CREDIT: LIFE IN VIEW/SCIENCE SOURCE
Bias in medical devices

A device can be biased if its design disadvantages certain groups on the basis of their physical attributes, such as skin color. For example, pulse oximeters (see the photo) detect changes in light passed through skin and are less effective in people with dark skin. Computational techniques are biased if training datasets are not representative of the population. Interpretation of results may be biased according to demographic groups, for example, with the use of “correction factors.”

CREDIT: N. DESAI/SCIENCE

However well intentioned, errors in “correction” for race (or sex) can disadvantage the groups it seeks to protect. In the spirometer example, the device designers conflated a racial group's healthy lung capacity with their average lung capacity. This assumption does not account for socioeconomic distinctions across race: Individuals who live near motorways exhibit reduced lung capacity, and these individuals are often from disadvantaged ethnic groups. The spirometer is just one of several examples of systemic racism in medicine (11).

If our society desires fair medical devices, it must reward a fair approach to innovation. It is inspiring to observe the speed at which the artificial intelligence (AI) community has recognized fairness in its endeavors. Authors can be encouraged by journals to address the societal implications of their technologies and include a “broader impacts” statement that is considered in peer review. This has already been introduced at an AI journal to encourage consideration of the diversity of potential users of their software (12). Fairness research in AI is increasingly garnering scholarly acclaim. For example, a seminal report highlighted the widespread problem of bias in face recognition, which found that darker-skinned females are misclassified at rates up to 34.7% while the maximum error rate for lighter-skinned males is only 0.8% (13). In response to concerns of fairness, action is being taken. For example, Amazon Inc. has recently banned the use of its facial-recognition products by police until bias concerns can be resolved. There is still a long way to go in addressing bias in AI, but some of the lessons learned can be repurposed to medical devices.

A “fairness” statement for the evaluation of studies of medical devices could use the three categories of bias as a rubric: physical bias, computational bias, and interpretation bias. A medical-device study does not need to be perfectly unbiased to be reported. Indeed, it may not always be possible to remove all sources of bias. For example, an oximeter reliant on an optical sensor is likely to remain biased against dark skin (1). The fairness statement can consist of technical explanations for how attempts to mitigate bias failed and suggest technical compensations for disadvantaged groups (e.g., collect additional data points for dark-skinned people). This is consistent with the introduction of “positive biases,” where race-aware and gender-aware methodologies are explicitly designed to counteract negative bias (14).

Additionally, the inclusion of fairness metrics in studies of medical devices could be considered. Choosing the right fairness metric of an algorithm is a quantitatively challenging computer science exercise (15) and can be abstracted here as “ϵ-bias,” where ϵ quantifies the degree of bias across subgroups. For example, 0-bias would be seen as perfectly fair. Achieving 0-bias on its own is trivial: Simply return a measurement that is consistently useless across demographics. The problem is to maximize performance and minimize ϵ-bias. This may present a Pareto trade-off, where maximizing the performance and minimizing bias are objectives at odds with each other. A Pareto curve can quantitatively display how changing device configuration varies the balance between performance and fairness (see the graph). Such analyses might be a useful inclusion in medical-device studies.

Achieving fairness in medical devices is a key piece of the puzzle, but a piece nonetheless. Even if one manages to engineer a fair medical device, it could be used by a clinical provider who has conscious or subconscious bias. And even a fair medical device from an engineering perspective might be inaccessible to a range of demographic groups, owing to socioeconomic reasons. Several open questions remain. What is an acceptable trade-off between device performance and fairness? It is also important to consider how biases that are not easy to predict or easy to observe at scale can be dealt with. Race and sex are also part of human biology. How can positive biases be properly encoded into medical-device design? Diversity and inclusion have gained increasing attention, and the era of fair medical devices is only just beginning.

References and Notes

Acknowledgments: I thank P. Chari, L. Jalilian, K. Kabra, M. Savary, M. Majmudar, and the Engineering 87 class at UCLA for constructive feedback. I am supported by a National Science Foundation CAREER grant (IIS-2046737), Google Faculty Award, and Sony Imaging Young Faculty Award.
View Abstract

Stay Connected to Science

Navigate This Article