Data Correlation and the Simpson Paradox

Summary

Simpson’s paradox is a fascinating statistical phenomenon where the observed relationship between two variables can be reversed when the data is divided into subgroups. For biologists and biochemists, this means that the correlation between two biological variables might change direction when considering an additional categorical variable, such as a specific species or experimental condition. This paradox underscores the importance of careful data analysis to avoid misleading conclusions.

Illustration

The following illustration can be found on Wikipedia article Simpson’s paradox  illustrating how the linear relationship between two variables is reversed  when taking into account a third variable grouping all smaller groups.

Simpsons paradox animation
Visualization of Simpson’s paradox on data resembling real-world variability indicates that risk of misjudgment of true causal relationship can be hard to spot.

The Simpson’s paradox Wikipedia article provides real life examples. One illustrates a wrong assumption of gender bias at California Berkley University, another on kidney stones.

Penguin beak dimensions

A real life example of the measurements of penguin beaks of 3 different species has more recently been used to illustrate this issue in the Blog entry “Who Is Simpson And What Does His Paradox Mean For Ecologists?

Penguin beak dimensions as one group
Scatterplot of bill depth in millimeters on the x-axis and bill length in millimeters on the y-axis for penguins found at Palmer Station. A regression line fit to the data has a negative slope.
Penguin beak diemensions for each of 3 species
Scatterplot of bill depth in millimeters on the x-axis and bill length in millimeters on the y-axis for penguins found at Palmer Station where the color and shape denote three different species of penguins. The regression lines fit to each species’ data have positive slopes.

 

Criticism

Some criticism against this idea of “paradox” are presented in the same article:

1. The paradox is not really a paradox but a failure to properly account for confounding variables or to consider causal relationships between variables (Blyth, 1972.)

2. Another criticism of the apparent Simpson’s paradox is that it may be a result of the specific way that data are stratified or grouped. The phenomenon may disappear or even reverse if the data is stratified differently or if different confounding variables are considered, as explained by the phenomenon of “noncollapsibility” (Greenland, 2021) which occurs when subgroups with high proportions do not make simple averages when combined.

References

Blyth, Colin R. (June 1972). “On Simpson’s Paradox and the Sure-Thing Principle”. Journal of the American Statistical Association. 67 (338): 364–366.

Greenland, Sander (2021-11-01). “Noncollapsibility, confounding, and sparse-data bias. Part 2: What should researchers make of persistent controversies about the odds ratio?”. Journal of Clinical Epidemiology. 139: 264–268. doi:10.1016/j.jclinepi.2021.06.004. ISSN 0895-4356. PMID 34119647

Simpson, Edward H. (1951). “The Interpretation of Interaction in Contingency Tables”. Journal of the Royal Statistical Society, Series B. 13 (2): 238–241. doi:10.1111/j.2517-6161.1951.tb00088.x.

Source