Is p-value hacking reverse statistics?


P-value hacking is the process of manipulating or analyzing data in a way that increases the likelihood of obtaining statistically significant results. I used 4 different AI chatbots to obtain an explanation with examples.


This story came about when I watched a short video titled “Teach me STATISTICS in half an hour! Seriously.” At the very end there was a passage about p-value hacking, which is a serious issue.

Note: This author has a video dedicated to p-values: “What are p-values?? Seriously.

The modern statistical Null Hypothesis Statistical Testing method involves:

  1. Formulating the null hypothesis.
  2. Conducting an experiment and recording data.
  3. Calculating the p-value, representing the probability of obtaining observed results under the assumption that the null hypothesis is true.
  4. Comparing the p-value to a threshold; if it’s below, results are statistically significant, rejecting the null hypothesis; if above, the null hypothesis stands.

What is p-value hacking?

P-value hacking is the process of manipulating or analyzing data in a way that increases the likelihood of obtaining statistically significant results.

This subject has been addressed in publications:

– Boutron I, Ravaud P. Misrepresentation and distortion of research in biomedical literature. Proc Natl Acad Sci U S A. 2018 Mar 13;115(11):2613-2619. doi: 10.1073/pnas.1710755115. PMID: 29531025; PMCID: PMC5856510.

– Head ML, Holman L, Lanfear R, Kahn AT, Jennions MD. The extent and consequences of p-hacking in science. PLoS Biol. 2015 Mar 13;13(3):e1002106. doi: 10.1371/journal.pbio.1002106. PMID: 25768323; PMCID: PMC4359000.

More references are available in this blog article: P-value hacking –  What is this about?

The post “Statistical P-hacking explained” ([Archived]) from the offers a good overview, including an understanding of the misuse of this tool.

I asked multiple AI chatbots about this, see their answers below. It is interesting how they offer different examples, but agree on the overall explanation.

p-value hacking explained by AI

I asked the same question to AI chatbots as an experiment to see how they would respond.
The chatbots were: with the llama-2-70b-chat model default query (See wikipedia article)
ChatGPT (free “Da Vinci” 3.5 version)
– Google Bard, Gemini version (released December 6th, 2023)

Note: Gemini Bard provides and answer as well as 3 drafts preparing the final answer. I included Draft #3 that seemed rather significantly more elaborate.

P-hacking question asked to AI Chatbots:

Please explain in simple, yet accurate words the notion of p-value hackingand provide real life, accurate example in the realm of biology or genetics.

Here are the unedited answers from all AI chatbots

Perplexity llama-2-70b-chat

P-value hacking refers to the practice of manipulating data, statistical methods, or study design to obtain a desired outcome, usually a statistically significant result (p-value less than 0.05) when none exists or to make a statistically significant result appear more significant than it is. This practice is also known as statistical significance chasing or p-hacking.

In the realm of biology or genetics, p-value hacking can occur when researchers are trying to identify a statistically significant association between a specific gene or genetic variant and a particular disease or trait. Here’s an example:

Suppose a researcher is studying the relationship between a genetic variant (let’s call it rs12345) and the risk of developing a particular disease (e.g., cancer). The researcher conducts a study with a sample of 1000 participants, half of whom have the disease and half of whom do not. The researcher finds that the frequency of the rs12345 variant is slightly higher in the group with the disease (20% vs. 15%).

Using a statistical test, the researcher calculates the p-value for the association between the rs12345 variant and the disease. The p-value is 0.15, which means that there is no statistically significant association between the variant and the disease (i.e., the result could be due to chance).

However, the researcher wants to publish a positive result, so they start manipulating the data to make the association appear more significant. They exclude participants who have other genetic variants that might be confounding the association, they adjust for various demographic factors, and they even split the data into subgroups based on age, gender, or other characteristics to see if the association is stronger in any particular group.

After several iterations of data manipulation, the researcher finally obtains a p-value of 0.03, which is below the 0.05 threshold for statistical significance. They then write a paper claiming that the rs12345 variant is associated with an increased risk of developing the disease.

This is an example of p-value hacking, as the researcher manipulated the data and statistical methods to obtain a statistically significant result that is not actually supported by the data. This practice can lead to false positives, overestimation of effects, and an inflated sense of the importance of the findings. It can also contribute to the reproducibility crisis in science, as other researchers may not be able to replicate the results.

It’s worth noting that p-value hacking is not always intentional or malicious. Researchers may be unaware of the biases in their data or statistical methods, or they may feel pressure to publish positive results to advance their careers. However, it’s important for researchers to be aware of these biases and to take steps to mitigate them, such as using robust statistical methods, reporting all results transparently, and avoiding the temptation to manipulate data or statistical methods to obtain a desired outcome.

Understanding P-Value Hacking

P-value hacking, also known as p-hacking, refers to the practice of manipulating or analyzing data in a way that increases the likelihood of obtaining statistically significant results. This can involve selectively choosing which data to include or exclude, running multiple statistical tests until a significant result is found, or changing the analysis after observing the results. Essentially, it involves tweaking the analysis until a desired outcome is achieved, which can lead to misleading or false conclusions.

Real-Life Example in Biology or Genetics

In the realm of biology or genetics, p-value hacking can occur when researchers analyze genetic data to find associations between certain genetic markers and a particular trait or disease. For example, imagine a study aiming to identify genetic variants associated with a specific disease. If the researchers analyze a large number of genetic markers and only report the ones that show a statistically significant association with the disease, they may be engaging in p-value hacking.

By selectively reporting only the statistically significant results, the researchers may be overlooking the overall pattern of the data and potentially drawing false conclusions about the genetic factors influencing the disease. This practice can lead to the publication of misleading or inaccurate findings, which can have significant implications for further research and clinical applications in the field of genetics.

It’s important for researchers to adhere to rigorous statistical practices and transparently report all analyses conducted to avoid the pitfalls of p-value hacking and ensure the reliability and validity of their findings.

If you have further questions or need more examples, feel free to ask!

ChatGPT (3.5)

P-value hacking refers to the practice of repeatedly analyzing data in various ways until a statistically significant result is found, without pre-specifying the analysis plan. This can lead to a false sense of significance, as the more analyses conducted, the higher the chance of finding a seemingly significant result by random chance alone.

Here’s a simplified example in the realm of biology or genetics:

Imagine a researcher is investigating whether a new drug has an effect on a specific biological marker. The researcher collects data and performs multiple statistical tests to see if there’s a significant difference between the drug and control groups. They test various time points, different subgroups, and explore different variables.

If the researcher continues testing until they find a statistically significant result, without adjusting for the multiple comparisons made, they may incorrectly conclude that the drug has a significant effect when, in reality, the result could be due to chance.

In a more concrete example, let’s say the researcher measures the expression of a specific gene in response to the drug at multiple time points. By chance alone, one of these time points might show a significant difference. If the researcher only reports this significant finding and doesn’t account for the multiple comparisons, it can lead to a false discovery.

P-value hacking undermines the reliability of statistical significance by inflating the likelihood of Type I errors (false positives). To mitigate this issue, researchers should pre-specify their analysis plans and, if they conduct multiple tests, use appropriate statistical techniques (such as Bonferroni correction) to account for the increased risk of obtaining false-positive results.

Bard (Gemini)

Imagine you’re flipping a coin 10 times. You expect roughly half of them to be heads, right? Now, picture someone flipping coins all day, trying to get 6 heads in a row. Eventually, by pure chance, they might get lucky. P-hacking is like that lucky coin flip – manipulating data or analyses until you get a statistically significant result, even if it doesn’t reflect reality.

Here’s an example in biology: Let’s say you’re testing if a new fertilizer increases corn yield. You collect data from several farms, but some yields are just okay. P-hacking could involve:

  • Fishing for results: Trying different statistical tests like a fisherman casting different lines, until you catch a significant one.
  • Data dredging: Diving into your data (like scouring the ocean floor) for any possible trend, even if it wasn’t your initial hypothesis.
  • Dropping observations: Excluding data points that make your result less significant, like throwing back unlucky fish.

The problem is, these practices inflate the likelihood of a false positive – a significant result that doesn’t reflect a real effect. Just like the lucky coin flip, it might be chance, not magic.

This can have serious consequences. Imagine basing agricultural practices on a proven fertilizer that actually works by chance. It could waste resources and even harm the environment.

P-hacking is a critical issue in science, especially in biology and genetics where complex data and experimental setups are common. It’s important to be aware of this and encourage robust research practices to ensure reliable scientific evidence.

Remember, just like trusting only heads in a coin flip, basing conclusions solely on hacked p-values can paint a misleading picture of reality.


P-hacking, in simple terms, is like playing with statistics until you get a winning result. It’s like fiddling with a slot machine until you hit the jackpot, but instead of coins, you’re dealing with scientific data.

Imagine you’re studying whether a new fertilizer increases tomato yield. You measure the tomatoes from many plants, but just looking at the raw data might not tell you if the fertilizer really works. That’s where statistics come in. You use a test that calculates a p-value,which tells you how likely it is that the difference in yields is just due to random chance.

Now, here’s where p-hacking comes in:

  1. Cherry-picking: You might analyze the data multiple times, trying different statistical tests or looking at different subsets of the data. This increases the chances of finding a significant result (low p-value) by chance, even if the fertilizer doesn’t actually make a difference.
  2. HARKing: You might decide which results to report based on whether they’re significant or not. This means you might throw away non-significant results, even if they’re important, and only report the ones that make your fertilizer look good.
  3. P-hacking by design: You might design your experiment in a way that makes it more likely to get a significant result, even if the effect is tiny. This could involve using an unusually small sample size or making assumptions about the data that might not be true.

Real-life example: A study investigates if a certain gene variant increases the risk of Alzheimer’s disease. They analyze data from hundreds of patients, but they try many different statistical tests and data subsets. Eventually, they find a statistically significant association between the gene and the disease (low p-value). However, they don’t report the non-significant results from other analyses, and they don’t mention the multiple tests they performed. This raises concerns that the result might be due to p-hacking, not a real effect of the gene.

P-hacking can lead to misleading conclusions that get published in scientific journals. This can have serious consequences, like delaying the development of effective treatments or even harming patients based on wrong information.

It’s important to be aware of p-hacking and to critically evaluate scientific studies, especially when they report statistically significant results. Look for transparency in the methods and data analysis, and be wary of studies that don’t report non-significant findings.

Remember, just because a study has a low p-value, it doesn’t mean the result is real or important. P-hacking can create false positives, and we need to be careful not to fall for them.

Top image credits:  AI generated image by Pixabay artist TheDigitalArtist.
Video credits: embedded videos were produced by Justin Zeltzer from