The Replication Crisis: Why Most Published Studies Fail

In 2015, 270 researchers across the world attempted to replicate 100 studies published in three of the most prestigious psychology journals. The Open Science Collaboration, as the project was called, used the same methods, similar sample sizes, and the same statistical analyses as the original researchers. Only 36% of the findings replicated. Of those that did replicate, the effect sizes were on average half as large as the original studies reported. In the subfield of social psychology, the replication rate was even worse: approximately 25%. This was not a fringe exercise. The original studies had been peer-reviewed, published in top journals, and cited thousands of times. They were the foundation of textbooks, therapy practices, and policy recommendations. And nearly two-thirds of them could not be reproduced. p-Hacking: The Art of Finding Significance The centerpiece of modern scientific publishing is the p-value -- specifically, the threshold of p The Cancer Biology Problem The replication crisis is not limited to psychology. In 2021, the Reproducibility Project: Cancer Biology published its findings after attempting to replicate 53 landmark studies from top cancer journals. The results were devastating: Only 5 studies could be fully replicated The project was unable to complete replication attempts for many studies because the original authors could not provide sufficient methodological detail, materials, or data Effect sizes in the successful replications were consistently smaller than in the original papers Cancer research drives billions of dollars in drug development. Clinical trials are designed based on preclinical findings. If those findings are unreliable, the entire pipeline is compromised. The Stanford Prison Experiment The Stanford Prison Experiment, conducted by Philip Zimbardo in 1971, is one of the most famous studies in psychology. It supposedly demonstrated that ordinary people, placed in positions of power, will naturally become abusive. The study has been cited over 28,000 times and appears in virtually every introductory psychology textbook. Recent investigations have revealed that the experiment was essentially scripted: Zimbardo and his research team actively coached guards to be abusive, contrary to claims that the behavior emerged naturally A participant who appeared to have a breakdown, Douglas Korpi, later acknowledged he was faking his distress to get out of the study early to study for an exam Guards who did not behave abusively were not included in the published results The BBC attempted a partial replication in 2002, and the guards did not become abusive -- they bonded with the prisoners The Stanford Prison Experiment is not an example of science discovering something uncomfortable about human nature. It is an example of a researcher producing the results he expected and calling it science. Ioannidis: Why Most Findings Are False In 2005, John Ioannidis published a paper with a title that needs no explanation: "Why Most Published Research Findings Are False." His argument was mathematical: Most studies are underpowered (too few participants to reliably detect effects) Most tested effects are small There are many possible relationships to test, but only a few are true Financial and career interests bias research toward positive findings The combination of these factors means that most published findings are false positives Ioannidis estimated that in many fields, the probability that a published finding is true is well below 50%. His paper has been cited over 12,000 times and is considered one of the most important methodological papers of the 21st century. What This Means for "Studies Show" The phrase "studies show" carries an authority that the underlying research often does not deserve. When you encounter a claim backed by "a study," consider: Has it been replicated? If the finding has not been independently reproduced, it is preliminary at best Was it pre-registered? Pre-registration -- publicly posting the study design before data collection -- prevents p-hacking and selective reporting What is the effect size? Statistical significance does not mean practical significance. A tiny effect can be "significant" with a large enough sample Who funded it? Industry-funded studies are significantly more likely to produce results favorable to the sponsor Was the data shared? If the authors refuse to share their data, that is a red flag Science remains the best tool humanity has for understanding the world. But the system that produces and publishes science is run by humans with careers, funding, and reputations at stake. Those incentives distort the output. Knowing that is not anti-science. It is the beginning of scientific literacy. They didn't ask if we wanted to know how fragile the evidence behind "studies show" really is. The replications tell the truth the originals could not. _- The Department_