Suppose you did the following experiment: You take 20 test subjects and divide them into two groups of ten people each. To the first group you give an experimental drug. To the second group you give a sugar pill. One week later all of the people in the first group are dead and all of the people in the second group are alive and healthy. Is it reasonable to conclude that the experimental drug is dangerous?
Just from the information I have given you, the answer is "no". The reason is that based solely on the information I have given there are many other possible explanations for the results. Here are just a few possibilities:
1. The test subjects died because they were being taken to the test administration area in a bus, and the bus crashed.
2. The groups were not randomly selected to begin with. The test subjects were all terminally ill and the control group was all young and healthy.
3. The batch of drugs being used was accidentally contaminated by a poisonous compound during manufacture.
4. Both the test and control groups were terminally ill, the drug is safe but ineffective, and it was just random chance that all of the test group died before any of the control group.
I'll leave it as an exercise to come up with others. I've chosen these four alternative hypotheses because each one illustrates a different kind of pitfall that you can fall into when trying to apply the scientific method.
The first alternative seems like it would be easy to dispense with. If it were true there would be ample evidence: newspaper stories, photos of mangled bodies, death certificates. And yet, none of this evidence would actually "prove" that this hypothesis is true. It's possible to both fabricate and hide evidence, and so it is possible that the bus crash did occur even though no direct evidence can be found to show that it did. Likewise, it is possible that the accident did not occur even though one can produce evidence to show that it did. Eventually you have to apply Occam's razor and decide how far down the conspiratorial rabbit hole you are willing to go. The point is: nothing in science is ever absolutely proven. The best you can do is get to the point where all but one of the alternative hypotheses seem implausible to you. Ultimately, the threshold of plausibility is an individual decision.
The second alternative is an example of simple procedural error. It is a rather blatant example, but things just like this happen all the time, usually inadvertently. In fact, this kind of mistake is so common it even has a name. It's called a "confounding factor." Sometimes confounding factors can lead to serendipitous discoveries, like when it was found that copper may contribute to Alzheimer's disease. But more often confounding factors are just that: confounding, and you can't tell whether the results you got from the experiment are due to the influence you were trying to test or the confound, and you have to go back and redesign your experiment.
The third alternative hypothesis might seem farfetched, but things like this actually happen all the time too, especially in biology. Accidental contamination is like a confounding factor, except that it arises by a procedural error rather than by a mistake in the experimental design. Nowadays biological experiments rely on dozens or hundreds or in some cases thousands of reagents, and if what's in the bottle isn't what you thought was in the bottle then your results may simply be a reflection of that contamination. (A friend of mine currently working on a biology Ph.D. actually discovered a contaminated reagent in her lab which invalidated many months worth of work, not just hers, but all of her colleagues' as well. She was not a very popular person for a while.)
The fourth alternative seems the most farfetched of all, but it is not impossible. In fact, we can actually calculate the exact probability that this hypothesis is correct. (You might want to make an intuitive guess before I tell you the answer.) In the scenario I described it turns out to be quite small indeed: just a little under one in a million (1 in 1048576 to be precise). But it is extremely rare to get results this crisp. Suppose 8 of 10 test subjects had died and 3 of ten control subjects. The odds of that happening by chance are quite a bit higher. Figuring out what those odds are exactly is complicated (and occasionally controversial). The study of how to compute those kinds of odds is known as statistics and now you know why statistics are part and parcel of science, because "it happened by chance" is always an alternative hypothesis for any result.
The possibility of chance results is also the reason why only predictions are considered scientifically valid, and never postdictions. To be considered a valid test of a scientific hypothesis, you have to state the hypothesis you are purporting to test before you do the experiment, and you are not allowed to change your mind afterwards. The reason for this is that if you do enough experiments you will sooner or later get what look like interesting results purely by chance. If you are allowed to go back and cherry-pick just the experiments that gave you the results you were looking for, it is not just likely that you will fool yourself into accepting false results, it is inevitable. (This is why all day-trading schemes eventually fail. If you take historical stock data and feed it to any model with enough parameters you will inevitably find a strategy that would have made money in the past. But this is almost certain to be pure chance, and the model will almost certainly have no predictive power. This is not to say that it won't make successful predictions -- it might, but the probability that it will is almost certainly 50%. In fact, you can generate any number of models that will be successful on historical data. Half of them will be successful on future data as well. The trick is, there's no way to know which half until it's too late for that information to be useful. Of course, the same is true for models that *weren't* successful on historical data.)
This last phenomenon is actually very common, especially in today's pharmaceutical industry. Creating new drugs is so expensive that the temptation to focus on the experiments that show your drug is safe and effective is nearly impossible to resist (especially when your job description is to produce a good return for your shareholders by any means necessary). So debacles like Vioxx are not just unsurprising, they are all but inevitable.
It's even worse than that. The threshold for being considered a publishable result is what statisticians call 95% significance, that is, results for which there is less than a 5% chance that they could have come about by chance. But even if everything is done perfectly -- if there are no mistakes in the experimental design, no procedural errors, no bad luck, and no cherry-picking -- one published result in twenty is still going to be a result of chance alone! There are tens, maybe hundreds of thousands of biology studies published every year. At least one in twenty of them is almost certainly wrong.
So here's your final exam: suppose I give IQ tests to people around the world and consistently (more or less) find that white people score higher than black people. Is it reasonable to conclude that black people are genetically predisposed to have lower intelligence than white people?