A lot of people wonder what the big deal is about experiments. Why do people care if, say, some particular dietary supplement has been supported by randomized experiments or not? If taking two St. John's wort pills a day helps me out, and my friends say it helps them, why should anyone care what some guy in a lab coat thinks? To answer that question, we need to start with explaining science in a slightly different way than most people are used to.
Most people think about the history of science as a history of gathering ever-more facts, but it is probably better to view the history of science as the development of method. The big issue in science is what does or does not count as evidence to support a given position, and the history of science mostly involves ever-improving methods of gathering and agreeing upon "evidence." There are many accepted ways to gather different types of evidence, depending on the questions you are asking. However, when you are asking about causality, you want to look for well-designed experiments.
Let's stick with the example of the St. Johns wort (SJw), and think about how to examine the evidence. Whether or not it was ever purely true, we imagine that at some point in history we simply relied on authority. In that context we might recall a legend in which SJw helped a hero recover, or consult a medieval tomb that talked about its effects on our humors, or asked a revered sage trained in the herbal arts. Those sources might or might not give advice that turned out to be accurate, but we wouldn't get very far if we followed up with "and how do you know that?" That is, we wouldn't have any evidence for the effectiveness of SJw, just some advice.
With that in mind, we might turn out into the world, to try to determine if SJw was effective. We might, for example, go to parts of the world where SJw is regularly used in herbal preparations, and ask people who regularly get it how they feel. Let's say, for the sake of argument, that we find out that people who get SJw are happier than those who don't, by some small amount. What do we make of that? Well, it is evidence, but it is not great evidence. Those people live in particular places, they have particular life histories, there are countless, countless ways in which their lives differ from the lives of whoever we want to compare them with. It is possible that SJw helps their mood, but it is also possible that any of those other differences are responsible for the results we found.
With that in mind, we might want to find people who have been using SJw for a while, and ask if their moods changed after use. Again, let's say it did. This is evidence too, and a bit better, but it has the same basic problem. If you felt bad for a few months, and then you felt a bit better for a few months, that is not too surprising. So, if we find people who started taking SJw at some point, and started feeling better vaguely thereafter, the fact that they started taking SJw is one of many possible explanations for that change. Also, if you are only studying long-term SJw users, you might be missing a lot of people who stopped using it because they did not feel better.
We could improve the above scenario in a few obvious ways. We could, for example, find people who felt bad and did not take SJw, and see how often they started feeling better relative to our SJw group. Again, better evidence. Even if we saw a difference, we are back to not being sure about the cause. Maybe people who decide on their own to try SJw are different types of people than those who don't. Maybe they are more optimistic. Maybe they have different social support networks. Etc. We could also improve our study by tracking people when they start using SJw, so that we are able to include gathering data about people who stop using because they find it ineffective. Better evidence again, but still the same fundamental problem. There might be some other difference between the groups that explains both why they don't stick with the SJw and why they don't feel better during the study period.
Surprisingly, we also can't fix this challenge by trying it out ourselves, or with a group of friends. If we are feeling bad, and we all get together and start using SJw, and then a while later we are feeling a bit better, our evidence is just as suspect. It is still the case that anything else that changed in our world during that same period could be responsible for the change in mood. Imagine, for example, if we started the test during winter, and we all felt a bit better come springtime. How many explanations could you quickly come up with other than our starting SJw in late January?
The above studies gather many sorts of useful information about the world. Many sorts of evidence regarding the relation between normally occurring use of SJw and changes in mood, and that evidence allows many sorts of conclusions to be drawn. In the end, however, it does not get us what we want. We need better methods of gathering evidence.
So, what happens in a randomized experiment? We might, for example, take 200 people who feeling bad and flip a coin when each one enters our lab to take part in our study. If the coin falls "heads", then the participant gets SJw pills to take for three months. If the coin falls "tails", then the participant gets nothing (or perhaps a placebo pill) and we also track how they do for three months. Let us say that, matching the studies above, we find that the SJw group has a more positive mood after three months than the "control" group. What has this gained us?
Recall the problem with the "observational" methods above, in which we simply found people already using the supplement. That evidence was suspect because the people who choose to use the supplement might differ from people who don't. Recall the problem with our simple experiment as well. That evidence was suspect because anything else that changed over the same period of time could have been responsible for our results. The randomization fixes both of these problems. The desires of the groups didn't play into it, and the world changed in the same way for people taking the supplement vs. those in the control condition (on average).
What does all this gain us? It gives us "causality." It makes us specifically confident that any difference we find between the two groups was caused by the SJw, because that is the only systematic way in which the two groups differed.**
As a final methodological note, in studies involving people you can often get even a bit better than this by having "double-blind" randomized experiments. This can be important because there is a lot of evidence that people's expectations can change the effects of experiments. In a double-blind study the participants do not know what condition they are (so you eliminate effects of being optimistic or disappointed) and any experimenters interacting with the participants does not know which condition any particular person is in (so they cannot create inadvertent social influences).
To wrap up: That you think a given dietary supplement helps you is one piece of data. But it is first and foremost a piece of data about what you think. Even if it is the case that you have gotten better since starting the supplement, that change could have been caused by anything else that changed around the same time, or it could even have changed as a result of the increased positive outlook you have thinking the supplement will work. All the friends you can survey don't change the fundamental inability to determine if any effects were caused by the supplement. Even gathering extensive data about people who use the supplement and comparing them to people who do not use the supplement doesn't get you past all the other ways in which those people are different. The best way to get at causality is a randomized experiment, preferably a "double blind" experiment. And that is why experiments matter so much in science, because it is the method that allows us to answer questions about causal effects in the world around us.
By the way, I picked St. John's wort as the example because the Mayo clinic has actually stated that there are well designed experiments supporting the use of it for treating mild depression. I thought it might be nice to use an example that actually holds up to all this scrutiny.
** We can improve this method a bit more by talking about "statistically significant" differences... but that is a discussion for a different time.