What is Empirical Evidence?
Check out this simple guide to empirical evidence and how to identify high-quality research from poor-quality research, using examples from the auto and education industries.
Empirical evidence is information that researchers generate to help uncover answers to questions that can have significant implications for our society.
Take seatbelts. Prior to their invention, people were killed or maimed in what today we would think of as minor traffic accidents. So smart engineers put their heads together to try to do something about it.
Let’s try tying people down! Let’s change what the steering wheel is made of! Let’s put an exploding bag of air in the steering wheel! (Imagine how crazy that sounded in a pitch meeting.) These all seem like reasonable ideas (well except that exploding airbag one), so how do we know which one we should do?
The answer is to generate and weigh empirical evidence.
Theory vs. Empirical Evidence
One might have a theory about how something will play out, but what one observes or experiences can be different from what a theory might predict. People want to know the effectiveness of all sorts of things, which means they have to test them.
Social scientists produce empirical evidence in a variety of ways to test theories and measure the ability of A to produce an expected result: B.
Usually, researchers collect data through direct or indirect observation, and they analyze these data to answer empirical questions (questions that can be answered through observation).
Let’s look at our car safety example. Engineers and scientists equipped cars with various safety devices in various configurations, then smashed them into walls, poles and other cars and recorded what happened. Over time, they were able to figure out what types of safety devices worked and which ones didn’t. As it turns out, that whole airbag thing wasn’t so crazy after all.
They didn’t get everything right immediately. For instance, early seatbelts weren’t retractable. Some airbags shot pieces of metal into passengers. But, in fits and in starts, auto safety got better, and even though people are driving more and more miles, fewer and fewer are dying on the road.
How Gathering Empirical Evidence in Social Science is Different
Testing the effects of, say, a public policy on a group of people puts us in the territory of social science.
For instance, education research is not the same as automotive research because children (people) aren’t cars (objects). Education, though, can be made better by attempting new things, gathering data on those efforts, rigorously analyzing that data and then weighing all available empirical evidence to see if those new things accomplish what we hope they do.
Unfortunately, the “rigorously analyzing” bit is often missing from education research. In the labs of automobile engineers, great care is taken to only change one bit of design (a variable) at a time so that each test isolates the individual factor that is making a car more or less safe. OK, for this test, let’s just change the material of the steering wheel and keep everything else the same, so we’ll know if it is the wheel that is hurting people.
Comparing Apples with Apples
In social science and especially in education, trying to isolate variables is challenging, but possible, if researchers can make “apples-to-apples” comparisons.
The best way to get an apples-to-apples comparison is to perform something called a randomized control trial (RCT). You might have heard about these in relation to the testing of medicine. Drug testing uses RCTs all the time.
In an educational RCT, students are divided into two groups by a randomized lottery and half of the students receive whatever the educational “treatment” is (a new reading program, a change in approach to discipline, a school voucher, etc.) while the other does not. Researchers compare the results of those two groups and estimate the “treatment” effect. This approach gives us confidence that the observed effect is caused by the intervention and no other factors.
RCTs are not always possible. Sometimes researchers can get close by using random events that separate kids into two groups, such as school district boundaries that are created by rivers or creeks that split a community more or less by chance or birthday cutoffs for preschool that place a child born on August 31st in one grade but one born September 1st in another even though there is basically no difference between them. Depending on the exact nature of the event, these can be known as “regression discontinuity” or “instrumental variable” analyses, and they can be useful tools to estimate the effects of a program.
Researchers can also follow individual children that receive a treatment if they have data from before and after to see how that child’s educational trajectory changes over time. These are known as “fixed effects” analyses.
All three of these—randomized control trials, regression discontinuity analyses and fixed effects analyses—have their drawbacks.
Very few outside events are truly random. If, as regression discontinuity analysis often does, researchers only look at children just above or just below the cutoff, or, as fixed effects analysis often does, researchers look at only those children who switch from one school to another, those children might not be representative of the population. How would an intervention affect kids who are not close to a cutoff or border? Or kids who do not switch schools?
In the SlideShare below, we present empirical evidence based on rigorous research on private school choice programs as an example of how we, as academics and researchers ourselves, identify and characterize the high-quality empirical evidence in a given area of study.
A Couple Considerations
It’s a lot to wade through, so before you do, we’d like to offer two notes.
First, it is always important to understand the tradeoffs between internal and external validity.
Internal validity refers to how well a study is conducted—it gives us confidence that the effects we observe can be attributed to the intervention or program, not other factors.
For example, when the federal government wanted to know if Washington, D.C.’s school voucher program increased students’ reading and math test scores, researchers took the 2,308 students who applied for the program and randomly assigned 1,387 to get vouchers and 921 not to. They then followed the two groups over time, and when they analyzed the results, they could reasonably conclude that any differences were due to the offer of a voucher, because that is the only thing that was different between the two groups and they were different only because of random chance. This study had high internal validity.
External validity refers to the extent that we can generalize the findings from a study to other settings.
Let’s think about that same study. The D.C. program was unique. The amount of money that students receive, the regulations that participating schools had to agree to, the size of the program, its politically precarious situation and numerous other factors were different in that program than in others, not to mention the fact that Washington, D.C. is not representative of the United States as a whole demographically, politically or in really any way we can possibly imagine. As a result, we have to be cautious when we try to generalize the findings. The study has lower external validity.
To combat issues around lower external validity, researchers can collect and analyze empirical evidence on program design to understand its impact. We can also look at multiple studies to see how similar interventions affect students in different settings.
Second, the respect and use of research does not endorse technocracy. Research and expertise is incredibly useful. When you get on an airplane or head into surgery, you want the person who is doing the work to be an expert. Empirical evidence can help us know more about the world and be better at what we do. But we should also exercise restraint and humility by recognizing the limits of social science.
Public policy involves weighing tradeoffs that social science cannot do for us. Social science can tell us that a program increases reading scores but also increases anxiety and depression in children. Should that program be allowed to continue? Ultimately, that comes down to human judgment and values. That should never be forgotten.
With that, we hope you found this article helpful. Please feel free to reach out with any questions by emailing firstname.lastname@example.org or posting your question in the comments section below.