Research design in practice

TLDR: Business decisions need to be supported by causal evidence. Running experiments is great to get at causality, but costs and benefits need to be balanced.

Before-after comparison and the beauty of difference-in-differences

Consider that we want to understand whether the productivity of a business unit has changed since we have introduced a new technology (“AI” or “robot” for simplicity). First, we need to define what change means, relative to what. We need to clarify the level of comparison. A first natural comparison would be the productivity of the same unit before the implementation of AI. Say we compare the unit’s productivity in 16 weeks before the AI arrives, to its productivity in 16 weeks thereafter. When done with multiple units, we can quantify the average change. We call this research design an event study.

Event study designs are not optimal because something else might have happened at the same time as the introduction of the AI. For example, local management might have changed, workers in the region might have gone on strike, supply chain issues may have led to underutilization of capacity. Whether or not these things are observable in the data, they pose a problem because we cannot know whether changes in productivity happened because of robots, or because the other things that happened at the same time. The change that we quantify in an event study design would be incorrectly attributed to the introduction of robots, but in reality, is the effect of robots and other things.

A research design that can help us in such a situation is the so-called difference-in-differences approach. We add another comparison: in addition to the before/after of the event study, we also compare a group of units that eventually receive robots to a group of units that do not. We can call the former the treatment group and the latter the control group. Now if the things that happen at the same time as the introduction of the robots happen also in the control group, we can separate out the effect of robots from the effect of the other things. We essentially do two event studies: we compare productivity before/after in the treatment group, as well as in the control group. When we subtract the before/after difference in the control group from the before/after difference in the treatment group, we remove the changes that are due to the other things. Hence the name difference-in-differences.

The difference-in-differences approach is already much better but still has important problems. What if units that introduce robots differ in their productivity in the first place? For example, robots might be introduced to improve the productivity of less-than-average units. Or they might be introduced in units that are already highly productive, for whatever reason. When we take the difference-in-differences, we still do not know whether the difference is coming from robots per se. We could also just be measuring differences in productivity among different types of units where one type happens to also have robots, and the other type not. The tricky bit is that it is perfectly possible that there are no productivity effects of robots at all, although our difference-in-differences approach indicates so.

Natural experiments: regression discontinuity design

What is a potential solution? A good way forward is to make use of so-called natural experiments. For example, imagine that the decision to implement robots in a business unit is done by aggregating unit-level information into a “robot-readiness” score that runs from 0 to 100. The decision rule that local managers follow is: implement a robot if the score is higher than 50. Now we compare the productivity of units with a score of 49 to the productivity of units with a score of 51. Why? The underlying circumstances that make one business unit “robot-ready” but not the other can be assumed to be negligible. After all, we are comparing a 49-unit to a 51-unit. The scores are not very different. This is what they call "regression discontinuity design".

Randomized experiments or A/B test

What is the best research design? The best way to find out how much robots – and only robots – affect productivity is to randomly select some units which implement the robots. Randomization solves all of the aforementioned problems since whether or not a unit receives robots does not depend on time, nor on its past performance. Randomization is the gold standard in science. In medical research, when we want to find out whether a drug can cure a disease, we give the drug to a random selection of 50% of lab mice and compare their health status to the other 50% of lab mice. Randomization is of course not limited to scientific applications. In fact, randomized experiments are widely applied in practice and are known as A/B tests. Especially in consumer-facing settings, such as user interface design, experimentation is a common practice, especially in digital environments. The Harvard Business Review articles “The surprising power of online experiments” and “A refresher on A/B Testing” describe practical implementations in more detail.



The implementation of an experiment in a non-consumer-facing, non-online environment is of course more challenging, for example, because prior expert knowledge might suggest that it makes more sense to implement robots in some units earlier than in others. However, this generates the abovementioned problem that now it is very difficult to find out whether units with robots perform differently than units without robots, because of the robots, or because of the reasons that made them more eligible to receive robots in the first place. But the managerial question is of course: do the benefits of the robots outweigh their costs? We need to precisely determine the benefits to make that calculation correctly.

A practical implementation of a randomized experiment would be to first select a group of units that should receive robots and then randomize the exact rollout timing within that group.

It is also important to realize that randomization is only needed in a first “experimentation” stage. Data from the experiment can be used to calibrate and optimize the rollout strategy later. As a result, mistakes can be avoided, costs can be saved. Cost savings may easily outweigh the direct and indirect costs of the experiment. For example, with data from an experiment, it is easy to build counterfactual predictions such as: how much would the productivity of a business unit with given characteristics change if robots would be implemented there? These predictions can be built into a decision-support system that contrasts the projected benefits with the costs.

Conclusion


Which method to choose requires a balancing of advantages and disadvantages.

It is quite difficult to work out the true causal effects of robots on productivity. A careful research design can solve some of the issues. The better the method, the higher the costs of implementing them. An event study is essentially costless, but not accurate. A comparative study is better without much more investment. Exploiting natural experiments takes time, and a field experiment, while leading to the most accurate results, has opportunity costs. The big advantage of the discussed non-experimental methods is that we can analyze data from the past, and do not need to wait for data being created within the experiment.

The research design to choose needs further discussion, taking practical implementation issues into account. A careful balancing of costs and benefits will be necessary.