Skip to main content

Experimentation

Introduction

Chemistry Icon

Most software development is not done with experimentation in mind. Instead, most development is carried out as an exercise in craftsmanship. We build software the way we think it should be built, and we hope that it works. Guessing is the basis of this approach, and it is not a good way to build software.

Instead we need to be consciously experimenting with our software. Come up with some idea, and test it to see if it actually works. If we want to know if some feature will be useful, then let's design an experiment to test it. When we have some novel approach to some process, let's test it to see if it actually works. Our default approach should be experimentation, not guessing.

The Opposite of science

For many years, organizations have been doing what is basically the opposite of science with regards to how they work with user data. The storage costs were so minimal that large companies simply gathered every piece of information they could. They would then throw hypotheses at it until something stuck. This is a terrible way to experiment, we never know if the results are due to the hypothesis, a correlation, or even just random noise.

This is a bit less common due to improved data privacy laws, but it still happens.

Scientific Method

There are four key components to having a experimental mindset when building software:

  • Feedback: The first thing we need to identify is a way to measure the success of our changes. How are we going to tell if our changes are successful? Once we decide on a metric, we need to measure our baseline so that we can compare it to the results after our changes.
  • Hypothesis: We need to have a clear idea of what we are trying to accomplish. What is the change we are making, and why do we think it will be successful? Have a prediction of the outcome before you start.
  • Control the Variables: We need to make sure that we are only changing one thing at a time. This gives us confidence that the change we made is the cause of the results we see.
  • Experiment: Make the changes and measure the results.
Measure first

Metrics and measurements are the most important part of the scientific method. If you can't measure the results of your changes, then you can't know if they were successful. This is why it is so important to have good metrics in place. It is all too common for organizations to make grand predictions about the success of a feature, only to find out that they have no way to measure the success of that feature.

Control the Variables

In order to be confident that our changes are the cause of the results we see, we need to control the variables. Real data is noisy and complex, and it is easy to see patterns where there are none. By controlling the variables, we can be more confident that the results we see are due to our changes.

Conducting Experiments

We are constantly trying to answer questions about our software.

  • Is this code free from defects?
  • Am I using the most efficient data structure?
  • Will this feature be useful? Will this change improve performance?
  • Will this change improve the user experience?

While some questions might be subjective and up to personal preference, the vast majority can be answered with experiments.

Experiments in Development

During development is where most of your experimentation will take place. Even simple steps in development are really tiny experiments. Will the code compile? Will the tests pass? We even make tiny hypotheses about what the code will do before we run it. Most developers are already doing this, they just don't think of it as experimentation.

Bug Fixes

I use a very experimentation based approach to investigate bugs. I start with an idea about what could be causing the bug, and then I try to craft an experiment to test that hypothesis. This is usually a simple unit test, but it could also be running the code in a debugger, using a profiler, or even just adding some logging to the code. I'm looking for the simplest way to disprove my hypothesis.

By taking this approach, I can identify the root cause of issues much more quickly than if I just try to fix where I think the bug is.

Experiment Icon

Tests as Experiments

Automated tests are the best, and easiest, source of data that we have as developers. Getting to use them is also one of the biggest advantages we have over other engineering disciplines. Our experimentation platform, a computer, is the real environment in almost every way. We can run tests on our code in seconds, and we can run them as many times as we want. This makes is easy for us to get incredible amounts of valuable data.

Any automated test to validate our code could be considered an experiment. We are testing a hypothesis about how our code should behave. However, the value of that experiment will vary depending on how closely the test is tied to the implementation. The more implementation details the test knows about, the less valuable the test is as an experiment.

Tests that are tightly coupled to the implementation are really just testing that the code does what the code does. They are not testing that the behavior of the code matches what is intended. An experiment should be based on some kind of hypothesis, and deciding if your code crashes or not is a pretty lame hypothesis.

Better tests are those that are more focused on the behavior of the code. They validate the interface, without caring about the implementation details. These sorts of tests provide more value, and are much easier to maintain. This is one of the main arguments for TDD. Tests that are written before the code cannot be tightly coupled to the implementation, because the implementation doesn't exist yet. Writing tests after the code can still do this, but it requires more discipline.

Flaky tests are one of the biggest issues in experimentation. If your tests are not reliable, then you can't trust the results of your experiments. If I have an integration tests that fails 10% of the time, then what does a failing test really imply about the changes I made? I certainly wouldn't assume that the changes caused the failure. I'd start by rerunning the test.

Experiments After Release

There are a lot of technical questions that a best answered after the code is released. This mostly happens when we can't effectively simulate the production environment in our development environment.

  • Did switching from REST to GraphQL improve performance?
  • Did system stability improve after increasing the pool size?
  • Did the new error handling address the issues we were seeing?

These are all questions that can best be answered after the code is released. We'll conduct some local experiments to try to predict the outcome, but the real test is in production with real data.

In order for these experiments to even be possible, we need to have good logging and metrics in place. We need to be able to measure the performance of our system, the stability, the error rates, etc. Without these metrics, we can't know if our changes are successful. When getting ready to release a change, make sure you have a plan for how you are going to measure the success of that change before you ship it.

Lab Equipment Icon

Experiments in Product

Measuring the success of a feature is a bit more complex. We can't just run a test and see if the feature works. We need to measure the impact of the feature on the users. This is where A/B testing comes in.

A/B Testing

A/B testing is a way to compare two versions of a feature to see which one is more successful. This is done by showing one version to some users, and the other version to other users. We then measure the success of each version and compare the results.

Which users see which version is usually done randomly. This is important because it helps to control the variables. If we were to show one version to users in the US and the other version to users in Europe, then we would have a hard time knowing if the results were due to the feature or due to the location of the users.

A/B testing is a powerful tool, but it is also easy to misuse. It is important to have a clear hypothesis about what you are testing, and to have a clear measurement of success. It is also important to have a large enough sample size to be confident in the results. In many cases, the impact of a feature is small, and we need a large sample size to be confident in the results.

Empirical Decision Making

All of these experiments are designed to help you make better decisions. This is only possible if you are willing to change your mind based on the results of the experiments. This is the essence of empirical decision making.

The experiments are pointless if we aren't using them to guide our decisions.

Engineering in general is firmly rooted in the scientific method. Software engineering needs to apply the same rigor to its decision making.

Experimenting with Organizational Changes

Trying to make broad process changes in an organization is tough. It's hard to show the value of a change without implementing it, but it's hard to implement a change without showing the value. Using experiments can really help with this. You can try a change with a single team and see if it works. If it does, then you can expand the change to the rest of the organization. If it doesn't, then you can try something else.

I witnessed this happen as the mobile development teams switched to more modern programming languages. When I started, the mobile teams were using Objective-C and Java. The engineering leadership thought switching to Swift and Kotlin would speed up development.

They ran into a lot of hurdles initially. Developers didn't want to rebuild language fluency. Teams didn't like interrupting their workflow to switch languages. Even the product teams didn't like the idea that features would be delayed during the switch.

To determine if the switch was worth it, they ran a small experiment. They had a single team train in the new languages and rebuild a small subset of the app and compared the resulting code.

They measured:

  • Code clarity and readability
  • Ease of testing
  • Speed of development
  • Performance of the app
  • Number of bugs
  • Developer satisfaction

The new languages won in every category. Armed with this data, they were able to convince the rest of the organization to switch and persuade the product teams to accept the delay.

Success Rate

We build features because we think they will be useful, but in many cases our ideas fail to improve key metrics. A study at Microsoft found that more than one third of the ideas tested failed to improve the metrics they were designed to improve. Studies at other organizations were often even worse.

We can't assume our changes will be successful. We need to test them, and act on the results of those tests.

Conclusion

Taking a more scientific approach to software development allows us to learn more quickly and make better decisions. We can't assume that our ideas are good, we need to test them. We need to be willing to change our minds based on the results of those tests.

Consider every change you make to be an experiment. Have a clear hypothesis about what you are trying to accomplish, and a clear measurement of success. Make sure you are controlling the variables, and that you are willing to change your mind based on the results.

Image Credits