A science-based approach to debugging code
There are a ton of ways to solve any individual problem. In this post we’re going to talk about the one I’ve had the most success with: a science-based approach.
Generally, when a scientist goes to figure out how the world works, they start by observing world. Then they make a guess about how it’s working. Then they set up some kind of experiment to collect evidence for/against their theory. Richard Feynman says it with more authority than I ever could.
Now I’m going to discuss how we would look for a new law. In general, we look for a new law by the following process. First, we guess it...Then we compute the consequences of the guess…And then we compare the computation results…with observations to see if it works.
If it disagrees with experiment, it’s wrong. In that simple statement is the key to science. It doesn’t make any difference how beautiful your guess is, it doesn’t matter how smart you are who made the guess, or what his name is…If it disagrees with experiment, it’s wrong. That’s all there is to it.
In the quote above, Feynman was talking about how physicists figure out how the world works. But that same way of thinking can be applied to problems in software. If you don’t believe me, let’s take a look at an example I took from a recent exchange with a colleague.
I was covering for the other team lead when one of the engineers came to me for help.
“I can’t figure this thing out. Can you help?”
”Sure,” I say. “What’s going on?”
He summarizes the problem:
“When reloading the large workspace after auto-arranging the windows, there is a hole where a window should be, and several windows are on top of one another when they shouldn’t be.”
A workspace is simply a collection of windows, their internal state, and their physical position on the screen. A native window is a type of window. Auto-arranging is a way of taking all of the windows on the monitor and forcing them to fill all of the available space on the monitor. Our product manages several different types of windows, and we were only seeing this bug manifest on one particular window type, after triggering auto-arrange.
What follows is a reconstructed transcript of the rest of our conversation:
D: It looks like the bug only presents on native windows. It was reported on the large workspace (40+ windows).
B: Huh. Why do you think that’s happening?
D: Well. When saving a large number of windows to the workspace, we are putting the windows in the wrong place because of a race condition in the code. There is a specific number of windows that makes this bug more likely to happen.
B: Alright. Why do you think that?
D: Well, I did a little experiment.
Before going over the results of his experiment, I want to pause. I’ve gotten a couple of very important bits of information from him. First, he told me his observations. Second, he told me his guess. And next, he’s going to tell me the experiments he ran to confirm/disconfirm his guess. I used this example in the talk I gave to the company. When I was prepping it, I asked for permission to use it. After he said yes, he told me that he’d heard a talk about using the scientific method to debug problems, and he was trying to apply that style to this problem. 👍👍
Now, for the results of his experiment and the rest of our conversation:
Scenario | Result |
---|---|
2 Windows | No bug |
4 Windows | No bug |
16 Windows | No bug |
25 Windows | No bug |
20 Windows | No bug |
21 Windows | BUG |
D: I think the problem is in [this other place in the code that you don’t need to understand to get the point I’m illustrating]. Something about a large number of windows is causing the bounds of the native window to be set incorrectly. It doesn’t happen on Electron windows. It looks like 21 windows appears to be the tipping point.
B: Does it happen when you just auto-arrange a bunch of native windows?
D: No
B: I’m not convinced. I think it's in the workspace management code. When we recreate a workspace, we just take the bounds from storage and pass them off to [the other part of the code that is irrelevant]. If your bug doesn’t happen when you trigger auto-arrange, I doubt it’s in [the other part of the code].
D: Okay, let's look at the data in the workspace.
B: Data looks fine.
B: Wait. There are fractional pixels for some of the windows. That’s strange. Don’t we round those when we move windows?
D: Maybe?
B: Wait. All of your tests were on even numbers. except 21. and 25. I don’t think it’s 21. I think it’s N, where N is a number that doesn’t divide evenly into the pixel-width of the monitor. Try 3.
D: It happens with 3.
D: Let me try 5. It happens with 5. Let me try 6. It doesn’t happen with 6. I think we’ve found the bug.
B: :)
So his initial guess was wrong. My initial guess was wrong. But by looking at evidence in favor of/contradicting our hypotheses, we were able to figure out where the bug was. One of the reasons we were able to solve this problem so quickly is that back in the day I’d seen some weird behavior when we tried to move a window to a fractional place in space (e.g., x: 283.76). It turns out that the win32 API expects coordinates to be integers, not floating point numbers.
But the other reason we were able to get to the bottom if this is because of the process he was following before we started talking. We only spoke for maybe 20 minutes. Before that, he only spent a couple of hours diagnosing the problem. There’s a lot that we can take from this example, but I really want to focus on his process.
First, he observed the problem. Second, he guessed what was causing the bug. Then he came up with an experiment to test his idea. And then he kept iterating. Once he felt like he had enough data, he looked at it and couldn’t make sense of it, so he came to me. Because he was already set up to approach the problem from a scientific angle, we were able to iterate rapidly and we ultimately solved the problem.
Using this approach doesn’t guarantee instant or even quick success. You won’t always be right on your first guess. Or your twelfth. But if you keep your mind open to the possibility that you’re wrong and you set yourself up to iterate quickly, this approach will get you to where you need to be. It makes a ton of intuitive sense to me, and it’s been by far the most effective way that I’ve been able to solve problems.
The next several posts will be about some common debugging mistakes that I’ve observed. I’ve tried to make them as generic as possible so that they are useful no matter which approach you use to solve problems. First, we’ll discuss perhaps the most important part of solving a problem: defining it.