A science-based approach to debugging code

There are a ton of ways to solve any individual problem. In this post we’re going to talk about the one I’ve had the most success with: a science-based approach.

Generally, when a scientist goes to figure out how the world works, they start by observing world. Then they make a guess about how it’s working. Then they set up some kind of experiment to collect evidence for/against their theory. Richard Feynman says it with more authority than I ever could.

Now I’m going to discuss how we would look for a new law. In general, we look for a new law by the following process. First, we guess it...Then we compute the consequences of the guess…And then we compare the computation results…with observations to see if it works.

If it disagrees with experiment, it’s wrong. In that simple statement is the key to science. It doesn’t make any difference how beautiful your guess is, it doesn’t matter how smart you are who made the guess, or what his name is…If it disagrees with experiment, it’s wrong. That’s all there is to it.

In the quote above, Feynman was talking about how physicists figure out how the world works. But that same way of thinking can be applied to problems in software. If you don’t believe me, let’s take a look at an example I took from a recent exchange with a colleague.

I was covering for the other team lead when one of the engineers came to me for help.

“I can’t figure this thing out. Can you help?”

”Sure,” I say. “What’s going on?”

He summarizes the problem:

“When reloading the large workspace after auto-arranging the windows, there is a hole where a window should be, and several windows are on top of one another when they shouldn’t be.”

A workspace is simply a collection of windows, their internal state, and their physical position on the screen. A native window is a type of window. Auto-arranging is a way of taking all of the windows on the monitor and forcing them to fill all of the available space on the monitor. Our product manages several different types of windows, and we were only seeing this bug manifest on one particular window type, after triggering auto-arrange.

What follows is a reconstructed transcript of the rest of our conversation:

D: It looks like the bug only presents on native windows. It was reported on the large workspace (40+ windows).
B: Huh. Why do you think that’s happening?
D: Well. When saving a large number of windows to the workspace, we are putting the windows in the wrong place because of a race condition in the code. There is a specific number of windows that makes this bug more likely to happen.
B: Alright. Why do you think that?
D: Well, I did a little experiment.

Before going over the results of his experiment, I want to pause. I’ve gotten a couple of very important bits of information from him. First, he told me his observations. Second, he told me his guess. And next, he’s going to tell me the experiments he ran to confirm/disconfirm his guess. I used this example in the talk I gave to the company. When I was prepping it, I asked for permission to use it. After he said yes, he told me that he’d heard a talk about using the scientific method to debug problems, and he was trying to apply that style to this problem. 👍👍

Now, for the results of his experiment and the rest of our conversation:

Scenario Result
2 Windows No bug
4 Windows No bug
16 Windows No bug
25 Windows No bug
20 Windows No bug
21 Windows BUG

D: I think the problem is in [this other place in the code that you don’t need to understand to get the point I’m illustrating]. Something about a large number of windows is causing the bounds of the native window to be set incorrectly. It doesn’t happen on Electron windows. It looks like 21 windows appears to be the tipping point.
B: Does it happen when you just auto-arrange a bunch of native windows?
D: No
B: I’m not convinced. I think it's in the workspace management code. When we recreate a workspace, we just take the bounds from storage and pass them off to [the other part of the code that is irrelevant]. If your bug doesn’t happen when you trigger auto-arrange, I doubt it’s in [the other part of the code].
D: Okay, let's look at the data in the workspace.
B: Data looks fine.
B: Wait. There are fractional pixels for some of the windows. That’s strange. Don’t we round those when we move windows?
D: Maybe?
B: Wait. All of your tests were on even numbers. except 21. and 25. I don’t think it’s 21. I think it’s N, where N is a number that doesn’t divide evenly into the pixel-width of the monitor. Try 3.
D: It happens with 3.
D: Let me try 5. It happens with 5. Let me try 6. It doesn’t happen with 6. I think we’ve found the bug.
B: :)

So his initial guess was wrong. My initial guess was wrong. But by looking at evidence in favor of/contradicting our hypotheses, we were able to figure out where the bug was. One of the reasons we were able to solve this problem so quickly is that back in the day I’d seen some weird behavior when we tried to move a window to a fractional place in space (e.g., x: 283.76). It turns out that the win32 API expects coordinates to be integers, not floating point numbers.

But the other reason we were able to get to the bottom if this is because of the process he was following before we started talking. We only spoke for maybe 20 minutes. Before that, he only spent a couple of hours diagnosing the problem. There’s a lot that we can take from this example, but I really want to focus on his process.

First, he observed the problem. Second, he guessed what was causing the bug. Then he came up with an experiment to test his idea. And then he kept iterating. Once he felt like he had enough data, he looked at it and couldn’t make sense of it, so he came to me. Because he was already set up to approach the problem from a scientific angle, we were able to iterate rapidly and we ultimately solved the problem.

Using this approach doesn’t guarantee instant or even quick success. You won’t always be right on your first guess. Or your twelfth. But if you keep your mind open to the possibility that you’re wrong and you set yourself up to iterate quickly, this approach will get you to where you need to be. It makes a ton of intuitive sense to me, and it’s been by far the most effective way that I’ve been able to solve problems.

The next several posts will be about some common debugging mistakes that I’ve observed. I’ve tried to make them as generic as possible so that they are useful no matter which approach you use to solve problems. First, we’ll discuss perhaps the most important part of solving a problem: defining it.

How to Be a Better Debugger - a Series

One of my favorite things to ask interviewees is this: imagine the ideal software engineer. This engineer has 3 discrete skills:

  1. Their ability to communicate with others.

  2. Their ability to debug code.

  3. Their technical expertise and experience.

I then ask the candidate to rank those skills in order of importance. Then I ask them to rank themselves on each of the skills. The way the candidate answers the question tells you a little bit about how self-aware they are; it also tells you what the ideal software engineer looks like in their head. The order doesn’t matter so much as their rationale for the order.

I think the order above is correct. Also, note that I called them skills and not qualities. Skills are things that can be learned and improved. Communication is at the top because of how difficult it is to learn to do well. Yes, you can become a better communicator. But I think the results depend a little more on aptitude than the other two on the list. As for expertise and experience - I think it’s nice to have. But you can gain experience by doing a job poorly. I have experience in woodworking. I’m not a good woodworker. Debugging — debugging is huge. Depending on the maturity level of the project, I’d wager that you spend 35%-75% of your coding time doing some kind of debugging. In an interview, I’m more interested in the person who tells me about an interesting problem that they solved than the person who rambles on about why Redux is clearly the best way to do state management in React apps.

That might be because I’m self taught. I got into this field because I enjoyed the problems. I never studied computer science or software engineering, so early on in my career I wrote a lot of code that didn’t work. Because of that, I got decent at figuring out why stuff wouldn’t work. But it wasn’t until I got onto my current project that I really honed my skills as a debugger.

I’m currently in a team lead role on a project I helped start. It’s a multi-process javascript framework and architecture designed for creating enterprise grade workflows. It’s very complex. Any given problem can be in the DOM, in the layer above, in the layer below, or somewhere in the communication layer. The product itself is only about 3 years old. We currently have 11 very smart engineers working on it, most of whom have been with us for less than a year. Because the problem space is so big, so complex, and so novel, I often end up serving as pair-programmer, rubber duck, and observer. Over the course of several months, I found myself giving the same or similar advice to different people. At some point, I had an epiphany in the form of a series of thoughts.

“Not everyone solves problems like I do”.
“ 😱”
“They’d be more efficient if they did.”
”Maybe I should do a talk about debugging.”

A couple of weeks later, I did a talk about debugging for the whole company. I want to be clear - I don’t think that the approach I will outline is the best way to solve every problem. However, if you combine the general approach and avoid some of the pitfalls that we will go over, you will be markedly better at debugging code.

By the end of this series, you should feel confident enough to parachute into a section of code with nothing but a stack trace or a description of what should happen and what is happening. From there, you’ll be able to make a guess, gather data, test your guess, and iterate quickly.

Up next: A Scientificish Approach to Debugging Code.