On debugging, v2

Dear new developer,

I wrote about debugging a few weeks back. I wanted to get more concrete. One time a friend called in about his client. The client was getting doubled orders on their ecommerce site. That is, someone would order five widgets on their site. The system would have some kind of hiccup and there would be two orders, resulting in ten widgets being shipped. It didn’t happen every time. It didn’t have any discerneable pattern. There were no obvious changes to the system that would cause this.

The customers weren’t happy about this. The client wasn’t happy about this. My friend wasn’t happy about this. He had looked around and couldn’t find the issue. He wanted me to take a look.

I had never worked with this ecommerce system before. There was no staging environment. I was debugging in the dark. I didn’t even have a way to submit fake orders that wouldn’t be charged. What I had was server access and SQL database access and a list of the customers that had been double charged.

So I started looking around at the log files (on the command line, with grep and vi and all the other great unix tools) and noticed that something weird had happened. The apache logs indicated that the server was restarted very often. The times when the server was restarted also lined up with the double charges.

I looked at the server cron file and found that someone had added a line that restarted the apache web server regularly. I asked if anyone knew why this was happening; no one did. They didn’t have their system changes under version control, so I was ginger about making changes. But finally I decided to disable the restarting of the server and see if the double orders continued.

They didn’t.

So this was definitely not a hugely complex system, but this is an example of debugging in a live system. Lessons for me:

  • Define the problem
  • Know the finish line
  • Start with what you know
  • Take small steps
  • Notice anything “weird”