The Knight Capital Saga – How to Go Out of Business in 45 Minutes

An excellent blog post on the events that led to Knight Capital’s bankruptcy in just 45 minutes was forwarded to me.  Most of us in the industry have heard the story.  It has been used either by vendors as an example of why tools and processes are critical or by others as an example of how things can go very wrong.

With the release of the SEC report, we now have more information than we had in the past.  Looking back on my career, I can see how some of the events that occurred could quite easily have happened to me or other people that I know.  Fine examples are:

  1. One server in a cluster not being updated.
  2. Developers reusing a flag that was no longer thought to be in use.
  3. Dead code coming back to life.
  4. Unusual messages from an application being ignored.

As I work for Serena and I’m writing about this, it shouldn’t be a surprise that we have tools that can address some of the issues that led up to Knight Capital’s bankruptcy.  Serena Release Automation (SRA) can be configured so that all servers in an environment are modeled in SRA, ensuring that code is deployed to all servers.  Serena Release Control can model the processes and approvals necessary to ensure that all turnovers have been completed and so on.

The clear message to me, though, is that tools alone can’t solve all release problems. At Velocity NYC, I attended a great keynote by Zane Lackey of Etsy and Dan Kaminsky, a well-respected security researcher.  They talked about many aspects of security, including zombie code, which is code that you long thought dead in your codebase but is still accessible.  There is a lot of old code that is still active, just waiting for something to trigger it.

As pointed out by Zane, the good news is that for web apps, detecting zombie code can be quite easy and done by analyzing logs that you already have.  Plus, once you have good release management processes and tools in place, including deployment automation, it is relatively easy to include such checks into your automated release process.

How does this relate to the Knight Capital incident?  Engineers at Knight Capital repurposed an existing flag.  This is something I would not suggest doing.  A new build that used the repurposed flag was installed to a cluster of servers.  However, one server was missed. This led to all servers, except one, working as expected.  After some troubleshooting, the code on the working servers was rolled back to the previous build.  While this led to a consistent set of servers, the repurposed flag was still set, resulting in code that should not be executing to be executed.  The rest is history.

There are many, many lessons that can be learned from this (not leaving unused code in your codebase being one of them).  Another is that an important, robust release process is a necessary part of keeping your business healthy.

Share this post:

Leave a Reply