An excellent blog post on the events that led to Knight Capital’s bankruptcy in just 45 minutes was forwarded to me. Most of us in the industry have heard the story. It has been used either by vendors as an example of why tools and processes are critical or by others as an example of how things can go very wrong.
With the release of the SEC report, we now have more information than we had in the past. Looking back on my career, I can see how some of the events that occurred could quite easily have happened to me or other people that I know. Fine examples are:
As I work for Serena and I’m writing about this, it shouldn’t be a surprise that we have tools that can address some of the issues that led up to Knight Capital’s bankruptcy. Serena Release Automation (SRA) can be configured so that all servers in an environment are modeled in SRA, ensuring that code is deployed to all servers. Serena Release Control can model the processes and approvals necessary to ensure that all turnovers have been completed and so on.
The clear message to me, though, is that tools alone can’t solve all release problems. At Velocity NYC, I attended a great keynote by Zane Lackey of Etsy and Dan Kaminsky, a well-respected security researcher. They talked about many aspects of security, including zombie code, which is code that you long thought dead in your codebase but is still accessible. There is a lot of old code that is still active, just waiting for something to trigger it.
As pointed out by Zane, the good news is that for web apps, detecting zombie code can be quite easy and done by analyzing logs that you already have. Plus, once you have good release management processes and tools in place, including deployment automation, it is relatively easy to include such checks into your automated release process.
How does this relate to the Knight Capital incident? Engineers at Knight Capital repurposed an existing flag. This is something I would not suggest doing. A new build that used the repurposed flag was installed to a cluster of servers. However, one server was missed. This led to all servers, except one, working as expected. After some troubleshooting, the code on the working servers was rolled back to the previous build. While this led to a consistent set of servers, the repurposed flag was still set, resulting in code that should not be executing to be executed. The rest is history.
There are many, many lessons that can be learned from this (not leaving unused code in your codebase being one of them). Another is that an important, robust release process is a necessary part of keeping your business healthy.
|Jonathan Thorpe is Product Marketing Manager for all things DevOps and Continuous Delivery at Serena Software. Previously Jonathan worked as a Systems Analyst at Electric Cloud, specializing in DevOps-related solutions. Jonathan holds a degree in Computing Systems from Nottingham Trent University.|
[...] That got me thinking. If others in an organization can see obvious release management issues (often significant), why aren’t they being addressed with more urgency in many organizations we talk to? I’m guessing it’s because release management issues span multiple teams and that makes them hard to solve. It’s easier to ignore the issues until a major problem arises, such as the Knight Capital incident. [...]
[...] gets totally under control. We’ve seen the BART system grind to a halt after a failed updated, Knight Capital go bankrupt from a bad release process and countless other notable failures. While we never want to see a failure that reflects badly on [...]