I was recently reminded of a presentation made by GoDaddy way back in the 2013 Flash Memory Summit in which I first heard the statement: “Failure is not an option — it is a requirement!” That’s certainly something that got my attention! It just sounded wrong.
In fact, this expression was used to describe a very pragmatic approach the company’s storage team had devised to determine the exact maximum load that could be supported by any piece of its storage system.
This is key, since, at the time, GoDaddy claimed to be the world’s largest web hosting service with 11 million users, 54 million domains registered, over 5 million hosting accounts, with a 99.9% uptime guarantee (although the internal goal was 99.999% – five nines!)
The presenters outlined four stages of how validation processes had evolved:
- Stage 0: “If it ain’t broke, don’t fix it.” This is a reactive solution to issues as they arise, that combines a lack of understanding of the workload with a penchant to purchasing to higher specifications than actually required leading to high costs.
- Stage 1: “Test in production… and pray!” In this scenario equipment is slowly ramped into production with expansion plans based on vendor specifications. In some cases this results in unexpected failures some months after deployment, with highly-visible disruptions.
- Stage 2: “Validation with freeware tools.” More sophisticated than the preceding scenarios, this one still has its problems. A mix of tools, including IOMeter, IOZone, Dbench, Fstress, and others, were designed for smaller workloads than GoDaddy’s massive systems. Not only do they fail to resemble the actual load, but they prove cumbersome to use in this magnitude of a system.
- Stage 3: “Validation with custom tests.” GoDaddy developed a test the company calls “SwiftTest” that has been specially designed for the correct type of validation. The tool validates against full scale operation loads on a realistic emulation of the company’s production workloads.
Most importantly, though, is that SwiftTest is ramped up over the course of a few days to find where a new component will predictably fail. By doing causing these failures the storage team at GoDaddy can accurately predict the conditions under which new resources will be required, without guesswork. They don’t over-buy, and they reduce storage system failures.
But the key point is that they hammer on a piece of equipment until it breaks, and use that knowledge to plan their resources. The system’s failure is key to this understanding. Failure is a requirement!
As of the time of the presentation this approach had been successfully applied to multiple parts of the storage system: SSDs, caching & tiering software, commodity hardware, compression, and deduplication systems.
It’s been over five years since the company made this presentation but the approach is still commendable. Sadly, when The SSD Guy asks many system administrators about their workloads he finds that “Scenario 0” continues to be the most prevalent by a very wide margin.