We goofed up. It was avoidable.
This is not the first goof-up that we have faced in a short contest. We have been there and done that. Specially the starting load during a CookOff has been a long standing problem for us. Sometimes the DB was unable to handle our complicated queries while at other times we were syn attacked! The result each time used to be the same. Pissed-off contestants at the start of the contest got soothed by the end by a usually good problem set (which we have no credit to take for but for our awesome problem setters).
We had tried to replicate the load at the start and fix things in our dev environment, but something or the other got left out. There used to be a time, when we would findan issue in a Cook-Off, fix it and then wait for the next CookOff to find the next one. It happened for a while until we took control of things, changed our application architecture, and moved to a new infrastructure!
Just a few days ago a couple of interns had joined the team and I was narrating this whole chain of events to them about how we had to be extremely alert and almost pray to god each time for the first 15 minutes that nothing goes wrong. The ordered food used to get cold in an eternal wait before it could draw our attention on this Sunday night of every month.
And how things have changed since last 6 months where we just think about googling the best restaurants to order the finest food for the 6 of us. 🙂 And how things have to go wrong this very CookOff!
Among the changes that we had made, now we run our servers on AWS infra. Over the last six months, for our CookOff, we have been running four c3.2xlarge web instances in front of a MySQL RDS instance to handle the ever-growing load.
The load has been keeping well below 1 on these 8 core machines and hence we thought why not test with reducing a couple of servers this time. This was out of sheer curiosity as we have mostly been unable to accurately replicate the load and the behaviour of what happens in the first 5 minutes of our CookOff. Things boomeranged and we all know what ensued. The load shot up to unmanageable proportions and balancing it on the two additional servers took a lot of time.
No, we do not take our production setup very casually as it may sound. And not that we are considering this very lightly. The contest has been extended and things are back on track. And we are left embarrassed. This foolhardy of ours have not only wasted the entire contestant’s time, it also undid the huge effort of our problem setters. We apologise. It was certainly an avoidable situation.