Why did the CodeChef website crash at the start of June Cook-Off 2014?

1 min read

We goofed up. It was avoidable.

This is not the first goof-up that we have faced in a short contest. We have been there and done that. Specially the starting load during a CookOff has been a long standing problem for us. Sometimes the DB was unable to handle our complicated queries while at other times we were syn attacked! The result each time used to be the same. Pissed-off contestants at the start of the contest got soothed by the end by a usually good problem set (which we have no credit to take for but for our awesome problem setters).

We had tried to replicate the load at the start and fix things in our dev environment, but something or the other got left out. There used to be a time, when we would findan issue in a Cook-Off, fix it and then wait for the next CookOff to find the next one. It happened for a while until we took control of things, changed our application architecture, and moved to a new infrastructure!

Just a few days ago a couple of interns had joined the team and I was narrating this whole chain of events to them about how we had to be extremely alert and almost pray to god each time for the first 15 minutes that nothing goes wrong. The ordered food used to get cold in an eternal wait before it could draw our attention on this Sunday night of every month.

And how things have changed since last 6 months where we just think about googling the best restaurants to order the finest food for the 6 of us. 🙂 And how things have to go wrong this very CookOff!

Among the changes that we had made, now we run our servers on AWS infra. Over the last six months, for our CookOff, we have been running four c3.2xlarge web instances in front of a MySQL RDS instance to handle the ever-growing load.

The load has been keeping well below 1 on these 8 core machines and hence we thought why not test with reducing a couple of servers this time. This was out of sheer curiosity as we have mostly been unable to accurately replicate the load and the behaviour of what happens in the first 5 minutes of our CookOff. Things boomeranged and we all know what ensued. The load shot up to unmanageable proportions and balancing it on the two additional servers took a lot of time.

No, we do not take our production setup very casually as it may sound. And not that we are considering this very lightly. The contest has been extended and things are back on track. And we are left embarrassed. This foolhardy of ours have not only wasted the entire contestant’s time, it also undid the huge effort of our problem setters. We apologise. It was certainly an avoidable situation.

Team CodeChef

September Cook-Off 2021 | Gennady Reigns Supreme!

The September Cook-Off had just ended, and we enjoyed the contest thoroughly. It was the first rated for all contests we’ve had in a...
2 min read

Top Coding Colleges Of The September Long Challenge!

The September Long Challenge was an amazing contest, and it both left us wanting for more. The budding college coders stood out in the...
1 min read

The September Long Challenge Came With A Surprising Ranklist!

The September Long Challenge was a pretty good way to kick off the month. There was a big problem-set to solve, and competitors tested...
1 min read

3 Replies to “Why did the CodeChef website crash at the start…”

Leave a Reply