Why did the CodeChef website crash at the start of June Cook-Off 2014?

1 min read

We goofed up. It was avoidable.

This is not the first goof-up that we have faced in a short contest. We have been there and done that. Specially the starting load during a CookOff has been a long standing problem for us. Sometimes the DB was unable to handle our complicated queries while at other times we were syn attacked! The result each time used to be the same. Pissed-off contestants at the start of the contest got soothed by the end by a usually good problem set (which we have no credit to take for but for our awesome problem setters).

We had tried to replicate the load at the start and fix things in our dev environment, but something or the other got left out. There used to be a time, when we would findan issue in a Cook-Off, fix it and then wait for the next CookOff to find the next one. It happened for a while until we took control of things, changed our application architecture, and moved to a new infrastructure!

Just a few days ago a couple of interns had joined the team and I was narrating this whole chain of events to them about how we had to be extremely alert and almost pray to god each time for the first 15 minutes that nothing goes wrong. The ordered food used to get cold in an eternal wait before it could draw our attention on this Sunday night of every month.

And how things have changed since last 6 months where we just think about googling the best restaurants to order the finest food for the 6 of us. 🙂 And how things have to go wrong this very CookOff!

Among the changes that we had made, now we run our servers on AWS infra. Over the last six months, for our CookOff, we have been running four c3.2xlarge web instances in front of a MySQL RDS instance to handle the ever-growing load.

The load has been keeping well below 1 on these 8 core machines and hence we thought why not test with reducing a couple of servers this time. This was out of sheer curiosity as we have mostly been unable to accurately replicate the load and the behaviour of what happens in the first 5 minutes of our CookOff. Things boomeranged and we all know what ensued. The load shot up to unmanageable proportions and balancing it on the two additional servers took a lot of time.

No, we do not take our production setup very casually as it may sound. And not that we are considering this very lightly. The contest has been extended and things are back on track. And we are left embarrassed. This foolhardy of ours have not only wasted the entire contestant’s time, it also undid the huge effort of our problem setters. We apologise. It was certainly an avoidable situation.

Regards,
Anup
Team CodeChef

Going For Gold: Meet The IOI 2020 Singapore Finalists…

The 32nd International Olympiad in Informatics is upon us, and now we know the names of the young, Indian coders who made it to...
riddhi_225
2 min read

4-star Coder Tops CodeChef Sept Long Challenge Div 1|North…

The September Long Challenge 2020, with 4,06,069 submissions, was yet another exciting event to witness. We saw some cutthroat competition among programmers, fighting over...
debanjan321
2 min read

CodeChef August Lunchtime 2020 | Gennady Makes A Comeback…

Gennady Korotkevich, the Global Rank 1 holder, returns to what was one of the most competitive Lunchtime in recent memory. Let's see how the...
neek_10
2 min read

3 Replies to “Why did the CodeChef website crash at the start…”

Leave a Reply