tl;dr: We missed on warming our cache servers which we had added a day earlier, this allowed most of the traffic to hit the database server, thus overloading it. This then led to some users not being shown the problems, and finally, everybody’s Saturday was ruined.
Right before a contest begins, our HQ looks just like a control centre for a rocket launch. All our eyes are glued to our terminals, monitoring the vital statistics of our contests and making sure everything is functioning smoothly. And just like a rocket launch, we still have a lot to be tense about even after counting down to zero. Sometimes things go wrong even after the contest is well underway.
It’s kind of like being a football goalkeeper. Even after so many successful contests, sometimes all we can remember are the times things didn’t work out. And LTIME58 is another such memory for us and our dedicated community.
Right before the contest began, we went through all the pre-launch checks. Checkers, problems, testing, all ready to go. At about T-30 minutes, we started observing some unusual behaviours. Emails informing us about pressflow errors started dropping in from our users who were very hopeful and excited about the contest. We made the necessary fixes and by 7:30 PM, we believed we had it all under control.
Remember, it’s just like a rocket launch. Sometimes, the strangest of things go wrong at the worst of times. We watched perplexed as our DB load increased to 100%, while at the same time our inboxes started exploding with our users informing us that the problems weren’t visible. We like to pride ourselves on setting up difficult contests but one with invisible problems would be a touch too difficult, wouldn’t it?
Our initial assumption was that problems were visible to none of the users. But when we received some submissions, we realised that the problem was not affecting all users. In fact, our service that fetched contest problems had stopped. It was evident that high load had caused this, but we were still unaware of the reason for this unusual load. Going through the logs and patterns of requests, we realised it was not any kind of attack. However, by the time we closed off that possibility, it was already 8:00 PM, thirty minutes past the contest start time. When all of our efforts to reduce the DB load proved fruitless, we had to reschedule the contest for the following day. Ultimately, we could not continue when so many of our community members had faced an issue. With an extremely heavy heart, we declared the contest as unrated.
CodeChef has, quite literally, been about learning from problems. As we do with every instance of an error, we carried out a complete investigation of our application, system, and database logs and noticed that read and write operations were increasing exponentially while cache misses spiked. As it turns out, we had swapped our cache instances a day before. The new cache servers, although they had been in use along with old servers for the past couple of weeks, had not fully warmed up. This led to direct calls being made to databases due to cache misses. The issue spiralled until it ended up spoiling what would have been a perfectly fun way of spending a Saturday.
Well, the only reason we can fly around in rockets today is because for a hundred years humans have analysed every failed rocket to make a better one. That’s also what we do at the CodeChef HQ here, we learn from our mistakes (and then post about it on the Internet so that you don’t have to make mistakes to learn the same thing!). Our sincere apologies to our community that was affected by the outage. We remain committed to bringing you the best and most fun coding experience online and have taken steps to ensure that this doesn’t happen again. We are also thankful for the extended support of our community during such times.
We really look forward to our next contest and hope to see you all there.