On October 4th the internet went down. Facebook, Instagram, and WhatsApp were unresponsive for several hours. It, in turn, caused abnormalities with other services. People called it Monday Blackout.
Building on ready-made PaaS (Platform-as-a-Service, third-party tools, and platforms you can combine into your own product) is awesome. It may reduce your time-to-market by half and is probably saving quite a buck while you rush to a quick release. But sometimes it hurts – just like it did on Monday when half the web went out.
Let’s find out what’s happened, and throw in a couple of battle-proven tips on how to evade downtime even when things come at you that hard.
What has happened?
In short, the DNS service (partially) failed. For a metaphor, let’s imagine half of all the world’s streets went missing from all the maps and nav apps. At the very moment when you’re out on a leave. On your way to the airport. Running out of time. In a country where nobody speaks your language. Ouch.
To give this ouch a more techy scent, let’s drop in some detail. The root cause of the whole thing was that a considerable part of domain names failed to resolve. This means, the internet backbone gave zero response when your browser asked, how to reach, say, Facebook. And all the third-fourth-twentieth-level subdomains used for third-party apps, all the authentication services, all the ad trackers, and zillions more services of various application scope.
How to minimize the risks of going down for your online project?
1. Don’t use third-party solutions to provide critical features
Monday Blackout is the reason to never use third-party solutions to provide critical features and to be so skeptical about using them as a single source of any feature. Using one global authentication provider like Google IS convenient, but any downtime on their side would totally ruin the user experience for your customers. And even if it is up and running – there is always a chance politics kick in: cases, when Middle Eastern or Asian governments sanction global tech corporations, are, unfortunately, not unheard of.
2. Have native iOS and Android apps for a reliable experience
We suggest building native mobile apps for both platforms, instead of relying solely on web apps. While web applications are naturally dependent on the DNS (you have to type in the URL to get where you want to), mobile software is already on your customer’s phone. It implies that all the fallback features can be at their disposal: e.g., in case of a DNS failure, it can store a number of backup domain names or even straight away IP addresses to reach out for, retaining a service level even when the others fail.
3. Use peer-to-peer (p2p) communication
The more the merrier, right? DNS failure was the root cause of the internet shutting down, but the effect was much more vast. When some of the social networks went unreachable, users rushed to their competitors — who, in turn, were not all ready to deal with that spike.
That’s why we’re so much into WebRTC’s p2p capabilities. With live multimedia being pretty traffic and resource-intensive, peer-to-peer communication is a budget saver at all times, and a business saver in cases like that. Even if a secondary service running somewhere in the cloud becomes unreachable for a while, the key feature will be available, as the spike load will be redistributed between the devices that are directly involved in a particular call.
4. Set up auto-scaling to handle spike load
Another critical thing is scalability. Design your platforms to scale up and down – by different strategies. Either automatically or manually, make your high-load solutions to be architectured to respond to traffic spikes without service degradation. For example, we architect, develop and test them to match the target criteria and even outdo them considerably.
5. Keep your code portable
And now – back to the PaaS, as it was what made the Monday blackout a blackout. When the users rushed to competing social networks, they sure reached for different domains. But under the hood, many of those were attached to the same cloud computing platforms, known for their quality and massive resource pool. And those platforms started to crack under load, making the internet go down.
That is why we recommend delivering software in code – it’s not only ownership rights. With code at your disposal, you are independent of some premade set up in a cloud. If your cloud provider goes down, what you do is simply deploy the whole thing on a different availability zone, or a competing provider, or even on-premise — whichever serves you better.
So, here’s the summary of why the internet went down.
What’s happened? A chain reaction.
- DNS failure made a number of massively used resources unreachable
- Their audience rushed to alternative destinations
- Many of those destinations run off the same cloud platforms the traffic snowballed to
- Many side services also failed, as they used third-party solutions provided by big tech companies affected by the DNS failure
How we mitigate these risks:
- Third-party independent critical features
- Mobile apps and PWAs for less DNS dependency
- P2p to avoid bottlenecks
- Auto-scaling to handle spike load
- Code availability for quick redeployment
Blackouts like this one are, definitely, a rare occurrence. But when they arrive, you are either struggling to minimize your losses, or welcoming the discontent customers of your less reliable competitors. Need a team capable of keeping your business ready for a chance like that? Request a quote.