Hypergrowth in the public cloud: add 5000 servers a night on AWS
We have read about the spectacular growth of several collaboration platforms as we started to work from home in 2020. An example is Zoom; it grew particularly fast as underlined by its quarter 3 2020 results. Revenues were up 370% YOY.
For a user it was a relatively small effort to sign up, connect and continue work with these tools. But what about the underlying IT efforts to make this happen on a global scale? Now we know; Rachel Dines put some numbers on Twitter. They are impressive: Zoom managed to add 5,000 servers a night on AWS in the early days of the pandemic.
Getting to hyperscale in less than one generation
This is a mind blowing performance. It was voiced this week in an AWS re:invent session. As a former CIO I still remember what it took and how long it took to add some capacity. You worked diligently with your suppliers and their lengthy supply chains. It required patience and deliberation, from quoting (get those discounts!) to installing (avoid life cycle issues!).
Furthermore, once ordered you crossed your fingers. Because there was no way to reverse the commitment you had made. As a result, you would end up often with oversized real estate, especially when launching applications that had no resource consumption history yet you could relate to. To make things worse, this became the new normal. Gradually resource consumption patterns became less predictable with the exponential growth in data, the need to process more data in real time and the introduction of new types of users and usage patterns.
Looking back, it feels like an Alice Through The Looking Glass experience now. In other words, the reverse of Zoom’s pace: a capacity to add a mere 8 servers every 5,000-6,000 hours.
'Hyperscale' vendor is no exaggeration
At re:Invent 2016, James Hamilton, AWS VP & Distinguished Engineer and previously with Microsoft and IBM, put up these numbers during his keynote. Already then, it read “AWS adds the capacity equivalent of a FORTUNE 500 Enterprise daily.” Every day!? Already in 2015!? Indeed, ‘hyper’ seems an appropriate qualification and not an exaggeration at all.
We know these companies have multibillion dollar operations and they add multibillion dollars every year. However, it is difficult to realize what the scale is of their operations. Because for us, this is simply infrastructure ‘as a code’.
Everyone can benefit, large and small
When working with these public cloud providers nowadays, one tends to forget how quickly and how profoundly things have changed. Moreover, we may forget how much value these changes bring us every moment of the day.
Solynta, one of our customers, was preparing for significant growth. Albeit on an entirely different scale than Zoom, Solynta understood the benefits of moving to public cloud services. We designed and built a new IT environment based on AWS and Microsoft cloud services and helped them migrate. Their positive reactions after the transition confirmed that the ‘private’ cloud supply chains and operations they were accustomed to, have not changed that much from those good old days.
Look beyond 'just' benefitting from scale
James Hamilton’s 2016 keynote may have outdated numbers, it still bears relevance. Scale is important. Amongst others, Hamilton explains the design of ‘AWS Availability Zones‘ (AZ) and why these are a crucial element in realizing a high availability infrastructure. AZs are different from Regions. These concepts are still relevant also because other cloud providers use similar or even identical words, but have built their data center infrastructures differently. When thousands of customers rely on your huge infrastructure 24/7, resilience and reliability matter.
This becomes even more important when you ‘move up’ from the Infrastructure Service to deploying Platform Services. In particular Microsoft customers have suffered from serious and persistent reliability issues. One of the underlying structural causes is explained in Gartner’s 2020 Magic Quadrant for Cloud Infrastructure and Platform Services:
“Microsoft has the lowest ratio of availability zones to regions of any [sic!] vendor in this Magic Quadrant, and a limited [sic!] set of services support the availability zone model. As a result, Gartner continues to have concerns related to the overall architecture and implementation of Azure, despite resilience-focused engineering efforts and improved service availability metrics during the past year.”
Scale matters, but equally important are the engineering principles with which the foundation has been built. Finally, I can recommend watching the video. It is a pleasure to watch Hamilton talk passionately about these important innovations.