What the Heck Is Hyperscale?
I first heard the term hyperscale around 2013 when I was managing Google’s data center hard drive software team. As hard drive and SATA controller vendors came in to share their road maps, they started referring to the needs of hyperscalers as something different from those of traditional enterprise users. Google’s needs being different from traditional enterprise wasn’t new but the term hyperscale to refer to Google and others was. Google didn’t use this term internally. As far as I know, Amazon and Facebook didn’t either. Our needs didn’t always overlap either. What did the industry mean by calling us hyperscale?
Seven years later, I’m still wondering what hyperscale means. Research papers and marketing materials talk about hyperscalers without ever clarifying what it means. Who qualifies as a hyperscaler? What are the criteria? I set out to find a definition that answers these questions and, finding little, created my own.
Looking for an origin story
With how broadly used the term is, I naively assumed that someone must have defined it somewhere yet web searches turned up nothing useful. After a few false starts, I discovered that I could filter Google search results with a date range. This was still far from ideal as many results would show very old dates despite the contents being obviously more up to date. Regardless, I could find candidate results this way and then check the Internet Archive’s Wayback Machine to check that the content at the date claimed by Google.
Using that method, the earliest use of hyperscale (as related to computing) I’ve been able to find is “Creating a Hyper-efficient Hyper-scale Data Center” in Dell Power Solutions magazine’s February 2008 issue. While the bulk of the article is pitching Dell’s newly formed Data Center Solutions Division, the introduction gives an overview of how cloud and large cluster computing environments differ from traditional environments. The article highlights how these environments focus on maximizing efficiency at every level of data center design from machines to power and cooling infrastructure. While this is generally true, the article places the emphasis on solutions used by these environments–rather than when those solutions are appropriate. Does simply choosing to trim unnecessary components from servers and using a hot aisle make you a hyperscaler? I’m inclined to say no. I’m also a bit dubious that Dell understood hyperscale well enough at the time to speak authoritatively. The same article describes these cloud computing environments as containing thousands, or even tens of thousands, of servers. By 2008, Google and Amazon had deployed well over a hundred thousand servers.
Crowd sourcing a definition
With my search for an origin story failing to uncover a definition, I began to wonder if our collective usage of the term hyperscale would uncover some consistent traits that could be formed into a definition. Knowing that my Twitter followers tend to skew toward servers and hyperscale, I posed the question there:
So, uh, what _is_ hyperscale? What are the qualities of a hyperscale design? Asking as someone who worked in Google's server design group for nearly a decade.
— Rick Altherr (@kc8apf) March 31, 2020
I anticipated diverse responses especially as retweets began to elicit responses from a broader audience. Depsite my anticipation, I was dumbfounded by the breadth of attributes and decisiveness in responses. Here are a few responses to give a flavor:
A combination of application and infrastructure architecture where the application can scale up or down without limit and with minimal effort thanks to automation that abstracts infrastructure away. You might have a light that says "order more racks" but that's about it.
— Nick 🦇🕸🖤 (@ExplodingLemur) April 1, 2020
I don't have an authoritative answer, but I think this is part of the definition:
— Luis Bruno 🟡 (@luisbruno) March 31, 2020
“designing one's own compute electronics because at the fleet-size we deploy there's gains to be made”
It probably also includes “designing one's own firmware”, or “... own networking gear” too.
I don't think the sheer number of servers makes a hyperscaler though.
— Matt King (@syncsrc) April 1, 2020
The ratio of dedicated technitions & sysadmins per server seems to matter.
A _design_ doesn't make you hyperscale. Scale does.
— Matthew S. Wilson (@_msw_) March 31, 2020
Proposed criteria could be something like: an organization that represents a material percentage of the deployment of commodity general purpose server processors manufactured per year.
Nodes are simpler because hyperscaler can make simplifying homogeneous infra assumptions and/or they don't have to handle their own hw errors.
— chris (@hugelgupf) April 1, 2020
Machine = dumb data collector. Higher level systems figure out what to do to fix them.
(Gross simplification on my part here, too)
I think your answer is in your question... A hyperscale design is one that is worked on by a dedicated (in-house) server design group. (Conversely you aren't hyperscale if you don't design your own servers to _your_ needs)
— Tom Whateley (@twhateley) April 1, 2020
-Redundancy is achieved primarily through software, not hardware
— Sargun Dhillon (@sargun) April 1, 2020
-Use cases are focused, and not general
-Usually pushing the edge in terms of software (OS) support, and not interested in external validation of a range of software
-Can handle single component failure for >days
Less a kind of design than a market segment. Differentiation, raw power, and per-unit support matter less than in enterprise. What matters most is minimizing amortized lifetime nre+capex+opex per unit of compute/storage/bandwidth/etc., especially labor costs.
— Code Worsener (@innuendofunctor) March 31, 2020
A scale 1000 times your current and 100x your dreams of reality.
— Julia Kreger (@ashinclouds) March 31, 2020
Often with many subtitle details you can't discern at a distance that
one wouldn't believe if they could see the details upfront.
Again, these definitions mostly focused on quantitative attributes, often in the number of servers, or apply specific solutions to problems perceived to be only experienced at hyperscaler. I was intriguted by the the suggestions related to ratios. This spoke to me of some underlying competiting needs problem. Maybe we could define hyperscale in terms of the constraints rather than the outcomes. That led me to thinking about times that I’ve heard electrical engineers discuss the term “high-speed signal.”
Inspiration from electrical engineering
I’ve spent a lot of time around electrical engineers, especially those designing computer motherboards. At various times, the topic of what consitutes a high-speed signal has come up. When asked for a description or definition, electrical engineers will give a variety of common answers depending on their experience. Students and interns often suggest it has to do with the frequency of the signal. Junior engineers will often talk about setup and hold times. While these are all relevant to the concept, they are not complete definitions.
Prof. Chris Diorio takes a different approach in his CSE467 handout on high-speed signaling. First, he describes the foundational abstractions of digital design:
- Digital interpretation of analog values
- Logic devices as idealized Boolean primitives
- Steady-state abstraction
- Finite-state behavior of sequential systems
High-speed is then defined as the point where those abstractions break down as a consequence of circuit speed. This avoids identifying a specific speed (which will vary with materials and logic technologies), set of problems, or possible solutions. Instead, it’s an observation that there is an inflection point related to speed where the set of problems that need to be considered changes.
Prof. Diorio continues by showing various ways in which those abstractions might break down as speed increases. For each of these potential problems, common solutions are also available. So what does it mean to have a high-speed signal? Something is broken and now you need to understand what and why so you can figure out which solutions makes sense for your situation.
Defining hyperscale
Borrowing Prof. Diorio’s approach, what is the foundational abstraction of business computing? Businesses existed before computing yet computers have been adopted by nearly every business which implies that they must provide value greater than their costs. Extendeing that line of thinking from a business/economics perspective:
- An IT investment must bring a potential revenue increase or opex reduction greater than its anticipated total cost of operation
- Total IT investments must not detract from the primary business focus
Applying Prof. Diorio’s approach, hyperscale is then an inflection point where those rules break down as a consequence of scale of IT deployments. Past that inflection point (aka operating at hyperscale), the needs of the business cannot be met through straightforward purchasing of additional servers and using mainstream administration techniques and tools. Exactly where this point is will vary from business to business and continue to change over time. This implies that at any point in time many more companies are hyperscalers than is commonly believed.
While the exact problems faced by a hyperscale business will also vary, common problems and solutions to those problems have emerged:
- High opex due to # admins per # servers => treat servers as cattle instead of pets, heavy use of automation
- High capex due to off-the-shelf equipment => white box and defeatured equipment
- High opex due to high PUE => improve power and thermal efficiency through whole-building co-design
What seems to differentiate hyperscalers is not their scale, but how they have adapted to the new challenges they face. Most striking is that the most prominent hyperscaler businesses today universally have software development as a core competency intrinsicly linked to their primary business focus. They have been able to justify an overall larger IT investment by leveraging the their existing software development teams to create software tools that lets them not only operate at hyperscale but thrive there. That is a topic worthy of its own post.