Chapter 3. Infrastructure Platforms

Different people define the scope of infrastructure in very different ways, often within the broad context of platform engineering, a concept that itself is wide and fuzzy. This chapter defines a scope of infrastructure that is useful for discussing Infrastructure as Code, and puts it into the context of the system as a whole.

Figure 3-1 shows three high-level system layers . In the top layer sit applications, digital products and services that offer business capabilities to customers and other users. Applications are supported by platform services provided in the engineering platform layer. Platform services include those used to host software, like container clusters and virtual servers, as well as operational services like monitoring.

The bottom system layer is the infrastructure platform, which provides the compute, storage, and network resources that underpin the rest of the system. The infrastructure platform may be composed of physical or virtual systems, if not both. These systems may be physically managed in a data center by the organization that uses them or hosted by a cloud provider. In many cases, an organization uses a combination of one or more cloud providers and data centers.

As with much IT terminology, people use the terms “infrastructure,” “platforms,” and “workloads" in different ways, and may even use them in conflicting ways in different contexts. It’s important to be aware of these differences and to decide which definitions are useful in a given conversation or context. Figure 3-2 maps how these three terms are often used against the system layers and how this book uses the terms.

This book uses infrastructure to describe the resources provided by an infrastructure platform, particularly those that can be defined, provisioned, and configured using infrastructure code. An application developer or solution architect, on the other hand, might call everything below the application layer “infrastructure” (that is, “stuff that isn’t an application”).

It’s also common to describe everything below the application as “the platform.” This book generally refers to the infrastructure platform and the engineering platform as two different things, as shown in these diagrams. In relation to Infrastructure as Code, an infrastructure platform provides resources that infrastructure code assembles to build the platform services that compose the engineering platform.

From the infrastructure point of view, everything that runs above the infrastructure platform, including applications and platform services, is seen as a workload. Many people, often including cloud vendors and platform engineers, use “workloads" to specifically mean applications, the things that provide value most directly to users.

In either definition, workloads usually include business capabilities, which are services that provide capabilities indirectly involved in delivering value to customers, users, and other stakeholders. A business data analytics service is an example of a business capability.

The first section of this chapter discusses infrastructure platforms in more detail. The second section describes how infrastructure provides platform services in the engineering platform. The last section discusses capabilities used to manage the platform elements, like application delivery services and, of special interest to readers of this book, infrastructure delivery services.

What Is a Platform?

Platform is one of those words (like “system” and “service”) that is used in so many ways that the word is nearly meaningless without a more specific qualifier, such as “business,” “developer,” or “data.” And even then terms like “business platform” still feel woolly. To make comprehension even more difficult, different people define and use various platform-related words in different ways; we have no industry-standard definitions to rely on.

The Cloud Native Computing Foundation (CNCF) defines a platform as “an integrated collection of capabilities defined and presented according to the needs of the platform’s users. It is a cross-cutting layer that ensures a consistent experience for acquiring and integrating typical capabilities and services for a broad set of applications and use cases.”¹

Infrastructure Platforms

So far, I’ve described infrastructure resources vaguely as the stuff that infrastructure code assembles. These resources, and the platforms that provide them, are the medium in which we infrastructure coders work. They are the materials that we mold, using our craft to turn characters in a file into the digital foundations that sustain the organizations for which we work.

That may be a flowery way to describe infrastructure platforms. But they are important to what we do. Chapter 4 discusses Infrastructure as Code tools and languages in detail. This section explores the platforms and resources that those tools manage. The following sections in this chapter discuss the purposes that those resources are put to by infrastructure tools.

Figure 3-3 shows the interaction between infrastructure code, infrastructure tools, and an infrastructure platform.

An infrastructure tool like Terraform or Pulumi reads the infrastructure code to determine what infrastructure needs to be provisioned and how to configure it. The tool connects to the infrastructure as a service (IaaS) API of the infrastructure platform, which then makes the appropriate changes to the provisioned infrastructure in the environment.

As mentioned earlier, an infrastructure platform can manage physical systems, virtual systems, or a combination. The physical equipment may be managed by the organization that uses it or by a vendor. The key requirement for managing Infrastructure as Code is that the infrastructure platform is an IaaS platform. An IaaS platform exposes a programmable interface that clients can use to provision and manage resources on demand.

Most IaaS platforms expose a Representational State Transfer (REST) API, often with software development kits (SDKs) for different programming languages.²

There are different types of IaaS platforms, from full-blown public clouds to private clouds, and from commercial vendors to open source platforms. Table 3-1 lists examples of vendors, products, and tools for each type of cloud IaaS platform.

Table 3-1. Examples of IaaS solutions
Type of platform	Providers or products
Public IaaS cloud services	Alibaba Cloud Amazon Web Services (AWS) Microsoft Azure Google Cloud DigitalOcean Linode (Akamai) Oracle Cloud Infrastructure (OCI) OVHcloud Scaleway Vultr
Private IaaS cloud products	Apache CloudStack OpenStack VMware vCloud
Bare-metal server-provisioning tools	Crowbar Cobbler Foreman RackN Digital Rebar See “Creating a Server by Using Network Provisioning” for more
Public cloud data center offerings	AWS Outposts Azure Local Google Cloud Anthos

IaaS providers with the largest global infrastructure, breadth of services, and global presence are sometimes called hyperscalers. As of late 2024, AWS , Azure, and Google Cloud are the definitive hyperscalers .³

At the basic level, an infrastructure platform provides compute, storage, and networking resources. The platform can provide these resources in different ways. For instance, it may offer compute as physical servers, VMs, container runtimes, or serverless code execution.

Different vendors often package and offer the same types of resources in different ways, or at least give them different names. For example, AWS object storage, Azure Blob Storage, and Google Cloud Storage are all pretty much the same thing. This book sometimes uses generic names that apply to different platforms, such as network address block or virtual local area network (VLAN) rather than virtual private cloud (VPC) or subnet.

Infrastructure Resources

Three essential types of resources are provided by an infrastructure platform: compute, storage, and networking. Different platforms combine and package these resources in different ways. For example, you may be able to provision a database instance, which combines compute, storage, and networking. Even something as seemingly simple as object storage (like Amazon Simple Storage Service, or S3, buckets) involves not only storage but also networking to allow connections over HTTP and compute to carry out encryption.

The fundamental forms of infrastructure are primitive resources, such as a subnet or a virtual disk volume. Cloud platforms combine infrastructure primitives into composite resources, such as these:

Database as a service (DBaaS)
Containers as a service (CaaS)
Load balancing
DNS
Identity management
Secrets management

Figure 3-4 shows an example of a composite resource composed of primitive resources and exposed by the IaaS platform API.

This example is a container cluster that can be created by calling a single API endpoint, createCluster(). That API call leads the IaaS platform to assemble a collection of primitive resources, including a pool of VMs, storage elements to manage state for the cluster, and networking for connectivity. Most of these primitives are also exposed directly through their own endpoints, including the createVirtualMachine() endpoint shown in the diagram.

The line between a primitive resource and a composite resource is arbitrary, as is the line between a composite infrastructure resource and a platform service such as an API gateway. But it’s a useful distinction.

Similarly, the line between compute, storage, and networking is arbitrary, as even the most primitive resource is usually composed of a combination of other resources.⁴ However, it’s useful to group resources by the essential capability they provide and consider examples of each of them.

Compute resources

Compute resources execute code. At its most elemental, compute is execution time on a physical server CPU core. But most platforms provide compute in different ways. Common compute resources include the following:

VM instances
Physical servers, also called bare metal as a service (BMaaS)
Server clusters, such as AWS Auto Scaling group (ASG), Azure virtual machine scale set, and Google managed instance groups (MIGs)
Container instances, containers as a service (CaaS)
Container clusters as a service (CCaaS), although sometimes also called CaaS; examples include Amazon Elastic Container Service (ECS), Amazon Elastic Kubernetes Service (EKS), Azure Kubernetes Service (AKS), and Google Kubernetes Engine (GKE)
Function as a service (FaaS) serverless code runtimes, such as AWS Lambda

The variety of ways to provision and use compute resources create useful options for designing and implementing applications to use them effectively and efficiently.

Storage resources

Infrastructure platforms also package and provide storage in multiple ways. Typical storage resources include the following:

Block storage: Virtual disk volumes that can be mounted to virtual services or other compute instances. Examples include Amazon Elastic Block Store (EBS), Azure Managed Disks, OpenStack Cinder, and GCE Persistent Disk.
Object storage: Provides access to files from multiple locations rather than attached to a specific compute instance. Amazon S3, Azure Blob Storage, Google Cloud Storage, and OpenStack Swift are all examples. Object storage is usually cheaper and more reliable than block storage, but with higher latency.
Networked filesystems, shared network volumes: These are usually volumes that can be mounted on multiple compute instances by using standard protocols, such as Network File System (NFS), Andrew File System (AFS), or Server Message Block (SMB) / Common Internet File System (CIFS).⁵
Structured data storage: These are often managed DBaaS offerings. They can be a relational database management system (RDBMS), key-value store, or formatted document stores for JSON or XML content.
Secrets management: Structured data storage with additional features for secrets management such as rotation and fine-grained access management. See “Handling Secrets” for techniques for managing secrets and infrastructure code.

As with compute resources, storage choices vary from simple options that provide raw storage space to more sophisticated options tailored for more narrow use cases.

Network resources

Typical networking constructs and services that an IaaS platform provides include:

Network address blocks, such as VPCs, virtual networks, subnets, and VLANs
DNS service
Traffic routing, gateways (low-level and API level), and proxies
Load balancing
Virtual private networks (VPNs)
Firewall rules
Asynchronous message queues
Caching
Service mesh

The capability of dynamic platforms to provision and change networking on demand, from code, creates interesting opportunities. These opportunities go beyond changing networking more quickly; they also include much safer use of networking.

Part of the safety comes from the ability to quickly and accurately test a networking configuration change before applying it to a critical environment. Beyond this, software-defined networking (SDN) makes it possible to create finer-grained network security constructs than you can do manually. This is especially true with systems where you create and destroy elements dynamically.

The details of networking are outside the scope of this book, so check the documentation for your platform provider, and perhaps a reference such as Craig Hunt’s TCP/IP Network Administration (O’Reilly).

IaaS in the Data Center

People used Infrastructure as Code in data centers long before public IaaS clouds brought it into mainstream IT. Early Infrastructure as Code focused on configuring servers more than networks and storage. Public cloud IaaS made it possible to use infrastructure code for broader infrastructure, which drove interest in ways to offer IaaS on premises.

Earlier, Table 3-1 included some open source and commercial products for building private IaaS platforms in a data center. The bare-metal cloud tools in that table can automate the provisioning of physical servers, either to use directly or as a first step for installing virtualization or IaaS software. Many of these tools are used for installing private IaaS products—for example, automating the process of installing hypervisors onto physical servers.

The major IaaS cloud vendors also offer products for deploying private instances of services in a data center that are compatible with their public cloud offerings. AWS offers Outposts, Azure has Local, and Google provides Anthos. These solutions are not necessarily complete private IaaS solutions, but are intended as stepping stones or complements for running hybrid clouds with their public cloud services.

Some use cases require building and running private cloud infrastructure. Some that I’ve encountered are regulatory, although the availability and acceptance of hyperscalers across smaller regions, and improvements to options for governance and control within these platforms, means that these use cases are shrinking. Other situations driving private infrastructure involve supporting physical infrastructure such as power networks.

More-dubious rationales for managing private infrastructure involve cost and expertise. Some people compare the cost of public cloud infrastructure with the cost of owning equivalent hardware.⁶ However, building and operating physical compute infrastructure involves more than buying servers. The investment and expertise needed to run a highly available, performant IT estate includes operational support, continually updating and replacing systems and parts, physical security, and increased staff and organizational management. I’ve encountered very few organizations that manage data center infrastructure as effectively, and cost-effectively, as dedicated cloud vendors.

Multicloud

Many organizations end up hosting across multiple platforms. A few terms crop up to describe variations of this:

Hybrid cloud: Hosting applications and services for a system across both private infrastructure and a public cloud service. People often do this because of legacy systems that they can’t easily migrate to a public cloud service (such as services running on mainframes). In other cases, organizations have requirements that public cloud vendors can’t currently meet, such as legal requirements to host data in a country where the vendor doesn’t have a presence.
Polycloud: Using multiple clouds, running different workloads on each. Some organizations do this to deliberately exploit the strengths of each platform. In other cases, the organization has acquired multiple existing cloud workloads, or certain parts of the organization have made different choices of cloud vendors. Having active, in-house capability for working with multiple clouds can shorten the time needed to move workloads from one cloud to another if needed.
Cloud agnostic: Running workloads that can be shifted dynamically among cloud platforms. People often aspire to be cloud agnostic in hopes of avoiding lock-in to one cloud vendor, or to have a failover option if one cloud vendor fails. In practice, this aspiration often leads to enormous investment and lock-in to cloud abstraction software layers, often with little true value.

Although building an abstraction layer to hide cloud implementations creates more problems than it solves, organizations do need to define how they use and interact with cloud platforms. This typically takes the form of an engineering platform.

The Cost of Avoiding Cloud Vendor Lock-in

I would argue that the cost of having a cloud-agnostic estate is not double the cost of using one cloud, but an order of magnitude (10×) higher. Don’t underestimate the effort and cost needed to build and continually maintain portability across clouds. Having an outage of several hours can be expensive, but does that cost justify 10× your cloud budget?⁷

Similarly, many organizations need to be prepared to switch to cloud vendors (this is a regulatory requirement in some jurisdictions). However, in most cases having a clear plan to migrate workloads, even where this involves engineering work, is enough to address the risk.

Engineering Platforms

An engineering platform provides technology capabilities to users inside an organization, who use them to create and operate products for users inside and outside the organization. People sometimes see an engineering platform as a single, unified solution, and many vendors aim to sell them this way. However, a platform is best seen as a collection of services. Different services may be provided by different teams, and in some cases hosted by external vendors as a software as a service (SaaS) solution.

Infrastructure as Code can be viewed as a method for building and providing platform services. I’ll describe various ways infrastructure is used to do this later in this section, after discussing platform services more generally. First, Figure 3-5 shows two groupings of platform services.

Application runtime services host software. These include compute-focused services like virtual servers, container clusters, and serverless runtime services. They also include data services like databases and networking services like event messaging services and web gateways. People working with on-premise estates often refer to this category of services as “middleware.” Application runtime capabilities are at the heart of what we provide through infrastructure code.

Operational services are not strictly required to host the software but are needed to make sure they run well. Examples include monitoring, observability, security management, disaster recovery, and capacity management.

Platform Services

A platform service is an implementation of a technology capability as a cohesive offering within an engineering platform. Figure 3-6 shows a few of the platform services used by the FoodSpin service.⁸

These services are defined by the capability they provide to the software that runs on them, rather than the details of the technology. “Public Traffic,” for example, may include DNS entries, content distribution network (CDN) services, and network routing. But the service is defined by the fact that it provides connectivity from users on the public internet to an application.

Providing Platform Service Functionality

The functionality of a platform service may be provided in various ways, and infrastructure is used differently in each. For example, a monitoring service might use functionality from a software package deployed onto the organization’s infrastructure, a service provided by the IaaS cloud vendor, or a SaaS monitoring solution. Figure 3-7 shows each of these options.

Infrastructure code is used in different ways for each of these three options:

Packaged software: The platform service functionality is provided by deploying and running a software package. Examples include open source monitoring software like Prometheus, a secrets management service like HashiCorp Vault, or a packaged container cluster service like kOps or Rancher. Infrastructure code is used to provision infrastructure to host the software package as well as to integrate it with other infrastructure and services, such as networking and authorization.
Cloud platform–provided service: Most cloud platform vendors not only provide basic infrastructure resources like virtual servers and network structures but also offer higher-level platform service functionality. Examples include Azure Monitor, AWS Secrets Manager, and GKE. Infrastructure code directly defines, configures, and provisions the services in the IaaS platform. The code also defines integrations with other resources like networking and authorization.
Externally hosted service: Many organizations use services hosted by a SaaS vendor. Examples include Datadog monitoring, Akamai Edge DNS, and Okta identity management. Many SaaS providers have APIs supported by Infrastructure as Code tools, so you can write code to provision, configure, and integrate their services with workloads and services hosted on your infrastructure.

Example Technology Capability Implementations by FoodSpin

The FoodSpin teams have built their systems over nearly 20 years and have provided platform service functionality in a variety of ways. When FoodSpin introduced an API layer for mobile applications and later opened it up to third-party developers, the company deployed and ran the Kong API gateway on an Amazon EKS cluster and an Amazon Relational Database Service (RDS) for PostgreSQL instance. FoodSpin’s infrastructure code pulled Docker images with Kong preinstalled. This is an example of pull-based packaged software for a platform service.

Later, the team decided to migrate to the Amazon API Gateway service. Most of the implementation for FoodSpin’s folks involved writing Terraform code to configure the service, with some work by the application developers to migrate their code. This is an example of functionality provided by the cloud platform.

FoodSpin management decided to offer a chatbot powered by generative AI (GenAI) to help customers create meal orders. The managers considered using a large language model (LLM) service provided by their IaaS platform, but in the end chose a SaaS GenAI vendor. The team deploys an AWS CDK stack that provides connectivity to the GenAI API for a chatbot service written and deployed by the FoodSpin team.

Platform Delivery Services

This chapter has defined a system’s capabilities as belonging to three layers of applications, engineering services, and infrastructure resources. However, a system also needs a set of meta-capabilities that can be used to build, deploy, and manage whatever software, resources, or services implement the system’s capabilities. These meta-capabilities can be called platform delivery services.

Figure 3-8 shows three sets of delivery services, one for each of the system layers.

These services are also sometimes called control planes. They may also be called platforms, as in an application delivery platform, or an infrastructure management platform. Most people would consider them all to be a part of the engineering platform layer of the system:⁹

Application delivery services: Application delivery services are used to develop, test, and deliver changes to software. Examples of specific services include source code and artifact repositories, CI services, CD pipelines, and automated testing tools.
Platform management services: Orchestrating platform services may be done with a combination of services or tools, including developer portals and platform service frameworks.
Infrastructure management services: These services are most interesting for readers of this book, since they include whatever tools and technology are used to manage infrastructure, particularly Infrastructure as Code tools.

As with many models, these three categories aren’t necessarily exclusive. Some services, like a CD pipeline service, could be used to deliver applications, platform services, and infrastructure.

Application Delivery Services

A wide and sprawling landscape of tools and services can automate the build, testing, delivery, and deployment of application software. These include build and pipeline services and application deployment services. Much of the tooling and practices are shared with infrastructure delivery, so more detail can be found in Chapters 16 and 19.

Infrastructure Delivery Services

Your organization may not use the term “infrastructure delivery services toolchain,” but every organization that uses infrastructure code has it. There you’ll find the obvious infrastructure stack tools like Terraform, CDK, and Pulumi; server configuration tools like Chef and Puppet; testing tools like Terratest and Chef InSpec; and infrastructure-specific delivery tools such as Spacelift and env0. The entire third part of this book is devoted to the infrastructure delivery lifecycle, which is the domain of these services.

Platform Management Services

As with the infrastructure delivery toolchain, different organizations may use different names for these services or may not even clearly define them as a group. Here are a few examples of solutions people use for managing platform services:

Platform-as-a-service (PaaS) solutions: Such as Red Hat OpenShift or VMware Tanzu Platform. A PaaS provides a collection of platform service implementations along with the tooling to provision, configure, and manage them.
Platform-building frameworks: Like Kratix and Humanitec. As opposed to a PaaS, these solutions provide tooling for teams to build and manage their own platform services, rather than providing a set of prebuilt services.
Developer portals: Along the lines of Backstage, which people can use to provision platform services (among other things).

A central platform engineering team may use platform management services to provision environments and services. However, there is a strong movement toward implementing solutions that empower other teams to provision, configure, and manage instances of platform services for themselves.

In some cases, a team can use a self-service solution such as a developer portal to manually trigger the provisioning of a platform service instance to use. In other cases, deploying an application automatically triggers the provisioning of a service the application requires. The latter situation requires integration with the application delivery services.

Conclusion

The previous two chapters set out the conceptual groundwork for Infrastructure as Code. This chapter set out the context that infrastructure code is used in. As mentioned earlier, the term “infrastructure” can be used broadly to include everything below applications in the layer-based view of a system. This chapter hopefully makes the scope of infrastructure clear and sensible for the purposes of this book.

Having touched on the topic of infrastructure delivery services in this chapter, Chapter 4 gets into the details of Infrastructure as Code tools and languages.

¹ See the “CNCF Platforms White Paper”. For another definition of platform, see “What I Talk About When I Talk About Platforms” by Evan Bottcher.

² The US National Institute of Standards and Technology (NIST) has an excellent definition of cloud computing: “The capability provided to the consumer is to provision processing, storage, networks, and other fundamental computing resources where the consumer can deploy and run arbitrary software, which can include operating systems and applications. The consumer does not manage or control the underlying cloud infrastructure but has control over operating systems, storage, and deployed applications; and possibly limited control of select networking components (e.g., host firewalls).”

³ Alibaba Cloud is often classed as a hyperscaler as well because of the size of its infrastructure and dominance in Asia. However, it doesn’t yet have as strong a presence outside Asia; it’s an edge case.

⁴ To see how this is true, we can open up a physical network router and find a CPU, RAM, and some kind of persistent storage for its configuration.

⁵ Although not provided by IaaS platforms, distributed filesystems like IPFS and iroh are interesting alternatives.

⁶ 37 Signals CEO David Heinemeier Hansson sparked a popular tech media meme about “cloud repatriation” in his post “We Stand to Save $7M over Five Years from Our Cloud Exit”. Charles Fitzgerald, an analyst, regularly writes about his doubts that this is a meaningful industry trend, such as his post “Platformonomics Repatriation Index – Q4 2022: The Search Continues”.

⁷ See Gregor Hohpe’s article, “Don’t Get Locked Up into Avoiding Lock-in”.

⁸ FoodSpin is the example company described in “Introduction to FoodSpin and Its Strategy”.

⁹ Daniel Bryant ’s article “Platform Engineering: Orchestrating Applications, Platforms, and Infrastructure” has a useful platform engineering diagram that offers a different view of the kinds of capabilities of a platform. He uses the terms “application choreography,” “platform orchestration,” and “infrastructure composition” to describe the capabilities of the three delivery services this book defines.

Chapter 3. Infrastructure Platforms

Figure 3-1. An infrastructure platform supports higher system layers

Figure 3-2. Alternative platform layer terminology