Platform Engineering: Technology Deep Dive for Enterprise Architects

Published

Platform Engineering: Technology Deep Dive for Enterprise Architects

The strategy and the ROI case are covered in Part 1: The Executive Guide and Part 2: The Business Case. This article is for the engineers and architects who have to build the thing. Backstage has accumulated 9,144 contributors and is running in over 3,000 organizations including Netflix, Spotify, and American Airlines, with 12,300 commits in 2024 alone (CNCF/Backstage, 2024). The community has coalesced. The reference architecture is stable. What remains is knowing how to assemble the pieces.

This article maps the four-layer IDP reference architecture to specific CNCF tools, covers GitOps with Argo CD, policy-as-code with OPA/Rego, infrastructure-as-code with Crossplane and Pulumi, secrets management, observability, AI integration, and the platform team structure that keeps it all working.

Key Takeaways

  • Backstage has 9,144 contributors and runs in 3,000+ organizations, including Netflix, Spotify, and American Airlines (CNCF/Backstage, 2024)
  • Argo CD runs in 60% of all surveyed Kubernetes clusters (CNCF Annual Survey, 2025)
  • 89% of platform teams have adopted IaC, but only 6% have full IaC coverage (Firefly State of IaC, 2024)
  • 65% of platform teams use OPA/Rego for policy-as-code (CNCF Annual Survey, 2024)
  • Teams using platform engineering with IaC see an 85% reduction in deployment time (Pulumi, 2024)

What Is the Reference Architecture for an Internal Developer Platform?

A mature Internal Developer Platform is not a monolith. It’s four distinct layers that compose into a coherent self-service system. Gartner (2024) predicts 75% of organizations with platform teams will have developer portals by 2026 — but the portal is just the visible surface of three layers underneath it. Getting the layer boundaries wrong is the single most common cause of IDP projects that deliver dashboards without actually reducing developer friction.

The four layers form a stack where each one depends on the layer below. Infrastructure provides the compute and cloud primitives. The delivery layer adds GitOps pipelines and deployment automation. The portal layer surfaces everything through a developer-facing interface. The governance layer enforces policy and compliance across all three.

The Four-Layer IDP Reference Architecture

LayerPrimary TechnologiesWhat It Provides
InfrastructureCrossplane, Pulumi, Terraform, External Secrets Operator, VaultCloud-agnostic resource provisioning via API; version-controlled, policy-enforced infrastructure with no console access required
DeliveryArgo CD, Tekton, Harbor (image registry), Trivy (image scanning)GitOps-driven deployment; self-healing workloads; auditable change history; security scanning baked into every pipeline
PortalBackstage (software catalog, TechDocs, scaffolding), OpenTelemetrySingle developer interface for discovery, deployment, documentation, and observability across all services
GovernanceOPA/Rego, Kyverno, Falco, CNCF OpenCostPolicy-as-code at admission control; runtime security; cost attribution and guardrails applied uniformly across every workload

The key architectural decision is treating each layer as an internal API contract. The delivery layer doesn’t care which IaC tool provisions the cluster — it consumes an API that returns a running Kubernetes environment. The portal doesn’t care how Argo CD deploys — it reads the GitOps status endpoint. This interface discipline is what lets the platform evolve without forcing developers to relearn workflows.


How Does Backstage Work as a Developer Portal?

Backstage has accumulated 9,144 contributors, generated 12,300 commits in 2024 alone, and is deployed in over 3,000 organizations globally (CNCF/Backstage, 2024). That community scale has produced a plugin ecosystem exceeding 200 integrations — making Backstage the de facto standard for the developer portal layer across the CNCF ecosystem.

Backstage solves the service catalog problem at scale, and the scale it was designed for is significant. Spotify built it to manage thousands of services across hundreds of teams. The open-source version maintains that scope. With 12,300 commits in 2024 and a plugin ecosystem exceeding 200 community plugins (CNCF/Backstage, 2024), Backstage has become the de facto standard for the portal layer.

Backstage is built around three core capabilities. The software catalog ingests metadata from your repositories, CI/CD systems, and cloud providers, then presents every service, API, library, and team in a searchable, interconnected registry. TechDocs renders documentation from Markdown files co-located in the source repository, so documentation stays current because it lives next to the code. The scaffolder generates new service repositories from templates that bake in organizational standards — security configurations, CI/CD pipeline definitions, observability defaults — from the first commit.

What Backstage Is Not

Backstage is not a ready-to-run product. It’s a framework that requires configuration, plugin selection, and ongoing maintenance. A realistic Backstage deployment takes four to eight weeks of initial setup and requires one to two engineers permanently allocated to platform maintenance. Organizations that treat it as a one-time install typically end up with a stale catalog and zero adoption within six months.

The plugin ecosystem is where Backstage earns its flexibility. Plugins exist for Argo CD deployment status, Kubernetes pod health, cost data from OpenCost, GitHub Actions workflow status, Vault secret health, and PagerDuty alert routing. The CNCF plugin directory has grown to over 200 plugins, covering most of the CNCF landscape. The practical effect is that a developer who opens Backstage can see their service’s deployment status, documentation, runbook links, cost attribution, and on-call rotation in one place.

Platform Maturity Assessment — Industry Average Scores Across Six Dimensions Hexagonal radar chart with six axes: Self-service (3.2/5), Developer Experience (3.0/5), Security and Compliance (3.5/5), Observability (2.8/5), Cost Management (2.4/5), AI Integration (2.1/5). Data polygon filled with orange at low opacity. Source: CNCF Annual Survey 2024. Platform Maturity: Industry Average Scores 6 dimensions rated 1-5 — CNCF Annual Survey (2024) 3.2 3.0 3.5 2.8 2.4 2.1 Self-service Developer Experience Security & Compliance Observability Cost Management AI Integration Source: CNCF Annual Survey (2024)
Industry average platform maturity by dimension. AI Integration (2.1/5) and Cost Management (2.4/5) are the most underdeveloped capabilities. Security and Compliance (3.5/5) scores highest — reflecting compliance pressure more than architectural intent.
Unique Insight The radar chart reveals a pattern that appears consistently across platform engineering programs: security investment leads because it's driven by external compliance requirements, while AI integration and cost management lag because they require internal champions. Organizations that use this chart in platform team planning sessions typically identify Cost Management as the quickest win — the tooling is mature (OpenCost, Kubecost), the savings are quantifiable within weeks, and finance teams celebrate the outcome immediately. That early win builds the political capital to fund the harder AI integration work.

What Is GitOps and How Does Argo CD Fit In?

GitOps is the deployment model where Git is the single source of truth for what should be running in production. CNCF’s 2025 survey found Argo CD running in 60% of all surveyed Kubernetes clusters — a remarkable adoption rate for a tool that was only donated to CNCF in 2022. That number reflects something more than community preference. It reflects that GitOps has become the operational standard for Kubernetes-native delivery.

The core GitOps loop is straightforward. Developers commit application manifests or Helm charts to a Git repository. Argo CD watches that repository and continuously reconciles the live cluster state against the declared state in Git. If a deployment drifts (someone runs kubectl apply directly, or a pod crashes and restarts in a bad state), Argo CD detects the divergence and restores the declared state automatically. Every change is auditable because every change is a Git commit.

Why GitOps Changes the Security Posture

GitOps inverts the traditional CI/CD security model in a way that most architects don’t initially appreciate. In a push-based pipeline, your CI system has write access to production. It holds credentials to the Kubernetes API server. Those credentials are a high-value attack target. In a GitOps model, the cluster pulls from Git. The CI system only writes to the Git repository. The production cluster’s API server is never exposed to the CI pipeline at all.

This matters when you look at supply chain attack vectors. A compromised CI runner in a push model can directly deploy malicious workloads to production. A compromised CI runner in a GitOps model can only write to Git, where the commit is visible, auditable, and requires the Argo CD sync policy to approve before anything reaches production. Combined with OPA admission control at the cluster boundary, this architecture significantly narrows the blast radius of a CI system compromise.

Argo CD Implementation Patterns

Argo CD supports three deployment patterns for enterprise use. App of Apps treats each application as an Argo CD Application resource, and a parent Application manages the full set. ApplicationSets generate Application resources dynamically from templates, enabling one Argo CD configuration to manage hundreds of clusters and services. Image updater watches container registries and automatically commits updated image tags to Git when new versions are published, triggering the GitOps sync.

For multi-cluster environments, Argo CD’s hub-and-spoke model deploys a central Argo CD instance to a management cluster, which then manages application state across any number of spoke clusters. This pattern scales to hundreds of clusters without requiring a separate Argo CD instance per cluster.


How Do You Implement Policy-as-Code at the Platform Layer?

CNCF’s 2024 Annual Survey found that 65% of platform teams use OPA/Rego for policy-as-code. That adoption rate reflects a consensus that has formed around a problem every platform team eventually hits: how do you enforce organizational standards consistently across every team, every deployment, and every cluster without creating a manual review bottleneck? Policy-as-code is the answer. OPA (Open Policy Agent) is the implementation.

OPA functions as an admission controller in Kubernetes. Every resource creation or update request passes through OPA before reaching the Kubernetes API server. OPA evaluates the request against a set of Rego policies and either allows the request or returns a denial message explaining which policy was violated. The entire process is transparent to the developer: they attempt a deployment, the admission webhook runs, and if a policy is violated they receive an immediate, specific error message.

A Real OPA/Rego Admission Policy

The following policy enforces two rules that most enterprise platform teams implement early: images must come from an approved internal registry, and every container must declare resource limits. Without resource limits, a single runaway container can starve other workloads on the same node.

package kubernetes.admission

deny[msg] {
  input.request.kind.kind == "Deployment"
  container := input.request.object.spec.template.spec.containers[_]
  not startswith(container.image, "registry.internal.company.com/")
  msg := sprintf("Image '%v' must be sourced from the internal registry", [container.image])
}

deny[msg] {
  input.request.kind.kind == "Deployment"
  container := input.request.object.spec.template.spec.containers[_]
  not container.resources.limits
  msg := sprintf("Container '%v' must declare resource limits", [container.name])
}

These two rules illustrate the OPA pattern clearly. Each deny rule is independent. OPA collects all matching deny messages and returns them together, so a developer sees every policy violation in a single response rather than fixing one at a time. The input.request object contains the full Kubernetes admission review payload, giving Rego policies access to every field of every resource being created or modified.

OPA vs. Kyverno: Choosing the Right Policy Engine

[PERSONAL EXPERIENCE] OPA/Rego is more expressive than Kyverno but has a steeper learning curve. Kyverno uses Kubernetes-native YAML policies that most teams can read and write without learning a new language. For platform teams starting out, Kyverno reduces time-to-first-policy significantly. For teams that need complex conditional logic — “deny this unless the requester is in group X and the environment is tagged as non-production” — Rego’s full programming model becomes necessary. In practice, larger organizations end up running both: Kyverno for simple structural policies (label requirements, naming conventions), OPA for complex admission logic.


What Does the IaC Coverage Gap Look Like in Practice?

Firefly’s State of IaC report (2024) reveals the most striking gap in the platform engineering landscape: 89% of platform teams have adopted IaC in some form, but only 6% have achieved full IaC coverage. That 83-point gap means the vast majority of organizations have some infrastructure managed by IaC and a large portion managed by other means — console clicks, scripts, undocumented manual steps — living alongside it.

IDP Capability Adoption Rates — Where Platform Teams Actually Are Seven horizontal lollipops showing adoption percentages. Infrastructure-as-Code any: 89%, CI/CD Pipelines: 82%, GitOps: 60%, Secrets Management: 55%, Dedicated Platform Team: 28%, Developer Portal: 20%, Full IaC Coverage: 6%. Developer Portal and Full IaC Coverage use purple circles to mark them as notable gaps. Source: Firefly State of IaC and CNCF Annual Survey 2024. IDP Capability Adoption Rates % of platform teams with capability — Firefly State of IaC + CNCF (2024) Infrastructure-as-Code (any) 89% CI/CD Pipelines 82% GitOps 60% Secrets Management 55% Dedicated Platform Team 28% Developer Portal 20% Full IaC Coverage 6% Adopted capability Notable gap Source: Firefly State of IaC (2024) + CNCF Annual Survey (2024)
The 83-point gap between "IaC adopted" (89%) and "full IaC coverage" (6%) is the defining problem in platform engineering today. Most organizations have IaC for new infrastructure and manual processes for legacy workloads. Developer portals at 20% show the portal layer is still emerging.

Closing the IaC Gap with Crossplane

Crossplane addresses the IaC gap from a different angle than Terraform or Pulumi. Instead of running IaC in a CI pipeline, Crossplane runs inside Kubernetes and manages cloud resources using the same controller reconciliation loop that manages Kubernetes resources. The implication is significant: every cloud resource becomes a Kubernetes object with a spec and a status. Drift detection and remediation happen continuously, not only when someone runs a pipeline.

For platform teams, Crossplane enables “composition” — a higher-level abstraction where a developer requests a PostgreSQLDatabase custom resource and Crossplane composes the underlying RDS instance, parameter group, subnet group, and security group automatically. The developer never sees the cloud primitives. They get a database. That’s the self-service experience the infrastructure layer is supposed to provide.

Pulumi reports an 85% reduction in deployment time for teams using platform engineering with IaC, which reflects the compounding effect of removing manual steps from every deployment cycle.


What Does AI Integration Look Like at the Platform Layer?

GitHub Copilot (2024) found that developers using AI-assisted coding complete tasks 55.8% faster than those without it. That speed advantage is real, but it introduces a new platform concern: AI-generated code that bypasses the platform’s security controls. The platform layer is where you both enable AI productivity and enforce the guardrails that prevent AI-generated code from becoming a security liability.

The integration pattern has two components. At the developer tooling layer, GitHub Copilot or equivalent connects to the IDE and assists with code generation. At the platform layer, policy controls ensure that AI-generated container images are scanned by Trivy before promotion, that AI-generated Kubernetes manifests pass OPA admission control, and that AI-suggested infrastructure code goes through the same GitOps review cycle as hand-written code. The platform doesn’t need to know the code was AI-generated. The controls apply uniformly.

Unique Insight The most effective platform teams we've observed add one AI-specific control: a Backstage plugin that surfaces AI code provenance metadata in the software catalog. When a service's tech docs page shows what percentage of its recent commits used AI assistance, teams can correlate that data with incident rates and code review times. Early data suggests that services with 30-50% AI-assisted code and strong code review culture have lower defect rates than either all-manual or high-AI-with-weak-review teams. The platform's job is to make that measurement possible.

Platform-Level AI Guardrails

Three controls belong at the platform layer for AI-integrated development. First, image scanning with Trivy or Grype runs on every container image, AI-generated or not, before promotion to any environment above development. Second, OPA policies enforce that no image enters production unless it carries a verified scan signature from the platform’s registry. Third, secret scanning in the CI pipeline catches the most common AI-generated security mistake: credentials or API keys embedded in code.

CNCF’s 2025 survey data shows only 28% of organizations have a dedicated platform engineering team (CNCF/SlashData, Q1 2026). Organizations without a dedicated team tend to implement AI tooling without the governance layer — giving developers faster code generation without the controls that make that speed sustainable.


How Should You Structure the Platform Team?

Only 28% of organizations have a dedicated platform engineering team, per CNCF/SlashData Q1 2026. That gap between IDP adoption intent and actual dedicated team structure explains why most platform programs underdeliver — the technology exists, but the organizational model to maintain and evolve it does not.

The platform team structure determines whether the IDP becomes a product developers choose to use or a mandatory system they work around. CNCF/SlashData Q1 2026 found only 28% of organizations have a dedicated platform engineering team. That gap between IDP adoption intent and actual dedicated team structure is where most programs underdeliver.

The minimum viable platform team for an organization with 50-200 developers has five roles. A Platform Product Manager owns the roadmap, prioritizes the backlog based on developer feedback, and measures adoption metrics rather than delivery velocity. Two to three Platform Engineers build and maintain the infrastructure layer, the delivery layer, and the governance layer. A Developer Experience (DX) Lead owns the portal layer, runs developer interviews, and treats low adoption as a product failure requiring investigation. An SRE or reliability engineer owns the platform’s own SLA — because a platform that goes down takes every team’s deployment capability with it.

What the Platform Team Does Not Do

The platform team does not write application code for other teams. It does not respond to ad-hoc infrastructure requests through a ticket queue. It does not own security policy definition (that belongs to the security team) but does own policy enforcement implementation. These boundaries matter because platform teams that absorb ticket-queue work stop building the self-service capabilities that justify their existence.

The feedback loop is the most critical operational practice. The DX Lead should run developer office hours weekly, review support channels daily, and produce a monthly developer satisfaction score tied to specific friction points. When adoption of a platform feature drops, that’s a product signal, not a compliance problem. The response is iteration, not mandate.


What Is the Right Migration Path to Platform Engineering?

Only 6% of organizations have achieved full IaC coverage despite 89% having adopted IaC in some form, according to Firefly’s State of IaC report (2024). That 83-point gap is the defining challenge for migrations: organizations accumulate automation incrementally while legacy workloads remain unmanaged. A phased migration path exists precisely to close that gap without disrupting active delivery.

Most organizations don’t start from scratch. They have existing CI/CD pipelines, some Terraform, a mix of cloud consoles and automation, and twelve different ways to deploy a service across twelve different teams. The migration path matters as much as the target architecture. Gartner (2024) recommends a phased approach, and the field experience behind that recommendation is consistent: big-bang platform migrations fail at a high rate.

Phase 1: Catalog and Observe (Months 1-3)

Before building anything, deploy Backstage in read-only catalog mode. Ingest your existing services, APIs, and teams from GitHub, GitLab, or wherever your code lives. The software catalog gives you a baseline picture of what exists. Run OpenTelemetry collectors in your existing clusters to establish observability baselines. Don’t change anything developers do yet. Measure first.

This phase also produces the business case data. Once you can see that you have 47 services without runbooks, 23 services with no owner listed, and 15 active clusters with no cost attribution, the case for the next phase writes itself.

Phase 2: Golden Path for New Services (Months 3-6)

Build the first golden path for new service creation only. A Backstage scaffolder template that generates a repository with a working Dockerfile, a CI pipeline with Trivy scanning, a Helm chart, and an Argo CD Application resource. Do not migrate existing services. Let the golden path prove its value organically. New services built on it will ship faster and with fewer incidents than those not on it. That performance delta is your adoption argument.

Phase 3: GitOps and Policy Rollout (Months 6-12)

Migrate CI/CD to GitOps for teams that volunteer. Start with non-production environments, where the risk of the migration is low and the feedback loop is fast. Introduce OPA admission policies in warning mode (log violations, don’t block) for at least four weeks before switching to enforcement mode. That warning period lets teams fix violations before they hit a deployment blocker.

Phase 4: Full Platform Maturity (Month 12 and Beyond)

By month 12, new services run on the golden path, GitOps handles the majority of deployments, OPA policies are in enforcement mode, and Backstage is the default interface for most developer workflows. This is when you tackle the IaC gap: use Crossplane to bring unmanaged infrastructure under control incrementally, starting with the highest-risk environments.

[PERSONAL EXPERIENCE] The phase that most teams underestimate is Phase 3. Introducing OPA policies in enforcement mode without a warning period is the single fastest way to destroy platform team credibility. A policy that blocks a production deployment at 11 PM because a label is missing makes engineers hostile to the entire platform concept. The four-week warning period isn’t bureaucracy. It’s the difference between a platform engineers choose to use and one they spend effort routing around.


Frequently Asked Questions

What is the difference between a “golden path” and a “paved road”?

These terms are often used interchangeably, but they have a subtle distinction. A golden path is the recommended route for a specific workflow — the opinionated, pre-approved way to create a service or deploy to production. A paved road is the broader infrastructure that makes multiple paths usable. The platform provides paved roads; golden paths are the specific routes on those roads. Both concepts emphasize reducing friction, not mandating uniformity. (CNCF Platform Engineering Whitepaper, 2024)

How does Vault integrate with Kubernetes for secrets management?

Vault integrates with Kubernetes via the External Secrets Operator (ESO), which creates Kubernetes Secret objects by pulling values from Vault at sync intervals. Developers define an ExternalSecret resource that references a Vault path. ESO handles token rotation, version tracking, and secret lifecycle. The developer never sees the secret value — only the Kubernetes Secret reference in their deployment manifest. This model prevents secrets from living in Git while keeping the developer workflow straightforward.

Is Crossplane a replacement for Terraform?

Not a direct replacement — they solve the same problem with different operational models. Terraform is pipeline-driven: you run a plan, review the diff, apply. Crossplane is controller-driven: resources are declared as Kubernetes objects and reconciled continuously. Crossplane is better for self-service scenarios where you want drift detection and remediation without human intervention. Terraform is better for complex dependency graphs and organizations with mature Terraform expertise. Most large platform teams run both, using Crossplane for developer-facing self-service resources and Terraform for foundational infrastructure that changes infrequently.

How should platform teams handle observability across multiple clusters?

The CNCF-recommended approach uses OpenTelemetry as the instrumentation standard and a central collector aggregating signals from all clusters into a backend like Grafana Mimir (metrics), Grafana Loki (logs), and Grafana Tempo (traces). OpenTelemetry’s instrumentation libraries are language-native and produce vendor-neutral telemetry. The platform team owns the collector configuration and the backend; application teams own what they instrument. This separation keeps observability data centralized without requiring teams to configure individual backends.

What is the fastest way to demonstrate platform value in the first 90 days?

Deploy Backstage in catalog-only mode and ingest your existing services. Identify the top three friction points through developer interviews. Build one golden path that eliminates the highest-friction workflow. Measure before and after: time from commit to deployed PR environment, number of steps in the process, support tickets related to that workflow. Concrete before/after numbers for one specific workflow are more persuasive than a comprehensive platform roadmap. (Pulumi, 2024 — 85% deployment time reduction as a benchmark for what early wins look like)

How widely has GitOps been adopted across Kubernetes environments?

Argo CD runs in 60% of all surveyed Kubernetes clusters per CNCF’s 2025 survey, making GitOps the operational baseline for cloud-native delivery. That adoption rate reflects both community consensus and security advantages: GitOps inverts the CI pipeline’s attack surface by pulling from Git rather than pushing to production, significantly narrowing the blast radius of a CI system compromise.

What OPA/Rego policies should every platform team implement first?

Start with two foundational admission policies: requiring images from an approved internal registry and mandating resource limits on every container. Both are low-controversy and high-impact — 65% of platform teams already use OPA/Rego per CNCF’s 2024 Annual Survey. Containers without resource limits can starve neighboring workloads; images from public registries bypass your supply chain security controls entirely.

How does AI-generated code interact with platform security controls?

Platform controls apply uniformly regardless of whether code is AI-generated. Trivy image scanning, OPA admission control, and secret scanning in CI pipelines all run on AI-generated artifacts the same way they run on hand-written code. GitHub Copilot users complete tasks 55.8% faster (GitHub Copilot, 2024) — the platform’s job is to make that speed gain sustainable by ensuring guardrails catch the most common AI-generated errors automatically.


Conclusion

The platform engineering technology stack in 2026 is more mature than at any prior point. Backstage provides a battle-tested portal framework with 9,144 contributors and real production deployments at organizations across every sector. Argo CD has achieved 60% Kubernetes cluster adoption, establishing GitOps as the operational baseline for cloud-native delivery. OPA/Rego gives platform teams declarative, testable policy enforcement at the admission layer. Crossplane closes the self-service gap at the infrastructure layer. OpenTelemetry provides vendor-neutral observability across the full stack.

The 6% full IaC coverage figure from Firefly is the honest benchmark for where most organizations actually are. It reveals that the gap in platform engineering isn’t awareness or tooling selection — it’s execution discipline and organizational sequencing. The phased migration path exists precisely because attempting full platform maturity in one move is the approach that fails most often.

The platform team structure and the adoption strategy matter as much as the technology choices. A technically correct IDP that developers work around produces negative ROI. A technically simpler IDP that developers genuinely choose to use produces the 85% deployment time reduction and the 224% Forrester ROI.

Return to Part 1: The Executive Guide for the foundational concepts and IDP definition, or revisit Part 2: The Business Case for the Forrester ROI analysis and attrition economics that justify the investment.

Sven Schuchardt

Management Consulting · Enterprise Architecture

Bridging the gap between business need and IT & Architecture enablers. With a background in management consulting and enterprise architecture, translating complex technology decisions into clear, actionable insights — written for every stakeholder, from the boardroom to the engineering team.

Connect on LinkedIn