Skip to content

How to Provision a Production-Ready Autopilot GKE Cluster

In this blog post I share my opinioated version of provisioning a Kubernetes cluster in the Google Cloud Platform (GCP) using nothing but Opentofu.

The principles discussed here are the ones I have learned while dealing with production setups at the same scale.

If you enjoy Kubernetes or want to learn more about GCP, this is for you.

Introduction

I have had the pleasure of working with Kubernetes in the last few years of my professional career.

While provisioning a Kubernetes from scratch may not be the most interesting part of the the day to day operation of dealing with a Kubernetes cluster, it is one of the most important ones.

The reason, especially in the context of Google Cloud, is that some of the initial settings and configurations you specify (or not) in your day 0 operations are the ones that you will not be able to change after the cluster is created1.

Therefore, I took some time to document my steps into provisioning a production-ready Kubernetes cluster from scratch.

Stick till the end to get to know the required component it takes.

Prerequisites

NOTE: With direnv, you can make sure to include your local development environment variables to yourself without exposing them to the VCS. Ensure you have .envrc in the root of the repository with the values you need.

Example values:

# <repo-root>/.envrc
export KUBECONFIG="<repo-root>/.kubeconfig"

Authenticate to Google Cloud

To get started, your first need to be authenticated to Google Cloud.

gcloud auth login

Additionally, to be able to perform TF API calls to the Google Cloud, you would need the Application Default Credentials (ADCs)2.

gcloud auth application-default login

Project Structure

The code is structured in directory per-environment manner (truncated for brevity).

Text Only
tofu/
├── backend.hcl # <- Terraform cloud remote state backend
├── gcp/
│   ├── gcp.hcl # Google Cloud authentication source
│   └── prod/
│       ├── 10-networking/
│       │   ├── main.tf
│       │   └── terragrunt.hcl
│       ├── 20-gke-encryption-key/
│       │   ├── main.tf
│       │   └── terragrunt.hcl
│       └── 30-kubernetes-cluster/
│           ├── main.tf
│           ├── terragrunt.hcl
│           └── terragrunt.hcl
└── modules/
    └── naming/
        └── main.tf

Shared Modules

Naming

We first need to create a unified naming module to be used everywhere:

tofu/modules/naming/versions.tf
terraform {
  required_providers {
    google = {
      source  = "hashicorp/google"
      version = "< 7"
    }
  }

  required_version = "< 2"
}
tofu/modules/naming/variables.tf
variable "project" {
  type        = string
  description = "The project name or ID"
  default     = ""
}

variable "environment" {
  type        = string
  description = "The environment (e.g., dev, prod)"
}

variable "resource_type" {
  type        = string
  description = "The type of resource (e.g., bucket, vm)"
}

variable "suffix" {
  type        = string
  description = "An optional suffix for the resource name"
  default     = ""
}
tofu/modules/naming/data.tf
data "google_client_config" "current" {}
tofu/modules/naming/main.tf
locals {
  project        = coalesce(var.project, data.google_client_config.current.project)
  generated_name = join("-", compact([local.project, var.environment, var.resource_type, var.suffix]))
}
tofu/modules/naming/outputs.tf
output "generated_name" {
  value = local.generated_name
}

Terragrunt Root Modules

There are a few modules being included in each of the following stacks34:

tofu/backend.hcl
locals {
  workspace = replace(path_relative_to_include(), "/", "-")
  # e.g. gcp-prod-10-networking
}

generate "remote_state" {
  path      = "backend.tf"
  if_exists = "overwrite_terragrunt"
  contents  = <<-EOF
    terraform {
      backend "remote" {
        hostname     = "app.terraform.io"
        organization = "developer-friendly-blog"
        workspaces {
          name = "${local.workspace}"
        }
      }
    }
  EOF
}
tofu/gcp/gcp.hcl
generate "gcp" {
  path      = "provider_gcp.tf"
  if_exists = "overwrite_terragrunt"
  contents  = <<-EOF
    provider "google" {
      project = "developer-friendly"
      region  = "europe-west4"
    }
  EOF
}

Networking

Next step is to create a dedicated VPC network and avoid using the default VPC provided in everty GCP project.

They also recommend to use custom subnets5 instead of auto-mode, which is where a subnet will be created for your in each of the available GCP regions.

Let's do just that.

tofu/gcp/prod/10-networking/versions.tf
terraform {
  required_providers {
    google = {
      source  = "hashicorp/google"
      version = "< 7"
    }
  }

  required_version = "< 2"
}
tofu/gcp/prod/10-networking/naming.tf
module "naming" {
  for_each = toset([
    "vpc",
    "subnet",
    "router",
    "nat",
    "firewall",
  ])

  source = "../../../modules/naming"

  environment   = "prod"
  resource_type = each.key
  suffix        = "networking"
}
tofu/gcp/prod/10-networking/main.tf
resource "google_compute_network" "this" {
  name                    = module.naming["vpc"].generated_name
  auto_create_subnetworks = false
}

resource "google_compute_subnetwork" "this" {
  name          = module.naming["subnet"].generated_name
  network       = google_compute_network.this.id
  ip_cidr_range = "10.0.0.0/14"

  log_config {
    aggregation_interval = "INTERVAL_10_MIN"
    flow_sampling        = 0.5
    metadata             = "INCLUDE_ALL_METADATA"
  }

  private_ip_google_access = true
}

resource "google_compute_router" "this" {
  name    = module.naming["router"].generated_name
  network = google_compute_network.this.id
}

resource "google_compute_router_nat" "this" {
  name                               = module.naming["nat"].generated_name
  router                             = google_compute_router.this.name
  nat_ip_allocate_option             = "AUTO_ONLY"
  source_subnetwork_ip_ranges_to_nat = "ALL_SUBNETWORKS_ALL_IP_RANGES"
}

resource "google_compute_firewall" "this" {
  name    = module.naming["firewall"].generated_name
  network = google_compute_network.this.self_link
  allow {
    protocol = "icmp"
  }
  allow {
    protocol = "tcp"
    ports    = ["80", "443"]
  }

  source_ranges = [
    "0.0.0.0/0",
  ]
}
tofu/gcp/prod/10-networking/outputs.tf
output "network_name" {
  value = google_compute_network.this.name
}

output "subnetwork_name" {
  value = google_compute_subnetwork.this.name
}
tofu/gcp/prod/10-networking/terragrunt.hcl
include "backend" {
  path = find_in_parent_folders("backend.hcl")
}

include "gcp" {
  path = find_in_parent_folders("gcp.hcl")
}

inputs = {
}

Running this stack:

terragrunt init -upgrade
terragrunt plan -out tfplan
terragrunt apply tfplan

GKE Encryption Key

We then need to create an encryption Customer-Managed Key (CMK) key for the GKE secrets6.

tofu/gcp/prod/20-gke-encryption-key/versions.tf
terraform {
  required_providers {
    google = {
      source  = "hashicorp/google"
      version = "< 7"
    }
  }

  required_version = "< 2"
}
tofu/gcp/prod/20-gke-encryption-key/naming.tf
module "naming" {
  for_each = toset([
    "keyring",
    "cryptokey",
  ])

  source = "../../../modules/naming"

  environment   = "prod"
  resource_type = each.key
  suffix        = "encryption-key"
}
tofu/gcp/prod/20-gke-encryption-key/main.tf
data "google_client_config" "current" {}

resource "google_kms_key_ring" "this" {
  name     = module.naming["keyring"].generated_name
  location = data.google_client_config.current.region

  lifecycle {
    prevent_destroy = true
  }
}

resource "google_kms_crypto_key" "this" {
  name            = module.naming["cryptokey"].generated_name
  key_ring        = google_kms_key_ring.this.id
  rotation_period = format("%ss", 60 * 60 * 24 * 30) # 30 days

  lifecycle {
    # NOTE: removing the TF resource will NOT delete the key from GCP
    prevent_destroy = true
  }

  labels = {
    env = "prod"
  }
}
tofu/gcp/prod/20-gke-encryption-key/outputs.tf
output "crypto_key_id" {
  value = google_kms_crypto_key.this.id
}
tofu/gcp/prod/20-gke-encryption-key/terragrunt.hcl
include "backend" {
  path = find_in_parent_folders("backend.hcl")
}

include "gcp" {
  path = find_in_parent_folders("gcp.hcl")
}

inputs = {
}

Creating this stack is just as before with the three commands mentioned above.

Kubernetes Cluster

Finally, we will create the cluster on Autopilot mode, requiring the least management and operational overhead over the lifetime of the cluster7.

tofu/gcp/prod/30-kubernetes-cluster/versions.tf
terraform {
  required_providers {
    google = {
      source  = "hashicorp/google"
      version = "< 7"
    }
    http = {
      source  = "hashicorp/http"
      version = "< 4"
    }
  }

  required_version = "< 2"
}
tofu/gcp/prod/30-kubernetes-cluster/variables.tf
variable "kms_key_id" {
  type     = string
  nullable = false
}

variable "network_name" {
  type     = string
  nullable = false
}

variable "subnetwork_name" {
  type     = string
  nullable = false
}
tofu/gcp/prod/30-kubernetes-cluster/naming.tf
module "naming" {
  for_each = toset([
    "gke",
    "vpc",
  ])

  source = "../../../modules/naming"

  environment   = "prod"
  resource_type = each.key
  suffix        = "k8s-cluster"
}
tofu/gcp/prod/30-kubernetes-cluster/main.tf
data "google_project" "current" {}
data "http" "my_ip" {
  url = "http://checkip.amazonaws.com/"
}

data "google_iam_policy" "encryptor" {
  binding {
    role = "roles/cloudkms.cryptoKeyEncrypter"

    members = [
      "serviceAccount:service-${data.google_project.current.number}@container-engine-robot.iam.gserviceaccount.com"
    ]
  }

  binding {
    role = "roles/cloudkms.cryptoKeyDecrypter"

    members = [
      "serviceAccount:service-${data.google_project.current.number}@container-engine-robot.iam.gserviceaccount.com"
    ]
  }
}

resource "google_kms_crypto_key_iam_policy" "this" {
  crypto_key_id = var.kms_key_id
  policy_data   = data.google_iam_policy.encryptor.policy_data
}

resource "google_container_cluster" "this" {
  name = module.naming["gke"].generated_name

  enable_autopilot = true

  release_channel {
    channel = "STABLE"
  }

  initial_node_count = 1

  deletion_protection = true

  networking_mode = "VPC_NATIVE"

  datapath_provider = "ADVANCED_DATAPATH"

  network    = var.network_name
  subnetwork = var.subnetwork_name

  ip_allocation_policy {
    cluster_ipv4_cidr_block  = "10.4.0.0/14"
    services_ipv4_cidr_block = "10.8.0.0/16"
  }

  cluster_autoscaling {


    auto_provisioning_defaults {
      oauth_scopes = [
        "https://www.googleapis.com/auth/cloud-platform",
      ]
    }
  }

  logging_config {
    enable_components = [
      "SYSTEM_COMPONENTS",
      "APISERVER",
      "CONTROLLER_MANAGER",
      "SCHEDULER",
      "WORKLOADS",
    ]
  }

  monitoring_config {
    enable_components = [
      "SYSTEM_COMPONENTS",
      "APISERVER",
      "SCHEDULER",
      "CONTROLLER_MANAGER",
      "STORAGE",
      "HPA",
      "POD",
      "DAEMONSET",
      "DEPLOYMENT",
      "STATEFULSET",
      "KUBELET",
      "CADVISOR",
    ]

    advanced_datapath_observability_config {
      enable_metrics = true
      enable_relay   = true
    }

  }

  binary_authorization {
    evaluation_mode = "PROJECT_SINGLETON_POLICY_ENFORCE"
  }

  node_config {
    workload_metadata_config {
      mode = "GCE_METADATA"
    }

    shielded_instance_config {
      enable_secure_boot          = true
      enable_integrity_monitoring = true
    }
  }

  master_auth {
    client_certificate_config {
      issue_client_certificate = false
    }
  }

  master_authorized_networks_config {
    gcp_public_cidrs_access_enabled = true
    cidr_blocks {
      cidr_block   = format("%s/32", trimspace(data.http.my_ip.response_body))
      display_name = "admin"
    }
  }

  maintenance_policy {
    recurring_window {
      start_time = "2025-01-01T00:00:00Z"
      end_time   = "2025-01-01T06:00:00Z"
      recurrence = "FREQ=WEEKLY;BYDAY=SA,SU"
    }
  }

  gateway_api_config {
    channel = "CHANNEL_STANDARD"
  }

  enable_cilium_clusterwide_network_policy = true

  enable_l4_ilb_subsetting = true

  secret_manager_config {
    enabled = true
  }

  security_posture_config {
    mode               = "BASIC"
    vulnerability_mode = "VULNERABILITY_BASIC"
  }

  private_cluster_config {
    enable_private_endpoint = false
    enable_private_nodes    = true
  }

  workload_identity_config {
    workload_pool = "${data.google_project.current.project_id}.svc.id.goog"
  }

  identity_service_config {
    enabled = true
  }

  addons_config {
    http_load_balancing {
      disabled = false
    }

    gcp_filestore_csi_driver_config {
      enabled = true
    }

    gcs_fuse_csi_driver_config {
      enabled = true
    }

    gce_persistent_disk_csi_driver_config {
      enabled = true
    }

    gke_backup_agent_config {
      enabled = true
    }

    parallelstore_csi_driver_config {
      enabled = true
    }
  }

  database_encryption {
    state    = "ENCRYPTED"
    key_name = var.kms_key_id
  }

  resource_labels = {
    env = "prod"
  }
}
tofu/gcp/prod/30-kubernetes-cluster/terragrunt.hcl
include "backend" {
  path = find_in_parent_folders("backend.hcl")
}

include "gcp" {
  path = find_in_parent_folders("gcp.hcl")
}

inputs = {
  network_name    = dependency.networking.outputs.network_name
  subnetwork_name = dependency.networking.outputs.subnetwork_name

  kms_key_id = dependency.kms_key.outputs.crypto_key_id
}

dependency "kms_key" {
  config_path = "../gke-encryption-key"
}

dependency "networking" {
  config_path = "../networking"
}

We provision this stack as well and we move on to the next step.

Fetch Kubeconfig Credential

Once the cluster is ready, we can use the following CLI command in the terminal to get the credential to talk to our cluster8.

gcloud container clusters get-credentials \
  developer-friendly-prod-gke-k8s-cluster \
  --region europe-west4 \
  --project developer-friendly

Deploy Sample Helm Application

For the case of this demo, we deploy Valkey from Bitnami Helm chart9.

helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo update bitnami
helm install valkey bitnami/valkey --version 2.x

And that't it.

Future Plans

It's good to touch on some of the future improvements we can make to this setup:

  • ✅ Provision a dedicated host deploy Atlantis to allow team collaboration on TF codes10.
  • ✅ Deploy VictoriaMetrics Kubernetes Stack11 for monitoring
  • ✅ Deploy Promtail12 and use VictoriaLogs13 as backend

This list is non-exhaustive. Once your infrastructure grows, more required components come into play, e.g., security, audit, compliance, etc.

This list is only here to give you an idea of what's possible.

Conclusion

GKE Autopilot Cluster
GKE Autopilot Cluster

We have seen how to create a Kubernetes cluster in GKE with the least operational overhead.

The GKE Autpilot is an equivalent of AWS' Auto-mode14.

These clusters may not make you feel hacky while dealing with the daily operation of a Kubernetes cluster.

However, since they require so little overhead for the maintenance of the cluster itself, you'd have the opportunity to focus on your core business logic and improve the user experience of your application, instead of chasing and troubleshooting a cumbersome Kubernetes bug.

Until next time, ciao 🤠 & happy coding! 🐧 🦀

FAQ

Why Terragrunt as an additional wrapper for an added complexity?

Terragrunt provides a thin wrapper around TF code. You'd generally add more tooling and complexity as your stack requires.

In the case of current stack, we're using dependency15 graph heavily to make sure dependent stacks are tied together correctly and inputs are passed around dynamically without the need to hardcode any value.

That makes it a viable choice to reduce the long-term chore and operational overhead, e.g., in case of disaster recovery, or just to spin up an identical replica of this platform in another region/account.

If you enjoyed this blog post, consider sharing it with these buttons 👇. Please leave a comment for us at the end, we read & love 'em all. ❣

Share on Share on Share on Share on

Comments