How to Provision a Production-Ready Autopilot GKE Cluster¶

In this blog post I share my opinioated version of provisioning a Kubernetes cluster in the Google Cloud Platform (GCP) using nothing but Opentofu.

The principles discussed here are the ones I have learned while dealing with production setups at the same scale.

If you enjoy Kubernetes or want to learn more about GCP, this is for you.

Introduction¶

I have had the pleasure of working with Kubernetes in the last few years of my professional career.

While provisioning a Kubernetes from scratch may not be the most interesting part of the the day to day operation of dealing with a Kubernetes cluster, it is one of the most important ones.

The reason, especially in the context of Google Cloud, is that some of the initial settings and configurations you specify (or not) in your day 0 operations are the ones that you will not be able to change after the cluster is created¹.

Therefore, I took some time to document my steps into provisioning a production-ready Kubernetes cluster from scratch.

Stick till the end to get to know the required component it takes.

Prerequisites¶

NOTE: With direnv, you can make sure to include your local development environment variables to yourself without exposing them to the VCS. Ensure you have .envrc in the root of the repository with the values you need.

Example values:

# <repo-root>/.envrc
export KUBECONFIG="<repo-root>/.kubeconfig"

Authenticate to Google Cloud¶

To get started, your first need to be authenticated to Google Cloud.

gcloud auth login

Additionally, to be able to perform TF API calls to the Google Cloud, you would need the Application Default Credentials (ADCs)².

gcloud auth application-default login

Project Structure¶

The code is structured in directory per-environment manner (truncated for brevity).

Text Only

tofu/
├── backend.hcl # <- Terraform cloud remote state backend
├── gcp/
│   ├── gcp.hcl # Google Cloud authentication source
│   └── prod/
│       ├── 10-networking/
│       │   ├── main.tf
│       │   └── terragrunt.hcl
│       ├── 20-gke-encryption-key/
│       │   ├── main.tf
│       │   └── terragrunt.hcl
│       └── 30-kubernetes-cluster/
│           ├── main.tf
│           ├── terragrunt.hcl
│           └── terragrunt.hcl
└── modules/
    └── naming/
        └── main.tf

Shared Modules¶

Naming¶

We first need to create a unified naming module to be used everywhere:

tofu/modules/naming/versions.tf

terraform {
  required_providers {
    google = {
      source  = "hashicorp/google"
      version = "< 7"
    }
  }

  required_version = "< 2"
}

tofu/modules/naming/variables.tf

variable "project" {
  type        = string
  description = "The project name or ID"
  default     = ""
}

variable "environment" {
  type        = string
  description = "The environment (e.g., dev, prod)"
}

variable "resource_type" {
  type        = string
  description = "The type of resource (e.g., bucket, vm)"
}

variable "suffix" {
  type        = string
  description = "An optional suffix for the resource name"
  default     = ""
}

tofu/modules/naming/data.tf

data "google_client_config" "current" {}

tofu/modules/naming/main.tf

locals {
  project        = coalesce(var.project, data.google_client_config.current.project)
  generated_name = join("-", compact([local.project, var.environment, var.resource_type, var.suffix]))
}

tofu/modules/naming/outputs.tf

output "generated_name" {
  value = local.generated_name
}

Terragrunt Root Modules¶

There are a few modules being included in each of the following stacks³⁴:

tofu/backend.hcl

locals {
  workspace = replace(path_relative_to_include(), "/", "-")
  # e.g. gcp-prod-10-networking
}

generate "remote_state" {
  path      = "backend.tf"
  if_exists = "overwrite_terragrunt"
  contents  = <<-EOF
    terraform {
      backend "remote" {
        hostname     = "app.terraform.io"
        organization = "developer-friendly-blog"
        workspaces {
          name = "${local.workspace}"
        }
      }
    }
  EOF
}

tofu/gcp/gcp.hcl

generate "gcp" {
  path      = "provider_gcp.tf"
  if_exists = "overwrite_terragrunt"
  contents  = <<-EOF
    provider "google" {
      project = "developer-friendly"
      region  = "europe-west4"
    }
  EOF
}

Networking¶

Next step is to create a dedicated VPC network and avoid using the default VPC provided in everty GCP project.

They also recommend to use custom subnets⁵ instead of auto-mode, which is where a subnet will be created for your in each of the available GCP regions.

Let's do just that.

tofu/gcp/prod/10-networking/versions.tf

terraform {
  required_providers {
    google = {
      source  = "hashicorp/google"
      version = "< 7"
    }
  }

  required_version = "< 2"
}

tofu/gcp/prod/10-networking/naming.tf

module "naming" {
  for_each = toset([
    "vpc",
    "subnet",
    "router",
    "nat",
    "firewall",
  ])

  source = "../../../modules/naming"

  environment   = "prod"
  resource_type = each.key
  suffix        = "networking"
}

tofu/gcp/prod/10-networking/main.tf

resource "google_compute_network" "this" {
  name                    = module.naming["vpc"].generated_name
  auto_create_subnetworks = false
}

resource "google_compute_subnetwork" "this" {
  name          = module.naming["subnet"].generated_name
  network       = google_compute_network.this.id
  ip_cidr_range = "10.0.0.0/14"

  log_config {
    aggregation_interval = "INTERVAL_10_MIN"
    flow_sampling        = 0.5
    metadata             = "INCLUDE_ALL_METADATA"
  }

  private_ip_google_access = true
}

resource "google_compute_router" "this" {
  name    = module.naming["router"].generated_name
  network = google_compute_network.this.id
}

resource "google_compute_router_nat" "this" {
  name                               = module.naming["nat"].generated_name
  router                             = google_compute_router.this.name
  nat_ip_allocate_option             = "AUTO_ONLY"
  source_subnetwork_ip_ranges_to_nat = "ALL_SUBNETWORKS_ALL_IP_RANGES"
}

resource "google_compute_firewall" "this" {
  name    = module.naming["firewall"].generated_name
  network = google_compute_network.this.self_link
  allow {
    protocol = "icmp"
  }
  allow {
    protocol = "tcp"
    ports    = ["80", "443"]
  }

  source_ranges = [
    "0.0.0.0/0",
  ]
}

tofu/gcp/prod/10-networking/outputs.tf

output "network_name" {
  value = google_compute_network.this.name
}

output "subnetwork_name" {
  value = google_compute_subnetwork.this.name
}

tofu/gcp/prod/10-networking/terragrunt.hcl

include "backend" {
  path = find_in_parent_folders("backend.hcl")
}

include "gcp" {
  path = find_in_parent_folders("gcp.hcl")
}

inputs = {
}

Running this stack:

terragrunt init -upgrade
terragrunt plan -out tfplan
terragrunt apply tfplan

GKE Encryption Key¶

We then need to create an encryption Customer-Managed Key (CMK) key for the GKE secrets⁶.

tofu/gcp/prod/20-gke-encryption-key/versions.tf

terraform {
  required_providers {
    google = {
      source  = "hashicorp/google"
      version = "< 7"
    }
  }

  required_version = "< 2"
}

tofu/gcp/prod/20-gke-encryption-key/naming.tf

module "naming" {
  for_each = toset([
    "keyring",
    "cryptokey",
  ])

  source = "../../../modules/naming"

  environment   = "prod"
  resource_type = each.key
  suffix        = "encryption-key"
}

tofu/gcp/prod/20-gke-encryption-key/main.tf

data "google_client_config" "current" {}

resource "google_kms_key_ring" "this" {
  name     = module.naming["keyring"].generated_name
  location = data.google_client_config.current.region

  lifecycle {
    prevent_destroy = true
  }
}

resource "google_kms_crypto_key" "this" {
  name            = module.naming["cryptokey"].generated_name
  key_ring        = google_kms_key_ring.this.id
  rotation_period = format("%ss", 60 * 60 * 24 * 30) # 30 days

  lifecycle {
    # NOTE: removing the TF resource will NOT delete the key from GCP
    prevent_destroy = true
  }

  labels = {
    env = "prod"
  }
}

tofu/gcp/prod/20-gke-encryption-key/outputs.tf

output "crypto_key_id" {
  value = google_kms_crypto_key.this.id
}

tofu/gcp/prod/20-gke-encryption-key/terragrunt.hcl

include "backend" {
  path = find_in_parent_folders("backend.hcl")
}

include "gcp" {
  path = find_in_parent_folders("gcp.hcl")
}

inputs = {
}

Creating this stack is just as before with the three commands mentioned above.

Kubernetes Cluster¶

Finally, we will create the cluster on Autopilot mode, requiring the least management and operational overhead over the lifetime of the cluster⁷.

tofu/gcp/prod/30-kubernetes-cluster/versions.tf

terraform {
  required_providers {
    google = {
      source  = "hashicorp/google"
      version = "< 7"
    }
    http = {
      source  = "hashicorp/http"
      version = "< 4"
    }
  }

  required_version = "< 2"
}

tofu/gcp/prod/30-kubernetes-cluster/variables.tf

variable "kms_key_id" {
  type     = string
  nullable = false
}

variable "network_name" {
  type     = string
  nullable = false
}

variable "subnetwork_name" {
  type     = string
  nullable = false
}

tofu/gcp/prod/30-kubernetes-cluster/naming.tf

module "naming" {
  for_each = toset([
    "gke",
    "vpc",
  ])

  source = "../../../modules/naming"

  environment   = "prod"
  resource_type = each.key
  suffix        = "k8s-cluster"
}

tofu/gcp/prod/30-kubernetes-cluster/main.tf

data "google_project" "current" {}
data "http" "my_ip" {
  url = "http://checkip.amazonaws.com/"
}

data "google_iam_policy" "encryptor" {
  binding {
    role = "roles/cloudkms.cryptoKeyEncrypter"

    members = [
      "serviceAccount:service-${data.google_project.current.number}@container-engine-robot.iam.gserviceaccount.com"
    ]
  }

  binding {
    role = "roles/cloudkms.cryptoKeyDecrypter"

    members = [
      "serviceAccount:service-${data.google_project.current.number}@container-engine-robot.iam.gserviceaccount.com"
    ]
  }
}

resource "google_kms_crypto_key_iam_policy" "this" {
  crypto_key_id = var.kms_key_id
  policy_data   = data.google_iam_policy.encryptor.policy_data
}

resource "google_container_cluster" "this" {
  name = module.naming["gke"].generated_name

  enable_autopilot = true

  release_channel {
    channel = "STABLE"
  }

  initial_node_count = 1

  deletion_protection = true

  networking_mode = "VPC_NATIVE"

  datapath_provider = "ADVANCED_DATAPATH"

  network    = var.network_name
  subnetwork = var.subnetwork_name

  ip_allocation_policy {
    cluster_ipv4_cidr_block  = "10.4.0.0/14"
    services_ipv4_cidr_block = "10.8.0.0/16"
  }

  cluster_autoscaling {


    auto_provisioning_defaults {
      oauth_scopes = [
        "https://www.googleapis.com/auth/cloud-platform",
      ]
    }
  }

  logging_config {
    enable_components = [
      "SYSTEM_COMPONENTS",
      "APISERVER",
      "CONTROLLER_MANAGER",
      "SCHEDULER",
      "WORKLOADS",
    ]
  }

  monitoring_config {
    enable_components = [
      "SYSTEM_COMPONENTS",
      "APISERVER",
      "SCHEDULER",
      "CONTROLLER_MANAGER",
      "STORAGE",
      "HPA",
      "POD",
      "DAEMONSET",
      "DEPLOYMENT",
      "STATEFULSET",
      "KUBELET",
      "CADVISOR",
    ]

    advanced_datapath_observability_config {
      enable_metrics = true
      enable_relay   = true
    }

  }

  binary_authorization {
    evaluation_mode = "PROJECT_SINGLETON_POLICY_ENFORCE"
  }

  node_config {
    workload_metadata_config {
      mode = "GCE_METADATA"
    }

    shielded_instance_config {
      enable_secure_boot          = true
      enable_integrity_monitoring = true
    }
  }

  master_auth {
    client_certificate_config {
      issue_client_certificate = false
    }
  }

  master_authorized_networks_config {
    gcp_public_cidrs_access_enabled = true
    cidr_blocks {
      cidr_block   = format("%s/32", trimspace(data.http.my_ip.response_body))
      display_name = "admin"
    }
  }

  maintenance_policy {
    recurring_window {
      start_time = "2025-01-01T00:00:00Z"
      end_time   = "2025-01-01T06:00:00Z"
      recurrence = "FREQ=WEEKLY;BYDAY=SA,SU"
    }
  }

  gateway_api_config {
    channel = "CHANNEL_STANDARD"
  }

  enable_cilium_clusterwide_network_policy = true

  enable_l4_ilb_subsetting = true

  secret_manager_config {
    enabled = true
  }

  security_posture_config {
    mode               = "BASIC"
    vulnerability_mode = "VULNERABILITY_BASIC"
  }

  private_cluster_config {
    enable_private_endpoint = false
    enable_private_nodes    = true
  }

  workload_identity_config {
    workload_pool = "${data.google_project.current.project_id}.svc.id.goog"
  }

  identity_service_config {
    enabled = true
  }

  addons_config {
    http_load_balancing {
      disabled = false
    }

    gcp_filestore_csi_driver_config {
      enabled = true
    }

    gcs_fuse_csi_driver_config {
      enabled = true
    }

    gce_persistent_disk_csi_driver_config {
      enabled = true
    }

    gke_backup_agent_config {
      enabled = true
    }

    parallelstore_csi_driver_config {
      enabled = true
    }
  }

  database_encryption {
    state    = "ENCRYPTED"
    key_name = var.kms_key_id
  }

  resource_labels = {
    env = "prod"
  }
}

tofu/gcp/prod/30-kubernetes-cluster/terragrunt.hcl

include "backend" {
  path = find_in_parent_folders("backend.hcl")
}

include "gcp" {
  path = find_in_parent_folders("gcp.hcl")
}

inputs = {
  network_name    = dependency.networking.outputs.network_name
  subnetwork_name = dependency.networking.outputs.subnetwork_name

  kms_key_id = dependency.kms_key.outputs.crypto_key_id
}

dependency "kms_key" {
  config_path = "../gke-encryption-key"
}

dependency "networking" {
  config_path = "../networking"
}

We provision this stack as well and we move on to the next step.

Fetch Kubeconfig Credential¶

Once the cluster is ready, we can use the following CLI command in the terminal to get the credential to talk to our cluster⁸.

gcloud container clusters get-credentials \
  developer-friendly-prod-gke-k8s-cluster \
  --region europe-west4 \
  --project developer-friendly

Deploy Sample Helm Application¶

For the case of this demo, we deploy Valkey from Bitnami Helm chart⁹.

helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo update bitnami
helm install valkey bitnami/valkey --version 2.x

And that't it.

Future Plans¶

It's good to touch on some of the future improvements we can make to this setup:

Provision a dedicated host deploy Atlantis to allow team collaboration on TF codes¹⁰.
Deploy VictoriaMetrics Kubernetes Stack¹¹ for monitoring
Deploy Promtail¹² and use VictoriaLogs¹³ as backend

This list is non-exhaustive. Once your infrastructure grows, more required components come into play, e.g., security, audit, compliance, etc.

This list is only here to give you an idea of what's possible.

Conclusion¶

We have seen how to create a Kubernetes cluster in GKE with the least operational overhead.

The GKE Autpilot is an equivalent of AWS' Auto-mode¹⁴.

These clusters may not make you feel hacky while dealing with the daily operation of a Kubernetes cluster.

However, since they require so little overhead for the maintenance of the cluster itself, you'd have the opportunity to focus on your core business logic and improve the user experience of your application, instead of chasing and troubleshooting a cumbersome Kubernetes bug.

FAQ¶

Why Terragrunt as an additional wrapper for an added complexity?¶

Terragrunt provides a thin wrapper around TF code. You'd generally add more tooling and complexity as your stack requires.

In the case of current stack, we're using dependency¹⁵ graph heavily to make sure dependent stacks are tied together correctly and inputs are passed around dynamically without the need to hardcode any value.

That makes it a viable choice to reduce the long-term chore and operational overhead, e.g., in case of disaster recovery, or just to spin up an identical replica of this platform in another region/account.