How to Provision a Production-Ready Autopilot GKE Cluster¶
In this blog post I share my opinioated version of provisioning a Kubernetes cluster in the Google Cloud Platform (GCP) using nothing but Opentofu.
The principles discussed here are the ones I have learned while dealing with production setups at the same scale.
If you enjoy Kubernetes or want to learn more about GCP, this is for you.
Introduction¶
I have had the pleasure of working with Kubernetes in the last few years of my professional career.
While provisioning a Kubernetes from scratch may not be the most interesting part of the the day to day operation of dealing with a Kubernetes cluster, it is one of the most important ones.
The reason, especially in the context of Google Cloud, is that some of the initial settings and configurations you specify (or not) in your day 0 operations are the ones that you will not be able to change after the cluster is created1.
Therefore, I took some time to document my steps into provisioning a production-ready Kubernetes cluster from scratch.
Stick till the end to get to know the required component it takes.
Prerequisites¶
- OpenTofu
- Terragrunt
- gcloud CLI
- direnv (optional)
- kubectl
- Helm
NOTE: With direnv
, you can make sure to include your local development environment variables to yourself without exposing them to the VCS. Ensure you have .envrc
in the root of the repository with the values you need.
Example values:
Authenticate to Google Cloud¶
To get started, your first need to be authenticated to Google Cloud.
Additionally, to be able to perform TF API calls to the Google Cloud, you would need the Application Default Credentials (ADCs)2.
Project Structure¶
The code is structured in directory per-environment manner (truncated for brevity).
tofu/
├── backend.hcl # <- Terraform cloud remote state backend
├── gcp/
│ ├── gcp.hcl # Google Cloud authentication source
│ └── prod/
│ ├── 10-networking/
│ │ ├── main.tf
│ │ └── terragrunt.hcl
│ ├── 20-gke-encryption-key/
│ │ ├── main.tf
│ │ └── terragrunt.hcl
│ └── 30-kubernetes-cluster/
│ ├── main.tf
│ ├── terragrunt.hcl
│ └── terragrunt.hcl
└── modules/
└── naming/
└── main.tf
Shared Modules¶
Naming¶
We first need to create a unified naming module to be used everywhere:
terraform {
required_providers {
google = {
source = "hashicorp/google"
version = "< 7"
}
}
required_version = "< 2"
}
variable "project" {
type = string
description = "The project name or ID"
default = ""
}
variable "environment" {
type = string
description = "The environment (e.g., dev, prod)"
}
variable "resource_type" {
type = string
description = "The type of resource (e.g., bucket, vm)"
}
variable "suffix" {
type = string
description = "An optional suffix for the resource name"
default = ""
}
locals {
project = coalesce(var.project, data.google_client_config.current.project)
generated_name = join("-", compact([local.project, var.environment, var.resource_type, var.suffix]))
}
Terragrunt Root Modules¶
There are a few modules being included in each of the following stacks34:
locals {
workspace = replace(path_relative_to_include(), "/", "-")
# e.g. gcp-prod-10-networking
}
generate "remote_state" {
path = "backend.tf"
if_exists = "overwrite_terragrunt"
contents = <<-EOF
terraform {
backend "remote" {
hostname = "app.terraform.io"
organization = "developer-friendly-blog"
workspaces {
name = "${local.workspace}"
}
}
}
EOF
}
generate "gcp" {
path = "provider_gcp.tf"
if_exists = "overwrite_terragrunt"
contents = <<-EOF
provider "google" {
project = "developer-friendly"
region = "europe-west4"
}
EOF
}
Networking¶
Next step is to create a dedicated VPC network and avoid using the default VPC provided in everty GCP project.
They also recommend to use custom subnets5 instead of auto-mode, which is where a subnet will be created for your in each of the available GCP regions.
Let's do just that.
terraform {
required_providers {
google = {
source = "hashicorp/google"
version = "< 7"
}
}
required_version = "< 2"
}
module "naming" {
for_each = toset([
"vpc",
"subnet",
"router",
"nat",
"firewall",
])
source = "../../../modules/naming"
environment = "prod"
resource_type = each.key
suffix = "networking"
}
resource "google_compute_network" "this" {
name = module.naming["vpc"].generated_name
auto_create_subnetworks = false
}
resource "google_compute_subnetwork" "this" {
name = module.naming["subnet"].generated_name
network = google_compute_network.this.id
ip_cidr_range = "10.0.0.0/14"
log_config {
aggregation_interval = "INTERVAL_10_MIN"
flow_sampling = 0.5
metadata = "INCLUDE_ALL_METADATA"
}
private_ip_google_access = true
}
resource "google_compute_router" "this" {
name = module.naming["router"].generated_name
network = google_compute_network.this.id
}
resource "google_compute_router_nat" "this" {
name = module.naming["nat"].generated_name
router = google_compute_router.this.name
nat_ip_allocate_option = "AUTO_ONLY"
source_subnetwork_ip_ranges_to_nat = "ALL_SUBNETWORKS_ALL_IP_RANGES"
}
resource "google_compute_firewall" "this" {
name = module.naming["firewall"].generated_name
network = google_compute_network.this.self_link
allow {
protocol = "icmp"
}
allow {
protocol = "tcp"
ports = ["80", "443"]
}
source_ranges = [
"0.0.0.0/0",
]
}
output "network_name" {
value = google_compute_network.this.name
}
output "subnetwork_name" {
value = google_compute_subnetwork.this.name
}
include "backend" {
path = find_in_parent_folders("backend.hcl")
}
include "gcp" {
path = find_in_parent_folders("gcp.hcl")
}
inputs = {
}
Running this stack:
GKE Encryption Key¶
We then need to create an encryption Customer-Managed Key (CMK) key for the GKE secrets6.
terraform {
required_providers {
google = {
source = "hashicorp/google"
version = "< 7"
}
}
required_version = "< 2"
}
module "naming" {
for_each = toset([
"keyring",
"cryptokey",
])
source = "../../../modules/naming"
environment = "prod"
resource_type = each.key
suffix = "encryption-key"
}
data "google_client_config" "current" {}
resource "google_kms_key_ring" "this" {
name = module.naming["keyring"].generated_name
location = data.google_client_config.current.region
lifecycle {
prevent_destroy = true
}
}
resource "google_kms_crypto_key" "this" {
name = module.naming["cryptokey"].generated_name
key_ring = google_kms_key_ring.this.id
rotation_period = format("%ss", 60 * 60 * 24 * 30) # 30 days
lifecycle {
# NOTE: removing the TF resource will NOT delete the key from GCP
prevent_destroy = true
}
labels = {
env = "prod"
}
}
output "crypto_key_id" {
value = google_kms_crypto_key.this.id
}
include "backend" {
path = find_in_parent_folders("backend.hcl")
}
include "gcp" {
path = find_in_parent_folders("gcp.hcl")
}
inputs = {
}
Creating this stack is just as before with the three commands mentioned above.
Kubernetes Cluster¶
Finally, we will create the cluster on Autopilot mode, requiring the least management and operational overhead over the lifetime of the cluster7.
terraform {
required_providers {
google = {
source = "hashicorp/google"
version = "< 7"
}
http = {
source = "hashicorp/http"
version = "< 4"
}
}
required_version = "< 2"
}
variable "kms_key_id" {
type = string
nullable = false
}
variable "network_name" {
type = string
nullable = false
}
variable "subnetwork_name" {
type = string
nullable = false
}
module "naming" {
for_each = toset([
"gke",
"vpc",
])
source = "../../../modules/naming"
environment = "prod"
resource_type = each.key
suffix = "k8s-cluster"
}
data "google_project" "current" {}
data "http" "my_ip" {
url = "http://checkip.amazonaws.com/"
}
data "google_iam_policy" "encryptor" {
binding {
role = "roles/cloudkms.cryptoKeyEncrypter"
members = [
"serviceAccount:service-${data.google_project.current.number}@container-engine-robot.iam.gserviceaccount.com"
]
}
binding {
role = "roles/cloudkms.cryptoKeyDecrypter"
members = [
"serviceAccount:service-${data.google_project.current.number}@container-engine-robot.iam.gserviceaccount.com"
]
}
}
resource "google_kms_crypto_key_iam_policy" "this" {
crypto_key_id = var.kms_key_id
policy_data = data.google_iam_policy.encryptor.policy_data
}
resource "google_container_cluster" "this" {
name = module.naming["gke"].generated_name
enable_autopilot = true
release_channel {
channel = "STABLE"
}
initial_node_count = 1
deletion_protection = true
networking_mode = "VPC_NATIVE"
datapath_provider = "ADVANCED_DATAPATH"
network = var.network_name
subnetwork = var.subnetwork_name
ip_allocation_policy {
cluster_ipv4_cidr_block = "10.4.0.0/14"
services_ipv4_cidr_block = "10.8.0.0/16"
}
cluster_autoscaling {
auto_provisioning_defaults {
oauth_scopes = [
"https://www.googleapis.com/auth/cloud-platform",
]
}
}
logging_config {
enable_components = [
"SYSTEM_COMPONENTS",
"APISERVER",
"CONTROLLER_MANAGER",
"SCHEDULER",
"WORKLOADS",
]
}
monitoring_config {
enable_components = [
"SYSTEM_COMPONENTS",
"APISERVER",
"SCHEDULER",
"CONTROLLER_MANAGER",
"STORAGE",
"HPA",
"POD",
"DAEMONSET",
"DEPLOYMENT",
"STATEFULSET",
"KUBELET",
"CADVISOR",
]
advanced_datapath_observability_config {
enable_metrics = true
enable_relay = true
}
}
binary_authorization {
evaluation_mode = "PROJECT_SINGLETON_POLICY_ENFORCE"
}
node_config {
workload_metadata_config {
mode = "GCE_METADATA"
}
shielded_instance_config {
enable_secure_boot = true
enable_integrity_monitoring = true
}
}
master_auth {
client_certificate_config {
issue_client_certificate = false
}
}
master_authorized_networks_config {
gcp_public_cidrs_access_enabled = true
cidr_blocks {
cidr_block = format("%s/32", trimspace(data.http.my_ip.response_body))
display_name = "admin"
}
}
maintenance_policy {
recurring_window {
start_time = "2025-01-01T00:00:00Z"
end_time = "2025-01-01T06:00:00Z"
recurrence = "FREQ=WEEKLY;BYDAY=SA,SU"
}
}
gateway_api_config {
channel = "CHANNEL_STANDARD"
}
enable_cilium_clusterwide_network_policy = true
enable_l4_ilb_subsetting = true
secret_manager_config {
enabled = true
}
security_posture_config {
mode = "BASIC"
vulnerability_mode = "VULNERABILITY_BASIC"
}
private_cluster_config {
enable_private_endpoint = false
enable_private_nodes = true
}
workload_identity_config {
workload_pool = "${data.google_project.current.project_id}.svc.id.goog"
}
identity_service_config {
enabled = true
}
addons_config {
http_load_balancing {
disabled = false
}
gcp_filestore_csi_driver_config {
enabled = true
}
gcs_fuse_csi_driver_config {
enabled = true
}
gce_persistent_disk_csi_driver_config {
enabled = true
}
gke_backup_agent_config {
enabled = true
}
parallelstore_csi_driver_config {
enabled = true
}
}
database_encryption {
state = "ENCRYPTED"
key_name = var.kms_key_id
}
resource_labels = {
env = "prod"
}
}
include "backend" {
path = find_in_parent_folders("backend.hcl")
}
include "gcp" {
path = find_in_parent_folders("gcp.hcl")
}
inputs = {
network_name = dependency.networking.outputs.network_name
subnetwork_name = dependency.networking.outputs.subnetwork_name
kms_key_id = dependency.kms_key.outputs.crypto_key_id
}
dependency "kms_key" {
config_path = "../gke-encryption-key"
}
dependency "networking" {
config_path = "../networking"
}
We provision this stack as well and we move on to the next step.
Fetch Kubeconfig Credential¶
Once the cluster is ready, we can use the following CLI command in the terminal to get the credential to talk to our cluster8.
gcloud container clusters get-credentials \
developer-friendly-prod-gke-k8s-cluster \
--region europe-west4 \
--project developer-friendly
Deploy Sample Helm Application¶
For the case of this demo, we deploy Valkey from Bitnami Helm chart9.
helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo update bitnami
helm install valkey bitnami/valkey --version 2.x
And that't it.
Future Plans¶
It's good to touch on some of the future improvements we can make to this setup:
- Provision a dedicated host deploy Atlantis to allow team collaboration on TF codes10.
- Deploy VictoriaMetrics Kubernetes Stack11 for monitoring
- Deploy Promtail12 and use VictoriaLogs13 as backend
This list is non-exhaustive. Once your infrastructure grows, more required components come into play, e.g., security, audit, compliance, etc.
This list is only here to give you an idea of what's possible.
Conclusion¶
We have seen how to create a Kubernetes cluster in GKE with the least operational overhead.
The GKE Autpilot is an equivalent of AWS' Auto-mode14.
These clusters may not make you feel hacky while dealing with the daily operation of a Kubernetes cluster.
However, since they require so little overhead for the maintenance of the cluster itself, you'd have the opportunity to focus on your core business logic and improve the user experience of your application, instead of chasing and troubleshooting a cumbersome Kubernetes bug.
Until next time, ciao & happy coding!
FAQ¶
Why Terragrunt as an additional wrapper for an added complexity?¶
Terragrunt provides a thin wrapper around TF code. You'd generally add more tooling and complexity as your stack requires.
In the case of current stack, we're using dependency
15 graph heavily to make sure dependent stacks are tied together correctly and inputs are passed around dynamically without the need to hardcode any value.
That makes it a viable choice to reduce the long-term chore and operational overhead, e.g., in case of disaster recovery, or just to spin up an identical replica of this platform in another region/account.
If you enjoyed this blog post, consider sharing it with these buttons . Please leave a comment for us at the end, we read & love 'em all.
Share on Share on Share on Share on
-
https://cloud.google.com/kubernetes-engine/docs/concepts/configuration-overview ↩
-
https://registry.terraform.io/providers/hashicorp/google/latest/docs/guides/provider_reference ↩
-
https://terragrunt.gruntwork.io/docs/getting-started/overview/#the-include-block ↩
-
https://terragrunt.gruntwork.io/docs/reference/config-blocks-and-attributes/#backend ↩
-
https://cloud.google.com/kubernetes-engine/docs/best-practices/networking#custom-subnet-mode ↩
-
https://cloud.google.com/kubernetes-engine/docs/how-to/encrypting-secrets ↩
-
https://cloud.google.com/kubernetes-engine/docs/concepts/autopilot-overview ↩
-
https://cloud.google.com/sdk/gcloud/reference/container/clusters/get-credentials ↩
-
https://artifacthub.io/packages/helm/victoriametrics/victoria-metrics-k8s-stack ↩
-
https://docs.aws.amazon.com/eks/latest/userguide/automode.html ↩
-
https://terragrunt.gruntwork.io/docs/reference/config-blocks-and-attributes/#dependency ↩