Kubernetes The Hard Way - Developer Friendly Blog
Skip to content

Kubernetes The Hard Way

You might've solved this challenge way sooner than I attempted it. Still, I always wanted to go through the process as it has many angles and learning the details intrigues me.

This version, however, does not use any cloud provider. Specifically, the things I am using differently from the original challenge are:

  • Vagrant & VirtualBox: For the nodes of the cluster
  • Ansible: For configuring everything until the cluster is ready
  • Cilium: For the network CNI and as a replacement for the kube-proxy

So, here is my story and how I solved the famous "Kubernetes The Hard Way" by the great Kelsey Hightower. Stay tuned if you're interested in the details.

Introduction

Kubernetes the Hard Way is a great exercise for any system administrator to really get into the nit and grit of Kubernetes and figure out how different components work together and what makes it as such.

If you have only used a managed Kubernetes cluster, or used kubeadm to spin up one, this is your chance to really understand the inner workings of Kubernetes. Because those tools abstract a lot of the details away from you, which is not helping to understand the implementation details if you have a knack for it.

Objective

The whole point of this exercise is to build a Kubernetes cluster from scratch, downloading the binaries, issuing and passing the certificates to the different components, configuring the network CNI, and finally, having a working Kubernetes cluster.

With that introduction, let's get started.

Prerequisites

First things first, let's make sure all the necessary tools are installed on our system before we start.

Tools

All the tools mentioned below are the latest versions at the time of writing, February 2024.

Tool Version Link
Ansible 2.16 https://docs.ansible.com/ansible/latest/installation_guide/intro_installation.html
Vagrant 2.4 https://www.vagrantup.com/docs/installation
VirtualBox 7 https://www.virtualbox.org/wiki/Downloads

Alright, with the tools installed, it's time to get our hands dirty and really get into it.

The Vagrantfile

Info

The Vagrantfile configuration language is a Ruby DSL. If you are not a Ruby developer, fret not, as I'm not either. I just know enough to get by.

The initial step is to have three nodes up and running for the Kubernetes cluster, and one for the Load Balancer used in front of the API server. We will be using Vagrant on top of VirtualBox to create all these nodes.

These will be Virtual Machines hosted on your local machine. As such, there is no cloud provider needed in this version of the challenge and all the configurations are done locally.

The configuration for our Vagrantfile looks as below.

Vagrantfile
box = "ubuntu/jammy64"
N = 2

common_script = <<~SHELL
  export DEBIAN_FRONTEND=noninteractive
  sudo apt update
  sudo apt upgrade -y
SHELL

Vagrant.configure("2") do |config|
  config.vm.define "lb" do |node|
    node.vm.box = box
    node.vm.network :private_network, ip: "192.168.56.100", hostname: true
    node.vm.network "forwarded_port", guest: 6443, host: 6443
    node.vm.hostname = "lb.local"
    node.vm.provider "virtualbox" do |vb|
      vb.name = "k8s-the-hard-way-lb"
      vb.memory = "1024"
      vb.cpus = 1
      vb.linked_clone = true
    end

    node.vm.synced_folder "share/dl", "/downloads", create: true

    node.vm.provision "shell", inline: common_script
    node.vm.provision "ansible" do |ansible|
      ansible.verbose = "vv"
      ansible.playbook = "bootstrap.yml"
      ansible.compatibility_mode = "2.0"
    end
  end

  (0..N).each do |machine_id|
    config.vm.define "node#{machine_id}" do |node|
      node.vm.box = box
      node.vm.hostname = "node#{machine_id}.local"
      node.vm.network :private_network, ip: "192.168.56.#{machine_id+2}", hostname: true
      node.vm.provider "virtualbox" do |vb|
        vb.name = "k8s-the-hard-way-node#{machine_id}"
        vb.memory = "1024"
        vb.cpus = 1
        vb.linked_clone = true
      end

      # To hold the downloaded items and survive VM restarts
      node.vm.synced_folder "share/dl", "/downloads", create: true

      node.vm.provision "shell", inline: common_script

      if machine_id == N
        node.vm.provision :ansible do |ansible|
          ansible.limit = "all"
          ansible.verbose = "vv"
          ansible.playbook = "bootstrap.yml"
        end
      end
    end
  end
end

Private Network Configuration

Vagrantfile
    node.vm.network :private_network, ip: "192.168.56.100", hostname: true
      node.vm.network :private_network, ip: "192.168.56.#{machine_id+2}", hostname: true

There are a couple of important notes worth mentioning about this config, highlighted in the snippet above and the following list.

The network configuration, as you see above, is a private network with hard-coded IP addresses. This is not a hard requirement, but it makes a lot of the upcoming assumptions a lot easier.

Dynamic IP addresses will need more careful handling when it comes to configuring the nodes, their TLS certificates, and how they communicate overall.

And tackling craziness in this challenge is a sure way not to go down the rabbit hole of despair 😎.

Load Balancer Port Forwarding

Vagrantfile
    node.vm.network "forwarded_port", guest: 6443, host: 6443

For some reason, I wasn't able to directly call 192.168.56.100:6443, which is the address pair available for the HAProxy. This is accessible from within the Vagrant VMs, but not from the host machine.

Using firewall techniques such as ufw only made things worse; I was locked out of the VM. I know now that I had to enable SSH access first, but that's behind me now.

Having the port-forwarding configured, I was able to call the localhost:6443 from my machine and directly get access to the HAProxy.

On Vagrant Networking

In general, I have found many networking issues while working on this challenge. For some reason, the download speed inside the VMs was terrible (I am not the only complainer here if you search through the web). That's the main driver for mounting the same download directory for all the VMs to stop re-downloading every time the Ansible playbook runs.

CPU and Memory Allocation

Vagrantfile
      vb.memory = "1024"
      vb.cpus = 1

While not strictly required, I found benefit restraining the CPU and memory usage on the VMs. This ensures that no extra resources is being used.

Frankly speaking they shouldn't even go beyond this. This is an emtpy cluster with just the control-plane components and no heavy workload is running on it.

Mounting the Download Directory

Vagrantfile
    node.vm.synced_folder "share/dl", "/downloads", create: true

The download directory is mounted to all the VMs to avoid re-downloading the binaries from the internet every time the playbook is running, either due to a machine restart, or simply to start from scratch.

The trick, however, is that in Ansible get_url, as you'll see shortly, you will have to specify the absolute path to the destination file to benefit from this optimization and only specifying a directory will re-download the file.

Ansible Provisioner

Vagrantfile
    node.vm.provision "ansible" do |ansible|
      ansible.verbose = "vv"
      ansible.playbook = "bootstrap.yml"
      ansible.compatibility_mode = "2.0"
    end
      if machine_id == N
        node.vm.provision :ansible do |ansible|
          ansible.limit = "all"
          ansible.verbose = "vv"
          ansible.playbook = "bootstrap.yml"
        end
      end

The last and most important part of the Vagrantfile is the Ansible provisioner section which, as you can see, is for both the Load Balancer VM as well as all the three nodes of the Kubernetes cluster.

The difference, however, is that for the Kubernetes nodes, we want the playbook to run for all of them at the same time to benefit from parallel execution of Ansible playbook. The alternative would be to spin up the nodes one by one and run the playbook on each of them, which is not efficient and consumes more time.

Ansible Playbook

After provisioning the VMs, it's time to take a look at what the Ansible playbook does to configure the nodes and the Load Balancer.

The main configuration of the entire Kubernetes cluster is done via this playbook and as such, you can expect a hefty amount of configurations to be done here.

First, let's take a look at the playbook itself to get a feeling of what to expect.

bootstrap.yml
- name: Configure the Load Balancer
  hosts: lb
  become: true
  gather_facts: true
  vars_files:
    - vars/apt.yml
    - vars/k8s.yml
    - vars/lb.yml
    - vars/etcd.yml
  roles:
    - prerequisites
    - haproxy
    - etcd-gateway

- name: Bootstrap the Kubernetes Cluster
  hosts:
    - node0
    - node1
    - node2
  become: true
  gather_facts: true
  vars_files:
    - vars/tls.yml
    - vars/apt.yml
    - vars/lb.yml
    - vars/k8s.yml
    - vars/etcd.yml
  environment:
    KUBECONFIG: /var/lib/kubernetes/admin.kubeconfig
  roles:
    - prerequisites
    - role: tls-ca
      run_once: true
    - tls
    - kubeconfig
    - encryption
    - etcd
    - k8s
    - worker
    - role: coredns
      run_once: true
    - role: cilium
      run_once: true

If you notice there are two plays running in this playbook, one for the Load Balancer and the other for the Kubernetes nodes. This distinction is important because not all the configurations will be the same for all the VMs. That's the logic behind having different hosts in each.

Another important highlight is that the Load Balancer is being configured first, only because that's the entrypoint for our Kubernetes API server and we need that to be ready before the upstream servers.

Directory Layout

This playbook you see above is at the root of our directory structure, right next to all the roles you see included in the roles section.

To get a better understanding, here's what the directory structure looks like:

Directory Tree
.
├── ansible.cfg
├── bootstrap.yml
├── cilium/
├── coredns/
├── encryption/
├── etcd/
├── etcd-gateway/
├── haproxy/
├── k8s/
├── kubeconfig/
├── prerequisites/
├── tls/
├── tls-ca/
├── Vagrantfile
├── vars/
└── worker/

Beside the playbook itself, the Ansible playbook and the Vagrantfile the other pieces are roles, initialized with ansible-galaxy init <role-name> command and modified as per the specification in the originial challenge.

We will take a closer look at each role shortly.

Ansible Configuration

Before jumping into the roles, one last impotant piece of information is the ansible.cfg file, which holds the modifications we make to the Ansible default behavior.

The content is as below.

ansible.cfg
[defaults]
inventory=.vagrant/provisioners/ansible/inventory/
become=false
log_path=/tmp/ansible.log
gather_facts=false
host_key_checking=false

gathering = smart
fact_caching = jsonfile
fact_caching_connection = /tmp/ansible_facts
fact_caching_timeout = 86400

[inventory]
enable_plugins = 'host_list', 'script', 'auto', 'yaml', 'ini', 'toml'
cache = yes
cache_connection = /tmp/ansible_inventory

The importance of facts caching is to get performant execution of the playbooks during subsequent runs.

Let's Begin the Real Work

So far, we've only been preparing everything. The remainder of this post will focus on the real challenge itself, creating the pieces and components that make up the Kubernetes cluster, one by one.

For the sake of brevity, and because I don't want this blog post be long, or worse, broken into multiple parts, I will only highlight the most important tasks, remove duplicates from the discussion, and generally go through the core aspect of the task at hand. You are more than welcome to visit the source code1 for yourself and dig deeper.

Step 0: Prerequisites

In here we will enable port-forwarding, create the necessary directories that will be used by later steps, and optionally, add the DNS records to each and every node's /etc/hosts file.

The important lines are highlighted in the snippet below.

prerequisites/tasks/main.yml
---
- name: Enable IP Forwarding permanently
  ansible.builtin.copy:
    content: net.ipv4.ip_forward = 1
    dest: /etc/sysctl.d/20-ipforward.conf
    mode: "0644"
  notify: Reload sysctl
- name: Ensure python3-pip package is installed
  ansible.builtin.apt:
    name: python3-pip
    state: present
    update_cache: true
    cache_valid_time: "{{ cache_valid_time }}"
- name: Add current user to adm group
  ansible.builtin.user:
    name: "{{ ansible_user }}"
    groups: adm
    append: true
- name: Ensure CA directory exists
  ansible.builtin.file:
    path: /etc/kubernetes/pki
    state: directory
    owner: root
    group: root
    mode: "0700"
- name: Create private-network DNS records
  ansible.builtin.lineinfile:
    path: /etc/hosts
    line: "{{ item }}"
    state: present
  with_items:
    - 192.168.56.2 node0 etcd-node0
    - 192.168.56.3 node1 etcd-node1
    - 192.168.56.4 node2 etcd-node2
  tags:
    - dns

As you can see, after the sysctl modification, we're notifying our handler for a reload and re-read of the sysctl configurations. The handler definition is as below.

prerequisites/handlers/main.yml
---
- name: Reload sysctl
  ansible.builtin.shell: sysctl --system
  changed_when: false

Ansible Linting

When you're working with Ansible, I highly recommend using ansible-lint2 as it will help you refine your playbooks much faster during the development phase of your project. It's not just "nice" and "linting" that matters. Some of the recommendations are really important from other aspects, such as security and performance.

Step 1: TLS CA certificate

For all the workloads we will be deploying, we will need a CA signing TLS certificates for us. If you're not a TLS expert, just know two main things:

  1. TLS enforces encrypted and secure communication between the parties (client and server in this case but can alse be peers). You can try it out for yourself and sniff some of the data using Wireshark to see that none of the data is readable. They are only decipherable by the parties involved in the communication.
  2. At least in the case of Kubernetes, TLS certificates are used for authentication and authorization of the different components and users. This will, in effect, mean that if a TLS certificate was signed by the trusted CA of the cluster, and the subject of that TLS has elevated privileges, then that subject can send corresponding to the API server and no further authentication is needed.

The TLS key and certificate generations are a pain in the butt IMO. But, with the power of Ansible, we take a lot of the pain away, as you can see in the snippet below.

tls-ca/tasks/ca.yml
---
- name: Generate an empty temp file for CSR
  ansible.builtin.file:
    path: /tmp/ca.csr
    state: touch
    owner: root
    group: root
    mode: "0400"
  register: temp_file
- name: Generate CA private key
  community.crypto.openssl_privatekey:
    path: /vagrant/share/ca.key
    type: "{{ k8s_privatekey_type }}"
    state: present
- name: Generate CA CSR to provide ALT names and other options
  community.crypto.openssl_csr:
    basicConstraints_critical: true
    basic_constraints:
      - CA:TRUE
    common_name: kubernetes-ca
    keyUsage_critical: true
    key_usage:
      - keyCertSign
      - cRLSign
    path: "{{ temp_file.dest }}"
    privatekey_path: /vagrant/share/ca.key
    state: present
- name: Generate CA certificate
  community.crypto.x509_certificate:
    path: /vagrant/share/ca.crt
    privatekey_path: /vagrant/share/ca.key
    csr_path: "{{ temp_file.dest }}"
    provider: selfsigned
    state: present
- name: Copy cert to kubernetes PKI dir
  ansible.builtin.copy:
    src: "{{ item }}"
    dest: /etc/kubernetes/pki/
    remote_src: true
    owner: root
    group: root
    mode: "0400"
  loop:
    - /vagrant/share/ca.crt
    - /vagrant/share/ca.key

I'm not a TLS expert, but from my understanding, the most important part of the CSR creation is the CA: TRUE flag. I actually don't even know if any of the constraints or usages are needed, used and respected by any tool!

Also, the provider: selfsigned, as self-explanatory as it is, is used to instruct that we're creating a new root CA certificate and not a subordinate one.

Lastly, we copy both the CA key and its certificate to a shared directory that will be used by all the other components when generating their own certificate.

Etcd CA

We could use the same CA for etcd communications as well, but I decided to separate them out to make sure no component other than the API server and the peers of etcd will be allowed to send any requests to the etcd server.

In the same Ansible role, we also generate a key and certificate for the admin/ operator of the cluster. In this case, that'll be me, the person who's provisioning and configuring the cluster.

The idea is that we will not use the TLS certificate of other components to talk to the API server, but rather use the ones explicitly created for this purpose.

Here's what it will look like:

tls-ca/tasks/admin.yml
- name: Generate an empty temp file for CSR
  ansible.builtin.file:
    path: /tmp/admin.csr
    state: touch
    owner: root
    group: root
    mode: "0400"
  register: temp_file
- name: Generate Admin Operator private key
  community.crypto.openssl_privatekey:
    path: /vagrant/share/admin.key
    type: "{{ k8s_privatekey_type }}"
- name: Generate Admin Operator CSR
  community.crypto.openssl_csr:
    path: "{{ temp_file.dest }}"
    privatekey_path: /vagrant/share/admin.key
    common_name: 'admin'
    subject:
      O: 'system:masters'
      OU: 'Kubernetes The Hard Way'
- name: Create Admin Operator TLS certificate using CA key and cert
  community.crypto.x509_certificate:
    path: /vagrant/share/admin.crt
    csr_path: "{{ temp_file.dest }}"
    privatekey_path: /vagrant/share/admin.key
    ownca_path: /vagrant/share/ca.crt
    ownca_privatekey_path: /vagrant/share/ca.key
    provider: ownca
    ownca_not_after: +365d

The subject you see on line 19, is the group system:masters. This group inside the Kubernetes cluster has the highest privileges. It won't require any RBAC to perform the requests, as all will be granted by default.

As for the certificate creation, you see on line 26-28 that we specify to the Ansible task that the CA will be of type selfsigned and we're passing the same key and certificate we created in the last step.

TLS CA Execution

To wrap this step up, two important things worth mentioning are:

  1. Both of the snippets mentioned here and the ones not mentioned, will be imported into the root of the Ansible role with the following main.yml.
    tls-ca/tasks/main.yml
    - name: CA
      ansible.builtin.import_tasks:
        file: ca.yml
    - name: Etcd CA
      ansible.builtin.import_tasks:
        file: etcd-ca.yml
    - name: Etcd Admin
      ansible.builtin.import_tasks:
        file: etcd-admin.yml
    - name: Create admin operator TLS certificate
      ansible.builtin.import_tasks:
        file: admin.yml
    
  2. You might have noticed in the bootstrap.yml root playbook that this CA role will only run once, the first Ansible inventory that gets to this point. This will ensure we don't consume extra CPU power or overwrite the currently existing CA key and certificate. Some of our roles are designed this way, e.g., the installation of cilium is another one of those cases.
    bootstrap.yml
        - role: tls-ca
          run_once: true
    

Step 2: TLS Certificates for Kubernetes Components

The number of certificates we need to generate for the Kubernetes components is eight in total, but we'll bring only one in this discussion. The most impotant one, the API server certificate.

All the others are similar with a possible minor tweak.

Let's first take a look at what the Ansible role will look like:

tls/tasks/apiserver.yml
- name: Generate API Server private key
  community.crypto.openssl_privatekey:
    path: /etc/kubernetes/pki/kube-apiserver.key
    type: "{{ k8s_privatekey_type }}"
- name: Generate API Server CSR
  community.crypto.openssl_csr:
    basicConstraints_critical: true
    basic_constraints:
      - CA:FALSE
    common_name: kube-apiserver
    extKeyUsage_critical: false
    extended_key_usage:
      - clientAuth
      - serverAuth
    keyUsage:
      - keyEncipherment
      - dataEncipherment
    keyUsage_critical: true
    path: /etc/kubernetes/pki/kube-apiserver.csr
    privatekey_path: /etc/kubernetes/pki/kube-apiserver.key
    subject:
      O: system:kubernetes
      OU: Kubernetes The Hard Way
    subject_alt_name: "{{ lookup('template', 'apiserver-alt-names.yml.j2') | from_yaml }}"
- name: Create API Server TLS certificate using CA key and cert
  community.crypto.x509_certificate:
    path: /etc/kubernetes/pki/kube-apiserver.crt
    csr_path: /etc/kubernetes/pki/kube-apiserver.csr
    privatekey_path: /etc/kubernetes/pki/kube-apiserver.key
    ownca_path: /vagrant/share/ca.crt
    ownca_privatekey_path: /vagrant/share/ca.key
    ownca_not_after: +365d
    provider: ownca

From the highlights in the snippet above, you can see at least 3 piece of important information:

  1. This certificate is not for the CA: CA: FALSE.
  2. The subject is in system:kubernetes group. This is just an identifier really and serves to special purpose.
  3. The same properties as with the admin.yml was used to generate the TLS certificate. Namely the provider and all the ownca_* properties.

Step 3: KubeConfig Files

In this step, for every component that will talk to the API server, we will create a KubeConfig file, specifying the server address, the CA certificate, and the key and certificate of the client.

The format of the KubeConfig is the same as you have in your filesystem under ~/.kube/config. That, for the purpose of our cluster, will be a Jinja2 template that will take the variables we just mentioned.

Here's what that Jinja2 template will look like:

kubeconfig/templates/kubeconfig.yml.j2
apiVersion: v1
clusters:
  - cluster:
      certificate-authority: {{ ca_cert_path }}
      server: https://{{ kube_apiserver_address }}:{{ kube_apiserver_port }}
    name: {{ kube_context }}
contexts:
  - context:
      cluster: {{ kube_context }}
      user: {{ kube_context }}
    name: {{ kube_context }}
current-context: {{ kube_context }}
kind: Config
preferences: {}
users:
  - name: {{ kube_context }}
    user:
      client-certificate: {{ client_cert_path }}
      client-key: {{ client_key_path }}

And with that template, we can generate multiple KubeConfigs for each component. This is one of the examples to create one for the Kubelet component.

kubeconfig/tasks/kubelet.yml
---
- name: Generate KubeConfig for kubelet
  ansible.builtin.template:
    src: kubeconfig.yml.j2
    dest: /var/lib/kubernetes/kubelet.kubeconfig
    mode: "0640"
    owner: root
    group: root
  vars:
    kube_apiserver_address: localhost
    kube_apiserver_port: 6443
    client_cert_path: /etc/kubernetes/pki/kubelet.crt
    client_key_path: /etc/kubernetes/pki/kubelet.key

Kubernetes API Server

The current setup is to deploy three API server, one on each of the three VM nodes. That means in any node, we will have localhost access to the control plane, if and only if the key and certificate are passed correctly.

As you notice, the Kubelet is talking to the localhost:6443. A better alternative is to talk to the Load Balancer in case one of the API servers goes down. But, this is an educational setup and not a production one!

The values that are not directly passed with vars property, are being passed by the defaults variables:

kubeconfig/defaults/main.yml
---
ca_cert_path: /etc/kubernetes/pki/ca.crt
kube_context: kubernetes-the-hard-way
kube_apiserver_address: "{{ load_balancer_ip }}"
kube_apiserver_port: "{{ load_balancer_port }}"

They can also be passed from parents of the role, or the files being passed to the playbook.

vars/lb.yml
---
haproxy_version: 2.9
load_balancer_ip: 192.168.56.100
load_balancer_port: 6443

Ansible Variables

As you may have noticed, inside every vars file, we can use the values from other variables. That's one of the many things that make Ansible so powerful!

Step 4: Encryption Configuration

The objective of this step is to create an encryption key that will be used to encrypt and decrypt the Kubernetes Secrets stored in the etcd database.

For this task, we use one template, and a set of Ansible tasks.

encryption/templates/config.yml.j2
kind: EncryptionConfig
apiVersion: v1
resources:
  - resources:
      - secrets
    providers:
      - aescbc:
          keys:
            - name: {{ key_name }}
              secret: {{ key_secret }}
      - identity: {}
encryption/tasks/main.yml
---
- name: Read the contents of /vagrant/share/encryption-secret
  ansible.builtin.slurp:
    src: '/vagrant/share/encryption-secret'
  register: encryption_secret_file
  failed_when: false
- name: Generate random string
  ansible.builtin.set_fact:
    key_secret: "{{ lookup('ansible.builtin.password', '/dev/null length=32 chars=ascii_letters,digits,special_characters') }}"
  no_log: true
  when: key_secret is not defined
- name: Ensure key_secret is populated
  when: encryption_secret_file.content is not defined
  block:
    - name: Write secret to file
      ansible.builtin.copy:
        content: '{{ key_secret }}'
        dest: '/vagrant/share/encryption-secret'
        mode: '0400'
- name: Read existing key_secret
  ansible.builtin.set_fact:
    key_secret: "{{ encryption_secret_file.content }}"
  no_log: true
  when: encryption_secret_file.content is defined
- name: Create encryption config
  ansible.builtin.template:
    src: config.yml.j2
    dest: /etc/kubernetes/encryption-config.yml
    mode: '0400'
    owner: root
    group: root
  no_log: true

As you see in the task definition, we will only generate one secret for all the subsequent runs, and reuse the file that hold that password.

That encryption configuration will later be passed to the cluster for storing encrypted Secret resources.

Step 5: Etcd Cluster

In this step we will download the compiled etcd binary, create the configuration, create the systemd service, issue the certificates for the etcd peers as well as one for the API server talking to the etcd cluster as a client.

The installation will like below.

etcd/tasks/install.yml
---
- name: Download etcd release tarball
  ansible.builtin.get_url:
    url: "{{ etcd_download_url }}"
    dest: "{{ downloads_dir }}/{{ etcd_download_url | basename }}"
    mode: "0644"
    owner: root
    group: root
    checksum: sha256:{{ etcd_checksum }}
  tags:
    - download
  register: etcd_download
- name: Ensure gzip is installed
  ansible.builtin.package:
    name: gzip
    state: present
- name: Extract etcd from the tarball to /usr/local/bin/
  ansible.builtin.unarchive:
    src: "{{ etcd_download.dest }}"
    dest: /usr/local/bin/
    remote_src: true
    mode: "0755"
    extra_opts:
      - --strip-components=1
      - --wildcards
      - "**/etcd"
      - "**/etcdctl"
      - "**/etcdutl"
  notify: Reload etcd systemd

A few important note to mention for this playbook:

  1. We specify the dest as an absolute path to the get_url task to avoid re-downloading the file for subsequent runs.
  2. The checksum ensures that we don't get any nasty binary from the internet.
  3. The register for the download step will allow us to use the etcd_download.dest when later trying to extract the tarball.
  4. Inside the tarball may or may not be more than one file. We are only interested in extracting the ones we specify in the extra_opts property. Be mindful of the --strip-components and the --wildcards options.

The variables for the above task will look like below:

vars/etcd.yml
---
etcd_initial_cluster: etcd-node0=https://192.168.56.2:2380,etcd-node1=https://192.168.56.3:2380,etcd-node2=https://192.168.56.4:2380
etcd_advertise_ip: "0.0.0.0"
etcd_privatekey_type: Ed25519
etcd_version: v3.5.12
etcd_checksum: f2ff0cb43ce119f55a85012255609b61c64263baea83aa7c8e6846c0938adca5
etcd_download_url: https://github.com/etcd-io/etcd/releases/download/{{ etcd_version }}/etcd-{{ etcd_version }}-linux-amd64.tar.gz
k8s_etcd_servers: https://192.168.56.2:2379,https://192.168.56.3:2379,https://192.168.56.4:2379

Once the installation is done, we can proceed with the configuration as below:

etcd/tasks/configure.yml
- name: Ensure the etcd directories exist
  ansible.builtin.file:
    path: '{{ item }}'
    state: directory
    owner: 'root'
    group: 'root'
    mode: '0750'
  loop:
    - /etc/etcd
- name: Copy CA TLS certificate to /etc/kubernetes/pki/etcd/
  ansible.builtin.copy:
    src: /vagrant/share/etcd-ca.crt
    dest: /etc/kubernetes/pki/etcd/ca.crt
    mode: '0640'
    remote_src: true
  notify: Reload etcd systemd
- name: Create systemd service
  ansible.builtin.template:
    src: systemd.service.j2
    dest: '/etc/systemd/system/etcd.service'
    mode: '0644'
    owner: root
    group: root
  tags:
    - systemd
  notify: Reload etcd systemd
- name: Start etcd service
  ansible.builtin.systemd_service:
    name: etcd.service
    state: started
    enabled: true
    daemon_reload: true

The handler for restarting the etcd is not much different from what we've seen previously. But the systemd Jinja2 template is an interesting one:

etcd/templates/systemd.service.j2
[Unit]
Description=etcd
Documentation=https://github.com/etcd-io/etcd

[Service]
Type=notify
ExecStart=/usr/local/bin/etcd \
  --advertise-client-urls=https://{{ host_ip }}:2379 \
  --cert-file=/etc/kubernetes/pki/etcd/server.crt \
  --client-cert-auth \
  --data-dir=/var/lib/etcd \
  --initial-advertise-peer-urls=https://{{ host_ip }}:2380 \
  --initial-cluster-state={{ etcd_initial_cluster_state }} \
  --initial-cluster-token={{ etcd_initial_cluster_token }} \
  --initial-cluster={{ etcd_initial_cluster }} \
  --key-file=/etc/kubernetes/pki/etcd/server.key \
  --listen-client-urls=https://{{ bind_address }}:2379 \
  --listen-peer-urls=https://{{ bind_address }}:2380 \
  --log-level info \
  --log-outputs stderr \
  --logger zap \
  --name={{ etcd_peer_name }} \
  --peer-cert-file=/etc/kubernetes/pki/etcd/server.crt \
  --peer-client-cert-auth \
  --peer-key-file=/etc/kubernetes/pki/etcd/server.key \
  --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt \
  --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
LimitNOFILE=40000
Restart=always
RestartSec=5
TimeoutStartSec=0

[Install]
WantedBy=multi-user.target

The two variables you see in the above template are passed from the following vars file:

vars/k8s.yml
---
host_ip: "{{ ansible_facts.all_ipv4_addresses | select('match', '^192.168.56') | first }}"
k8s_version: v1.29.2
apiserver_port: 6443
apiserver_ips:
  - 192.168.56.2
  - 192.168.56.3
  - 192.168.56.4
cilium_version: 1.15.1
k8s_static_pods_dir: /etc/kubernetes/manifests
bind_address: "0.0.0.0"

You will realize that the etcd is instructed to verify the authentication of its requests using the TLS CA. No request shall be allowed unless its TLS is signed by the verified and trusted CA.

This is achieved by the --client-cert-auth and --trusted-ca-file options for clients of the etcd cluster, and the --peer-client-cert-auth and --peer-trusted-ca-file for the peers of the etcd cluster.

You will also notice that this is a 3-node etcd cluster, and the peers are statically configured by the values given in the vars/etcd.yml file. This is exactly one of the cases where having static IP addresses make a lot of our assumptions easier and the configurations simpler. One can only imagine what would be required for dynamic environments where DHCP is involved.

Step 6: Kubernetes Components

There are multiple components, as you know, but here's a sample, being the Kubernetes API server.

k8s/tasks/install.yml
---
- name: Download the Kubernetes binaries
  ansible.builtin.get_url:
    url: "https://dl.k8s.io/{{ k8s_version }}/kubernetes-server-linux-amd64.tar.gz"
    dest: "{{ downloads_dir }}/kubernetes-server-{{ k8s_version }}-linux-amd64.tar.gz"
    mode: "0444"
    owner: root
    group: root
  tags:
    - download
  register: k8s_download
- name: Extract binaries to system path
  ansible.builtin.unarchive:
    src: "{{ k8s_download.dest }}"
    dest: /usr/local/bin/
    remote_src: true
    owner: root
    group: root
    mode: "0755"
    extra_opts:
      - --strip-components=3
      - kubernetes/server/bin/kube-apiserver
      - kubernetes/server/bin/kube-controller-manager
      - kubernetes/server/bin/kube-scheduler
      - kubernetes/server/bin/kubectl
      - kubernetes/server/bin/kubelet
      - kubernetes/server/bin/kube-proxy
k8s/templates/kube-apiserver.service.j2
[Unit]
Description=Kubernetes API Server
Documentation=https://github.com/kubernetes/kubernetes

[Service]
ExecStart=/usr/local/bin/kube-apiserver \
  --allow-privileged=true \
  --audit-log-maxage=30 \
  --audit-log-maxbackup=3 \
  --audit-log-maxsize=100 \
  --audit-log-path=/var/log/audit.log \
  --authorization-mode=Node,RBAC \
  --bind-address={{ bind_address }} \
  --external-hostname={{ external_hostname }} \
  --client-ca-file=/etc/kubernetes/pki/ca.crt \
  --enable-admission-plugins=NamespaceLifecycle,NodeRestriction,LimitRanger,ServiceAccount,DefaultStorageClass,ResourceQuota \
  --etcd-cafile=/etc/kubernetes/pki/etcd/ca.crt \
  --etcd-certfile=/etc/kubernetes/pki/etcd/kube-apiserver.crt \
  --etcd-keyfile=/etc/kubernetes/pki/etcd/kube-apiserver.key \
  --etcd-servers={{ k8s_etcd_servers }} \
  --event-ttl=1h \
  --encryption-provider-config=/etc/kubernetes/encryption-config.yml \
  --kubelet-certificate-authority=/etc/kubernetes/pki/ca.crt \
  --kubelet-client-certificate=/etc/kubernetes/pki/kube-apiserver.crt \
  --kubelet-client-key=/etc/kubernetes/pki/kube-apiserver.key \
  --runtime-config='api/all=true' \
  --service-account-key-file=/etc/kubernetes/pki/serviceaccount.key \
  --service-account-signing-key-file=/etc/kubernetes/pki/serviceaccount.key \
  --service-account-issuer=https://{{ kubernetes_public_ip }}:6443 \
  --service-cluster-ip-range=10.0.0.0/16,fd00:10:96::/112 \
  --service-node-port-range=30000-32767 \
  --tls-cert-file=/etc/kubernetes/pki/kube-apiserver.crt \
  --tls-private-key-file=/etc/kubernetes/pki/kube-apiserver.key \
  --proxy-client-cert-file=/etc/kubernetes/pki/kube-apiserver.crt \
  --proxy-client-key-file=/etc/kubernetes/pki/kube-apiserver.key \
  --peer-advertise-ip={{ bind_address }} \
  --peer-ca-file=/etc/kubernetes/pki/ca.crt \
  --feature-gates=UnknownVersionInteroperabilityProxy=true,StorageVersionAPI=true \
  --v=4
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target

And the default vaules are fetched from its own role defaults:

k8s/defaults/main.yml
---
k8s_version: "{{ lookup('url', 'https://dl.k8s.io/release/stable.txt').split('\n')[0] }}"
kubernetes_public_ip: "{{ host_ip }}"
cluster_cidr: 10.200.0.0/16
service_cluster_ip_range: 10.32.0.0/24
external_hostname: 192.168.56.100

You will notice the following points in this systemd setup:

  1. The external host address is of the Load Balancer.
  2. The certificate files for both Kubernetes API server and the Etcd server are being passed from the location previously generated.
  3. The encryption config is being fetched from a file generated in step 4.

What makes this setup HA then?

First of all, there are a couple of places that are not pointing to the Load Balancer IP address so this wouldn't count as as a complete HA setup. But, having an LB in-place already qualifies this setup as such.

But, you will not see any peer address configuration in the API server as it was the case for etcd with --initial-cluster flag and you might wonder how do the different instances of the API server know of one another and how can they coordinate between each other when multiple requests hit the API server?

The answer to this question does not lie in the Kubernetes itself, but in the storage layer, being the etcd cluster. The etcd cluster, at the time of writing, uses the Raft protocol for consensus and coordination between the peers.

And that is what makes the Kubernetes cluster HA, not the API server itself. Each instance will talk to the etcd cluster to understand the state of the cluster and the components inside.

Step 7: Worker Nodes

This is one of the last steps before we have a non-Ready Kubernetes cluster.

The task includes downloading some of the binaries, passing in some of the TLS certificates generated earlier, and starting the systemd services.

worker/tasks/cni-config.yml
- name: Ensure CNI directory exists
  ansible.builtin.file:
    path: /etc/cni/net.d/
    state: directory
    owner: root
    group: root
    mode: "0755"
  tags:
    - never
- name: Configure CNI Networking
  ansible.builtin.copy:
    content: |
      {
        "cniVersion": "1.0.0",
        "name": "containerd-net",
        "plugins": [
          {
            "type": "bridge",
            "bridge": "cni0",
            "isGateway": true,
            "ipMasq": true,
            "promiscMode": true,
            "ipam": {
              "type": "host-local",
              "ranges": [
                [{
                  "subnet": "{{ pod_subnet_cidr_v4 }}"
                }]
              ],
              "routes": [
                { "dst": "0.0.0.0/0" }
              ]
            }
          },
          {
            "type": "portmap",
            "capabilities": {"portMappings": true}
          }
        ]
      }
    dest: /etc/cni/net.d/10-containerd-net.conf
    owner: root
    group: root
    mode: "0640"
  tags:
    - never
- name: Ensure containerd directory exists
  ansible.builtin.file:
    path: /etc/containerd
    state: directory
    owner: root
    group: root
    mode: "0755"
- name: Get containerd default config
  ansible.builtin.command: containerd config default
  changed_when: false
  register: containerd_default_config
  tags:
    - config
- name: Configure containerd
  ansible.builtin.copy:
    content: "{{ containerd_default_config.stdout }}"
    dest: /etc/containerd/config.toml
    owner: root
    group: root
    mode: "0640"
  tags:
    - config
  notify: Restart containerd

The CNI config for containerd is documented in their repository if you feel curious3.

Worker Role Default Variables
worker/defaults/main.yml
---
cluster_cidr: 10.200.0.0/16
cluster_dns: 10.32.0.10
cluster_domain: cluster.local
cni_plugins_checksum_url: https://github.com/containernetworking/plugins/releases/download/v1.4.0/cni-plugins-linux-amd64-v1.4.0.tgz.sha256
cni_plugins_url: https://github.com/containernetworking/plugins/releases/download/v1.4.0/cni-plugins-linux-amd64-v1.4.0.tgz
containerd_checksum_url: https://github.com/containerd/containerd/releases/download/v1.7.13/containerd-1.7.13-linux-amd64.tar.gz.sha256sum
containerd_service_url: https://github.com/containerd/containerd/raw/v1.7.13/containerd.service
containerd_url: https://github.com/containerd/containerd/releases/download/v1.7.13/containerd-1.7.13-linux-amd64.tar.gz
crictl_checksum_url: https://github.com/kubernetes-sigs/cri-tools/releases/download/v1.29.0/crictl-v1.29.0-linux-amd64.tar.gz.sha256
crictl_url: https://github.com/kubernetes-sigs/cri-tools/releases/download/v1.29.0/crictl-v1.29.0-linux-amd64.tar.gz
kubectl_checksum_url: https://dl.k8s.io/release/{{ k8s_version }}/bin/linux/amd64/kubectl.sha256
kubectl_url: https://dl.k8s.io/release/{{ k8s_version }}/bin/linux/amd64/kubectl
kubelet_checksum_url: https://dl.k8s.io/{{ k8s_version }}/bin/linux/amd64/kubelet.sha256
kubelet_config_path: /var/lib/kubelet/config.yml
kubelet_url: https://dl.k8s.io/release/{{ k8s_version }}/bin/linux/amd64/kubelet
pod_subnet_cidr_v4: 10.88.0.0/16
pod_subnet_cidr_v6: 2001:4860:4860::/64
runc_url: https://github.com/opencontainers/runc/releases/download/v1.1.12/runc.amd64
runc_checksum: aadeef400b8f05645768c1476d1023f7875b78f52c7ff1967a6dbce236b8cbd8
worker/tasks/containerd.yml
- name: Download containerd
  ansible.builtin.get_url:
    url: "{{ containerd_url }}"
    dest: "{{ downloads_dir }}/{{ containerd_url | basename }}"
    checksum: "sha256:{{ lookup('url', containerd_checksum_url) | split | first }}"
    mode: "0444"
  register: containerd_download
  tags:
    - download
- name: Create /tmp/containerd directory
  ansible.builtin.file:
    path: /tmp/containerd
    state: directory
    mode: "0755"
- name: Extract containerd
  ansible.builtin.unarchive:
    src: "{{ containerd_download.dest }}"
    dest: /tmp/containerd
    mode: "0755"
    remote_src: true
- name: Glob files in unarchived bin directory
  ansible.builtin.find:
    paths: /tmp/containerd
    file_type: file
    recurse: true
    mode: "0755"
  register: containerd_bin_files
- name: Install containerd
  ansible.builtin.copy:
    src: "{{ item }}"
    dest: /usr/local/bin/
    owner: root
    group: root
    mode: "0755"
    remote_src: true
  loop: "{{ containerd_bin_files.files | map(attribute='path') | list }}"
- name: Download containerd service
  ansible.builtin.get_url:
    url: "{{ containerd_service_url }}"
    dest: /etc/systemd/system/{{ containerd_service_url | basename }}
    mode: "0644"
    owner: root
    group: root
  tags:
    - download

Step 8: CoreDNS & Cilium

The last step is straightforward.

We plan to run the CoreDNS as Kubernetes Deployment with affinity, and install the Cilium using its CLI.

coredns/tasks/main.yml
---
- name: Remove CoreDNS as static pod
  ansible.builtin.file:
    path: "{{ k8s_static_pods_dir }}/coredns.yml"
    state: absent
- name: Slurp CoreDNS TLS certificate
  ansible.builtin.slurp:
    src: /etc/kubernetes/pki/coredns.crt
  register: coredns_cert
- name: Slurp CoreDNS TLS key
  ansible.builtin.slurp:
    src: /etc/kubernetes/pki/coredns.key
  register: coredns_key
- name: Slurp CoreDNS CA certificate
  ansible.builtin.slurp:
    src: /etc/kubernetes/pki/ca.crt
  register: coredns_ca
- name: Apply CoreDNS manifest
  kubernetes.core.k8s:
    definition: "{{ lookup('template', 'manifests.yml.j2') | from_yaml }}"
    state: present

Notice the slurp tasks because they will be used the pass the TLS certificates to the CoreDNS instance.

CoreDNS Kubernetes Manifests
coredns/templates/manifests.yml.j2
apiVersion: v1
kind: List
items:
  - apiVersion: v1
    data:
      Corefile: |
        .:53 {
            errors
            health {
                lameduck 5s
            }
            ready
            kubernetes cluster.local in-addr.arpa ip6.arpa {
                endpoint https://{{ host_ip }}:{{ apiserver_port }}
                tls /cert/coredns.crt /cert/coredns.key /cert/ca.crt
                pods insecure
                fallthrough in-addr.arpa ip6.arpa
            }
            prometheus :9153
            forward . /etc/resolv.conf {
              max_concurrent 1000
            }
            cache 30
            loop
            reload
            loadbalance
        }
    kind: ConfigMap
    metadata:
      name: coredns-config
      namespace: kube-system
  - apiVersion: v1
    kind: Secret
    metadata:
      name: coredns-tls
      namespace: kube-system
    data:
      tls.crt: "{{ coredns_cert.content }}"
      tls.key: "{{ coredns_key.content }}"
      ca.crt: "{{ coredns_ca.content }}"
  - apiVersion: v1
    kind: ServiceAccount
    metadata:
      name: coredns
      namespace: kube-system
  - apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRole
    metadata:
      name: coredns
    rules:
      - apiGroups:
          - ""
        resources:
          - endpoints
          - services
          - pods
          - namespaces
        verbs:
          - list
          - watch
      - apiGroups:
          - discovery.k8s.io
        resources:
          - endpointslices
        verbs:
          - list
          - watch
  - apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRoleBinding
    metadata:
      name: coredns
    roleRef:
      apiGroup: rbac.authorization.k8s.io
      kind: ClusterRole
      name: coredns
    subjects:
      - kind: ServiceAccount
        name: coredns
        namespace: kube-system
  - apiVersion: v1
    kind: Service
    metadata:
      annotations:
        prometheus.io/port: "9153"
        prometheus.io/scrape: "true"
      labels:
        k8s-app: kube-dns
      name: kube-dns
      namespace: kube-system
    spec:
      ports:
        - name: dns
          port: 53
          protocol: UDP
          targetPort: 53
        - name: dns-tcp
          port: 53
          protocol: TCP
          targetPort: 53
        - name: metrics
          port: 9153
          protocol: TCP
          targetPort: 9153
      selector:
        k8s-app: coredns
      type: ClusterIP
  - apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: coredns
      namespace: kube-system
    spec:
      progressDeadlineSeconds: 600
      replicas: 2
      revisionHistoryLimit: 10
      selector:
        matchLabels:
          k8s-app: coredns
      strategy:
        rollingUpdate:
          maxSurge: 0
          maxUnavailable: 1
        type: RollingUpdate
      template:
        metadata:
          labels:
            k8s-app: coredns
        spec:
          containers:
            - args:
                - -conf
                - /etc/coredns/Corefile
              image: coredns/coredns:1.11.1
              imagePullPolicy: IfNotPresent
              livenessProbe:
                failureThreshold: 5
                httpGet:
                  path: /health
                  port: 8080
                  scheme: HTTP
                initialDelaySeconds: 60
                successThreshold: 1
                timeoutSeconds: 5
              name: coredns
              ports:
                - containerPort: 53
                  name: dns
                  protocol: UDP
                - containerPort: 53
                  name: dns-tcp
                  protocol: TCP
                - containerPort: 9153
                  name: metrics
                  protocol: TCP
              readinessProbe:
                httpGet:
                  path: /ready
                  port: 8181
                  scheme: HTTP
              resources:
                limits:
                  memory: 170Mi
                requests:
                  cpu: 100m
                  memory: 70Mi
              securityContext:
                allowPrivilegeEscalation: false
                capabilities:
                  add:
                    - NET_BIND_SERVICE
                  drop:
                    - all
                readOnlyRootFilesystem: true
              volumeMounts:
                - mountPath: /etc/coredns
                  name: config-volume
                  readOnly: true
                - mountPath: /cert
                  name: coredns-tls
                  readOnly: true
          dnsPolicy: Default
          nodeSelector:
            kubernetes.io/os: linux
          priorityClassName: system-cluster-critical
          serviceAccountName: coredns
          tolerations:
            - key: CriticalAddonsOnly
              operator: Exists
            - effect: NoSchedule
              key: node-role.kubernetes.io/control-plane
          volumes:
            - configMap:
                items:
                  - key: Corefile
                    path: Corefile
                name: coredns-config
              name: config-volume
            - secret:
                defaultMode: 0444
                items:
                  - key: tls.crt
                    path: coredns.crt
                  - key: tls.key
                    path: coredns.key
                  - key: ca.crt
                    path: ca.crt
                secretName: coredns-tls
              name: coredns-tls

And finally the Cilium.

cilium/tasks/main.yml
- name: Download cilium-cli
  ansible.builtin.get_url:
    url: "{{ cilium_cli_url }}"
    dest: "{{ downloads_dir }}/{{ cilium_cli_url | basename }}"
    owner: root
    group: root
    mode: "0644"
    checksum: "sha256:{{ cilium_cli_checksum }}"
  register: cilium_cli_download
- name: Extract cilium bin to /usr/local/bin
  ansible.builtin.unarchive:
    src: "{{ cilium_cli_download.dest }}"
    dest: /usr/local/bin/
    remote_src: true
    owner: root
    group: root
    mode: "0755"
    extra_opts:
      - cilium
- name: Install cilium
  ansible.builtin.command: cilium install
  failed_when: false
Cilium Role Default Variables
cilium/defaults/main.yml
---
cilium_cli_url: https://github.com/cilium/cilium-cli/releases/download/v0.15.23/cilium-linux-amd64.tar.gz
cilium_cli_checksum: cda3f1c40ae2191a250a7cea9e2c3987eaa81cb657dda54cd8ce25f856c384da

That's it. Believe it or not, the Kubernetes cluster is now ready and if you run the following command, you will see three nodes in the Ready state.

export KUBECONFIG=share/admin.yml # KubeConfig generated in step 3
kubectl get nodes

How to run it?

If you clone the repository, you would only need vagrant up to build everything from scratch. It will take some time for all the components to be up and ready, but it will set things up without any further manual intervention.

Conclusion

This task took me a lot of time to get right. I had to go through a lot of iterations to make it work. One of the most time-consuming parts was how the etcd cluster was misbehaving, leading to the Kubernetes API server hitting timeout errors and being inaccessible for the rest of the cluster's components.

I learned a lot from this challenge. I learned how to write efficient Ansible playbooks, how to create the right mental model for the target host where the Ansible executes a command, how to deal with all those TLS certificates, and overall, how to set up a Kubernetes cluster from scratch.

I couldn't be happier reaching the final result, having spent countless hours debugging and banging my head against the wall.

I recommend everyone giving the challenge a try. You never know how much you don't know about the inner workings of Kubernetes until you try to set it up from scratch.

Thanks for reading so far. I hope you enjoyed the journey as much as I did 🤗.

Source Code

As mentioned before, you can find the source code for this challenge on the GitHub repository1.

FAQ

Why Cilium?

Cilium has emerged as a cloud-native CNI tool that happens to have a lot of the features and characteristics of a production-grade CNI. To name a few, performance, security, and observability are the top ones. I have used Linkerd in the past but I am using Cilium for any of the current and upcoming projects I am working on. It will continue to prove itself as a great CNI for Kubernetes clusters.

Why use Vagrant?

I'm cheap 😁 and I don't want to pay for cloud resources, even for learning purposes. I have active subscription on O'Reilly and A Cloud Guru and I would've gone for their sandboxes, but I initially started this challenge just with Vagrant and I resisted the urge to change that, even after countless hours was spent on the terrible network performance of the VirtualBox VMs 🤷.