Kubernetes The Hard Way¶
You might've solved this challenge way sooner than I attempted it. Still, I always wanted to go through the process as it has many angles and learning the details intrigues me.
This version, however, does not use any cloud provider. Specifically, the things I am using differently from the original challenge are:
- Vagrant & VirtualBox: For the nodes of the cluster
- Ansible: For configuring everything until the cluster is ready
- Cilium: For the network CNI and as a replacement for the kube-proxy
So, here is my story and how I solved the famous Kubernetes The Hard Way by the great Kelsey Hightower. Stick around if you're interested in the details.
Introduction¶
Kubernetes the Hard Way is a great exercise for any system administrator to really get into the nit and grit of Kubernetes and figure out how different components work together and what makes it as such.
If you have only used a managed Kubernetes cluster, or used kubeadm
to spin up one, this is your chance to really understand the inner workings of Kubernetes. Because those tools abstract a lot of the details away from you, which is not helping to understand the implementation details if you have a knack for it.
Objective¶
The whole point of this exercise is to build a Kubernetes cluster from scratch, downloading the binaries, issuing and passing the certificates to the different components, configuring the network CNI, and finally, having a working Kubernetes cluster.
With that introduction, let's get started.
Prerequisites¶
First things first, let's make sure all the necessary tools are installed on our system before we start.
Tools¶
All the tools mentioned below are the latest versions at the time of writing, February 2024.
Tool | Version | Link |
---|---|---|
Ansible | 2.16 | https://docs.ansible.com/ansible/latest/installation_guide/intro_installation.html |
Vagrant | 2.4 | https://www.vagrantup.com/docs/installation |
VirtualBox | 7 | https://www.virtualbox.org/wiki/Downloads |
Alright, with the tools installed, it's time to get our hands dirty and really get into it.
The Vagrantfile¶
Info
The Vagrantfile
configuration language is a Ruby DSL. If you are not a Ruby developer, fret not, as I'm not either. I just know enough to get by.
The initial step is to have three nodes up and running for the Kubernetes cluster, and one for the Load Balancer used in front of the API server. We will be using Vagrant on top of VirtualBox to create all these nodes.
These will be Virtual Machines hosted on your local machine. As such, there is no cloud provider needed in this version of the challenge and all the configurations are done locally.
The configuration for our Vagrantfile looks as below.
box = "ubuntu/jammy64"
N = 2
common_script = <<~SHELL
export DEBIAN_FRONTEND=noninteractive
sudo apt update
sudo apt upgrade -y
SHELL
Vagrant.configure("2") do |config|
config.vm.define "lb" do |node|
node.vm.box = box
node.vm.network :private_network, ip: "192.168.56.100", hostname: true
node.vm.network "forwarded_port", guest: 6443, host: 6443
node.vm.hostname = "lb.local"
node.vm.provider "virtualbox" do |vb|
vb.name = "k8s-the-hard-way-lb"
vb.memory = "1024"
vb.cpus = 1
vb.linked_clone = true
end
node.vm.synced_folder "share/dl", "/downloads", create: true
node.vm.provision "shell", inline: common_script
node.vm.provision "ansible" do |ansible|
ansible.verbose = "vv"
ansible.playbook = "bootstrap.yml"
ansible.compatibility_mode = "2.0"
end
end
(0..N).each do |machine_id|
config.vm.define "node#{machine_id}" do |node|
node.vm.box = box
node.vm.hostname = "node#{machine_id}.local"
node.vm.network :private_network, ip: "192.168.56.#{machine_id+2}", hostname: true
node.vm.provider "virtualbox" do |vb|
vb.name = "k8s-the-hard-way-node#{machine_id}"
vb.memory = "1024"
vb.cpus = 1
vb.linked_clone = true
end
# To hold the downloaded items and survive VM restarts
node.vm.synced_folder "share/dl", "/downloads", create: true
node.vm.provision "shell", inline: common_script
if machine_id == N
node.vm.provision :ansible do |ansible|
ansible.limit = "all"
ansible.verbose = "vv"
ansible.playbook = "bootstrap.yml"
end
end
end
end
end
Private Network Configuration¶
There are a couple of important notes worth mentioning about this config, highlighted in the snippet above and the following list.
The network configuration, as you see above, is a private network with hard-coded IP addresses. This is not a hard requirement, but it makes a lot of the upcoming assumptions a lot easier.
Dynamic IP addresses will need more careful handling when it comes to configuring the nodes, their TLS certificates, and how they communicate overall.
And tackling craziness in this challenge is a sure way not to go down the rabbit hole of despair .
Load Balancer Port Forwarding¶
For some reason, I wasn't able to directly call 192.168.56.100:6443
, which is the address pair available for the HAProxy. This is accessible from within the Vagrant VMs, but not from the host machine.
Using firewall techniques such as ufw
only made things worse; I was locked out of the VM. I know now that I had to enable SSH access first, but that's behind me now.
Having the port-forwarding configured, I was able to call the localhost:6443
from my machine and directly get access to the HAProxy.
On Vagrant Networking
In general, I have found many networking issues while working on this challenge. For some reason, the download speed inside the VMs was terrible (I am not the only complainer here if you search through the web). That's the main driver for mounting the same download directory for all the VMs to stop re-downloading every time the Ansible playbook runs.
CPU and Memory Allocation¶
While not strictly required, I found benefit restraining the CPU and memory usage on the VMs. This ensures that no extra resources is being used.
Frankly speaking they shouldn't even go beyond this. This is an emtpy cluster with just the control-plane components and no heavy workload is running on it.
Mounting the Download Directory¶
The download directory is mounted to all the VMs to avoid re-downloading the binaries from the internet every time the playbook is running, either due to a machine restart, or simply to start from scratch.
The trick, however, is that in Ansible get_url
, as you'll see shortly, you will have to specify the absolute path to the destination file to benefit from this optimization and only specifying a directory will re-download the file.
Ansible Provisioner¶
node.vm.provision "ansible" do |ansible|
ansible.verbose = "vv"
ansible.playbook = "bootstrap.yml"
ansible.compatibility_mode = "2.0"
end
if machine_id == N
node.vm.provision :ansible do |ansible|
ansible.limit = "all"
ansible.verbose = "vv"
ansible.playbook = "bootstrap.yml"
end
end
The last and most important part of the Vagrantfile
is the Ansible provisioner section which, as you can see, is for both the Load Balancer VM as well as all the three nodes of the Kubernetes cluster.
The difference, however, is that for the Kubernetes nodes, we want the playbook to run for all of them at the same time to benefit from parallel execution of Ansible playbook. The alternative would be to spin up the nodes one by one and run the playbook on each of them, which is not efficient and consumes more time.
Ansible Playbook¶
After provisioning the VMs, it's time to take a look at what the Ansible playbook does to configure the nodes and the Load Balancer.
The main configuration of the entire Kubernetes cluster is done via this playbook and as such, you can expect a hefty amount of configurations to be done here.
First, let's take a look at the playbook itself to get a feeling of what to expect.
- name: Configure the Load Balancer
hosts: lb
become: true
gather_facts: true
vars_files:
- vars/apt.yml
- vars/k8s.yml
- vars/lb.yml
- vars/etcd.yml
roles:
- prerequisites
- haproxy
- etcd-gateway
- name: Bootstrap the Kubernetes Cluster
hosts:
- node0
- node1
- node2
become: true
gather_facts: true
vars_files:
- vars/tls.yml
- vars/apt.yml
- vars/lb.yml
- vars/k8s.yml
- vars/etcd.yml
environment:
KUBECONFIG: /var/lib/kubernetes/admin.kubeconfig
roles:
- prerequisites
- role: tls-ca
run_once: true
- tls
- kubeconfig
- encryption
- etcd
- k8s
- worker
- role: coredns
run_once: true
- role: cilium
run_once: true
If you notice there are two plays running in this playbook, one for the Load Balancer and the other for the Kubernetes nodes. This distinction is important because not all the configurations will be the same for all the VMs. That's the logic behind having different hosts
in each.
Another important highlight is that the Load Balancer is being configured first, only because that's the entrypoint for our Kubernetes API server and we need that to be ready before the upstream servers.
Directory Layout¶
This playbook you see above is at the root of our directory structure, right next to all the roles you see included in the roles
section.
To get a better understanding, here's what the directory structure looks like:
.
├── ansible.cfg
├── bootstrap.yml
├── cilium/
├── coredns/
├── encryption/
├── etcd/
├── etcd-gateway/
├── haproxy/
├── k8s/
├── kubeconfig/
├── prerequisites/
├── tls/
├── tls-ca/
├── Vagrantfile
├── vars/
└── worker/
Beside the playbook itself, the Ansible playbook and the Vagrantfile
the other pieces are roles, initialized with ansible-galaxy init <role-name>
command and modified as per the specification in the originial challenge.
We will take a closer look at each role shortly.
Ansible Configuration¶
Before jumping into the roles, one last impotant piece of information is the ansible.cfg
file, which holds the modifications we make to the Ansible default behavior.
The content is as below.
[defaults]
inventory=.vagrant/provisioners/ansible/inventory/
become=false
log_path=/tmp/ansible.log
gather_facts=false
host_key_checking=false
gathering = smart
fact_caching = jsonfile
fact_caching_connection = /tmp/ansible_facts
fact_caching_timeout = 86400
[inventory]
enable_plugins = 'host_list', 'script', 'auto', 'yaml', 'ini', 'toml'
cache = yes
cache_connection = /tmp/ansible_inventory
The importance of facts caching is to get performant execution of the playbooks during subsequent runs.
Let's Begin the Real Work¶
So far, we've only been preparing everything. The remainder of this post will focus on the real challenge itself, creating the pieces and components that make up the Kubernetes cluster, one by one.
For the sake of brevity, and because I don't want this blog post be long, or worse, broken into multiple parts, I will only highlight the most important tasks, remove duplicates from the discussion, and generally go through the core aspect of the task at hand. You are more than welcome to visit the source code1 for yourself and dig deeper.
Step 0: Prerequisites¶
In here we will enable port-forwarding, create the necessary directories that will be used by later steps, and optionally, add the DNS records to each and every node's /etc/hosts
file.
The important lines are highlighted in the snippet below.
---
- name: Enable IP Forwarding permanently
ansible.builtin.copy:
content: net.ipv4.ip_forward = 1
dest: /etc/sysctl.d/20-ipforward.conf
mode: "0644"
notify: Reload sysctl
- name: Ensure python3-pip package is installed
ansible.builtin.apt:
name: python3-pip
state: present
update_cache: true
cache_valid_time: "{{ cache_valid_time }}"
- name: Add current user to adm group
ansible.builtin.user:
name: "{{ ansible_user }}"
groups: adm
append: true
- name: Ensure CA directory exists
ansible.builtin.file:
path: /etc/kubernetes/pki
state: directory
owner: root
group: root
mode: "0700"
- name: Create private-network DNS records
ansible.builtin.lineinfile:
path: /etc/hosts
line: "{{ item }}"
state: present
with_items:
- 192.168.56.2 node0 etcd-node0
- 192.168.56.3 node1 etcd-node1
- 192.168.56.4 node2 etcd-node2
tags:
- dns
As you can see, after the sysctl
modification, we're notifying our handler for a reload and re-read of the sysctl
configurations. The handler definition is as below.
---
- name: Reload sysctl
ansible.builtin.shell: sysctl --system
changed_when: false
Ansible Linting
When you're working with Ansible, I highly recommend using ansible-lint
2 as it will help you refine your playbooks much faster during the development phase of your project. It's not just "nice" and "linting" that matters. Some of the recommendations are really important from other aspects, such as security and performance.
Step 1: TLS CA certificate¶
For all the workloads we will be deploying, we will need a CA signing TLS certificates for us. If you're not a TLS expert, just know two main things:
- TLS enforces encrypted and secure communication between the parties (client and server in this case but can alse be peers). You can try it out for yourself and sniff some of the data using Wireshark to see that none of the data is readable. They are only decipherable by the parties involved in the communication.
- At least in the case of Kubernetes, TLS certificates are used for authentication and authorization of the different components and users. This will, in effect, mean that if a TLS certificate was signed by the trusted CA of the cluster, and the subject of that TLS has elevated privileges, then that subject can send corresponding to the API server and no further authentication is needed.
The TLS key and certificate generations are a pain in the butt IMO. But, with the power of Ansible, we take a lot of the pain away, as you can see in the snippet below.
---
- name: Generate an empty temp file for CSR
ansible.builtin.file:
path: /tmp/ca.csr
state: touch
owner: root
group: root
mode: "0400"
register: temp_file
- name: Generate CA private key
community.crypto.openssl_privatekey:
path: /vagrant/share/ca.key
type: "{{ k8s_privatekey_type }}"
state: present
- name: Generate CA CSR to provide ALT names and other options
community.crypto.openssl_csr:
basicConstraints_critical: true
basic_constraints:
- CA:TRUE
common_name: kubernetes-ca
keyUsage_critical: true
key_usage:
- keyCertSign
- cRLSign
path: "{{ temp_file.dest }}"
privatekey_path: /vagrant/share/ca.key
state: present
- name: Generate CA certificate
community.crypto.x509_certificate:
path: /vagrant/share/ca.crt
privatekey_path: /vagrant/share/ca.key
csr_path: "{{ temp_file.dest }}"
provider: selfsigned
state: present
- name: Copy cert to kubernetes PKI dir
ansible.builtin.copy:
src: "{{ item }}"
dest: /etc/kubernetes/pki/
remote_src: true
owner: root
group: root
mode: "0400"
loop:
- /vagrant/share/ca.crt
- /vagrant/share/ca.key
I'm not a TLS expert, but from my understanding, the most important part of the CSR creation is the CA: TRUE
flag. I actually don't even know if any of the constraints or usages are needed, used and respected by any tool!
Also, the provider: selfsigned
, as self-explanatory as it is, is used to instruct that we're creating a new root CA certificate and not a subordinate one.
Lastly, we copy both the CA key and its certificate to a shared directory that will be used by all the other components when generating their own certificate.
Etcd CA
We could use the same CA for etcd
communications as well, but I decided to separate them out to make sure no component other than the API server and the peers of etcd
will be allowed to send any requests to the etcd
server.
In the same Ansible role, we also generate a key and certificate for the admin/ operator of the cluster. In this case, that'll be me, the person who's provisioning and configuring the cluster.
The idea is that we will not use the TLS certificate of other components to talk to the API server, but rather use the ones explicitly created for this purpose.
Here's what it will look like:
- name: Generate an empty temp file for CSR
ansible.builtin.file:
path: /tmp/admin.csr
state: touch
owner: root
group: root
mode: "0400"
register: temp_file
- name: Generate Admin Operator private key
community.crypto.openssl_privatekey:
path: /vagrant/share/admin.key
type: "{{ k8s_privatekey_type }}"
- name: Generate Admin Operator CSR
community.crypto.openssl_csr:
path: "{{ temp_file.dest }}"
privatekey_path: /vagrant/share/admin.key
common_name: 'admin'
subject:
O: 'system:masters'
OU: 'Kubernetes The Hard Way'
- name: Create Admin Operator TLS certificate using CA key and cert
community.crypto.x509_certificate:
path: /vagrant/share/admin.crt
csr_path: "{{ temp_file.dest }}"
privatekey_path: /vagrant/share/admin.key
ownca_path: /vagrant/share/ca.crt
ownca_privatekey_path: /vagrant/share/ca.key
provider: ownca
ownca_not_after: +365d
The subject
you see on line 19, is the group system:masters
. This group inside the Kubernetes cluster has the highest privileges. It won't require any RBAC to perform the requests, as all will be granted by default.
As for the certificate creation, you see on line 26-28 that we specify to the Ansible task that the CA will be of type selfsigned
and we're passing the same key and certificate we created in the last step.
TLS CA Execution¶
To wrap this step up, two important things worth mentioning are:
- Both of the snippets mentioned here and the ones not mentioned, will be imported into the root of the Ansible role with the following
main.yml
.tls-ca/tasks/main.yml- name: CA ansible.builtin.import_tasks: file: ca.yml - name: Etcd CA ansible.builtin.import_tasks: file: etcd-ca.yml - name: Etcd Admin ansible.builtin.import_tasks: file: etcd-admin.yml - name: Create admin operator TLS certificate ansible.builtin.import_tasks: file: admin.yml
- You might have noticed in the
bootstrap.yml
root playbook that this CA role will only run once, the first Ansible inventory that gets to this point. This will ensure we don't consume extra CPU power or overwrite the currently existing CA key and certificate. Some of our roles are designed this way, e.g., the installation ofcilium
is another one of those cases.
Step 2: TLS Certificates for Kubernetes Components¶
The number of certificates we need to generate for the Kubernetes components is eight in total, but we'll bring only one in this discussion. The most impotant one, the API server certificate.
All the others are similar with a possible minor tweak.
Let's first take a look at what the Ansible role will look like:
- name: Generate API Server private key
community.crypto.openssl_privatekey:
path: /etc/kubernetes/pki/kube-apiserver.key
type: "{{ k8s_privatekey_type }}"
- name: Generate API Server CSR
community.crypto.openssl_csr:
basicConstraints_critical: true
basic_constraints:
- CA:FALSE
common_name: kube-apiserver
extKeyUsage_critical: false
extended_key_usage:
- clientAuth
- serverAuth
keyUsage:
- keyEncipherment
- dataEncipherment
keyUsage_critical: true
path: /etc/kubernetes/pki/kube-apiserver.csr
privatekey_path: /etc/kubernetes/pki/kube-apiserver.key
subject:
O: system:kubernetes
OU: Kubernetes The Hard Way
subject_alt_name: "{{ lookup('template', 'apiserver-alt-names.yml.j2') | from_yaml }}"
- name: Create API Server TLS certificate using CA key and cert
community.crypto.x509_certificate:
path: /etc/kubernetes/pki/kube-apiserver.crt
csr_path: /etc/kubernetes/pki/kube-apiserver.csr
privatekey_path: /etc/kubernetes/pki/kube-apiserver.key
ownca_path: /vagrant/share/ca.crt
ownca_privatekey_path: /vagrant/share/ca.key
ownca_not_after: +365d
provider: ownca
From the highlights in the snippet above, you can see at least 3 piece of important information:
- This certificate is not for the CA:
CA: FALSE
. - The subject is in
system:kubernetes
group. This is just an identifier really and serves to special purpose. - The same properties as with the
admin.yml
was used to generate the TLS certificate. Namely theprovider
and all theownca_*
properties.
Step 3: KubeConfig Files¶
In this step, for every component that will talk to the API server, we will create a KubeConfig file, specifying the server address, the CA certificate, and the key and certificate of the client.
The format of the KubeConfig is the same as you have in your filesystem under ~/.kube/config
. That, for the purpose of our cluster, will be a Jinja2 template that will take the variables we just mentioned.
Here's what that Jinja2 template will look like:
apiVersion: v1
clusters:
- cluster:
certificate-authority: {{ ca_cert_path }}
server: https://{{ kube_apiserver_address }}:{{ kube_apiserver_port }}
name: {{ kube_context }}
contexts:
- context:
cluster: {{ kube_context }}
user: {{ kube_context }}
name: {{ kube_context }}
current-context: {{ kube_context }}
kind: Config
preferences: {}
users:
- name: {{ kube_context }}
user:
client-certificate: {{ client_cert_path }}
client-key: {{ client_key_path }}
And with that template, we can generate multiple KubeConfigs for each component. This is one of the examples to create one for the Kubelet component.
---
- name: Generate KubeConfig for kubelet
ansible.builtin.template:
src: kubeconfig.yml.j2
dest: /var/lib/kubernetes/kubelet.kubeconfig
mode: "0640"
owner: root
group: root
vars:
kube_apiserver_address: localhost
kube_apiserver_port: 6443
client_cert_path: /etc/kubernetes/pki/kubelet.crt
client_key_path: /etc/kubernetes/pki/kubelet.key
Kubernetes API Server
The current setup is to deploy three API server, one on each of the three VM nodes. That means in any node, we will have localhost
access to the control plane, if and only if the key and certificate are passed correctly.
As you notice, the Kubelet is talking to the localhost:6443
. A better alternative is to talk to the Load Balancer in case one of the API servers goes down. But, this is an educational setup and not a production one!
The values that are not directly passed with vars
property, are being passed by the defaults variables:
---
ca_cert_path: /etc/kubernetes/pki/ca.crt
kube_context: kubernetes-the-hard-way
kube_apiserver_address: "{{ load_balancer_ip }}"
kube_apiserver_port: "{{ load_balancer_port }}"
They can also be passed from parents of the role, or the files being passed to the playbook.
Ansible Variables
As you may have noticed, inside every vars
file, we can use the values from other variables. That's one of the many things that make Ansible so powerful!
Step 4: Encryption Configuration¶
The objective of this step is to create an encryption key that will be used to encrypt and decrypt the Kubernetes Secrets stored in the etcd
database.
For this task, we use one template, and a set of Ansible tasks.
kind: EncryptionConfig
apiVersion: v1
resources:
- resources:
- secrets
providers:
- aescbc:
keys:
- name: {{ key_name }}
secret: {{ key_secret }}
- identity: {}
---
- name: Read the contents of /vagrant/share/encryption-secret
ansible.builtin.slurp:
src: '/vagrant/share/encryption-secret'
register: encryption_secret_file
failed_when: false
- name: Generate random string
ansible.builtin.set_fact:
key_secret: "{{ lookup('ansible.builtin.password', '/dev/null length=32 chars=ascii_letters,digits,special_characters') }}"
no_log: true
when: key_secret is not defined
- name: Ensure key_secret is populated
when: encryption_secret_file.content is not defined
block:
- name: Write secret to file
ansible.builtin.copy:
content: '{{ key_secret }}'
dest: '/vagrant/share/encryption-secret'
mode: '0400'
- name: Read existing key_secret
ansible.builtin.set_fact:
key_secret: "{{ encryption_secret_file.content }}"
no_log: true
when: encryption_secret_file.content is defined
- name: Create encryption config
ansible.builtin.template:
src: config.yml.j2
dest: /etc/kubernetes/encryption-config.yml
mode: '0400'
owner: root
group: root
no_log: true
As you see in the task definition, we will only generate one secret for all the subsequent runs, and reuse the file that hold that password.
That encryption configuration will later be passed to the cluster for storing encrypted Secret resources.
Step 5: Etcd Cluster¶
In this step we will download the compiled etcd
binary, create the configuration, create the systemd service, issue the certificates for the etcd
peers as well as one for the API server talking to the etcd
cluster as a client.
The installation will like below.
---
- name: Download etcd release tarball
ansible.builtin.get_url:
url: "{{ etcd_download_url }}"
dest: "{{ downloads_dir }}/{{ etcd_download_url | basename }}"
mode: "0644"
owner: root
group: root
checksum: sha256:{{ etcd_checksum }}
tags:
- download
register: etcd_download
- name: Ensure gzip is installed
ansible.builtin.package:
name: gzip
state: present
- name: Extract etcd from the tarball to /usr/local/bin/
ansible.builtin.unarchive:
src: "{{ etcd_download.dest }}"
dest: /usr/local/bin/
remote_src: true
mode: "0755"
extra_opts:
- --strip-components=1
- --wildcards
- "**/etcd"
- "**/etcdctl"
- "**/etcdutl"
notify: Reload etcd systemd
A few important note to mention for this playbook:
- We specify the
dest
as an absolute path to theget_url
task to avoid re-downloading the file for subsequent runs. - The
checksum
ensures that we don't get any nasty binary from the internet. - The
register
for the download step will allow us to use theetcd_download.dest
when later trying to extract the tarball. - Inside the tarball may or may not be more than one file. We are only interested in extracting the ones we specify in the
extra_opts
property. Be mindful of the--strip-components
and the--wildcards
options.
The variables for the above task will look like below:
---
etcd_initial_cluster: etcd-node0=https://192.168.56.2:2380,etcd-node1=https://192.168.56.3:2380,etcd-node2=https://192.168.56.4:2380
etcd_advertise_ip: "0.0.0.0"
etcd_privatekey_type: Ed25519
etcd_version: v3.5.12
etcd_checksum: f2ff0cb43ce119f55a85012255609b61c64263baea83aa7c8e6846c0938adca5
etcd_download_url: https://github.com/etcd-io/etcd/releases/download/{{ etcd_version }}/etcd-{{ etcd_version }}-linux-amd64.tar.gz
k8s_etcd_servers: https://192.168.56.2:2379,https://192.168.56.3:2379,https://192.168.56.4:2379
Once the installation is done, we can proceed with the configuration as below:
- name: Ensure the etcd directories exist
ansible.builtin.file:
path: '{{ item }}'
state: directory
owner: 'root'
group: 'root'
mode: '0750'
loop:
- /etc/etcd
- name: Copy CA TLS certificate to /etc/kubernetes/pki/etcd/
ansible.builtin.copy:
src: /vagrant/share/etcd-ca.crt
dest: /etc/kubernetes/pki/etcd/ca.crt
mode: '0640'
remote_src: true
notify: Reload etcd systemd
- name: Create systemd service
ansible.builtin.template:
src: systemd.service.j2
dest: '/etc/systemd/system/etcd.service'
mode: '0644'
owner: root
group: root
tags:
- systemd
notify: Reload etcd systemd
- name: Start etcd service
ansible.builtin.systemd_service:
name: etcd.service
state: started
enabled: true
daemon_reload: true
The handler for restarting the etcd
is not much different from what we've seen previously. But the systemd Jinja2 template is an interesting one:
[Unit]
Description=etcd
Documentation=https://github.com/etcd-io/etcd
[Service]
Type=notify
ExecStart=/usr/local/bin/etcd \
--advertise-client-urls=https://{{ host_ip }}:2379 \
--cert-file=/etc/kubernetes/pki/etcd/server.crt \
--client-cert-auth \
--data-dir=/var/lib/etcd \
--initial-advertise-peer-urls=https://{{ host_ip }}:2380 \
--initial-cluster-state={{ etcd_initial_cluster_state }} \
--initial-cluster-token={{ etcd_initial_cluster_token }} \
--initial-cluster={{ etcd_initial_cluster }} \
--key-file=/etc/kubernetes/pki/etcd/server.key \
--listen-client-urls=https://{{ bind_address }}:2379 \
--listen-peer-urls=https://{{ bind_address }}:2380 \
--log-level info \
--log-outputs stderr \
--logger zap \
--name={{ etcd_peer_name }} \
--peer-cert-file=/etc/kubernetes/pki/etcd/server.crt \
--peer-client-cert-auth \
--peer-key-file=/etc/kubernetes/pki/etcd/server.key \
--peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt \
--trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
LimitNOFILE=40000
Restart=always
RestartSec=5
TimeoutStartSec=0
[Install]
WantedBy=multi-user.target
The two variables you see in the above template are passed from the following vars
file:
---
host_ip: "{{ ansible_facts.all_ipv4_addresses | select('match', '^192.168.56') | first }}"
k8s_version: v1.29.2
apiserver_port: 6443
apiserver_ips:
- 192.168.56.2
- 192.168.56.3
- 192.168.56.4
cilium_version: 1.15.1
k8s_static_pods_dir: /etc/kubernetes/manifests
bind_address: "0.0.0.0"
You will realize that the etcd
is instructed to verify the authentication of its requests using the TLS CA. No request shall be allowed unless its TLS is signed by the verified and trusted CA.
This is achieved by the --client-cert-auth
and --trusted-ca-file
options for clients of the etcd
cluster, and the --peer-client-cert-auth
and --peer-trusted-ca-file
for the peers of the etcd
cluster.
You will also notice that this is a 3-node etcd
cluster, and the peers are statically configured by the values given in the vars/etcd.yml
file. This is exactly one of the cases where having static IP addresses make a lot of our assumptions easier and the configurations simpler. One can only imagine what would be required for dynamic environments where DHCP is involved.
Step 6: Kubernetes Components¶
There are multiple components, as you know, but here's a sample, being the Kubernetes API server.
---
- name: Download the Kubernetes binaries
ansible.builtin.get_url:
url: "https://dl.k8s.io/{{ k8s_version }}/kubernetes-server-linux-amd64.tar.gz"
dest: "{{ downloads_dir }}/kubernetes-server-{{ k8s_version }}-linux-amd64.tar.gz"
mode: "0444"
owner: root
group: root
tags:
- download
register: k8s_download
- name: Extract binaries to system path
ansible.builtin.unarchive:
src: "{{ k8s_download.dest }}"
dest: /usr/local/bin/
remote_src: true
owner: root
group: root
mode: "0755"
extra_opts:
- --strip-components=3
- kubernetes/server/bin/kube-apiserver
- kubernetes/server/bin/kube-controller-manager
- kubernetes/server/bin/kube-scheduler
- kubernetes/server/bin/kubectl
- kubernetes/server/bin/kubelet
- kubernetes/server/bin/kube-proxy
[Unit]
Description=Kubernetes API Server
Documentation=https://github.com/kubernetes/kubernetes
[Service]
ExecStart=/usr/local/bin/kube-apiserver \
--allow-privileged=true \
--audit-log-maxage=30 \
--audit-log-maxbackup=3 \
--audit-log-maxsize=100 \
--audit-log-path=/var/log/audit.log \
--authorization-mode=Node,RBAC \
--bind-address={{ bind_address }} \
--external-hostname={{ external_hostname }} \
--client-ca-file=/etc/kubernetes/pki/ca.crt \
--enable-admission-plugins=NamespaceLifecycle,NodeRestriction,LimitRanger,ServiceAccount,DefaultStorageClass,ResourceQuota \
--etcd-cafile=/etc/kubernetes/pki/etcd/ca.crt \
--etcd-certfile=/etc/kubernetes/pki/etcd/kube-apiserver.crt \
--etcd-keyfile=/etc/kubernetes/pki/etcd/kube-apiserver.key \
--etcd-servers={{ k8s_etcd_servers }} \
--event-ttl=1h \
--encryption-provider-config=/etc/kubernetes/encryption-config.yml \
--kubelet-certificate-authority=/etc/kubernetes/pki/ca.crt \
--kubelet-client-certificate=/etc/kubernetes/pki/kube-apiserver.crt \
--kubelet-client-key=/etc/kubernetes/pki/kube-apiserver.key \
--runtime-config='api/all=true' \
--service-account-key-file=/etc/kubernetes/pki/serviceaccount.key \
--service-account-signing-key-file=/etc/kubernetes/pki/serviceaccount.key \
--service-account-issuer=https://{{ kubernetes_public_ip }}:6443 \
--service-cluster-ip-range=10.0.0.0/16,fd00:10:96::/112 \
--service-node-port-range=30000-32767 \
--tls-cert-file=/etc/kubernetes/pki/kube-apiserver.crt \
--tls-private-key-file=/etc/kubernetes/pki/kube-apiserver.key \
--proxy-client-cert-file=/etc/kubernetes/pki/kube-apiserver.crt \
--proxy-client-key-file=/etc/kubernetes/pki/kube-apiserver.key \
--peer-advertise-ip={{ bind_address }} \
--peer-ca-file=/etc/kubernetes/pki/ca.crt \
--feature-gates=UnknownVersionInteroperabilityProxy=true,StorageVersionAPI=true \
--v=4
Restart=on-failure
RestartSec=5
[Install]
WantedBy=multi-user.target
And the default vaules are fetched from its own role defaults:
---
k8s_version: "{{ lookup('url', 'https://dl.k8s.io/release/stable.txt').split('\n')[0] }}"
kubernetes_public_ip: "{{ host_ip }}"
cluster_cidr: 10.200.0.0/16
service_cluster_ip_range: 10.32.0.0/24
external_hostname: 192.168.56.100
You will notice the following points in this systemd setup:
- The external host address is of the Load Balancer.
- The certificate files for both Kubernetes API server and the Etcd server are being passed from the location previously generated.
- The encryption config is being fetched from a file generated in step 4.
What makes this setup HA then?
First of all, there are a couple of places that are not pointing to the Load Balancer IP address so this wouldn't count as as a complete HA setup. But, having an LB in-place already qualifies this setup as such.
But, you will not see any peer address configuration in the API server as it was the case for etcd
with --initial-cluster
flag and you might wonder how do the different instances of the API server know of one another and how can they coordinate between each other when multiple requests hit the API server?
The answer to this question does not lie in the Kubernetes itself, but in the storage layer, being the etcd
cluster. The etcd
cluster, at the time of writing, uses the Raft protocol for consensus and coordination between the peers.
And that is what makes the Kubernetes cluster HA, not the API server itself. Each instance will talk to the etcd
cluster to understand the state of the cluster and the components inside.
Step 7: Worker Nodes¶
This is one of the last steps before we have a non-Ready Kubernetes cluster.
The task includes downloading some of the binaries, passing in some of the TLS certificates generated earlier, and starting the systemd services.
- name: Ensure CNI directory exists
ansible.builtin.file:
path: /etc/cni/net.d/
state: directory
owner: root
group: root
mode: "0755"
tags:
- never
- name: Configure CNI Networking
ansible.builtin.copy:
content: |
{
"cniVersion": "1.0.0",
"name": "containerd-net",
"plugins": [
{
"type": "bridge",
"bridge": "cni0",
"isGateway": true,
"ipMasq": true,
"promiscMode": true,
"ipam": {
"type": "host-local",
"ranges": [
[{
"subnet": "{{ pod_subnet_cidr_v4 }}"
}]
],
"routes": [
{ "dst": "0.0.0.0/0" }
]
}
},
{
"type": "portmap",
"capabilities": {"portMappings": true}
}
]
}
dest: /etc/cni/net.d/10-containerd-net.conf
owner: root
group: root
mode: "0640"
tags:
- never
- name: Ensure containerd directory exists
ansible.builtin.file:
path: /etc/containerd
state: directory
owner: root
group: root
mode: "0755"
- name: Get containerd default config
ansible.builtin.command: containerd config default
changed_when: false
register: containerd_default_config
tags:
- config
- name: Configure containerd
ansible.builtin.copy:
content: "{{ containerd_default_config.stdout }}"
dest: /etc/containerd/config.toml
owner: root
group: root
mode: "0640"
tags:
- config
notify: Restart containerd
The CNI config for containerd is documented in their repository if you feel curious3.
Worker Role Default Variables
---
cluster_cidr: 10.200.0.0/16
cluster_dns: 10.32.0.10
cluster_domain: cluster.local
cni_plugins_checksum_url: https://github.com/containernetworking/plugins/releases/download/v1.4.0/cni-plugins-linux-amd64-v1.4.0.tgz.sha256
cni_plugins_url: https://github.com/containernetworking/plugins/releases/download/v1.4.0/cni-plugins-linux-amd64-v1.4.0.tgz
containerd_checksum_url: https://github.com/containerd/containerd/releases/download/v1.7.13/containerd-1.7.13-linux-amd64.tar.gz.sha256sum
containerd_service_url: https://github.com/containerd/containerd/raw/v1.7.13/containerd.service
containerd_url: https://github.com/containerd/containerd/releases/download/v1.7.13/containerd-1.7.13-linux-amd64.tar.gz
crictl_checksum_url: https://github.com/kubernetes-sigs/cri-tools/releases/download/v1.29.0/crictl-v1.29.0-linux-amd64.tar.gz.sha256
crictl_url: https://github.com/kubernetes-sigs/cri-tools/releases/download/v1.29.0/crictl-v1.29.0-linux-amd64.tar.gz
kubectl_checksum_url: https://dl.k8s.io/release/{{ k8s_version }}/bin/linux/amd64/kubectl.sha256
kubectl_url: https://dl.k8s.io/release/{{ k8s_version }}/bin/linux/amd64/kubectl
kubelet_checksum_url: https://dl.k8s.io/{{ k8s_version }}/bin/linux/amd64/kubelet.sha256
kubelet_config_path: /var/lib/kubelet/config.yml
kubelet_url: https://dl.k8s.io/release/{{ k8s_version }}/bin/linux/amd64/kubelet
pod_subnet_cidr_v4: 10.88.0.0/16
pod_subnet_cidr_v6: 2001:4860:4860::/64
runc_url: https://github.com/opencontainers/runc/releases/download/v1.1.12/runc.amd64
runc_checksum: aadeef400b8f05645768c1476d1023f7875b78f52c7ff1967a6dbce236b8cbd8
- name: Download containerd
ansible.builtin.get_url:
url: "{{ containerd_url }}"
dest: "{{ downloads_dir }}/{{ containerd_url | basename }}"
checksum: "sha256:{{ lookup('url', containerd_checksum_url) | split | first }}"
mode: "0444"
register: containerd_download
tags:
- download
- name: Create /tmp/containerd directory
ansible.builtin.file:
path: /tmp/containerd
state: directory
mode: "0755"
- name: Extract containerd
ansible.builtin.unarchive:
src: "{{ containerd_download.dest }}"
dest: /tmp/containerd
mode: "0755"
remote_src: true
- name: Glob files in unarchived bin directory
ansible.builtin.find:
paths: /tmp/containerd
file_type: file
recurse: true
mode: "0755"
register: containerd_bin_files
- name: Install containerd
ansible.builtin.copy:
src: "{{ item }}"
dest: /usr/local/bin/
owner: root
group: root
mode: "0755"
remote_src: true
loop: "{{ containerd_bin_files.files | map(attribute='path') | list }}"
- name: Download containerd service
ansible.builtin.get_url:
url: "{{ containerd_service_url }}"
dest: /etc/systemd/system/{{ containerd_service_url | basename }}
mode: "0644"
owner: root
group: root
tags:
- download
Step 8: CoreDNS & Cilium¶
The last step is straightforward.
We plan to run the CoreDNS as Kubernetes Deployment with affinity, and install the Cilium using its CLI.
---
- name: Remove CoreDNS as static pod
ansible.builtin.file:
path: "{{ k8s_static_pods_dir }}/coredns.yml"
state: absent
- name: Slurp CoreDNS TLS certificate
ansible.builtin.slurp:
src: /etc/kubernetes/pki/coredns.crt
register: coredns_cert
- name: Slurp CoreDNS TLS key
ansible.builtin.slurp:
src: /etc/kubernetes/pki/coredns.key
register: coredns_key
- name: Slurp CoreDNS CA certificate
ansible.builtin.slurp:
src: /etc/kubernetes/pki/ca.crt
register: coredns_ca
- name: Apply CoreDNS manifest
kubernetes.core.k8s:
definition: "{{ lookup('template', 'manifests.yml.j2') | from_yaml }}"
state: present
Notice the slurp tasks because they will be used the pass the TLS certificates to the CoreDNS instance.
CoreDNS Kubernetes Manifests
apiVersion: v1
kind: List
items:
- apiVersion: v1
data:
Corefile: |
.:53 {
errors
health {
lameduck 5s
}
ready
kubernetes cluster.local in-addr.arpa ip6.arpa {
endpoint https://{{ host_ip }}:{{ apiserver_port }}
tls /cert/coredns.crt /cert/coredns.key /cert/ca.crt
pods insecure
fallthrough in-addr.arpa ip6.arpa
}
prometheus :9153
forward . /etc/resolv.conf {
max_concurrent 1000
}
cache 30
loop
reload
loadbalance
}
kind: ConfigMap
metadata:
name: coredns-config
namespace: kube-system
- apiVersion: v1
kind: Secret
metadata:
name: coredns-tls
namespace: kube-system
data:
tls.crt: "{{ coredns_cert.content }}"
tls.key: "{{ coredns_key.content }}"
ca.crt: "{{ coredns_ca.content }}"
- apiVersion: v1
kind: ServiceAccount
metadata:
name: coredns
namespace: kube-system
- apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: coredns
rules:
- apiGroups:
- ""
resources:
- endpoints
- services
- pods
- namespaces
verbs:
- list
- watch
- apiGroups:
- discovery.k8s.io
resources:
- endpointslices
verbs:
- list
- watch
- apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: coredns
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: coredns
subjects:
- kind: ServiceAccount
name: coredns
namespace: kube-system
- apiVersion: v1
kind: Service
metadata:
annotations:
prometheus.io/port: "9153"
prometheus.io/scrape: "true"
labels:
k8s-app: kube-dns
name: kube-dns
namespace: kube-system
spec:
ports:
- name: dns
port: 53
protocol: UDP
targetPort: 53
- name: dns-tcp
port: 53
protocol: TCP
targetPort: 53
- name: metrics
port: 9153
protocol: TCP
targetPort: 9153
selector:
k8s-app: coredns
type: ClusterIP
- apiVersion: apps/v1
kind: Deployment
metadata:
name: coredns
namespace: kube-system
spec:
progressDeadlineSeconds: 600
replicas: 2
revisionHistoryLimit: 10
selector:
matchLabels:
k8s-app: coredns
strategy:
rollingUpdate:
maxSurge: 0
maxUnavailable: 1
type: RollingUpdate
template:
metadata:
labels:
k8s-app: coredns
spec:
containers:
- args:
- -conf
- /etc/coredns/Corefile
image: coredns/coredns:1.11.1
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 5
httpGet:
path: /health
port: 8080
scheme: HTTP
initialDelaySeconds: 60
successThreshold: 1
timeoutSeconds: 5
name: coredns
ports:
- containerPort: 53
name: dns
protocol: UDP
- containerPort: 53
name: dns-tcp
protocol: TCP
- containerPort: 9153
name: metrics
protocol: TCP
readinessProbe:
httpGet:
path: /ready
port: 8181
scheme: HTTP
resources:
limits:
memory: 170Mi
requests:
cpu: 100m
memory: 70Mi
securityContext:
allowPrivilegeEscalation: false
capabilities:
add:
- NET_BIND_SERVICE
drop:
- all
readOnlyRootFilesystem: true
volumeMounts:
- mountPath: /etc/coredns
name: config-volume
readOnly: true
- mountPath: /cert
name: coredns-tls
readOnly: true
dnsPolicy: Default
nodeSelector:
kubernetes.io/os: linux
priorityClassName: system-cluster-critical
serviceAccountName: coredns
tolerations:
- key: CriticalAddonsOnly
operator: Exists
- effect: NoSchedule
key: node-role.kubernetes.io/control-plane
volumes:
- configMap:
items:
- key: Corefile
path: Corefile
name: coredns-config
name: config-volume
- secret:
defaultMode: 0444
items:
- key: tls.crt
path: coredns.crt
- key: tls.key
path: coredns.key
- key: ca.crt
path: ca.crt
secretName: coredns-tls
name: coredns-tls
And finally the Cilium.
- name: Download cilium-cli
ansible.builtin.get_url:
url: "{{ cilium_cli_url }}"
dest: "{{ downloads_dir }}/{{ cilium_cli_url | basename }}"
owner: root
group: root
mode: "0644"
checksum: "sha256:{{ cilium_cli_checksum }}"
register: cilium_cli_download
- name: Extract cilium bin to /usr/local/bin
ansible.builtin.unarchive:
src: "{{ cilium_cli_download.dest }}"
dest: /usr/local/bin/
remote_src: true
owner: root
group: root
mode: "0755"
extra_opts:
- cilium
- name: Install cilium
ansible.builtin.command: cilium install
failed_when: false
Cilium Role Default Variables
That's it. Believe it or not, the Kubernetes cluster is now ready and if you run the following command, you will see three nodes in the Ready
state.
How to run it?¶
If you clone the repository, you would only need vagrant up
to build everything from scratch. It will take some time for all the components to be up and ready, but it will set things up without any further manual intervention.
Conclusion¶
This task took me a lot of time to get right. I had to go through a lot of iterations to make it work. One of the most time-consuming parts was how the etcd
cluster was misbehaving, leading to the Kubernetes API server hitting timeout errors and being inaccessible for the rest of the cluster's components.
I learned a lot from this challenge. I learned how to write efficient Ansible playbooks, how to create the right mental model for the target host where the Ansible executes a command, how to deal with all those TLS certificates, and overall, how to set up a Kubernetes cluster from scratch.
I couldn't be happier reaching the final result, having spent countless hours debugging and banging my head against the wall.
I recommend everyone giving the challenge a try. You never know how much you don't know about the inner workings of Kubernetes until you try to set it up from scratch.
Thanks for reading so far. I hope you enjoyed the journey as much as I did .
Source Code¶
As mentioned before, you can find the source code for this challenge on the GitHub repository1.
FAQ¶
Why Cilium?¶
Cilium has emerged as a cloud-native CNI tool that happens to have a lot of the features and characteristics of a production-grade CNI. To name a few, performance, security, and observability are the top ones. I have used Linkerd in the past but I am using Cilium for any of the current and upcoming projects I am working on. It will continue to prove itself as a great CNI for Kubernetes clusters.
Why use Vagrant?¶
I'm cheap and I don't want to pay for cloud resources, even for learning purposes. I have active subscription on O'Reilly and A Cloud Guru and I would've gone for their sandboxes, but I initially started this challenge just with Vagrant and I resisted the urge to change that, even after countless hours was spent on the terrible network performance of the VirtualBox VMs .
If you enjoyed this blog post, consider sharing it with these buttons . Please leave a comment for us at the end, we read & love 'em all.
Share on Share on Share on Share on