Skip to main content

[Tutorial] Deploying Enterprise-Grade High-Availability K8S Clusters Based on Ansible

This is a development document intended for developers and AI, reposted from my documentation site. Original address:

The development environment for this article is Linux, using the micro CLI to edit files. Please adjust according to your own system environment.

Basic Concepts

About Ansible

Ansible is an agentless automation tool that writes configurations and changes as clear, repeatable tasks. It excels at consistent configuration across multiple hosts and is also suitable for application deployment and batch operations. When used with load balancers, it can break down complex changes into controllable rolling steps.

Ansible is very, very suitable for deploying and managing HAProxy ~

1765946562142.webp

About Kubernetes and RKE2

Kubernetes (K8s) is a container orchestration system responsible for core capabilities such as scheduling, service discovery, rolling updates, and self-healing. Its goal is to standardize the way distributed applications run, making O&M processes more controllable.

RKE2 (RKE Government) is a Kubernetes distribution provided by Rancher that complies with consistency standards. By default, it leans more towards security and compliance, making it suitable for production environments.

1765946612642.webp

About Rocky Linux and SELinux

Rocky Linux is an open-source enterprise-grade operating system aimed at maintaining bug-for-bug compatibility with RHEL. It has a stable lifecycle and is suitable for long-running production clusters.

1765947632075.webp

SELinux is a Mandatory Access Control (MAC) mechanism used to finely restrict the access boundaries of processes and resources. Rocky Linux enables it by default in enforcing mode; it is recommended to configure it according to policies rather than disabling it.

image.webp

Getting Started

Installing Ansible

Install Ansible (using yay as an example):

yay -S ansible

Run ansible --version to view version information.

yun@yun ~/V/a/yunzaixi-dev (main)> ansible --version
ansible [core 2.20.0]
config file = None
configured module search path = ['/home/yun/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
ansible python module location = /usr/lib/python3.13/site-packages/ansible
ansible collection location = /home/yun/.ansible/collections:/usr/share/ansible/collections
executable location = /usr/bin/ansible
python version = 3.13.7 (main, Aug 16 2025, 15:55:01) [GCC 15.2.1 20250813] (/usr/bin/python)
jinja version = 3.1.6
pyyaml version = 6.0.3 (with libyaml v0.2.5)

Ansible is implemented based on Python, so please ensure your development environment has Python configured before installing Ansible. lablabs.rke2 depends on the netaddr Python package, which needs to be installed separately. Arch Linux can use sudo pacman -S python-netaddr.

Installing Version Management Tools

Install git, gh (using yay as an example):

yay -S git github-cli

Run git version and gh version to view version information.

yun@yun ~/V/a/yunzaixi-dev (main)> git version
git version 2.52.0
yun@yun ~/V/a/yunzaixi-dev (main)> gh version
gh version 2.83.1 (2025-11-13)
https://github.com/cli/cli/releases/tag/v2.83.1

Log in to GitHub:

gh auth login --scopes workflow

Follow the prompts to proceed.

Preparing Cloud Servers

Before everything starts, we need to prepare the cloud servers for deploying the cluster. A minimum viable production-grade HA (control plane + etcd) usually consists of 3 rke2-server nodes (embedded etcd) plus at least one rke2-agent. Therefore, we need at least 4 cloud servers to proceed with the next steps.

For ease of O&M, all systems are unified as RockyLinux.

Reason for choosing RockyLinux: It is an open-source and free enterprise-grade operating system, 100% compatible with RHEL, and is within the RKE2 support matrix.

RKE2 is very lightweight, but has some minimum requirements:

  1. Two RKE2 nodes cannot have the same node name. By default, the node name is taken from the machine's hostname, so the hostnames of the Linux cloud servers must not be the same.
  2. Each cloud server should have at least 2 Core CPU, 4 GB RAM, and use SSD as the hard disk.
  3. Open specific firewall ports.

Configuring SSH Config

Add the following code to your system SSH Config (fill in the public IP address of the cloud server at HostName):

Host rke2-server1
HostName <Your Public IP Address 1>
User root

Host rke2-server2
HostName <Your Public IP Address 2>
User root

Host rke2-server3
HostName <Your Public IP Address 3>
User root

Host rke2-agent1
HostName <Your Public IP Address 4>
User root

Host rke2-agent2
HostName <Your Public IP Address 5>
User root

The above code configures SSH aliases for all cloud servers, which greatly simplifies future O&M operations. Next, upload the SSH public key to the target servers:

ssh-copy-id rke2-server1
ssh-copy-id rke2-server2
ssh-copy-id rke2-server3
ssh-copy-id rke2-agent1
ssh-copy-id rke2-agent2

If you have reinstalled the system before, you might need to clean the SSH fingerprints first:

ssh-keygen -R rke2-server1
ssh-keygen -R rke2-server2
ssh-keygen -R rke2-server3
ssh-keygen -R rke2-agent1
ssh-keygen -R rke2-agent2

Follow the prompts to proceed.

Once completed, you can log in to all cloud servers without a password:

ssh rke2-server1
ssh rke2-server2
ssh rke2-server3
ssh rke2-agent1
ssh rke2-agent2

Prompt after login: not using post-quantum key exchange algorithms will be vulnerable in the future (that's very futuristic), ignore this.

** WARNING: connection is not using a post-quantum key exchange algorithm. 
** This session may be vulnerable to "store now, decrypt later" attacks.
** The server may need to be upgraded. See https://openssh.com/pq.html
Last failed login: ~~ from ~~ on ssh:notty There were 31 failed login attempts since the last successful login.

Initializing Ansible Project

Initializing Repository

First, create a folder; assume the project name is rke2-ansible.

yun@yun ~/V/a/y/p/ansible (main)> mkdir rke2-ansible
yun@yun ~/V/a/y/p/ansible (main)> ls
rke2-ansible/

Enter the project repository, initialize git, and create a GitHub private repository:

cd rke2-ansible
git init
echo "# rke2-ansible" > README.md
git add .
git commit -m "chore: initial commit"
gh repo create rke2-ansible --private --source=. --remote=origin --push

The following code block is optional and is used to declare the newly created code repository as a submodule:

cd ..
rm -rf rke2-ansible/

git submodule add https://github.com/yunzaixi-dev/rke2-ansible.git ./rke2-ansible

Planning Directory Structure

Next, plan the project structure:

mkdir -p inventories/prod \
group_vars \
host_vars \
playbooks \
roles

Create empty files:

touch ansible.cfg \
requirements.yml \
inventories/prod/hosts.yml \
group_vars/all.yml \
group_vars/rke2_servers.yml \
group_vars/rke2_agents.yml \
host_vars/rke2-server1.yml \
playbooks/site.yml \
playbooks/ping.yml \
playbooks/update-packages.yml \
playbooks/set-hostname.yml \
playbooks/disable-ssh-password.yml

The directory structure is as follows:

yun@yun ~/V/a/y/p/a/rke2-ansible (master)> tree
.
├── ansible.cfg
├── group_vars
│ ├── all.yml
│ ├── rke2_agents.yml
│ └── rke2_servers.yml
├── host_vars
│ └── rke2-server1.yml
├── inventories
│ └── prod
│ └── hosts.yml
├── playbooks
│ ├── disable-ssh-password.yml
│ ├── ping.yml
│ ├── site.yml
│ ├── update-packages.yml
│ └── set-hostname.yml
├── README.md
├── requirements.yml
└── roles

Description of each directory and file:

  • ansible.cfg: Ansible global configuration, specifies inventory and roles_path.
  • requirements.yml: Galaxy dependency list, used to install the lablabs.rke2 role.
  • inventories/prod/hosts.yml: Production environment host inventory and grouping.
  • group_vars/*.yml: Host group variables, used for cluster common parameters and server/agent respectively.
  • host_vars/rke2-server1.yml: Single machine variables, used to declare the initialization of the first control plane.
  • playbooks/site.yml: Deployment entry point, including system preparation and RKE2 installation process.
  • playbooks/ping.yml: Connectivity check Playbook, used to verify host reachability.
  • playbooks/update-packages.yml: Batch update Playbook, used to upgrade system software packages.
  • playbooks/set-hostname.yml: Batch set hostname, preserving - and cleaning illegal characters.
  • playbooks/disable-ssh-password.yml: Disable SSH password login, only allow key login.
  • roles/: Directory for roles downloaded by Galaxy.

Installing Galaxy Role

micro requirements.yml :

roles:
- name: lablabs.rke2
version: "1.49.1"

lablabs.rke2 is a community-maintained RKE2 Role. GitHub repository address: https://github.com/lablabs/ansible-role-rke2. It encapsulates official installation scripts and service management logic. Pinning it to 1.49.1 ensures the deployment process is reproducible and reduces uncertainty from upstream updates.

Install dependencies:

ansible-galaxy role install -r requirements.yml -p roles
yun@yun ~/V/a/y/p/a/rke2-ansible (master)> ansible-galaxy role install -r requirements.yml -p
roles
Starting galaxy role install process
- downloading role 'rke2', owned by lablabs
- downloading role from https://github.com/lablabs/ansible-role-rke2/archive/1.49.1.tar.gz
- extracting lablabs.rke2 to /home/yun/Vaults/admin/yunzaixi-dev/project/ansible/rke2-ansible/roles/lablabs.rke2
- lablabs.rke2 (1.49.1) was installed successfully

Configuring Ansible

micro ansible.cfg (interpreter_python path should be adjusted according to your own situation):

[defaults]
inventory = inventories/prod/hosts.yml
remote_user = root
host_key_checking = False
roles_path = ./roles
forks = 10
timeout = 30
deprecation_warnings = False
stdout_callback = default
result_format = yaml
interpreter_python = /usr/bin/python3

Writing Inventory

micro inventories/prod/hosts.yml :

all:
children:
rke2_servers:
hosts:
rke2-server1:
rke2-server2:
rke2-server3:
rke2_agents:
hosts:
rke2-agent1:
rke2-agent2:
rke2_cluster:
children:
rke2_servers:
rke2_agents:

Since the SSH Config was configured earlier, host aliases can be used directly here without filling in ansible_host additionally.

Connectivity Check

micro playbooks/ping.yml :

- name: Ping all hosts
hosts: all
gather_facts: false
tasks:
- name: Ping
ansible.builtin.ping:

Execute:

ansible-playbook playbooks/ping.yml

The output is as follows:

yun@yun ~/V/a/y/p/a/rke2-ansible (master)> ansible-playbook playbooks/ping.yml

PLAY [Ping all hosts] ***********************************************************************

TASK [Ping] *********************************************************************************
ok: [rke2-agent1]
ok: [rke2-agent2]
ok: [rke2-server2]
ok: [rke2-server1]
ok: [rke2-server3]

PLAY RECAP **********************************************************************************
rke2-agent1 : ok=1 changed=0 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
rke2-agent2 : ok=1 changed=0 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
rke2-server1 : ok=1 changed=0 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
rke2-server2 : ok=1 changed=0 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
rke2-server3 : ok=1 changed=0 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0

Batch Setting Hostnames

Hostnames cannot contain _.

micro playbooks/set-hostname.yml :

- name: Set hostname from SSH alias
hosts: all
become: true
vars:
raw_hostname: "{{ inventory_hostname | lower }}"
hostname_from_alias: "{{ raw_hostname | regex_replace('[^a-z0-9-]', '') | regex_replace('^-+', '') | regex_replace('-+$', '') }}"
tasks:
- name: Ensure hostname is not empty
ansible.builtin.assert:
that:
- hostname_from_alias | length > 0
fail_msg: "Derived hostname is empty. Check inventory_hostname: {{ inventory_hostname }}"

- name: Set hostname
ansible.builtin.hostname:
name: "{{ hostname_from_alias }}"

Execute:

ansible-playbook playbooks/set-hostname.yml

The results are as follows:

yun@yun ~/V/a/y/p/a/rke2-ansible (master)> ansible-playbook playbooks/set-hostname.yml

PLAY [Set hostname from SSH alias] **********************************************************

TASK [Gathering Facts] **********************************************************************
ok: [rke2-server3]
ok: [rke2-server2]
ok: [rke2-server1]
ok: [rke2-agent2]
ok: [rke2-agent1]

TASK [Ensure hostname is not empty] *********************************************************
ok: [rke2-server1] => {
"changed": false,
"msg": "All assertions passed"
}
ok: [rke2-server2] => {
"changed": false,
"msg": "All assertions passed"
}
ok: [rke2-server3] => {
"changed": false,
"msg": "All assertions passed"
}
ok: [rke2-agent1] => {
"changed": false,
"msg": "All assertions passed"
}
ok: [rke2-agent2] => {
"changed": false,
"msg": "All assertions passed"
}

TASK [Set hostname] *************************************************************************
changed: [rke2-agent1]
changed: [rke2-server1]
changed: [rke2-server3]
changed: [rke2-server2]
changed: [rke2-agent2]

PLAY RECAP **********************************************************************************
rke2-agent1 : ok=3 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
rke2-agent2 : ok=3 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
rke2-server1 : ok=3 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
rke2-server2 : ok=3 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
rke2-server3 : ok=3 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0

Disabling SSH Password Login (Optional)

Before execution, please confirm that key login has been configured to avoid being locked out of the server.

micro playbooks/disable-ssh-password.yml :

- name: Disable SSH password authentication
hosts: all
become: true
tasks:
- name: Write SSH hardening config
ansible.builtin.copy:
dest: /etc/ssh/sshd_config.d/99-disable-password.conf
mode: "0644"
content: |
PasswordAuthentication no
KbdInteractiveAuthentication no
ChallengeResponseAuthentication no
notify: Restart sshd

- name: Validate sshd config
ansible.builtin.command: sshd -t
changed_when: false

handlers:
- name: Restart sshd
ansible.builtin.service:
name: sshd
state: restarted

Execute:

ansible-playbook playbooks/disable-ssh-password.yml

The output is as follows:

yun@yun ~/V/a/y/p/a/rke2-ansible (master)> ansible-playbook playbooks/disable-ssh-password.yml

PLAY [Disable SSH password authentication] **************************************************

TASK [Gathering Facts] **********************************************************************
ok: [rke2-agent1]
ok: [rke2-server3]
ok: [rke2-agent2]
ok: [rke2-server1]
ok: [rke2-server2]

TASK [Write SSH hardening config] ***********************************************************
changed: [rke2-server3]
changed: [rke2-agent1]
changed: [rke2-server2]
changed: [rke2-server1]
changed: [rke2-agent2]

TASK [Validate sshd config] *****************************************************************
ok: [rke2-server3]
ok: [rke2-agent1]
ok: [rke2-server2]
ok: [rke2-agent2]
ok: [rke2-server1]

RUNNING HANDLER [Restart sshd] **************************************************************
changed: [rke2-server2]
changed: [rke2-server3]
changed: [rke2-server1]
changed: [rke2-agent2]
changed: [rke2-agent1]

PLAY RECAP **********************************************************************************
rke2-agent1 : ok=4 changed=2 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
rke2-agent2 : ok=4 changed=2 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
rke2-server1 : ok=4 changed=2 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
rke2-server2 : ok=4 changed=2 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
rke2-server3 : ok=4 changed=2 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0

Applicable to scenarios already on Rocky Linux 9 where only system software packages need to be updated. If no reboot is required, set reboot_after_update to false.

micro playbooks/update-packages.yml :

- name: Update Rocky Linux packages
hosts: all
become: true
serial: 1
vars:
reboot_after_update: true
tasks:
- name: Update package metadata
ansible.builtin.dnf:
update_cache: true

- name: Upgrade all packages
ansible.builtin.dnf:
name: "*"
state: latest

- name: Remove unneeded packages
ansible.builtin.dnf:
autoremove: true

- name: Clean package cache
ansible.builtin.command: dnf clean all
changed_when: false

- name: Reboot after update (optional)
ansible.builtin.reboot:
reboot_timeout: 3600
when: reboot_after_update

Execute:

ansible-playbook playbooks/update-packages.yml

The output is as follows:

yun@yun ~/V/a/y/p/a/rke2-ansible (master)> ansible-playbook playbooks/update-packages.yml 

PLAY [Update Rocky Linux packages] **********************************************************

TASK [Gathering Facts] **********************************************************************
ok: [rke2-server1]

TASK [Update package metadata] **************************************************************
ok: [rke2-server1]

TASK [Upgrade all packages] *****************************************************************
ok: [rke2-server1]

TASK [Remove unneeded packages] *************************************************************
ok: [rke2-server1]

TASK [Clean package cache] ******************************************************************
ok: [rke2-server1]

TASK [Reboot after update (optional)] *******************************************************
changed: [rke2-server1]

PLAY [Update Rocky Linux packages] **********************************************************

TASK [Gathering Facts] **********************************************************************
ok: [rke2-server2]

TASK [Update package metadata] **************************************************************
ok: [rke2-server2]

TASK [Upgrade all packages] *****************************************************************
changed: [rke2-server2]

TASK [Remove unneeded packages] *************************************************************
ok: [rke2-server2]

TASK [Clean package cache] ******************************************************************
ok: [rke2-server2]

TASK [Reboot after update (optional)] *******************************************************
changed: [rke2-server2]

PLAY [Update Rocky Linux packages] **********************************************************

TASK [Gathering Facts] **********************************************************************
ok: [rke2-server3]

TASK [Update package metadata] **************************************************************
ok: [rke2-server3]

TASK [Upgrade all packages] *****************************************************************
changed: [rke2-server3]

TASK [Remove unneeded packages] *************************************************************
ok: [rke2-server3]

TASK [Clean package cache] ******************************************************************
ok: [rke2-server3]

TASK [Reboot after update (optional)] *******************************************************
changed: [rke2-server3]

PLAY [Update Rocky Linux packages] **********************************************************

TASK [Gathering Facts] **********************************************************************
ok: [rke2-agent1]

TASK [Update package metadata] **************************************************************
ok: [rke2-agent1]

TASK [Upgrade all packages] *****************************************************************
changed: [rke2-agent1]

TASK [Remove unneeded packages] *************************************************************
ok: [rke2-agent1]

TASK [Clean package cache] ******************************************************************
ok: [rke2-agent1]

TASK [Reboot after update (optional)] *******************************************************
changed: [rke2-agent1]

PLAY [Update Rocky Linux packages] **********************************************************

TASK [Gathering Facts] **********************************************************************
ok: [rke2-agent2]

TASK [Update package metadata] **************************************************************
ok: [rke2-agent2]

TASK [Upgrade all packages] *****************************************************************
changed: [rke2-agent2]

TASK [Remove unneeded packages] *************************************************************
ok: [rke2-agent2]

TASK [Clean package cache] ******************************************************************
ok: [rke2-agent2]

TASK [Reboot after update (optional)] *******************************************************
changed: [rke2-agent2]

PLAY RECAP **********************************************************************************
rke2-agent1 : ok=6 changed=2 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
rke2-agent2 : ok=6 changed=2 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
rke2-server1 : ok=6 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
rke2-server2 : ok=6 changed=2 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
rke2-server3 : ok=6 changed=2 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0

Deploying RKE2

Writing RKE2 Variables

The rke2_config of lablabs.rke2 is a template path (default templates/config.yaml.j2); do not write it as a dictionary. Parameters that need to be written to config.yaml should be placed in rke2_server_options / rke2_agent_options.

micro group_vars/all.yml :

rke2_cluster_group_name: "rke2_cluster"
rke2_servers_group_name: "rke2_servers"
rke2_agents_group_name: "rke2_agents"

rke2_channel: "latest"
rke2_version: "v1.34.2+rke2r1"
rke2_token: "CHANGE_ME"
rke2_api_ip: "<LB or server1>"
rke2_additional_sans:
- "<LB or server1>"
rke2_selinux: true
rke2_cni:
- cilium

rke2_token is a shared secret used for cluster registration; it must be consistent across all nodes. rke2_api_ip is the control plane entry address: if there is an LB/VIP, fill in the IP or domain name of the LB/VIP; if there is no LB/VIP and each machine has only a fixed single IP, you can fill in the IP/domain name of the first control plane (e.g., rke2-server1) and add this value to rke2_additional_sans synchronously. This configuration is equivalent to pinning the API to a single node; the control plane entry is not highly available. It is recommended to use LB/VIP for production. rke2_token can be generated using openssl rand -base64 32. When Rocky Linux has SELinux enabled by default, be sure to set rke2_selinux: true and ensure container-selinux is installed. When using Cilium, point rke2_cni to cilium.

micro group_vars/rke2_servers.yml :

rke2_server_options:
- write-kubeconfig-mode: "0644"

micro group_vars/rke2_agents.yml :

rke2_agent_options:
- node-ip: "{{ ansible_default_ipv4.address }}"

Mark the first control plane as the initialization node, micro host_vars/rke2-server1.yml :

rke2_server_options:
- write-kubeconfig-mode: "0644"
- cluster-init: true

Writing Playbook

micro playbooks/site.yml :

- name: Base setup
hosts: all
become: true
tasks:
- name: Install base packages
ansible.builtin.package:
name:
- curl
- tar
- socat
- conntrack
- iptables
- container-selinux
state: present

- name: Disable swap
ansible.builtin.command: swapoff -a
when: ansible_swaptotal_mb | int > 0
changed_when: false

- name: Remove swap from fstab
ansible.builtin.replace:
path: /etc/fstab
regexp: '^(.*\\sswap\\s.*)$'
replace: '# \\1'

- name: Load br_netfilter
ansible.builtin.modprobe:
name: br_netfilter
state: present

- name: Enable sysctl for Kubernetes
ansible.builtin.sysctl:
name: "{{ item.name }}"
value: "{{ item.value }}"
state: present
reload: true
loop:
- { name: net.bridge.bridge-nf-call-iptables, value: 1 }
- { name: net.bridge.bridge-nf-call-ip6tables, value: 1 }
- { name: net.ipv4.ip_forward, value: 1 }

- name: RKE2 servers
hosts: rke2_servers
become: true
serial: 1
roles:
- role: lablabs.rke2

- name: RKE2 agents
hosts: rke2_agents
become: true
roles:
- role: lablabs.rke2

Deployment and Verification

Executing Deployment

Perform a syntax check first:

ansible-playbook playbooks/site.yml --syntax-check

Execute deployment:

ansible-playbook playbooks/site.yml

Getting kubeconfig

Log in to any control plane node and export the kubeconfig:

export KUBECONFIG=/etc/rancher/rke2/rke2.yaml
rke2 kubectl get nodes -o wide

If using kubectl locally, you can copy the kubeconfig:

mkdir -p ~/.kube
scp rke2-server1:/etc/rancher/rke2/rke2.yaml ~/.kube/rke2.yaml
sed -i 's/127.0.0.1/<LB or server1>/g' ~/.kube/rke2.yaml
export KUBECONFIG=~/.kube/rke2.yaml
kubectl get nodes -o wide

At this point, the deployment of the minimum high-availability RKE2 cluster is complete.