High Availability with Keepalived
Ensuring continuous availability for edge messaging is critical for IoT and Industrial IoT (IIoT) deployments. This guide describes a Primary-Standby High Availability (HA) solution for EMQX Edge using VRRP (Virtual Router Redundancy Protocol) via Keepalived.
By implementing this pattern, you provision a floating Virtual IP (VIP) address. Your MQTT clients and edge applications always connect to this VIP. Keepalived automatically routes traffic to the active (Primary) EMQX Edge node. If the primary node experiences hardware or software failure, Keepalived detects this and automatically moves the VIP to the Standby node within approximately 5 seconds, ensuring seamless failover.
Important Notice
This VRRP and VIP-based architecture is designed for on-premises bare-metal servers, local virtual machines, and private edge networks. Standard VRRP and floating VIPs generally do not work natively in most public cloud VM environments (e.g., AWS, Azure, Google Cloud). Public clouds typically block the multicast/broadcast traffic VRRP relies on and restrict arbitrary MAC/IP address reassignment at the virtual switch layer.
If you are deploying EMQX Edge in a public cloud, use the cloud provider's native load balancer (e.g., AWS NLB, Azure Load Balancer) to route traffic to your nodes instead of Keepalived.
Architecture
MQTT Clients / Edge Devices
│ (Connect strictly to VIP)
▼
VIP: 192.168.1.100:1883
│
┌────┴─────────────────────────┐
│ │
[Primary Node] [Standby Node]
EMQX Edge (MASTER) EMQX Edge (BACKUP)
IP: 192.168.1.10 IP: 192.168.1.11
Keepalived Keepalived
│ │
└─────────── VRRP ─────────────┘
(Health Heartbeat)Keepalived continuously monitors the health of the EMQX Edge service. If the primary node fails to respond to health checks, the standby node is promoted and acquires the VIP.
Prerequisites
Download and extract EMQX Edge on both nodes. This guide uses version 1.3.0 as an example:
wget https://www.emqx.com/en/downloads/emqx-edge/1.3.0/emqx-edge-1.3.0-linux-amd64.zip
unzip emqx-edge-1.3.0-linux-amd64.zipCore HA Scripts
Regardless of whether you deploy on bare-metal machines or via Docker, Keepalived requires two core scripts.
Health Check Script
This script verifies whether EMQX Edge is running properly by checking its HTTP API, with a fallback TCP port check.
Create check_emqx_edge.sh:
#!/bin/bash
# Check EMQX Edge health via its HTTP API
# Returns 0 = healthy, non-zero = unhealthy
EMQX_EDGE_HOST="127.0.0.1"
HTTP_PORT="8081"
TIMEOUT=3
# Try the EMQX Edge HTTP health endpoint
HTTP_STATUS=$(curl -s -o /dev/null -w "%{http_code}" \
--max-time $TIMEOUT \
"http://${EMQX_EDGE_HOST}:${HTTP_PORT}/api/v4/")
# 401 means auth is required, but the broker is successfully responding to HTTP requests
if [ "$HTTP_STATUS" = "104" ] || [ "$HTTP_STATUS" = "200" ] || [ "$HTTP_STATUS" = "401" ]; then
exit 0
fi
# Fallback: check if the MQTT port (1883) is accepting connections
if nc -z -w $TIMEOUT $EMQX_EDGE_HOST 1883 2>/dev/null; then
exit 0
fi
# EMQX Edge is down
echo "EMQX Edge health check FAILED (HTTP: $HTTP_STATUS)"
exit 1State Notification Script
This script logs state transitions, giving you an audit trail of when failovers occur.
Create notify.sh:
#!/bin/bash
STATE=$1
HOSTNAME=$(hostname)
echo "[$(date '+%H:%M:%S')] $HOSTNAME → $STATE"
case "$STATE" in
MASTER)
echo ">>> This node is now ACTIVE — VIP acquired"
# Optional: Add webhook triggers or email alerts here
;;
BACKUP)
echo ">>> This node is now STANDBY"
;;
FAULT)
echo ">>> This node is FAULTED"
;;
esacMake both scripts executable:
chmod +x check_emqx_edge.sh notify.shDeployment on Bare-Metal or Virtual Machines
This section covers deployment directly on physical or virtual machines for maximum performance and direct network access.
Assumptions for this guide:
- Network interface:
eth0 - Primary IP:
192.168.1.10 - Standby IP:
192.168.1.11 - Virtual IP (VIP):
192.168.1.100
Install Keepalived
On both Ubuntu/Debian machines, install Keepalived and the required utilities:
sudo apt-get update
sudo apt-get install -y keepalived curl netcat-openbsd iproute2Deploy Scripts
Copy the scripts from Core HA Scripts to the Keepalived directory on both machines:
sudo mkdir -p /etc/keepalived
sudo cp check_emqx_edge.sh /etc/keepalived/
sudo cp notify.sh /etc/keepalived/
sudo chmod +x /etc/keepalived/*.shConfigure the Primary Node
On the primary node (192.168.1.10), create /etc/keepalived/keepalived.conf:
global_defs {
router_id EMQX_EDGE_PRIMARY
script_user root
enable_script_security
}
vrrp_script check_emqx_edge {
script "/etc/keepalived/check_emqx_edge.sh"
interval 2 # Check every 2 seconds
timeout 5 # Timeout after 5 seconds
fall 2 # Require 2 failures to mark as down
rise 1 # Require 1 success to mark as up
weight -30 # Reduce priority by 30 if check fails
}
vrrp_instance EMQX_EDGE_HA {
state MASTER
interface eth0
virtual_router_id 51
priority 100 # Higher priority than Standby
advert_int 1
preempt
# Unicast is recommended to avoid multicast blocking
unicast_src_ip 192.168.1.10
unicast_peer {
192.168.1.11
}
authentication {
auth_type PASS
auth_pass REPLACE_WITH_YOUR_OWN_SHARED_VRRP_PASSWORD
}
virtual_ipaddress {
192.168.1.100/24 dev eth0
}
track_script {
check_emqx_edge
}
notify_master "/etc/keepalived/notify.sh MASTER"
notify_backup "/etc/keepalived/notify.sh BACKUP"
notify_fault "/etc/keepalived/notify.sh FAULT"
}Configure the Standby Node
On the standby node (192.168.1.11), create /etc/keepalived/keepalived.conf:
global_defs {
router_id EMQX_EDGE_STANDBY
script_user root
enable_script_security
}
vrrp_script check_emqx_edge {
script "/etc/keepalived/check_emqx_edge.sh"
interval 2
timeout 5
fall 2
rise 1
weight -30
}
vrrp_instance EMQX_EDGE_HA {
state BACKUP
interface eth0
virtual_router_id 51
priority 90 # Lower priority than Primary
advert_int 1
unicast_src_ip 192.168.1.11
unicast_peer {
192.168.1.10
}
authentication {
auth_type PASS
auth_pass REPLACE_WITH_YOUR_OWN_SHARED_VRRP_PASSWORD
}
virtual_ipaddress {
192.168.1.100/24 dev eth0
}
track_script {
check_emqx_edge
}
notify_master "/etc/keepalived/notify.sh MASTER"
notify_backup "/etc/keepalived/notify.sh BACKUP"
notify_fault "/etc/keepalived/notify.sh FAULT"
}Start Services
Start EMQX Edge and Keepalived on both machines:
# Start EMQX Edge
cd /path/to/emqx-edge-1.3.0-linux-amd64
./nanomq start -d
# Start Keepalived
sudo systemctl enable keepalived
sudo systemctl start keepalivedDeployment with Docker
Docker provides a way to package EMQX Edge and Keepalived together for easier deployment. Docker environments require specific network privileges (NET_ADMIN) for VIP routing.
Project Structure
Set up your directory as follows:
emqx-edge-ha/
├── docker-compose.yml
├── Dockerfile
├── entrypoint.sh
├── emqx-edge-1.3.0-linux-amd64/
│ └── (extracted EMQX Edge files)
├── HA/
│ ├── keepalived-primary.conf
│ ├── keepalived-standby.conf
│ ├── check_emqx_edge.sh
│ └── notify.shPlace the scripts from Core HA Scripts in the HA/ directory.
Keepalived Configurations for Docker
The Docker deployment uses the subnet 172.22.0.0/24. Create the configuration files in HA/ using the same structure as the bare-metal examples, with the IP values updated as shown below.
HA/keepalived-primary.conf
| Parameter | Value |
|---|---|
unicast_src_ip | 172.22.0.10 |
unicast_peer | 172.22.0.11 |
virtual_ipaddress | 172.22.0.100/24 dev eth0 |
HA/keepalived-standby.conf
| Parameter | Value |
|---|---|
unicast_src_ip | 172.22.0.11 |
unicast_peer | 172.22.0.10 |
virtual_ipaddress | 172.22.0.100/24 dev eth0 |
Docker Entrypoint
Create entrypoint.sh to start both services inside the container:
#!/bin/bash
set -e
# Clean up stale PID files from unclean shutdowns
if [ -f "/tmp/nanomq/nanomq.pid" ]; then
rm -f /tmp/nanomq/nanomq.pid
fi
echo "Starting EMQX Edge..."
./nanomq start &
echo "Starting Keepalived..."
keepalived --dont-fork --log-console --log-detail &
waitDockerfile
Create a Dockerfile to build the unified image:
FROM ubuntu:24.04
RUN apt-get update && apt-get install -y \
unzip curl netcat-openbsd iproute2 keepalived \
&& rm -rf /var/lib/apt/lists/*
RUN mkdir -p /opt/emqx-edge
COPY emqx-edge-1.3.0-linux-amd64/. /opt/emqx-edge/
RUN chmod +x /opt/emqx-edge/nanomq
COPY HA/check_emqx_edge.sh /etc/keepalived/check_emqx_edge.sh
COPY HA/notify.sh /etc/keepalived/notify.sh
RUN chmod +x /etc/keepalived/check_emqx_edge.sh /etc/keepalived/notify.sh
COPY entrypoint.sh /entrypoint.sh
RUN chmod +x /entrypoint.sh
WORKDIR /opt/emqx-edge
ENTRYPOINT ["/entrypoint.sh"]Docker Compose
Create docker-compose.yml to orchestrate the two-node cluster:
networks:
emqx-edge-ha-net:
driver: bridge
ipam:
config:
- subnet: 172.22.0.0/24
services:
emqx-edge-primary:
build: .
container_name: emqx-edge-primary
hostname: emqx-edge-primary
cap_add:
- NET_ADMIN # Required: lets Keepalived add/remove the VIP
- NET_BROADCAST # Required: VRRP advertisements
networks:
emqx-edge-ha-net:
ipv4_address: 172.22.0.10
volumes:
- ./HA/keepalived-primary.conf:/etc/keepalived/keepalived.conf
ports:
- "1883:1883"
restart: unless-stopped
emqx-edge-standby:
build: .
container_name: emqx-edge-standby
hostname: emqx-edge-standby
cap_add:
- NET_ADMIN
- NET_BROADCAST
networks:
emqx-edge-ha-net:
ipv4_address: 172.22.0.11
volumes:
- ./HA/keepalived-standby.conf:/etc/keepalived/keepalived.conf
ports:
- "1884:1883" # Exposed on a different host port for debugging
restart: unless-stoppedBuild and start the containers:
cd emqx-edge-ha
docker compose build
docker compose up -dTest the HA Failover
Follow these steps to verify that failover works correctly.
Verify Initial VIP Assignment
Check which node currently holds the VIP.
For bare-metal:
# Run on the primary node
ip addr show eth0 | grep 192.168.1.100For Docker:
docker exec emqx-edge-primary ip addr show eth0 | grep 172.22.0.100
docker exec emqx-edge-standby ip addr show eth0 | grep 172.22.0.100Simulate a Failure
Stop the EMQX Edge process on the primary node to trigger failover.
For bare-metal:
killall nanomqFor Docker:
docker exec emqx-edge-primary pkill nanomqObserve Failover
Within approximately 3 to 4 seconds, the standby node's Keepalived will detect the failure via check_emqx_edge.sh and acquire the VIP.
For bare-metal:
# Run on the standby node
ip addr show eth0 | grep 192.168.1.100For Docker:
docker exec emqx-edge-standby ip addr show eth0 | grep 172.22.0.100Any active MQTT client connected to the VIP will momentarily disconnect and reconnect as traffic is routed to the standby node.
Restore the Primary Node
Restart the EMQX Edge service on the primary node. Because preempt is enabled, the primary node reclaims the VIP once it passes health checks.
For bare-metal:
./nanomq start -dFor Docker:
docker exec emqx-edge-primary ./nanomq start