Skip to content

High Availability with Keepalived

Ensuring continuous availability for edge messaging is critical for IoT and Industrial IoT (IIoT) deployments. This guide describes a Primary-Standby High Availability (HA) solution for EMQX Edge using VRRP (Virtual Router Redundancy Protocol) via Keepalived.

By implementing this pattern, you provision a floating Virtual IP (VIP) address. Your MQTT clients and edge applications always connect to this VIP. Keepalived automatically routes traffic to the active (Primary) EMQX Edge node. If the primary node experiences hardware or software failure, Keepalived detects this and automatically moves the VIP to the Standby node within approximately 5 seconds, ensuring seamless failover.

Important Notice

This VRRP and VIP-based architecture is designed for on-premises bare-metal servers, local virtual machines, and private edge networks. Standard VRRP and floating VIPs generally do not work natively in most public cloud VM environments (e.g., AWS, Azure, Google Cloud). Public clouds typically block the multicast/broadcast traffic VRRP relies on and restrict arbitrary MAC/IP address reassignment at the virtual switch layer.

If you are deploying EMQX Edge in a public cloud, use the cloud provider's native load balancer (e.g., AWS NLB, Azure Load Balancer) to route traffic to your nodes instead of Keepalived.

Architecture

MQTT Clients / Edge Devices
         │ (Connect strictly to VIP)

   VIP: 192.168.1.100:1883

    ┌────┴─────────────────────────┐
    │                              │
[Primary Node]              [Standby Node]
EMQX Edge (MASTER)          EMQX Edge (BACKUP)
IP: 192.168.1.10            IP: 192.168.1.11
Keepalived                  Keepalived
    │                              │
    └─────────── VRRP ─────────────┘
          (Health Heartbeat)

Keepalived continuously monitors the health of the EMQX Edge service. If the primary node fails to respond to health checks, the standby node is promoted and acquires the VIP.

Prerequisites

Download and extract EMQX Edge on both nodes. This guide uses version 1.3.0 as an example:

bash
wget https://www.emqx.com/en/downloads/emqx-edge/1.3.0/emqx-edge-1.3.0-linux-amd64.zip
unzip emqx-edge-1.3.0-linux-amd64.zip

Core HA Scripts

Regardless of whether you deploy on bare-metal machines or via Docker, Keepalived requires two core scripts.

Health Check Script

This script verifies whether EMQX Edge is running properly by checking its HTTP API, with a fallback TCP port check.

Create check_emqx_edge.sh:

bash
#!/bin/bash
# Check EMQX Edge health via its HTTP API
# Returns 0 = healthy, non-zero = unhealthy
EMQX_EDGE_HOST="127.0.0.1"
HTTP_PORT="8081"
TIMEOUT=3

# Try the EMQX Edge HTTP health endpoint
HTTP_STATUS=$(curl -s -o /dev/null -w "%{http_code}" \
  --max-time $TIMEOUT \
  "http://${EMQX_EDGE_HOST}:${HTTP_PORT}/api/v4/")

# 401 means auth is required, but the broker is successfully responding to HTTP requests
if [ "$HTTP_STATUS" = "104" ] || [ "$HTTP_STATUS" = "200" ] || [ "$HTTP_STATUS" = "401" ]; then
  exit 0
fi

# Fallback: check if the MQTT port (1883) is accepting connections
if nc -z -w $TIMEOUT $EMQX_EDGE_HOST 1883 2>/dev/null; then
  exit 0
fi

# EMQX Edge is down
echo "EMQX Edge health check FAILED (HTTP: $HTTP_STATUS)"
exit 1

State Notification Script

This script logs state transitions, giving you an audit trail of when failovers occur.

Create notify.sh:

bash
#!/bin/bash
STATE=$1
HOSTNAME=$(hostname)
echo "[$(date '+%H:%M:%S')] $HOSTNAME$STATE"
case "$STATE" in
  MASTER)
    echo ">>> This node is now ACTIVE — VIP acquired"
    # Optional: Add webhook triggers or email alerts here
    ;;
  BACKUP)
    echo ">>> This node is now STANDBY"
    ;;
  FAULT)
    echo ">>> This node is FAULTED"
    ;;
esac

Make both scripts executable:

bash
chmod +x check_emqx_edge.sh notify.sh

Deployment on Bare-Metal or Virtual Machines

This section covers deployment directly on physical or virtual machines for maximum performance and direct network access.

Assumptions for this guide:

  • Network interface: eth0
  • Primary IP: 192.168.1.10
  • Standby IP: 192.168.1.11
  • Virtual IP (VIP): 192.168.1.100

Install Keepalived

On both Ubuntu/Debian machines, install Keepalived and the required utilities:

bash
sudo apt-get update
sudo apt-get install -y keepalived curl netcat-openbsd iproute2

Deploy Scripts

Copy the scripts from Core HA Scripts to the Keepalived directory on both machines:

bash
sudo mkdir -p /etc/keepalived
sudo cp check_emqx_edge.sh /etc/keepalived/
sudo cp notify.sh /etc/keepalived/
sudo chmod +x /etc/keepalived/*.sh

Configure the Primary Node

On the primary node (192.168.1.10), create /etc/keepalived/keepalived.conf:

global_defs {
  router_id EMQX_EDGE_PRIMARY
  script_user root
  enable_script_security
}

vrrp_script check_emqx_edge {
  script       "/etc/keepalived/check_emqx_edge.sh"
  interval     2     # Check every 2 seconds
  timeout      5     # Timeout after 5 seconds
  fall         2     # Require 2 failures to mark as down
  rise         1     # Require 1 success to mark as up
  weight       -30   # Reduce priority by 30 if check fails
}

vrrp_instance EMQX_EDGE_HA {
  state             MASTER
  interface         eth0
  virtual_router_id 51
  priority          100  # Higher priority than Standby
  advert_int        1
  preempt

  # Unicast is recommended to avoid multicast blocking
  unicast_src_ip 192.168.1.10
  unicast_peer {
    192.168.1.11
  }

  authentication {
    auth_type PASS
    auth_pass REPLACE_WITH_YOUR_OWN_SHARED_VRRP_PASSWORD
  }

  virtual_ipaddress {
    192.168.1.100/24 dev eth0
  }

  track_script {
    check_emqx_edge
  }

  notify_master "/etc/keepalived/notify.sh MASTER"
  notify_backup "/etc/keepalived/notify.sh BACKUP"
  notify_fault  "/etc/keepalived/notify.sh FAULT"
}

Configure the Standby Node

On the standby node (192.168.1.11), create /etc/keepalived/keepalived.conf:

global_defs {
  router_id EMQX_EDGE_STANDBY
  script_user root
  enable_script_security
}

vrrp_script check_emqx_edge {
  script       "/etc/keepalived/check_emqx_edge.sh"
  interval     2
  timeout      5
  fall         2
  rise         1
  weight       -30
}

vrrp_instance EMQX_EDGE_HA {
  state             BACKUP
  interface         eth0
  virtual_router_id 51
  priority          90   # Lower priority than Primary
  advert_int        1

  unicast_src_ip 192.168.1.11
  unicast_peer {
    192.168.1.10
  }

  authentication {
    auth_type PASS
    auth_pass REPLACE_WITH_YOUR_OWN_SHARED_VRRP_PASSWORD
  }

  virtual_ipaddress {
    192.168.1.100/24 dev eth0
  }

  track_script {
    check_emqx_edge
  }

  notify_master "/etc/keepalived/notify.sh MASTER"
  notify_backup "/etc/keepalived/notify.sh BACKUP"
  notify_fault  "/etc/keepalived/notify.sh FAULT"
}

Start Services

Start EMQX Edge and Keepalived on both machines:

bash
# Start EMQX Edge
cd /path/to/emqx-edge-1.3.0-linux-amd64
./nanomq start -d

# Start Keepalived
sudo systemctl enable keepalived
sudo systemctl start keepalived

Deployment with Docker

Docker provides a way to package EMQX Edge and Keepalived together for easier deployment. Docker environments require specific network privileges (NET_ADMIN) for VIP routing.

Project Structure

Set up your directory as follows:

text
emqx-edge-ha/
├── docker-compose.yml
├── Dockerfile
├── entrypoint.sh
├── emqx-edge-1.3.0-linux-amd64/
│   └── (extracted EMQX Edge files)
├── HA/
│   ├── keepalived-primary.conf
│   ├── keepalived-standby.conf
│   ├── check_emqx_edge.sh
│   └── notify.sh

Place the scripts from Core HA Scripts in the HA/ directory.

Keepalived Configurations for Docker

The Docker deployment uses the subnet 172.22.0.0/24. Create the configuration files in HA/ using the same structure as the bare-metal examples, with the IP values updated as shown below.

HA/keepalived-primary.conf

ParameterValue
unicast_src_ip172.22.0.10
unicast_peer172.22.0.11
virtual_ipaddress172.22.0.100/24 dev eth0

HA/keepalived-standby.conf

ParameterValue
unicast_src_ip172.22.0.11
unicast_peer172.22.0.10
virtual_ipaddress172.22.0.100/24 dev eth0

Docker Entrypoint

Create entrypoint.sh to start both services inside the container:

bash
#!/bin/bash
set -e

# Clean up stale PID files from unclean shutdowns
if [ -f "/tmp/nanomq/nanomq.pid" ]; then
  rm -f /tmp/nanomq/nanomq.pid
fi

echo "Starting EMQX Edge..."
./nanomq start &

echo "Starting Keepalived..."
keepalived --dont-fork --log-console --log-detail &

wait

Dockerfile

Create a Dockerfile to build the unified image:

dockerfile
FROM ubuntu:24.04

RUN apt-get update && apt-get install -y \
    unzip curl netcat-openbsd iproute2 keepalived \
    && rm -rf /var/lib/apt/lists/*

RUN mkdir -p /opt/emqx-edge
COPY emqx-edge-1.3.0-linux-amd64/. /opt/emqx-edge/
RUN chmod +x /opt/emqx-edge/nanomq

COPY HA/check_emqx_edge.sh /etc/keepalived/check_emqx_edge.sh
COPY HA/notify.sh /etc/keepalived/notify.sh
RUN chmod +x /etc/keepalived/check_emqx_edge.sh /etc/keepalived/notify.sh

COPY entrypoint.sh /entrypoint.sh
RUN chmod +x /entrypoint.sh

WORKDIR /opt/emqx-edge
ENTRYPOINT ["/entrypoint.sh"]

Docker Compose

Create docker-compose.yml to orchestrate the two-node cluster:

yaml
networks:
  emqx-edge-ha-net:
    driver: bridge
    ipam:
      config:
        - subnet: 172.22.0.0/24

services:
  emqx-edge-primary:
    build: .
    container_name: emqx-edge-primary
    hostname: emqx-edge-primary
    cap_add:
      - NET_ADMIN       # Required: lets Keepalived add/remove the VIP
      - NET_BROADCAST   # Required: VRRP advertisements
    networks:
      emqx-edge-ha-net:
        ipv4_address: 172.22.0.10
    volumes:
      - ./HA/keepalived-primary.conf:/etc/keepalived/keepalived.conf
    ports:
      - "1883:1883"
    restart: unless-stopped

  emqx-edge-standby:
    build: .
    container_name: emqx-edge-standby
    hostname: emqx-edge-standby
    cap_add:
      - NET_ADMIN
      - NET_BROADCAST
    networks:
      emqx-edge-ha-net:
        ipv4_address: 172.22.0.11
    volumes:
      - ./HA/keepalived-standby.conf:/etc/keepalived/keepalived.conf
    ports:
      - "1884:1883"     # Exposed on a different host port for debugging
    restart: unless-stopped

Build and start the containers:

bash
cd emqx-edge-ha
docker compose build
docker compose up -d

Test the HA Failover

Follow these steps to verify that failover works correctly.

Verify Initial VIP Assignment

Check which node currently holds the VIP.

For bare-metal:

bash
# Run on the primary node
ip addr show eth0 | grep 192.168.1.100

For Docker:

bash
docker exec emqx-edge-primary ip addr show eth0 | grep 172.22.0.100
docker exec emqx-edge-standby ip addr show eth0 | grep 172.22.0.100

Simulate a Failure

Stop the EMQX Edge process on the primary node to trigger failover.

For bare-metal:

bash
killall nanomq

For Docker:

bash
docker exec emqx-edge-primary pkill nanomq

Observe Failover

Within approximately 3 to 4 seconds, the standby node's Keepalived will detect the failure via check_emqx_edge.sh and acquire the VIP.

For bare-metal:

bash
# Run on the standby node
ip addr show eth0 | grep 192.168.1.100

For Docker:

bash
docker exec emqx-edge-standby ip addr show eth0 | grep 172.22.0.100

Any active MQTT client connected to the VIP will momentarily disconnect and reconnect as traffic is routed to the standby node.

Restore the Primary Node

Restart the EMQX Edge service on the primary node. Because preempt is enabled, the primary node reclaims the VIP once it passes health checks.

For bare-metal:

bash
./nanomq start -d

For Docker:

bash
docker exec emqx-edge-primary ./nanomq start