Skip to content

Broker Health Indicators

This page is a curated reference of the most useful Prometheus metrics for monitoring an EMQX broker. Use it together with Integrate with Prometheus, which covers how to expose and scrape these metrics.

The indicators are organized into four areas:

  1. System: operating system and Erlang VM resources.
  2. Broker: connection and message traffic, plus broker state.
  3. Authentication and Authorization: connect-time identity checks and per-message ACL decisions.
  4. Data Integration: rules, actions, connectors, and bridges.

All metrics are exposed on the EMQX Prometheus endpoints (/api/v5/prometheus/stats, /api/v5/prometheus/auth, and /api/v5/prometheus/data_integration). For endpoint details and mode query parameters, see Integrate with Prometheus.

Note on collector defaults

The metrics prefixed emqx_ are always on. The richer Erlang VM metrics prefixed erlang_vm_ come from the upstream Prometheus Erlang exporter and are disabled by default in EMQX 6.0 and newer. To enable process counts, per-allocator memory, or GC and scheduler breakdowns, set prometheus.collectors.vm_system_info, vm_memory, and vm_statistics to enabled.

System

The signals closest to the hardware layer. When the broker is unhealthy, one of these usually moves first.

CPU

MetricDescription
emqx_vm_cpu_usePercent CPU used.
emqx_vm_cpu_idlePercent CPU idle.

Memory

MetricDescription
emqx_vm_total_memoryTotal system memory (bytes).
emqx_vm_used_memoryUsed system memory (bytes).
erlang_vm_memory_processesPer-allocator memory: processes (requires the vm_memory collector enabled).
erlang_vm_memory_atomPer-allocator memory: atoms.
erlang_vm_memory_binaryPer-allocator memory: binaries.
erlang_vm_memory_etsPer-allocator memory: ETS tables.
erlang_vm_memory_codePer-allocator memory: loaded code.
erlang_vm_memory_systemPer-allocator memory: system overhead.

File Descriptors

MetricDescription
emqx_vm_max_fdsSoft FD ulimit for the broker process.

Erlang Processes and Scheduler Load

MetricDescription
emqx_vm_run_queueCurrent scheduler run queue length. A sustained non-zero value indicates CPU saturation.
emqx_vm_process_messages_in_queuesSum of all Erlang process mailbox lengths. Large or growing values mean a process is unable to keep up with incoming work.
erlang_vm_process_countCurrent Erlang process count (requires the vm_system_info collector enabled).
erlang_vm_process_limitConfigured maximum Erlang processes.

Internal Mailbox Watchdogs

MetricDescription
emqx_vm_mnesia_tm_mailbox_sizeMnesia transaction manager mailbox depth. High values indicate transactional contention.
emqx_vm_broker_pool_max_mailbox_sizeLargest mailbox in the broker dispatch pool. High values indicate subscriber-side backpressure.

Uptime

MetricDescription
emqx_vm_uptime_msBroker uptime in milliseconds. A sudden drop to a small value means the node restarted.

Cluster Replication Health (Mria)

MetricDescription
emqx_mria_lagReplication lag per replicant node.
emqx_mria_replicantsReplicant count.
emqx_mria_bootstrap_timeTime required for the last bootstrap.
emqx_mria_message_queue_lenMria mailbox length.

Overload Protection

MetricDescription
emqx_overload_protection_new_connConnections refused due to overload.
emqx_overload_protection_gcForced garbage collections triggered by overload protection.
emqx_overload_protection_hibernationProcess hibernations triggered.
emqx_overload_protection_delay_okSuccessful delay applications.
emqx_overload_protection_delay_timeoutDelay attempts that timed out.

Broker

Core operational signals. Watch the rate of message-related counters, and pay particular attention to the dropped series.

Cluster Topology

MetricDescription
emqx_cluster_nodes_runningRunning cluster nodes.
emqx_cluster_nodes_stoppedStopped cluster nodes. Alert when this is greater than zero.
emqx_conf_sync_txidLast cluster configuration transaction ID applied. Diverging values across nodes indicate a sync issue.

License (Enterprise)

MetricDescription
emqx_license_expiry_atLicense expiration time (UNIX epoch seconds).
emqx_license_issued_atLicense issuance time.
emqx_license_max_sessionsLicense session cap.
emqx_cert_expiry_atListener certificate expiration time.

Connections, Sessions, and Channels

MetricDescription
emqx_connections_countCurrent connection count.
emqx_connections_maxPeak connection count since boot.
emqx_live_connections_countCurrently connected (TCP up) clients.
emqx_live_connections_maxPeak live connections.
emqx_sessions_countActive session count (includes persistent sessions whose client is currently disconnected).
emqx_sessions_maxPeak session count.
emqx_cluster_sessions_countCluster-wide session count.
emqx_cluster_sessions_maxPeak cluster-wide session count.
emqx_channels_countChannel processes (one per connected client).
emqx_channels_maxPeak channel count.

Subscriptions and Topics

MetricDescription
emqx_subscriptions_countSubscription count.
emqx_subscriptions_maxPeak subscriptions.
emqx_subscriptions_shared_countShared subscriptions.
emqx_subscriptions_shared_maxPeak shared subscriptions.
emqx_subscribers_countSubscriber processes.
emqx_topics_countDistinct topic count.
emqx_topics_maxPeak topics.
emqx_routes_countRoute table size.
emqx_routes_maxPeak route table size.
emqx_durable_subscriptions_countPersistent-session subscriptions.
emqx_durable_subscriptions_maxPeak persistent-session subscriptions.

Retained, Delayed, and Banned

MetricDescription
emqx_retained_countRetained message count.
emqx_retained_maxPeak retained count.
emqx_delayed_countDelayed-publish queue depth.
emqx_delayed_maxPeak delayed queue depth.
emqx_banned_countBanned client / username / IP entries.

Messages

MetricDescription
emqx_messages_receivedApplication-level messages received from clients.
emqx_messages_sentApplication-level messages sent to clients.
emqx_messages_publishPUBLISH packets dispatched.
emqx_messages_deliveredDeliveries to subscribers (one published message can produce multiple deliveries).
emqx_messages_ackedAcknowledgements received from subscribers.
emqx_messages_forwardCross-node message forwards.
emqx_messages_retainedRetained-message events.
emqx_messages_delayedDelayed-publish enqueues.

Message Drops (the Earliest Sign of Trouble)

MetricDescription
emqx_messages_droppedTotal dropped messages.
emqx_messages_dropped_expiredDropped because the message-expiry interval was exceeded.
emqx_messages_dropped_no_subscribersDropped because no subscriber matched.
emqx_messages_dropped_quota_exceededDropped because a per-client quota was hit.
emqx_messages_dropped_receive_maximumDropped because the subscriber's MQTT v5 receive-maximum quota was hit.

Per-Subscriber Delivery Drops

MetricDescription
emqx_delivery_droppedTotal deliveries dropped.
emqx_delivery_dropped_expiredExpired before delivery.
emqx_delivery_dropped_no_localMQTT v5 no-local rule.
emqx_delivery_dropped_qosQoS not supported.
emqx_delivery_dropped_queue_fullSubscriber mqueue full.
emqx_delivery_dropped_too_largeExceeds subscriber's max packet size.

Bytes

MetricDescription
emqx_bytes_receivedTotal bytes received.
emqx_bytes_sentTotal bytes sent.

Packet-Level (for Protocol-Debug Dashboards)

MetricDescription
emqx_packets_receivedTotal packets received.
emqx_packets_sentTotal packets sent.
emqx_packets_connectCONNECT packets received.
emqx_packets_connack_sentCONNACK packets sent.
emqx_packets_connack_errorCONNACK with a non-zero reason code (most per-client AUTHN failures show here).
emqx_packets_disconnect_receivedDISCONNECT packets received.
emqx_packets_disconnect_sentDISCONNECT packets sent.
emqx_packets_publish_receivedPUBLISH packets received.
emqx_packets_publish_sentPUBLISH packets sent.
emqx_packets_publish_errorPUBLISH that could not be accepted.
emqx_packets_publish_auth_errorPUBLISH denied by authorization.
emqx_packets_puback_receivedPUBACK packets received (QoS 1).
emqx_packets_puback_sentPUBACK packets sent (QoS 1).
emqx_packets_pubrec_receivedPUBREC packets received (QoS 2).
emqx_packets_pubrec_sentPUBREC packets sent (QoS 2).
emqx_packets_pubrel_receivedPUBREL packets received (QoS 2).
emqx_packets_pubrel_sentPUBREL packets sent (QoS 2).
emqx_packets_pubcomp_receivedPUBCOMP packets received (QoS 2).
emqx_packets_pubcomp_sentPUBCOMP packets sent (QoS 2).
emqx_packets_subscribe_receivedSUBSCRIBE packets received.
emqx_packets_suback_sentSUBACK packets sent.
emqx_packets_subscribe_errorSUBSCRIBE packets that failed.
emqx_packets_subscribe_auth_errorSUBSCRIBE packets denied by authorization.
emqx_packets_unsubscribe_receivedUNSUBSCRIBE packets received.
emqx_packets_unsuback_sentUNSUBACK packets sent.
emqx_packets_unsubscribe_errorUNSUBSCRIBE packets that failed.
emqx_packets_pingreq_receivedPINGREQ packets received.
emqx_packets_pingresp_sentPINGRESP packets sent.

Client Lifecycle (Hook Trigger Counters)

MetricDescription
emqx_client_connectCONNECT received.
emqx_client_connackCONNACK sent.
emqx_client_connectedclient.connected hook fired.
emqx_client_disconnectedclient.disconnected hook fired.
emqx_client_disconnected_reasonDisconnect counts labeled by reason.
emqx_client_subscribeSubscribe hook fires.
emqx_client_unsubscribeUnsubscribe hook fires.

Session Lifecycle

MetricDescription
emqx_session_createdSessions created.
emqx_session_resumedPersistent sessions resumed.
emqx_session_takenoverSessions taken over by a new client.
emqx_session_discardedSessions discarded (clean start over an existing session).
emqx_session_terminatedSessions terminated.

Authentication and Authorization

Use these metrics when an HTTP, LDAP, or database backend is in the authentication path, to determine whether the broker or the backend is the slow or failing component.

Connect-Time Authentication Outcomes

MetricDescription
emqx_authentication_successSuccessful authentication (excluding anonymous).
emqx_authentication_success_anonymousAnonymous pass.
emqx_authentication_failureAuthentication failures.

Authorization Decisions

MetricDescription
emqx_authorization_allowDecisions: allow.
emqx_authorization_denyDecisions: deny.
emqx_authorization_nomatchNo matching rule (falls back to the no_match configuration).
emqx_authorization_matched_allowMatched-allow rule fired.
emqx_authorization_matched_denyMatched-deny rule fired.
emqx_authorization_cache_hitCache hits.
emqx_authorization_cache_missCache misses.
emqx_authorization_superuserSuperuser-bypass path.

Authentication Chain Status

MetricDescription
emqx_authn_totalConfigured authentication providers.
emqx_authn_enableEnabled flag per provider (0 / 1).
emqx_authn_statusResource state per provider.
emqx_authn_users_countUser record count per provider (for password, mnesia, or DB-backed providers).

Authentication Per-Provider Runtime Counters

MetricDescription
emqx_authn_successSuccessful matches per provider.
emqx_authn_failedFailures per provider.
emqx_authn_nomatchIgnored per provider (chain continues to the next provider).
emqx_authn_latencyBackend latency per provider.

Authorization Source Status

MetricDescription
emqx_authz_totalConfigured authorization sources.
emqx_authz_enableEnabled flag per source (0 / 1).
emqx_authz_statusResource state per source.
emqx_authz_rules_countRule record count per source (file, mnesia, or DB-backed).

Authorization Per-Source Runtime Counters

MetricDescription
emqx_authz_allowDecisions: allow per source.
emqx_authz_denyDecisions: deny per source.
emqx_authz_nomatchIgnored per source (chain continues).
emqx_authz_latencyBackend latency per source.

Built-In DB Sizes

MetricDescription
emqx_authn_builtin_record_countUser count in the built-in authentication database.
emqx_authz_builtin_record_countRule count in the built-in authorization database.

Data Integration

Traffic enters the rule engine, fans out to actions and connectors, and lands at external systems. Each layer exposes its own counters. Reading them in order shows where messages are being lost.

Inventory

MetricDescription
emqx_rules_countConfigured rules.
emqx_actions_countConfigured actions.
emqx_connectors_countConfigured connectors.
emqx_schema_registrys_countSchema-registry entries.

Per-Resource Status

MetricDescription
emqx_rule_enableRule enable flag (0 / 1).
emqx_action_enableAction enable flag (0 / 1).
emqx_action_statusAction resource state.
emqx_connector_enableConnector enable flag (0 / 1).
emqx_connector_statusConnector resource state.

Rule Engine: Per-Rule Counters

MetricDescription
emqx_rule_matchedMessages that matched the rule's WHERE clause.
emqx_rule_passedMessages that passed the rule.
emqx_rule_failedRule processing failed.
emqx_rule_failed_exceptionErlang exception during the rule.
emqx_rule_failed_no_resultSQL produced no result.

Rule Engine: Action Sub-Counters

MetricDescription
emqx_rule_actions_totalAction invocations from rules.
emqx_rule_actions_successAction returned success.
emqx_rule_actions_failedAction failed.
emqx_rule_actions_failed_unknownFailure with unknown reason.
emqx_rule_actions_failed_out_of_serviceDownstream resource unhealthy.
emqx_rule_actions_discardedAction discarded (for example, rate limited).

Action Throughput

MetricDescription
emqx_action_matchedMessages routed to the action.
emqx_action_receivedReceived at the action queue.
emqx_action_successAction call succeeded.
emqx_action_failedAction call failed.
emqx_action_late_replyResponse arrived after timeout.
emqx_action_retriedRetry attempts.
emqx_action_retried_successSucceeded after retry.
emqx_action_retried_failedFailed after all retries.

Action Queue and Inflight

MetricDescription
emqx_action_inflightIn-flight requests.
emqx_action_queuingQueued (pending dispatch) length.

Action Drops

A non-zero rate on any of these series indicates that a downstream system is unhealthy or the action is misconfigured.

MetricDescription
emqx_action_droppedTotal dropped at the action layer.
emqx_action_dropped_queue_fullQueue cap hit.
emqx_action_dropped_resource_stoppedTarget resource stopped.
emqx_action_dropped_resource_not_foundTarget resource missing.
emqx_action_dropped_expiredMessage expired before dispatch.
emqx_action_dropped_otherOther reasons.

Minimal "Broker-is-Sick" Panel

If only a handful of series can fit on a Grafana dashboard, these are the ones that matter most. Most production issues move at least one of them within seconds of the event:

  • rate(emqx_messages_dropped[1m]): non-zero means the broker is refusing or losing work.
  • rate(emqx_action_dropped[1m]): the integration layer is losing work.
  • emqx_cluster_nodes_stopped: greater than zero means a member was lost.
  • rate(emqx_overload_protection_new_conn[1m]): the broker is actively rejecting new connections.
  • rate(emqx_authentication_failure[1m]): a spike usually indicates a backend issue or an attack.
  • emqx_vm_run_queue: sustained above zero means CPU saturation.
  • emqx_vm_process_messages_in_queues: large values indicate process-mailbox backlog.
  • emqx_mria_lag: values above a few seconds mean replication is falling behind.
  • emqx_license_expiry_at - time() (Enterprise): countdown to license expiration.