I've recently had the time and the drive to dive a little deeper into Infiniband...which is a bit of an understatement considering my understanding prior to going down the 'how do I monitor this' rabbit hole consisted of compiling OpenMPI with Infiniband Support.

Rather than bore all of you with the inconsequential fluff - I'll just get down to the 'here is how I did it' details.

Assumptions

You already have infiniband enabled hosts and a subnet manager running.

You have installed infiniband support for Red Hat/CentOS repos and NOT directly from Mellanox/Intel/INSERT VENDOR HERE.

Tools like ibdiagnet show expected output.

You are familiar with Zabbix Low Level Discovery.

All that said, maybe in another post I'll go over how I got that far...but for now I'm gonna be lazy (running theme for me.)

Caveats

Most of my test gear involved Mellanox cards, so this information may not be entirely accurate if tested against another vendor's cards.

The Zabbix bits are provided as-is and may not work for you.

As always, use documents you find on blogs at your own risk. This worked for me. It may not work for you.

Getting the Information

There will be some files in /sys/class/infiniband full of juicy juicy information.  The ones we are currently concerned with / the ones I chose to monitor are:

### Counters
/sys/class/infiniband/${DEVICE}/ports/${PORT}/counters/port_xmit_packets
/sys/class/infiniband/${DEVICE}/ports/${PORT}/counters/port_xmit_data
/sys/class/infiniband/${DEVICE}/ports/${PORT}/counters/port_rcv_packets
/sys/class/infiniband/${DEVICE}/ports/${PORT}/counters/port_rcv_data

### Info
/sys/class/infiniband/${DEVICE}/ports/${PORT}/rate
/sys/class/infiniband/${DEVICE}/ports/${PORT}/state
/sys/class/infiniband/${DEVICE}/ports/${PORT}/phys_state

For example, one of my hosts with a Mellanox card that has a single port would be /sys/class/infiniband/mlx5_0/ports/1/ .  In this case, ${DEVICE} is "mlx5_0" and ${PORT} is "1".

After a lot of knowledge base article reading[1], source code reading[2], and general DuckDuckGo-Fu, I was able to determine that the counters for 'xmit_data' and 'rcv_data' are in octets divided by 4, at least for the Mellanox cards.  This is important data when you go to graph this information.

Also through the aforementioned searching/reading, I determined the following values for 'state' and 'phys_state', again - at least for the Mellanox cards[3]:

State (state) Meaning
0 ???
1 Down
2 Initializing
3 Armed
4 Active
Physical State (phys_state) Meaning
0 No state change
1 Sleep
2 Polling
3 Disabled
4 PortConfigurationTraining
5 LinkUp
6 LinkErrorRecovery
7 PhyTest

Zabbix Bits

I've got a few hosts with a few cards in them...so I didn't want to create an item for each of them in Zabbix.  Remember...I'm lazy.

This is where Zabbix Low Level Discovery came in handy.  With low level discovery, I can have Zabbix discover each device.

First, the zabbix user parameter I used:

UserParameter=infiniband.discovery, if [[ -d /sys/class/infiniband ]]; then echo -n '{"data":['; for dev in /sys/class/infiniband/*; do DEVICE=$(basename $dev); for port in /sys/class/infiniband/${DEVICE}/ports/*; do PORT=$(basename $port); echo -n "{\"{#HCA}\": \"${DEVICE}:${PORT}\"},"; done ; done | sed -e 's:\},$:\}:'; echo -n ']}'; fi

The same logic in a bit more readable format:

## only fire if /sys/class/infiniband exists
if [[ -d /sys/class/infiniband ]]; then
  ## open the json output
  echo -n '{"data":['
  ## for each device
  for dev in /sys/class/infiniband/*; do
    DEVICE=$(basename $dev)
    ## For each port in device
    for port in /sys/class/infiniband/${DEVICE}/ports/*; do
      PORT=$(basename $port)
      ## echo the json
      echo -n "{\"{#HCA}\": \"${DEVICE}:${PORT}\"},"
    done
  ## trim off the last comma  
  done | sed -e 's:\},$:\}:'
  ## close the json
  echo -n ']}'
fi

Example Output:

## Raw
{"data":[{"{#HCA}": "mlx5_0:1"},{"{#HCA}": "mlx5_1:1"},{"{#HCA}": "mlx5_2:1"},{"{#HCA}": "mlx5_3:1"}]}

## Pretty Printed
{
  "data": [
    {
      "{#HCA}": "mlx5_0:1"
    },
    {
      "{#HCA}": "mlx5_1:1"
    },
    {
      "{#HCA}": "mlx5_2:1"
    },
    {
      "{#HCA}": "mlx5_3:1"
    }
  ]
}

Along with the LLD rules, I also added some UserParameter configuration scripts to gather the information on the host.  These expect the first argument to be in the format "device:port", then using cut it splits the input and builds the path to the information that will be monitored.

Counters

UserParameter=infiniband.port_xmit_packets[*], DEVICE="$(echo $1 | cut -f1 -d:)"; PORT="$(echo $1 | cut -f2 -d:)"; cat /sys/class/infiniband/${DEVICE}/ports/${PORT}/counters/port_xmit_packets
UserParameter=infiniband.port_xmit_data[*], DEVICE="$(echo $1 | cut -f1 -d:)"; PORT="$(echo $1 | cut -f2 -d:)"; cat /sys/class/infiniband/${DEVICE}/ports/${PORT}/counters/port_xmit_data
UserParameter=infiniband.port_rcv_packets[*], DEVICE="$(echo $1 | cut -f1 -d:)"; PORT="$(echo $1 | cut -f2 -d:)"; cat /sys/class/infiniband/${DEVICE}/ports/${PORT}/counters/port_rcv_packets
UserParameter=infiniband.port_rcv_data[*], DEVICE="$(echo $1 | cut -f1 -d:)"; PORT="$(echo $1 | cut -f2 -d:)"; cat /sys/class/infiniband/${DEVICE}/ports/${PORT}/counters/port_rcv_data

Info

UserParameter=infiniband.rate[*], DEVICE="$(echo $1 | cut -f1 -d:)"; PORT="$(echo $1 | cut -f2 -d:)"; cat /sys/class/infiniband/${DEVICE}/ports/${PORT}/rate
UserParameter=infiniband.state[*], DEVICE="$(echo $1 | cut -f1 -d:)"; PORT="$(echo $1 | cut -f2 -d:)"; cat /sys/class/infiniband/${DEVICE}/ports/${PORT}/state | cut -f1 -d:
UserParameter=infiniband.phys_state[*], DEVICE="$(echo $1 | cut -f1 -d:)"; PORT="$(echo $1 | cut -f2 -d:)"; cat /sys/class/infiniband/${DEVICE}/ports/${PORT}/phys_state | cut -f1 -d:

A git repository with the above Zabbix bits, plus a template for Zabbix can be found on this fancy new GitLab thang.

Lovely Pictures

Eyeballed the graph while doing a sustained qperf test on 2 of my hosts.  Assuming I understand any of this (yikes) I believe I've got things set up correctly.  ~12GB/s transmit plus ~12GB/s receive equals ~24GB/s bi-directional bandwidth (as reported by the test rc_bi_bw).

Sources