Since June 2016, M-Lab has collected high resolution switch telemetry for each M-Lab server and site uplink.
Originally designed to detect switch discards from
server traffic microbursts, we now support the DIScard COllection
(a.k.a. DISCO) dataset as a standard M-Lab BigQuery table:
measurement-lab.base_tables.switch
DISCO Metrics
To see the complete list of DISCO metrics, run the following query:
#standardSQLSELECT metricFROM `measurement-lab.utilization.switch`GROUP BY metricORDER BY metric ASCDISCO metric names use a common structure: switch.<type>.<source>.<direction>.
DISCO collects seven standard metric types: packets sent by broadcast, packets discarded by the switch, packet errors, packets sent by multicast, octet counts (e.g. bytes), pause frames (for Ethernet flow control), and unicast packets.
DISCO collects metrics for two sources: machine local and switch uplink.
The “local” metrics report values from the switch port attached to the
machine collecting the switch data, e.g. hostname = "mlab1.sea02.measurement-lab.org".
So there will be “local” metrics for each machine at a site. The “uplink”
metrics report values from the switch port attached to the uplink router,
outside M-Lab’s administrative domain. Each machine collects these so there
is some built-in redunancy.
Finally, a DISCO metric indicates whether the value is for data received (rx)
by the port or transferred (tx) from the port. Note that the data direction
is relative the the switch port. So, for example, a “download” test will show
up on the switch.unicast.local.rx followed by the switch.unicast.uplink.tx.
All sample.values are counts over the 10 second window starting at the given sample.timestamp.
| Metric Name | Description |
|---|---|
| switch.broadcast.local.rx | Broadcast packets received by the machine switch port. |
| switch.broadcast.local.tx | Broadcast packets transmitted by the machine switch port. |
| switch.broadcast.uplink.rx | Broadcast packets received by the uplink switch port. |
| switch.broadcast.uplink.tx | Broadcast packets transmitted by the uplink switch port. |
| switch.discards.local.rx | Packets discarded while received by the machine switch port. |
| switch.discards.local.tx | Packets discarded while transmitted by the machine switch port. |
| switch.discards.uplink.rx | Packets discarded while received by the uplink switch port. |
| switch.discards.uplink.tx | Packets discarded while transmitted by the uplink switch port. |
| switch.errors.local.rx | Packet errors received by the machine switch port. |
| switch.errors.local.tx | Packet errors transmitted by the machine switch port. |
| switch.errors.uplink.rx | Packet errors received by the uplink switch port. |
| switch.errors.uplink.tx | Packet errors transmitted by the uplink switch port. |
| switch.multicast.local.rx | Multicast packets received by the machine switch port. |
| switch.multicast.local.tx | Multicast packets transmitted by the machine switch port. |
| switch.multicast.uplink.rx | Multicast packets received by the uplink switch port. |
| switch.multicast.uplink.tx | Multicast packets transmitted by the uplink switch port. |
| switch.octets.local.rx | Bytes received by the machine switch port. |
| switch.octets.local.tx | Bytes transmitted by the machine switch port. |
| switch.octets.uplink.rx | Bytes received by the uplink switch port. |
| switch.octets.uplink.tx | Bytes transmitted by the uplink switch port. |
| switch.pause.local.rx | Pause frames received by the machine switch port. |
| switch.pause.local.tx | Pause frames transmitted by the machine switch port. |
| switch.pause.uplink.rx | Pause frames received by the uplink switch port. |
| switch.pause.uplink.tx | Pause frames transmitted by the uplink switch port. |
| switch.unicast.local.rx | Unicast packets received by the machine switch port. |
| switch.unicast.local.tx | Unicast packets transmitted by the machine switch port. |
| switch.unicast.uplink.rx | Unicast packets received by the uplink switch port. |
| switch.unicast.uplink.tx | Unicast packets transmitted by the uplink switch port. |
Example Queries
The switch schema takes advantage of record arrays.
So, queries that use the sample.value or sample.timestamp columns must handle
the array records appropriately.
Daily Average Download Rates
The following query calculates the per-day, per-host average data output rate
from sites in the DFW metro area. The native sample.value is a 10-second
counter delta. So, to calculate the per-second rate, we must divide by 10.
And, the unit of switch.octets.uplink.tx metric is bytes, so we must multiply
by 8 to convert to bits.
#standardSQLSELECT TIMESTAMP_TRUNC(sample.timestamp, DAY) AS day, hostname, metric, 8 * ROUND(AVG(sample.value) / 1e6, 1) / 10 AS mbpsFROM `measurement-lab.utilization.switch`, UNNEST(sample) AS sampleWHERE metric = "switch.octets.uplink.tx" AND hostname LIKE "%mlab1.dfw%"GROUP BY day, hostname, metricORDER BY dayDaily 99th Percentile Discard Rate
The following query calculates the 99th percentile packet discard rate per day.
Because we want a per-second rate, we must divide sample.value by 10. And,
we use APPROX_QUANTILES in part to exclude extreme values caused by erroneous
samples in the raw data.
#standardSQLSELECT TIMESTAMP_TRUNC(sample.timestamp, DAY) AS day, hostname, metric, APPROX_QUANTILES(sample.value / 10, 100)[OFFSET(99)] AS ppsFROM `measurement-lab.utilization.switch`, UNNEST(sample) AS sampleWHERE metric = "switch.discards.uplink.tx" AND hostname LIKE "%mlab1.dfw%"GROUP BY day, hostname, metricORDER BY dayPercentage of Download Traffic Per-machine
Of the data leaving a site, this query roughly estimates the percentage attributed to each machine. The percentage may change, for example, when one machine is offline and traffic is diverted to another.
For simplicity, this query looks at a single site over time. For most of the history, mlab-ns was using the “first online server” policy, so mlab1 sees about 100% of traffic during that time. However after June 1st, 2018, the mlab-ns server selection algorithm was changed to “random online server”, and the ratios approach 1/3.
#standardSQLSELECT day, 100 * mlab1_total/uplink_total as mlab1, 100 * mlab2_total/uplink_total as mlab2, 100 * mlab3_total/uplink_total as mlab3FROM (SELECT TIMESTAMP_TRUNC(sample.timestamp, DAY) AS day, SUM(IF(metric = "switch.octets.local.rx" AND hostname like "%mlab1%", sample.value, 0)) as mlab1_total, SUM(IF(metric = "switch.octets.local.rx" AND hostname like "%mlab2%", sample.value, 0)) as mlab2_total, SUM(IF(metric = "switch.octets.local.rx" AND hostname like "%mlab3%", sample.value, 0)) as mlab3_total, SUM(IF(metric = "switch.octets.uplink.tx" AND hostname like "%mlab1%", sample.value, 0)) as uplink_totalFROM `measurement-lab.utilization.switch`, UNNEST(sample) AS sampleWHERE (metric = "switch.octets.uplink.tx" OR metric = "switch.octets.local.rx") AND hostname LIKE "%yyz%"GROUP BY day)WHERE uplink_total > 0ORDER BY day