I started noticing some gaps in my home-lab monitoring solutions this week but I didn’t really pay much attention to it until I looked at it over time.
My SmartThings Prometheus exporter reaches out to the SmartThings API every 60 seconds and populates the status and metrics for each of my devices. I’m able to make some nice Grafana dashboards that show things like battery levels, contact, motion, & switch status, power usage, temperature, and thermostats. But there’s nothing worse than seeing null values in my data!! UGH!!! (It makes for some seriously ugly graphs and missed alerts – unless you’re connecting null values…)
Because I also use Grafana’s Loki Docker Driver Client, I was quickly able to take a current and historical view of the logs from my SmartThings exporter. It wasn’t looking good:
Something started kicking off a lot errors around the the 19th of February. And what are all these messages?
dial tcp: lookup api.smartthings.com on 127.0.0.11:53: server misbehaving" source="loop.go:59"
Specifically – looking up a server name on port 53 tends to point to my local DNS server. To start troubleshooting – I tried doing some command line DNS lookups first:
# nslookup api.smartthings.com Server: 10.10.3.80 Address: 10.10.3.80#53 Non-authoritative answer: Name: api.smartthings.com Address: 18.104.22.168 Name: api.smartthings.com Address: 22.214.171.124 Name: api.smartthings.com Address: 126.96.36.199 Name: api.smartthings.com Address: 188.8.131.52 Name: api.smartthings.com Address: 184.108.40.206 Name: api.smartthings.com Address: 220.127.116.11 Name: api.smartthings.com Address: 18.104.22.168 Name: api.smartthings.com Address: 22.214.171.124
That looked fine. So I tried running it a few more times just to be sure and I quickly received a different response:
# nslookup api.smartthings.com Server: 10.10.3.80 Address: 10.10.3.80#53 ** server can't find api.smartthings.com: REFUSED
Refused huh? (I shall not be REFUSED!!)
Let me try a local network server:
# nslookup grafana01.tylephony.com Server: 10.10.3.80 Address: 10.10.3.80#53 ** server can't find grafana01.tylephony.com: REFUSED
Since I use and love Pi-hole as my local DNS and DHCP server – I made a quick search of their forums and found this similar issue. But the response didn’t quite make sense in my case: “Pi-hole is not refusing it, your upstream is refusing to provide an answer.”
Pi-hole IS my upstream server!!
A little more digging in the forums revealed a look at the most recent release notes for Pi-hole FTL v5.7 and Web v5.4. This caught my eye:
Inbuilt enhanced Denial-of-Service (DoS) protection
Hence, we decided to implement a customizable rate-limiting into FTL itself. It defaults to the rather conservative limit of allowing no more than 1000 queries in a 60 seconds window for each client. Afterwards, any further queries are replied to with empty replies with the status set to REFUSED. Both the number of queries within the window as well as the window size can be configured by the user. It is important to note that rate-limiting is happening on a per-client basis. Other clients can continue to use FTL while rate-limited clients are short-circuited at the same time.
Because I monitor a lot of my servers with Grafana, Prometheus, and InfluxDB – I ping DNS a lot. Like around 5 million times a day. And I did just update my Pi-hole servers. A look back at my Pi-hole DNS stats shows a bit more of this trend over time both before and after this update.
Since the server that I have running my SmartThings exporter also runs a lot of other exporters and metrics collectors, Pi-hole was actually rate limiting that client.
Thankfully an easy fix:
Rate-limiting can easily be disabled by setting RATE_LIMIT=0/0 in /etc/pihole/pihole-FTL.conf. If I want, say, to set a rate limit of 1 query per hour, the option should look like RATE_LIMIT=1/3600.
So – any lessons learned?
- Read the release notes before deploying any updates? (Nope!)
- Limit the rate interval or number of metrics and exporters to ease the number of DNS queries? (Nope!)
- Set up default error alerting to catch problems sooner? (Soon!!)
Happily – no more gaps…