I started noticing some gaps in my home-lab monitoring solutions this week but I didn’t really pay much attention to it until I looked at it over time.

My SmartThings Prometheus exporter reaches out to the SmartThings API every 60 seconds and populates the status and metrics for each of my devices. I’m able to make some nice Grafana dashboards that show things like battery levels, contact, motion, & switch status, power usage, temperature, and thermostats. But there’s nothing worse than seeing null values in my data!! UGH!!! (It makes for some seriously ugly graphs and missed alerts – unless you’re connecting null values…)

Because I also use Grafana’s Loki Docker Driver Client, I was quickly able to take a current and historical view of the logs from my SmartThings exporter. It wasn’t looking good:

Something started kicking off a lot errors around the the 19th of February. And what are all these messages?

dial tcp: lookup api.smartthings.com on 127.0.0.11:53: server misbehaving" source="loop.go:59"

Specifically – looking up a server name on port 53 tends to point to my local DNS server. To start troubleshooting – I tried doing some command line DNS lookups first:

# nslookup api.smartthings.com
Server: 10.10.3.80
Address: 10.10.3.80#53

Non-authoritative answer:
Name: api.smartthings.com
Address: 3.131.74.134
Name: api.smartthings.com
Address: 13.59.226.110
Name: api.smartthings.com
Address: 3.137.134.173
Name: api.smartthings.com
Address: 3.131.168.20
Name: api.smartthings.com
Address: 18.220.168.164
Name: api.smartthings.com
Address: 3.140.46.118
Name: api.smartthings.com
Address: 3.13.133.255
Name: api.smartthings.com
Address: 3.131.79.130

That looked fine. So I tried running it a few more times just to be sure and I quickly received a different response:

# nslookup api.smartthings.com
Server: 10.10.3.80
Address: 10.10.3.80#53

** server can't find api.smartthings.com: REFUSED

Refused huh? (I shall not be REFUSED!!)

Let me try a local network server:

# nslookup grafana01.tylephony.com
Server: 10.10.3.80
Address: 10.10.3.80#53

** server can't find grafana01.tylephony.com: REFUSED

Since I use and love Pi-hole as my local DNS and DHCP server – I made a quick search of their forums and found this similar issue. But the response didn’t quite make sense in my case: “Pi-hole is not refusing it, your upstream is refusing to provide an answer.”

Pi-hole IS my upstream server!!

A little more digging in the forums revealed a look at the most recent release notes for Pi-hole FTL v5.7 and Web v5.4. This caught my eye:

Inbuilt enhanced Denial-of-Service (DoS) protection

Hence, we decided to implement a customizable rate-limiting into FTL itself. It defaults to the rather conservative limit of allowing no more than 1000 queries in a 60 seconds window for each client. Afterwards, any further queries are replied to with empty replies with the status set to REFUSED. Both the number of queries within the window as well as the window size can be configured by the user. It is important to note that rate-limiting is happening on a per-client basis. Other clients can continue to use FTL while rate-limited clients are short-circuited at the same time.

Because I monitor a lot of my servers with Grafana, Prometheus, and InfluxDB – I ping DNS a lot. Like around 5 million times a day. And I did just update my Pi-hole servers. A look back at my Pi-hole DNS stats shows a bit more of this trend over time both before and after this update.

Since the server that I have running my SmartThings exporter also runs a lot of other exporters and metrics collectors, Pi-hole was actually rate limiting that client.

Thankfully an easy fix:

Rate-limiting can easily be disabled by setting RATE_LIMIT=0/0 in /etc/pihole/pihole-FTL.conf. If I want, say, to set a rate limit of 1 query per hour, the option should look like RATE_LIMIT=1/3600.

So – any lessons learned?

  • Read the release notes before deploying any updates? (Nope!)
  • Limit the rate interval or number of metrics and exporters to ease the number of DNS queries? (Nope!)
  • Set up default error alerting to catch problems sooner? (Soon!!)

Happily – no more gaps…

This started as an easy story with three simple things:

  1. I love IoT gadgets.
  2. I love weather gadgets.
  3. I love charts and graphs.

While perusing online, it’s possible that I stumbled upon a social media advertisement for the WeatherFlow Tempest Weather System (and a 15% off coupon). It’s also quite possible that I made a quick purchase.

WeatherFlow Tempest started out as a Kickstarter project but is now shipping directly outside of just the backers. The WeatherFlow Tempest is essentially a solar powered weather station that collects wind, rain, lightning, temperature, humidity, pressure, and sunlight measurements. It has a wireless hub that collects these data points, pushes them up to the WeatherFlow cloud for processing and weather forecasting, and then provides a personalized weather app. Data can be collected with a few different APIs – one from WeatherFlow’s processed stream of events and metrics (REST API) or locally via my network with UDP listener API.

Yes. Now we’re talking. APIs. Raw data. Wireless. Weather data. All that means “how can I use my data to make some charts and graphs!”

Logs Into Metrics

With any new project – it’s always best to start out and see what other great work has already been shared with the community. I found a great InfluxDB listener from Vince Skahan called weatherflow-udp-listener. I made a first set of Grafana dashboards that used his system which worked out pretty well. But as much as I loved Vince’s efforts – the project isn’t currently supported (but should work just fine for the foreseeable future…)

What I really wanted to do is work with the raw JSON logs being streamed over UDP from the Tempest to the WeatherFlow hub. I needed a way to collect those JSON logs and push them into a logging system that provides long term storage AND a way to turn logs into metrics. What better solution than Grafana’s Loki log aggregation system.

JSON Logs

The WeatherFlow hub sends out JSON logs broadcast over UDP port 50222 on my local network. Based on the WeatherFlow Tempest UDP Reference, these logs can be broken out by type (among a few other measurements):

obs_st (Observation – Tempest)

{"serial_number":"ST-00028209","type":"obs_st","hub_sn":"HB-00038302","obs":[[1613934241,0.00,1.34,5.45,93,3,988.74,9.39,28.68,7761,0.23,65,0.000000,0,0,0,2.808,1]],"firmware_revision":134}

device_status

{"serial_number":"ST-00028209","type":"device_status","hub_sn":"HB-00038302","timestamp":1613934241,"uptime":1478650,"voltage":2.81,"firmware_revision":134,"rssi":-69,"hub_rssi":-66,"sensor_status":0,"debug":0}

hub_status

{"serial_number":"HB-00038302","type":"hub_status","firmware_revision":"160","uptime":895935,"rssi":-39,"timestamp":1613934260,"reset_flags":"PIN,SFT","seq":89486,"fs":[1,0,15675411,524288],"radio_stats":[25,1,0,3,4248],"mqtt_stats":[119,29]}

rapid_wind

{"serial_number":"ST-00028209","type":"rapid_wind","hub_sn":"HB-00038302","ob":[1613934245,0.67,152]}

By using a slightly modified weatherflow-listener.py script (from P-Doyle’s Simple-WeatherFlow-Python-Listener) I was able to use the following command to send the broadcast JSON logs into Loki via Promtail.

/usr/bin/stdbuf -oL /usr/bin/python weatherflow-listener.py | /usr/bin/promtail --stdin --client.url http://loki:3100/loki/api/v1/push --client.external-labels=app=weatherflow,hostname=weatherflow

(Using stdbuf here to reduce the STDOUT/STDIN buffer wait…)

Once the logs are in Grafana Loki – I used Loki’s LogQL to crack open some of the JSON arrays into useful metrics:

max(max_over_time({app="weatherflow"} |= "obs_st" | json obs_Air_Temperature="obs[0][7]" | unwrap obs_Air_Temperature | __error__="" [$__interval])) * 9/5 + 32

Using the WeatherFlow UDP API as a guide, I made metrics from each of the index values:

Observation Value Layout

Index Field Units
0 Time Epoch Seconds
1 Wind Lull (minimum 3 second sample) m/s
2 Wind Avg (average over report interval) m/s
3 Wind Gust (maximum 3 second sample) m/s
4 Wind Direction Degrees
5 Wind Sample Interval seconds
6 Station Pressure MB
7 Air Temperature C
8 Relative Humidity %
9 Illuminance Lux
10 UV Index
11 Solar Radiation W/m^2
12 Precip Accumulated mm
13 Precipitation Type 0 = none, 1 = rain, 2 = hail
14 Lightning Strike Avg Distance km
15 Lightning Strike Count
16 Battery Volts
17 Report Interval Minutes

Grafana Dashboards

The rest of my efforts were simply turning those metrics into usable information – coming up with an Overview, Today So Far, and a Device Details set of Grafana dashboards.

Like my previous Loki Syslog All-In-One project – I created an All-In-One project for this WeatherFlow collector. Details on how to install your own collector including all of the files needed to download are over at my WeatherFlow Dashboards AIO Github repository. These dashboards are also available in Grafana’s Community Dashboards. With a little help from P-Doyle’s Simple-WeatherFlow-Python-Listener and Promtail – you too can deploy a quick and easy WeatherFlow log collector with Grafana Loki and Grafana dashboards.

If you’re a WeatherFlow fan, I’d love to hear any feedback on how this works for you. If you’d like to share your dashboards, I’d be happy to include them here to share with the community as well!

These dashboards are also part of my Internet facing set of current Grafana dashboards. Enjoy!!

Loki Syslog Overview

I’ll be the first to admit that I’ve always been a metrics person. Charts and graphs through and through. Almost to a fault – I largely ignored logs. That’s not to say I haven’t combed through my fair share of application logs across hundreds of end points. Do you remember the days of creating shared NAS exports and just writing out logs until they filled up? (Yeah – me neither… ahem…) But recently two things have come to light in the last few months that make this hopefully an interesting story to tell. One, I discovered Loki, Grafana’s log aggregation system. And two, I have a handful of home lab servers, an increasingly complex network, and storage devices that are hard to see what they’re doing all the time. My initial challenge to tackle involved understanding why my wireless devices were having intermittent network instability and which (if any) of my wireless access points were having the most number of issues. But all I had to work with was Syslog.

A search on Google for “Syslog Collector” presented me 342,000 results to start my effort. Most of the attention grabbing “6 Free Syslog Servers” links turned into a fair number of Windows utilities but each still pretty limited to just a few hosts at a time. I needed to collect data from more than a dozen systems and I’m running on Linux and MacOS. What I really needed was some Open Source goodness.

This now becomes a tale of how I came to love logs.

And Loki. <3

My first exposure to Loki came recently during my first days at Grafana Labs. Presented with an amazing way to discover and consume logs in relationship to Prometheus and Kubernetes with microservices – it didn’t immediately occur to me to capture standalone network logs with Loki in this same fashion. And so I set out to see what I could accomplish.

Loki is actually quite easy to deploy as single binary either via the command line or in Docker. One of the primary ways to get logs into Loki is with the use of Promtail, also easily deployed the same way. For me, I jumped into docker-compose (even with Loki’s roots coming from Prometheus and Kubernetes – I’m looking to build out essentially a quick start standalone Syslog ingester.)

A look through some of the Loki documentation on configuring Promtail with Syslog had me realize that Promtail only works with IETF Syslog (RFC5424) – which is how I also found out my devices were limited to only RFC3164. Time to look at syslog-ng!!

What’s useful about syslog-ng in my situation is that it can be spun up to listen for RFC3164 (UDP port 514) and then forward it to Promtail RFC5424 on port 1514. (Many of my devices only output the older style of Syslog…) A few quick configurations was I needed to do to get syslog-ng and Promtail talking to each other!

syslog-ng Configuration

# syslog-ng.conf

source s_local {
    internal();
};

source s_network {
    default-network-drivers(
    );
};

destination d_loki {
    syslog("promtail" transport("tcp") port("1514"));
};

log {
        source(s_local);
        source(s_network);
        destination(d_loki);
};

Promtail Configuration


# promtail-config.yml

server:
  http_listen_port: 9080
  grpc_listen_port: 0

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:

- job_name: syslog
  syslog:
    listen_address: 0.0.0.0:1514
    idle_timeout: 60s
    label_structured_data: yes
    labels:
      job: "syslog"
  relabel_configs:
    - source_labels: ['__syslog_message_hostname']
      target_label: 'host'

The relabeling in Promtail takes the hostname of the sending device into syslog-ng and turns it into a host label for Loki to index. Within a few minutes I had all of my hosts streaming Syslog from my network into Loki and explorable within Grafana!

Now – around this same time Loki 2.0 was released. Ward Bekker had just presented to our team some of the launch efforts and dashboard examples he worked on when I heard him say to me…

“Dave – look how easy it is to turn logs into metrics!” ~ Ward Bekker

Ward – you have my attention!! At this point – I really expedited my efforts to build out a dashboard that combined how easy it was to gather my logs into an even easier way to sort, search, filter, and present useful information with dashboards showing all of device logs.

Within a few minutes I had a working dashboard that I could configure either a drop down of pre-defined search terms or use a free form search for items in my logs. Then simply apply the “logs to metrics” magic and I was presenting group summaries of counts by wireless access points!

Loki First Dashboard

Oh yeah – my first LogQL query!! Showing the number of logs over time filtered by hostname (host=”$hostname”), coming from my Syslog Promtail job (job=”syslog”), with a free form search query string from my Grafana variable ($filter).

count_over_time({host=~"$hostname", job="syslog"}[$__interval] |="$filter”)

With a bit more dashboard usability tweaking I was able to visualize other types of logging from my gateway devices, my server IPMI stats, and NAS details – all available to scroll back through time. And finally – building out alerting for threshold breaching (yes… logs into metrics!! More on alerting in a follow-up post.)

So while a pretty simple example of how I got started with Loki and my logging journey – I believe it represents how quick and easy it is to connect Open Source solutions to solve immediate problems – even in a homelab situation.

I also wanted to share these configurations and what better way to do that than with a kind of “All In One” docker-compose project. So I present to you:

Grafana Loki Syslog All-In-One Project

Loki Syslog AIO

This quick example project allows you to run all of these mentioned services with docker-compose on a Linux server. Point your network devices at (hostname:514) and log into Grafana (hostname:3000) and you’ll be presented with the “Loki Syslog AIO – Overview” dashboard. For those of you that want to see some of the behind the scenes details, I’ve included some prebuilt performance overview dashboards for each of the main services (Grafana, Loki, MinIO, Docker, and host metrics.) You’ll see dropdown links to the “Performance Overview” at the top of the Loki Syslog AIO – Overview dashboard including links to get you back to the starting dashboard. If you don’t have Syslog devices immediately available but want to try the dashboard out – I also built an optional Syslog Generator container.

For more setup details and downloads, checkout my Grafana Loki Syslog AIO Github repository. My example Loki Dashboard is also available in Grafana’s Community Dashboards.

And yes – I did figure out that my dropped connections were related to high DHCP retries and too aggressive of settings on my minimum data rate controls. Now I know! Thanks Loki!!

Grafana Loki Icon

“You cannot step into the same river twice, for other waters are continually flowing on.” ~ Heraclitus

I stumbled upon that analogy some time back and I think it’s a good way of talking about change. For me, joining the Solutions Engineering team at Grafana is a continuation of embracing change and enjoying my professional journey. As I look back over my first several months here, I wanted to share a bit of my own journey and a few thoughts on what I’m most looking forward to.

How My Journey Started

  • I’ve been running metrics since the days of Perl scripts and RRD Graphs (Hats off to Cacti). My start in Enterprise IT really started when I worked at Citigroup and started accelerating with building out an amazing APM team using BMC Patrol and Precise APM.
  • Opportunity to launch my Sales Engineering career joining HP Software with solutions from their Mercury Interactive acquisition with HP Business Service Management.
  • Some time with BMC Software with their ITSM and Cloud solutions. (Always Remedy Green!)
  • Last three years deep into APM with Cisco AppDynamics.
  • All leading to the opportunity to join Grafana Labs this year and help drive adoption of Grafana’s Observability Stack of metrics, logs and tracing.

Since Arriving At Grafana

Everybody at Grafana deeply believes that data and its visualization should be easy to use, understand, and act upon by all. As an open-source software startup company, Grafana delivers on sharing opening and transparently.

Our team is the most energized, thoughtful, and empowered stewards of “All Things Grafana”. But it extends beyond our namesake. It’s compelling open-source solutions to big problems that are facing today’s Observability efforts and we lead our community and customers with consultative guidance on solving real visibility issues.

We Love Visualizing Data!!

I published a quick video showcasing some of the Grafana Dashboards that I built out along my Journey Into Grafana. (With a great excuse to write some new music as well!!) The dashboards include:

(I’ll be publishing some follow-up blog posts on these dashboards and data exporters shortly…)

What Does the Future of Grafana Look Like?

First of all, it’s absolutely the continuing acceleration of compelling and valuable adoption across all of our solutions, Grafana, Loki, and our recently released Tempo. Grafana releases every few months and I can tell you that we cover all parts of our community open-source platforms as well as our Enterprise products.

Cloud native is accelerating at a ferocious pace. That means delivering platforms that scale to the collection and visualization of high velocity, fine grained observability that are a critical underpinning for everything that we do. It’s about thinking different and having diverse and inclusive relationships with our teams, community, and customers.

Growth. My journey to Grafana was absolutely based on great culture and the ability to personally deliver amazing and collaborative outcomes across our organization. From Engineering and Marketing to Sales and Solutions Engineering – Grafana has incredible opportunities for great candidates worldwide. Please reach out and be part of our great team!

Thanks!

I’m fortunate to have many great friends, family, colleagues, and customers that have joined me along my incredible journey. I look forward to sharing back with the community that makes Grafana such a respected and loved solution.