Posted on 2019-01-12
In 2015 I changed responsibilities at The Gathering, moving from practical challenges to technical, and I quickly got responsibility for the network management system used. Over time, it transformed from a handful of hacky perl-scripts to a pretty serious project, now named Gondul.
But I'm getting ahead of myself. Before we can talk about Gondul, I need to explain The Gathering a little bit.
The Gathering is a yearly computer party arranged in Hamar Olympic Arena, better known as Vikingskipet. It attracts 5000 participants every easter, participants. It can look something like this....
Most of the participants are in the aged 14 to 20-something, with some outliers in either direction. They arrive on a Wednesday, with computers, gaming consoles, home-made shelfs, hopes and dreams. A few days later, on Sunday, they leave with what I hope is a great experience behind them.
Calling this a computer party is somewhat missleading today. It's closer to a festival. For the norwegian crowd, I like to explain it as "Norway Cup, but with computers". Sure, gaming or computers or cosplay or whatnot might be the backdrop, but it's really about the social stuff, exploring, staying up late, being away from home, etc. I'll spare you further details of advocacy for now...
Except it's all made possible by volunteers. More specifically, about 400 of them.
And since all of the participants and all of the volunteers (the crew) bring a ton of network-connected equipment, we need a network. So out of the 400 or so volunteers, about 30 of us work with the network.
The "Tech" crew is split in two roughly equal-size groups. We've got Tech:Support and Tech:Net.
Tech:Support are the people who really do the heavy lifting. They are seriously under-appreciated, to be honest. Just for the 150-180 access switches and associated wireless access points used directly by participants, these people pull close to 20km of network cables. Probably close to 1000 individual cables. They also do similar work for essentially all switches where an end-user connects, including switches needed for the stage area, switches for the reception area, for a dozen or so well-hidden areas used by everyone from the medic crew, the sponsors, the press lounge, and so on.
And when that's done, they are the people participants come to with all their (hopefully computer-related) troubles. They are the ones who help you when your computer doesn't boot, or replace that switch that just doesn't work - which can be easier said than done when it's stuck in between two shelfs... In short, these guys are essential.
The other crew is Tech:Net, which I'm currently part of. We do the higher-level stuff, whatever that means.
We design the network, we configure the core routers, we pull fiber cables all around the place, we negotiate with sponsors for equipment, we troubleshoot, we figure out power-requirements, we fight pigeons, we coordinate with other crews, we set up DNS, DHCP, and so on.
Tech:Net arrives at Vikingskipet on friday (event starts on Wednesday), but the last few years we've also spent the weekend prior at Vikingskipet to get some core infrastructure up and running.
The Gathering uses roughly 180 or so access switches, Juniper EX2200 48-port gigabit switches.
On the "floor", each of these are connected with 3 uplinks to a "distro"-switch. This Distro-switch is really three Juniper EX3300's, configured to act as a single logical cluster. We've used 9 of these lately, for a total of 27 EX3300's. Each of them look roughly like this:
These then all connect up to a core switch. And by "up", I mean both figuratively and literally. We run fiber cables from the distro-switches in the middle of the arena and up to the roof, roughly 40 meters above the floor. We use two 10Gb/s fiber-links per distro-switch. Both for capacity and redundancy.
We've used a few different solutions. In the past we used two Juniper QFX5100's in virtual chassis, pictured below:
Last year, though, we switched to a more powerful Juniper MX480, and instead of placing it in the roof, we patched everything down to the NOC...
And then there's a ring-network, and whatnot too. The design for TG18 looked like this:
As should be obvious by now, this takes a little bit of effort to plan and set up. We've gotten better and better at this over the years, and since 2015, Gondul has played a bigger and bigger part in that.
Gondul is an NMS, written for The Gathering and similar events. It's open source, and there are a few other places it's used.
It is designed to be super-simple to use, but with a lot of little magic pieces. I've made a few design decisions early on that goes contrary to a lot of idio.... I mean contrary to the common way of thinking.
To the end-user, which in this case are the members of the Tech crew, Gondul is a single-page application that just gives you a map of the arena with equipment on.
And that's just the way we like it.
At this point in time, the only thing you really need to know is that basically everything is OK, except for that one switch. Judging from the legend, things are a bit nasty over there.
Gondul gathers information from three sources at this time. It polls SNMP, it pings the infrastructure continuously and it parses our DHCP logs. Every single bit of information that is gathered is always available in the frontend - in the actual browser, for instant access. And yes, that means there's a couple of MB of data there, and it's updated live.
The reason this is important is because Gondul's primary magic exists in providing a generic API that just tells you information, and an advanced frontend that parses that information to make sense of it. But it goes further, it also prioritizes information.
This means that:
For the end-user, it means lightning fast response times both UI-wise and intelligence-wise. I read somewhere about some swedish play-pretend party that configured prometheus and were very happy, but to be honest, I'm impatient. If I had to wait TEN SECONDS from the time someone unplugged a switch until my NMS visualized it, then I'd go nuts. No wonder they're a bit slower over there...
Since we ping the infrastructure more than once a second, our response time is probably somewhere around 3-4 seconds tops, but typically faster. That's 3-4 seconds from you remove a power cable ANYWHERE in the arena until the frontend is showing it.
Unfortunately, SNMP is slow. So we typically poll everything every minute, with some exceptions where we're more aggressive (e.g.: internet border router). So while we'll be fast at detecting down-right outages, richer information takes a bit longer.
Over the years, I've added a lot of tiny modules in Gondul. This allows us to solve very specific problems. As an example, we've had problems with event scripts. I'll spare you the details, but I'll explain the core issue.
Every link we have between switches/routers typically use two or more physical cables. Be they copper cables (twisted pair) or fiber optical cables. These are grouped together in what's called a Link Aggregate Group, or LAG. In the Juniper-world, these are named "ae(number)".
But for reasons we do know, but that I don't want to get into, we've had issues where not all physical interfaces that are supposed to be part of a LAG has actually joined the LAG. That meant that, for example, "ae14" could be up, as could the physical interfaces "ge-0/0/1", "ge-0/1/0" and "ge-0/2/0", all of which are supposed to be part of the AE.
Logic suggests that if all the ge-interfaces (gig-ethernet) are up, that's 3*1Gb/s, and the ae should have a link speed of 3Gb/s.
This is a condition that Gondul now looks for and alerts on - Gondul iterates over all interfaces and compares physical interface speed to LAG speed. If there's a missmatch, we know that we need to log in and, well, fix the LAG.
If we just looked at either the LAG or the physical interfaces, we wouldn't be able to know this. And adding this took just a few minutes, and pushing it to "prod" was just asking people to reload their browser.
We check for DHCP requests. If we don't see any DHCP requests on a participant-network, something is probably wrong. But then, maybe not?
We used to get a few false positives on the more non-standard switches. To solve this, we check DHCP as usual, but compare it to physical client ports. This means that number of client ports that are showing link should roughly equal number of DHCP clients.
We also use number of client ports to determine when to REMOVE a switch on the last night of the event.
And we also track uplink ports as mentioned. We've established a convention where access switches use ports ge-0/0/44 through ge-0/0/47 for uplinks. But we usually don't pull 4 uplinks. For particiapnts we run 3, for others, even fewer.
Gondul knows this, because operators (tech:net) set unused ports to "admin-down", which means that they are off, and this is exposed in SNMP.
Searching seems simple, and I suppose it is. What we do in Gondul is just evaluate a handful of variables when you search, including name (obviously), IP addresses, distro-switch and a few other variables. This makes it easy to figure out what switches are connected to a certain distro immediately.
An other simple feature we've added is the "oplog" - operator log. This is a simple way for us to log work, and optionally associate it with a "system", or more specifically, a search. It allows us to log stuff like "there were CRC errors on this link, we're working on it" and everyone will see it, both in the overview and if you drill down per switch.
This is an exmaple of a really simple feature, that adds a great deal of value. I'm advocating increased use of this for everything, however trivial, so we can keep track of things.
We don't use Gondul just for monitoring. We also use it for provisioning. It started with access switches, but now we also use it for distro switches and the core, and possibly more.
Templating is really simple. Since we have an API that provides all information we have on the infrastructure, I hooked up a small thing that polls the API and compiles Jinja2 templates.
Then our DHCP server is set up to parse "DHCP option 82" and provide switches with URLs back to gondul. Gondul then de-composes the option-82 part of the URL and then the templating engine has a variable for which distro switch and port the DHCP request was received from, which in turn allows us to deduce what switch is asking for DHCP and compile a config for it.
Similar, we have templates for the DNS server-config, etc. This is part of what was originally known as "FAP", though that was a different code based that has been gradually integrated or moved to solve a more general problem.
Despite what it might look like, Gondul is really extremely simple underneath the hood. That's why it works so well.
The collectors (DHCP log tailer, SNMP poller, pinger) are just a few hundred lines of perl, and they rarely need any attention during the event.
The API is similarly simple: It just exposes information as it knows it and adds as little logic or meaning to it as possible.
A less well known part is the cache layer - this sits between the API and anything using the API. It is a Varnish Cache, with a pretty trivial config. Instead of complex Varnish-logic, the API uses standard HTTP cache headers to control the varnish cache. But cache goes further than that - the browser also caches too, according to the same rules. This means that the frontend-code can be super-stupid: Just poll every API end point every second, because the browser will obey cache rules anyway and not actually send it to the backend. Granted, I had to implement ETag-aware logic in the frontend, but that's easy.
The frontend is not perfect, but it's getting better. It's modularized, and I'm particularly happy with how the actual map is completely isolated from the rest of the code, as is the data collection and the logic that takes data and makes sense of it.
Adding a new "handler" that takes a new look on data can be done in 20-50 lines of code. That will immediately integrate with the map and switch summary information.
And this is what we actually do all the time during the event. I adapt the frontend constantly as we learn new things.
The design of Gondul is focused on just that: Make it possible to develop both before and during the event with little to none risk.