NANOG 93 recap summary

The NANOG 93 meeting took place a few days ago. Slide decks for all the talks are on the meeting page and the corresponding videos should appear on the NANOG play list on YouTube soon. This is a very brief, likely imperfect summary of a few of talks from the meeting. I don’t review all talks, because I can’t attend them all and some I just wasn’t sufficiently moved to say anything useful about them. There is also of course more to tell about the so-called “hallway track”, which is where a lot more interesting things happen that I do not detail here.

Building Community around Network Automation
Scott Robohn
A keynote by a co-founder of the Network Automation Forum (NAF). Automation has become a frequent recurring topic at NANOG the past few years. Expect was an early mechanism netops used to automate interaction with network devices. First Rancid and later Oxidized have become popular tools for backing up network gear configs and pushing out changes. One or more of Ansible, CFEngine, Chef, Puppet, and Salt have become popular tools for remote system administration that are widely used by netops. Netbox arrived with a splash to become a popular “source-of-truth” for network inventory needs. There are now many high-quality tools and frameworks being developed that go beyond tool makers first scripts, building off of or combining what has come before. Scott covers some of this history and ponders why network automation isn’t even further along than it is. Scott didn’t provide an answer that satisfied me, but I think we can all proffer a set of reasons. Time, expertise, motivation, good-enough existing tools, management priorities, and so on could all help explain how incongruous different people and organizations are in this area. If you’re interested in network automation, head over to the NAF and join their Slack space where a lot of what you’d be interested in are being discussed.

gRPC Services under One Roof – Introduction and Practical use cases
Reda Laichi, Saju Salahudeen
The presenters, both affiliated with Nokia, continued the automation theme with technical overview of gRPC-based tools. Depending on who you ask the ‘g’ does or does not stand for Google. Google invented and open sourced it so it probably doesn’t matter. It uses protocol buffers and HTTP/2 for the interface and transport respectively. If you’re not really interested nor work with either of those at a devops level than you probably don’t care much about this talk. And you’re probably only then interested in this talk if you’re also interested in OpenConfig for network device management. The presenters went through a suite of available existing tools, plus some of their own developed with the Nokia service router (SR) platform in mind. Links to tools in the slides.

ARIN Update
John Sweeting
The perennial talk given at most every meeting is the ARIN update. John’s highlights include IPv4 and IPv6 prefix requests. The uptick in IPv6 assignments was attributed to an IPv6 marketing campaign last year. Fees for ASNs have been removed, which seems to have led to an increase in ASN assignments as well. RIPE instituted some small fees for ASN prices to help minimize waste and abuse. It will be interesting if ARIN will ultimately return to requiring fees as well. ARIN also highlighted its Qualified Facilitators. These are essentially address/ASN resource brokers that can help acquire, lease, or trade resources. It is interesting to see who is on the list as who is not. If you get resources from an organization not on that list, you should ask them if they would go through the process. I’d be very interested to see responses of those who choose not to do so.

Let’s stop using DNS
Jen Linkova
Jen admitted the title of the talk was click bait, but it was obvious from the abstract this was not going to be as provocative as the title suggested. In a nutshell, Jen is appealing to anyone who has or who might deploy IPv6-only networks. The over-reliance on DNS, and in particular mechanisms like DNS64 and the use of the well-known suffix of ipv4only.arpa, should be deprecated. Jen is encouraging the deployment of IETF RFC 8781 - Discovering PREF64 in Router Advertisements for IPv6-only or IPv6-mostly networks. And anyone who is familiar with hard coded name prefixes or suffixes such as wpad or isatap should know, ipv6only.arpa usage should generally be avoided, there be dragons.

DNSSEC-related Outages
John Kristoff
This was my research talk based on some initial results of trying to rigorously evaluate the performance of the DNS where DNSSEC zone signing is being deployed. DNSSEC the tech and deployment have a storied, uneven history. For almost as long as it has been around, which is now approximately two decades, there has been as many advocates as detractors. Despite the legitimate questions and concerns about DNSSEC, little formal evaluation of how well DNSSEC performs has been done. We aim to remedy that. Evaluating the system completely and easily cannot be done. It is too distributed and too much local knowledge is hidden to academic measurement. This is why I started the talk with the joke about the cow as a sphere. We simplify the problem at the expense of some precision, under the pretense that we can get close enough to an exact measure to say something useful. We focus on DNSKEY RRSIG expirations from the perspective of an instrumented resolvers in the SecSpider system. An expiry event in this infrastructure record is common failure scenario we have good visibility for. We try to understand what is normal and the impact of a DNSSEC-related failure. It is still early to make final conclusions, but overall we can say DNSSEC seems to perform fairly well in the aggregate despite some high-profile outages over the years. Many of the outages are not “full” zone availability outages or as impacting as the quantity of outages might suggest. We hope to publish a full report in an upcoming academic venue soon. Note, I made a couple of goofs in the introductory material and I fixed the one big in my local copy of the slides. One was that I wildly misquoted the amount of DNS inconsistency a paper cited reported (it is 8% not 80%!). And the total unique IPv4 address space covered by ROAs was for 2023 not 2024 as might have been implied.

Reviving BGP Zombies: New Insights
Iliana Xygkou
Iliana is a PhD student at Georgia Tech and performed this work in collaboration with colleagues from CodeBGP (now part of the Thousand Eyes group in Cisco). A BGP zombie is a route that persists in a remote routing table after having been withdrawn by the originator. In other words, a withdraw update gets lost or does not fully propagate. In theory, these stuck routes could live on for a long time and in practice some have been observed to persist for many months. Surprisingly some that have seemingly disappeared have returned. The total number of zombies is relatively low, but we might not even know the exact number depending on our vantage point and measurement duration. BGP beacons are routes that are periodically announced and withdrawn. These have been used for a variety of routing system measurements, including zombies. One technique to help with these and other types of measurements is to encode within an IPv6 prefix announcement a timestamp, termed a BGP clock. To help reduce zombies it is recommended IETF RFC 9687 - Border Gateway Protocol 4 (BGP-4) Send Hold Timer be implemented by router vendors and widely deployed. This modifies the BGP state machine to tear down a session if a BGP speaker detects that a peer is not processing messages sent to it. The presenter proposed a “Stuck Route Observatory” for ongoing monitoring and alerting.

AI-Powered Network 5.0 - A Paradigm Shift
Yun Freund
The second day’s speaker opened the day with the first of two AI-related talks. AI has been discussed in NANOG talks before, but perhaps notably NANOG has not been engulfed by the subject as much as other fields might be. The speaker focused primarily on how AI, and specifically LLMs can help sort through mounds of complex data to address otherwise difficult problems. Geoff Huston said this all sounded contrary to what the Internet has long advocated for, which in general is simplicity, particularly within the network nodes, pushing complexity to up and out towards the edge (an imperfect retelling of the e2e argument). I offered my own concern to the presenter. I wondered if by relying on AI to deal with increasingly complex environments we would reinforce the very complexity that resulted in now finding AI necessary. In other words, will good design give way to an increasing complex mess, just because “AI” will ultimately save us? It is not entirely clear we will get AI-powered networks as envisioned, but maybe there is a role, especially in the early stages of designing and deployment to help reduce complexity and cost? Seems that might be a path worth exploring.

Measuring Starlink Protocol Performance
Geoff Huston
After ARIN, Geoff is closest to having a standing speaker invitation at NANOG as anyone. Geoff has talked about this subject before, but not at NANOG and I could tell many people got a lot out of it. He noted a prior talk at NANOG that identified TCP with BBR generally performed far better than TCP with Cubic (each two distinct types of congestion avoidance strategies). Geoff is a fan of widely deploying BBR everywhere. I’m not so sure that is a good idea, but it is certainly worth consideration and debate. Geoff then went into some details of how Starlink functions, which is helpful to understand why BBR has been shown to significantly outperform Cubic on the platform. I found it interesting he mentioned ECN usage as a mechanism that could prove useful in this system as well. I think it could prove useful in a lot of systems, but as unfortunately far as I can tell, widespread deployment and use of ECN appears to be indefinitely elusive. Starlink is a fascinating system to study and I think there is probably a lot more to come in the years ahead.

Recent Linux Improvements that Impact TCP Throughput: Insights from R&E Networks
Brian Tierney
Brian gave a nice in-depth talk on TCP performance through the lens of high-speed networks in the research and education sector. This was a modern day TCP tuning talk that would routinely be a subject often seen at Internet2 Joint Tech (now Tech Exchange) meetings. Outside of R&E this talk might be of interest to those involved in high-speed data center and storage operations. The good news is that the latest Linux kernels continue to improve and perform fairly well without additional modification. There are some practical considerations the general network audience might be interested in. For example, they highlight how poorly SCP/SFTP performs in the WAN due to the built-in limited buffer size. See their Say no to scp/sftp page for details.

IPD: Detecting Traffic Ingress Points at ISPs
Ingmar Poese
Ingmar presented and interesting research talk measuring ingress traffic from a large ISP’s customers. The environment consists of a few thousand edge routers, not always with BGP running but flows exported to a collector. In other words, lots of data from which to measure and analyze. They were able to group and associate flows based on source addresses, ranking the ingress points across the network. Some behaviors such as asymmetric routing and bogus source addresses could complicate the analysis, but it appears overall they were able to detect and classify the vast majority of traffic. There is an IPD GitHub repo for the reference implementation.

Interplanetary Internet: Where no DNS query has gone before…
Scott Johnson
Scott presented an update on work progressing to bring Internet protocols, or some version of them to outer space. Popularized by Vint Cerf, the idea of IP connectivity throughout the solar system has been a subject of thinking, research, and design for a number of years. One of the obvious challenges is transmission delays inherent in such vastly larger distances. Many assumptions and some protocols will not work well in this environment. DNS is one such problem area. Scott described a modification to the DNS where each world would operate it’s own root. He suggested each world would have the same TLDs, but they could contain different data. Why this obvious name collision scenario is acceptable doesn’t make any sense to me, but I assume I’m missing something. The use of gateways presumably provide some isolation and separation of service between worlds, but why do this at all I’m not sure. The good news is, I doubt we’re going to have to worry about name collisions any time soon. It appears much of the network would operate in a mode we’ve seen before, store-and-forward. Quick, get those X.25 network people out of retirement! SMTP was one of the example Internet applications that can adapt reasonably well in this environment.

The MTU Manifesto
Mencken Davidson
Unfortunately Mencken had 67 pages to get through in 30 minutes. He tried, but he went over by about 15 minutes and still rushed through the material. Despite that, Mencken’s material was good and many found it interesting. For better or worse, the 1500-byte Ethernet payload limitation has posed a challenge for many netops for years. Jumbo frames have been around a long time and history is strewn with presentations about gaining performance benefits from them with mixed success. Path MTU has long been a challenge in some environments, but more so recently as the use of various tunnel and overlay mechanisms are employed. Mencken covered a number of these scenarios and expressed his own frustrations at dealing with them. One of Mencken’s demands was to formalize jumbo frames. This seems unlikely to ever happen IMO and we’re probably stuck with the current situation until the day comes, if it ever does, when something replaces what we know as 802.3 with something else. I’m not holding my breath. Mencken also wants to see more signaling between layers so they can inform one another of MTU sizes. I’m not so sure this will work either, but Linux seems to do some of this already. We might just as well ask tunneling and overlay protocols to do their own fragmentation, but that is unlikely to gain a lot of fans. It seems to me Mencken has contributed a valuable summary of issues and some possible reactions, but it appears there is no easy way to avoid the complexity in layer2 + tunnels/overlays + autonomy of intermediate devices to make PMTU an easily solved challenge.

30 Years of BGP
Geoff Huston
A lightning talk by Geoff on the state of BGP. He highlighted the major trends in the BGP routing table including times of growth and when things have seemingly slowed. According to current data, IPv4 route table growth has slowed noticeably. IPv6 has been growing more rapidly, but it’s growth pattern also changed recently at the same time as IPv4’s. Geoff attributes this to the global pandemic. We’ll need a few more years to see if the IPv4 and IPv6 growth trends diverge or continue along the same trajectories. I’d guess, or maybe hope, we’d see IPv6 eventually outpacing IPv4, but I wouldn’t bet my life on it. For the small number of people who weren’t already aware, follow Geoff’s BGP monitoring work here.

SMTP Distributed Monitoring in-the-wild
John Kristoff
I gave a lightning talk on what we (Dataplane.org) see for SMTP traffic at our 500+ sensor network. This was primarily a short talk in plots. Our sensors do not advertise themselves in any way. That is, we do not assign A, AAAA, or MX RRs to them. Whatever SMTP traffic they see is unsolicited. First I showed the distribution of TCP source port we saw from SMTP connections. The vast majority of connections utilize the dynamic/private port range (49152-65535). Then I showed what one or a set of SMTP scanners look like in plot form. We map the destination IP address to an integer and use that as the y-axis value. This allows to show how spread out in the address space we are. This can tell us if scanning is occurring over the entire Internet address space or not. I showed one example where a scanning host was locked into essentially 3 destination addresses for days before switching to an all address space scan for instance. We show what popular scanning services from Shadowserver and Censys look like in our plots, which helps demonstrate their overall frequency and breadth of activity. I was then able to highlight what looked like maintenance or an outage in Censys activity by examining data across the address space and time.

Hackathon Wrap-up
Aaron Atac
Aaron gave an overview of the hackathon, which was essentially a capture-the-flag style competition. The hackathon team used Containerlab for all the challenges and setup various mock environments from which to work in. Challenges involving debugging network connectivity issues, using a network inventory system, using a REST API, working with GraphQL, and an RPKI-related reachability issue. I had every intention of participating as did at least one of my colleagues, but the problem we ran into was lack of time. The challenge ran from Monday through Thursday 6pm. I mostly attended talks Monday and then the social event which went to 10pm. Tuesday I was also attending most of the talks so I just didn’t make time for the event. There ought to be ways to make it easier to participate or maybe I just free up more time to do it, but I’m not sure it’ll work in the current schedule. I hope it continues and more people can do it. Participation was low, but it is a worthwhile event that should attract more attention.