The NANOG 93 meeting took place a few days ago. Slide decks for all the talks are on the meeting page and the corresponding videos should appear on the NANOG play list on YouTube soon. This is a very brief, likely imperfect summary of a few of talks from the meeting. I don’t review all talks, because I can’t attend them all and some I just wasn’t sufficiently moved to say anything useful about them. There is also of course more to tell about the so-called “hallway track”, which is where a lot more interesting things happen that I do not detail here.
Building Community around Network Automation
Scott Robohn
A keynote by a co-founder of the Network Automation Forum
(NAF). Automation has become a
frequent recurring topic at NANOG the past few years.
Expect was an early mechanism
netops used to automate interaction with network devices. First
Rancid and later
Oxidized have become popular tools
for backing up network gear configs and pushing out changes. One or
more of Ansible,
CFEngine, Chef,
Puppet, and Salt have
become popular tools for remote system administration that are widely
used by netops. Netbox
arrived with a splash to become a popular “source-of-truth” for network
inventory needs. There are now many high-quality tools and frameworks
being developed that go beyond tool makers first scripts, building off
of or combining what has come before. Scott covers some of this history
and ponders why network automation isn’t even further along than it is.
Scott didn’t provide an answer that satisfied me, but I think we can
all proffer a set of reasons. Time, expertise, motivation, good-enough
existing tools, management priorities, and so on could all help explain
how incongruous different people and organizations are in this area. If
you’re interested in network automation, head over to the NAF and join
their Slack space where a lot of what you’d be interested in are being
discussed.
gRPC Services under One Roof – Introduction and Practical use cases
Reda Laichi, Saju Salahudeen
The presenters, both affiliated with Nokia, continued the automation
theme with technical overview of gRPC-based tools. Depending on who you
ask the ‘g’ does or does not stand for Google. Google invented and open
sourced it so it probably doesn’t matter. It uses protocol buffers and
HTTP/2 for the interface and transport respectively. If you’re not
really interested nor work with either of those at a devops level than
you probably don’t care much about this talk. And you’re probably only
then interested in this talk if you’re also interested in OpenConfig for
network device management. The presenters went through a suite of
available existing tools, plus some of their own developed with the
Nokia service router (SR) platform in mind. Links to tools in the
slides.
ARIN Update
John Sweeting
The perennial talk given at most every meeting is the ARIN update.
John’s highlights include IPv4 and IPv6 prefix requests. The uptick in
IPv6 assignments was attributed to an IPv6 marketing campaign last year.
Fees for ASNs have been removed, which seems to have led to an increase
in ASN assignments as well. RIPE instituted some small fees for ASN
prices to help minimize waste and abuse. It will be interesting if ARIN
will ultimately return to requiring fees as well. ARIN also highlighted
its Qualified Facilitators.
These are essentially address/ASN resource brokers that can help
acquire, lease, or trade resources. It is interesting to see who is on
the list as who is not. If you get resources from an organization not
on that list, you should ask them if they would go through the process.
I’d be very interested to see responses of those who choose not to do
so.
Let’s stop using DNS
Jen Linkova
Jen admitted the title of the talk was click bait, but it was obvious
from the abstract this was not going to be as provocative as the title
suggested. In a nutshell, Jen is appealing to anyone who has or who
might deploy IPv6-only networks. The over-reliance on DNS, and in
particular mechanisms like DNS64 and the use of the well-known suffix of
ipv4only.arpa, should be deprecated. Jen is encouraging the deployment
of IETF RFC 8781 - Discovering PREF64 in Router
Advertisements for
IPv6-only or IPv6-mostly networks. And anyone who is familiar with
hard coded name prefixes or suffixes such as wpad or isatap should know,
ipv6only.arpa usage should generally be avoided, there be dragons.
DNSSEC-related Outages
John Kristoff
This was my research talk based on some initial results of trying to
rigorously evaluate the performance of the DNS where DNSSEC zone signing
is being deployed. DNSSEC the tech and deployment have a storied,
uneven history. For almost as long as it has been around, which is now
approximately two decades, there has been as many advocates as
detractors. Despite the legitimate questions and concerns about DNSSEC,
little formal evaluation of how well DNSSEC performs has been done. We
aim to remedy that. Evaluating the system completely and easily cannot
be done. It is too distributed and too much local knowledge is hidden
to academic measurement. This is why I started the talk with the joke
about the cow as a sphere. We simplify the problem at the expense of
some precision, under the pretense that we can get close enough to an
exact measure to say something useful. We focus on DNSKEY RRSIG
expirations from the perspective of an instrumented resolvers in the
SecSpider system. An expiry event in this
infrastructure record is common failure scenario we have good visibility
for. We try to understand what is normal and the impact of a
DNSSEC-related failure. It is still early to make final conclusions,
but overall we can say DNSSEC seems to perform fairly well in the
aggregate despite some high-profile outages over the years. Many of the
outages are not “full” zone availability outages or as impacting as the
quantity of outages might suggest. We hope to publish a full report in
an upcoming academic venue soon. Note, I made a couple of goofs in the
introductory material and I fixed the one big in my local copy of the
slides. One was that I wildly misquoted the amount of DNS inconsistency
a paper cited reported (it is 8% not 80%!). And the total unique IPv4
address space covered by ROAs was for 2023 not 2024 as might have been
implied.
Reviving BGP Zombies: New Insights
Iliana Xygkou
Iliana is a PhD student at Georgia Tech and performed this work in
collaboration with colleagues from CodeBGP (now part of the Thousand
Eyes group in Cisco). A BGP zombie is a route that persists in a remote
routing table after having been withdrawn by the originator. In other
words, a withdraw update gets lost or does not fully propagate. In
theory, these stuck routes could live on for a long time and in practice
some have been observed to persist for many months. Surprisingly some
that have seemingly disappeared have returned. The total number of
zombies is relatively low, but we might not even know the exact number
depending on our vantage point and measurement duration. BGP beacons
are routes that are periodically announced and withdrawn. These have
been used for a variety of routing system measurements, including
zombies. One technique to help with these and other types of
measurements is to encode within an IPv6 prefix announcement a
timestamp, termed a BGP clock. To help reduce zombies it is recommended
IETF RFC 9687 - Border Gateway Protocol 4 (BGP-4) Send Hold
Timer be implemented by
router vendors and widely deployed. This modifies the BGP state machine
to tear down a session if a BGP speaker detects that a peer is not
processing messages sent to it. The presenter proposed a “Stuck Route
Observatory” for ongoing monitoring and alerting.
AI-Powered Network 5.0 - A Paradigm Shift
Yun Freund
The second day’s speaker opened the day with the first of two AI-related
talks. AI has been discussed in NANOG talks before, but perhaps notably
NANOG has not been engulfed by the subject as much as other fields might
be. The speaker focused primarily on how AI, and specifically LLMs can
help sort through mounds of complex data to address otherwise difficult
problems. Geoff Huston said this all sounded contrary to what the
Internet has long advocated for, which in general is simplicity,
particularly within the network nodes, pushing complexity to up and out
towards the edge (an imperfect retelling of the e2e argument). I
offered my own concern to the presenter. I wondered if by relying on AI
to deal with increasingly complex environments we would reinforce the
very complexity that resulted in now finding AI necessary. In other
words, will good design give way to an increasing complex mess, just
because “AI” will ultimately save us? It is not entirely clear we will
get AI-powered networks as envisioned, but maybe there is a role,
especially in the early stages of designing and deployment to help
reduce complexity and cost? Seems that might be a path worth exploring.
Measuring Starlink Protocol Performance
Geoff Huston
After ARIN, Geoff is closest to having a standing speaker invitation at
NANOG as anyone. Geoff has talked about this subject before, but not at
NANOG and I could tell many people got a lot out of it. He noted a prior
talk at NANOG that identified TCP with BBR generally performed far
better than TCP with Cubic (each two distinct types of congestion
avoidance strategies). Geoff is a fan of widely deploying BBR
everywhere. I’m not so sure that is a good idea, but it is certainly
worth consideration and debate. Geoff then went into some details of
how Starlink functions, which is helpful to understand why BBR has been
shown to significantly outperform Cubic on the platform. I found it
interesting he mentioned ECN usage as a mechanism that could prove
useful in this system as well. I think it could prove useful in a lot
of systems, but as unfortunately far as I can tell, widespread
deployment and use of ECN appears to be indefinitely elusive. Starlink
is a fascinating system to study and I think there is probably a lot
more to come in the years ahead.
Recent Linux Improvements that Impact TCP Throughput: Insights from R&E Networks
Brian Tierney
Brian gave a nice in-depth talk on TCP performance through the lens of
high-speed networks in the research and education sector. This was a
modern day TCP tuning talk that would routinely be a subject often seen
at Internet2 Joint Tech (now Tech Exchange) meetings. Outside of R&E
this talk might be of interest to those involved in high-speed data
center and storage operations. The good news is that the latest Linux
kernels continue to improve and perform fairly well without additional
modification. There are some practical considerations the general
network audience might be interested in. For example, they highlight
how poorly SCP/SFTP performs in the WAN due to the built-in limited
buffer size. See their Say no to
scp/sftp page
for details.
IPD: Detecting Traffic Ingress Points at ISPs
Ingmar Poese
Ingmar presented and interesting research talk measuring ingress traffic
from a large ISP’s customers. The environment consists of a few
thousand edge routers, not always with BGP running but flows exported to
a collector. In other words, lots of data from which to measure and
analyze. They were able to group and associate flows based on source
addresses, ranking the ingress points across the network. Some
behaviors such as asymmetric routing and bogus source addresses could
complicate the analysis, but it appears overall they were able to detect
and classify the vast majority of traffic. There is an IPD GitHub
repo for the reference implementation.
Interplanetary Internet: Where no DNS query has gone before…
Scott Johnson
Scott presented an update on work progressing to bring Internet
protocols, or some version of them to outer space. Popularized by Vint
Cerf, the idea of IP connectivity throughout the solar system has been a
subject of thinking, research, and design for a number of years. One of
the obvious challenges is transmission delays inherent in such vastly
larger distances. Many assumptions and some protocols will not work
well in this environment. DNS is one such problem area. Scott
described a modification to the DNS where each world would operate it’s
own root. He suggested each world would have the same TLDs, but they
could contain different data. Why this obvious name collision scenario
is acceptable doesn’t make any sense to me, but I assume I’m missing
something. The use of gateways presumably provide some isolation and
separation of service between worlds, but why do this at all I’m not
sure. The good news is, I doubt we’re going to have to worry about name
collisions any time soon. It appears much of the network would operate
in a mode we’ve seen before, store-and-forward. Quick, get those X.25
network people out of retirement! SMTP was one of the example Internet
applications that can adapt reasonably well in this environment.
The MTU Manifesto
Mencken Davidson
Unfortunately Mencken had 67 pages to get through in 30 minutes. He
tried, but he went over by about 15 minutes and still rushed through the
material. Despite that, Mencken’s material was good and many found it
interesting. For better or worse, the 1500-byte Ethernet payload
limitation has posed a challenge for many netops for years. Jumbo
frames have been around a long time and history is strewn with
presentations about gaining performance benefits from them with mixed
success. Path MTU has long been a challenge in some environments, but
more so recently as the use of various tunnel and overlay mechanisms are
employed. Mencken covered a number of these scenarios and expressed his
own frustrations at dealing with them. One of Mencken’s demands was to
formalize jumbo frames. This seems unlikely to ever happen IMO and
we’re probably stuck with the current situation until the day comes, if
it ever does, when something replaces what we know as 802.3 with
something else. I’m not holding my breath. Mencken also wants to see
more signaling between layers so they can inform one another of MTU
sizes. I’m not so sure this will work either, but Linux seems to do
some of this already. We might just as well ask tunneling and overlay
protocols to do their own fragmentation, but that is unlikely to gain a
lot of fans. It seems to me Mencken has contributed a valuable summary
of issues and some possible reactions, but it appears there is no easy
way to avoid the complexity in layer2 + tunnels/overlays + autonomy of
intermediate devices to make PMTU an easily solved challenge.
30 Years of BGP
Geoff Huston
A lightning talk by Geoff on the state of BGP. He highlighted the major
trends in the BGP routing table including times of growth and when
things have seemingly slowed. According to current data, IPv4 route
table growth has slowed noticeably. IPv6 has been growing more rapidly,
but it’s growth pattern also changed recently at the same time as
IPv4’s. Geoff attributes this to the global pandemic. We’ll need a few
more years to see if the IPv4 and IPv6 growth trends diverge or continue
along the same trajectories. I’d guess, or maybe hope, we’d see IPv6
eventually outpacing IPv4, but I wouldn’t bet my life on it. For the
small number of people who weren’t already aware, follow Geoff’s BGP
monitoring work here.
SMTP Distributed Monitoring in-the-wild
John Kristoff
I gave a lightning talk on what we (Dataplane.org) see for SMTP traffic
at our 500+ sensor network. This was primarily a short talk in plots.
Our sensors do not advertise themselves in any way. That is, we do not
assign A, AAAA, or MX RRs to them. Whatever SMTP traffic they see is
unsolicited. First I showed the distribution of TCP source port we saw
from SMTP connections. The vast majority of connections utilize the
dynamic/private port range (49152-65535). Then I showed what one or a
set of SMTP scanners look like in plot form. We map the destination IP
address to an integer and use that as the y-axis value. This allows to
show how spread out in the address space we are. This can tell us if
scanning is occurring over the entire Internet address space or not. I
showed one example where a scanning host was locked into essentially 3
destination addresses for days before switching to an all address space
scan for instance. We show what popular scanning services from
Shadowserver and
Censys look like in our plots, which helps
demonstrate their overall frequency and breadth of activity. I was then
able to highlight what looked like maintenance or an outage in Censys
activity by examining data across the address space and time.
Hackathon Wrap-up
Aaron Atac
Aaron gave an overview of the hackathon, which was essentially a
capture-the-flag style competition. The hackathon team used
Containerlab for all the challenges and
setup various mock environments from which to work in. Challenges
involving debugging network connectivity issues, using a network
inventory system, using a REST API, working with GraphQL, and an
RPKI-related reachability issue. I had every intention of participating
as did at least one of my colleagues, but the problem we ran into was
lack of time. The challenge ran from Monday through Thursday 6pm. I
mostly attended talks Monday and then the social event which went to
10pm. Tuesday I was also attending most of the talks so I just didn’t
make time for the event. There ought to be ways to make it easier to
participate or maybe I just free up more time to do it, but I’m not
sure it’ll work in the current schedule. I hope it continues and more
people can do it. Participation was low, but it is a worthwhile event
that should attract more attention.