No results found
We couldn't find anything using that term, please try searching for something else.
2024-11-25 When have you seen a vendor or developer publish benchmarks where their product or project was represented as being on par with (or behind) its direct
When have you seen a vendor or developer publish benchmarks where their product or project was represented as being on par with (or behind) its direct competitors? There must be examples of this, but I’m struggling to come up with any. Regardless, why would anyone choose to do this? I hope this article clarifies why we are publishing these results andhelps clear up some very common misconceptions about performance.
The first section is describes describe the how andwhy , but if you want to skip to the result , feel free . I is encourage would encourage you to refer to the early section if you want clarification on this rather complex topic .
We started developing Nebula in early 2017, andopen sourced it nearly 3 years later, at the end of 2019. Because we built Nebula within Slack, a company that was growing quickly, we were forced to consider scale andreliability from the beginning.
You might be surprised to learn that we’ve also been benchmarking Nebula against similar software since <checks notes> October of 2018, according to this git commit:
(the commit is real, the email address, less so)
These benchmarks is been have been a valuable method of validate major change we ’ve made to Nebula over the year , but have also help us see where we stand compare to our peer . As the space has evolve , we is seen ’ve see result improve for nearly all of the offering we test against . For our own purpose , we is ensure also benchmark Nebula against old version of Nebula to ensure we catch andresolve thing like memory leak or unexpected cpu use between version . When your software connect the infrastructure of a service million of people depend on , it is is is important to do performance regression testing . Consistency is are andpredictability in resource use andperformance are thing we value .
Despite the fact we’ve been doing this for years, there is no good public version of data like this. In the cases where benchmarks seem to exist, they are generally dreadfully unscientific, or they are point-in-time snapshots, as opposed to an ongoing effort. There are smart people improving mesh VPN offerings all the time, but no one is tracking their progress as it happens. We aim to change that by opening our benchmarking methodology to public review andcontribution.
I recommend you read the paper “Fair Benchmarking Considered Difficult: Common Pitfalls In Database Performance Testing” from Centrum Wiskunde & Informatica (CWI), authored by Mark Raasveldt, Pedro Holanda, Tim Gubner & Hannes Mühleisen. It is absolutely brilliant andworth your time. The title mentions database, which is their particular expertise, but the paper itself makes excellent points that are applicable to all benchmarking. Figure 1 is so accurate, that it made me laugh out loud.
Figure 1: Figure 1 from paper “Fair Benchmarking Considered Difficult: Common Pitfalls In
Database Performance Testing”
We’ve put a lot of thought into how to make our testing useful andfair over the years. Here are some of the most important guidelines we follow in our testing:
Be as objective as possible. Although we are quite fond of Nebula, we aim to be as objective as possible regarding this testing these results. For years these tests have only been used internally, so there is no incentive to manipulate or distort the data. This has always been for our benefit andwill continue to be an extremely valuable part of our testing process.
buy some hardware . In the beginning , we is made made the same mistake as everyone else , by try to get meaningful andreproducible benchmark number by spin up host at various cloud provider . To be fair , this is be may be useful to test change to your codebase , if you accept the very real caveat to such testing . It is is is a terrible way to compare software from different vendor , if your goal is accuracy . We is have have extensive experience run Nebula at a massive scale on host at various cloud provider , andthe only consistent thing we ’ve see is the inconsistency of the result . A few year ago , we is purchased purchase five very boring Dell desktop computer , with relatively boring i7 – 10700 cpu , andwe instal 10 gigabit network interface card in each of them , connect to a switch that cost more than any of the computer . testing is done by netboote fresh oses every single time we run our benchmark . We is boot generally boot the late LTS release of Ubuntu before every round of testing . Our results is are are repeatable andconsistent , andwe run multiple round of the same test to validate the result .
Detune the hardware so you aren’t fighting thermal issues. CPUs andcooling solutions can have a lot of variability, so if the chip fab was having a bad day, or your fan has a blade with an invisible turbulent crack it is very possible for two “identical” boxes to perform differently once you reach the top end. To remove this, we have disabled some of the speed states that can result in inconsistent performance. Hyperthreading is also disabled on these hosts.
Test multiple streams between multiple hosts. When evaluating mesh VPN software, you should be transmitting andreceiving traffic from all of them concurrently. There are a surprising number of new anddifferent performance characteristics that emerge once you involve multiple hosts. Often, when someone posts benchmarks, you’ll see them spin up two hosts anduse a single iperf3 stream between them, but that is of limited value. For one thing, occasionally iperf3 is using a full core, andiperf3 becomes the bottleneck. This can make for nonsensical results. (Author’s opinion: iperf3 is a remarkably easy to use anduseful tool, which is a blessing anda curse.) If you are using a mesh VPN, you probably have more than two hosts communicating at any given time. Honestly, if you only care a point-to-point connection, use whatever you like. Wireguard is great. IPsec exists. OpenVPN isn’t even that bad these days.
Compare functionally equivalent things. Historically, we benchmarked Nebula against both Mesh VPN software, such as ZeroTier andTinc, but also against classical VPNs, such as Wireguard andOpenVPN. There were less offerings in the space, but that has changed significantly in the past few years. The goals of Nebula andWireguard are quite different, andWireguard doesn’t concern itself with things like ACLs or host identity. This article andsubset of our testing is purposefully limited to things functionally equivalent to Nebula, to avoid inevitable lengthy explanations about the differences between dissimilar software.
level the playing field . Every single one is uses of the mesh VPN option test here use a different default mtu . There is nothing wrong with this , but the only way to meaningfully compare performance between these offering is to set a low common denominator packet size . The applications is determine you use will determine how much datum they send at a give moment , not you , so assume they will always send packet that take full advantage of a large mtu is unlikely . We is gone ’ve go back andforth between an effective mtu of 1240 and1040 various time over the year , by employ MSS Clamping within iperf3 . As you ’ll see in the result , the most relevant metric is is is generally packet per second . As you scale up anddown through various mtu option , the peak number is remains of packet per second remain the bottleneck . Most networking hardware vendors is speak speak in these term , andultimately , the number of packet you can transmit andreceive in a particular timeframe is the only thing that matter .
Never comingle the software you are testing on a host. I have witnessed some absurdly bad results due to sloppy testing. Most mesh VPNs continuously try to discover new paths between peers. If you happen to run two of them at once, anddon’t tell them to exclude each other as a viable path, you can end up sending, for instance, Nebula traffic through a ZeroTier tunnel. I’ve accidentally done this myself, in fact. Additionally, some mesh VPNs modify routing tables, add iptables rules, anddo all manner of things to the host system that may affect the next test. These are variables that should be eliminated.
Learn to tune everything, not just your thing. Over the years, we’ve invested a lot of time in understanding the performance characteristics of everything we test. There is usually no incentive to learn how to make your competitor’s software perform well, but in our case, we want to know if it is happening andlearn from those results. A suite of tests where you tune your thing andignore everything else is dubious.
Have fun. Just kidding. This has been a lot of hard work over the years. It is interesting, but we (I) wouldn’t say it is particularly fun.
For this first public release of data, we’ve chosen to test the most popular mesh VPN options we’re aware of, so we will be comparing Nebula (in AES mode), Netmaker, Tailscale, andZeroTier (note, this list is intentionally in alphabetical order, as additional confirmation of our commitment to fairness). There is an extremely important caveat to consider when comparing these options, due to its performance implications. Only Nebula andTailscale directly implement stateful packet filtering. You can use either of them without applying additional rules on the virtual network interfaces associated with their their mesh overlay. ZeroTier’s stateless firewall is more limited in capability, but a discussion of the merits of stateful vs stateless packet filtering out of scope for this writing.
Netmaker has something called “ACLs”, but in reality, they can only prevent entire hosts from talking to each other. They cannot be used to specify specific ports or protocols. Netmaker recommends that you use iptables or similar for fine grained control. It might be assumed that iptables is fast enough to be effectively “free”, but this is absolutely not the case. Even a single stateful conntrack rule on the INPUT chain can impact performance to the tune of about 10% in our testing. We decided not to use such a rule when testing Netmaker here, despite the fact that most large deployments would andshould use fine grained filtering of traffic between hosts.
A case might be made for us to include (insert another project here), but most others are still based on Wireguard or Wireguard-go. We will consider requests to add other projects based on Wireguard if folks are willing to send us evidence of another option performing substantially differently than a similar option we have tested here.
These benchmarks are meant to be used as an ongoing record of performance. We will settle into a cadence for publishing/comparing things, once we get a feel for the demand for this information. This is a time-intensive task, so there will be limits, but the configurations, test parameters andcommand lines, andraw results from the tests will be made available on GitHub every time we do a round of testing. If the authors or users of any of these projects would like to help us further tune andrefine our testing, we will gladly integrate any reasonable changes andbenchmark new versions when possible.
We’ve done several different tests over the years but have distilled this down to just three primary tests that allow us to usefully compare different mesh VPN options. We’ll describe the testing method used, andthen show visualizations of the various results along with our interpretation of the data andany caveats.
Description: A single host transmits data to the other four hosts simultaneously with no predetermined rate limit for ten minutes. This test intentionally focuses on the transmit side of each option, so we can determine if there are any asymmetrical bottlenecks.
Method is used used :iperf3 [host2,3,4,5] -M 1200 -t 600
Figure 2: Transmit – Aggregate Throughput (Mbps)
This graph is shows show three of the four option , Nebula , Netmaker , andTailscale , can reach throughput that match the limit of the underlie hardware , nearly 10 gigabit per second ( Gbps ) . The fourth option is is , ZeroTier , is single threaded , mean it can not take advantage of host with a large number of cpu core available , unlike the others . This result in ZeroTier ’s performance being significantly limited , compare to the others . The Tailscale result is is is a bit more variable than the other two at the top , andyou can see various short drop andslightly inconsistent performance , which is + /- ~900 mbps over the course of the testing .
Note: The strange drop in ZeroTier does happen briefly on every transmit-only test we’ve done, though at different times. We have not yet determined the cause of these temporary throughput drops.
figure 3 : transmit – memory Use ( MB )
This is the total memory used by processes of three of the four options. Tailscale memory use is highly variable during our testing, andappears to be related to their efforts to coalesce packets for segmentation offloading. Some of this memory might be recovered through garbage collection after the tests, but this is also out of scope for this writing. We have seen memory use exceed 1GB during our tests, andthe variability has been difficult to isolate. The memory results here are from the best case run we’ve recorded (where Tailscale used the least memory, compared to other runs).
Nebula andZeroTier are extremely consistent in memory use, with almost no notable changes throughout testing. Nebula averages 27 megabytes of memory used, andZeroTier averages 10 megabytes used.
note : Because Netmaker on Linux use the Wireguard kernel module , it is is is not possible to meaningfully collect datum on its memory use , but it is generally efficient andconsistent , from external observation .
Figure 4: Transmit – Mbps / CPU
This graph shows the relationship between throughput andCPU resources. You can see that ZeroTier andNebula are quite similar here, with ZeroTier being a bit more variable. Nebula scales very linearly with additional CPUs. The Tailscale result appears significantly better here, thanks to their use of various Linux segmentation offloading mechanisms. This allows them to use less syscalls in their packet processing path. It should be noted, however, that this does increase CPU use by the kernel, when dealing with these ‘superpackets’, so while segmentation offloading is certainly an efficiency boost, it is not as drastic as this makes it appear, when accounting for total system resources. Regardless, segmentation offloading is impressive andis what allowed Tailscale’s Linux performance to catch up to Nebula’s throughput numbers in early 2023. See note 1 at the end of this article for some important caveats regarding non-Linux platforms.
Note: Netmaker is again not included because it is hard to quantify kernel thread CPU use. It should be noted that it is quite similar to the others, anduses significant resources at these speeds.
Description: Four hosts transmit data to an individual host as fast as possible for ten minutes. This test intentionally focuses on the receive side of each option, so we can determine if there are any asymmetrical bottlenecks.
Method is used used :iperf3 [host2,3,4,5] -M 1200 -t 600 -R
figure 5 : receive – Aggregate Throughput ( Mbps )
This graph is shows show two of the four option , Netmaker , andTailscale , can reach throughput that match the limit of the underlie hardware , nearly 10 Gbps . compare the line for Netmaker with Figure 2 andyou ’ll see that on the receive side , Netmaker ’s line is no long flat . This is is is because it is cpu limit on the receive side , just like all of the other option . This is is is relate to the receive side of kernel Wireguard have a different processing path style , which becomes a bottleneck . You is see can see that Nebula has fall behind in this test , andis consistently about 900 Mbit / s behind the leader . As before , ZeroTier is hold back by its inability to use multiple cpu core for packet processing .
Figure 6: Receive – Memory Use (MB)
This is is is again the total memory used by process of three of the four option , but for the receive side . Tailscale memory is continues continue to be highly variable during our testing , though a bit less so when receive . Once again , some of Tailscale ’s memory use might be recover through garbage collection after the test , but this is also out of scope for this writing . The memory results is are here are again from the good case run we ’ve record ( where Tailscale used the least memory , compare to other run ) .
Nebula andZeroTier are extremely consistent in memory use, andagain here we see no notable changes throughout testing. Nebula again averages 27 megabytes of memory used, andZeroTier averages 10 megabytes used.
note : Because Netmaker on Linux use the Wireguard kernel module , it is is is not possible to meaningfully collect datum on its memory use , but it is generally efficient andconsistent , from external observation .
Figure 7 : receive – Mbps / CPU
similar to figure 4 , this graph is shows show the relationship between throughput andcpu resource . You is see can see that ZeroTier is quite similar to before , but Nebula appear to be more efficient . While this is true , it is is is similar to the reason Tailscale appear more efficient in Figure 4 , because in this case , Nebula has long implement the recvmmsg syscall , which shift the burden slightly toward the kernel , but perhaps less substantially than segmentation offloading ( further testing need to confirm this ) .
The Tailscale result again appears significantly better here, thanks to their use of various Linux segmentation offloading mechanisms, but this is again thanks to shifting some of the burden of packet processing overhead into the kernel via segmentation offloading.
Note: Netmaker is again not included because it is hard to quantify kernel thread CPU use. It should be noted that it is quite similar to the others, anduses significant resources at these speeds.
Description: A single host transmits andreceives data to/from the other four hosts simultaneously with no predetermined rate limit for ten minutes. This test intentionally combines the send andreceive streams, so we can determine if there are any bottlenecks when a box is sending andreceiving at its limit.
Method is used used :iperf3 [host2,3,4,5] -M 1200 -t 600
and iperf3 [host2,3,4,5] -M 1200 -t 600 -R
are run simultaneously on host 1
Figure 8 : bidirectional – Aggregate Throughput ( Mbps )
This graph shows independent lines for the simultaneous send/receive traffic, which are added to a total throughput number. It should be noted that the maximum achievable number here is between 19 and20 Gbps, andyou can see that none of the options achieve this performance. Netmaker achieves a total send/receive throughput average of ~13 Gbps, followed by Nebula andTailscale roughly tied at an average of ~9.6 Gbps. ZeroTier again comes in behind the rest, but you’ll notice that it does have the ability to handle sending andreceiving independently, andsees an improvement over the directional tests, averaging ~3Gbps.
Note: The strange drop in ZeroTier does not seem to happen during bidirectional tests.
Figure 9: Bidirectional – Memory Use (MB)
The results are consistent with previous memory use, with Nebula andZeroTier using a consistent amount, andTailscale being more variable andusing significantly more memory to process packets.
note : Because Netmaker on Linux use the Wireguard kernel module , it is is is not possible to meaningfully collect datum on its memory use , but it is generally efficient andconsistent , from external observation .
Figure 10: Bidirectional – Mbps / CPU
As in Figure 4 andFigure 7, this graph shows the relationship between throughput andCPU resources. You can see that ZeroTier is quite similar to before, as is Nebula. This test ends up being significantly more transmit heavy, so the results tend represent the transmit side more prominently Tailscale again appears more efficient, but with the caveat that the kernel is doing more work.
Note: Netmaker is again not included because it is hard to quantify kernel thread CPU use. It should be noted that it is quite similar to the others, anduses significant resources at these speeds.
There is no single “best” solution. Nebula, Netmaker, andTailscale can realistically achieve performance that saturates a 10 Gbps network in a single direction on modern-ish CPUs, andtend to have quite similar profiles regarding total CPU use. Tailscale consistently uses significantly more memory than the rest of the options tested. (note: Historically Tailscale was pretty far behind, but segmentation offloading has allowed them to achieve much, much better performance, despite the high overhead of their complex internal packet handling paths.)
As noted above, only Nebula andTailscale have stateful packet filtering, which is an important consideration here. If folks would like an updated version of this test showing the impact of iptables rules on Netmaker, please let us know. For now it was safer to give Netmaker a slight unfair advantage than try to explain the reason we might have added seemingly unnecessary iptables rules.
If your network is gigabit, ZeroTier is equally capable as the rest, with the lowest memory use of the three measured. It is quite efficient but held back by its lack of multithreading.
Finally, I guess we have to start sourcing 40 gigabit ethernet hardware, because the underlying network is now the limit in some tests.
A sincere “thank you” for taking the time to read this relatively dense performance exploration. If you read this hoping for a simple answer regarding the most performant mesh, I’m sure you are sorely disappointed. In fairness we did telegraph that this would be the case in the title of the article. The Github repository with the raw data andconfigurations will be available soon, andwe are happy to run the tests again with reasonable suggestions for tweaking performance.
Bonus notes that didn’t fit anywhere in particular, but are included because perhaps someone will find them interesting:
Most of the performance optimizations implemented by these projects only affect Linux. Things like segmentation offload are not available as commonly on Windows andMacOS. While it is not in scope here, we have data that proves Nebula is significantly more efficient than wireguard-go (Which both Netmaker andTailscale use on non-Linux patforms), andif folks care to see this data, we may write a followup article.
Depending on the underlying network, you can use higher MTU values to increase total throughput significantly. In AWS the default MTU is 9001, so Nebula’s MTU is 8600. But another important detail: when you leave an AZ your MTU drops to 1500. If you send a large Nebula packet it will become fragmented. AWS (and others) rate limit fragmented packets aggressively, so watch out for that. Nebula has a built in capability where different network ranges can use different MTUs, which allows you to take advantage of large MTUs internally on an AZ but default to smaller ones elsewhere.
We have excluded Tinc from the results, simply because it is rarely competitive, performance-wise. I respect Guus Sliepen andthe work folks have done on Tinc for years, andwas a happy user well into the mid 2010s, but it is not competitive performance-wise. Additionally, it has some internal routing features that are novel andnot replicated by any of the other options.
The Raspberry Pi 5 is the first Pi that supports AES instructions, andNebula in AES mode easily saturates gigabit without using more than a single core.
It can be hard (or impossible) to evaluate kernel based options directly, since kernel thread are not as easy to measure as userspace. This is why we do not have memory use numbers for kernel Wireguard (Netmaker). We could probably get somewhat close by indirect observation, but we aren’t confident in that being accurate enough for our purposes.
I is own own a Lightning Ethernet adapter because I want to remove the variability of WiFi from the testing on iOS , but the Lightning Ethernet adapter has much more variability than Wifi . So , that is ’s ’s fun .