aboutsummaryrefslogtreecommitdiff
path: root/sys/netinet
Commit message (Collapse)AuthorAgeFilesLines
* Filter TCP connections to SO_REUSEPORT_LB listen sockets by NUMA domainAndrew Gallatin2020-12-194-21/+108
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In order to efficiently serve web traffic on a NUMA machine, one must avoid as many NUMA domain crossings as possible. With SO_REUSEPORT_LB, a number of workers can share a listen socket. However, even if a worker sets affinity to a core or set of cores on a NUMA domain, it will receive connections associated with all NUMA domains in the system. This will lead to cross-domain traffic when the server writes to the socket or calls sendfile(), and memory is allocated on the server's local NUMA node, but transmitted on the NUMA node associated with the TCP connection. Similarly, when the server reads from the socket, he will likely be reading memory allocated on the NUMA domain associated with the TCP connection. This change provides a new socket ioctl, TCP_REUSPORT_LB_NUMA. A server can now tell the kernel to filter traffic so that only incoming connections associated with the desired NUMA domain are given to the server. (Of course, in the case where there are no servers sharing the listen socket on some domain, then as a fallback, traffic will be hashed as normal to all servers sharing the listen socket regardless of domain). This allows a server to deal only with traffic that is local to its NUMA domain, and avoids cross-domain traffic in most cases. This patch, and a corresponding small patch to nginx to use TCP_REUSPORT_LB_NUMA allows us to serve 190Gb/s of kTLS encrypted https media content from dual-socket Xeons with only 13% (as measured by pcm.x) cross domain traffic on the memory controller. Reviewed by: jhb, bz (earlier version), bcr (man page) Tested by: gonzo Sponsored by: Netfix Differential Revision: https://reviews.freebsd.org/D21636 Notes: svn path=/head/; revision=368819
* Harden the handling of outgoing streams in case of an restart or INITMichael Tuexen2020-12-131-3/+6
| | | | | | | | | | | collision. This avouds an out-of-bounce access in case the peer can break the cookie signature. Thanks to Felix Wilhelm from Google for reporting the issue. MFC after: 1 week Notes: svn path=/head/; revision=368622
* Clean up more resouces of an existing SCTP association in case ofMichael Tuexen2020-12-121-1/+56
| | | | | | | | | | | | | | a restart. This fixes a use-after-free scenario, which was reported by Felix Wilhelm from Google in case a peer is able to modify the cookie. However, this can also be triggered by an assciation restart under some specific conditions. MFC after: 1 week Notes: svn path=/head/; revision=368593
* Add TCP feature Proportional Rate Reduction (PRR) - RFC6937Richard Scheffenegger2020-12-042-7/+131
| | | | | | | | | | | | | | | | | PRR improves loss recovery and avoids RTOs in a wide range of scenarios (ACK thinning) over regular SACK loss recovery. PRR is disabled by default, enable by net.inet.tcp.do_prr = 1. Performance may be impeded by token bucket rate policers at the bottleneck, where net.inet.tcp.do_prr_conservate = 1 should be enabled in addition. Submitted by: Aris Angelogiannopoulos Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D18892 Notes: svn path=/head/; revision=368327
* Remove RADIX_MPATH config option.Alexander V. Chernikov2020-11-291-4/+0
| | | | | | | | | | | | ROUTE_MPATH is the new config option controlling new multipath routing implementation. Remove the last pieces of RADIX_MPATH-related code and the config option. Reviewed by: glebius Differential Revision: https://reviews.freebsd.org/D27244 Notes: svn path=/head/; revision=368164
* Refactor fib4/fib6 functions.Alexander V. Chernikov2020-11-292-42/+83
| | | | | | | | | | | | | | | | | No functional changes. * Make lookup path of fib<4|6>_lookup_debugnet() separate functions (fib<46>_lookup_rt()). These will be used in the control plane code requiring unlocked radix operations and actual prefix pointer. * Make lookup part of fib<4|6>_check_urpf() separate functions. This change simplifies the switch to alternative lookup implementations, which helps algorithmic lookups introduction. * While here, use static initializers for IPv4/IPv6 keys Differential Revision: https://reviews.freebsd.org/D27405 Notes: svn path=/head/; revision=368147
* Fix two occurences of a typo in a comment introduced in r367530.Michael Tuexen2020-11-232-2/+2
| | | | | | | | | Reported by: lstewart@ MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D27148 Notes: svn path=/head/; revision=367946
* Refactor rib iterator functions.Alexander V. Chernikov2020-11-221-1/+1
| | | | | | | | | | | | | | | * Make rib_walk() order of arguments consistent with the rest of RIB api * Add rib_walk_ext() allowing to exec callback before/after iteration. * Rename rt_foreach_fib_walk_del -> rib_foreach_table_walk_del * Rename rt_forach_fib_walk -> rib_foreach_table_walk * Move rib_foreach_table_walk{_del} to route/route_helpers.c * Slightly refactor rib_foreach_table_walk{_del} to make the implementation consistent and prepare for upcoming iterator optimizations. Differential Revision: https://reviews.freebsd.org/D27219 Notes: svn path=/head/; revision=367941
* Fix an issue I introuced in r367530: tcp_twcheck() can be calledMichael Tuexen2020-11-201-10/+13
| | | | | | | | | | | | | with to == NULL for SYN segments. So don't assume tp != NULL. Thanks to jhb@ for reporting and suggesting a fix. PR: 250499 MFC after: 1 week XMFC-with: r367530 Sponsored by: Netflix, Inc. Notes: svn path=/head/; revision=367891
* ip_fastfwd: style(9) tidy for r367628Ed Maste2020-11-132-5/+6
| | | | | | | | Discussed with: gnn MFC with: r367628 Notes: svn path=/head/; revision=367645
* Followup pointed out by ae@George V. Neville-Neil2020-11-131-1/+5
| | | | Notes: svn path=/head/; revision=367635
* An earlier commit effectively turned out the fast forwading pathGeorge V. Neville-Neil2020-11-123-5/+64
| | | | | | | | | | | | due to its lack of support for ICMP redirects. The following commit adds redirects to the fastforward path, again allowing for decent forwarding performance in the kernel. Reviewed by: ae, melifaro Sponsored by: Rubicon Communications, LLC (d/b/a "Netgate") Notes: svn path=/head/; revision=367628
* RFC 7323 specifies that:Michael Tuexen2020-11-095-46/+98
| | | | | | | | | | | | | | | | | * TCP segments without timestamps should be dropped when support for the timestamp option has been negotiated. * TCP segments with timestamps should be processed normally if support for the timestamp option has not been negotiated. This patch enforces the above. PR: 250499 Reviewed by: gnn, rrs MFC after: 1 week Sponsored by: Netflix, Inc Differential Revision: https://reviews.freebsd.org/D27148 Notes: svn path=/head/; revision=367530
* Fix a potential use-after-free bug introduced inMichael Tuexen2020-11-091-3/+3
| | | | | | | | | | https://svnweb.freebsd.org/changeset/base/363046 Thanks to Taylor Brandstetter for finding this issue using fuzz testing and reporting it in https://github.com/sctplab/usrsctp/issues/547 Notes: svn path=/head/; revision=367520
* igmp: convert igmpstat to use PCPU countersMitchell Horne2020-11-082-21/+31
| | | | | | | | | | | | | | | | | | | | Currently there is no locking done to protect this structure. It is likely okay due to the low-volume nature of IGMP, but allows for the possibility of underflow. This appears to be one of the only holdouts of the conversion to counter(9) which was done for most protocol stat structures around 2013. This also updates the visibility of this stats structure so that it can be consumed from elsewhere in the kernel, consistent with the vast majority of VNET_PCPUSTAT structures. Reviewed by: kp Sponsored by: NetApp, Inc. Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D27023 Notes: svn path=/head/; revision=367493
* Prevent premature SACK block transmission during loss recoveryRichard Scheffenegger2020-11-086-31/+74
| | | | | | | | | | | | | | | | Under specific conditions, a window update can be sent with outdated SACK information. Some clients react to this by subsequently delaying loss recovery, making TCP perform very poorly. Reported by: chengc_netapp.com Reviewed by: rrs, jtl MFC after: 2 weeks Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D24237 Notes: svn path=/head/; revision=367492
* Add m_snd_tag_alloc() as a wrapper around if_snd_tag_alloc().John Baldwin2020-10-292-28/+13
| | | | | | | | | | | This gives a more uniform API for send tag life cycle management. Reviewed by: gallatin, hselasky Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D27000 Notes: svn path=/head/; revision=367151
* Call m_snd_tag_rele() to free send tags.John Baldwin2020-10-293-20/+6
| | | | | | | | | | | | Send tags are refcounted and if_snd_tag_free() is called by m_snd_tag_rele() when the last reference is dropped on a send tag. Reviewed by: gallatin, hselasky Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D26995 Notes: svn path=/head/; revision=367148
* Remove an extra if_ref().John Baldwin2020-10-291-1/+0
| | | | | | | | | | | | | | In r348254, if_snd_tag_alloc() routines were changed to bump the ifp refcount via m_snd_tag_init(). This function wasn't in the tree at the time and wasn't updated for the new semantics, so was still doing a separate bump after if_snd_tag_alloc() returned. Reviewed by: gallatin Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D26999 Notes: svn path=/head/; revision=367147
* Store the new send tag in the right place.John Baldwin2020-10-291-1/+1
| | | | | | | | | | | | r350501 added the 'st' parameter, but did not pass it down to if_snd_tag_alloc(). Reviewed by: gallatin Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D26997 Notes: svn path=/head/; revision=367146
* Support hardware rate limiting (pacing) with TLS offload.John Baldwin2020-10-291-9/+63
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - Add a new send tag type for a send tag that supports both rate limiting (packet pacing) and TLS offload (mostly similar to D22669 but adds a separate structure when allocating the new tag type). - When allocating a send tag for TLS offload, check to see if the connection already has a pacing rate. If so, allocate a tag that supports both rate limiting and TLS offload rather than a plain TLS offload tag. - When setting an initial rate on an existing ifnet KTLS connection, set the rate in the TCP control block inp and then reset the TLS send tag (via ktls_output_eagain) to reallocate a TLS + ratelimit send tag. This allocates the TLS send tag asynchronously from a task queue, so the TLS rate limit tag alloc is always sleepable. - When modifying a rate on a connection using KTLS, look for a TLS send tag. If the send tag is only a plain TLS send tag, assume we failed to allocate a TLS ratelimit tag (either during the TCP_TXTLS_ENABLE socket option, or during the send tag reset triggered by ktls_output_eagain) and ignore the new rate. If the send tag is a ratelimit TLS send tag, change the rate on the TLS tag and leave the inp tag alone. - Lock the inp lock when setting sb_tls_info for a socket send buffer so that the routines in tcp_ratelimit can safely dereference the pointer without needing to grab the socket buffer lock. - Add an IFCAP_TXTLS_RTLMT capability flag and associated administrative controls in ifconfig(8). TLS rate limit tags are only allocated if this capability is enabled. Note that TLS offload (whether unlimited or rate limited) always requires IFCAP_TXTLS[46]. Reviewed by: gallatin, hselasky Relnotes: yes Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D26691 Notes: svn path=/head/; revision=367123
* Save the current TCP pacing rate in t_pacing_rate.John Baldwin2020-10-293-0/+11
| | | | | | | | | Reviewed by: gallatin, gnn Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D26875 Notes: svn path=/head/; revision=367122
* TCP Cubic: improve reaction to (and rollback from) RTORichard Scheffenegger2020-10-241-28/+42
| | | | | | | | | | | | | | 1. fix compliancy issue of CUBIC RTO handling according to RFC8312 section 4.7 2. add CUBIC CC_RTO_ERR handling Submitted by: chengc_netapp.com Reviewed by: rrs, tuexen, rscheff MFC after: 2 weeks Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D26808 Notes: svn path=/head/; revision=367008
* tcp: move cwnd and ssthresh updates into cc modulesRichard Scheffenegger2020-10-245-8/+24
| | | | | | | | | | | | | | | | This will pave the way of setting ssthresh differently in TCP CUBIC, according to RFC8312 section 4.7. No functional change, only code movement. Submitted by: chengc_netapp.com Reviewed by: rrs, tuexen, rscheff MFC after: 2 weeks Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D26807 Notes: svn path=/head/; revision=367007
* icmp6: Count packets dropped due to an invalid hop limitMark Johnston2020-10-191-0/+2
| | | | | | | | | | | | | | | Pad the icmp6stat structure so that we can add more counters in the future without breaking compatibility again, last done in r358620. Annotate the rarely executed error paths with __predict_false while here. Reviewed by: bz, melifaro Sponsored by: NetApp, Inc. Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D26578 Notes: svn path=/head/; revision=366842
* Implement flowid calculation for outbound connections to balanceAlexander V. Chernikov2020-10-187-23/+125
| | | | | | | | | | | | | | | | | | | connections over multiple paths. Multipath routing relies on mbuf flowid data for both transit and outbound traffic. Current code fills mbuf flowid from inp_flowid for connection-oriented sockets. However, inp_flowid is currently not calculated for outbound connections. This change creates simple hashing functions and starts calculating hashes for TCP,UDP/UDP-Lite and raw IP if multipath routes are present in the system. Reviewed by: glebius (previous version),ae Differential Revision: https://reviews.freebsd.org/D26523 Notes: svn path=/head/; revision=366813
* Simplify NET_EPOCH_EXIT in inp_join_group().Alexander V. Chernikov2020-10-181-1/+3
| | | | | | | Suggested by: kib Notes: svn path=/head/; revision=366807
* Fix sleepq_add panic happening with too wide net epoch in mcast control.Alexander V. Chernikov2020-10-171-11/+21
| | | | | | | | | | PR: 250413 Reported by: Christopher Hall <hsw at bitmark.com> Reviewed by: ae Differential Revision: https://reviews.freebsd.org/D26827 Notes: svn path=/head/; revision=366795
* Improve the handling of cookie life times.Michael Tuexen2020-10-165-27/+44
| | | | | | | | | | | | | | | | The staleness reported in an error cause is in us, not ms. Enforce limits on the life time via sysct; and socket options consistently. Update the description of the sysctl variable to use the right unit. Also do some minor cleanups. This also fixes an interger overflow issue if the peer can modify the cookie. This was reported by Felix Weinrank by fuzz testing the userland stack and in https://oss-fuzz.com/testcase-detail/4800394024452096 MFC after: 3 days Notes: svn path=/head/; revision=366750
* Implement SIOCGIFALIAS.Andrey V. Elsukov2020-10-141-0/+60
| | | | | | | | | | | | It is lightweight way to check if an IPv4 address exists. Submitted by: Roy Marples Reviewed by: gnn, melifaro MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D26636 Notes: svn path=/head/; revision=366695
* Join to AllHosts multicast group again when adding an existing IPv4 address.Andrey V. Elsukov2020-10-131-1/+2
| | | | | | | | | | | | | | | | When SIOCAIFADDR ioctl configures an IPv4 address that is already exist, it removes old ifaddr. When this IPv4 address is only one configured on the interface, this also leads to leaving from AllHosts multicast group. Then an address is added again, but due to the bug, this doesn't lead to joining to AllHosts multicast group. Submitted by: yannis.planus_alstomgroup.com Reviewed by: gnn MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D26757 Notes: svn path=/head/; revision=366682
* ip_mroute: fix the viftable export sysctlBjoern A. Zeeb2020-10-111-7/+24
| | | | | | | | | | | | | | | It seems that in r354857 I got more than one thing wrong. Convert the SYSCTL_OPAQUE to a SYSCTL_PROC to properly export the these days allocated and not longer static per-vnet viftable array. This fixes a problem with netstat -g which would show bogus information for the IPv4 Virtual Interface Table. PR: 246626 Reported by: Ozkan KIRIK (ozkan.kirik gmail.com) MFC after: 3 days Notes: svn path=/head/; revision=366623
* Stop sending tiny new data segments during SACK recoveryRichard Scheffenegger2020-10-092-4/+4
| | | | | | | | | | | | | | | | | Consider the currently in-use TCP options when calculating the amount of new data to be injected during SACK loss recovery. That addresses the effect that very small (new) segments could be injected on partial ACKs while still performing a SACK loss recovery. Reported by: Liang Tian Reviewed by: tuexen, chengc_netapp.com MFC after: 2 weeks Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D26446 Notes: svn path=/head/; revision=366570
* Add IP(V6)_VLAN_PCP to set 802.1 priority per-flow.Richard Scheffenegger2020-10-093-0/+52
| | | | | | | | | | | | | | | | | | This adds a new IP_PROTO / IPV6_PROTO setsockopt (getsockopt) option IP(V6)_VLAN_PCP, which can be set to -1 (interface default), or explicitly to any priority between 0 and 7. Note that for untagged traffic, explicitly adding a priority will insert a special 801.1Q vlan header with vlan ID = 0 to carry the priority setting Reviewed by: gallatin, rrs MFC after: 2 weeks Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D26409 Notes: svn path=/head/; revision=366569
* Extend netstat to display TCP stack and detailed congestion state (2)Richard Scheffenegger2020-10-092-1/+14
| | | | | | | | | | | | | | | | | | | Extend netstat to display TCP stack and detailed congestion state Adding the "-c" option used to show detailed per-connection congestion control state for TCP sessions. This is one summary patch, which adds the relevant variables into xtcpcb. As previous "spare" space is used, these changes are ABI compatible. Reviewed by: tuexen MFC after: 2 weeks Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D26518 Notes: svn path=/head/; revision=366567
* Minor cleanups.Michael Tuexen2020-10-072-4/+3
| | | | | | | MFC after: 3 days Notes: svn path=/head/; revision=366517
* Check if_capenable, not if_capabilities when enabling rate limiting.John Baldwin2020-10-061-2/+2
| | | | | | | | | | | | if_capabilities is a read-only mask of supported capabilities. if_capenable is a mask under administrative control via ifconfig(8). Reviewed by: gallatin Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D26690 Notes: svn path=/head/; revision=366492
* Reset delayed SACK state when restarting an SCTP association.Michael Tuexen2020-10-061-5/+2
| | | | | | | MFC after: 3 days Notes: svn path=/head/; revision=366489
* Ensure variables are initialized before used.Michael Tuexen2020-10-062-1/+4
| | | | | | | MFC after: 3 days Notes: svn path=/head/; revision=366483
* Remove dead stores reported by clang static code analysisMichael Tuexen2020-10-064-12/+4
| | | | | | | MFC after: 3 days Notes: svn path=/head/; revision=366482
* Cleanup, no functional change intended.Michael Tuexen2020-10-061-34/+18
| | | | | | | MFC after: 3 days Notes: svn path=/head/; revision=366480
* Whitespace changes.Michael Tuexen2020-10-061-3/+2
| | | | | | | MFC after: 3 days Notes: svn path=/head/; revision=366474
* Use __func__ instead of __FUNCTION__ for consistency.Michael Tuexen2020-10-042-2/+2
| | | | | | | MFC after: 3 days Notes: svn path=/head/; revision=366426
* Cleanup, no functional change intended.Michael Tuexen2020-10-041-30/+22
| | | | | | | MFC after: 3 days Notes: svn path=/head/; revision=366425
* Introduce scalable route multipath.Alexander V. Chernikov2020-10-034-59/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This change is based on the nexthop objects landed in D24232. The change introduces the concept of nexthop groups. Each group contains the collection of nexthops with their relative weights and a dataplane-optimized structure to enable efficient nexthop selection. Simular to the nexthops, nexthop groups are immutable. Dataplane part gets compiled during group creation and is basically an array of nexthop pointers, compiled w.r.t their weights. With this change, `rt_nhop` field of `struct rtentry` contains either nexthop or nexthop group. They are distinguished by the presense of NHF_MULTIPATH flag. All dataplane lookup functions returns pointer to the nexthop object, leaving nexhop groups details inside routing subsystem. User-visible changes: The change is intended to be backward-compatible: all non-mpath operations should work as before with ROUTE_MPATH and net.route.multipath=1. All routes now comes with weight, default weight is 1, maximum is 2^24-1. Current maximum multipath group width is statically set to 64. This will become sysctl-tunable in the followup changes. Using functionality: * Recompile kernel with ROUTE_MPATH * set net.route.multipath to 1 route add -6 2001:db8::/32 2001:db8::2 -weight 10 route add -6 2001:db8::/32 2001:db8::3 -weight 20 netstat -6On Nexthop groups data Internet6: GrpIdx NhIdx Weight Slots Gateway Netif Refcnt 1 ------- ------- ------- --------------------------------------- --------- 1 13 10 1 2001:db8::2 vlan2 14 20 2 2001:db8::3 vlan2 Next steps: * Land outbound hashing for locally-originated routes ( D26523 ). * Fix net/bird multipath (net/frr seems to work fine) * Add ROUTE_MPATH to GENERIC * Set net.route.multipath=1 by default Tested by: olivier Reviewed by: glebius Relnotes: yes Differential Revision: https://reviews.freebsd.org/D26449 Notes: svn path=/head/; revision=366390
* Improve the input validation and processing of cookies.Michael Tuexen2020-09-292-16/+14
| | | | | | | | | | | | | | This avoids setting the association in an inconsistent state, which could result in a use-after-free situation. This can be triggered by a malicious peer, if the peer can modify the cookie without the local endpoint recognizing it. Thanks to Ned Williamson for reporting the issue. MFC after: 3 days Notes: svn path=/head/; revision=366248
* Minor cleanup.Michael Tuexen2020-09-281-1/+1
| | | | | | | MFC after: 3 days Notes: svn path=/head/; revision=366226
* Cleanup, no functional change intended.Michael Tuexen2020-09-271-3/+1
| | | | | | | MFC after: 3 days Notes: svn path=/head/; revision=366199
* Improve the handling of receiving unordered and unreliable userMichael Tuexen2020-09-271-1/+3
| | | | | | | | | | | | | | | messages using DATA chunks. Don't use fsn_included when not being sure that it is set to an appropriate value. If the default is used, which is -1, this can result in SCTP associaitons not making any user visible progress. Thanks to Yutaka Takeda for reporting this issue for the the userland stack in https://github.com/pion/sctp/issues/138. MFC after: 3 days Notes: svn path=/head/; revision=366198
* TCP: send full initial window when timestamps are in useRichard Scheffenegger2020-09-253-7/+22
| | | | | | | | | | | | | | | | | | | | The fastpath in tcp_output tries to send out full segments, and avoid sending partial segments by comparing against the static t_maxseg variable. That value does not consider tcp options like timestamps, while the initial window calculation is using the correct dynamic tcp_maxseg() function. Due to this interaction, the last, full size segment is considered too short and not sent out immediately. Reviewed by: tuexen MFC after: 2 weeks Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D26478 Notes: svn path=/head/; revision=366150