summaryrefslogtreecommitdiff
path: root/sys/netinet/tcp_subr.c
Commit message (Collapse)AuthorAgeFilesLines
* Enhance our RFC1948 implementation to perform better in some pathlogicalMike Silbersack2004-04-201-2/+53
| | | | | | | | | | | | | | | | | | | | | | | TIME_WAIT recycling cases I was able to generate with http testing tools. In short, as the old algorithm relied on ticks to create the time offset component of an ISN, two connections with the exact same host, port pair that were generated between timer ticks would have the exact same sequence number. As a result, the second connection would fail to pass the TIME_WAIT check on the server side, and the SYN would never be acknowledged. I've "fixed" this by adding random positive increments to the time component between clock ticks so that ISNs will *always* be increasing, no matter how quickly the port is recycled. Except in such contrived benchmarking situations, this problem should never come up in normal usage... until networks get faster. No MFC planned, 4.x is missing other optimizations that are needed to even create the situation in which such quick port recycling will occur. Notes: svn path=/head/; revision=128452
* Remove advertising clause from University of California Regent'sWarner Losh2004-04-071-4/+0
| | | | | | | | | | license, per letter dated July 22, 1999 and email from Peter Wemm, Alan Cox and Robert Watson. Approved by: core, peter, alc, rwatson Notes: svn path=/head/; revision=128019
* Two missed in previous commit -- compare pointer with NULL rather thanRobert Watson2004-04-051-2/+2
| | | | | | | using it as a boolean. Notes: svn path=/head/; revision=127871
* Prefer NULL to 0 when checking pointer values as integers or booleans.Robert Watson2004-04-051-19/+20
| | | | Notes: svn path=/head/; revision=127870
* Remove now unneeded arguments to tcp_twrespond() -- so and msrc. TheseRobert Watson2004-02-281-10/+2
| | | | | | | | | were needed by the MAC Framework until inpcbs gained labels. Submitted by: sam Notes: svn path=/head/; revision=126351
* Split the mlock() kernel code into two parts, mlock(), which unpacksDon Lewis2004-02-261-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | the syscall arguments and does the suser() permission check, and kern_mlock(), which does the resource limit checking and calls vm_map_wire(). Split munlock() in a similar way. Enable the RLIMIT_MEMLOCK checking code in kern_mlock(). Replace calls to vslock() and vsunlock() in the sysctl code with calls to kern_mlock() and kern_munlock() so that the sysctl code will obey the wired memory limits. Nuke the vslock() and vsunlock() implementations, which are no longer used. Add a member to struct sysctl_req to track the amount of memory that is wired to handle the request. Modify sysctl_wire_old_buffer() to return an error if its call to kern_mlock() fails. Only wire the minimum of the length specified in the sysctl request and the length specified in its argument list. It is recommended that sysctl handlers that use sysctl_wire_old_buffer() should specify reasonable estimates for the amount of data they want to return so that only the minimum amount of memory is wired no matter what length has been specified by the request. Modify the callers of sysctl_wire_old_buffer() to look for the error return. Modify sysctl_old_user to obey the wired buffer length and clean up its implementation. Reviewed by: bms Notes: svn path=/head/; revision=126253
* Convert the tcp segment reassembly queue to UMA and limit the maximumAndre Oppermann2004-02-241-2/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | amount of segments it will hold. The following tuneables and sysctls control the behaviour of the tcp segment reassembly queue: net.inet.tcp.reass.maxsegments (loader tuneable) specifies the maximum number of segments all tcp reassemly queues can hold (defaults to 1/16 of nmbclusters). net.inet.tcp.reass.maxqlen specifies the maximum number of segments any individual tcp session queue can hold (defaults to 48). net.inet.tcp.reass.cursegments (readonly) counts the number of segments currently in all reassembly queues. net.inet.tcp.reass.overflows (readonly) counts how often either the global or local queue limit has been reached. Tested by: bms, silby Reviewed by: bms, silby Notes: svn path=/head/; revision=126193
* Fixed ucred structure leak.Pawel Jakub Dawidek2004-02-191-0/+2
| | | | | | | | | Approved by: scottl (mentor) PR: 54163 MFC after: 3 days Notes: svn path=/head/; revision=126002
* Final brucification pass. Spell types consistently (u_int). Remove bogusBruce M Simpson2004-02-141-1/+1
| | | | | | | | | casts. Remove unnecessary parenthesis. Submitted by: bde Notes: svn path=/head/; revision=125819
* Brucification.Bruce M Simpson2004-02-131-10/+14
| | | | | | | Submitted by: bde Notes: svn path=/head/; revision=125783
* supported IPV6_RECVPATHMTU socket option.Hajimu UMEMOTO2004-02-131-2/+2
| | | | | | | Obtained from: KAME Notes: svn path=/head/; revision=125776
* Update the prototype for tcpsignature_apply() to reflect the spelling ofBruce M Simpson2004-02-121-2/+2
| | | | | | | | | the types used by m_apply()'s callback function, f, as documented in mbuf(9). Noticed by: njl Notes: svn path=/head/; revision=125742
* style(9) pass; whitespace and comments.Bruce M Simpson2004-02-121-17/+22
| | | | | | | Submitted by: njl Notes: svn path=/head/; revision=125741
* Initial import of RFC 2385 (TCP-MD5) digest support.Bruce M Simpson2004-02-111-0/+114
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is the first of two commits; bringing in the kernel support first. This can be enabled by compiling a kernel with options TCP_SIGNATURE and FAST_IPSEC. For the uninitiated, this is a TCP option which provides for a means of authenticating TCP sessions which came into being before IPSEC. It is still relevant today, however, as it is used by many commercial router vendors, particularly with BGP, and as such has become a requirement for interconnect at many major Internet points of presence. Several parts of the TCP and IP headers, including the segment payload, are digested with MD5, including a shared secret. The PF_KEY interface is used to manage the secrets using security associations in the SADB. There is a limitation here in that as there is no way to map a TCP flow per-port back to an SPI without polluting tcpcb or using the SPD; the code to do the latter is unstable at this time. Therefore this code only supports per-host keying granularity. Whilst FAST_IPSEC is mutually exclusive with KAME IPSEC (and thus IPv6), TCP_SIGNATURE applies only to IPv4. For the vast majority of prospective users of this feature, this will not pose any problem. This implementation is output-only; that is, the option is honoured when responding to a host initiating a TCP session, but no effort is made [yet] to authenticate inbound traffic. This is, however, sufficient to interwork with Cisco equipment. Tested with a Cisco 2501 running IOS 12.0(27), and Quagga 0.96.4 with local patches. Patches for tcpdump to validate TCP-MD5 sessions are also available from me upon request. Sponsored by: sentex.net Notes: svn path=/head/; revision=125680
* Limiters and sanity checks for TCP MSS (maximum segement size)Andre Oppermann2004-01-081-0/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | resource exhaustion attacks. For network link optimization TCP can adjust its MSS and thus packet size according to the observed path MTU. This is done dynamically based on feedback from the remote host and network components along the packet path. This information can be abused to pretend an extremely low path MTU. The resource exhaustion works in two ways: o during tcp connection setup the advertized local MSS is exchanged between the endpoints. The remote endpoint can set this arbitrarily low (except for a minimum MTU of 64 octets enforced in the BSD code). When the local host is sending data it is forced to send many small IP packets instead of a large one. For example instead of the normal TCP payload size of 1448 it forces TCP payload size of 12 (MTU 64) and thus we have a 120 times increase in workload and packets. On fast links this quickly saturates the local CPU and may also hit pps processing limites of network components along the path. This type of attack is particularly effective for servers where the attacker can download large files (WWW and FTP). We mitigate it by enforcing a minimum MTU settable by sysctl net.inet.tcp.minmss defaulting to 256 octets. o the local host is reveiving data on a TCP connection from the remote host. The local host has no control over the packet size the remote host is sending. The remote host may chose to do what is described in the first attack and send the data in packets with an TCP payload of at least one byte. For each packet the tcp_input() function will be entered, the packet is processed and a sowakeup() is signalled to the connected process. For example an attack with 2 Mbit/s gives 4716 packets per second and the same amount of sowakeup()s to the process (and context switches). This type of attack is particularly effective for servers where the attacker can upload large amounts of data. Normally this is the case with WWW server where large POSTs can be made. We mitigate this by calculating the average MSS payload per second. If it goes below 'net.inet.tcp.minmss' and the pps rate is above 'net.inet.tcp.minmssoverload' defaulting to 1000 this particular TCP connection is resetted and dropped. MITRE CVE: CAN-2004-0002 Reviewed by: sam (mentor) MFC after: 1 day Notes: svn path=/head/; revision=124258
* If path mtu discovery is enabled set the DF bit in all cases weAndre Oppermann2004-01-081-0/+4
| | | | | | | | | | | send packets on a tcp connection. PR: kern/60889 Tested by: Richard Wendland <richard@wendland.org.uk> Approved by: re (scottl) Notes: svn path=/head/; revision=124248
* Enable the following TCP options by default to give it more exposure:Andre Oppermann2004-01-061-1/+1
| | | | | | | | | | | | | | | rfc3042 Limited retransmit rfc3390 Increasing TCP's initial congestion Window inflight TCP inflight bandwidth limiting All my production server have it enabled and there have been no issues. I am confident about having them on by default and it gives us better overall TCP performance. Reviewed by: sam (mentor) Notes: svn path=/head/; revision=124199
* Fix some becuase -> because typos.John Baldwin2003-12-171-1/+1
| | | | | | | Reported by: Marco Wertejuk <wertejuk@mwcis.com> Notes: svn path=/head/; revision=123608
* Switch TCP over to using the inpcb label when responding in timedRobert Watson2003-12-171-4/+1
| | | | | | | | | | | | | | | | | | | wait, rather than the socket label. This avoids reaching up to the socket layer during connection close, which requires locking changes. To do this, introduce MAC Framework entry point mac_create_mbuf_from_inpcb(), which is called from tcp_twrespond() instead of calling mac_create_mbuf_from_socket() or mac_create_mbuf_netlayer(). Introduce MAC Policy entry point mpo_create_mbuf_from_inpcb(), and implementations for various policies, which generally just copy label data from the inpcb to the mbuf. Assert the inpcb lock in the entry point since we require consistency for the inpcb label reference. Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories Notes: svn path=/head/; revision=123607
* Make sure all uses of stack allocated struct route's are properlyAndre Oppermann2003-11-261-2/+2
| | | | | | | | | | | zeroed. Doing a bzero on the entire struct route is not more expensive than assigning NULL to ro.ro_rt and bzero of ro.ro_dst. Reviewed by: sam (mentor) Approved by: re (scottl) Notes: svn path=/head/; revision=122996
* Introduce tcp_hostcache and remove the tcp specific metrics fromAndre Oppermann2003-11-201-223/+125
| | | | | | | | | | | | | | | | | | | | | | | | | | the routing table. Move all usage and references in the tcp stack from the routing table metrics to the tcp hostcache. It caches measured parameters of past tcp sessions to provide better initial start values for following connections from or to the same source or destination. Depending on the network parameters to/from the remote host this can lead to significant speedups for new tcp connections after the first one because they inherit and shortcut the learning curve. tcp_hostcache is designed for multiple concurrent access in SMP environments with high contention and is hash indexed by remote ip address. It removes significant locking requirements from the tcp stack with regard to the routing table. Reviewed by: sam (mentor), bms Reviewed by: -net, -current, core@kame.net (IPv6 parts) Approved by: re (scottl) Notes: svn path=/head/; revision=122922
* o correct locking problem: the inpcb must be held across tcp_respondSam Leffler2003-11-081-15/+20
| | | | | | | | | | o add assertions in tcp_respond to validate inpcb locking assumptions o use local variable instead of chasing pointers in tcp_respond Supported by: FreeBSD Foundation Notes: svn path=/head/; revision=122327
* Add an additional check to the tcp_twrecycleable function; I hadMike Silbersack2003-11-021-3/+16
| | | | | | | | | | | | previously only considered the send sequence space. Unfortunately, some OSes (windows) still use a random positive increments scheme for their syn-ack ISNs, so I must consider receive sequence space as well. The value of 250000 bytes / second for Microsoft's ISN rate of increase was determined by testing with an XP machine. Notes: svn path=/head/; revision=121884
* - Add a new function tcp_twrecycleable, which tells us if the ISN whichMike Silbersack2003-11-011-0/+19
| | | | | | | | | | | | | | | | we will generate for a given ip/port tuple has advanced far enough for the time_wait socket in question to be safely recycled. - Have in_pcblookup_local use tcp_twrecycleable to determine if time_Wait sockets which are hogging local ports can be safely freed. This change preserves proper TIME_WAIT behavior under normal circumstances while allowing for safe and fast recycling whenever ephemeral port space is scarce. Notes: svn path=/head/; revision=121850
* Reduce the number of tcp time_wait structs to maxsockets / 5; this ensuresMike Silbersack2003-10-241-1/+1
| | | | | | | | | | | | | | | that at most 20% of sockets can be in time_wait at one time, ensuring that time_wait sockets do not starve real connections from inpcb structures. No implementation change is needed, jlemon already implemented a nice LRU-ish algorithm for tcp_tw structure recycling. This should reduce the need for sysadmins to lower the default msl on busy servers. Notes: svn path=/head/; revision=121453
* Change all SYSCTLS which are readonly and have a related TUNABLEMike Silbersack2003-10-211-1/+1
| | | | | | | | from CTLFLAG_RD to CTLFLAG_RDTUN so that sysctl(8) can provide more useful error messages. Notes: svn path=/head/; revision=121307
* Fix a bunch of off-by-one errors in the range checking code.Ruslan Ermilov2003-09-111-2/+2
| | | | Notes: svn path=/head/; revision=119995
* Introduce two new MAC Framework and MAC policy entry points:Robert Watson2003-08-211-3/+3
| | | | | | | | | | | | | | | | | mac_reflect_mbuf_icmp() mac_reflect_mbuf_tcp() These entry points permit MAC policies to do "update in place" changes to the labels on ICMP and TCP mbuf headers when an ICMP or TCP response is generated to a packet outside of the context of an existing socket. For example, in respond to a ping or a RST packet to a SYN on a closed port. Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories Notes: svn path=/head/; revision=119245
* Correct a bug introduced with reduced TCP state handling; makeRobert Watson2003-05-071-3/+18
| | | | | | | | | | | | | | | | | | | | | | sure that the MAC label on TCP responses during TIMEWAIT is properly set from either the socket (if available), or the mbuf that it's responding to. Unfortunately, this is made somewhat difficult by the TCP code, as tcp_twstart() calls tcp_twrespond() after discarding the socket but without a reference to the mbuf that causes the "response". Passing both the socket and the mbuf works arounds this--eventually it might be good to make sure the mbuf always gets passed in in "response" scenarios but working through this provided to complicate things too much. Approved by: re (scottl) Reviewed by: hsu Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories Notes: svn path=/head/; revision=114794
* Remove a potential panic condition introduced by reduced TCP waitRobert Watson2003-04-101-5/+15
| | | | | | | | | | | | | | | | | | | | | state. Those changed attempted to work around the changed invariant that inp->in_socket was sometimes now NULL, but the logic wasn't quite right, meaning that inp->in_socket would be dereferenced by cr_canseesocket() if security.bsd.see_other_uids, jail, or MAC were in use. Attempt to clarify and correct the logic. Note: the work-around originally introduced with the reduced TCP wait state handling to use cr_cansee() instead of cr_canseesocket() in this case isn't really right, although it "Does the right thing" for most of the cases in the base system. We'll need to address this at some point in the future. Pointed out by: dcs Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories Notes: svn path=/head/; revision=113345
* Remove a panic(); if the zone allocator can't provide more timewaitJonathan Lemon2003-03-081-20/+19
| | | | | | | | | | structures, reuse the oldest one. Also move the expiry timer from a per-structure callout to the tcp slow timer. Sponsored by: DARPA, NAI Labs Notes: svn path=/head/; revision=112009
* More low-hanging fruit: kill caddr_t in calls to wakeup(9) / [mt]sleep(9).Dag-Erling Smørgrav2003-03-021-1/+1
| | | | Notes: svn path=/head/; revision=111748
* When generating a TCP response to a connection, not only test if theRobert Watson2003-02-251-1/+1
| | | | | | | | | | | | | tcpcb is NULL, but also its connected inpcb, since we now allow elements of a TCP connection to hang around after other state, such as the socket, has been recycled. Tested by: dcs Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories Notes: svn path=/head/; revision=111483
* - m = m_gethdr(M_NOWAIT, MT_HEADER);Poul-Henning Kamp2003-02-211-1/+1
| | | | | | | | | + m = m_gethdr(M_DONTWAIT, MT_HEADER); 'nuff said. Notes: svn path=/head/; revision=111231
* Unbreak non-IPV6 compilation.Jonathan Lemon2003-02-191-4/+10
| | | | | | | | Caught by: phk Sponsored by: DARPA, NAI Labs Notes: svn path=/head/; revision=111153
* Add a TCP TIMEWAIT state which uses less space than a fullblown TCPJonathan Lemon2003-02-191-48/+268
| | | | | | | | | | | control block. Allow the socket and tcpcb structures to be freed earlier than inpcb. Update code to understand an inp w/o a socket. Reviewed by: hsu, silby, jayanth Sponsored by: DARPA, NAI Labs Notes: svn path=/head/; revision=111145
* Convert tcp_fillheaders(tp, ...) -> tcpip_fillheaders(inp, ...) so theJonathan Lemon2003-02-191-35/+32
| | | | | | | | | | | routine does not require a tcpcb to operate. Since we no longer keep template mbufs around, move pseudo checksum out of this routine, and merge it with the length update. Sponsored by: DARPA, NAI Labs Notes: svn path=/head/; revision=111144
* Back out M_* changes, per decision of the TRB.Warner Losh2003-02-191-4/+4
| | | | | | | Approved by: trb Notes: svn path=/head/; revision=111119
* Take advantage of pre-existing lock-free synchronization and type stable memoryJeffrey Hsu2003-02-151-4/+3
| | | | | | | to avoid acquiring SMP locks during expensive copyout process. Notes: svn path=/head/; revision=110896
* Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0.Alfred Perlstein2003-01-211-4/+4
| | | | | | | Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT. Notes: svn path=/head/; revision=109623
* Validate inp to prevent an use after free.Jeffrey Hsu2002-12-241-1/+2
| | | | Notes: svn path=/head/; revision=108265
* Change tcp.inflight_min from 1024 to a production default of 6144. CreateMatthew Dillon2002-12-141-4/+14
| | | | | | | | | | a sysctl for the stabilization value for the bandwidth delay product (inflight) algorithm and document it. MFC after: 3 days Notes: svn path=/head/; revision=107881
* Fix two instances of variant struct definitions in sys/netinet:Poul-Henning Kamp2002-10-201-4/+4
| | | | | | | | | | | | | | | | | Remove the never completed _IP_VHL version, it has not caught on anywhere and it would make us incompatible with other BSD netstacks to retain this version. Add a CTASSERT protecting sizeof(struct ip) == 20. Don't let the size of struct ipq depend on the IPDIVERT option. This is a functional no-op commit. Approved by: re Notes: svn path=/head/; revision=105586
* Tie new "Fast IPsec" code into the build. This involves the usualSam Leffler2002-10-161-0/+8
| | | | | | | | | | | | | | | configuration stuff as well as conditional code in the IPv4 and IPv6 areas. Everything is conditional on FAST_IPSEC which is mutually exclusive with IPSEC (KAME IPsec implmentation). As noted previously, don't use FAST_IPSEC with INET6 at the moment. Reviewed by: KAME, rwatson Approved by: silence Supported by: Vernier Networks Notes: svn path=/head/; revision=105199
* Replace aux mbufs with packet tags:Sam Leffler2002-10-161-8/+3
| | | | | | | | | | | | | | | | | | | | | | o instead of a list of mbufs use a list of m_tag structures a la openbsd o for netgraph et. al. extend the stock openbsd m_tag to include a 32-bit ABI/module number cookie o for openbsd compatibility define a well-known cookie MTAG_ABI_COMPAT and use this in defining openbsd-compatible m_tag_find and m_tag_get routines o rewrite KAME use of aux mbufs in terms of packet tags o eliminate the most heavily used aux mbufs by adding an additional struct inpcb parameter to ip_output and ip6_output to allow the IPsec code to locate the security policy to apply to outbound packets o bump __FreeBSD_version so code can be conditionalized o fixup ipfilter's call to ip_output based on __FreeBSD_version Reviewed by: julian, luigi (silent), -arch, -net, darren Approved by: julian, silence from everyone else Obtained from: openbsd (mostly) MFC after: 1 month Notes: svn path=/head/; revision=105194
* turn off debugging by default if bandwidth delay product limiting isMatthew Dillon2002-10-101-1/+1
| | | | | | | turned on (it is already off in -stable). Notes: svn path=/head/; revision=104825
* Correct bug in t_bw_rtttime rollover, #undef USERTTMatthew Dillon2002-08-241-1/+5
| | | | Notes: svn path=/head/; revision=102368
* Implement TCP bandwidth delay product window limiting, similar to (butMatthew Dillon2002-08-171-0/+158
| | | | | | | | | | | | | | | not meant to duplicate) TCP/Vegas. Add four sysctls and default the implementation to 'off'. net.inet.tcp.inflight_enable enable algorithm (defaults to 0=off) net.inet.tcp.inflight_debug debugging (defaults to 1=on) net.inet.tcp.inflight_min minimum window limit net.inet.tcp.inflight_max maximum window limit MFC after: 1 week Notes: svn path=/head/; revision=102017
* Document the undocumented assumption that at least one of the PCBRobert Watson2002-08-011-0/+2
| | | | | | | | | | | | pointer and incoming mbuf pointer will be non-NULL in tcp_respond(). This is relied on by the MAC code for correctness, as well as existing code. Obtained from: TrustedBSD PRoject Sponsored by: DARPA, NAI Labs Notes: svn path=/head/; revision=101137
* Introduce support for Mandatory Access Control and extensibleRobert Watson2002-07-311-0/+17
| | | | | | | | | | | | | | | | | | | | | kernel access control. Instrument the TCP socket code for packet generation and delivery: label outgoing mbufs with the label of the socket, and check socket and mbuf labels before permitting delivery to a socket. Assign labels to newly accepted connections when the syncache/cookie code has done its business. Also set peer labels as convenient. Currently, MAC policies cannot influence the PCB matching algorithm, so cannot implement polyinstantiation. Note that there is at least one case where a PCB is not available due to the TCP packet not being associated with any socket, so we don't label in that case, but need to handle it in a special manner. Obtained from: TrustedBSD Project Sponsored by: DARPA, NAI Labs Notes: svn path=/head/; revision=101106