| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
| |
The regressions in aio(4) and kernel RPC aren't a 5 minute problem.
This reverts commit d80a97def9a1db6f07f5d2e68f7ad62b27918947.
This reverts commit d1cbb17a873c787a527316bbb27551e97d5ad30c.
This reverts commit fb8a8333b481cc4256d0b3f0b5b4feaa4594e01f.
|
|
|
|
|
|
|
| |
Jumping to cleanup routines will work on uninitialized stack mc.
Fixes: d80a97def9a1db6f07f5d2e68f7ad62b27918947
Reported-by: syzbot+4adf0b37849ea7723586@syzkaller.appspotmail.com
|
|
|
|
|
|
|
| |
If there is nothing to prepend, don't try STAILQ_INSERT_HEAD().
Fixes: d80a97def9a1db6f07f5d2e68f7ad62b27918947
Reported-by: syzbot+bb7f3d07c79b5faf8de8@syzkaller.appspotmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Provide protocol specific pr_sosend and pr_soreceive for PF_UNIX
SOCK_STREAM sockets and implement SOCK_SEQPACKET sockets as an extension
of SOCK_STREAM. The change meets three goals: get rid of unix(4) specific
stuff in the generic socket code, provide a faster and robust unix/stream
sockets and bring unix/seqpacket much closer to specification. Highlights
follow:
- The send buffer now is truly bypassed. Previously it was always empty,
but the send(2) still needed to acquire its lock and do a variety of
tricks to be woken up in the right time while sleeping on it. Now the
only two things we care about in the send buffer is the I/O sx(9) lock
that serializes operations and value of so_snd.sb_hiwat, which we can read
without obtaining a lock. The sleep of a send(2) happens on the mutex of
the receive buffer of the peer. A bulk send/recv of data with large
socket buffers will make both syscalls just bounce between owning the
receive buffer lock and copyin(9)/copyout(9), no other locks would be
involved.
- The implementation uses new mchain structure to manipulate mbuf chains.
Note that this required converting to mchain two functions that are shared
with unix/dgram: unp_internalize() and unp_addsockcred() as well as adding
a new shared one uipc_process_kernel_mbuf(). This induces some non-
functional changes in the unix/dgram code as well. There is a space for
improvement here, as right now it is a mix of mchain and manually managed
mbuf chains.
- unix/seqpacket previously marked as PR_ADDR & PR_ATOMIC and thus treated
as a datagram socket by the generic socket code, now becomes a true stream
socket with record markers.
- unix/stream loses the sendfile(2) support. This can be brought back,
but requires some work. Let's first see if there is any interest in this
feature, except purely academical.
Reviewed by: markj, tuexen
Differential Revision: https://reviews.freebsd.org/D44151
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When calling VOP_CREATE(), uipc_bindat() reuses the componentname
object from the preceding lookup operation, which is likely to specify
LK_SHARED. Furthermore, the VOP_CREATE() interface technically only
requires the newly-created vnode to be returned with a shared lock.
However, the socket layer requires the new vnode to be locked exclusive
and asserts to that effect.
In most cases, this is not a practical concern because most if not
all base-layer filesystems (certainly FFS, ZFS, and msdosfs at least)
always return the vnode locked exclusive regardless of the lock flags.
However, it is an issue for unionfs which uses cn_lkflags to determine
how the new unionfs wrapper vnode should be locked. While it would
be easy enough to work around this issue within unionfs itself, it
seems better for the socket layer to be explicit about its locking
requirements when issuing VOP_CREATE().
Reviewed by: kib, olce
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D44047
|
|
|
|
|
|
|
|
|
|
| |
That was lost in transition from one-for-all soshutdown() to protocol
specific methods. Only protocols that listen(2) were affected. This is
not a documented or specified feature, but some software relies on it. At
least the FreeSWITCH telephony software uses this behavior on
PF_INET/SOCK_STREAM.
Fixes: 5bba2728079ed4da33f727dbc2b6ae1de02ba897
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This socket option was added in 6a2989fd54a9 together with LOCAL_CREDS.
Both options originate from NetBSD. The LOCAL_CREDS seems to be used by
some software and is covered by our test suite.
The main problem with LOCAL_CONNWAIT is that it doesn't work as
documented. A basic test shows that connect(2) indeed blocks, but
accept(2) on the other side does not wake it up. Indeed, I don't see what
code in the accept(2) path would go into the peer socket of a unix/stream
listener's child and would make wakeup(&so->so_timeo). I tried the test
even on a FreeBSD 6.4-RELEASE and it produced the same results as on
CURRENT.
The other thing that puzzles me is why that option would be useful even if
it worked? Because on unix/stream you can send(2) immediately after
connect(2) and that would put data on the peer receive buffer even before
listener had done accept(2). In other words, one side can do connect(2)
then send(2), only after the remote side would make accept(2) and the
remote would see the data sent before the accept(2). Again this
undocumented feature of unix(4) is present on all versions from FreeBSD 6
to CURRENT.
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D43708
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
It is a legitimate case to use sendmsg(2) to send control only, with zero
bytes of data and then recvmsg(2) them with zero length iov, receiving
control only. This sendmsg(2)+recmsg(2) would leave a zero length mbuf on
the top of the socket buffer. If you now try to repeat this combo again,
your recvmsg(2) would not return control data, because it sits behind an
MT_DATA mbuf and you have provided zero length uio_resid. IMHO, best
strategy to deal with zero length buffers in a chain is to not put them
there in the first place. Thus, solve this right in uipc_send() instead
of touching soreceive_generic().
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D43733
|
|
|
|
|
|
|
|
| |
unp_dispose() is called on shutdown(2) and thus needs to acquire socket
I/O receive sx(9) to synchronize with read(2) that may read top of the
buffer without socket buffer mutex. Notice in the last chunk of the diff
that the function used to be called with the lock already acquired in the
past.
|
| |
|
|
|
|
|
|
|
|
| |
Passing file descriptors (rights) via sockets is a feature specific to
PF_UNIX only, so fully isolate the logic into uipc_usrreq.c.
Reviewed by: tuexen
Differential Revision: https://reviews.freebsd.org/D43414
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Disassemble a one-for-all soshutdown() into protocol specific methods.
This creates a small amount of copy & paste, but makes code a lot more
self documented, as protocol specific method would execute only the code
that is relevant to that protocol and nothing else. This also fixes a
couple recent regressions and reduces risk of future regressions. The
extended KPI for the new pr_shutdown removes need for the extra pr_flush
which was added for the sake of SCTP which could not perform its shutdown
properly with the old one. Particularly for SCTP this change streamlines
a lot of code.
Some notes on why certain parts of code were copied or were not to certain
protocols:
* The (SS_ISCONNECTED | SS_ISCONNECTING | SS_ISDISCONNECTING) check is
needed only for those protocols that may be connected or disconnected.
* The above reduces into only SS_ISCONNECTED for those protocols that
always connect instantly.
* The ENOTCONN and continue processing hack is left only for datagram
protocols.
* The SOLISTENING(so) block is copied to those protocols that listen(2).
* sorflush() on SHUT_RD is copied almost to every protocol, but that
will be refactored later.
* wakeup(&so->so_timeo) is copied to protocols that can make a non-instant
connect(2), can SO_LINGER or can accept(2).
There are three protocols (netgraph(4), Bluetooth, SDP) that did not have
pr_shutdown, but old soshutdown() would still perform sorflush() on
SHUT_RD for them and also wakeup(9). Those protocols partially supported
shutdown(2) returning EOPNOTSUP for SHUT_WR/SHUT_RDWR, now they fully lost
shutdown(2) support. I'm pretty sure netgraph(4) and Bluetooth are okay
about that and SDP is almost abandoned anyway.
Reviewed by: tuexen
Differential Revision: https://reviews.freebsd.org/D43413
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is important for wpa_supplicant operation on a crowded network.
Note: we actually need an API to increase maximum datagram size on a
socket. Previously SO_SNDBUF magically acted like that, but that was
an undocumented "feature".
Also move the comment to the proper line. Previously it was the receive
buffer that imposed the limit. Now notion of buffer size and maximum
datagram are separate.
Reviewed by: bz, tuexen, karels
Differential Revision: https://reviews.freebsd.org/D42830
PR: 274990
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Just like it was done for accept(2) in cfb1e92912b4, use same approach
for two simplier syscalls that return socket addresses. Although,
these two syscalls aren't performance critical, this change generalizes
some code between 3 syscalls trimming code size.
Following example of accept(2), provide VNET-aware and INVARIANT-checking
wrappers sopeeraddr() and sosockaddr() around protosw methods.
Reviewed by: tuexen
Differential Revision: https://reviews.freebsd.org/D42694
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Let the accept functions provide stack memory for protocols to fill it in.
Generic code should provide sockaddr_storage, specialized code may provide
smaller structure.
While rewriting accept(2) make 'addrlen' a true in/out parameter, reporting
required length in case if provided length was insufficient. Our manual
page accept(2) and POSIX don't explicitly require that, but one can read
the text as they do. Linux also does that. Update tests accordingly.
Reviewed by: rscheff, tuexen, zlei, dchagin
Differential Revision: https://reviews.freebsd.org/D42635
|
|
|
|
|
|
|
|
| |
Remove ancient SCCS tags from the tree, automated scripting, with two
minor fixup to keep things compiling. All the common forms in the tree
were removed with a perl script.
Sponsored by: Netflix
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Running the test suite yields:
lock order reversal:
1st 0xfffff80004bc6700 unp (unp, sleep mutex) @ sys/kern/uipc_usrreq.c:390
2nd 0xffffffff81a94b30 unp_link_rwlock (unp_link_rwlock, rw) @ sys/kern/uipc_usrreq.c:2934
lock order unp -> unp_link_rwlock attempted at:
0xffffffff80bc216e at witness_checkorder+0xbbe
0xffffffff80b493a5 at _rw_wlock_cookie+0x65
0xffffffff80c0a8e2 at unp_discard+0x22
0xffffffff80c0a888 at unp_freerights+0x38
0xffffffff80c09fdd at unp_scan+0x9d
0xffffffff80c0f9a7 at uipc_sosend_dgram+0x727
0xffffffff80c00a79 at sousrsend+0x79
0xffffffff80c072d0 at kern_sendit+0x1c0
0xffffffff80c074d7 at sendit+0xb7
0xffffffff80c076f3 at sys_sendmsg+0x63
0xffffffff8104d957 at amd64_syscall+0x6b7
0xffffffff8101f9eb at fast_syscall_common+0xf8
This happens when uipc_sosend_dgram() discards a control message because
the receive socket buffer is full. The overflow handling frees
internalized file references in the socket buffer before freeing mbufs.
It does this with socket PCBs locked, leading to the LOR. Defer
handling of file references until the PCBs are unlocked.
Reviewed by: glebius
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D41884
|
|
|
|
| |
Remove /^[\s*]*__FBSDID\("\$FreeBSD\$"\);?\s*\n/
|
|
|
|
|
|
|
|
| |
Reported by: syzbot+c2da2dbae5fe006556bc@syzkaller.appspotmail.com
Reported by: syzbot+b4d6b093b1d78bfa859b@syzkaller.appspotmail.com
Fixes: e8f6e5b2d969 ("unix: Fix locking in uipc_peeraddr()")
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
After the locking protocol changed in commit 75a67bf3d00d ("AF_UNIX:
make unix socket locking finer grained"), uipc_peeraddr() was not
updated accordingly.
The link lock now only protects global socket lists. The PCB lock is
used to protect the link between connected PCBs, so use that. Remove an
old comment which appears to be noting that unp_conn is not set for
connected SOCK_DGRAM sockets (in one direction anyway).
Reviewed by: glebius
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D39855
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This API change led to unexpected consequences with Go runtime. The
Go runtime emulates blocking sockets over non-blocking sockets and
for that uses available event dispatcher on the target OS, which is
kevent(2) if availabe, with OS independent layer on top. It expects
that if whatever O_NONBLOCK socket returned ever EAGAIN, then it is
supposed to be reported as writable by the event dispatcher. kevent(2)
would never report a unix/dgram socket, since they never change their
state, they always are writeable. The expectations of Go are not
literally specified by SUS, however they are in its spirit. The SUS
specifies EAGAIN for send(2) as "The socket's file descriptor is marked
O_NONBLOCK and the requested operation would block" [1]. This doesn't
apply to FreeBSD unix/dgram socket, it never blocks on send(2).
Thus, changing API trying to mimic Linux was a mistake. But what about
the problem we tried to fix? Discussed that with Max Dounin of nginx,
and we agreed that the log bomb described shall be fixed on nginx side,
and it actually isn't specific to FreeBSD, may happen with nginx on any
non-Linux system with a certain configuration.
[1] https://pubs.opengroup.org/onlinepubs/9699919799/functions/send.html
This reverts commit 65572cade35a93add2168a7a012f808ac499218b.
|
|
|
|
| |
for the new implementation of PF_UNIX/SOCK_DGRAM
|
| |
|
| |
|
|
|
|
|
|
|
|
| |
This removes some of the complexity needed to maintain HASBUF and
allows for removing injecting SAVENAME by filesystems.
Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D36542
|
|
|
|
|
|
| |
That's a legitimate scenario, although unlikely.
Reported by: https://syzkaller.appspot.com/bug?extid=6e8be1ec8d77578a3df4
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
o Assert that every protosw has pr_attach. Now this structure is
only for socket protocols declarations and nothing else.
o Merge struct pr_usrreqs into struct protosw. This was suggested
in 1996 by wollman@ (see 7b187005d18ef), and later reiterated
in 2006 by rwatson@ (see 6fbb9cf860dcd).
o Make struct domain hold a variable sized array of protosw pointers.
For most protocols these pointers are initialized statically.
Those domains that may have loadable protocols have spacers. IPv4
and IPv6 have 8 spacers each (andre@ dff3237ee54ea).
o For inetsw and inet6sw leave a comment noting that many protosw
entries very likely are dead code.
o Refactor pf_proto_[un]register() into protosw_[un]register().
o Isolate pr_*_notsupp() methods into uipc_domain.c
Reviewed by: melifaro
Differential revision: https://reviews.freebsd.org/D36232
|
|
|
|
|
| |
instead of using historic PRU_ flags that are now not used by anything
rather than TCP debugging.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Implement Linux-variant of MSG_TRUNC input flag used in recv(), recvfrom() and recvmsg().
Posix defines MSG_TRUNC as an output flag, indicating packet/datagram truncation.
Linux extended it a while (~15+ years) ago to act as input flag,
resulting in returning the full packet size regarless of the input
buffer size.
It's a (relatively) popular pattern to do recvmsg( MSG_PEEK | MSG_TRUNC) to get the
packet size, allocate the buffer and issue another call to fetch the packet.
In particular, it's popular in userland netlink code, which is the primary driving factor of this change.
This commit implements the MSG_TRUNC support for SOCK_DGRAM sockets (udp, unix and all soreceive_generic() users).
PR: kern/176322
Reviewed by: pauamma(doc)
Differential Revision: https://reviews.freebsd.org/D35909
MFC after: 1 month
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Instead of returning EMSGSIZE pass the error code from fdallocn() directly
to userland. That would be EMFILE, which makes much more sense. This
error code is not listed in the specification[1], but the specification
doesn't cover such edge case at all. Meanwhile the specification lists
EMSGSIZE as the error code for invalid value of msg_iovlen, and FreeBSD
follows that, see sys_recmsg(). Differentiating these two cases will make
a developer/admin life much easier when debugging.
[1] https://pubs.opengroup.org/onlinepubs/9699919799/functions/recvmsg.html
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35640
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
A one-to-many unix/dgram socket is a socket that has been bound
with bind(2) and can get multiple connections. A typical example
is /var/run/log bound by syslogd(8) and receiving multiple
connections from libc syslog(3) API. Until now all of these
connections shared the same receive socket buffer of the bound
socket. This made the socket vulnerable to overflow attack.
See 240d5a9b1ce for a historical attempt to workaround the problem.
This commit creates a per-connection socket buffer for every single
connected socket and eliminates the problem. The new behavior will
optimize seldom writers over frequent writers. See added test case
scenarios and code comments for more detailed description of the
new behavior.
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35303
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
o Use m_pkthdr.memlen from m_uiotombuf()
o Modify unp_internalize() to keep track of allocated space and memory
as well as pointer to the last buffer.
o Modify unp_addsockcred() to keep track of allocated space and memory
as well as pointer to the last buffer.
o Record the datagram len/memlen/ctllen in the first (from) mbuf of the
chain in uipc_sosend_dgram() and reuse it in uipc_soreceive_dgram().
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35302
|
|
|
|
|
|
|
|
|
|
|
| |
Data allocated by m_uiotombuf() usually goes into a socket buffer.
We are interested in the length of useful data to be added to sb_acc,
as well as total memory used by mbufs. The later would be added to
sb_mbcnt. Calculating this value at allocation time allows to save
on extra traversal of the mbuf chain.
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35301
|
|
|
|
|
|
|
|
|
|
|
| |
This change fully splits away PF_UNIX/SOCK_DGRAM from other socket
buffer implementations, without any behavior changes.
Generic socket implementation is reduced down to one STAILQ and very
little code.
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35300
|
|
|
|
|
|
|
|
|
| |
Use this new version in unix/dgram socket when sending to a target
address. This removes extra lock release/acquisition and possible
counter-intuitive ENOTCONN.
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35298
|
|
|
|
|
|
|
|
| |
This allows to remove one M_NOWAIT allocation and also makes it
more clear what's going on.
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35297
|
|
|
|
|
|
|
|
|
|
| |
With this second step PF_UNIX/SOCK_DGRAM has protocol specific
implementation. This gives some possibility performance
optimizations. However, it still operates on the same struct
socket as all other sockets do.
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35296
|
|
|
|
|
|
|
| |
Just remove one level of indentation as the case clause always match.
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35295
|
|
|
|
|
|
|
|
| |
Remove the dead code. The new uipc_sosend_dgram() handles send()
on PF_UNIX/SOCK_DGRAM in full.
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35294
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is first step towards splitting classic BSD socket
implementation into separate classes. The first to be
split is PF_UNIX/SOCK_DGRAM as it has most differencies
to SOCK_STREAM sockets and to PF_INET sockets.
Historically a protocol shall provide two methods for sendmsg(2):
pru_sosend and pru_send. The former is a generic send method,
e.g. sosend_generic() which would internally call the latter,
uipc_send() in our case. There is one important exception, though,
the sendfile(2) code will call pru_send directly. But sendfile
doesn't work on SOCK_DGRAM, so we can do the trick. We will create
socket class specific uipc_sosend_dgram() which will carry only
important bits from sosend_generic() and uipc_send().
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35293
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Make it a complex, but a single for(;;) statement. The previous cycle
with some loop logic in the beginning and some loop logic at the end
was confusing. Both me and markj@ were misleaded to a conclusion that
some checks are unnecessary, while they actually were necessary.
While here, handle an edge case found by Mark, when on 64-bit platform
an incorrect message from userland would underflow length counter, but
return without any error. Provide a test case for such message.
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35375
|
|
|
|
| |
Matches the meaning of the variable and sysctl node name.
|
|
|
|
|
| |
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35318
|
|
|
|
|
| |
Pointy hat to: glebius
Fixes: 4682ac697ce9b306d11e03a628d1ac07f4b540c8
|
|
|
|
|
|
|
|
| |
In this function we always work with mbufs that we previously
created ourselves in unp_internalize(). They must be valid.
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35319
|
|
|
|
|
|
|
|
|
| |
Now that we call sbcreatecontrol() with M_WAITOK, we are expected to
pass a valid size. Return same error code, we are returning for an
oversized control from sockargs().
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35317
|
|
|
|
| |
No functional change.
|
| |
|
|
|
|
| |
Fixes: 1f32cef47189403e9e70b1893c731c68b97b964e
|
|
|
|
|
|
|
|
| |
This change is also a preparation for further optimization to
allow locked return on success.
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35182
|