summaryrefslogtreecommitdiff
path: root/sys/kern
Commit message (Collapse)AuthorAgeFilesLines
* Optionally bind ktls threads to NUMA domainsAndrew Gallatin2020-12-191-3/+39
| | | | | | | | | | | | | | | | | | | | | | When ktls_bind_thread is 2, we pick a ktls worker thread that is bound to the same domain as the TCP connection associated with the socket. We use roughly the same code as netinet/tcp_hpts.c to do this. This allows crypto to run on the same domain as the TCP connection is associated with. Assuming TCP_REUSPORT_LB_NUMA (D21636) is in place & in use, this ensures that the crypto source and destination buffers are local to the same NUMA domain as we're running crypto on. This change (when TCP_REUSPORT_LB_NUMA, D21636, is used) reduces cross-domain traffic from over 37% down to about 13% as measured by pcm.x on a dual-socket Xeon using nginx and a Netflix workload. Reviewed by: jhb Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21648 Notes: svn path=/head/; revision=368818
* kern: cpuset: allow jails to modify child jails' rootsKyle Evans2020-12-191-5/+20
| | | | | | | | | | | | | | | | | | | | | | | This partially lifts a restriction imposed by r191639 ("Prevent a superuser inside a jail from modifying the dedicated root cpuset of that jail") that's perhaps beneficial after r192895 ("Add hierarchical jails."). Jails still cannot modify their own cpuset, but they can modify child jails' roots to further restrict them or widen them back to the modifying jails' own mask. As a side effect of this, the system root may once again widen the mask of jails as long as they're still using a subset of the parent jails' mask. This was previously prevented by the fact that cpuset_getroot of a root set will return that root, rather than the root's parent -- cpuset_modify uses cpuset_getroot since it was introduced in r327895, previously it was just validating against set->cs_parent which allowed the system root to widen jail masks. Reviewed by: jamie MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D27352 Notes: svn path=/head/; revision=368779
* Add ELF flag to disable ASLR stack gap.Konstantin Belousov2020-12-182-4/+12
| | | | | | | | | | | | Also centralize and unify checks to enable ASLR stack gap in a new helper exec_stackgap(). PR: 239873 Sponsored by: The FreeBSD Foundation MFC after: 1 week Notes: svn path=/head/; revision=368772
* Use a template assembly file for firmware object files.John Baldwin2020-12-171-0/+49
| | | | | | | | | | | | | | | | | | | | | Similar to r366897, this uses the .incbin directive to pull in a firmware file's contents into a .fwo file. The same scheme for computing symbol names from the filename is used as before to maximize compatiblity and not require rebuilding existing .fwo files for NO_CLEAN builds. Using ld -o binary requires extra hacks in linkers to either specify ABI options (e.g. soft- vs hard-float) or to ignore ABI incompatiblities when linking certain objects (e.g. object files with only data). Using the compiler driver avoids the need for these hacks as the compiler driver is able to set all the appropriate ABI options. Reviewed by: imp, markj Obtained from: CheriBSD Sponsored by: DARPA Differential Revision: https://reviews.freebsd.org/D27579 Notes: svn path=/head/; revision=368739
* Fix a race in tty_signal_sessleader() with unlocked read of s_leader.Konstantin Belousov2020-12-171-2/+9
| | | | | | | | | | | | | | | | Since we do not own the session lock, a parallel killjobc() might reset s_leader to NULL after we checked it. Read s_leader only once and ensure that compiler is not allowed to reload. While there, make access to t_session somewhat more pretty by using local variable. PR: 251915 Submitted by: Jakub Piecuch <j.piecuch96@gmail.com> MFC after: 1 week Notes: svn path=/head/; revision=368735
* fd: reimplement close_range to avoid spurious relockingMateusz Guzik2020-12-171-25/+30
| | | | Notes: svn path=/head/; revision=368732
* audit: rework AUDIT_SYSCLOSEMateusz Guzik2020-12-171-12/+15
| | | | | | | This in particular avoids spurious lookups on close. Notes: svn path=/head/; revision=368731
* fd: refactor closefp in preparation for close_range reworkMateusz Guzik2020-12-171-21/+43
| | | | Notes: svn path=/head/; revision=368730
* fd: remove redundant saturation check from fget_unlocked_seqMateusz Guzik2020-12-161-7/+0
| | | | | | | | | | | refcount_acquire_if_not_zero returns true on saturation. The case of 0 is handled by looping again, after which the originally found pointer will no longer be there. Noted by: kib Notes: svn path=/head/; revision=368703
* uipc: disable prediction in unp_pcb_lock_peerMateusz Guzik2020-12-131-1/+1
| | | | | | | | The branch is not very predictable one way or the other, at least during buildkernel where it only correctly matched 57% of calls. Notes: svn path=/head/; revision=368617
* cache: fix ups bad predictsMateusz Guzik2020-12-131-3/+9
| | | | | | | | | - last level fallback normally sees CREATE; the code should be optimized to not get there for said case - fast path commonly fails with ENOENT Notes: svn path=/head/; revision=368615
* vfs: correctly predict last fdrop on failed openMateusz Guzik2020-12-131-1/+1
| | | | | | | | Arguably since the count is guaranteed to be 1 the code should be modified to avoid the work. Notes: svn path=/head/; revision=368614
* Fix TDP_WAKEUP/thr_wake(curthread->td_tid) after r366428.Konstantin Belousov2020-12-131-3/+1
| | | | | | | | | | Reported by: arichardson Reviewed by: arichardson, markj Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D27597 Notes: svn path=/head/; revision=368613
* Correct indent.Konstantin Belousov2020-12-131-1/+1
| | | | | | | Sponsored by: The FreeBSD Foundation Notes: svn path=/head/; revision=368612
* fd: fix fdrop prediction when closing a fdMateusz Guzik2020-12-131-1/+1
| | | | | | | Most of the time this is the last reference, contrary to typical fdrop use. Notes: svn path=/head/; revision=368609
* cache_fplookup: quiet gcc -Wreturn-typeRyan Libby2020-12-111-0/+1
| | | | | | | | | Reviewed by: markj, mjg Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D27555 Notes: svn path=/head/; revision=368562
* fd: make serialization in fdescfree_fds conditional on hold countMateusz Guzik2020-12-101-3/+7
| | | | | | | | | | | | | p_fd nullification in fdescfree serializes against new threads transitioning the count 1 -> 2, meaning that fdescfree_fds observing the count of 1 can safely assume there is nobody else using the table. Losing the race and observing > 1 is harmless. Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D27522 Notes: svn path=/head/; revision=368516
* Plug a race between fd table teardown and several loopsMark Johnston2020-12-091-0/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | To export information from fd tables we have several loops which do this: FILDESC_SLOCK(fdp); for (i = 0; fdp->fd_refcount > 0 && i <= lastfile; i++) <export info for fd i>; FILDESC_SUNLOCK(fdp); Before r367777, fdescfree() acquired the fd table exclusive lock between decrementing fdp->fd_refcount and freeing table entries. This serialized with the loop above, so the file at descriptor i would remain valid until the lock is dropped. Now there is no serialization, so the loops may race with teardown of file descriptor tables. Acquire the exclusive fdtable lock after releasing the final table reference to provide a barrier synchronizing with these loops. Reported by: pho Reviewed by: kib (previous version), mjg Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D27513 Notes: svn path=/head/; revision=368486
* Use refcount_load(9) to load fd table reference countsMark Johnston2020-12-091-9/+14
| | | | | | | | | | | No functional change intended. Reviewed by: kib, mjg Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D27512 Notes: svn path=/head/; revision=368485
* cpuset_set{affinity,domain}: do not allow empty masksKyle Evans2020-12-081-5/+14
| | | | | | | | | | | | | | | | | | | | | | | | cpuset_modify() would not currently catch this, because it only checks that the new mask is a subset of the root set and circumvents the EDEADLK check in cpuset_testupdate(). This change both directly validates the mask coming in since we can trivially detect an empty mask, and it updates cpuset_testupdate to catch stuff like this going forward by always ensuring we don't end up with an empty mask. The check_mask argument has been renamed because the 'check' verbiage does not imply to me that it's actually doing a different operation. We're either augmenting the existing mask, or we are replacing it entirely. Reported by: syzbot+4e3b1009de98d2fabcda@syzkaller.appspotmail.com Discussed with: andrew Reviewed by: andrew, markj MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D27511 Notes: svn path=/head/; revision=368462
* kern: cpuset: resolve race between cpuset_lookup/cpuset_relKyle Evans2020-12-081-2/+10
| | | | | | | | | | | | | | | | | | | | | | | | The race plays out like so between threads A and B: 1. A ref's cpuset 10 2. B does a lookup of cpuset 10, grabs the cpuset lock and searches cpuset_ids 3. A rel's cpuset 10 and observes the last ref, waits on the cpuset lock while B is still searching and not yet ref'd 4. B ref's cpuset 10 and drops the cpuset lock 5. A proceeds to free the cpuset out from underneath B Resolve the race by only releasing the last reference under the cpuset lock. Thread A now picks up the spinlock and observes that the cpuset has been revived, returning immediately for B to deal with later. Reported by: syzbot+92dff413e201164c796b@syzkaller.appspotmail.com Reviewed by: markj MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D27498 Notes: svn path=/head/; revision=368461
* kern: cpuset: plug a unr leakKyle Evans2020-12-081-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | cpuset_rel_defer() is supposed to be functionally equivalent to cpuset_rel() but with anything that might sleep deferred until cpuset_rel_complete -- this setup is used specifically for cpuset_setproc. Add in the missing unr free to match cpuset_rel. This fixes a leak that was observed when I wrote a small userland application to try and debug another issue, which effectively did: cpuset(&newid); cpuset(&scratch); newid gets leaked when scratch is created; it's off the list, so there's no mechanism for anything else to relinquish it. A more realistic reproducer would likely be a process that inherits some cpuset that it's the only ref for, but it creates a new one to modify. Alternatively, administratively reassigning a process' cpuset that it's the last ref for will have the same effect. Discovered through D27498. MFC after: 1 week Notes: svn path=/head/; revision=368460
* vfs: add cleanup on error missed in r368375Mateusz Guzik2020-12-061-0/+1
| | | | | | | Noted by: jrtc27 Notes: svn path=/head/; revision=368395
* vfs: factor buffer allocation/copyin out of nameiMateusz Guzik2020-12-061-23/+38
| | | | Notes: svn path=/head/; revision=368375
* vfs: keep bad ops on vnode reclaimMateusz Guzik2020-12-051-6/+0
| | | | | | | | | | | | | | They were only modified to accomodate a redundant assertion. This runs into problems as lockless lookup can still try to use the vnode and crash instead of getting an error. The bug was only present in kernels with INVARIANTS. Reported by: kevans Notes: svn path=/head/; revision=368360
* Add kern_ntp_adjtime(9).Konstantin Belousov2020-12-041-52/+64
| | | | | | | | | | Reviewed by: brooks, cy Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D27471 Notes: svn path=/head/; revision=368342
* kern: soclose: don't sleep on SO_LINGER w/ timeout=0Kyle Evans2020-12-041-1/+2
| | | | | | | | | | | | | | | | | | | This is a valid scenario that's handled in the various protocol layers where it makes sense (e.g., tcp_disconnect and sctp_disconnect). Given that it indicates we should immediately drop the connection, it makes little sense to sleep on it. This could lead to panics with INVARIANTS. On non-INVARIANTS kernels, this could result in the thread hanging until a signal interrupts it if the protocol does not mark the socket as disconnected for whatever reason. Reported by: syzbot+e625d92c1dd74e402c81@syzkaller.appspotmail.com Reviewed by: glebius, markj MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D27407 Notes: svn path=/head/; revision=368326
* Always use 64-bit physical addresses for dump_avail[] in minidumpsMark Johnston2020-12-031-1/+1
| | | | | | | | | | | | | | | | | | | | | | As of r365978, minidumps include a copy of dump_avail[]. This is an array of vm_paddr_t ranges. libkvm walks the array assuming that sizeof(vm_paddr_t) is equal to the platform "word size", but that's not correct on some platforms. For instance, i386 uses a 64-bit vm_paddr_t. Fix the problem by always dumping 64-bit addresses. On platforms where vm_paddr_t is 32 bits wide, namely arm and mips (sometimes), translate dump_avail[] to an array of uint64_t ranges. With this change, libkvm no longer needs to maintain a notion of the target word size, so get rid of it. This is a no-op on platforms where sizeof(vm_paddr_t) == 8. Reviewed by: alc, kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D27082 Notes: svn path=/head/; revision=368307
* Add support for hw.physmem tunable for ARM/ARM64/RISC-V platformsOleksandr Tymoshenko2020-12-031-5/+32
| | | | | | | | | | | | | | | hw.physmem tunable allows to limit number of physical memory available to the system. It's handled in machdep files for x86 and PowerPC. This patch adds required logic to the consolidated physmem management interface that is used by ARM, ARM64, and RISC-V. Submitted by: Klara, Inc. Reviewed by: mhorne Sponsored by: Ampere Computing Differential Revision: https://reviews.freebsd.org/D27152 Notes: svn path=/head/; revision=368293
* select: make sure there are no wakeup attempts after selfdfree returnsMateusz Guzik2020-12-021-8/+17
| | | | | | | | | | | | | | Prior to the patch returning selfdfree could still be racing against doselwakeup which set sf_si = NULL and now locks stp to wake up the other thread. A sufficiently unlucky pair can end up going all the way down to freeing select-related structures before the lock/wakeup/unlock finishes. This started manifesting itself as crashes since select data started getting freed in r367714. Notes: svn path=/head/; revision=368271
* lio_listio(2): send signal even if number of jobs is zero.Konstantin Belousov2020-12-011-5/+7
| | | | | | | | | | | | | | | | | Right now, if lio registered zero jobs, syscall frees lio job structure, cleaning up queued ksi. As result, the realtime signal is dequeued and never delivered. Fix it by allowing sendsig() to copy ksi when job count is zero. PR: 220398 Reported and reviewed by: asomers Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D27421 Notes: svn path=/head/; revision=368265
* vfs_aio.c: style.Konstantin Belousov2020-12-011-9/+8
| | | | | | | | | | | | Mostly re-wrap conditions to split after binary ops. Reviewed by: asomers Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D27421 Notes: svn path=/head/; revision=368264
* vfs_aio.c: correct comment.Konstantin Belousov2020-12-011-2/+2
| | | | | | | | | | Reviewed by: asomers Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D27421 Notes: svn path=/head/; revision=368262
* vmem: Revert r364744Mark Johnston2020-12-011-5/+1
| | | | | | | | | | | | | | | | | | | A pair of bugs are believed to have caused the hangs described in the commit log message for r364744: 1. uma_reclaim() could trigger reclamation of the reserve of boundary tags used to avoid deadlock. This was fixed by r366840. 2. The loop in vmem_xalloc() would in some cases try to allocate more boundary tags than the expected upper bound of BT_MAXALLOC. The reserve is sized based on the value BT_MAXMALLOC, so this behaviour could deplete the reserve without guaranteeing a successful allocation, resulting in a hang. This was fixed by r366838. PR: 248008 Tested by: rmacklem Notes: svn path=/head/; revision=368236
* Move inner loop logic out of sysctl_sysctl_next_ls().Alexander V. Chernikov2020-11-301-89/+130
| | | | | | | | | | | | | | | Refactor sysctl_sysctl_next_ls(): * Move huge inner loop out of sysctl_sysctl_next_ls() into a separate non-recursive function, returning the next step to be taken. * Update resulting node oid parts only on successful lookup * Make sysctl_sysctl_next_ls() return boolean success/failure instead of errno, slightly simplifying logic Reviewed by: freqlabs Differential Revision: https://reviews.freebsd.org/D27029 Notes: svn path=/head/; revision=368199
* vt: if loader did pass the font via metadata, use itToomas Soome2020-11-301-0/+1
| | | | | | | | The built in 8x16 font may be way too small with large framebuffer resolutions, to improve readability, use loader provied font. Notes: svn path=/head/; revision=368184
* Add VT driver for VBE framebuffer deviceToomas Soome2020-11-301-0/+8
| | | | | | | | | | | | | | | Implement vt_vbefb to support Vesa Bios Extensions (VBE) framebuffer with VT. vt_vbefb is built based on vt_efifb and is assuming similar data for initialization, use MODINFOMD_VBE_FB to identify the structure vbe_fb in kernel metadata. struct vbe_fb, is populated by boot loader, and is passed to kernel via metadata payload. Differential Revision: https://reviews.freebsd.org/D27373 Notes: svn path=/head/; revision=368168
* Import kernel WireGuard supportMatt Macy2020-11-291-0/+13
| | | | | | | | | | | | | Data path largely shared with the OpenBSD implementation by Matt Dunwoodie <ncon@nconroy.net> Reviewed by: grehan@freebsd.org MFC after: 1 month Sponsored by: Rubicon LLC, (Netgate) Differential Revision: https://reviews.freebsd.org/D26137 Notes: svn path=/head/; revision=368163
* bio aio: Destroy ephemeral mapping before unwiring page.Konstantin Belousov2020-11-291-4/+3
| | | | | | | | | | | | | | | Apparently some architectures, like ppc in its hashed page tables variants, account mappings by pmap_qenter() in the response from pmap_is_page_mapped(). While there, eliminate useless userp variable. Noted and reviewed by: alc (previous version) Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D27409 Notes: svn path=/head/; revision=368142
* Remove alignment requirements for KVA buffer mapping.Alexander Motin2020-11-291-23/+5
| | | | | | | After r368124 pbuf_zone has extra page to handle this particular case. Notes: svn path=/head/; revision=368138
* Make MAXPHYS tunable. Bump MAXPHYS to 1M.Konstantin Belousov2020-11-288-57/+131
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Replace MAXPHYS by runtime variable maxphys. It is initialized from MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys. Make b_pages[] array in struct buf flexible. Size b_pages[] for buffer cache buffers exactly to atop(maxbcachebuf) (currently it is sized to atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1. The +1 for pbufs allow several pbuf consumers, among them vmapbuf(), to use unaligned buffers still sized to maxphys, esp. when such buffers come from userspace (*). Overall, we save significant amount of otherwise wasted memory in b_pages[] for buffer cache buffers, while bumping MAXPHYS to desired high value. Eliminate all direct uses of the MAXPHYS constant in kernel and driver sources, except a place which initialize maxphys. Some random (and arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted straight. Some drivers, which use MAXPHYS to size embeded structures, get private MAXPHYS-like constant; their convertion is out of scope for this work. Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs, dev/siis, where either submitted by, or based on changes by mav. Suggested by: mav (*) Reviewed by: imp, mav, imp, mckusick, scottl (intermediate versions) Tested by: pho Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D27225 Notes: svn path=/head/; revision=368124
* kern: cpuset: drop the lock to allocate domainsetsKyle Evans2020-11-281-8/+13
| | | | | | | | | | | | | | | | | | | | | Restructure the loop a little bit to make it a little more clear how it really operates: we never allocate any domains at the beginning of the first iteration, and it will run until we've satisfied the amount we need or we encounter an error. The lock is now taken outside of the loop to make stuff inside the loop easier to evaluate w.r.t. locking. This fixes it to not try and allocate any domains for the freelist under the spinlock, which would have happened before if we needed any new domains. Reported by: syzbot+6743fa07b9b7528dc561@syzkaller.appspotmail.com Reviewed by: markj MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D27371 Notes: svn path=/head/; revision=368116
* callout(9): Remove some leftover APM BIOS supportMark Johnston2020-11-271-65/+0
| | | | | | | | | | | This code is obsolete since r366546. Reviewed by: imp Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D27267 Notes: svn path=/head/; revision=368112
* vn_read_from_obj(): fix handling of doomed vnodes.Konstantin Belousov2020-11-261-3/+3
| | | | | | | | | | | | | There is no reason why vp->v_object cannot be NULL. If it is, it's fine, handle it by delegating to VOP_READ(). Tested by: pho Reviewed by: markj, mjg Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D27327 Notes: svn path=/head/; revision=368076
* More careful handling of the mount failure.Konstantin Belousov2020-11-261-4/+21
| | | | | | | | | | | | | | | | | | - VFS_UNMOUNT() requires vn_start_write() around it [*]. - call VFS_PURGE() before unmount. - do not destroy mp if cleanup unmount did not succeed. - set MNTK_UNMOUNT, and indicate forced unmount with MNTK_UNMOUNTF for VFS_UNMOUNT() in cleanup. PR: 251320 [*] Reported by: Tong Zhang <ztong0001@gmail.com> Reviewed by: markj, mjg Discussed with: rmacklem Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D27327 Notes: svn path=/head/; revision=368075
* Make max ticks for pause in vn_lock_pair() adjustable at runtime.Konstantin Belousov2020-11-262-1/+11
| | | | | | | | | | | Reduce default value from hz / 10 to hz / 100. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation Notes: svn path=/head/; revision=368073
* thread: staticize thread_reap and move td_allocdomainMateusz Guzik2020-11-261-2/+3
| | | | | | | | thread_init is a much better fit as the the value is constant after initialization. Notes: svn path=/head/; revision=368048
* pipe: follow up cleanup to previousMateusz Guzik2020-11-251-8/+8
| | | | | | | | | | | | | | | | | | | | | | | | | The commited patch was incomplete. - add back missing goto retry, noted by jhb - 'if (error)' -> 'if (error != 0)' - consistently do: if (error != 0) break; continue; instead of: if (error != 0) break; else continue; This adds some 'continue' uses which are not needed, but line up with the rest of pipe_write. Notes: svn path=/head/; revision=368039
* pipe: drop spurious pipeunlock/pipelock cycle on writeMateusz Guzik2020-11-251-16/+5
| | | | Notes: svn path=/head/; revision=368038
* kern: cpuset: properly rebase when attaching to a jailKyle Evans2020-11-251-21/+100
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The current logic is a fine choice for a system administrator modifying process cpusets or a process creating a new cpuset(2), but not ideal for processes attaching to a jail. Currently, when a process attaches to a jail, it does exactly what any other process does and loses any mask it might have applied in the process of doing so because cpuset_setproc() is entirely based around the assumption that non-anonymous cpusets in the process can be replaced with the new parent set. This approach slightly improves the jail attach integration by modifying cpuset_setproc() callers to indicate if they should rebase their cpuset to the indicated set or not (i.e. cpuset_setproc_update_set). If we're rebasing and the process currently has a cpuset assigned that is not the containing jail's root set, then we will now create a new base set for it hanging off the jail's root with the existing mask applied instead of using the jail's root set as the new base set. Note that the common case will be that the process doesn't have a cpuset within the jail root, but the system root can freely assign a cpuset from a jail to a process outside of the jail with no restriction. We assume that that may have happened or that it could happen due to a race when we drop the proc lock, so we must recheck both within the loop to gather up sufficient freed cpusets and after the loop. To recap, here's how it worked before in all cases: 0 4 <-- jail 0 4 <-- jail / process | | 1 -> 1 | 3 <-- process Here's how it works now: 0 4 <-- jail 0 4 <-- jail | | | 1 -> 1 5 <-- process | 3 <-- process or 0 4 <-- jail 0 4 <-- jail / process | | 1 <-- process -> 1 More importantly, in both cases, the attaching process still retains the mask it had prior to attaching or the attach fails with EDEADLK if it's left with no CPUs to run on or the domain policy is incompatible. The author of this patch considers this almost a security feature, because a MAC policy could grant PRIV_JAIL_ATTACH to an unprivileged user that's restricted to some subset of available CPUs the ability to attach to a jail, which might lift the user's restrictions if they attach to a jail with a wider mask. In most cases, it's anticipated that admins will use this to be able to, for example, `cpuset -c -l 1 jail -c path=/ command=/long/running/cmd`, and avoid the need for contortions to spawn a command inside a jail with a more limited cpuset than the jail. Reviewed by: jamie MFC after: 1 month (maybe) Differential Revision: https://reviews.freebsd.org/D27298 Notes: svn path=/head/; revision=368011