aboutsummaryrefslogtreecommitdiff
path: root/usr.sbin/nfsd
diff options
context:
space:
mode:
Diffstat (limited to 'usr.sbin/nfsd')
-rw-r--r--usr.sbin/nfsd/Makefile8
-rw-r--r--usr.sbin/nfsd/Makefile.depend19
-rw-r--r--usr.sbin/nfsd/nfsd.8396
-rw-r--r--usr.sbin/nfsd/nfsd.c1400
-rw-r--r--usr.sbin/nfsd/nfsv4.4379
-rw-r--r--usr.sbin/nfsd/pnfs.4228
-rw-r--r--usr.sbin/nfsd/pnfsserver.4444
-rw-r--r--usr.sbin/nfsd/stablerestart.594
8 files changed, 2968 insertions, 0 deletions
diff --git a/usr.sbin/nfsd/Makefile b/usr.sbin/nfsd/Makefile
new file mode 100644
index 000000000000..b6bd9a28e651
--- /dev/null
+++ b/usr.sbin/nfsd/Makefile
@@ -0,0 +1,8 @@
+PACKAGE= nfs
+
+PROG= nfsd
+MAN= nfsd.8 nfsv4.4 stablerestart.5 pnfs.4 pnfsserver.4
+
+LIBADD= util
+
+.include <bsd.prog.mk>
diff --git a/usr.sbin/nfsd/Makefile.depend b/usr.sbin/nfsd/Makefile.depend
new file mode 100644
index 000000000000..7e5c47e39608
--- /dev/null
+++ b/usr.sbin/nfsd/Makefile.depend
@@ -0,0 +1,19 @@
+# Autogenerated - do NOT edit!
+
+DIRDEPS = \
+ include \
+ include/arpa \
+ include/rpc \
+ include/rpcsvc \
+ include/xlocale \
+ lib/${CSU_DIR} \
+ lib/libc \
+ lib/libcompiler_rt \
+ lib/libutil \
+
+
+.include <dirdeps.mk>
+
+.if ${DEP_RELDIR} == ${_DEP_RELDIR}
+# local dependencies - needed for -jN in clean tree
+.endif
diff --git a/usr.sbin/nfsd/nfsd.8 b/usr.sbin/nfsd/nfsd.8
new file mode 100644
index 000000000000..2e5724dbce33
--- /dev/null
+++ b/usr.sbin/nfsd/nfsd.8
@@ -0,0 +1,396 @@
+.\" Copyright (c) 1989, 1991, 1993
+.\" The Regents of the University of California. All rights reserved.
+.\"
+.\" Redistribution and use in source and binary forms, with or without
+.\" modification, are permitted provided that the following conditions
+.\" are met:
+.\" 1. Redistributions of source code must retain the above copyright
+.\" notice, this list of conditions and the following disclaimer.
+.\" 2. Redistributions in binary form must reproduce the above copyright
+.\" notice, this list of conditions and the following disclaimer in the
+.\" documentation and/or other materials provided with the distribution.
+.\" 3. Neither the name of the University nor the names of its contributors
+.\" may be used to endorse or promote products derived from this software
+.\" without specific prior written permission.
+.\"
+.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
+.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
+.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+.\" SUCH DAMAGE.
+.\"
+.Dd May 30, 2025
+.Dt NFSD 8
+.Os
+.Sh NAME
+.Nm nfsd
+.Nd remote
+NFS server
+.Sh SYNOPSIS
+.Nm
+.Op Fl arduteN
+.Op Fl n Ar num_servers
+.Op Fl h Ar bindip
+.Op Fl p Ar pnfs_setup
+.Op Fl m Ar mirror_level
+.Op Fl P Ar pidfile
+.Op Fl V Ar virtual_hostname
+.Op Fl Fl maxthreads Ar max_threads
+.Op Fl Fl minthreads Ar min_threads
+.Sh DESCRIPTION
+The
+.Nm
+utility runs on a server machine to service NFS requests from client machines.
+At least one
+.Nm
+must be running for a machine to operate as a server.
+.Pp
+Unless otherwise specified, eight servers per CPU for UDP transport are
+started.
+.Pp
+When
+.Nm
+is run in an appropriately configured vnet jail, the server is restricted
+to TCP transport and no pNFS service.
+Therefore, the
+.Fl t
+option must be specified and none of the
+.Fl u ,
+.Fl p
+and
+.Fl m
+options can be specified when run in a vnet jail.
+See
+.Xr jail 8
+for more information.
+.Pp
+The following options are available:
+.Bl -tag -width Ds
+.It Fl r
+Register the NFS service with
+.Xr rpcbind 8
+without creating any servers.
+This option can be used along with the
+.Fl u
+or
+.Fl t
+options to re-register NFS if the rpcbind server is restarted.
+.It Fl d
+Unregister the NFS service with
+.Xr rpcbind 8
+without creating any servers.
+.It Fl P Ar pidfile
+Specify alternative location of a file where main process PID will be stored.
+The default location is
+.Pa /var/run/nfsd.pid .
+.It Fl V Ar virtual_hostname
+Specifies a hostname to be used as a principal name, instead of
+the default hostname.
+.It Fl n Ar threads
+This option is deprecated and is limited to a maximum of 256 threads.
+The options
+.Fl Fl maxthreads
+and
+.Fl Fl minthreads
+should now be used.
+The
+.Ar threads
+argument for
+.Fl Fl minthreads
+and
+.Fl Fl maxthreads
+may be set to the same value to avoid dynamic
+changes to the number of threads.
+.It Fl Fl maxthreads Ar threads
+Specifies the maximum servers that will be kept around to service requests.
+.It Fl Fl minthreads Ar threads
+Specifies the minimum servers that will be kept around to service requests.
+.It Fl h Ar bindip
+Specifies which IP address or hostname to bind to on the local host.
+This option is recommended when a host has multiple interfaces.
+Multiple
+.Fl h
+options may be specified.
+.It Fl a
+Specifies that nfsd should bind to the wildcard IP address.
+This is the default if no
+.Fl h
+options are given.
+It may also be specified in addition to any
+.Fl h
+options given.
+Note that NFS/UDP does not operate properly when
+bound to the wildcard IP address whether you use -a or do not use -h.
+.It Fl p Ar pnfs_setup
+Enables pNFS support in the server and specifies the information that the
+daemon needs to start it.
+This option can only be used on one server and specifies that this server
+will be the MetaData Server (MDS) for the pNFS service.
+This can only be done if there is at least one
+.Fx
+system configured
+as a Data Server (DS) for it to use.
+.Pp
+The
+.Ar pnfs_setup
+string is a set of fields separated by ',' characters:
+Each of these fields specifies one DS.
+It consists of a server hostname, followed by a ':'
+and the directory path where the DS's data storage file system is mounted on
+this MDS server.
+This can optionally be followed by a '#' and the mds_path, which is the
+directory path for an exported file system on this MDS.
+If this is specified, it means that this DS is to be used to store data
+files for this mds_path file system only.
+If this optional component does not exist, the DS will be used to store data
+files for all exported MDS file systems.
+The DS storage file systems must be mounted on this system before the
+.Nm
+is started with this option specified.
+.br
+For example:
+.sp
+nfsv4-data0:/data0,nfsv4-data1:/data1
+.sp
+would specify two DS servers called nfsv4-data0 and nfsv4-data1 that comprise
+the data storage component of the pNFS service.
+These two DSs would be used to store data files for all exported file systems
+on this MDS.
+The directories
+.Dq /data0
+and
+.Dq /data1
+are where the data storage servers exported
+storage directories are mounted on this system (which will act as the MDS).
+.br
+Whereas, for the example:
+.sp
+nfsv4-data0:/data0#/export1,nfsv4-data1:/data1#/export2
+.sp
+would specify two DSs as above, however nfsv4-data0 will be used to store
+data files for
+.Dq /export1
+and nfsv4-data1 will be used to store data files for
+.Dq /export2 .
+.sp
+When using IPv6 addresses for DSs
+be wary of using link local addresses.
+The IPv6 address for the DS is sent to the client and there is no scope
+zone in it.
+As such, a link local address may not work for a pNFS client to DS
+TCP connection.
+When parsed,
+.Nm
+will only use a link local address if it is the only address returned by
+.Xr getaddrinfo 3
+for the DS hostname.
+.It Fl m Ar mirror_level
+This option is only meaningful when used with the
+.Fl p
+option.
+It specifies the
+.Dq mirror_level ,
+which defines how many of the DSs will
+have a copy of a file's data storage file.
+The default of one implies no mirroring of data storage files on the DSs.
+The
+.Dq mirror_level
+would normally be set to 2 to enable mirroring, but
+can be as high as NFSDEV_MAXMIRRORS.
+There must be at least
+.Dq mirror_level
+DSs for each exported file system on the MDS, as specified in the
+.Fl p
+option.
+This implies that, for the above example using "#/export1" and "#/export2",
+mirroring cannot be done.
+There would need to be two DS entries for each of "#/export1" and "#/export2"
+in order to support a
+.Dq mirror_level
+of two.
+.Pp
+If mirroring is enabled, the server must use the Flexible File
+layout.
+If mirroring is not enabled, the server will use the File layout
+by default, but this default can be changed to the Flexible File layout if the
+.Xr sysctl 8
+vfs.nfsd.default_flexfile
+is set non-zero.
+.It Fl t
+Serve TCP NFS clients.
+.It Fl u
+Serve UDP NFS clients.
+.It Fl e
+Ignored; included for backward compatibility.
+.It Fl N
+Cause
+.Nm
+to execute in the foreground instead of in daemon mode.
+.El
+.Pp
+For example,
+.Dq Li "nfsd -u -t --minthreads 6 --maxthreads 6"
+serves UDP and TCP transports using six kernel threads (servers).
+.Pp
+For a system dedicated to servicing NFS RPCs, the number of
+threads (servers) should be sufficient to handle the peak
+client RPC load.
+For systems that perform other services, the number of
+threads (servers) may need to be limited, so that resources
+are available for these other services.
+.Pp
+The
+.Nm
+utility listens for service requests at the port indicated in the
+NFS server specification; see
+.%T "Network File System Protocol Specification" ,
+RFC1094,
+.%T "NFS: Network File System Version 3 Protocol Specification" ,
+RFC1813,
+.%T "Network File System (NFS) Version 4 Protocol" ,
+RFC7530,
+.%T "Network File System (NFS) Version 4 Minor Version 1 Protocol" ,
+RFC5661,
+.%T "Network File System (NFS) Version 4 Minor Version 2 Protocol" ,
+RFC7862,
+.%T "File System Extended Attributes in NFSv4" ,
+RFC8276 and
+.%T "Parallel NFS (pNFS) Flexible File Layout" ,
+RFC8435.
+.Pp
+If
+.Nm
+detects that
+NFS is not loaded in the running kernel, it will attempt
+to load a loadable kernel module containing NFS support using
+.Xr kldload 2 .
+If this fails, or no NFS KLD is available,
+.Nm
+will exit with an error.
+.Pp
+If
+.Nm
+is to be run on a host with multiple interfaces or interface aliases, use
+of the
+.Fl h
+option is recommended.
+If you do not use the option NFS may not respond to
+UDP packets from the same IP address they were sent to.
+Use of this option
+is also recommended when securing NFS exports on a firewalling machine such
+that the NFS sockets can only be accessed by the inside interface.
+The
+.Nm ipfw
+utility
+would then be used to block NFS-related packets that come in on the outside
+interface.
+.Pp
+If the server has stopped servicing clients and has generated a console message
+like
+.Dq Li "nfsd server cache flooded..." ,
+the value for vfs.nfsd.tcphighwater needs to be increased.
+This should allow the server to again handle requests without a reboot.
+Also, you may want to consider decreasing the value for
+vfs.nfsd.tcpcachetimeo to several minutes (in seconds) instead of 12 hours
+when this occurs.
+.Pp
+Unfortunately making vfs.nfsd.tcphighwater too large can result in the mbuf
+limit being reached, as indicated by a console message
+like
+.Dq Li "kern.ipc.nmbufs limit reached" .
+If you cannot find values of the above
+.Nm sysctl
+values that work, you can disable the DRC cache for TCP by setting
+vfs.nfsd.cachetcp to 0.
+.Pp
+The
+.Nm
+utility has to be terminated with
+.Dv SIGUSR1
+and cannot be killed with
+.Dv SIGTERM
+or
+.Dv SIGQUIT .
+The
+.Nm
+utility needs to ignore these signals in order to stay alive as long
+as possible during a shutdown, otherwise loopback mounts will
+not be able to unmount.
+If you have to kill
+.Nm
+just do a
+.Dq Li "kill -USR1 <PID of master nfsd>"
+.Sh EXIT STATUS
+.Ex -std
+.Sh SEE ALSO
+.Xr nfsstat 1 ,
+.Xr kldload 2 ,
+.Xr nfssvc 2 ,
+.Xr nfsv4 4 ,
+.Xr pnfs 4 ,
+.Xr pnfsserver 4 ,
+.Xr exports 5 ,
+.Xr stablerestart 5 ,
+.Xr gssd 8 ,
+.Xr ipfw 8 ,
+.Xr jail 8 ,
+.Xr mountd 8 ,
+.Xr nfsiod 8 ,
+.Xr nfsrevoke 8 ,
+.Xr nfsuserd 8 ,
+.Xr rpcbind 8
+.Sh HISTORY
+The
+.Nm
+utility first appeared in
+.Bx 4.4 .
+.Sh BUGS
+If
+.Nm
+is started when
+.Xr gssd 8
+is not running, it will service AUTH_SYS requests only.
+To fix the problem you must kill
+.Nm
+and then restart it, after the
+.Xr gssd 8
+is running.
+.Pp
+For a Flexible File Layout pNFS server,
+if there are Linux clients doing NFSv4.1 or NFSv4.2 mounts, those
+clients might need the
+.Xr sysctl 8
+vfs.nfsd.flexlinuxhack
+to be set to one on the MDS as a workaround.
+.Pp
+Linux 5.n kernels appear to have been patched such that this
+.Xr sysctl 8
+does not need to be set.
+.Pp
+For NFSv4.2, a Copy operation can take a long time to complete.
+If there is a concurrent ExchangeID or DelegReturn operation
+which requires the exclusive lock on all NFSv4 state, this can
+result in a
+.Dq stall
+of the
+.Nm
+server.
+If your storage is on ZFS without block cloning enabled,
+setting the
+.Xr sysctl 8
+.Va vfs.zfs.dmu_offset_next_sync
+to 0 can often avoid this problem.
+It is also possible to set the
+.Xr sysctl 8
+.Va vfs.nfsd.maxcopyrange
+to 10-100 megabytes to try and reduce Copy operation times.
+As a last resort, setting
+.Xr sysctl 8
+.Va vfs.nfsd.maxcopyrange
+to 0 disables the Copy operation.
diff --git a/usr.sbin/nfsd/nfsd.c b/usr.sbin/nfsd/nfsd.c
new file mode 100644
index 000000000000..94c30ae6dee1
--- /dev/null
+++ b/usr.sbin/nfsd/nfsd.c
@@ -0,0 +1,1400 @@
+/*-
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Copyright (c) 1989, 1993, 1994
+ * The Regents of the University of California. All rights reserved.
+ *
+ * This code is derived from software contributed to Berkeley by
+ * Rick Macklem at The University of Guelph.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in the
+ * documentation and/or other materials provided with the distribution.
+ * 3. Neither the name of the University nor the names of its contributors
+ * may be used to endorse or promote products derived from this software
+ * without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+
+#include <sys/param.h>
+#include <sys/syslog.h>
+#include <sys/wait.h>
+#include <sys/mount.h>
+#include <sys/fcntl.h>
+#include <sys/linker.h>
+#include <sys/module.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <sys/sysctl.h>
+#include <sys/ucred.h>
+
+#include <rpc/rpc.h>
+#include <rpc/pmap_clnt.h>
+#include <rpcsvc/nfs_prot.h>
+
+#include <netdb.h>
+#include <arpa/inet.h>
+#include <nfs/nfssvc.h>
+
+#include <fs/nfs/nfsproto.h>
+#include <fs/nfs/nfskpiport.h>
+#include <fs/nfs/nfs.h>
+
+#include <err.h>
+#include <errno.h>
+#include <libutil.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <sysexits.h>
+
+#include <getopt.h>
+
+static int debug = 0;
+static int nofork = 0;
+
+#define DEFAULT_PIDFILE "/var/run/nfsd.pid"
+#define NFSD_STABLERESTART "/var/db/nfs-stablerestart"
+#define NFSD_STABLEBACKUP "/var/db/nfs-stablerestart.bak"
+#define MAXNFSDCNT 256
+#define DEFNFSDCNT 4
+#define NFS_VER2 2
+#define NFS_VER3 3
+#define NFS_VER4 4
+static pid_t children[MAXNFSDCNT]; /* PIDs of children */
+static pid_t masterpid; /* PID of master/parent */
+static struct pidfh *masterpidfh = NULL; /* pidfh of master/parent */
+static int nfsdcnt; /* number of children */
+static int nfsdcnt_set;
+static int minthreads;
+static int maxthreads;
+static int nfssvc_nfsd; /* Set to correct NFSSVC_xxx flag */
+static int stablefd = -1; /* Fd for the stable restart file */
+static int backupfd; /* Fd for the backup stable restart file */
+static const char *getopt_shortopts;
+static const char *getopt_usage;
+static int nfs_minvers = NFS_VER2;
+
+static int minthreads_set;
+static int maxthreads_set;
+
+static struct option longopts[] = {
+ { "debug", no_argument, &debug, 1 },
+ { "minthreads", required_argument, &minthreads_set, 1 },
+ { "maxthreads", required_argument, &maxthreads_set, 1 },
+ { "pnfs", required_argument, NULL, 'p' },
+ { "mirror", required_argument, NULL, 'm' },
+ { NULL, 0, NULL, 0}
+};
+
+static void cleanup(int);
+static void child_cleanup(int);
+static void killchildren(void);
+static void nfsd_exit(int);
+static void nonfs(int);
+static void reapchild(int);
+static int setbindhost(struct addrinfo **ia, const char *bindhost,
+ struct addrinfo hints);
+static void start_server(int, struct nfsd_nfsd_args *, const char *vhost);
+static void unregistration(void);
+static void usage(void);
+static void open_stable(int *, int *);
+static void copy_stable(int, int);
+static void backup_stable(int);
+static void set_nfsdcnt(int);
+static void parse_dsserver(const char *, struct nfsd_nfsd_args *);
+
+/*
+ * Nfs server daemon mostly just a user context for nfssvc()
+ *
+ * 1 - do file descriptor and signal cleanup
+ * 2 - fork the nfsd(s)
+ * 3 - create server socket(s)
+ * 4 - register socket with rpcbind
+ *
+ * For connectionless protocols, just pass the socket into the kernel via.
+ * nfssvc().
+ * For connection based sockets, loop doing accepts. When you get a new
+ * socket from accept, pass the msgsock into the kernel via. nfssvc().
+ * The arguments are:
+ * -r - reregister with rpcbind
+ * -d - unregister with rpcbind
+ * -t - support tcp nfs clients
+ * -u - support udp nfs clients
+ * -e - forces it to run a server that supports nfsv4
+ * -p - enable a pNFS service
+ * -m - set the mirroring level for a pNFS service
+ * followed by "n" which is the number of nfsds' to fork off
+ */
+int
+main(int argc, char **argv)
+{
+ struct nfsd_addsock_args addsockargs;
+ struct addrinfo *ai_udp, *ai_tcp, *ai_udp6, *ai_tcp6, hints;
+ struct netconfig *nconf_udp, *nconf_tcp, *nconf_udp6, *nconf_tcp6;
+ struct netbuf nb_udp, nb_tcp, nb_udp6, nb_tcp6;
+ struct sockaddr_storage peer;
+ fd_set ready, sockbits;
+ int ch, connect_type_cnt, i, maxsock, msgsock;
+ socklen_t len;
+ int on = 1, unregister, reregister, sock;
+ int tcp6sock, ip6flag, tcpflag, tcpsock;
+ int udpflag, ecode, error, s;
+ int bindhostc, bindanyflag, rpcbreg, rpcbregcnt;
+ int nfssvc_addsock;
+ int jailed, longindex = 0;
+ size_t jailed_size, nfs_minvers_size;
+ const char *lopt;
+ char **bindhost = NULL;
+ const char *pidfile_path = DEFAULT_PIDFILE;
+ pid_t pid, otherpid;
+ struct nfsd_nfsd_args nfsdargs;
+ const char *vhostname = NULL;
+
+ nfsdargs.mirrorcnt = 1;
+ nfsdargs.addr = NULL;
+ nfsdargs.addrlen = 0;
+ nfsdcnt = DEFNFSDCNT;
+ unregister = reregister = tcpflag = maxsock = 0;
+ bindanyflag = udpflag = connect_type_cnt = bindhostc = 0;
+ getopt_shortopts = "ah:n:rdtuep:m:V:NP:";
+ getopt_usage =
+ "usage:\n"
+ " nfsd [-ardtueN] [-h bindip]\n"
+ " [-n numservers] [--minthreads #] [--maxthreads #]\n"
+ " [-p/--pnfs dsserver0:/dsserver0-mounted-on-dir,...,"
+ "dsserverN:/dsserverN-mounted-on-dir] [-m mirrorlevel]\n"
+ " [-P pidfile ] [-V virtual_hostname]\n";
+ while ((ch = getopt_long(argc, argv, getopt_shortopts, longopts,
+ &longindex)) != -1)
+ switch (ch) {
+ case 'V':
+ if (strlen(optarg) <= MAXHOSTNAMELEN)
+ vhostname = optarg;
+ else
+ warnx("Virtual host name (%s) is too long",
+ optarg);
+ break;
+ case 'a':
+ bindanyflag = 1;
+ break;
+ case 'n':
+ set_nfsdcnt(atoi(optarg));
+ break;
+ case 'h':
+ bindhostc++;
+ bindhost = realloc(bindhost,sizeof(char *)*bindhostc);
+ if (bindhost == NULL)
+ errx(1, "Out of memory");
+ bindhost[bindhostc-1] = strdup(optarg);
+ if (bindhost[bindhostc-1] == NULL)
+ errx(1, "Out of memory");
+ break;
+ case 'r':
+ reregister = 1;
+ break;
+ case 'd':
+ unregister = 1;
+ break;
+ case 't':
+ tcpflag = 1;
+ break;
+ case 'u':
+ udpflag = 1;
+ break;
+ case 'e':
+ /* now a no-op, since this is the default */
+ break;
+ case 'p':
+ /* Parse out the DS server host names and mount pts. */
+ parse_dsserver(optarg, &nfsdargs);
+ break;
+ case 'm':
+ /* Set the mirror level for a pNFS service. */
+ i = atoi(optarg);
+ if (i < 2 || i > NFSDEV_MAXMIRRORS)
+ errx(1, "Mirror level out of range 2<-->%d",
+ NFSDEV_MAXMIRRORS);
+ nfsdargs.mirrorcnt = i;
+ break;
+ case 'N':
+ nofork = 1;
+ break;
+ case 'P':
+ pidfile_path = optarg;
+ break;
+ case 0:
+ lopt = longopts[longindex].name;
+ if (!strcmp(lopt, "minthreads")) {
+ minthreads = atoi(optarg);
+ } else if (!strcmp(lopt, "maxthreads")) {
+ maxthreads = atoi(optarg);
+ }
+ break;
+ default:
+ case '?':
+ usage();
+ }
+ if (!tcpflag && !udpflag)
+ udpflag = 1;
+ argv += optind;
+ argc -= optind;
+ if (minthreads_set && maxthreads_set && minthreads > maxthreads)
+ errx(EX_USAGE,
+ "error: minthreads(%d) can't be greater than "
+ "maxthreads(%d)", minthreads, maxthreads);
+
+ /*
+ * XXX
+ * Backward compatibility, trailing number is the count of daemons.
+ */
+ if (argc > 1)
+ usage();
+ if (argc == 1)
+ set_nfsdcnt(atoi(argv[0]));
+
+ /*
+ * Unless the "-o" option was specified, try and run "nfsd".
+ * If "-o" was specified, try and run "nfsserver".
+ */
+ if (modfind("nfsd") < 0) {
+ /* Not present in kernel, try loading it */
+ if (kldload("nfsd") < 0 || modfind("nfsd") < 0)
+ errx(1, "NFS server is not available");
+ }
+
+ ip6flag = 1;
+ s = socket(AF_INET6, SOCK_DGRAM, IPPROTO_UDP);
+ if (s == -1) {
+ if (errno != EPROTONOSUPPORT && errno != EAFNOSUPPORT)
+ err(1, "socket");
+ ip6flag = 0;
+ } else if (getnetconfigent("udp6") == NULL ||
+ getnetconfigent("tcp6") == NULL) {
+ ip6flag = 0;
+ }
+ if (s != -1)
+ close(s);
+
+ if (bindhostc == 0 || bindanyflag) {
+ bindhostc++;
+ bindhost = realloc(bindhost,sizeof(char *)*bindhostc);
+ if (bindhost == NULL)
+ errx(1, "Out of memory");
+ bindhost[bindhostc-1] = strdup("*");
+ if (bindhost[bindhostc-1] == NULL)
+ errx(1, "Out of memory");
+ }
+
+ if (unregister) {
+ /*
+ * Unregister before setting nfs_minvers, in case the
+ * value of vfs.nfsd.server_min_nfsvers has changed
+ * since registering with rpcbind.
+ */
+ unregistration();
+ exit (0);
+ }
+
+ nfs_minvers_size = sizeof(nfs_minvers);
+ error = sysctlbyname("vfs.nfsd.server_min_nfsvers", &nfs_minvers,
+ &nfs_minvers_size, NULL, 0);
+ if (error != 0 || nfs_minvers < NFS_VER2 || nfs_minvers > NFS_VER4) {
+ warnx("sysctlbyname(vfs.nfsd.server_min_nfsvers) failed,"
+ " defaulting to NFSv2");
+ nfs_minvers = NFS_VER2;
+ }
+
+ if (reregister) {
+ if (udpflag) {
+ memset(&hints, 0, sizeof hints);
+ hints.ai_flags = AI_PASSIVE;
+ hints.ai_family = AF_INET;
+ hints.ai_socktype = SOCK_DGRAM;
+ hints.ai_protocol = IPPROTO_UDP;
+ ecode = getaddrinfo(NULL, "nfs", &hints, &ai_udp);
+ if (ecode != 0)
+ err(1, "getaddrinfo udp: %s", gai_strerror(ecode));
+ nconf_udp = getnetconfigent("udp");
+ if (nconf_udp == NULL)
+ err(1, "getnetconfigent udp failed");
+ nb_udp.buf = ai_udp->ai_addr;
+ nb_udp.len = nb_udp.maxlen = ai_udp->ai_addrlen;
+ if (nfs_minvers == NFS_VER2)
+ if (!rpcb_set(NFS_PROGRAM, 2, nconf_udp,
+ &nb_udp))
+ err(1, "rpcb_set udp failed");
+ if (nfs_minvers <= NFS_VER3)
+ if (!rpcb_set(NFS_PROGRAM, 3, nconf_udp,
+ &nb_udp))
+ err(1, "rpcb_set udp failed");
+ freeaddrinfo(ai_udp);
+ }
+ if (udpflag && ip6flag) {
+ memset(&hints, 0, sizeof hints);
+ hints.ai_flags = AI_PASSIVE;
+ hints.ai_family = AF_INET6;
+ hints.ai_socktype = SOCK_DGRAM;
+ hints.ai_protocol = IPPROTO_UDP;
+ ecode = getaddrinfo(NULL, "nfs", &hints, &ai_udp6);
+ if (ecode != 0)
+ err(1, "getaddrinfo udp6: %s", gai_strerror(ecode));
+ nconf_udp6 = getnetconfigent("udp6");
+ if (nconf_udp6 == NULL)
+ err(1, "getnetconfigent udp6 failed");
+ nb_udp6.buf = ai_udp6->ai_addr;
+ nb_udp6.len = nb_udp6.maxlen = ai_udp6->ai_addrlen;
+ if (nfs_minvers == NFS_VER2)
+ if (!rpcb_set(NFS_PROGRAM, 2, nconf_udp6,
+ &nb_udp6))
+ err(1, "rpcb_set udp6 failed");
+ if (nfs_minvers <= NFS_VER3)
+ if (!rpcb_set(NFS_PROGRAM, 3, nconf_udp6,
+ &nb_udp6))
+ err(1, "rpcb_set udp6 failed");
+ freeaddrinfo(ai_udp6);
+ }
+ if (tcpflag) {
+ memset(&hints, 0, sizeof hints);
+ hints.ai_flags = AI_PASSIVE;
+ hints.ai_family = AF_INET;
+ hints.ai_socktype = SOCK_STREAM;
+ hints.ai_protocol = IPPROTO_TCP;
+ ecode = getaddrinfo(NULL, "nfs", &hints, &ai_tcp);
+ if (ecode != 0)
+ err(1, "getaddrinfo tcp: %s", gai_strerror(ecode));
+ nconf_tcp = getnetconfigent("tcp");
+ if (nconf_tcp == NULL)
+ err(1, "getnetconfigent tcp failed");
+ nb_tcp.buf = ai_tcp->ai_addr;
+ nb_tcp.len = nb_tcp.maxlen = ai_tcp->ai_addrlen;
+ if (nfs_minvers == NFS_VER2)
+ if (!rpcb_set(NFS_PROGRAM, 2, nconf_tcp,
+ &nb_tcp))
+ err(1, "rpcb_set tcp failed");
+ if (nfs_minvers <= NFS_VER3)
+ if (!rpcb_set(NFS_PROGRAM, 3, nconf_tcp,
+ &nb_tcp))
+ err(1, "rpcb_set tcp failed");
+ freeaddrinfo(ai_tcp);
+ }
+ if (tcpflag && ip6flag) {
+ memset(&hints, 0, sizeof hints);
+ hints.ai_flags = AI_PASSIVE;
+ hints.ai_family = AF_INET6;
+ hints.ai_socktype = SOCK_STREAM;
+ hints.ai_protocol = IPPROTO_TCP;
+ ecode = getaddrinfo(NULL, "nfs", &hints, &ai_tcp6);
+ if (ecode != 0)
+ err(1, "getaddrinfo tcp6: %s", gai_strerror(ecode));
+ nconf_tcp6 = getnetconfigent("tcp6");
+ if (nconf_tcp6 == NULL)
+ err(1, "getnetconfigent tcp6 failed");
+ nb_tcp6.buf = ai_tcp6->ai_addr;
+ nb_tcp6.len = nb_tcp6.maxlen = ai_tcp6->ai_addrlen;
+ if (nfs_minvers == NFS_VER2)
+ if (!rpcb_set(NFS_PROGRAM, 2, nconf_tcp6,
+ &nb_tcp6))
+ err(1, "rpcb_set tcp6 failed");
+ if (nfs_minvers <= NFS_VER3)
+ if (!rpcb_set(NFS_PROGRAM, 3, nconf_tcp6,
+ &nb_tcp6))
+ err(1, "rpcb_set tcp6 failed");
+ freeaddrinfo(ai_tcp6);
+ }
+ exit (0);
+ }
+
+ if (pidfile_path != NULL) {
+ masterpidfh = pidfile_open(pidfile_path, 0600, &otherpid);
+ if (masterpidfh == NULL) {
+ if (errno == EEXIST)
+ errx(1, "daemon already running, pid: %jd.",
+ (intmax_t)otherpid);
+ warn("cannot open pid file");
+ }
+ }
+ if (debug == 0 && nofork == 0) {
+ daemon(0, 0);
+ (void)signal(SIGHUP, SIG_IGN);
+ (void)signal(SIGINT, SIG_IGN);
+ /*
+ * nfsd sits in the kernel most of the time. It needs
+ * to ignore SIGTERM/SIGQUIT in order to stay alive as long
+ * as possible during a shutdown, otherwise loopback
+ * mounts will not be able to unmount.
+ */
+ (void)signal(SIGTERM, SIG_IGN);
+ (void)signal(SIGQUIT, SIG_IGN);
+ }
+ (void)signal(SIGSYS, nonfs);
+ (void)signal(SIGCHLD, reapchild);
+ (void)signal(SIGUSR2, backup_stable);
+
+ openlog("nfsd", LOG_PID | (debug ? LOG_PERROR : 0), LOG_DAEMON);
+
+ if (masterpidfh != NULL && pidfile_write(masterpidfh) != 0)
+ syslog(LOG_ERR, "pidfile_write(): %m");
+
+ /*
+ * For V4, we open the stablerestart file and call nfssvc()
+ * to get it loaded. This is done before the daemons do the
+ * regular nfssvc() call to service NFS requests.
+ * (This way the file remains open until the last nfsd is killed
+ * off.)
+ * It and the backup copy will be created as empty files
+ * the first time this nfsd is started and should never be
+ * deleted/replaced if at all possible. It should live on a
+ * local, non-volatile storage device that does not do hardware
+ * level write-back caching. (See SCSI doc for more information
+ * on how to prevent write-back caching on SCSI disks.)
+ */
+ open_stable(&stablefd, &backupfd);
+ if (stablefd < 0) {
+ syslog(LOG_ERR, "Can't open %s: %m\n", NFSD_STABLERESTART);
+ exit(1);
+ }
+ /* This system call will fail for old kernels, but that's ok. */
+ nfssvc(NFSSVC_BACKUPSTABLE, NULL);
+ if (nfssvc(NFSSVC_STABLERESTART, (caddr_t)&stablefd) < 0) {
+ if (errno == EPERM) {
+ jailed = 0;
+ jailed_size = sizeof(jailed);
+ sysctlbyname("security.jail.jailed", &jailed,
+ &jailed_size, NULL, 0);
+ if (jailed != 0)
+ syslog(LOG_ERR, "nfssvc stablerestart failed: "
+ "allow.nfsd might not be configured");
+ else
+ syslog(LOG_ERR, "nfssvc stablerestart failed");
+ } else if (errno == ENXIO)
+ syslog(LOG_ERR, "nfssvc stablerestart failed: is nfsd "
+ "already running?");
+ else
+ syslog(LOG_ERR, "Can't read stable storage file: %m\n");
+ exit(1);
+ }
+ nfssvc_addsock = NFSSVC_NFSDADDSOCK;
+ nfssvc_nfsd = NFSSVC_NFSDNFSD | NFSSVC_NEWSTRUCT;
+
+ if (tcpflag) {
+ /*
+ * For TCP mode, we fork once to start the first
+ * kernel nfsd thread. The kernel will add more
+ * threads as needed.
+ */
+ masterpid = getpid();
+ pid = fork();
+ if (pid == -1) {
+ syslog(LOG_ERR, "fork: %m");
+ nfsd_exit(1);
+ }
+ if (pid) {
+ children[0] = pid;
+ } else {
+ pidfile_close(masterpidfh);
+ (void)signal(SIGUSR1, child_cleanup);
+ setproctitle("server");
+ start_server(0, &nfsdargs, vhostname);
+ }
+ }
+
+ (void)signal(SIGUSR1, cleanup);
+ FD_ZERO(&sockbits);
+
+ rpcbregcnt = 0;
+ /* Set up the socket for udp and rpcb register it. */
+ if (udpflag) {
+ rpcbreg = 0;
+ for (i = 0; i < bindhostc; i++) {
+ memset(&hints, 0, sizeof hints);
+ hints.ai_flags = AI_PASSIVE;
+ hints.ai_family = AF_INET;
+ hints.ai_socktype = SOCK_DGRAM;
+ hints.ai_protocol = IPPROTO_UDP;
+ if (setbindhost(&ai_udp, bindhost[i], hints) == 0) {
+ rpcbreg = 1;
+ rpcbregcnt++;
+ if ((sock = socket(ai_udp->ai_family,
+ ai_udp->ai_socktype,
+ ai_udp->ai_protocol)) < 0) {
+ syslog(LOG_ERR,
+ "can't create udp socket");
+ nfsd_exit(1);
+ }
+ if (bind(sock, ai_udp->ai_addr,
+ ai_udp->ai_addrlen) < 0) {
+ syslog(LOG_ERR,
+ "can't bind udp addr %s: %m",
+ bindhost[i]);
+ nfsd_exit(1);
+ }
+ freeaddrinfo(ai_udp);
+ addsockargs.sock = sock;
+ addsockargs.name = NULL;
+ addsockargs.namelen = 0;
+ if (nfssvc(nfssvc_addsock, &addsockargs) < 0) {
+ syslog(LOG_ERR, "can't Add UDP socket");
+ nfsd_exit(1);
+ }
+ (void)close(sock);
+ }
+ }
+ if (rpcbreg == 1) {
+ memset(&hints, 0, sizeof hints);
+ hints.ai_flags = AI_PASSIVE;
+ hints.ai_family = AF_INET;
+ hints.ai_socktype = SOCK_DGRAM;
+ hints.ai_protocol = IPPROTO_UDP;
+ ecode = getaddrinfo(NULL, "nfs", &hints, &ai_udp);
+ if (ecode != 0) {
+ syslog(LOG_ERR, "getaddrinfo udp: %s",
+ gai_strerror(ecode));
+ nfsd_exit(1);
+ }
+ nconf_udp = getnetconfigent("udp");
+ if (nconf_udp == NULL) {
+ syslog(LOG_ERR, "getnetconfigent udp failed");
+ nfsd_exit(1);
+ }
+ nb_udp.buf = ai_udp->ai_addr;
+ nb_udp.len = nb_udp.maxlen = ai_udp->ai_addrlen;
+ if (nfs_minvers == NFS_VER2)
+ if (!rpcb_set(NFS_PROGRAM, 2, nconf_udp,
+ &nb_udp)) {
+ syslog(LOG_ERR, "rpcb_set udp failed");
+ nfsd_exit(1);
+ }
+ if (nfs_minvers <= NFS_VER3)
+ if (!rpcb_set(NFS_PROGRAM, 3, nconf_udp,
+ &nb_udp)) {
+ syslog(LOG_ERR, "rpcb_set udp failed");
+ nfsd_exit(1);
+ }
+ freeaddrinfo(ai_udp);
+ }
+ }
+
+ /* Set up the socket for udp6 and rpcb register it. */
+ if (udpflag && ip6flag) {
+ rpcbreg = 0;
+ for (i = 0; i < bindhostc; i++) {
+ memset(&hints, 0, sizeof hints);
+ hints.ai_flags = AI_PASSIVE;
+ hints.ai_family = AF_INET6;
+ hints.ai_socktype = SOCK_DGRAM;
+ hints.ai_protocol = IPPROTO_UDP;
+ if (setbindhost(&ai_udp6, bindhost[i], hints) == 0) {
+ rpcbreg = 1;
+ rpcbregcnt++;
+ if ((sock = socket(ai_udp6->ai_family,
+ ai_udp6->ai_socktype,
+ ai_udp6->ai_protocol)) < 0) {
+ syslog(LOG_ERR,
+ "can't create udp6 socket");
+ nfsd_exit(1);
+ }
+ if (setsockopt(sock, IPPROTO_IPV6, IPV6_V6ONLY,
+ &on, sizeof on) < 0) {
+ syslog(LOG_ERR,
+ "can't set v6-only binding for "
+ "udp6 socket: %m");
+ nfsd_exit(1);
+ }
+ if (bind(sock, ai_udp6->ai_addr,
+ ai_udp6->ai_addrlen) < 0) {
+ syslog(LOG_ERR,
+ "can't bind udp6 addr %s: %m",
+ bindhost[i]);
+ nfsd_exit(1);
+ }
+ freeaddrinfo(ai_udp6);
+ addsockargs.sock = sock;
+ addsockargs.name = NULL;
+ addsockargs.namelen = 0;
+ if (nfssvc(nfssvc_addsock, &addsockargs) < 0) {
+ syslog(LOG_ERR,
+ "can't add UDP6 socket");
+ nfsd_exit(1);
+ }
+ (void)close(sock);
+ }
+ }
+ if (rpcbreg == 1) {
+ memset(&hints, 0, sizeof hints);
+ hints.ai_flags = AI_PASSIVE;
+ hints.ai_family = AF_INET6;
+ hints.ai_socktype = SOCK_DGRAM;
+ hints.ai_protocol = IPPROTO_UDP;
+ ecode = getaddrinfo(NULL, "nfs", &hints, &ai_udp6);
+ if (ecode != 0) {
+ syslog(LOG_ERR, "getaddrinfo udp6: %s",
+ gai_strerror(ecode));
+ nfsd_exit(1);
+ }
+ nconf_udp6 = getnetconfigent("udp6");
+ if (nconf_udp6 == NULL) {
+ syslog(LOG_ERR, "getnetconfigent udp6 failed");
+ nfsd_exit(1);
+ }
+ nb_udp6.buf = ai_udp6->ai_addr;
+ nb_udp6.len = nb_udp6.maxlen = ai_udp6->ai_addrlen;
+ if (nfs_minvers == NFS_VER2)
+ if (!rpcb_set(NFS_PROGRAM, 2, nconf_udp6,
+ &nb_udp6)) {
+ syslog(LOG_ERR,
+ "rpcb_set udp6 failed");
+ nfsd_exit(1);
+ }
+ if (nfs_minvers <= NFS_VER3)
+ if (!rpcb_set(NFS_PROGRAM, 3, nconf_udp6,
+ &nb_udp6)) {
+ syslog(LOG_ERR,
+ "rpcb_set udp6 failed");
+ nfsd_exit(1);
+ }
+ freeaddrinfo(ai_udp6);
+ }
+ }
+
+ /* Set up the socket for tcp and rpcb register it. */
+ if (tcpflag) {
+ rpcbreg = 0;
+ for (i = 0; i < bindhostc; i++) {
+ memset(&hints, 0, sizeof hints);
+ hints.ai_flags = AI_PASSIVE;
+ hints.ai_family = AF_INET;
+ hints.ai_socktype = SOCK_STREAM;
+ hints.ai_protocol = IPPROTO_TCP;
+ if (setbindhost(&ai_tcp, bindhost[i], hints) == 0) {
+ rpcbreg = 1;
+ rpcbregcnt++;
+ if ((tcpsock = socket(AF_INET, SOCK_STREAM,
+ 0)) < 0) {
+ syslog(LOG_ERR,
+ "can't create tcp socket");
+ nfsd_exit(1);
+ }
+ if (setsockopt(tcpsock, SOL_SOCKET,
+ SO_REUSEADDR,
+ (char *)&on, sizeof(on)) < 0)
+ syslog(LOG_ERR,
+ "setsockopt SO_REUSEADDR: %m");
+ if (bind(tcpsock, ai_tcp->ai_addr,
+ ai_tcp->ai_addrlen) < 0) {
+ syslog(LOG_ERR,
+ "can't bind tcp addr %s: %m",
+ bindhost[i]);
+ nfsd_exit(1);
+ }
+ if (listen(tcpsock, -1) < 0) {
+ syslog(LOG_ERR, "listen failed");
+ nfsd_exit(1);
+ }
+ freeaddrinfo(ai_tcp);
+ FD_SET(tcpsock, &sockbits);
+ maxsock = tcpsock;
+ connect_type_cnt++;
+ }
+ }
+ if (rpcbreg == 1) {
+ memset(&hints, 0, sizeof hints);
+ hints.ai_flags = AI_PASSIVE;
+ hints.ai_family = AF_INET;
+ hints.ai_socktype = SOCK_STREAM;
+ hints.ai_protocol = IPPROTO_TCP;
+ ecode = getaddrinfo(NULL, "nfs", &hints,
+ &ai_tcp);
+ if (ecode != 0) {
+ syslog(LOG_ERR, "getaddrinfo tcp: %s",
+ gai_strerror(ecode));
+ nfsd_exit(1);
+ }
+ nconf_tcp = getnetconfigent("tcp");
+ if (nconf_tcp == NULL) {
+ syslog(LOG_ERR, "getnetconfigent tcp failed");
+ nfsd_exit(1);
+ }
+ nb_tcp.buf = ai_tcp->ai_addr;
+ nb_tcp.len = nb_tcp.maxlen = ai_tcp->ai_addrlen;
+ if (nfs_minvers == NFS_VER2)
+ if (!rpcb_set(NFS_PROGRAM, 2, nconf_tcp,
+ &nb_tcp)) {
+ syslog(LOG_ERR, "rpcb_set tcp failed");
+ nfsd_exit(1);
+ }
+ if (nfs_minvers <= NFS_VER3)
+ if (!rpcb_set(NFS_PROGRAM, 3, nconf_tcp,
+ &nb_tcp)) {
+ syslog(LOG_ERR, "rpcb_set tcp failed");
+ nfsd_exit(1);
+ }
+ freeaddrinfo(ai_tcp);
+ }
+ }
+
+ /* Set up the socket for tcp6 and rpcb register it. */
+ if (tcpflag && ip6flag) {
+ rpcbreg = 0;
+ for (i = 0; i < bindhostc; i++) {
+ memset(&hints, 0, sizeof hints);
+ hints.ai_flags = AI_PASSIVE;
+ hints.ai_family = AF_INET6;
+ hints.ai_socktype = SOCK_STREAM;
+ hints.ai_protocol = IPPROTO_TCP;
+ if (setbindhost(&ai_tcp6, bindhost[i], hints) == 0) {
+ rpcbreg = 1;
+ rpcbregcnt++;
+ if ((tcp6sock = socket(ai_tcp6->ai_family,
+ ai_tcp6->ai_socktype,
+ ai_tcp6->ai_protocol)) < 0) {
+ syslog(LOG_ERR,
+ "can't create tcp6 socket");
+ nfsd_exit(1);
+ }
+ if (setsockopt(tcp6sock, SOL_SOCKET,
+ SO_REUSEADDR,
+ (char *)&on, sizeof(on)) < 0)
+ syslog(LOG_ERR,
+ "setsockopt SO_REUSEADDR: %m");
+ if (setsockopt(tcp6sock, IPPROTO_IPV6,
+ IPV6_V6ONLY, &on, sizeof on) < 0) {
+ syslog(LOG_ERR,
+ "can't set v6-only binding for tcp6 "
+ "socket: %m");
+ nfsd_exit(1);
+ }
+ if (bind(tcp6sock, ai_tcp6->ai_addr,
+ ai_tcp6->ai_addrlen) < 0) {
+ syslog(LOG_ERR,
+ "can't bind tcp6 addr %s: %m",
+ bindhost[i]);
+ nfsd_exit(1);
+ }
+ if (listen(tcp6sock, -1) < 0) {
+ syslog(LOG_ERR, "listen failed");
+ nfsd_exit(1);
+ }
+ freeaddrinfo(ai_tcp6);
+ FD_SET(tcp6sock, &sockbits);
+ if (maxsock < tcp6sock)
+ maxsock = tcp6sock;
+ connect_type_cnt++;
+ }
+ }
+ if (rpcbreg == 1) {
+ memset(&hints, 0, sizeof hints);
+ hints.ai_flags = AI_PASSIVE;
+ hints.ai_family = AF_INET6;
+ hints.ai_socktype = SOCK_STREAM;
+ hints.ai_protocol = IPPROTO_TCP;
+ ecode = getaddrinfo(NULL, "nfs", &hints, &ai_tcp6);
+ if (ecode != 0) {
+ syslog(LOG_ERR, "getaddrinfo tcp6: %s",
+ gai_strerror(ecode));
+ nfsd_exit(1);
+ }
+ nconf_tcp6 = getnetconfigent("tcp6");
+ if (nconf_tcp6 == NULL) {
+ syslog(LOG_ERR, "getnetconfigent tcp6 failed");
+ nfsd_exit(1);
+ }
+ nb_tcp6.buf = ai_tcp6->ai_addr;
+ nb_tcp6.len = nb_tcp6.maxlen = ai_tcp6->ai_addrlen;
+ if (nfs_minvers == NFS_VER2)
+ if (!rpcb_set(NFS_PROGRAM, 2, nconf_tcp6,
+ &nb_tcp6)) {
+ syslog(LOG_ERR, "rpcb_set tcp6 failed");
+ nfsd_exit(1);
+ }
+ if (nfs_minvers <= NFS_VER3)
+ if (!rpcb_set(NFS_PROGRAM, 3, nconf_tcp6,
+ &nb_tcp6)) {
+ syslog(LOG_ERR, "rpcb_set tcp6 failed");
+ nfsd_exit(1);
+ }
+ freeaddrinfo(ai_tcp6);
+ }
+ }
+
+ if (rpcbregcnt == 0) {
+ syslog(LOG_ERR, "rpcb_set() failed, nothing to do: %m");
+ nfsd_exit(1);
+ }
+
+ if (tcpflag && connect_type_cnt == 0) {
+ syslog(LOG_ERR, "tcp connects == 0, nothing to do: %m");
+ nfsd_exit(1);
+ }
+
+ setproctitle("master");
+ /*
+ * We always want a master to have a clean way to shut nfsd down
+ * (with unregistration): if the master is killed, it unregisters and
+ * kills all children. If we run for UDP only (and so do not have to
+ * loop waiting for accept), we instead make the parent
+ * a "server" too. start_server will not return.
+ */
+ if (!tcpflag)
+ start_server(1, &nfsdargs, vhostname);
+
+ /*
+ * Loop forever accepting connections and passing the sockets
+ * into the kernel for the mounts.
+ */
+ for (;;) {
+ ready = sockbits;
+ if (connect_type_cnt > 1) {
+ if (select(maxsock + 1,
+ &ready, NULL, NULL, NULL) < 1) {
+ error = errno;
+ if (error == EINTR)
+ continue;
+ syslog(LOG_ERR, "select failed: %m");
+ nfsd_exit(1);
+ }
+ }
+ for (tcpsock = 0; tcpsock <= maxsock; tcpsock++) {
+ if (FD_ISSET(tcpsock, &ready)) {
+ len = sizeof(peer);
+ if ((msgsock = accept(tcpsock,
+ (struct sockaddr *)&peer, &len)) < 0) {
+ error = errno;
+ syslog(LOG_ERR, "accept failed: %m");
+ if (error == ECONNABORTED ||
+ error == EINTR)
+ continue;
+ nfsd_exit(1);
+ }
+ if (setsockopt(msgsock, SOL_SOCKET,
+ SO_KEEPALIVE, (char *)&on, sizeof(on)) < 0)
+ syslog(LOG_ERR,
+ "setsockopt SO_KEEPALIVE: %m");
+ addsockargs.sock = msgsock;
+ addsockargs.name = (caddr_t)&peer;
+ addsockargs.namelen = len;
+ nfssvc(nfssvc_addsock, &addsockargs);
+ (void)close(msgsock);
+ }
+ }
+ }
+}
+
+static int
+setbindhost(struct addrinfo **ai, const char *bindhost, struct addrinfo hints)
+{
+ int ecode;
+ u_int32_t host_addr[4]; /* IPv4 or IPv6 */
+ const char *hostptr;
+
+ if (bindhost == NULL || strcmp("*", bindhost) == 0)
+ hostptr = NULL;
+ else
+ hostptr = bindhost;
+
+ if (hostptr != NULL) {
+ switch (hints.ai_family) {
+ case AF_INET:
+ if (inet_pton(AF_INET, hostptr, host_addr) == 1) {
+ hints.ai_flags = AI_NUMERICHOST;
+ } else {
+ if (inet_pton(AF_INET6, hostptr,
+ host_addr) == 1)
+ return (1);
+ }
+ break;
+ case AF_INET6:
+ if (inet_pton(AF_INET6, hostptr, host_addr) == 1) {
+ hints.ai_flags = AI_NUMERICHOST;
+ } else {
+ if (inet_pton(AF_INET, hostptr,
+ host_addr) == 1)
+ return (1);
+ }
+ break;
+ default:
+ break;
+ }
+ }
+
+ ecode = getaddrinfo(hostptr, "nfs", &hints, ai);
+ if (ecode != 0) {
+ syslog(LOG_ERR, "getaddrinfo %s: %s", bindhost,
+ gai_strerror(ecode));
+ return (1);
+ }
+ return (0);
+}
+
+static void
+set_nfsdcnt(int proposed)
+{
+
+ if (proposed < 1) {
+ warnx("nfsd count too low %d; reset to %d", proposed,
+ DEFNFSDCNT);
+ nfsdcnt = DEFNFSDCNT;
+ } else if (proposed > MAXNFSDCNT) {
+ warnx("nfsd count too high %d; truncated to %d", proposed,
+ MAXNFSDCNT);
+ nfsdcnt = MAXNFSDCNT;
+ } else
+ nfsdcnt = proposed;
+ nfsdcnt_set = 1;
+}
+
+static void
+usage(void)
+{
+ (void)fprintf(stderr, "%s", getopt_usage);
+ exit(1);
+}
+
+static void
+nonfs(__unused int signo)
+{
+ syslog(LOG_ERR, "missing system call: NFS not available");
+}
+
+static void
+reapchild(__unused int signo)
+{
+ pid_t pid;
+ int i;
+
+ while ((pid = wait3(NULL, WNOHANG, NULL)) > 0) {
+ for (i = 0; i < nfsdcnt; i++)
+ if (pid == children[i])
+ children[i] = -1;
+ }
+}
+
+static void
+unregistration(void)
+{
+ if ((nfs_minvers == NFS_VER2 && !rpcb_unset(NFS_PROGRAM, 2, NULL)) ||
+ (nfs_minvers <= NFS_VER3 && !rpcb_unset(NFS_PROGRAM, 3, NULL)))
+ syslog(LOG_ERR, "rpcb_unset failed");
+}
+
+static void
+killchildren(void)
+{
+ int i;
+
+ for (i = 0; i < nfsdcnt; i++) {
+ if (children[i] > 0)
+ kill(children[i], SIGKILL);
+ }
+}
+
+/*
+ * Cleanup master after SIGUSR1.
+ */
+static void
+cleanup(__unused int signo)
+{
+ nfsd_exit(0);
+}
+
+/*
+ * Cleanup child after SIGUSR1.
+ */
+static void
+child_cleanup(__unused int signo)
+{
+ exit(0);
+}
+
+static void
+nfsd_exit(int status)
+{
+ killchildren();
+ unregistration();
+ if (masterpidfh != NULL)
+ pidfile_remove(masterpidfh);
+ exit(status);
+}
+
+static int
+get_tuned_nfsdcount(void)
+{
+ int ncpu, error, tuned_nfsdcnt;
+ size_t ncpu_size;
+
+ ncpu_size = sizeof(ncpu);
+ error = sysctlbyname("hw.ncpu", &ncpu, &ncpu_size, NULL, 0);
+ if (error) {
+ warnx("sysctlbyname(hw.ncpu) failed defaulting to %d nfs servers",
+ DEFNFSDCNT);
+ tuned_nfsdcnt = DEFNFSDCNT;
+ } else {
+ tuned_nfsdcnt = ncpu * 8;
+ }
+ return tuned_nfsdcnt;
+}
+
+static void
+start_server(int master, struct nfsd_nfsd_args *nfsdargp, const char *vhost)
+{
+ char principal[MAXHOSTNAMELEN + 5];
+ int status, error;
+ char hostname[MAXHOSTNAMELEN + 1], *cp;
+ struct addrinfo *aip, hints;
+
+ status = 0;
+ if (vhost == NULL)
+ gethostname(hostname, sizeof (hostname));
+ else
+ strlcpy(hostname, vhost, sizeof (hostname));
+ snprintf(principal, sizeof (principal), "nfs@%s", hostname);
+ if ((cp = strchr(hostname, '.')) == NULL ||
+ *(cp + 1) == '\0') {
+ /* If not fully qualified, try getaddrinfo() */
+ memset((void *)&hints, 0, sizeof (hints));
+ hints.ai_flags = AI_CANONNAME;
+ error = getaddrinfo(hostname, NULL, &hints, &aip);
+ if (error == 0) {
+ if (aip->ai_canonname != NULL &&
+ (cp = strchr(aip->ai_canonname, '.')) !=
+ NULL && *(cp + 1) != '\0')
+ snprintf(principal, sizeof (principal),
+ "nfs@%s", aip->ai_canonname);
+ freeaddrinfo(aip);
+ }
+ }
+ nfsdargp->principal = principal;
+
+ if (nfsdcnt_set)
+ nfsdargp->minthreads = nfsdargp->maxthreads = nfsdcnt;
+ else {
+ nfsdargp->minthreads = minthreads_set ? minthreads : get_tuned_nfsdcount();
+ nfsdargp->maxthreads = maxthreads_set ? maxthreads : nfsdargp->minthreads;
+ if (nfsdargp->maxthreads < nfsdargp->minthreads)
+ nfsdargp->maxthreads = nfsdargp->minthreads;
+ }
+ error = nfssvc(nfssvc_nfsd, nfsdargp);
+ if (error < 0 && errno == EAUTH) {
+ /*
+ * This indicates that it could not register the
+ * rpcsec_gss credentials, usually because the
+ * gssd daemon isn't running.
+ * (only the experimental server with nfsv4)
+ */
+ syslog(LOG_ERR, "No gssd, using AUTH_SYS only");
+ principal[0] = '\0';
+ error = nfssvc(nfssvc_nfsd, nfsdargp);
+ }
+ if (error < 0) {
+ if (errno == ENXIO) {
+ syslog(LOG_ERR, "Bad -p option, cannot run");
+ if (masterpid != 0 && master == 0)
+ kill(masterpid, SIGUSR1);
+ } else
+ syslog(LOG_ERR, "nfssvc: %m");
+ status = 1;
+ }
+ if (master)
+ nfsd_exit(status);
+ else
+ exit(status);
+}
+
+/*
+ * Open the stable restart file and return the file descriptor for it.
+ */
+static void
+open_stable(int *stable_fdp, int *backup_fdp)
+{
+ int stable_fd, backup_fd = -1, ret;
+ struct stat st, backup_st;
+
+ /* Open and stat the stable restart file. */
+ stable_fd = open(NFSD_STABLERESTART, O_RDWR, 0);
+ if (stable_fd < 0)
+ stable_fd = open(NFSD_STABLERESTART, O_RDWR | O_CREAT, 0600);
+ if (stable_fd >= 0) {
+ ret = fstat(stable_fd, &st);
+ if (ret < 0) {
+ close(stable_fd);
+ stable_fd = -1;
+ }
+ }
+
+ /* Open and stat the backup stable restart file. */
+ if (stable_fd >= 0) {
+ backup_fd = open(NFSD_STABLEBACKUP, O_RDWR, 0);
+ if (backup_fd < 0)
+ backup_fd = open(NFSD_STABLEBACKUP, O_RDWR | O_CREAT,
+ 0600);
+ if (backup_fd >= 0) {
+ ret = fstat(backup_fd, &backup_st);
+ if (ret < 0) {
+ close(backup_fd);
+ backup_fd = -1;
+ }
+ }
+ if (backup_fd < 0) {
+ close(stable_fd);
+ stable_fd = -1;
+ }
+ }
+
+ *stable_fdp = stable_fd;
+ *backup_fdp = backup_fd;
+ if (stable_fd < 0)
+ return;
+
+ /* Sync up the 2 files, as required. */
+ if (st.st_size > 0)
+ copy_stable(stable_fd, backup_fd);
+ else if (backup_st.st_size > 0)
+ copy_stable(backup_fd, stable_fd);
+}
+
+/*
+ * Copy the stable restart file to the backup or vice versa.
+ */
+static void
+copy_stable(int from_fd, int to_fd)
+{
+ int cnt, ret;
+ static char buf[1024];
+
+ ret = lseek(from_fd, (off_t)0, SEEK_SET);
+ if (ret >= 0)
+ ret = lseek(to_fd, (off_t)0, SEEK_SET);
+ if (ret >= 0)
+ ret = ftruncate(to_fd, (off_t)0);
+ if (ret >= 0)
+ do {
+ cnt = read(from_fd, buf, 1024);
+ if (cnt > 0)
+ ret = write(to_fd, buf, cnt);
+ else if (cnt < 0)
+ ret = cnt;
+ } while (cnt > 0 && ret >= 0);
+ if (ret >= 0)
+ ret = fsync(to_fd);
+ if (ret < 0)
+ syslog(LOG_ERR, "stable restart copy failure: %m");
+}
+
+/*
+ * Back up the stable restart file when indicated by the kernel.
+ */
+static void
+backup_stable(__unused int signo)
+{
+
+ if (stablefd >= 0)
+ copy_stable(stablefd, backupfd);
+}
+
+/*
+ * Parse the pNFS string and extract the DS servers and ports numbers.
+ */
+static void
+parse_dsserver(const char *optionarg, struct nfsd_nfsd_args *nfsdargp)
+{
+ char *cp, *cp2, *dsaddr, *dshost, *dspath, *dsvol, nfsprt[9];
+ char *mdspath, *mdsp, ip6[INET6_ADDRSTRLEN];
+ const char *ad;
+ int ecode;
+ u_int adsiz, dsaddrcnt, dshostcnt, dspathcnt, hostsiz, pathsiz;
+ u_int mdspathcnt;
+ size_t dsaddrsiz, dshostsiz, dspathsiz, nfsprtsiz, mdspathsiz;
+ struct addrinfo hints, *ai_tcp, *res;
+ struct sockaddr_in sin;
+ struct sockaddr_in6 sin6;
+
+ cp = strdup(optionarg);
+ if (cp == NULL)
+ errx(1, "Out of memory");
+
+ /* Now, do the host names. */
+ dspathsiz = 1024;
+ dspathcnt = 0;
+ dspath = malloc(dspathsiz);
+ if (dspath == NULL)
+ errx(1, "Out of memory");
+ dshostsiz = 1024;
+ dshostcnt = 0;
+ dshost = malloc(dshostsiz);
+ if (dshost == NULL)
+ errx(1, "Out of memory");
+ dsaddrsiz = 1024;
+ dsaddrcnt = 0;
+ dsaddr = malloc(dsaddrsiz);
+ if (dsaddr == NULL)
+ errx(1, "Out of memory");
+ mdspathsiz = 1024;
+ mdspathcnt = 0;
+ mdspath = malloc(mdspathsiz);
+ if (mdspath == NULL)
+ errx(1, "Out of memory");
+
+ /* Put the NFS port# in "." form. */
+ snprintf(nfsprt, 9, ".%d.%d", 2049 >> 8, 2049 & 0xff);
+ nfsprtsiz = strlen(nfsprt);
+
+ ai_tcp = NULL;
+ /* Loop around for each DS server name. */
+ do {
+ cp2 = strchr(cp, ',');
+ if (cp2 != NULL) {
+ /* Not the last DS in the list. */
+ *cp2++ = '\0';
+ if (*cp2 == '\0')
+ usage();
+ }
+
+ dsvol = strchr(cp, ':');
+ if (dsvol == NULL || *(dsvol + 1) == '\0')
+ usage();
+ *dsvol++ = '\0';
+
+ /* Optional path for MDS file system to be stored on DS. */
+ mdsp = strchr(dsvol, '#');
+ if (mdsp != NULL) {
+ if (*(mdsp + 1) == '\0' || mdsp <= dsvol)
+ usage();
+ *mdsp++ = '\0';
+ }
+
+ /* Append this pathname to dspath. */
+ pathsiz = strlen(dsvol);
+ if (dspathcnt + pathsiz + 1 > dspathsiz) {
+ dspathsiz *= 2;
+ dspath = realloc(dspath, dspathsiz);
+ if (dspath == NULL)
+ errx(1, "Out of memory");
+ }
+ strcpy(&dspath[dspathcnt], dsvol);
+ dspathcnt += pathsiz + 1;
+
+ /* Append this pathname to mdspath. */
+ if (mdsp != NULL)
+ pathsiz = strlen(mdsp);
+ else
+ pathsiz = 0;
+ if (mdspathcnt + pathsiz + 1 > mdspathsiz) {
+ mdspathsiz *= 2;
+ mdspath = realloc(mdspath, mdspathsiz);
+ if (mdspath == NULL)
+ errx(1, "Out of memory");
+ }
+ if (mdsp != NULL)
+ strcpy(&mdspath[mdspathcnt], mdsp);
+ else
+ mdspath[mdspathcnt] = '\0';
+ mdspathcnt += pathsiz + 1;
+
+ if (ai_tcp != NULL)
+ freeaddrinfo(ai_tcp);
+
+ /* Get the fully qualified domain name and IP address. */
+ memset(&hints, 0, sizeof(hints));
+ hints.ai_flags = AI_CANONNAME | AI_ADDRCONFIG;
+ hints.ai_family = PF_UNSPEC;
+ hints.ai_socktype = SOCK_STREAM;
+ hints.ai_protocol = IPPROTO_TCP;
+ ecode = getaddrinfo(cp, NULL, &hints, &ai_tcp);
+ if (ecode != 0)
+ err(1, "getaddrinfo pnfs: %s %s", cp,
+ gai_strerror(ecode));
+ ad = NULL;
+ for (res = ai_tcp; res != NULL; res = res->ai_next) {
+ if (res->ai_addr->sa_family == AF_INET) {
+ if (res->ai_addrlen < sizeof(sin))
+ err(1, "getaddrinfo() returned "
+ "undersized IPv4 address");
+ /*
+ * Mips cares about sockaddr_in alignment,
+ * so copy the address.
+ */
+ memcpy(&sin, res->ai_addr, sizeof(sin));
+ ad = inet_ntoa(sin.sin_addr);
+ break;
+ } else if (res->ai_family == AF_INET6) {
+ if (res->ai_addrlen < sizeof(sin6))
+ err(1, "getaddrinfo() returned "
+ "undersized IPv6 address");
+ /*
+ * Mips cares about sockaddr_in6 alignment,
+ * so copy the address.
+ */
+ memcpy(&sin6, res->ai_addr, sizeof(sin6));
+ ad = inet_ntop(AF_INET6, &sin6.sin6_addr, ip6,
+ sizeof(ip6));
+
+ /*
+ * XXX
+ * Since a link local address will only
+ * work if the client and DS are in the
+ * same scope zone, only use it if it is
+ * the only address.
+ */
+ if (ad != NULL &&
+ !IN6_IS_ADDR_LINKLOCAL(&sin6.sin6_addr))
+ break;
+ }
+ }
+ if (ad == NULL)
+ err(1, "No IP address for %s", cp);
+
+ /* Append this address to dsaddr. */
+ adsiz = strlen(ad);
+ if (dsaddrcnt + adsiz + nfsprtsiz + 1 > dsaddrsiz) {
+ dsaddrsiz *= 2;
+ dsaddr = realloc(dsaddr, dsaddrsiz);
+ if (dsaddr == NULL)
+ errx(1, "Out of memory");
+ }
+ strcpy(&dsaddr[dsaddrcnt], ad);
+ strcat(&dsaddr[dsaddrcnt], nfsprt);
+ dsaddrcnt += adsiz + nfsprtsiz + 1;
+
+ /* Append this hostname to dshost. */
+ hostsiz = strlen(ai_tcp->ai_canonname);
+ if (dshostcnt + hostsiz + 1 > dshostsiz) {
+ dshostsiz *= 2;
+ dshost = realloc(dshost, dshostsiz);
+ if (dshost == NULL)
+ errx(1, "Out of memory");
+ }
+ strcpy(&dshost[dshostcnt], ai_tcp->ai_canonname);
+ dshostcnt += hostsiz + 1;
+
+ cp = cp2;
+ } while (cp != NULL);
+
+ nfsdargp->addr = dsaddr;
+ nfsdargp->addrlen = dsaddrcnt;
+ nfsdargp->dnshost = dshost;
+ nfsdargp->dnshostlen = dshostcnt;
+ nfsdargp->dspath = dspath;
+ nfsdargp->dspathlen = dspathcnt;
+ nfsdargp->mdspath = mdspath;
+ nfsdargp->mdspathlen = mdspathcnt;
+ freeaddrinfo(ai_tcp);
+}
+
diff --git a/usr.sbin/nfsd/nfsv4.4 b/usr.sbin/nfsd/nfsv4.4
new file mode 100644
index 000000000000..e96e507e23ad
--- /dev/null
+++ b/usr.sbin/nfsd/nfsv4.4
@@ -0,0 +1,379 @@
+.\" Copyright (c) 2009 Rick Macklem, University of Guelph
+.\" All rights reserved.
+.\"
+.\" Redistribution and use in source and binary forms, with or without
+.\" modification, are permitted provided that the following conditions
+.\" are met:
+.\" 1. Redistributions of source code must retain the above copyright
+.\" notice, this list of conditions and the following disclaimer.
+.\" 2. Redistributions in binary form must reproduce the above copyright
+.\" notice, this list of conditions and the following disclaimer in the
+.\" documentation and/or other materials provided with the distribution.
+.\"
+.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
+.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+.\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
+.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+.\" SUCH DAMAGE.
+.\"
+.Dd January 8, 2024
+.Dt NFSV4 4
+.Os
+.Sh NAME
+.Nm NFSv4
+.Nd NFS Version 4 Protocol
+.Sh DESCRIPTION
+The NFS client and server provides support for the
+.Tn NFSv4
+specification; see
+.%T "Network File System (NFS) Version 4 Protocol RFC 7530" ,
+.%T "Network File System (NFS) Version 4 Minor Version 1 Protocol RFC 5661" ,
+.%T "Network File System (NFS) Version 4 Minor Version 2 Protocol RFC 7862" ,
+.%T "File System Extended Attributes in NFSv4 RFC 8276" and
+.%T "Parallel NFS (pNFS) Flexible File Layout RFC 8435" .
+The protocol is somewhat similar to NFS Version 3, but differs in significant
+ways.
+It uses a single compound RPC that concatenates operations to-gether.
+Each of these operations are similar to the RPCs of NFS Version 3.
+The operations in the compound are performed in order, until one of
+them fails (returns an error) and then the RPC terminates at that point.
+.Pp
+It has
+integrated locking support, which implies that the server is no longer
+stateless.
+As such, the
+.Nm
+server remains in recovery mode for a grace period (always greater than the
+lease duration the server uses) after a reboot.
+During this grace period, clients may recover state but not perform other
+open/lock state changing operations.
+To provide for correct recovery semantics, a small file described by
+.Xr stablerestart 5
+is used by the server during the recovery phase.
+If this file is missing or empty, there is a backup copy maintained by
+.Xr nfsd 8
+that will be used.
+If either file is missing, they will be created by the
+.Xr nfsd 8 .
+If both the file and the backup copy are empty,
+it will result in the server starting without providing a grace period
+for recovery.
+Note that recovery only occurs when the server
+machine is rebooted, not when the
+.Xr nfsd 8
+are just restarted.
+.Pp
+It provides several optional features not present in NFS Version 3:
+.sp
+.Bd -literal -offset indent -compact
+- NFS Version 4 ACLs
+- Referrals, which redirect subtrees to other servers
+ (not yet implemented)
+- Delegations, which allow a client to operate on a file locally
+- pNFS, where I/O operations are separated from Metadata operations
+And for NFSv4.2 only
+- User namespace extended attributes
+- lseek(SEEK_DATA/SEEK_HOLE)
+- File copying done locally on the server for copy_file_range(2)
+- posix_fallocate(2)
+- posix_fadvise(POSIX_FADV_WILLNEED/POSIX_FADV_DONTNEED)
+.Ed
+.Pp
+The
+.Nm
+protocol does not use a separate mount protocol and assumes that the
+server provides a single file system tree structure, rooted at the point
+in the local file system tree specified by one or more
+.sp 1
+.Bd -literal -offset indent -compact
+V4: <rootdir> [-sec=secflavors] [host(s) or net]
+.Ed
+.sp 1
+line(s) in the
+.Xr exports 5
+file.
+(See
+.Xr exports 5
+for details.)
+The
+.Xr nfsd 8
+allows a limited subset of operations to be performed on non-exported subtrees
+of the local file system, so that traversal of the tree to the exported
+subtrees is possible.
+As such, the ``<rootdir>'' can be in a non-exported file system.
+The exception is ZFS, which checks exports and, as such, all ZFS file systems
+below the ``<rootdir>'' must be exported.
+However,
+the entire tree that is rooted at that point must be in local file systems
+that are of types that can be NFS exported.
+Since the
+.Nm
+file system is rooted at ``<rootdir>'', setting this to anything other
+than ``/'' will result in clients being required to use different mount
+paths for
+.Nm
+than for NFS Version 2 or 3.
+Unlike NFS Version 2 and 3, Version 4 allows a client mount to span across
+multiple server file systems, although not all clients are capable of doing
+this.
+.Pp
+.Nm
+uses strings for users and groups instead of numbers.
+On the wire, these strings can either have the numbers in the string or
+take the form:
+.sp
+.Bd -literal -offset indent -compact
+<user>@<dns.domain>
+.Ed
+.sp
+where ``<dns.domain>'' is not the same as the DNS domain used
+for host name lookups, but is usually set to the same string.
+Most systems set this ``<dns.domain>''
+to the domain name part of the machine's
+.Xr hostname 1
+by default.
+However, this can normally be overridden by a command line
+option or configuration file for the daemon used to do the name<->number
+mapping.
+Under
+.Fx ,
+the mapping daemon is called
+.Xr nfsuserd 8
+and has a command line option that overrides the domain component of the
+machine's hostname.
+For use of this form of string on
+.Nm ,
+either client or server, this daemon must be running.
+.Pp
+The form where the numbers are in the strings can only be used for AUTH_SYS.
+To configure your systems this way, the
+.Xr nfsuserd 8
+daemon does not need to be running on the server, but the following sysctls
+need to be set to 1 on the server.
+.sp
+.Bd -literal -offset indent -compact
+vfs.nfs.enable_uidtostring
+vfs.nfsd.enable_stringtouid
+.Ed
+.sp
+On the client, the sysctl
+.sp
+.Bd -literal -offset indent -compact
+vfs.nfs.enable_uidtostring
+.Ed
+.sp
+must be set to 1 and the
+.Xr nfsuserd 8
+daemon does not need to be running.
+.Pp
+If these strings are not configured correctly, ``ls -l'' will typically
+report a lot of ``nobody'' and ``nogroup'' ownerships.
+.Pp
+Although uid/gid numbers are no longer used in the
+.Nm
+protocol except optionally in the above strings, they will still be in the RPC
+authentication fields when using AUTH_SYS (sec=sys), which is the default.
+As such, in this case both the user/group name and number spaces must
+be consistent between the client and server.
+.Pp
+However, if you run
+.Nm
+with RPCSEC_GSS (sec=krb5, krb5i, krb5p), only names and KerberosV tickets
+will go on the wire.
+.Sh SERVER SETUP
+To set up the NFS server that supports
+.Nm ,
+you will need to set the variables in
+.Xr rc.conf 5
+as follows:
+.sp
+.Bd -literal -offset indent -compact
+nfs_server_enable="YES"
+nfsv4_server_enable="YES"
+.Ed
+.sp
+plus
+.sp
+.Bd -literal -offset indent -compact
+nfsuserd_enable="YES"
+.Ed
+.sp
+if the server is using the ``<user>@<domain>'' form of user/group strings or
+is using the ``-manage-gids'' option for
+.Xr nfsuserd 8 .
+.Pp
+In addition, you can set:
+.sp
+.Bd -literal -offset indent -compact
+nfsv4_server_only="YES"
+.Ed
+.sp
+to disable support for NFSv2 and NFSv3.
+.Pp
+You will also need to add at least one ``V4:'' line to the
+.Xr exports 5
+file for
+.Nm
+to work.
+.Pp
+If the file systems you are exporting are only being accessed via
+.Nm
+there are a couple of
+.Xr sysctl 8
+variables that you can change, which might improve performance.
+.Bl -tag -width Ds
+.It Cm vfs.nfsd.issue_delegations
+when set non-zero, allows the server to issue Open Delegations to
+clients.
+These delegations permit the client to manipulate the file
+locally on the client.
+Unfortunately, at this time, client use of
+delegations is limited, so performance gains may not be observed.
+This can only be enabled when the file systems being exported to
+.Nm
+clients are not being accessed locally on the server and, if being
+accessed via NFS Version 2 or 3 clients, these clients cannot be
+using the NLM.
+.It Cm vfs.nfsd.enable_locallocks
+can be set to 0 to disable acquisition of local byte range locks.
+Disabling local locking can only be done if neither local accesses
+to the exported file systems nor the NLM is operating on them.
+.El
+.sp
+Note that Samba server access would be considered ``local access'' for the above
+discussion.
+.Pp
+To build a kernel with the NFS server that supports
+.Nm
+linked into it, the
+.sp
+.Bd -literal -offset indent -compact
+options NFSD
+.Ed
+.sp
+must be specified in the kernel's
+.Xr config 5
+file.
+.Sh CLIENT MOUNTS
+To do an
+.Nm
+mount, specify the ``nfsv4'' option on the
+.Xr mount_nfs 8
+command line.
+This will force use of the client that supports
+.Nm
+plus set ``tcp'' and
+.Nm .
+.Pp
+The
+.Xr nfsuserd 8
+must be running if name<->uid/gid mapping is being used, as above.
+Also, since an
+.Nm
+mount uses the host uuid to identify the client uniquely to the server,
+you cannot safely do an
+.Nm
+mount when
+.sp
+.Bd -literal -offset indent -compact
+hostid_enable="NO"
+.Ed
+.sp
+is set in
+.Xr rc.conf 5 .
+.sp
+If the
+.Nm
+server that is being mounted on supports delegations, you can start the
+.Xr nfscbd 8
+daemon to handle client side callbacks.
+This will occur if
+.sp
+.Bd -literal -offset indent -compact
+nfsuserd_enable="YES" <-- If name<->uid/gid mapping is being used.
+nfscbd_enable="YES"
+.Ed
+.sp
+are set in
+.Xr rc.conf 5 .
+.sp
+Without a functioning callback path, a server will never issue Delegations
+to a client.
+.sp
+For NFSv4.0, by default, the callback address will be set to the IP address
+acquired via
+.Fn rtalloc
+in the kernel and port# 7745.
+To override the default port#, a command line option for
+.Xr nfscbd 8
+can be used.
+.sp
+To get callbacks to work when behind a NAT gateway, a port for the callback
+service will need to be set up on the NAT gateway and then the address
+of the NAT gateway (host IP plus port#) will need to be set by assigning the
+.Xr sysctl 8
+variable vfs.nfs.callback_addr to a string of the form:
+.sp
+N.N.N.N.N.N
+.sp
+where the first 4 Ns are the host IP address and the last two are the
+port# in network byte order (all decimal #s in the range 0-255).
+.Pp
+For NFSv4.1 and NFSv4.2, the callback path (called a backchannel) uses the
+same TCP connection as the mount, so none of the above applies and should
+work through gateways without any issues.
+.Pp
+To build a kernel with the client that supports
+.Nm
+linked into it, the option
+.sp
+.Bd -literal -offset indent -compact
+options NFSCL
+.Ed
+.sp
+must be specified in the kernel's
+.Xr config 5
+file.
+.Pp
+Options can be specified for the
+.Xr nfsuserd 8
+and
+.Xr nfscbd 8
+daemons at boot time via the ``nfsuserd_flags'' and ``nfscbd_flags''
+.Xr rc.conf 5
+variables.
+.Pp
+NFSv4 mount(s) against exported volume(s) on the same host are not recommended,
+since this can result in a hung NFS server.
+It occurs when an nfsd thread tries to do an NFSv4
+.Fn VOP_RECLAIM
+/ Close RPC as part of acquiring a new vnode.
+If all other nfsd threads are blocked waiting for lock(s) held by this nfsd
+thread, then there is no nfsd thread to service the Close RPC.
+.Sh FILES
+.Bl -tag -width /var/db/nfs-stablerestart.bak -compact
+.It Pa /var/db/nfs-stablerestart
+NFS V4 stable restart file
+.It Pa /var/db/nfs-stablerestart.bak
+backup copy of the file
+.El
+.Sh SEE ALSO
+.Xr stablerestart 5 ,
+.Xr mountd 8 ,
+.Xr nfscbd 8 ,
+.Xr nfsd 8 ,
+.Xr nfsdumpstate 8 ,
+.Xr nfsrevoke 8 ,
+.Xr nfsuserd 8
+.Sh BUGS
+At this time, there is no recall of delegations for local file system
+operations.
+As such, delegations should only be enabled for file systems
+that are being used solely as NFS export volumes and are not being accessed
+via local system calls nor services such as Samba.
diff --git a/usr.sbin/nfsd/pnfs.4 b/usr.sbin/nfsd/pnfs.4
new file mode 100644
index 000000000000..babd221a6d5a
--- /dev/null
+++ b/usr.sbin/nfsd/pnfs.4
@@ -0,0 +1,228 @@
+.\" Copyright (c) 2017 Rick Macklem
+.\"
+.\" Redistribution and use in source and binary forms, with or without
+.\" modification, are permitted provided that the following conditions
+.\" are met:
+.\" 1. Redistributions of source code must retain the above copyright
+.\" notice, this list of conditions and the following disclaimer.
+.\" 2. Redistributions in binary form must reproduce the above copyright
+.\" notice, this list of conditions and the following disclaimer in the
+.\" documentation and/or other materials provided with the distribution.
+.\"
+.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
+.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+.\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
+.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+.\" SUCH DAMAGE.
+.\"
+.Dd December 20, 2019
+.Dt PNFS 4
+.Os
+.Sh NAME
+.Nm pNFS
+.Nd NFS Version 4.1 and 4.2 Parallel NFS Protocol
+.Sh DESCRIPTION
+The NFSv4.1 and NFSv4.2 client and server provides support for the
+.Tn pNFS
+specification; see
+.%T "Network File System (NFS) Version 4 Minor Version 1 Protocol RFC 5661" ,
+.%T "Network File System (NFS) Version 4 Minor Version 2 Protocol RFC 7862" and
+.%T "Parallel NFS (pNFS) Flexible File Layout RFC 8435" .
+A pNFS service separates Read/Write operations from all other NFSv4.1 and
+NFSv4.2 operations, which are referred to as Metadata operations.
+The Read/Write operations are performed directly on the Data Server (DS)
+where the file's data resides, bypassing the NFS server.
+All other file operations are performed on the NFS server, which is referred to
+as a Metadata Server (MDS).
+NFS clients that do not support
+.Tn pNFS
+perform Read/Write operations on the MDS, which acts as a proxy for the
+appropriate DS(s).
+.Pp
+The NFSv4.1 and NFSv4.2 protocols provide two pieces of information to pNFS
+aware clients that allow them to perform Read/Write operations directly on
+the DS.
+.Pp
+The first is DeviceInfo, which is static information defining the DS
+server.
+The critical piece of information in DeviceInfo for the layout types
+supported by
+.Fx
+is the IP address that is used to perform RPCs on the DS.
+It also indicates which version of NFS the DS supports, I/O size and other
+layout specific information.
+In the DeviceInfo, there is a DeviceID which, for the
+.Fx
+server
+is unique to the DS configuration
+and changes whenever the
+.Xr nfsd
+daemon is restarted or the server is rebooted.
+.Pp
+The second is the layout, which is per file and references the DeviceInfo
+to use via the DeviceID.
+It is for a byte range of a file and is either Read or Read/Write.
+For the
+.Fx
+server, a layout covers all bytes of a file.
+A layout may be recalled by the MDS using a LayoutRecall callback.
+When a client returns a layout via the LayoutReturn operation it can
+indicate that error(s) were encountered while doing I/O on the DS,
+at least for certain layout types such as the Flexible File Layout.
+.Pp
+The
+.Fx
+client and server supports two layout types.
+.Pp
+The File Layout is described in RFC5661 and uses the NFSv4.1 or NFSv4.2 protocol
+to perform I/O on the DS.
+It does not support client aware DS mirroring and, as such,
+the
+.Fx
+server only provides File Layout support for non-mirrored
+configurations.
+.Pp
+The Flexible File Layout allows the use of the NFSv3, NFSv4.0, NFSv4.1 or
+NFSv4.2 protocol to perform I/O on the DS and does support client aware
+mirroring.
+As such, the
+.Fx
+server uses Flexible File Layout layouts for the
+mirrored DS configurations.
+The
+.Fx
+server supports the
+.Dq tightly coupled
+variant and all DSs allow use of the
+NFSv4.2 or NFSv4.1 protocol for I/O operations.
+Clients that support the Flexible File Layout will do writes and commits
+to all DS mirrors in the mirror set.
+.Pp
+A
+.Fx
+pNFS service consists of a single MDS server plus one or more
+DS servers, all of which are
+.Fx
+systems.
+For a non-mirrored configuration, the
+.Fx
+server will issue File Layout
+layouts by default.
+However that default can be set to the Flexible File Layout by setting the
+.Xr sysctl 8
+sysctl
+.Dq vfs.nfsd.default_flexfile
+to one.
+Mirrored server configurations will only issue Flexible File Layouts.
+.Tn pNFS
+clients mount the MDS as they would a single NFS server.
+.Pp
+A
+.Fx
+.Tn pNFS
+client must be running the
+.Xr nfscbd 8
+daemon and use the mount options
+.Dq nfsv4,minorversion=2,pnfs or
+.Dq nfsv4,minorversion=1,pnfs .
+.Pp
+When files are created, the MDS creates a file tree identical to what a
+single NFS server creates, except that all the regular (VREG) files will
+be empty.
+As such, if you look at the exported tree on the MDS directly
+on the MDS server (not via an NFS mount), the files will all be of size zero.
+Each of these files will also have two extended attributes in the system
+attribute name space:
+.Bd -literal -offset indent
+pnfsd.dsfile - This extended attribute stores the information that the
+ MDS needs to find the data file on a DS(s) for this file.
+pnfsd.dsattr - This extended attribute stores the Size, AccessTime,
+ ModifyTime, Change and SpaceUsed attributes for the file.
+.Ed
+.Pp
+For each regular (VREG) file, the MDS creates a data file on one
+(or on N of them for the mirrored case, where N is the mirror_level)
+of the DS(s) where the file's data will be stored.
+The name of this file is
+the file handle of the file on the MDS in hexadecimal at time of file creation.
+The data file will have the same file ownership, mode and NFSv4 ACL
+(if ACLs are enabled for the file system) as the file on the MDS, so that
+permission checking can be done on the DS.
+This is referred to as
+.Dq tightly coupled
+for the Flexible File Layout.
+.Pp
+For
+.Tn pNFS
+aware clients, the service generates File Layout
+or Flexible File Layout
+layouts and associated DeviceInfo.
+For non-pNFS aware NFS clients, the pNFS service appears just like a normal
+NFS service.
+For the non-pNFS aware client, the MDS will perform I/O operations on the
+appropriate DS(s), acting as
+a proxy for the non-pNFS aware client.
+This is also true for NFSv3 and NFSv4.0 mounts, since these are always non-pNFS
+aware.
+.Pp
+It is possible to assign a DS to an MDS exported file system so that it will
+store data for files on the MDS exported file system.
+If a DS is not assigned to an MDS exported file system, it will store data
+for files on all exported file systems on the MDS.
+.Pp
+If mirroring is enabled, the pNFS service will continue to function when
+DS(s) have failed, so long is there is at least one DS still operational
+that stores data for files on all of the MDS exported file systems.
+After a disabled mirrored DS is repaired, it is possible to recover the DS
+as a mirror while the pNFS service continues to function.
+.Pp
+See
+.Xr pnfsserver 4
+for information on how to set up a
+.Fx
+pNFS service.
+.Sh SEE ALSO
+.Xr nfsv4 4 ,
+.Xr pnfsserver 4 ,
+.Xr exports 5 ,
+.Xr fstab 5 ,
+.Xr rc.conf 5 ,
+.Xr nfscbd 8 ,
+.Xr nfsd 8 ,
+.Xr nfsuserd 8 ,
+.Xr pnfsdscopymr 8 ,
+.Xr pnfsdsfile 8 ,
+.Xr pnfsdskill 8
+.Sh BUGS
+Linux kernel versions prior to 4.12 only supports NFSv3 DSs in its client
+and will do all I/O through the MDS.
+For Linux 4.12 kernels, support for NFSv4.1 DSs was added, but I have seen
+Linux client crashes when testing this client.
+For Linux 4.17-rc2 kernels, I have not seen client crashes during testing,
+but it only supports the
+.Dq loosely coupled
+variant.
+To make it work correctly when mounting the
+.Fx
+server, you must
+set the sysctl
+.Dq vfs.nfsd.flexlinuxhack
+to one so that it works around
+the Linux client driver's limitations.
+Wihout this sysctl being set, there will be access errors, since the Linux
+client will use the authenticator in the layout (uid=999, gid=999) and not
+the authenticator specified in the RPC header.
+.Pp
+Linux 5.n kernels appear to be patched so that it uses the authenticator
+in the RPC header and, as such, the above sysctl should not need to be set.
+.Pp
+Since the MDS cannot be mirrored, it is a single point of failure just
+as a non
+.Tn pNFS
+server is.
diff --git a/usr.sbin/nfsd/pnfsserver.4 b/usr.sbin/nfsd/pnfsserver.4
new file mode 100644
index 000000000000..7a2ddc4e85c0
--- /dev/null
+++ b/usr.sbin/nfsd/pnfsserver.4
@@ -0,0 +1,444 @@
+.\" Copyright (c) 2018 Rick Macklem
+.\"
+.\" Redistribution and use in source and binary forms, with or without
+.\" modification, are permitted provided that the following conditions
+.\" are met:
+.\" 1. Redistributions of source code must retain the above copyright
+.\" notice, this list of conditions and the following disclaimer.
+.\" 2. Redistributions in binary form must reproduce the above copyright
+.\" notice, this list of conditions and the following disclaimer in the
+.\" documentation and/or other materials provided with the distribution.
+.\"
+.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
+.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+.\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
+.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+.\" SUCH DAMAGE.
+.\"
+.Dd December 20, 2019
+.Dt PNFSSERVER 4
+.Os
+.Sh NAME
+.Nm pNFSserver
+.Nd NFS Version 4.1 and 4.2 Parallel NFS Protocol Server
+.Sh DESCRIPTION
+A set of
+.Fx
+servers may be configured to provide a
+.Xr pnfs 4
+service.
+One
+.Fx
+system needs to be configured as a MetaData Server (MDS) and
+at least one additional
+.Fx
+system needs to be configured as one or
+more Data Servers (DS)s.
+.Pp
+These
+.Fx
+systems are configured to be NFSv4.1 and NFSv4.2
+servers, see
+.Xr nfsd 8
+and
+.Xr exports 5
+if you are not familiar with configuring a NFSv4.n server.
+All DS(s) and the MDS should support NFSv4.2 as well as NFSv4.1.
+Mixing an MDS that supports NFSv4.2 with any DS(s) that do not support
+NFSv4.2 will not work correctly.
+As such, all DS(s) must be upgraded from
+.Fx 12
+to
+.Fx 13
+before upgrading the MDS.
+.Sh DS server configuration
+The DS(s) need to be configured as NFSv4.1 and NFSv4.2 server(s),
+with a top level exported
+directory used for storage of data files.
+This directory must be owned by
+.Dq root
+and would normally have a mode of
+.Dq 700 .
+Within this directory there needs to be additional directories named
+ds0,...,dsN (where N is 19 by default) also owned by
+.Dq root
+with mode
+.Dq 700 .
+These are the directories where the data files are stored.
+The following command can be run by root when in the top level exported
+directory to create these subdirectories.
+.Bd -literal -offset indent
+jot -w ds 20 0 | xargs mkdir -m 700
+.Ed
+.sp
+Note that
+.Dq 20
+is the default and can be set to a larger value on the MDS as shown below.
+.sp
+The top level exported directory used for storage of data files must be
+exported to the MDS with the
+.Dq maproot=root sec=sys
+export options so that the MDS can create entries in these subdirectories.
+It must also be exported to all pNFS aware clients, but these clients do
+not require the
+.Dq maproot=root
+export option and this directory should be exported to them with the same
+options as used by the MDS to export file system(s) to the clients.
+.Pp
+It is possible to have multiple DSs on the same
+.Fx
+system, but each
+of these DSs must have a separate top level exported directory used for storage
+of data files and each
+of these DSs must be mountable via a separate IP address.
+Alias addresses can be set on the DS server system for a network
+interface via
+.Xr ifconfig 8
+to create these different IP addresses.
+Multiple DSs on the same server may be useful when data for different file systems
+on the MDS are being stored on different file system volumes on the
+.Fx
+DS system.
+.Sh MDS server configuration
+The MDS must be a separate
+.Fx
+system from the
+.Fx
+DS system(s) and
+NFS clients.
+It is configured as a NFSv4.1 and NFSv4.2 server with
+file system(s) exported to clients.
+However, the
+.Dq -p
+command line argument for
+.Xr nfsd
+is used to indicate that it is running as the MDS for a pNFS server.
+.Pp
+The DS(s) must all be mounted on the MDS using the following mount options:
+.Bd -literal -offset indent
+nfsv4,minorversion=2,soft,retrans=2
+.Ed
+.sp
+so that they can be defined as DSs in the
+.Dq -p
+option.
+Normally these mounts would be entered in the
+.Xr fstab 5
+on the MDS.
+For example, if there are four DSs named nfsv4-data[0-3], the
+.Xr fstab 5
+lines might look like:
+.Bd -literal -offset
+nfsv4-data0:/ /data0 nfs rw,nfsv4,minorversion=2,soft,retrans=2 0 0
+nfsv4-data1:/ /data1 nfs rw,nfsv4,minorversion=2,soft,retrans=2 0 0
+nfsv4-data2:/ /data2 nfs rw,nfsv4,minorversion=2,soft,retrans=2 0 0
+nfsv4-data3:/ /data3 nfs rw,nfsv4,minorversion=2,soft,retrans=2 0 0
+.Ed
+.sp
+The
+.Xr nfsd 8
+command line option
+.Dq -p
+indicates that the NFS server is a pNFS MDS and specifies what
+DSs are to be used.
+.br
+For the above
+.Xr fstab 5
+example, the
+.Xr nfsd 8
+nfs_server_flags line in your
+.Xr rc.conf 5
+might look like:
+.Bd -literal -offset
+nfs_server_flags="-u -t -n 128 -p nfsv4-data0:/data0,nfsv4-data1:/data1,nfsv4-data2:/data2,nfsv4-data3:/data3"
+.Ed
+.sp
+This example specifies that the data files should be distributed over the
+four DSs and File layouts will be issued to pNFS enabled clients.
+If issuing Flexible File layouts is desired for this case, setting the sysctl
+.Dq vfs.nfsd.default_flexfile
+non-zero in your
+.Xr sysctl.conf 5
+file will make the
+.Nm
+do that.
+.br
+Alternately, this variant of
+.Dq nfs_server_flags
+will specify that two way mirroring is to be done, via the
+.Dq -m
+command line option.
+.Bd -literal -offset
+nfs_server_flags="-u -t -n 128 -p nfsv4-data0:/data0,nfsv4-data1:/data1,nfsv4-data2:/data2,nfsv4-data3:/data3 -m 2"
+.Ed
+.sp
+With two way mirroring, the data file for each exported file on the MDS
+will be stored on two of the DSs.
+When mirroring is enabled, the server will always issue Flexible File layouts.
+.Pp
+It is also possible to specify which DSs are to be used to store data files for
+specific exported file systems on the MDS.
+For example, if the MDS has exported two file systems
+.Dq /export1
+and
+.Dq /export2
+to clients, the following variant of
+.Dq nfs_server_flags
+will specify that data files for
+.Dq /export1
+will be stored on nfsv4-data0 and nfsv4-data1, whereas the data files for
+.Dq /export2
+will be store on nfsv4-data2 and nfsv4-data3.
+.Bd -literal -offset
+nfs_server_flags="-u -t -n 128 -p nfsv4-data0:/data0#/export1,nfsv4-data1:/data1#/export1,nfsv4-data2:/data2#/export2,nfsv4-data3:/data3#/export2"
+.Ed
+.sp
+This can be used by system administrators to control where data files are
+stored and might be useful for control of storage use.
+For this case, it may be convenient to co-locate more than one of the DSs
+on the same
+.Fx
+server, using separate file systems on the DS system
+for storage of the respective DS's data files.
+If mirroring is desired for this case, the
+.Dq -m
+option also needs to be specified.
+There must be enough DSs assigned to each exported file system on the MDS
+to support the level of mirroring.
+The above example would be fine for two way mirroring, but four way mirroring
+would not work, since there are only two DSs assigned to each exported file
+system on the MDS.
+.Pp
+The number of subdirectories in each DS is defined by the
+.Dq vfs.nfs.dsdirsize
+sysctl on the MDS.
+This value can be increased from the default of 20, but only when the
+.Xr nfsd 8
+is not running and after the additional ds20,... subdirectories have been
+created on all the DSs.
+For a service that will store a large number of files this sysctl should be
+set much larger, to avoid the number of entries in a subdirectory from
+getting too large.
+.Sh Client mounts
+Once operational, NFSv4.1 or NFSv4.2
+.Fx
+client mounts
+done with the
+.Dq pnfs
+option should do I/O directly on the DSs.
+The clients mounting the MDS must be running the
+.Xr nfscbd
+daemon for pNFS to work.
+Set
+.Bd -literal -offset indent
+nfscbd_enable="YES"
+.Ed
+.sp
+in the
+.Xr rc.conf 5
+on these clients.
+Non-pNFS aware clients or NFSv3 mounts will do all I/O RPCs on the MDS,
+which acts as a proxy for the appropriate DS(s).
+.Sh Backing up a pNFS service
+Since the data is separated from the metadata, the simple way to back up
+a pNFS service is to do so from an NFS client that has the service mounted
+on it.
+If you back up the MDS exported file system(s) on the MDS, you must do it
+in such a way that the
+.Dq system
+namespace extended attributes get backed up.
+.Sh Handling of failed mirrored DSs
+When a mirrored DS fails, it can be disabled one of three ways:
+.sp
+1 - The MDS detects a problem when trying to do proxy
+operations on the DS.
+This can take a couple of minutes
+after the DS failure or network partitioning occurs.
+.sp
+2 - A pNFS client can report an I/O error that occurred for a DS to the MDS in
+the arguments for a LayoutReturn operation.
+.sp
+3 - The system administrator can perform the pnfsdskill(8) command on the MDS
+to disable it.
+If the system administrator does a pnfsdskill(8) and it fails with ENXIO
+(Device not configured) that normally means the DS was already
+disabled via #1 or #2.
+Since doing this is harmless, once a system administrator knows that
+there is a problem with a mirrored DS, doing the command is recommended.
+.sp
+Once a system administrator knows that a mirrored DS has malfunctioned
+or has been network partitioned, they should do the following as root/su
+on the MDS:
+.Bd -literal -offset indent
+# pnfsdskill <mounted-on-path-of-DS>
+# umount -N <mounted-on-path-of-DS>
+.Ed
+.sp
+Note that the <mounted-on-path-of-DS> must be the exact mounted-on path
+string used when the DS was mounted on the MDS.
+.Pp
+Once the mirrored DS has been disabled, the pNFS service should continue to
+function, but file updates will only happen on the DS(s) that have not been disabled.
+Assuming two way mirroring, that implies the one DS of the pair stored in the
+.Dq pnfsd.dsfile
+extended attribute for the file on the MDS, for files stored on the disabled DS.
+.Pp
+The next step is to clear the IP address in the
+.Dq pnfsd.dsfile
+extended attribute on all files on the MDS for the failed DS.
+This is done so that, when the disabled DS is repaired and brought back online,
+the data files on this DS will not be used, since they may be out of date.
+The command that clears the IP address is
+.Xr pnfsdsfile 8
+with the
+.Dq -r
+option.
+.Bd -literal -offset
+For example:
+# pnfsdsfile -r nfsv4-data3 yyy.c
+yyy.c: nfsv4-data2.home.rick ds0/207508569ff983350c000000ec7c0200e4c57b2e0000000000000000 0.0.0.0 ds0/207508569ff983350c000000ec7c0200e4c57b2e0000000000000000
+.Ed
+.sp
+replaces nfsv4-data3 with an IPv4 address of 0.0.0.0, so that nfsv4-data3
+will not get used.
+.Pp
+Normally this will be called within a
+.Xr find 1
+command for all regular
+files in the exported directory tree and must be done on the MDS.
+When used with
+.Xr find 1 ,
+you will probably also want the
+.Dq -q
+option so that it won't spit out the results for every file.
+If the disabled/repaired DS is nfsv4-data3, the commands done on the MDS
+would be:
+.Bd -literal -offset
+# cd <top-level-exported-dir>
+# find . -type f -exec pnfsdsfile -q -r nfsv4-data3 {} \;
+.Ed
+.sp
+There is a problem with the above command if the file found by
+.Xr find 1
+is renamed or unlinked before the
+.Xr pnfsdsfile 8
+command is done on it.
+This should normally generate an error message.
+A simple unlink is harmless
+but a link/unlink or rename might result in the file not having been processed
+under its new name.
+To check that all files have their IP addresses set to 0.0.0.0 these
+commands can be used (assuming the
+.Xr sh 1
+shell):
+.Bd -literal -offset
+# cd <top-level-exported-dir>
+# find . -type f -exec pnfsdsfile {} \; | sed "/nfsv4-data3/!d"
+.Ed
+.sp
+Any line(s) printed require the
+.Xr pnfsdsfile 8
+with
+.Dq -r
+to be done again.
+Once this is done, the replaced/repaired DS can be brought back online.
+It should have empty ds0,...,dsN directories under the top level exported
+directory for storage of data files just like it did when first set up.
+Mount it on the MDS exactly as you did before disabling it.
+For the nfsv4-data3 example, the command would be:
+.Bd -literal -offset
+# mount -t nfs -o nfsv4,minorversion=2,soft,retrans=2 nfsv4-data3:/ /data3
+.Ed
+.sp
+Then restart the nfsd to re-enable the DS.
+.Bd -literal -offset
+# /etc/rc.d/nfsd restart
+.Ed
+.sp
+Now, new files can be stored on nfsv4-data3,
+but files with the IP address zeroed out on the MDS will not yet use the
+repaired DS (nfsv4-data3).
+The next step is to go through the exported file tree on the MDS and,
+for each of the
+files with an IPv4 address of 0.0.0.0 in its extended attribute, copy the file
+data to the repaired DS and re-enable use of this mirror for it.
+This command for copying the file data for one MDS file is
+.Xr pnfsdscopymr 8
+and it will also normally be used in a
+.Xr find 1 .
+For the example case, the commands on the MDS would be:
+.Bd -literal -offset
+# cd <top-level-exported-dir>
+# find . -type f -exec pnfsdscopymr -r /data3 {} \;
+.Ed
+.sp
+When this completes, the recovery should be complete or at least nearly so.
+As noted above, if a link/unlink or rename occurs on a file name while the
+above
+.Xr find 1
+is in progress, it may not get copied.
+To check for any file(s) not yet copied, the commands are:
+.Bd -literal -offset
+# cd <top-level-exported-dir>
+# find . -type f -exec pnfsdsfile {} \; | sed "/0\.0\.0\.0/!d"
+.Ed
+.sp
+If this command prints out any file name(s), these files must
+have the
+.Xr pnfsdscopymr 8
+command done on them to complete the recovery.
+.Bd -literal -offset
+# pnfsdscopymr -r /data3 <file-path-reported>
+.Ed
+.sp
+If this command fails with the error
+.br
+.Dq pnfsdscopymr: Copymr failed for file <path>: Device not configured
+.br
+repeatedly, this may be caused by a Read/Write layout that has not
+been returned.
+The only way to get rid of such a layout is to restart the
+.Xr nfsd 8 .
+.sp
+All of these commands are designed to be
+done while the pNFS service is running and can be re-run safely.
+.Pp
+For a more detailed discussion of the setup and management of a pNFS service
+see:
+.Bd -literal -offset indent
+https://people.freebsd.org/~rmacklem/pnfs-planb-setup.txt
+.Ed
+.sp
+.Sh SEE ALSO
+.Xr nfsv4 4 ,
+.Xr pnfs 4 ,
+.Xr exports 5 ,
+.Xr fstab 5 ,
+.Xr rc.conf 5 ,
+.Xr sysctl.conf 5 ,
+.Xr nfscbd 8 ,
+.Xr nfsd 8 ,
+.Xr nfsuserd 8 ,
+.Xr pnfsdscopymr 8 ,
+.Xr pnfsdsfile 8 ,
+.Xr pnfsdskill 8
+.Sh HISTORY
+The
+.Nm
+service first appeared in
+.Fx 12.0 .
+.Sh BUGS
+Since the MDS cannot be mirrored, it is a single point of failure just
+as a non
+.Tn pNFS
+server is.
+For non-mirrored configurations, all
+.Fx
+systems used in the service
+are single points of failure.
diff --git a/usr.sbin/nfsd/stablerestart.5 b/usr.sbin/nfsd/stablerestart.5
new file mode 100644
index 000000000000..0d93b0487f09
--- /dev/null
+++ b/usr.sbin/nfsd/stablerestart.5
@@ -0,0 +1,94 @@
+.\" Copyright (c) 2009 Rick Macklem, University of Guelph
+.\" All rights reserved.
+.\"
+.\" Redistribution and use in source and binary forms, with or without
+.\" modification, are permitted provided that the following conditions
+.\" are met:
+.\" 1. Redistributions of source code must retain the above copyright
+.\" notice, this list of conditions and the following disclaimer.
+.\" 2. Redistributions in binary form must reproduce the above copyright
+.\" notice, this list of conditions and the following disclaimer in the
+.\" documentation and/or other materials provided with the distribution.
+.\"
+.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
+.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+.\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
+.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+.\" SUCH DAMAGE.
+.\"
+.Dd April 10, 2011
+.Dt STABLERESTART 5
+.Os
+.Sh NAME
+.Nm nfs-stablerestart
+.Nd restart information for the
+.Tn NFSv4
+server
+.Sh SYNOPSIS
+.Nm nfs-stablerestart
+.Sh DESCRIPTION
+The
+.Nm
+file holds information that allows the
+.Tn NFSv4
+server to restart without always returning the NFSERR_NOGRACE error, as described in the
+.Tn NFSv4
+server specification; see
+.%T "Network File System (NFS) Version 4 Protocol RFC 3530, Section 8.6.3" .
+.Pp
+The first record in the file, as defined by struct nfsf_rec in
+/usr/include/fs/nfs/nfsrvstate.h, holds the lease duration of the
+last incarnation of the server and the number of boot times that follows.
+Following this are the number of previous boot times listed in the
+first record.
+The lease duration is used to set the grace period.
+The boot times
+are used to avoid the unlikely occurrence of a boot time being reused,
+due to a TOD clock going backwards.
+This record and the previous boot times with this boot time
+added is re-written at the end of the grace period.
+.Pp
+The rest of the file are appended records, as defined by
+struct nfst_rec in /usr/include/fs/nfs/nfsrvstate.h and are used represent one of two things.
+There are records which indicate that a
+client successfully acquired state and records that indicate a client's state was revoked.
+State revoke records indicate that state information
+for a client was discarded, due to lease expiry and an otherwise
+conflicting open or lock request being made by a different client.
+These records can be used to determine if clients might have done either of the
+edge conditions.
+.Pp
+If a client might have done either edge condition or this file is
+empty or corrupted, the server returns NFSERR_NOGRACE for any reclaim
+request from the client.
+.Pp
+For correct operation of the server, it must be ensured that the file
+is written to stable storage by the time a write op with IO_SYNC specified has returned.
+This might require hardware level caching to be disabled for
+a local disk drive that holds the file, or similar.
+.Sh FILES
+.Bl -tag -width /var/db/nfs-stablerestart.bak -compact
+.It Pa /var/db/nfs-stablerestart
+NFSv4 stable restart file
+.It Pa /var/db/nfs-stablerestart.bak
+backup copy of the file
+.El
+.Sh SEE ALSO
+.Xr nfsv4 4 ,
+.Xr nfsd 8
+.Sh BUGS
+If the file is empty, the NFSv4 server has no choice but to return
+NFSERR_NOGRACE for all reclaim requests.
+Although correct, this is a highly undesirable occurrence, so the file should not be lost if
+at all possible.
+The backup copy of the file is maintained and used by the
+.Xr nfsd 8
+to minimize the risk of this occurring.
+To move the file, you must edit the nfsd sources and recompile it.
+This was done to discourage accidental relocation of the file.