diff options
Diffstat (limited to 'usr.sbin/nfsd')
| -rw-r--r-- | usr.sbin/nfsd/Makefile | 8 | ||||
| -rw-r--r-- | usr.sbin/nfsd/Makefile.depend | 19 | ||||
| -rw-r--r-- | usr.sbin/nfsd/nfsd.8 | 396 | ||||
| -rw-r--r-- | usr.sbin/nfsd/nfsd.c | 1400 | ||||
| -rw-r--r-- | usr.sbin/nfsd/nfsv4.4 | 379 | ||||
| -rw-r--r-- | usr.sbin/nfsd/pnfs.4 | 228 | ||||
| -rw-r--r-- | usr.sbin/nfsd/pnfsserver.4 | 444 | ||||
| -rw-r--r-- | usr.sbin/nfsd/stablerestart.5 | 94 |
8 files changed, 2968 insertions, 0 deletions
diff --git a/usr.sbin/nfsd/Makefile b/usr.sbin/nfsd/Makefile new file mode 100644 index 000000000000..b6bd9a28e651 --- /dev/null +++ b/usr.sbin/nfsd/Makefile @@ -0,0 +1,8 @@ +PACKAGE= nfs + +PROG= nfsd +MAN= nfsd.8 nfsv4.4 stablerestart.5 pnfs.4 pnfsserver.4 + +LIBADD= util + +.include <bsd.prog.mk> diff --git a/usr.sbin/nfsd/Makefile.depend b/usr.sbin/nfsd/Makefile.depend new file mode 100644 index 000000000000..7e5c47e39608 --- /dev/null +++ b/usr.sbin/nfsd/Makefile.depend @@ -0,0 +1,19 @@ +# Autogenerated - do NOT edit! + +DIRDEPS = \ + include \ + include/arpa \ + include/rpc \ + include/rpcsvc \ + include/xlocale \ + lib/${CSU_DIR} \ + lib/libc \ + lib/libcompiler_rt \ + lib/libutil \ + + +.include <dirdeps.mk> + +.if ${DEP_RELDIR} == ${_DEP_RELDIR} +# local dependencies - needed for -jN in clean tree +.endif diff --git a/usr.sbin/nfsd/nfsd.8 b/usr.sbin/nfsd/nfsd.8 new file mode 100644 index 000000000000..2e5724dbce33 --- /dev/null +++ b/usr.sbin/nfsd/nfsd.8 @@ -0,0 +1,396 @@ +.\" Copyright (c) 1989, 1991, 1993 +.\" The Regents of the University of California. All rights reserved. +.\" +.\" Redistribution and use in source and binary forms, with or without +.\" modification, are permitted provided that the following conditions +.\" are met: +.\" 1. Redistributions of source code must retain the above copyright +.\" notice, this list of conditions and the following disclaimer. +.\" 2. Redistributions in binary form must reproduce the above copyright +.\" notice, this list of conditions and the following disclaimer in the +.\" documentation and/or other materials provided with the distribution. +.\" 3. Neither the name of the University nor the names of its contributors +.\" may be used to endorse or promote products derived from this software +.\" without specific prior written permission. +.\" +.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND +.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE +.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE +.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL +.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS +.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) +.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT +.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY +.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF +.\" SUCH DAMAGE. +.\" +.Dd May 30, 2025 +.Dt NFSD 8 +.Os +.Sh NAME +.Nm nfsd +.Nd remote +NFS server +.Sh SYNOPSIS +.Nm +.Op Fl arduteN +.Op Fl n Ar num_servers +.Op Fl h Ar bindip +.Op Fl p Ar pnfs_setup +.Op Fl m Ar mirror_level +.Op Fl P Ar pidfile +.Op Fl V Ar virtual_hostname +.Op Fl Fl maxthreads Ar max_threads +.Op Fl Fl minthreads Ar min_threads +.Sh DESCRIPTION +The +.Nm +utility runs on a server machine to service NFS requests from client machines. +At least one +.Nm +must be running for a machine to operate as a server. +.Pp +Unless otherwise specified, eight servers per CPU for UDP transport are +started. +.Pp +When +.Nm +is run in an appropriately configured vnet jail, the server is restricted +to TCP transport and no pNFS service. +Therefore, the +.Fl t +option must be specified and none of the +.Fl u , +.Fl p +and +.Fl m +options can be specified when run in a vnet jail. +See +.Xr jail 8 +for more information. +.Pp +The following options are available: +.Bl -tag -width Ds +.It Fl r +Register the NFS service with +.Xr rpcbind 8 +without creating any servers. +This option can be used along with the +.Fl u +or +.Fl t +options to re-register NFS if the rpcbind server is restarted. +.It Fl d +Unregister the NFS service with +.Xr rpcbind 8 +without creating any servers. +.It Fl P Ar pidfile +Specify alternative location of a file where main process PID will be stored. +The default location is +.Pa /var/run/nfsd.pid . +.It Fl V Ar virtual_hostname +Specifies a hostname to be used as a principal name, instead of +the default hostname. +.It Fl n Ar threads +This option is deprecated and is limited to a maximum of 256 threads. +The options +.Fl Fl maxthreads +and +.Fl Fl minthreads +should now be used. +The +.Ar threads +argument for +.Fl Fl minthreads +and +.Fl Fl maxthreads +may be set to the same value to avoid dynamic +changes to the number of threads. +.It Fl Fl maxthreads Ar threads +Specifies the maximum servers that will be kept around to service requests. +.It Fl Fl minthreads Ar threads +Specifies the minimum servers that will be kept around to service requests. +.It Fl h Ar bindip +Specifies which IP address or hostname to bind to on the local host. +This option is recommended when a host has multiple interfaces. +Multiple +.Fl h +options may be specified. +.It Fl a +Specifies that nfsd should bind to the wildcard IP address. +This is the default if no +.Fl h +options are given. +It may also be specified in addition to any +.Fl h +options given. +Note that NFS/UDP does not operate properly when +bound to the wildcard IP address whether you use -a or do not use -h. +.It Fl p Ar pnfs_setup +Enables pNFS support in the server and specifies the information that the +daemon needs to start it. +This option can only be used on one server and specifies that this server +will be the MetaData Server (MDS) for the pNFS service. +This can only be done if there is at least one +.Fx +system configured +as a Data Server (DS) for it to use. +.Pp +The +.Ar pnfs_setup +string is a set of fields separated by ',' characters: +Each of these fields specifies one DS. +It consists of a server hostname, followed by a ':' +and the directory path where the DS's data storage file system is mounted on +this MDS server. +This can optionally be followed by a '#' and the mds_path, which is the +directory path for an exported file system on this MDS. +If this is specified, it means that this DS is to be used to store data +files for this mds_path file system only. +If this optional component does not exist, the DS will be used to store data +files for all exported MDS file systems. +The DS storage file systems must be mounted on this system before the +.Nm +is started with this option specified. +.br +For example: +.sp +nfsv4-data0:/data0,nfsv4-data1:/data1 +.sp +would specify two DS servers called nfsv4-data0 and nfsv4-data1 that comprise +the data storage component of the pNFS service. +These two DSs would be used to store data files for all exported file systems +on this MDS. +The directories +.Dq /data0 +and +.Dq /data1 +are where the data storage servers exported +storage directories are mounted on this system (which will act as the MDS). +.br +Whereas, for the example: +.sp +nfsv4-data0:/data0#/export1,nfsv4-data1:/data1#/export2 +.sp +would specify two DSs as above, however nfsv4-data0 will be used to store +data files for +.Dq /export1 +and nfsv4-data1 will be used to store data files for +.Dq /export2 . +.sp +When using IPv6 addresses for DSs +be wary of using link local addresses. +The IPv6 address for the DS is sent to the client and there is no scope +zone in it. +As such, a link local address may not work for a pNFS client to DS +TCP connection. +When parsed, +.Nm +will only use a link local address if it is the only address returned by +.Xr getaddrinfo 3 +for the DS hostname. +.It Fl m Ar mirror_level +This option is only meaningful when used with the +.Fl p +option. +It specifies the +.Dq mirror_level , +which defines how many of the DSs will +have a copy of a file's data storage file. +The default of one implies no mirroring of data storage files on the DSs. +The +.Dq mirror_level +would normally be set to 2 to enable mirroring, but +can be as high as NFSDEV_MAXMIRRORS. +There must be at least +.Dq mirror_level +DSs for each exported file system on the MDS, as specified in the +.Fl p +option. +This implies that, for the above example using "#/export1" and "#/export2", +mirroring cannot be done. +There would need to be two DS entries for each of "#/export1" and "#/export2" +in order to support a +.Dq mirror_level +of two. +.Pp +If mirroring is enabled, the server must use the Flexible File +layout. +If mirroring is not enabled, the server will use the File layout +by default, but this default can be changed to the Flexible File layout if the +.Xr sysctl 8 +vfs.nfsd.default_flexfile +is set non-zero. +.It Fl t +Serve TCP NFS clients. +.It Fl u +Serve UDP NFS clients. +.It Fl e +Ignored; included for backward compatibility. +.It Fl N +Cause +.Nm +to execute in the foreground instead of in daemon mode. +.El +.Pp +For example, +.Dq Li "nfsd -u -t --minthreads 6 --maxthreads 6" +serves UDP and TCP transports using six kernel threads (servers). +.Pp +For a system dedicated to servicing NFS RPCs, the number of +threads (servers) should be sufficient to handle the peak +client RPC load. +For systems that perform other services, the number of +threads (servers) may need to be limited, so that resources +are available for these other services. +.Pp +The +.Nm +utility listens for service requests at the port indicated in the +NFS server specification; see +.%T "Network File System Protocol Specification" , +RFC1094, +.%T "NFS: Network File System Version 3 Protocol Specification" , +RFC1813, +.%T "Network File System (NFS) Version 4 Protocol" , +RFC7530, +.%T "Network File System (NFS) Version 4 Minor Version 1 Protocol" , +RFC5661, +.%T "Network File System (NFS) Version 4 Minor Version 2 Protocol" , +RFC7862, +.%T "File System Extended Attributes in NFSv4" , +RFC8276 and +.%T "Parallel NFS (pNFS) Flexible File Layout" , +RFC8435. +.Pp +If +.Nm +detects that +NFS is not loaded in the running kernel, it will attempt +to load a loadable kernel module containing NFS support using +.Xr kldload 2 . +If this fails, or no NFS KLD is available, +.Nm +will exit with an error. +.Pp +If +.Nm +is to be run on a host with multiple interfaces or interface aliases, use +of the +.Fl h +option is recommended. +If you do not use the option NFS may not respond to +UDP packets from the same IP address they were sent to. +Use of this option +is also recommended when securing NFS exports on a firewalling machine such +that the NFS sockets can only be accessed by the inside interface. +The +.Nm ipfw +utility +would then be used to block NFS-related packets that come in on the outside +interface. +.Pp +If the server has stopped servicing clients and has generated a console message +like +.Dq Li "nfsd server cache flooded..." , +the value for vfs.nfsd.tcphighwater needs to be increased. +This should allow the server to again handle requests without a reboot. +Also, you may want to consider decreasing the value for +vfs.nfsd.tcpcachetimeo to several minutes (in seconds) instead of 12 hours +when this occurs. +.Pp +Unfortunately making vfs.nfsd.tcphighwater too large can result in the mbuf +limit being reached, as indicated by a console message +like +.Dq Li "kern.ipc.nmbufs limit reached" . +If you cannot find values of the above +.Nm sysctl +values that work, you can disable the DRC cache for TCP by setting +vfs.nfsd.cachetcp to 0. +.Pp +The +.Nm +utility has to be terminated with +.Dv SIGUSR1 +and cannot be killed with +.Dv SIGTERM +or +.Dv SIGQUIT . +The +.Nm +utility needs to ignore these signals in order to stay alive as long +as possible during a shutdown, otherwise loopback mounts will +not be able to unmount. +If you have to kill +.Nm +just do a +.Dq Li "kill -USR1 <PID of master nfsd>" +.Sh EXIT STATUS +.Ex -std +.Sh SEE ALSO +.Xr nfsstat 1 , +.Xr kldload 2 , +.Xr nfssvc 2 , +.Xr nfsv4 4 , +.Xr pnfs 4 , +.Xr pnfsserver 4 , +.Xr exports 5 , +.Xr stablerestart 5 , +.Xr gssd 8 , +.Xr ipfw 8 , +.Xr jail 8 , +.Xr mountd 8 , +.Xr nfsiod 8 , +.Xr nfsrevoke 8 , +.Xr nfsuserd 8 , +.Xr rpcbind 8 +.Sh HISTORY +The +.Nm +utility first appeared in +.Bx 4.4 . +.Sh BUGS +If +.Nm +is started when +.Xr gssd 8 +is not running, it will service AUTH_SYS requests only. +To fix the problem you must kill +.Nm +and then restart it, after the +.Xr gssd 8 +is running. +.Pp +For a Flexible File Layout pNFS server, +if there are Linux clients doing NFSv4.1 or NFSv4.2 mounts, those +clients might need the +.Xr sysctl 8 +vfs.nfsd.flexlinuxhack +to be set to one on the MDS as a workaround. +.Pp +Linux 5.n kernels appear to have been patched such that this +.Xr sysctl 8 +does not need to be set. +.Pp +For NFSv4.2, a Copy operation can take a long time to complete. +If there is a concurrent ExchangeID or DelegReturn operation +which requires the exclusive lock on all NFSv4 state, this can +result in a +.Dq stall +of the +.Nm +server. +If your storage is on ZFS without block cloning enabled, +setting the +.Xr sysctl 8 +.Va vfs.zfs.dmu_offset_next_sync +to 0 can often avoid this problem. +It is also possible to set the +.Xr sysctl 8 +.Va vfs.nfsd.maxcopyrange +to 10-100 megabytes to try and reduce Copy operation times. +As a last resort, setting +.Xr sysctl 8 +.Va vfs.nfsd.maxcopyrange +to 0 disables the Copy operation. diff --git a/usr.sbin/nfsd/nfsd.c b/usr.sbin/nfsd/nfsd.c new file mode 100644 index 000000000000..94c30ae6dee1 --- /dev/null +++ b/usr.sbin/nfsd/nfsd.c @@ -0,0 +1,1400 @@ +/*- + * SPDX-License-Identifier: BSD-3-Clause + * + * Copyright (c) 1989, 1993, 1994 + * The Regents of the University of California. All rights reserved. + * + * This code is derived from software contributed to Berkeley by + * Rick Macklem at The University of Guelph. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * 3. Neither the name of the University nor the names of its contributors + * may be used to endorse or promote products derived from this software + * without specific prior written permission. + * + * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF + * SUCH DAMAGE. + */ + +#include <sys/param.h> +#include <sys/syslog.h> +#include <sys/wait.h> +#include <sys/mount.h> +#include <sys/fcntl.h> +#include <sys/linker.h> +#include <sys/module.h> +#include <sys/types.h> +#include <sys/stat.h> +#include <sys/sysctl.h> +#include <sys/ucred.h> + +#include <rpc/rpc.h> +#include <rpc/pmap_clnt.h> +#include <rpcsvc/nfs_prot.h> + +#include <netdb.h> +#include <arpa/inet.h> +#include <nfs/nfssvc.h> + +#include <fs/nfs/nfsproto.h> +#include <fs/nfs/nfskpiport.h> +#include <fs/nfs/nfs.h> + +#include <err.h> +#include <errno.h> +#include <libutil.h> +#include <signal.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <unistd.h> +#include <sysexits.h> + +#include <getopt.h> + +static int debug = 0; +static int nofork = 0; + +#define DEFAULT_PIDFILE "/var/run/nfsd.pid" +#define NFSD_STABLERESTART "/var/db/nfs-stablerestart" +#define NFSD_STABLEBACKUP "/var/db/nfs-stablerestart.bak" +#define MAXNFSDCNT 256 +#define DEFNFSDCNT 4 +#define NFS_VER2 2 +#define NFS_VER3 3 +#define NFS_VER4 4 +static pid_t children[MAXNFSDCNT]; /* PIDs of children */ +static pid_t masterpid; /* PID of master/parent */ +static struct pidfh *masterpidfh = NULL; /* pidfh of master/parent */ +static int nfsdcnt; /* number of children */ +static int nfsdcnt_set; +static int minthreads; +static int maxthreads; +static int nfssvc_nfsd; /* Set to correct NFSSVC_xxx flag */ +static int stablefd = -1; /* Fd for the stable restart file */ +static int backupfd; /* Fd for the backup stable restart file */ +static const char *getopt_shortopts; +static const char *getopt_usage; +static int nfs_minvers = NFS_VER2; + +static int minthreads_set; +static int maxthreads_set; + +static struct option longopts[] = { + { "debug", no_argument, &debug, 1 }, + { "minthreads", required_argument, &minthreads_set, 1 }, + { "maxthreads", required_argument, &maxthreads_set, 1 }, + { "pnfs", required_argument, NULL, 'p' }, + { "mirror", required_argument, NULL, 'm' }, + { NULL, 0, NULL, 0} +}; + +static void cleanup(int); +static void child_cleanup(int); +static void killchildren(void); +static void nfsd_exit(int); +static void nonfs(int); +static void reapchild(int); +static int setbindhost(struct addrinfo **ia, const char *bindhost, + struct addrinfo hints); +static void start_server(int, struct nfsd_nfsd_args *, const char *vhost); +static void unregistration(void); +static void usage(void); +static void open_stable(int *, int *); +static void copy_stable(int, int); +static void backup_stable(int); +static void set_nfsdcnt(int); +static void parse_dsserver(const char *, struct nfsd_nfsd_args *); + +/* + * Nfs server daemon mostly just a user context for nfssvc() + * + * 1 - do file descriptor and signal cleanup + * 2 - fork the nfsd(s) + * 3 - create server socket(s) + * 4 - register socket with rpcbind + * + * For connectionless protocols, just pass the socket into the kernel via. + * nfssvc(). + * For connection based sockets, loop doing accepts. When you get a new + * socket from accept, pass the msgsock into the kernel via. nfssvc(). + * The arguments are: + * -r - reregister with rpcbind + * -d - unregister with rpcbind + * -t - support tcp nfs clients + * -u - support udp nfs clients + * -e - forces it to run a server that supports nfsv4 + * -p - enable a pNFS service + * -m - set the mirroring level for a pNFS service + * followed by "n" which is the number of nfsds' to fork off + */ +int +main(int argc, char **argv) +{ + struct nfsd_addsock_args addsockargs; + struct addrinfo *ai_udp, *ai_tcp, *ai_udp6, *ai_tcp6, hints; + struct netconfig *nconf_udp, *nconf_tcp, *nconf_udp6, *nconf_tcp6; + struct netbuf nb_udp, nb_tcp, nb_udp6, nb_tcp6; + struct sockaddr_storage peer; + fd_set ready, sockbits; + int ch, connect_type_cnt, i, maxsock, msgsock; + socklen_t len; + int on = 1, unregister, reregister, sock; + int tcp6sock, ip6flag, tcpflag, tcpsock; + int udpflag, ecode, error, s; + int bindhostc, bindanyflag, rpcbreg, rpcbregcnt; + int nfssvc_addsock; + int jailed, longindex = 0; + size_t jailed_size, nfs_minvers_size; + const char *lopt; + char **bindhost = NULL; + const char *pidfile_path = DEFAULT_PIDFILE; + pid_t pid, otherpid; + struct nfsd_nfsd_args nfsdargs; + const char *vhostname = NULL; + + nfsdargs.mirrorcnt = 1; + nfsdargs.addr = NULL; + nfsdargs.addrlen = 0; + nfsdcnt = DEFNFSDCNT; + unregister = reregister = tcpflag = maxsock = 0; + bindanyflag = udpflag = connect_type_cnt = bindhostc = 0; + getopt_shortopts = "ah:n:rdtuep:m:V:NP:"; + getopt_usage = + "usage:\n" + " nfsd [-ardtueN] [-h bindip]\n" + " [-n numservers] [--minthreads #] [--maxthreads #]\n" + " [-p/--pnfs dsserver0:/dsserver0-mounted-on-dir,...," + "dsserverN:/dsserverN-mounted-on-dir] [-m mirrorlevel]\n" + " [-P pidfile ] [-V virtual_hostname]\n"; + while ((ch = getopt_long(argc, argv, getopt_shortopts, longopts, + &longindex)) != -1) + switch (ch) { + case 'V': + if (strlen(optarg) <= MAXHOSTNAMELEN) + vhostname = optarg; + else + warnx("Virtual host name (%s) is too long", + optarg); + break; + case 'a': + bindanyflag = 1; + break; + case 'n': + set_nfsdcnt(atoi(optarg)); + break; + case 'h': + bindhostc++; + bindhost = realloc(bindhost,sizeof(char *)*bindhostc); + if (bindhost == NULL) + errx(1, "Out of memory"); + bindhost[bindhostc-1] = strdup(optarg); + if (bindhost[bindhostc-1] == NULL) + errx(1, "Out of memory"); + break; + case 'r': + reregister = 1; + break; + case 'd': + unregister = 1; + break; + case 't': + tcpflag = 1; + break; + case 'u': + udpflag = 1; + break; + case 'e': + /* now a no-op, since this is the default */ + break; + case 'p': + /* Parse out the DS server host names and mount pts. */ + parse_dsserver(optarg, &nfsdargs); + break; + case 'm': + /* Set the mirror level for a pNFS service. */ + i = atoi(optarg); + if (i < 2 || i > NFSDEV_MAXMIRRORS) + errx(1, "Mirror level out of range 2<-->%d", + NFSDEV_MAXMIRRORS); + nfsdargs.mirrorcnt = i; + break; + case 'N': + nofork = 1; + break; + case 'P': + pidfile_path = optarg; + break; + case 0: + lopt = longopts[longindex].name; + if (!strcmp(lopt, "minthreads")) { + minthreads = atoi(optarg); + } else if (!strcmp(lopt, "maxthreads")) { + maxthreads = atoi(optarg); + } + break; + default: + case '?': + usage(); + } + if (!tcpflag && !udpflag) + udpflag = 1; + argv += optind; + argc -= optind; + if (minthreads_set && maxthreads_set && minthreads > maxthreads) + errx(EX_USAGE, + "error: minthreads(%d) can't be greater than " + "maxthreads(%d)", minthreads, maxthreads); + + /* + * XXX + * Backward compatibility, trailing number is the count of daemons. + */ + if (argc > 1) + usage(); + if (argc == 1) + set_nfsdcnt(atoi(argv[0])); + + /* + * Unless the "-o" option was specified, try and run "nfsd". + * If "-o" was specified, try and run "nfsserver". + */ + if (modfind("nfsd") < 0) { + /* Not present in kernel, try loading it */ + if (kldload("nfsd") < 0 || modfind("nfsd") < 0) + errx(1, "NFS server is not available"); + } + + ip6flag = 1; + s = socket(AF_INET6, SOCK_DGRAM, IPPROTO_UDP); + if (s == -1) { + if (errno != EPROTONOSUPPORT && errno != EAFNOSUPPORT) + err(1, "socket"); + ip6flag = 0; + } else if (getnetconfigent("udp6") == NULL || + getnetconfigent("tcp6") == NULL) { + ip6flag = 0; + } + if (s != -1) + close(s); + + if (bindhostc == 0 || bindanyflag) { + bindhostc++; + bindhost = realloc(bindhost,sizeof(char *)*bindhostc); + if (bindhost == NULL) + errx(1, "Out of memory"); + bindhost[bindhostc-1] = strdup("*"); + if (bindhost[bindhostc-1] == NULL) + errx(1, "Out of memory"); + } + + if (unregister) { + /* + * Unregister before setting nfs_minvers, in case the + * value of vfs.nfsd.server_min_nfsvers has changed + * since registering with rpcbind. + */ + unregistration(); + exit (0); + } + + nfs_minvers_size = sizeof(nfs_minvers); + error = sysctlbyname("vfs.nfsd.server_min_nfsvers", &nfs_minvers, + &nfs_minvers_size, NULL, 0); + if (error != 0 || nfs_minvers < NFS_VER2 || nfs_minvers > NFS_VER4) { + warnx("sysctlbyname(vfs.nfsd.server_min_nfsvers) failed," + " defaulting to NFSv2"); + nfs_minvers = NFS_VER2; + } + + if (reregister) { + if (udpflag) { + memset(&hints, 0, sizeof hints); + hints.ai_flags = AI_PASSIVE; + hints.ai_family = AF_INET; + hints.ai_socktype = SOCK_DGRAM; + hints.ai_protocol = IPPROTO_UDP; + ecode = getaddrinfo(NULL, "nfs", &hints, &ai_udp); + if (ecode != 0) + err(1, "getaddrinfo udp: %s", gai_strerror(ecode)); + nconf_udp = getnetconfigent("udp"); + if (nconf_udp == NULL) + err(1, "getnetconfigent udp failed"); + nb_udp.buf = ai_udp->ai_addr; + nb_udp.len = nb_udp.maxlen = ai_udp->ai_addrlen; + if (nfs_minvers == NFS_VER2) + if (!rpcb_set(NFS_PROGRAM, 2, nconf_udp, + &nb_udp)) + err(1, "rpcb_set udp failed"); + if (nfs_minvers <= NFS_VER3) + if (!rpcb_set(NFS_PROGRAM, 3, nconf_udp, + &nb_udp)) + err(1, "rpcb_set udp failed"); + freeaddrinfo(ai_udp); + } + if (udpflag && ip6flag) { + memset(&hints, 0, sizeof hints); + hints.ai_flags = AI_PASSIVE; + hints.ai_family = AF_INET6; + hints.ai_socktype = SOCK_DGRAM; + hints.ai_protocol = IPPROTO_UDP; + ecode = getaddrinfo(NULL, "nfs", &hints, &ai_udp6); + if (ecode != 0) + err(1, "getaddrinfo udp6: %s", gai_strerror(ecode)); + nconf_udp6 = getnetconfigent("udp6"); + if (nconf_udp6 == NULL) + err(1, "getnetconfigent udp6 failed"); + nb_udp6.buf = ai_udp6->ai_addr; + nb_udp6.len = nb_udp6.maxlen = ai_udp6->ai_addrlen; + if (nfs_minvers == NFS_VER2) + if (!rpcb_set(NFS_PROGRAM, 2, nconf_udp6, + &nb_udp6)) + err(1, "rpcb_set udp6 failed"); + if (nfs_minvers <= NFS_VER3) + if (!rpcb_set(NFS_PROGRAM, 3, nconf_udp6, + &nb_udp6)) + err(1, "rpcb_set udp6 failed"); + freeaddrinfo(ai_udp6); + } + if (tcpflag) { + memset(&hints, 0, sizeof hints); + hints.ai_flags = AI_PASSIVE; + hints.ai_family = AF_INET; + hints.ai_socktype = SOCK_STREAM; + hints.ai_protocol = IPPROTO_TCP; + ecode = getaddrinfo(NULL, "nfs", &hints, &ai_tcp); + if (ecode != 0) + err(1, "getaddrinfo tcp: %s", gai_strerror(ecode)); + nconf_tcp = getnetconfigent("tcp"); + if (nconf_tcp == NULL) + err(1, "getnetconfigent tcp failed"); + nb_tcp.buf = ai_tcp->ai_addr; + nb_tcp.len = nb_tcp.maxlen = ai_tcp->ai_addrlen; + if (nfs_minvers == NFS_VER2) + if (!rpcb_set(NFS_PROGRAM, 2, nconf_tcp, + &nb_tcp)) + err(1, "rpcb_set tcp failed"); + if (nfs_minvers <= NFS_VER3) + if (!rpcb_set(NFS_PROGRAM, 3, nconf_tcp, + &nb_tcp)) + err(1, "rpcb_set tcp failed"); + freeaddrinfo(ai_tcp); + } + if (tcpflag && ip6flag) { + memset(&hints, 0, sizeof hints); + hints.ai_flags = AI_PASSIVE; + hints.ai_family = AF_INET6; + hints.ai_socktype = SOCK_STREAM; + hints.ai_protocol = IPPROTO_TCP; + ecode = getaddrinfo(NULL, "nfs", &hints, &ai_tcp6); + if (ecode != 0) + err(1, "getaddrinfo tcp6: %s", gai_strerror(ecode)); + nconf_tcp6 = getnetconfigent("tcp6"); + if (nconf_tcp6 == NULL) + err(1, "getnetconfigent tcp6 failed"); + nb_tcp6.buf = ai_tcp6->ai_addr; + nb_tcp6.len = nb_tcp6.maxlen = ai_tcp6->ai_addrlen; + if (nfs_minvers == NFS_VER2) + if (!rpcb_set(NFS_PROGRAM, 2, nconf_tcp6, + &nb_tcp6)) + err(1, "rpcb_set tcp6 failed"); + if (nfs_minvers <= NFS_VER3) + if (!rpcb_set(NFS_PROGRAM, 3, nconf_tcp6, + &nb_tcp6)) + err(1, "rpcb_set tcp6 failed"); + freeaddrinfo(ai_tcp6); + } + exit (0); + } + + if (pidfile_path != NULL) { + masterpidfh = pidfile_open(pidfile_path, 0600, &otherpid); + if (masterpidfh == NULL) { + if (errno == EEXIST) + errx(1, "daemon already running, pid: %jd.", + (intmax_t)otherpid); + warn("cannot open pid file"); + } + } + if (debug == 0 && nofork == 0) { + daemon(0, 0); + (void)signal(SIGHUP, SIG_IGN); + (void)signal(SIGINT, SIG_IGN); + /* + * nfsd sits in the kernel most of the time. It needs + * to ignore SIGTERM/SIGQUIT in order to stay alive as long + * as possible during a shutdown, otherwise loopback + * mounts will not be able to unmount. + */ + (void)signal(SIGTERM, SIG_IGN); + (void)signal(SIGQUIT, SIG_IGN); + } + (void)signal(SIGSYS, nonfs); + (void)signal(SIGCHLD, reapchild); + (void)signal(SIGUSR2, backup_stable); + + openlog("nfsd", LOG_PID | (debug ? LOG_PERROR : 0), LOG_DAEMON); + + if (masterpidfh != NULL && pidfile_write(masterpidfh) != 0) + syslog(LOG_ERR, "pidfile_write(): %m"); + + /* + * For V4, we open the stablerestart file and call nfssvc() + * to get it loaded. This is done before the daemons do the + * regular nfssvc() call to service NFS requests. + * (This way the file remains open until the last nfsd is killed + * off.) + * It and the backup copy will be created as empty files + * the first time this nfsd is started and should never be + * deleted/replaced if at all possible. It should live on a + * local, non-volatile storage device that does not do hardware + * level write-back caching. (See SCSI doc for more information + * on how to prevent write-back caching on SCSI disks.) + */ + open_stable(&stablefd, &backupfd); + if (stablefd < 0) { + syslog(LOG_ERR, "Can't open %s: %m\n", NFSD_STABLERESTART); + exit(1); + } + /* This system call will fail for old kernels, but that's ok. */ + nfssvc(NFSSVC_BACKUPSTABLE, NULL); + if (nfssvc(NFSSVC_STABLERESTART, (caddr_t)&stablefd) < 0) { + if (errno == EPERM) { + jailed = 0; + jailed_size = sizeof(jailed); + sysctlbyname("security.jail.jailed", &jailed, + &jailed_size, NULL, 0); + if (jailed != 0) + syslog(LOG_ERR, "nfssvc stablerestart failed: " + "allow.nfsd might not be configured"); + else + syslog(LOG_ERR, "nfssvc stablerestart failed"); + } else if (errno == ENXIO) + syslog(LOG_ERR, "nfssvc stablerestart failed: is nfsd " + "already running?"); + else + syslog(LOG_ERR, "Can't read stable storage file: %m\n"); + exit(1); + } + nfssvc_addsock = NFSSVC_NFSDADDSOCK; + nfssvc_nfsd = NFSSVC_NFSDNFSD | NFSSVC_NEWSTRUCT; + + if (tcpflag) { + /* + * For TCP mode, we fork once to start the first + * kernel nfsd thread. The kernel will add more + * threads as needed. + */ + masterpid = getpid(); + pid = fork(); + if (pid == -1) { + syslog(LOG_ERR, "fork: %m"); + nfsd_exit(1); + } + if (pid) { + children[0] = pid; + } else { + pidfile_close(masterpidfh); + (void)signal(SIGUSR1, child_cleanup); + setproctitle("server"); + start_server(0, &nfsdargs, vhostname); + } + } + + (void)signal(SIGUSR1, cleanup); + FD_ZERO(&sockbits); + + rpcbregcnt = 0; + /* Set up the socket for udp and rpcb register it. */ + if (udpflag) { + rpcbreg = 0; + for (i = 0; i < bindhostc; i++) { + memset(&hints, 0, sizeof hints); + hints.ai_flags = AI_PASSIVE; + hints.ai_family = AF_INET; + hints.ai_socktype = SOCK_DGRAM; + hints.ai_protocol = IPPROTO_UDP; + if (setbindhost(&ai_udp, bindhost[i], hints) == 0) { + rpcbreg = 1; + rpcbregcnt++; + if ((sock = socket(ai_udp->ai_family, + ai_udp->ai_socktype, + ai_udp->ai_protocol)) < 0) { + syslog(LOG_ERR, + "can't create udp socket"); + nfsd_exit(1); + } + if (bind(sock, ai_udp->ai_addr, + ai_udp->ai_addrlen) < 0) { + syslog(LOG_ERR, + "can't bind udp addr %s: %m", + bindhost[i]); + nfsd_exit(1); + } + freeaddrinfo(ai_udp); + addsockargs.sock = sock; + addsockargs.name = NULL; + addsockargs.namelen = 0; + if (nfssvc(nfssvc_addsock, &addsockargs) < 0) { + syslog(LOG_ERR, "can't Add UDP socket"); + nfsd_exit(1); + } + (void)close(sock); + } + } + if (rpcbreg == 1) { + memset(&hints, 0, sizeof hints); + hints.ai_flags = AI_PASSIVE; + hints.ai_family = AF_INET; + hints.ai_socktype = SOCK_DGRAM; + hints.ai_protocol = IPPROTO_UDP; + ecode = getaddrinfo(NULL, "nfs", &hints, &ai_udp); + if (ecode != 0) { + syslog(LOG_ERR, "getaddrinfo udp: %s", + gai_strerror(ecode)); + nfsd_exit(1); + } + nconf_udp = getnetconfigent("udp"); + if (nconf_udp == NULL) { + syslog(LOG_ERR, "getnetconfigent udp failed"); + nfsd_exit(1); + } + nb_udp.buf = ai_udp->ai_addr; + nb_udp.len = nb_udp.maxlen = ai_udp->ai_addrlen; + if (nfs_minvers == NFS_VER2) + if (!rpcb_set(NFS_PROGRAM, 2, nconf_udp, + &nb_udp)) { + syslog(LOG_ERR, "rpcb_set udp failed"); + nfsd_exit(1); + } + if (nfs_minvers <= NFS_VER3) + if (!rpcb_set(NFS_PROGRAM, 3, nconf_udp, + &nb_udp)) { + syslog(LOG_ERR, "rpcb_set udp failed"); + nfsd_exit(1); + } + freeaddrinfo(ai_udp); + } + } + + /* Set up the socket for udp6 and rpcb register it. */ + if (udpflag && ip6flag) { + rpcbreg = 0; + for (i = 0; i < bindhostc; i++) { + memset(&hints, 0, sizeof hints); + hints.ai_flags = AI_PASSIVE; + hints.ai_family = AF_INET6; + hints.ai_socktype = SOCK_DGRAM; + hints.ai_protocol = IPPROTO_UDP; + if (setbindhost(&ai_udp6, bindhost[i], hints) == 0) { + rpcbreg = 1; + rpcbregcnt++; + if ((sock = socket(ai_udp6->ai_family, + ai_udp6->ai_socktype, + ai_udp6->ai_protocol)) < 0) { + syslog(LOG_ERR, + "can't create udp6 socket"); + nfsd_exit(1); + } + if (setsockopt(sock, IPPROTO_IPV6, IPV6_V6ONLY, + &on, sizeof on) < 0) { + syslog(LOG_ERR, + "can't set v6-only binding for " + "udp6 socket: %m"); + nfsd_exit(1); + } + if (bind(sock, ai_udp6->ai_addr, + ai_udp6->ai_addrlen) < 0) { + syslog(LOG_ERR, + "can't bind udp6 addr %s: %m", + bindhost[i]); + nfsd_exit(1); + } + freeaddrinfo(ai_udp6); + addsockargs.sock = sock; + addsockargs.name = NULL; + addsockargs.namelen = 0; + if (nfssvc(nfssvc_addsock, &addsockargs) < 0) { + syslog(LOG_ERR, + "can't add UDP6 socket"); + nfsd_exit(1); + } + (void)close(sock); + } + } + if (rpcbreg == 1) { + memset(&hints, 0, sizeof hints); + hints.ai_flags = AI_PASSIVE; + hints.ai_family = AF_INET6; + hints.ai_socktype = SOCK_DGRAM; + hints.ai_protocol = IPPROTO_UDP; + ecode = getaddrinfo(NULL, "nfs", &hints, &ai_udp6); + if (ecode != 0) { + syslog(LOG_ERR, "getaddrinfo udp6: %s", + gai_strerror(ecode)); + nfsd_exit(1); + } + nconf_udp6 = getnetconfigent("udp6"); + if (nconf_udp6 == NULL) { + syslog(LOG_ERR, "getnetconfigent udp6 failed"); + nfsd_exit(1); + } + nb_udp6.buf = ai_udp6->ai_addr; + nb_udp6.len = nb_udp6.maxlen = ai_udp6->ai_addrlen; + if (nfs_minvers == NFS_VER2) + if (!rpcb_set(NFS_PROGRAM, 2, nconf_udp6, + &nb_udp6)) { + syslog(LOG_ERR, + "rpcb_set udp6 failed"); + nfsd_exit(1); + } + if (nfs_minvers <= NFS_VER3) + if (!rpcb_set(NFS_PROGRAM, 3, nconf_udp6, + &nb_udp6)) { + syslog(LOG_ERR, + "rpcb_set udp6 failed"); + nfsd_exit(1); + } + freeaddrinfo(ai_udp6); + } + } + + /* Set up the socket for tcp and rpcb register it. */ + if (tcpflag) { + rpcbreg = 0; + for (i = 0; i < bindhostc; i++) { + memset(&hints, 0, sizeof hints); + hints.ai_flags = AI_PASSIVE; + hints.ai_family = AF_INET; + hints.ai_socktype = SOCK_STREAM; + hints.ai_protocol = IPPROTO_TCP; + if (setbindhost(&ai_tcp, bindhost[i], hints) == 0) { + rpcbreg = 1; + rpcbregcnt++; + if ((tcpsock = socket(AF_INET, SOCK_STREAM, + 0)) < 0) { + syslog(LOG_ERR, + "can't create tcp socket"); + nfsd_exit(1); + } + if (setsockopt(tcpsock, SOL_SOCKET, + SO_REUSEADDR, + (char *)&on, sizeof(on)) < 0) + syslog(LOG_ERR, + "setsockopt SO_REUSEADDR: %m"); + if (bind(tcpsock, ai_tcp->ai_addr, + ai_tcp->ai_addrlen) < 0) { + syslog(LOG_ERR, + "can't bind tcp addr %s: %m", + bindhost[i]); + nfsd_exit(1); + } + if (listen(tcpsock, -1) < 0) { + syslog(LOG_ERR, "listen failed"); + nfsd_exit(1); + } + freeaddrinfo(ai_tcp); + FD_SET(tcpsock, &sockbits); + maxsock = tcpsock; + connect_type_cnt++; + } + } + if (rpcbreg == 1) { + memset(&hints, 0, sizeof hints); + hints.ai_flags = AI_PASSIVE; + hints.ai_family = AF_INET; + hints.ai_socktype = SOCK_STREAM; + hints.ai_protocol = IPPROTO_TCP; + ecode = getaddrinfo(NULL, "nfs", &hints, + &ai_tcp); + if (ecode != 0) { + syslog(LOG_ERR, "getaddrinfo tcp: %s", + gai_strerror(ecode)); + nfsd_exit(1); + } + nconf_tcp = getnetconfigent("tcp"); + if (nconf_tcp == NULL) { + syslog(LOG_ERR, "getnetconfigent tcp failed"); + nfsd_exit(1); + } + nb_tcp.buf = ai_tcp->ai_addr; + nb_tcp.len = nb_tcp.maxlen = ai_tcp->ai_addrlen; + if (nfs_minvers == NFS_VER2) + if (!rpcb_set(NFS_PROGRAM, 2, nconf_tcp, + &nb_tcp)) { + syslog(LOG_ERR, "rpcb_set tcp failed"); + nfsd_exit(1); + } + if (nfs_minvers <= NFS_VER3) + if (!rpcb_set(NFS_PROGRAM, 3, nconf_tcp, + &nb_tcp)) { + syslog(LOG_ERR, "rpcb_set tcp failed"); + nfsd_exit(1); + } + freeaddrinfo(ai_tcp); + } + } + + /* Set up the socket for tcp6 and rpcb register it. */ + if (tcpflag && ip6flag) { + rpcbreg = 0; + for (i = 0; i < bindhostc; i++) { + memset(&hints, 0, sizeof hints); + hints.ai_flags = AI_PASSIVE; + hints.ai_family = AF_INET6; + hints.ai_socktype = SOCK_STREAM; + hints.ai_protocol = IPPROTO_TCP; + if (setbindhost(&ai_tcp6, bindhost[i], hints) == 0) { + rpcbreg = 1; + rpcbregcnt++; + if ((tcp6sock = socket(ai_tcp6->ai_family, + ai_tcp6->ai_socktype, + ai_tcp6->ai_protocol)) < 0) { + syslog(LOG_ERR, + "can't create tcp6 socket"); + nfsd_exit(1); + } + if (setsockopt(tcp6sock, SOL_SOCKET, + SO_REUSEADDR, + (char *)&on, sizeof(on)) < 0) + syslog(LOG_ERR, + "setsockopt SO_REUSEADDR: %m"); + if (setsockopt(tcp6sock, IPPROTO_IPV6, + IPV6_V6ONLY, &on, sizeof on) < 0) { + syslog(LOG_ERR, + "can't set v6-only binding for tcp6 " + "socket: %m"); + nfsd_exit(1); + } + if (bind(tcp6sock, ai_tcp6->ai_addr, + ai_tcp6->ai_addrlen) < 0) { + syslog(LOG_ERR, + "can't bind tcp6 addr %s: %m", + bindhost[i]); + nfsd_exit(1); + } + if (listen(tcp6sock, -1) < 0) { + syslog(LOG_ERR, "listen failed"); + nfsd_exit(1); + } + freeaddrinfo(ai_tcp6); + FD_SET(tcp6sock, &sockbits); + if (maxsock < tcp6sock) + maxsock = tcp6sock; + connect_type_cnt++; + } + } + if (rpcbreg == 1) { + memset(&hints, 0, sizeof hints); + hints.ai_flags = AI_PASSIVE; + hints.ai_family = AF_INET6; + hints.ai_socktype = SOCK_STREAM; + hints.ai_protocol = IPPROTO_TCP; + ecode = getaddrinfo(NULL, "nfs", &hints, &ai_tcp6); + if (ecode != 0) { + syslog(LOG_ERR, "getaddrinfo tcp6: %s", + gai_strerror(ecode)); + nfsd_exit(1); + } + nconf_tcp6 = getnetconfigent("tcp6"); + if (nconf_tcp6 == NULL) { + syslog(LOG_ERR, "getnetconfigent tcp6 failed"); + nfsd_exit(1); + } + nb_tcp6.buf = ai_tcp6->ai_addr; + nb_tcp6.len = nb_tcp6.maxlen = ai_tcp6->ai_addrlen; + if (nfs_minvers == NFS_VER2) + if (!rpcb_set(NFS_PROGRAM, 2, nconf_tcp6, + &nb_tcp6)) { + syslog(LOG_ERR, "rpcb_set tcp6 failed"); + nfsd_exit(1); + } + if (nfs_minvers <= NFS_VER3) + if (!rpcb_set(NFS_PROGRAM, 3, nconf_tcp6, + &nb_tcp6)) { + syslog(LOG_ERR, "rpcb_set tcp6 failed"); + nfsd_exit(1); + } + freeaddrinfo(ai_tcp6); + } + } + + if (rpcbregcnt == 0) { + syslog(LOG_ERR, "rpcb_set() failed, nothing to do: %m"); + nfsd_exit(1); + } + + if (tcpflag && connect_type_cnt == 0) { + syslog(LOG_ERR, "tcp connects == 0, nothing to do: %m"); + nfsd_exit(1); + } + + setproctitle("master"); + /* + * We always want a master to have a clean way to shut nfsd down + * (with unregistration): if the master is killed, it unregisters and + * kills all children. If we run for UDP only (and so do not have to + * loop waiting for accept), we instead make the parent + * a "server" too. start_server will not return. + */ + if (!tcpflag) + start_server(1, &nfsdargs, vhostname); + + /* + * Loop forever accepting connections and passing the sockets + * into the kernel for the mounts. + */ + for (;;) { + ready = sockbits; + if (connect_type_cnt > 1) { + if (select(maxsock + 1, + &ready, NULL, NULL, NULL) < 1) { + error = errno; + if (error == EINTR) + continue; + syslog(LOG_ERR, "select failed: %m"); + nfsd_exit(1); + } + } + for (tcpsock = 0; tcpsock <= maxsock; tcpsock++) { + if (FD_ISSET(tcpsock, &ready)) { + len = sizeof(peer); + if ((msgsock = accept(tcpsock, + (struct sockaddr *)&peer, &len)) < 0) { + error = errno; + syslog(LOG_ERR, "accept failed: %m"); + if (error == ECONNABORTED || + error == EINTR) + continue; + nfsd_exit(1); + } + if (setsockopt(msgsock, SOL_SOCKET, + SO_KEEPALIVE, (char *)&on, sizeof(on)) < 0) + syslog(LOG_ERR, + "setsockopt SO_KEEPALIVE: %m"); + addsockargs.sock = msgsock; + addsockargs.name = (caddr_t)&peer; + addsockargs.namelen = len; + nfssvc(nfssvc_addsock, &addsockargs); + (void)close(msgsock); + } + } + } +} + +static int +setbindhost(struct addrinfo **ai, const char *bindhost, struct addrinfo hints) +{ + int ecode; + u_int32_t host_addr[4]; /* IPv4 or IPv6 */ + const char *hostptr; + + if (bindhost == NULL || strcmp("*", bindhost) == 0) + hostptr = NULL; + else + hostptr = bindhost; + + if (hostptr != NULL) { + switch (hints.ai_family) { + case AF_INET: + if (inet_pton(AF_INET, hostptr, host_addr) == 1) { + hints.ai_flags = AI_NUMERICHOST; + } else { + if (inet_pton(AF_INET6, hostptr, + host_addr) == 1) + return (1); + } + break; + case AF_INET6: + if (inet_pton(AF_INET6, hostptr, host_addr) == 1) { + hints.ai_flags = AI_NUMERICHOST; + } else { + if (inet_pton(AF_INET, hostptr, + host_addr) == 1) + return (1); + } + break; + default: + break; + } + } + + ecode = getaddrinfo(hostptr, "nfs", &hints, ai); + if (ecode != 0) { + syslog(LOG_ERR, "getaddrinfo %s: %s", bindhost, + gai_strerror(ecode)); + return (1); + } + return (0); +} + +static void +set_nfsdcnt(int proposed) +{ + + if (proposed < 1) { + warnx("nfsd count too low %d; reset to %d", proposed, + DEFNFSDCNT); + nfsdcnt = DEFNFSDCNT; + } else if (proposed > MAXNFSDCNT) { + warnx("nfsd count too high %d; truncated to %d", proposed, + MAXNFSDCNT); + nfsdcnt = MAXNFSDCNT; + } else + nfsdcnt = proposed; + nfsdcnt_set = 1; +} + +static void +usage(void) +{ + (void)fprintf(stderr, "%s", getopt_usage); + exit(1); +} + +static void +nonfs(__unused int signo) +{ + syslog(LOG_ERR, "missing system call: NFS not available"); +} + +static void +reapchild(__unused int signo) +{ + pid_t pid; + int i; + + while ((pid = wait3(NULL, WNOHANG, NULL)) > 0) { + for (i = 0; i < nfsdcnt; i++) + if (pid == children[i]) + children[i] = -1; + } +} + +static void +unregistration(void) +{ + if ((nfs_minvers == NFS_VER2 && !rpcb_unset(NFS_PROGRAM, 2, NULL)) || + (nfs_minvers <= NFS_VER3 && !rpcb_unset(NFS_PROGRAM, 3, NULL))) + syslog(LOG_ERR, "rpcb_unset failed"); +} + +static void +killchildren(void) +{ + int i; + + for (i = 0; i < nfsdcnt; i++) { + if (children[i] > 0) + kill(children[i], SIGKILL); + } +} + +/* + * Cleanup master after SIGUSR1. + */ +static void +cleanup(__unused int signo) +{ + nfsd_exit(0); +} + +/* + * Cleanup child after SIGUSR1. + */ +static void +child_cleanup(__unused int signo) +{ + exit(0); +} + +static void +nfsd_exit(int status) +{ + killchildren(); + unregistration(); + if (masterpidfh != NULL) + pidfile_remove(masterpidfh); + exit(status); +} + +static int +get_tuned_nfsdcount(void) +{ + int ncpu, error, tuned_nfsdcnt; + size_t ncpu_size; + + ncpu_size = sizeof(ncpu); + error = sysctlbyname("hw.ncpu", &ncpu, &ncpu_size, NULL, 0); + if (error) { + warnx("sysctlbyname(hw.ncpu) failed defaulting to %d nfs servers", + DEFNFSDCNT); + tuned_nfsdcnt = DEFNFSDCNT; + } else { + tuned_nfsdcnt = ncpu * 8; + } + return tuned_nfsdcnt; +} + +static void +start_server(int master, struct nfsd_nfsd_args *nfsdargp, const char *vhost) +{ + char principal[MAXHOSTNAMELEN + 5]; + int status, error; + char hostname[MAXHOSTNAMELEN + 1], *cp; + struct addrinfo *aip, hints; + + status = 0; + if (vhost == NULL) + gethostname(hostname, sizeof (hostname)); + else + strlcpy(hostname, vhost, sizeof (hostname)); + snprintf(principal, sizeof (principal), "nfs@%s", hostname); + if ((cp = strchr(hostname, '.')) == NULL || + *(cp + 1) == '\0') { + /* If not fully qualified, try getaddrinfo() */ + memset((void *)&hints, 0, sizeof (hints)); + hints.ai_flags = AI_CANONNAME; + error = getaddrinfo(hostname, NULL, &hints, &aip); + if (error == 0) { + if (aip->ai_canonname != NULL && + (cp = strchr(aip->ai_canonname, '.')) != + NULL && *(cp + 1) != '\0') + snprintf(principal, sizeof (principal), + "nfs@%s", aip->ai_canonname); + freeaddrinfo(aip); + } + } + nfsdargp->principal = principal; + + if (nfsdcnt_set) + nfsdargp->minthreads = nfsdargp->maxthreads = nfsdcnt; + else { + nfsdargp->minthreads = minthreads_set ? minthreads : get_tuned_nfsdcount(); + nfsdargp->maxthreads = maxthreads_set ? maxthreads : nfsdargp->minthreads; + if (nfsdargp->maxthreads < nfsdargp->minthreads) + nfsdargp->maxthreads = nfsdargp->minthreads; + } + error = nfssvc(nfssvc_nfsd, nfsdargp); + if (error < 0 && errno == EAUTH) { + /* + * This indicates that it could not register the + * rpcsec_gss credentials, usually because the + * gssd daemon isn't running. + * (only the experimental server with nfsv4) + */ + syslog(LOG_ERR, "No gssd, using AUTH_SYS only"); + principal[0] = '\0'; + error = nfssvc(nfssvc_nfsd, nfsdargp); + } + if (error < 0) { + if (errno == ENXIO) { + syslog(LOG_ERR, "Bad -p option, cannot run"); + if (masterpid != 0 && master == 0) + kill(masterpid, SIGUSR1); + } else + syslog(LOG_ERR, "nfssvc: %m"); + status = 1; + } + if (master) + nfsd_exit(status); + else + exit(status); +} + +/* + * Open the stable restart file and return the file descriptor for it. + */ +static void +open_stable(int *stable_fdp, int *backup_fdp) +{ + int stable_fd, backup_fd = -1, ret; + struct stat st, backup_st; + + /* Open and stat the stable restart file. */ + stable_fd = open(NFSD_STABLERESTART, O_RDWR, 0); + if (stable_fd < 0) + stable_fd = open(NFSD_STABLERESTART, O_RDWR | O_CREAT, 0600); + if (stable_fd >= 0) { + ret = fstat(stable_fd, &st); + if (ret < 0) { + close(stable_fd); + stable_fd = -1; + } + } + + /* Open and stat the backup stable restart file. */ + if (stable_fd >= 0) { + backup_fd = open(NFSD_STABLEBACKUP, O_RDWR, 0); + if (backup_fd < 0) + backup_fd = open(NFSD_STABLEBACKUP, O_RDWR | O_CREAT, + 0600); + if (backup_fd >= 0) { + ret = fstat(backup_fd, &backup_st); + if (ret < 0) { + close(backup_fd); + backup_fd = -1; + } + } + if (backup_fd < 0) { + close(stable_fd); + stable_fd = -1; + } + } + + *stable_fdp = stable_fd; + *backup_fdp = backup_fd; + if (stable_fd < 0) + return; + + /* Sync up the 2 files, as required. */ + if (st.st_size > 0) + copy_stable(stable_fd, backup_fd); + else if (backup_st.st_size > 0) + copy_stable(backup_fd, stable_fd); +} + +/* + * Copy the stable restart file to the backup or vice versa. + */ +static void +copy_stable(int from_fd, int to_fd) +{ + int cnt, ret; + static char buf[1024]; + + ret = lseek(from_fd, (off_t)0, SEEK_SET); + if (ret >= 0) + ret = lseek(to_fd, (off_t)0, SEEK_SET); + if (ret >= 0) + ret = ftruncate(to_fd, (off_t)0); + if (ret >= 0) + do { + cnt = read(from_fd, buf, 1024); + if (cnt > 0) + ret = write(to_fd, buf, cnt); + else if (cnt < 0) + ret = cnt; + } while (cnt > 0 && ret >= 0); + if (ret >= 0) + ret = fsync(to_fd); + if (ret < 0) + syslog(LOG_ERR, "stable restart copy failure: %m"); +} + +/* + * Back up the stable restart file when indicated by the kernel. + */ +static void +backup_stable(__unused int signo) +{ + + if (stablefd >= 0) + copy_stable(stablefd, backupfd); +} + +/* + * Parse the pNFS string and extract the DS servers and ports numbers. + */ +static void +parse_dsserver(const char *optionarg, struct nfsd_nfsd_args *nfsdargp) +{ + char *cp, *cp2, *dsaddr, *dshost, *dspath, *dsvol, nfsprt[9]; + char *mdspath, *mdsp, ip6[INET6_ADDRSTRLEN]; + const char *ad; + int ecode; + u_int adsiz, dsaddrcnt, dshostcnt, dspathcnt, hostsiz, pathsiz; + u_int mdspathcnt; + size_t dsaddrsiz, dshostsiz, dspathsiz, nfsprtsiz, mdspathsiz; + struct addrinfo hints, *ai_tcp, *res; + struct sockaddr_in sin; + struct sockaddr_in6 sin6; + + cp = strdup(optionarg); + if (cp == NULL) + errx(1, "Out of memory"); + + /* Now, do the host names. */ + dspathsiz = 1024; + dspathcnt = 0; + dspath = malloc(dspathsiz); + if (dspath == NULL) + errx(1, "Out of memory"); + dshostsiz = 1024; + dshostcnt = 0; + dshost = malloc(dshostsiz); + if (dshost == NULL) + errx(1, "Out of memory"); + dsaddrsiz = 1024; + dsaddrcnt = 0; + dsaddr = malloc(dsaddrsiz); + if (dsaddr == NULL) + errx(1, "Out of memory"); + mdspathsiz = 1024; + mdspathcnt = 0; + mdspath = malloc(mdspathsiz); + if (mdspath == NULL) + errx(1, "Out of memory"); + + /* Put the NFS port# in "." form. */ + snprintf(nfsprt, 9, ".%d.%d", 2049 >> 8, 2049 & 0xff); + nfsprtsiz = strlen(nfsprt); + + ai_tcp = NULL; + /* Loop around for each DS server name. */ + do { + cp2 = strchr(cp, ','); + if (cp2 != NULL) { + /* Not the last DS in the list. */ + *cp2++ = '\0'; + if (*cp2 == '\0') + usage(); + } + + dsvol = strchr(cp, ':'); + if (dsvol == NULL || *(dsvol + 1) == '\0') + usage(); + *dsvol++ = '\0'; + + /* Optional path for MDS file system to be stored on DS. */ + mdsp = strchr(dsvol, '#'); + if (mdsp != NULL) { + if (*(mdsp + 1) == '\0' || mdsp <= dsvol) + usage(); + *mdsp++ = '\0'; + } + + /* Append this pathname to dspath. */ + pathsiz = strlen(dsvol); + if (dspathcnt + pathsiz + 1 > dspathsiz) { + dspathsiz *= 2; + dspath = realloc(dspath, dspathsiz); + if (dspath == NULL) + errx(1, "Out of memory"); + } + strcpy(&dspath[dspathcnt], dsvol); + dspathcnt += pathsiz + 1; + + /* Append this pathname to mdspath. */ + if (mdsp != NULL) + pathsiz = strlen(mdsp); + else + pathsiz = 0; + if (mdspathcnt + pathsiz + 1 > mdspathsiz) { + mdspathsiz *= 2; + mdspath = realloc(mdspath, mdspathsiz); + if (mdspath == NULL) + errx(1, "Out of memory"); + } + if (mdsp != NULL) + strcpy(&mdspath[mdspathcnt], mdsp); + else + mdspath[mdspathcnt] = '\0'; + mdspathcnt += pathsiz + 1; + + if (ai_tcp != NULL) + freeaddrinfo(ai_tcp); + + /* Get the fully qualified domain name and IP address. */ + memset(&hints, 0, sizeof(hints)); + hints.ai_flags = AI_CANONNAME | AI_ADDRCONFIG; + hints.ai_family = PF_UNSPEC; + hints.ai_socktype = SOCK_STREAM; + hints.ai_protocol = IPPROTO_TCP; + ecode = getaddrinfo(cp, NULL, &hints, &ai_tcp); + if (ecode != 0) + err(1, "getaddrinfo pnfs: %s %s", cp, + gai_strerror(ecode)); + ad = NULL; + for (res = ai_tcp; res != NULL; res = res->ai_next) { + if (res->ai_addr->sa_family == AF_INET) { + if (res->ai_addrlen < sizeof(sin)) + err(1, "getaddrinfo() returned " + "undersized IPv4 address"); + /* + * Mips cares about sockaddr_in alignment, + * so copy the address. + */ + memcpy(&sin, res->ai_addr, sizeof(sin)); + ad = inet_ntoa(sin.sin_addr); + break; + } else if (res->ai_family == AF_INET6) { + if (res->ai_addrlen < sizeof(sin6)) + err(1, "getaddrinfo() returned " + "undersized IPv6 address"); + /* + * Mips cares about sockaddr_in6 alignment, + * so copy the address. + */ + memcpy(&sin6, res->ai_addr, sizeof(sin6)); + ad = inet_ntop(AF_INET6, &sin6.sin6_addr, ip6, + sizeof(ip6)); + + /* + * XXX + * Since a link local address will only + * work if the client and DS are in the + * same scope zone, only use it if it is + * the only address. + */ + if (ad != NULL && + !IN6_IS_ADDR_LINKLOCAL(&sin6.sin6_addr)) + break; + } + } + if (ad == NULL) + err(1, "No IP address for %s", cp); + + /* Append this address to dsaddr. */ + adsiz = strlen(ad); + if (dsaddrcnt + adsiz + nfsprtsiz + 1 > dsaddrsiz) { + dsaddrsiz *= 2; + dsaddr = realloc(dsaddr, dsaddrsiz); + if (dsaddr == NULL) + errx(1, "Out of memory"); + } + strcpy(&dsaddr[dsaddrcnt], ad); + strcat(&dsaddr[dsaddrcnt], nfsprt); + dsaddrcnt += adsiz + nfsprtsiz + 1; + + /* Append this hostname to dshost. */ + hostsiz = strlen(ai_tcp->ai_canonname); + if (dshostcnt + hostsiz + 1 > dshostsiz) { + dshostsiz *= 2; + dshost = realloc(dshost, dshostsiz); + if (dshost == NULL) + errx(1, "Out of memory"); + } + strcpy(&dshost[dshostcnt], ai_tcp->ai_canonname); + dshostcnt += hostsiz + 1; + + cp = cp2; + } while (cp != NULL); + + nfsdargp->addr = dsaddr; + nfsdargp->addrlen = dsaddrcnt; + nfsdargp->dnshost = dshost; + nfsdargp->dnshostlen = dshostcnt; + nfsdargp->dspath = dspath; + nfsdargp->dspathlen = dspathcnt; + nfsdargp->mdspath = mdspath; + nfsdargp->mdspathlen = mdspathcnt; + freeaddrinfo(ai_tcp); +} + diff --git a/usr.sbin/nfsd/nfsv4.4 b/usr.sbin/nfsd/nfsv4.4 new file mode 100644 index 000000000000..e96e507e23ad --- /dev/null +++ b/usr.sbin/nfsd/nfsv4.4 @@ -0,0 +1,379 @@ +.\" Copyright (c) 2009 Rick Macklem, University of Guelph +.\" All rights reserved. +.\" +.\" Redistribution and use in source and binary forms, with or without +.\" modification, are permitted provided that the following conditions +.\" are met: +.\" 1. Redistributions of source code must retain the above copyright +.\" notice, this list of conditions and the following disclaimer. +.\" 2. Redistributions in binary form must reproduce the above copyright +.\" notice, this list of conditions and the following disclaimer in the +.\" documentation and/or other materials provided with the distribution. +.\" +.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND +.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE +.\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE +.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL +.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS +.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) +.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT +.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY +.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF +.\" SUCH DAMAGE. +.\" +.Dd January 8, 2024 +.Dt NFSV4 4 +.Os +.Sh NAME +.Nm NFSv4 +.Nd NFS Version 4 Protocol +.Sh DESCRIPTION +The NFS client and server provides support for the +.Tn NFSv4 +specification; see +.%T "Network File System (NFS) Version 4 Protocol RFC 7530" , +.%T "Network File System (NFS) Version 4 Minor Version 1 Protocol RFC 5661" , +.%T "Network File System (NFS) Version 4 Minor Version 2 Protocol RFC 7862" , +.%T "File System Extended Attributes in NFSv4 RFC 8276" and +.%T "Parallel NFS (pNFS) Flexible File Layout RFC 8435" . +The protocol is somewhat similar to NFS Version 3, but differs in significant +ways. +It uses a single compound RPC that concatenates operations to-gether. +Each of these operations are similar to the RPCs of NFS Version 3. +The operations in the compound are performed in order, until one of +them fails (returns an error) and then the RPC terminates at that point. +.Pp +It has +integrated locking support, which implies that the server is no longer +stateless. +As such, the +.Nm +server remains in recovery mode for a grace period (always greater than the +lease duration the server uses) after a reboot. +During this grace period, clients may recover state but not perform other +open/lock state changing operations. +To provide for correct recovery semantics, a small file described by +.Xr stablerestart 5 +is used by the server during the recovery phase. +If this file is missing or empty, there is a backup copy maintained by +.Xr nfsd 8 +that will be used. +If either file is missing, they will be created by the +.Xr nfsd 8 . +If both the file and the backup copy are empty, +it will result in the server starting without providing a grace period +for recovery. +Note that recovery only occurs when the server +machine is rebooted, not when the +.Xr nfsd 8 +are just restarted. +.Pp +It provides several optional features not present in NFS Version 3: +.sp +.Bd -literal -offset indent -compact +- NFS Version 4 ACLs +- Referrals, which redirect subtrees to other servers + (not yet implemented) +- Delegations, which allow a client to operate on a file locally +- pNFS, where I/O operations are separated from Metadata operations +And for NFSv4.2 only +- User namespace extended attributes +- lseek(SEEK_DATA/SEEK_HOLE) +- File copying done locally on the server for copy_file_range(2) +- posix_fallocate(2) +- posix_fadvise(POSIX_FADV_WILLNEED/POSIX_FADV_DONTNEED) +.Ed +.Pp +The +.Nm +protocol does not use a separate mount protocol and assumes that the +server provides a single file system tree structure, rooted at the point +in the local file system tree specified by one or more +.sp 1 +.Bd -literal -offset indent -compact +V4: <rootdir> [-sec=secflavors] [host(s) or net] +.Ed +.sp 1 +line(s) in the +.Xr exports 5 +file. +(See +.Xr exports 5 +for details.) +The +.Xr nfsd 8 +allows a limited subset of operations to be performed on non-exported subtrees +of the local file system, so that traversal of the tree to the exported +subtrees is possible. +As such, the ``<rootdir>'' can be in a non-exported file system. +The exception is ZFS, which checks exports and, as such, all ZFS file systems +below the ``<rootdir>'' must be exported. +However, +the entire tree that is rooted at that point must be in local file systems +that are of types that can be NFS exported. +Since the +.Nm +file system is rooted at ``<rootdir>'', setting this to anything other +than ``/'' will result in clients being required to use different mount +paths for +.Nm +than for NFS Version 2 or 3. +Unlike NFS Version 2 and 3, Version 4 allows a client mount to span across +multiple server file systems, although not all clients are capable of doing +this. +.Pp +.Nm +uses strings for users and groups instead of numbers. +On the wire, these strings can either have the numbers in the string or +take the form: +.sp +.Bd -literal -offset indent -compact +<user>@<dns.domain> +.Ed +.sp +where ``<dns.domain>'' is not the same as the DNS domain used +for host name lookups, but is usually set to the same string. +Most systems set this ``<dns.domain>'' +to the domain name part of the machine's +.Xr hostname 1 +by default. +However, this can normally be overridden by a command line +option or configuration file for the daemon used to do the name<->number +mapping. +Under +.Fx , +the mapping daemon is called +.Xr nfsuserd 8 +and has a command line option that overrides the domain component of the +machine's hostname. +For use of this form of string on +.Nm , +either client or server, this daemon must be running. +.Pp +The form where the numbers are in the strings can only be used for AUTH_SYS. +To configure your systems this way, the +.Xr nfsuserd 8 +daemon does not need to be running on the server, but the following sysctls +need to be set to 1 on the server. +.sp +.Bd -literal -offset indent -compact +vfs.nfs.enable_uidtostring +vfs.nfsd.enable_stringtouid +.Ed +.sp +On the client, the sysctl +.sp +.Bd -literal -offset indent -compact +vfs.nfs.enable_uidtostring +.Ed +.sp +must be set to 1 and the +.Xr nfsuserd 8 +daemon does not need to be running. +.Pp +If these strings are not configured correctly, ``ls -l'' will typically +report a lot of ``nobody'' and ``nogroup'' ownerships. +.Pp +Although uid/gid numbers are no longer used in the +.Nm +protocol except optionally in the above strings, they will still be in the RPC +authentication fields when using AUTH_SYS (sec=sys), which is the default. +As such, in this case both the user/group name and number spaces must +be consistent between the client and server. +.Pp +However, if you run +.Nm +with RPCSEC_GSS (sec=krb5, krb5i, krb5p), only names and KerberosV tickets +will go on the wire. +.Sh SERVER SETUP +To set up the NFS server that supports +.Nm , +you will need to set the variables in +.Xr rc.conf 5 +as follows: +.sp +.Bd -literal -offset indent -compact +nfs_server_enable="YES" +nfsv4_server_enable="YES" +.Ed +.sp +plus +.sp +.Bd -literal -offset indent -compact +nfsuserd_enable="YES" +.Ed +.sp +if the server is using the ``<user>@<domain>'' form of user/group strings or +is using the ``-manage-gids'' option for +.Xr nfsuserd 8 . +.Pp +In addition, you can set: +.sp +.Bd -literal -offset indent -compact +nfsv4_server_only="YES" +.Ed +.sp +to disable support for NFSv2 and NFSv3. +.Pp +You will also need to add at least one ``V4:'' line to the +.Xr exports 5 +file for +.Nm +to work. +.Pp +If the file systems you are exporting are only being accessed via +.Nm +there are a couple of +.Xr sysctl 8 +variables that you can change, which might improve performance. +.Bl -tag -width Ds +.It Cm vfs.nfsd.issue_delegations +when set non-zero, allows the server to issue Open Delegations to +clients. +These delegations permit the client to manipulate the file +locally on the client. +Unfortunately, at this time, client use of +delegations is limited, so performance gains may not be observed. +This can only be enabled when the file systems being exported to +.Nm +clients are not being accessed locally on the server and, if being +accessed via NFS Version 2 or 3 clients, these clients cannot be +using the NLM. +.It Cm vfs.nfsd.enable_locallocks +can be set to 0 to disable acquisition of local byte range locks. +Disabling local locking can only be done if neither local accesses +to the exported file systems nor the NLM is operating on them. +.El +.sp +Note that Samba server access would be considered ``local access'' for the above +discussion. +.Pp +To build a kernel with the NFS server that supports +.Nm +linked into it, the +.sp +.Bd -literal -offset indent -compact +options NFSD +.Ed +.sp +must be specified in the kernel's +.Xr config 5 +file. +.Sh CLIENT MOUNTS +To do an +.Nm +mount, specify the ``nfsv4'' option on the +.Xr mount_nfs 8 +command line. +This will force use of the client that supports +.Nm +plus set ``tcp'' and +.Nm . +.Pp +The +.Xr nfsuserd 8 +must be running if name<->uid/gid mapping is being used, as above. +Also, since an +.Nm +mount uses the host uuid to identify the client uniquely to the server, +you cannot safely do an +.Nm +mount when +.sp +.Bd -literal -offset indent -compact +hostid_enable="NO" +.Ed +.sp +is set in +.Xr rc.conf 5 . +.sp +If the +.Nm +server that is being mounted on supports delegations, you can start the +.Xr nfscbd 8 +daemon to handle client side callbacks. +This will occur if +.sp +.Bd -literal -offset indent -compact +nfsuserd_enable="YES" <-- If name<->uid/gid mapping is being used. +nfscbd_enable="YES" +.Ed +.sp +are set in +.Xr rc.conf 5 . +.sp +Without a functioning callback path, a server will never issue Delegations +to a client. +.sp +For NFSv4.0, by default, the callback address will be set to the IP address +acquired via +.Fn rtalloc +in the kernel and port# 7745. +To override the default port#, a command line option for +.Xr nfscbd 8 +can be used. +.sp +To get callbacks to work when behind a NAT gateway, a port for the callback +service will need to be set up on the NAT gateway and then the address +of the NAT gateway (host IP plus port#) will need to be set by assigning the +.Xr sysctl 8 +variable vfs.nfs.callback_addr to a string of the form: +.sp +N.N.N.N.N.N +.sp +where the first 4 Ns are the host IP address and the last two are the +port# in network byte order (all decimal #s in the range 0-255). +.Pp +For NFSv4.1 and NFSv4.2, the callback path (called a backchannel) uses the +same TCP connection as the mount, so none of the above applies and should +work through gateways without any issues. +.Pp +To build a kernel with the client that supports +.Nm +linked into it, the option +.sp +.Bd -literal -offset indent -compact +options NFSCL +.Ed +.sp +must be specified in the kernel's +.Xr config 5 +file. +.Pp +Options can be specified for the +.Xr nfsuserd 8 +and +.Xr nfscbd 8 +daemons at boot time via the ``nfsuserd_flags'' and ``nfscbd_flags'' +.Xr rc.conf 5 +variables. +.Pp +NFSv4 mount(s) against exported volume(s) on the same host are not recommended, +since this can result in a hung NFS server. +It occurs when an nfsd thread tries to do an NFSv4 +.Fn VOP_RECLAIM +/ Close RPC as part of acquiring a new vnode. +If all other nfsd threads are blocked waiting for lock(s) held by this nfsd +thread, then there is no nfsd thread to service the Close RPC. +.Sh FILES +.Bl -tag -width /var/db/nfs-stablerestart.bak -compact +.It Pa /var/db/nfs-stablerestart +NFS V4 stable restart file +.It Pa /var/db/nfs-stablerestart.bak +backup copy of the file +.El +.Sh SEE ALSO +.Xr stablerestart 5 , +.Xr mountd 8 , +.Xr nfscbd 8 , +.Xr nfsd 8 , +.Xr nfsdumpstate 8 , +.Xr nfsrevoke 8 , +.Xr nfsuserd 8 +.Sh BUGS +At this time, there is no recall of delegations for local file system +operations. +As such, delegations should only be enabled for file systems +that are being used solely as NFS export volumes and are not being accessed +via local system calls nor services such as Samba. diff --git a/usr.sbin/nfsd/pnfs.4 b/usr.sbin/nfsd/pnfs.4 new file mode 100644 index 000000000000..babd221a6d5a --- /dev/null +++ b/usr.sbin/nfsd/pnfs.4 @@ -0,0 +1,228 @@ +.\" Copyright (c) 2017 Rick Macklem +.\" +.\" Redistribution and use in source and binary forms, with or without +.\" modification, are permitted provided that the following conditions +.\" are met: +.\" 1. Redistributions of source code must retain the above copyright +.\" notice, this list of conditions and the following disclaimer. +.\" 2. Redistributions in binary form must reproduce the above copyright +.\" notice, this list of conditions and the following disclaimer in the +.\" documentation and/or other materials provided with the distribution. +.\" +.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND +.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE +.\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE +.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL +.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS +.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) +.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT +.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY +.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF +.\" SUCH DAMAGE. +.\" +.Dd December 20, 2019 +.Dt PNFS 4 +.Os +.Sh NAME +.Nm pNFS +.Nd NFS Version 4.1 and 4.2 Parallel NFS Protocol +.Sh DESCRIPTION +The NFSv4.1 and NFSv4.2 client and server provides support for the +.Tn pNFS +specification; see +.%T "Network File System (NFS) Version 4 Minor Version 1 Protocol RFC 5661" , +.%T "Network File System (NFS) Version 4 Minor Version 2 Protocol RFC 7862" and +.%T "Parallel NFS (pNFS) Flexible File Layout RFC 8435" . +A pNFS service separates Read/Write operations from all other NFSv4.1 and +NFSv4.2 operations, which are referred to as Metadata operations. +The Read/Write operations are performed directly on the Data Server (DS) +where the file's data resides, bypassing the NFS server. +All other file operations are performed on the NFS server, which is referred to +as a Metadata Server (MDS). +NFS clients that do not support +.Tn pNFS +perform Read/Write operations on the MDS, which acts as a proxy for the +appropriate DS(s). +.Pp +The NFSv4.1 and NFSv4.2 protocols provide two pieces of information to pNFS +aware clients that allow them to perform Read/Write operations directly on +the DS. +.Pp +The first is DeviceInfo, which is static information defining the DS +server. +The critical piece of information in DeviceInfo for the layout types +supported by +.Fx +is the IP address that is used to perform RPCs on the DS. +It also indicates which version of NFS the DS supports, I/O size and other +layout specific information. +In the DeviceInfo, there is a DeviceID which, for the +.Fx +server +is unique to the DS configuration +and changes whenever the +.Xr nfsd +daemon is restarted or the server is rebooted. +.Pp +The second is the layout, which is per file and references the DeviceInfo +to use via the DeviceID. +It is for a byte range of a file and is either Read or Read/Write. +For the +.Fx +server, a layout covers all bytes of a file. +A layout may be recalled by the MDS using a LayoutRecall callback. +When a client returns a layout via the LayoutReturn operation it can +indicate that error(s) were encountered while doing I/O on the DS, +at least for certain layout types such as the Flexible File Layout. +.Pp +The +.Fx +client and server supports two layout types. +.Pp +The File Layout is described in RFC5661 and uses the NFSv4.1 or NFSv4.2 protocol +to perform I/O on the DS. +It does not support client aware DS mirroring and, as such, +the +.Fx +server only provides File Layout support for non-mirrored +configurations. +.Pp +The Flexible File Layout allows the use of the NFSv3, NFSv4.0, NFSv4.1 or +NFSv4.2 protocol to perform I/O on the DS and does support client aware +mirroring. +As such, the +.Fx +server uses Flexible File Layout layouts for the +mirrored DS configurations. +The +.Fx +server supports the +.Dq tightly coupled +variant and all DSs allow use of the +NFSv4.2 or NFSv4.1 protocol for I/O operations. +Clients that support the Flexible File Layout will do writes and commits +to all DS mirrors in the mirror set. +.Pp +A +.Fx +pNFS service consists of a single MDS server plus one or more +DS servers, all of which are +.Fx +systems. +For a non-mirrored configuration, the +.Fx +server will issue File Layout +layouts by default. +However that default can be set to the Flexible File Layout by setting the +.Xr sysctl 8 +sysctl +.Dq vfs.nfsd.default_flexfile +to one. +Mirrored server configurations will only issue Flexible File Layouts. +.Tn pNFS +clients mount the MDS as they would a single NFS server. +.Pp +A +.Fx +.Tn pNFS +client must be running the +.Xr nfscbd 8 +daemon and use the mount options +.Dq nfsv4,minorversion=2,pnfs or +.Dq nfsv4,minorversion=1,pnfs . +.Pp +When files are created, the MDS creates a file tree identical to what a +single NFS server creates, except that all the regular (VREG) files will +be empty. +As such, if you look at the exported tree on the MDS directly +on the MDS server (not via an NFS mount), the files will all be of size zero. +Each of these files will also have two extended attributes in the system +attribute name space: +.Bd -literal -offset indent +pnfsd.dsfile - This extended attribute stores the information that the + MDS needs to find the data file on a DS(s) for this file. +pnfsd.dsattr - This extended attribute stores the Size, AccessTime, + ModifyTime, Change and SpaceUsed attributes for the file. +.Ed +.Pp +For each regular (VREG) file, the MDS creates a data file on one +(or on N of them for the mirrored case, where N is the mirror_level) +of the DS(s) where the file's data will be stored. +The name of this file is +the file handle of the file on the MDS in hexadecimal at time of file creation. +The data file will have the same file ownership, mode and NFSv4 ACL +(if ACLs are enabled for the file system) as the file on the MDS, so that +permission checking can be done on the DS. +This is referred to as +.Dq tightly coupled +for the Flexible File Layout. +.Pp +For +.Tn pNFS +aware clients, the service generates File Layout +or Flexible File Layout +layouts and associated DeviceInfo. +For non-pNFS aware NFS clients, the pNFS service appears just like a normal +NFS service. +For the non-pNFS aware client, the MDS will perform I/O operations on the +appropriate DS(s), acting as +a proxy for the non-pNFS aware client. +This is also true for NFSv3 and NFSv4.0 mounts, since these are always non-pNFS +aware. +.Pp +It is possible to assign a DS to an MDS exported file system so that it will +store data for files on the MDS exported file system. +If a DS is not assigned to an MDS exported file system, it will store data +for files on all exported file systems on the MDS. +.Pp +If mirroring is enabled, the pNFS service will continue to function when +DS(s) have failed, so long is there is at least one DS still operational +that stores data for files on all of the MDS exported file systems. +After a disabled mirrored DS is repaired, it is possible to recover the DS +as a mirror while the pNFS service continues to function. +.Pp +See +.Xr pnfsserver 4 +for information on how to set up a +.Fx +pNFS service. +.Sh SEE ALSO +.Xr nfsv4 4 , +.Xr pnfsserver 4 , +.Xr exports 5 , +.Xr fstab 5 , +.Xr rc.conf 5 , +.Xr nfscbd 8 , +.Xr nfsd 8 , +.Xr nfsuserd 8 , +.Xr pnfsdscopymr 8 , +.Xr pnfsdsfile 8 , +.Xr pnfsdskill 8 +.Sh BUGS +Linux kernel versions prior to 4.12 only supports NFSv3 DSs in its client +and will do all I/O through the MDS. +For Linux 4.12 kernels, support for NFSv4.1 DSs was added, but I have seen +Linux client crashes when testing this client. +For Linux 4.17-rc2 kernels, I have not seen client crashes during testing, +but it only supports the +.Dq loosely coupled +variant. +To make it work correctly when mounting the +.Fx +server, you must +set the sysctl +.Dq vfs.nfsd.flexlinuxhack +to one so that it works around +the Linux client driver's limitations. +Wihout this sysctl being set, there will be access errors, since the Linux +client will use the authenticator in the layout (uid=999, gid=999) and not +the authenticator specified in the RPC header. +.Pp +Linux 5.n kernels appear to be patched so that it uses the authenticator +in the RPC header and, as such, the above sysctl should not need to be set. +.Pp +Since the MDS cannot be mirrored, it is a single point of failure just +as a non +.Tn pNFS +server is. diff --git a/usr.sbin/nfsd/pnfsserver.4 b/usr.sbin/nfsd/pnfsserver.4 new file mode 100644 index 000000000000..7a2ddc4e85c0 --- /dev/null +++ b/usr.sbin/nfsd/pnfsserver.4 @@ -0,0 +1,444 @@ +.\" Copyright (c) 2018 Rick Macklem +.\" +.\" Redistribution and use in source and binary forms, with or without +.\" modification, are permitted provided that the following conditions +.\" are met: +.\" 1. Redistributions of source code must retain the above copyright +.\" notice, this list of conditions and the following disclaimer. +.\" 2. Redistributions in binary form must reproduce the above copyright +.\" notice, this list of conditions and the following disclaimer in the +.\" documentation and/or other materials provided with the distribution. +.\" +.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND +.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE +.\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE +.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL +.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS +.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) +.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT +.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY +.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF +.\" SUCH DAMAGE. +.\" +.Dd December 20, 2019 +.Dt PNFSSERVER 4 +.Os +.Sh NAME +.Nm pNFSserver +.Nd NFS Version 4.1 and 4.2 Parallel NFS Protocol Server +.Sh DESCRIPTION +A set of +.Fx +servers may be configured to provide a +.Xr pnfs 4 +service. +One +.Fx +system needs to be configured as a MetaData Server (MDS) and +at least one additional +.Fx +system needs to be configured as one or +more Data Servers (DS)s. +.Pp +These +.Fx +systems are configured to be NFSv4.1 and NFSv4.2 +servers, see +.Xr nfsd 8 +and +.Xr exports 5 +if you are not familiar with configuring a NFSv4.n server. +All DS(s) and the MDS should support NFSv4.2 as well as NFSv4.1. +Mixing an MDS that supports NFSv4.2 with any DS(s) that do not support +NFSv4.2 will not work correctly. +As such, all DS(s) must be upgraded from +.Fx 12 +to +.Fx 13 +before upgrading the MDS. +.Sh DS server configuration +The DS(s) need to be configured as NFSv4.1 and NFSv4.2 server(s), +with a top level exported +directory used for storage of data files. +This directory must be owned by +.Dq root +and would normally have a mode of +.Dq 700 . +Within this directory there needs to be additional directories named +ds0,...,dsN (where N is 19 by default) also owned by +.Dq root +with mode +.Dq 700 . +These are the directories where the data files are stored. +The following command can be run by root when in the top level exported +directory to create these subdirectories. +.Bd -literal -offset indent +jot -w ds 20 0 | xargs mkdir -m 700 +.Ed +.sp +Note that +.Dq 20 +is the default and can be set to a larger value on the MDS as shown below. +.sp +The top level exported directory used for storage of data files must be +exported to the MDS with the +.Dq maproot=root sec=sys +export options so that the MDS can create entries in these subdirectories. +It must also be exported to all pNFS aware clients, but these clients do +not require the +.Dq maproot=root +export option and this directory should be exported to them with the same +options as used by the MDS to export file system(s) to the clients. +.Pp +It is possible to have multiple DSs on the same +.Fx +system, but each +of these DSs must have a separate top level exported directory used for storage +of data files and each +of these DSs must be mountable via a separate IP address. +Alias addresses can be set on the DS server system for a network +interface via +.Xr ifconfig 8 +to create these different IP addresses. +Multiple DSs on the same server may be useful when data for different file systems +on the MDS are being stored on different file system volumes on the +.Fx +DS system. +.Sh MDS server configuration +The MDS must be a separate +.Fx +system from the +.Fx +DS system(s) and +NFS clients. +It is configured as a NFSv4.1 and NFSv4.2 server with +file system(s) exported to clients. +However, the +.Dq -p +command line argument for +.Xr nfsd +is used to indicate that it is running as the MDS for a pNFS server. +.Pp +The DS(s) must all be mounted on the MDS using the following mount options: +.Bd -literal -offset indent +nfsv4,minorversion=2,soft,retrans=2 +.Ed +.sp +so that they can be defined as DSs in the +.Dq -p +option. +Normally these mounts would be entered in the +.Xr fstab 5 +on the MDS. +For example, if there are four DSs named nfsv4-data[0-3], the +.Xr fstab 5 +lines might look like: +.Bd -literal -offset +nfsv4-data0:/ /data0 nfs rw,nfsv4,minorversion=2,soft,retrans=2 0 0 +nfsv4-data1:/ /data1 nfs rw,nfsv4,minorversion=2,soft,retrans=2 0 0 +nfsv4-data2:/ /data2 nfs rw,nfsv4,minorversion=2,soft,retrans=2 0 0 +nfsv4-data3:/ /data3 nfs rw,nfsv4,minorversion=2,soft,retrans=2 0 0 +.Ed +.sp +The +.Xr nfsd 8 +command line option +.Dq -p +indicates that the NFS server is a pNFS MDS and specifies what +DSs are to be used. +.br +For the above +.Xr fstab 5 +example, the +.Xr nfsd 8 +nfs_server_flags line in your +.Xr rc.conf 5 +might look like: +.Bd -literal -offset +nfs_server_flags="-u -t -n 128 -p nfsv4-data0:/data0,nfsv4-data1:/data1,nfsv4-data2:/data2,nfsv4-data3:/data3" +.Ed +.sp +This example specifies that the data files should be distributed over the +four DSs and File layouts will be issued to pNFS enabled clients. +If issuing Flexible File layouts is desired for this case, setting the sysctl +.Dq vfs.nfsd.default_flexfile +non-zero in your +.Xr sysctl.conf 5 +file will make the +.Nm +do that. +.br +Alternately, this variant of +.Dq nfs_server_flags +will specify that two way mirroring is to be done, via the +.Dq -m +command line option. +.Bd -literal -offset +nfs_server_flags="-u -t -n 128 -p nfsv4-data0:/data0,nfsv4-data1:/data1,nfsv4-data2:/data2,nfsv4-data3:/data3 -m 2" +.Ed +.sp +With two way mirroring, the data file for each exported file on the MDS +will be stored on two of the DSs. +When mirroring is enabled, the server will always issue Flexible File layouts. +.Pp +It is also possible to specify which DSs are to be used to store data files for +specific exported file systems on the MDS. +For example, if the MDS has exported two file systems +.Dq /export1 +and +.Dq /export2 +to clients, the following variant of +.Dq nfs_server_flags +will specify that data files for +.Dq /export1 +will be stored on nfsv4-data0 and nfsv4-data1, whereas the data files for +.Dq /export2 +will be store on nfsv4-data2 and nfsv4-data3. +.Bd -literal -offset +nfs_server_flags="-u -t -n 128 -p nfsv4-data0:/data0#/export1,nfsv4-data1:/data1#/export1,nfsv4-data2:/data2#/export2,nfsv4-data3:/data3#/export2" +.Ed +.sp +This can be used by system administrators to control where data files are +stored and might be useful for control of storage use. +For this case, it may be convenient to co-locate more than one of the DSs +on the same +.Fx +server, using separate file systems on the DS system +for storage of the respective DS's data files. +If mirroring is desired for this case, the +.Dq -m +option also needs to be specified. +There must be enough DSs assigned to each exported file system on the MDS +to support the level of mirroring. +The above example would be fine for two way mirroring, but four way mirroring +would not work, since there are only two DSs assigned to each exported file +system on the MDS. +.Pp +The number of subdirectories in each DS is defined by the +.Dq vfs.nfs.dsdirsize +sysctl on the MDS. +This value can be increased from the default of 20, but only when the +.Xr nfsd 8 +is not running and after the additional ds20,... subdirectories have been +created on all the DSs. +For a service that will store a large number of files this sysctl should be +set much larger, to avoid the number of entries in a subdirectory from +getting too large. +.Sh Client mounts +Once operational, NFSv4.1 or NFSv4.2 +.Fx +client mounts +done with the +.Dq pnfs +option should do I/O directly on the DSs. +The clients mounting the MDS must be running the +.Xr nfscbd +daemon for pNFS to work. +Set +.Bd -literal -offset indent +nfscbd_enable="YES" +.Ed +.sp +in the +.Xr rc.conf 5 +on these clients. +Non-pNFS aware clients or NFSv3 mounts will do all I/O RPCs on the MDS, +which acts as a proxy for the appropriate DS(s). +.Sh Backing up a pNFS service +Since the data is separated from the metadata, the simple way to back up +a pNFS service is to do so from an NFS client that has the service mounted +on it. +If you back up the MDS exported file system(s) on the MDS, you must do it +in such a way that the +.Dq system +namespace extended attributes get backed up. +.Sh Handling of failed mirrored DSs +When a mirrored DS fails, it can be disabled one of three ways: +.sp +1 - The MDS detects a problem when trying to do proxy +operations on the DS. +This can take a couple of minutes +after the DS failure or network partitioning occurs. +.sp +2 - A pNFS client can report an I/O error that occurred for a DS to the MDS in +the arguments for a LayoutReturn operation. +.sp +3 - The system administrator can perform the pnfsdskill(8) command on the MDS +to disable it. +If the system administrator does a pnfsdskill(8) and it fails with ENXIO +(Device not configured) that normally means the DS was already +disabled via #1 or #2. +Since doing this is harmless, once a system administrator knows that +there is a problem with a mirrored DS, doing the command is recommended. +.sp +Once a system administrator knows that a mirrored DS has malfunctioned +or has been network partitioned, they should do the following as root/su +on the MDS: +.Bd -literal -offset indent +# pnfsdskill <mounted-on-path-of-DS> +# umount -N <mounted-on-path-of-DS> +.Ed +.sp +Note that the <mounted-on-path-of-DS> must be the exact mounted-on path +string used when the DS was mounted on the MDS. +.Pp +Once the mirrored DS has been disabled, the pNFS service should continue to +function, but file updates will only happen on the DS(s) that have not been disabled. +Assuming two way mirroring, that implies the one DS of the pair stored in the +.Dq pnfsd.dsfile +extended attribute for the file on the MDS, for files stored on the disabled DS. +.Pp +The next step is to clear the IP address in the +.Dq pnfsd.dsfile +extended attribute on all files on the MDS for the failed DS. +This is done so that, when the disabled DS is repaired and brought back online, +the data files on this DS will not be used, since they may be out of date. +The command that clears the IP address is +.Xr pnfsdsfile 8 +with the +.Dq -r +option. +.Bd -literal -offset +For example: +# pnfsdsfile -r nfsv4-data3 yyy.c +yyy.c: nfsv4-data2.home.rick ds0/207508569ff983350c000000ec7c0200e4c57b2e0000000000000000 0.0.0.0 ds0/207508569ff983350c000000ec7c0200e4c57b2e0000000000000000 +.Ed +.sp +replaces nfsv4-data3 with an IPv4 address of 0.0.0.0, so that nfsv4-data3 +will not get used. +.Pp +Normally this will be called within a +.Xr find 1 +command for all regular +files in the exported directory tree and must be done on the MDS. +When used with +.Xr find 1 , +you will probably also want the +.Dq -q +option so that it won't spit out the results for every file. +If the disabled/repaired DS is nfsv4-data3, the commands done on the MDS +would be: +.Bd -literal -offset +# cd <top-level-exported-dir> +# find . -type f -exec pnfsdsfile -q -r nfsv4-data3 {} \; +.Ed +.sp +There is a problem with the above command if the file found by +.Xr find 1 +is renamed or unlinked before the +.Xr pnfsdsfile 8 +command is done on it. +This should normally generate an error message. +A simple unlink is harmless +but a link/unlink or rename might result in the file not having been processed +under its new name. +To check that all files have their IP addresses set to 0.0.0.0 these +commands can be used (assuming the +.Xr sh 1 +shell): +.Bd -literal -offset +# cd <top-level-exported-dir> +# find . -type f -exec pnfsdsfile {} \; | sed "/nfsv4-data3/!d" +.Ed +.sp +Any line(s) printed require the +.Xr pnfsdsfile 8 +with +.Dq -r +to be done again. +Once this is done, the replaced/repaired DS can be brought back online. +It should have empty ds0,...,dsN directories under the top level exported +directory for storage of data files just like it did when first set up. +Mount it on the MDS exactly as you did before disabling it. +For the nfsv4-data3 example, the command would be: +.Bd -literal -offset +# mount -t nfs -o nfsv4,minorversion=2,soft,retrans=2 nfsv4-data3:/ /data3 +.Ed +.sp +Then restart the nfsd to re-enable the DS. +.Bd -literal -offset +# /etc/rc.d/nfsd restart +.Ed +.sp +Now, new files can be stored on nfsv4-data3, +but files with the IP address zeroed out on the MDS will not yet use the +repaired DS (nfsv4-data3). +The next step is to go through the exported file tree on the MDS and, +for each of the +files with an IPv4 address of 0.0.0.0 in its extended attribute, copy the file +data to the repaired DS and re-enable use of this mirror for it. +This command for copying the file data for one MDS file is +.Xr pnfsdscopymr 8 +and it will also normally be used in a +.Xr find 1 . +For the example case, the commands on the MDS would be: +.Bd -literal -offset +# cd <top-level-exported-dir> +# find . -type f -exec pnfsdscopymr -r /data3 {} \; +.Ed +.sp +When this completes, the recovery should be complete or at least nearly so. +As noted above, if a link/unlink or rename occurs on a file name while the +above +.Xr find 1 +is in progress, it may not get copied. +To check for any file(s) not yet copied, the commands are: +.Bd -literal -offset +# cd <top-level-exported-dir> +# find . -type f -exec pnfsdsfile {} \; | sed "/0\.0\.0\.0/!d" +.Ed +.sp +If this command prints out any file name(s), these files must +have the +.Xr pnfsdscopymr 8 +command done on them to complete the recovery. +.Bd -literal -offset +# pnfsdscopymr -r /data3 <file-path-reported> +.Ed +.sp +If this command fails with the error +.br +.Dq pnfsdscopymr: Copymr failed for file <path>: Device not configured +.br +repeatedly, this may be caused by a Read/Write layout that has not +been returned. +The only way to get rid of such a layout is to restart the +.Xr nfsd 8 . +.sp +All of these commands are designed to be +done while the pNFS service is running and can be re-run safely. +.Pp +For a more detailed discussion of the setup and management of a pNFS service +see: +.Bd -literal -offset indent +https://people.freebsd.org/~rmacklem/pnfs-planb-setup.txt +.Ed +.sp +.Sh SEE ALSO +.Xr nfsv4 4 , +.Xr pnfs 4 , +.Xr exports 5 , +.Xr fstab 5 , +.Xr rc.conf 5 , +.Xr sysctl.conf 5 , +.Xr nfscbd 8 , +.Xr nfsd 8 , +.Xr nfsuserd 8 , +.Xr pnfsdscopymr 8 , +.Xr pnfsdsfile 8 , +.Xr pnfsdskill 8 +.Sh HISTORY +The +.Nm +service first appeared in +.Fx 12.0 . +.Sh BUGS +Since the MDS cannot be mirrored, it is a single point of failure just +as a non +.Tn pNFS +server is. +For non-mirrored configurations, all +.Fx +systems used in the service +are single points of failure. diff --git a/usr.sbin/nfsd/stablerestart.5 b/usr.sbin/nfsd/stablerestart.5 new file mode 100644 index 000000000000..0d93b0487f09 --- /dev/null +++ b/usr.sbin/nfsd/stablerestart.5 @@ -0,0 +1,94 @@ +.\" Copyright (c) 2009 Rick Macklem, University of Guelph +.\" All rights reserved. +.\" +.\" Redistribution and use in source and binary forms, with or without +.\" modification, are permitted provided that the following conditions +.\" are met: +.\" 1. Redistributions of source code must retain the above copyright +.\" notice, this list of conditions and the following disclaimer. +.\" 2. Redistributions in binary form must reproduce the above copyright +.\" notice, this list of conditions and the following disclaimer in the +.\" documentation and/or other materials provided with the distribution. +.\" +.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND +.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE +.\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE +.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL +.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS +.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) +.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT +.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY +.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF +.\" SUCH DAMAGE. +.\" +.Dd April 10, 2011 +.Dt STABLERESTART 5 +.Os +.Sh NAME +.Nm nfs-stablerestart +.Nd restart information for the +.Tn NFSv4 +server +.Sh SYNOPSIS +.Nm nfs-stablerestart +.Sh DESCRIPTION +The +.Nm +file holds information that allows the +.Tn NFSv4 +server to restart without always returning the NFSERR_NOGRACE error, as described in the +.Tn NFSv4 +server specification; see +.%T "Network File System (NFS) Version 4 Protocol RFC 3530, Section 8.6.3" . +.Pp +The first record in the file, as defined by struct nfsf_rec in +/usr/include/fs/nfs/nfsrvstate.h, holds the lease duration of the +last incarnation of the server and the number of boot times that follows. +Following this are the number of previous boot times listed in the +first record. +The lease duration is used to set the grace period. +The boot times +are used to avoid the unlikely occurrence of a boot time being reused, +due to a TOD clock going backwards. +This record and the previous boot times with this boot time +added is re-written at the end of the grace period. +.Pp +The rest of the file are appended records, as defined by +struct nfst_rec in /usr/include/fs/nfs/nfsrvstate.h and are used represent one of two things. +There are records which indicate that a +client successfully acquired state and records that indicate a client's state was revoked. +State revoke records indicate that state information +for a client was discarded, due to lease expiry and an otherwise +conflicting open or lock request being made by a different client. +These records can be used to determine if clients might have done either of the +edge conditions. +.Pp +If a client might have done either edge condition or this file is +empty or corrupted, the server returns NFSERR_NOGRACE for any reclaim +request from the client. +.Pp +For correct operation of the server, it must be ensured that the file +is written to stable storage by the time a write op with IO_SYNC specified has returned. +This might require hardware level caching to be disabled for +a local disk drive that holds the file, or similar. +.Sh FILES +.Bl -tag -width /var/db/nfs-stablerestart.bak -compact +.It Pa /var/db/nfs-stablerestart +NFSv4 stable restart file +.It Pa /var/db/nfs-stablerestart.bak +backup copy of the file +.El +.Sh SEE ALSO +.Xr nfsv4 4 , +.Xr nfsd 8 +.Sh BUGS +If the file is empty, the NFSv4 server has no choice but to return +NFSERR_NOGRACE for all reclaim requests. +Although correct, this is a highly undesirable occurrence, so the file should not be lost if +at all possible. +The backup copy of the file is maintained and used by the +.Xr nfsd 8 +to minimize the risk of this occurring. +To move the file, you must edit the nfsd sources and recompile it. +This was done to discourage accidental relocation of the file. |
