aboutsummaryrefslogtreecommitdiff
path: root/sys/dev/nvme
Commit message (Collapse)AuthorAgeFilesLines
...
* In all the places that we use the polled for completion interface, except crashWarner Losh2019-09-023-16/+16
| | | | | | | | | dump support code, move the while loop into an inline function. These aren't done in the fast path, so if the compiler choses to not inline, any performance hit is tiny. Notes: svn path=/head/; revision=351705
* Add a brief comment explaining why we can return ETIMEDOUT from the call to theWarner Losh2019-09-021-0/+8
| | | | | | | | | polled interface. Normally this would have the potential to corrupt stack memory because the completion routines would run after we return. In this case, however, we're doing a dump so it's safe for reasons explained in the comment. Notes: svn path=/head/; revision=351704
* It turns out the duplication is only mostly harmless.Warner Losh2019-08-233-4/+16
| | | | | | | | | | | | | | | | | | | While it worked with the kenrel, it wasn't working with the loader. It failed to handle dependencies correctly. The reason for that is that we never created a nvme module with the DRIVER_MODULE, but instead a nvme_pci and nvme_ahci module. Create a real nvme module that nvd can be dependent on so it can import the nvme symbols it needs from there. Arguably, nvd should just be a simple child of nvme, but transitioning to that (and winning that argument given why it was done this way) is beyond the scope of this change. Reviewed by: jhb@ Differential Revision: https://reviews.freebsd.org/D21382 Notes: svn path=/head/; revision=351447
* When we have errors resetting the device before we allocate theWarner Losh2019-08-221-6/+8
| | | | | | | | | | | | queues, don't try to tear them down in the ctrlr_destroy path. Otherwise, we dereference queue structures that are NULL and we trap. This fix is incomplete: we leak IRQ and MSI resources when this happens. That's preferable to a crash but still should be fixed. Notes: svn path=/head/; revision=351411
* We need to define version 1 of nvme, not nvme_foo. Otherwise nvd won'tWarner Losh2019-08-222-2/+4
| | | | | | | | | load and people who pull in nvme/nvd from modules can't load nvd.ko since it depends on nvme, not nvme_foo. The duplicate doesn't matter since kldxref properly handles that case. Notes: svn path=/head/; revision=351406
* Move releasing of resources to laterWarner Losh2019-08-221-1/+3
| | | | | | | | | | | | Turn off bus master after we detach the device (to match the prior order). Release MSI after we're done detaching and have turned off all the interrupts. Otherwise this may cause problems as other threads race nvme_detach. This more closely matches the old order. Reviewed by: mav@ Notes: svn path=/head/; revision=351403
* Remove stray line that was duplicated.Warner Losh2019-08-221-1/+0
| | | | | | | Noticed by: rpokala@ Notes: svn path=/head/; revision=351376
* Create a AHCI attachment for nvme.Warner Losh2019-08-211-0/+127
| | | | | | | | | | | | | | | | | | | | | | | | | Intel has created RST and many laptops from vendors like Lenovo and Asus. It's a mechanism for creating multiple boot devices under windows. It effectively hides the nvme drive inside of the ahci controller. The details are supposed to be a trade secret. However, there's a reverse engineered Linux driver, and this implements similar operations to allow nvme drives to attach. The ahci driver attaches nvme children that proxy the remapped resources to the child. nvme_ahci is just like nvme_pci, except it doesn't do the PCI specific things. That's moved into ahci where appropriate. When the nvme drive is remapped, MSI-x interrupts aren't forwarded (the linux driver doesn't know how to use this either). INTx interrupts are used instead. This is suboptimal, but usually sufficient for the laptops these parts are in. This is based loosely on https://www.spinics.net/lists/linux-ide/msg53364.html submitted, but not accepted by, Linux. It was written by Dan Williams. These changes were written from scratch by Olivier Houchard. Submitted by: cognet@ (Olivier Houchard) Notes: svn path=/head/; revision=351356
* Separate the pci attachment from the rest of nvmeWarner Losh2019-08-214-303/+346
| | | | | | | | | | | Nvme drives can be attached in a number of different ways. Separate out the PCI attachment so that we can have other attachment types, like ahci and various types of NVMeoF. Submitted by: cognet@ Notes: svn path=/head/; revision=351355
* Improve NVMe hot unplug handling.Alexander Motin2019-08-212-20/+41
| | | | | | | | | | | | | | | | If device is unplugged from the system (CSTS register reads return 0xffffffff), it makes no sense to send any more recovery requests or expect any responses back. If there is a detach call in such state, just stop all activity and free resources. If there is no detach call (hot-plug is not supported), rely on normal timeout handling, but when it trigger controller reset, do not wait for impossible and quickly report failure. MFC after: 2 weeks Sponsored by: iXsystems, Inc. Notes: svn path=/head/; revision=351352
* Formalize NVMe controller consumer life cycle.Alexander Motin2019-08-211-9/+23
| | | | | | | | | | | This fixes possible double call of fail_fn, for example on hot removal. It also allows ctrlr_fn to safely return NULL cookie in case of failure and not get useless ns_fn or fail_fn call with NULL cookie later. MFC after: 2 weeks Notes: svn path=/head/; revision=351320
* Report NOIOB and NPWG fields as stripe size.Alexander Motin2019-08-142-24/+30
| | | | | | | | | | | | | | | Namespace Optimal I/O Boundary field added in NVMe 1.3 and Namespace Preferred Write Granularity added in 1.4 allow upper layers to align I/Os for improved SSD performance and endurance. I don't have hardware reportig those yet, but NPWG could probably be reported by bhyve. MFC after: 2 weeks Sponsored by: iXsystems, Inc. Notes: svn path=/head/; revision=351028
* Add `nvmecontrol resv` to handle NVMe reservations.Alexander Motin2019-08-051-0/+80
| | | | | | | | | | | | NVMe reservations are quite alike to SCSI persistent reservations and can be used in clustered setups with shared multiport storage. MFC after: 10 days Relnotes: yes Sponsored by: iXsystems, Inc. Notes: svn path=/head/; revision=350599
* Add more random bits from NVMe 1.4.Alexander Motin2019-08-033-28/+82
| | | | | | | MFC after: 2 weeks Notes: svn path=/head/; revision=350553
* Decode few more NVMe log pages.Alexander Motin2019-08-022-0/+122
| | | | | | | | | | | | | In particular: Changed Namespace List, Commands Supported and Effects, Reservation Notification, Sanitize Status. Add few new arguments to `nvmecontrol log` subcommand. MFC after: 2 weeks Sponsored by: iXsystems, Inc. Notes: svn path=/head/; revision=350541
* Fix typo in r350529.Alexander Motin2019-08-021-1/+1
| | | | | | | MFC after: 2 weeks Notes: svn path=/head/; revision=350530
* Add more new fields and values from NVMe 1.4.Alexander Motin2019-08-022-7/+70
| | | | | | | MFC after: 2 weeks Notes: svn path=/head/; revision=350529
* Add IOCTL to translate nvdX into nvmeY and NSID.Alexander Motin2019-08-013-0/+32
| | | | | | | | | | | | | | | While very useful by itself, it also makes `nvmecontrol` not depend on hardcoded device names parsing, that in its turn makes simple to take nvdX (and potentially any other) device names as arguments. Also added IOCTL bypass from nvdX to respective nvmeYnsZ makes them interchangeable for management purposes. MFC after: 2 weeks Sponsored by: iXsystems, Inc. Notes: svn path=/head/; revision=350523
* Add some new fields and bits from NVMe 1.4.Alexander Motin2019-07-291-8/+146
| | | | | | | | MFC after: 2 weeks Sponsored by: iXsystems, Inc. Notes: svn path=/head/; revision=350399
* Widen the type for to.Warner Losh2019-07-251-1/+1
| | | | | | | | | | | | | | The timeout field in the CAPS register is defined to be 8 bits, so its type was uint8_t. We recently started adding 1 to it to cope with rogue devices that listed 0 timeout time (which is impossible). However, in so doing, other devices that list 0xff (for a 2 minute timeout) were broken when adding 1 overflowed. Widen the type to be uint32_t like its source register to avoid the issue. Reported by: bapt@ Notes: svn path=/head/; revision=350333
* Keep track of the number of commands that exhaust their retry limit.Warner Losh2019-07-193-3/+31
| | | | | | | | | While we print failure messages on the console, sometimes logs are lost or overwhelmed. Keeping a count of how many times we've failed retriable commands helps get a magnitude of the problem. Notes: svn path=/head/; revision=350147
* Keep track of the number of retried commands.Warner Losh2019-07-193-0/+27
| | | | | | | | | Retried commands can indicate a performance degredation of an nvme drive. Keep track of the number of retries and report it out via sysctl, just like number of commands an interrupts. Notes: svn path=/head/; revision=350146
* Use sysctl + CTLRWTUN for hw.nvme.verbose_cmd_dump.Warner Losh2019-07-193-3/+5
| | | | | | | | | | | Also convert it to a bool. While the rest of the driver isn't yet bool clean, this will help. Reviewed by: cem@ Differential Revision: https://reviews.freebsd.org/D20988 Notes: svn path=/head/; revision=350120
* Provide new tunable hw.nvme.verbose_cmd_dumpWarner Losh2019-07-183-0/+14
| | | | | | | | | | | | | | | The nvme drive dumps only the most relevant details about a command when it fails. However, there are times this is not sufficient (such as debugging weird issues for a new drive with a vendor). Setting hw.nvme.verbose_cmd_dump=1 in loader.conf will enable more complete debugging information about each command that fails. Reviewed by: rpokala Sponsored by: Netflix Differential Version: https://reviews.freebsd.org/D20988 Notes: svn path=/head/; revision=350118
* Provide macros to extract the sub-fields of the CAP_LO and CAP_HI registers.Warner Losh2019-07-182-4/+20
| | | | | | | | | | | | | | These macros make places where we extract these easier to read. The shift and mask stuff is also a bit tedious and error prone. Start with the CAP_LO and CAP_HI registers since their scope is somewhat constrained. This is style chagne only, no functional changes. Reviewed by: chuck Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20979 Notes: svn path=/head/; revision=350114
* Remove now-obsolete comment.Warner Losh2019-07-171-2/+1
| | | | Notes: svn path=/head/; revision=350094
* Assume that the timeout value from the capacity is 1-basedWarner Losh2019-07-161-1/+1
| | | | | | | | | Neither the 1.3 or 1.4 standards say this number is 1's based, but adding 1 costs little and copes with those NVMe drives that report '0' in this field cheaply. This is consistent with what the Linux driver does as well. Notes: svn path=/head/; revision=350068
* Fix nda(4) PCIe link status outputChuck Tuffli2019-06-071-6/+10
| | | | | | | | | | | | | | | Differentiate between PCI Express Endpoint devices and Root Complex Integrated Endpoints in the nda driver. The Link Status and Capability registers are not valid for Integrated Endpoints and should not be displayed. The bhyve emulated NVMe device will advertise as being an Integrated Endpoint. Reviewed by: imp Approved byL imp (mentor) Differential Revision: https://reviews.freebsd.org/D20282 Notes: svn path=/head/; revision=348786
* Since a fatal trap can happen at aribtrary times, don't panic when theWarner Losh2019-06-011-13/+66
| | | | | | | | | | | | completions are not in a consistent state. Cope with the different places the normal I/O completion polling thread can be interrupted and then re-entered during a kernel panic + dump. Reviewed by: jhb and markj (both prior versions) Differential Revision: https://reviews.freebsd.org/D20478 Notes: svn path=/head/; revision=348495
* rename nvme_ctrlr_destroy_qpair to nvme_ctrlr_destroy_qpairsWarner Losh2019-05-081-19/+24
| | | | | | | | | | | | | Maintain symmetry with nvme_ctrlr_create_qpairs, making it easier to match init/uninit scenarios. Signed-off-by: John Meneghini <johnm@netapp.com> Submitted by: Michael Hordijk <hordijk@netapp.com> Reviewed by: imp Differential Revision: https://reviews.freebsd.org/D19781 Notes: svn path=/head/; revision=347369
* Decode Deallocate Logical Block Features.Alexander Motin2019-05-051-0/+14
| | | | | | | MFC after: 1 week Notes: svn path=/head/; revision=347158
* Don't print all the I/O we abort on a reset, unless we're out ofWarner Losh2019-03-093-18/+20
| | | | | | | | | | | | | | | | | | | retries. When resetting the controller, we abort I/O. Prior to this fix, we printed a ton of abort messages for I/O that we're going to retry. This imparts no useful information. Stop printing them unless our retry count is exhausted. Clarify code for when we don't retry, and remove useless arg to a routine that's always called with it as 'true'. All the other debug is still printed (including multiple reset messages if we have multiple timeouts before the taskqueue runs the actual reset) so that we know when we reset. Reviewed by: jimharris@, chuck@ Differential Revision: https://reviews.freebsd.org/D19431 Notes: svn path=/head/; revision=344955
* Add ABORTED_BY_REQUEST to the list of things we look at DNR bit and tell why ↵Warner Losh2019-03-031-1/+2
| | | | | | | to comment (code already does this) Notes: svn path=/head/; revision=344736
* Unconditionally support unmapped BIOs. This was another shim forWarner Losh2019-02-274-34/+1
| | | | | | | | | | supporting older kernels. However, all supported versions of FreeBSD have unmapped I/Os (as do several that have gone EOL), remove it. It's unlikely the driver would work on the older kernels anyway at this point. Notes: svn path=/head/; revision=344642
* Remove #ifdef code to support FreeBSD versions that haven't beenWarner Losh2019-02-273-33/+0
| | | | | | | | | supported in years. A number of changes have been made to the driver that likely wouldn't work on those older versions that aren't properly ifdef'd and it's project policy to GC such code once it is stale. Notes: svn path=/head/; revision=344640
* Regularize the Netflix copyrightWarner Losh2019-02-041-1/+1
| | | | | | | | | | | | | | | Use recent best practices for Copyright form at the top of the license: 1. Remove all the All Rights Reserved clauses on our stuff. Where we piggybacked others, use a separate line to make things clear. 2. Use "Netflix, Inc." everywhere. 3. Use a single line for the copyright for grep friendliness. 4. Use date ranges in all places for our stuff. Approved by: Netflix Legal (who gave me the form), adrian@ (pmc files) Notes: svn path=/head/; revision=343755
* Allocate pager bufs from UMA instead of 80-ish mutex protected linked list.Gleb Smirnoff2019-01-151-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | o In vm_pager_bufferinit() create pbuf_zone and start accounting on how many pbufs are we going to have set. In various subsystems that are going to utilize pbufs create private zones via call to pbuf_zsecond_create(). The latter calls uma_zsecond_create(), and sets a limit on created zone. After startup preallocate pbufs according to requirements of all pbuf zones. Subsystems that used to have a private limit with old allocator now have private pbuf zones: md(4), fusefs, NFS client, smbfs, VFS cluster, FFS, swap, vnode pager. The following subsystems use shared pbuf zone: cam(4), nvme(4), physio(9), aio(4). They should have their private limits, but changing that is out of scope of this commit. o Fetch tunable value of kern.nswbuf from init_param2() and while here move NSWBUF_MIN to opt_param.h and eliminate opt_swap.h, that was holding only this option. Default values aren't touched by this commit, but they probably should be reviewed wrt to modern hardware. This change removes a tight bottleneck from sendfile(2) operation, that uses pbufs in vnode pager. Other pagers also would benefit from faster allocation. Together with: gallatin Tested by: pho Notes: svn path=/head/; revision=343030
* Add NVMe drive to NOIOB quirk listChuck Tuffli2019-01-081-0/+1
| | | | | | | | | | | | | | | | | | | Dell-branded Intel P4600 NVMe drives benefit from NVMe 1.3's NOIOB feature. Unfortunately just like Intel DC P4500s, they don't advertise themselves as benefiting from this... This changes adds P4600s to the existing list of old drives which benefit from striping. PR: 233969 Submitted by: David Fugate <dave.fugate@gmail.com> Reviewed by: imp, mav Approved by: imp (mentor) MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D18772 Notes: svn path=/head/; revision=342862
* Add descriptions to NVMe interrupts.Alexander Motin2018-12-261-0/+7
| | | | | | | MFC after: 1 month Notes: svn path=/head/; revision=342546
* Remove CAM SIM lock from NVMe SIM.Alexander Motin2018-12-241-11/+1
| | | | | | | | | | | | CAM does not require SIM lock since FreeBSD 10.4, and NVMe code never required it at all, using per-queue locks instead. This formally allows parallel request submission in CAM mode as much as single per-device and per-queue locks of CAM allow. MFC after: 1 month Notes: svn path=/head/; revision=342399
* nda(4) fix check for Dataset Management supportChuck Tuffli2018-12-132-5/+8
| | | | | | | | | | | | | | | | | | | In the nda(4) driver, only set DISKFLAG_CANDELETE (a.k.a. can support BIO_DELETE) if the drive supports Dataset Management. There are reports that without this check, VMWare Workstation does not work reliably. Fix is to check the ONCS field in the NVMe Controller Data structure for support. This check previously existed but did not survive the big-endian changes. Reported by: yuripv@yuripv.net Reviewed by: imp, mav, jimharris Approved by: imp (mentor) MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D18493 Notes: svn path=/head/; revision=342046
* Even though they are reserved, cdw2 and cdw3 can be set via nvme-cliWarner Losh2018-12-071-0/+2
| | | | | | | | | (and soon nvmecontrol). Go ahead and copy them into rsvd2 and rsvd3. Sponsored by: Netflix Notes: svn path=/head/; revision=341710
* Remove do-nothing nvme_modevent.Warner Losh2018-11-161-30/+1
| | | | | | | | | nvme_modevent no longer does anything interesting, remove it. Sponsored by: Netflix Notes: svn path=/head/; revision=340481
* Use atomic_load_acq_int() here too to poll done, ala r328521Warner Losh2018-11-131-3/+3
| | | | Notes: svn path=/head/; revision=340412
* Put a workaround in for command timeout malfunctioningWarner Losh2018-10-262-1/+22
| | | | | | | | | | | | | At least one NVMe drive has a bug that makeing the Command Time Out PCIe feature unreliable. The workaround is to disable this feature. The driver wouldn't deal correctly with a timeout anyway. Only do this for drives that are known bad. Sponsored by: Netflix, Inc Differential Revision: https://reviews.freebsd.org/D17708 Notes: svn path=/head/; revision=339775
* Make NVMe compatible with the original APIChuck Tuffli2018-08-226-36/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | The original NVMe API used bit-fields to represent fields in data structures defined by the specification (e.g. the op-code in the command data structure). The implementation targeted x86_64 processors and defined the bit fields for little endian dwords (i.e. 32 bits). This approach does not work as-is for big endian architectures and was changed to use a combination of bit shifts and masks to support PowerPC. Unfortunately, this changed the NVMe API and forces #ifdef's based on the OS revision level in user space code. This change reverts to something that looks like the original API, but it uses bytes instead of bit-fields inside the packed command structure. As a bonus, this works as-is for both big and little endian CPU architectures. Bump __FreeBSD_version to 1200081 due to API change Reviewed by: imp, kbowling, smh, mav Approved by: imp (mentor) Differential Revision: https://reviews.freebsd.org/D16404 Notes: svn path=/head/; revision=338182
* nvme(4): Add bus_dmamap_sync() at the end of the request pathJustin Hibbits2018-08-031-1/+18
| | | | | | | | | | | | | | | | | | | Summary: Some architectures, in this case powerpc64, need explicit synchronization barriers vs device accesses. Prior to this change, when running 'make buildworld -j72' on a 18-core (72-thread) POWER9, I would see controller resets often. With this change, I don't see these resets messages, though another tester still does, for yet to be determined reasons, so this may not be a complete fix. Additionally, I see a ~5-10% speed up in buildworld times, likely due to not needing to reset the controller. Reviewed By: jimharris Differential Revision: https://reviews.freebsd.org/D16570 Notes: svn path=/head/; revision=337273
* Refactor NVMe CAM integration.Alexander Motin2018-05-256-101/+116
| | | | | | | | | | | | | | | | | | | | | | | | - Remove layering violation, when NVMe SIM code accessed CAM internal device structures to set pointers on controller and namespace data. Instead make NVMe XPT probe fetch the data directly from hardware. - Cleanup NVMe SIM code, fixing support for multiple namespaces per controller (reporting them as LUNs) and adding controller detach support and run-time namespace change notifications. - Add initial support for namespace change async events. So far only in CAM mode, but it allows run-time namespace arrival and departure. - Add missing nvme_notify_fail_consumers() call on controller detach. Together with previous changes this allows NVMe device detach/unplug. Non-CAM mode still requires a lot of love to stay on par, but at least CAM mode code should not stay in the way so much, becoming much more self-sufficient. Reviewed by: imp MFC after: 1 month Sponsored by: iXsystems, Inc. Notes: svn path=/head/; revision=334200
* Remove the 'All Rights Reserved' clause from some of the stuff I'veWarner Losh2018-05-091-1/+0
| | | | | | | done for Netflix, since I'm in the neighborhood. Notes: svn path=/head/; revision=333434
* Fix LOR between controller and queue locks.Alexander Motin2018-05-021-16/+11
| | | | | | | | | | | | Admin pass-through requests took controller lock before the queue lock, but in case of request submission to a failed controller controller lock was taken after the queue lock. Fix that by reducing the lock scopes and switching to mtx_pool locks to track pass-through request completion. Sponsored by: iXsystems, Inc. Notes: svn path=/head/; revision=333180