src/include/sys, branch zfs-0.6.0-rc1

src/include/sys, branch zfs-0.6.0-rc1 FreeBSD source tree https://cgit-dev.freebsd.org/src/atom?h=zfs-0.6.0-rc1 2011-02-18T17:31:25Z Merge branch 'zpl' 2011-02-18T17:31:25Z Brian Behlendorf behlendorf1@llnl.gov 2011-02-18T17:31:25Z urn:sha1:5d0265c0dd54d798a35babe587ad5138392fe807 Add API to wait for pending commit callbacks 2011-02-16T19:20:06Z Ricardo M. Correia ricardo.correia@oracle.com 2011-01-21T22:35:41Z urn:sha1:54a179e7b80413bd48cd2cd259110fb493d0215e This adds an API to wait for pending commit callbacks of already-synced transactions to finish processing. This is needed by the DMU-OSD in Lustre during device finalization when some callbacks may still not be called, this leads to non-zero reference count errors. See lustre.org bug 23931. Linux 2.6.36 compat, sops->evict_inode() 2011-02-11T21:47:51Z Brian Behlendorf behlendorf1@llnl.gov 2011-02-11T21:46:10Z urn:sha1:2c395def2763ccc7a549d297f7f11bd304caaeae The new prefered inteface for evicting an inode from the inode cache is the ->evict_inode() callback. It replaces both the ->delete_inode() and ->clear_inode() callbacks which were previously used for this. Linux 2.6.35 compat, fops->fsync() 2011-02-11T17:05:51Z Brian Behlendorf behlendorf1@llnl.gov 2011-02-11T16:58:55Z urn:sha1:7268e1bec8478639b7a1047e02ab931f30bc2f92 The fsync() callback in the file_operations structure used to take 3 arguments. The callback now only takes 2 arguments because the dentry argument was determined to be unused by all consumers. To handle this a compatibility prototype was added to ensure the right prototype is used. Our implementation never used the dentry argument either so it's just a matter of using the right prototype. Linux 2.6.35 compat, const struct xattr_handler 2011-02-11T00:29:00Z Brian Behlendorf behlendorf1@llnl.gov 2011-02-11T00:16:52Z urn:sha1:777d4af89137907adc91377327505f40c296035d The const keyword was added to the 'struct xattr_handler' in the generic Linux super_block structure. To handle this we define an appropriate xattr_handler_t typedef which can be used. This was the preferred solution because it keeps the code clean and readable. Use 'noop' IO Scheduler 2011-02-10T17:27:22Z Brian Behlendorf behlendorf1@llnl.gov 2011-02-07T21:54:59Z urn:sha1:6839eed23e3c9d85cf0de767be32af0759e5bf2d Initial testing has shown the the right IO scheduler to use under Linux is noop. This strikes the ideal balance by allowing the zfs elevator to do all request ordering and prioritization. While allowing the Linux elevator to do the maximum front/back merging allowed by the physical device. This yields the largest possible requests for the device with the lowest total overhead. While 'noop' should be right for your system you can choose a different IO scheduler with the 'zfs_vdev_scheduler' option. You may set this value to any of the standard Linux schedulers: noop, cfq, deadline, anticipatory. In addition, if you choose 'none' zfs will not attempt to change the IO scheduler for the block device. Add mmap(2) support 2011-02-10T17:27:21Z Brian Behlendorf behlendorf1@llnl.gov 2011-02-03T18:34:05Z urn:sha1:c0d35759c5ab1abaa6b72062cc4ecd0d86628de8 It's worth taking a moment to describe how mmap is implemented for zfs because it differs considerably from other Linux filesystems. However, this issue is handled the same way under OpenSolaris. The issue is that by design zfs bypasses the Linux page cache and leaves all caching up to the ARC. This has been shown to work well for the common read(2)/write(2) case. However, mmap(2) is problem because it relies on being tightly integrated with the page cache. To handle this we cache mmap'ed files twice, once in the ARC and a second time in the page cache. The code is careful to keep both copies synchronized. When a file with an mmap'ed region is written to using write(2) both the data in the ARC and existing pages in the page cache are updated. For a read(2) data will be read first from the page cache then the ARC if needed. Neither a write(2) or read(2) will will ever result in new pages being added to the page cache. New pages are added to the page cache only via .readpage() which is called when the vfs needs to read a page off disk to back the virtual memory region. These pages may be modified without notifying the ARC and will be written out periodically via .writepage(). This will occur due to either a sync or the usual page aging behavior. Note because a read(2) of a mmap'ed file will always check the page cache first even when the ARC is out of date correct data will still be returned. While this implementation ensures correct behavior it does have have some drawbacks. The most obvious of which is that it increases the required memory footprint when access mmap'ed files. It also adds additional complexity to the code keeping both caches synchronized. Longer term it may be possible to cleanly resolve this wart by mapping page cache pages directly on to the ARC buffers. The Linux address space operations are flexible enough to allow selection of which pages back a particular index. The trick would be working out the details of which subsystem is in charge, the ARC, the page cache, or both. It may also prove helpful to move the ARC buffers to a scatter-gather lists rather than a vmalloc'ed region. Additionally, zfs_write/read_common() were used in the readpage and writepage hooks because it was fairly easy. However, it would be better to update zfs_fillpage and zfs_putapage to be Linux friendly and use them instead. Add Hooks for Linux File Operations 2011-02-10T17:27:21Z Brian Behlendorf behlendorf1@llnl.gov 2011-01-26T20:03:58Z urn:sha1:1efb473f8919c5f195e127136b79c6d3b1eb1c81 The Linux specific file operations have all been located in the file zpl_file.c. These functions primarily rely on the reworked zfs_* functions to do their job. They are also responsible for converting the possible Solaris style error codes to negative Linux errors. This first zpl_* commit also includes a common zpl.h header with minimal entries to register the Linux specific hooks. In also adds all the new zpl_* file to the Makefile.in. This is not a standalone commit, you required the following zpl_* commits. Add zp->z_is_zvol flag 2011-02-10T17:27:21Z Brian Behlendorf behlendorf1@llnl.gov 2011-02-08T19:29:50Z urn:sha1:3c4988c83e4f278cd6c8076f6cdb8e4858d05840 A new flag is required for the zfs_rlock code to determine if it is operation of the zvol of zpl dataset. This used to be keyed off the zp->z_vnode, which was a hack to begin with, but with the removal of vnodes we needed a dedicated flag. Prototype/structure update for Linux 2011-02-10T17:27:21Z Brian Behlendorf behlendorf1@llnl.gov 2011-02-08T19:16:06Z urn:sha1:3558fd73b5d863304102f6745c26e0b592aca60a I appologize in advance why to many things ended up in this commit. When it could be seperated in to a whole series of commits teasing that all apart now would take considerable time and I'm not sure there's much merrit in it. As such I'll just summerize the intent of the changes which are all (or partly) in this commit. Broadly the intent is to remove as much Solaris specific code as possible and replace it with native Linux equivilants. More specifically: 1) Replace all instances of zfsvfs_t with zfs_sb_t. While the type is largely the same calling it private super block data rather than a zfsvfs is more consistent with how Linux names this. While non critical it makes the code easier to read when your thinking in Linux friendly VFS terms. 2) Replace vnode_t with struct inode. The Linux VFS doesn't have the notion of a vnode and there's absolutely no good reason to create one. There are in fact several good reasons to remove it. It just adds overhead on Linux if we were to manage one, it conplicates the code, and it likely will lead to bugs so there's a good change it will be out of date. The code has been updated to remove all need for this type. 3) Replace all vtype_t's with umode types. Along with this shift all uses of types to mode bits. The Solaris code would pass a vtype which is redundant with the Linux mode. Just update all the code to use the Linux mode macros and remove this redundancy. 4) Remove using of vn_* helpers and replace where needed with inode helpers. The big example here is creating iput_aync to replace vn_rele_async. Other vn helpers will be addressed as needed but they should be be emulated. They are a Solaris VFS'ism and should simply be replaced with Linux equivilants. 5) Update znode alloc/free code. Under Linux it's common to embed the inode specific data with the inode itself. This removes the need for an extra memory allocation. In zfs this information is called a znode and it now embeds the inode with it. Allocators have been updated accordingly. 6) Minimal integration with the vfs flags for setting up the super block and handling mount options has been added this code will need to be refined but functionally it's all there. This will be the first and last of these to large to review commits.