diff options
Diffstat (limited to 'subversion/libsvn_fs_fs/structure-indexes')
| -rw-r--r-- | subversion/libsvn_fs_fs/structure-indexes | 352 | 
1 files changed, 352 insertions, 0 deletions
| diff --git a/subversion/libsvn_fs_fs/structure-indexes b/subversion/libsvn_fs_fs/structure-indexes new file mode 100644 index 000000000000..25490c7ca8f0 --- /dev/null +++ b/subversion/libsvn_fs_fs/structure-indexes @@ -0,0 +1,352 @@ +This file describes the design, data model, and storage formats of FSFS +index data. + + +Design +====== + +Each pack and each rev file using logical addressing contains exactly +two index sections.  One, the log-to-phys index, maps the (rev, item_index) +pairs to absolute file offsets.  The other, phys-to-log, is a reverse +index that gives basic information on any file location.  This is enough +to read and cache any data without traversing DAGs. + +Rev and pack files are immutable, so the same is true for index data. +During a transaction or while packing a file, a proto index file gets +written (actually, one log-to-phys and one phys-to-log).  Its format is +a simple concatenation of runtime structs and as such, an implementation +detail subject to change.  A proto index basically aggregates all the +information that must later be transformed into the final index. + + +General design concerns +----------------------- + +In Subversion, there is no limit to the size of a revision; even practical +limits are in the order of millions of changes at least.  Index data for +these would be multiple megabytes in size with pack file indexes possibly +approaching 1 GB.  To ensure we still get roughly O(1) access time, we +need a hierarchical data structure. + +Therefore, the indexes will start with a header containing an array of +references to sub-sections or pages.  The length of these pages varies +but is limited to a size configurable in fsfs.conf. + +Finally, it is assumed that whole pages can be cached efficiently and +with a high cache hit rate.  So, although a page may have a thousand or +more entries, the access time is still amortized O(1) in many scenarios. + + +Items and item types +-------------------- + +The index implementation treats item_index and item type as simple ints, +except for SVN_FS_FS__ITEM_INDEX_UNUSED and SVN_FS_FS__ITEM_TYPE_UNUSED. +Since they have been defined as 0, the code may test for "used" etc. +by simply comparing with 0. + +See section "addressing modes" in structure to a list of item types +and pre-defined item_index values. + + +Encoding +-------- + +The final index data format is tuned for space and decoding efficiency. +Indexes are stored as a sequence of variable integers.  The encoding is +as follows: + +* Unsigned integers are stored in little endian order with a variable +  length 7b/8b encoding.  If most significant bit a byte has been set, +  the next byte has also belongs to the same value. + +  0x00 .. 0x7f    -> 0x00 .. 0x7f               ( 7 bits stored in  8 bits) +  0x80 .. 0xff    -> 0x80 0x01 .. 0xff 0x01     (14 bits stored in 16 bits) +  0x100 .. 0x3fff -> 0x80 0x02 .. 0xff 0x7f     (14 bits stored in 16 bits) +  0x100000000     -> 0x80 0x80 0x80 0x80 0x10   (35 bits stored in 40 bits) + +  Technically, we can represent integers of arbitrary lengths.  Currently, +  we only generate and parse up to 64 bits. + +* Signed integers are mapped onto the unsigned value space as follows: + +  x >= 0 ->  2 * x +  x < 0  -> -2 * x - 1 + +  Again, we can represent arbitrary length numbers that way but the code +  is currently restricted to 64 bits. + +Most data is unsigned by nature but will be stored differentially using +signed integers. + + +Encoding in proto-index files +----------------------------- + +These have a much simpler encoding.  Throughout the files, all records have +the same length (but different between L2P and P2L).  All records contain +unsigned 64 bit integers only, stored in little endian byte order. + + +Log-to-phys index +================= + +This index has to map (rev, item_index) -> offset.  It assumes that the +item_index values per revision are dense and start at 0.  There may be +unused item_index values, though; the data structure simply gets less +space-efficient when the more sparse the value space gets. + + +Index data model +---------------- + +hierarchy: + +  header -> per-revision info -> page -> offset + +  There is one entry per revision in the header.  Per revision there are +  one or more pages (exclusive to that revision) containing up to a known, +  fixed limit of entries (= page size).  The total access path is: + +  pages = header->pages[revision]; +  offsets = page = pages[item_index / page_size]; +  offset = offsets[item_index % page_size]; + +  Different log-to-phys indexes in the same repository may have different +  page sizes but within any given index, the page size is the same and +  immutable. + +header: + +  <first revision>   ... first revision covered by this index +  <revision count>   ... number of revision covered by this index +  <page size>        ... maximum number of entries per page +  <page table index> ... array, for each revision containing the index in +                         <page table> of the first page that belongs to +                         this revision.  This has <revision count>+1 +                         entries to terminate the last revision. +  <page table>       ... array of page headers.  It has +                         <page table index>[<revision count>] entries. + +page table: + +  <offset>           ... absolute position of the page contents within the +                         index +  <entry count>      ... number of offset entries in the page. +                         Must match <header>.<page size> unless this is +                         the last page for the respective revision. +  <size>             ... length in bytes of the on-disk page description. +                         Note that this is redundant with the <offset>. + +page: + +  <entry count>      ... number of offset entries in the page. +                         Must match <header>.<page size> unless this is +                         the last page for the respective revision. +                         Redundant with <page table>.<entry count> +  <offsets>          ... array of absolute file positions within the rev / +                         pack file.  This has <entry count> entries. + +                          +Index on-disk format +-------------------- + +  index := "L2P-INDEX\n" header revisions pages offsets + +  header := u(<header>.<first revision>) \ +            u(<header>.<page size>) \ +            u(<header>.<revision count>) \ +            u(s(<header>.<page table>)) + +  revisions := u(  <header>.<page table index>[k+1] +                 - <header>.<page table index>[k]), +               for k in 0 .. <header>.<revision count>-1 + +  pages := u(<header>.<page table>[k].<size>) \ +           u(<header>.<page table>[k].<entry count>), +           for k in 0 .. s(<header>.<page table>)-1 + +  offsets := page(k), +             for k in 0 .. s(<header>.<page table>)-1 + +  page(k) := i(<header>.<page table>[k].<offsets>[0]) \ +             i(  <header>.<page table>[k].<offsets>[l] \ +               - <header>.<page table>[k].<offsets>[l - 1]), +             for l in 1 .. s(<header>.<page table>[k].<entry count>)-1 + +  u(x) ... unsigned int x in 7b/8b encoding +  i(x) ... signed int x in 7b/8b encoding +  s(x) ... number of entries in array x + + +Proto index file format +----------------------- + +The index will be created from a "revision-less" proto index file +containing (<offset><item_index>) pairs only. + +All mappings belonging to the same revision must be written in one go but +there is no restriction on the order of those entries.  To signify that +a new revision begins, a (0, 0) mapping must be written.  A (0, 0) entry +at the beginning of the file is optional and will be ignored. + +  <bof>         /* begin of proto index file for revision r and following */ +  (0, 0)        /* mark start of revision r, optional for first rev */ +  (off, item)*  /* zero to many mappings in random order */ +  (0, 0)        /* mark start of revision r + 1 */ +  (off, item)*  /* zero to many mappings in random order */ +  (0, 0)        /* mark start of revision r + 2 */ +  (off, item)*  /* zero to many mappings in random order */ +  ... +  <eof>         /* end of file. */ + +All entries are pairs of 64 bit unsigned integers in little endian order. + + +Phys-to-log index +================= + +This index has to map offset -> (rev, item_index, type, len, checksum). + + +Index data model +---------------- + +hierarchy: + +  header -> page -> item info + +  Logically, the index splits up index rev / pack file into pages of a +  fixed size.  That page size may differ from the FS's block size.  The +  index will describe each rev / pack file page with one index page. + +  page = header->pages[offset % page_size]; +  item info = binary_search(page.data, offset) + +  Note that the current implementation will not return any data if the +  offset is does not match any actual item start.  To simplify the lookup, +  the last index page will have an "unused item" entry for the section +  behind EOF.  Holes aren't allowed as well, i.e. every byte of the rev / +  pack is expected to be covered by the index. + +  Also, there may be items stretching across page borders or even over +  multiple pages.  The data model solves this issue by storing the item +  descriptions as a "primary array" and then representing the pages as +  ranges within that array.  Thus multiple pages may reference the same +  item description. + +header: + +  <first revision>   ... first revision covered by this index +  <file size>        ... size of the rev / pack file in bytes +  <page size>        ... number of bytes in the rev / pack file covered by +                         each index page +  <page count>       ... number of pages +  <offsets>          ... array of page offsets, i.e. locations the page +                         data within the index. +                         This array has <page count> + 1 entries. + +page: + +  <entries>          ... array of item descriptions, ordered by offset. +                         First and last entry may cross page boundaries. + +entry: + +  <offset>           ... absolute position in the pack / rev file +  <size>             ... on-disk size of the item in the pack / rev file +  <type>             ... item type +  <FNV checksum>     ... modified 32 bit FNV-1a checksum of that section +                         of the pack / rev file (see below). 0 for empty +                         or zero-length items +  <revision>         ... revision that this item belongs to +  <item_index>       ... item_index within that revision + + +Index on-disk format +-------------------- + +  index := "P2L-INDEX\n" header pages items + +  header := u(<header>.<first revision>) \ +            u(<header>.<file size>) \ +            u(<header>.<page size>) \ +            u(<header>.<page count>) + +  pages := u(<header>.<offsets>[k+1] - <header>.<offsets>[k]), +           for k in 0 .. <header>.<page count> -1 + +  items := u(<items in page k>[0].<offset>) \ +           u(<items in page k>[l].<size>) \ +           i(c(<items in page k>[l]) - c(<items of page k>[l-1])) \ +           i(  <items in page k>[l].<revision> +             - <items in page k>[l-1].<revision>), \ +           u(FNV checksum) +           for l in 0 .. s(<items in page k>)-1, +           for k in 0 .. <header>.<page count>-1 + +  u(x) ... unsigned int x in 7b/8b encoding +  i(x) ... signed int x in 7b/8b encoding +  s(x) ... number of entries in collection x +  c(x) ... compound function := x.<item_index> * 8 + x.<type> + +  Access to negative indexes gives a 0 value. + +  <Items in page k> are in strict ascending offset order.  Items that +  started after the begin of a given page and overlap with the next page +  will not be stored in the start page.  The runtime representation will +  duplicate items overlapping page boundaries; the on-disk representation +  will not.  Thus, pages inside large items will have zero entries on disk. + + +Proto index file format +----------------------- + +The index will be created from a proto index file containing simple +instances of svn_fs_fs__p2l_entry_t with the following element order: + +  item offset               as uint64 +  item size                 as uint64 +  item type                 as uint64 +  modified FNV1a checksum   as uint64 +  revision                  as uint64, with SVN_INVALID_REVNUM mapped to 0 +                                       and revisions >= 0 stored as rev+1 +  item index                as uint64 + +All values are stored in little endian order. + +Page table and header information, except start revision and page size, +can easily be derived from that information. + +All entries must be written in strict offset order.  Overlapping entries +are not allowed; zero-length items are. + +In transactions, the final revision number may not be known when writing +the proto index file (e.g. while still writing the proto rev file).  Items +with revision set to SVN_INVALID_REVNUM will therefore be automatically +updated when creating the final index.  This is possible in conjunction +with rev files but not for pack files. + + +FNV checksum +------------ + +FNV-1a can be found here: http://www.isthe.com/chongo/tech/comp/fnv/ +For performance reasons we use a modified version: + +* split the input byte stream [b0 .. bN] into 4 sub-streams of equal +  length and up to 3 remnants: + +  [b0 b4 b8 ..], [b1 b5 b9 ..], [b2 b6 b10 ..], [b3 b7 b11 ..], [remnant] + +* calculate 32 bit FNV-1a checksums for the 4 substreams: + +  h0 = fnv_1a([b0 b4 b8 ..]), ..., h3 = fnv_1a([b3 b7 b11 ..]) + +* combine the big endian representation of these checksums plus the +  remnant of the original stream into a 12 to 15 byte long intermediate + +  [i0 .. iK], 12 <= K+1 <= 15 + +* FNV checksum = fnv_1a([i0 .. iK]) in big endian representation + | 
