1 files changed, 352 insertions, 0 deletions
diff --git a/subversion/libsvn_fs_fs/structure-indexes b/subversion/libsvn_fs_fs/structure-indexes
new file mode 100644
index 000000000000..25490c7ca8f0
--- /dev/null
+++ b/subversion/libsvn_fs_fs/structure-indexes
@@ -0,0 +1,352 @@
+This file describes the design, data model, and storage formats of FSFS
+index data.
+
+
+Design
+======
+
+Each pack and each rev file using logical addressing contains exactly
+two index sections.  One, the log-to-phys index, maps the (rev, item_index)
+pairs to absolute file offsets.  The other, phys-to-log, is a reverse
+index that gives basic information on any file location.  This is enough
+to read and cache any data without traversing DAGs.
+
+Rev and pack files are immutable, so the same is true for index data.
+During a transaction or while packing a file, a proto index file gets
+written (actually, one log-to-phys and one phys-to-log).  Its format is
+a simple concatenation of runtime structs and as such, an implementation
+detail subject to change.  A proto index basically aggregates all the
+information that must later be transformed into the final index.
+
+
+General design concerns
+-----------------------
+
+In Subversion, there is no limit to the size of a revision; even practical
+limits are in the order of millions of changes at least.  Index data for
+these would be multiple megabytes in size with pack file indexes possibly
+approaching 1 GB.  To ensure we still get roughly O(1) access time, we
+need a hierarchical data structure.
+
+Therefore, the indexes will start with a header containing an array of
+references to sub-sections or pages.  The length of these pages varies
+but is limited to a size configurable in fsfs.conf.
+
+Finally, it is assumed that whole pages can be cached efficiently and
+with a high cache hit rate.  So, although a page may have a thousand or
+more entries, the access time is still amortized O(1) in many scenarios.
+
+
+Items and item types
+--------------------
+
+The index implementation treats item_index and item type as simple ints,
+except for SVN_FS_FS__ITEM_INDEX_UNUSED and SVN_FS_FS__ITEM_TYPE_UNUSED.
+Since they have been defined as 0, the code may test for "used" etc.
+by simply comparing with 0.
+
+See section "addressing modes" in structure to a list of item types
+and pre-defined item_index values.
+
+
+Encoding
+--------
+
+The final index data format is tuned for space and decoding efficiency.
+Indexes are stored as a sequence of variable integers.  The encoding is
+as follows:
+
+* Unsigned integers are stored in little endian order with a variable
+  length 7b/8b encoding.  If most significant bit a byte has been set,
+  the next byte has also belongs to the same value.
+
+  0x00 .. 0x7f    -> 0x00 .. 0x7f               ( 7 bits stored in  8 bits)
+  0x80 .. 0xff    -> 0x80 0x01 .. 0xff 0x01     (14 bits stored in 16 bits)
+  0x100 .. 0x3fff -> 0x80 0x02 .. 0xff 0x7f     (14 bits stored in 16 bits)
+  0x100000000     -> 0x80 0x80 0x80 0x80 0x10   (35 bits stored in 40 bits)
+
+  Technically, we can represent integers of arbitrary lengths.  Currently,
+  we only generate and parse up to 64 bits.
+
+* Signed integers are mapped onto the unsigned value space as follows:
+
+  x >= 0 ->  2 * x
+  x < 0  -> -2 * x - 1
+
+  Again, we can represent arbitrary length numbers that way but the code
+  is currently restricted to 64 bits.
+
+Most data is unsigned by nature but will be stored differentially using
+signed integers.
+
+
+Encoding in proto-index files
+-----------------------------
+
+These have a much simpler encoding.  Throughout the files, all records have
+the same length (but different between L2P and P2L).  All records contain
+unsigned 64 bit integers only, stored in little endian byte order.
+
+
+Log-to-phys index
+=================
+
+This index has to map (rev, item_index) -> offset.  It assumes that the
+item_index values per revision are dense and start at 0.  There may be
+unused item_index values, though; the data structure simply gets less
+space-efficient when the more sparse the value space gets.
+
+
+Index data model
+----------------
+
+hierarchy:
+
+  header -> per-revision info -> page -> offset
+
+  There is one entry per revision in the header.  Per revision there are
+  one or more pages (exclusive to that revision) containing up to a known,
+  fixed limit of entries (= page size).  The total access path is:
+
+  pages = header->pages[revision];
+  offsets = page = pages[item_index / page_size];
+  offset = offsets[item_index % page_size];
+
+  Different log-to-phys indexes in the same repository may have different
+  page sizes but within any given index, the page size is the same and
+  immutable.
+
+header:
+
+  <first revision>   ... first revision covered by this index
+  <revision count>   ... number of revision covered by this index
+  <page size>        ... maximum number of entries per page
+  <page table index> ... array, for each revision containing the index in
+                         <page table> of the first page that belongs to
+                         this revision.  This has <revision count>+1
+                         entries to terminate the last revision.
+  <page table>       ... array of page headers.  It has
+                         <page table index>[<revision count>] entries.
+
+page table:
+
+  <offset>           ... absolute position of the page contents within the
+                         index
+  <entry count>      ... number of offset entries in the page.
+                         Must match <header>.<page size> unless this is
+                         the last page for the respective revision.
+  <size>             ... length in bytes of the on-disk page description.
+                         Note that this is redundant with the <offset>.
+
+page:
+
+  <entry count>      ... number of offset entries in the page.
+                         Must match <header>.<page size> unless this is
+                         the last page for the respective revision.
+                         Redundant with <page table>.<entry count>
+  <offsets>          ... array of absolute file positions within the rev /
+                         pack file.  This has <entry count> entries.
+
+                         
+Index on-disk format
+--------------------
+
+  index := "L2P-INDEX\n" header revisions pages offsets
+
+  header := u(<header>.<first revision>) \
+            u(<header>.<page size>) \
+            u(<header>.<revision count>) \
+            u(s(<header>.<page table>))
+
+  revisions := u(  <header>.<page table index>[k+1]
+                 - <header>.<page table index>[k]),
+               for k in 0 .. <header>.<revision count>-1
+
+  pages := u(<header>.<page table>[k].<size>) \
+           u(<header>.<page table>[k].<entry count>),
+           for k in 0 .. s(<header>.<page table>)-1
+
+  offsets := page(k),
+             for k in 0 .. s(<header>.<page table>)-1
+
+  page(k) := i(<header>.<page table>[k].<offsets>[0]) \
+             i(  <header>.<page table>[k].<offsets>[l] \
+               - <header>.<page table>[k].<offsets>[l - 1]),
+             for l in 1 .. s(<header>.<page table>[k].<entry count>)-1
+
+  u(x) ... unsigned int x in 7b/8b encoding
+  i(x) ... signed int x in 7b/8b encoding
+  s(x) ... number of entries in array x
+
+
+Proto index file format
+-----------------------
+
+The index will be created from a "revision-less" proto index file
+containing (<offset><item_index>) pairs only.
+
+All mappings belonging to the same revision must be written in one go but
+there is no restriction on the order of those entries.  To signify that
+a new revision begins, a (0, 0) mapping must be written.  A (0, 0) entry
+at the beginning of the file is optional and will be ignored.
+
+  <bof>         /* begin of proto index file for revision r and following */
+  (0, 0)        /* mark start of revision r, optional for first rev */
+  (off, item)*  /* zero to many mappings in random order */
+  (0, 0)        /* mark start of revision r + 1 */
+  (off, item)*  /* zero to many mappings in random order */
+  (0, 0)        /* mark start of revision r + 2 */
+  (off, item)*  /* zero to many mappings in random order */
+  ...
+  <eof>         /* end of file. */
+
+All entries are pairs of 64 bit unsigned integers in little endian order.
+
+
+Phys-to-log index
+=================
+
+This index has to map offset -> (rev, item_index, type, len, checksum).
+
+
+Index data model
+----------------
+
+hierarchy:
+
+  header -> page -> item info
+
+  Logically, the index splits up index rev / pack file into pages of a
+  fixed size.  That page size may differ from the FS's block size.  The
+  index will describe each rev / pack file page with one index page.
+
+  page = header->pages[offset % page_size];
+  item info = binary_search(page.data, offset)
+
+  Note that the current implementation will not return any data if the
+  offset is does not match any actual item start.  To simplify the lookup,
+  the last index page will have an "unused item" entry for the section
+  behind EOF.  Holes aren't allowed as well, i.e. every byte of the rev /
+  pack is expected to be covered by the index.
+
+  Also, there may be items stretching across page borders or even over
+  multiple pages.  The data model solves this issue by storing the item
+  descriptions as a "primary array" and then representing the pages as
+  ranges within that array.  Thus multiple pages may reference the same
+  item description.
+
+header:
+
+  <first revision>   ... first revision covered by this index
+  <file size>        ... size of the rev / pack file in bytes
+  <page size>        ... number of bytes in the rev / pack file covered by
+                         each index page
+  <page count>       ... number of pages
+  <offsets>          ... array of page offsets, i.e. locations the page
+                         data within the index.
+                         This array has <page count> + 1 entries.
+
+page:
+
+  <entries>          ... array of item descriptions, ordered by offset.
+                         First and last entry may cross page boundaries.
+
+entry:
+
+  <offset>           ... absolute position in the pack / rev file
+  <size>             ... on-disk size of the item in the pack / rev file
+  <type>             ... item type
+  <FNV checksum>     ... modified 32 bit FNV-1a checksum of that section
+                         of the pack / rev file (see below). 0 for empty
+                         or zero-length items
+  <revision>         ... revision that this item belongs to
+  <item_index>       ... item_index within that revision
+
+
+Index on-disk format
+--------------------
+
+  index := "P2L-INDEX\n" header pages items
+
+  header := u(<header>.<first revision>) \
+            u(<header>.<file size>) \
+            u(<header>.<page size>) \
+            u(<header>.<page count>)
+
+  pages := u(<header>.<offsets>[k+1] - <header>.<offsets>[k]),
+           for k in 0 .. <header>.<page count> -1
+
+  items := u(<items in page k>[0].<offset>) \
+           u(<items in page k>[l].<size>) \
+           i(c(<items in page k>[l]) - c(<items of page k>[l-1])) \
+           i(  <items in page k>[l].<revision>
+             - <items in page k>[l-1].<revision>), \
+           u(FNV checksum)
+           for l in 0 .. s(<items in page k>)-1,
+           for k in 0 .. <header>.<page count>-1
+
+  u(x) ... unsigned int x in 7b/8b encoding
+  i(x) ... signed int x in 7b/8b encoding
+  s(x) ... number of entries in collection x
+  c(x) ... compound function := x.<item_index> * 8 + x.<type>
+
+  Access to negative indexes gives a 0 value.
+
+  <Items in page k> are in strict ascending offset order.  Items that
+  started after the begin of a given page and overlap with the next page
+  will not be stored in the start page.  The runtime representation will
+  duplicate items overlapping page boundaries; the on-disk representation
+  will not.  Thus, pages inside large items will have zero entries on disk.
+
+
+Proto index file format
+-----------------------
+
+The index will be created from a proto index file containing simple
+instances of svn_fs_fs__p2l_entry_t with the following element order:
+
+  item offset               as uint64
+  item size                 as uint64
+  item type                 as uint64
+  modified FNV1a checksum   as uint64
+  revision                  as uint64, with SVN_INVALID_REVNUM mapped to 0
+                                       and revisions >= 0 stored as rev+1
+  item index                as uint64
+
+All values are stored in little endian order.
+
+Page table and header information, except start revision and page size,
+can easily be derived from that information.
+
+All entries must be written in strict offset order.  Overlapping entries
+are not allowed; zero-length items are.
+
+In transactions, the final revision number may not be known when writing
+the proto index file (e.g. while still writing the proto rev file).  Items
+with revision set to SVN_INVALID_REVNUM will therefore be automatically
+updated when creating the final index.  This is possible in conjunction
+with rev files but not for pack files.
+
+
+FNV checksum
+------------
+
+FNV-1a can be found here: http://www.isthe.com/chongo/tech/comp/fnv/
+For performance reasons we use a modified version:
+
+* split the input byte stream [b0 .. bN] into 4 sub-streams of equal
+  length and up to 3 remnants:
+
+  [b0 b4 b8 ..], [b1 b5 b9 ..], [b2 b6 b10 ..], [b3 b7 b11 ..], [remnant]
+
+* calculate 32 bit FNV-1a checksums for the 4 substreams:
+
+  h0 = fnv_1a([b0 b4 b8 ..]), ..., h3 = fnv_1a([b3 b7 b11 ..])
+
+* combine the big endian representation of these checksums plus the
+  remnant of the original stream into a 12 to 15 byte long intermediate
+
+  [i0 .. iK], 12 <= K+1 <= 15
+
+* FNV checksum = fnv_1a([i0 .. iK]) in big endian representation
+