53 files changed, 4479 insertions, 1677 deletions
diff --git a/docs/AMDGPUUsage.rst b/docs/AMDGPUUsage.rst
index 41c7ecba527f..439089348fff 100644
--- a/docs/AMDGPUUsage.rst
+++ b/docs/AMDGPUUsage.rst
@@ -23,50 +23,55 @@ Target Triples
 Use the ``clang -target <Architecture>-<Vendor>-<OS>-<Environment>`` option to
 specify the target triple:
 
-  .. table:: AMDGPU Target Triples
-     :name: amdgpu-target-triples-table
-
-     ============ ======== ========= ===========
-     Architecture Vendor   OS        Environment
-     ============ ======== ========= ===========
-     r600         amd      <empty>   <empty>
-     amdgcn       amd      <empty>   <empty>
-     amdgcn       amd      amdhsa    <empty>
-     amdgcn       amd      amdhsa    opencl
-     amdgcn       amd      amdhsa    amdgizcl
-     amdgcn       amd      amdhsa    amdgiz
-     amdgcn       amd      amdhsa    hcc
-     ============ ======== ========= ===========
-
-``r600-amd--``
-  Supports AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders executed on
-  the MESA runtime.
-
-``amdgcn-amd--``
-  Supports AMD GPUs GCN 6 onwards for graphics and compute shaders executed on
-  the MESA runtime.
-
-``amdgcn-amd-amdhsa-``
-  Supports AMD GCN GPUs GFX6 onwards for compute kernels executed on HSA [HSA]_
-  compatible runtimes such as AMD's ROCm [AMD-ROCm]_.
-
-``amdgcn-amd-amdhsa-opencl``
-  Supports AMD GCN GPUs GFX6 onwards for OpenCL compute kernels executed on HSA
-  [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. See
-  :ref:`amdgpu-opencl`.
-
-``amdgcn-amd-amdhsa-amdgizcl``
-  Same as ``amdgcn-amd-amdhsa-opencl`` except a different address space mapping
-  is used (see :ref:`amdgpu-address-spaces`).
-
-``amdgcn-amd-amdhsa-amdgiz``
-  Same as ``amdgcn-amd-amdhsa-`` except a different address space mapping is
-  used (see :ref:`amdgpu-address-spaces`).
-
-``amdgcn-amd-amdhsa-hcc``
-  Supports AMD GCN GPUs GFX6 onwards for AMD HC language compute kernels
-  executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. See
-  :ref:`amdgpu-hcc`.
+  .. table:: AMDGPU Architectures
+     :name: amdgpu-architecture-table
+
+     ============ ==============================================================
+     Architecture Description
+     ============ ==============================================================
+     ``r600``     AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders.
+     ``amdgcn``   AMD GPUs GCN GFX6 onwards for graphics and compute shaders.
+     ============ ==============================================================
+
+  .. table:: AMDGPU Vendors
+     :name: amdgpu-vendor-table
+
+     ============ ==============================================================
+     Vendor       Description
+     ============ ==============================================================
+     ``amd``      Can be used for all AMD GPU usage.
+     ``mesa3d``   Can be used if the OS is ``mesa3d``.
+     ============ ==============================================================
+
+  .. table:: AMDGPU Operating Systems
+     :name: amdgpu-os-table
+
+     ============== ============================================================
+     OS             Description
+     ============== ============================================================
+     *<empty>*      Defaults to the *unknown* OS.
+     ``amdhsa``     Compute kernels executed on HSA [HSA]_ compatible runtimes
+                    such as AMD's ROCm [AMD-ROCm]_.
+     ``amdpal``     Graphic shaders and compute kernels executed on AMD PAL
+                    runtime.
+     ``mesa3d``     Graphic shaders and compute kernels executed on Mesa 3D
+                    runtime.
+     ============== ============================================================
+
+  .. table:: AMDGPU Environments
+     :name: amdgpu-environment-table
+
+     ============ ==============================================================
+     Environment  Description
+     ============ ==============================================================
+     *<empty>*    Defaults to ``opencl``.
+     ``opencl``   OpenCL compute kernel (see :ref:`amdgpu-opencl`).
+     ``amdgizcl`` Same as ``opencl`` except a different address space mapping is
+                  used (see :ref:`amdgpu-address-spaces`).
+     ``amdgiz``   Same as ``opencl`` except a different address space mapping is
+                  used (see :ref:`amdgpu-address-spaces`).
+     ``hcc``      AMD HC language compute kernel (see :ref:`amdgpu-hcc`).
+     ============ ==============================================================
 
 .. _amdgpu-processors:
 
@@ -77,132 +82,182 @@ Use the ``clang -mcpu <Processor>`` option to specify the AMD GPU processor. The
 names from both the *Processor* and *Alternative Processor* can be used.
 
   .. table:: AMDGPU Processors
-     :name: amdgpu-processors-table
-
-     ========== =========== ============ ===== ======= ==================
-     Processor  Alternative Target       dGPU/ Runtime Example
-                Processor   Triple       APU   Support Products
-                            Architecture
-     ========== =========== ============ ===== ======= ==================
-     **R600** [AMD-R6xx]_
-     --------------------------------------------------------------------
-     r600                   r600         dGPU
-     r630                   r600         dGPU
-     rs880                  r600         dGPU
-     rv670                  r600         dGPU
-     **R700** [AMD-R7xx]_
-     --------------------------------------------------------------------
-     rv710                  r600         dGPU
-     rv730                  r600         dGPU
-     rv770                  r600         dGPU
-     **Evergreen** [AMD-Evergreen]_
-     --------------------------------------------------------------------
-     cedar                  r600         dGPU
-     redwood                r600         dGPU
-     sumo                   r600         dGPU
-     juniper                r600         dGPU
-     cypress                r600         dGPU
-     **Northern Islands** [AMD-Cayman-Trinity]_
-     --------------------------------------------------------------------
-     barts                  r600         dGPU
-     turks                  r600         dGPU
-     caicos                 r600         dGPU
-     cayman                 r600         dGPU
-     **GCN GFX6 (Southern Islands (SI))** [AMD-Souther-Islands]_
-     --------------------------------------------------------------------
-     gfx600     - SI        amdgcn       dGPU
-                - tahiti
-     gfx601     - pitcairn  amdgcn       dGPU
-                - verde
-                - oland
-                - hainan
-     **GCN GFX7 (Sea Islands (CI))** [AMD-Sea-Islands]_
-     --------------------------------------------------------------------
-     gfx700     - bonaire   amdgcn       dGPU          - Radeon HD 7790
-                                                       - Radeon HD 8770
-                                                       - R7 260
-                                                       - R7 260X
-     \          - kaveri    amdgcn       APU           - A6-7000
-                                                       - A6 Pro-7050B
-                                                       - A8-7100
-                                                       - A8 Pro-7150B
-                                                       - A10-7300
-                                                       - A10 Pro-7350B
-                                                       - FX-7500
-                                                       - A8-7200P
-                                                       - A10-7400P
-                                                       - FX-7600P
-     gfx701     - hawaii    amdgcn       dGPU  ROCm    - FirePro W8100
-                                                       - FirePro W9100
-                                                       - FirePro S9150
-                                                       - FirePro S9170
-     gfx702                              dGPU  ROCm    - Radeon R9 290
-                                                       - Radeon R9 290x
-                                                       - Radeon R390
-                                                       - Radeon R390x
-     gfx703     - kabini    amdgcn       APU           - E1-2100
-                - mullins                              - E1-2200
-                                                       - E1-2500
-                                                       - E2-3000
-                                                       - E2-3800
-                                                       - A4-5000
-                                                       - A4-5100
-                                                       - A6-5200
-                                                       - A4 Pro-3340B
-     **GCN GFX8 (Volcanic Islands (VI))** [AMD-Volcanic-Islands]_
-     --------------------------------------------------------------------
-     gfx800     - iceland   amdgcn       dGPU          - FirePro S7150
-                                                       - FirePro S7100
-                                                       - FirePro W7100
-                                                       - Radeon R285
-                                                       - Radeon R9 380
-                                                       - Radeon R9 385
-                                                       - Mobile FirePro
-                                                         M7170
-     gfx801     - carrizo   amdgcn       APU           - A6-8500P
-                                                       - Pro A6-8500B
-                                                       - A8-8600P
-                                                       - Pro A8-8600B
-                                                       - FX-8800P
-                                                       - Pro A12-8800B
-     \                      amdgcn       APU   ROCm    - A10-8700P
-                                                       - Pro A10-8700B
-                                                       - A10-8780P
-     \                      amdgcn       APU           - A10-9600P
-                                                       - A10-9630P
-                                                       - A12-9700P
-                                                       - A12-9730P
-                                                       - FX-9800P
-                                                       - FX-9830P
-     \                      amdgcn       APU           - E2-9010
-                                                       - A6-9210
-                                                       - A9-9410
-     gfx802     - tonga     amdgcn       dGPU  ROCm    Same as gfx800
-     gfx803     - fiji      amdgcn       dGPU  ROCm    - Radeon R9 Nano
-                                                       - Radeon R9 Fury
-                                                       - Radeon R9 FuryX
-                                                       - Radeon Pro Duo
-                                                       - FirePro S9300x2
-     \          - polaris10 amdgcn       dGPU  ROCm    - Radeon RX 470
-                                                       - Radeon RX 480
-     \          - polaris11 amdgcn       dGPU  ROCm    - Radeon RX 460
-     gfx804                 amdgcn       dGPU          Same as gfx803
-     gfx810     - stoney    amdgcn       APU
-     **GCN GFX9**
-     --------------------------------------------------------------------
-     gfx900                 amdgcn       dGPU          - Radeon Vega Frontier Edition
-     gfx901                 amdgcn       dGPU  ROCm    Same as gfx900
-                                                       except XNACK is
-                                                       enabled
-     gfx902                 amdgcn       APU           *TBA*
-
-                                                       .. TODO
-                                                          Add product
-                                                          names.
-     gfx903                 amdgcn       APU           Same as gfx902
-                                                       except XNACK is
-                                                       enabled
-     ========== =========== ============ ===== ======= ==================
+     :name: amdgpu-processor-table
+
+     =========== =============== ============ ===== ========= ======= ==================
+     Processor   Alternative     Target       dGPU/ Target    ROCm    Example
+                 Processor       Triple       APU   Features  Support Products
+                                 Architecture       Supported
+                                                    [Default]
+     =========== =============== ============ ===== ========= ======= ==================
+     **Radeon HD 2000/3000 Series (R600)** [AMD-RADEON-HD-2000-3000]_
+     -----------------------------------------------------------------------------------
+     ``r600``                    ``r600``     dGPU
+     ``r630``                    ``r600``     dGPU
+     ``rs880``                   ``r600``     dGPU
+     ``rv670``                   ``r600``     dGPU
+     **Radeon HD 4000 Series (R700)** [AMD-RADEON-HD-4000]_
+     -----------------------------------------------------------------------------------
+     ``rv710``                   ``r600``     dGPU
+     ``rv730``                   ``r600``     dGPU
+     ``rv770``                   ``r600``     dGPU
+     **Radeon HD 5000 Series (Evergreen)** [AMD-RADEON-HD-5000]_
+     -----------------------------------------------------------------------------------
+     ``cedar``                   ``r600``     dGPU
+     ``redwood``                 ``r600``     dGPU
+     ``sumo``                    ``r600``     dGPU
+     ``juniper``                 ``r600``     dGPU
+     ``cypress``                 ``r600``     dGPU
+     **Radeon HD 6000 Series (Northern Islands)** [AMD-RADEON-HD-6000]_
+     -----------------------------------------------------------------------------------
+     ``barts``                   ``r600``     dGPU
+     ``turks``                   ``r600``     dGPU
+     ``caicos``                  ``r600``     dGPU
+     ``cayman``                  ``r600``     dGPU
+     **GCN GFX6 (Southern Islands (SI))** [AMD-GCN-GFX6]_
+     -----------------------------------------------------------------------------------
+     ``gfx600``  - ``tahiti``    ``amdgcn``   dGPU
+     ``gfx601``  - ``pitcairn``  ``amdgcn``   dGPU
+                 - ``verde``
+                 - ``oland``
+                 - ``hainan``
+     **GCN GFX7 (Sea Islands (CI))** [AMD-GCN-GFX7]_
+     -----------------------------------------------------------------------------------
+     ``gfx700``  - ``kaveri``    ``amdgcn``   APU                     - A6-7000
+                                                                      - A6 Pro-7050B
+                                                                      - A8-7100
+                                                                      - A8 Pro-7150B
+                                                                      - A10-7300
+                                                                      - A10 Pro-7350B
+                                                                      - FX-7500
+                                                                      - A8-7200P
+                                                                      - A10-7400P
+                                                                      - FX-7600P
+     ``gfx701``  - ``hawaii``    ``amdgcn``   dGPU            ROCm    - FirePro W8100
+                                                                      - FirePro W9100
+                                                                      - FirePro S9150
+                                                                      - FirePro S9170
+     ``gfx702``                  ``amdgcn``   dGPU            ROCm    - Radeon R9 290
+                                                                      - Radeon R9 290x
+                                                                      - Radeon R390
+                                                                      - Radeon R390x
+     ``gfx703``  - ``kabini``    ``amdgcn``   APU                     - E1-2100
+                 - ``mullins``                                        - E1-2200
+                                                                      - E1-2500
+                                                                      - E2-3000
+                                                                      - E2-3800
+                                                                      - A4-5000
+                                                                      - A4-5100
+                                                                      - A6-5200
+                                                                      - A4 Pro-3340B
+     ``gfx704``  - ``bonaire``   ``amdgcn``   dGPU                    - Radeon HD 7790
+                                                                      - Radeon HD 8770
+                                                                      - R7 260
+                                                                      - R7 260X
+     **GCN GFX8 (Volcanic Islands (VI))** [AMD-GCN-GFX8]_
+     -----------------------------------------------------------------------------------
+     ``gfx801``  - ``carrizo``   ``amdgcn``   APU   - xnack           - A6-8500P
+                                                      [on]            - Pro A6-8500B
+                                                                      - A8-8600P
+                                                                      - Pro A8-8600B
+                                                                      - FX-8800P
+                                                                      - Pro A12-8800B
+     \                           ``amdgcn``   APU   - xnack   ROCm    - A10-8700P
+                                                      [on]            - Pro A10-8700B
+                                                                      - A10-8780P
+     \                           ``amdgcn``   APU   - xnack           - A10-9600P
+                                                      [on]            - A10-9630P
+                                                                      - A12-9700P
+                                                                      - A12-9730P
+                                                                      - FX-9800P
+                                                                      - FX-9830P
+     \                           ``amdgcn``   APU   - xnack           - E2-9010
+                                                      [on]            - A6-9210
+                                                                      - A9-9410
+     ``gfx802``  - ``tonga``     ``amdgcn``   dGPU  - xnack   ROCm    - FirePro S7150
+                 - ``iceland``                        [off]           - FirePro S7100
+                                                                      - FirePro W7100
+                                                                      - Radeon R285
+                                                                      - Radeon R9 380
+                                                                      - Radeon R9 385
+                                                                      - Mobile FirePro
+                                                                        M7170
+     ``gfx803``  - ``fiji``      ``amdgcn``   dGPU  - xnack   ROCm    - Radeon R9 Nano
+                                                      [off]           - Radeon R9 Fury
+                                                                      - Radeon R9 FuryX
+                                                                      - Radeon Pro Duo
+                                                                      - FirePro S9300x2
+                                                                      - Radeon Instinct MI8
+     \           - ``polaris10`` ``amdgcn``   dGPU  - xnack   ROCm    - Radeon RX 470
+                                                      [off]           - Radeon RX 480
+                                                                      - Radeon Instinct MI6
+     \           - ``polaris11`` ``amdgcn``   dGPU  - xnack   ROCm    - Radeon RX 460
+                                                      [off]
+     ``gfx810``  - ``stoney``    ``amdgcn``   APU   - xnack
+                                                      [on]
+     **GCN GFX9** [AMD-GCN-GFX9]_
+     -----------------------------------------------------------------------------------
+     ``gfx900``                  ``amdgcn``   dGPU  - xnack   ROCm    - Radeon Vega
+                                                      [off]             Frontier Edition
+                                                                      - Radeon RX Vega 56
+                                                                      - Radeon RX Vega 64
+                                                                      - Radeon RX Vega 64
+                                                                        Liquid
+                                                                      - Radeon Instinct MI25
+     ``gfx902``                  ``amdgcn``   APU   - xnack           *TBA*
+                                                      [on]
+                                                                      .. TODO
+                                                                         Add product
+                                                                         names.
+     =========== =============== ============ ===== ========= ======= ==================
+
+.. _amdgpu-target-features:
+
+Target Features
+---------------
+
+Target features control how code is generated to support certain
+processor specific features. Not all target features are supported by
+all processors. The runtime must ensure that the features supported by
+the device used to execute the code match the features enabled when
+generating the code. A mismatch of features may result in incorrect
+execution, or a reduction in performance.
+
+The target features supported by each processor, and the default value
+used if not specified explicitly, is listed in
+:ref:`amdgpu-processor-table`.
+
+Use the ``clang -m[no-]<TargetFeature>`` option to specify the AMD GPU
+target features.
+
+For example:
+
+``-mxnack``
+  Enable the ``xnack`` feature.
+``-mno-xnack``
+  Disable the ``xnack`` feature.
+
+  .. table:: AMDGPU Target Features
+     :name: amdgpu-target-feature-table
+
+     ============== ==================================================
+     Target Feature Description
+     ============== ==================================================
+     -m[no-]xnack   Enable/disable generating code that has
+                    memory clauses that are compatible with
+                    having XNACK replay enabled.
+
+                    This is used for demand paging and page
+                    migration. If XNACK replay is enabled in
+                    the device, then if a page fault occurs
+                    the code may execute incorrectly if the
+                    ``xnack`` feature is not enabled. Executing
+                    code that has the feature enabled on a
+                    device that does not have XNACK replay
+                    enabled will execute correctly, but may
+                    be less performant than code with the
+                    feature disabled.
+     ============== ==================================================
 
 .. _amdgpu-address-spaces:
 
@@ -261,14 +316,14 @@ The memory model supported is based on the HSA memory model [HSA]_ which is
 based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
 relation is transitive over the synchonizes-with relation independent of scope,
 and synchonizes-with allows the memory scope instances to be inclusive (see
-table :ref:`amdgpu-amdhsa-llvm-sync-scopes-amdhsa-table`).
+table :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`).
 
 This is different to the OpenCL [OpenCL]_ memory model which does not have scope
 inclusion and requires the memory scopes to exactly match. However, this
 is conservatively correct for OpenCL.
 
-  .. table:: AMDHSA LLVM Sync Scopes for AMDHSA
-     :name: amdgpu-amdhsa-llvm-sync-scopes-amdhsa-table
+  .. table:: AMDHSA LLVM Sync Scopes
+     :name: amdgpu-amdhsa-llvm-sync-scopes-table
 
      ================ ==========================================================
      LLVM Sync Scope  Description
@@ -352,47 +407,78 @@ The AMDGPU backend uses the following ELF header:
   .. table:: AMDGPU ELF Header
      :name: amdgpu-elf-header-table
 
-     ========================== =========================
+     ========================== ===============================
      Field                      Value
-     ========================== =========================
+     ========================== ===============================
      ``e_ident[EI_CLASS]``      ``ELFCLASS64``
      ``e_ident[EI_DATA]``       ``ELFDATA2LSB``
-     ``e_ident[EI_OSABI]``      ``ELFOSABI_AMDGPU_HSA``
-     ``e_ident[EI_ABIVERSION]`` ``ELFABIVERSION_AMDGPU_HSA``
-     ``e_type``                 ``ET_REL`` or ``ET_DYN``
+     ``e_ident[EI_OSABI]``      - ``ELFOSABI_NONE``
+                                - ``ELFOSABI_AMDGPU_HSA``
+                                - ``ELFOSABI_AMDGPU_PAL``
+                                - ``ELFOSABI_AMDGPU_MESA3D``
+     ``e_ident[EI_ABIVERSION]`` - ``ELFABIVERSION_AMDGPU_HSA``
+                                - ``ELFABIVERSION_AMDGPU_PAL``
+                                - ``ELFABIVERSION_AMDGPU_MESA3D``
+     ``e_type``                 - ``ET_REL``
+                                - ``ET_DYN``
      ``e_machine``              ``EM_AMDGPU``
      ``e_entry``                0
-     ``e_flags``                0
-     ========================== =========================
+     ``e_flags``                See :ref:`amdgpu-elf-header-e_flags-table`
+     ========================== ===============================
 
 ..
 
   .. table:: AMDGPU ELF Header Enumeration Values
      :name: amdgpu-elf-header-enumeration-values-table
 
-     ============================ =====
-     Name                         Value
-     ============================ =====
-     ``EM_AMDGPU``                224
-     ``ELFOSABI_AMDGPU_HSA``      64
-     ``ELFABIVERSION_AMDGPU_HSA`` 1
-     ============================ =====
+     =============================== =====
+     Name                            Value
+     =============================== =====
+     ``EM_AMDGPU``                   224
+     ``ELFOSABI_NONE``               0
+     ``ELFOSABI_AMDGPU_HSA``         64
+     ``ELFOSABI_AMDGPU_PAL``         65
+     ``ELFOSABI_AMDGPU_MESA3D``      66
+     ``ELFABIVERSION_AMDGPU_HSA``    1
+     ``ELFABIVERSION_AMDGPU_PAL``    0
+     ``ELFABIVERSION_AMDGPU_MESA3D`` 0
+     =============================== =====
 
 ``e_ident[EI_CLASS]``
-  The ELF class is always ``ELFCLASS64``. The AMDGPU backend only supports 64 bit
-  applications.
+  The ELF class is:
+
+  * ``ELFCLASS32`` for ``r600`` architecture.
+
+  * ``ELFCLASS64`` for ``amdgcn`` architecture which only supports 64
+    bit applications.
 
 ``e_ident[EI_DATA]``
-  All AMDGPU targets use ELFDATA2LSB for little-endian byte ordering.
+  All AMDGPU targets use ``ELFDATA2LSB`` for little-endian byte ordering.
 
 ``e_ident[EI_OSABI]``
-  The AMD GPU architecture specific OS ABI of ``ELFOSABI_AMDGPU_HSA`` is used to
-  specify that the code object conforms to the AMD HSA runtime ABI [HSA]_.
+  One of the following AMD GPU architecture specific OS ABIs
+  (see :ref:`amdgpu-os-table`):
+
+  * ``ELFOSABI_NONE`` for *unknown* OS.
+
+  * ``ELFOSABI_AMDGPU_HSA`` for ``amdhsa`` OS.
+
+  * ``ELFOSABI_AMDGPU_PAL`` for ``amdpal`` OS.
+
+  * ``ELFOSABI_AMDGPU_MESA3D`` for ``mesa3D`` OS.
 
 ``e_ident[EI_ABIVERSION]``
-  The AMD GPU architecture specific OS ABI version of
-  ``ELFABIVERSION_AMDGPU_HSA`` is used to specify the version of AMD HSA runtime
-  ABI to which the code object conforms.
+  The ABI version of the AMD GPU architecture specific OS ABI to which the code
+  object conforms:
+
+  * ``ELFABIVERSION_AMDGPU_HSA`` is used to specify the version of AMD HSA
+    runtime ABI.
+
+  * ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL
+    runtime ABI.
+
+  * ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA
+    3D runtime ABI.
 
 ``e_type``
   Can be one of the following values:
@@ -408,17 +494,80 @@ The AMDGPU backend uses the following ELF header:
   The AMD HSA runtime loader requires a ``ET_DYN`` code object.
 
 ``e_machine``
-  The value ``EM_AMDGPU`` is used for the machine for all members of the AMD GPU
-  architecture family. The specific member is specified in the
-  ``NT_AMD_AMDGPU_ISA`` entry in the ``.note`` section (see
-  :ref:`amdgpu-note-records`).
+  The value ``EM_AMDGPU`` is used for the machine for all processors supported
+  by the ``r600`` and ``amdgcn`` architectures (see
+  :ref:`amdgpu-processor-table`). The specific processor is specified in the
+  ``EF_AMDGPU_MACH`` bit field of the ``e_flags`` (see
+  :ref:`amdgpu-elf-header-e_flags-table`).
 
 ``e_entry``
   The entry point is 0 as the entry points for individual kernels must be
   selected in order to invoke them through AQL packets.
 
 ``e_flags``
-  The value is 0 as no flags are used.
+  The AMDGPU backend uses the following ELF header flags:
+
+  .. table:: AMDGPU ELF Header ``e_flags``
+     :name: amdgpu-elf-header-e_flags-table
+
+     ================================= ========== =============================
+     Name                              Value      Description
+     ================================= ========== =============================
+     **AMDGPU Processor Flag**                    See :ref:`amdgpu-processor-table`.
+     -------------------------------------------- -----------------------------
+     ``EF_AMDGPU_MACH``                0x000000ff AMDGPU processor selection
+                                                  mask for
+                                                  ``EF_AMDGPU_MACH_xxx`` values
+                                                  defined in
+                                                  :ref:`amdgpu-ef-amdgpu-mach-table`.
+     ``EF_AMDGPU_XNACK``               0x00000100 Indicates if the ``xnack``
+                                                  target feature is
+                                                  enabled for all code
+                                                  contained in the code object.
+                                                  See
+                                                  :ref:`amdgpu-target-features`.
+     ================================= ========== =============================
+
+  .. table:: AMDGPU ``EF_AMDGPU_MACH`` Values
+     :name: amdgpu-ef-amdgpu-mach-table
+
+     ================================= ========== =============================
+     Name                              Value      Description (see
+                                                  :ref:`amdgpu-processor-table`)
+     ================================= ========== =============================
+     ``EF_AMDGPU_MACH_NONE``           0          *not specified*
+     ``EF_AMDGPU_MACH_R600_R600``      1          ``r600``
+     ``EF_AMDGPU_MACH_R600_R630``      2          ``r630``
+     ``EF_AMDGPU_MACH_R600_RS880``     3          ``rs880``
+     ``EF_AMDGPU_MACH_R600_RV670``     4          ``rv670``
+     ``EF_AMDGPU_MACH_R600_RV710``     5          ``rv710``
+     ``EF_AMDGPU_MACH_R600_RV730``     6          ``rv730``
+     ``EF_AMDGPU_MACH_R600_RV770``     7          ``rv770``
+     ``EF_AMDGPU_MACH_R600_CEDAR``     8          ``cedar``
+     ``EF_AMDGPU_MACH_R600_REDWOOD``   9          ``redwood``
+     ``EF_AMDGPU_MACH_R600_SUMO``      10         ``sumo``
+     ``EF_AMDGPU_MACH_R600_JUNIPER``   11         ``juniper``
+     ``EF_AMDGPU_MACH_R600_CYPRESS``   12         ``cypress``
+     ``EF_AMDGPU_MACH_R600_BARTS``     13         ``barts``
+     ``EF_AMDGPU_MACH_R600_TURKS``     14         ``turks``
+     ``EF_AMDGPU_MACH_R600_CAICOS``    15         ``caicos``
+     ``EF_AMDGPU_MACH_R600_CAYMAN``    16         ``cayman``
+     *reserved*                        17-31      Reserved for ``r600``
+                                                  architecture processors.
+     ``EF_AMDGPU_MACH_AMDGCN_GFX600``  32         ``gfx600``
+     ``EF_AMDGPU_MACH_AMDGCN_GFX601``  33         ``gfx601``
+     ``EF_AMDGPU_MACH_AMDGCN_GFX700``  34         ``gfx700``
+     ``EF_AMDGPU_MACH_AMDGCN_GFX701``  35         ``gfx701``
+     ``EF_AMDGPU_MACH_AMDGCN_GFX702``  36         ``gfx702``
+     ``EF_AMDGPU_MACH_AMDGCN_GFX703``  37         ``gfx703``
+     ``EF_AMDGPU_MACH_AMDGCN_GFX704``  38         ``gfx704``
+     ``EF_AMDGPU_MACH_AMDGCN_GFX801``  39         ``gfx801``
+     ``EF_AMDGPU_MACH_AMDGCN_GFX802``  40         ``gfx802``
+     ``EF_AMDGPU_MACH_AMDGCN_GFX803``  41         ``gfx803``
+     ``EF_AMDGPU_MACH_AMDGCN_GFX810``  42         ``gfx810``
+     ``EF_AMDGPU_MACH_AMDGCN_GFX900``  43         ``gfx900``
+     ``EF_AMDGPU_MACH_AMDGCN_GFX902``  44         ``gfx902``
+     ================================= ========== =============================
 
 Sections
 --------
@@ -456,7 +605,7 @@ if needed.
   The standard DWARF sections. See :ref:`amdgpu-dwarf` for information on the
   DWARF produced by the AMDGPU backend.
 
-``.dynamic``, ``.dynstr``, ``.dynstr``, ``.hash``
+``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash``
   The standard sections used by a dynamic loader.
 
 ``.note``
@@ -484,15 +633,15 @@ if needed.
 Note Records
 ------------
 
-As required by ``ELFCLASS64``, minimal zero byte padding must be generated after
-the ``name`` field to ensure the ``desc`` field is 4 byte aligned. In addition,
-minimal zero byte padding must be generated to ensure the ``desc`` field size is
-a multiple of 4 bytes. The ``sh_addralign`` field of the ``.note`` section must
-be at least 4 to indicate at least 8 byte alignment.
+As required by ``ELFCLASS32`` and ``ELFCLASS64``, minimal zero byte padding must
+be generated after the ``name`` field to ensure the ``desc`` field is 4 byte
+aligned. In addition, minimal zero byte padding must be generated to ensure the
+``desc`` field size is a multiple of 4 bytes. The ``sh_addralign`` field of the
+``.note`` section must be at least 4 to indicate at least 8 byte alignment.
 
 The AMDGPU backend code object uses the following ELF note records in the
 ``.note`` section. The *Description* column specifies the layout of the note
-record’s ``desc`` field. All fields are consecutive bytes. Note records with
+record's ``desc`` field. All fields are consecutive bytes. Note records with
 variable size strings have a corresponding ``*_size`` field that specifies the
 number of bytes, including the terminating null character, in the string. The
 string(s) come immediately after the preceding fields.
@@ -502,92 +651,249 @@ Additional note records can be present.
   .. table:: AMDGPU ELF Note Records
      :name: amdgpu-elf-note-records-table
 
-     ===== ========================== ==========================================
-     Name  Type                       Description
-     ===== ========================== ==========================================
-     "AMD" ``NT_AMD_AMDGPU_METADATA`` <metadata null terminated string>
-     "AMD" ``NT_AMD_AMDGPU_ISA``      <isa name null terminated string>
-     ===== ========================== ==========================================
+     ===== ============================== ======================================
+     Name  Type                           Description
+     ===== ============================== ======================================
+     "AMD" ``NT_AMD_AMDGPU_HSA_METADATA`` <metadata null terminated string>
+     ===== ============================== ======================================
 
 ..
 
   .. table:: AMDGPU ELF Note Record Enumeration Values
      :name: amdgpu-elf-note-record-enumeration-values-table
 
-     ============================= =====
-     Name                          Value
-     ============================= =====
-     *reserved*                    0-9
-     ``NT_AMD_AMDGPU_METADATA``    10
-     ``NT_AMD_AMDGPU_ISA``         11
-     ============================= =====
+     ============================== =====
+     Name                           Value
+     ============================== =====
+     *reserved*                       0-9
+     ``NT_AMD_AMDGPU_HSA_METADATA``    10
+     *reserved*                        11
+     ============================== =====
+
+``NT_AMD_AMDGPU_HSA_METADATA``
+  Specifies extensible metadata associated with the code objects executed on HSA
+  [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. It is required when
+  the target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See
+  :ref:`amdgpu-amdhsa-hsa-code-object-metadata` for the syntax of the code
+  object metadata string.
 
-``NT_AMD_AMDGPU_ISA``
-  Specifies the instruction set architecture used by the machine code contained
-  in the code object.
+.. _amdgpu-symbols:
 
-  This note record is required for code objects containing machine code for
-  processors matching the ``amdgcn`` architecture in table
-  :ref:`amdgpu-processors`.
+Symbols
+-------
 
-  The null terminated string has the following syntax:
+Symbols include the following:
 
-    *architecture*\ ``-``\ *vendor*\ ``-``\ *os*\ ``-``\ *environment*\ ``-``\ *processor*
+  .. table:: AMDGPU ELF Symbols
+     :name: amdgpu-elf-symbols-table
 
-  where:
+     ===================== ============== ============= ==================
+     Name                  Type           Section       Description
+     ===================== ============== ============= ==================
+     *link-name*           ``STT_OBJECT`` - ``.data``   Global variable
+                                          - ``.rodata``
+                                          - ``.bss``
+     *link-name*\ ``@kd``  ``STT_OBJECT`` - ``.rodata`` Kernel descriptor
+     *link-name*           ``STT_FUNC``   - ``.text``   Kernel entry point
+     ===================== ============== ============= ==================
 
-    *architecture*
-      The architecture from table :ref:`amdgpu-target-triples-table`.
+Global variable
+  Global variables both used and defined by the compilation unit.
 
-      This is always ``amdgcn`` when the target triple OS is ``amdhsa`` (see
-      :ref:`amdgpu-target-triples`).
+  If the symbol is defined in the compilation unit then it is allocated in the
+  appropriate section according to if it has initialized data or is readonly.
 
-    *vendor*
-      The vendor from table :ref:`amdgpu-target-triples-table`.
+  If the symbol is external then its section is ``STN_UNDEF`` and the loader
+  will resolve relocations using the definition provided by another code object
+  or explicitly defined by the runtime.
 
-      For the AMDGPU backend this is always ``amd``.
+  All global symbols, whether defined in the compilation unit or external, are
+  accessed by the machine code indirectly through a GOT table entry. This
+  allows them to be preemptable. The GOT table is only supported when the target
+  triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`).
 
-    *os*
-      The OS from table :ref:`amdgpu-target-triples-table`.
+  .. TODO
+     Add description of linked shared object symbols. Seems undefined symbols
+     are marked as STT_NOTYPE.
 
-    *environment*
-      An environment from table :ref:`amdgpu-target-triples-table`, or blank if
-      the environment has no affect on the execution of the code object.
+Kernel descriptor
+  Every HSA kernel has an associated kernel descriptor. It is the address of the
+  kernel descriptor that is used in the AQL dispatch packet used to invoke the
+  kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
+  defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
 
-      For the AMDGPU backend this is currently always blank.
-    *processor*
-      The processor from table :ref:`amdgpu-processors-table`.
+Kernel entry point
+  Every HSA kernel also has a symbol for its machine code entry point.
 
-  For example:
+.. _amdgpu-relocation-records:
 
-    ``amdgcn-amd-amdhsa--gfx901``
+Relocation Records
+------------------
 
-``NT_AMD_AMDGPU_METADATA``
-  Specifies extensible metadata associated with the code object. See
-  :ref:`amdgpu-code-object-metadata` for the syntax of the code object metadata
-  string.
+AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported
+relocatable fields are:
 
-  This note record is required and must contain the minimum information
-  necessary to support the ROCM kernel queries. For example, the segment sizes
-  needed in a dispatch packet. In addition, a high level language runtime may
-  require other information to be included. For example, the AMD OpenCL runtime
-  records kernel argument information.
+``word32``
+  This specifies a 32-bit field occupying 4 bytes with arbitrary byte
+  alignment. These values use the same byte order as other word values in the
+  AMD GPU architecture.
 
-  .. TODO
-     Is the string null terminated? It probably should not if YAML allows it to
-     contain null characters, otherwise it should be.
+``word64``
+  This specifies a 64-bit field occupying 8 bytes with arbitrary byte
+  alignment. These values use the same byte order as other word values in the
+  AMD GPU architecture.
 
-.. _amdgpu-code-object-metadata:
+Following notations are used for specifying relocation calculations:
+
+**A**
+  Represents the addend used to compute the value of the relocatable field.
+
+**G**
+  Represents the offset into the global offset table at which the relocation
+  entry's symbol will reside during execution.
+
+**GOT**
+  Represents the address of the global offset table.
+
+**P**
+  Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
+  of the storage unit being relocated (computed using ``r_offset``).
+
+**S**
+  Represents the value of the symbol whose index resides in the relocation
+  entry. Relocations not using this must specify a symbol index of ``STN_UNDEF``.
+
+**B**
+  Represents the base address of a loaded executable or shared object which is
+  the difference between the ELF address and the actual load address. Relocations
+  using this are only valid in executable or shared objects.
+
+The following relocation types are supported:
+
+  .. table:: AMDGPU ELF Relocation Records
+     :name: amdgpu-elf-relocation-records-table
+
+     ==========================  =====  ==========  ==============================
+     Relocation Type             Value  Field       Calculation
+     ==========================  =====  ==========  ==============================
+     ``R_AMDGPU_NONE``           0      *none*      *none*
+     ``R_AMDGPU_ABS32_LO``       1      ``word32``  (S + A) & 0xFFFFFFFF
+     ``R_AMDGPU_ABS32_HI``       2      ``word32``  (S + A) >> 32
+     ``R_AMDGPU_ABS64``          3      ``word64``  S + A
+     ``R_AMDGPU_REL32``          4      ``word32``  S + A - P
+     ``R_AMDGPU_REL64``          5      ``word64``  S + A - P
+     ``R_AMDGPU_ABS32``          6      ``word32``  S + A
+     ``R_AMDGPU_GOTPCREL``       7      ``word32``  G + GOT + A - P
+     ``R_AMDGPU_GOTPCREL32_LO``  8      ``word32``  (G + GOT + A - P) & 0xFFFFFFFF
+     ``R_AMDGPU_GOTPCREL32_HI``  9      ``word32``  (G + GOT + A - P) >> 32
+     ``R_AMDGPU_REL32_LO``       10     ``word32``  (S + A - P) & 0xFFFFFFFF
+     ``R_AMDGPU_REL32_HI``       11     ``word32``  (S + A - P) >> 32
+     *reserved*                  12
+     ``R_AMDGPU_RELATIVE64``     13     ``word64``  B + A
+     ==========================  =====  ==========  ==============================
+
+.. _amdgpu-dwarf:
+
+DWARF
+-----
+
+Standard DWARF [DWARF]_ Version 2 sections can be generated. These contain
+information that maps the code object executable code and data to the source
+language constructs. It can be used by tools such as debuggers and profilers.
+
+Address Space Mapping
+~~~~~~~~~~~~~~~~~~~~~
+
+The following address space mapping is used:
+
+  .. table:: AMDGPU DWARF Address Space Mapping
+     :name: amdgpu-dwarf-address-space-mapping-table
+
+     =================== =================
+     DWARF Address Space Memory Space
+     =================== =================
+     1                   Private (Scratch)
+     2                   Local (group/LDS)
+     *omitted*           Global
+     *omitted*           Constant
+     *omitted*           Generic (Flat)
+     *not supported*     Region (GDS)
+     =================== =================
+
+See :ref:`amdgpu-address-spaces` for information on the memory space terminology
+used in the table.
+
+An ``address_class`` attribute is generated on pointer type DIEs to specify the
+DWARF address space of the value of the pointer when it is in the *private* or
+*local* address space. Otherwise the attribute is omitted.
+
+An ``XDEREF`` operation is generated in location list expressions for variables
+that are allocated in the *private* and *local* address space. Otherwise no
+``XDREF`` is omitted.
+
+Register Mapping
+~~~~~~~~~~~~~~~~
+
+*This section is WIP.*
+
+.. TODO
+   Define DWARF register enumeration.
+
+   If want to present a wavefront state then should expose vector registers as
+   64 wide (rather than per work-item view that LLVM uses). Either as separate
+   registers, or a 64x4 byte single register. In either case use a new LANE op
+   (akin to XDREF) to select the current lane usage in a location
+   expression. This would also allow scalar register spilling to vector register
+   lanes to be expressed (currently no debug information is being generated for
+   spilling). If choose a wide single register approach then use LANE in
+   conjunction with PIECE operation to select the dword part of the register for
+   the current lane. If the separate register approach then use LANE to select
+   the register.
+
+Source Text
+~~~~~~~~~~~
+
+*This section is WIP.*
+
+.. TODO
+   DWARF extension to include runtime generated source text.
+
+.. _amdgpu-code-conventions:
+
+Code Conventions
+================
+
+This section provides code conventions used for each supported target triple OS
+(see :ref:`amdgpu-target-triples`).
+
+AMDHSA
+------
+
+This section provides code conventions used when the target triple OS is
+``amdhsa`` (see :ref:`amdgpu-target-triples`).
+
+.. _amdgpu-amdhsa-hsa-code-object-metadata:
 
 Code Object Metadata
---------------------
+~~~~~~~~~~~~~~~~~~~~
 
-The code object metadata is specified by the ``NT_AMD_AMDHSA_METADATA`` note
-record (see :ref:`amdgpu-note-records`).
+The code object metadata specifies extensible metadata associated with the code
+objects executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm
+[AMD-ROCm]_. It is specified by the ``NT_AMD_AMDGPU_HSA_METADATA`` note record
+(see :ref:`amdgpu-note-records`) and is required when the target triple OS is
+``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum
+information necessary to support the ROCM kernel queries. For example, the
+segment sizes needed in a dispatch packet. In addition, a high level language
+runtime may require other information to be included. For example, the AMD
+OpenCL runtime records kernel argument information.
 
 The metadata is specified as a YAML formatted string (see [YAML]_ and
 :doc:`YamlIO`).
 
+.. TODO
+   Is the string null terminated? It probably should not if YAML allows it to
+   contain null characters, otherwise it should be.
+
 The metadata is represented as a single YAML document comprised of the mapping
 defined in table :ref:`amdgpu-amdhsa-code-object-metadata-mapping-table` and
 referenced tables.
@@ -667,7 +973,7 @@ non-AMD key names should be prefixed by "*vendor-name*.".
                                                 See
                                                 :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table`
                                                 for the mapping definition.
-     "Arguments"       sequence of              Sequence of mappings of the
+     "Args"            sequence of              Sequence of mappings of the
                        mapping                  kernel arguments. See
                                                 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table`
                                                 for the definition of the mapping.
@@ -675,10 +981,6 @@ non-AMD key names should be prefixed by "*vendor-name*.".
                                                 the kernel code. See
                                                 :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table`
                                                 for the mapping definition.
-     "DebugProps"      mapping                  Mapping of properties related to
-                                                the kernel debugging. See
-                                                :ref:`amdgpu-amdhsa-code-object-kernel-debug-properties-metadata-mapping-table`
-                                                for the mapping definition.
      ================= ============== ========= ================================
 
 ..
@@ -708,6 +1010,16 @@ non-AMD key names should be prefixed by "*vendor-name*.".
 
                                                   Corresponds to the OpenCL
                                                   ``vec_type_hint`` attribute.
+
+     "RuntimeHandle"     string                   The external symbol name
+                                                  associated with a kernel.
+                                                  OpenCL runtime allocates a
+                                                  global buffer for the symbol
+                                                  and saves the kernel's address
+                                                  to it, which is used for
+                                                  device side enqueueing. Only
+                                                  available for device side
+                                                  enqueued kernels.
      =================== ============== ========= ==============================
 
 ..
@@ -800,10 +1112,10 @@ non-AMD key names should be prefixed by "*vendor-name*.".
                                                   passed in the kernarg.
 
                                                 "HiddenCompletionAction"
-                                                  *TBD*
-
-                                                  .. TODO
-                                                     Add description.
+                                                  A global address space pointer
+                                                  to help link enqueued kernels into
+                                                  the ancestor tree for determining
+                                                  when the parent kernel has finished.
 
      "ValueType"       string         Required  Kernel argument value type. Only
                                                 present if "ValueKind" is
@@ -867,7 +1179,7 @@ non-AMD key names should be prefixed by "*vendor-name*.".
                                                 .. TODO
                                                    Does this apply to
                                                    GlobalBuffer?
-     "ActualAcc"       string                   The actual memory accesses
+     "ActualAccQual"   string                   The actual memory accesses
                                                 performed by the kernel on the
                                                 kernel argument. Only present if
                                                 "ValueKind" is "GlobalBuffer",
@@ -936,9 +1248,9 @@ non-AMD key names should be prefixed by "*vendor-name*.".
                                                            private address space
                                                            memory required for a
                                                            work-item in
-                                                           bytes. If
-                                                           IsDynamicCallstack
-                                                           is 1 then additional
+                                                           bytes. If the kernel
+                                                           uses a dynamic call
+                                                           stack then additional
                                                            space must be added
                                                            to this value for the
                                                            call stack.
@@ -949,7 +1261,7 @@ non-AMD key names should be prefixed by "*vendor-name*.".
                                                            be a power of 2.
      "WavefrontSize"              integer        Required  Wavefront size. Must
                                                            be a power of 2.
-     "NumSGPRs"                   integer                  Number of scalar
+     "NumSGPRs"                   integer        Required  Number of scalar
                                                            registers used by a
                                                            wavefront for
                                                            GFX6-GFX9. This
@@ -965,228 +1277,42 @@ non-AMD key names should be prefixed by "*vendor-name*.".
                                                            rounded up to the
                                                            allocation
                                                            granularity.
-     "NumVGPRs"                   integer                  Number of vector
+     "NumVGPRs"                   integer        Required  Number of vector
                                                            registers used by
                                                            each work-item for
                                                            GFX6-GFX9
-     "MaxFlatWorkgroupSize"       integer                  Maximum flat
+     "MaxFlatWorkGroupSize"       integer        Required  Maximum flat
                                                            work-group size
                                                            supported by the
                                                            kernel in work-items.
-     "IsDynamicCallStack"         boolean                  Indicates if the
-                                                           generated machine
-                                                           code is using a
-                                                           dynamically sized
-                                                           call stack.
-     "IsXNACKEnabled"             boolean                  Indicates if the
-                                                           generated machine
-                                                           code is capable of
-                                                           supporting XNACK.
+                                                           Must be >=1 and
+                                                           consistent with any
+                                                           non-0 values in
+                                                           FixedWorkGroupSize.
+     "FixedWorkGroupSize"         sequence of              Corresponds to the
+                                  3 integers               dispatch work-group
+                                                           size X, Y, Z. If
+                                                           omitted, defaults to
+                                                           0, 0, 0. If an
+                                                           element is non-0 then
+                                                           the kernel must only
+                                                           be launched with a
+                                                           matching corresponding
+                                                           work-group size.
+     "NumSpilledSGPRs"            integer                  Number of stores from
+                                                           a scalar register to
+                                                           a register allocator
+                                                           created spill
+                                                           location.
+     "NumSpilledVGPRs"            integer                  Number of stores from
+                                                           a vector register to
+                                                           a register allocator
+                                                           created spill
+                                                           location.
      ============================ ============== ========= =====================
 
 ..
 
-  .. table:: AMDHSA Code Object Kernel Debug Properties Metadata Mapping
-     :name: amdgpu-amdhsa-code-object-kernel-debug-properties-metadata-mapping-table
-
-     =================================== ============== ========= ==============
-     String Key                          Value Type     Required? Description
-     =================================== ============== ========= ==============
-     "DebuggerABIVersion"                string
-     "ReservedNumVGPRs"                  integer
-     "ReservedFirstVGPR"                 integer
-     "PrivateSegmentBufferSGPR"          integer
-     "WavefrontPrivateSegmentOffsetSGPR" integer
-     =================================== ============== ========= ==============
-
-.. TODO
-   Plan to remove the debug properties metadata.   
-
-.. _amdgpu-symbols:
-
-Symbols
--------
-
-Symbols include the following:
-
-  .. table:: AMDGPU ELF Symbols
-     :name: amdgpu-elf-symbols-table
-
-     ===================== ============== ============= ==================
-     Name                  Type           Section       Description
-     ===================== ============== ============= ==================
-     *link-name*           ``STT_OBJECT`` - ``.data``   Global variable
-                                          - ``.rodata``
-                                          - ``.bss``
-     *link-name*\ ``@kd``  ``STT_OBJECT`` - ``.rodata`` Kernel descriptor
-     *link-name*           ``STT_FUNC``   - ``.text``   Kernel entry point
-     ===================== ============== ============= ==================
-
-Global variable
-  Global variables both used and defined by the compilation unit.
-
-  If the symbol is defined in the compilation unit then it is allocated in the
-  appropriate section according to if it has initialized data or is readonly.
-
-  If the symbol is external then its section is ``STN_UNDEF`` and the loader
-  will resolve relocations using the definition provided by another code object
-  or explicitly defined by the runtime.
-
-  All global symbols, whether defined in the compilation unit or external, are
-  accessed by the machine code indirectly through a GOT table entry. This
-  allows them to be preemptable. The GOT table is only supported when the target
-  triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`).
-
-  .. TODO
-     Add description of linked shared object symbols. Seems undefined symbols
-     are marked as STT_NOTYPE.
-
-Kernel descriptor
-  Every HSA kernel has an associated kernel descriptor. It is the address of the
-  kernel descriptor that is used in the AQL dispatch packet used to invoke the
-  kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
-  defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
-
-Kernel entry point
-  Every HSA kernel also has a symbol for its machine code entry point.
-
-.. _amdgpu-relocation-records:
-
-Relocation Records
-------------------
-
-AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported
-relocatable fields are:
-
-``word32``
-  This specifies a 32-bit field occupying 4 bytes with arbitrary byte
-  alignment. These values use the same byte order as other word values in the
-  AMD GPU architecture.
-
-``word64``
-  This specifies a 64-bit field occupying 8 bytes with arbitrary byte
-  alignment. These values use the same byte order as other word values in the
-  AMD GPU architecture.
-
-Following notations are used for specifying relocation calculations:
-
-**A**
-  Represents the addend used to compute the value of the relocatable field.
-
-**G**
-  Represents the offset into the global offset table at which the relocation
-  entry’s symbol will reside during execution.
-
-**GOT**
-  Represents the address of the global offset table.
-
-**P**
-  Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
-  of the storage unit being relocated (computed using ``r_offset``).
-
-**S**
-  Represents the value of the symbol whose index resides in the relocation
-  entry.
-
-The following relocation types are supported:
-
-  .. table:: AMDGPU ELF Relocation Records
-     :name: amdgpu-elf-relocation-records-table
-
-     ==========================  =====  ==========  ==============================
-     Relocation Type             Value  Field       Calculation
-     ==========================  =====  ==========  ==============================
-     ``R_AMDGPU_NONE``           0      *none*      *none*
-     ``R_AMDGPU_ABS32_LO``       1      ``word32``  (S + A) & 0xFFFFFFFF
-     ``R_AMDGPU_ABS32_HI``       2      ``word32``  (S + A) >> 32
-     ``R_AMDGPU_ABS64``          3      ``word64``  S + A
-     ``R_AMDGPU_REL32``          4      ``word32``  S + A - P
-     ``R_AMDGPU_REL64``          5      ``word64``  S + A - P
-     ``R_AMDGPU_ABS32``          6      ``word32``  S + A
-     ``R_AMDGPU_GOTPCREL``       7      ``word32``  G + GOT + A - P
-     ``R_AMDGPU_GOTPCREL32_LO``  8      ``word32``  (G + GOT + A - P) & 0xFFFFFFFF
-     ``R_AMDGPU_GOTPCREL32_HI``  9      ``word32``  (G + GOT + A - P) >> 32
-     ``R_AMDGPU_REL32_LO``       10     ``word32``  (S + A - P) & 0xFFFFFFFF
-     ``R_AMDGPU_REL32_HI``       11     ``word32``  (S + A - P) >> 32
-     ==========================  =====  ==========  ==============================
-
-.. _amdgpu-dwarf:
-
-DWARF
------
-
-Standard DWARF [DWARF]_ Version 2 sections can be generated. These contain
-information that maps the code object executable code and data to the source
-language constructs. It can be used by tools such as debuggers and profilers.
-
-Address Space Mapping
-~~~~~~~~~~~~~~~~~~~~~
-
-The following address space mapping is used:
-
-  .. table:: AMDGPU DWARF Address Space Mapping
-     :name: amdgpu-dwarf-address-space-mapping-table
-
-     =================== =================
-     DWARF Address Space Memory Space
-     =================== =================
-     1                   Private (Scratch)
-     2                   Local (group/LDS)
-     *omitted*           Global
-     *omitted*           Constant
-     *omitted*           Generic (Flat)
-     *not supported*     Region (GDS)
-     =================== =================
-
-See :ref:`amdgpu-address-spaces` for infomration on the memory space terminology
-used in the table.
-
-An ``address_class`` attribute is generated on pointer type DIEs to specify the
-DWARF address space of the value of the pointer when it is in the *private* or
-*local* address space. Otherwise the attribute is omitted.
-
-An ``XDEREF`` operation is generated in location list expressions for variables
-that are allocated in the *private* and *local* address space. Otherwise no
-``XDREF`` is omitted.
-
-Register Mapping
-~~~~~~~~~~~~~~~~
-
-*This section is WIP.*
-
-.. TODO
-   Define DWARF register enumeration.
-
-   If want to present a wavefront state then should expose vector registers as
-   64 wide (rather than per work-item view that LLVM uses). Either as separate
-   registers, or a 64x4 byte single register. In either case use a new LANE op
-   (akin to XDREF) to select the current lane usage in a location
-   expression. This would also allow scalar register spilling to vector register
-   lanes to be expressed (currently no debug information is being generated for
-   spilling). If choose a wide single register approach then use LANE in
-   conjunction with PIECE operation to select the dword part of the register for
-   the current lane. If the separate register approach then use LANE to select
-   the register.
-
-Source Text
-~~~~~~~~~~~
-
-*This section is WIP.*
-
-.. TODO
-   DWARF extension to include runtime generated source text.
-
-.. _amdgpu-code-conventions:
-
-Code Conventions
-================
-
-AMDHSA
-------
-
-This section provides code conventions used when the target triple OS is
-``amdhsa`` (see :ref:`amdgpu-target-triples`).
-
 Kernel Dispatch
 ~~~~~~~~~~~~~~~
 
@@ -1220,7 +1346,7 @@ CPU host program, or from an HSA kernel executing on a GPU.
    for a memory region with the kernarg property for the kernel agent that will
    execute the kernel. It must be at least 16 byte aligned.
 4. Kernel argument values are assigned to the kernel argument memory
-   allocation. The layout is defined in the *HSA Programmer’s Language Reference*
+   allocation. The layout is defined in the *HSA Programmer's Language Reference*
    [HSA]_. For AMDGPU the kernel execution directly accesses the kernel argument
    memory in the same way constant memory is accessed. (Note that the HSA
    specification allows an implementation to copy the kernel argument contents to
@@ -1237,7 +1363,7 @@ CPU host program, or from an HSA kernel executing on a GPU.
    such as grid and work-group size, together with information from the code
    object about the kernel, such as segment sizes. The ROCm runtime queries on
    the kernel symbol can be used to obtain the code object values which are
-   recorded in the :ref:`amdgpu-code-object-metadata`.
+   recorded in the :ref:`amdgpu-amdhsa-hsa-code-object-metadata`.
 7. CP executes micro-code and is responsible for detecting and setting up the
    GPU to execute the wavefronts of a kernel dispatch.
 8. CP ensures that when the a wavefront starts executing the kernel machine
@@ -1331,8 +1457,8 @@ registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
 address mode the apperture sizes are 2^32 bytes and the base is aligned to 2^32
 which makes it easier to convert from flat to segment or segment to flat.
 
-HSA Image and Samplers
-~~~~~~~~~~~~~~~~~~~~~~
+Image and Samplers
+~~~~~~~~~~~~~~~~~~
 
 Image and sample handles created by the ROCm runtime are 64 bit addresses of a
 hardware 32 byte V# and 48 byte S# object respectively. In order to support the
@@ -1343,17 +1469,17 @@ representation.
 HSA Signals
 ~~~~~~~~~~~
 
-Signal handles created by the ROCm runtime are 64 bit addresses of a structure
-allocated in memory accessible from both the CPU and GPU. The structure is
-defined by the ROCm runtime and subject to change between releases (see
-[AMD-ROCm-github]_).
+HSA signal handles created by the ROCm runtime are 64 bit addresses of a
+structure allocated in memory accessible from both the CPU and GPU. The
+structure is defined by the ROCm runtime and subject to change between releases
+(see [AMD-ROCm-github]_).
 
 .. _amdgpu-amdhsa-hsa-aql-queue:
 
 HSA AQL Queue
 ~~~~~~~~~~~~~
 
-The AQL queue structure is defined by the ROCm runtime and subject to change
+The HSA AQL queue structure is defined by the ROCm runtime and subject to change
 between releases (see [AMD-ROCm-github]_). For some processors it contains
 fields needed to implement certain language features such as the flat address
 aperture bases. It also contains fields used by CP such as managing the
@@ -1376,10 +1502,10 @@ CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
   .. table:: Kernel Descriptor for GFX6-GFX9
      :name: amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table
 
-     ======= ======= =============================== ===========================
+     ======= ======= =============================== ============================
      Bits    Size    Field Name                      Description
-     ======= ======= =============================== ===========================
-     31:0    4 bytes group_segment_fixed_size        The amount of fixed local
+     ======= ======= =============================== ============================
+     31:0    4 bytes GroupSegmentFixedSize           The amount of fixed local
                                                      address space memory
                                                      required for a work-group
                                                      in bytes. This does not
@@ -1388,7 +1514,7 @@ CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
                                                      space memory that may be
                                                      added when the kernel is
                                                      dispatched.
-     63:32   4 bytes private_segment_fixed_size      The amount of fixed
+     63:32   4 bytes PrivateSegmentFixedSize         The amount of fixed
                                                      private address space
                                                      memory required for a
                                                      work-item in bytes. If
@@ -1396,42 +1522,55 @@ CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
                                                      then additional space must
                                                      be added to this value for
                                                      the call stack.
-     95:64   4 bytes max_flat_workgroup_size         Maximum flat work-group
-                                                     size supported by the
-                                                     kernel in work-items.
-     96      1 bit   is_dynamic_call_stack           Indicates if the generated
-                                                     machine code is using a
-                                                     dynamically sized call
-                                                     stack.
-     97      1 bit   is_xnack_enabled                Indicates if the generated
-                                                     machine code is capable of
-                                                     suppoting XNACK.
-     127:98  30 bits                                 Reserved. Must be 0.
-     191:128 8 bytes kernel_code_entry_byte_offset   Byte offset (possibly
+     127:64  8 bytes                                 Reserved, must be 0.
+     191:128 8 bytes KernelCodeEntryByteOffset       Byte offset (possibly
                                                      negative) from base
                                                      address of kernel
                                                      descriptor to kernel's
                                                      entry point instruction
                                                      which must be 256 byte
                                                      aligned.
-     383:192 24                                      Reserved. Must be 0.
+     223:192 4 bytes MaxFlatWorkGroupSize            Maximum flat work-group
+                                                     size supported by the
+                                                     kernel in work-items. If
+                                                     an exact work-group size
+                                                     is required then must be
+                                                     omitted or 0 and
+                                                     ReqdWorkGroupSize* must
+                                                     be set to non-0.
+     239:224 2 bytes ReqdWorkGroupSizeX              If present and non-0 then
+                                                     the kernel
+                                                     must be executed with the
+                                                     specified work-group size
+                                                     for X.
+     255:240 2 bytes ReqdWorkGroupSizeY              If present and non-0 then
+                                                     the kernel
+                                                     must be executed with the
+                                                     specified work-group size
+                                                     for Y.
+     271:256 2 bytes ReqdWorkGroupSizeZ              If present and non-0 then
+                                                     the kernel
+                                                     must be executed with the
+                                                     specified work-group size
+                                                     for Z.
+     383:272 14                                      Reserved, must be 0.
              bytes
-     415:384 4 bytes compute_pgm_rsrc1               Compute Shader (CS)
+     415:384 4 bytes ComputePgmRsrc1                 Compute Shader (CS)
                                                      program settings used by
                                                      CP to set up
                                                      ``COMPUTE_PGM_RSRC1``
                                                      configuration
                                                      register. See
-                                                     :ref:`amdgpu-amdhsa-compute_pgm_rsrc1_t-gfx6-gfx9-table`.
-     447:416 4 bytes compute_pgm_rsrc2               Compute Shader (CS)
+                                                     :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
+     447:416 4 bytes ComputePgmRsrc2                 Compute Shader (CS)
                                                      program settings used by
                                                      CP to set up
                                                      ``COMPUTE_PGM_RSRC2``
                                                      configuration
                                                      register. See
                                                      :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
-     448     1 bit   enable_sgpr_private_segment     Enable the setup of the
-                     _buffer                         SGPR user data registers
+     448     1 bit   EnableSGPRPrivateSegmentBuffer  Enable the setup of the
+                                                     SGPR user data registers
                                                      (see
                                                      :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
 
@@ -1442,55 +1581,57 @@ CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
                                                      ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
                                                      Any requests beyond 16
                                                      will be ignored.
-     449     1 bit   enable_sgpr_dispatch_ptr        *see above*
-     450     1 bit   enable_sgpr_queue_ptr           *see above*
-     451     1 bit   enable_sgpr_kernarg_segment_ptr *see above*
-     452     1 bit   enable_sgpr_dispatch_id         *see above*
-     453     1 bit   enable_sgpr_flat_scratch_init   *see above*
-     454     1 bit   enable_sgpr_private_segment     *see above*
-                     _size
-     455     1 bit   enable_sgpr_grid_workgroup      Not implemented in CP and
-                     _count_X                        should always be 0.
-     456     1 bit   enable_sgpr_grid_workgroup      Not implemented in CP and
-                     _count_Y                        should always be 0.
-     457     1 bit   enable_sgpr_grid_workgroup      Not implemented in CP and
-                     _count_Z                        should always be 0.
-     463:458 6 bits                                  Reserved. Must be 0.
-     511:464 4                                       Reserved. Must be 0.
+     449     1 bit   EnableSGPRDispatchPtr           *see above*
+     450     1 bit   EnableSGPRQueuePtr              *see above*
+     451     1 bit   EnableSGPRKernargSegmentPtr     *see above*
+     452     1 bit   EnableSGPRDispatchID            *see above*
+     453     1 bit   EnableSGPRFlatScratchInit       *see above*
+     454     1 bit   EnableSGPRPrivateSegmentSize    *see above*
+     455     1 bit   EnableSGPRGridWorkgroupCountX   Not implemented in CP and
+                                                     should always be 0.
+     456     1 bit   EnableSGPRGridWorkgroupCountY   Not implemented in CP and
+                                                     should always be 0.
+     457     1 bit   EnableSGPRGridWorkgroupCountZ   Not implemented in CP and
+                                                     should always be 0.
+     463:458 6 bits                                  Reserved, must be 0.
+     511:464 6                                       Reserved, must be 0.
              bytes
      512     **Total size 64 bytes.**
-     ======= ===================================================================
+     ======= ====================================================================
 
 ..
 
   .. table:: compute_pgm_rsrc1 for GFX6-GFX9
-     :name: amdgpu-amdhsa-compute_pgm_rsrc1_t-gfx6-gfx9-table
+     :name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table
 
      ======= ======= =============================== ===========================================================================
      Bits    Size    Field Name                      Description
      ======= ======= =============================== ===========================================================================
-     5:0     6 bits  granulated_workitem_vgpr_count  Number of vector registers
+     5:0     6 bits  GRANULATED_WORKITEM_VGPR_COUNT  Number of vector registers
                                                      used by each work-item,
                                                      granularity is device
                                                      specific:
 
-                                                     GFX6-9
-                                                       roundup((max-vgpg + 1)
-                                                       / 4) - 1
+                                                     GFX6-GFX9
+                                                       - max_vgpr 1..256
+                                                       - roundup((max_vgpg + 1)
+                                                         / 4) - 1
 
                                                      Used by CP to set up
                                                      ``COMPUTE_PGM_RSRC1.VGPRS``.
-     9:6     4 bits  granulated_wavefront_sgpr_count Number of scalar registers
+     9:6     4 bits  GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar registers
                                                      used by a wavefront,
                                                      granularity is device
                                                      specific:
 
-                                                     GFX6-8
-                                                       roundup((max-sgpg + 1)
-                                                       / 8) - 1
+                                                     GFX6-GFX8
+                                                       - max_sgpr 1..112
+                                                       - roundup((max_sgpg + 1)
+                                                         / 8) - 1
                                                      GFX9
-                                                       roundup((max-sgpg + 1)
-                                                       / 16) - 1
+                                                       - max_sgpr 1..112
+                                                       - roundup((max_sgpg + 1)
+                                                         / 16) - 1
 
                                                      Includes the special SGPRs
                                                      for VCC, Flat Scratch (for
@@ -1502,7 +1643,7 @@ CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
 
                                                      Used by CP to set up
                                                      ``COMPUTE_PGM_RSRC1.SGPRS``.
-     11:10   2 bits  priority                        Must be 0.
+     11:10   2 bits  PRIORITY                        Must be 0.
 
                                                      Start executing wavefront
                                                      at the specified priority.
@@ -1510,7 +1651,7 @@ CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
                                                      CP is responsible for
                                                      filling in
                                                      ``COMPUTE_PGM_RSRC1.PRIORITY``.
-     13:12   2 bits  float_mode_round_32             Wavefront starts execution
+     13:12   2 bits  FLOAT_ROUND_MODE_32             Wavefront starts execution
                                                      with specified rounding
                                                      mode for single (32
                                                      bit) floating point
@@ -1523,7 +1664,7 @@ CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
 
                                                      Used by CP to set up
                                                      ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
-     15:14   2 bits  float_mode_round_16_64          Wavefront starts execution
+     15:14   2 bits  FLOAT_ROUND_MODE_16_64          Wavefront starts execution
                                                      with specified rounding
                                                      denorm mode for half/double (16
                                                      and 64 bit) floating point
@@ -1536,7 +1677,7 @@ CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
 
                                                      Used by CP to set up
                                                      ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
-     17:16   2 bits  float_mode_denorm_32            Wavefront starts execution
+     17:16   2 bits  FLOAT_DENORM_MODE_32            Wavefront starts execution
                                                      with specified denorm mode
                                                      for single (32
                                                      bit)  floating point
@@ -1549,7 +1690,7 @@ CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
 
                                                      Used by CP to set up
                                                      ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
-     19:18   2 bits  float_mode_denorm_16_64         Wavefront starts execution
+     19:18   2 bits  FLOAT_DENORM_MODE_16_64         Wavefront starts execution
                                                      with specified denorm mode
                                                      for half/double (16
                                                      and 64 bit) floating point
@@ -1562,7 +1703,7 @@ CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
 
                                                      Used by CP to set up
                                                      ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
-     20      1 bit   priv                            Must be 0.
+     20      1 bit   PRIV                            Must be 0.
 
                                                      Start executing wavefront
                                                      in privilege trap handler
@@ -1571,10 +1712,10 @@ CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
                                                      CP is responsible for
                                                      filling in
                                                      ``COMPUTE_PGM_RSRC1.PRIV``.
-     21      1 bit   enable_dx10_clamp               Wavefront starts execution
+     21      1 bit   ENABLE_DX10_CLAMP               Wavefront starts execution
                                                      with DX10 clamp mode
                                                      enabled. Used by the vector
-                                                     ALU to force DX-10 style
+                                                     ALU to force DX10 style
                                                      treatment of NaN's (when
                                                      set, clamp NaN to zero,
                                                      otherwise pass NaN
@@ -1582,7 +1723,7 @@ CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
 
                                                      Used by CP to set up
                                                      ``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
-     22      1 bit   debug_mode                      Must be 0.
+     22      1 bit   DEBUG_MODE                      Must be 0.
 
                                                      Start executing wavefront
                                                      in single step mode.
@@ -1590,7 +1731,7 @@ CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
                                                      CP is responsible for
                                                      filling in
                                                      ``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
-     23      1 bit   enable_ieee_mode                Wavefront starts execution
+     23      1 bit   ENABLE_IEEE_MODE                Wavefront starts execution
                                                      with IEEE mode
                                                      enabled. Floating point
                                                      opcodes that support
@@ -1605,7 +1746,7 @@ CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
 
                                                      Used by CP to set up
                                                      ``COMPUTE_PGM_RSRC1.IEEE_MODE``.
-     24      1 bit   bulky                           Must be 0.
+     24      1 bit   BULKY                           Must be 0.
 
                                                      Only one work-group allowed
                                                      to execute on a compute
@@ -1614,7 +1755,7 @@ CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
                                                      CP is responsible for
                                                      filling in
                                                      ``COMPUTE_PGM_RSRC1.BULKY``.
-     25      1 bit   cdbg_user                       Must be 0.
+     25      1 bit   CDBG_USER                       Must be 0.
 
                                                      Flag that can be used to
                                                      control debugging code.
@@ -1622,7 +1763,25 @@ CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
                                                      CP is responsible for
                                                      filling in
                                                      ``COMPUTE_PGM_RSRC1.CDBG_USER``.
-     31:26   6 bits                                  Reserved. Must be 0.
+     26      1 bit   FP16_OVFL                       GFX6-GFX8
+                                                       Reserved, must be 0.
+                                                     GFX9
+                                                       Wavefront starts execution
+                                                       with specified fp16 overflow
+                                                       mode.
+
+                                                       - If 0, fp16 overflow generates
+                                                         +/-INF values.
+                                                       - If 1, fp16 overflow that is the
+                                                         result of an +/-INF input value
+                                                         or divide by 0 produces a +/-INF,
+                                                         otherwise clamps computed
+                                                         overflow to +/-MAX_FP16 as
+                                                         appropriate.
+
+                                                       Used by CP to set up
+                                                       ``COMPUTE_PGM_RSRC1.FP16_OVFL``.
+     31:27   5 bits                                  Reserved, must be 0.
      32      **Total size 4 bytes**
      ======= ===================================================================================================================
 
@@ -1634,14 +1793,14 @@ CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
      ======= ======= =============================== ===========================================================================
      Bits    Size    Field Name                      Description
      ======= ======= =============================== ===========================================================================
-     0       1 bit   enable_sgpr_private_segment     Enable the setup of the
-                     _wave_offset                    SGPR wave scratch offset
+     0       1 bit   ENABLE_SGPR_PRIVATE_SEGMENT     Enable the setup of the
+                     _WAVE_OFFSET                    SGPR wave scratch offset
                                                      system register (see
                                                      :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
 
                                                      Used by CP to set up
                                                      ``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
-     5:1     5 bits  user_sgpr_count                 The total number of SGPR
+     5:1     5 bits  USER_SGPR_COUNT                 The total number of SGPR
                                                      user data registers
                                                      requested. This number must
                                                      match the number of user
@@ -1649,7 +1808,7 @@ CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
 
                                                      Used by CP to set up
                                                      ``COMPUTE_PGM_RSRC2.USER_SGPR``.
-     6       1 bit   enable_trap_handler             Set to 1 if code contains a
+     6       1 bit   ENABLE_TRAP_HANDLER             Set to 1 if code contains a
                                                      TRAP instruction which
                                                      requires a trap handler to
                                                      be enabled.
@@ -1660,7 +1819,7 @@ CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
                                                      installed a trap handler
                                                      regardless of the setting
                                                      of this field.
-     7       1 bit   enable_sgpr_workgroup_id_x      Enable the setup of the
+     7       1 bit   ENABLE_SGPR_WORKGROUP_ID_X      Enable the setup of the
                                                      system SGPR register for
                                                      the work-group id in the X
                                                      dimension (see
@@ -1668,7 +1827,7 @@ CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
 
                                                      Used by CP to set up
                                                      ``COMPUTE_PGM_RSRC2.TGID_X_EN``.
-     8       1 bit   enable_sgpr_workgroup_id_y      Enable the setup of the
+     8       1 bit   ENABLE_SGPR_WORKGROUP_ID_Y      Enable the setup of the
                                                      system SGPR register for
                                                      the work-group id in the Y
                                                      dimension (see
@@ -1676,7 +1835,7 @@ CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
 
                                                      Used by CP to set up
                                                      ``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
-     9       1 bit   enable_sgpr_workgroup_id_z      Enable the setup of the
+     9       1 bit   ENABLE_SGPR_WORKGROUP_ID_Z      Enable the setup of the
                                                      system SGPR register for
                                                      the work-group id in the Z
                                                      dimension (see
@@ -1684,14 +1843,14 @@ CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
 
                                                      Used by CP to set up
                                                      ``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
-     10      1 bit   enable_sgpr_workgroup_info      Enable the setup of the
+     10      1 bit   ENABLE_SGPR_WORKGROUP_INFO      Enable the setup of the
                                                      system SGPR register for
                                                      work-group information (see
                                                      :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
 
                                                      Used by CP to set up
                                                      ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
-     12:11   2 bits  enable_vgpr_workitem_id         Enable the setup of the
+     12:11   2 bits  ENABLE_VGPR_WORKITEM_ID         Enable the setup of the
                                                      VGPR system registers used
                                                      for the work-item ID.
                                                      :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
@@ -1699,7 +1858,7 @@ CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
 
                                                      Used by CP to set up
                                                      ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
-     13      1 bit   enable_exception_address_watch  Must be 0.
+     13      1 bit   ENABLE_EXCEPTION_ADDRESS_WATCH  Must be 0.
 
                                                      Wavefront starts execution
                                                      with address watch
@@ -1715,7 +1874,7 @@ CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
                                                      ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
                                                      according to what the
                                                      runtime requests.
-     14      1 bit   enable_exception_memory         Must be 0.
+     14      1 bit   ENABLE_EXCEPTION_MEMORY         Must be 0.
 
                                                      Wavefront starts execution
                                                      with memory violation
@@ -1734,7 +1893,7 @@ CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
                                                      ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
                                                      according to what the
                                                      runtime requests.
-     23:15   9 bits  granulated_lds_size             Must be 0.
+     23:15   9 bits  GRANULATED_LDS_SIZE             Must be 0.
 
                                                      CP uses the rounded value
                                                      from the dispatch packet,
@@ -1755,8 +1914,8 @@ CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
                                                      GFX7-GFX9:
                                                        roundup(lds-size / (128 * 4))
 
-     24      1 bit   enable_exception_ieee_754_fp    Wavefront starts execution
-                     _invalid_operation              with specified exceptions
+     24      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    Wavefront starts execution
+                     _INVALID_OPERATION              with specified exceptions
                                                      enabled.
 
                                                      Used by CP to set up
@@ -1765,21 +1924,21 @@ CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
 
                                                      IEEE 754 FP Invalid
                                                      Operation
-     25      1 bit   enable_exception_fp_denormal    FP Denormal one or more
-                     _source                         input operands is a
+     25      1 bit   ENABLE_EXCEPTION_FP_DENORMAL    FP Denormal one or more
+                     _SOURCE                         input operands is a
                                                      denormal number
-     26      1 bit   enable_exception_ieee_754_fp    IEEE 754 FP Division by
-                     _division_by_zero               Zero
-     27      1 bit   enable_exception_ieee_754_fp    IEEE 754 FP FP Overflow
-                     _overflow
-     28      1 bit   enable_exception_ieee_754_fp    IEEE 754 FP Underflow
-                     _underflow
-     29      1 bit   enable_exception_ieee_754_fp    IEEE 754 FP Inexact
-                     _inexact
-     30      1 bit   enable_exception_int_divide_by  Integer Division by Zero
-                     _zero                           (rcp_iflag_f32 instruction
+     26      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP Division by
+                     _DIVISION_BY_ZERO               Zero
+     27      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP FP Overflow
+                     _OVERFLOW
+     28      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP Underflow
+                     _UNDERFLOW
+     29      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP Inexact
+                     _INEXACT
+     30      1 bit   ENABLE_EXCEPTION_INT_DIVIDE_BY  Integer Division by Zero
+                     _ZERO                           (rcp_iflag_f32 instruction
                                                      only)
-     31      1 bit                                   Reserved. Must be 0.
+     31      1 bit                                   Reserved, must be 0.
      32      **Total size 4 bytes.**
      ======= ===================================================================================================================
 
@@ -1788,45 +1947,46 @@ CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
   .. table:: Floating Point Rounding Mode Enumeration Values
      :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
 
-     ===================================== ===== ===============================
-     Enumeration Name                      Value Description
-     ===================================== ===== ===============================
-     AMD_FLOAT_ROUND_MODE_NEAR_EVEN        0     Round Ties To Even
-     AMD_FLOAT_ROUND_MODE_PLUS_INFINITY    1     Round Toward +infinity
-     AMD_FLOAT_ROUND_MODE_MINUS_INFINITY   2     Round Toward -infinity
-     AMD_FLOAT_ROUND_MODE_ZERO             3     Round Toward 0
-     ===================================== ===== ===============================
+     ====================================== ===== ==============================
+     Enumeration Name                       Value Description
+     ====================================== ===== ==============================
+     AMDGPU_FLOAT_ROUND_MODE_NEAR_EVEN      0     Round Ties To Even
+     AMDGPU_FLOAT_ROUND_MODE_PLUS_INFINITY  1     Round Toward +infinity
+     AMDGPU_FLOAT_ROUND_MODE_MINUS_INFINITY 2     Round Toward -infinity
+     AMDGPU_FLOAT_ROUND_MODE_ZERO           3     Round Toward 0
+     ====================================== ===== ==============================
 
 ..
 
   .. table:: Floating Point Denorm Mode Enumeration Values
      :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
 
-     ===================================== ===== ===============================
-     Enumeration Name                      Value Description
-     ===================================== ===== ===============================
-     AMD_FLOAT_DENORM_MODE_FLUSH_SRC_DST   0     Flush Source and Destination
-                                                 Denorms
-     AMD_FLOAT_DENORM_MODE_FLUSH_DST       1     Flush Output Denorms
-     AMD_FLOAT_DENORM_MODE_FLUSH_SRC       2     Flush Source Denorms
-     AMD_FLOAT_DENORM_MODE_FLUSH_NONE      3     No Flush
-     ===================================== ===== ===============================
+     ====================================== ===== ==============================
+     Enumeration Name                       Value Description
+     ====================================== ===== ==============================
+     AMDGPU_FLOAT_DENORM_MODE_FLUSH_SRC_DST 0     Flush Source and Destination
+                                                  Denorms
+     AMDGPU_FLOAT_DENORM_MODE_FLUSH_DST     1     Flush Output Denorms
+     AMDGPU_FLOAT_DENORM_MODE_FLUSH_SRC     2     Flush Source Denorms
+     AMDGPU_FLOAT_DENORM_MODE_FLUSH_NONE    3     No Flush
+     ====================================== ===== ==============================
 
 ..
 
   .. table:: System VGPR Work-Item ID Enumeration Values
      :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
 
-     ===================================== ===== ===============================
-     Enumeration Name                      Value Description
-     ===================================== ===== ===============================
-     AMD_SYSTEM_VGPR_WORKITEM_ID_X         0     Set work-item X dimension ID.
-     AMD_SYSTEM_VGPR_WORKITEM_ID_X_Y       1     Set work-item X and Y
-                                                 dimensions ID.
-     AMD_SYSTEM_VGPR_WORKITEM_ID_X_Y_Z     2     Set work-item X, Y and Z
-                                                 dimensions ID.
-     AMD_SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3     Undefined.
-     ===================================== ===== ===============================
+     ======================================== ===== ============================
+     Enumeration Name                         Value Description
+     ======================================== ===== ============================
+     AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X         0     Set work-item X dimension
+                                                    ID.
+     AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X_Y       1     Set work-item X and Y
+                                                    dimensions ID.
+     AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X_Y_Z     2     Set work-item X, Y and Z
+                                                    dimensions ID.
+     AMDGPU_SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3     Undefined.
+     ======================================== ===== ============================
 
 .. _amdgpu-amdhsa-initial-kernel-execution-state:
 
@@ -1901,60 +2061,85 @@ SGPR register initial state is defined in
                                                     for scratch for the queue
                                                     executing the kernel
                                                     dispatch. CP obtains this
-                                                    from the runtime.
-
-                                                    This is the same offset used
-                                                    in computing the Scratch
-                                                    Segment Buffer base
-                                                    address. The value of
-                                                    Scratch Wave Offset must be
-                                                    added by the kernel machine
-                                                    code and moved to SGPRn-4
-                                                    for use as the FLAT SCRATCH
-                                                    BASE in flat memory
-                                                    instructions.
+                                                    from the runtime. (The
+                                                    Scratch Segment Buffer base
+                                                    address is
+                                                    ``SH_HIDDEN_PRIVATE_BASE_VIMID``
+                                                    plus this offset.) The value
+                                                    of Scratch Wave Offset must
+                                                    be added to this offset by
+                                                    the kernel machine code,
+                                                    right shifted by 8, and
+                                                    moved to the FLAT_SCRATCH_HI
+                                                    SGPR register.
+                                                    FLAT_SCRATCH_HI corresponds
+                                                    to SGPRn-4 on GFX7, and
+                                                    SGPRn-6 on GFX8 (where SGPRn
+                                                    is the highest numbered SGPR
+                                                    allocated to the wave).
+                                                    FLAT_SCRATCH_HI is
+                                                    multiplied by 256 (as it is
+                                                    in units of 256 bytes) and
+                                                    added to
+                                                    ``SH_HIDDEN_PRIVATE_BASE_VIMID``
+                                                    to calculate the per wave
+                                                    FLAT SCRATCH BASE in flat
+                                                    memory instructions that
+                                                    access the scratch
+                                                    apperture.
 
                                                     The second SGPR is 32 bit
                                                     byte size of a single
-                                                    work-item’s scratch memory
-                                                    usage. This is directly
-                                                    loaded from the kernel
-                                                    dispatch packet Private
-                                                    Segment Byte Size and
-                                                    rounded up to a multiple of
-                                                    DWORD.
-
-                                                    The kernel code must move to
-                                                    SGPRn-3 for use as the FLAT
-                                                    SCRATCH SIZE in flat memory
+                                                    work-item's scratch memory
+                                                    usage. CP obtains this from
+                                                    the runtime, and it is
+                                                    always a multiple of DWORD.
+                                                    CP checks that the value in
+                                                    the kernel dispatch packet
+                                                    Private Segment Byte Size is
+                                                    not larger, and requests the
+                                                    runtime to increase the
+                                                    queue's scratch size if
+                                                    necessary. The kernel code
+                                                    must move it to
+                                                    FLAT_SCRATCH_LO which is
+                                                    SGPRn-3 on GFX7 and SGPRn-5
+                                                    on GFX8. FLAT_SCRATCH_LO is
+                                                    used as the FLAT SCRATCH
+                                                    SIZE in flat memory
                                                     instructions. Having CP load
                                                     it once avoids loading it at
                                                     the beginning of every
                                                     wavefront.
                                                   GFX9
-                                                    This is the 64 bit base
-                                                    address of the per SPI
-                                                    scratch backing memory
-                                                    managed by SPI for the queue
-                                                    executing the kernel
-                                                    dispatch. CP obtains this
-                                                    from the runtime (and
+                                                    This is the
+                                                    64 bit base address of the
+                                                    per SPI scratch backing
+                                                    memory managed by SPI for
+                                                    the queue executing the
+                                                    kernel dispatch. CP obtains
+                                                    this from the runtime (and
                                                     divides it if there are
                                                     multiple Shader Arrays each
                                                     with its own SPI). The value
                                                     of Scratch Wave Offset must
                                                     be added by the kernel
-                                                    machine code and moved to
-                                                    SGPRn-4 and SGPRn-3 for use
-                                                    as the FLAT SCRATCH BASE in
-                                                    flat memory instructions.
+                                                    machine code and the result
+                                                    moved to the FLAT_SCRATCH
+                                                    SGPR which is SGPRn-6 and
+                                                    SGPRn-5. It is used as the
+                                                    FLAT SCRATCH BASE in flat
+                                                    memory instructions.
      then       Private Segment Size       1      The 32 bit byte size of a
-                (enable_sgpr_private              single work-item’s scratch
-                _segment_size)                    memory allocation. This is the
-                                                  value from the kernel dispatch
-                                                  packet Private Segment Byte
-                                                  Size rounded up by CP to a
-                                                  multiple of DWORD.
+                                                  (enable_sgpr_private single
+                                                  work-item's
+                                                  scratch_segment_size) memory
+                                                  allocation. This is the
+                                                  value from the kernel
+                                                  dispatch packet Private
+                                                  Segment Byte Size rounded up
+                                                  by CP to a multiple of
+                                                  DWORD.
 
                                                   Having CP load it once avoids
                                                   loading it at the beginning of
@@ -2006,7 +2191,7 @@ SGPR register initial state is defined in
      then       Work-Group Id Z            1      32 bit work-group id in Z
                 (enable_sgpr_workgroup_id         dimension of grid for
                 _Z)                               wavefront.
-     then       Work-Group Info            1      {first_wave, 14’b0000,
+     then       Work-Group Info            1      {first_wave, 14'b0000,
                 (enable_sgpr_workgroup            ordered_append_term[10:0],
                 _info)                            threadgroup_size_in_waves[5:0]}
      then       Scratch Wave Offset        1      32 bit byte offset from base
@@ -2066,7 +2251,7 @@ Flat Scratch register pair are adjacent SGRRs so they can be moved as a 64 bit
 value to the hardware required SGPRn-3 and SGPRn-4 respectively.
 
 The global segment can be accessed either using buffer instructions (GFX6 which
-has V# 64 bit address support), flat instructions (GFX7-9), or global
+has V# 64 bit address support), flat instructions (GFX7-GFX9), or global
 instructions (GFX9).
 
 If buffer operations are used then the compiler can generate a V# with the
@@ -2112,7 +2297,7 @@ Offset SGPR registers (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
 GFX6
   Flat scratch is not supported.
 
-GFX7-8
+GFX7-GFX8
   1. The low word of Flat Scratch Init is 32 bit byte offset from
      ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
      being managed by SPI for the queue executing the kernel dispatch. This is
@@ -2127,6 +2312,7 @@ GFX7-8
      DWORD. Having CP load it once avoids loading it at the beginning of every
      wavefront. The prolog must move it to FLAT_SCRATCH_LO for use as FLAT SCRATCH
      SIZE.
+
 GFX9
   The Flat Scratch Init is the 64 bit address of the base of scratch backing
   memory being managed by SPI for the queue executing the kernel dispatch. The
@@ -2144,9 +2330,6 @@ This section describes the mapping of LLVM memory model onto AMDGPU machine code
 .. TODO
    Update when implementation complete.
 
-   Support more relaxed OpenCL memory model to be controlled by environment
-   component of target triple.
-
 The AMDGPU backend supports the memory synchronization scopes specified in
 :ref:`amdgpu-memory-scopes`.
 
@@ -2163,19 +2346,23 @@ additional ``s_waitcnt`` instructions are required to ensure registers are
 defined before being used. These may be able to be combined with the memory
 model ``s_waitcnt`` instructions as described above.
 
-The AMDGPU memory model supports both the HSA [HSA]_ memory model, and the
-OpenCL [OpenCL]_ memory model. The HSA memory model uses a single happens-before
-relation for all address spaces (see :ref:`amdgpu-address-spaces`). The OpenCL
-memory model which has separate happens-before relations for the global and
-local address spaces, and only a fence specifying both global and local address
-space joins the relationships. Since the LLVM ``memfence`` instruction does not
-allow an address space to be specified the OpenCL fence has to convervatively
-assume both local and global address space was specified. However, optimizations
-can often be done to eliminate the additional ``s_waitcnt``instructions when
-there are no intervening corresponding ``ds/flat_load/store/atomic`` memory
-instructions. The code sequences in the table indicate what can be omitted for
-the OpenCL memory. The target triple environment is used to determine if the
-source language is OpenCL (see :ref:`amdgpu-opencl`).
+The AMDGPU backend supports the following memory models:
+
+  HSA Memory Model [HSA]_
+    The HSA memory model uses a single happens-before relation for all address
+    spaces (see :ref:`amdgpu-address-spaces`).
+  OpenCL Memory Model [OpenCL]_
+    The OpenCL memory model which has separate happens-before relations for the
+    global and local address spaces. Only a fence specifying both global and
+    local address space, and seq_cst instructions join the relationships. Since
+    the LLVM ``memfence`` instruction does not allow an address space to be
+    specified the OpenCL fence has to convervatively assume both local and
+    global address space was specified. However, optimizations can often be
+    done to eliminate the additional ``s_waitcnt`` instructions when there are
+    no intervening memory instructions which access the corresponding address
+    space. The code sequences in the table indicate what can be omitted for the
+    OpenCL memory. The target triple environment is used to determine if the
+    source language is OpenCL (see :ref:`amdgpu-opencl`).
 
 ``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
 operations.
@@ -2204,14 +2391,14 @@ For GFX6-GFX9:
   same wavefront.
 * The vector memory operations are performed as wavefront wide operations and
   completion is reported to a wavefront in execution order. The exception is
-  that for GFX7-9 ``flat_load/store/atomic`` instructions can report out of
+  that for GFX7-GFX9 ``flat_load/store/atomic`` instructions can report out of
   vector memory order if they access LDS memory, and out of LDS operation order
   if they access global memory.
-* The vector memory operations access a vector L1 cache shared by all wavefronts
-  on a CU. Therefore, no special action is required for coherence between
-  wavefronts in the same work-group. A ``buffer_wbinvl1_vol`` is required for
-  coherence between waves executing in different work-groups as they may be
-  executing on different CUs.
+* The vector memory operations access a single vector L1 cache shared by all
+  SIMDs a CU. Therefore, no special action is required for coherence between the
+  lanes of a single wavefront, or for coherence between wavefronts in the same
+  work-group. A ``buffer_wbinvl1_vol`` is required for coherence between waves
+  executing in different work-groups as they may be executing on different CUs.
 * The scalar memory operations access a scalar L1 cache shared by all wavefronts
   on a group of CUs. The scalar and vector L1 caches are not coherent. However,
   scalar operations are used in a restricted way so do not impact the memory
@@ -2231,7 +2418,7 @@ For GFX6-GFX9:
 * The L2 cache can be kept coherent with other agents on some targets, or ranges
   of virtual addresses can be set up to bypass it to ensure system coherence.
 
-Private address space uses ``buffer_load/store`` using the scratch V# (GFX6-8),
+Private address space uses ``buffer_load/store`` using the scratch V# (GFX6-GFX8),
 or ``scratch_load/store`` (GFX9). Since only a single thread is accessing the
 memory, atomic memory orderings are not meaningful and all accesses are treated
 as non-atomic.
@@ -2275,45 +2462,62 @@ future wave that uses the same scratch area, or a function call that creates a
 frame at the same address, respectively. There is no need for a ``s_dcache_inv``
 as all scalar writes are write-before-read in the same thread.
 
-Scratch backing memory (which is used for the private address space) is accessed
-with MTYPE NC_NV (non-coherenent non-volatile). Since the private address space
-is only accessed by a single thread, and is always write-before-read,
-there is never a need to invalidate these entries from the L1 cache. Hence all
-cache invalidates are done as ``*_vol`` to only invalidate the volatile cache
-lines.
+Scratch backing memory (which is used for the private address space)
+is accessed with MTYPE NC_NV (non-coherenent non-volatile). Since the private
+address space is only accessed by a single thread, and is always
+write-before-read, there is never a need to invalidate these entries from the L1
+cache. Hence all cache invalidates are done as ``*_vol`` to only invalidate the
+volatile cache lines.
 
 On dGPU the kernarg backing memory is accessed as UC (uncached) to avoid needing
-to invalidate the L2 cache. This also causes it to be treated as non-volatile
-and so is not invalidated by ``*_vol``. On APU it is accessed as CC (cache
-coherent) and so the L2 cache will coherent with the CPU and other agents.
+to invalidate the L2 cache. This also causes it to be treated as
+non-volatile and so is not invalidated by ``*_vol``. On APU it is accessed as CC
+(cache coherent) and so the L2 cache will coherent with the CPU and other
+agents.
 
   .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
      :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table
 
-     ============ ============ ============== ========== =======================
+     ============ ============ ============== ========== ===============================
      LLVM Instr   LLVM Memory  LLVM Memory    AMDGPU     AMDGPU Machine Code
                   Ordering     Sync Scope     Address
                                               Space
-     ============ ============ ============== ========== =======================
+     ============ ============ ============== ========== ===============================
      **Non-Atomic**
-     ---------------------------------------------------------------------------
-     load         *none*       *none*         - global   non-volatile
-                                              - generic    1. buffer/global/flat_load
-                                                         volatile
+     -----------------------------------------------------------------------------------
+     load         *none*       *none*         - global   - !volatile & !nontemporal
+                                              - generic
+                                              - private    1. buffer/global/flat_load
+                                              - constant
+                                                         - volatile & !nontemporal
+
                                                            1. buffer/global/flat_load
                                                               glc=1
+
+                                                         - nontemporal
+
+                                                           1. buffer/global/flat_load
+                                                              glc=1 slc=1
+
      load         *none*       *none*         - local    1. ds_load
-     store        *none*       *none*         - global   1. buffer/global/flat_store
+     store        *none*       *none*         - global   - !nontemporal
                                               - generic
+                                              - private    1. buffer/global/flat_store
+                                              - constant
+                                                         - nontemporal
+
+                                                           1. buffer/global/flat_stote
+                                                              glc=1 slc=1
+
      store        *none*       *none*         - local    1. ds_store
      **Unordered Atomic**
-     ---------------------------------------------------------------------------
+     -----------------------------------------------------------------------------------
      load atomic  unordered    *any*          *any*      *Same as non-atomic*.
      store atomic unordered    *any*          *any*      *Same as non-atomic*.
      atomicrmw    unordered    *any*          *any*      *Same as monotonic
                                                          atomic*.
      **Monotonic Atomic**
-     ---------------------------------------------------------------------------
+     -----------------------------------------------------------------------------------
      load atomic  monotonic    - singlethread - global   1. buffer/global/flat_load
                                - wavefront    - generic
                                - workgroup
@@ -2339,16 +2543,15 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
                                - wavefront
                                - workgroup
      **Acquire Atomic**
-     ---------------------------------------------------------------------------
+     -----------------------------------------------------------------------------------
      load atomic  acquire      - singlethread - global   1. buffer/global/ds/flat_load
                                - wavefront    - local
                                               - generic
-     load atomic  acquire      - workgroup    - global   1. buffer/global_load
-     load atomic  acquire      - workgroup    - local    1. ds/flat_load
-                                              - generic  2. s_waitcnt lgkmcnt(0)
+     load atomic  acquire      - workgroup    - global   1. buffer/global/flat_load
+     load atomic  acquire      - workgroup    - local    1. ds_load
+                                                         2. s_waitcnt lgkmcnt(0)
 
-                                                           - If OpenCL, omit
-                                                             waitcnt.
+                                                           - If OpenCL, omit.
                                                            - Must happen before
                                                              any following
                                                              global/generic
@@ -2361,8 +2564,23 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
                                                              older than the load
                                                              atomic value being
                                                              acquired.
+     load atomic  acquire      - workgroup    - generic  1. flat_load
+                                                         2. s_waitcnt lgkmcnt(0)
 
-     load atomic  acquire      - agent        - global   1. buffer/global_load
+                                                           - If OpenCL, omit.
+                                                           - Must happen before
+                                                             any following
+                                                             global/generic
+                                                             load/load
+                                                             atomic/store/store
+                                                             atomic/atomicrmw.
+                                                           - Ensures any
+                                                             following global
+                                                             data read is no
+                                                             older than the load
+                                                             atomic value being
+                                                             acquired.
+     load atomic  acquire      - agent        - global   1. buffer/global/flat_load
                                - system                     glc=1
                                                          2. s_waitcnt vmcnt(0)
 
@@ -2415,12 +2633,11 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
      atomicrmw    acquire      - singlethread - global   1. buffer/global/ds/flat_atomic
                                - wavefront    - local
                                               - generic
-     atomicrmw    acquire      - workgroup    - global   1. buffer/global_atomic
-     atomicrmw    acquire      - workgroup    - local    1. ds/flat_atomic
-                                              - generic  2. waitcnt lgkmcnt(0)
+     atomicrmw    acquire      - workgroup    - global   1. buffer/global/flat_atomic
+     atomicrmw    acquire      - workgroup    - local    1. ds_atomic
+                                                         2. waitcnt lgkmcnt(0)
 
-                                                           - If OpenCL, omit
-                                                             waitcnt.
+                                                           - If OpenCL, omit.
                                                            - Must happen before
                                                              any following
                                                              global/generic
@@ -2434,7 +2651,24 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
                                                              atomicrmw value
                                                              being acquired.
 
-     atomicrmw    acquire      - agent        - global   1. buffer/global_atomic
+     atomicrmw    acquire      - workgroup    - generic  1. flat_atomic
+                                                         2. waitcnt lgkmcnt(0)
+
+                                                           - If OpenCL, omit.
+                                                           - Must happen before
+                                                             any following
+                                                             global/generic
+                                                             load/load
+                                                             atomic/store/store
+                                                             atomic/atomicrmw.
+                                                           - Ensures any
+                                                             following global
+                                                             data read is no
+                                                             older than the
+                                                             atomicrmw value
+                                                             being acquired.
+
+     atomicrmw    acquire      - agent        - global   1. buffer/global/flat_atomic
                                - system                  2. s_waitcnt vmcnt(0)
 
                                                            - Must happen before
@@ -2491,9 +2725,8 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
 
                                                            - If OpenCL and
                                                              address space is
-                                                             not generic, omit
-                                                             waitcnt. However,
-                                                             since LLVM
+                                                             not generic, omit.
+                                                           - However, since LLVM
                                                              currently has no
                                                              address space on
                                                              the fence need to
@@ -2532,14 +2765,14 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
                                                              value read by the
                                                              fence-paired-atomic.
 
-     fence        acquire      - agent        *none*     1. s_waitcnt vmcnt(0) &
-                               - system                     lgkmcnt(0)
+     fence        acquire      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
+                               - system                     vmcnt(0)
 
                                                            - If OpenCL and
                                                              address space is
                                                              not generic, omit
                                                              lgkmcnt(0).
-                                                             However, since LLVM
+                                                           - However, since LLVM
                                                              currently has no
                                                              address space on
                                                              the fence need to
@@ -2571,7 +2804,7 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
                                                            - s_waitcnt lgkmcnt(0)
                                                              must happen after
                                                              any preceding
-                                                             group/generic load
+                                                             local/generic load
                                                              atomic/atomicrmw
                                                              with an equal or
                                                              wider sync scope
@@ -2598,8 +2831,8 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
 
                                                          2. buffer_wbinvl1_vol
 
-                                                           - Must happen before
-                                                             any following global/generic
+                                                           - Must happen before any
+                                                             following global/generic
                                                              load/load
                                                              atomic/store/store
                                                              atomic/atomicrmw.
@@ -2609,14 +2842,13 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
                                                              global data.
 
      **Release Atomic**
-     ---------------------------------------------------------------------------
+     -----------------------------------------------------------------------------------
      store atomic release      - singlethread - global   1. buffer/global/ds/flat_store
                                - wavefront    - local
                                               - generic
      store atomic release      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
-                                              - generic
-                                                           - If OpenCL, omit
-                                                             waitcnt.
+
+                                                           - If OpenCL, omit.
                                                            - Must happen after
                                                              any preceding
                                                              local/generic
@@ -2636,8 +2868,29 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
 
                                                          2. buffer/global/flat_store
      store atomic release      - workgroup    - local    1. ds_store
-     store atomic release      - agent        - global   1. s_waitcnt vmcnt(0) &
-                               - system       - generic     lgkmcnt(0)
+     store atomic release      - workgroup    - generic  1. s_waitcnt lgkmcnt(0)
+
+                                                           - If OpenCL, omit.
+                                                           - Must happen after
+                                                             any preceding
+                                                             local/generic
+                                                             load/store/load
+                                                             atomic/store
+                                                             atomic/atomicrmw.
+                                                           - Must happen before
+                                                             the following
+                                                             store.
+                                                           - Ensures that all
+                                                             memory operations
+                                                             to local have
+                                                             completed before
+                                                             performing the
+                                                             store that is being
+                                                             released.
+
+                                                         2. flat_store
+     store atomic release      - agent        - global   1. s_waitcnt lgkmcnt(0) &
+                               - system       - generic     vmcnt(0)
 
                                                            - If OpenCL, omit
                                                              lgkmcnt(0).
@@ -2669,7 +2922,7 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
                                                              store.
                                                            - Ensures that all
                                                              memory operations
-                                                             to global have
+                                                             to memory have
                                                              completed before
                                                              performing the
                                                              store that is being
@@ -2680,9 +2933,8 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
                                - wavefront    - local
                                               - generic
      atomicrmw    release      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
-                                              - generic
-                                                           - If OpenCL, omit
-                                                             waitcnt.
+
+                                                           - If OpenCL, omit.
                                                            - Must happen after
                                                              any preceding
                                                              local/generic
@@ -2702,8 +2954,29 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
 
                                                          2. buffer/global/flat_atomic
      atomicrmw    release      - workgroup    - local    1. ds_atomic
-     atomicrmw    release      - agent        - global   1. s_waitcnt vmcnt(0) &
-                               - system       - generic     lgkmcnt(0)
+     atomicrmw    release      - workgroup    - generic  1. s_waitcnt lgkmcnt(0)
+
+                                                           - If OpenCL, omit.
+                                                           - Must happen after
+                                                             any preceding
+                                                             local/generic
+                                                             load/store/load
+                                                             atomic/store
+                                                             atomic/atomicrmw.
+                                                           - Must happen before
+                                                             the following
+                                                             atomicrmw.
+                                                           - Ensures that all
+                                                             memory operations
+                                                             to local have
+                                                             completed before
+                                                             performing the
+                                                             atomicrmw that is
+                                                             being released.
+
+                                                         2. flat_atomic
+     atomicrmw    release      - agent        - global   1. s_waitcnt lgkmcnt(0) &
+                               - system       - generic     vmcnt(0)
 
                                                            - If OpenCL, omit
                                                              lgkmcnt(0).
@@ -2741,23 +3014,29 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
                                                              the atomicrmw that
                                                              is being released.
 
-                                                         2. buffer/global/ds/flat_atomic*
+                                                         2. buffer/global/ds/flat_atomic
      fence        release      - singlethread *none*     *none*
                                - wavefront
      fence        release      - workgroup    *none*     1. s_waitcnt lgkmcnt(0)
 
                                                            - If OpenCL and
                                                              address space is
-                                                             not generic, omit
-                                                             waitcnt. However,
-                                                             since LLVM
+                                                             not generic, omit.
+                                                           - However, since LLVM
                                                              currently has no
                                                              address space on
                                                              the fence need to
                                                              conservatively
-                                                             always generate
-                                                             (see comment for
-                                                             previous fence).
+                                                             always generate. If
+                                                             fence had an
+                                                             address space then
+                                                             set to address
+                                                             space of OpenCL
+                                                             fence flag, or to
+                                                             generic if both
+                                                             local and global
+                                                             flags are
+                                                             specified.
                                                            - Must happen after
                                                              any preceding
                                                              local/generic
@@ -2782,21 +3061,32 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
                                                              following
                                                              fence-paired-atomic.
 
-     fence        release      - agent        *none*     1. s_waitcnt vmcnt(0) &
-                               - system                     lgkmcnt(0)
+     fence        release      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
+                               - system                     vmcnt(0)
 
                                                            - If OpenCL and
                                                              address space is
                                                              not generic, omit
                                                              lgkmcnt(0).
-                                                             However, since LLVM
+                                                           - If OpenCL and
+                                                             address space is
+                                                             local, omit
+                                                             vmcnt(0).
+                                                           - However, since LLVM
                                                              currently has no
                                                              address space on
                                                              the fence need to
                                                              conservatively
-                                                             always generate
-                                                             (see comment for
-                                                             previous fence).
+                                                             always generate. If
+                                                             fence had an
+                                                             address space then
+                                                             set to address
+                                                             space of OpenCL
+                                                             fence flag, or to
+                                                             generic if both
+                                                             local and global
+                                                             flags are
+                                                             specified.
                                                            - Could be split into
                                                              separate s_waitcnt
                                                              vmcnt(0) and
@@ -2832,21 +3122,20 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
                                                              fence-paired-atomic).
                                                            - Ensures that all
                                                              memory operations
-                                                             to global have
+                                                             have
                                                              completed before
                                                              performing the
                                                              following
                                                              fence-paired-atomic.
 
      **Acquire-Release Atomic**
-     ---------------------------------------------------------------------------
+     -----------------------------------------------------------------------------------
      atomicrmw    acq_rel      - singlethread - global   1. buffer/global/ds/flat_atomic
                                - wavefront    - local
                                               - generic
      atomicrmw    acq_rel      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
 
-                                                           - If OpenCL, omit
-                                                             waitcnt.
+                                                           - If OpenCL, omit.
                                                            - Must happen after
                                                              any preceding
                                                              local/generic
@@ -2864,12 +3153,11 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
                                                              atomicrmw that is
                                                              being released.
 
-                                                         2. buffer/global_atomic
+                                                         2. buffer/global/flat_atomic
      atomicrmw    acq_rel      - workgroup    - local    1. ds_atomic
                                                          2. s_waitcnt lgkmcnt(0)
 
-                                                           - If OpenCL, omit
-                                                             waitcnt.
+                                                           - If OpenCL, omit.
                                                            - Must happen before
                                                              any following
                                                              global/generic
@@ -2885,8 +3173,7 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
 
      atomicrmw    acq_rel      - workgroup    - generic  1. s_waitcnt lgkmcnt(0)
 
-                                                           - If OpenCL, omit
-                                                             waitcnt.
+                                                           - If OpenCL, omit.
                                                            - Must happen after
                                                              any preceding
                                                              local/generic
@@ -2907,8 +3194,7 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
                                                          2. flat_atomic
                                                          3. s_waitcnt lgkmcnt(0)
 
-                                                           - If OpenCL, omit
-                                                             waitcnt.
+                                                           - If OpenCL, omit.
                                                            - Must happen before
                                                              any following
                                                              global/generic
@@ -2921,8 +3207,9 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
                                                              older than the load
                                                              atomic value being
                                                              acquired.
-     atomicrmw    acq_rel      - agent        - global   1. s_waitcnt vmcnt(0) &
-                               - system                     lgkmcnt(0)
+
+     atomicrmw    acq_rel      - agent        - global   1. s_waitcnt lgkmcnt(0) &
+                               - system                     vmcnt(0)
 
                                                            - If OpenCL, omit
                                                              lgkmcnt(0).
@@ -2960,7 +3247,7 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
                                                              atomicrmw that is
                                                              being released.
 
-                                                         2. buffer/global_atomic
+                                                         2. buffer/global/flat_atomic
                                                          3. s_waitcnt vmcnt(0)
 
                                                            - Must happen before
@@ -2984,8 +3271,8 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
                                                              will not see stale
                                                              global data.
 
-     atomicrmw    acq_rel      - agent        - generic  1. s_waitcnt vmcnt(0) &
-                               - system                     lgkmcnt(0)
+     atomicrmw    acq_rel      - agent        - generic  1. s_waitcnt lgkmcnt(0) &
+                               - system                     vmcnt(0)
 
                                                            - If OpenCL, omit
                                                              lgkmcnt(0).
@@ -3056,8 +3343,8 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
 
                                                            - If OpenCL and
                                                              address space is
-                                                             not generic, omit
-                                                             waitcnt. However,
+                                                             not generic, omit.
+                                                           - However,
                                                              since LLVM
                                                              currently has no
                                                              address space on
@@ -3095,8 +3382,8 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
                                                              stronger than
                                                              unordered (this is
                                                              termed the
-                                                             fence-paired-atomic)
-                                                             has completed
+                                                             acquire-fence-paired-atomic
+                                                             ) has completed
                                                              before following
                                                              global memory
                                                              operations. This
@@ -3116,19 +3403,19 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
                                                              stronger than
                                                              unordered (this is
                                                              termed the
-                                                             fence-paired-atomic).
-                                                             This satisfies the
+                                                             release-fence-paired-atomic
+                                                             ). This satisfies the
                                                              requirements of
                                                              release.
 
-     fence        acq_rel      - agent        *none*     1. s_waitcnt vmcnt(0) &
-                               - system                     lgkmcnt(0)
+     fence        acq_rel      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
+                               - system                     vmcnt(0)
 
                                                            - If OpenCL and
                                                              address space is
                                                              not generic, omit
                                                              lgkmcnt(0).
-                                                             However, since LLVM
+                                                           - However, since LLVM
                                                              currently has no
                                                              address space on
                                                              the fence need to
@@ -3173,8 +3460,8 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
                                                              stronger than
                                                              unordered (this is
                                                              termed the
-                                                             fence-paired-atomic)
-                                                             has completed
+                                                             acquire-fence-paired-atomic
+                                                             ) has completed
                                                              before invalidating
                                                              the cache. This
                                                              satisfies the
@@ -3194,8 +3481,8 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
                                                              stronger than
                                                              unordered (this is
                                                              termed the
-                                                             fence-paired-atomic).
-                                                             This satisfies the
+                                                             release-fence-paired-atomic
+                                                             ). This satisfies the
                                                              requirements of
                                                              release.
 
@@ -3216,13 +3503,103 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
                                                              acquire.
 
      **Sequential Consistent Atomic**
-     ---------------------------------------------------------------------------
+     -----------------------------------------------------------------------------------
      load atomic  seq_cst      - singlethread - global   *Same as corresponding
-                               - wavefront    - local    load atomic acquire*.
-                               - workgroup    - generic
-     load atomic  seq_cst      - agent        - global   1. s_waitcnt vmcnt(0)
-                               - system       - local
-                                              - generic    - Must happen after
+                               - wavefront    - local    load atomic acquire,
+                                              - generic  except must generated
+                                                         all instructions even
+                                                         for OpenCL.*
+     load atomic  seq_cst      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
+                                              - generic
+                                                           - Must
+                                                             happen after
+                                                             preceding
+                                                             global/generic load
+                                                             atomic/store
+                                                             atomic/atomicrmw
+                                                             with memory
+                                                             ordering of seq_cst
+                                                             and with equal or
+                                                             wider sync scope.
+                                                             (Note that seq_cst
+                                                             fences have their
+                                                             own s_waitcnt
+                                                             lgkmcnt(0) and so do
+                                                             not need to be
+                                                             considered.)
+                                                           - Ensures any
+                                                             preceding
+                                                             sequential
+                                                             consistent local
+                                                             memory instructions
+                                                             have completed
+                                                             before executing
+                                                             this sequentially
+                                                             consistent
+                                                             instruction. This
+                                                             prevents reordering
+                                                             a seq_cst store
+                                                             followed by a
+                                                             seq_cst load. (Note
+                                                             that seq_cst is
+                                                             stronger than
+                                                             acquire/release as
+                                                             the reordering of
+                                                             load acquire
+                                                             followed by a store
+                                                             release is
+                                                             prevented by the
+                                                             waitcnt of
+                                                             the release, but
+                                                             there is nothing
+                                                             preventing a store
+                                                             release followed by
+                                                             load acquire from
+                                                             competing out of
+                                                             order.)
+
+                                                         2. *Following
+                                                            instructions same as
+                                                            corresponding load
+                                                            atomic acquire,
+                                                            except must generated
+                                                            all instructions even
+                                                            for OpenCL.*
+     load atomic  seq_cst      - workgroup    - local    *Same as corresponding
+                                                         load atomic acquire,
+                                                         except must generated
+                                                         all instructions even
+                                                         for OpenCL.*
+     load atomic  seq_cst      - agent        - global   1. s_waitcnt lgkmcnt(0) &
+                               - system       - generic     vmcnt(0)
+
+                                                           - Could be split into
+                                                             separate s_waitcnt
+                                                             vmcnt(0)
+                                                             and s_waitcnt
+                                                             lgkmcnt(0) to allow
+                                                             them to be
+                                                             independently moved
+                                                             according to the
+                                                             following rules.
+                                                           - waitcnt lgkmcnt(0)
+                                                             must happen after
+                                                             preceding
+                                                             global/generic load
+                                                             atomic/store
+                                                             atomic/atomicrmw
+                                                             with memory
+                                                             ordering of seq_cst
+                                                             and with equal or
+                                                             wider sync scope.
+                                                             (Note that seq_cst
+                                                             fences have their
+                                                             own s_waitcnt
+                                                             lgkmcnt(0) and so do
+                                                             not need to be
+                                                             considered.)
+                                                           - waitcnt vmcnt(0)
+                                                             must happen after
                                                              preceding
                                                              global/generic load
                                                              atomic/store
@@ -3250,7 +3627,7 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
                                                              prevents reordering
                                                              a seq_cst store
                                                              followed by a
-                                                             seq_cst load (Note
+                                                             seq_cst load. (Note
                                                              that seq_cst is
                                                              stronger than
                                                              acquire/release as
@@ -3259,7 +3636,7 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
                                                              followed by a store
                                                              release is
                                                              prevented by the
-                                                             waitcnt vmcnt(0) of
+                                                             waitcnt of
                                                              the release, but
                                                              there is nothing
                                                              preventing a store
@@ -3271,24 +3648,36 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
                                                          2. *Following
                                                             instructions same as
                                                             corresponding load
-                                                            atomic acquire*.
-
+                                                            atomic acquire,
+                                                            except must generated
+                                                            all instructions even
+                                                            for OpenCL.*
      store atomic seq_cst      - singlethread - global   *Same as corresponding
-                               - wavefront    - local    store atomic release*.
-                               - workgroup    - generic
+                               - wavefront    - local    store atomic release,
+                               - workgroup    - generic  except must generated
+                                                         all instructions even
+                                                         for OpenCL.*
      store atomic seq_cst      - agent        - global   *Same as corresponding
-                               - system       - generic  store atomic release*.
+                               - system       - generic  store atomic release,
+                                                         except must generated
+                                                         all instructions even
+                                                         for OpenCL.*
      atomicrmw    seq_cst      - singlethread - global   *Same as corresponding
-                               - wavefront    - local    atomicrmw acq_rel*.
-                               - workgroup    - generic
+                               - wavefront    - local    atomicrmw acq_rel,
+                               - workgroup    - generic  except must generated
+                                                         all instructions even
+                                                         for OpenCL.*
      atomicrmw    seq_cst      - agent        - global   *Same as corresponding
-                               - system       - generic  atomicrmw acq_rel*.
+                               - system       - generic  atomicrmw acq_rel,
+                                                         except must generated
+                                                         all instructions even
+                                                         for OpenCL.*
      fence        seq_cst      - singlethread *none*     *Same as corresponding
-                               - wavefront               fence acq_rel*.
-                               - workgroup
-                               - agent
-                               - system
-     ============ ============ ============== ========== =======================
+                               - wavefront               fence acq_rel,
+                               - workgroup               except must generated
+                               - agent                   all instructions even
+                               - system                  for OpenCL.*
+     ============ ============ ============== ========== ===============================
 
 The memory order also adds the single thread optimization constrains defined in
 table
@@ -3359,8 +3748,11 @@ the ``s_trap`` instruction with the following usage:
      debugger            ``s_trap 0xff``                 Reserved for debugger.
      =================== =============== =============== =======================
 
-Non-AMDHSA
-----------
+Unspecified OS
+--------------
+
+This section provides code conventions used when the target triple OS is
+empty (see :ref:`amdgpu-target-triples`).
 
 Trap Handler ABI
 ~~~~~~~~~~~~~~~~
@@ -3395,7 +3787,8 @@ When the language is OpenCL the following differences occur:
 
 1. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
 2. The AMDGPU backend adds additional arguments to the kernel.
-3. Additional metadata is generated (:ref:`amdgpu-code-object-metadata`).
+3. Additional metadata is generated
+   (:ref:`amdgpu-amdhsa-hsa-code-object-metadata`).
 
 .. TODO
    Specify what affect this has. Hidden arguments added. Additional metadata
@@ -3420,12 +3813,12 @@ Assembler
 ---------
 
 AMDGPU backend has LLVM-MC based assembler which is currently in development.
-It supports AMDGCN GFX6-GFX8.
+It supports AMDGCN GFX6-GFX9.
 
 This section describes general syntax for instructions and operands. For more
 information about instructions, their semantics and supported combinations of
 operands, refer to one of instruction set architecture manuals
-[AMD-Souther-Islands]_ [AMD-Sea-Islands]_ [AMD-Volcanic-Islands]_.
+[AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_ and [AMD-GCN-GFX9]_.
 
 An instruction has the following syntax (register operands are normally
 comma-separated while extra operands are space-separated):
@@ -3694,7 +4087,7 @@ used.  The default value for all keys is 0, with the following exceptions:
 - *kernel_code_entry_byte_offset* defaults to 256.
 - *wavefront_size* defaults to 6.
 - *kernarg_segment_alignment*, *group_segment_alignment*, and
-  *private_segment_alignment* default to 4.  Note that alignments are specified
+  *private_segment_alignment* default to 4. Note that alignments are specified
   as a power of two, so a value of **n** means an alignment of 2^ **n**.
 
 The *.amd_kernel_code_t* directive must be placed immediately after the
@@ -3741,13 +4134,14 @@ Here is an example of a minimal amd_kernel_code_t specification:
 Additional Documentation
 ========================
 
-.. [AMD-R6xx] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
-.. [AMD-R7xx] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
-.. [AMD-Evergreen] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
-.. [AMD-Cayman-Trinity] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
-.. [AMD-Souther-Islands] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
-.. [AMD-Sea-Islands] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
-.. [AMD-Volcanic-Islands] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
+.. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
+.. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
+.. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
+.. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
+.. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
+.. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
+.. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
+.. [AMD-GCN-GFX9] `AMD "Vega" Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__
 .. [AMD-OpenCL_Programming-Guide]  `AMD Accelerated Parallel Processing OpenCL Programming Guide <http://developer.amd.com/download/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf>`_
 .. [AMD-APP-SDK] `AMD Accelerated Parallel Processing APP SDK Documentation <http://developer.amd.com/tools/heterogeneous-computing/amd-accelerated-parallel-processing-app-sdk/documentation/>`__
 .. [AMD-ROCm] `ROCm: Open Platform for Development, Discovery and Education Around GPU Computing <http://gpuopen.com/compute-product/rocm/>`__
@@ -3755,7 +4149,7 @@ Additional Documentation
 .. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
 .. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
 .. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
-.. [YAML] `YAML Ain’t Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__
+.. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__
 .. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
 .. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__
 .. [AMD-AMDGPU-Compute-Application-Binary-Interface] `AMDGPU Compute Application Binary Interface <https://github.com/RadeonOpenCompute/ROCm-ComputeABI-Doc/blob/master/AMDGPU-ABI.md>`__
diff --git a/docs/BitCodeFormat.rst b/docs/BitCodeFormat.rst
index 6ee3842c8d90..429c945e7120 100644
--- a/docs/BitCodeFormat.rst
+++ b/docs/BitCodeFormat.rst
@@ -681,7 +681,7 @@ for each library name referenced.
 MODULE_CODE_GLOBALVAR Record
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-``[GLOBALVAR, strtab offset, strtab size, pointer type, isconst, initid, linkage, alignment, section, visibility, threadlocal, unnamed_addr, externally_initialized, dllstorageclass, comdat]``
+``[GLOBALVAR, strtab offset, strtab size, pointer type, isconst, initid, linkage, alignment, section, visibility, threadlocal, unnamed_addr, externally_initialized, dllstorageclass, comdat, attributes, preemptionspecifier]``
 
 The ``GLOBALVAR`` record (code 7) marks the declaration or definition of a
 global variable. The operand fields are:
@@ -761,12 +761,21 @@ global variable. The operand fields are:
 
 * *comdat*: An encoding of the COMDAT of this function
 
+* *attributes*: If nonzero, the 1-based index into the table of AttributeLists.
+
+.. _bcpreemptionspecifier:
+
+* *preemptionspecifier*: If present, an encoding of the runtime preemption specifier of this variable:
+
+  * ``dso_preemptable``: code 0
+  * ``dso_local``: code 1
+
 .. _FUNCTION:
 
 MODULE_CODE_FUNCTION Record
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-``[FUNCTION, strtab offset, strtab size, type, callingconv, isproto, linkage, paramattr, alignment, section, visibility, gc, prologuedata, dllstorageclass, comdat, prefixdata, personalityfn]``
+``[FUNCTION, strtab offset, strtab size, type, callingconv, isproto, linkage, paramattr, alignment, section, visibility, gc, prologuedata, dllstorageclass, comdat, prefixdata, personalityfn, preemptionspecifier]``
 
 The ``FUNCTION`` record (code 8) marks the declaration or definition of a
 function. The operand fields are:
@@ -828,10 +837,12 @@ function. The operand fields are:
 * *personalityfn*: If non-zero, the value index of the personality function for this function,
   plus 1.
 
+* *preemptionspecifier*: If present, an encoding of the :ref:`runtime preemption specifier<bcpreemptionspecifier>`  of this function.
+ 
 MODULE_CODE_ALIAS Record
 ^^^^^^^^^^^^^^^^^^^^^^^^
 
-``[ALIAS, strtab offset, strtab size, alias type, aliasee val#, linkage, visibility, dllstorageclass, threadlocal, unnamed_addr]``
+``[ALIAS, strtab offset, strtab size, alias type, aliasee val#, linkage, visibility, dllstorageclass, threadlocal, unnamed_addr, preemptionspecifier]``
 
 The ``ALIAS`` record (code 9) marks the definition of an alias. The operand
 fields are
@@ -856,6 +867,8 @@ fields are
 * *unnamed_addr*: If present, an encoding of the
   :ref:`unnamed_addr<bcunnamedaddr>` attribute of this alias
 
+* *preemptionspecifier*: If present, an encoding of the :ref:`runtime preemption specifier<bcpreemptionspecifier>`  of this alias.
+
 .. _MODULE_CODE_GCNAME:
 
 MODULE_CODE_GCNAME Record
@@ -1039,6 +1052,9 @@ The integer codes are mapped to well-known attributes as follows.
 * code 50: ``inaccessiblememonly_or_argmemonly``
 * code 51: ``allocsize(<EltSizeParam>[, <NumEltsParam>])``
 * code 52: ``writeonly``
+* code 53: ``speculatable``
+* code 54: ``strictfp``
+* code 55: ``sanitize_hwaddress``
 
 .. note::
   The ``allocsize`` attribute has a special encoding for its arguments. Its two
diff --git a/docs/Bugpoint.rst b/docs/Bugpoint.rst
index 6bd7ff99564f..27732e0fffbd 100644
--- a/docs/Bugpoint.rst
+++ b/docs/Bugpoint.rst
@@ -151,6 +151,11 @@ non-obvious ways.  Here are some hints and tips:
   optimizations to be randomized and applied to the program. This process will
   repeat until a bug is found or the user kills ``bugpoint``.
 
+* ``bugpoint`` can produce IR which contains long names. Run ``opt
+  -metarenamer`` over the IR to rename everything using easy-to-read,
+  metasyntactic names. Alternatively, run ``opt -strip -instnamer`` to rename
+  everything with very short (often purely numeric) names.
+
 What to do when bugpoint isn't enough
 =====================================
 	
diff --git a/docs/CFIVerify.rst b/docs/CFIVerify.rst
new file mode 100644
index 000000000000..7424d01c90b5
--- /dev/null
+++ b/docs/CFIVerify.rst
@@ -0,0 +1,91 @@
+==============================================
+Control Flow Verification Tool Design Document
+==============================================
+
+.. contents::
+   :local:
+
+Objective
+=========
+
+This document provides an overview of an external tool to verify the protection
+mechanisms implemented by Clang's *Control Flow Integrity* (CFI) schemes
+(``-fsanitize=cfi``). This tool, provided a binary or DSO, should infer whether
+indirect control flow operations are protected by CFI, and should output these
+results in a human-readable form.
+
+This tool should also be added as part of Clang's continuous integration testing
+framework, where modifications to the compiler ensure that CFI protection
+schemes are still present in the final binary.
+
+Location
+========
+
+This tool will be present as a part of the LLVM toolchain, and will reside in
+the "/llvm/tools/llvm-cfi-verify" directory, relative to the LLVM trunk. It will
+be tested in two methods:
+
+- Unit tests to validate code sections, present in "/llvm/unittests/llvm-cfi-
+  verify".
+- Integration tests, present in "/llvm/tools/clang/test/LLVMCFIVerify". These
+  integration tests are part of clang as part of a continuous integration
+  framework, ensuring updates to the compiler that reduce CFI coverage on
+  indirect control flow instructions are identified.
+
+Background
+==========
+
+This tool will continuously validate that CFI directives are properly
+implemented around all indirect control flows by analysing the output machine
+code. The analysis of machine code is important as it ensures that any bugs
+present in linker or compiler do not subvert CFI protections in the final
+shipped binary.
+
+Unprotected indirect control flow instructions will be flagged for manual
+review. These unexpected control flows may simply have not been accounted for in
+the compiler implementation of CFI (e.g. indirect jumps to facilitate switch
+statements may not be fully protected).
+
+It may be possible in the future to extend this tool to flag unnecessary CFI
+directives (e.g. CFI directives around a static call to a non-polymorphic base
+type). This type of directive has no security implications, but may present
+performance impacts.
+
+Design Ideas
+============
+
+This tool will disassemble binaries and DSO's from their machine code format and
+analyse the disassembled machine code. The tool will inspect virtual calls and
+indirect function calls. This tool will also inspect indirect jumps, as inlined
+functions and jump tables should also be subject to CFI protections. Non-virtual
+calls (``-fsanitize=cfi-nvcall``) and cast checks (``-fsanitize=cfi-*cast*``)
+are not implemented due to a lack of information provided by the bytecode.
+
+The tool would operate by searching for indirect control flow instructions in
+the disassembly. A control flow graph would be generated from a small buffer of
+the instructions surrounding the 'target' control flow instruction. If the
+target instruction is branched-to, the fallthrough of the branch should be the
+CFI trap (on x86, this is a ``ud2`` instruction). If the target instruction is
+the fallthrough (i.e. immediately succeeds) of a conditional jump, the
+conditional jump target should be the CFI trap. If an indirect control flow
+instruction does not conform to one of these formats, the target will be noted
+as being CFI-unprotected.
+
+Note that in the second case outlined above (where the target instruction is the
+fallthrough of a conditional jump), if the target represents a vcall that takes
+arguments, these arguments may be pushed to the stack after the branch but
+before the target instruction. In these cases, a secondary 'spill graph' in
+constructed, to ensure the register argument used by the indirect jump/call is
+not spilled from the stack at any point in the interim period. If there are no
+spills that affect the target register, the target is marked as CFI-protected.
+
+Other Design Notes
+~~~~~~~~~~~~~~~~~~
+
+Only machine code sections that are marked as executable will be subject to this
+analysis. Non-executable sections do not require analysis as any execution
+present in these sections has already violated the control flow integrity.
+
+Suitable extensions may be made at a later date to include anaylsis for indirect
+control flow operations across DSO boundaries. Currently, these CFI features are
+only experimental with an unstable ABI, making them unsuitable for analysis.
diff --git a/docs/CMake.rst b/docs/CMake.rst
index b6ebf37adc92..05edec64da33 100644
--- a/docs/CMake.rst
+++ b/docs/CMake.rst
@@ -224,6 +224,10 @@ LLVM-specific variables
   Generate build targets for the LLVM tools. Defaults to ON. You can use this
   option to disable the generation of build targets for the LLVM tools.
 
+**LLVM_INSTALL_BINUTILS_SYMLINKS**:BOOL
+  Install symlinks from the binutils tool names to the corresponding LLVM tools.
+  For example, ar will be symlinked to llvm-ar.
+
 **LLVM_BUILD_EXAMPLES**:BOOL
   Build LLVM examples. Defaults to OFF. Targets for building each example are
   generated in any case. See documentation for *LLVM_BUILD_TOOLS* above for more
@@ -542,6 +546,11 @@ LLVM-specific variables
   reverse order. This is useful for uncovering non-determinism caused by
   iteration of unordered containers.
 
+**LLVM_BUILD_INSTRUMENTED_COVERAGE**:BOOL
+  If enabled, `source-based code coverage
+  <http://clang.llvm.org/docs/SourceBasedCodeCoverage.html>`_ instrumentation
+  is enabled while building llvm.
+
 CMake Caches
 ============
 
diff --git a/docs/CMakeLists.txt b/docs/CMakeLists.txt
index 4437610146c4..0f2681e0cd86 100644
--- a/docs/CMakeLists.txt
+++ b/docs/CMakeLists.txt
@@ -3,7 +3,7 @@ if (DOXYGEN_FOUND)
 if (LLVM_ENABLE_DOXYGEN)
   set(abs_top_srcdir ${CMAKE_CURRENT_SOURCE_DIR})
   set(abs_top_builddir ${CMAKE_CURRENT_BINARY_DIR})
-  
+
   if (HAVE_DOT)
     set(DOT ${LLVM_PATH_DOT})
   endif()
@@ -21,20 +21,20 @@ if (LLVM_ENABLE_DOXYGEN)
     set(enable_external_search "NO")
     set(extra_search_mappings "")
   endif()
-  
+
   # If asked, configure doxygen for the creation of a Qt Compressed Help file.
   option(LLVM_ENABLE_DOXYGEN_QT_HELP
     "Generate a Qt Compressed Help file." OFF)
   if (LLVM_ENABLE_DOXYGEN_QT_HELP)
     set(LLVM_DOXYGEN_QCH_FILENAME "org.llvm.qch" CACHE STRING
       "Filename of the Qt Compressed help file")
-    set(LLVM_DOXYGEN_QHP_NAMESPACE "org.llvm" CACHE STRING 
+    set(LLVM_DOXYGEN_QHP_NAMESPACE "org.llvm" CACHE STRING
       "Namespace under which the intermediate Qt Help Project file lives")
     set(LLVM_DOXYGEN_QHP_CUST_FILTER_NAME "${PACKAGE_STRING}" CACHE STRING
       "See http://qt-project.org/doc/qt-4.8/qthelpproject.html#custom-filters")
     set(LLVM_DOXYGEN_QHP_CUST_FILTER_ATTRS "${PACKAGE_NAME},${PACKAGE_VERSION}" CACHE STRING
       "See http://qt-project.org/doc/qt-4.8/qthelpproject.html#filter-attributes")
-    find_program(LLVM_DOXYGEN_QHELPGENERATOR_PATH qhelpgenerator 
+    find_program(LLVM_DOXYGEN_QHELPGENERATOR_PATH qhelpgenerator
       DOC "Path to the qhelpgenerator binary")
     if (NOT LLVM_DOXYGEN_QHELPGENERATOR_PATH)
       message(FATAL_ERROR "Failed to find qhelpgenerator binary")
@@ -55,7 +55,7 @@ if (LLVM_ENABLE_DOXYGEN)
     set(llvm_doxygen_qhp_cust_filter_name "")
     set(llvm_doxygen_qhp_cust_filter_attrs "")
   endif()
-  
+
   option(LLVM_DOXYGEN_SVG
     "Use svg instead of png files for doxygen graphs." OFF)
   if (LLVM_DOXYGEN_SVG)
@@ -112,6 +112,8 @@ if (LLVM_ENABLE_SPHINX)
 
     if (${SPHINX_OUTPUT_MAN})
       add_sphinx_target(man llvm)
+      add_sphinx_target(man llvm-dwarfdump)
+      add_sphinx_target(man dsymutil)
     endif()
 
   endif()
diff --git a/docs/CMakePrimer.rst b/docs/CMakePrimer.rst
index c29d627ee62c..72ebffa5bdd6 100644
--- a/docs/CMakePrimer.rst
+++ b/docs/CMakePrimer.rst
@@ -167,10 +167,9 @@ Other Types
 
 Variables that are cached or specified on the command line can have types
 associated with them. The variable's type is used by CMake's UI tool to display
-the right input field. The variable's type generally doesn't impact evaluation.
-One of the few examples is PATH variables, which CMake does have some special
-handling for. You can read more about the special handling in `CMake's set
-documentation
+the right input field. A variable's type generally doesn't impact evaluation,
+however CMake does have special handling for some variables such as PATH.
+You can read more about the special handling in `CMake's set documentation
 <https://cmake.org/cmake/help/v3.5/command/set.html#set-cache-entry>`_.
 
 Scope
@@ -203,7 +202,7 @@ Control Flow
 ============
 
 CMake features the same basic control flow constructs you would expect in any
-scripting language, but there are a few quarks because, as with everything in
+scripting language, but there are a few quirks because, as with everything in
 CMake, control flow constructs are commands.
 
 If, ElseIf, Else
@@ -334,21 +333,23 @@ When defining a CMake command handling arguments is very useful. The examples
 in this section will all use the CMake ``function`` block, but this all applies
 to the ``macro`` block as well.
 
-CMake commands can have named arguments, but all commands are implicitly
-variable argument. If the command has named arguments they are required and must
-be specified at every call site. Below is a trivial example of providing a
-wrapper function for CMake's built in function ``add_dependencies``.
+CMake commands can have named arguments that are requried at every call site. In
+addition, all commands will implicitly accept a variable number of extra
+arguments (In C parlance, all commands are varargs functions). When a command is
+invoked with extra arguments (beyond the named ones) CMake will store the full
+list of arguments (both named and unnamed) in a list named ``ARGV``, and the
+sublist of unnamed arguments in ``ARGN``. Below is a trivial example of
+providing a wrapper function for CMake's built in function ``add_dependencies``.
 
 .. code-block:: cmake
 
    function(add_deps target)
-     add_dependencies(${target} ${ARGV})
+     add_dependencies(${target} ${ARGN})
    endfunction()
 
 This example defines a new macro named ``add_deps`` which takes a required first
 argument, and just calls another function passing through the first argument and
-all trailing arguments. When variable arguments are present CMake defines them
-in a list named ``ARGV``, and the count of the arguments is defined in ``ARGN``.
+all trailing arguments.
 
 CMake provides a module ``CMakeParseArguments`` which provides an implementation
 of advanced argument parsing. We use this all over LLVM, and it is recommended
diff --git a/docs/CodeGenerator.rst b/docs/CodeGenerator.rst
index bcdc72283566..5c0fb064959e 100644
--- a/docs/CodeGenerator.rst
+++ b/docs/CodeGenerator.rst
@@ -1578,6 +1578,17 @@ which lowers MCInst's into machine code bytes and relocations.  This is
 important if you want to support direct .o file emission, or would like to
 implement an assembler for your target.
 
+Emitting function stack size information
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+A section containing metadata on function stack sizes will be emitted when
+``TargetLoweringObjectFile::StackSizesSection`` is not null, and
+``TargetOptions::EmitStackSizeSection`` is set (-stack-size-section). The
+section will contain an array of pairs of function symbol references (8 byte)
+and stack sizes (unsigned LEB128). The stack size values only include the space
+allocated in the function prologue. Functions with dynamic stack allocations are
+not included.
+
 VLIW Packetizer
 ---------------
 
diff --git a/docs/CodingStandards.rst b/docs/CodingStandards.rst
index fa41198755fd..231c034be19d 100644
--- a/docs/CodingStandards.rst
+++ b/docs/CodingStandards.rst
@@ -203,7 +203,7 @@ this means are `Effective Go`_ and `Go Code Review Comments`_.
   https://golang.org/doc/effective_go.html
 
 .. _Go Code Review Comments:
-  https://code.google.com/p/go-wiki/wiki/CodeReviewComments
+  https://github.com/golang/go/wiki/CodeReviewComments
 
 Mechanical Source Issues
 ========================
@@ -811,6 +811,21 @@ As a rule of thumb, use ``auto &`` unless you need to copy the result, and use
   for (const auto *Ptr : Container) { observe(*Ptr); }
   for (auto *Ptr : Container) { Ptr->change(); }
 
+Beware of non-determinism due to ordering of pointers
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+In general, there is no relative ordering among pointers. As a result,
+when unordered containers like sets and maps are used with pointer keys
+the iteration order is undefined. Hence, iterating such containers may
+result in non-deterministic code generation. While the generated code
+might not necessarily be "wrong code", this non-determinism might result
+in unexpected runtime crashes or simply hard to reproduce bugs on the
+customer side making it harder to debug and fix.
+
+As a rule of thumb, in case an ordered result is expected, remember to
+sort an unordered container before iteration. Or use ordered containers
+like vector/MapVector/SetVector if you want to iterate pointer keys.
+
 Style Issues
 ============
 
@@ -941,8 +956,8 @@ loops.  A silly example is something like this:
 
 .. code-block:: c++
 
-  for (BasicBlock::iterator II = BB->begin(), E = BB->end(); II != E; ++II) {
-    if (BinaryOperator *BO = dyn_cast<BinaryOperator>(II)) {
+  for (Instruction &I : BB) {
+    if (auto *BO = dyn_cast<BinaryOperator>(&I)) {
       Value *LHS = BO->getOperand(0);
       Value *RHS = BO->getOperand(1);
       if (LHS != RHS) {
@@ -961,8 +976,8 @@ It is strongly preferred to structure the loop like this:
 
 .. code-block:: c++
 
-  for (BasicBlock::iterator II = BB->begin(), E = BB->end(); II != E; ++II) {
-    BinaryOperator *BO = dyn_cast<BinaryOperator>(II);
+  for (Instruction &I : BB) {
+    auto *BO = dyn_cast<BinaryOperator>(&I);
     if (!BO) continue;
 
     Value *LHS = BO->getOperand(0);
@@ -1232,6 +1247,12 @@ builds), ``llvm_unreachable`` becomes a hint to compilers to skip generating
 code for this branch. If the compiler does not support this, it will fall back
 to the "abort" implementation.
 
+Neither assertions or ``llvm_unreachable`` will abort the program on a release
+build. If the error condition can be triggered by user input then the
+recoverable error mechanism described in :doc:`ProgrammersManual` should be
+used instead. In cases where this is not practical, ``report_fatal_error`` may
+be used.
+
 Another issue is that values used only by assertions will produce an "unused
 value" warning when assertions are disabled.  For example, this code will warn:
 
@@ -1316,19 +1337,31 @@ that the enum expression may take any representable value, not just those of
 individual enumerators. To suppress this warning, use ``llvm_unreachable`` after
 the switch.
 
+Use range-based ``for`` loops wherever possible
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The introduction of range-based ``for`` loops in C++11 means that explicit
+manipulation of iterators is rarely necessary. We use range-based ``for``
+loops wherever possible for all newly added code. For example:
+
+.. code-block:: c++
+
+  BasicBlock *BB = ...
+  for (Instruction &I : *BB)
+    ... use I ...
+
 Don't evaluate ``end()`` every time through a loop
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-Because C++ doesn't have a standard "``foreach``" loop (though it can be
-emulated with macros and may be coming in C++'0x) we end up writing a lot of
-loops that manually iterate from begin to end on a variety of containers or
-through other data structures.  One common mistake is to write a loop in this
-style:
+In cases where range-based ``for`` loops can't be used and it is necessary
+to write an explicit iterator-based loop, pay close attention to whether
+``end()`` is re-evaluted on each loop iteration. One common mistake is to
+write a loop in this style:
 
 .. code-block:: c++
 
   BasicBlock *BB = ...
-  for (BasicBlock::iterator I = BB->begin(); I != BB->end(); ++I)
+  for (auto I = BB->begin(); I != BB->end(); ++I)
     ... use I ...
 
 The problem with this construct is that it evaluates "``BB->end()``" every time
@@ -1339,7 +1372,7 @@ convenient way to do this is like so:
 .. code-block:: c++
 
   BasicBlock *BB = ...
-  for (BasicBlock::iterator I = BB->begin(), E = BB->end(); I != E; ++I)
+  for (auto I = BB->begin(), E = BB->end(); I != E; ++I)
     ... use I ...
 
 The observant may quickly point out that these two loops may have different
diff --git a/docs/CommandGuide/FileCheck.rst b/docs/CommandGuide/FileCheck.rst
index 8830c394b212..9078f65e01c5 100644
--- a/docs/CommandGuide/FileCheck.rst
+++ b/docs/CommandGuide/FileCheck.rst
@@ -86,6 +86,11 @@ OPTIONS
 
   All other variables get undefined after each encountered ``CHECK-LABEL``.
 
+.. option:: -D<VAR=VALUE>
+
+  Sets a filecheck variable ``VAR`` with value ``VALUE`` that can be used in
+  ``CHECK:`` lines.
+
 .. option:: -version
 
  Show the version number of this program.
@@ -397,10 +402,11 @@ All FileCheck directives take a pattern to match.
 For most uses of FileCheck, fixed string matching is perfectly sufficient.  For
 some things, a more flexible form of matching is desired.  To support this,
 FileCheck allows you to specify regular expressions in matching strings,
-surrounded by double braces: ``{{yourregex}}``.  Because we want to use fixed
-string matching for a majority of what we do, FileCheck has been designed to
-support mixing and matching fixed string matching with regular expressions.
-This allows you to write things like this:
+surrounded by double braces: ``{{yourregex}}``. FileCheck implements a POSIX
+regular expression matcher; it supports Extended POSIX regular expressions
+(ERE). Because we want to use fixed string matching for a majority of what we
+do, FileCheck has been designed to support mixing and matching fixed string
+matching with regular expressions.  This allows you to write things like this:
 
 .. code-block:: llvm
 
@@ -434,7 +440,7 @@ The first check line matches a regex ``%[a-z]+`` and captures it into the
 variable ``REGISTER``.  The second line verifies that whatever is in
 ``REGISTER`` occurs later in the file after an "``andw``".  :program:`FileCheck`
 variable references are always contained in ``[[ ]]`` pairs, and their names can
-be formed with the regex ``[a-zA-Z][a-zA-Z0-9]*``.  If a colon follows the name,
+be formed with the regex ``[a-zA-Z_][a-zA-Z0-9_]*``.  If a colon follows the name,
 then it is a definition of the variable; otherwise, it is a use.
 
 :program:`FileCheck` variables can be defined multiple times, and uses always
diff --git a/docs/CommandGuide/dsymutil.rst b/docs/CommandGuide/dsymutil.rst
new file mode 100644
index 000000000000..a29bc3c295c7
--- /dev/null
+++ b/docs/CommandGuide/dsymutil.rst
@@ -0,0 +1,89 @@
+dsymutil - manipulate archived DWARF debug symbol files
+=======================================================
+
+SYNOPSIS
+--------
+
+| :program:`dsymutil` [*options*] *executable*
+
+DESCRIPTION
+-----------
+
+:program:`dsymutil` links the DWARF debug information found in the object files
+for an executable *executable* by using debug symbols information contained in
+its symbol table. By default, the linked debug information is placed in a
+``.dSYM`` bundle with the same name as the executable.
+
+OPTIONS
+-------
+.. option:: --arch=<arch>
+
+ Link DWARF debug information only for specified CPU architecture types.
+ Architectures may be specified by name. When using this option, an error will
+ be returned if any architectures can not be properly linked.  This option can
+ be specified multiple times, once for each desired architecture. All CPU
+ architectures will be linked by default and any architectures that can't be
+ properly linked will cause :program:`dsymutil` to return an error.
+
+.. option:: --dump-debug-map
+
+ Dump the *executable*'s debug-map (the list of the object files containing the
+ debug information) in YAML format and exit. Not DWARF link will take place.
+
+.. option:: -f, --flat
+
+ Produce a flat dSYM file. A ``.dwarf`` extension will be appended to the
+ executable name unless the output file is specified using the -o option.
+
+.. option:: --no-odr
+
+ Do not use ODR (One Definition Rule) for uniquing C++ types.
+
+.. option:: --no-output
+
+ Do the link in memory, but do not emit the result file.
+
+.. option:: --no-swiftmodule-timestamp
+
+ Don't check the timestamp for swiftmodule files.
+
+.. option:: -j <n>, --num-threads=<n>
+
+ Specifies the maximum number (``n``) of simultaneous threads to use when
+ linking multiple architectures.
+
+.. option:: -o <filename>
+
+ Specifies an alternate ``path`` to place the dSYM bundle. The default dSYM
+ bundle path is created by appending ``.dSYM`` to the executable name.
+
+.. option:: --oso-prepend-path=<path>
+
+ Specifies a ``path`` to prepend to all debug symbol object file paths.
+
+.. option:: -s, --symtab
+
+ Dumps the symbol table found in *executable* or object file(s) and exits.
+
+.. option:: -v, --verbose
+
+ Display verbose information when linking.
+
+.. option:: --version
+
+ Display the version of the tool.
+
+.. option:: -y
+
+ Treat *executable* as a YAML debug-map rather than an executable.
+
+EXIT STATUS
+-----------
+
+:program:`dsymutil` returns 0 if the DWARF debug information was linked
+successfully. Otherwise, it returns 1.
+
+SEE ALSO
+--------
+
+:manpage:`llvm-dwarfdump(1)`
diff --git a/docs/CommandGuide/index.rst b/docs/CommandGuide/index.rst
index 46db57f1c845..805df00c1738 100644
--- a/docs/CommandGuide/index.rst
+++ b/docs/CommandGuide/index.rst
@@ -30,6 +30,7 @@ Basic Commands
    llvm-stress
    llvm-symbolizer
    llvm-dwarfdump
+   dsymutil
 
 Debugging Tools
 ~~~~~~~~~~~~~~~
@@ -51,4 +52,5 @@ Developer Tools
    tblgen
    lit
    llvm-build
+   llvm-pdbutil
    llvm-readobj
diff --git a/docs/CommandGuide/llc.rst b/docs/CommandGuide/llc.rst
index 5094259f9f95..95945e68d13f 100644
--- a/docs/CommandGuide/llc.rst
+++ b/docs/CommandGuide/llc.rst
@@ -132,6 +132,14 @@ End-user Options
  Specify which EABI version should conform to.  Valid EABI versions are *gnu*,
  *4* and *5*.  Default value (*default*) depends on the triple.
 
+.. option:: -stack-size-section
+
+ Emit the .stack_sizes section which contains stack size metadata. The section
+ contains an array of pairs of function symbol references (8 byte) and stack
+ sizes (unsigned LEB128). The stack size values only include the space allocated
+ in the function prologue. Functions with dynamic stack allocations are not
+ included.
+
 
 Tuning/Configuration Options
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
diff --git a/docs/CommandGuide/lli.rst b/docs/CommandGuide/lli.rst
index 9da13ee47e0e..58481073d069 100644
--- a/docs/CommandGuide/lli.rst
+++ b/docs/CommandGuide/lli.rst
@@ -122,7 +122,7 @@ CODE GENERATION OPTIONS
 
  Choose the code model from:
 
- .. code-block:: perl
+ .. code-block:: text
 
       default: Target default code model
       small: Small code model
@@ -154,7 +154,7 @@ CODE GENERATION OPTIONS
 
  Instruction schedulers available (before register allocation):
 
- .. code-block:: perl
+ .. code-block:: text
 
       =default: Best scheduler for the target
       =none: No scheduling: breadth first sequencing
@@ -168,7 +168,7 @@ CODE GENERATION OPTIONS
 
  Register allocator to use (default=linearscan)
 
- .. code-block:: perl
+ .. code-block:: text
 
       =bigblock: Big-block register allocator
       =linearscan: linear scan register allocator =local -   local register allocator
@@ -178,7 +178,7 @@ CODE GENERATION OPTIONS
 
  Choose relocation model from:
 
- .. code-block:: perl
+ .. code-block:: text
 
       =default: Target default relocation model
       =static: Non-relocatable code =pic -   Fully relocatable, position independent code
@@ -188,7 +188,7 @@ CODE GENERATION OPTIONS
 
  Spiller to use (default=local)
 
- .. code-block:: perl
+ .. code-block:: text
 
       =simple: simple spiller
       =local: local spiller
@@ -197,7 +197,7 @@ CODE GENERATION OPTIONS
 
  Choose style of code to emit from X86 backend:
 
- .. code-block:: perl
+ .. code-block:: text
 
       =att: Emit AT&T-style assembly
       =intel: Emit Intel-style assembly
diff --git a/docs/CommandGuide/llvm-cov.rst b/docs/CommandGuide/llvm-cov.rst
index 47db8d04e0b2..478ba0fb15e0 100644
--- a/docs/CommandGuide/llvm-cov.rst
+++ b/docs/CommandGuide/llvm-cov.rst
@@ -195,44 +195,53 @@ OPTIONS
 
 .. option:: -show-line-counts
 
- Show the execution counts for each line. This is enabled by default, unless
- another ``-show`` option is used.
+ Show the execution counts for each line. Defaults to true, unless another
+ ``-show`` option is used.
 
 .. option:: -show-expansions
 
  Expand inclusions, such as preprocessor macros or textual inclusions, inline
- in the display of the source file.
+ in the display of the source file. Defaults to false.
 
 .. option:: -show-instantiations
 
  For source regions that are instantiated multiple times, such as templates in
  ``C++``, show each instantiation separately as well as the combined summary.
+ Defaults to true.
 
 .. option:: -show-regions
 
  Show the execution counts for each region by displaying a caret that points to
- the character where the region starts.
+ the character where the region starts. Defaults to false.
 
 .. option:: -show-line-counts-or-regions
 
  Show the execution counts for each line if there is only one region on the
  line, but show the individual regions if there are multiple on the line.
+ Defaults to false.
 
-.. option:: -use-color[=VALUE]
+.. option:: -use-color
 
  Enable or disable color output. By default this is autodetected.
 
-.. option:: -arch=<name>
+.. option:: -arch=[*NAMES*]
 
- If the covered binary is a universal binary, select the architecture to use.
- It is an error to specify an architecture that is not included in the
- universal binary or to use an architecture that does not match a
- non-universal binary.
+ Specify a list of architectures such that the Nth entry in the list
+ corresponds to the Nth specified binary. If the covered object is a universal
+ binary, this specifies the architecture to use. It is an error to specify an
+ architecture that is not included in the universal binary or to use an
+ architecture that does not match a non-universal binary.
 
 .. option:: -name=<NAME>
 
  Show code coverage only for functions with the given name.
 
+.. option:: -name-whitelist=<FILE>
+
+ Show code coverage only for functions listed in the given file. Each line in
+ the file should start with `whitelist_fun:`, immediately followed by the name
+ of the function to accept. This name can be a wildcard expression.
+
 .. option:: -name-regex=<PATTERN>
 
  Show code coverage only for functions that match the given regular expression.
@@ -288,6 +297,12 @@ OPTIONS
  Show code coverage only for functions with region coverage less than the given
  threshold.
 
+.. option:: -path-equivalence=<from>,<to>
+
+ Map the paths in the coverage data to local source file paths. This allows you
+ to generate the coverage data on one machine, and then use llvm-cov on a
+ different machine where you have the same files on a different path.
+
 .. program:: llvm-cov report
 
 .. _llvm-cov-report:
@@ -330,7 +345,11 @@ OPTIONS
 
 .. option:: -show-functions
 
- Show coverage summaries for each function.
+ Show coverage summaries for each function. Defaults to false.
+
+.. option:: -show-instantiation-summary
+
+ Show statistics for all function instantiations. Defaults to false.
 
 .. program:: llvm-cov export
 
@@ -363,3 +382,10 @@ OPTIONS
  It is an error to specify an architecture that is not included in the
  universal binary or to use an architecture that does not match a
  non-universal binary.
+
+.. option:: -summary-only
+
+ Export only summary information for each file in the coverage data. This mode
+ will not export coverage information for smaller units such as individual
+ functions or regions. The result will be the same as produced by :program:
+ `llvm-cov report` command, but presented in JSON format rather than text.
diff --git a/docs/CommandGuide/llvm-dwarfdump.rst b/docs/CommandGuide/llvm-dwarfdump.rst
index 30c18adb7713..a3b62664cbe5 100644
--- a/docs/CommandGuide/llvm-dwarfdump.rst
+++ b/docs/CommandGuide/llvm-dwarfdump.rst
@@ -1,30 +1,142 @@
-llvm-dwarfdump - print contents of DWARF sections
-=================================================
+llvm-dwarfdump - dump and verify DWARF debug information
+========================================================
 
 SYNOPSIS
 --------
 
-:program:`llvm-dwarfdump` [*options*] [*filenames...*]
+:program:`llvm-dwarfdump` [*options*] [*filename ...*]
 
 DESCRIPTION
 -----------
 
-:program:`llvm-dwarfdump` parses DWARF sections in the object files
-and prints their contents in human-readable form.
+:program:`llvm-dwarfdump` parses DWARF sections in object files,
+archives, and `.dSYM` bundles and prints their contents in
+human-readable form. Only the .debug_info section is printed unless one of
+the section-specific options or :option:`--all` is specified.
 
 OPTIONS
 -------
 
-.. option:: -debug-dump=section
+.. option:: -a, --all
 
-  Specify the DWARF section to dump.
-  For example, use ``abbrev`` to dump the contents of ``.debug_abbrev`` section,
-  ``loc.dwo`` to dump the contents of ``.debug_loc.dwo`` etc.
-  See ``llvm-dwarfdump --help`` for the complete list of supported sections.
-  Use ``all`` to dump all DWARF sections. It is the default.
+            Disassemble all supported DWARF sections.
+
+.. option:: --arch=<arch>
+
+            Dump DWARF debug information for the specified CPU architecture.
+            Architectures may be specified by name or by number.  This
+            option can be specified multiple times, once for each desired
+            architecture.  All CPU architectures will be printed by
+            default.
+
+.. option:: -c, --show-children
+
+            Show a debug info entry's children when using
+            the :option:`--debug-info`, :option:`--find`,
+            and :option:`--name` options.
+
+.. option:: -f <name>, --find=<name>
+
+            Search for the exact text <name> in the accelerator tables
+            and print the matching debug information entries.
+            When there is no accelerator tables or the name of the DIE
+            you are looking for is not found in the accelerator tables,
+            try using the slower but more complete :option:`--name` option.
+
+.. option:: -F, --show-form
+
+            Show DWARF form types after the DWARF attribute types.
+
+.. option:: -h, --help
+
+            Show help and usage for this command.
+
+.. option:: -i, --ignore-case
+
+            Ignore case distinctions in when searching entries by name
+            or by regular expression.
+
+.. option:: -n <pattern>, --name=<pattern>
+
+            Find and print all debug info entries whose name
+            (`DW_AT_name` attribute) matches the exact text in
+            <pattern>. Use the :option:`--regex` option to have
+            <pattern> become a regular expression for more flexible
+            pattern matching.
+
+.. option:: --lookup=<address>
+
+            Lookup <address> in the debug information and print out the file,
+            function, block, and line table details.
+
+.. option:: -o <path>, --out-file=<path>
+
+            Redirect output to a file specified by <path>.
+
+.. option:: -p, --show-parents
+
+            Show a debug info entry's parent objects when using the
+            :option:`--debug-info`, :option:`--find`, and
+            :option:`--name` options.
+
+.. option:: -r <n>, --recurse-depth=<n>
+
+            Only recurse to a maximum depth of <n> when dumping debug info
+            entries.
+
+.. option:: --statistics
+
+            Collect debug info quality metrics and print the results
+            as machine-readable single-line JSON output.
+
+.. option:: -x, --regex
+
+            Treat any <pattern> strings as regular expressions when searching
+            instead of just as an exact string match.
+
+.. option:: -u, --uuid
+
+            Show the UUID for each architecture.
+
+.. option:: --diff
+
+            Dump the output in a format that is more friendly for comparing
+            DWARF output from two different files.
+
+.. option:: -v, --verbose
+
+            Display verbose information when dumping. This can help to debug
+            DWARF issues.
+
+.. option:: --verify
+
+            Verify the structure of the DWARF information by verifying the
+            compile unit chains, DIE relationships graph, address
+            ranges, and more.
+
+.. option:: --version
+
+            Display the version of the tool.
+
+.. option:: --debug-abbrev, --debug-aranges, --debug-cu-index, --debug-frame [=<offset>], --debug-gnu-pubnames, --debug-gnu-pubtypes, --debug-info [=<offset>], --debug-line [=<offset>], --debug-loc [=<offset>], --debug-macro, --debug-pubnames, --debug-pubtypes, --debug-ranges, --debug-str, --debug-str-offsets, --debug-tu-index, --debug-types, --eh-frame, --gdb-index, --apple-names, --apple-types, --apple-namespaces, --apple-objc
+
+            Dump the specified DWARF section by name. Only the
+            `.debug_info` section is shown by default. Some entries
+            support adding an `=<offset>` as a way to provide an
+            optional offset of the exact entry to dump within the
+            respective section. When an offset is provided, only the
+            entry at that offset will be dumped, else the entire
+            section will be dumped. Children of items at a specific
+            offset can be dumped by also using the
+            :option:`--show-children` option where applicable.
 
 EXIT STATUS
 -----------
 
 :program:`llvm-dwarfdump` returns 0 if the input files were parsed and dumped
 successfully. Otherwise, it returns 1.
+
+SEE ALSO
+--------
+
+:manpage:`dsymutil(1)`
diff --git a/docs/CommandGuide/llvm-pdbutil.rst b/docs/CommandGuide/llvm-pdbutil.rst
new file mode 100644
index 000000000000..29d487e0e740
--- /dev/null
+++ b/docs/CommandGuide/llvm-pdbutil.rst
@@ -0,0 +1,585 @@
+llvm-pdbutil - PDB File forensics and diagnostics
+=================================================
+
+.. contents::
+   :local:
+
+Synopsis
+--------
+
+:program:`llvm-pdbutil` [*subcommand*] [*options*]
+
+Description
+-----------
+
+Display types, symbols, CodeView records, and other information from a
+PDB file, as well as manipulate and create PDB files.  :program:`llvm-pdbutil`
+is normally used by FileCheck-based tests to test LLVM's PDB reading and
+writing functionality, but can also be used for general PDB file investigation
+and forensics, or as a replacement for cvdump.
+
+Subcommands
+-----------
+
+:program:`llvm-pdbutil` is separated into several subcommands each tailored to
+a different purpose.  A brief summary of each command follows, with more detail
+in the sections that follow.
+
+  * :ref:`pretty_subcommand` - Dump symbol and type information in a format that
+    tries to look as much like the original source code as possible.
+  * :ref:`dump_subcommand` - Dump low level types and structures from the PDB
+    file, including CodeView records, hash tables, PDB streams, etc.
+  * :ref:`bytes_subcommand` - Dump data from the PDB file's streams, records,
+    types, symbols, etc as raw bytes.
+  * :ref:`yaml2pdb_subcommand` - Given a yaml description of a PDB file, produce
+    a valid PDB file that matches that description.
+  * :ref:`pdb2yaml_subcommand` - For a given PDB file, produce a YAML
+    description of some or all of the file in a way that the PDB can be
+    reconstructed.
+  * :ref:`merge_subcommand` - Given two PDBs, produce a third PDB that is the
+    result of merging the two input PDBs.
+
+.. _pretty_subcommand:
+
+pretty
+~~~~~~
+
+.. program:: llvm-pdbutil pretty
+
+.. important::
+   The **pretty** subcommand is built on the Windows DIA SDK, and as such is not
+   supported on non-Windows platforms.
+
+USAGE: :program:`llvm-pdbutil` pretty [*options*] <input PDB file>
+
+Summary
+^^^^^^^^^^^
+
+The *pretty* subcommand displays a very high level representation of your
+program's debug info.  Since it is built on the Windows DIA SDK which is the
+standard API that Windows tools and debuggers query debug information, it
+presents a more authoritative view of how a debugger is going to interpret your
+debug information than a mode which displays low-level CodeView records.
+
+Options
+^^^^^^^
+
+Filtering and Sorting Options
++++++++++++++++++++++++++++++
+
+.. note::
+   *exclude* filters take priority over *include* filters.  So if a filter
+   matches both an include and an exclude rule, then it is excluded.
+
+.. option:: -exclude-compilands=<string>
+
+ When dumping compilands, compiland source-file contributions, or per-compiland
+ symbols, this option instructs **llvm-pdbutil** to omit any compilands that
+ match the specified regular expression.
+
+.. option:: -exclude-symbols=<string>
+
+ When dumping global, public, or per-compiland symbols, this option instructs
+ **llvm-pdbutil** to omit any symbols that match the specified regular
+ expression.
+
+.. option:: -exclude-types=<string>
+
+ When dumping types, this option instructs **llvm-pdbutil** to omit any types
+ that match the specified regular expression.
+
+.. option:: -include-compilands=<string>
+
+ When dumping compilands, compiland source-file contributions, or per-compiland
+ symbols, limit the initial search to only those compilands that match the
+ specified regular expression.
+
+.. option:: -include-symbols=<string>
+
+ When dumping global, public, or per-compiland symbols, limit the initial
+ search to only those symbols that match the specified regular expression.
+
+.. option:: -include-types=<string>
+
+ When dumping types, limit the initial search to only those types that match
+ the specified regular expression.
+
+.. option:: -min-class-padding=<uint>
+
+ Only display types that have at least the specified amount of alignment
+ padding, accounting for padding in base classes and aggregate field members.
+
+.. option:: -min-class-padding-imm=<uint>
+
+ Only display types that have at least the specified amount of alignment
+ padding, ignoring padding in base classes and aggregate field members.
+
+.. option:: -min-type-size=<uint>
+
+ Only display types T where sizeof(T) is greater than or equal to the specified
+ amount.
+
+.. option:: -no-compiler-generated
+
+ Don't show compiler generated types and symbols
+
+.. option:: -no-enum-definitions
+
+ When dumping an enum, don't show the full enum (e.g. the individual enumerator
+ values).
+
+.. option:: -no-system-libs
+
+ Don't show symbols from system libraries
+
+Symbol Type Options
++++++++++++++++++++
+.. option:: -all
+
+ Implies all other options in this category.
+
+.. option:: -class-definitions=<format>
+
+ Displays class definitions in the specified format.
+
+ .. code-block:: text
+
+    =all      - Display all class members including data, constants, typedefs, functions, etc (default)
+    =layout   - Only display members that contribute to class size.
+    =none     - Don't display class definitions (e.g. only display the name and base list)
+
+.. option:: -class-order
+
+ Displays classes in the specified order.
+
+ .. code-block:: text
+
+    =none            - Undefined / no particular sort order (default)
+    =name            - Sort classes by name
+    =size            - Sort classes by size
+    =padding         - Sort classes by amount of padding
+    =padding-pct     - Sort classes by percentage of space consumed by padding
+    =padding-imm     - Sort classes by amount of immediate padding
+    =padding-pct-imm - Sort classes by percentage of space consumed by immediate padding
+
+.. option::  -class-recurse-depth=<uint>
+
+ When dumping class definitions, stop after recursing the specified number of times.  The
+ default is 0, which is no limit.
+
+.. option::  -classes
+
+ Display classes
+
+.. option::  -compilands
+
+ Display compilands (e.g. object files)
+
+.. option::  -enums
+
+ Display enums
+
+.. option::  -externals
+
+ Dump external (e.g. exported) symbols
+
+.. option::  -globals
+
+ Dump global symbols
+
+.. option::  -lines
+
+ Dump the mappings between source lines and code addresses.
+
+.. option::  -module-syms
+
+ Display symbols (variables, functions, etc) for each compiland
+
+.. option::  -sym-types=<types>
+
+ Type of symbols to dump when -globals, -externals, or -module-syms is
+ specified. (default all)
+
+ .. code-block:: text
+
+    =thunks - Display thunk symbols
+    =data   - Display data symbols
+    =funcs  - Display function symbols
+    =all    - Display all symbols (default)
+
+.. option::  -symbol-order=<order>
+
+ For symbols dumped via the -module-syms, -globals, or -externals options, sort
+ the results in specified order.
+
+ .. code-block:: text
+
+    =none - Undefined / no particular sort order
+    =name - Sort symbols by name
+    =size - Sort symbols by size
+
+.. option::  -typedefs
+
+ Display typedef types
+
+.. option::  -types
+
+ Display all types (implies -classes, -enums, -typedefs)
+
+Other Options
++++++++++++++
+
+.. option:: -color-output
+
+ Force color output on or off.  By default, color if used if outputting to a
+ terminal.
+
+.. option:: -load-address=<uint>
+
+ When displaying relative virtual addresses, assume the process is loaded at the
+ given address and display what would be the absolute address.
+
+.. _dump_subcommand:
+
+dump
+~~~~
+
+USAGE: :program:`llvm-pdbutil` dump [*options*] <input PDB file>
+
+.. program:: llvm-pdbutil dump
+
+Summary
+^^^^^^^^^^^
+
+The **dump** subcommand displays low level information about the structure of a
+PDB file.  It is used heavily by LLVM's testing infrastructure, but can also be
+used for PDB forensics.  It serves a role similar to that of Microsoft's
+`cvdump` tool.
+
+.. note::
+   The **dump** subcommand exposes internal details of the file format.  As
+   such, the reader should be familiar with :doc:`/PDB/index` before using this
+   command.
+
+Options
+^^^^^^^
+
+MSF Container Options
++++++++++++++++++++++
+
+.. option:: -streams
+
+ dump a summary of all of the streams in the PDB file.
+
+.. option:: -stream-blocks
+
+ In conjunction with :option:`-streams`, add information to the output about
+ what blocks the specified stream occupies.
+
+.. option:: -summary
+
+ Dump MSF and PDB header information.
+
+Module & File Options
++++++++++++++++++++++
+
+.. option:: -modi=<uint>
+
+ For all options that dump information from each module/compiland, limit to
+ the specified module.
+
+.. option:: -files
+
+ Dump the source files that contribute to each displayed module.
+
+.. option:: -il
+
+ Dump inlinee line information (DEBUG_S_INLINEELINES CodeView subsection)
+
+.. option:: -l
+
+ Dump line information (DEBUG_S_LINES CodeView subsection)
+
+.. option:: -modules
+
+ Dump compiland information
+
+.. option:: -xme
+
+ Dump cross module exports (DEBUG_S_CROSSSCOPEEXPORTS CodeView subsection)
+
+.. option:: -xmi
+
+ Dump cross module imports (DEBUG_S_CROSSSCOPEIMPORTS CodeView subsection)
+
+Symbol Options
+++++++++++++++
+
+.. option:: -globals
+
+ dump global symbol records
+
+.. option:: -global-extras
+
+ dump additional information about the globals, such as hash buckets and hash
+ values.
+
+.. option:: -publics
+
+ dump public symbol records
+
+.. option:: -public-extras
+
+ dump additional information about the publics, such as hash buckets and hash
+ values.
+
+.. option:: -symbols
+
+ dump symbols (functions, variables, etc) for each module dumped.
+
+.. option:: -sym-data
+
+ For each symbol record dumped as a result of the :option:`-symbols` option,
+ display the full bytes of the record in binary as well.
+
+Type Record Options
++++++++++++++++++++
+
+.. option:: -types
+
+ Dump CodeView type records from TPI stream
+
+.. option:: -type-extras
+
+ Dump additional information from the TPI stream, such as hashes and the type
+ index offsets array.
+
+.. option:: -type-data
+
+ For each type record dumped, display the full bytes of the record in binary as
+ well.
+
+.. option:: -type-index=<uint>
+
+ Only dump types with the specified type index.
+
+.. option:: -ids
+
+ Dump CodeView type records from IPI stream.
+
+.. option:: -id-extras
+
+ Dump additional information from the IPI stream, such as hashes and the type
+ index offsets array.
+
+.. option:: -id-data
+
+ For each ID record dumped, display the full bytes of the record in binary as
+ well.
+
+.. option:: -id-index=<uint>
+
+ only dump ID records with the specified hexadecimal type index.
+
+.. option:: -dependents
+
+ When used in conjunction with :option:`-type-index` or :option:`-id-index`,
+ dumps the entire dependency graph for the specified index instead of just the
+ single record with the specified index.  For example, if type index 0x4000 is
+ a function whose return type has index 0x3000, and you specify
+ `-dependents=0x4000`, then this would dump both records (as well as any other
+ dependents in the tree).
+
+Miscellaneous Options
++++++++++++++++++++++
+
+.. option:: -all
+
+ Implies most other options.
+
+.. option:: -section-contribs
+
+ Dump section contributions.
+
+.. option:: -section-headers
+
+ Dump image section headers.
+
+.. option:: -section-map
+
+ Dump section map.
+
+.. option:: -string-table
+
+ Dump PDB string table.
+
+.. _bytes_subcommand:
+
+bytes
+~~~~~
+
+USAGE: :program:`llvm-pdbutil` bytes [*options*] <input PDB file>
+
+.. program:: llvm-pdbutil bytes
+
+Summary
+^^^^^^^
+
+Like the **dump** subcommand, the **bytes** subcommand displays low level
+information about the structure of a PDB file, but it is used for even deeper
+forensics.  The **bytes** subcommand finds various structures in a PDB file
+based on the command line options specified, and dumps them in hex.  Someone
+working on support for emitting PDBs would use this heavily, for example, to
+compare one PDB against another PDB to ensure byte-for-byte compatibility.  It
+is not enough to simply compare the bytes of an entire file, or an entire stream
+because it's perfectly fine for the same structure to exist at different
+locations in two different PDBs, and "finding" the structure is half the battle.
+
+Options
+^^^^^^^
+
+MSF File Options
+++++++++++++++++
+
+.. option:: -block-range=<start[-end]>
+
+ Dump binary data from specified range of MSF file blocks.
+
+.. option:: -byte-range=<start[-end]>
+
+ Dump binary data from specified range of bytes in the file.
+
+.. option:: -fpm
+
+ Dump the MSF free page map.
+
+.. option:: -stream-data=<string>
+
+ Dump binary data from the specified streams.  Format is SN[:Start][@Size].
+ For example, `-stream-data=7:3@12` dumps 12 bytes from stream 7, starting
+ at offset 3 in the stream.
+
+PDB Stream Options
+++++++++++++++++++
+
+.. option:: -name-map
+
+ Dump bytes of PDB Name Map
+
+DBI Stream Options
+++++++++++++++++++
+
+.. option:: -ec
+
+ Dump the edit and continue map substream of the DBI stream.
+
+.. option:: -files
+
+ Dump the file info substream of the DBI stream.
+
+.. option:: -modi
+
+ Dump the modi substream of the DBI stream.
+
+.. option:: -sc
+
+ Dump section contributions substream of the DBI stream.
+
+.. option:: -sm
+
+ Dump the section map from the DBI stream.
+
+.. option:: -type-server
+
+ Dump the type server map from the DBI stream.
+
+Module Options
+++++++++++++++
+
+.. option:: -mod=<uint>
+
+ Limit all options in this category to the specified module index.  By default,
+ options in this category will dump bytes from all modules.
+
+.. option:: -chunks
+
+ Dump the bytes of each module's C13 debug subsection.
+
+.. option:: -split-chunks
+
+ When specified with :option:`-chunks`, split the C13 debug subsection into a
+ separate chunk for each subsection type, and dump them separately.
+
+.. option:: -syms
+
+ Dump the symbol record substream from each module.
+
+Type Record Options
++++++++++++++++++++
+
+.. option:: -id=<uint>
+
+ Dump the record from the IPI stream with the given type index.
+
+.. option:: -type=<uint>
+
+ Dump the record from the TPI stream with the given type index.
+
+.. _pdb2yaml_subcommand:
+
+pdb2yaml
+~~~~~~~~
+
+USAGE: :program:`llvm-pdbutil` pdb2yaml [*options*] <input PDB file>
+
+.. program:: llvm-pdbutil pdb2yaml
+
+Summary
+^^^^^^^
+
+Options
+^^^^^^^
+
+.. _yaml2pdb_subcommand:
+
+yaml2pdb
+~~~~~~~~
+
+USAGE: :program:`llvm-pdbutil` yaml2pdb [*options*] <input YAML file>
+
+.. program:: llvm-pdbutil yaml2pdb
+
+Summary
+^^^^^^^
+
+Generate a PDB file from a YAML description.  The YAML syntax is not described
+here.  Instead, use :ref:`llvm-pdbutil pdb2yaml <pdb2yaml_subcommand>` and
+examine the output for an example starting point.
+
+Options
+^^^^^^^
+
+.. option:: -pdb=<file-name>
+
+Write the resulting PDB to the specified file.
+
+.. _merge_subcommand:
+
+merge
+~~~~~
+
+USAGE: :program:`llvm-pdbutil` merge [*options*] <input PDB file 1> <input PDB file 2>
+
+.. program:: llvm-pdbutil merge
+
+Summary
+^^^^^^^
+
+Merge two PDB files into a single file.
+
+Options
+^^^^^^^
+
+.. option:: -pdb=<file-name>
+
+Write the resulting PDB to the specified file.
diff --git a/docs/CommandLine.rst b/docs/CommandLine.rst
index a660949881a4..5d2a39d45a17 100644
--- a/docs/CommandLine.rst
+++ b/docs/CommandLine.rst
@@ -1251,9 +1251,7 @@ Unices have a relatively low limit on command-line length. It is therefore
 customary to use the so-called 'response files' to circumvent this
 restriction. These files are mentioned on the command-line (using the "@file")
 syntax. The program reads these files and inserts the contents into argv,
-thereby working around the command-line length limits. Response files are
-enabled by an optional fourth argument to `cl::ParseEnvironmentOptions`_ and
-`cl::ParseCommandLineOptions`_.
+thereby working around the command-line length limits.
 
 Top-Level Classes and Functions
 -------------------------------
@@ -1324,8 +1322,7 @@ option variables once ``argc`` and ``argv`` are available.
 
 The ``cl::ParseCommandLineOptions`` function requires two parameters (``argc``
 and ``argv``), but may also take an optional third parameter which holds
-`additional extra text`_ to emit when the ``-help`` option is invoked, and a
-fourth boolean parameter that enables `response files`_.
+`additional extra text`_ to emit when the ``-help`` option is invoked.
 
 .. _cl::ParseEnvironmentOptions:
 
@@ -1340,9 +1337,8 @@ command line option variables just like `cl::ParseCommandLineOptions`_ does.
 
 It takes four parameters: the name of the program (since ``argv`` may not be
 available, it can't just look in ``argv[0]``), the name of the environment
-variable to examine, the optional `additional extra text`_ to emit when the
-``-help`` option is invoked, and the boolean switch that controls whether
-`response files`_ should be read.
+variable to examine, and the optional `additional extra text`_ to emit when the
+``-help`` option is invoked.
 
 ``cl::ParseEnvironmentOptions`` will break the environment variable's value up
 into words and then process them using `cl::ParseCommandLineOptions`_.
diff --git a/docs/CompilerWriterInfo.rst b/docs/CompilerWriterInfo.rst
index 24375fb70d4e..60f102472c63 100644
--- a/docs/CompilerWriterInfo.rst
+++ b/docs/CompilerWriterInfo.rst
@@ -119,7 +119,7 @@ ABI
 ===
 
 * `System V Application Binary Interface <http://www.sco.com/developers/gabi/latest/contents.html>`_
-* `Itanium C++ ABI <http://mentorembedded.github.io/cxx-abi/>`_
+* `Itanium C++ ABI <http://itanium-cxx-abi.github.io/cxx-abi/>`_
 
 Linux
 -----
diff --git a/docs/CoverageMappingFormat.rst b/docs/CoverageMappingFormat.rst
index 46cc9d10f319..30b11fe2f31d 100644
--- a/docs/CoverageMappingFormat.rst
+++ b/docs/CoverageMappingFormat.rst
@@ -258,7 +258,7 @@ The coverage mapping variable generated by Clang has 3 fields:
       i32 2,  ; The number of function records
       i32 20, ; The length of the string that contains the encoded translation unit filenames
       i32 20, ; The length of the string that contains the encoded coverage mapping data
-      i32 1,  ; Coverage mapping format version
+      i32 2,  ; Coverage mapping format version
     },
     [2 x { i64, i32, i64 }] [ ; Function records
      { i64, i32, i64 } {
@@ -274,6 +274,8 @@ The coverage mapping variable generated by Clang has 3 fields:
    [40 x i8] c"..." ; Encoded data (dissected later)
   }, section "__llvm_covmap", align 8
 
+The current version of the format is version 3. The only difference from version 2 is that a special encoding for column end locations was introduced to indicate gap regions.
+
 The function record layout has evolved since version 1. In version 1, the function record for *foo* is defined as follows:
 
 .. code-block:: llvm
@@ -296,7 +298,7 @@ The coverage mapping header has the following fields:
 
 * The length of the string in the third field of *__llvm_coverage_mapping* that contains the encoded coverage mapping data.
 
-* The format version. The current version is 2 (encoded as a 1).
+* The format version. The current version is 3 (encoded as a 2).
 
 .. _function records:
 
@@ -602,4 +604,6 @@ The source range record contains the following fields:
 * *numLines*: The difference between the ending line and the starting line
   of the current mapping region.
 
-* *columnEnd*: The ending column of the mapping region.
+* *columnEnd*: The ending column of the mapping region. If the high bit is set,
+  the current mapping region is a gap area. A count for a gap area is only used
+  as the line execution count if there are no other regions on a line.
diff --git a/docs/DeveloperPolicy.rst b/docs/DeveloperPolicy.rst
index 97e057234379..9bd50f142b96 100644
--- a/docs/DeveloperPolicy.rst
+++ b/docs/DeveloperPolicy.rst
@@ -188,7 +188,7 @@ problem, we have a notion of an 'owner' for a piece of the code.  The sole
 responsibility of a code owner is to ensure that a commit to their area of the
 code is appropriately reviewed, either by themself or by someone else.  The list
 of current code owners can be found in the file
-`CODE_OWNERS.TXT <http://llvm.org/klaus/llvm/blob/master/CODE_OWNERS.TXT>`_
+`CODE_OWNERS.TXT <http://git.llvm.org/klaus/llvm/blob/master/CODE_OWNERS.TXT>`_
 in the root of the LLVM source tree.
 
 Note that code ownership is completely different than reviewers: anyone can
diff --git a/docs/ExceptionHandling.rst b/docs/ExceptionHandling.rst
index a44fb92794c9..ff8b6456c2bc 100644
--- a/docs/ExceptionHandling.rst
+++ b/docs/ExceptionHandling.rst
@@ -32,13 +32,13 @@ execution of an application.
 
 A more complete description of the Itanium ABI exception handling runtime
 support of can be found at `Itanium C++ ABI: Exception Handling
-<http://mentorembedded.github.com/cxx-abi/abi-eh.html>`_. A description of the
+<http://itanium-cxx-abi.github.io/cxx-abi/abi-eh.html>`_. A description of the
 exception frame format can be found at `Exception Frames
 <http://refspecs.linuxfoundation.org/LSB_3.0.0/LSB-Core-generic/LSB-Core-generic/ehframechpt.html>`_,
 with details of the DWARF 4 specification at `DWARF 4 Standard
 <http://dwarfstd.org/Dwarf4Std.php>`_.  A description for the C++ exception
 table formats can be found at `Exception Handling Tables
-<http://mentorembedded.github.com/cxx-abi/exceptions.pdf>`_.
+<http://itanium-cxx-abi.github.io/cxx-abi/exceptions.pdf>`_.
 
 Setjmp/Longjmp Exception Handling
 ---------------------------------
diff --git a/docs/FuzzingLLVM.rst b/docs/FuzzingLLVM.rst
new file mode 100644
index 000000000000..b3cf719f275b
--- /dev/null
+++ b/docs/FuzzingLLVM.rst
@@ -0,0 +1,274 @@
+================================
+Fuzzing LLVM libraries and tools
+================================
+
+.. contents::
+   :local:
+   :depth: 2
+
+Introduction
+============
+
+The LLVM tree includes a number of fuzzers for various components. These are
+built on top of :doc:`LibFuzzer <LibFuzzer>`.
+
+
+Available Fuzzers
+=================
+
+clang-fuzzer
+------------
+
+A |generic fuzzer| that tries to compile textual input as C++ code. Some of the
+bugs this fuzzer has reported are `on bugzilla`__ and `on OSS Fuzz's
+tracker`__.
+
+__ https://llvm.org/pr23057
+__ https://bugs.chromium.org/p/oss-fuzz/issues/list?q=proj-llvm+clang-fuzzer
+
+clang-proto-fuzzer
+------------------
+
+A |protobuf fuzzer| that compiles valid C++ programs generated from a protobuf
+class that describes a subset of the C++ language.
+
+This fuzzer accepts clang command line options after `ignore_remaining_args=1`.
+For example, the following command will fuzz clang with a higher optimization
+level:
+
+.. code-block:: shell
+
+   % bin/clang-proto-fuzzer <corpus-dir> -ignore_remaining_args=1 -O3
+
+clang-format-fuzzer
+-------------------
+
+A |generic fuzzer| that runs clang-format_ on C++ text fragments. Some of the
+bugs this fuzzer has reported are `on bugzilla`__
+and `on OSS Fuzz's tracker`__.
+
+.. _clang-format: https://clang.llvm.org/docs/ClangFormat.html
+__ https://llvm.org/pr23052
+__ https://bugs.chromium.org/p/oss-fuzz/issues/list?q=proj-llvm+clang-format-fuzzer
+
+llvm-as-fuzzer
+--------------
+
+A |generic fuzzer| that tries to parse text as :doc:`LLVM assembly <LangRef>`.
+Some of the bugs this fuzzer has reported are `on bugzilla`__.
+
+__ https://llvm.org/pr24639
+
+llvm-dwarfdump-fuzzer
+---------------------
+
+A |generic fuzzer| that interprets inputs as object files and runs
+:doc:`llvm-dwarfdump <CommandGuide/llvm-dwarfdump>` on them. Some of the bugs
+this fuzzer has reported are `on OSS Fuzz's tracker`__
+
+__ https://bugs.chromium.org/p/oss-fuzz/issues/list?q=proj-llvm+llvm-dwarfdump-fuzzer
+
+llvm-demangle-fuzzer
+---------------------
+
+A |generic fuzzer| for the Itanium demangler used in various LLVM tools. We've
+fuzzed __cxa_demangle to death, why not fuzz LLVM's implementation of the same
+function!
+
+llvm-isel-fuzzer
+----------------
+
+A |LLVM IR fuzzer| aimed at finding bugs in instruction selection.
+
+This fuzzer accepts flags after `ignore_remaining_args=1`. The flags match
+those of :doc:`llc <CommandGuide/llc>` and the triple is required. For example,
+the following command would fuzz AArch64 with :doc:`GlobalISel`:
+
+.. code-block:: shell
+
+   % bin/llvm-isel-fuzzer <corpus-dir> -ignore_remaining_args=1 -mtriple aarch64 -global-isel -O0
+
+Some flags can also be specified in the binary name itself in order to support
+OSS Fuzz, which has trouble with required arguments. To do this, you can copy
+or move ``llvm-isel-fuzzer`` to ``llvm-isel-fuzzer--x-y-z``, separating options
+from the binary name using "--". The valid options are architecture names
+(``aarch64``, ``x86_64``), optimization levels (``O0``, ``O2``), or specific
+keywords, like ``gisel`` for enabling global instruction selection. In this
+mode, the same example could be run like so:
+
+.. code-block:: shell
+
+   % bin/llvm-isel-fuzzer--aarch64-O0-gisel <corpus-dir>
+
+llvm-opt-fuzzer
+---------------
+
+A |LLVM IR fuzzer| aimed at finding bugs in optimization passes.
+
+It receives optimzation pipeline and runs it for each fuzzer input.
+
+Interface of this fuzzer almost directly mirrors ``llvm-isel-fuzzer``. Both
+``mtriple`` and ``passes`` arguments are required. Passes are specified in a
+format suitable for the new pass manager.
+
+.. code-block:: shell
+
+   % bin/llvm-opt-fuzzer <corpus-dir> -ignore_remaining_args=1 -mtriple x86_64 -passes instcombine
+
+Similarly to the ``llvm-isel-fuzzer`` arguments in some predefined configurations
+might be embedded directly into the binary file name:
+
+.. code-block:: shell
+
+   % bin/llvm-opt-fuzzer--x86_64-instcombine <corpus-dir>
+
+llvm-mc-assemble-fuzzer
+-----------------------
+
+A |generic fuzzer| that fuzzes the MC layer's assemblers by treating inputs as
+target specific assembly.
+
+Note that this fuzzer has an unusual command line interface which is not fully
+compatible with all of libFuzzer's features. Fuzzer arguments must be passed
+after ``--fuzzer-args``, and any ``llc`` flags must use two dashes. For
+example, to fuzz the AArch64 assembler you might use the following command:
+
+.. code-block:: console
+
+  llvm-mc-fuzzer --triple=aarch64-linux-gnu --fuzzer-args -max_len=4
+
+This scheme will likely change in the future.
+
+llvm-mc-disassemble-fuzzer
+--------------------------
+
+A |generic fuzzer| that fuzzes the MC layer's disassemblers by treating inputs
+as assembled binary data.
+
+Note that this fuzzer has an unusual command line interface which is not fully
+compatible with all of libFuzzer's features. See the notes above about
+``llvm-mc-assemble-fuzzer`` for details.
+
+
+.. |generic fuzzer| replace:: :ref:`generic fuzzer <fuzzing-llvm-generic>`
+.. |protobuf fuzzer|
+   replace:: :ref:`libprotobuf-mutator based fuzzer <fuzzing-llvm-protobuf>`
+.. |LLVM IR fuzzer|
+   replace:: :ref:`structured LLVM IR fuzzer <fuzzing-llvm-ir>`
+
+
+Mutators and Input Generators
+=============================
+
+The inputs for a fuzz target are generated via random mutations of a
+:ref:`corpus <libfuzzer-corpus>`. There are a few options for the kinds of
+mutations that a fuzzer in LLVM might want.
+
+.. _fuzzing-llvm-generic:
+
+Generic Random Fuzzing
+----------------------
+
+The most basic form of input mutation is to use the built in mutators of
+LibFuzzer. These simply treat the input corpus as a bag of bits and make random
+mutations. This type of fuzzer is good for stressing the surface layers of a
+program, and is good at testing things like lexers, parsers, or binary
+protocols.
+
+Some of the in-tree fuzzers that use this type of mutator are `clang-fuzzer`_,
+`clang-format-fuzzer`_, `llvm-as-fuzzer`_, `llvm-dwarfdump-fuzzer`_,
+`llvm-mc-assemble-fuzzer`_, and `llvm-mc-disassemble-fuzzer`_.
+
+.. _fuzzing-llvm-protobuf:
+
+Structured Fuzzing using ``libprotobuf-mutator``
+------------------------------------------------
+
+We can use libprotobuf-mutator_ in order to perform structured fuzzing and
+stress deeper layers of programs. This works by defining a protobuf class that
+translates arbitrary data into structurally interesting input. Specifically, we
+use this to work with a subset of the C++ language and perform mutations that
+produce valid C++ programs in order to exercise parts of clang that are more
+interesting than parser error handling.
+
+To build this kind of fuzzer you need `protobuf`_ and its dependencies
+installed, and you need to specify some extra flags when configuring the build
+with :doc:`CMake <CMake>`. For example, `clang-proto-fuzzer`_ can be enabled by
+adding ``-DCLANG_ENABLE_PROTO_FUZZER=ON`` to the flags described in
+:ref:`building-fuzzers`.
+
+The only in-tree fuzzer that uses ``libprotobuf-mutator`` today is
+`clang-proto-fuzzer`_.
+
+.. _libprotobuf-mutator: https://github.com/google/libprotobuf-mutator
+.. _protobuf: https://github.com/google/protobuf
+
+.. _fuzzing-llvm-ir:
+
+Structured Fuzzing of LLVM IR
+-----------------------------
+
+We also use a more direct form of structured fuzzing for fuzzers that take
+:doc:`LLVM IR <LangRef>` as input. This is achieved through the ``FuzzMutate``
+library, which was `discussed at EuroLLVM 2017`_.
+
+The ``FuzzMutate`` library is used to structurally fuzz backends in
+`llvm-isel-fuzzer`_.
+
+.. _discussed at EuroLLVM 2017: https://www.youtube.com/watch?v=UBbQ_s6hNgg
+
+
+Building and Running
+====================
+
+.. _building-fuzzers:
+
+Configuring LLVM to Build Fuzzers
+---------------------------------
+
+Fuzzers will be built and linked to libFuzzer by default as long as you build
+LLVM with sanitizer coverage enabled. You would typically also enable at least
+one sanitizer to find bugs faster. The most common way to build the fuzzers is
+by adding the following two flags to your CMake invocation:
+``-DLLVM_USE_SANITIZER=Address -DLLVM_USE_SANITIZE_COVERAGE=On``.
+
+.. note:: If you have ``compiler-rt`` checked out in an LLVM tree when building
+          with sanitizers, you'll want to specify ``-DLLVM_BUILD_RUNTIME=Off``
+          to avoid building the sanitizers themselves with sanitizers enabled.
+
+Continuously Running and Finding Bugs
+-------------------------------------
+
+There used to be a public buildbot running LLVM fuzzers continuously, and while
+this did find issues, it didn't have a very good way to report problems in an
+actionable way. Because of this, we're moving towards using `OSS Fuzz`_ more
+instead.
+
+You can browse the `LLVM project issue list`_ for the bugs found by
+`LLVM on OSS Fuzz`_. These are also mailed to the `llvm-bugs mailing
+list`_.
+
+.. _OSS Fuzz: https://github.com/google/oss-fuzz
+.. _LLVM project issue list:
+   https://bugs.chromium.org/p/oss-fuzz/issues/list?q=Proj-llvm
+.. _LLVM on OSS Fuzz:
+   https://github.com/google/oss-fuzz/blob/master/projects/llvm
+.. _llvm-bugs mailing list:
+   http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs
+
+
+Utilities for Writing Fuzzers
+=============================
+
+There are some utilities available for writing fuzzers in LLVM.
+
+Some helpers for handling the command line interface are available in
+``include/llvm/FuzzMutate/FuzzerCLI.h``, including functions to parse command
+line options in a consistent way and to implement standalone main functions so
+your fuzzer can be built and tested when not built against libFuzzer.
+
+There is also some handling of the CMake config for fuzzers, where you should
+use the ``add_llvm_fuzzer`` to set up fuzzer targets. This function works
+similarly to functions such as ``add_llvm_tool``, but they take care of linking
+to LibFuzzer when appropriate and can be passed the ``DUMMY_MAIN`` argument to
+enable standalone testing.
diff --git a/docs/GettingStarted.rst b/docs/GettingStarted.rst
index 0cb415ad764e..ed2e936d1360 100644
--- a/docs/GettingStarted.rst
+++ b/docs/GettingStarted.rst
@@ -52,6 +52,12 @@ Here's the short story for getting up and running quickly with LLVM:
    * ``cd llvm/tools``
    * ``svn co http://llvm.org/svn/llvm-project/cfe/trunk clang``
 
+#. Checkout Extra Clang Tools **[Optional]**:
+
+   * ``cd where-you-want-llvm-to-live``
+   * ``cd llvm/tools/clang/tools``
+   * ``svn co http://llvm.org/svn/llvm-project/clang-tools-extra/trunk extra``
+
 #. Checkout LLD linker **[Optional]**:
 
    * ``cd where-you-want-llvm-to-live``
@@ -91,9 +97,9 @@ Here's the short story for getting up and running quickly with LLVM:
 
 #. Configure and build LLVM and Clang:
 
-   *Warning:* Make sure you've checked out *all of* the source code 
+   *Warning:* Make sure you've checked out *all of* the source code
    before trying to configure with cmake.  cmake does not pickup newly
-   added source directories in incremental builds. 
+   added source directories in incremental builds.
 
    The build uses `CMake <CMake.html>`_. LLVM requires CMake 3.4.3 to build. It
    is generally recommended to use a recent CMake, especially if you're
@@ -137,8 +143,8 @@ Here's the short story for getting up and running quickly with LLVM:
      * CMake will generate build targets for each tool and library, and most
        LLVM sub-projects generate their own ``check-<project>`` target.
 
-     * Running a serial build will be *slow*.  Make sure you run a 
-       parallel build; for ``make``, use ``make -j``.  
+     * Running a serial build will be *slow*.  Make sure you run a
+       parallel build; for ``make``, use ``make -j``.
 
    * For more information see `CMake <CMake.html>`_
 
@@ -146,7 +152,7 @@ Here's the short story for getting up and running quickly with LLVM:
      `below`_.
 
 Consult the `Getting Started with LLVM`_ section for detailed information on
-configuring and compiling LLVM.  Go to `Directory Layout`_ to learn about the 
+configuring and compiling LLVM.  Go to `Directory Layout`_ to learn about the
 layout of the source code tree.
 
 Requirements
@@ -191,10 +197,10 @@ Windows x64        x86-64                Visual Studio
 Note that Debug builds require a lot of time and disk space.  An LLVM-only build
 will need about 1-3 GB of space.  A full build of LLVM and Clang will need around
 15-20 GB of disk space.  The exact space requirements will vary by system.  (It
-is so large because of all the debugging information and the fact that the 
-libraries are statically linked into multiple tools).  
+is so large because of all the debugging information and the fact that the
+libraries are statically linked into multiple tools).
 
-If you you are space-constrained, you can build only selected tools or only 
+If you you are space-constrained, you can build only selected tools or only
 selected targets.  The Release build requires considerably less space.
 
 The LLVM suite *may* compile on other platforms, but it is not guaranteed to do
@@ -460,34 +466,13 @@ populate it with the LLVM source code, Makefiles, test directories, and local
 copies of documentation files.
 
 If you want to get a specific release (as opposed to the most recent revision),
-you can checkout it from the '``tags``' directory (instead of '``trunk``'). The
+you can check it out from the '``tags``' directory (instead of '``trunk``'). The
 following releases are located in the following subdirectories of the '``tags``'
 directory:
 
-* Release 3.4: **RELEASE_34/final**
-* Release 3.3: **RELEASE_33/final**
-* Release 3.2: **RELEASE_32/final**
-* Release 3.1: **RELEASE_31/final**
-* Release 3.0: **RELEASE_30/final**
-* Release 2.9: **RELEASE_29/final**
-* Release 2.8: **RELEASE_28**
-* Release 2.7: **RELEASE_27**
-* Release 2.6: **RELEASE_26**
-* Release 2.5: **RELEASE_25**
-* Release 2.4: **RELEASE_24**
-* Release 2.3: **RELEASE_23**
-* Release 2.2: **RELEASE_22**
-* Release 2.1: **RELEASE_21**
-* Release 2.0: **RELEASE_20**
-* Release 1.9: **RELEASE_19**
-* Release 1.8: **RELEASE_18**
-* Release 1.7: **RELEASE_17**
-* Release 1.6: **RELEASE_16**
-* Release 1.5: **RELEASE_15**
-* Release 1.4: **RELEASE_14**
-* Release 1.3: **RELEASE_13**
-* Release 1.2: **RELEASE_12**
-* Release 1.1: **RELEASE_11**
+* Release 3.5.0 and later: **RELEASE_350/final** and so on
+* Release 2.9 through 3.4: **RELEASE_29/final** and so on
+* Release 1.1 through 2.8: **RELEASE_11** and so on
 * Release 1.0: **RELEASE_1**
 
 If you would like to get the LLVM test suite (a separate package as of 1.4), you
@@ -512,43 +497,43 @@ clone of LLVM via:
 
 .. code-block:: console
 
-  % git clone http://llvm.org/git/llvm.git
+  % git clone https://git.llvm.org/git/llvm.git/
 
 If you want to check out clang too, run:
 
 .. code-block:: console
 
   % cd llvm/tools
-  % git clone http://llvm.org/git/clang.git
+  % git clone https://git.llvm.org/git/clang.git/
 
 If you want to check out compiler-rt (required to build the sanitizers), run:
 
 .. code-block:: console
 
   % cd llvm/projects
-  % git clone http://llvm.org/git/compiler-rt.git
+  % git clone https://git.llvm.org/git/compiler-rt.git/
 
 If you want to check out libomp (required for OpenMP support), run:
 
 .. code-block:: console
 
   % cd llvm/projects
-  % git clone http://llvm.org/git/openmp.git
+  % git clone https://git.llvm.org/git/openmp.git/
 
 If you want to check out libcxx and libcxxabi (optional), run:
 
 .. code-block:: console
 
   % cd llvm/projects
-  % git clone http://llvm.org/git/libcxx.git
-  % git clone http://llvm.org/git/libcxxabi.git
+  % git clone https://git.llvm.org/git/libcxx.git/
+  % git clone https://git.llvm.org/git/libcxxabi.git/
 
 If you want to check out the Test Suite Source Code (optional), run:
 
 .. code-block:: console
 
   % cd llvm/projects
-  % git clone http://llvm.org/git/test-suite.git
+  % git clone https://git.llvm.org/git/test-suite.git/
 
 Since the upstream repository is in Subversion, you should use ``git
 pull --rebase`` instead of ``git pull`` to avoid generating a non-linear history
@@ -622,7 +607,7 @@ To set up clone from which you can submit code using ``git-svn``, run:
 
 .. code-block:: console
 
-  % git clone http://llvm.org/git/llvm.git
+  % git clone https://git.llvm.org/git/llvm.git/
   % cd llvm
   % git svn init https://llvm.org/svn/llvm-project/llvm/trunk --username=<username>
   % git config svn-remote.svn.fetch :refs/remotes/origin/master
@@ -630,7 +615,7 @@ To set up clone from which you can submit code using ``git-svn``, run:
 
   # If you have clang too:
   % cd tools
-  % git clone http://llvm.org/git/clang.git
+  % git clone https://git.llvm.org/git/clang.git/
   % cd clang
   % git svn init https://llvm.org/svn/llvm-project/cfe/trunk --username=<username>
   % git config svn-remote.svn.fetch :refs/remotes/origin/master
@@ -1010,7 +995,7 @@ Directory Layout
 ================
 
 One useful source of information about the LLVM source base is the LLVM `doxygen
-<http://www.doxygen.org/>`_ documentation available at 
+<http://www.doxygen.org/>`_ documentation available at
 `<http://llvm.org/doxygen/>`_.  The following is a brief introduction to code
 layout:
 
@@ -1026,13 +1011,13 @@ Public header files exported from the LLVM library. The three main subdirectorie
 
 ``llvm/include/llvm``
 
-  All LLVM-specific header files, and  subdirectories for different portions of 
+  All LLVM-specific header files, and  subdirectories for different portions of
   LLVM: ``Analysis``, ``CodeGen``, ``Target``, ``Transforms``, etc...
 
 ``llvm/include/llvm/Support``
 
-  Generic support libraries provided with LLVM but not necessarily specific to 
-  LLVM. For example, some C++ STL utilities and a Command Line option processing 
+  Generic support libraries provided with LLVM but not necessarily specific to
+  LLVM. For example, some C++ STL utilities and a Command Line option processing
   library store header files here.
 
 ``llvm/include/llvm/Config``
@@ -1045,12 +1030,12 @@ Public header files exported from the LLVM library. The three main subdirectorie
 ``llvm/lib``
 ------------
 
-Most source files are here. By putting code in libraries, LLVM makes it easy to 
+Most source files are here. By putting code in libraries, LLVM makes it easy to
 share code among the `tools`_.
 
 ``llvm/lib/IR/``
 
-  Core LLVM source files that implement core classes like Instruction and 
+  Core LLVM source files that implement core classes like Instruction and
   BasicBlock.
 
 ``llvm/lib/AsmParser/``
@@ -1063,23 +1048,23 @@ share code among the `tools`_.
 
 ``llvm/lib/Analysis/``
 
-  A variety of program analyses, such as Call Graphs, Induction Variables, 
+  A variety of program analyses, such as Call Graphs, Induction Variables,
   Natural Loop Identification, etc.
 
 ``llvm/lib/Transforms/``
 
-  IR-to-IR program transformations, such as Aggressive Dead Code Elimination, 
-  Sparse Conditional Constant Propagation, Inlining, Loop Invariant Code Motion, 
+  IR-to-IR program transformations, such as Aggressive Dead Code Elimination,
+  Sparse Conditional Constant Propagation, Inlining, Loop Invariant Code Motion,
   Dead Global Elimination, and many others.
 
 ``llvm/lib/Target/``
 
-  Files describing target architectures for code generation.  For example, 
+  Files describing target architectures for code generation.  For example,
   ``llvm/lib/Target/X86`` holds the X86 machine description.
 
 ``llvm/lib/CodeGen/``
 
-  The major parts of the code generator: Instruction Selector, Instruction 
+  The major parts of the code generator: Instruction Selector, Instruction
   Scheduling, and Register Allocation.
 
 ``llvm/lib/MC/``
@@ -1088,7 +1073,7 @@ share code among the `tools`_.
 
 ``llvm/lib/ExecutionEngine/``
 
-  Libraries for directly executing bitcode at runtime in interpreted and 
+  Libraries for directly executing bitcode at runtime in interpreted and
   JIT-compiled scenarios.
 
 ``llvm/lib/Support/``
@@ -1099,7 +1084,7 @@ share code among the `tools`_.
 ``llvm/projects``
 -----------------
 
-Projects not strictly part of LLVM but shipped with LLVM. This is also the 
+Projects not strictly part of LLVM but shipped with LLVM. This is also the
 directory for creating your own LLVM-based projects which leverage the LLVM
 build system.
 
@@ -1112,8 +1097,8 @@ are intended to run quickly and cover a lot of territory without being exhaustiv
 ``test-suite``
 --------------
 
-A comprehensive correctness, performance, and benchmarking test suite for LLVM. 
-Comes in a separate Subversion module because not every LLVM user is interested 
+A comprehensive correctness, performance, and benchmarking test suite for LLVM.
+Comes in a separate Subversion module because not every LLVM user is interested
 in such a comprehensive suite. For details see the :doc:`Testing Guide
 <TestingGuide>` document.
 
@@ -1194,7 +1179,7 @@ because they are code generators for parts of the infrastructure.
 
 ``emacs/``
 
-   Emacs and XEmacs syntax highlighting  for LLVM   assembly files and TableGen 
+   Emacs and XEmacs syntax highlighting  for LLVM   assembly files and TableGen
    description files.  See the ``README`` for information on using them.
 
 ``getsrcs.sh``
diff --git a/docs/GettingStartedVS.rst b/docs/GettingStartedVS.rst
index 50f7aa123c55..a4ff2b822fc3 100644
--- a/docs/GettingStartedVS.rst
+++ b/docs/GettingStartedVS.rst
@@ -76,6 +76,11 @@ Here's the short story for getting up and running quickly with LLVM:
 
    * With anonymous Subversion access:
 
+     *Note:* some regression tests require Unix-style line ending (``\n``). To
+     pass all regression tests, please add two lines *enable-auto-props = yes*
+     and *\* = svn:mime-type=application/octet-stream* to
+     ``C:\Users\<username>\AppData\Roaming\Subversion\config``.
+
       1. ``cd <where-you-want-llvm-to-live>``
       2. ``svn co http://llvm.org/svn/llvm-project/llvm/trunk llvm``
       3. ``cd llvm``
diff --git a/docs/GlobalISel.rst b/docs/GlobalISel.rst
index 176bd4edea33..8746685491c7 100644
--- a/docs/GlobalISel.rst
+++ b/docs/GlobalISel.rst
@@ -304,10 +304,13 @@ As opposed to SelectionDAG, there are no legalization phases.  In particular,
 Legalization is iterative, and all state is contained in GMIR.  To maintain the
 validity of the intermediate code, instructions are introduced:
 
-* ``G_SEQUENCE`` --- concatenate multiple registers into a single wider
-  register.
+* ``G_MERGE_VALUES`` --- concatenate multiple registers of the same
+  size into a single wider register.
 
-* ``G_EXTRACT`` --- extract multiple registers (as contiguous sequences of bits)
+* ``G_UNMERGE_VALUES`` --- extract multiple registers of the same size
+  from a single wider register.
+
+* ``G_EXTRACT`` --- extract a simple register (as contiguous sequences of bits)
   from a single wider register.
 
 As they are expected to be temporary byproducts of the legalization process,
@@ -500,16 +503,69 @@ The simple API consists of:
 This target-provided method is responsible for mutating (or replacing) a
 possibly-generic MI into a fully target-specific equivalent.
 It is also responsible for doing the necessary constraining of gvregs into the
-appropriate register classes.
+appropriate register classes as well as passing through COPY instructions to
+the register allocator.
 
 The ``InstructionSelector`` can fold other instructions into the selected MI,
 by walking the use-def chain of the vreg operands.
 As GlobalISel is Global, this folding can occur across basic blocks.
 
-``TODO``:
-Currently, the Select pass is implemented with hand-written c++, similar to
-FastISel, rather than backed by tblgen'erated pattern-matching.
-We intend to eventually reuse SelectionDAG patterns.
+SelectionDAG Rule Imports
+^^^^^^^^^^^^^^^^^^^^^^^^^
+
+TableGen will import SelectionDAG rules and provide the following function to
+execute them:
+
+  .. code-block:: c++
+
+    bool selectImpl(MachineInstr &MI)
+
+The ``--stats`` option can be used to determine what proportion of rules were
+successfully imported. The easiest way to use this is to copy the
+``-gen-globalisel`` tablegen command from ``ninja -v`` and modify it.
+
+Similarly, the ``--warn-on-skipped-patterns`` option can be used to obtain the
+reasons that rules weren't imported. This can be used to focus on the most
+important rejection reasons.
+
+PatLeaf Predicates
+^^^^^^^^^^^^^^^^^^
+
+PatLeafs cannot be imported because their C++ is implemented in terms of
+``SDNode`` objects. PatLeafs that handle immediate predicates should be
+replaced by ``ImmLeaf``, ``IntImmLeaf``, or ``FPImmLeaf`` as appropriate.
+
+There's no standard answer for other PatLeafs. Some standard predicates have
+been baked into TableGen but this should not generally be done.
+
+Custom SDNodes
+^^^^^^^^^^^^^^
+
+Custom SDNodes should be mapped to Target Pseudos using ``GINodeEquiv``. This
+will cause the instruction selector to import them but you will also need to
+ensure the target pseudo is introduced to the MIR before the instruction
+selector. Any preceeding pass is suitable but the legalizer will be a
+particularly common choice.
+
+ComplexPatterns
+^^^^^^^^^^^^^^^
+
+ComplexPatterns cannot be imported because their C++ is implemented in terms of
+``SDNode`` objects. GlobalISel versions should be defined with
+``GIComplexOperandMatcher`` and mapped to ComplexPattern with
+``GIComplexPatternEquiv``.
+
+The following predicates are useful for porting ComplexPattern:
+
+* isBaseWithConstantOffset() - Check for base+offset structures
+* isOperandImmEqual() - Check for a particular constant
+* isObviouslySafeToFold() - Check for reasons an instruction can't be sunk and folded into another.
+
+There are some important points for the C++ implementation:
+
+* Don't modify MIR in the predicate
+* Renderer lambdas should capture by value to avoid use-after-free. They will be used after the predicate returns.
+* Only create instructions in a renderer lambda. GlobalISel won't clean up things you create but don't use.
 
 
 .. _maintainability:
@@ -633,5 +689,14 @@ Additionally:
 
 * ``TargetPassConfig`` --- create the passes constituting the pipeline,
   including additional passes not included in the :ref:`pipeline`.
-* ``GISelAccessor`` --- setup the various subtarget-provided classes, with a
-  graceful fallback to no-op when GlobalISel isn't enabled.
+
+.. _other_resources:
+
+Resources
+=========
+
+* `Global Instruction Selection - A Proposal by Quentin Colombet @LLVMDevMeeting 2015 <https://www.youtube.com/watch?v=F6GGbYtae3g>`_
+* `Global Instruction Selection - Status by Quentin Colombet, Ahmed Bougacha, and Tim Northover @LLVMDevMeeting 2016 <https://www.youtube.com/watch?v=6tfb344A7w8>`_
+* `GlobalISel - LLVM's Latest Instruction Selection Framework by Diana Picus @FOSDEM17 <https://www.youtube.com/watch?v=d6dF6E4BPeU>`_
+* GlobalISel: Past, Present, and Future by Quentin Colombet and Ahmed Bougacha @LLVMDevMeeting 2017
+* Head First into GlobalISel by Daniel Sanders, Aditya Nandakumar, and Justin Bogner @LLVMDevMeeting 2017
diff --git a/docs/HowToCrossCompileBuiltinsOnArm.rst b/docs/HowToCrossCompileBuiltinsOnArm.rst
new file mode 100644
index 000000000000..4b4d563a5a96
--- /dev/null
+++ b/docs/HowToCrossCompileBuiltinsOnArm.rst
@@ -0,0 +1,201 @@
+===================================================================
+How to Cross Compile Compiler-rt Builtins For Arm
+===================================================================
+
+Introduction
+============
+
+This document contains information about building and testing the builtins part
+of compiler-rt for an Arm target, from an x86_64 Linux machine.
+
+While this document concentrates on Arm and Linux the general principles should
+apply to other targets supported by compiler-rt. Further contributions for other
+targets are welcome.
+
+The instructions in this document depend on libraries and programs external to
+LLVM, there are many ways to install and configure these dependencies so you
+may need to adapt the instructions here to fit your own local situation.
+
+Prerequisites
+=============
+
+In this use case we'll be using CMake on a Debian-based Linux system,
+cross-compiling from an x86_64 host to a hard-float Armv7-A target. We'll be
+using as many of the LLVM tools as we can, but it is possible to use GNU
+equivalents.
+
+ * ``A build of LLVM/clang for the llvm-tools and llvm-config``
+ * ``The qemu-arm user mode emulator``
+ * ``An arm-linux-gnueabihf sysroot``
+
+See https://compiler-rt.llvm.org/ for more information about the dependencies
+on clang and LLVM.
+
+``qemu-arm`` should be available as a package for your Linux distribution.
+
+The most complicated of the prequisites to satisfy is the arm-linux-gnueabihf
+sysroot. The :doc:`HowToCrossCompileLLVM` has information about how to use the
+Linux distributions multiarch support to fulfill the dependencies for building
+LLVM. Alternatively, as building and testing just the compiler-rt builtins
+requires fewer dependencies than LLVM, it is possible to use the Linaro
+arm-linux-gnueabihf gcc installation as our sysroot.
+
+Building compiler-rt builtins for Arm
+=====================================
+We will be doing a standalone build of compiler-rt using the following cmake
+options.
+
+* ``path/to/llvm/projects/compiler-rt``
+* ``-DCOMPILER_RT_BUILD_BUILTINS=ON``
+* ``-DCOMPILER_RT_BUILD_SANITIZERS=OFF``
+* ``-DCOMPILER_RT_BUILD_XRAY=OFF``
+* ``-DCOMPILER_RT_BUILD_LIBFUZZER=OFF``
+* ``-DCOMPILER_RT_BUILD_PROFILE=OFF``
+* ``-DCMAKE_C_COMPILER=/path/to/clang``
+* ``-DCMAKE_AR=/path/to/llvm-ar``
+* ``-DCMAKE_NM=/path/to/llvm-nm``
+* ``-DCMAKE_RANLIB=/path/to/llvm-ranlib``
+* ``-DCMAKE_EXE_LINKER_FLAGS="-fuse-ld=lld"``
+* ``-DCMAKE_C_COMPILER_TARGET="arm-linux-gnueabihf"``
+* ``-DCOMPILER_RT_DEFAULT_TARGET_ONLY=ON``
+* ``-DLLVM_CONFIG_PATH=/path/to/llvm-config``
+* ``-DCMAKE_C_FLAGS="build-c-flags"``
+
+The build-c-flags need to be sufficient to pass the C-make compiler check and
+to compile compiler-rt. When using a GCC 7 Linaro arm-linux-gnueabihf
+installation the following flags are needed:
+
+* ``--target=arm-linux-gnueabihf``
+* ``--march=armv7a``
+* ``--gcc-toolchain=/path/to/dir/toolchain``
+* ``--sysroot=/path/to/toolchain/arm-linux-gnueabihf/libc``
+
+Depending on how your sysroot is laid out, you may not need ``--gcc-toolchain``.
+For example if you have added armhf as an architecture using your Linux
+distributions multiarch support then you should be able to use ``--sysroot=/``.
+
+Once cmake has completed the builtins can be built with ``ninja builtins``
+
+Testing compiler-rt builtins using qemu-arm
+===========================================
+To test the builtins library we need to add a few more cmake flags to enable
+testing and set up the compiler and flags for test case. We must also tell
+cmake that we wish to run the tests on ``qemu-arm``.
+
+* ``-DCOMPILER_RT_EMULATOR="qemu-arm -L /path/to/armhf/sysroot``
+* ``-DCOMPILER_RT_INCLUDE_TESTS=ON``
+* ``-DCOMPILER_RT_TEST_COMPILER="/path/to/clang"``
+* ``-DCOMPILER_RT_TEST_COMPILER_CFLAGS="test-c-flags"``
+
+The ``/path/to/armhf/sysroot`` should be the same as the one passed to
+``--sysroot`` in the "build-c-flags".
+
+The "test-c-flags" can be the same as the "build-c-flags", with the addition
+of ``"-fuse-ld=lld`` if you wish to use lld to link the tests.
+
+Once cmake has completed the tests can be built and run using
+``ninja check-builtins``
+
+Modifications for other Targets
+===============================
+
+Arm Soft-Float Target
+---------------------
+The instructions for the Arm hard-float target can be used for the soft-float
+target by substituting soft-float equivalents for the sysroot and target. The
+target to use is:
+
+* ``-DCMAKE_C_COMPILER_TARGET=arm-linux-gnueabi``
+
+Depending on whether you want to use floating point instructions or not you
+may need extra c-flags such as ``-mfloat-abi=softfp`` for use of floating-point
+instructions, and ``-mfloat-abi=soft -mfpu=none`` for software floating-point
+emulation.
+
+AArch64 Target
+--------------
+The instructions for Arm can be used for AArch64 by substituting AArch64
+equivalents for the sysroot, emulator and target.
+
+* ``-DCMAKE_C_COMPILER_TARGET=aarch64-linux-gnu``
+* ``-DCOMPILER_RT_EMULATOR="qemu-aarch64 -L /path/to/aarch64/sysroot``
+
+The CMAKE_C_FLAGS and COMPILER_RT_TEST_COMPILER_CFLAGS may also need:
+``"--sysroot=/path/to/aarch64/sysroot --gcc-toolchain=/path/to/gcc-toolchain"``
+
+Armv6-m, Armv7-m and Armv7E-M targets
+-------------------------------------
+If you wish to build, but not test compiler-rt for Armv6-M, Armv7-M or Armv7E-M
+then the easiest way is to use the BaremetalARM.cmake recipe in
+clang/cmake/caches.
+
+You will need a bare metal sysroot such as that provided by the GNU ARM
+Embedded toolchain.
+
+The libraries can be built with the cmake options:
+
+* ``-DBAREMETAL_ARMV6M_SYSROOT=/path/to/bare/metal/sysroot``
+* ``-DBAREMETAL_ARMV7M_SYSROOT=/path/to/bare/metal/sysroot``
+* ``-DBAREMETAL_ARMV7EM_SYSROOT=/path/to/bare/metal/sysroot``
+* ``-C /path/to/llvm/source/tools/clang/cmake/caches/BaremetalARM.cmake``
+
+**Note** that for the recipe to work the compiler-rt source must be checked out
+into the directory llvm/runtimes and not llvm/projects.
+
+To build and test the libraries using a similar method to Armv7-A is possible
+but more difficult. The main problems are:
+
+* There isn't a ``qemu-arm`` user-mode emulator for bare-metal systems. The ``qemu-system-arm`` can be used but this is significantly more difficult to setup.
+* The target to compile compiler-rt have the suffix -none-eabi. This uses the BareMetal driver in clang and by default won't find the libraries needed to pass the cmake compiler check.
+
+As the Armv6-M, Armv7-M and Armv7E-M builds of compiler-rt only use instructions
+that are supported on Armv7-A we can still get most of the value of running the
+tests using the same ``qemu-arm`` that we used for Armv7-A by building and
+running the test cases for Armv7-A but using the builtins compiled for
+Armv6-M, Armv7-M or Armv7E-M. This will not catch instructions that are
+supported on Armv7-A but not Armv6-M, Armv7-M and Armv7E-M.
+
+To get the cmake compile test to pass the libraries needed to successfully link
+the test application will need to be manually added to ``CMAKE_CFLAGS``.
+Alternatively if you are using version 3.6 or above of cmake you can use
+``CMAKE_TRY_COMPILE_TARGET=STATIC_LIBRARY`` to skip the link step.
+
+* ``-DCMAKE_TRY_COMPILE_TARGET_TYPE=STATIC_LIBRARY``
+* ``-DCOMPILER_RT_OS_DIR="baremetal"``
+* ``-DCOMPILER_RT_BUILD_BUILTINS=ON``
+* ``-DCOMPILER_RT_BUILD_SANITIZERS=OFF``
+* ``-DCOMPILER_RT_BUILD_XRAY=OFF``
+* ``-DCOMPILER_RT_BUILD_LIBFUZZER=OFF``
+* ``-DCOMPILER_RT_BUILD_PROFILE=OFF``
+* ``-DCMAKE_C_COMPILER=${host_install_dir}/bin/clang``
+* ``-DCMAKE_C_COMPILER_TARGET="your *-none-eabi target"``
+* ``-DCMAKE_AR=/path/to/llvm-ar``
+* ``-DCMAKE_NM=/path/to/llvm-nm``
+* ``-DCMAKE_RANLIB=/path/to/llvm-ranlib``
+* ``-DCOMPILER_RT_BAREMETAL_BUILD=ON``
+* ``-DCOMPILER_RT_DEFAULT_TARGET_ONLY=ON``
+* ``-DLLVM_CONFIG_PATH=/path/to/llvm-config``
+* ``-DCMAKE_C_FLAGS="build-c-flags"``
+* ``-DCMAKE_ASM_FLAGS="${arm_cflags}"``
+* ``-DCOMPILER_RT_EMULATOR="qemu-arm -L /path/to/armv7-A/sysroot"``
+* ``-DCOMPILER_RT_INCLUDE_TESTS=ON``
+* ``-DCOMPILER_RT_TEST_COMPILER="/path/to/clang"``
+* ``-DCOMPILER_RT_TEST_COMPILER_CFLAGS="test-c-flags"``
+
+The Armv6-M builtins will use the soft-float ABI. When compiling the tests for
+Armv7-A we must include ``"-mthumb -mfloat-abi=soft -mfpu=none"`` in the
+test-c-flags. We must use an Armv7-A soft-float abi sysroot for ``qemu-arm``.
+
+Unfortunately at time of writing the Armv7-M and Armv7E-M builds of
+compiler-rt will always include assembler files including floating point
+instructions. This means that building for a cpu without a floating point unit
+requires something like removing the arm_Thumb1_VFPv2_SOURCES from the
+arm_Thumb1_SOURCES in builtins/CMakeLists.txt. The float-abi of the compiler-rt
+library must be matched by the float abi of the Armv7-A sysroot used by
+qemu-arm.
+
+Depending on the linker used for the test cases you may encounter BuildAttribute
+mismatches between the M-profile objects from compiler-rt and the A-profile
+objects from the test. The lld linker does not check the BuildAttributes so it
+can be used to link the tests by adding -fuse-ld=lld to the
+``COMPILER_RT_TEST_COMPILER_CFLAGS``.
diff --git a/docs/HowToReleaseLLVM.rst b/docs/HowToReleaseLLVM.rst
index 5ea6d49cf480..ec3362e97b6f 100644
--- a/docs/HowToReleaseLLVM.rst
+++ b/docs/HowToReleaseLLVM.rst
@@ -256,6 +256,28 @@ If a bug can't be reproduced, or stops being a blocker, it should be removed
 from the Meta and its priority decreased to *normal*. Debugging can continue,
 but on trunk.
 
+Merge Requests
+--------------
+
+You can use any of the following methods to request that a revision from trunk
+be merged into a release branch:
+
+#. Use the ``utils/release/merge-request.sh`` script which will automatically
+   file a bug_ requesting that the patch be merged. e.g. To request revision
+   12345 be merged into the branch for the 5.0.1 release:
+   ``llvm.src/utils/release/merge-request.sh -stable-version 5.0 -r 12345 -user bugzilla@example.com``
+
+#. Manually file a bug_ with the subject: "Merge r12345 into the X.Y branch",
+   enter the commit(s) that you want merged in the "Fixed by Commit(s)" and mark
+   it as a blocker of the current release bug.  Release bugs are given aliases
+   in the form of release-x.y.z, so to mark a bug as a blocker for the 5.0.1
+   release, just enter release-5.0.1 in the "Blocks" field.
+
+#. Reply to the commit email on llvm-commits for the revision to merge and cc
+   the release manager.
+
+.. _bug: https://bugs.llvm.org/
+
 Release Patch Rules
 -------------------
 
diff --git a/docs/LangRef.rst b/docs/LangRef.rst
index 5c65864e901e..589786255af7 100644
--- a/docs/LangRef.rst
+++ b/docs/LangRef.rst
@@ -527,6 +527,24 @@ the alias is accessed. It will not have any effect in the aliasee.
 For platforms without linker support of ELF TLS model, the -femulated-tls
 flag can be used to generate GCC compatible emulated TLS code.
 
+.. _runtime_preemption_model:
+
+Runtime Preemption Specifiers
+-----------------------------
+
+Global variables, functions and aliases may have an optional runtime preemption
+specifier. If a preemption specifier isn't given explicitly, then a
+symbol is assumed to be ``dso_preemptable``.
+
+``dso_preemptable``
+    Indicates that the function or variable may be replaced by a symbol from
+    outside the linkage unit at runtime.
+
+``dso_local``
+    The compiler may assume that a function or variable marked as ``dso_local``
+    will resolve to a symbol within the same linkage unit. Direct access will
+    be generated even if the definition is not within this compilation unit.
+
 .. _namedtypes:
 
 Structure Types
@@ -579,7 +597,9 @@ Global variables in other translation units can also be declared, in which
 case they don't have an initializer.
 
 Either global variable definitions or declarations may have an explicit section
-to be placed in and may have an optional explicit alignment specified.
+to be placed in and may have an optional explicit alignment specified. If there
+is a mismatch between the explicit or inferred section information for the
+variable declaration and its definition the resulting behavior is undefined.
 
 A variable may be defined as a global ``constant``, which indicates that
 the contents of the variable will **never** be modified (enabling better
@@ -622,6 +642,12 @@ target supports it, it will emit globals to the section specified.
 Additionally, the global can placed in a comdat if the target has the necessary
 support.
 
+External declarations may have an explicit section specified. Section
+information is retained in LLVM IR for targets that make use of this
+information. Attaching section information to an external declaration is an
+assertion that its definition is located in the specified section. If the
+definition is located in a different section, the behavior is undefined.
+
 By default, global initializers are optimized by assuming that global
 variables defined within the module are not modified from their
 initial values before the start of the global initializer. This is
@@ -642,6 +668,7 @@ iterate over them as an array, alignment padding would break this
 iteration. The maximum alignment is ``1 << 29``.
 
 Globals can also have a :ref:`DLL storage class <dllstorageclass>`,
+an optional :ref:`runtime preemption specifier <runtime_preemption_model>`,
 an optional :ref:`global attributes <glattrs>` and
 an optional list of attached :ref:`metadata <metadata>`.
 
@@ -650,7 +677,8 @@ Variables and aliases can have a
 
 Syntax::
 
-      @<GlobalVarName> = [Linkage] [Visibility] [DLLStorageClass] [ThreadLocal]
+      @<GlobalVarName> = [Linkage] [PreemptionSpecifier] [Visibility]
+                         [DLLStorageClass] [ThreadLocal]
                          [(unnamed_addr|local_unnamed_addr)] [AddrSpace]
                          [ExternallyInitialized]
                          <global | constant> <Type> [<InitializerConstant>]
@@ -683,7 +711,8 @@ Functions
 ---------
 
 LLVM function definitions consist of the "``define``" keyword, an
-optional :ref:`linkage type <linkage>`, an optional :ref:`visibility
+optional :ref:`linkage type <linkage>`, an optional :ref:`runtime preemption
+specifier <runtime_preemption_model>`,  an optional :ref:`visibility
 style <visibility>`, an optional :ref:`DLL storage class <dllstorageclass>`,
 an optional :ref:`calling convention <callingconv>`,
 an optional ``unnamed_addr`` attribute, a return type, an optional
@@ -742,7 +771,7 @@ not be significant within the module.
 
 Syntax::
 
-    define [linkage] [visibility] [DLLStorageClass]
+    define [linkage] [PreemptionSpecifier] [visibility] [DLLStorageClass]
            [cconv] [ret attrs]
            <ResultType> @<FunctionName> ([argument list])
            [(unnamed_addr|local_unnamed_addr)] [fn Attrs] [section "name"]
@@ -769,12 +798,13 @@ Aliases have a name and an aliasee that is either a global value or a
 constant expression.
 
 Aliases may have an optional :ref:`linkage type <linkage>`, an optional
+:ref:`runtime preemption specifier <runtime_preemption_model>`, an optional
 :ref:`visibility style <visibility>`, an optional :ref:`DLL storage class
 <dllstorageclass>` and an optional :ref:`tls model <tls_model>`.
 
 Syntax::
 
-    @<Name> = [Linkage] [Visibility] [DLLStorageClass] [ThreadLocal] [(unnamed_addr|local_unnamed_addr)] alias <AliaseeTy>, <AliaseeTy>* @<Aliasee>
+    @<Name> = [Linkage] [PreemptionSpecifier] [Visibility] [DLLStorageClass] [ThreadLocal] [(unnamed_addr|local_unnamed_addr)] alias <AliaseeTy>, <AliaseeTy>* @<Aliasee>
 
 The linkage must be one of ``private``, ``internal``, ``linkonce``, ``weak``,
 ``linkonce_odr``, ``weak_odr``, ``external``. Note that some system linkers
@@ -1381,6 +1411,9 @@ example:
 ``naked``
     This attribute disables prologue / epilogue emission for the
     function. This can have very system-specific consequences.
+``no-jump-tables``
+    When this attribute is set to true, the jump tables and lookup tables that
+    can be generated from a switch case lowering are disabled.
 ``nobuiltin``
     This indicates that the callee function at a call site is not recognized as
     a built-in function. LLVM will retain the original call and not replace it
@@ -1564,6 +1597,10 @@ example:
 ``sanitize_thread``
     This attribute indicates that ThreadSanitizer checks
     (dynamic thread safety analysis) are enabled for this function.
+``sanitize_hwaddress``
+    This attribute indicates that HWAddressSanitizer checks
+    (dynamic address safety analysis based on tagged pointers) are enabled for
+    this function.
 ``speculatable``
     This function attribute indicates that the function does not have any
     effects besides calculating its result and does not have undefined behavior.
@@ -1641,6 +1678,12 @@ example:
     If a function that has an ``sspstrong`` attribute is inlined into a
     function that doesn't have an ``sspstrong`` attribute, then the
     resulting function will have an ``sspstrong`` attribute.
+``strictfp``
+    This attribute indicates that the function was called from a scope that
+    requires strict floating point semantics.  LLVM will not attempt any
+    optimizations that require assumptions about the floating point rounding
+    mode or that might alter the state of floating point status flags that
+    might otherwise be set or cleared by calling this function.
 ``"thunk"``
     This attribute indicates that the function will delegate to some other
     function with a tail call. The prototype of a thunk should not be used for
@@ -2016,8 +2059,11 @@ to the following rules:
 A pointer value is *based* on another pointer value according to the
 following rules:
 
--  A pointer value formed from a ``getelementptr`` operation is *based*
-   on the second value operand of the ``getelementptr``.
+-  A pointer value formed from a scalar ``getelementptr`` operation is *based* on
+   the pointer-typed operand of the ``getelementptr``.
+-  The pointer in lane *l* of the result of a vector ``getelementptr`` operation
+   is *based* on the pointer in lane *l* of the vector-of-pointers-typed operand
+   of the ``getelementptr``.
 -  The result value of a ``bitcast`` is *based* on the operand of the
    ``bitcast``.
 -  A pointer value formed by an ``inttoptr`` is *based* on all pointer
@@ -2230,11 +2276,11 @@ seq\_cst total orderings of other operations that are not marked
 Fast-Math Flags
 ---------------
 
-LLVM IR floating-point binary ops (:ref:`fadd <i_fadd>`,
+LLVM IR floating-point operations (:ref:`fadd <i_fadd>`,
 :ref:`fsub <i_fsub>`, :ref:`fmul <i_fmul>`, :ref:`fdiv <i_fdiv>`,
 :ref:`frem <i_frem>`, :ref:`fcmp <i_fcmp>`) and :ref:`call <i_call>`
-instructions have the following flags that can be set to enable
-otherwise unsafe floating point transformations.
+may use the following flags to enable otherwise unsafe 
+floating-point transformations.
 
 ``nnan``
    No NaNs - Allow optimizations to assume the arguments and result are not
@@ -2258,10 +2304,17 @@ otherwise unsafe floating point transformations.
    Allow floating-point contraction (e.g. fusing a multiply followed by an
    addition into a fused multiply-and-add).
 
+``afn``
+   Approximate functions - Allow substitution of approximate calculations for
+   functions (sin, log, sqrt, etc). See floating-point intrinsic definitions 
+   for places where this can apply to LLVM's intrinsic math functions. 
+
+``reassoc``
+   Allow reassociation transformations for floating-point instructions. 
+   This may dramatically change results in floating point.
+
 ``fast``
-   Fast - Allow algebraically equivalent transformations that may
-   dramatically change results in floating point (e.g. reassociate). This
-   flag implies all the others.
+   This flag implies all of the others.
 
 .. _uselistorder:
 
@@ -3142,14 +3195,11 @@ that does not have side effects (e.g. load and call are not supported).
 The following is the syntax for constant expressions:
 
 ``trunc (CST to TYPE)``
-    Truncate a constant to another type. The bit size of CST must be
-    larger than the bit size of TYPE. Both types must be integers.
+    Perform the :ref:`trunc operation <i_trunc>` on constants.
 ``zext (CST to TYPE)``
-    Zero extend a constant to another type. The bit size of CST must be
-    smaller than the bit size of TYPE. Both types must be integers.
+    Perform the :ref:`zext operation <i_zext>` on constants.
 ``sext (CST to TYPE)``
-    Sign extend a constant to another type. The bit size of CST must be
-    smaller than the bit size of TYPE. Both types must be integers.
+    Perform the :ref:`sext operation <i_sext>` on constants.
 ``fptrunc (CST to TYPE)``
     Truncate a floating point constant to another floating point type.
     The size of CST must be larger than the size of TYPE. Both types
@@ -3183,19 +3233,14 @@ The following is the syntax for constant expressions:
     be scalars, or vectors of the same number of elements. If the value
     won't fit in the floating point type, the results are undefined.
 ``ptrtoint (CST to TYPE)``
-    Convert a pointer typed constant to the corresponding integer
-    constant. ``TYPE`` must be an integer type. ``CST`` must be of
-    pointer type. The ``CST`` value is zero extended, truncated, or
-    unchanged to make it fit in ``TYPE``.
+    Perform the :ref:`ptrtoint operation <i_ptrtoint>` on constants.
 ``inttoptr (CST to TYPE)``
-    Convert an integer constant to a pointer constant. TYPE must be a
-    pointer type. CST must be of integer type. The CST value is zero
-    extended, truncated, or unchanged to make it fit in a pointer size.
+    Perform the :ref:`inttoptr operation <i_inttoptr>` on constants.
     This one is *really* dangerous!
 ``bitcast (CST to TYPE)``
-    Convert a constant, CST, to another TYPE. The constraints of the
-    operands are the same as those for the :ref:`bitcast
-    instruction <i_bitcast>`.
+    Convert a constant, CST, to another TYPE.
+    The constraints of the operands are the same as those for the
+    :ref:`bitcast instruction <i_bitcast>`.
 ``addrspacecast (CST to TYPE)``
     Convert a constant pointer or constant vector of pointer, CST, to another
     TYPE in a different address space. The constraints of the operands are the
@@ -3208,9 +3253,9 @@ The following is the syntax for constant expressions:
 ``select (COND, VAL1, VAL2)``
     Perform the :ref:`select operation <i_select>` on constants.
 ``icmp COND (VAL1, VAL2)``
-    Performs the :ref:`icmp operation <i_icmp>` on constants.
+    Perform the :ref:`icmp operation <i_icmp>` on constants.
 ``fcmp COND (VAL1, VAL2)``
-    Performs the :ref:`fcmp operation <i_fcmp>` on constants.
+    Perform the :ref:`fcmp operation <i_fcmp>` on constants.
 ``extractelement (VAL, IDX)``
     Perform the :ref:`extractelement operation <i_extractelement>` on
     constants.
@@ -4013,12 +4058,12 @@ example:
 
     !foo = !{!4, !3}
 
-Metadata can be used as function arguments. Here ``llvm.dbg.value``
-function is using two metadata arguments:
+Metadata can be used as function arguments. Here the ``llvm.dbg.value``
+intrinsic is using three metadata arguments:
 
 .. code-block:: llvm
 
-    call void @llvm.dbg.value(metadata !24, i64 0, metadata !25)
+    call void @llvm.dbg.value(metadata !24, metadata !25, metadata !26)
 
 Metadata can be attached to an instruction. Here metadata ``!21`` is attached
 to the ``add`` instruction using the ``!dbg`` identifier:
@@ -4465,7 +4510,7 @@ source variable. DIExpressions also follow this model: A DIExpression that
 doesn't have a trailing ``DW_OP_stack_value`` will describe an *address* when
 combined with a concrete location.
 
-.. code-block:: llvm
+.. code-block:: text
 
     !0 = !DIExpression(DW_OP_deref)
     !1 = !DIExpression(DW_OP_plus_uconst, 3)
@@ -4605,13 +4650,13 @@ As a concrete example, the type descriptor graph for the following program
       int i;    // offset 0
       float f;  // offset 4
     };
-    
+
     struct Outer {
       float f;  // offset 0
       double d; // offset 4
       struct Inner inner_a;  // offset 12
     };
-    
+
     void f(struct Outer* outer, struct Inner* inner, float* f, int* i, char* c) {
       outer->f = 0;            // tag0: (OuterStructTy, FloatScalarTy, 0)
       outer->inner_a.i = 0;    // tag1: (OuterStructTy, IntScalarTy, 12)
@@ -4858,6 +4903,23 @@ Example (assuming 64-bit pointers):
     !0 = !{ i64 0, i64 256 }
     !1 = !{ i64 -1, i64 -1 }
 
+'``callees``' Metadata
+^^^^^^^^^^^^^^^^^^^^^^
+
+``callees`` metadata may be attached to indirect call sites. If ``callees``
+metadata is attached to a call site, and any callee is not among the set of
+functions provided by the metadata, the behavior is undefined. The intent of
+this metadata is to facilitate optimizations such as indirect-call promotion.
+For example, in the code below, the call instruction may only target the
+``add`` or ``sub`` functions:
+
+.. code-block:: llvm
+
+    %result = call i64 %binop(i64 %x, i64 %y), !callees !0
+
+    ...
+    !0 = !{i64 (i64, i64)* @add, i64 (i64, i64)* @sub}
+
 '``unpredictable``' Metadata
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
@@ -5143,14 +5205,37 @@ the loop identifier metadata node directly:
    !1 = !{!1} ; an identifier for the inner loop
    !2 = !{!2} ; an identifier for the outer loop
 
+'``irr_loop``' Metadata
+^^^^^^^^^^^^^^^^^^^^^^^
+
+``irr_loop`` metadata may be attached to the terminator instruction of a basic
+block that's an irreducible loop header (note that an irreducible loop has more
+than once header basic blocks.) If ``irr_loop`` metadata is attached to the
+terminator instruction of a basic block that is not really an irreducible loop
+header, the behavior is undefined. The intent of this metadata is to improve the
+accuracy of the block frequency propagation. For example, in the code below, the
+block ``header0`` may have a loop header weight (relative to the other headers of
+the irreducible loop) of 100:
+
+.. code-block:: llvm
+
+    header0:
+    ...
+    br i1 %cmp, label %t1, label %t2, !irr_loop !0
+
+    ...
+    !0 = !{"loop_header_weight", i64 100}
+
+Irreducible loop header weights are typically based on profile data.
+
 '``invariant.group``' Metadata
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 The ``invariant.group`` metadata may be attached to ``load``/``store`` instructions.
-The existence of the ``invariant.group`` metadata on the instruction tells 
-the optimizer that every ``load`` and ``store`` to the same pointer operand 
-within the same invariant group can be assumed to load or store the same  
-value (but see the ``llvm.invariant.group.barrier`` intrinsic which affects 
+The existence of the ``invariant.group`` metadata on the instruction tells
+the optimizer that every ``load`` and ``store`` to the same pointer operand
+within the same invariant group can be assumed to load or store the same
+value (but see the ``llvm.invariant.group.barrier`` intrinsic which affects
 when two pointers are considered the same). Pointers returned by bitcast or
 getelementptr with only zero indices are considered the same.
 
@@ -5163,26 +5248,26 @@ Examples:
    %ptr = alloca i8
    store i8 42, i8* %ptr, !invariant.group !0
    call void @foo(i8* %ptr)
-   
+
    %a = load i8, i8* %ptr, !invariant.group !0 ; Can assume that value under %ptr didn't change
    call void @foo(i8* %ptr)
    %b = load i8, i8* %ptr, !invariant.group !1 ; Can't assume anything, because group changed
-  
-   %newPtr = call i8* @getPointer(i8* %ptr) 
+
+   %newPtr = call i8* @getPointer(i8* %ptr)
    %c = load i8, i8* %newPtr, !invariant.group !0 ; Can't assume anything, because we only have information about %ptr
-   
+
    %unknownValue = load i8, i8* @unknownPtr
    store i8 %unknownValue, i8* %ptr, !invariant.group !0 ; Can assume that %unknownValue == 42
-   
+
    call void @foo(i8* %ptr)
    %newPtr2 = call i8* @llvm.invariant.group.barrier(i8* %ptr)
    %d = load i8, i8* %newPtr2, !invariant.group !0  ; Can't step through invariant.group.barrier to get value of %ptr
-   
+
    ...
    declare void @foo(i8*)
    declare i8* @getPointer(i8*)
    declare i8* @llvm.invariant.group.barrier(i8*)
-   
+
    !0 = !{!"magic ptr"}
    !1 = !{!"other ptr"}
 
@@ -5191,7 +5276,7 @@ another based on aliasing information. This is because invariant.group is tied
 to the SSA value of the pointer operand.
 
 .. code-block:: llvm
-  
+
   %v = load i8, i8* %x, !invariant.group !0
   ; if %x mustalias %y then we can replace the above instruction with
   %v = load i8, i8* %y
@@ -5221,7 +5306,7 @@ It does not have any effect on non-ELF targets.
 
 Example:
 
-.. code-block:: llvm
+.. code-block:: text
 
     $a = comdat any
     @a = global i32 1, comdat $a
@@ -6649,9 +6734,9 @@ remainder.
 
 Note that unsigned integer remainder and signed integer remainder are
 distinct operations; for signed integer remainder, use '``srem``'.
- 
+
 Taking the remainder of a division by zero is undefined behavior.
-For vectors, if any element of the divisor is zero, the operation has 
+For vectors, if any element of the divisor is zero, the operation has
 undefined behavior.
 
 Example:
@@ -6703,7 +6788,7 @@ Note that signed integer remainder and unsigned integer remainder are
 distinct operations; for unsigned integer remainder, use '``urem``'.
 
 Taking the remainder of a division by zero is undefined behavior.
-For vectors, if any element of the divisor is zero, the operation has 
+For vectors, if any element of the divisor is zero, the operation has
 undefined behavior.
 Overflow also leads to undefined behavior; this is a rare case, but can
 occur, for example, by taking the remainder of a 32-bit division of
@@ -6746,10 +6831,12 @@ Both arguments must have identical types.
 Semantics:
 """"""""""
 
-This instruction returns the *remainder* of a division. The remainder
-has the same sign as the dividend. This instruction can also take any
-number of :ref:`fast-math flags <fastmath>`, which are optimization hints
-to enable otherwise unsafe floating point optimizations:
+Return the same value as a libm '``fmod``' function but without trapping or 
+setting ``errno``.
+
+The remainder has the same sign as the dividend. This instruction can also 
+take any number of :ref:`fast-math flags <fastmath>`, which are optimization
+hints to enable otherwise unsafe floating-point optimizations:
 
 Example:
 """"""""
@@ -7576,7 +7663,7 @@ be reused in the cache. The code generator may select special
 instructions to save cache bandwidth, such as the ``MOVNT`` instruction on
 x86.
 
-The optional ``!invariant.group`` metadata must reference a 
+The optional ``!invariant.group`` metadata must reference a
 single metadata name ``<index>``. See ``invariant.group`` metadata.
 
 Semantics:
@@ -7650,7 +7737,7 @@ A ``fence`` instruction can also take an optional
 Example:
 """"""""
 
-.. code-block:: llvm
+.. code-block:: text
 
       fence acquire                                        ; yields void
       fence syncscope("singlethread") seq_cst              ; yields void
@@ -7682,10 +7769,10 @@ There are three arguments to the '``cmpxchg``' instruction: an address
 to operate on, a value to compare to the value currently be at that
 address, and a new value to place at that address if the compared values
 are equal. The type of '<cmp>' must be an integer or pointer type whose
-bit width is a power of two greater than or equal to eight and less 
+bit width is a power of two greater than or equal to eight and less
 than or equal to a target-specific size limit. '<cmp>' and '<new>' must
-have the same type, and the type of '<pointer>' must be a pointer to 
-that type. If the ``cmpxchg`` is marked as ``volatile``, then the 
+have the same type, and the type of '<pointer>' must be a pointer to
+that type. If the ``cmpxchg`` is marked as ``volatile``, then the
 optimizer is not allowed to modify the number or order of execution of
 this ``cmpxchg`` with other :ref:`volatile operations <volatile>`.
 
@@ -7705,9 +7792,9 @@ Semantics:
 """"""""""
 
 The contents of memory at the location specified by the '``<pointer>``' operand
-is read and compared to '``<cmp>``'; if the read value is the equal, the
-'``<new>``' is written. The original value at the location is returned, together
-with a flag indicating success (true) or failure (false).
+is read and compared to '``<cmp>``'; if the values are equal, '``<new>``' is
+written to the location. The original value at the location is returned,
+together with a flag indicating success (true) or failure (false).
 
 If the cmpxchg operation is marked as ``weak`` then a spurious failure is
 permitted: the operation may not write ``<new>`` even if the comparison
@@ -8039,6 +8126,8 @@ The instructions in this category are the conversion instructions
 (casting) which all take a single operand and a type. They perform
 various bit conversions on the operand.
 
+.. _i_trunc:
+
 '``trunc .. to``' Instruction
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
@@ -8081,6 +8170,8 @@ Example:
       %Z = trunc i32 122 to i1                        ; yields i1:false
       %W = trunc <2 x i16> <i16 8, i16 7> to <2 x i8> ; yields <i8 8, i8 7>
 
+.. _i_zext:
+
 '``zext .. to``' Instruction
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
@@ -8121,6 +8212,8 @@ Example:
       %Y = zext i1 true to i32              ; yields i32:1
       %Z = zext <2 x i16> <i16 8, i16 7> to <2 x i32> ; yields <i32 8, i32 7>
 
+.. _i_sext:
+
 '``sext .. to``' Instruction
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
@@ -8973,7 +9066,7 @@ This instruction requires several arguments:
    ``tail`` or ``musttail`` markers to the call. It is used to prevent tail
    call optimization from being performed on the call.
 
-#. The optional ``fast-math flags`` marker indicates that the call has one or more 
+#. The optional ``fast-math flags`` marker indicates that the call has one or more
    :ref:`fast-math flags <fastmath>`, which are optimization hints to enable
    otherwise unsafe floating-point optimizations. Fast-math flags are only valid
    for calls that return a floating-point scalar or vector type.
@@ -10138,7 +10231,7 @@ argument to specify the step of the increment.
 Arguments:
 """"""""""
 The first four arguments are the same as '``llvm.instrprof_increment``'
-instrinsic.
+intrinsic.
 
 The last argument specifies the value of the increment of the counter variable.
 
@@ -10403,7 +10496,7 @@ Syntax:
 """""""
 
 This is an overloaded intrinsic. You can use ``llvm.sqrt`` on any
-floating point or vector of floating point type. Not all targets support
+floating-point or vector of floating-point type. Not all targets support
 all types however.
 
 ::
@@ -10417,20 +10510,22 @@ all types however.
 Overview:
 """""""""
 
-The '``llvm.sqrt``' intrinsics return the square root of the specified value,
-returning the same value as the libm '``sqrt``' functions would, but without
-trapping or setting ``errno``.
+The '``llvm.sqrt``' intrinsics return the square root of the specified value.
 
 Arguments:
 """"""""""
 
-The argument and return value are floating point numbers of the same type.
+The argument and return value are floating-point numbers of the same type.
 
 Semantics:
 """"""""""
 
-This function returns the square root of the operand if it is a nonnegative
-floating point number.
+Return the same value as a corresponding libm '``sqrt``' function but without
+trapping or setting ``errno``. For types specified by IEEE-754, the result 
+matches a conforming libm implementation.
+
+When specified with the fast-math-flag 'afn', the result may be approximated 
+using a less accurate calculation.
 
 '``llvm.powi.*``' Intrinsic
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -10477,7 +10572,7 @@ Syntax:
 """""""
 
 This is an overloaded intrinsic. You can use ``llvm.sin`` on any
-floating point or vector of floating point type. Not all targets support
+floating-point or vector of floating-point type. Not all targets support
 all types however.
 
 ::
@@ -10496,14 +10591,16 @@ The '``llvm.sin.*``' intrinsics return the sine of the operand.
 Arguments:
 """"""""""
 
-The argument and return value are floating point numbers of the same type.
+The argument and return value are floating-point numbers of the same type.
 
 Semantics:
 """"""""""
 
-This function returns the sine of the specified operand, returning the
-same values as the libm ``sin`` functions would, and handles error
-conditions in the same way.
+Return the same value as a corresponding libm '``sin``' function but without
+trapping or setting ``errno``.
+
+When specified with the fast-math-flag 'afn', the result may be approximated 
+using a less accurate calculation.
 
 '``llvm.cos.*``' Intrinsic
 ^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -10512,7 +10609,7 @@ Syntax:
 """""""
 
 This is an overloaded intrinsic. You can use ``llvm.cos`` on any
-floating point or vector of floating point type. Not all targets support
+floating-point or vector of floating-point type. Not all targets support
 all types however.
 
 ::
@@ -10531,14 +10628,16 @@ The '``llvm.cos.*``' intrinsics return the cosine of the operand.
 Arguments:
 """"""""""
 
-The argument and return value are floating point numbers of the same type.
+The argument and return value are floating-point numbers of the same type.
 
 Semantics:
 """"""""""
 
-This function returns the cosine of the specified operand, returning the
-same values as the libm ``cos`` functions would, and handles error
-conditions in the same way.
+Return the same value as a corresponding libm '``cos``' function but without
+trapping or setting ``errno``.
+
+When specified with the fast-math-flag 'afn', the result may be approximated 
+using a less accurate calculation.
 
 '``llvm.pow.*``' Intrinsic
 ^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -10547,7 +10646,7 @@ Syntax:
 """""""
 
 This is an overloaded intrinsic. You can use ``llvm.pow`` on any
-floating point or vector of floating point type. Not all targets support
+floating-point or vector of floating-point type. Not all targets support
 all types however.
 
 ::
@@ -10567,15 +10666,16 @@ specified (positive or negative) power.
 Arguments:
 """"""""""
 
-The second argument is a floating point power, and the first is a value
-to raise to that power.
+The arguments and return value are floating-point numbers of the same type.
 
 Semantics:
 """"""""""
 
-This function returns the first value raised to the second power,
-returning the same values as the libm ``pow`` functions would, and
-handles error conditions in the same way.
+Return the same value as a corresponding libm '``pow``' function but without
+trapping or setting ``errno``.
+
+When specified with the fast-math-flag 'afn', the result may be approximated 
+using a less accurate calculation.
 
 '``llvm.exp.*``' Intrinsic
 ^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -10584,7 +10684,7 @@ Syntax:
 """""""
 
 This is an overloaded intrinsic. You can use ``llvm.exp`` on any
-floating point or vector of floating point type. Not all targets support
+floating-point or vector of floating-point type. Not all targets support
 all types however.
 
 ::
@@ -10604,13 +10704,16 @@ value.
 Arguments:
 """"""""""
 
-The argument and return value are floating point numbers of the same type.
+The argument and return value are floating-point numbers of the same type.
 
 Semantics:
 """"""""""
 
-This function returns the same values as the libm ``exp`` functions
-would, and handles error conditions in the same way.
+Return the same value as a corresponding libm '``exp``' function but without
+trapping or setting ``errno``.
+
+When specified with the fast-math-flag 'afn', the result may be approximated 
+using a less accurate calculation.
 
 '``llvm.exp2.*``' Intrinsic
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -10619,7 +10722,7 @@ Syntax:
 """""""
 
 This is an overloaded intrinsic. You can use ``llvm.exp2`` on any
-floating point or vector of floating point type. Not all targets support
+floating-point or vector of floating-point type. Not all targets support
 all types however.
 
 ::
@@ -10639,13 +10742,16 @@ specified value.
 Arguments:
 """"""""""
 
-The argument and return value are floating point numbers of the same type.
+The argument and return value are floating-point numbers of the same type.
 
 Semantics:
 """"""""""
 
-This function returns the same values as the libm ``exp2`` functions
-would, and handles error conditions in the same way.
+Return the same value as a corresponding libm '``exp2``' function but without
+trapping or setting ``errno``.
+
+When specified with the fast-math-flag 'afn', the result may be approximated 
+using a less accurate calculation.
 
 '``llvm.log.*``' Intrinsic
 ^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -10654,7 +10760,7 @@ Syntax:
 """""""
 
 This is an overloaded intrinsic. You can use ``llvm.log`` on any
-floating point or vector of floating point type. Not all targets support
+floating-point or vector of floating-point type. Not all targets support
 all types however.
 
 ::
@@ -10674,13 +10780,16 @@ value.
 Arguments:
 """"""""""
 
-The argument and return value are floating point numbers of the same type.
+The argument and return value are floating-point numbers of the same type.
 
 Semantics:
 """"""""""
 
-This function returns the same values as the libm ``log`` functions
-would, and handles error conditions in the same way.
+Return the same value as a corresponding libm '``log``' function but without
+trapping or setting ``errno``.
+
+When specified with the fast-math-flag 'afn', the result may be approximated 
+using a less accurate calculation.
 
 '``llvm.log10.*``' Intrinsic
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -10689,7 +10798,7 @@ Syntax:
 """""""
 
 This is an overloaded intrinsic. You can use ``llvm.log10`` on any
-floating point or vector of floating point type. Not all targets support
+floating-point or vector of floating-point type. Not all targets support
 all types however.
 
 ::
@@ -10709,13 +10818,16 @@ specified value.
 Arguments:
 """"""""""
 
-The argument and return value are floating point numbers of the same type.
+The argument and return value are floating-point numbers of the same type.
 
 Semantics:
 """"""""""
 
-This function returns the same values as the libm ``log10`` functions
-would, and handles error conditions in the same way.
+Return the same value as a corresponding libm '``log10``' function but without
+trapping or setting ``errno``.
+
+When specified with the fast-math-flag 'afn', the result may be approximated 
+using a less accurate calculation.
 
 '``llvm.log2.*``' Intrinsic
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -10724,7 +10836,7 @@ Syntax:
 """""""
 
 This is an overloaded intrinsic. You can use ``llvm.log2`` on any
-floating point or vector of floating point type. Not all targets support
+floating-point or vector of floating-point type. Not all targets support
 all types however.
 
 ::
@@ -10744,13 +10856,16 @@ value.
 Arguments:
 """"""""""
 
-The argument and return value are floating point numbers of the same type.
+The argument and return value are floating-point numbers of the same type.
 
 Semantics:
 """"""""""
 
-This function returns the same values as the libm ``log2`` functions
-would, and handles error conditions in the same way.
+Return the same value as a corresponding libm '``log2``' function but without
+trapping or setting ``errno``.
+
+When specified with the fast-math-flag 'afn', the result may be approximated 
+using a less accurate calculation.
 
 '``llvm.fma.*``' Intrinsic
 ^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -10759,7 +10874,7 @@ Syntax:
 """""""
 
 This is an overloaded intrinsic. You can use ``llvm.fma`` on any
-floating point or vector of floating point type. Not all targets support
+floating-point or vector of floating-point type. Not all targets support
 all types however.
 
 ::
@@ -10773,20 +10888,21 @@ all types however.
 Overview:
 """""""""
 
-The '``llvm.fma.*``' intrinsics perform the fused multiply-add
-operation.
+The '``llvm.fma.*``' intrinsics perform the fused multiply-add operation.
 
 Arguments:
 """"""""""
 
-The argument and return value are floating point numbers of the same
-type.
+The arguments and return value are floating-point numbers of the same type.
 
 Semantics:
 """"""""""
 
-This function returns the same values as the libm ``fma`` functions
-would, and does not set errno.
+Return the same value as a corresponding libm '``fma``' function but without
+trapping or setting ``errno``.
+
+When specified with the fast-math-flag 'afn', the result may be approximated 
+using a less accurate calculation.
 
 '``llvm.fabs.*``' Intrinsic
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -12242,7 +12358,7 @@ Debugger Intrinsics
 
 The LLVM debugger intrinsics (which all start with ``llvm.dbg.``
 prefix), are described in the `LLVM Source Level
-Debugging <SourceLevelDebugging.html#format_common_intrinsics>`_
+Debugging <SourceLevelDebugging.html#format-common-intrinsics>`_
 document.
 
 Exception Handling Intrinsics
@@ -12250,7 +12366,7 @@ Exception Handling Intrinsics
 
 The LLVM exception handling intrinsics (which all start with
 ``llvm.eh.`` prefix), are described in the `LLVM Exception
-Handling <ExceptionHandling.html#format_common_intrinsics>`_ document.
+Handling <ExceptionHandling.html#format-common-intrinsics>`_ document.
 
 .. _int_trampoline:
 
@@ -12707,15 +12823,18 @@ This intrinsic indicates that the memory is mutable again.
 
 Syntax:
 """""""
+This is an overloaded intrinsic. The memory object can belong to any address
+space. The returned pointer must belong to the same address space as the
+argument.
 
 ::
 
-      declare i8* @llvm.invariant.group.barrier(i8* <ptr>)
+      declare i8* @llvm.invariant.group.barrier.p0i8(i8* <ptr>)
 
 Overview:
 """""""""
 
-The '``llvm.invariant.group.barrier``' intrinsic can be used when an invariant 
+The '``llvm.invariant.group.barrier``' intrinsic can be used when an invariant
 established by invariant.group metadata no longer holds, to obtain a new pointer
 value that does not carry the invariant information.
 
@@ -12729,7 +12848,7 @@ the pointer to the memory for which the ``invariant.group`` no longer holds.
 Semantics:
 """"""""""
 
-Returns another pointer that aliases its argument but which is considered different 
+Returns another pointer that aliases its argument but which is considered different
 for the purposes of ``load``/``store`` ``invariant.group`` metadata.
 
 Constrained Floating Point Intrinsics
@@ -12807,7 +12926,7 @@ strictly preserve the floating point exception semantics of the original code.
 Any FP exception that would have been raised by the original code must be raised
 by the transformed code, and the transformed code must not raise any FP
 exceptions that would not have been raised by the original code.  This is the
-exception behavior argument that will be used if the code being compiled reads 
+exception behavior argument that will be used if the code being compiled reads
 the FP exception status flags, but this mode can also be used with code that
 unmasks FP exceptions.
 
@@ -12825,7 +12944,7 @@ Syntax:
 
 ::
 
-      declare <type> 
+      declare <type>
       @llvm.experimental.constrained.fadd(<type> <op1>, <type> <op2>,
                                           metadata <rounding mode>,
                                           metadata <exception behavior>)
@@ -12862,7 +12981,7 @@ Syntax:
 
 ::
 
-      declare <type> 
+      declare <type>
       @llvm.experimental.constrained.fsub(<type> <op1>, <type> <op2>,
                                           metadata <rounding mode>,
                                           metadata <exception behavior>)
@@ -12899,7 +13018,7 @@ Syntax:
 
 ::
 
-      declare <type> 
+      declare <type>
       @llvm.experimental.constrained.fmul(<type> <op1>, <type> <op2>,
                                           metadata <rounding mode>,
                                           metadata <exception behavior>)
@@ -12936,7 +13055,7 @@ Syntax:
 
 ::
 
-      declare <type> 
+      declare <type>
       @llvm.experimental.constrained.fdiv(<type> <op1>, <type> <op2>,
                                           metadata <rounding mode>,
                                           metadata <exception behavior>)
@@ -12973,7 +13092,7 @@ Syntax:
 
 ::
 
-      declare <type> 
+      declare <type>
       @llvm.experimental.constrained.frem(<type> <op1>, <type> <op2>,
                                           metadata <rounding mode>,
                                           metadata <exception behavior>)
@@ -13002,8 +13121,43 @@ Semantics:
 
 The value produced is the floating point remainder from the division of the two
 value operands and has the same type as the operands.  The remainder has the
-same sign as the dividend. 
+same sign as the dividend.
+
+'``llvm.experimental.constrained.fma``' Intrinsic
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+
+::
+
+      declare <type>
+      @llvm.experimental.constrained.fma(<type> <op1>, <type> <op2>, <type> <op3>,
+                                          metadata <rounding mode>,
+                                          metadata <exception behavior>)
+
+Overview:
+"""""""""
+
+The '``llvm.experimental.constrained.fma``' intrinsic returns the result of a
+fused-multiply-add operation on its operands.
+
+Arguments:
+""""""""""
 
+The first three arguments to the '``llvm.experimental.constrained.fma``'
+intrinsic must be :ref:`floating point <t_floating>` or :ref:`vector
+<t_vector>` of floating point values. All arguments must have identical types.
+
+The fourth and fifth arguments specify the rounding mode and exception behavior
+as described above.
+
+Semantics:
+""""""""""
+
+The result produced is the product of the first two operands added to the third
+operand computed with infinite precision, and then rounded to the target
+precision.
 
 Constrained libm-equivalent Intrinsics
 --------------------------------------
@@ -13027,7 +13181,7 @@ Syntax:
 
 ::
 
-      declare <type> 
+      declare <type>
       @llvm.experimental.constrained.sqrt(<type> <op1>,
                                           metadata <rounding mode>,
                                           metadata <exception behavior>)
@@ -13064,7 +13218,7 @@ Syntax:
 
 ::
 
-      declare <type> 
+      declare <type>
       @llvm.experimental.constrained.pow(<type> <op1>, <type> <op2>,
                                          metadata <rounding mode>,
                                          metadata <exception behavior>)
@@ -13101,7 +13255,7 @@ Syntax:
 
 ::
 
-      declare <type> 
+      declare <type>
       @llvm.experimental.constrained.powi(<type> <op1>, i32 <op2>,
                                           metadata <rounding mode>,
                                           metadata <exception behavior>)
@@ -13140,7 +13294,7 @@ Syntax:
 
 ::
 
-      declare <type> 
+      declare <type>
       @llvm.experimental.constrained.sin(<type> <op1>,
                                          metadata <rounding mode>,
                                          metadata <exception behavior>)
@@ -13176,7 +13330,7 @@ Syntax:
 
 ::
 
-      declare <type> 
+      declare <type>
       @llvm.experimental.constrained.cos(<type> <op1>,
                                          metadata <rounding mode>,
                                          metadata <exception behavior>)
@@ -13212,7 +13366,7 @@ Syntax:
 
 ::
 
-      declare <type> 
+      declare <type>
       @llvm.experimental.constrained.exp(<type> <op1>,
                                          metadata <rounding mode>,
                                          metadata <exception behavior>)
@@ -13247,7 +13401,7 @@ Syntax:
 
 ::
 
-      declare <type> 
+      declare <type>
       @llvm.experimental.constrained.exp2(<type> <op1>,
                                           metadata <rounding mode>,
                                           metadata <exception behavior>)
@@ -13283,7 +13437,7 @@ Syntax:
 
 ::
 
-      declare <type> 
+      declare <type>
       @llvm.experimental.constrained.log(<type> <op1>,
                                          metadata <rounding mode>,
                                          metadata <exception behavior>)
@@ -13319,7 +13473,7 @@ Syntax:
 
 ::
 
-      declare <type> 
+      declare <type>
       @llvm.experimental.constrained.log10(<type> <op1>,
                                            metadata <rounding mode>,
                                            metadata <exception behavior>)
@@ -13354,7 +13508,7 @@ Syntax:
 
 ::
 
-      declare <type> 
+      declare <type>
       @llvm.experimental.constrained.log2(<type> <op1>,
                                           metadata <rounding mode>,
                                           metadata <exception behavior>)
@@ -13389,7 +13543,7 @@ Syntax:
 
 ::
 
-      declare <type> 
+      declare <type>
       @llvm.experimental.constrained.rint(<type> <op1>,
                                           metadata <rounding mode>,
                                           metadata <exception behavior>)
@@ -13428,7 +13582,7 @@ Syntax:
 
 ::
 
-      declare <type> 
+      declare <type>
       @llvm.experimental.constrained.nearbyint(<type> <op1>,
                                                metadata <rounding mode>,
                                                metadata <exception behavior>)
@@ -13574,6 +13728,27 @@ with arbitrary strings. This can be useful for special purpose
 optimizations that want to look for these annotations. These have no
 other defined use; they are ignored by code generation and optimization.
 
+'``llvm.codeview.annotation``' Intrinsic
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+
+This annotation emits a label at its program point and an associated
+``S_ANNOTATION`` codeview record with some additional string metadata. This is
+used to implement MSVC's ``__annotation`` intrinsic. It is marked
+``noduplicate``, so calls to this intrinsic prevent inlining and should be
+considered expensive.
+
+::
+
+      declare void @llvm.codeview.annotation(metadata)
+
+Arguments:
+""""""""""
+
+The argument should be an MDTuple containing any number of MDStrings.
+
 '``llvm.trap``' Intrinsic
 ^^^^^^^^^^^^^^^^^^^^^^^^^
 
@@ -14101,6 +14276,36 @@ not overflow at link time under the medium code model if ``x`` is an
 a constant initializer folded into a function body. This intrinsic can be
 used to avoid the possibility of overflows when loading from such a constant.
 
+'``llvm.sideeffect``' Intrinsic
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+
+::
+
+      declare void @llvm.sideeffect() inaccessiblememonly nounwind
+
+Overview:
+"""""""""
+
+The ``llvm.sideeffect`` intrinsic doesn't perform any operation. Optimizers
+treat it as having side effects, so it can be inserted into a loop to
+indicate that the loop shouldn't be assumed to terminate (which could
+potentially lead to the loop being optimized away entirely), even if it's
+an infinite loop with no other side effects.
+
+Arguments:
+""""""""""
+
+None.
+
+Semantics:
+""""""""""
+
+This intrinsic actually does nothing, but optimizers must assume that it
+has externally observable side effects.
+
 Stack Map Intrinsics
 --------------------
 
@@ -14168,7 +14373,7 @@ The '``llvm.memcpy.element.unordered.atomic.*``' intrinsic copies ``len`` bytes
 memory from the source location to the destination location. These locations are not
 allowed to overlap. The memory copy is performed as a sequence of load/store operations
 where each access is guaranteed to be a multiple of ``element_size`` bytes wide and
-aligned at an ``element_size`` boundary. 
+aligned at an ``element_size`` boundary.
 
 The order of the copy is unspecified. The same value may be read from the source
 buffer many times, but only one write is issued to the destination buffer per
@@ -14243,7 +14448,7 @@ The '``llvm.memmove.element.unordered.atomic.*``' intrinsic copies ``len`` bytes
 of memory from the source location to the destination location. These locations
 are allowed to overlap. The memory copy is performed as a sequence of load/store
 operations where each access is guaranteed to be a multiple of ``element_size``
-bytes wide and aligned at an ``element_size`` boundary. 
+bytes wide and aligned at an ``element_size`` boundary.
 
 The order of the copy is unspecified. The same value may be read from the source
 buffer many times, but only one write is issued to the destination buffer per
@@ -14318,7 +14523,7 @@ Semantics:
 The '``llvm.memset.element.unordered.atomic.*``' intrinsic sets the ``len`` bytes of
 memory starting at the destination location to the given ``value``. The memory is
 set with a sequence of store operations where each access is guaranteed to be a
-multiple of ``element_size`` bytes wide and aligned at an ``element_size`` boundary. 
+multiple of ``element_size`` bytes wide and aligned at an ``element_size`` boundary.
 
 The order of the assignment is unspecified. Only one write is issued to the
 destination buffer per element. It is well defined to have concurrent reads and
diff --git a/docs/Lexicon.rst b/docs/Lexicon.rst
index ce7ed318fe4b..0021bf8e00b1 100644
--- a/docs/Lexicon.rst
+++ b/docs/Lexicon.rst
@@ -109,6 +109,11 @@ G
     Garbage Collection. The practice of using reachability analysis instead of
     explicit memory management to reclaim unused memory.
 
+**GEP**
+    ``GetElementPtr``. An LLVM IR instruction that is used to get the address
+    of a subelement of an aggregate data structure. It is documented in detail
+    `here <http://llvm.org/docs/GetElementPtr.html>`_.
+
 **GVN**
     Global Value Numbering. GVN is a pass that partitions values computed by a
     function into congruence classes. Values ending up in the same congruence
diff --git a/docs/LibFuzzer.rst b/docs/LibFuzzer.rst
index 0f0b0e2e6fbd..7a105e5ed129 100644
--- a/docs/LibFuzzer.rst
+++ b/docs/LibFuzzer.rst
@@ -24,28 +24,9 @@ Versions
 ========
 
 LibFuzzer is under active development so you will need the current
-(or at least a very recent) version of the Clang compiler.
+(or at least a very recent) version of the Clang compiler (see `building Clang from trunk`_)
 
-(If `building Clang from trunk`_ is too time-consuming or difficult, then
-the Clang binaries that the Chromium developers build are likely to be
-fairly recent:
-
-.. code-block:: console
-
-  mkdir TMP_CLANG
-  cd TMP_CLANG
-  git clone https://chromium.googlesource.com/chromium/src/tools/clang
-  cd ..
-  TMP_CLANG/clang/scripts/update.py
-
-This installs the Clang binary as
-``./third_party/llvm-build/Release+Asserts/bin/clang``)
-
-The libFuzzer code resides in the LLVM repository, and requires a recent Clang
-compiler to build (and is used to `fuzz various parts of LLVM itself`_).
-However the fuzzer itself does not (and should not) depend on any part of LLVM
-infrastructure and can be used for other projects without requiring the rest
-of LLVM.
+Refer to https://releases.llvm.org/5.0.0/docs/LibFuzzer.html for documentation on the older version.
 
 
 Getting Started
@@ -90,40 +71,33 @@ Some important things to remember about fuzz targets:
 Fuzzer Usage
 ------------
 
-Very recent versions of Clang (> April 20 2017) include libFuzzer,
-and no installation is necessary.
-In order to fuzz your binary, use the `-fsanitize=fuzzer` flag during the compilation::
-
-   clang -fsanitize=fuzzer,address mytarget.c
+Recent versions of Clang (starting from 6.0) include libFuzzer, and no extra installation is necessary.
 
-Otherwise, build the libFuzzer library as a static archive, without any sanitizer
-options. Note that the libFuzzer library contains the ``main()`` function:
+In order to build your fuzzer binary, use the `-fsanitize=fuzzer` flag during the
+compilation and linking. In most cases you may want to combine libFuzzer with
+AddressSanitizer_ (ASAN), UndefinedBehaviorSanitizer_ (UBSAN), or both::
 
-.. code-block:: console
+   clang -g -O1 -fsanitize=fuzzer                         mytarget.c # Builds the fuzz target w/o sanitizers
+   clang -g -O1 -fsanitize=fuzzer,address                 mytarget.c # Builds the fuzz target with ASAN
+   clang -g -O1 -fsanitize=fuzzer,signed-integer-overflow mytarget.c # Builds the fuzz target with a part of UBSAN
 
-  svn co http://llvm.org/svn/llvm-project/llvm/trunk/lib/Fuzzer  # or git clone https://chromium.googlesource.com/chromium/llvm-project/llvm/lib/Fuzzer
-  ./Fuzzer/build.sh  # Produces libFuzzer.a
+This will perform the necessary instrumentation, as well as linking with the libFuzzer library.
+Note that ``-fsanitize=fuzzer`` links in the libFuzzer's ``main()`` symbol.
 
-Then build the fuzzing target function and the library under test using
-the SanitizerCoverage_ option, which instruments the code so that the fuzzer
-can retrieve code coverage information (to guide the fuzzing).  Linking with
-the libFuzzer code then gives a fuzzer executable.
+If modifying ``CFLAGS`` of a large project, which also compiles executables
+requiring their own ``main`` symbol, it may be desirable to request just the
+instrumentation without linking::
 
-You should also enable one or more of the *sanitizers*, which help to expose
-latent bugs by making incorrect behavior generate errors at runtime:
+   clang -fsanitize=fuzzer-no-link mytarget.c
 
- - AddressSanitizer_ (ASAN) detects memory access errors. Use `-fsanitize=address`.
- - UndefinedBehaviorSanitizer_ (UBSAN) detects the use of various features of C/C++ that are explicitly
-   listed as resulting in undefined behavior.  Use `-fsanitize=undefined -fno-sanitize-recover=undefined`
-   or any individual UBSAN check, e.g.  `-fsanitize=signed-integer-overflow -fno-sanitize-recover=undefined`.
-   You may combine ASAN and UBSAN in one build.
- - MemorySanitizer_ (MSAN) detects uninitialized reads: code whose behavior relies on memory
-   contents that have not been initialized to a specific value. Use `-fsanitize=memory`.
-   MSAN can not be combined with other sanirizers and should be used as a seprate build.
+Then libFuzzer can be linked to the desired driver by passing in
+``-fsanitize=fuzzer`` during the linking stage.
 
-Finally, link with ``libFuzzer.a``::
+Using MemorySanitizer_ (MSAN) with libFuzzer is possible too, but tricky.
+The exact details are out of scope, we expect to simplify this in future
+versions.
 
-  clang -fsanitize-coverage=trace-pc-guard -fsanitize=address your_lib.cc fuzz_target.cc libFuzzer.a -o my_fuzzer
+.. _libfuzzer-corpus:
 
 Corpus
 ------
@@ -161,7 +135,6 @@ Only the inputs that trigger new coverage will be added to the first corpus.
 
   ./my_fuzzer -merge=1 CURRENT_CORPUS_DIR NEW_POTENTIALLY_INTERESTING_INPUTS_DIR
 
-
 Running
 -------
 
@@ -208,6 +181,33 @@ running with ``-jobs=30`` on a 12-core machine would run 6 workers by default,
 with each worker averaging 5 bugs by completion of the entire process.
 
 
+Resuming merge
+--------------
+
+Merging large corpora may be time consuming, and it is often desirable to do it
+on preemptable VMs, where the process may be killed at any time.
+In order to seamlessly resume the merge, use the ``-merge_control_file`` flag
+and use ``killall -SIGUSR1 /path/to/fuzzer/binary`` to stop the merge gracefully. Example:
+
+.. code-block:: console
+
+  % rm -f SomeLocalPath
+  % ./my_fuzzer CORPUS1 CORPUS2 -merge=1 -merge_control_file=SomeLocalPath
+  ...
+  MERGE-INNER: using the control file 'SomeLocalPath'
+  ...
+  # While this is running, do `killall -SIGUSR1 my_fuzzer` in another console
+  ==9015== INFO: libFuzzer: exiting as requested
+
+  # This will leave the file SomeLocalPath with the partial state of the merge.
+  # Now, you can continue the merge by executing the same command. The merge
+  # will continue from where it has been interrupted.
+  % ./my_fuzzer CORPUS1 CORPUS2 -merge=1 -merge_control_file=SomeLocalPath
+  ...
+  MERGE-OUTER: non-empty control file provided: 'SomeLocalPath'
+  MERGE-OUTER: control file ok, 32 files total, first not processed file 20
+  ...
+
 Options
 =======
 
@@ -246,6 +246,10 @@ The most important command line options are:
   the process is treated as a failure case.
   The limit is checked in a separate thread every second.
   If running w/o ASAN/MSAN, you may use 'ulimit -v' instead.
+``-malloc_limit_mb``
+  If non-zero, the fuzzer will exit if the target tries to allocate this
+  number of Mb with one malloc call.
+  If zero (default) same limit as rss_limit_mb is applied.
 ``-timeout_exitcode``
   Exit code (default 77) used if libFuzzer reports a timeout.
 ``-error_exitcode``
@@ -257,6 +261,10 @@ The most important command line options are:
   If set to 1, any corpus inputs from the 2nd, 3rd etc. corpus directories
   that trigger new code coverage will be merged into the first corpus
   directory.  Defaults to 0. This flag can be used to minimize a corpus.
+``-merge_control_file``
+  Specify a control file used for the merge proccess.
+  If a merge process gets killed it tries to leave this file in a state
+  suitable for resuming the merge. By default a temporary file will be used.
 ``-minimize_crash``
   If 1, minimizes the provided crash input.
   Use with -runs=N or -max_total_time=N to limit the number of attempts.
@@ -278,6 +286,9 @@ The most important command line options are:
 ``-use_counters``
   Use `coverage counters`_ to generate approximate counts of how often code
   blocks are hit; defaults to 1.
+``-reduce_inputs``
+  Try to reduce the size of inputs while preserving their full feature sets;
+  defaults to 1.
 ``-use_value_profile``
   Use `value profile`_ to guide corpus expansion; defaults to 0.
 ``-only_ascii``
@@ -305,10 +316,6 @@ The most important command line options are:
    - 1 : close ``stdout``
    - 2 : close ``stderr``
    - 3 : close both ``stdout`` and ``stderr``.
-``-print_coverage``
-   If 1, print coverage information as text at exit.
-``-dump_coverage``
-   If 1, dump coverage information as a .sancov file at exit.
 
 For the full list of flags run the fuzzer binary with ``-help=1``.
 
@@ -345,6 +352,9 @@ possible event codes are:
 ``NEW``
   The fuzzer has created a test input that covers new areas of the code
   under test.  This input will be saved to the primary corpus directory.
+``REDUCE``
+  The fuzzer has found a better (smaller) input that triggers previously
+  discovered features (set ``-reduce_inputs=0`` to disable).
 ``pulse``
   The fuzzer has generated 2\ :sup:`n` inputs (generated periodically to reassure
   the user that the fuzzer is still working).
@@ -465,7 +475,7 @@ Tracing CMP instructions
 ------------------------
 
 With an additional compiler flag ``-fsanitize-coverage=trace-cmp``
-(see SanitizerCoverageTraceDataFlow_)
+(on by default as part of ``-fsanitize=fuzzer``, see SanitizerCoverageTraceDataFlow_)
 libFuzzer will intercept CMP instructions and guide mutations based
 on the arguments of intercepted CMP instructions. This may slow down
 the fuzzing but is very likely to improve the results.
@@ -473,7 +483,6 @@ the fuzzing but is very likely to improve the results.
 Value Profile
 -------------
 
-*EXPERIMENTAL*.
 With  ``-fsanitize-coverage=trace-cmp``
 and extra run-time flag ``-use_value_profile=1`` the fuzzer will
 collect value profiles for the parameters of compare instructions
@@ -535,7 +544,7 @@ Periodically restart both fuzzers so that they can use each other's findings.
 Currently, there is no simple way to run both fuzzing engines in parallel while sharing the same corpus dir.
 
 You may also use AFL on your target function ``LLVMFuzzerTestOneInput``:
-see an example `here <https://github.com/llvm-mirror/llvm/blob/master/lib/Fuzzer/afl/afl_driver.cpp>`__.
+see an example `here <https://github.com/llvm-mirror/compiler-rt/tree/master/lib/fuzzer/afl>`__.
 
 How good is my fuzzer?
 ----------------------
@@ -543,28 +552,12 @@ How good is my fuzzer?
 Once you implement your target function ``LLVMFuzzerTestOneInput`` and fuzz it to death,
 you will want to know whether the function or the corpus can be improved further.
 One easy to use metric is, of course, code coverage.
-You can get the coverage for your corpus like this:
 
-.. code-block:: console
-
-  ./fuzzer CORPUS_DIR -runs=0 -print_coverage=1
+We recommend to use
+`Clang Coverage <http://clang.llvm.org/docs/SourceBasedCodeCoverage.html>`_,
+to visualize and study your code coverage
+(`example <https://github.com/google/fuzzer-test-suite/blob/master/tutorial/libFuzzerTutorial.md#visualizing-coverage>`_).
 
-This will run all tests in the CORPUS_DIR but will not perform any fuzzing.
-At the end of the process it will print text describing what code has been covered and what hasn't.
-
-Alternatively, use
-
-.. code-block:: console
-
-  ./fuzzer CORPUS_DIR -runs=0 -dump_coverage=1
-
-which will dump a ``.sancov`` file with coverage information.
-See SanitizerCoverage_ for details on querying the file using the ``sancov`` tool.
-
-You may also use other ways to visualize coverage,
-e.g. using `Clang coverage <http://clang.llvm.org/docs/SourceBasedCodeCoverage.html>`_,
-but those will require
-you to rebuild the code with different compiler flags.
 
 User-supplied mutators
 ----------------------
@@ -621,75 +614,17 @@ you will eventually run out of RAM (see the ``-rss_limit_mb`` flag).
 Developing libFuzzer
 ====================
 
-Building libFuzzer as a part of LLVM project and running its test requires
-fresh clang as the host compiler and special CMake configuration:
+LibFuzzer is built as a part of LLVM project by default on macos and Linux.
+Users of other operating systems can explicitly request compilation using
+``-DLIBFUZZER_ENABLE=YES`` flag.
+Tests are run using ``check-fuzzer`` target from the build directory
+which was configured with ``-DLIBFUZZER_ENABLE_TESTS=ON`` flag.
 
 .. code-block:: console
 
-    cmake -GNinja  -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ -DLLVM_USE_SANITIZER=Address -DLLVM_USE_SANITIZE_COVERAGE=YES -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=ON /path/to/llvm
     ninja check-fuzzer
 
 
-Fuzzing components of LLVM
-==========================
-.. contents::
-   :local:
-   :depth: 1
-
-To build any of the LLVM fuzz targets use the build instructions above.
-
-clang-format-fuzzer
--------------------
-The inputs are random pieces of C++-like text.
-
-.. code-block:: console
-
-    ninja clang-format-fuzzer
-    mkdir CORPUS_DIR
-    ./bin/clang-format-fuzzer CORPUS_DIR
-
-Optionally build other kinds of binaries (ASan+Debug, MSan, UBSan, etc).
-
-Tracking bug: https://llvm.org/bugs/show_bug.cgi?id=23052
-
-clang-fuzzer
-------------
-
-The behavior is very similar to ``clang-format-fuzzer``.
-
-Tracking bug: https://llvm.org/bugs/show_bug.cgi?id=23057
-
-llvm-as-fuzzer
---------------
-
-Tracking bug: https://llvm.org/bugs/show_bug.cgi?id=24639
-
-llvm-mc-fuzzer
---------------
-
-This tool fuzzes the MC layer. Currently it is only able to fuzz the
-disassembler but it is hoped that assembly, and round-trip verification will be
-added in future.
-
-When run in dissassembly mode, the inputs are opcodes to be disassembled. The
-fuzzer will consume as many instructions as possible and will stop when it
-finds an invalid instruction or runs out of data.
-
-Please note that the command line interface differs slightly from that of other
-fuzzers. The fuzzer arguments should follow ``--fuzzer-args`` and should have
-a single dash, while other arguments control the operation mode and target in a
-similar manner to ``llvm-mc`` and should have two dashes. For example:
-
-.. code-block:: console
-
-  llvm-mc-fuzzer --triple=aarch64-linux-gnu --disassemble --fuzzer-args -max_len=4 -jobs=10
-
-Buildbot
---------
-
-A buildbot continuously runs the above fuzzers for LLVM components, with results
-shown at http://lab.llvm.org:8011/builders/sanitizer-x86_64-linux-fuzzer .
-
 FAQ
 =========================
 
@@ -748,6 +683,8 @@ network, crypto.
 
 Trophies
 ========
+* Thousands of bugs found on OSS-Fuzz:  https://opensource.googleblog.com/2017/05/oss-fuzz-five-months-later-and.html
+
 * GLIBC: https://sourceware.org/glibc/wiki/FuzzingLibc
 
 * MUSL LIBC: `[1] <http://git.musl-libc.org/cgit/musl/commit/?id=39dfd58417ef642307d90306e1c7e50aaec5a35c>`__ `[2] <http://www.openwall.com/lists/oss-security/2015/03/30/3>`__
@@ -774,6 +711,8 @@ Trophies
 
 * `Linux Kernel's BPF verifier <https://github.com/iovisor/bpf-fuzzer>`_
 
+* `Linux Kernel's Crypto code <https://www.spinics.net/lists/stable/msg199712.html>`_
+
 * Capstone: `[1] <https://github.com/aquynh/capstone/issues/600>`__ `[2] <https://github.com/aquynh/capstone/commit/6b88d1d51eadf7175a8f8a11b690684443b11359>`__
 
 * file:`[1] <http://bugs.gw.com/view.php?id=550>`__  `[2] <http://bugs.gw.com/view.php?id=551>`__  `[3] <http://bugs.gw.com/view.php?id=553>`__  `[4] <http://bugs.gw.com/view.php?id=554>`__
@@ -792,6 +731,8 @@ Trophies
 
 * `Wireshark <https://bugs.wireshark.org/bugzilla/buglist.cgi?bug_status=UNCONFIRMED&bug_status=CONFIRMED&bug_status=IN_PROGRESS&bug_status=INCOMPLETE&bug_status=RESOLVED&bug_status=VERIFIED&f0=OP&f1=OP&f2=product&f3=component&f4=alias&f5=short_desc&f7=content&f8=CP&f9=CP&j1=OR&o2=substring&o3=substring&o4=substring&o5=substring&o6=substring&o7=matches&order=bug_id%20DESC&query_format=advanced&v2=libfuzzer&v3=libfuzzer&v4=libfuzzer&v5=libfuzzer&v6=libfuzzer&v7=%22libfuzzer%22>`_
 
+* `QEMU <https://researchcenter.paloaltonetworks.com/2017/09/unit42-palo-alto-networks-discovers-new-qemu-vulnerability/>`_
+
 .. _pcre2: http://www.pcre.org/
 .. _AFL: http://lcamtuf.coredump.cx/afl/
 .. _Radamsa: https://github.com/aoh/radamsa
@@ -800,7 +741,7 @@ Trophies
 .. _AddressSanitizer: http://clang.llvm.org/docs/AddressSanitizer.html
 .. _LeakSanitizer: http://clang.llvm.org/docs/LeakSanitizer.html
 .. _Heartbleed: http://en.wikipedia.org/wiki/Heartbleed
-.. _FuzzerInterface.h: https://github.com/llvm-mirror/llvm/blob/master/lib/Fuzzer/FuzzerInterface.h
+.. _FuzzerInterface.h: https://github.com/llvm-mirror/compiler-rt/blob/master/lib/fuzzer/FuzzerInterface.h
 .. _3.7.0: http://llvm.org/releases/3.7.0/docs/LibFuzzer.html
 .. _building Clang from trunk: http://clang.llvm.org/get_started.html
 .. _MemorySanitizer: http://clang.llvm.org/docs/MemorySanitizer.html
@@ -809,4 +750,4 @@ Trophies
 .. _`value profile`: #value-profile
 .. _`caller-callee pairs`: http://clang.llvm.org/docs/SanitizerCoverage.html#caller-callee-coverage
 .. _BoringSSL: https://boringssl.googlesource.com/boringssl/
-.. _`fuzz various parts of LLVM itself`: `Fuzzing components of LLVM`_
+
diff --git a/docs/MIRLangRef.rst b/docs/MIRLangRef.rst
index b4ca8f2347a7..f170c7210879 100644
--- a/docs/MIRLangRef.rst
+++ b/docs/MIRLangRef.rst
@@ -121,6 +121,8 @@ Tests are more accessible and future proof when simplified:
   contains dummy functions (see above). The .mir loader will create the
   IR functions automatically in this case.
 
+.. _limitations:
+
 Limitations
 -----------
 
@@ -238,6 +240,8 @@ in the block's definition:
 The block's name should be identical to the name of the IR block that this
 machine block is based on.
 
+.. _block-references:
+
 Block References
 ^^^^^^^^^^^^^^^^
 
@@ -246,13 +250,25 @@ blocks are referenced using the following syntax:
 
 .. code-block:: text
 
-    %bb.<id>[.<name>]
+    %bb.<id>
 
-Examples:
+Example:
 
 .. code-block:: llvm
 
     %bb.0
+
+The following syntax is also supported, but the former syntax is preferred for
+block references:
+
+.. code-block:: text
+
+    %bb.<id>[.<name>]
+
+Example:
+
+.. code-block:: llvm
+
     %bb.1.then
 
 Successors
@@ -418,7 +434,40 @@ immediate machine operand ``-42``:
 
     %eax = MOV32ri -42
 
-.. TODO: Describe the CIMM (Rare) and FPIMM immediate operands.
+An immediate operand is also used to represent a subregister index when the
+machine instruction has one of the following opcodes:
+
+- ``EXTRACT_SUBREG``
+
+- ``INSERT_SUBREG``
+
+- ``REG_SEQUENCE``
+
+- ``SUBREG_TO_REG``
+
+In case this is true, the Machine Operand is printed according to the target.
+
+For example:
+
+In AArch64RegisterInfo.td:
+
+.. code-block:: text
+
+  def sub_32 : SubRegIndex<32>;
+
+If the third operand is an immediate with the value ``15`` (target-dependent
+value), based on the instruction's opcode and the operand's index the operand
+will be printed as ``%subreg.sub_32``:
+
+.. code-block:: text
+
+    %1:gpr64 = SUBREG_TO_REG 0, %0, %subreg.sub_32
+
+For integers > 64bit, we use a special machine operand, ``MO_CImmediate``,
+which stores the immediate in a ``ConstantInt`` using an ``APInt`` (LLVM's
+arbitrary precision integers).
+
+.. TODO: Describe the FPIMM immediate operands.
 
 .. _register-operands:
 
@@ -484,6 +533,9 @@ corresponding internal ``llvm::RegState`` representation:
    * - ``debug-use``
      - ``RegState::Debug``
 
+   * - ``renamable``
+     - ``RegState::Renamable``
+
 .. _subregister-indices:
 
 Subregister Indices
@@ -501,6 +553,53 @@ lower bits from the 32-bit virtual register 0 to the 8-bit virtual register 1:
 The names of the subregister indices are target specific, and are typically
 defined in the target's ``*RegisterInfo.td`` file.
 
+Constant Pool Indices
+^^^^^^^^^^^^^^^^^^^^^
+
+A constant pool index (CPI) operand is printed using its index in the
+function's ``MachineConstantPool`` and an offset.
+
+For example, a CPI with the index 1 and offset 8:
+
+.. code-block:: text
+
+    %1:gr64 = MOV64ri %const.1 + 8
+
+For a CPI with the index 0 and offset -12:
+
+.. code-block:: text
+
+    %1:gr64 = MOV64ri %const.0 - 12
+
+A constant pool entry is bound to a LLVM IR ``Constant`` or a target-specific
+``MachineConstantPoolValue``. When serializing all the function's constants the
+following format is used:
+
+.. code-block:: text
+
+    constants:
+      - id:               <index>
+        value:            <value>
+        alignment:        <alignment>
+        isTargetSpecific: <target-specific>
+
+where ``<index>`` is a 32-bit unsigned integer, ``<value>`` is a `LLVM IR Constant
+<https://www.llvm.org/docs/LangRef.html#constants>`_, alignment is a 32-bit
+unsigned integer, and ``<target-specific>`` is either true or false.
+
+Example:
+
+.. code-block:: text
+
+    constants:
+      - id:               0
+        value:            'double 3.250000e+00'
+        alignment:        8
+      - id:               1
+        value:            'g-(LPC0+8)'
+        alignment:        4
+        isTargetSpecific: true
+
 Global Value Operands
 ^^^^^^^^^^^^^^^^^^^^^
 
@@ -520,24 +619,91 @@ If the identifier doesn't match the regular expression
 The unnamed global values are represented using an unsigned numeric value with
 the '@' prefix, like in the following examples: ``@0``, ``@989``.
 
+Target-dependent Index Operands
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+A target index operand is a target-specific index and an offset. The
+target-specific index is printed using target-specific names and a positive or
+negative offset.
+
+For example, the ``amdgpu-constdata-start`` is associated with the index ``0``
+in the AMDGPU backend. So if we have a target index operand with the index 0
+and the offset 8:
+
+.. code-block:: text
+
+    %sgpr2 = S_ADD_U32 _, target-index(amdgpu-constdata-start) + 8, implicit-def _, implicit-def _
+
+Jump-table Index Operands
+^^^^^^^^^^^^^^^^^^^^^^^^^
+
+A jump-table index operand with the index 0 is printed as following:
+
+.. code-block:: text
+
+    tBR_JTr killed %r0, %jump-table.0
+
+A machine jump-table entry contains a list of ``MachineBasicBlocks``. When serializing all the function's jump-table entries, the following format is used:
+
+.. code-block:: text
+
+    jumpTable:
+      kind:             <kind>
+      entries:
+        - id:             <index>
+          blocks:         [ <bbreference>, <bbreference>, ... ]
+
+where ``<kind>`` is describing how the jump table is represented and emitted (plain address, relocations, PIC, etc.), and each ``<index>`` is a 32-bit unsigned integer and ``blocks`` contains a list of :ref:`machine basic block references <block-references>`.
+
+Example:
+
+.. code-block:: text
+
+    jumpTable:
+      kind:             inline
+      entries:
+        - id:             0
+          blocks:         [ '%bb.3', '%bb.9', '%bb.4.d3' ]
+        - id:             1
+          blocks:         [ '%bb.7', '%bb.7', '%bb.4.d3', '%bb.5' ]
+
+External Symbol Operands
+^^^^^^^^^^^^^^^^^^^^^^^^^
+
+An external symbol operand is represented using an identifier with the ``$``
+prefix. The identifier is surrounded with ""'s and escaped if it has any
+special non-printable characters in it.
+
+Example:
+
+.. code-block:: text
+
+    CALL64pcrel32 $__stack_chk_fail, csr_64, implicit %rsp, implicit-def %rsp
+
+MCSymbol Operands
+^^^^^^^^^^^^^^^^^
+
+A MCSymbol operand is holding a pointer to a ``MCSymbol``. For the limitations
+of this operand in MIR, see :ref:`limitations <limitations>`.
+
+The syntax is:
+
+.. code-block:: text
+
+    EH_LABEL <mcsymbol Ltmp1>
+
 .. TODO: Describe the parsers default behaviour when optional YAML attributes
    are missing.
 .. TODO: Describe the syntax for the bundled instructions.
 .. TODO: Describe the syntax for virtual register YAML definitions.
 .. TODO: Describe the machine function's YAML flag attributes.
-.. TODO: Describe the syntax for the external symbol and register
-   mask machine operands.
+.. TODO: Describe the syntax for the register mask machine operands.
 .. TODO: Describe the frame information YAML mapping.
 .. TODO: Describe the syntax of the stack object machine operands and their
    YAML definitions.
-.. TODO: Describe the syntax of the constant pool machine operands and their
-   YAML definitions.
-.. TODO: Describe the syntax of the jump table machine operands and their
-   YAML definitions.
 .. TODO: Describe the syntax of the block address machine operands.
 .. TODO: Describe the syntax of the CFI index machine operands.
 .. TODO: Describe the syntax of the metadata machine operands, and the
    instructions debug location attribute.
-.. TODO: Describe the syntax of the target index machine operands.
 .. TODO: Describe the syntax of the register live out machine operands.
 .. TODO: Describe the syntax of the machine memory operands.
diff --git a/docs/NVPTXUsage.rst b/docs/NVPTXUsage.rst
index 159fe078653c..38222afbc63a 100644
--- a/docs/NVPTXUsage.rst
+++ b/docs/NVPTXUsage.rst
@@ -499,7 +499,7 @@ The output we get from ``llc`` (as of LLVM 3.4):
     .reg .s32   %r<2>;
     .reg .s64   %rl<8>;
 
-  // BB#0:                                // %entry
+  // %bb.0:                                // %entry
     ld.param.u64    %rl1, [kernel_param_0];
     mov.u32         %r1, %tid.x;
     mul.wide.s32    %rl2, %r1, 4;
@@ -897,7 +897,7 @@ This gives us the following PTX (excerpt):
     .reg .s32   %r<21>;
     .reg .s64   %rl<8>;
 
-  // BB#0:                                // %entry
+  // %bb.0:                                // %entry
     ld.param.u64  %rl2, [kernel_param_0];
     mov.u32   %r3, %tid.x;
     ld.param.u64  %rl3, [kernel_param_1];
@@ -921,7 +921,7 @@ This gives us the following PTX (excerpt):
     abs.f32   %f4, %f1;
     setp.gtu.f32  %p4, %f4, 0f7F800000;
     @%p4 bra  BB0_4;
-  // BB#3:                                // %__nv_isnanf.exit5.i
+  // %bb.3:                                // %__nv_isnanf.exit5.i
     abs.f32   %f5, %f2;
     setp.le.f32 %p5, %f5, 0f7F800000;
     @%p5 bra  BB0_5;
@@ -953,7 +953,7 @@ This gives us the following PTX (excerpt):
     selp.f32  %f110, 0f7F800000, %f99, %p16;
     setp.eq.f32 %p17, %f110, 0f7F800000;
     @%p17 bra   BB0_28;
-  // BB#27:
+  // %bb.27:
     fma.rn.f32  %f110, %f110, %f108, %f110;
   BB0_28:                                 // %__internal_accurate_powf.exit.i
     setp.lt.f32 %p18, %f1, 0f00000000;
diff --git a/docs/ProgrammersManual.rst b/docs/ProgrammersManual.rst
index d115a9cf6de8..719d3997594e 100644
--- a/docs/ProgrammersManual.rst
+++ b/docs/ProgrammersManual.rst
@@ -441,6 +441,15 @@ the program where they can be handled appropriately. Handling the error may be
 as simple as reporting the issue to the user, or it may involve attempts at
 recovery.
 
+.. note::
+
+   While it would be ideal to use this error handling scheme throughout
+   LLVM, there are places where this hasn't been practical to apply. In
+   situations where you absolutely must emit a non-programmatic error and
+   the ``Error`` model isn't workable you can call ``report_fatal_error``,
+   which will call installed error handlers, print a message, and exit the
+   program.
+
 Recoverable errors are modeled using LLVM's ``Error`` scheme. This scheme
 represents errors using function return values, similar to classic C integer
 error codes, or C++'s ``std::error_code``. However, the ``Error`` class is
@@ -486,7 +495,7 @@ that inherits from the ErrorInfo utility, E.g.:
 
   Error printFormattedFile(StringRef Path) {
     if (<check for valid format>)
-      return make_error<InvalidObjectFile>(Path);
+      return make_error<BadFileFormat>(Path);
     // print file contents.
     return Error::success();
   }
@@ -1224,7 +1233,7 @@ Define your DebugCounter like this:
 .. code-block:: c++
 
   DEBUG_COUNTER(DeleteAnInstruction, "passname-delete-instruction",
-		"Controls which instructions get delete").
+		"Controls which instructions get delete");
 
 The ``DEBUG_COUNTER`` macro defines a static variable, whose name
 is specified by the first argument.  The name of the counter
@@ -2105,7 +2114,7 @@ is stored in the same allocation as the Value of a pair).
 StringMap also provides query methods that take byte ranges, so it only ever
 copies a string if a value is inserted into the table.
 
-StringMap iteratation order, however, is not guaranteed to be deterministic, so
+StringMap iteration order, however, is not guaranteed to be deterministic, so
 any uses which require that should instead use a std::map.
 
 .. _dss_indexmap:
diff --git a/docs/Proposals/VectorizationPlan.rst b/docs/Proposals/VectorizationPlan.rst
index aed8e3d2b793..f9700d177d23 100644
--- a/docs/Proposals/VectorizationPlan.rst
+++ b/docs/Proposals/VectorizationPlan.rst
@@ -82,8 +82,14 @@ The design of VPlan follows several high-level guidelines:
    replicated VF*UF times to handle scalarized and predicated instructions.
    Innerloops are also modelled as SESE regions.
 
-Low-level Design
-================
+7. Support instruction-level analysis and transformation, as part of Planning
+   Step 2.b: During vectorization instructions may need to be traversed, moved,
+   replaced by other instructions or be created. For example, vector idiom
+   detection and formation involves searching for and optimizing instruction
+   patterns.
+
+Definitions
+===========
 The low-level design of VPlan comprises of the following classes.
 
 :LoopVectorizationPlanner:
@@ -139,11 +145,64 @@ The low-level design of VPlan comprises of the following classes.
   instructions; e.g., cloned once, replicated multiple times or widened
   according to selected VF.
 
+:VPValue:
+  The base of VPlan's def-use relations class hierarchy. When instantiated, it
+  models a constant or a live-in Value in VPlan. It has users, which are of type
+  VPUser, but no operands.
+
+:VPUser:
+  A VPValue representing a general vertex in the def-use graph of VPlan. It has
+  operands which are of type VPValue. When instantiated, it represents a
+  live-out Instruction that exists outside VPlan. VPUser is similar in some
+  aspects to LLVM's User class.
+
+:VPInstruction:
+  A VPInstruction is both a VPRecipe and a VPUser. It models a single
+  VPlan-level instruction to be generated if the VPlan is executed, including
+  its opcode and possibly additional characteristics. It is the basis for
+  writing instruction-level analyses and optimizations in VPlan as creating,
+  replacing or moving VPInstructions record both def-use and scheduling
+  decisions. VPInstructions also extend LLVM IR's opcodes with idiomatic
+  operations that enrich the Vectorizer's semantics.
+
 :VPTransformState:
   Stores information used for generating output IR, passed from
   LoopVectorizationPlanner to its selected VPlan for execution, and used to pass
   additional information down to VPBlocks and VPRecipes.
 
+The Planning Process and VPlan Roadmap
+======================================
+
+Transforming the Loop Vectorizer to use VPlan follows a staged approach. First,
+VPlan is used to record the final vectorization decisions, and to execute them:
+the Hierarchical CFG models the planned control-flow, and Recipes capture
+decisions taken inside basic-blocks. Next, VPlan will be used also as the basis
+for taking these decisions, effectively turning them into a series of
+VPlan-to-VPlan algorithms. Finally, VPlan will support the planning process
+itself including cost-based analyses for making these decisions, to fully
+support compositional and iterative decision making.
+
+Some decisions are local to an instruction in the loop, such as whether to widen
+it into a vector instruction or replicate it, keeping the generated instructions
+in place. Other decisions, however, involve moving instructions, replacing them
+with other instructions, and/or introducing new instructions. For example, a
+cast may sink past a later instruction and be widened to handle first-order
+recurrence; an interleave group of strided gathers or scatters may effectively
+move to one place where they are replaced with shuffles and a common wide vector
+load or store; new instructions may be introduced to compute masks, shuffle the
+elements of vectors, and pack scalar values into vectors or vice-versa.
+
+In order for VPlan to support making instruction-level decisions and analyses,
+it needs to model the relevant instructions along with their def/use relations.
+This too follows a staged approach: first, the new instructions that compute
+masks are modeled as VPInstructions, along with their induced def/use subgraph.
+This effectively models masks in VPlan, facilitating VPlan-based predication.
+Next, the logic embedded within each Recipe for generating its instructions at
+VPlan execution time, will instead take part in the planning process by modeling
+them as VPInstructions. Finally, only logic that applies to instructions as a
+group will remain in Recipes, such as interleave groups and potentially other
+idiom groups having synergistic cost.
+
 Related LLVM components
 -----------------------
 1. SLP Vectorizer: one can compare the VPlan model with LLVM's existing SLP
@@ -152,6 +211,9 @@ Related LLVM components
 2. RegionInfo: one can compare VPlan's H-CFG with the Region Analysis as used by
    Polly [7]_.
 
+3. Loop Vectorizer: the Vectorization Plan aims to upgrade the infrastructure of
+   the Loop Vectorizer and extend it to handle outer loops [8,9]_.
+
 References
 ----------
 .. [1] "Outer-loop vectorization: revisited for short SIMD architectures", Dorit
@@ -180,3 +242,6 @@ References
 
 .. [8] "Introducing VPlan to the Loop Vectorizer", Gil Rapaport and Ayal Zaks,
     European LLVM Developers' Meeting 2017.
+
+.. [9] "Extending LoopVectorizer: OpenMP4.5 SIMD and Outer Loop
+    Auto-Vectorization", Intel Vectorizer Team, LLVM Developers' Meeting 2016.
diff --git a/docs/ReleaseNotes.rst b/docs/ReleaseNotes.rst
index 4e91eea2cbc8..41b9cf92d767 100644
--- a/docs/ReleaseNotes.rst
+++ b/docs/ReleaseNotes.rst
@@ -1,10 +1,15 @@
 ========================
-LLVM 5.0.0 Release Notes
+LLVM 6.0.0 Release Notes
 ========================
 
 .. contents::
     :local:
 
+.. warning::
+   These are in-progress notes for the upcoming LLVM 6 release.
+   Release notes for previous releases can be found on
+   `the Download Page <http://releases.llvm.org/download.html>`_.
+
 
 Introduction
 ============
@@ -21,244 +26,97 @@ have questions or comments, the `LLVM Developer's Mailing List
 <http://lists.llvm.org/mailman/listinfo/llvm-dev>`_ is a good place to send
 them.
 
+Note that if you are reading this file from a Subversion checkout or the main
+LLVM web page, this document applies to the *next* release, not the current
+one.  To see the release notes for a specific release, please see the `releases
+page <http://llvm.org/releases/>`_.
+
 Non-comprehensive list of changes in this release
 =================================================
+.. NOTE
+   For small 1-3 sentence descriptions, just add an entry at the end of
+   this list. If your description won't fit comfortably in one bullet
+   point (e.g. maybe you would like to give an example of the
+   functionality, or simply have a lot to talk about), see the `NOTE` below
+   for adding a new subsection.
 
-* LLVM's ``WeakVH`` has been renamed to ``WeakTrackingVH`` and a new ``WeakVH``
-  has been introduced.  The new ``WeakVH`` nulls itself out on deletion, but
-  does not track values across RAUW.
-
-* A new library named ``BinaryFormat`` has been created which holds a collection
-  of code which previously lived in ``Support``.  This includes the
-  ``file_magic`` structure and ``identify_magic`` functions, as well as all the
-  structure and type definitions for DWARF, ELF, COFF, WASM, and MachO file
-  formats.
+* The ``Redirects`` argument of ``llvm::sys::ExecuteAndWait`` and
+  ``llvm::sys::ExecuteNoWait`` was changed to an ``ArrayRef`` of optional
+  ``StringRef``'s to make it safer and more convenient to use.
 
-* The tool ``llvm-pdbdump`` has been renamed ``llvm-pdbutil`` to better reflect
-  its nature as a general purpose PDB manipulation / diagnostics tool that does
-  more than just dumping contents.
+* The backend name was added to the Target Registry to allow run-time
+  information to be fed back into TableGen. Out-of-tree targets will need to add
+  the name used in the `def X : Target` definition to the call to
+  `RegisterTarget`.
 
-* The ``BBVectorize`` pass has been removed. It was fully replaced and no
-  longer used back in 2014 but we didn't get around to removing it. Now it is
-  gone. The SLP vectorizer is the suggested non-loop vectorization pass.
+* The ``Debugify`` pass was added to ``opt`` to facilitate testing of debug
+  info preservation. This pass attaches synthetic ``DILocations`` and
+  ``DIVariables`` to the instructions in a ``Module``. The ``CheckDebugify``
+  pass determines how much of the metadata is lost.
 
-* A new tool opt-viewer.py has been added to visualize optimization remarks in
-  HTML.  The tool processes the YAML files produced by clang with the
-  -fsave-optimization-record option.
+* Note..
 
-* A new CMake macro ``LLVM_REVERSE_ITERATION`` has been added. If enabled, all
-  supported unordered LLVM containers would be iterated in reverse order. This
-  is useful for uncovering non-determinism caused by iteration of unordered
-  containers. Currently, it supports reverse iteration of SmallPtrSet and
-  DenseMap.
+.. NOTE
+   If you would like to document a larger change, then you can add a
+   subsection about it right here. You can copy the following boilerplate
+   and un-indent it (the indentation causes it to be inside this comment).
 
-* A new tool ``llvm-dlltool`` has been added to create short import libraries
-  from GNU style definition files. The tool utilizes the PE COFF SPEC Import
-  Library Format and PE COFF Auxiliary Weak Externals Format to achieve
-  compatibility with LLD and MSVC LINK.
+   Special New Feature
+   -------------------
 
+   Makes programs 10x faster by doing Special New Thing.
 
 Changes to the LLVM IR
 ----------------------
 
-* The datalayout string may now indicate an address space to use for
-  the pointer type of ``alloca`` rather than the default of 0.
-
-* Added ``speculatable`` attribute indicating a function which has no
-  side-effects which could inhibit hoisting of calls.
-
-Changes to the Arm Targets
+Changes to the ARM Backend
 --------------------------
 
-During this release the AArch64 target has:
-
-* A much improved Global ISel at O0.
-* Support for ARMv8.1 8.2 and 8.3 instructions.
-* New scheduler information for ThunderX2.
-* Some SVE type changes but not much more than that.
-* Made instruction fusion more aggressive, resulting in speedups
-  for code making use of AArch64 AES instructions. AES fusion has been
-  enabled for most Cortex-A cores and the AArch64MacroFusion pass was moved
-  to the generic MacroFusion pass.
-* Added preferred function alignments for most Cortex-A cores.
-* OpenMP "offload-to-self" base support.
-
-During this release the ARM target has:
-
-* Improved, but still mostly broken, Global ISel.
-* Scheduling models update, new schedule for Cortex-A57.
-* Hardware breakpoint support in LLDB.
-* New assembler error handling, with spelling corrections and multiple
-  suggestions on how to fix problems.
-* Improved mixed ARM/Thumb code generation. Some cases in which wrong
-  relocations were emitted have been fixed.
-* Added initial support for mixed ARM/Thumb link-time optimization, using the
-  thumb-mode target feature.
+ During this release ...
+
 
 Changes to the MIPS Target
 --------------------------
 
-* The microMIPS64R6 backend is deprecated and will be removed in the next
-  release.
-
-* The MIPS backend now directly supports vector types for arguments and return
-  values (previously this required ABI specific LLVM IR).
-
-* Added documentation for how the MIPS backend handles address lowering.
-
-* Added a GCC compatible option -m(no-)madd4 to control the generation of four
-  operand multiply addition/subtraction instructions.
-
-* Added basic support for the XRay instrumentation system.
-
-* Added support for more assembly aliases and macros.
-
-* Added support for the ``micromips`` and ``nomicromips`` function attributes
-  which control micromips code generation on a per function basis.
-
-* Added the ``long-calls`` feature for non-pic environments. This feature is
-  used where the callee is out of range of the caller using a standard call
-  sequence. It must be enabled specifically.
-
-* Added support for performing microMIPS code generation via function
-  attributes.
-
-* Added experimental support for the static relocation model for the N64 ABI.
-
-* Added partial support for the MT ASE.
-
-* Added basic support for code size reduction for microMIPS.
-
-* Fixed numerous bugs including: multi-precision arithmetic support, various
-  vectorization bugs, debug information for thread local variables, debug
-  sections lacking the correct flags, crashing when disassembling sections
-  whose size is not a multiple of two or four.
+ During this release ...
 
 
 Changes to the PowerPC Target
 -----------------------------
 
-* Additional support and exploitation of POWER ISA 3.0: vabsdub, vabsduh,
-  vabsduw, modsw, moduw, modsd, modud, lxv, stxv, vextublx, vextubrx, vextuhlx,
-  vextuhrx, vextuwlx, vextuwrx, vextsb2w, vextsb2d, vextsh2w, vextsh2d, and
-  vextsw2d
-
-* Implemented Optimal Code Sequences from The PowerPC Compiler Writer's Guide.
-
-* Enable -fomit-frame-pointer by default.
-
-* Improved handling of bit reverse intrinsic.
-
-* Improved handling of memcpy and memcmp functions.
-
-* Improved handling of branches with static branch hints.
-
-* Improved codegen for atomic load_acquire.
-
-* Improved block placement during code layout
-
-* Many improvements to instruction selection and code generation
-
+ During this release ...
 
 Changes to the X86 Target
 -------------------------
 
-* Added initial AMD Ryzen (znver1) scheduler support.
-
-* Added support for Intel Goldmont CPUs.
-
-* Add support for avx512vpopcntdq instructions.
-
-* Added heuristics to convert CMOV into branches when it may be profitable.
-
-* More aggressive inlining of memcmp calls.
-
-* Improve vXi64 shuffles on 32-bit targets.
-
-* Improved use of PMOVMSKB for any_of/all_of comparision reductions.
-
-* Improved Silvermont, Sandybridge, and Jaguar (btver2) schedulers.
-
-* Improved support for AVX512 vector rotations.
-
-* Added support for AMD Lightweight Profiling (LWP) instructions.
-
-* Avoid using slow LEA instructions.
-
-* Use alternative sequences for multiply by constant.
-
-* Improved lowering of strided shuffles.
-
-* Improved the AVX512 cost model used by the vectorizer.
-
-* Fix scalar code performance when AVX512 is enabled by making i1's illegal.
-
-* Fixed many inline assembly bugs.
-
-* Preliminary support for tracing NetBSD processes and core files with a single
-  thread in LLDB.
+ During this release ...
 
 Changes to the AMDGPU Target
 -----------------------------
 
-* Initial gfx9 support
+ During this release ...
 
 Changes to the AVR Target
 -----------------------------
 
-This release consists mainly of bugfixes and implementations of features
-required for compiling basic Rust programs.
-
-* Enable the branch relaxation pass so that we don't crash on large
-  stack load/stores
-
-* Add support for lowering bit-rotations to the native ``ror`` and ``rol``
-  instructions
+ During this release ...
 
-* Fix bug where function pointers were treated as pointers to RAM and not
-  pointers to program memory
-
-* Fix broken code generation for shift-by-variable expressions
+Changes to the OCaml bindings
+-----------------------------
 
-* Support zero-sized types in argument lists; this is impossible in C,
-  but possible in Rust
+ During this release ...
 
 
 Changes to the C API
 --------------------
 
-* Deprecated the ``LLVMAddBBVectorizePass`` interface since the ``BBVectorize``
-  pass has been removed. It is now a no-op and will be removed in the next
-  release. Use ``LLVMAddSLPVectorizePass`` instead to get the supported SLP
-  vectorizer.
+ During this release ...
 
 
-External Open Source Projects Using LLVM 5
+External Open Source Projects Using LLVM 6
 ==========================================
 
-Zig Programming Language
-------------------------
-
-`Zig <http://ziglang.org>`_  is an open-source programming language designed
-for robustness, optimality, and clarity. It integrates closely with C and is
-intended to eventually take the place of C. It uses LLVM to produce highly
-optimized native code and to cross-compile for any target out of the box. Zig
-is in alpha; with a beta release expected in September.
-
-LDC - the LLVM-based D compiler
--------------------------------
-
-`D <http://dlang.org>`_ is a language with C-like syntax and static typing. It
-pragmatically combines efficiency, control, and modeling power, with safety and
-programmer productivity. D supports powerful concepts like Compile-Time Function
-Execution (CTFE) and Template Meta-Programming, provides an innovative approach
-to concurrency and offers many classical paradigms.
-
-`LDC <http://wiki.dlang.org/LDC>`_ uses the frontend from the reference compiler
-combined with LLVM as backend to produce efficient native code. LDC targets
-x86/x86_64 systems like Linux, OS X, FreeBSD and Windows and also Linux on ARM
-and PowerPC (32/64 bit). Ports to other architectures like AArch64 and MIPS64
-are underway.
+* A project...
 
 
 Additional Information
diff --git a/docs/ScudoHardenedAllocator.rst b/docs/ScudoHardenedAllocator.rst
index e00c8324e55a..562a39144829 100644
--- a/docs/ScudoHardenedAllocator.rst
+++ b/docs/ScudoHardenedAllocator.rst
@@ -126,14 +126,14 @@ For example, using the environment variable:
 
 .. code::
 
-  SCUDO_OPTIONS="DeleteSizeMismatch=1:QuarantineSizeMb=16" ./a.out
+  SCUDO_OPTIONS="DeleteSizeMismatch=1:QuarantineSizeKb=64" ./a.out
 
 Or using the function:
 
-.. code::
+.. code:: cpp
 
   extern "C" const char *__scudo_default_options() {
-    return "DeleteSizeMismatch=1:QuarantineSizeMb=16";
+    return "DeleteSizeMismatch=1:QuarantineSizeKb=64";
   }
 
 
@@ -142,11 +142,14 @@ The following options are available:
 +-----------------------------+----------------+----------------+------------------------------------------------+
 | Option                      | 64-bit default | 32-bit default | Description                                    |
 +-----------------------------+----------------+----------------+------------------------------------------------+
-| QuarantineSizeMb            | 64             | 16             | The size (in Mb) of quarantine used to delay   |
+| QuarantineSizeKb            | 256            | 64             | The size (in Kb) of quarantine used to delay   |
 |                             |                |                | the actual deallocation of chunks. Lower value |
 |                             |                |                | may reduce memory usage but decrease the       |
 |                             |                |                | effectiveness of the mitigation; a negative    |
-|                             |                |                | value will fallback to a default of 64Mb.      |
+|                             |                |                | value will fallback to the defaults.           |
++-----------------------------+----------------+----------------+------------------------------------------------+
+| QuarantineChunksUpToSize    | 2048           | 512            | Size (in bytes) up to which chunks can be      |
+|                             |                |                | quarantined.                                   |
 +-----------------------------+----------------+----------------+------------------------------------------------+
 | ThreadLocalQuarantineSizeKb | 1024           | 256            | The size (in Kb) of per-thread cache use to    |
 |                             |                |                | offload the global quarantine. Lower value may |
diff --git a/docs/SourceLevelDebugging.rst b/docs/SourceLevelDebugging.rst
index a9f5c3a08147..103c6e0365ba 100644
--- a/docs/SourceLevelDebugging.rst
+++ b/docs/SourceLevelDebugging.rst
@@ -171,35 +171,64 @@ Debugger intrinsic functions
 ----------------------------
 
 LLVM uses several intrinsic functions (name prefixed with "``llvm.dbg``") to
-provide debug information at various points in generated code.
+track source local variables through optimization and code generation.
 
-``llvm.dbg.declare``
+``llvm.dbg.addr``
 ^^^^^^^^^^^^^^^^^^^^
 
 .. code-block:: llvm
 
-  void @llvm.dbg.declare(metadata, metadata, metadata)
+  void @llvm.dbg.addr(metadata, metadata, metadata)
 
-This intrinsic provides information about a local element (e.g., variable).  The
-first argument is metadata holding the alloca for the variable.  The second
-argument is a `local variable <LangRef.html#dilocalvariable>`_ containing a
-description of the variable.  The third argument is a `complex expression
-<LangRef.html#diexpression>`_.  An `llvm.dbg.declare` instrinsic describes the
-*location* of a source variable.
+This intrinsic provides information about a local element (e.g., variable).
+The first argument is metadata holding the address of variable, typically a
+static alloca in the function entry block.  The second argument is a
+`local variable <LangRef.html#dilocalvariable>`_ containing a description of
+the variable.  The third argument is a `complex expression
+<LangRef.html#diexpression>`_.  An `llvm.dbg.addr` intrinsic describes the
+*address* of a source variable.
 
-.. code-block:: llvm
+.. code-block:: text
 
     %i.addr = alloca i32, align 4
-    call void @llvm.dbg.declare(metadata i32* %i.addr, metadata !1, metadata !2), !dbg !3
+    call void @llvm.dbg.addr(metadata i32* %i.addr, metadata !1,
+                             metadata !DIExpression()), !dbg !2
     !1 = !DILocalVariable(name: "i", ...) ; int i
-    !2 = !DIExpression()
-    !3 = !DILocation(...)
+    !2 = !DILocation(...)
     ...
     %buffer = alloca [256 x i8], align 8
     ; The address of i is buffer+64.
-    call void @llvm.dbg.declare(metadata [256 x i8]* %buffer, metadata !1, metadata !2)
-    !1 = !DILocalVariable(name: "i", ...) ; int i
-    !2 = !DIExpression(DW_OP_plus, 64)
+    call void @llvm.dbg.addr(metadata [256 x i8]* %buffer, metadata !3,
+                             metadata !DIExpression(DW_OP_plus, 64)), !dbg !4
+    !3 = !DILocalVariable(name: "i", ...) ; int i
+    !4 = !DILocation(...)
+
+A frontend should generate exactly one call to ``llvm.dbg.addr`` at the point
+of declaration of a source variable. Optimization passes that fully promote the
+variable from memory to SSA values will replace this call with possibly
+multiple calls to `llvm.dbg.value`. Passes that delete stores are effectively
+partial promotion, and they will insert a mix of calls to ``llvm.dbg.value``
+and ``llvm.dbg.addr`` to track the source variable value when it is available.
+After optimization, there may be multiple calls to ``llvm.dbg.addr`` describing
+the program points where the variables lives in memory. All calls for the same
+concrete source variable must agree on the memory location.
+
+
+``llvm.dbg.declare``
+^^^^^^^^^^^^^^^^^^^^
+
+.. code-block:: llvm
+
+  void @llvm.dbg.declare(metadata, metadata, metadata)
+
+This intrinsic is identical to `llvm.dbg.addr`, except that there can only be
+one call to `llvm.dbg.declare` for a given concrete `local variable
+<LangRef.html#dilocalvariable>`_. It is not control-dependent, meaning that if
+a call to `llvm.dbg.declare` exists and has a valid location argument, that
+address is considered to be the true home of the variable across its entire
+lifetime. This makes it hard for optimizations to preserve accurate debug info
+in the presence of ``llvm.dbg.declare``, so we are transitioning away from it,
+and we plan to deprecate it in future LLVM releases.
 
 
 ``llvm.dbg.value``
@@ -207,14 +236,13 @@ description of the variable.  The third argument is a `complex expression
 
 .. code-block:: llvm
 
-  void @llvm.dbg.value(metadata, i64, metadata, metadata)
+  void @llvm.dbg.value(metadata, metadata, metadata)
 
 This intrinsic provides information when a user source variable is set to a new
 value.  The first argument is the new value (wrapped as metadata).  The second
-argument is the offset in the user source variable where the new value is
-written.  The third argument is a `local variable
-<LangRef.html#dilocalvariable>`_ containing a description of the variable.  The
-fourth argument is a `complex expression <LangRef.html#diexpression>`_.
+argument is a `local variable <LangRef.html#dilocalvariable>`_ containing a
+description of the variable.  The third argument is a `complex expression
+<LangRef.html#diexpression>`_.
 
 Object lifetimes and scoping
 ============================
@@ -243,6 +271,9 @@ following C fragment, for example:
   8.    X = Y;
   9.  }
 
+.. FIXME: Update the following example to use llvm.dbg.addr once that is the
+   default in clang.
+
 Compiled to LLVM, this function would be represented like this:
 
 .. code-block:: text
diff --git a/docs/Statepoints.rst b/docs/Statepoints.rst
index 73e09ae8b620..6ed4e46b73bc 100644
--- a/docs/Statepoints.rst
+++ b/docs/Statepoints.rst
@@ -172,7 +172,7 @@ SSA value ``%obj.relocated`` which represents the potentially changed value of
 ``%obj`` after the safepoint and update any following uses appropriately.  The 
 resulting relocation sequence is:
 
-.. code-block:: text
+.. code-block:: llvm
 
   define i8 addrspace(1)* @test1(i8 addrspace(1)* %obj) 
          gc "statepoint-example" {
@@ -273,7 +273,7 @@ afterwards.
 If we extend our previous example to include a pointless derived pointer, 
 we get:
 
-.. code-block:: text
+.. code-block:: llvm
 
   define i8 addrspace(1)* @test1(i8 addrspace(1)* %obj) 
          gc "statepoint-example" {
@@ -319,7 +319,7 @@ Let's assume a hypothetical GC--somewhat unimaginatively named "hypothetical-gc"
 --that requires that a TLS variable must be written to before and after a call
 to unmanaged code. The resulting relocation sequence is:
 
-.. code-block:: text
+.. code-block:: llvm
 
   @flag = thread_local global i32 0, align 4
 
@@ -702,7 +702,7 @@ whitelist or use one of the predefined ones.
 
 As an example, given this code:
 
-.. code-block:: text
+.. code-block:: llvm
 
   define i8 addrspace(1)* @test1(i8 addrspace(1)* %obj) 
          gc "statepoint-example" {
@@ -712,7 +712,7 @@ As an example, given this code:
 
 The pass would produce this IR:
 
-.. code-block:: text
+.. code-block:: llvm
 
   define i8 addrspace(1)* @test1(i8 addrspace(1)* %obj) 
          gc "statepoint-example" {
@@ -789,7 +789,7 @@ As an example, given input IR of the following:
 
 This pass would produce the following IR:
 
-.. code-block:: text
+.. code-block:: llvm
 
   define void @test() gc "statepoint-example" {
     call void @do_safepoint()
diff --git a/docs/TypeMetadata.rst b/docs/TypeMetadata.rst
index 98d58b71a6d3..0382d0cbd44b 100644
--- a/docs/TypeMetadata.rst
+++ b/docs/TypeMetadata.rst
@@ -128,7 +128,7 @@ address points of the vtables of A, B and D respectively). If we then load
 an address from that pointer, we know that the address can only be one of
 ``&A::f``, ``&B::f`` or ``&D::f``.
 
-.. _address point: https://mentorembedded.github.io/cxx-abi/abi.html#vtable-general
+.. _address point: https://itanium-cxx-abi.github.io/cxx-abi/abi.html#vtable-general
 
 Testing Addresses For Type Membership
 =====================================
@@ -223,4 +223,4 @@ efficiently to minimize the sizes of the underlying bitsets.
       ret void
     }
 
-.. _GlobalLayoutBuilder: http://llvm.org/klaus/llvm/blob/master/include/llvm/Transforms/IPO/LowerTypeTests.h
+.. _GlobalLayoutBuilder: http://git.llvm.org/klaus/llvm/blob/master/include/llvm/Transforms/IPO/LowerTypeTests.h
diff --git a/docs/WritingAnLLVMBackend.rst b/docs/WritingAnLLVMBackend.rst
index 0fc7382f788b..5f34c70540b4 100644
--- a/docs/WritingAnLLVMBackend.rst
+++ b/docs/WritingAnLLVMBackend.rst
@@ -1008,10 +1008,50 @@ Instruction Scheduling
 ----------------------
 
 Instruction itineraries can be queried using MCDesc::getSchedClass(). The
-value can be named by an enumemation in llvm::XXX::Sched namespace generated
+value can be named by an enumeration in llvm::XXX::Sched namespace generated
 by TableGen in XXXGenInstrInfo.inc. The name of the schedule classes are
 the same as provided in XXXSchedule.td plus a default NoItinerary class.
 
+The schedule models are generated by TableGen by the SubtargetEmitter,
+using the ``CodeGenSchedModels`` class. This is distinct from the itinerary
+method of specifying machine resource use.  The tool ``utils/schedcover.py``
+can be used to determine which instructions have been covered by the
+schedule model description and which haven't. The first step is to use the
+instructions below to create an output file. Then run ``schedcover.py`` on the
+output file:
+
+.. code-block:: shell
+
+  $ <src>/utils/schedcover.py <build>/lib/Target/AArch64/tblGenSubtarget.with
+  instruction, default, CortexA53Model, CortexA57Model, CycloneModel, ExynosM1Model, FalkorModel, KryoModel, ThunderX2T99Model, ThunderXT8XModel
+  ABSv16i8, WriteV, , , CyWriteV3, M1WriteNMISC1, FalkorWr_2VXVY_2cyc, KryoWrite_2cyc_XY_XY_150ln, , 
+  ABSv1i64, WriteV, , , CyWriteV3, M1WriteNMISC1, FalkorWr_1VXVY_2cyc, KryoWrite_2cyc_XY_noRSV_67ln, , 
+  ...
+
+To capture the debug output from generating a schedule model, change to the
+appropriate target directory and use the following command:
+command with the ``subtarget-emitter`` debug option:
+
+.. code-block:: shell
+
+  $ <build>/bin/llvm-tblgen -debug-only=subtarget-emitter -gen-subtarget \
+    -I <src>/lib/Target/<target> -I <src>/include \
+    -I <src>/lib/Target <src>/lib/Target/<target>/<target>.td \
+    -o <build>/lib/Target/<target>/<target>GenSubtargetInfo.inc.tmp \
+    > tblGenSubtarget.dbg 2>&1
+
+Where ``<build>`` is the build directory, ``src`` is the source directory,
+and ``<target>`` is the name of the target.
+To double check that the above command is what is needed, one can capture the
+exact TableGen command from a build by using:
+
+.. code-block:: shell
+
+  $ VERBOSE=1 make ... 
+
+and search for ``llvm-tblgen`` commands in the output.
+
+
 Instruction Relation Mapping
 ----------------------------
 
diff --git a/docs/WritingAnLLVMPass.rst b/docs/WritingAnLLVMPass.rst
index 54b3630e655f..41f400740e84 100644
--- a/docs/WritingAnLLVMPass.rst
+++ b/docs/WritingAnLLVMPass.rst
@@ -1032,7 +1032,7 @@ implementation for the interface.
 Pass Statistics
 ===============
 
-The `Statistic <http://llvm.org/doxygen/Statistic_8h-source.html>`_ class is
+The `Statistic <http://llvm.org/doxygen/Statistic_8h_source.html>`_ class is
 designed to be an easy way to expose various success metrics from passes.
 These statistics are printed at the end of a run, when the :option:`-stats`
 command line option is enabled on the command line.  See the :ref:`Statistics
@@ -1043,7 +1043,7 @@ section <Statistic>` in the Programmer's Manual for details.
 What PassManager does
 ---------------------
 
-The `PassManager <http://llvm.org/doxygen/PassManager_8h-source.html>`_ `class
+The `PassManager <http://llvm.org/doxygen/PassManager_8h_source.html>`_ `class
 <http://llvm.org/doxygen/classllvm_1_1PassManager.html>`_ takes a list of
 passes, ensures their :ref:`prerequisites <writing-an-llvm-pass-interaction>`
 are set up correctly, and then schedules passes to run efficiently.  All of the
diff --git a/docs/XRay.rst b/docs/XRay.rst
index d61e4e6d9955..ebf025678305 100644
--- a/docs/XRay.rst
+++ b/docs/XRay.rst
@@ -75,11 +75,11 @@ GCC-style attributes or C++11-style attributes.
 
 .. code-block:: c++
 
-    [[clang::xray_always_intrument]] void always_instrumented();
+    [[clang::xray_always_instrument]] void always_instrumented();
 
     [[clang::xray_never_instrument]] void never_instrumented();
 
-    void alt_always_instrumented() __attribute__((xray_always_intrument));
+    void alt_always_instrumented() __attribute__((xray_always_instrument));
 
     void alt_never_instrumented() __attribute__((xray_never_instrument));
 
@@ -143,17 +143,30 @@ variable, where we list down the options and their defaults below.
 |                   |                 |               | instrumentation points |
 |                   |                 |               | before main.           |
 +-------------------+-----------------+---------------+------------------------+
-| xray_naive_log    | ``bool``        | ``true``      | Whether to install     |
-|                   |                 |               | the naive log          |
-|                   |                 |               | implementation.        |
+| xray_mode         | ``const char*`` | ``""``        | Default mode to        |
+|                   |                 |               | install and initialize |
+|                   |                 |               | before ``main``.       |
 +-------------------+-----------------+---------------+------------------------+
 | xray_logfile_base | ``const char*`` | ``xray-log.`` | Filename base for the  |
 |                   |                 |               | XRay logfile.          |
 +-------------------+-----------------+---------------+------------------------+
-| xray_fdr_log      | ``bool``        | ``false``     | Whether to install the |
-|                   |                 |               | Flight Data Recorder   |
+| xray_naive_log    | ``bool``        | ``false``     | **DEPRECATED:** Use    |
+|                   |                 |               | xray_mode=xray-basic   |
+|                   |                 |               | instead. Whether to    |
+|                   |                 |               | install the basic log  |
+|                   |                 |               | the naive log          |
+|                   |                 |               | implementation.        |
++-------------------+-----------------+---------------+------------------------+
+| xray_fdr_log      | ``bool``        | ``false``     | **DEPRECATED:** Use    |
+|                   |                 |               | xray_mode=xray-fdr     |
+|                   |                 |               | instead. Whether to    |
+|                   |                 |               | install the Flight     |
+|                   |                 |               | Data Recorder          |
 |                   |                 |               | (FDR) mode.            |
 +-------------------+-----------------+---------------+------------------------+
+| verbosity         | ``int``         | ``0``         | Runtime verbosity      |
+|                   |                 |               | level.                 |
++-------------------+-----------------+---------------+------------------------+
 
 
 If you choose to not use the default logging implementation that comes with the
@@ -196,6 +209,9 @@ on your application, you may set the ``xray_fdr_log`` option to ``true`` in the
 ``XRAY_OPTIONS`` environment variable (while also optionally setting the
 ``xray_naive_log`` to ``false``).
 
+When the buffers are flushed to disk, the result is a binary trace format
+described by `XRay FDR format <XRayFDRFormat.html>`_
+
 When FDR mode is on, it will keep writing and recycling memory buffers until
 the logging implementation is finalized -- at which point it can be flushed and
 re-initialised later. To do this programmatically, we follow the workflow
@@ -238,6 +254,14 @@ following API:
 - ``__xray_set_log_impl(...)``: This function takes a struct of type
   ``XRayLogImpl``, which is defined in ``xray/xray_log_interface.h``, part of
   the XRay compiler-rt installation.
+- ``__xray_log_register_mode(...)``: Register a logging implementation against
+  a string Mode. The implementation is an instance of ``XRayLogImpl`` defined
+  in ``xray/xray_log_interface.h``.
+- ``__xray_log_select_mode(...)``: Select the mode to install, associated with
+  a string Mode. Only implementations registered with
+  ``__xray_log_register_mode(...)`` can be chosen with this function. When
+  successful, has the same effects as calling ``__xray_set_log_impl(...)`` with
+  the registered logging implementation.
 - ``__xray_log_init(...)``: This function allows for initializing and
   re-initializing an installed logging implementation. See
   ``xray/xray_log_interface.h`` for details, part of the XRay compiler-rt
@@ -255,10 +279,15 @@ supports the following subcommands:
 - ``account``: Performs basic function call accounting statistics with various
   options for sorting, and output formats (supports CSV, YAML, and
   console-friendly TEXT).
-- ``convert``: Converts an XRay log file from one format to another. Currently
-  only converts to YAML.
+- ``convert``: Converts an XRay log file from one format to another. We can
+  convert from binary XRay traces (both naive and FDR mode) to YAML,
+  `flame-graph <https://github.com/brendangregg/FlameGraph>`_ friendly text
+  formats, as well as `Chrome Trace Viewer (catapult)
+  <https://github.com/catapult-project/catapult>` formats.
 - ``graph``: Generates a DOT graph of the function call relationships between
   functions found in an XRay trace.
+- ``stack``: Reconstructs function call stacks from a timeline of function
+  calls in an XRay trace.
 
 These subcommands use various library components found as part of the XRay
 libraries, distributed with the LLVM distribution. These are:
@@ -271,7 +300,7 @@ libraries, distributed with the LLVM distribution. These are:
   associated with edges and vertices.
 - ``llvm/XRay/InstrumentationMap.h``: A convenient tool for analyzing the
   instrumentation map in XRay-instrumented object files and binaries. The
-  ``extract`` subcommand uses this particular library.
+  ``extract`` and ``stack`` subcommands uses this particular library.
 
 Future Work
 ===========
@@ -279,13 +308,17 @@ Future Work
 There are a number of ongoing efforts for expanding the toolset building around
 the XRay instrumentation system.
 
-Trace Analysis
---------------
-
-We have more subcommands and modes that we're thinking of developing, in the
-following forms:
+Trace Analysis Tools
+--------------------
 
-- ``stack``: Reconstruct the function call stacks in a timeline.
+- Work is in progress to integrate with or develop tools to visualize findings
+  from an XRay trace. Particularly, the ``stack`` tool is being expanded to
+  output formats that allow graphing and exploring the duration of time in each
+  call stack.
+- With a large instrumented binary, the size of generated XRay traces can
+  quickly become unwieldy. We are working on integrating pruning techniques and
+  heuristics for the analysis tools to sift through the traces and surface only
+  relevant information.
 
 More Platforms
 --------------
diff --git a/docs/XRayExample.rst b/docs/XRayExample.rst
index 5dfb0bfaf298..f8e7d943fedd 100644
--- a/docs/XRayExample.rst
+++ b/docs/XRayExample.rst
@@ -60,7 +60,7 @@ to enable XRay at application start. To do this, XRay checks the
   $ ./bin/llc input.ll
 
   # We need to set the XRAY_OPTIONS to enable some features.
-  $ XRAY_OPTIONS="patch_premain=true" ./bin/llc input.ll
+  $ XRAY_OPTIONS="patch_premain=true xray_mode=xray-basic verbosity=1" ./bin/llc input.ll
   ==69819==XRay: Log file in 'xray-log.llc.m35qPB'
 
 At this point we now have an XRay trace we can start analysing.
@@ -100,7 +100,7 @@ this data in a spreadsheet, we can output the results as CSV using the
 
 If we want to get a textual representation of the raw trace we can use the
 ``llvm-xray convert`` tool to get YAML output. The first few lines of that
-ouput for an example trace would look like the following:
+output for an example trace would look like the following:
 
 ::
 
@@ -195,6 +195,70 @@ Given the above two files we can re-build by providing those two files as
 arguments to clang as ``-fxray-always-instrument=always-instrument.txt`` or
 ``-fxray-never-instrument=never-instrument.txt``.
 
+The XRay stack tool
+-------------------
+
+Given a trace, and optionally an instrumentation map, the ``llvm-xray stack``
+command can be used to analyze a call stack graph constructed from the function
+call timeline.
+
+The simplest way to use the command is simply to output the top stacks by call
+count and time spent.
+
+::
+
+  $ llvm-xray stack xray-log.llc.5rqxkU -instr_map ./bin/llc
+
+  Unique Stacks: 3069
+  Top 10 Stacks by leaf sum:
+
+  Sum: 9633790
+  lvl   function                                                            count              sum
+  #0    main                                                                    1         58421550
+  #1    compileModule(char**, llvm::LLVMContext&)                               1         51440360
+  #2    llvm::legacy::PassManagerImpl::run(llvm::Module&)                       1         40535375
+  #3    llvm::FPPassManager::runOnModule(llvm::Module&)                         2         39337525
+  #4    llvm::FPPassManager::runOnFunction(llvm::Function&)                     6         39331465
+  #5    llvm::PMDataManager::verifyPreservedAnalysis(llvm::Pass*)             399         16628590
+  #6    llvm::PMTopLevelManager::findAnalysisPass(void const*)               4584         15155600
+  #7    llvm::PMDataManager::findAnalysisPass(void const*, bool)            32088          9633790
+
+  ..etc..
+
+In the default mode, identical stacks on different threads are independently
+aggregated. In a multithreaded program, you may end up having identical call
+stacks fill your list of top calls.
+
+To address this, you may specify the ``-aggregate-threads`` or
+``-per-thread-stacks`` flags. ``-per-thread-stacks`` treats the thread id as an
+implicit root in each call stack tree, while ``-aggregate-threads`` combines
+identical stacks from all threads.
+
+Flame Graph Generation
+----------------------
+
+The ``llvm-xray stack`` tool may also be used to generate flamegraphs for
+visualizing your instrumented invocations. The tool does not generate the graphs
+themselves, but instead generates a format that can be used with Brendan Gregg's
+FlameGraph tool, currently available on `github
+<https://github.com/brendangregg/FlameGraph>`_.
+
+To generate output for a flamegraph, a few more options are necessary.
+
+- ``-all-stacks`` - Emits all of the stacks instead of just the top stacks.
+- ``-stack-format`` - Choose the flamegraph output format 'flame'.
+- ``-aggregation-type`` - Choose the metric to graph.
+
+You may pipe the command output directly to the flamegraph tool to obtain an
+svg file.
+
+::
+
+  $llvm-xray stack xray-log.llc.5rqxkU -instr_map ./bin/llc -stack-format=flame -aggregation-type=time -all-stacks | \
+  /path/to/FlameGraph/flamegraph.pl > flamegraph.svg
+
+If you open the svg in a browser, mouse events allow exploring the call stacks.
+
 Further Exploration
 -------------------
 
@@ -211,11 +275,11 @@ application.
   #include <iostream>
   #include <thread>
 
-  [[clang::xray_always_intrument]] void f() {
+  [[clang::xray_always_instrument]] void f() {
     std::cerr << '.';
   }
 
-  [[clang::xray_always_intrument]] void g() {
+  [[clang::xray_always_instrument]] void g() {
     for (int i = 0; i < 1 << 10; ++i) {
       std::cerr << '-';
     }
diff --git a/docs/XRayFDRFormat.rst b/docs/XRayFDRFormat.rst
new file mode 100644
index 000000000000..f7942bc212df
--- /dev/null
+++ b/docs/XRayFDRFormat.rst
@@ -0,0 +1,401 @@
+======================================
+XRay Flight Data Recorder Trace Format
+======================================
+
+:Version: 1 as of 2017-07-20
+
+.. contents::
+   :local:
+
+
+Introduction
+============
+
+When gathering XRay traces in Flight Data Recorder mode, each thread of an
+application will claim buffers to fill with trace data, which at some point
+is finalized and flushed.
+
+A goal of the profiler is to minimize overhead, so the flushed data directly
+corresponds to the buffer.
+
+This document describes the format of a trace file.
+
+
+General
+=======
+
+Each trace file corresponds to a sequence of events in a particular thread.
+
+The file has a header followed by a sequence of discriminated record types.
+
+The endianness of byte fields matches the endianess of the platform which
+produced the trace file.
+
+
+Header Section
+==============
+
+A trace file begins with a 32 byte header.
+
++-------------------+-----------------+----------------------------------------+
+| Field             | Size (bytes)    | Description                            |
++===================+=================+========================================+
+| version           | ``2``           | Anticipates versioned  readers. This   |
+|                   |                 | document describes the format when     |
+|                   |                 | version == 1                           |
++-------------------+-----------------+----------------------------------------+
+| type              | ``2``           | An enumeration encoding the type of    |
+|                   |                 | trace. Flight Data Recorder mode       |
+|                   |                 | traces have type == 1                  |
++-------------------+-----------------+----------------------------------------+
+| bitfield          | ``4``           | Holds parameters that are not aligned  |
+|                   |                 | to bytes. Further described below.     |
++-------------------+-----------------+----------------------------------------+
+| cycle_frequency   | ``8``           | The frequency in hertz of the CPU      |
+|                   |                 | oscillator used to measure duration of |
+|                   |                 | events in ticks.                       |
++-------------------+-----------------+----------------------------------------+
+| buffer_size       | ``8``           | The size in bytes of the data portion  |
+|                   |                 | of the trace following the header.     |
++-------------------+-----------------+----------------------------------------+
+| reserved          | ``8``           | Reserved for future use.               |
++-------------------+-----------------+----------------------------------------+
+
+The bitfield parameter of the file header is composed of the following fields.
+
++-------------------+----------------+-----------------------------------------+
+| Field             | Size (bits)    | Description                             |
++===================+================+=========================================+
+| constant_tsc      | ``1``          | Whether the platform's timestamp        |
+|                   |                | counter used to record ticks between    |
+|                   |                | events ticks at a constant frequency    |
+|                   |                | despite CPU frequency changes.          |
+|                   |                | 0 == non-constant. 1 == constant.       |
++-------------------+----------------+-----------------------------------------+
+| nonstop_tsc       | ``1``          | Whether the tsc continues to count      |
+|                   |                | despite whether the CPU is in a low     |
+|                   |                | power state. 0 == stop. 1 == non-stop.  |
++-------------------+----------------+-----------------------------------------+
+| reserved          | ``30``         | Not meaningful.                         |
++-------------------+----------------+-----------------------------------------+
+
+
+Data Section
+============
+
+Following the header in a trace is a data section with size matching the
+buffer_size field in the header.
+
+The data section is a stream of elements of different types.
+
+There are a few categories of data in the sequence.
+
+- ``Function Records``: Function Records contain the timing of entry into and
+  exit from function execution. Function Records have 8 bytes each.
+
+- ``Metadata Records``: Metadata records serve many purposes. Mostly, they
+  capture information that may be too costly to record for each function, but
+  that is required to contextualize the fine-grained timings. They also are used
+  as markers for user-defined Event Data payloads. Metadata records have 16
+  bytes each.
+
+- ``Event Data``: Free form data may be associated with events that are traced
+  by the binary and encode data defined by a handler function. Event data is
+  always preceded with a marker record which indicates how large it is.
+
+- ``Function Arguments``: The arguments to some functions are included in the
+  trace. These are either pointer addresses or primitives that are read and
+  logged independently of their types in a high level language. To the tracer,
+  they are all simply numbers. Function Records that have attached arguments
+  will indicate their presence on the function entry record. We only support
+  logging contiguous function argument sequences starting with argument zero,
+  which will be the "this" pointer for member function invocations. For example,
+  we don't support logging the first and third argument.
+
+A reader of the memory format must maintain a state machine. The format makes no
+attempt to pad for alignment, and it is not seekable.
+
+
+Function Records
+----------------
+
+Function Records have an 8 byte layout. This layout encodes information to
+reconstruct a call stack of instrumented function and their durations.
+
++---------------+--------------+-----------------------------------------------+
+| Field         | Size (bits)  | Description                                   |
++===============+==============+===============================================+
+| discriminant  | ``1``        | Indicates whether a reader should read a      |
+|               |              | Function or Metadata record. Set to ``0`` for |
+|               |              | Function records.                             |
++---------------+--------------+-----------------------------------------------+
+| action        | ``3``        | Specifies whether the function is being       |
+|               |              | entered, exited, or is a non-standard entry   |
+|               |              | or exit produced by optimizations.            |
++---------------+--------------+-----------------------------------------------+
+| function_id   | ``28``       | A numeric ID for the function. Resolved to a  |
+|               |              | name via the xray instrumentation map. The    |
+|               |              | instrumentation map is built by xray at       |
+|               |              | compile time into an object file and pairs    |
+|               |              | the function ids to addresses. It is used for |
+|               |              | patching and as a lookup into the binary's    |
+|               |              | symbols to obtain names.                      |
++---------------+--------------+-----------------------------------------------+
+| tsc_delta     | ``32``       | The number of ticks of the timestamp counter  |
+|               |              | since a previous record recorded a delta or   |
+|               |              | other TSC resetting event.                    |
++---------------+--------------+-----------------------------------------------+
+
+On little-endian machines, the bitfields are ordered from least significant bit
+bit to most significant bit. A reader can read an 8 bit value and apply the mask
+``0x01`` for the discriminant. Similarly, they can read 32 bits and unsigned
+shift right by ``0x04`` to obtain the function_id field.
+
+On big-endian machine, the bitfields are written in order from most significant
+bit to least significant bit. A reader would read an 8 bit value and unsigned
+shift right by 7 bits for the discriminant. The function_id field could be
+obtained by reading a 32 bit value and applying the mask ``0x0FFFFFFF``.
+
+Function action types are as follows.
+
++---------------+--------------+-----------------------------------------------+
+| Type          | Number       | Description                                   |
++===============+==============+===============================================+
+| Entry         | ``0``        | Typical function entry.                       |
++---------------+--------------+-----------------------------------------------+
+| Exit          | ``1``        | Typical function exit.                        |
++---------------+--------------+-----------------------------------------------+
+| Tail_Exit     | ``2``        | An exit from a function due to tail call      |
+|               |              | optimization.                                 |
++---------------+--------------+-----------------------------------------------+
+| Entry_Args    | ``3``        | A function entry that records arguments.      |
++---------------+--------------+-----------------------------------------------+
+
+Entry_Args records do not contain the arguments themselves. Instead, metadata
+records for each of the logged args follow the function record in the stream.
+
+
+Metadata Records
+----------------
+
+Interspersed throughout the buffer are 16 byte Metadata records. For typically
+instrumented binaries, they will be sparser than Function records, and they
+provide a fuller picture of the binary execution state.
+
+Metadata record layout is partially record dependent, but they share a common
+structure.
+
+The same bit field rules described for function records apply to the first byte
+of MetadataRecords. Within this byte, little endian machines use lsb to msb
+ordering and big endian machines use msb to lsb ordering.
+
++---------------+--------------+-----------------------------------------------+
+| Field         | Size         | Description                                   |
++===============+==============+===============================================+
+| discriminant  | ``1 bit``    | Indicates whether a reader should read a      |
+|               |              | Function or Metadata record. Set to ``1`` for |
+|               |              | Metadata records.                             |
++---------------+--------------+-----------------------------------------------+
+| record_kind   | ``7 bits``   | The type of Metadata record.                  |
++---------------+--------------+-----------------------------------------------+
+| data          | ``15 bytes`` | A data field used differently for each record |
+|               |              | type.                                         |
++---------------+--------------+-----------------------------------------------+
+
+Here is a table of the enumerated record kinds.
+
++--------+---------------------------+
+| Number | Type                      |
++========+===========================+
+| 0      | NewBuffer                 |
++--------+---------------------------+
+| 1      | EndOfBuffer               |
++--------+---------------------------+
+| 2      | NewCPUId                  |
++--------+---------------------------+
+| 3      | TSCWrap                   |
++--------+---------------------------+
+| 4      | WallTimeMarker            |
++--------+---------------------------+
+| 5      | CustomEventMarker         |
++--------+---------------------------+
+| 6      | CallArgument              |
++--------+---------------------------+
+
+
+NewBuffer Records
+-----------------
+
+Each buffer begins with a NewBuffer record immediately after the header.
+It records the thread ID of the thread that the trace belongs to.
+
+Its data segment is as follows.
+
++---------------+--------------+-----------------------------------------------+
+| Field         | Size (bytes) | Description                                   |
++===============+==============+===============================================+
+| thread_Id     | ``2``        | Thread ID for buffer.                         |
++---------------+--------------+-----------------------------------------------+
+| reserved      | ``13``       | Unused.                                       |
++---------------+--------------+-----------------------------------------------+
+
+
+WallClockTime Records
+---------------------
+
+Following the NewBuffer record, each buffer records an absolute time as a frame
+of reference for the durations recorded by timestamp counter deltas.
+
+Its data segment is as follows.
+
++---------------+--------------+-----------------------------------------------+
+| Field         | Size (bytes) | Description                                   |
++===============+==============+===============================================+
+| seconds       | ``8``        | Seconds on absolute timescale. The starting   |
+|               |              | point is unspecified and depends on the       |
+|               |              | implementation and platform configured by the |
+|               |              | tracer.                                       |
++---------------+--------------+-----------------------------------------------+
+| microseconds  | ``4``        | The microsecond component of the time.        |
++---------------+--------------+-----------------------------------------------+
+| reserved      | ``3``        | Unused.                                       |
++---------------+--------------+-----------------------------------------------+
+
+
+NewCpuId Records
+----------------
+
+Each function entry invokes a routine to determine what CPU is executing.
+Typically, this is done with readtscp, which reads the timestamp counter at the
+same time.
+
+If the tracing detects that the execution has switched CPUs or if this is the
+first instrumented entry point, the tracer will output a NewCpuId record.
+
+Its data segment is as follows.
+
++---------------+--------------+-----------------------------------------------+
+| Field         | Size (bytes) | Description                                   |
++===============+==============+===============================================+
+| cpu_id        | ``2``        | CPU Id.                                       |
++---------------+--------------+-----------------------------------------------+
+| absolute_tsc  | ``8``        | The absolute value of the timestamp counter.  |
++---------------+--------------+-----------------------------------------------+
+| reserved      | ``5``        | Unused.                                       |
++---------------+--------------+-----------------------------------------------+
+
+
+TSCWrap Records
+---------------
+
+Since each function record uses a 32 bit value to represent the number of ticks
+of the timestamp counter since the last reference, it is possible for this value
+to overflow, particularly for sparsely instrumented binaries.
+
+When this delta would not fit into a 32 bit representation, a reference absolute
+timestamp counter record is written in the form of a TSCWrap record.
+
+Its data segment is as follows.
+
++---------------+--------------+-----------------------------------------------+
+| Field         | Size (bytes) | Description                                   |
++===============+==============+===============================================+
+| absolute_tsc  | ``8``        | Timestamp counter value.                      |
++---------------+--------------+-----------------------------------------------+
+| reserved      | ``7``        | Unused.                                       |
++---------------+--------------+-----------------------------------------------+
+
+
+CallArgument Records
+--------------------
+
+Immediately following an Entry_Args type function record, there may be one or
+more CallArgument records that contain the traced function's parameter values.
+
+The order of the CallArgument Record sequency corresponds one to one with the
+order of the function parameters.
+
+CallArgument data segment:
+
++---------------+--------------+-----------------------------------------------+
+| Field         | Size (bytes) | Description                                   |
++===============+==============+===============================================+
+| argument      | ``8``        | Numeric argument (may be pointer address).    |
++---------------+--------------+-----------------------------------------------+
+| reserved      | ``7``        | Unused.                                       |
++---------------+--------------+-----------------------------------------------+
+
+
+CustomEventMarker Records
+-------------------------
+
+XRay provides the feature of logging custom events. This may be leveraged to
+record tracing info for RPCs or similarly trace data that is application
+specific.
+
+Custom Events themselves are an unstructured (application defined) segment of
+memory with arbitrary size within the buffer. They are preceded by
+CustomEventMarkers to indicate their presence and size.
+
+CustomEventMarker data segment:
+
++---------------+--------------+-----------------------------------------------+
+| Field         | Size (bytes) | Description                                   |
++===============+==============+===============================================+
+| event_size    | ``4``        | Size of preceded event.                       |
++---------------+--------------+-----------------------------------------------+
+| absolute_tsc  | ``8``        | A timestamp counter of the event.             |
++---------------+--------------+-----------------------------------------------+
+| reserved      | ``3``        | Unused.                                       |
++---------------+--------------+-----------------------------------------------+
+
+
+EndOfBuffer Records
+-------------------
+
+An EndOfBuffer record type indicates that there is no more trace data in this
+buffer. The reader is expected to seek past the remaining buffer_size expressed
+before the start of buffer and look for either another header or EOF.
+
+
+Format Grammar and Invariants
+=============================
+
+Not all sequences of Metadata records and Function records are valid data. A
+sequence should be parsed as a state machine. The expectations for a valid
+format can be expressed as a context free grammar.
+
+This is an attempt to explain the format with statements in EBNF format.
+
+- Format := Header ThreadBuffer* EOF
+
+- ThreadBuffer := NewBuffer WallClockTime NewCPUId BodySequence* End
+
+- BodySequence := NewCPUId | TSCWrap | Function | CustomEvent
+
+- Function := (Function_Entry_Args CallArgument*) | Function_Other_Type
+
+- CustomEvent := CustomEventMarker CustomEventUnstructuredMemory
+
+- End := EndOfBuffer RemainingBufferSizeToSkip
+
+
+Function Record Order
+---------------------
+
+There are a few clarifications that may help understand what is expected of
+Function records.
+
+- Functions with an Exit are expected to have a corresponding Entry or
+  Entry_Args function record precede them in the trace.
+
+- Tail_Exit Function records record the Function ID of the function whose return
+  address the program counter will take. In other words, the final function that
+  would be popped off of the call stack if tail call optimization was not used.
+
+- Not all functions marked for instrumentation are necessarily in the trace. The
+  tracer uses heuristics to preserve the trace for non-trivial functions.
+
+- Not every entry must have a traced Exit or Tail Exit. The buffer may run out
+  of space or the program may request for the tracer to finalize toreturn the
+  buffer before an instrumented function exits.
diff --git a/docs/YamlIO.rst b/docs/YamlIO.rst
index 0b728ed8ec1e..4c07820b6f99 100644
--- a/docs/YamlIO.rst
+++ b/docs/YamlIO.rst
@@ -466,7 +466,7 @@ looks like:
         return StringRef();
       }
       // Determine if this scalar needs quotes.
-      static bool mustQuote(StringRef) { return true; }
+      static QuotingType mustQuote(StringRef) { return QuotingType::Single; }
     };
 
 Block Scalars
diff --git a/docs/conf.py b/docs/conf.py
index e7c18da48ebe..92eb9813ecf9 100644
--- a/docs/conf.py
+++ b/docs/conf.py
@@ -48,9 +48,9 @@ copyright = u'2003-%d, LLVM Project' % date.today().year
 # built documents.
 #
 # The short version.
-version = '5'
+version = '6'
 # The full version, including alpha/beta/rc tags.
-release = '5'
+release = '6'
 
 # The language for content autogenerated by Sphinx. Refer to documentation
 # for a list of supported languages.
diff --git a/docs/index.rst b/docs/index.rst
index 5bc2368def51..47c2f0473931 100644
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -1,6 +1,11 @@
 Overview
 ========
 
+.. warning::
+
+   If you are using a released version of LLVM, see `the download page
+   <http://llvm.org/releases/>`_ to find your documentation.
+
 The LLVM compiler infrastructure supports a wide range of projects, from
 industrial strength compilers to specialized JIT applications to small
 research projects.
@@ -63,6 +68,7 @@ representation.
    CMakePrimer
    AdvancedBuilds
    HowToBuildOnARM
+   HowToCrossCompileBuiltinsOnArm
    HowToCrossCompileLLVM
    CommandGuide/index
    GettingStarted
@@ -100,6 +106,9 @@ representation.
 :doc:`HowToBuildOnARM`
    Notes on building and testing LLVM/Clang on ARM.
 
+:doc:`HowToCrossCompileBuiltinsOnArm`
+   Notes on cross-building and testing the compiler-rt builtins for Arm.
+
 :doc:`HowToCrossCompileLLVM`
    Notes on cross-building and testing LLVM/Clang.
 
@@ -154,7 +163,7 @@ representation.
   misunderstood instruction.
 
 :doc:`Frontend/PerformanceTips`
-   A collection of tips for frontend authors on how to generate IR 
+   A collection of tips for frontend authors on how to generate IR
    which LLVM is able to effectively optimize.
 
 :doc:`Docker`
@@ -178,6 +187,7 @@ For developers of applications which use LLVM as a library.
    ProgrammersManual
    Extensions
    LibFuzzer
+   FuzzingLLVM
    ScudoHardenedAllocator
    OptBisect
 
@@ -211,7 +221,6 @@ For developers of applications which use LLVM as a library.
 
 `Doxygen generated documentation <http://llvm.org/doxygen/>`_
   (`classes <http://llvm.org/doxygen/inherits.html>`_)
-  (`tarball <http://llvm.org/doxygen/doxygen.tar.gz>`_)
 
 `Documentation for Go bindings <http://godoc.org/llvm.org/llvm/bindings/go/llvm>`_
 
@@ -224,6 +233,9 @@ For developers of applications which use LLVM as a library.
 :doc:`LibFuzzer`
   A library for writing in-process guided fuzzers.
 
+:doc:`FuzzingLLVM`
+  Information on writing and using Fuzzers to find bugs in LLVM.
+
 :doc:`ScudoHardenedAllocator`
   A library that implements a security-hardened `malloc()`.
 
@@ -275,7 +287,9 @@ For API clients and LLVM developers.
    GlobalISel
    XRay
    XRayExample
+   XRayFDRFormat
    PDB/index
+   CFIVerify
 
 :doc:`WritingAnLLVMPass`
    Information on how to write LLVM transformations and analyses.
@@ -406,6 +420,9 @@ For API clients and LLVM developers.
 :doc:`The Microsoft PDB File Format <PDB/index>`
   A detailed description of the Microsoft PDB (Program Database) file format.
 
+:doc:`CFIVerify`
+  A description of the verification tool for Control Flow Integrity.
+
 Development Process Documentation
 =================================
 
diff --git a/docs/tutorial/BuildingAJIT1.rst b/docs/tutorial/BuildingAJIT1.rst
index 88f7aa5abbc7..9d7f50477836 100644
--- a/docs/tutorial/BuildingAJIT1.rst
+++ b/docs/tutorial/BuildingAJIT1.rst
@@ -75,8 +75,7 @@ will look like:
   std::unique_ptr<Module> M = buildModule();
   JIT J;
   Handle H = J.addModule(*M);
-  int (*Main)(int, char*[]) =
-    (int(*)(int, char*[])J.findSymbol("main").getAddress();
+  int (*Main)(int, char*[]) = (int(*)(int, char*[]))J.getSymbolAddress("main");
   int Result = Main();
   J.removeModule(H);
 
@@ -111,14 +110,24 @@ usual include guards and #includes [2]_, we get to the definition of our class:
   #ifndef LLVM_EXECUTIONENGINE_ORC_KALEIDOSCOPEJIT_H
   #define LLVM_EXECUTIONENGINE_ORC_KALEIDOSCOPEJIT_H
 
+  #include "llvm/ADT/STLExtras.h"
   #include "llvm/ExecutionEngine/ExecutionEngine.h"
+  #include "llvm/ExecutionEngine/JITSymbol.h"
   #include "llvm/ExecutionEngine/RTDyldMemoryManager.h"
+  #include "llvm/ExecutionEngine/SectionMemoryManager.h"
   #include "llvm/ExecutionEngine/Orc/CompileUtils.h"
   #include "llvm/ExecutionEngine/Orc/IRCompileLayer.h"
   #include "llvm/ExecutionEngine/Orc/LambdaResolver.h"
-  #include "llvm/ExecutionEngine/Orc/ObjectLinkingLayer.h"
+  #include "llvm/ExecutionEngine/Orc/RTDyldObjectLinkingLayer.h"
+  #include "llvm/IR/DataLayout.h"
   #include "llvm/IR/Mangler.h"
   #include "llvm/Support/DynamicLibrary.h"
+  #include "llvm/Support/raw_ostream.h"
+  #include "llvm/Target/TargetMachine.h"
+  #include <algorithm>
+  #include <memory>
+  #include <string>
+  #include <vector>
 
   namespace llvm {
   namespace orc {
@@ -127,38 +136,39 @@ usual include guards and #includes [2]_, we get to the definition of our class:
   private:
     std::unique_ptr<TargetMachine> TM;
     const DataLayout DL;
-    ObjectLinkingLayer<> ObjectLayer;
-    IRCompileLayer<decltype(ObjectLayer)> CompileLayer;
+    RTDyldObjectLinkingLayer ObjectLayer;
+    IRCompileLayer<decltype(ObjectLayer), SimpleCompiler> CompileLayer;
 
   public:
-    typedef decltype(CompileLayer)::ModuleSetHandleT ModuleHandleT;
+    using ModuleHandle = decltype(CompileLayer)::ModuleHandleT;
 
-Our class begins with four members: A TargetMachine, TM, which will be used
-to build our LLVM compiler instance; A DataLayout, DL, which will be used for
+Our class begins with four members: A TargetMachine, TM, which will be used to
+build our LLVM compiler instance; A DataLayout, DL, which will be used for
 symbol mangling (more on that later), and two ORC *layers*: an
-ObjectLinkingLayer and a IRCompileLayer. We'll be talking more about layers in
-the next chapter, but for now you can think of them as analogous to LLVM
+RTDyldObjectLinkingLayer and a CompileLayer. We'll be talking more about layers
+in the next chapter, but for now you can think of them as analogous to LLVM
 Passes: they wrap up useful JIT utilities behind an easy to compose interface.
-The first layer, ObjectLinkingLayer, is the foundation of our JIT: it takes
-in-memory object files produced by a compiler and links them on the fly to make
-them executable. This JIT-on-top-of-a-linker design was introduced in MCJIT,
-however the linker was hidden inside the MCJIT class. In ORC we expose the
-linker so that clients can access and configure it directly if they need to. In
-this tutorial our ObjectLinkingLayer will just be used to support the next layer
-in our stack: the IRCompileLayer, which will be responsible for taking LLVM IR,
-compiling it, and passing the resulting in-memory object files down to the
-object linking layer below.
+The first layer, ObjectLayer, is the foundation of our JIT: it takes in-memory
+object files produced by a compiler and links them on the fly to make them
+executable. This JIT-on-top-of-a-linker design was introduced in MCJIT, however
+the linker was hidden inside the MCJIT class. In ORC we expose the linker so
+that clients can access and configure it directly if they need to. In this
+tutorial our ObjectLayer will just be used to support the next layer in our
+stack: the CompileLayer, which will be responsible for taking LLVM IR, compiling
+it, and passing the resulting in-memory object files down to the object linking
+layer below.
 
 That's it for member variables, after that we have a single typedef:
-ModuleHandleT. This is the handle type that will be returned from our JIT's
+ModuleHandle. This is the handle type that will be returned from our JIT's
 addModule method, and can be passed to the removeModule method to remove a
 module. The IRCompileLayer class already provides a convenient handle type
-(IRCompileLayer::ModuleSetHandleT), so we just alias our ModuleHandleT to this.
+(IRCompileLayer::ModuleHandleT), so we just alias our ModuleHandle to this.
 
 .. code-block:: c++
 
   KaleidoscopeJIT()
       : TM(EngineBuilder().selectTarget()), DL(TM->createDataLayout()),
+        ObjectLayer([]() { return std::make_shared<SectionMemoryManager>(); }),
         CompileLayer(ObjectLayer, SimpleCompiler(*TM)) {
     llvm::sys::DynamicLibrary::LoadLibraryPermanently(nullptr);
   }
@@ -166,17 +176,22 @@ module. The IRCompileLayer class already provides a convenient handle type
   TargetMachine &getTargetMachine() { return *TM; }
 
 Next up we have our class constructor. We begin by initializing TM using the
-EngineBuilder::selectTarget helper method, which constructs a TargetMachine for
-the current process. Next we use our newly created TargetMachine to initialize
-DL, our DataLayout. Then we initialize our IRCompileLayer. Our IRCompile layer
-needs two things: (1) A reference to our object linking layer, and (2) a
-compiler instance to use to perform the actual compilation from IR to object
-files. We use the off-the-shelf SimpleCompiler instance for now. Finally, in
-the body of the constructor, we call the DynamicLibrary::LoadLibraryPermanently
-method with a nullptr argument. Normally the LoadLibraryPermanently method is
-called with the path of a dynamic library to load, but when passed a null
-pointer it will 'load' the host process itself, making its exported symbols
-available for execution.
+EngineBuilder::selectTarget helper method which constructs a TargetMachine for
+the current process. Then we use our newly created TargetMachine to initialize
+DL, our DataLayout. After that we need to initialize our ObjectLayer. The
+ObjectLayer requires a function object that will build a JIT memory manager for
+each module that is added (a JIT memory manager manages memory allocations,
+memory permissions, and registration of exception handlers for JIT'd code). For
+this we use a lambda that returns a SectionMemoryManager, an off-the-shelf
+utility that provides all the basic memory management functionality required for
+this chapter. Next we initialize our CompileLayer. The CompileLayer needs two
+things: (1) A reference to our object layer, and (2) a compiler instance to use
+to perform the actual compilation from IR to object files. We use the
+off-the-shelf SimpleCompiler instance for now. Finally, in the body of the
+constructor, we call the DynamicLibrary::LoadLibraryPermanently method with a
+nullptr argument. Normally the LoadLibraryPermanently method is called with the
+path of a dynamic library to load, but when passed a null pointer it will 'load'
+the host process itself, making its exported symbols available for execution.
 
 .. code-block:: c++
 
@@ -191,48 +206,36 @@ available for execution.
             return Sym;
           return JITSymbol(nullptr);
         },
-        [](const std::string &S) {
+        [](const std::string &Name) {
           if (auto SymAddr =
                 RTDyldMemoryManager::getSymbolAddressInProcess(Name))
             return JITSymbol(SymAddr, JITSymbolFlags::Exported);
           return JITSymbol(nullptr);
         });
 
-    // Build a singleton module set to hold our module.
-    std::vector<std::unique_ptr<Module>> Ms;
-    Ms.push_back(std::move(M));
-
     // Add the set to the JIT with the resolver we created above and a newly
     // created SectionMemoryManager.
-    return CompileLayer.addModuleSet(std::move(Ms),
-                                     make_unique<SectionMemoryManager>(),
-                                     std::move(Resolver));
+    return cantFail(CompileLayer.addModule(std::move(M),
+                                           std::move(Resolver)));
   }
 
 Now we come to the first of our JIT API methods: addModule. This method is
 responsible for adding IR to the JIT and making it available for execution. In
 this initial implementation of our JIT we will make our modules "available for
-execution" by adding them straight to the IRCompileLayer, which will
-immediately compile them. In later chapters we will teach our JIT to be lazier
-and instead add the Modules to a "pending" list to be compiled if and when they
-are first executed.
-
-To add our module to the IRCompileLayer we need to supply two auxiliary objects
-(as well as the module itself): a memory manager and a symbol resolver.  The
-memory manager will be responsible for managing the memory allocated to JIT'd
-machine code, setting memory permissions, and registering exception handling
-tables (if the JIT'd code uses exceptions). For our memory manager we will use
-the SectionMemoryManager class: another off-the-shelf utility that provides all
-the basic functionality we need. The second auxiliary class, the symbol
-resolver, is more interesting for us. It exists to tell the JIT where to look
-when it encounters an *external symbol* in the module we are adding.  External
+execution" by adding them straight to the CompileLayer, which will immediately
+compile them. In later chapters we will teach our JIT to defer compilation
+of individual functions until they're actually called.
+
+To add our module to the CompileLayer we need to supply both the module and a
+symbol resolver. The symbol resolver is responsible for supplying the JIT with
+an address for each *external symbol* in the module we are adding. External
 symbols are any symbol not defined within the module itself, including calls to
 functions outside the JIT and calls to functions defined in other modules that
-have already been added to the JIT. It may seem as though modules added to the
-JIT should "know about one another" by default, but since we would still have to
+have already been added to the JIT. (It may seem as though modules added to the
+JIT should know about one another by default, but since we would still have to
 supply a symbol resolver for references to code outside the JIT it turns out to
-be easier to just re-use this one mechanism for all symbol resolution. This has
-the added benefit that the user has full control over the symbol resolution
+be easier to re-use this one mechanism for all symbol resolution.) This has the
+added benefit that the user has full control over the symbol resolution
 process. Should we search for definitions within the JIT first, then fall back
 on external definitions? Or should we prefer external definitions where
 available and only JIT code if we don't already have an available
@@ -263,12 +266,13 @@ symbol definition via either of these paths, the JIT will refuse to accept our
 module, returning a "symbol not found" error.
 
 Now that we've built our symbol resolver, we're ready to add our module to the
-JIT. We do this by calling the CompileLayer's addModuleSet method [4]_. Since
-we only have a single Module and addModuleSet expects a collection, we will
-create a vector of modules and add our module as the only member. Since we
-have already typedef'd our ModuleHandleT type to be the same as the
-CompileLayer's handle type, we can return the handle from addModuleSet
-directly from our addModule method.
+JIT. We do this by calling the CompileLayer's addModule method. The addModule
+method returns an ``Expected<CompileLayer::ModuleHandle>``, since in more
+advanced JIT configurations it could fail. In our basic configuration we know
+that it will always succeed so we use the cantFail utility to assert that no
+error occurred, and extract the handle value. Since we have already typedef'd
+our ModuleHandle type to be the same as the CompileLayer's handle type, we can
+return the unwrapped handle directly.
 
 .. code-block:: c++
 
@@ -279,19 +283,29 @@ directly from our addModule method.
     return CompileLayer.findSymbol(MangledNameStream.str(), true);
   }
 
+  JITTargetAddress getSymbolAddress(const std::string Name) {
+    return cantFail(findSymbol(Name).getAddress());
+  }
+
   void removeModule(ModuleHandle H) {
-    CompileLayer.removeModuleSet(H);
+    cantFail(CompileLayer.removeModule(H));
   }
 
 Now that we can add code to our JIT, we need a way to find the symbols we've
-added to it. To do that we call the findSymbol method on our IRCompileLayer,
-but with a twist: We have to *mangle* the name of the symbol we're searching
-for first. The reason for this is that the ORC JIT components use mangled
-symbols internally the same way a static compiler and linker would, rather
-than using plain IR symbol names. The kind of mangling will depend on the
-DataLayout, which in turn depends on the target platform. To allow us to
-remain portable and search based on the un-mangled name, we just re-produce
-this mangling ourselves.
+added to it. To do that we call the findSymbol method on our CompileLayer, but
+with a twist: We have to *mangle* the name of the symbol we're searching for
+first. The ORC JIT components use mangled symbols internally the same way a
+static compiler and linker would, rather than using plain IR symbol names. This
+allows JIT'd code to interoperate easily with precompiled code in the
+application or shared libraries. The kind of mangling will depend on the
+DataLayout, which in turn depends on the target platform. To allow us to remain
+portable and search based on the un-mangled name, we just re-produce this
+mangling ourselves.
+
+Next we have a convenience function, getSymbolAddress, which returns the address
+of a given symbol. Like CompileLayer's addModule function, JITSymbol's getAddress
+function is allowed to fail [4]_, however we know that it will not in our simple
+example, so we wrap it in a call to cantFail.
 
 We now come to the last method in our JIT API: removeModule. This method is
 responsible for destructing the MemoryManager and SymbolResolver that were
@@ -302,7 +316,10 @@ treated as a duplicate definition when the next top-level expression is
 entered. It is generally good to free any module that you know you won't need
 to call further, just to free up the resources dedicated to it. However, you
 don't strictly need to do this: All resources will be cleaned up when your
-JIT class is destructed, if they haven't been freed before then.
+JIT class is destructed, if they haven't been freed before then. Like
+``CompileLayer::addModule`` and ``JITSymbol::getAddress``, removeModule may
+fail in general but will never fail in our example, so we wrap it in a call to
+cantFail.
 
 This brings us to the end of Chapter 1 of Building a JIT. You now have a basic
 but fully functioning JIT stack that you can use to take LLVM IR and make it
@@ -321,7 +338,7 @@ example, use:
 .. code-block:: bash
 
     # Compile
-    clang++ -g toy.cpp `llvm-config --cxxflags --ldflags --system-libs --libs core orc native` -O3 -o toy
+    clang++ -g toy.cpp `llvm-config --cxxflags --ldflags --system-libs --libs core orcjit native` -O3 -o toy
     # Run
     ./toy
 
@@ -337,36 +354,45 @@ Here is the code:
        left as an exercise for the reader. (The KaleidoscopeJIT.h used in the
        original tutorials will be a helpful reference).
 
-.. [2] +-----------------------+-----------------------------------------------+
-       |         File          |               Reason for inclusion            |
-       +=======================+===============================================+
-       |   ExecutionEngine.h   | Access to the EngineBuilder::selectTarget     |
-       |                       | method.                                       |
-       +-----------------------+-----------------------------------------------+
-       |                       | Access to the                                 |
-       | RTDyldMemoryManager.h | RTDyldMemoryManager::getSymbolAddressInProcess|
-       |                       | method.                                       |
-       +-----------------------+-----------------------------------------------+
-       |    CompileUtils.h     | Provides the SimpleCompiler class.            |
-       +-----------------------+-----------------------------------------------+
-       |   IRCompileLayer.h    | Provides the IRCompileLayer class.            |
-       +-----------------------+-----------------------------------------------+
-       |                       | Access the createLambdaResolver function,     |
-       |   LambdaResolver.h    | which provides easy construction of symbol    |
-       |                       | resolvers.                                    |
-       +-----------------------+-----------------------------------------------+
-       |  ObjectLinkingLayer.h | Provides the ObjectLinkingLayer class.        |
-       +-----------------------+-----------------------------------------------+
-       |       Mangler.h       | Provides the Mangler class for platform       |
-       |                       | specific name-mangling.                       |
-       +-----------------------+-----------------------------------------------+
-       |   DynamicLibrary.h    | Provides the DynamicLibrary class, which      |
-       |                       | makes symbols in the host process searchable. |
-       +-----------------------+-----------------------------------------------+
+.. [2] +-----------------------------+-----------------------------------------------+
+       |         File                |               Reason for inclusion            |
+       +=============================+===============================================+
+       |      STLExtras.h            | LLVM utilities that are useful when working   |
+       |                             | with the STL.                                 |
+       +-----------------------------+-----------------------------------------------+
+       |   ExecutionEngine.h         | Access to the EngineBuilder::selectTarget     |
+       |                             | method.                                       |
+       +-----------------------------+-----------------------------------------------+
+       |                             | Access to the                                 |
+       | RTDyldMemoryManager.h       | RTDyldMemoryManager::getSymbolAddressInProcess|
+       |                             | method.                                       |
+       +-----------------------------+-----------------------------------------------+
+       |    CompileUtils.h           | Provides the SimpleCompiler class.            |
+       +-----------------------------+-----------------------------------------------+
+       |   IRCompileLayer.h          | Provides the IRCompileLayer class.            |
+       +-----------------------------+-----------------------------------------------+
+       |                             | Access the createLambdaResolver function,     |
+       |   LambdaResolver.h          | which provides easy construction of symbol    |
+       |                             | resolvers.                                    |
+       +-----------------------------+-----------------------------------------------+
+       |  RTDyldObjectLinkingLayer.h | Provides the RTDyldObjectLinkingLayer class.  |
+       +-----------------------------+-----------------------------------------------+
+       |       Mangler.h             | Provides the Mangler class for platform       |
+       |                             | specific name-mangling.                       |
+       +-----------------------------+-----------------------------------------------+
+       |   DynamicLibrary.h          | Provides the DynamicLibrary class, which      |
+       |                             | makes symbols in the host process searchable. |
+       +-----------------------------+-----------------------------------------------+
+       |                             | A fast output stream class. We use the        |
+       |     raw_ostream.h           | raw_string_ostream subclass for symbol        |
+       |                             | mangling                                      |
+       +-----------------------------+-----------------------------------------------+
+       |   TargetMachine.h           | LLVM target machine description class.        |
+       +-----------------------------+-----------------------------------------------+
 
 .. [3] Actually they don't have to be lambdas, any object with a call operator
        will do, including plain old functions or std::functions.
 
-.. [4] ORC layers accept sets of Modules, rather than individual ones, so that
-       all Modules in the set could be co-located by the memory manager, though
-       this feature is not yet implemented.
+.. [4] ``JITSymbol::getAddress`` will force the JIT to compile the definition of
+       the symbol if it hasn't already been compiled, and since the compilation
+       process could fail getAddress must be able to return this failure.
diff --git a/docs/tutorial/BuildingAJIT2.rst b/docs/tutorial/BuildingAJIT2.rst
index 2f22bdad6c14..f1861033cc79 100644
--- a/docs/tutorial/BuildingAJIT2.rst
+++ b/docs/tutorial/BuildingAJIT2.rst
@@ -46,7 +46,7 @@ Chapter 1 and compose an ORC *IRTransformLayer* on top. We will look at how the
 IRTransformLayer works in more detail below, but the interface is simple: the
 constructor for this layer takes a reference to the layer below (as all layers
 do) plus an *IR optimization function* that it will apply to each Module that
-is added via addModuleSet:
+is added via addModule:
 
 .. code-block:: c++
 
@@ -54,19 +54,20 @@ is added via addModuleSet:
   private:
     std::unique_ptr<TargetMachine> TM;
     const DataLayout DL;
-    ObjectLinkingLayer<> ObjectLayer;
+    RTDyldObjectLinkingLayer<> ObjectLayer;
     IRCompileLayer<decltype(ObjectLayer)> CompileLayer;
 
-    typedef std::function<std::unique_ptr<Module>(std::unique_ptr<Module>)>
-      OptimizeFunction;
+    using OptimizeFunction =
+        std::function<std::shared_ptr<Module>(std::shared_ptr<Module>)>;
 
     IRTransformLayer<decltype(CompileLayer), OptimizeFunction> OptimizeLayer;
 
   public:
-    typedef decltype(OptimizeLayer)::ModuleSetHandleT ModuleHandle;
+    using ModuleHandle = decltype(OptimizeLayer)::ModuleHandleT;
 
     KaleidoscopeJIT()
         : TM(EngineBuilder().selectTarget()), DL(TM->createDataLayout()),
+          ObjectLayer([]() { return std::make_shared<SectionMemoryManager>(); }),
           CompileLayer(ObjectLayer, SimpleCompiler(*TM)),
           OptimizeLayer(CompileLayer,
                         [this](std::unique_ptr<Module> M) {
@@ -101,9 +102,8 @@ define below.
 .. code-block:: c++
 
   // ...
-  return OptimizeLayer.addModuleSet(std::move(Ms),
-                                    make_unique<SectionMemoryManager>(),
-                                    std::move(Resolver));
+  return cantFail(OptimizeLayer.addModule(std::move(M),
+                                          std::move(Resolver)));
   // ...
 
 .. code-block:: c++
@@ -115,17 +115,17 @@ define below.
 .. code-block:: c++
 
   // ...
-  OptimizeLayer.removeModuleSet(H);
+  cantFail(OptimizeLayer.removeModule(H));
   // ...
 
 Next we need to replace references to 'CompileLayer' with references to
 OptimizeLayer in our key methods: addModule, findSymbol, and removeModule. In
 addModule we need to be careful to replace both references: the findSymbol call
-inside our resolver, and the call through to addModuleSet.
+inside our resolver, and the call through to addModule.
 
 .. code-block:: c++
 
-  std::unique_ptr<Module> optimizeModule(std::unique_ptr<Module> M) {
+  std::shared_ptr<Module> optimizeModule(std::shared_ptr<Module> M) {
     // Create a function pass manager.
     auto FPM = llvm::make_unique<legacy::FunctionPassManager>(M.get());
 
@@ -166,37 +166,30 @@ implementations of the layer concept that can be devised:
   template <typename BaseLayerT, typename TransformFtor>
   class IRTransformLayer {
   public:
-    typedef typename BaseLayerT::ModuleSetHandleT ModuleSetHandleT;
+    using ModuleHandleT = typename BaseLayerT::ModuleHandleT;
 
     IRTransformLayer(BaseLayerT &BaseLayer,
                      TransformFtor Transform = TransformFtor())
       : BaseLayer(BaseLayer), Transform(std::move(Transform)) {}
 
-    template <typename ModuleSetT, typename MemoryManagerPtrT,
-              typename SymbolResolverPtrT>
-    ModuleSetHandleT addModuleSet(ModuleSetT Ms,
-                                  MemoryManagerPtrT MemMgr,
-                                  SymbolResolverPtrT Resolver) {
-
-      for (auto I = Ms.begin(), E = Ms.end(); I != E; ++I)
-        *I = Transform(std::move(*I));
-
-      return BaseLayer.addModuleSet(std::move(Ms), std::move(MemMgr),
-                                  std::move(Resolver));
+    Expected<ModuleHandleT>
+    addModule(std::shared_ptr<Module> M,
+              std::shared_ptr<JITSymbolResolver> Resolver) {
+      return BaseLayer.addModule(Transform(std::move(M)), std::move(Resolver));
     }
 
-    void removeModuleSet(ModuleSetHandleT H) { BaseLayer.removeModuleSet(H); }
+    void removeModule(ModuleHandleT H) { BaseLayer.removeModule(H); }
 
     JITSymbol findSymbol(const std::string &Name, bool ExportedSymbolsOnly) {
       return BaseLayer.findSymbol(Name, ExportedSymbolsOnly);
     }
 
-    JITSymbol findSymbolIn(ModuleSetHandleT H, const std::string &Name,
+    JITSymbol findSymbolIn(ModuleHandleT H, const std::string &Name,
                            bool ExportedSymbolsOnly) {
       return BaseLayer.findSymbolIn(H, Name, ExportedSymbolsOnly);
     }
 
-    void emitAndFinalize(ModuleSetHandleT H) {
+    void emitAndFinalize(ModuleHandleT H) {
       BaseLayer.emitAndFinalize(H);
     }
 
@@ -215,14 +208,14 @@ comments. It is a template class with two template arguments: ``BaesLayerT`` and
 ``TransformFtor`` that provide the type of the base layer and the type of the
 "transform functor" (in our case a std::function) respectively. This class is
 concerned with two very simple jobs: (1) Running every IR Module that is added
-with addModuleSet through the transform functor, and (2) conforming to the ORC
+with addModule through the transform functor, and (2) conforming to the ORC
 layer interface. The interface consists of one typedef and five methods:
 
 +------------------+-----------------------------------------------------------+
 |     Interface    |                         Description                       |
 +==================+===========================================================+
 |                  | Provides a handle that can be used to identify a module   |
-| ModuleSetHandleT | set when calling findSymbolIn, removeModuleSet, or        |
+| ModuleHandleT    | set when calling findSymbolIn, removeModule, or           |
 |                  | emitAndFinalize.                                          |
 +------------------+-----------------------------------------------------------+
 |                  | Takes a given set of Modules and makes them "available    |
@@ -231,28 +224,28 @@ layer interface. The interface consists of one typedef and five methods:
 |                  | the address of the symbols should be read/writable (for   |
 |                  | data symbols), or executable (for function symbols) after |
 |                  | JITSymbol::getAddress() is called. Note: This means that  |
-|   addModuleSet   | addModuleSet doesn't have to compile (or do any other     |
+|   addModule      | addModule doesn't have to compile (or do any other        |
 |                  | work) up-front. It *can*, like IRCompileLayer, act        |
 |                  | eagerly, but it can also simply record the module and     |
 |                  | take no further action until somebody calls               |
 |                  | JITSymbol::getAddress(). In IRTransformLayer's case       |
-|                  | addModuleSet eagerly applies the transform functor to     |
+|                  | addModule eagerly applies the transform functor to        |
 |                  | each module in the set, then passes the resulting set     |
 |                  | of mutated modules down to the layer below.               |
 +------------------+-----------------------------------------------------------+
 |                  | Removes a set of modules from the JIT. Code or data       |
-|  removeModuleSet | defined in these modules will no longer be available, and |
+|  removeModule    | defined in these modules will no longer be available, and |
 |                  | the memory holding the JIT'd definitions will be freed.   |
 +------------------+-----------------------------------------------------------+
 |                  | Searches for the named symbol in all modules that have    |
-|                  | previously been added via addModuleSet (and not yet       |
-|    findSymbol    | removed by a call to removeModuleSet). In                 |
+|                  | previously been added via addModule (and not yet          |
+|    findSymbol    | removed by a call to removeModule). In                    |
 |                  | IRTransformLayer we just pass the query on to the layer   |
 |                  | below. In our REPL this is our default way to search for  |
 |                  | function definitions.                                     |
 +------------------+-----------------------------------------------------------+
 |                  | Searches for the named symbol in the module set indicated |
-|                  | by the given ModuleSetHandleT. This is just an optimized  |
+|                  | by the given ModuleHandleT. This is just an optimized     |
 |                  | search, better for lookup-speed when you know exactly     |
 |                  | a symbol definition should be found. In IRTransformLayer  |
 |   findSymbolIn   | we just pass this query on to the layer below. In our     |
@@ -262,7 +255,7 @@ layer interface. The interface consists of one typedef and five methods:
 |                  | we just added.                                            |
 +------------------+-----------------------------------------------------------+
 |                  | Forces all of the actions required to make the code and   |
-|                  | data in a module set (represented by a ModuleSetHandleT)  |
+|                  | data in a module set (represented by a ModuleHandleT)     |
 |                  | accessible. Behaves as if some symbol in the set had been |
 |                  | searched for and JITSymbol::getSymbolAddress called. This |
 | emitAndFinalize  | is rarely needed, but can be useful when dealing with     |
@@ -276,11 +269,11 @@ wrinkles like emitAndFinalize for performance), similar to the basic JIT API
 operations we identified in Chapter 1. Conforming to the layer concept allows
 classes to compose neatly by implementing their behaviors in terms of the these
 same operations, carried out on the layer below. For example, an eager layer
-(like IRTransformLayer) can implement addModuleSet by running each module in the
+(like IRTransformLayer) can implement addModule by running each module in the
 set through its transform up-front and immediately passing the result to the
-layer below. A lazy layer, by contrast, could implement addModuleSet by
+layer below. A lazy layer, by contrast, could implement addModule by
 squirreling away the modules doing no other up-front work, but applying the
-transform (and calling addModuleSet on the layer below) when the client calls
+transform (and calling addModule on the layer below) when the client calls
 findSymbol instead. The JIT'd program behavior will be the same either way, but
 these choices will have different performance characteristics: Doing work
 eagerly means the JIT takes longer up-front, but proceeds smoothly once this is
@@ -319,7 +312,7 @@ IRTransformLayer added to enable optimization. To build this example, use:
 .. code-block:: bash
 
     # Compile
-    clang++ -g toy.cpp `llvm-config --cxxflags --ldflags --system-libs --libs core orc native` -O3 -o toy
+    clang++ -g toy.cpp `llvm-config --cxxflags --ldflags --system-libs --libs core orcjit native` -O3 -o toy
     # Run
     ./toy
 
@@ -329,8 +322,8 @@ Here is the code:
    :language: c++
 
 .. [1] When we add our top-level expression to the JIT, any calls to functions
-       that we defined earlier will appear to the ObjectLinkingLayer as
-       external symbols. The ObjectLinkingLayer will call the SymbolResolver
-       that we defined in addModuleSet, which in turn calls findSymbol on the
+       that we defined earlier will appear to the RTDyldObjectLinkingLayer as
+       external symbols. The RTDyldObjectLinkingLayer will call the SymbolResolver
+       that we defined in addModule, which in turn calls findSymbol on the
        OptimizeLayer, at which point even a lazy transform layer will have to
        do its work.
diff --git a/docs/tutorial/BuildingAJIT3.rst b/docs/tutorial/BuildingAJIT3.rst
index 071e92c74541..9c4e59fe1176 100644
--- a/docs/tutorial/BuildingAJIT3.rst
+++ b/docs/tutorial/BuildingAJIT3.rst
@@ -21,7 +21,7 @@ Lazy Compilation
 
 When we add a module to the KaleidoscopeJIT class from Chapter 2 it is
 immediately optimized, compiled and linked for us by the IRTransformLayer,
-IRCompileLayer and ObjectLinkingLayer respectively. This scheme, where all the
+IRCompileLayer and RTDyldObjectLinkingLayer respectively. This scheme, where all the
 work to make a Module executable is done up front, is simple to understand and
 its performance characteristics are easy to reason about. However, it will lead
 to very high startup times if the amount of code to be compiled is large, and
@@ -33,7 +33,7 @@ the ORC APIs provide us with a layer to lazily compile LLVM IR:
 *CompileOnDemandLayer*.
 
 The CompileOnDemandLayer class conforms to the layer interface described in
-Chapter 2, but its addModuleSet method behaves quite differently from the layers
+Chapter 2, but its addModule method behaves quite differently from the layers
 we have seen so far: rather than doing any work up front, it just scans the
 Modules being added and arranges for each function in them to be compiled the
 first time it is called. To do this, the CompileOnDemandLayer creates two small
@@ -73,21 +73,22 @@ lazy compilation. We just need a few changes to the source:
   private:
     std::unique_ptr<TargetMachine> TM;
     const DataLayout DL;
-    std::unique_ptr<JITCompileCallbackManager> CompileCallbackManager;
-    ObjectLinkingLayer<> ObjectLayer;
-    IRCompileLayer<decltype(ObjectLayer)> CompileLayer;
+    RTDyldObjectLinkingLayer ObjectLayer;
+    IRCompileLayer<decltype(ObjectLayer), SimpleCompiler> CompileLayer;
 
-    typedef std::function<std::unique_ptr<Module>(std::unique_ptr<Module>)>
-      OptimizeFunction;
+    using OptimizeFunction =
+        std::function<std::shared_ptr<Module>(std::shared_ptr<Module>)>;
 
     IRTransformLayer<decltype(CompileLayer), OptimizeFunction> OptimizeLayer;
+
+    std::unique_ptr<JITCompileCallbackManager> CompileCallbackManager;
     CompileOnDemandLayer<decltype(OptimizeLayer)> CODLayer;
 
   public:
-    typedef decltype(CODLayer)::ModuleSetHandleT ModuleHandle;
+    using ModuleHandle = decltype(CODLayer)::ModuleHandleT;
 
 First we need to include the CompileOnDemandLayer.h header, then add two new
-members: a std::unique_ptr<CompileCallbackManager> and a CompileOnDemandLayer,
+members: a std::unique_ptr<JITCompileCallbackManager> and a CompileOnDemandLayer,
 to our class. The CompileCallbackManager member is used by the CompileOnDemandLayer
 to create the compile callback needed for each function.
 
@@ -95,9 +96,10 @@ to create the compile callback needed for each function.
 
   KaleidoscopeJIT()
       : TM(EngineBuilder().selectTarget()), DL(TM->createDataLayout()),
+        ObjectLayer([]() { return std::make_shared<SectionMemoryManager>(); }),
         CompileLayer(ObjectLayer, SimpleCompiler(*TM)),
         OptimizeLayer(CompileLayer,
-                      [this](std::unique_ptr<Module> M) {
+                      [this](std::shared_ptr<Module> M) {
                         return optimizeModule(std::move(M));
                       }),
         CompileCallbackManager(
@@ -133,7 +135,7 @@ our CompileCallbackManager. Finally, we need to supply an "indirect stubs
 manager builder": a utility function that constructs IndirectStubManagers, which
 are in turn used to build the stubs for the functions in each module. The
 CompileOnDemandLayer will call the indirect stub manager builder once for each
-call to addModuleSet, and use the resulting indirect stubs manager to create
+call to addModule, and use the resulting indirect stubs manager to create
 stubs for all functions in all modules in the set. If/when the module set is
 removed from the JIT the indirect stubs manager will be deleted, freeing any
 memory allocated to the stubs. We supply this function by using the
@@ -144,9 +146,8 @@ createLocalIndirectStubsManagerBuilder utility.
   // ...
           if (auto Sym = CODLayer.findSymbol(Name, false))
   // ...
-  return CODLayer.addModuleSet(std::move(Ms),
-                               make_unique<SectionMemoryManager>(),
-                               std::move(Resolver));
+  return cantFail(CODLayer.addModule(std::move(Ms),
+                                     std::move(Resolver)));
   // ...
 
   // ...
@@ -154,7 +155,7 @@ createLocalIndirectStubsManagerBuilder utility.
   // ...
 
   // ...
-  CODLayer.removeModuleSet(H);
+  CODLayer.removeModule(H);
   // ...
 
 Finally, we need to replace the references to OptimizeLayer in our addModule,
@@ -173,7 +174,7 @@ layer added to enable lazy function-at-a-time compilation. To build this example
 .. code-block:: bash
 
     # Compile
-    clang++ -g toy.cpp `llvm-config --cxxflags --ldflags --system-libs --libs core orc native` -O3 -o toy
+    clang++ -g toy.cpp `llvm-config --cxxflags --ldflags --system-libs --libs core orcjit native` -O3 -o toy
     # Run
     ./toy
 
diff --git a/docs/tutorial/BuildingAJIT4.rst b/docs/tutorial/BuildingAJIT4.rst
index 39d9198a85c3..3d3f81e43858 100644
--- a/docs/tutorial/BuildingAJIT4.rst
+++ b/docs/tutorial/BuildingAJIT4.rst
@@ -36,7 +36,7 @@ Kaleidoscope ASTS. To build this example, use:
 .. code-block:: bash
 
     # Compile
-    clang++ -g toy.cpp `llvm-config --cxxflags --ldflags --system-libs --libs core orc native` -O3 -o toy
+    clang++ -g toy.cpp `llvm-config --cxxflags --ldflags --system-libs --libs core orcjit native` -O3 -o toy
     # Run
     ./toy
 
diff --git a/docs/tutorial/BuildingAJIT5.rst b/docs/tutorial/BuildingAJIT5.rst
index 94ea92ce5ad2..0fda8610efbf 100644
--- a/docs/tutorial/BuildingAJIT5.rst
+++ b/docs/tutorial/BuildingAJIT5.rst
@@ -40,8 +40,10 @@ Kaleidoscope ASTS. To build this example, use:
 .. code-block:: bash
 
     # Compile
-    clang++ -g toy.cpp `llvm-config --cxxflags --ldflags --system-libs --libs core orc native` -O3 -o toy
+    clang++ -g toy.cpp `llvm-config --cxxflags --ldflags --system-libs --libs core orcjit native` -O3 -o toy
+    clang++ -g Server/server.cpp `llvm-config --cxxflags --ldflags --system-libs --libs core orcjit native` -O3 -o toy-server
     # Run
+    ./toy-server &
     ./toy
 
 Here is the code for the modified KaleidoscopeJIT: