17 files changed, 475 insertions, 179 deletions
diff --git a/docs/ReleaseNotes.rst b/docs/ReleaseNotes.rst
index dc76617a87b2..757434a02ce4 100644
--- a/docs/ReleaseNotes.rst
+++ b/docs/ReleaseNotes.rst
@@ -5,12 +5,6 @@ LLVM 3.9 Release Notes
 .. contents::
     :local:
 
-.. warning::
-   These are in-progress notes for the upcoming LLVM 3.9 release.  You may
-   prefer the `LLVM 3.8 Release Notes <http://llvm.org/releases/3.8.0/docs
-   /ReleaseNotes.html>`_.
-
-
 Introduction
 ============
 
@@ -26,11 +20,6 @@ have questions or comments, the `LLVM Developer's Mailing List
 <http://lists.llvm.org/mailman/listinfo/llvm-dev>`_ is a good place to send
 them.
 
-Note that if you are reading this file from a Subversion checkout or the main
-LLVM web page, this document applies to the *next* release, not the current
-one.  To see the release notes for a specific release, please see the `releases
-page <http://llvm.org/releases/>`_.
-
 Non-comprehensive list of changes in this release
 =================================================
 * The LLVMContext gains a new runtime check (see
@@ -45,10 +34,10 @@ Non-comprehensive list of changes in this release
   please see the documentation on :doc:`CMake`. For information about the CMake
   language there is also a :doc:`CMakePrimer` document available.
 
-* .. note about C API functions LLVMParseBitcode,
-   LLVMParseBitcodeInContext, LLVMGetBitcodeModuleInContext and
-   LLVMGetBitcodeModule having been removed. LLVMGetTargetMachineData has been
-   removed (use LLVMGetDataLayout instead).
+* C API functions LLVMParseBitcode,
+  LLVMParseBitcodeInContext, LLVMGetBitcodeModuleInContext and
+  LLVMGetBitcodeModule having been removed. LLVMGetTargetMachineData has been
+  removed (use LLVMGetDataLayout instead).
 
 * The C API function LLVMLinkModules has been removed.
 
@@ -68,52 +57,35 @@ Non-comprehensive list of changes in this release
   iterator to the next instruction instead of ``void``. Targets that previously
   did ``MBB.erase(I); return;`` now probably want ``return MBB.erase(I);``.
 
-* ``SelectionDAGISel::Select`` now returns ``void``. Out of tree targets will
+* ``SelectionDAGISel::Select`` now returns ``void``. Out-of-tree targets will
   need to be updated to replace the argument node and remove any dead nodes in
   cases where they currently return an ``SDNode *`` from this interface.
 
-* Raised the minimum required CMake version to 3.4.3.
-
 * Added the MemorySSA analysis, which hopes to replace MemoryDependenceAnalysis.
   It should provide higher-quality results than MemDep, and be algorithmically
   faster than MemDep. Currently, GVNHoist (which is off by default) makes use of
   MemorySSA.
 
-.. NOTE
-   For small 1-3 sentence descriptions, just add an entry at the end of
-   this list. If your description won't fit comfortably in one bullet
-   point (e.g. maybe you would like to give an example of the
-   functionality, or simply have a lot to talk about), see the `NOTE` below
-   for adding a new subsection.
-
-* ... next change ...
-
-.. NOTE
-   If you would like to document a larger change, then you can add a
-   subsection about it right here. You can copy the following boilerplate
-   and un-indent it (the indentation causes it to be inside this comment).
-
-   Special New Feature
-   -------------------
-
-   Makes programs 10x faster by doing Special New Thing.
+* The minimum density for lowering switches with jump tables has been reduced
+  from 40% to 10% for functions which are not marked ``optsize`` (that is,
+  compiled with ``-Os``).
 
 GCC ABI Tag
 -----------
 
-Recently, many of the Linux distributions (ex. `Fedora <http://developerblog.redhat.com/2015/02/10/gcc-5-in-fedora/>`_,
+Recently, many of the Linux distributions (e.g. `Fedora <http://developerblog.redhat.com/2015/02/10/gcc-5-in-fedora/>`_,
 `Debian <https://wiki.debian.org/GCC5>`_, `Ubuntu <https://wiki.ubuntu.com/GCC5>`_)
 have moved on to use the new `GCC ABI <https://gcc.gnu.org/onlinedocs/gcc/C_002b_002b-Attributes.html>`_
 to work around `C++11 incompatibilities in libstdc++ <https://gcc.gnu.org/onlinedocs/libstdc++/manual/using_dual_abi.html>`_.
 This caused `incompatibility problems <https://gcc.gnu.org/ml/gcc-patches/2015-04/msg00153.html>`_
-with other compilers (ex. Clang), which needed to be fixed, but due to the
+with other compilers (e.g. Clang), which needed to be fixed, but due to the
 experimental nature of GCC's own implementation, it took a long time for it to
-land in LLVM (`here <https://reviews.llvm.org/D18035>`_ and
-`here <https://reviews.llvm.org/D17567>`_), not in time for the 3.8 release.
+land in LLVM (`D18035 <https://reviews.llvm.org/D18035>`_ and
+`D17567 <https://reviews.llvm.org/D17567>`_), not in time for the 3.8 release.
 
-Those patches are now present in the 3.9.0 release and should be working on the
+Those patches are now present in the 3.9.0 release and should be working in the
 majority of cases, as they have been tested thoroughly. However, some bugs were
-`filled in GCC <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71712>`_ and have not
+`filed in GCC <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71712>`_ and have not
 yet been fixed, so there may be corner cases not covered by either GCC or Clang.
 Bug fixes to those problems should be reported in Bugzilla (either LLVM or GCC),
 and patches to LLVM's trunk are very likely to be back-ported to future 3.9.x
@@ -131,6 +103,10 @@ Changes to the LLVM IR
   ``llvm.masked.gather`` and ``llvm.masked.scatter`` were introduced to the
   LLVM IR to allow selective memory access for vector data types.
 
+* The new ``notail`` attribute prevents optimization passes from adding ``tail``
+  or ``musttail`` markers to a call. It is used to prevent tail call
+  optimization from being performed on the call.
+
 Changes to LLVM's IPO model
 ---------------------------
 
@@ -145,7 +121,7 @@ Support for ThinLTO
 -------------------
 
 LLVM now supports ThinLTO compilation, which can be invoked by compiling
-and linking with -flto=thin. The gold linker plugin, as well as linkers
+and linking with ``-flto=thin``. The gold linker plugin, as well as linkers
 that use the new ThinLTO API in libLTO (like ld64), will transparently
 execute the ThinLTO backends in parallel threads.
 For more information on ThinLTO and the LLVM implementation, see the
@@ -238,7 +214,7 @@ fixes:**
 Changes to the PowerPC Target
 -----------------------------
 
- Moved some optimizations from O3 to O2 (D18562)
+* Moved some optimizations from O3 to O2 (D18562)
 
 * Enable sibling call optimization on ppc64 ELFv1/ELFv2 abi
 
@@ -266,18 +242,6 @@ Changes to the AMDGPU Target
  * Mesa 11.0.x is no longer supported
 
 
-Changes to the OCaml bindings
------------------------------
-
- During this release ...
-
-Support for attribute 'notail' has been added
----------------------------------------------
-
-This marker prevents optimization passes from adding 'tail' or
-'musttail' markers to a call. It is used to prevent tail call
-optimization from being performed on the call.
-
 External Open Source Projects Using LLVM 3.9
 ============================================
 
@@ -285,8 +249,6 @@ An exciting aspect of LLVM is that it is used as an enabling technology for
 a lot of other language and tools projects. This section lists some of the
 projects that have already been updated to work with LLVM 3.9.
 
-* A project
-
 LDC - the LLVM-based D compiler
 -------------------------------
 
diff --git a/include/llvm/Transforms/Scalar/Reassociate.h b/include/llvm/Transforms/Scalar/Reassociate.h
index 7b68b4489306..2f56b9398778 100644
--- a/include/llvm/Transforms/Scalar/Reassociate.h
+++ b/include/llvm/Transforms/Scalar/Reassociate.h
@@ -65,7 +65,7 @@ public:
   PreservedAnalyses run(Function &F, FunctionAnalysisManager &);
 
 private:
-  void BuildRankMap(Function &F, ReversePostOrderTraversal<Function *> &RPOT);
+  void BuildRankMap(Function &F);
   unsigned getRank(Value *V);
   void canonicalizeOperands(Instruction *I);
   void ReassociateExpression(BinaryOperator *I);
diff --git a/lib/Analysis/ScalarEvolution.cpp b/lib/Analysis/ScalarEvolution.cpp
index 2abbf3480358..e42a4b574d90 100644
--- a/lib/Analysis/ScalarEvolution.cpp
+++ b/lib/Analysis/ScalarEvolution.cpp
@@ -4822,6 +4822,10 @@ bool ScalarEvolution::isSCEVExprNeverPoison(const Instruction *I) {
   // from different loops, so that we know which loop to prove that I is
   // executed in.
   for (unsigned OpIndex = 0; OpIndex < I->getNumOperands(); ++OpIndex) {
+    // I could be an extractvalue from a call to an overflow intrinsic.
+    // TODO: We can do better here in some cases.
+    if (!isSCEVable(I->getOperand(OpIndex)->getType()))
+      return false;
     const SCEV *Op = getSCEV(I->getOperand(OpIndex));
     if (auto *AddRec = dyn_cast<SCEVAddRecExpr>(Op)) {
       bool AllOtherOpsLoopInvariant = true;
diff --git a/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp b/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
index dca13fc49414..dd2ea6a9dbd6 100644
--- a/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
+++ b/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
@@ -1258,8 +1258,11 @@ AArch64LoadStoreOpt::findMatchingInsn(MachineBasicBlock::iterator I,
         if (MIIsUnscaled) {
           // If the unscaled offset isn't a multiple of the MemSize, we can't
           // pair the operations together: bail and keep looking.
-          if (MIOffset % MemSize)
+          if (MIOffset % MemSize) {
+            trackRegDefsUses(MI, ModifiedRegs, UsedRegs, TRI);
+            MemInsns.push_back(&MI);
             continue;
+          }
           MIOffset /= MemSize;
         } else {
           MIOffset *= MemSize;
@@ -1424,9 +1427,6 @@ bool AArch64LoadStoreOpt::isMatchingUpdateInsn(MachineInstr &MemMI,
   default:
     break;
   case AArch64::SUBXri:
-    // Negate the offset for a SUB instruction.
-    Offset *= -1;
-  // FALLTHROUGH
   case AArch64::ADDXri:
     // Make sure it's a vanilla immediate operand, not a relocation or
     // anything else we can't handle.
@@ -1444,6 +1444,9 @@ bool AArch64LoadStoreOpt::isMatchingUpdateInsn(MachineInstr &MemMI,
 
     bool IsPairedInsn = isPairedLdSt(MemMI);
     int UpdateOffset = MI.getOperand(2).getImm();
+    if (MI.getOpcode() == AArch64::SUBXri)
+      UpdateOffset = -UpdateOffset;
+
     // For non-paired load/store instructions, the immediate must fit in a
     // signed 9-bit integer.
     if (!IsPairedInsn && (UpdateOffset > 255 || UpdateOffset < -256))
@@ -1458,13 +1461,13 @@ bool AArch64LoadStoreOpt::isMatchingUpdateInsn(MachineInstr &MemMI,
         break;
 
       int ScaledOffset = UpdateOffset / Scale;
-      if (ScaledOffset > 64 || ScaledOffset < -64)
+      if (ScaledOffset > 63 || ScaledOffset < -64)
         break;
     }
 
     // If we have a non-zero Offset, we check that it matches the amount
     // we're adding to the register.
-    if (!Offset || Offset == MI.getOperand(2).getImm())
+    if (!Offset || Offset == UpdateOffset)
       return true;
     break;
   }
diff --git a/lib/Target/PowerPC/PPCISelLowering.cpp b/lib/Target/PowerPC/PPCISelLowering.cpp
index 6e3c830a8243..3d06de804200 100644
--- a/lib/Target/PowerPC/PPCISelLowering.cpp
+++ b/lib/Target/PowerPC/PPCISelLowering.cpp
@@ -4033,11 +4033,18 @@ PPCTargetLowering::IsEligibleForTailCallOptimization_64SVR4(
   if (CalleeCC != CallingConv::Fast && CalleeCC != CallingConv::C)
     return false;
 
-  // Functions containing by val parameters are not supported.
+  // Caller contains any byval parameter is not supported.
   if (std::any_of(Ins.begin(), Ins.end(),
                   [](const ISD::InputArg& IA) { return IA.Flags.isByVal(); }))
     return false;
 
+  // Callee contains any byval parameter is not supported, too.
+  // Note: This is a quick work around, because in some cases, e.g.
+  // caller's stack size > callee's stack size, we are still able to apply
+  // sibling call optimization. See: https://reviews.llvm.org/D23441#513574
+  if (any_of(Outs, [](const ISD::OutputArg& OA) { return OA.Flags.isByVal(); }))
+    return false;
+
   // No TCO/SCO on indirect call because Caller have to restore its TOC
   if (!isFunctionGlobalAddress(Callee) &&
       !isa<ExternalSymbolSDNode>(Callee))
diff --git a/lib/Transforms/Scalar/Reassociate.cpp b/lib/Transforms/Scalar/Reassociate.cpp
index b930a8fb7e99..e42e2c6d8aa4 100644
--- a/lib/Transforms/Scalar/Reassociate.cpp
+++ b/lib/Transforms/Scalar/Reassociate.cpp
@@ -145,8 +145,7 @@ static BinaryOperator *isReassociableOp(Value *V, unsigned Opcode1,
   return nullptr;
 }
 
-void ReassociatePass::BuildRankMap(
-    Function &F, ReversePostOrderTraversal<Function *> &RPOT) {
+void ReassociatePass::BuildRankMap(Function &F) {
   unsigned i = 2;
 
   // Assign distinct ranks to function arguments.
@@ -155,6 +154,7 @@ void ReassociatePass::BuildRankMap(
     DEBUG(dbgs() << "Calculated Rank[" << I->getName() << "] = " << i << "\n");
   }
 
+  ReversePostOrderTraversal<Function *> RPOT(&F);
   for (BasicBlock *BB : RPOT) {
     unsigned BBRank = RankMap[BB] = ++i << 16;
 
@@ -2172,28 +2172,13 @@ void ReassociatePass::ReassociateExpression(BinaryOperator *I) {
 }
 
 PreservedAnalyses ReassociatePass::run(Function &F, FunctionAnalysisManager &) {
-  // Reassociate needs for each instruction to have its operands already
-  // processed, so we first perform a RPOT of the basic blocks so that
-  // when we process a basic block, all its dominators have been processed
-  // before.
-  ReversePostOrderTraversal<Function *> RPOT(&F);
-  BuildRankMap(F, RPOT);
+  // Calculate the rank map for F.
+  BuildRankMap(F);
 
   MadeChange = false;
-  for (BasicBlock *BI : RPOT) {
-    // Use a worklist to keep track of which instructions have been processed
-    // (and which insts won't be optimized again) so when redoing insts,
-    // optimize insts rightaway which won't be processed later.
-    SmallSet<Instruction *, 8> Worklist;
-
-    // Insert all instructions in the BB
-    for (Instruction &I : *BI)
-      Worklist.insert(&I);
-
+  for (Function::iterator BI = F.begin(), BE = F.end(); BI != BE; ++BI) {
     // Optimize every instruction in the basic block.
-    for (BasicBlock::iterator II = BI->begin(), IE = BI->end(); II != IE;) {
-      // This instruction has been processed.
-      Worklist.erase(&*II);
+    for (BasicBlock::iterator II = BI->begin(), IE = BI->end(); II != IE;)
       if (isInstructionTriviallyDead(&*II)) {
         EraseInst(&*II++);
       } else {
@@ -2202,22 +2187,27 @@ PreservedAnalyses ReassociatePass::run(Function &F, FunctionAnalysisManager &) {
         ++II;
       }
 
-      // If the above optimizations produced new instructions to optimize or
-      // made modifications which need to be redone, do them now if they won't
-      // be handled later.
-      while (!RedoInsts.empty()) {
-        Instruction *I = RedoInsts.pop_back_val();
-        // Process instructions that won't be processed later, either
-        // inside the block itself or in another basic block (based on rank),
-        // since these will be processed later.
-        if ((I->getParent() != BI || !Worklist.count(I)) &&
-            RankMap[I->getParent()] <= RankMap[BI]) {
-          if (isInstructionTriviallyDead(I))
-            EraseInst(I);
-          else
-            OptimizeInst(I);
-        }
-      }
+    // Make a copy of all the instructions to be redone so we can remove dead
+    // instructions.
+    SetVector<AssertingVH<Instruction>> ToRedo(RedoInsts);
+    // Iterate over all instructions to be reevaluated and remove trivially dead
+    // instructions. If any operand of the trivially dead instruction becomes
+    // dead mark it for deletion as well. Continue this process until all
+    // trivially dead instructions have been removed.
+    while (!ToRedo.empty()) {
+      Instruction *I = ToRedo.pop_back_val();
+      if (isInstructionTriviallyDead(I))
+        RecursivelyEraseDeadInsts(I, ToRedo);
+    }
+
+    // Now that we have removed dead instructions, we can reoptimize the
+    // remaining instructions.
+    while (!RedoInsts.empty()) {
+      Instruction *I = RedoInsts.pop_back_val();
+      if (isInstructionTriviallyDead(I))
+        EraseInst(I);
+      else
+        OptimizeInst(I);
     }
   }
 
diff --git a/lib/Transforms/Utils/CloneFunction.cpp b/lib/Transforms/Utils/CloneFunction.cpp
index 4f1052d81433..4d33e22fecfb 100644
--- a/lib/Transforms/Utils/CloneFunction.cpp
+++ b/lib/Transforms/Utils/CloneFunction.cpp
@@ -566,6 +566,12 @@ void llvm::CloneAndPruneIntoFromInst(Function *NewFunc, const Function *OldFunc,
     if (!I)
       continue;
 
+    // Skip over non-intrinsic callsites, we don't want to remove any nodes from
+    // the CGSCC.
+    CallSite CS = CallSite(I);
+    if (CS && CS.getCalledFunction() && !CS.getCalledFunction()->isIntrinsic())
+      continue;
+
     // See if this instruction simplifies.
     Value *SimpleV = SimplifyInstruction(I, DL);
     if (!SimpleV)
diff --git a/lib/Transforms/Vectorize/SLPVectorizer.cpp b/lib/Transforms/Vectorize/SLPVectorizer.cpp
index 8a3c4d14fecb..fb94f7924ee0 100644
--- a/lib/Transforms/Vectorize/SLPVectorizer.cpp
+++ b/lib/Transforms/Vectorize/SLPVectorizer.cpp
@@ -82,8 +82,13 @@ static cl::opt<int> MinVectorRegSizeOption(
     "slp-min-reg-size", cl::init(128), cl::Hidden,
     cl::desc("Attempt to vectorize for this register size in bits"));
 
-// FIXME: Set this via cl::opt to allow overriding.
-static const unsigned RecursionMaxDepth = 12;
+static cl::opt<unsigned> RecursionMaxDepth(
+    "slp-recursion-max-depth", cl::init(12), cl::Hidden,
+    cl::desc("Limit the recursion depth when building a vectorizable tree"));
+
+static cl::opt<unsigned> MinTreeSize(
+    "slp-min-tree-size", cl::init(3), cl::Hidden,
+    cl::desc("Only vectorize small trees if they are fully vectorizable"));
 
 // Limit the number of alias checks. The limit is chosen so that
 // it has no negative effect on the llvm benchmarks.
@@ -1842,7 +1847,7 @@ int BoUpSLP::getTreeCost() {
         VectorizableTree.size() << ".\n");
 
   // We only vectorize tiny trees if it is fully vectorizable.
-  if (VectorizableTree.size() < 3 && !isFullyVectorizableTinyTree()) {
+  if (VectorizableTree.size() < MinTreeSize && !isFullyVectorizableTinyTree()) {
     if (VectorizableTree.empty()) {
       assert(!ExternalUses.size() && "We should not have any external users");
     }
@@ -2124,11 +2129,61 @@ void BoUpSLP::reorderInputsAccordingToOpcode(ArrayRef<Value *> VL,
 }
 
 void BoUpSLP::setInsertPointAfterBundle(ArrayRef<Value *> VL) {
-  Instruction *VL0 = cast<Instruction>(VL[0]);
-  BasicBlock::iterator NextInst(VL0);
-  ++NextInst;
-  Builder.SetInsertPoint(VL0->getParent(), NextInst);
-  Builder.SetCurrentDebugLocation(VL0->getDebugLoc());
+
+  // Get the basic block this bundle is in. All instructions in the bundle
+  // should be in this block.
+  auto *Front = cast<Instruction>(VL.front());
+  auto *BB = Front->getParent();
+  assert(all_of(make_range(VL.begin(), VL.end()), [&](Value *V) -> bool {
+    return cast<Instruction>(V)->getParent() == BB;
+  }));
+
+  // The last instruction in the bundle in program order.
+  Instruction *LastInst = nullptr;
+
+  // Find the last instruction. The common case should be that BB has been
+  // scheduled, and the last instruction is VL.back(). So we start with
+  // VL.back() and iterate over schedule data until we reach the end of the
+  // bundle. The end of the bundle is marked by null ScheduleData.
+  if (BlocksSchedules.count(BB)) {
+    auto *Bundle = BlocksSchedules[BB]->getScheduleData(VL.back());
+    if (Bundle && Bundle->isPartOfBundle())
+      for (; Bundle; Bundle = Bundle->NextInBundle)
+        LastInst = Bundle->Inst;
+  }
+
+  // LastInst can still be null at this point if there's either not an entry
+  // for BB in BlocksSchedules or there's no ScheduleData available for
+  // VL.back(). This can be the case if buildTree_rec aborts for various
+  // reasons (e.g., the maximum recursion depth is reached, the maximum region
+  // size is reached, etc.). ScheduleData is initialized in the scheduling
+  // "dry-run".
+  //
+  // If this happens, we can still find the last instruction by brute force. We
+  // iterate forwards from Front (inclusive) until we either see all
+  // instructions in the bundle or reach the end of the block. If Front is the
+  // last instruction in program order, LastInst will be set to Front, and we
+  // will visit all the remaining instructions in the block.
+  //
+  // One of the reasons we exit early from buildTree_rec is to place an upper
+  // bound on compile-time. Thus, taking an additional compile-time hit here is
+  // not ideal. However, this should be exceedingly rare since it requires that
+  // we both exit early from buildTree_rec and that the bundle be out-of-order
+  // (causing us to iterate all the way to the end of the block).
+  if (!LastInst) {
+    SmallPtrSet<Value *, 16> Bundle(VL.begin(), VL.end());
+    for (auto &I : make_range(BasicBlock::iterator(Front), BB->end())) {
+      if (Bundle.erase(&I))
+        LastInst = &I;
+      if (Bundle.empty())
+        break;
+    }
+  }
+
+  // Set the insertion point after the last instruction in the bundle. Set the
+  // debug location to Front.
+  Builder.SetInsertPoint(BB, next(BasicBlock::iterator(LastInst)));
+  Builder.SetCurrentDebugLocation(Front->getDebugLoc());
 }
 
 Value *BoUpSLP::Gather(ArrayRef<Value *> VL, VectorType *Ty) {
@@ -2206,7 +2261,9 @@ Value *BoUpSLP::vectorizeTree(TreeEntry *E) {
 
   if (E->NeedToGather) {
     setInsertPointAfterBundle(E->Scalars);
-    return Gather(E->Scalars, VecTy);
+    auto *V = Gather(E->Scalars, VecTy);
+    E->VectorizedValue = V;
+    return V;
   }
 
   unsigned Opcode = getSameOpcode(E->Scalars);
@@ -2253,7 +2310,10 @@ Value *BoUpSLP::vectorizeTree(TreeEntry *E) {
         E->VectorizedValue = V;
         return V;
       }
-      return Gather(E->Scalars, VecTy);
+      setInsertPointAfterBundle(E->Scalars);
+      auto *V = Gather(E->Scalars, VecTy);
+      E->VectorizedValue = V;
+      return V;
     }
     case Instruction::ExtractValue: {
       if (canReuseExtract(E->Scalars, Instruction::ExtractValue)) {
@@ -2265,7 +2325,10 @@ Value *BoUpSLP::vectorizeTree(TreeEntry *E) {
         E->VectorizedValue = V;
         return propagateMetadata(V, E->Scalars);
       }
-      return Gather(E->Scalars, VecTy);
+      setInsertPointAfterBundle(E->Scalars);
+      auto *V = Gather(E->Scalars, VecTy);
+      E->VectorizedValue = V;
+      return V;
     }
     case Instruction::ZExt:
     case Instruction::SExt:
diff --git a/test/Analysis/ScalarEvolution/flags-from-poison.ll b/test/Analysis/ScalarEvolution/flags-from-poison.ll
index 2fcb4c038f20..8e73fe4fd54c 100644
--- a/test/Analysis/ScalarEvolution/flags-from-poison.ll
+++ b/test/Analysis/ScalarEvolution/flags-from-poison.ll
@@ -688,3 +688,52 @@ outer.be:
 exit:
   ret void
 }
+
+
+; PR28932: Don't assert on non-SCEV-able value %2.
+%struct.anon = type { i8* }
+@a = common global %struct.anon* null, align 8
+@b = common global i32 0, align 4
+declare { i32, i1 } @llvm.ssub.with.overflow.i32(i32, i32)
+declare void @llvm.trap()
+define i32 @pr28932() {
+entry:
+  %.pre = load %struct.anon*, %struct.anon** @a, align 8
+  %.pre7 = load i32, i32* @b, align 4
+  br label %for.cond
+
+for.cond:                                         ; preds = %cont6, %entry
+  %0 = phi i32 [ %3, %cont6 ], [ %.pre7, %entry ]
+  %1 = phi %struct.anon* [ %.ph, %cont6 ], [ %.pre, %entry ]
+  %tobool = icmp eq %struct.anon* %1, null
+  %2 = tail call { i32, i1 } @llvm.ssub.with.overflow.i32(i32 %0, i32 1)
+  %3 = extractvalue { i32, i1 } %2, 0
+  %4 = extractvalue { i32, i1 } %2, 1
+  %idxprom = sext i32 %3 to i64
+  %5 = getelementptr inbounds %struct.anon, %struct.anon* %1, i64 0, i32 0
+  %6 = load i8*, i8** %5, align 8
+  %7 = getelementptr inbounds i8, i8* %6, i64 %idxprom
+  %8 = load i8, i8* %7, align 1
+  br i1 %tobool, label %if.else, label %if.then
+
+if.then:                                          ; preds = %for.cond
+  br i1 %4, label %trap, label %cont6
+
+trap:                                             ; preds = %if.else, %if.then
+  tail call void @llvm.trap()
+  unreachable
+
+if.else:                                          ; preds = %for.cond
+  br i1 %4, label %trap, label %cont1
+
+cont1:                                            ; preds = %if.else
+  %conv5 = sext i8 %8 to i64
+  %9 = inttoptr i64 %conv5 to %struct.anon*
+  store %struct.anon* %9, %struct.anon** @a, align 8
+  br label %cont6
+
+cont6:                                            ; preds = %cont1, %if.then
+  %.ph = phi %struct.anon* [ %9, %cont1 ], [ %1, %if.then ]
+  store i32 %3, i32* @b, align 4
+  br label %for.cond
+}
diff --git a/test/CodeGen/AArch64/ldst-opt.ll b/test/CodeGen/AArch64/ldst-opt.ll
index d2133213f186..a7b6399b7cd1 100644
--- a/test/CodeGen/AArch64/ldst-opt.ll
+++ b/test/CodeGen/AArch64/ldst-opt.ll
@@ -1,4 +1,4 @@
-; RUN: llc -mtriple=aarch64-linux-gnu -aarch64-atomic-cfg-tidy=0 -verify-machineinstrs -o - %s | FileCheck %s
+; RUN: llc -mtriple=aarch64-linux-gnu -aarch64-atomic-cfg-tidy=0 -disable-lsr -verify-machineinstrs -o - %s | FileCheck %s
 
 ; This file contains tests for the AArch64 load/store optimizer.
 
@@ -1232,3 +1232,104 @@ for.body:
 end:
   ret void
 }
+
+define void @post-indexed-sub-doubleword-offset-min(i64* %a, i64* %b, i64 %count) nounwind {
+; CHECK-LABEL: post-indexed-sub-doubleword-offset-min
+; CHECK: ldr x{{[0-9]+}}, [x{{[0-9]+}}], #-256
+; CHECK: str x{{[0-9]+}}, [x{{[0-9]+}}], #-256
+  br label %for.body
+for.body:
+  %phi1 = phi i64* [ %gep4, %for.body ], [ %b, %0 ]
+  %phi2 = phi i64* [ %gep3, %for.body ], [ %a, %0 ]
+  %i = phi i64 [ %dec.i, %for.body], [ %count, %0 ]
+  %gep1 = getelementptr i64, i64* %phi1, i64 1
+  %load1 = load i64, i64* %gep1
+  %gep2 = getelementptr i64, i64* %phi2, i64 1
+  store i64 %load1, i64* %gep2
+  %load2 = load i64, i64* %phi1
+  store i64 %load2, i64* %phi2
+  %dec.i = add nsw i64 %i, -1
+  %gep3 = getelementptr i64, i64* %phi2, i64 -32
+  %gep4 = getelementptr i64, i64* %phi1, i64 -32
+  %cond = icmp sgt i64 %dec.i, 0
+  br i1 %cond, label %for.body, label %end
+end:
+  ret void
+}
+
+define void @post-indexed-doubleword-offset-out-of-range(i64* %a, i64* %b, i64 %count) nounwind {
+; CHECK-LABEL: post-indexed-doubleword-offset-out-of-range
+; CHECK: ldr x{{[0-9]+}}, [x{{[0-9]+}}]
+; CHECK: add x{{[0-9]+}}, x{{[0-9]+}}, #256
+; CHECK: str x{{[0-9]+}}, [x{{[0-9]+}}]
+; CHECK: add x{{[0-9]+}}, x{{[0-9]+}}, #256
+
+  br label %for.body
+for.body:
+  %phi1 = phi i64* [ %gep4, %for.body ], [ %b, %0 ]
+  %phi2 = phi i64* [ %gep3, %for.body ], [ %a, %0 ]
+  %i = phi i64 [ %dec.i, %for.body], [ %count, %0 ]
+  %gep1 = getelementptr i64, i64* %phi1, i64 1
+  %load1 = load i64, i64* %gep1
+  %gep2 = getelementptr i64, i64* %phi2, i64 1
+  store i64 %load1, i64* %gep2
+  %load2 = load i64, i64* %phi1
+  store i64 %load2, i64* %phi2
+  %dec.i = add nsw i64 %i, -1
+  %gep3 = getelementptr i64, i64* %phi2, i64 32
+  %gep4 = getelementptr i64, i64* %phi1, i64 32
+  %cond = icmp sgt i64 %dec.i, 0
+  br i1 %cond, label %for.body, label %end
+end:
+  ret void
+}
+
+define void @post-indexed-paired-min-offset(i64* %a, i64* %b, i64 %count) nounwind {
+; CHECK-LABEL: post-indexed-paired-min-offset
+; CHECK: ldp x{{[0-9]+}}, x{{[0-9]+}}, [x{{[0-9]+}}], #-512
+; CHECK: stp x{{[0-9]+}}, x{{[0-9]+}}, [x{{[0-9]+}}], #-512
+  br label %for.body
+for.body:
+  %phi1 = phi i64* [ %gep4, %for.body ], [ %b, %0 ]
+  %phi2 = phi i64* [ %gep3, %for.body ], [ %a, %0 ]
+  %i = phi i64 [ %dec.i, %for.body], [ %count, %0 ]
+  %gep1 = getelementptr i64, i64* %phi1, i64 1
+  %load1 = load i64, i64* %gep1
+  %gep2 = getelementptr i64, i64* %phi2, i64 1
+  %load2 = load i64, i64* %phi1
+  store i64 %load1, i64* %gep2
+  store i64 %load2, i64* %phi2
+  %dec.i = add nsw i64 %i, -1
+  %gep3 = getelementptr i64, i64* %phi2, i64 -64
+  %gep4 = getelementptr i64, i64* %phi1, i64 -64
+  %cond = icmp sgt i64 %dec.i, 0
+  br i1 %cond, label %for.body, label %end
+end:
+  ret void
+}
+
+define void @post-indexed-paired-offset-out-of-range(i64* %a, i64* %b, i64 %count) nounwind {
+; CHECK-LABEL: post-indexed-paired-offset-out-of-range
+; CHECK: ldp x{{[0-9]+}}, x{{[0-9]+}}, [x{{[0-9]+}}]
+; CHECK: add x{{[0-9]+}}, x{{[0-9]+}}, #512
+; CHECK: stp x{{[0-9]+}}, x{{[0-9]+}}, [x{{[0-9]+}}]
+; CHECK: add x{{[0-9]+}}, x{{[0-9]+}}, #512
+  br label %for.body
+for.body:
+  %phi1 = phi i64* [ %gep4, %for.body ], [ %b, %0 ]
+  %phi2 = phi i64* [ %gep3, %for.body ], [ %a, %0 ]
+  %i = phi i64 [ %dec.i, %for.body], [ %count, %0 ]
+  %gep1 = getelementptr i64, i64* %phi1, i64 1
+  %load1 = load i64, i64* %phi1
+  %gep2 = getelementptr i64, i64* %phi2, i64 1
+  %load2 = load i64, i64* %gep1
+  store i64 %load1, i64* %gep2
+  store i64 %load2, i64* %phi2
+  %dec.i = add nsw i64 %i, -1
+  %gep3 = getelementptr i64, i64* %phi2, i64 64
+  %gep4 = getelementptr i64, i64* %phi1, i64 64
+  %cond = icmp sgt i64 %dec.i, 0
+  br i1 %cond, label %for.body, label %end
+end:
+  ret void
+}
diff --git a/test/CodeGen/AArch64/ldst-paired-aliasing.ll b/test/CodeGen/AArch64/ldst-paired-aliasing.ll
new file mode 100644
index 000000000000..035e911b3c76
--- /dev/null
+++ b/test/CodeGen/AArch64/ldst-paired-aliasing.ll
@@ -0,0 +1,47 @@
+; RUN: llc -mcpu cortex-a53 < %s | FileCheck %s
+target datalayout = "e-m:e-i64:64-i128:128-n8:16:32:64-S128"
+target triple = "aarch64--linux-gnu"
+
+declare void @f(i8*, i8*)
+declare void @f2(i8*, i8*)
+declare void @_Z5setupv()
+declare void @llvm.memset.p0i8.i64(i8* nocapture, i8, i64, i32, i1) #3
+
+define i32 @main() local_unnamed_addr #1 {
+; Make sure the stores happen in the correct order (the exact instructions could change).
+; CHECK-LABEL: main:
+; CHECK: str q0, [sp, #48]
+; CHECK: ldr w8, [sp, #48]
+; CHECK: stur q1, [sp, #72]
+; CHECK: str q0, [sp, #64]
+; CHECK: str w9, [sp, #80]
+
+for.body.lr.ph.i.i.i.i.i.i63:
+  %b1 = alloca [10 x i32], align 16
+  %x0 = bitcast [10 x i32]* %b1 to i8*
+  %b2 = alloca [10 x i32], align 16
+  %x1 = bitcast [10 x i32]* %b2 to i8*
+  tail call void @_Z5setupv()
+  %x2 = getelementptr inbounds [10 x i32], [10 x i32]* %b1, i64 0, i64 6
+  %x3 = bitcast i32* %x2 to i8*
+  call void @llvm.memset.p0i8.i64(i8* %x3, i8 0, i64 16, i32 8, i1 false)
+  %arraydecay2 = getelementptr inbounds [10 x i32], [10 x i32]* %b1, i64 0, i64 0
+  %x4 = bitcast [10 x i32]* %b1 to <4 x i32>*
+  store <4 x i32> <i32 1, i32 1, i32 1, i32 1>, <4 x i32>* %x4, align 16
+  %incdec.ptr.i7.i.i.i.i.i.i64.3 = getelementptr inbounds [10 x i32], [10 x i32]* %b1, i64 0, i64 4
+  %x5 = bitcast i32* %incdec.ptr.i7.i.i.i.i.i.i64.3 to <4 x i32>*
+  store <4 x i32> <i32 1, i32 1, i32 1, i32 1>, <4 x i32>* %x5, align 16
+  %incdec.ptr.i7.i.i.i.i.i.i64.7 = getelementptr inbounds [10 x i32], [10 x i32]* %b1, i64 0, i64 8
+  store i32 1, i32* %incdec.ptr.i7.i.i.i.i.i.i64.7, align 16
+  %x6 = load i32, i32* %arraydecay2, align 16
+  %cmp6 = icmp eq i32 %x6, 1
+  br i1 %cmp6, label %for.inc, label %if.then
+
+for.inc:
+  call void @f(i8* %x0, i8* %x1)
+  ret i32 0
+
+if.then:
+  call void @f2(i8* %x0, i8* %x1)
+  ret i32 0
+}
diff --git a/test/CodeGen/PowerPC/ppc64-sibcall.ll b/test/CodeGen/PowerPC/ppc64-sibcall.ll
index e4fe5468e942..474e845a5ba5 100644
--- a/test/CodeGen/PowerPC/ppc64-sibcall.ll
+++ b/test/CodeGen/PowerPC/ppc64-sibcall.ll
@@ -189,3 +189,15 @@ define void @w_caller(i8* %ptr) {
 ; CHECK-SCO-LABEL: w_caller:
 ; CHECK-SCO: bl w_callee
 }
+
+%struct.byvalTest = type { [8 x i8] }
+@byval = common global %struct.byvalTest zeroinitializer
+
+define void @byval_callee(%struct.byvalTest* byval %ptr) { ret void }
+define void @byval_caller() {
+  tail call void @byval_callee(%struct.byvalTest* byval @byval)
+  ret void
+
+; CHECK-SCO-LABEL: bl byval_callee
+; CHECK-SCO: bl byval_callee
+}
diff --git a/test/Transforms/Inline/inline_constprop.ll b/test/Transforms/Inline/inline_constprop.ll
index ab9e90ca27c6..c13e023017a0 100644
--- a/test/Transforms/Inline/inline_constprop.ll
+++ b/test/Transforms/Inline/inline_constprop.ll
@@ -299,8 +299,8 @@ entry:
 }
 
 ; CHECK-LABEL: define i32 @PR28802(
-; CHECK: call i32 @PR28802.external(i32 0)
-; CHECK: ret i32 0
+; CHECK: %[[call:.*]] = call i32 @PR28802.external(i32 0)
+; CHECK: ret i32 %[[call]]
 
 define internal i32 @PR28848.callee(i32 %p2, i1 %c) {
 entry:
@@ -322,3 +322,25 @@ entry:
 }
 ; CHECK-LABEL: define i32 @PR28848(
 ; CHECK: ret i32 0
+
+define internal void @callee7(i16 %param1, i16 %param2) {
+entry:
+  br label %bb
+
+bb:
+  %phi = phi i16 [ %param2, %entry ]
+  %add = add i16 %phi, %param1
+  ret void
+}
+
+declare i16 @caller7.external(i16 returned)
+
+define void @caller7() {
+bb1:
+  %call = call i16 @caller7.external(i16 1)
+  call void @callee7(i16 0, i16 %call)
+  ret void
+}
+; CHECK-LABEL: define void @caller7(
+; CHECK: %call = call i16 @caller7.external(i16 1)
+; CHECK-NEXT: ret void
diff --git a/test/Transforms/Reassociate/prev_insts_canonicalized.ll b/test/Transforms/Reassociate/prev_insts_canonicalized.ll
deleted file mode 100644
index 649761e57c9a..000000000000
--- a/test/Transforms/Reassociate/prev_insts_canonicalized.ll
+++ /dev/null
@@ -1,57 +0,0 @@
-; RUN: opt < %s -reassociate -S | FileCheck %s
-
-; These tests make sure that before processing insts
-; any previous instructions are already canonicalized.
-define i32 @foo(i32 %in) {
-; CHECK-LABEL: @foo
-; CHECK-NEXT: %factor = mul i32 %in, -4
-; CHECK-NEXT: %factor1 = mul i32 %in, 2
-; CHECK-NEXT: %_3 = add i32 %factor, 1
-; CHECK-NEXT: %_5 = add i32 %_3, %factor1
-; CHECK-NEXT: ret i32 %_5
-  %_0 = add i32 %in, 1
-  %_1 = mul i32 %in, -2
-  %_2 = add i32 %_0, %_1
-  %_3 = add i32 %_1, %_2
-  %_4 = add i32 %_3, 1
-  %_5 = add i32 %in, %_3
-  ret i32 %_5
-}
-
-; CHECK-LABEL: @foo1
-define void @foo1(float %in, i1 %cmp) {
-wrapper_entry:
-  br label %foo1
-
-for.body:
-  %0 = fadd float %in1, %in1
-  br label %foo1
-
-foo1:
-  %_0 = fmul fast float %in, -3.000000e+00
-  %_1 = fmul fast float %_0, 3.000000e+00
-  %in1 = fadd fast float -3.000000e+00, %_1
-  %in1use = fadd fast float %in1, %in1
-  br label %for.body
-
-
-}
-
-; CHECK-LABEL: @foo2
-define void @foo2(float %in, i1 %cmp) {
-wrapper_entry:
-  br label %for.body
-
-for.body:
-; If the operands of the phi are sheduled for processing before
-; foo1 is processed, the invariant of reassociate are not preserved
-  %unused = phi float [%in1, %foo1], [undef, %wrapper_entry]
-  br label %foo1
-
-foo1:
-  %_0 = fmul fast float %in, -3.000000e+00
-  %_1 = fmul fast float %_0, 3.000000e+00
-  %in1 = fadd fast float -3.000000e+00, %_1
-  %in1use = fadd fast float %in1, %in1
-  br label %for.body
-}
diff --git a/test/Transforms/Reassociate/reassoc-intermediate-fnegs.ll b/test/Transforms/Reassociate/reassoc-intermediate-fnegs.ll
index 7d82ef7e7a2f..c2cdffce61e4 100644
--- a/test/Transforms/Reassociate/reassoc-intermediate-fnegs.ll
+++ b/test/Transforms/Reassociate/reassoc-intermediate-fnegs.ll
@@ -1,8 +1,8 @@
 ; RUN: opt < %s -reassociate -S | FileCheck %s
 ; CHECK-LABEL: faddsubAssoc1
-; CHECK: [[TMP1:%.*]] = fsub fast half 0xH8000, %a
-; CHECK: [[TMP2:%.*]] = fadd fast half %b, [[TMP1]]
-; CHECK: fmul fast half [[TMP2]], 0xH4500
+; CHECK: [[TMP1:%tmp.*]] = fmul fast half %a, 0xH4500
+; CHECK: [[TMP2:%tmp.*]] = fmul fast half %b, 0xH4500
+; CHECK: fsub fast half [[TMP2]], [[TMP1]]
 ; CHECK: ret
 ; Input is A op (B op C)
 define half @faddsubAssoc1(half %a, half %b) {
diff --git a/test/Transforms/Reassociate/xor_reassoc.ll b/test/Transforms/Reassociate/xor_reassoc.ll
index a22689805fb5..0bed6f358808 100644
--- a/test/Transforms/Reassociate/xor_reassoc.ll
+++ b/test/Transforms/Reassociate/xor_reassoc.ll
@@ -88,8 +88,8 @@ define i32 @xor_special2(i32 %x, i32 %y) {
   %xor1 = xor i32 %xor, %and
   ret i32 %xor1
 ; CHECK-LABEL: @xor_special2(
-; CHECK: %xor = xor i32 %y, 123
-; CHECK: %xor1 = xor i32 %xor, %x
+; CHECK: %xor = xor i32 %x, 123
+; CHECK: %xor1 = xor i32 %xor, %y
 ; CHECK: ret i32 %xor1
 }
 
diff --git a/test/Transforms/SLPVectorizer/AArch64/gather-root.ll b/test/Transforms/SLPVectorizer/AArch64/gather-root.ll
new file mode 100644
index 000000000000..aefd015cce27
--- /dev/null
+++ b/test/Transforms/SLPVectorizer/AArch64/gather-root.ll
@@ -0,0 +1,87 @@
+; RUN: opt < %s -slp-vectorizer -S | FileCheck %s --check-prefix=DEFAULT
+; RUN: opt < %s -slp-schedule-budget=0 -slp-min-tree-size=0 -slp-threshold=-30 -slp-vectorizer -S | FileCheck %s --check-prefix=GATHER
+
+target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
+target triple = "aarch64--linux-gnu"
+
+@a = common global [80 x i8] zeroinitializer, align 16
+
+; DEFAULT-LABEL: @PR28330(
+; DEFAULT: %tmp17 = phi i32 [ %tmp34, %for.body ], [ 0, %entry ]
+; DEFAULT: %[[S0:.+]] = select <8 x i1> %1, <8 x i32> <i32 -720, i32 -720, i32 -720, i32 -720, i32 -720, i32 -720, i32 -720, i32 -720>, <8 x i32> <i32 -80, i32 -80, i32 -80, i32 -80, i32 -80, i32 -80, i32 -80, i32 -80>
+; DEFAULT: %[[R0:.+]] = shufflevector <8 x i32> %[[S0]], <8 x i32> undef, <8 x i32> <i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef>
+; DEFAULT: %[[R1:.+]] = add <8 x i32> %[[S0]], %[[R0]]
+; DEFAULT: %[[R2:.+]] = shufflevector <8 x i32> %[[R1]], <8 x i32> undef, <8 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
+; DEFAULT: %[[R3:.+]] = add <8 x i32> %[[R1]], %[[R2]]
+; DEFAULT: %[[R4:.+]] = shufflevector <8 x i32> %[[R3]], <8 x i32> undef, <8 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
+; DEFAULT: %[[R5:.+]] = add <8 x i32> %[[R3]], %[[R4]]
+; DEFAULT: %[[R6:.+]] = extractelement <8 x i32> %[[R5]], i32 0
+; DEFAULT: %tmp34 = add i32 %[[R6]], %tmp17
+;
+; GATHER-LABEL: @PR28330(
+; GATHER: %tmp17 = phi i32 [ %tmp34, %for.body ], [ 0, %entry ]
+; GATHER: %tmp19 = select i1 %tmp1, i32 -720, i32 -80
+; GATHER: %tmp21 = select i1 %tmp3, i32 -720, i32 -80
+; GATHER: %tmp23 = select i1 %tmp5, i32 -720, i32 -80
+; GATHER: %tmp25 = select i1 %tmp7, i32 -720, i32 -80
+; GATHER: %tmp27 = select i1 %tmp9, i32 -720, i32 -80
+; GATHER: %tmp29 = select i1 %tmp11, i32 -720, i32 -80
+; GATHER: %tmp31 = select i1 %tmp13, i32 -720, i32 -80
+; GATHER: %tmp33 = select i1 %tmp15, i32 -720, i32 -80
+; GATHER: %[[I0:.+]] = insertelement <8 x i32> undef, i32 %tmp19, i32 0
+; GATHER: %[[I1:.+]] = insertelement <8 x i32> %[[I0]], i32 %tmp21, i32 1
+; GATHER: %[[I2:.+]] = insertelement <8 x i32> %[[I1]], i32 %tmp23, i32 2
+; GATHER: %[[I3:.+]] = insertelement <8 x i32> %[[I2]], i32 %tmp25, i32 3
+; GATHER: %[[I4:.+]] = insertelement <8 x i32> %[[I3]], i32 %tmp27, i32 4
+; GATHER: %[[I5:.+]] = insertelement <8 x i32> %[[I4]], i32 %tmp29, i32 5
+; GATHER: %[[I6:.+]] = insertelement <8 x i32> %[[I5]], i32 %tmp31, i32 6
+; GATHER: %[[I7:.+]] = insertelement <8 x i32> %[[I6]], i32 %tmp33, i32 7
+; GATHER: %[[R0:.+]] = shufflevector <8 x i32> %[[I7]], <8 x i32> undef, <8 x i32> <i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef>
+; GATHER: %[[R1:.+]] = add <8 x i32> %[[I7]], %[[R0]]
+; GATHER: %[[R2:.+]] = shufflevector <8 x i32> %[[R1]], <8 x i32> undef, <8 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
+; GATHER: %[[R3:.+]] = add <8 x i32> %[[R1]], %[[R2]]
+; GATHER: %[[R4:.+]] = shufflevector <8 x i32> %[[R3]], <8 x i32> undef, <8 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
+; GATHER: %[[R5:.+]] = add <8 x i32> %[[R3]], %[[R4]]
+; GATHER: %[[R6:.+]] = extractelement <8 x i32> %[[R5]], i32 0
+; GATHER: %tmp34 = add i32 %[[R6]], %tmp17
+
+define void @PR28330(i32 %n) {
+entry:
+  %tmp0 = load i8, i8* getelementptr inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 1), align 1
+  %tmp1 = icmp eq i8 %tmp0, 0
+  %tmp2 = load i8, i8* getelementptr inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 2), align 2
+  %tmp3 = icmp eq i8 %tmp2, 0
+  %tmp4 = load i8, i8* getelementptr inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 3), align 1
+  %tmp5 = icmp eq i8 %tmp4, 0
+  %tmp6 = load i8, i8* getelementptr inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 4), align 4
+  %tmp7 = icmp eq i8 %tmp6, 0
+  %tmp8 = load i8, i8* getelementptr inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 5), align 1
+  %tmp9 = icmp eq i8 %tmp8, 0
+  %tmp10 = load i8, i8* getelementptr inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 6), align 2
+  %tmp11 = icmp eq i8 %tmp10, 0
+  %tmp12 = load i8, i8* getelementptr inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 7), align 1
+  %tmp13 = icmp eq i8 %tmp12, 0
+  %tmp14 = load i8, i8* getelementptr inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 8), align 8
+  %tmp15 = icmp eq i8 %tmp14, 0
+  br label %for.body
+
+for.body:
+  %tmp17 = phi i32 [ %tmp34, %for.body ], [ 0, %entry ]
+  %tmp19 = select i1 %tmp1, i32 -720, i32 -80
+  %tmp20 = add i32 %tmp17, %tmp19
+  %tmp21 = select i1 %tmp3, i32 -720, i32 -80
+  %tmp22 = add i32 %tmp20, %tmp21
+  %tmp23 = select i1 %tmp5, i32 -720, i32 -80
+  %tmp24 = add i32 %tmp22, %tmp23
+  %tmp25 = select i1 %tmp7, i32 -720, i32 -80
+  %tmp26 = add i32 %tmp24, %tmp25
+  %tmp27 = select i1 %tmp9, i32 -720, i32 -80
+  %tmp28 = add i32 %tmp26, %tmp27
+  %tmp29 = select i1 %tmp11, i32 -720, i32 -80
+  %tmp30 = add i32 %tmp28, %tmp29
+  %tmp31 = select i1 %tmp13, i32 -720, i32 -80
+  %tmp32 = add i32 %tmp30, %tmp31
+  %tmp33 = select i1 %tmp15, i32 -720, i32 -80
+  %tmp34 = add i32 %tmp32, %tmp33
+  br label %for.body
+}