summaryrefslogtreecommitdiff
path: root/docs/CompileCudaWithLLVM.rst
diff options
context:
space:
mode:
Diffstat (limited to 'docs/CompileCudaWithLLVM.rst')
-rw-r--r--docs/CompileCudaWithLLVM.rst133
1 files changed, 112 insertions, 21 deletions
diff --git a/docs/CompileCudaWithLLVM.rst b/docs/CompileCudaWithLLVM.rst
index a981ffe1e8f52..f57839cec9615 100644
--- a/docs/CompileCudaWithLLVM.rst
+++ b/docs/CompileCudaWithLLVM.rst
@@ -18,9 +18,11 @@ familiarity with CUDA. Information about CUDA programming can be found in the
How to Build LLVM with CUDA Support
===================================
-Below is a quick summary of downloading and building LLVM. Consult the `Getting
-Started <http://llvm.org/docs/GettingStarted.html>`_ page for more details on
-setting up LLVM.
+CUDA support is still in development and works the best in the trunk version
+of LLVM. Below is a quick summary of downloading and building the trunk
+version. Consult the `Getting Started
+<http://llvm.org/docs/GettingStarted.html>`_ page for more details on setting
+up LLVM.
#. Checkout LLVM
@@ -51,7 +53,7 @@ How to Compile CUDA C/C++ with LLVM
===================================
We assume you have installed the CUDA driver and runtime. Consult the `NVIDIA
-CUDA installation Guide
+CUDA installation guide
<https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html>`_ if
you have not.
@@ -60,8 +62,6 @@ which multiplies a ``float`` array by a ``float`` scalar (AXPY).
.. code-block:: c++
- #include <helper_cuda.h> // for checkCudaErrors
-
#include <iostream>
__global__ void axpy(float a, float* x, float* y) {
@@ -78,25 +78,25 @@ which multiplies a ``float`` array by a ``float`` scalar (AXPY).
// Copy input data to device.
float* device_x;
float* device_y;
- checkCudaErrors(cudaMalloc(&device_x, kDataLen * sizeof(float)));
- checkCudaErrors(cudaMalloc(&device_y, kDataLen * sizeof(float)));
- checkCudaErrors(cudaMemcpy(device_x, host_x, kDataLen * sizeof(float),
- cudaMemcpyHostToDevice));
+ cudaMalloc(&device_x, kDataLen * sizeof(float));
+ cudaMalloc(&device_y, kDataLen * sizeof(float));
+ cudaMemcpy(device_x, host_x, kDataLen * sizeof(float),
+ cudaMemcpyHostToDevice);
// Launch the kernel.
axpy<<<1, kDataLen>>>(a, device_x, device_y);
// Copy output data to host.
- checkCudaErrors(cudaDeviceSynchronize());
- checkCudaErrors(cudaMemcpy(host_y, device_y, kDataLen * sizeof(float),
- cudaMemcpyDeviceToHost));
+ cudaDeviceSynchronize();
+ cudaMemcpy(host_y, device_y, kDataLen * sizeof(float),
+ cudaMemcpyDeviceToHost);
// Print the results.
for (int i = 0; i < kDataLen; ++i) {
std::cout << "y[" << i << "] = " << host_y[i] << "\n";
}
- checkCudaErrors(cudaDeviceReset());
+ cudaDeviceReset();
return 0;
}
@@ -104,16 +104,89 @@ The command line for compilation is similar to what you would use for C++.
.. code-block:: console
- $ clang++ -o axpy -I<CUDA install path>/samples/common/inc -L<CUDA install path>/<lib64 or lib> axpy.cu -lcudart_static -lcuda -ldl -lrt -pthread
+ $ clang++ axpy.cu -o axpy --cuda-gpu-arch=<GPU arch> \
+ -L<CUDA install path>/<lib64 or lib> \
+ -lcudart_static -ldl -lrt -pthread
$ ./axpy
y[0] = 2
y[1] = 4
y[2] = 6
y[3] = 8
-Note that ``helper_cuda.h`` comes from the CUDA samples, so you need the
-samples installed for this example. ``<CUDA install path>`` is the root
-directory where you installed CUDA SDK, typically ``/usr/local/cuda``.
+``<CUDA install path>`` is the root directory where you installed CUDA SDK,
+typically ``/usr/local/cuda``. ``<GPU arch>`` is `the compute capability of
+your GPU <https://developer.nvidia.com/cuda-gpus>`_. For example, if you want
+to run your program on a GPU with compute capability of 3.5, you should specify
+``--cuda-gpu-arch=sm_35``.
+
+Detecting clang vs NVCC
+=======================
+
+Although clang's CUDA implementation is largely compatible with NVCC's, you may
+still want to detect when you're compiling CUDA code specifically with clang.
+
+This is tricky, because NVCC may invoke clang as part of its own compilation
+process! For example, NVCC uses the host compiler's preprocessor when
+compiling for device code, and that host compiler may in fact be clang.
+
+When clang is actually compiling CUDA code -- rather than being used as a
+subtool of NVCC's -- it defines the ``__CUDA__`` macro. ``__CUDA_ARCH__`` is
+defined only in device mode (but will be defined if NVCC is using clang as a
+preprocessor). So you can use the following incantations to detect clang CUDA
+compilation, in host and device modes:
+
+.. code-block:: c++
+
+ #if defined(__clang__) && defined(__CUDA__) && !defined(__CUDA_ARCH__)
+ // clang compiling CUDA code, host mode.
+ #endif
+
+ #if defined(__clang__) && defined(__CUDA__) && defined(__CUDA_ARCH__)
+ // clang compiling CUDA code, device mode.
+ #endif
+
+Both clang and nvcc define ``__CUDACC__`` during CUDA compilation. You can
+detect NVCC specifically by looking for ``__NVCC__``.
+
+Flags that control numerical code
+=================================
+
+If you're using GPUs, you probably care about making numerical code run fast.
+GPU hardware allows for more control over numerical operations than most CPUs,
+but this results in more compiler options for you to juggle.
+
+Flags you may wish to tweak include:
+
+* ``-ffp-contract={on,off,fast}`` (defaults to ``fast`` on host and device when
+ compiling CUDA) Controls whether the compiler emits fused multiply-add
+ operations.
+
+ * ``off``: never emit fma operations, and prevent ptxas from fusing multiply
+ and add instructions.
+ * ``on``: fuse multiplies and adds within a single statement, but never
+ across statements (C11 semantics). Prevent ptxas from fusing other
+ multiplies and adds.
+ * ``fast``: fuse multiplies and adds wherever profitable, even across
+ statements. Doesn't prevent ptxas from fusing additional multiplies and
+ adds.
+
+ Fused multiply-add instructions can be much faster than the unfused
+ equivalents, but because the intermediate result in an fma is not rounded,
+ this flag can affect numerical code.
+
+* ``-fcuda-flush-denormals-to-zero`` (default: off) When this is enabled,
+ floating point operations may flush `denormal
+ <https://en.wikipedia.org/wiki/Denormal_number>`_ inputs and/or outputs to 0.
+ Operations on denormal numbers are often much slower than the same operations
+ on normal numbers.
+
+* ``-fcuda-approx-transcendentals`` (default: off) When this is enabled, the
+ compiler may emit calls to faster, approximate versions of transcendental
+ functions, instead of using the slower, fully IEEE-compliant versions. For
+ example, this flag allows clang to emit the ptx ``sin.approx.f32``
+ instruction.
+
+ This is implied by ``-ffast-math``.
Optimizations
=============
@@ -134,10 +207,9 @@ customizable target-independent optimization pipeline.
straight-line scalar optimizations <https://goo.gl/4Rb9As>`_.
* **Inferring memory spaces**. `This optimization
- <http://www.llvm.org/docs/doxygen/html/NVPTXFavorNonGenericAddrSpaces_8cpp_source.html>`_
+ <https://github.com/llvm-mirror/llvm/blob/master/lib/Target/NVPTX/NVPTXInferAddressSpaces.cpp>`_
infers the memory space of an address so that the backend can emit faster
- special loads and stores from it. Details can be found in the `design
- document for memory space inference <https://goo.gl/5wH2Ct>`_.
+ special loads and stores from it.
* **Aggressive loop unrooling and function inlining**. Loop unrolling and
function inlining need to be more aggressive for GPUs than for CPUs because
@@ -167,3 +239,22 @@ customizable target-independent optimization pipeline.
32-bit ones on NVIDIA GPUs due to lack of a divide unit. Many of the 64-bit
divides in our benchmarks have a divisor and dividend which fit in 32-bits at
runtime. This optimization provides a fast path for this common case.
+
+Publication
+===========
+
+| `gpucc: An Open-Source GPGPU Compiler <http://dl.acm.org/citation.cfm?id=2854041>`_
+| Jingyue Wu, Artem Belevich, Eli Bendersky, Mark Heffernan, Chris Leary, Jacques Pienaar, Bjarke Roune, Rob Springer, Xuetian Weng, Robert Hundt
+| *Proceedings of the 2016 International Symposium on Code Generation and Optimization (CGO 2016)*
+| `Slides for the CGO talk <http://wujingyue.com/docs/gpucc-talk.pdf>`_
+
+Tutorial
+========
+
+`CGO 2016 gpucc tutorial <http://wujingyue.com/docs/gpucc-tutorial.pdf>`_
+
+Obtaining Help
+==============
+
+To obtain help on LLVM in general and its CUDA support, see `the LLVM
+community <http://llvm.org/docs/#mailing-lists>`_.