diff options
Diffstat (limited to 'docs/CompileCudaWithLLVM.rst')
-rw-r--r-- | docs/CompileCudaWithLLVM.rst | 133 |
1 files changed, 112 insertions, 21 deletions
diff --git a/docs/CompileCudaWithLLVM.rst b/docs/CompileCudaWithLLVM.rst index a981ffe1e8f52..f57839cec9615 100644 --- a/docs/CompileCudaWithLLVM.rst +++ b/docs/CompileCudaWithLLVM.rst @@ -18,9 +18,11 @@ familiarity with CUDA. Information about CUDA programming can be found in the How to Build LLVM with CUDA Support =================================== -Below is a quick summary of downloading and building LLVM. Consult the `Getting -Started <http://llvm.org/docs/GettingStarted.html>`_ page for more details on -setting up LLVM. +CUDA support is still in development and works the best in the trunk version +of LLVM. Below is a quick summary of downloading and building the trunk +version. Consult the `Getting Started +<http://llvm.org/docs/GettingStarted.html>`_ page for more details on setting +up LLVM. #. Checkout LLVM @@ -51,7 +53,7 @@ How to Compile CUDA C/C++ with LLVM =================================== We assume you have installed the CUDA driver and runtime. Consult the `NVIDIA -CUDA installation Guide +CUDA installation guide <https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html>`_ if you have not. @@ -60,8 +62,6 @@ which multiplies a ``float`` array by a ``float`` scalar (AXPY). .. code-block:: c++ - #include <helper_cuda.h> // for checkCudaErrors - #include <iostream> __global__ void axpy(float a, float* x, float* y) { @@ -78,25 +78,25 @@ which multiplies a ``float`` array by a ``float`` scalar (AXPY). // Copy input data to device. float* device_x; float* device_y; - checkCudaErrors(cudaMalloc(&device_x, kDataLen * sizeof(float))); - checkCudaErrors(cudaMalloc(&device_y, kDataLen * sizeof(float))); - checkCudaErrors(cudaMemcpy(device_x, host_x, kDataLen * sizeof(float), - cudaMemcpyHostToDevice)); + cudaMalloc(&device_x, kDataLen * sizeof(float)); + cudaMalloc(&device_y, kDataLen * sizeof(float)); + cudaMemcpy(device_x, host_x, kDataLen * sizeof(float), + cudaMemcpyHostToDevice); // Launch the kernel. axpy<<<1, kDataLen>>>(a, device_x, device_y); // Copy output data to host. - checkCudaErrors(cudaDeviceSynchronize()); - checkCudaErrors(cudaMemcpy(host_y, device_y, kDataLen * sizeof(float), - cudaMemcpyDeviceToHost)); + cudaDeviceSynchronize(); + cudaMemcpy(host_y, device_y, kDataLen * sizeof(float), + cudaMemcpyDeviceToHost); // Print the results. for (int i = 0; i < kDataLen; ++i) { std::cout << "y[" << i << "] = " << host_y[i] << "\n"; } - checkCudaErrors(cudaDeviceReset()); + cudaDeviceReset(); return 0; } @@ -104,16 +104,89 @@ The command line for compilation is similar to what you would use for C++. .. code-block:: console - $ clang++ -o axpy -I<CUDA install path>/samples/common/inc -L<CUDA install path>/<lib64 or lib> axpy.cu -lcudart_static -lcuda -ldl -lrt -pthread + $ clang++ axpy.cu -o axpy --cuda-gpu-arch=<GPU arch> \ + -L<CUDA install path>/<lib64 or lib> \ + -lcudart_static -ldl -lrt -pthread $ ./axpy y[0] = 2 y[1] = 4 y[2] = 6 y[3] = 8 -Note that ``helper_cuda.h`` comes from the CUDA samples, so you need the -samples installed for this example. ``<CUDA install path>`` is the root -directory where you installed CUDA SDK, typically ``/usr/local/cuda``. +``<CUDA install path>`` is the root directory where you installed CUDA SDK, +typically ``/usr/local/cuda``. ``<GPU arch>`` is `the compute capability of +your GPU <https://developer.nvidia.com/cuda-gpus>`_. For example, if you want +to run your program on a GPU with compute capability of 3.5, you should specify +``--cuda-gpu-arch=sm_35``. + +Detecting clang vs NVCC +======================= + +Although clang's CUDA implementation is largely compatible with NVCC's, you may +still want to detect when you're compiling CUDA code specifically with clang. + +This is tricky, because NVCC may invoke clang as part of its own compilation +process! For example, NVCC uses the host compiler's preprocessor when +compiling for device code, and that host compiler may in fact be clang. + +When clang is actually compiling CUDA code -- rather than being used as a +subtool of NVCC's -- it defines the ``__CUDA__`` macro. ``__CUDA_ARCH__`` is +defined only in device mode (but will be defined if NVCC is using clang as a +preprocessor). So you can use the following incantations to detect clang CUDA +compilation, in host and device modes: + +.. code-block:: c++ + + #if defined(__clang__) && defined(__CUDA__) && !defined(__CUDA_ARCH__) + // clang compiling CUDA code, host mode. + #endif + + #if defined(__clang__) && defined(__CUDA__) && defined(__CUDA_ARCH__) + // clang compiling CUDA code, device mode. + #endif + +Both clang and nvcc define ``__CUDACC__`` during CUDA compilation. You can +detect NVCC specifically by looking for ``__NVCC__``. + +Flags that control numerical code +================================= + +If you're using GPUs, you probably care about making numerical code run fast. +GPU hardware allows for more control over numerical operations than most CPUs, +but this results in more compiler options for you to juggle. + +Flags you may wish to tweak include: + +* ``-ffp-contract={on,off,fast}`` (defaults to ``fast`` on host and device when + compiling CUDA) Controls whether the compiler emits fused multiply-add + operations. + + * ``off``: never emit fma operations, and prevent ptxas from fusing multiply + and add instructions. + * ``on``: fuse multiplies and adds within a single statement, but never + across statements (C11 semantics). Prevent ptxas from fusing other + multiplies and adds. + * ``fast``: fuse multiplies and adds wherever profitable, even across + statements. Doesn't prevent ptxas from fusing additional multiplies and + adds. + + Fused multiply-add instructions can be much faster than the unfused + equivalents, but because the intermediate result in an fma is not rounded, + this flag can affect numerical code. + +* ``-fcuda-flush-denormals-to-zero`` (default: off) When this is enabled, + floating point operations may flush `denormal + <https://en.wikipedia.org/wiki/Denormal_number>`_ inputs and/or outputs to 0. + Operations on denormal numbers are often much slower than the same operations + on normal numbers. + +* ``-fcuda-approx-transcendentals`` (default: off) When this is enabled, the + compiler may emit calls to faster, approximate versions of transcendental + functions, instead of using the slower, fully IEEE-compliant versions. For + example, this flag allows clang to emit the ptx ``sin.approx.f32`` + instruction. + + This is implied by ``-ffast-math``. Optimizations ============= @@ -134,10 +207,9 @@ customizable target-independent optimization pipeline. straight-line scalar optimizations <https://goo.gl/4Rb9As>`_. * **Inferring memory spaces**. `This optimization - <http://www.llvm.org/docs/doxygen/html/NVPTXFavorNonGenericAddrSpaces_8cpp_source.html>`_ + <https://github.com/llvm-mirror/llvm/blob/master/lib/Target/NVPTX/NVPTXInferAddressSpaces.cpp>`_ infers the memory space of an address so that the backend can emit faster - special loads and stores from it. Details can be found in the `design - document for memory space inference <https://goo.gl/5wH2Ct>`_. + special loads and stores from it. * **Aggressive loop unrooling and function inlining**. Loop unrolling and function inlining need to be more aggressive for GPUs than for CPUs because @@ -167,3 +239,22 @@ customizable target-independent optimization pipeline. 32-bit ones on NVIDIA GPUs due to lack of a divide unit. Many of the 64-bit divides in our benchmarks have a divisor and dividend which fit in 32-bits at runtime. This optimization provides a fast path for this common case. + +Publication +=========== + +| `gpucc: An Open-Source GPGPU Compiler <http://dl.acm.org/citation.cfm?id=2854041>`_ +| Jingyue Wu, Artem Belevich, Eli Bendersky, Mark Heffernan, Chris Leary, Jacques Pienaar, Bjarke Roune, Rob Springer, Xuetian Weng, Robert Hundt +| *Proceedings of the 2016 International Symposium on Code Generation and Optimization (CGO 2016)* +| `Slides for the CGO talk <http://wujingyue.com/docs/gpucc-talk.pdf>`_ + +Tutorial +======== + +`CGO 2016 gpucc tutorial <http://wujingyue.com/docs/gpucc-tutorial.pdf>`_ + +Obtaining Help +============== + +To obtain help on LLVM in general and its CUDA support, see `the LLVM +community <http://llvm.org/docs/#mailing-lists>`_. |