summaryrefslogtreecommitdiff
path: root/docs/CommandGuide/llvm-mca.rst
diff options
context:
space:
mode:
authorDimitry Andric <dim@FreeBSD.org>2018-07-28 10:51:19 +0000
committerDimitry Andric <dim@FreeBSD.org>2018-07-28 10:51:19 +0000
commiteb11fae6d08f479c0799db45860a98af528fa6e7 (patch)
tree44d492a50c8c1a7eb8e2d17ea3360ec4d066f042 /docs/CommandGuide/llvm-mca.rst
parentb8a2042aa938069e862750553db0e4d82d25822c (diff)
Notes
Diffstat (limited to 'docs/CommandGuide/llvm-mca.rst')
-rw-r--r--docs/CommandGuide/llvm-mca.rst551
1 files changed, 551 insertions, 0 deletions
diff --git a/docs/CommandGuide/llvm-mca.rst b/docs/CommandGuide/llvm-mca.rst
new file mode 100644
index 000000000000..dd2320b15ffb
--- /dev/null
+++ b/docs/CommandGuide/llvm-mca.rst
@@ -0,0 +1,551 @@
+llvm-mca - LLVM Machine Code Analyzer
+=====================================
+
+SYNOPSIS
+--------
+
+:program:`llvm-mca` [*options*] [input]
+
+DESCRIPTION
+-----------
+
+:program:`llvm-mca` is a performance analysis tool that uses information
+available in LLVM (e.g. scheduling models) to statically measure the performance
+of machine code in a specific CPU.
+
+Performance is measured in terms of throughput as well as processor resource
+consumption. The tool currently works for processors with an out-of-order
+backend, for which there is a scheduling model available in LLVM.
+
+The main goal of this tool is not just to predict the performance of the code
+when run on the target, but also help with diagnosing potential performance
+issues.
+
+Given an assembly code sequence, llvm-mca estimates the Instructions Per Cycle
+(IPC), as well as hardware resource pressure. The analysis and reporting style
+were inspired by the IACA tool from Intel.
+
+:program:`llvm-mca` allows the usage of special code comments to mark regions of
+the assembly code to be analyzed. A comment starting with substring
+``LLVM-MCA-BEGIN`` marks the beginning of a code region. A comment starting with
+substring ``LLVM-MCA-END`` marks the end of a code region. For example:
+
+.. code-block:: none
+
+ # LLVM-MCA-BEGIN My Code Region
+ ...
+ # LLVM-MCA-END
+
+Multiple regions can be specified provided that they do not overlap. A code
+region can have an optional description. If no user-defined region is specified,
+then :program:`llvm-mca` assumes a default region which contains every
+instruction in the input file. Every region is analyzed in isolation, and the
+final performance report is the union of all the reports generated for every
+code region.
+
+Inline assembly directives may be used from source code to annotate the
+assembly text:
+
+.. code-block:: c++
+
+ int foo(int a, int b) {
+ __asm volatile("# LLVM-MCA-BEGIN foo");
+ a += 42;
+ __asm volatile("# LLVM-MCA-END");
+ a *= b;
+ return a;
+ }
+
+So for example, you can compile code with clang, output assembly, and pipe it
+directly into llvm-mca for analysis:
+
+.. code-block:: bash
+
+ $ clang foo.c -O2 -target x86_64-unknown-unknown -S -o - | llvm-mca -mcpu=btver2
+
+Or for Intel syntax:
+
+.. code-block:: bash
+
+ $ clang foo.c -O2 -target x86_64-unknown-unknown -mllvm -x86-asm-syntax=intel -S -o - | llvm-mca -mcpu=btver2
+
+OPTIONS
+-------
+
+If ``input`` is "``-``" or omitted, :program:`llvm-mca` reads from standard
+input. Otherwise, it will read from the specified filename.
+
+If the :option:`-o` option is omitted, then :program:`llvm-mca` will send its output
+to standard output if the input is from standard input. If the :option:`-o`
+option specifies "``-``", then the output will also be sent to standard output.
+
+
+.. option:: -help
+
+ Print a summary of command line options.
+
+.. option:: -mtriple=<target triple>
+
+ Specify a target triple string.
+
+.. option:: -march=<arch>
+
+ Specify the architecture for which to analyze the code. It defaults to the
+ host default target.
+
+.. option:: -mcpu=<cpuname>
+
+ Specify the processor for which to analyze the code. By default, the cpu name
+ is autodetected from the host.
+
+.. option:: -output-asm-variant=<variant id>
+
+ Specify the output assembly variant for the report generated by the tool.
+ On x86, possible values are [0, 1]. A value of 0 (vic. 1) for this flag enables
+ the AT&T (vic. Intel) assembly format for the code printed out by the tool in
+ the analysis report.
+
+.. option:: -dispatch=<width>
+
+ Specify a different dispatch width for the processor. The dispatch width
+ defaults to field 'IssueWidth' in the processor scheduling model. If width is
+ zero, then the default dispatch width is used.
+
+.. option:: -register-file-size=<size>
+
+ Specify the size of the register file. When specified, this flag limits how
+ many temporary registers are available for register renaming purposes. A value
+ of zero for this flag means "unlimited number of temporary registers".
+
+.. option:: -iterations=<number of iterations>
+
+ Specify the number of iterations to run. If this flag is set to 0, then the
+ tool sets the number of iterations to a default value (i.e. 100).
+
+.. option:: -noalias=<bool>
+
+ If set, the tool assumes that loads and stores don't alias. This is the
+ default behavior.
+
+.. option:: -lqueue=<load queue size>
+
+ Specify the size of the load queue in the load/store unit emulated by the tool.
+ By default, the tool assumes an unbound number of entries in the load queue.
+ A value of zero for this flag is ignored, and the default load queue size is
+ used instead.
+
+.. option:: -squeue=<store queue size>
+
+ Specify the size of the store queue in the load/store unit emulated by the
+ tool. By default, the tool assumes an unbound number of entries in the store
+ queue. A value of zero for this flag is ignored, and the default store queue
+ size is used instead.
+
+.. option:: -timeline
+
+ Enable the timeline view.
+
+.. option:: -timeline-max-iterations=<iterations>
+
+ Limit the number of iterations to print in the timeline view. By default, the
+ timeline view prints information for up to 10 iterations.
+
+.. option:: -timeline-max-cycles=<cycles>
+
+ Limit the number of cycles in the timeline view. By default, the number of
+ cycles is set to 80.
+
+.. option:: -resource-pressure
+
+ Enable the resource pressure view. This is enabled by default.
+
+.. option:: -register-file-stats
+
+ Enable register file usage statistics.
+
+.. option:: -dispatch-stats
+
+ Enable extra dispatch statistics. This view collects and analyzes instruction
+ dispatch events, as well as static/dynamic dispatch stall events. This view
+ is disabled by default.
+
+.. option:: -scheduler-stats
+
+ Enable extra scheduler statistics. This view collects and analyzes instruction
+ issue events. This view is disabled by default.
+
+.. option:: -retire-stats
+
+ Enable extra retire control unit statistics. This view is disabled by default.
+
+.. option:: -instruction-info
+
+ Enable the instruction info view. This is enabled by default.
+
+.. option:: -all-stats
+
+ Print all hardware statistics. This enables extra statistics related to the
+ dispatch logic, the hardware schedulers, the register file(s), and the retire
+ control unit. This option is disabled by default.
+
+.. option:: -all-views
+
+ Enable all the view.
+
+.. option:: -instruction-tables
+
+ Prints resource pressure information based on the static information
+ available from the processor model. This differs from the resource pressure
+ view because it doesn't require that the code is simulated. It instead prints
+ the theoretical uniform distribution of resource pressure for every
+ instruction in sequence.
+
+
+EXIT STATUS
+-----------
+
+:program:`llvm-mca` returns 0 on success. Otherwise, an error message is printed
+to standard error, and the tool returns 1.
+
+HOW MCA WORKS
+-------------
+
+MCA takes assembly code as input. The assembly code is parsed into a sequence
+of MCInst with the help of the existing LLVM target assembly parsers. The
+parsed sequence of MCInst is then analyzed by a ``Pipeline`` module to generate
+a performance report.
+
+The Pipeline module simulates the execution of the machine code sequence in a
+loop of iterations (default is 100). During this process, the pipeline collects
+a number of execution related statistics. At the end of this process, the
+pipeline generates and prints a report from the collected statistics.
+
+Here is an example of a performance report generated by MCA for a dot-product
+of two packed float vectors of four elements. The analysis is conducted for
+target x86, cpu btver2. The following result can be produced via the following
+command using the example located at
+``test/tools/llvm-mca/X86/BtVer2/dot-product.s``:
+
+.. code-block:: bash
+
+ $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=300 dot-product.s
+
+.. code-block:: none
+
+ Iterations: 300
+ Instructions: 900
+ Total Cycles: 610
+ Dispatch Width: 2
+ IPC: 1.48
+ Block RThroughput: 2.0
+
+
+ Instruction Info:
+ [1]: #uOps
+ [2]: Latency
+ [3]: RThroughput
+ [4]: MayLoad
+ [5]: MayStore
+ [6]: HasSideEffects (U)
+
+ [1] [2] [3] [4] [5] [6] Instructions:
+ 1 2 1.00 vmulps %xmm0, %xmm1, %xmm2
+ 1 3 1.00 vhaddps %xmm2, %xmm2, %xmm3
+ 1 3 1.00 vhaddps %xmm3, %xmm3, %xmm4
+
+
+ Resources:
+ [0] - JALU0
+ [1] - JALU1
+ [2] - JDiv
+ [3] - JFPA
+ [4] - JFPM
+ [5] - JFPU0
+ [6] - JFPU1
+ [7] - JLAGU
+ [8] - JMul
+ [9] - JSAGU
+ [10] - JSTC
+ [11] - JVALU0
+ [12] - JVALU1
+ [13] - JVIMUL
+
+
+ Resource pressure per iteration:
+ [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13]
+ - - - 2.00 1.00 2.00 1.00 - - - - - - -
+
+ Resource pressure by instruction:
+ [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] Instructions:
+ - - - - 1.00 - 1.00 - - - - - - - vmulps %xmm0, %xmm1, %xmm2
+ - - - 1.00 - 1.00 - - - - - - - - vhaddps %xmm2, %xmm2, %xmm3
+ - - - 1.00 - 1.00 - - - - - - - - vhaddps %xmm3, %xmm3, %xmm4
+
+According to this report, the dot-product kernel has been executed 300 times,
+for a total of 900 dynamically executed instructions.
+
+The report is structured in three main sections. The first section collects a
+few performance numbers; the goal of this section is to give a very quick
+overview of the performance throughput. In this example, the two important
+performance indicators are the predicted total number of cycles, and the IPC.
+IPC is probably the most important throughput indicator. A big delta between
+the Dispatch Width and the computed IPC is an indicator of potential
+performance issues.
+
+The second section of the report shows the latency and reciprocal
+throughput of every instruction in the sequence. That section also reports
+extra information related to the number of micro opcodes, and opcode properties
+(i.e., 'MayLoad', 'MayStore', and 'HasSideEffects').
+
+The third section is the *Resource pressure view*. This view reports
+the average number of resource cycles consumed every iteration by instructions
+for every processor resource unit available on the target. Information is
+structured in two tables. The first table reports the number of resource cycles
+spent on average every iteration. The second table correlates the resource
+cycles to the machine instruction in the sequence. For example, every iteration
+of the instruction vmulps always executes on resource unit [6]
+(JFPU1 - floating point pipeline #1), consuming an average of 1 resource cycle
+per iteration. Note that on AMD Jaguar, vector floating-point multiply can
+only be issued to pipeline JFPU1, while horizontal floating-point additions can
+only be issued to pipeline JFPU0.
+
+The resource pressure view helps with identifying bottlenecks caused by high
+usage of specific hardware resources. Situations with resource pressure mainly
+concentrated on a few resources should, in general, be avoided. Ideally,
+pressure should be uniformly distributed between multiple resources.
+
+Timeline View
+^^^^^^^^^^^^^
+MCA's timeline view produces a detailed report of each instruction's state
+transitions through an instruction pipeline. This view is enabled by the
+command line option ``-timeline``. As instructions transition through the
+various stages of the pipeline, their states are depicted in the view report.
+These states are represented by the following characters:
+
+* D : Instruction dispatched.
+* e : Instruction executing.
+* E : Instruction executed.
+* R : Instruction retired.
+* = : Instruction already dispatched, waiting to be executed.
+* \- : Instruction executed, waiting to be retired.
+
+Below is the timeline view for a subset of the dot-product example located in
+``test/tools/llvm-mca/X86/BtVer2/dot-product.s`` and processed by
+MCA using the following command:
+
+.. code-block:: bash
+
+ $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=3 -timeline dot-product.s
+
+.. code-block:: none
+
+ Timeline view:
+ 012345
+ Index 0123456789
+
+ [0,0] DeeER. . . vmulps %xmm0, %xmm1, %xmm2
+ [0,1] D==eeeER . . vhaddps %xmm2, %xmm2, %xmm3
+ [0,2] .D====eeeER . vhaddps %xmm3, %xmm3, %xmm4
+ [1,0] .DeeE-----R . vmulps %xmm0, %xmm1, %xmm2
+ [1,1] . D=eeeE---R . vhaddps %xmm2, %xmm2, %xmm3
+ [1,2] . D====eeeER . vhaddps %xmm3, %xmm3, %xmm4
+ [2,0] . DeeE-----R . vmulps %xmm0, %xmm1, %xmm2
+ [2,1] . D====eeeER . vhaddps %xmm2, %xmm2, %xmm3
+ [2,2] . D======eeeER vhaddps %xmm3, %xmm3, %xmm4
+
+
+ Average Wait times (based on the timeline view):
+ [0]: Executions
+ [1]: Average time spent waiting in a scheduler's queue
+ [2]: Average time spent waiting in a scheduler's queue while ready
+ [3]: Average time elapsed from WB until retire stage
+
+ [0] [1] [2] [3]
+ 0. 3 1.0 1.0 3.3 vmulps %xmm0, %xmm1, %xmm2
+ 1. 3 3.3 0.7 1.0 vhaddps %xmm2, %xmm2, %xmm3
+ 2. 3 5.7 0.0 0.0 vhaddps %xmm3, %xmm3, %xmm4
+
+The timeline view is interesting because it shows instruction state changes
+during execution. It also gives an idea of how MCA processes instructions
+executed on the target, and how their timing information might be calculated.
+
+The timeline view is structured in two tables. The first table shows
+instructions changing state over time (measured in cycles); the second table
+(named *Average Wait times*) reports useful timing statistics, which should
+help diagnose performance bottlenecks caused by long data dependencies and
+sub-optimal usage of hardware resources.
+
+An instruction in the timeline view is identified by a pair of indices, where
+the first index identifies an iteration, and the second index is the
+instruction index (i.e., where it appears in the code sequence). Since this
+example was generated using 3 iterations: ``-iterations=3``, the iteration
+indices range from 0-2 inclusively.
+
+Excluding the first and last column, the remaining columns are in cycles.
+Cycles are numbered sequentially starting from 0.
+
+From the example output above, we know the following:
+
+* Instruction [1,0] was dispatched at cycle 1.
+* Instruction [1,0] started executing at cycle 2.
+* Instruction [1,0] reached the write back stage at cycle 4.
+* Instruction [1,0] was retired at cycle 10.
+
+Instruction [1,0] (i.e., vmulps from iteration #1) does not have to wait in the
+scheduler's queue for the operands to become available. By the time vmulps is
+dispatched, operands are already available, and pipeline JFPU1 is ready to
+serve another instruction. So the instruction can be immediately issued on the
+JFPU1 pipeline. That is demonstrated by the fact that the instruction only
+spent 1cy in the scheduler's queue.
+
+There is a gap of 5 cycles between the write-back stage and the retire event.
+That is because instructions must retire in program order, so [1,0] has to wait
+for [0,2] to be retired first (i.e., it has to wait until cycle 10).
+
+In the example, all instructions are in a RAW (Read After Write) dependency
+chain. Register %xmm2 written by vmulps is immediately used by the first
+vhaddps, and register %xmm3 written by the first vhaddps is used by the second
+vhaddps. Long data dependencies negatively impact the ILP (Instruction Level
+Parallelism).
+
+In the dot-product example, there are anti-dependencies introduced by
+instructions from different iterations. However, those dependencies can be
+removed at register renaming stage (at the cost of allocating register aliases,
+and therefore consuming temporary registers).
+
+Table *Average Wait times* helps diagnose performance issues that are caused by
+the presence of long latency instructions and potentially long data dependencies
+which may limit the ILP. Note that MCA, by default, assumes at least 1cy
+between the dispatch event and the issue event.
+
+When the performance is limited by data dependencies and/or long latency
+instructions, the number of cycles spent while in the *ready* state is expected
+to be very small when compared with the total number of cycles spent in the
+scheduler's queue. The difference between the two counters is a good indicator
+of how large of an impact data dependencies had on the execution of the
+instructions. When performance is mostly limited by the lack of hardware
+resources, the delta between the two counters is small. However, the number of
+cycles spent in the queue tends to be larger (i.e., more than 1-3cy),
+especially when compared to other low latency instructions.
+
+Extra Statistics to Further Diagnose Performance Issues
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+The ``-all-stats`` command line option enables extra statistics and performance
+counters for the dispatch logic, the reorder buffer, the retire control unit,
+and the register file.
+
+Below is an example of ``-all-stats`` output generated by MCA for the
+dot-product example discussed in the previous sections.
+
+.. code-block:: none
+
+ Dynamic Dispatch Stall Cycles:
+ RAT - Register unavailable: 0
+ RCU - Retire tokens unavailable: 0
+ SCHEDQ - Scheduler full: 272
+ LQ - Load queue full: 0
+ SQ - Store queue full: 0
+ GROUP - Static restrictions on the dispatch group: 0
+
+
+ Dispatch Logic - number of cycles where we saw N instructions dispatched:
+ [# dispatched], [# cycles]
+ 0, 24 (3.9%)
+ 1, 272 (44.6%)
+ 2, 314 (51.5%)
+
+
+ Schedulers - number of cycles where we saw N instructions issued:
+ [# issued], [# cycles]
+ 0, 7 (1.1%)
+ 1, 306 (50.2%)
+ 2, 297 (48.7%)
+
+
+ Scheduler's queue usage:
+ JALU01, 0/20
+ JFPU01, 18/18
+ JLSAGU, 0/12
+
+
+ Retire Control Unit - number of cycles where we saw N instructions retired:
+ [# retired], [# cycles]
+ 0, 109 (17.9%)
+ 1, 102 (16.7%)
+ 2, 399 (65.4%)
+
+
+ Register File statistics:
+ Total number of mappings created: 900
+ Max number of mappings used: 35
+
+ * Register File #1 -- JFpuPRF:
+ Number of physical registers: 72
+ Total number of mappings created: 900
+ Max number of mappings used: 35
+
+ * Register File #2 -- JIntegerPRF:
+ Number of physical registers: 64
+ Total number of mappings created: 0
+ Max number of mappings used: 0
+
+If we look at the *Dynamic Dispatch Stall Cycles* table, we see the counter for
+SCHEDQ reports 272 cycles. This counter is incremented every time the dispatch
+logic is unable to dispatch a group of two instructions because the scheduler's
+queue is full.
+
+Looking at the *Dispatch Logic* table, we see that the pipeline was only able
+to dispatch two instructions 51.5% of the time. The dispatch group was limited
+to one instruction 44.6% of the cycles, which corresponds to 272 cycles. The
+dispatch statistics are displayed by either using the command option
+``-all-stats`` or ``-dispatch-stats``.
+
+The next table, *Schedulers*, presents a histogram displaying a count,
+representing the number of instructions issued on some number of cycles. In
+this case, of the 610 simulated cycles, single
+instructions were issued 306 times (50.2%) and there were 7 cycles where
+no instructions were issued.
+
+The *Scheduler's queue usage* table shows that the maximum number of buffer
+entries (i.e., scheduler queue entries) used at runtime. Resource JFPU01
+reached its maximum (18 of 18 queue entries). Note that AMD Jaguar implements
+three schedulers:
+
+* JALU01 - A scheduler for ALU instructions.
+* JFPU01 - A scheduler floating point operations.
+* JLSAGU - A scheduler for address generation.
+
+The dot-product is a kernel of three floating point instructions (a vector
+multiply followed by two horizontal adds). That explains why only the floating
+point scheduler appears to be used.
+
+A full scheduler queue is either caused by data dependency chains or by a
+sub-optimal usage of hardware resources. Sometimes, resource pressure can be
+mitigated by rewriting the kernel using different instructions that consume
+different scheduler resources. Schedulers with a small queue are less resilient
+to bottlenecks caused by the presence of long data dependencies.
+The scheduler statistics are displayed by
+using the command option ``-all-stats`` or ``-scheduler-stats``.
+
+The next table, *Retire Control Unit*, presents a histogram displaying a count,
+representing the number of instructions retired on some number of cycles. In
+this case, of the 610 simulated cycles, two instructions were retired during
+the same cycle 399 times (65.4%) and there were 109 cycles where no
+instructions were retired. The retire statistics are displayed by using the
+command option ``-all-stats`` or ``-retire-stats``.
+
+The last table presented is *Register File statistics*. Each physical register
+file (PRF) used by the pipeline is presented in this table. In the case of AMD
+Jaguar, there are two register files, one for floating-point registers
+(JFpuPRF) and one for integer registers (JIntegerPRF). The table shows that of
+the 900 instructions processed, there were 900 mappings created. Since this
+dot-product example utilized only floating point registers, the JFPuPRF was
+responsible for creating the 900 mappings. However, we see that the pipeline
+only used a maximum of 35 of 72 available register slots at any given time. We
+can conclude that the floating point PRF was the only register file used for
+the example, and that it was never resource constrained. The register file
+statistics are displayed by using the command option ``-all-stats`` or
+``-register-file-stats``.
+
+In this example, we can conclude that the IPC is mostly limited by data
+dependencies, and not by resource pressure.