diff options
| author | Dimitry Andric <dim@FreeBSD.org> | 2018-07-28 10:51:19 +0000 | 
|---|---|---|
| committer | Dimitry Andric <dim@FreeBSD.org> | 2018-07-28 10:51:19 +0000 | 
| commit | eb11fae6d08f479c0799db45860a98af528fa6e7 (patch) | |
| tree | 44d492a50c8c1a7eb8e2d17ea3360ec4d066f042 /docs/CommandGuide/llvm-mca.rst | |
| parent | b8a2042aa938069e862750553db0e4d82d25822c (diff) | |
Notes
Diffstat (limited to 'docs/CommandGuide/llvm-mca.rst')
| -rw-r--r-- | docs/CommandGuide/llvm-mca.rst | 551 | 
1 files changed, 551 insertions, 0 deletions
| diff --git a/docs/CommandGuide/llvm-mca.rst b/docs/CommandGuide/llvm-mca.rst new file mode 100644 index 000000000000..dd2320b15ffb --- /dev/null +++ b/docs/CommandGuide/llvm-mca.rst @@ -0,0 +1,551 @@ +llvm-mca - LLVM Machine Code Analyzer +===================================== + +SYNOPSIS +-------- + +:program:`llvm-mca` [*options*] [input] + +DESCRIPTION +----------- + +:program:`llvm-mca` is a performance analysis tool that uses information +available in LLVM (e.g. scheduling models) to statically measure the performance +of machine code in a specific CPU. + +Performance is measured in terms of throughput as well as processor resource +consumption. The tool currently works for processors with an out-of-order +backend, for which there is a scheduling model available in LLVM. + +The main goal of this tool is not just to predict the performance of the code +when run on the target, but also help with diagnosing potential performance +issues. + +Given an assembly code sequence, llvm-mca estimates the Instructions Per Cycle +(IPC), as well as hardware resource pressure. The analysis and reporting style +were inspired by the IACA tool from Intel. + +:program:`llvm-mca` allows the usage of special code comments to mark regions of +the assembly code to be analyzed.  A comment starting with substring +``LLVM-MCA-BEGIN`` marks the beginning of a code region. A comment starting with +substring ``LLVM-MCA-END`` marks the end of a code region.  For example: + +.. code-block:: none + +  # LLVM-MCA-BEGIN My Code Region +    ... +  # LLVM-MCA-END + +Multiple regions can be specified provided that they do not overlap.  A code +region can have an optional description. If no user-defined region is specified, +then :program:`llvm-mca` assumes a default region which contains every +instruction in the input file.  Every region is analyzed in isolation, and the +final performance report is the union of all the reports generated for every +code region. + +Inline assembly directives may be used from source code to annotate the  +assembly text: + +.. code-block:: c++ + +  int foo(int a, int b) { +    __asm volatile("# LLVM-MCA-BEGIN foo"); +    a += 42; +    __asm volatile("# LLVM-MCA-END"); +    a *= b; +    return a; +  } + +So for example, you can compile code with clang, output assembly, and pipe it +directly into llvm-mca for analysis: + +.. code-block:: bash + +  $ clang foo.c -O2 -target x86_64-unknown-unknown -S -o - | llvm-mca -mcpu=btver2 + +Or for Intel syntax: + +.. code-block:: bash + +  $ clang foo.c -O2 -target x86_64-unknown-unknown -mllvm -x86-asm-syntax=intel -S -o - | llvm-mca -mcpu=btver2 + +OPTIONS +------- + +If ``input`` is "``-``" or omitted, :program:`llvm-mca` reads from standard +input. Otherwise, it will read from the specified filename. + +If the :option:`-o` option is omitted, then :program:`llvm-mca` will send its output +to standard output if the input is from standard input.  If the :option:`-o` +option specifies "``-``", then the output will also be sent to standard output. + + +.. option:: -help + + Print a summary of command line options. + +.. option:: -mtriple=<target triple> + + Specify a target triple string. + +.. option:: -march=<arch> + + Specify the architecture for which to analyze the code. It defaults to the + host default target. + +.. option:: -mcpu=<cpuname> + +  Specify the processor for which to analyze the code.  By default, the cpu name +  is autodetected from the host. + +.. option:: -output-asm-variant=<variant id> + + Specify the output assembly variant for the report generated by the tool. + On x86, possible values are [0, 1]. A value of 0 (vic. 1) for this flag enables + the AT&T (vic. Intel) assembly format for the code printed out by the tool in + the analysis report. + +.. option:: -dispatch=<width> + + Specify a different dispatch width for the processor. The dispatch width + defaults to field 'IssueWidth' in the processor scheduling model.  If width is + zero, then the default dispatch width is used. + +.. option:: -register-file-size=<size> + + Specify the size of the register file. When specified, this flag limits how + many temporary registers are available for register renaming purposes. A value + of zero for this flag means "unlimited number of temporary registers". + +.. option:: -iterations=<number of iterations> + + Specify the number of iterations to run. If this flag is set to 0, then the + tool sets the number of iterations to a default value (i.e. 100). + +.. option:: -noalias=<bool> + +  If set, the tool assumes that loads and stores don't alias. This is the +  default behavior. + +.. option:: -lqueue=<load queue size> + +  Specify the size of the load queue in the load/store unit emulated by the tool. +  By default, the tool assumes an unbound number of entries in the load queue. +  A value of zero for this flag is ignored, and the default load queue size is +  used instead.  + +.. option:: -squeue=<store queue size> + +  Specify the size of the store queue in the load/store unit emulated by the +  tool. By default, the tool assumes an unbound number of entries in the store +  queue. A value of zero for this flag is ignored, and the default store queue +  size is used instead. + +.. option:: -timeline + +  Enable the timeline view. + +.. option:: -timeline-max-iterations=<iterations> + +  Limit the number of iterations to print in the timeline view. By default, the +  timeline view prints information for up to 10 iterations. + +.. option:: -timeline-max-cycles=<cycles> + +  Limit the number of cycles in the timeline view. By default, the number of +  cycles is set to 80. + +.. option:: -resource-pressure + +  Enable the resource pressure view. This is enabled by default. + +.. option:: -register-file-stats + +  Enable register file usage statistics. + +.. option:: -dispatch-stats + +  Enable extra dispatch statistics. This view collects and analyzes instruction +  dispatch events, as well as static/dynamic dispatch stall events. This view +  is disabled by default. + +.. option:: -scheduler-stats + +  Enable extra scheduler statistics. This view collects and analyzes instruction +  issue events. This view is disabled by default. + +.. option:: -retire-stats + +  Enable extra retire control unit statistics. This view is disabled by default. + +.. option:: -instruction-info + +  Enable the instruction info view. This is enabled by default. + +.. option:: -all-stats + +  Print all hardware statistics. This enables extra statistics related to the +  dispatch logic, the hardware schedulers, the register file(s), and the retire +  control unit. This option is disabled by default. + +.. option:: -all-views + +  Enable all the view. + +.. option:: -instruction-tables + +  Prints resource pressure information based on the static information +  available from the processor model. This differs from the resource pressure +  view because it doesn't require that the code is simulated. It instead prints +  the theoretical uniform distribution of resource pressure for every +  instruction in sequence. + + +EXIT STATUS +----------- + +:program:`llvm-mca` returns 0 on success. Otherwise, an error message is printed +to standard error, and the tool returns 1. + +HOW MCA WORKS +------------- + +MCA takes assembly code as input. The assembly code is parsed into a sequence +of MCInst with the help of the existing LLVM target assembly parsers. The +parsed sequence of MCInst is then analyzed by a ``Pipeline`` module to generate +a performance report. + +The Pipeline module simulates the execution of the machine code sequence in a +loop of iterations (default is 100). During this process, the pipeline collects +a number of execution related statistics. At the end of this process, the +pipeline generates and prints a report from the collected statistics. + +Here is an example of a performance report generated by MCA for a dot-product +of two packed float vectors of four elements. The analysis is conducted for +target x86, cpu btver2.  The following result can be produced via the following +command using the example located at +``test/tools/llvm-mca/X86/BtVer2/dot-product.s``: + +.. code-block:: bash + +  $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=300 dot-product.s + +.. code-block:: none + +  Iterations:        300 +  Instructions:      900 +  Total Cycles:      610 +  Dispatch Width:    2 +  IPC:               1.48 +  Block RThroughput: 2.0 + + +  Instruction Info: +  [1]: #uOps +  [2]: Latency +  [3]: RThroughput +  [4]: MayLoad +  [5]: MayStore +  [6]: HasSideEffects (U) + +  [1]    [2]    [3]    [4]    [5]    [6]    Instructions: +   1      2     1.00                        vmulps	%xmm0, %xmm1, %xmm2 +   1      3     1.00                        vhaddps	%xmm2, %xmm2, %xmm3 +   1      3     1.00                        vhaddps	%xmm3, %xmm3, %xmm4 + + +  Resources: +  [0]   - JALU0 +  [1]   - JALU1 +  [2]   - JDiv +  [3]   - JFPA +  [4]   - JFPM +  [5]   - JFPU0 +  [6]   - JFPU1 +  [7]   - JLAGU +  [8]   - JMul +  [9]   - JSAGU +  [10]  - JSTC +  [11]  - JVALU0 +  [12]  - JVALU1 +  [13]  - JVIMUL + + +  Resource pressure per iteration: +  [0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    [10]   [11]   [12]   [13] +   -      -      -     2.00   1.00   2.00   1.00    -      -      -      -      -      -      - + +  Resource pressure by instruction: +  [0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    [10]   [11]   [12]   [13]   Instructions: +   -      -      -      -     1.00    -     1.00    -      -      -      -      -      -      -     vmulps	%xmm0, %xmm1, %xmm2 +   -      -      -     1.00    -     1.00    -      -      -      -      -      -      -      -     vhaddps	%xmm2, %xmm2, %xmm3 +   -      -      -     1.00    -     1.00    -      -      -      -      -      -      -      -     vhaddps	%xmm3, %xmm3, %xmm4 + +According to this report, the dot-product kernel has been executed 300 times, +for a total of 900 dynamically executed instructions. + +The report is structured in three main sections.  The first section collects a +few performance numbers; the goal of this section is to give a very quick +overview of the performance throughput. In this example, the two important +performance indicators are the predicted total number of cycles, and the IPC. +IPC is probably the most important throughput indicator. A big delta between +the Dispatch Width and the computed IPC is an indicator of potential +performance issues. + +The second section of the report shows the latency and reciprocal +throughput of every instruction in the sequence. That section also reports +extra information related to the number of micro opcodes, and opcode properties +(i.e., 'MayLoad', 'MayStore', and 'HasSideEffects'). + +The third section is the *Resource pressure view*.  This view reports +the average number of resource cycles consumed every iteration by instructions +for every processor resource unit available on the target.  Information is +structured in two tables. The first table reports the number of resource cycles +spent on average every iteration. The second table correlates the resource +cycles to the machine instruction in the sequence. For example, every iteration +of the instruction vmulps always executes on resource unit [6] +(JFPU1 - floating point pipeline #1), consuming an average of 1 resource cycle +per iteration.  Note that on AMD Jaguar, vector floating-point multiply can +only be issued to pipeline JFPU1, while horizontal floating-point additions can +only be issued to pipeline JFPU0. + +The resource pressure view helps with identifying bottlenecks caused by high +usage of specific hardware resources.  Situations with resource pressure mainly +concentrated on a few resources should, in general, be avoided.  Ideally, +pressure should be uniformly distributed between multiple resources. + +Timeline View +^^^^^^^^^^^^^ +MCA's timeline view produces a detailed report of each instruction's state +transitions through an instruction pipeline.  This view is enabled by the +command line option ``-timeline``.  As instructions transition through the +various stages of the pipeline, their states are depicted in the view report. +These states are represented by the following characters: + +* D : Instruction dispatched. +* e : Instruction executing. +* E : Instruction executed. +* R : Instruction retired. +* = : Instruction already dispatched, waiting to be executed. +* \- : Instruction executed, waiting to be retired. + +Below is the timeline view for a subset of the dot-product example located in +``test/tools/llvm-mca/X86/BtVer2/dot-product.s`` and processed by +MCA using the following command: + +.. code-block:: bash + +  $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=3 -timeline dot-product.s + +.. code-block:: none + +  Timeline view: +                      012345 +  Index     0123456789 + +  [0,0]     DeeER.    .    .   vmulps	%xmm0, %xmm1, %xmm2 +  [0,1]     D==eeeER  .    .   vhaddps	%xmm2, %xmm2, %xmm3 +  [0,2]     .D====eeeER    .   vhaddps	%xmm3, %xmm3, %xmm4 +  [1,0]     .DeeE-----R    .   vmulps	%xmm0, %xmm1, %xmm2 +  [1,1]     . D=eeeE---R   .   vhaddps	%xmm2, %xmm2, %xmm3 +  [1,2]     . D====eeeER   .   vhaddps	%xmm3, %xmm3, %xmm4 +  [2,0]     .  DeeE-----R  .   vmulps	%xmm0, %xmm1, %xmm2 +  [2,1]     .  D====eeeER  .   vhaddps	%xmm2, %xmm2, %xmm3 +  [2,2]     .   D======eeeER   vhaddps	%xmm3, %xmm3, %xmm4 + + +  Average Wait times (based on the timeline view): +  [0]: Executions +  [1]: Average time spent waiting in a scheduler's queue +  [2]: Average time spent waiting in a scheduler's queue while ready +  [3]: Average time elapsed from WB until retire stage + +        [0]    [1]    [2]    [3] +  0.     3     1.0    1.0    3.3       vmulps	%xmm0, %xmm1, %xmm2 +  1.     3     3.3    0.7    1.0       vhaddps	%xmm2, %xmm2, %xmm3 +  2.     3     5.7    0.0    0.0       vhaddps	%xmm3, %xmm3, %xmm4 + +The timeline view is interesting because it shows instruction state changes +during execution.  It also gives an idea of how MCA processes instructions +executed on the target, and how their timing information might be calculated. + +The timeline view is structured in two tables.  The first table shows +instructions changing state over time (measured in cycles); the second table +(named *Average Wait times*) reports useful timing statistics, which should +help diagnose performance bottlenecks caused by long data dependencies and +sub-optimal usage of hardware resources. + +An instruction in the timeline view is identified by a pair of indices, where +the first index identifies an iteration, and the second index is the +instruction index (i.e., where it appears in the code sequence).  Since this +example was generated using 3 iterations: ``-iterations=3``, the iteration +indices range from 0-2 inclusively. + +Excluding the first and last column, the remaining columns are in cycles. +Cycles are numbered sequentially starting from 0. + +From the example output above, we know the following: + +* Instruction [1,0] was dispatched at cycle 1. +* Instruction [1,0] started executing at cycle 2. +* Instruction [1,0] reached the write back stage at cycle 4. +* Instruction [1,0] was retired at cycle 10. + +Instruction [1,0] (i.e., vmulps from iteration #1) does not have to wait in the +scheduler's queue for the operands to become available. By the time vmulps is +dispatched, operands are already available, and pipeline JFPU1 is ready to +serve another instruction.  So the instruction can be immediately issued on the +JFPU1 pipeline. That is demonstrated by the fact that the instruction only +spent 1cy in the scheduler's queue. + +There is a gap of 5 cycles between the write-back stage and the retire event. +That is because instructions must retire in program order, so [1,0] has to wait +for [0,2] to be retired first (i.e., it has to wait until cycle 10). + +In the example, all instructions are in a RAW (Read After Write) dependency +chain.  Register %xmm2 written by vmulps is immediately used by the first +vhaddps, and register %xmm3 written by the first vhaddps is used by the second +vhaddps.  Long data dependencies negatively impact the ILP (Instruction Level +Parallelism). + +In the dot-product example, there are anti-dependencies introduced by +instructions from different iterations.  However, those dependencies can be +removed at register renaming stage (at the cost of allocating register aliases, +and therefore consuming temporary registers). + +Table *Average Wait times* helps diagnose performance issues that are caused by +the presence of long latency instructions and potentially long data dependencies +which may limit the ILP.  Note that MCA, by default, assumes at least 1cy +between the dispatch event and the issue event. + +When the performance is limited by data dependencies and/or long latency +instructions, the number of cycles spent while in the *ready* state is expected +to be very small when compared with the total number of cycles spent in the +scheduler's queue.  The difference between the two counters is a good indicator +of how large of an impact data dependencies had on the execution of the +instructions.  When performance is mostly limited by the lack of hardware +resources, the delta between the two counters is small.  However, the number of +cycles spent in the queue tends to be larger (i.e., more than 1-3cy), +especially when compared to other low latency instructions. + +Extra Statistics to Further Diagnose Performance Issues +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +The ``-all-stats`` command line option enables extra statistics and performance +counters for the dispatch logic, the reorder buffer, the retire control unit, +and the register file. + +Below is an example of ``-all-stats`` output generated by MCA for the +dot-product example discussed in the previous sections. + +.. code-block:: none + +  Dynamic Dispatch Stall Cycles: +  RAT     - Register unavailable:                      0 +  RCU     - Retire tokens unavailable:                 0 +  SCHEDQ  - Scheduler full:                            272 +  LQ      - Load queue full:                           0 +  SQ      - Store queue full:                          0 +  GROUP   - Static restrictions on the dispatch group: 0 + + +  Dispatch Logic - number of cycles where we saw N instructions dispatched: +  [# dispatched], [# cycles] +   0,              24  (3.9%) +   1,              272  (44.6%) +   2,              314  (51.5%) + + +  Schedulers - number of cycles where we saw N instructions issued: +  [# issued], [# cycles] +   0,          7  (1.1%) +   1,          306  (50.2%) +   2,          297  (48.7%) + + +  Scheduler's queue usage: +  JALU01,  0/20 +  JFPU01,  18/18 +  JLSAGU,  0/12 + + +  Retire Control Unit - number of cycles where we saw N instructions retired: +  [# retired], [# cycles] +   0,           109  (17.9%) +   1,           102  (16.7%) +   2,           399  (65.4%) + + +  Register File statistics: +  Total number of mappings created:    900 +  Max number of mappings used:         35 + +  *  Register File #1 -- JFpuPRF: +     Number of physical registers:     72 +     Total number of mappings created: 900 +     Max number of mappings used:      35 + +  *  Register File #2 -- JIntegerPRF: +     Number of physical registers:     64 +     Total number of mappings created: 0 +     Max number of mappings used:      0 + +If we look at the *Dynamic Dispatch Stall Cycles* table, we see the counter for +SCHEDQ reports 272 cycles.  This counter is incremented every time the dispatch +logic is unable to dispatch a group of two instructions because the scheduler's +queue is full. + +Looking at the *Dispatch Logic* table, we see that the pipeline was only able +to dispatch two instructions 51.5% of the time.  The dispatch group was limited +to one instruction 44.6% of the cycles, which corresponds to 272 cycles.  The +dispatch statistics are displayed by either using the command option +``-all-stats`` or ``-dispatch-stats``. + +The next table, *Schedulers*, presents a histogram displaying a count, +representing the number of instructions issued on some number of cycles.  In +this case, of the 610 simulated cycles, single +instructions were issued 306 times (50.2%) and there were 7 cycles where +no instructions were issued. + +The *Scheduler's queue usage* table shows that the maximum number of buffer +entries (i.e., scheduler queue entries) used at runtime.  Resource JFPU01 +reached its maximum (18 of 18 queue entries). Note that AMD Jaguar implements +three schedulers: + +* JALU01 - A scheduler for ALU instructions. +* JFPU01 - A scheduler floating point operations. +* JLSAGU - A scheduler for address generation. + +The dot-product is a kernel of three floating point instructions (a vector +multiply followed by two horizontal adds).  That explains why only the floating +point scheduler appears to be used. + +A full scheduler queue is either caused by data dependency chains or by a +sub-optimal usage of hardware resources.  Sometimes, resource pressure can be +mitigated by rewriting the kernel using different instructions that consume +different scheduler resources.  Schedulers with a small queue are less resilient +to bottlenecks caused by the presence of long data dependencies. +The scheduler statistics are displayed by +using the command option ``-all-stats`` or ``-scheduler-stats``. + +The next table, *Retire Control Unit*, presents a histogram displaying a count, +representing the number of instructions retired on some number of cycles.  In +this case, of the 610 simulated cycles, two instructions were retired during +the same cycle 399 times (65.4%) and there were 109 cycles where no +instructions were retired.  The retire statistics are displayed by using the +command option ``-all-stats`` or ``-retire-stats``. + +The last table presented is *Register File statistics*.  Each physical register +file (PRF) used by the pipeline is presented in this table.  In the case of AMD +Jaguar, there are two register files, one for floating-point registers +(JFpuPRF) and one for integer registers (JIntegerPRF).  The table shows that of +the 900 instructions processed, there were 900 mappings created.  Since this +dot-product example utilized only floating point registers, the JFPuPRF was +responsible for creating the 900 mappings.  However, we see that the pipeline +only used a maximum of 35 of 72 available register slots at any given time. We +can conclude that the floating point PRF was the only register file used for +the example, and that it was never resource constrained.  The register file +statistics are displayed by using the command option ``-all-stats`` or +``-register-file-stats``. + +In this example, we can conclude that the IPC is mostly limited by data +dependencies, and not by resource pressure. | 
