2 files changed, 1389 insertions, 0 deletions
diff --git a/usr.bin/clang/llvm-mca/Makefile b/usr.bin/clang/llvm-mca/Makefile
new file mode 100644
index 000000000000..274b7a43e5fe
--- /dev/null
+++ b/usr.bin/clang/llvm-mca/Makefile
@@ -0,0 +1,21 @@
+PROG_CXX=	llvm-mca
+
+SRCDIR=		llvm/tools/llvm-mca
+SRCS+=		CodeRegion.cpp
+SRCS+=		CodeRegionGenerator.cpp
+SRCS+=		PipelinePrinter.cpp
+SRCS+=		Views/BottleneckAnalysis.cpp
+SRCS+=		Views/DispatchStatistics.cpp
+SRCS+=		Views/InstructionInfoView.cpp
+SRCS+=		Views/InstructionView.cpp
+SRCS+=		Views/RegisterFileStatistics.cpp
+SRCS+=		Views/ResourcePressureView.cpp
+SRCS+=		Views/RetireControlUnitStatistics.cpp
+SRCS+=		Views/SchedulerStatistics.cpp
+SRCS+=		Views/SummaryView.cpp
+SRCS+=		Views/TimelineView.cpp
+SRCS+=		llvm-mca.cpp
+
+CFLAGS+=	-I${LLVM_BASE}/${SRCDIR}
+
+.include "../llvm.prog.mk"
diff --git a/usr.bin/clang/llvm-mca/llvm-mca.1 b/usr.bin/clang/llvm-mca/llvm-mca.1
new file mode 100644
index 000000000000..7c30c5e95336
--- /dev/null
+++ b/usr.bin/clang/llvm-mca/llvm-mca.1
@@ -0,0 +1,1368 @@
+.\" Man page generated from reStructuredText.
+.
+.
+.nr rst2man-indent-level 0
+.
+.de1 rstReportMargin
+\\$1 \\n[an-margin]
+level \\n[rst2man-indent-level]
+level margin: \\n[rst2man-indent\\n[rst2man-indent-level]]
+-
+\\n[rst2man-indent0]
+\\n[rst2man-indent1]
+\\n[rst2man-indent2]
+..
+.de1 INDENT
+.\" .rstReportMargin pre:
+. RS \\$1
+. nr rst2man-indent\\n[rst2man-indent-level] \\n[an-margin]
+. nr rst2man-indent-level +1
+.\" .rstReportMargin post:
+..
+.de UNINDENT
+. RE
+.\" indent \\n[an-margin]
+.\" old: \\n[rst2man-indent\\n[rst2man-indent-level]]
+.nr rst2man-indent-level -1
+.\" new: \\n[rst2man-indent\\n[rst2man-indent-level]]
+.in \\n[rst2man-indent\\n[rst2man-indent-level]]u
+..
+.TH "LLVM-MCA" "1" "2023-05-24" "16" "LLVM"
+.SH NAME
+llvm-mca \- LLVM Machine Code Analyzer
+.SH SYNOPSIS
+.sp
+\fBllvm\-mca\fP [\fIoptions\fP] [input]
+.SH DESCRIPTION
+.sp
+\fBllvm\-mca\fP is a performance analysis tool that uses information
+available in LLVM (e.g. scheduling models) to statically measure the performance
+of machine code in a specific CPU.
+.sp
+Performance is measured in terms of throughput as well as processor resource
+consumption. The tool currently works for processors with a backend for which
+there is a scheduling model available in LLVM.
+.sp
+The main goal of this tool is not just to predict the performance of the code
+when run on the target, but also help with diagnosing potential performance
+issues.
+.sp
+Given an assembly code sequence, \fBllvm\-mca\fP estimates the Instructions
+Per Cycle (IPC), as well as hardware resource pressure. The analysis and
+reporting style were inspired by the IACA tool from Intel.
+.sp
+For example, you can compile code with clang, output assembly, and pipe it
+directly into \fBllvm\-mca\fP for analysis:
+.INDENT 0.0
+.INDENT 3.5
+.sp
+.nf
+.ft C
+$ clang foo.c \-O2 \-target x86_64\-unknown\-unknown \-S \-o \- | llvm\-mca \-mcpu=btver2
+.ft P
+.fi
+.UNINDENT
+.UNINDENT
+.sp
+Or for Intel syntax:
+.INDENT 0.0
+.INDENT 3.5
+.sp
+.nf
+.ft C
+$ clang foo.c \-O2 \-target x86_64\-unknown\-unknown \-mllvm \-x86\-asm\-syntax=intel \-S \-o \- | llvm\-mca \-mcpu=btver2
+.ft P
+.fi
+.UNINDENT
+.UNINDENT
+.sp
+(\fBllvm\-mca\fP detects Intel syntax by the presence of an \fI\&.intel_syntax\fP
+directive at the beginning of the input.  By default its output syntax matches
+that of its input.)
+.sp
+Scheduling models are not just used to compute instruction latencies and
+throughput, but also to understand what processor resources are available
+and how to simulate them.
+.sp
+By design, the quality of the analysis conducted by \fBllvm\-mca\fP is
+inevitably affected by the quality of the scheduling models in LLVM.
+.sp
+If you see that the performance report is not accurate for a processor,
+please \fI\%file a bug\fP
+against the appropriate backend.
+.SH OPTIONS
+.sp
+If \fBinput\fP is \(dq\fB\-\fP\(dq or omitted, \fBllvm\-mca\fP reads from standard
+input. Otherwise, it will read from the specified filename.
+.sp
+If the \fI\%\-o\fP option is omitted, then \fBllvm\-mca\fP will send its output
+to standard output if the input is from standard input.  If the \fI\%\-o\fP
+option specifies \(dq\fB\-\fP\(dq, then the output will also be sent to standard output.
+.INDENT 0.0
+.TP
+.B \-help
+Print a summary of command line options.
+.UNINDENT
+.INDENT 0.0
+.TP
+.B \-o <filename>
+Use \fB<filename>\fP as the output filename. See the summary above for more
+details.
+.UNINDENT
+.INDENT 0.0
+.TP
+.B \-mtriple=<target triple>
+Specify a target triple string.
+.UNINDENT
+.INDENT 0.0
+.TP
+.B \-march=<arch>
+Specify the architecture for which to analyze the code. It defaults to the
+host default target.
+.UNINDENT
+.INDENT 0.0
+.TP
+.B \-mcpu=<cpuname>
+Specify the processor for which to analyze the code.  By default, the cpu name
+is autodetected from the host.
+.UNINDENT
+.INDENT 0.0
+.TP
+.B \-output\-asm\-variant=<variant id>
+Specify the output assembly variant for the report generated by the tool.
+On x86, possible values are [0, 1]. A value of 0 (vic. 1) for this flag enables
+the AT&T (vic. Intel) assembly format for the code printed out by the tool in
+the analysis report.
+.UNINDENT
+.INDENT 0.0
+.TP
+.B \-print\-imm\-hex
+Prefer hex format for numeric literals in the output assembly printed as part
+of the report.
+.UNINDENT
+.INDENT 0.0
+.TP
+.B \-dispatch=<width>
+Specify a different dispatch width for the processor. The dispatch width
+defaults to field \(aqIssueWidth\(aq in the processor scheduling model.  If width is
+zero, then the default dispatch width is used.
+.UNINDENT
+.INDENT 0.0
+.TP
+.B \-register\-file\-size=<size>
+Specify the size of the register file. When specified, this flag limits how
+many physical registers are available for register renaming purposes. A value
+of zero for this flag means \(dqunlimited number of physical registers\(dq.
+.UNINDENT
+.INDENT 0.0
+.TP
+.B \-iterations=<number of iterations>
+Specify the number of iterations to run. If this flag is set to 0, then the
+tool sets the number of iterations to a default value (i.e. 100).
+.UNINDENT
+.INDENT 0.0
+.TP
+.B \-noalias=<bool>
+If set, the tool assumes that loads and stores don\(aqt alias. This is the
+default behavior.
+.UNINDENT
+.INDENT 0.0
+.TP
+.B \-lqueue=<load queue size>
+Specify the size of the load queue in the load/store unit emulated by the tool.
+By default, the tool assumes an unbound number of entries in the load queue.
+A value of zero for this flag is ignored, and the default load queue size is
+used instead.
+.UNINDENT
+.INDENT 0.0
+.TP
+.B \-squeue=<store queue size>
+Specify the size of the store queue in the load/store unit emulated by the
+tool. By default, the tool assumes an unbound number of entries in the store
+queue. A value of zero for this flag is ignored, and the default store queue
+size is used instead.
+.UNINDENT
+.INDENT 0.0
+.TP
+.B \-timeline
+Enable the timeline view.
+.UNINDENT
+.INDENT 0.0
+.TP
+.B \-timeline\-max\-iterations=<iterations>
+Limit the number of iterations to print in the timeline view. By default, the
+timeline view prints information for up to 10 iterations.
+.UNINDENT
+.INDENT 0.0
+.TP
+.B \-timeline\-max\-cycles=<cycles>
+Limit the number of cycles in the timeline view, or use 0 for no limit. By
+default, the number of cycles is set to 80.
+.UNINDENT
+.INDENT 0.0
+.TP
+.B \-resource\-pressure
+Enable the resource pressure view. This is enabled by default.
+.UNINDENT
+.INDENT 0.0
+.TP
+.B \-register\-file\-stats
+Enable register file usage statistics.
+.UNINDENT
+.INDENT 0.0
+.TP
+.B \-dispatch\-stats
+Enable extra dispatch statistics. This view collects and analyzes instruction
+dispatch events, as well as static/dynamic dispatch stall events. This view
+is disabled by default.
+.UNINDENT
+.INDENT 0.0
+.TP
+.B \-scheduler\-stats
+Enable extra scheduler statistics. This view collects and analyzes instruction
+issue events. This view is disabled by default.
+.UNINDENT
+.INDENT 0.0
+.TP
+.B \-retire\-stats
+Enable extra retire control unit statistics. This view is disabled by default.
+.UNINDENT
+.INDENT 0.0
+.TP
+.B \-instruction\-info
+Enable the instruction info view. This is enabled by default.
+.UNINDENT
+.INDENT 0.0
+.TP
+.B \-show\-encoding
+Enable the printing of instruction encodings within the instruction info view.
+.UNINDENT
+.INDENT 0.0
+.TP
+.B \-show\-barriers
+Enable the printing of LoadBarrier and StoreBarrier flags within the
+instruction info view.
+.UNINDENT
+.INDENT 0.0
+.TP
+.B \-all\-stats
+Print all hardware statistics. This enables extra statistics related to the
+dispatch logic, the hardware schedulers, the register file(s), and the retire
+control unit. This option is disabled by default.
+.UNINDENT
+.INDENT 0.0
+.TP
+.B \-all\-views
+Enable all the view.
+.UNINDENT
+.INDENT 0.0
+.TP
+.B \-instruction\-tables
+Prints resource pressure information based on the static information
+available from the processor model. This differs from the resource pressure
+view because it doesn\(aqt require that the code is simulated. It instead prints
+the theoretical uniform distribution of resource pressure for every
+instruction in sequence.
+.UNINDENT
+.INDENT 0.0
+.TP
+.B \-bottleneck\-analysis
+Print information about bottlenecks that affect the throughput. This analysis
+can be expensive, and it is disabled by default. Bottlenecks are highlighted
+in the summary view. Bottleneck analysis is currently not supported for
+processors with an in\-order backend.
+.UNINDENT
+.INDENT 0.0
+.TP
+.B \-json
+Print the requested views in valid JSON format. The instructions and the
+processor resources are printed as members of special top level JSON objects.
+The individual views refer to them by index. However, not all views are
+currently supported. For example, the report from the bottleneck analysis is
+not printed out in JSON. All the default views are currently supported.
+.UNINDENT
+.INDENT 0.0
+.TP
+.B \-disable\-cb
+Force usage of the generic CustomBehaviour and InstrPostProcess classes rather
+than using the target specific implementation. The generic classes never
+detect any custom hazards or make any post processing modifications to
+instructions.
+.UNINDENT
+.INDENT 0.0
+.TP
+.B \-disable\-im
+Force usage of the generic InstrumentManager rather than using the target
+specific implementation. The generic class creates Instruments that provide
+no extra information, and InstrumentManager never overrides the default
+schedule class for a given instruction.
+.UNINDENT
+.SH EXIT STATUS
+.sp
+\fBllvm\-mca\fP returns 0 on success. Otherwise, an error message is printed
+to standard error, and the tool returns 1.
+.SH USING MARKERS TO ANALYZE SPECIFIC CODE BLOCKS
+.sp
+\fBllvm\-mca\fP allows for the optional usage of special code comments to
+mark regions of the assembly code to be analyzed.  A comment starting with
+substring \fBLLVM\-MCA\-BEGIN\fP marks the beginning of an analysis region. A
+comment starting with substring \fBLLVM\-MCA\-END\fP marks the end of a region.
+For example:
+.INDENT 0.0
+.INDENT 3.5
+.sp
+.nf
+.ft C
+# LLVM\-MCA\-BEGIN
+  ...
+# LLVM\-MCA\-END
+.ft P
+.fi
+.UNINDENT
+.UNINDENT
+.sp
+If no user\-defined region is specified, then \fBllvm\-mca\fP assumes a
+default region which contains every instruction in the input file.  Every region
+is analyzed in isolation, and the final performance report is the union of all
+the reports generated for every analysis region.
+.sp
+Analysis regions can have names. For example:
+.INDENT 0.0
+.INDENT 3.5
+.sp
+.nf
+.ft C
+# LLVM\-MCA\-BEGIN A simple example
+  add %eax, %eax
+# LLVM\-MCA\-END
+.ft P
+.fi
+.UNINDENT
+.UNINDENT
+.sp
+The code from the example above defines a region named \(dqA simple example\(dq with a
+single instruction in it. Note how the region name doesn\(aqt have to be repeated
+in the \fBLLVM\-MCA\-END\fP directive. In the absence of overlapping regions,
+an anonymous \fBLLVM\-MCA\-END\fP directive always ends the currently active user
+defined region.
+.sp
+Example of nesting regions:
+.INDENT 0.0
+.INDENT 3.5
+.sp
+.nf
+.ft C
+# LLVM\-MCA\-BEGIN foo
+  add %eax, %edx
+# LLVM\-MCA\-BEGIN bar
+  sub %eax, %edx
+# LLVM\-MCA\-END bar
+# LLVM\-MCA\-END foo
+.ft P
+.fi
+.UNINDENT
+.UNINDENT
+.sp
+Example of overlapping regions:
+.INDENT 0.0
+.INDENT 3.5
+.sp
+.nf
+.ft C
+# LLVM\-MCA\-BEGIN foo
+  add %eax, %edx
+# LLVM\-MCA\-BEGIN bar
+  sub %eax, %edx
+# LLVM\-MCA\-END foo
+  add %eax, %edx
+# LLVM\-MCA\-END bar
+.ft P
+.fi
+.UNINDENT
+.UNINDENT
+.sp
+Note that multiple anonymous regions cannot overlap. Also, overlapping regions
+cannot have the same name.
+.sp
+There is no support for marking regions from high\-level source code, like C or
+C++. As a workaround, inline assembly directives may be used:
+.INDENT 0.0
+.INDENT 3.5
+.sp
+.nf
+.ft C
+int foo(int a, int b) {
+  __asm volatile(\(dq# LLVM\-MCA\-BEGIN foo\(dq:::\(dqmemory\(dq);
+  a += 42;
+  __asm volatile(\(dq# LLVM\-MCA\-END\(dq:::\(dqmemory\(dq);
+  a *= b;
+  return a;
+}
+.ft P
+.fi
+.UNINDENT
+.UNINDENT
+.sp
+However, this interferes with optimizations like loop vectorization and may have
+an impact on the code generated. This is because the \fB__asm\fP statements are
+seen as real code having important side effects, which limits how the code
+around them can be transformed. If users want to make use of inline assembly
+to emit markers, then the recommendation is to always verify that the output
+assembly is equivalent to the assembly generated in the absence of markers.
+The \fI\%Clang options to emit optimization reports\fP
+can also help in detecting missed optimizations.
+.SH INSTRUMENT REGIONS
+.sp
+An InstrumentRegion describes a region of assembly code guarded by
+special LLVM\-MCA comment directives.
+.INDENT 0.0
+.INDENT 3.5
+.sp
+.nf
+.ft C
+# LLVM\-MCA\-<INSTRUMENT_TYPE> <data>
+  ...  ## asm
+.ft P
+.fi
+.UNINDENT
+.UNINDENT
+.sp
+where \fIINSTRUMENT_TYPE\fP is a type defined by the target and expects
+to use \fIdata\fP\&.
+.sp
+A comment starting with substring \fILLVM\-MCA\-<INSTRUMENT_TYPE>\fP
+brings data into scope for llvm\-mca to use in its analysis for
+all following instructions.
+.sp
+If a comment with the same \fIINSTRUMENT_TYPE\fP is found later in the
+instruction list, then the original InstrumentRegion will be
+automatically ended, and a new InstrumentRegion will begin.
+.sp
+If there are comments containing the different \fIINSTRUMENT_TYPE\fP,
+then both data sets remain available. In contrast with an AnalysisRegion,
+an InstrumentRegion does not need a comment to end the region.
+.sp
+Comments that are prefixed with \fILLVM\-MCA\-\fP but do not correspond to
+a valid \fIINSTRUMENT_TYPE\fP for the target cause an error, except for
+\fIBEGIN\fP and \fIEND\fP, since those correspond to AnalysisRegions. Comments
+that do not start with \fILLVM\-MCA\-\fP are ignored by :program \fIllvm\-mca\fP\&.
+.sp
+An instruction (a MCInst) is added to an InstrumentRegion R only
+if its location is in range [R.RangeStart, R.RangeEnd].
+.sp
+On RISCV targets, vector instructions have different behaviour depending
+on the LMUL. Code can be instrumented with a comment that takes the
+following form:
+.INDENT 0.0
+.INDENT 3.5
+.sp
+.nf
+.ft C
+# LLVM\-MCA\-RISCV\-LMUL <M1|M2|M4|M8|MF2|MF4|MF8>
+.ft P
+.fi
+.UNINDENT
+.UNINDENT
+.sp
+The RISCV InstrumentManager will override the schedule class for vector
+instructions to use the scheduling behaviour of its pseudo\-instruction
+which is LMUL dependent. It makes sense to place RISCV instrument
+comments directly after \fIvset{i}vl{i}\fP instructions, although
+they can be placed anywhere in the program.
+.sp
+Example of program with no call to \fIvset{i}vl{i}\fP:
+.INDENT 0.0
+.INDENT 3.5
+.sp
+.nf
+.ft C
+# LLVM\-MCA\-RISCV\-LMUL M2
+vadd.vv v2, v2, v2
+.ft P
+.fi
+.UNINDENT
+.UNINDENT
+.sp
+Example of program with call to \fIvset{i}vl{i}\fP:
+.INDENT 0.0
+.INDENT 3.5
+.sp
+.nf
+.ft C
+vsetvli zero, a0, e8, m1, tu, mu
+# LLVM\-MCA\-RISCV\-LMUL M1
+vadd.vv v2, v2, v2
+.ft P
+.fi
+.UNINDENT
+.UNINDENT
+.sp
+Example of program with multiple calls to \fIvset{i}vl{i}\fP:
+.INDENT 0.0
+.INDENT 3.5
+.sp
+.nf
+.ft C
+vsetvli zero, a0, e8, m1, tu, mu
+# LLVM\-MCA\-RISCV\-LMUL M1
+vadd.vv v2, v2, v2
+vsetvli zero, a0, e8, m8, tu, mu
+# LLVM\-MCA\-RISCV\-LMUL M8
+vadd.vv v2, v2, v2
+.ft P
+.fi
+.UNINDENT
+.UNINDENT
+.sp
+Example of program with call to \fIvsetvl\fP:
+.INDENT 0.0
+.INDENT 3.5
+.sp
+.nf
+.ft C
+vsetvl rd, rs1, rs2
+# LLVM\-MCA\-RISCV\-LMUL M1
+vadd.vv v12, v12, v12
+vsetvl rd, rs1, rs2
+# LLVM\-MCA\-RISCV\-LMUL M4
+vadd.vv v12, v12, v12
+.ft P
+.fi
+.UNINDENT
+.UNINDENT
+.SH HOW LLVM-MCA WORKS
+.sp
+\fBllvm\-mca\fP takes assembly code as input. The assembly code is parsed
+into a sequence of MCInst with the help of the existing LLVM target assembly
+parsers. The parsed sequence of MCInst is then analyzed by a \fBPipeline\fP module
+to generate a performance report.
+.sp
+The Pipeline module simulates the execution of the machine code sequence in a
+loop of iterations (default is 100). During this process, the pipeline collects
+a number of execution related statistics. At the end of this process, the
+pipeline generates and prints a report from the collected statistics.
+.sp
+Here is an example of a performance report generated by the tool for a
+dot\-product of two packed float vectors of four elements. The analysis is
+conducted for target x86, cpu btver2.  The following result can be produced via
+the following command using the example located at
+\fBtest/tools/llvm\-mca/X86/BtVer2/dot\-product.s\fP:
+.INDENT 0.0
+.INDENT 3.5
+.sp
+.nf
+.ft C
+$ llvm\-mca \-mtriple=x86_64\-unknown\-unknown \-mcpu=btver2 \-iterations=300 dot\-product.s
+.ft P
+.fi
+.UNINDENT
+.UNINDENT
+.INDENT 0.0
+.INDENT 3.5
+.sp
+.nf
+.ft C
+Iterations:        300
+Instructions:      900
+Total Cycles:      610
+Total uOps:        900
+
+Dispatch Width:    2
+uOps Per Cycle:    1.48
+IPC:               1.48
+Block RThroughput: 2.0
+
+
+Instruction Info:
+[1]: #uOps
+[2]: Latency
+[3]: RThroughput
+[4]: MayLoad
+[5]: MayStore
+[6]: HasSideEffects (U)
+
+[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
+ 1      2     1.00                        vmulps      %xmm0, %xmm1, %xmm2
+ 1      3     1.00                        vhaddps     %xmm2, %xmm2, %xmm3
+ 1      3     1.00                        vhaddps     %xmm3, %xmm3, %xmm4
+
+
+Resources:
+[0]   \- JALU0
+[1]   \- JALU1
+[2]   \- JDiv
+[3]   \- JFPA
+[4]   \- JFPM
+[5]   \- JFPU0
+[6]   \- JFPU1
+[7]   \- JLAGU
+[8]   \- JMul
+[9]   \- JSAGU
+[10]  \- JSTC
+[11]  \- JVALU0
+[12]  \- JVALU1
+[13]  \- JVIMUL
+
+
+Resource pressure per iteration:
+[0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    [10]   [11]   [12]   [13]
+ \-      \-      \-     2.00   1.00   2.00   1.00    \-      \-      \-      \-      \-      \-      \-
+
+Resource pressure by instruction:
+[0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    [10]   [11]   [12]   [13]   Instructions:
+ \-      \-      \-      \-     1.00    \-     1.00    \-      \-      \-      \-      \-      \-      \-     vmulps      %xmm0, %xmm1, %xmm2
+ \-      \-      \-     1.00    \-     1.00    \-      \-      \-      \-      \-      \-      \-      \-     vhaddps     %xmm2, %xmm2, %xmm3
+ \-      \-      \-     1.00    \-     1.00    \-      \-      \-      \-      \-      \-      \-      \-     vhaddps     %xmm3, %xmm3, %xmm4
+.ft P
+.fi
+.UNINDENT
+.UNINDENT
+.sp
+According to this report, the dot\-product kernel has been executed 300 times,
+for a total of 900 simulated instructions. The total number of simulated micro
+opcodes (uOps) is also 900.
+.sp
+The report is structured in three main sections.  The first section collects a
+few performance numbers; the goal of this section is to give a very quick
+overview of the performance throughput. Important performance indicators are
+\fBIPC\fP, \fBuOps Per Cycle\fP, and  \fBBlock RThroughput\fP (Block Reciprocal
+Throughput).
+.sp
+Field \fIDispatchWidth\fP is the maximum number of micro opcodes that are dispatched
+to the out\-of\-order backend every simulated cycle. For processors with an
+in\-order backend, \fIDispatchWidth\fP is the maximum number of micro opcodes issued
+to the backend every simulated cycle.
+.sp
+IPC is computed dividing the total number of simulated instructions by the total
+number of cycles.
+.sp
+Field \fIBlock RThroughput\fP is the reciprocal of the block throughput. Block
+throughput is a theoretical quantity computed as the maximum number of blocks
+(i.e. iterations) that can be executed per simulated clock cycle in the absence
+of loop carried dependencies. Block throughput is superiorly limited by the
+dispatch rate, and the availability of hardware resources.
+.sp
+In the absence of loop\-carried data dependencies, the observed IPC tends to a
+theoretical maximum which can be computed by dividing the number of instructions
+of a single iteration by the \fIBlock RThroughput\fP\&.
+.sp
+Field \(aquOps Per Cycle\(aq is computed dividing the total number of simulated micro
+opcodes by the total number of cycles. A delta between Dispatch Width and this
+field is an indicator of a performance issue. In the absence of loop\-carried
+data dependencies, the observed \(aquOps Per Cycle\(aq should tend to a theoretical
+maximum throughput which can be computed by dividing the number of uOps of a
+single iteration by the \fIBlock RThroughput\fP\&.
+.sp
+Field \fIuOps Per Cycle\fP is bounded from above by the dispatch width. That is
+because the dispatch width limits the maximum size of a dispatch group. Both IPC
+and \(aquOps Per Cycle\(aq are limited by the amount of hardware parallelism. The
+availability of hardware resources affects the resource pressure distribution,
+and it limits the number of instructions that can be executed in parallel every
+cycle.  A delta between Dispatch Width and the theoretical maximum uOps per
+Cycle (computed by dividing the number of uOps of a single iteration by the
+\fIBlock RThroughput\fP) is an indicator of a performance bottleneck caused by the
+lack of hardware resources.
+In general, the lower the Block RThroughput, the better.
+.sp
+In this example, \fBuOps per iteration/Block RThroughput\fP is 1.50. Since there
+are no loop\-carried dependencies, the observed \fIuOps Per Cycle\fP is expected to
+approach 1.50 when the number of iterations tends to infinity. The delta between
+the Dispatch Width (2.00), and the theoretical maximum uOp throughput (1.50) is
+an indicator of a performance bottleneck caused by the lack of hardware
+resources, and the \fIResource pressure view\fP can help to identify the problematic
+resource usage.
+.sp
+The second section of the report is the \fIinstruction info view\fP\&. It shows the
+latency and reciprocal throughput of every instruction in the sequence. It also
+reports extra information related to the number of micro opcodes, and opcode
+properties (i.e., \(aqMayLoad\(aq, \(aqMayStore\(aq, and \(aqHasSideEffects\(aq).
+.sp
+Field \fIRThroughput\fP is the reciprocal of the instruction throughput. Throughput
+is computed as the maximum number of instructions of a same type that can be
+executed per clock cycle in the absence of operand dependencies. In this
+example, the reciprocal throughput of a vector float multiply is 1
+cycles/instruction.  That is because the FP multiplier JFPM is only available
+from pipeline JFPU1.
+.sp
+Instruction encodings are displayed within the instruction info view when flag
+\fI\-show\-encoding\fP is specified.
+.sp
+Below is an example of \fI\-show\-encoding\fP output for the dot\-product kernel:
+.INDENT 0.0
+.INDENT 3.5
+.sp
+.nf
+.ft C
+Instruction Info:
+[1]: #uOps
+[2]: Latency
+[3]: RThroughput
+[4]: MayLoad
+[5]: MayStore
+[6]: HasSideEffects (U)
+[7]: Encoding Size
+
+[1]    [2]    [3]    [4]    [5]    [6]    [7]    Encodings:                    Instructions:
+ 1      2     1.00                         4     c5 f0 59 d0                   vmulps %xmm0, %xmm1, %xmm2
+ 1      4     1.00                         4     c5 eb 7c da                   vhaddps        %xmm2, %xmm2, %xmm3
+ 1      4     1.00                         4     c5 e3 7c e3                   vhaddps        %xmm3, %xmm3, %xmm4
+.ft P
+.fi
+.UNINDENT
+.UNINDENT
+.sp
+The \fIEncoding Size\fP column shows the size in bytes of instructions.  The
+\fIEncodings\fP column shows the actual instruction encodings (byte sequences in
+hex).
+.sp
+The third section is the \fIResource pressure view\fP\&.  This view reports
+the average number of resource cycles consumed every iteration by instructions
+for every processor resource unit available on the target.  Information is
+structured in two tables. The first table reports the number of resource cycles
+spent on average every iteration. The second table correlates the resource
+cycles to the machine instruction in the sequence. For example, every iteration
+of the instruction vmulps always executes on resource unit [6]
+(JFPU1 \- floating point pipeline #1), consuming an average of 1 resource cycle
+per iteration.  Note that on AMD Jaguar, vector floating\-point multiply can
+only be issued to pipeline JFPU1, while horizontal floating\-point additions can
+only be issued to pipeline JFPU0.
+.sp
+The resource pressure view helps with identifying bottlenecks caused by high
+usage of specific hardware resources.  Situations with resource pressure mainly
+concentrated on a few resources should, in general, be avoided.  Ideally,
+pressure should be uniformly distributed between multiple resources.
+.SS Timeline View
+.sp
+The timeline view produces a detailed report of each instruction\(aqs state
+transitions through an instruction pipeline.  This view is enabled by the
+command line option \fB\-timeline\fP\&.  As instructions transition through the
+various stages of the pipeline, their states are depicted in the view report.
+These states are represented by the following characters:
+.INDENT 0.0
+.IP \(bu 2
+D : Instruction dispatched.
+.IP \(bu 2
+e : Instruction executing.
+.IP \(bu 2
+E : Instruction executed.
+.IP \(bu 2
+R : Instruction retired.
+.IP \(bu 2
+= : Instruction already dispatched, waiting to be executed.
+.IP \(bu 2
+\- : Instruction executed, waiting to be retired.
+.UNINDENT
+.sp
+Below is the timeline view for a subset of the dot\-product example located in
+\fBtest/tools/llvm\-mca/X86/BtVer2/dot\-product.s\fP and processed by
+\fBllvm\-mca\fP using the following command:
+.INDENT 0.0
+.INDENT 3.5
+.sp
+.nf
+.ft C
+$ llvm\-mca \-mtriple=x86_64\-unknown\-unknown \-mcpu=btver2 \-iterations=3 \-timeline dot\-product.s
+.ft P
+.fi
+.UNINDENT
+.UNINDENT
+.INDENT 0.0
+.INDENT 3.5
+.sp
+.nf
+.ft C
+Timeline view:
+                    012345
+Index     0123456789
+
+[0,0]     DeeER.    .    .   vmulps   %xmm0, %xmm1, %xmm2
+[0,1]     D==eeeER  .    .   vhaddps  %xmm2, %xmm2, %xmm3
+[0,2]     .D====eeeER    .   vhaddps  %xmm3, %xmm3, %xmm4
+[1,0]     .DeeE\-\-\-\-\-R    .   vmulps   %xmm0, %xmm1, %xmm2
+[1,1]     . D=eeeE\-\-\-R   .   vhaddps  %xmm2, %xmm2, %xmm3
+[1,2]     . D====eeeER   .   vhaddps  %xmm3, %xmm3, %xmm4
+[2,0]     .  DeeE\-\-\-\-\-R  .   vmulps   %xmm0, %xmm1, %xmm2
+[2,1]     .  D====eeeER  .   vhaddps  %xmm2, %xmm2, %xmm3
+[2,2]     .   D======eeeER   vhaddps  %xmm3, %xmm3, %xmm4
+
+
+Average Wait times (based on the timeline view):
+[0]: Executions
+[1]: Average time spent waiting in a scheduler\(aqs queue
+[2]: Average time spent waiting in a scheduler\(aqs queue while ready
+[3]: Average time elapsed from WB until retire stage
+
+      [0]    [1]    [2]    [3]
+0.     3     1.0    1.0    3.3       vmulps   %xmm0, %xmm1, %xmm2
+1.     3     3.3    0.7    1.0       vhaddps  %xmm2, %xmm2, %xmm3
+2.     3     5.7    0.0    0.0       vhaddps  %xmm3, %xmm3, %xmm4
+       3     3.3    0.5    1.4       <total>
+.ft P
+.fi
+.UNINDENT
+.UNINDENT
+.sp
+The timeline view is interesting because it shows instruction state changes
+during execution.  It also gives an idea of how the tool processes instructions
+executed on the target, and how their timing information might be calculated.
+.sp
+The timeline view is structured in two tables.  The first table shows
+instructions changing state over time (measured in cycles); the second table
+(named \fIAverage Wait times\fP) reports useful timing statistics, which should
+help diagnose performance bottlenecks caused by long data dependencies and
+sub\-optimal usage of hardware resources.
+.sp
+An instruction in the timeline view is identified by a pair of indices, where
+the first index identifies an iteration, and the second index is the
+instruction index (i.e., where it appears in the code sequence).  Since this
+example was generated using 3 iterations: \fB\-iterations=3\fP, the iteration
+indices range from 0\-2 inclusively.
+.sp
+Excluding the first and last column, the remaining columns are in cycles.
+Cycles are numbered sequentially starting from 0.
+.sp
+From the example output above, we know the following:
+.INDENT 0.0
+.IP \(bu 2
+Instruction [1,0] was dispatched at cycle 1.
+.IP \(bu 2
+Instruction [1,0] started executing at cycle 2.
+.IP \(bu 2
+Instruction [1,0] reached the write back stage at cycle 4.
+.IP \(bu 2
+Instruction [1,0] was retired at cycle 10.
+.UNINDENT
+.sp
+Instruction [1,0] (i.e., vmulps from iteration #1) does not have to wait in the
+scheduler\(aqs queue for the operands to become available. By the time vmulps is
+dispatched, operands are already available, and pipeline JFPU1 is ready to
+serve another instruction.  So the instruction can be immediately issued on the
+JFPU1 pipeline. That is demonstrated by the fact that the instruction only
+spent 1cy in the scheduler\(aqs queue.
+.sp
+There is a gap of 5 cycles between the write\-back stage and the retire event.
+That is because instructions must retire in program order, so [1,0] has to wait
+for [0,2] to be retired first (i.e., it has to wait until cycle 10).
+.sp
+In the example, all instructions are in a RAW (Read After Write) dependency
+chain.  Register %xmm2 written by vmulps is immediately used by the first
+vhaddps, and register %xmm3 written by the first vhaddps is used by the second
+vhaddps.  Long data dependencies negatively impact the ILP (Instruction Level
+Parallelism).
+.sp
+In the dot\-product example, there are anti\-dependencies introduced by
+instructions from different iterations.  However, those dependencies can be
+removed at register renaming stage (at the cost of allocating register aliases,
+and therefore consuming physical registers).
+.sp
+Table \fIAverage Wait times\fP helps diagnose performance issues that are caused by
+the presence of long latency instructions and potentially long data dependencies
+which may limit the ILP. Last row, \fB<total>\fP, shows a global average over all
+instructions measured. Note that \fBllvm\-mca\fP, by default, assumes at
+least 1cy between the dispatch event and the issue event.
+.sp
+When the performance is limited by data dependencies and/or long latency
+instructions, the number of cycles spent while in the \fIready\fP state is expected
+to be very small when compared with the total number of cycles spent in the
+scheduler\(aqs queue.  The difference between the two counters is a good indicator
+of how large of an impact data dependencies had on the execution of the
+instructions.  When performance is mostly limited by the lack of hardware
+resources, the delta between the two counters is small.  However, the number of
+cycles spent in the queue tends to be larger (i.e., more than 1\-3cy),
+especially when compared to other low latency instructions.
+.SS Bottleneck Analysis
+.sp
+The \fB\-bottleneck\-analysis\fP command line option enables the analysis of
+performance bottlenecks.
+.sp
+This analysis is potentially expensive. It attempts to correlate increases in
+backend pressure (caused by pipeline resource pressure and data dependencies) to
+dynamic dispatch stalls.
+.sp
+Below is an example of \fB\-bottleneck\-analysis\fP output generated by
+\fBllvm\-mca\fP for 500 iterations of the dot\-product example on btver2.
+.INDENT 0.0
+.INDENT 3.5
+.sp
+.nf
+.ft C
+Cycles with backend pressure increase [ 48.07% ]
+Throughput Bottlenecks:
+  Resource Pressure       [ 47.77% ]
+  \- JFPA  [ 47.77% ]
+  \- JFPU0  [ 47.77% ]
+  Data Dependencies:      [ 0.30% ]
+  \- Register Dependencies [ 0.30% ]
+  \- Memory Dependencies   [ 0.00% ]
+
+Critical sequence based on the simulation:
+
+              Instruction                         Dependency Information
+ +\-\-\-\-< 2.    vhaddps %xmm3, %xmm3, %xmm4
+ |
+ |    < loop carried >
+ |
+ |      0.    vmulps  %xmm0, %xmm1, %xmm2
+ +\-\-\-\-> 1.    vhaddps %xmm2, %xmm2, %xmm3         ## RESOURCE interference:  JFPA [ probability: 74% ]
+ +\-\-\-\-> 2.    vhaddps %xmm3, %xmm3, %xmm4         ## REGISTER dependency:  %xmm3
+ |
+ |    < loop carried >
+ |
+ +\-\-\-\-> 1.    vhaddps %xmm2, %xmm2, %xmm3         ## RESOURCE interference:  JFPA [ probability: 74% ]
+.ft P
+.fi
+.UNINDENT
+.UNINDENT
+.sp
+According to the analysis, throughput is limited by resource pressure and not by
+data dependencies.  The analysis observed increases in backend pressure during
+48.07% of the simulated run. Almost all those pressure increase events were
+caused by contention on processor resources JFPA/JFPU0.
+.sp
+The \fIcritical sequence\fP is the most expensive sequence of instructions according
+to the simulation. It is annotated to provide extra information about critical
+register dependencies and resource interferences between instructions.
+.sp
+Instructions from the critical sequence are expected to significantly impact
+performance. By construction, the accuracy of this analysis is strongly
+dependent on the simulation and (as always) by the quality of the processor
+model in llvm.
+.sp
+Bottleneck analysis is currently not supported for processors with an in\-order
+backend.
+.SS Extra Statistics to Further Diagnose Performance Issues
+.sp
+The \fB\-all\-stats\fP command line option enables extra statistics and performance
+counters for the dispatch logic, the reorder buffer, the retire control unit,
+and the register file.
+.sp
+Below is an example of \fB\-all\-stats\fP output generated by  \fBllvm\-mca\fP
+for 300 iterations of the dot\-product example discussed in the previous
+sections.
+.INDENT 0.0
+.INDENT 3.5
+.sp
+.nf
+.ft C
+Dynamic Dispatch Stall Cycles:
+RAT     \- Register unavailable:                      0
+RCU     \- Retire tokens unavailable:                 0
+SCHEDQ  \- Scheduler full:                            272  (44.6%)
+LQ      \- Load queue full:                           0
+SQ      \- Store queue full:                          0
+GROUP   \- Static restrictions on the dispatch group: 0
+
+
+Dispatch Logic \- number of cycles where we saw N micro opcodes dispatched:
+[# dispatched], [# cycles]
+ 0,              24  (3.9%)
+ 1,              272  (44.6%)
+ 2,              314  (51.5%)
+
+
+Schedulers \- number of cycles where we saw N micro opcodes issued:
+[# issued], [# cycles]
+ 0,          7  (1.1%)
+ 1,          306  (50.2%)
+ 2,          297  (48.7%)
+
+Scheduler\(aqs queue usage:
+[1] Resource name.
+[2] Average number of used buffer entries.
+[3] Maximum number of used buffer entries.
+[4] Total number of buffer entries.
+
+ [1]            [2]        [3]        [4]
+JALU01           0          0          20
+JFPU01           17         18         18
+JLSAGU           0          0          12
+
+
+Retire Control Unit \- number of cycles where we saw N instructions retired:
+[# retired], [# cycles]
+ 0,           109  (17.9%)
+ 1,           102  (16.7%)
+ 2,           399  (65.4%)
+
+Total ROB Entries:                64
+Max Used ROB Entries:             35  ( 54.7% )
+Average Used ROB Entries per cy:  32  ( 50.0% )
+
+
+Register File statistics:
+Total number of mappings created:    900
+Max number of mappings used:         35
+
+*  Register File #1 \-\- JFpuPRF:
+   Number of physical registers:     72
+   Total number of mappings created: 900
+   Max number of mappings used:      35
+
+*  Register File #2 \-\- JIntegerPRF:
+   Number of physical registers:     64
+   Total number of mappings created: 0
+   Max number of mappings used:      0
+.ft P
+.fi
+.UNINDENT
+.UNINDENT
+.sp
+If we look at the \fIDynamic Dispatch Stall Cycles\fP table, we see the counter for
+SCHEDQ reports 272 cycles.  This counter is incremented every time the dispatch
+logic is unable to dispatch a full group because the scheduler\(aqs queue is full.
+.sp
+Looking at the \fIDispatch Logic\fP table, we see that the pipeline was only able to
+dispatch two micro opcodes 51.5% of the time.  The dispatch group was limited to
+one micro opcode 44.6% of the cycles, which corresponds to 272 cycles.  The
+dispatch statistics are displayed by either using the command option
+\fB\-all\-stats\fP or \fB\-dispatch\-stats\fP\&.
+.sp
+The next table, \fISchedulers\fP, presents a histogram displaying a count,
+representing the number of micro opcodes issued on some number of cycles. In
+this case, of the 610 simulated cycles, single opcodes were issued 306 times
+(50.2%) and there were 7 cycles where no opcodes were issued.
+.sp
+The \fIScheduler\(aqs queue usage\fP table shows that the average and maximum number of
+buffer entries (i.e., scheduler queue entries) used at runtime.  Resource JFPU01
+reached its maximum (18 of 18 queue entries). Note that AMD Jaguar implements
+three schedulers:
+.INDENT 0.0
+.IP \(bu 2
+JALU01 \- A scheduler for ALU instructions.
+.IP \(bu 2
+JFPU01 \- A scheduler floating point operations.
+.IP \(bu 2
+JLSAGU \- A scheduler for address generation.
+.UNINDENT
+.sp
+The dot\-product is a kernel of three floating point instructions (a vector
+multiply followed by two horizontal adds).  That explains why only the floating
+point scheduler appears to be used.
+.sp
+A full scheduler queue is either caused by data dependency chains or by a
+sub\-optimal usage of hardware resources.  Sometimes, resource pressure can be
+mitigated by rewriting the kernel using different instructions that consume
+different scheduler resources.  Schedulers with a small queue are less resilient
+to bottlenecks caused by the presence of long data dependencies.  The scheduler
+statistics are displayed by using the command option \fB\-all\-stats\fP or
+\fB\-scheduler\-stats\fP\&.
+.sp
+The next table, \fIRetire Control Unit\fP, presents a histogram displaying a count,
+representing the number of instructions retired on some number of cycles.  In
+this case, of the 610 simulated cycles, two instructions were retired during the
+same cycle 399 times (65.4%) and there were 109 cycles where no instructions
+were retired.  The retire statistics are displayed by using the command option
+\fB\-all\-stats\fP or \fB\-retire\-stats\fP\&.
+.sp
+The last table presented is \fIRegister File statistics\fP\&.  Each physical register
+file (PRF) used by the pipeline is presented in this table.  In the case of AMD
+Jaguar, there are two register files, one for floating\-point registers (JFpuPRF)
+and one for integer registers (JIntegerPRF).  The table shows that of the 900
+instructions processed, there were 900 mappings created.  Since this dot\-product
+example utilized only floating point registers, the JFPuPRF was responsible for
+creating the 900 mappings.  However, we see that the pipeline only used a
+maximum of 35 of 72 available register slots at any given time. We can conclude
+that the floating point PRF was the only register file used for the example, and
+that it was never resource constrained.  The register file statistics are
+displayed by using the command option \fB\-all\-stats\fP or
+\fB\-register\-file\-stats\fP\&.
+.sp
+In this example, we can conclude that the IPC is mostly limited by data
+dependencies, and not by resource pressure.
+.SS Instruction Flow
+.sp
+This section describes the instruction flow through the default pipeline of
+\fBllvm\-mca\fP, as well as the functional units involved in the process.
+.sp
+The default pipeline implements the following sequence of stages used to
+process instructions.
+.INDENT 0.0
+.IP \(bu 2
+Dispatch (Instruction is dispatched to the schedulers).
+.IP \(bu 2
+Issue (Instruction is issued to the processor pipelines).
+.IP \(bu 2
+Write Back (Instruction is executed, and results are written back).
+.IP \(bu 2
+Retire (Instruction is retired; writes are architecturally committed).
+.UNINDENT
+.sp
+The in\-order pipeline implements the following sequence of stages:
+* InOrderIssue (Instruction is issued to the processor pipelines).
+* Retire (Instruction is retired; writes are architecturally committed).
+.sp
+\fBllvm\-mca\fP assumes that instructions have all been decoded and placed
+into a queue before the simulation start. Therefore, the instruction fetch and
+decode stages are not modeled. Performance bottlenecks in the frontend are not
+diagnosed. Also, \fBllvm\-mca\fP does not model branch prediction.
+.SS Instruction Dispatch
+.sp
+During the dispatch stage, instructions are picked in program order from a
+queue of already decoded instructions, and dispatched in groups to the
+simulated hardware schedulers.
+.sp
+The size of a dispatch group depends on the availability of the simulated
+hardware resources.  The processor dispatch width defaults to the value
+of the \fBIssueWidth\fP in LLVM\(aqs scheduling model.
+.sp
+An instruction can be dispatched if:
+.INDENT 0.0
+.IP \(bu 2
+The size of the dispatch group is smaller than processor\(aqs dispatch width.
+.IP \(bu 2
+There are enough entries in the reorder buffer.
+.IP \(bu 2
+There are enough physical registers to do register renaming.
+.IP \(bu 2
+The schedulers are not full.
+.UNINDENT
+.sp
+Scheduling models can optionally specify which register files are available on
+the processor. \fBllvm\-mca\fP uses that information to initialize register
+file descriptors.  Users can limit the number of physical registers that are
+globally available for register renaming by using the command option
+\fB\-register\-file\-size\fP\&.  A value of zero for this option means \fIunbounded\fP\&. By
+knowing how many registers are available for renaming, the tool can predict
+dispatch stalls caused by the lack of physical registers.
+.sp
+The number of reorder buffer entries consumed by an instruction depends on the
+number of micro\-opcodes specified for that instruction by the target scheduling
+model.  The reorder buffer is responsible for tracking the progress of
+instructions that are \(dqin\-flight\(dq, and retiring them in program order.  The
+number of entries in the reorder buffer defaults to the value specified by field
+\fIMicroOpBufferSize\fP in the target scheduling model.
+.sp
+Instructions that are dispatched to the schedulers consume scheduler buffer
+entries. \fBllvm\-mca\fP queries the scheduling model to determine the set
+of buffered resources consumed by an instruction.  Buffered resources are
+treated like scheduler resources.
+.SS Instruction Issue
+.sp
+Each processor scheduler implements a buffer of instructions.  An instruction
+has to wait in the scheduler\(aqs buffer until input register operands become
+available.  Only at that point, does the instruction becomes eligible for
+execution and may be issued (potentially out\-of\-order) for execution.
+Instruction latencies are computed by \fBllvm\-mca\fP with the help of the
+scheduling model.
+.sp
+\fBllvm\-mca\fP\(aqs scheduler is designed to simulate multiple processor
+schedulers.  The scheduler is responsible for tracking data dependencies, and
+dynamically selecting which processor resources are consumed by instructions.
+It delegates the management of processor resource units and resource groups to a
+resource manager.  The resource manager is responsible for selecting resource
+units that are consumed by instructions.  For example, if an instruction
+consumes 1cy of a resource group, the resource manager selects one of the
+available units from the group; by default, the resource manager uses a
+round\-robin selector to guarantee that resource usage is uniformly distributed
+between all units of a group.
+.sp
+\fBllvm\-mca\fP\(aqs scheduler internally groups instructions into three sets:
+.INDENT 0.0
+.IP \(bu 2
+WaitSet: a set of instructions whose operands are not ready.
+.IP \(bu 2
+ReadySet: a set of instructions ready to execute.
+.IP \(bu 2
+IssuedSet: a set of instructions executing.
+.UNINDENT
+.sp
+Depending on the operands availability, instructions that are dispatched to the
+scheduler are either placed into the WaitSet or into the ReadySet.
+.sp
+Every cycle, the scheduler checks if instructions can be moved from the WaitSet
+to the ReadySet, and if instructions from the ReadySet can be issued to the
+underlying pipelines. The algorithm prioritizes older instructions over younger
+instructions.
+.SS Write\-Back and Retire Stage
+.sp
+Issued instructions are moved from the ReadySet to the IssuedSet.  There,
+instructions wait until they reach the write\-back stage.  At that point, they
+get removed from the queue and the retire control unit is notified.
+.sp
+When instructions are executed, the retire control unit flags the instruction as
+\(dqready to retire.\(dq
+.sp
+Instructions are retired in program order.  The register file is notified of the
+retirement so that it can free the physical registers that were allocated for
+the instruction during the register renaming stage.
+.SS Load/Store Unit and Memory Consistency Model
+.sp
+To simulate an out\-of\-order execution of memory operations, \fBllvm\-mca\fP
+utilizes a simulated load/store unit (LSUnit) to simulate the speculative
+execution of loads and stores.
+.sp
+Each load (or store) consumes an entry in the load (or store) queue. Users can
+specify flags \fB\-lqueue\fP and \fB\-squeue\fP to limit the number of entries in the
+load and store queues respectively. The queues are unbounded by default.
+.sp
+The LSUnit implements a relaxed consistency model for memory loads and stores.
+The rules are:
+.INDENT 0.0
+.IP 1. 3
+A younger load is allowed to pass an older load only if there are no
+intervening stores or barriers between the two loads.
+.IP 2. 3
+A younger load is allowed to pass an older store provided that the load does
+not alias with the store.
+.IP 3. 3
+A younger store is not allowed to pass an older store.
+.IP 4. 3
+A younger store is not allowed to pass an older load.
+.UNINDENT
+.sp
+By default, the LSUnit optimistically assumes that loads do not alias
+(\fI\-noalias=true\fP) store operations.  Under this assumption, younger loads are
+always allowed to pass older stores.  Essentially, the LSUnit does not attempt
+to run any alias analysis to predict when loads and stores do not alias with
+each other.
+.sp
+Note that, in the case of write\-combining memory, rule 3 could be relaxed to
+allow reordering of non\-aliasing store operations.  That being said, at the
+moment, there is no way to further relax the memory model (\fB\-noalias\fP is the
+only option).  Essentially, there is no option to specify a different memory
+type (e.g., write\-back, write\-combining, write\-through; etc.) and consequently
+to weaken, or strengthen, the memory model.
+.sp
+Other limitations are:
+.INDENT 0.0
+.IP \(bu 2
+The LSUnit does not know when store\-to\-load forwarding may occur.
+.IP \(bu 2
+The LSUnit does not know anything about cache hierarchy and memory types.
+.IP \(bu 2
+The LSUnit does not know how to identify serializing operations and memory
+fences.
+.UNINDENT
+.sp
+The LSUnit does not attempt to predict if a load or store hits or misses the L1
+cache.  It only knows if an instruction \(dqMayLoad\(dq and/or \(dqMayStore.\(dq  For
+loads, the scheduling model provides an \(dqoptimistic\(dq load\-to\-use latency (which
+usually matches the load\-to\-use latency for when there is a hit in the L1D).
+.sp
+\fBllvm\-mca\fP does not (on its own) know about serializing operations or
+memory\-barrier like instructions.  The LSUnit used to conservatively use an
+instruction\(aqs \(dqMayLoad\(dq, \(dqMayStore\(dq, and unmodeled side effects flags to
+determine whether an instruction should be treated as a memory\-barrier. This was
+inaccurate in general and was changed so that now each instruction has an
+IsAStoreBarrier and IsALoadBarrier flag. These flags are mca specific and
+default to false for every instruction. If any instruction should have either of
+these flags set, it should be done within the target\(aqs InstrPostProcess class.
+For an example, look at the \fIX86InstrPostProcess::postProcessInstruction\fP method
+within \fIllvm/lib/Target/X86/MCA/X86CustomBehaviour.cpp\fP\&.
+.sp
+A load/store barrier consumes one entry of the load/store queue.  A load/store
+barrier enforces ordering of loads/stores.  A younger load cannot pass a load
+barrier.  Also, a younger store cannot pass a store barrier.  A younger load
+has to wait for the memory/load barrier to execute.  A load/store barrier is
+\(dqexecuted\(dq when it becomes the oldest entry in the load/store queue(s). That
+also means, by construction, all of the older loads/stores have been executed.
+.sp
+In conclusion, the full set of load/store consistency rules are:
+.INDENT 0.0
+.IP 1. 3
+A store may not pass a previous store.
+.IP 2. 3
+A store may not pass a previous load (regardless of \fB\-noalias\fP).
+.IP 3. 3
+A store has to wait until an older store barrier is fully executed.
+.IP 4. 3
+A load may pass a previous load.
+.IP 5. 3
+A load may not pass a previous store unless \fB\-noalias\fP is set.
+.IP 6. 3
+A load has to wait until an older load barrier is fully executed.
+.UNINDENT
+.SS In\-order Issue and Execute
+.sp
+In\-order processors are modelled as a single \fBInOrderIssueStage\fP stage. It
+bypasses Dispatch, Scheduler and Load/Store unit. Instructions are issued as
+soon as their operand registers are available and resource requirements are
+met. Multiple instructions can be issued in one cycle according to the value of
+the \fBIssueWidth\fP parameter in LLVM\(aqs scheduling model.
+.sp
+Once issued, an instruction is moved to \fBIssuedInst\fP set until it is ready to
+retire. \fBllvm\-mca\fP ensures that writes are committed in\-order. However,
+an instruction is allowed to commit writes and retire out\-of\-order if
+\fBRetireOOO\fP property is true for at least one of its writes.
+.SS Custom Behaviour
+.sp
+Due to certain instructions not being expressed perfectly within their
+scheduling model, \fBllvm\-mca\fP isn\(aqt always able to simulate them
+perfectly. Modifying the scheduling model isn\(aqt always a viable
+option though (maybe because the instruction is modeled incorrectly on
+purpose or the instruction\(aqs behaviour is quite complex). The
+CustomBehaviour class can be used in these cases to enforce proper
+instruction modeling (often by customizing data dependencies and detecting
+hazards that \fBllvm\-mca\fP has no way of knowing about).
+.sp
+\fBllvm\-mca\fP comes with one generic and multiple target specific
+CustomBehaviour classes. The generic class will be used if the \fB\-disable\-cb\fP
+flag is used or if a target specific CustomBehaviour class doesn\(aqt exist for
+that target. (The generic class does nothing.) Currently, the CustomBehaviour
+class is only a part of the in\-order pipeline, but there are plans to add it
+to the out\-of\-order pipeline in the future.
+.sp
+CustomBehaviour\(aqs main method is \fIcheckCustomHazard()\fP which uses the
+current instruction and a list of all instructions still executing within
+the pipeline to determine if the current instruction should be dispatched.
+As output, the method returns an integer representing the number of cycles
+that the current instruction must stall for (this can be an underestimate
+if you don\(aqt know the exact number and a value of 0 represents no stall).
+.sp
+If you\(aqd like to add a CustomBehaviour class for a target that doesn\(aqt
+already have one, refer to an existing implementation to see how to set it
+up. The classes are implemented within the target specific backend (for
+example \fI/llvm/lib/Target/AMDGPU/MCA/\fP) so that they can access backend symbols.
+.SS Instrument Manager
+.sp
+On certain architectures, scheduling information for certain instructions
+do not contain all of the information required to identify the most precise
+schedule class. For example, data that can have an impact on scheduling can
+be stored in CSR registers.
+.sp
+One example of this is on RISCV, where values in registers such as \fIvtype\fP
+and \fIvl\fP change the scheduling behaviour of vector instructions. Since MCA
+does not keep track of the values in registers, instrument comments can
+be used to specify these values.
+.sp
+InstrumentManager\(aqs main function is \fIgetSchedClassID()\fP which has access
+to the MCInst and all of the instruments that are active for that MCInst.
+This function can use the instruments to override the schedule class of
+the MCInst.
+.sp
+On RISCV, instrument comments containing LMUL information are used
+by \fIgetSchedClassID()\fP to map a vector instruction and the active
+LMUL to the scheduling class of the pseudo\-instruction that describes
+that base instruction and the active LMUL.
+.SS Custom Views
+.sp
+\fBllvm\-mca\fP comes with several Views such as the Timeline View and
+Summary View. These Views are generic and can work with most (if not all)
+targets. If you wish to add a new View to \fBllvm\-mca\fP and it does not
+require any backend functionality that is not already exposed through MC layer
+classes (MCSubtargetInfo, MCInstrInfo, etc.), please add it to the
+\fI/tools/llvm\-mca/View/\fP directory. However, if your new View is target specific
+AND requires unexposed backend symbols or functionality, you can define it in
+the \fI/lib/Target/<TargetName>/MCA/\fP directory.
+.sp
+To enable this target specific View, you will have to use this target\(aqs
+CustomBehaviour class to override the \fICustomBehaviour::getViews()\fP methods.
+There are 3 variations of these methods based on where you want your View to
+appear in the output: \fIgetStartViews()\fP, \fIgetPostInstrInfoViews()\fP, and
+\fIgetEndViews()\fP\&. These methods returns a vector of Views so you will want to
+return a vector containing all of the target specific Views for the target in
+question.
+.sp
+Because these target specific (and backend dependent) Views require the
+\fICustomBehaviour::getViews()\fP variants, these Views will not be enabled if
+the \fI\-disable\-cb\fP flag is used.
+.sp
+Enabling these custom Views does not affect the non\-custom (generic) Views.
+Continue to use the usual command line arguments to enable / disable those
+Views.
+.SH AUTHOR
+Maintained by the LLVM Team (https://llvm.org/).
+.SH COPYRIGHT
+2003-2023, LLVM Project
+.\" Generated by docutils manpage writer.
+.