RFC: Uniformity Analysis for Irreducible Control Flow

Uniformity analysis is a generalization of divergence analysis to include irreducible control flow: 1. The proposed spec presents a notion of "maximal convergence" that captures the existing convention of converging threads at the headers of natual loops. 2. Maximal convergence is then extended to irreducible cycles. The identity of irreducible cycles is determined by the choices made in a depth-first traversal of the control flow graph. Uniformity analysis uses criteria that depend only on closed paths and not cycles, to determine maximal convergence. This makes it a conservative analysis that is independent of the effect of DFS on CycleInfo. 3. The analysis is implemented as a template that can be instantiated for both LLVM IR and Machine IR. Validation: - passes existing tests for divergence analysis - passes new tests with irreducible control flow - passes equivalent tests in MIR and GMIR Based on concepts originally outlined by Nicolai Haehnle <nicolai.haehnle@amd.com> With contributions from Ruiling Song <ruiling.song@amd.com> and Jay Foad <jay.foad@amd.com>. Support for GMIR and lit tests for GMIR/MIR added by Yashwant Singh <yashwant.singh@amd.com>. Differential Revision: https://reviews.llvm.org/D130746
2022-12-20 06:49:30 +05:30 · 2022-12-20 06:49:30 +05:30 · 475ce4c200
commit 475ce4c200
parent 3ebc6bee6b
81 changed files with 5921 additions and 110 deletions
--- a/llvm/docs/ConvergenceAndUniformity.rst
+++ b/llvm/docs/ConvergenceAndUniformity.rst
@ -0,0 +1,697 @@
+==========================
+Convergence And Uniformity
+==========================
+
+.. contents::
+   :local:
+
+Introduction
+============
+
+Some parallel environments execute threads in groups that allow
+communication within the group using special primitives called
+*convergent* operations. The outcome of a convergent operation is
+sensitive to the set of threads that executes it "together", i.e.,
+convergently.
+
+A value is said to be *uniform* across a set of threads if it is the
+same across those threads, and *divergent* otherwise. Correspondingly,
+a branch is said to be a uniform branch if its condition is uniform,
+and it is a divergent branch otherwise.
+
+Whether threads are *converged* or not depends on the paths they take
+through the control flow graph. Threads take different outgoing edges
+at a *divergent branch*. Divergent branches constrain
+program transforms such as changing the CFG or moving a convergent
+operation to a different point of the CFG. Performing these
+transformations across a divergent branch can change the sets of
+threads that execute convergent operations convergently. While these
+constraints are out of scope for this document, the described
+*uniformity analysis* allows these transformations to identify
+uniform branches where these constraints do not hold.
+
+Convergence and
+uniformity are inter-dependent: When threads diverge at a divergent
+branch, they may later *reconverge* at a common program point.
+Subsequent operations are performed convergently, but the inputs may
+be non-uniform, thus producing divergent outputs.
+
+Uniformity is also useful by itself on targets that execute threads in
+groups with shared execution resources (e.g. waves, warps, or
+subgroups):
+
+- Uniform outputs can potentially be computed or stored on shared
+  resources.
+- These targets must "linearize" a divergent branch to ensure that
+  each side of the branch is followed by the corresponding threads in
+  the same group. But linearization is unnecessary at uniform
+  branches, since the whole group of threads follows either one side
+  of the branch or the other.
+
+This document presents a definition of convergence that is reasonable
+for real targets and is compatible with the currently implicit
+semantics of convergent operations in LLVM IR. This is accompanied by
+a *uniformity analysis* that extends the existing divergence analysis
+[DivergenceSPMD]_ to cover irreducible control-flow.
+
+.. [DivergenceSPMD] Julian Rosemann, Simon Moll, and Sebastian
+   Hack. 2021. An Abstract Interpretation for SPMD Divergence on
+   Reducible Control Flow Graphs. Proc. ACM Program. Lang. 5, POPL,
+   Article 31 (January 2021), 35 pages.
+   https://doi.org/10.1145/3434312
+
+Terminology
+===========
+
+Cycles
+   Described in :ref:`cycle-terminology`.
+
+Closed path
+   Described in :ref:`cycle-closed-path`.
+
+Disjoint paths
+   Two paths in a CFG are said to be disjoint if the only nodes common
+   to both are the start node or the end node, or both.
+
+Join node
+   A join node of a branch is a node reachable along disjoint paths
+   starting from that branch.
+
+Diverged path
+   A diverged path is a path that starts from a divergent branch and
+   either reaches a join node of the branch or reaches the end of the
+   function without passing through any join node of the branch.
+
+Threads and Dynamic Instances
+=============================
+
+Each occurrence of an instruction in the program source is called a
+*static instance*. When a thread executes a program, each execution of
+a static instance produces a distinct *dynamic instance* of that
+instruction.
+
+Each thread produces a unique sequence of dynamic instances:
+
+- The sequence is generated along branch decisions and loop
+  traversals.
+- Starts with a dynamic instance of a "first" instruction.
+- Continues with dynamic instances of successive "next"
+  instructions.
+
+Threads are independent; some targets may choose to execute them in
+groups in order to share resources when possible.
+
+.. figure:: convergence-natural-loop.png
+   :name: convergence-natural-loop
+
+.. table::
+   :name: convergence-thread-example
+   :align: left
+
+   +----------+--------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
+   |          |        | 1   | 2   | 3   | 4   | 5   | 6   | 7   | 8   | 9   |      |
+   +----------+--------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
+   | Thread 1 | Entry1 | H1  | B1  | L1  | H3  |     | L3  |     |     |     | Exit |
+   +----------+--------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
+   | Thread 2 | Entry1 | H2  |     | L2  | H4  | B2  | L4  | H5  | B3  | L5  | Exit |
+   +----------+--------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
+
+In the above table, each row is a different thread, listing the
+dynamic instances produced by that thread from left to right. Each
+thread executes the same program that starts with an ``Entry`` node
+and ends with an ``Exit`` node, but different threads may take
+different paths through the control flow of the program. The columns
+are numbered merely for convenience, and empty cells have no special
+meaning. Dynamic instances listed in the same column are converged.
+
+.. _convergence-definition:
+
+Convergence
+===========
+
+*Converged-with* is a transitive symmetric relation over dynamic
+instances produced by *different threads* for the *same static
+instance*. Informally, two threads that produce converged dynamic
+instances are said to be *converged*, and they are said to execute
+that static instance *convergently*, at that point in the execution.
+
+*Convergence order* is a strict partial order over dynamic instances
+that is defined as the transitive closure of:
+
+1. If dynamic instance ``P`` is executed strictly before ``Q`` in the
+   same thread, then ``P`` is *convergence-before* ``Q``.
+2. If dynamic instance ``P`` is executed strictly before ``Q1`` in the
+   same thread, and ``Q1`` is *converged-with* ``Q2``, then ``P`` is
+   *convergence-before* ``Q2``.
+3. If dynamic instance ``P1`` is *converged-with* ``P2``, and ``P2``
+   is executed strictly before ``Q`` in the same thread, then ``P1``
+   is *convergence-before* ``Q``.
+
+.. table::
+   :name: convergence-order-example
+   :align: left
+
+   +----------+-------+-----+-----+-----+-----+-----+-----+-----+------+
+   |          | 1     | 2   | 3   | 4   | 5   | 6   | 7   | 8   | 9    |
+   +----------+-------+-----+-----+-----+-----+-----+-----+-----+------+
+   | Thread 1 | Entry | ... |     |     |     | S2  | T   | ... | Exit |
+   +----------+-------+-----+-----+-----+-----+-----+-----+-----+------+
+   | Thread 2 | Entry | ... |     | Q2  | R   | S1  |     | ... | Exit |
+   +----------+-------+-----+-----+-----+-----+-----+-----+-----+------+
+   | Thread 3 | Entry | ... | P   | Q1  |     |     |     | ... |      |
+   +----------+-------+-----+-----+-----+-----+-----+-----+-----+------+
+
+The above table shows partial sequences of dynamic instances from
+different threads. Dynamic instances in the same column are assumed
+to be converged (i.e., related to each other in the converged-with
+relation). The resulting convergence order includes the edges ``P ->
+Q2``, ``Q1 -> R``, ``P -> R``, ``P -> T``, etc.
+
+The fact that *convergence-before* is a strict partial order is a
+constraint on the *converged-with* relation. It is trivially satisfied
+if different dynamic instances are never converged. It is also
+trivially satisfied for all known implementations for which
+convergence plays some role. Aside from the strict partial convergence
+order, there are currently no additional constraints on the
+*converged-with* relation imposed in LLVM IR.
+
+.. _convergence-note-convergence:
+
+.. note::
+
+   1. The ``convergent`` attribute on convergent operations does
+      constrain changes to ``converged-with``, but it is expressed in
+      terms of control flow and does not explicitly deal with thread
+      convergence.
+
+   2. The convergence-before relation is not
+      directly observable. Program transforms are in general free to
+      change the order of instructions, even though that obviously
+      changes the convergence-before relation.
+
+   3. Converged dynamic instances need not be executed at the same
+      time or even on the same resource. Converged dynamic instances
+      of a convergent operation may appear to do so but that is an
+      implementation detail. The fact that ``P`` is convergence-before
+      ``Q`` does not automatically imply that ``P`` happens-before
+      ``Q`` in a memory model sense.
+
+   4. **Future work:** Providing convergence-related guarantees to
+      compiler frontends enables some powerful optimization techniques
+      that can be used by programmers or by high-level program
+      transforms. Constraints on the ``converged-with`` relation may
+      be added eventually as part of the definition of LLVM
+      IR, so that guarantees can be made that frontends can rely on.
+      For a proposal on how this might work, see `D85603
+      <https://reviews.llvm.org/D85603>`_.
+
+.. _convergence-maximal:
+
+Maximal Convergence
+-------------------
+
+This section defines a constraint that may be used to
+produce a *maximal converged-with* relation without violating the
+strict *convergence-before* order. This maximal converged-with
+relation is reasonable for real targets and is compatible with
+convergent operations.
+
+The maximal converged-with relation is defined in terms of cycle
+headers, which are not unique to a given CFG. Each cycle hierarchy for
+the same CFG results in a different maximal converged-with relation.
+
+   **Maximal converged-with:**
+
+   Dynamic instances ``X1`` and ``X2`` produced by different threads
+   for the same static instance ``X`` are converged in the maximal
+   converged-with relation if and only if for every cycle ``C`` with
+   header ``H`` that contains ``X``:
+
+   - every dynamic instance ``H1`` of ``H`` that precedes ``X1`` in
+     the respective thread is convergence-before ``X2``, and,
+   - every dynamic instance ``H2`` of ``H`` that precedes ``X2`` in
+     the respective thread is convergence-before ``X1``,
+   - without assuming that ``X1`` is converged with ``X2``.
+
+.. note::
+
+   For brevity, the rest of the document restricts the term
+   *converged* to mean "related under the maximal converged-with
+   relation for the given cycle hierarchy".
+
+Maximal convergence can now be demonstrated in the earlier example as follows:
+
+.. table::
+   :align: left
+
+   +----------+--------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
+   |          |        | 1   | 2   | 3   | 4   | 5   | 6   | 7   | 8   | 9   |      |
+   +----------+--------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
+   | Thread 1 | Entry1 | H1  | B1  | L1  | H3  |     | L3  |     |     |     | Exit |
+   +----------+--------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
+   | Thread 2 | Entry2 | H2  |     | L2  | H4  | B2  | L4  | H5  | B3  | L5  | Exit |
+   +----------+--------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
+
+- ``Entry1`` and ``Entry2`` are converged.
+- ``H1`` and ``H2`` are converged.
+- ``B1`` and ``B2`` are not converged due to ``H4`` which is not
+  convergence-before ``B1``.
+- ``H3`` and ``H4`` are converged.
+- ``H3`` is not converged with ``H5`` due to ``H4`` which is not
+  convergence-before ``H3``.
+- ``L1`` and ``L2`` are converged.
+- ``L3`` and ``L4`` are converged.
+- ``L3`` is not converged with ``L5`` due to ``H5`` which is not
+  convergence-before ``L3``.
+
+.. _convergence-cycle-headers:
+
+Dependence on Cycles Headers
+----------------------------
+
+Contradictions in convergence order are possible only between two
+nodes that are inside some cycle. The dynamic instances of such nodes
+may be interleaved in the same thread, and this interleaving may be
+different for different threads.
+
+When a thread executes a node ``X`` once and then executes it again,
+it must have followed a closed path in the CFG that includes ``X``.
+Such a path must pass through the header of at least one cycle --- the
+smallest cycle that includes the entire closed path. In a given
+thread, two dynamic instances of ``X`` are either separated by the
+execution of at least one cycle header, or ``X`` itself is a cycle
+header.
+
+In reducible cycles (natural loops), each execution of the header is
+equivalent to the start of a new iteration of the cycle. But this
+analogy breaks down in the presence of explicit constraints on the
+converged-with relation, such as those described in :ref:`future
+work<convergence-note-convergence>`. Instead, cycle headers should be
+treated as implicit *points of convergence* in a maximal
+converged-with relation.
+
+Consider a sequence of nested cycles ``C1``, ``C2``, ..., ``Ck`` such
+that ``C1`` is the outermost cycle and ``Ck`` is the innermost cycle,
+with headers ``H1``, ``H2``, ..., ``Hk`` respectively. When a thread
+enters the cycle ``Ck``, any of the following is possible:
+
+1. The thread directly entered cycle ``Ck`` without having executed
+   any of the headers ``H1`` to ``Hk``.
+
+2. The thread executed some or all of the nested headers one or more
+   times.
+
+The maximal converged-with relation captures the following intuition
+about cycles:
+
+1. When two threads enter a top-level cycle ``C1``, they execute
+   converged dynamic instances of every node that is a :ref:`child
+   <cycle-parent-block>` of ``C1``.
+
+2. When two threads enter a nested cycle ``Ck``, they execute
+   converged dynamic instances of every node that is a child of
+   ``Ck``, until either thread exits ``Ck``, if and only if they
+   executed converged dynamic instances of the last nested header that
+   either thread encountered.
+
+   Note that when a thread exits a nested cycle ``Ck``, it must follow
+   a closed path outside ``Ck`` to reenter it. This requires executing
+   the header of some outer cycle, as described earlier.
+
+Consider two dynamic instances ``X1`` and ``X2`` produced by threads ``T1``
+and ``T2`` for a node ``X`` that is a child of nested cycle ``Ck``.
+Maximal convergence relates ``X1`` and ``X2`` as follows:
+
+1. If neither thread executed any header from ``H1`` to ``Hk``, then
+   ``X1`` and ``X2`` are converged.
+
+2. Otherwise, if there are no converged dynamic instances ``Q1`` and
+   ``Q2`` of any header ``Q`` from ``H1`` to ``Hk`` (where ``Q`` is
+   possibly the same as ``X``), such that ``Q1`` precedes ``X1`` and
+   ``Q2`` precedes ``X2`` in the respective threads, then ``X1`` and
+   ``X2`` are not converged.
+
+3. Otherwise, consider the pair ``Q1`` and ``Q2`` of converged dynamic
+   instances of a header ``Q`` from ``H1`` to ``Hk`` that occur most
+   recently before ``X1`` and ``X2`` in the respective threads. Then
+   ``X1`` and ``X2`` are converged if and only if there is no dynamic
+   instance of any header from ``H1`` to ``Hk`` that occurs between
+   ``Q1`` and ``X1`` in thread ``T1``, or between ``Q2`` and ``X2`` in
+   thread ``T2``. In other words, ``Q1`` and ``Q2`` represent the last
+   point of convergence, with no other header being executed before
+   executing ``X``.
+
+**Example:**
+
+.. figure:: convergence-both-diverged-nested.png
+   :name: convergence-both-diverged-nested
+
+The above figure shows two nested irreducible cycles with headers
+``R`` and ``S``. The nodes ``Entry`` and ``Q`` have divergent
+branches. The table below shows the convergence between three threads
+taking different paths through the CFG. Dynamic instances listed in
+the same column are converged.
+
+   .. table::
+      :align: left
+
+      +---------+-------+-----+-----+-----+-----+-----+-----+-----+------+
+      |         | 1     | 2   | 3   | 4   | 5   | 6   | 7   | 8   | 10   |
+      +---------+-------+-----+-----+-----+-----+-----+-----+-----+------+
+      | Thread1 | Entry | P1  | Q1  | S1  | P3  | Q3  | R1  | S2  | Exit |
+      +---------+-------+-----+-----+-----+-----+-----+-----+-----+------+
+      | Thread2 | Entry | P2  | Q2  |     |     |     | R2  | S3  | Exit |
+      +---------+-------+-----+-----+-----+-----+-----+-----+-----+------+
+      | Thread3 | Entry |     |     |     |     |     | R3  | S4  | Exit |
+      +---------+-------+-----+-----+-----+-----+-----+-----+-----+------+
+
+- ``P2`` and ``P3`` are not converged due to ``S1``
+- ``Q2`` and ``Q3`` are not converged due to ``S1``
+- ``S1`` and ``S3`` are not converged due to ``R2``
+- ``S1`` and ``S4`` are not converged due to ``R3``
+
+Informally, ``T1`` and ``T2`` execute the inner cycle a different
+number of times, without executing the header of the outer cycle. All
+threads converge in the outer cycle when they first execute the header
+of the outer cycle.
+
+.. _convergence-uniformity:
+
+Uniformity
+==========
+
+1. The output of two converged dynamic instances is uniform if and
+   only if it compares equal for those two dynamic instances.
+2. The output of a static instance ``X`` is uniform *for a given set
+   of threads* if and only if it is uniform for every pair of
+   converged dynamic instances of ``X`` produced by those threads.
+
+A non-uniform value is said to be *divergent*.
+
+For a set ``S`` of threads, the uniformity of each output of a static
+instance is determined as follows:
+
+1. The semantics of the instruction may specify the output to be
+   uniform.
+2. Otherwise, if it is a PHI node, its output is uniform if and only
+   if for every pair of converged dynamic instances produced by all
+   threads in ``S``:
+
+   a. Both instances choose the same output from converged
+      dynamic instances, and,
+   b. That output is uniform for all threads in ``S``.
+3. Otherwise, the output is uniform if and only if the input
+   operands are uniform for all threads in ``S``.
+
+Divergent Cycle Exits
+---------------------
+
+When a divergent branch occurs inside a cycle, it is possible that a
+diverged path continues to an exit of the cycle. This is called a
+divergent cycle exit. If the cycle is irreducible, the diverged path
+may re-enter and eventually reach a join within the cycle. Such a join
+should be examined for the :ref:`diverged entry
+<convergence-diverged-entry>` criterion.
+
+Nodes along the diverged path that lie outside the cycle experience
+*temporal divergence*, when two threads executing convergently inside
+the cycle produce uniform values, but exit the cycle along the same
+divergent path after executing the header a different number of times
+(informally, on different iterations of the cycle). For a node ``N``
+inside the cycle the outputs may be uniform for the two threads, but
+any use ``U`` outside the cycle receives a value from non-converged
+dynamic instances of ``N``. An output of ``U`` may be divergent,
+depending on the semantics of the instruction.
+
+Static Uniformity Analysis
+==========================
+
+Irreducible control flow results in different cycle hierarchies
+depending on the choice of headers during depth-first traversal. As a
+result, a static analysis cannot always determine the convergence of
+nodes in irreducible cycles, and any uniformity analysis is limited to
+those static instances whose convergence is independent of the cycle
+hierarchy:
+
+  **m-converged static instances:**
+
+  A static instance ``X`` is *m-converged* for a given CFG if and only
+  if the maximal converged-with relation for its dynamic instances is
+  the same in every cycle hierarchy that can be constructed for that CFG.
+
+  .. note::
+
+   In other words, two dynamic instances ``X1`` and ``X2`` of an
+   m-converged static instance ``X`` are converged in some cycle
+   hierarchy if and only if they are also converged in every other
+   cycle hierarchy for the same CFG.
+
+   As noted earlier, for brevity, we restrict the term *converged* to
+   mean "related under the maximal converged-with relation for a given
+   cycle hierarchy".
+
+
+Each node ``X`` in a given CFG is reported to be m-converged if and
+only if:
+
+1. ``X`` is a :ref:`top-level<cycle-toplevel-block>` node, in which
+   case, there are no cycle headers to influence the convergence of
+   ``X``.
+
+2. Otherwise, if ``X`` is inside a cycle, then every cycle that
+   contains ``X`` satisfies the following necessary conditions:
+
+   a. Every divergent branch inside the cycle satisfies the
+      :ref:`diverged entry criterion<convergence-diverged-entry>`, and,
+   b. There are no :ref:`diverged paths reaching the
+      cycle<convergence-diverged-outside>` from a divergent branch
+      outside it.
+
+.. note::
+
+   A reducible cycle :ref:`trivially satisfies
+   <convergence-reducible-cycle>` the above conditions. In particular,
+   if the whole CFG is reducible, then all nodes in the CFG are
+   m-converged.
+
+If a static instance is not m-converged, then every output is assumed
+to be divergent. Otherwise, for an m-converged static instance, the
+uniformity of each output is determined using the criteria
+:ref:`described earlier <convergence-uniformity>`. The discovery of
+divergent outputs may cause their uses (including branches) to also
+become divergent. The analysis propagates this divergence until a
+fixed point is reached.
+
+The convergence inferred using these criteria is a safe subset of the
+maximal converged-with relation for any cycle hierarchy. In
+particular, it is sufficient to determine if a static instance is
+m-converged for a given cycle hierarchy ``T``, even if that fact is
+not detected when examining some other cycle hierarchy ``T'``.
+
+This property allows compiler transforms to use the uniformity
+analysis without being affected by DFS choices made in the underlying
+cycle analysis. When two transforms use different instances of the
+uniformity analysis for the same CFG, a "divergent value" result in
+one analysis instance cannot contradict a "uniform value" result in
+the other.
+
+Generic transforms such as SimplifyCFG, CSE, and loop transforms
+commonly change the program in ways that change the maximal
+converged-with relations. This also means that a value that was
+previously uniform can become divergent after such a transform.
+Uniformity has to be recomputed after such transforms.
+
+Divergent Branch inside a Cycle
+-------------------------------
+
+.. figure:: convergence-divergent-inside.png
+   :name: convergence-divergent-inside
+
+The above figure shows a divergent branch ``Q`` inside an irreducible
+cyclic region. When two threads diverge at ``Q``, the convergence of
+dynamic instances within the cyclic region depends on the cycle
+hierarchy chosen:
+
+1. In an implementation that detects a single cycle ``C`` with header
+   ``P``, convergence inside the cycle is determined by ``P``.
+
+2. In an implementation that detects two nested cycles with headers
+   ``R`` and ``S``, convergence inside those cycles is determined by
+   their respective headers.
+
+.. _convergence-diverged-entry:
+
+A conservative approach would be to simply report all nodes inside
+irreducible cycles as having divergent outputs. But it is desirable to
+recognize m-converged nodes in the CFG in order to maximize
+uniformity. This section describes one such pattern of nodes derived
+from *closed paths*, which are a property of the CFG and do not depend
+on the cycle hierarchy.
+
+  **Diverged Entry Criterion:**
+
+  The dynamic instances of all the nodes in a closed path ``P`` are
+  m-converged only if for every divergent branch ``B`` and its
+  join node ``J`` that lie on ``P``, there is no entry to ``P`` which
+  lies on a diverged path from ``B`` to ``J``.
+
+.. figure:: convergence-closed-path.png
+   :name: convergence-closed-path
+
+Consider the closed path ``P -> Q -> R -> S`` in the above figure.
+``P`` and ``R`` are :ref:`entries to the closed
+path<cycle-closed-path>`. ``Q`` is a divergent branch and ``S`` is a
+join for that branch, with diverged paths ``Q -> R -> S`` and ``Q ->
+S``.
+
+- If a diverged entry ``R`` exists, then in some cycle hierarchy,
+  ``R`` is the header of the smallest cycle ``C`` containing the
+  closed path and a :ref:`child cycle<cycle-definition>` ``C'``
+  exists in the set ``C - R``, containing both branch ``Q`` and join
+  ``S``. When threads diverge at ``Q``, one subset ``M`` continues
+  inside cycle ``C'``, while the complement ``N`` exits ``C'`` and
+  reaches ``R``. Dynamic instances of ``S`` executed by threads in set
+  ``M`` are not converged with those executed in set ``N`` due to the
+  presence of ``R``. Informally, threads that diverge at ``Q``
+  reconverge in the same iteration of the outer cycle ``C``, but they
+  may have executed the inner cycle ``C'`` differently.
+
+  .. table::
+     :align: left
+
+     +---------+-------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
+     |         | 1     | 2   | 3   | 4   | 5   | 6   | 7   | 8   | 9   | 10  | 11   |
+     +---------+-------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
+     | Thread1 | Entry | P1  | Q1  |     |     |     | R1  | S1  | P3  | ... | Exit |
+     +---------+-------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
+     | Thread2 | Entry | P2  | Q2  | S2  | P4  | Q4  | R2  | S4  |     |     | Exit |
+     +---------+-------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
+
+  In the table above, ``S2`` is not converged with ``S1`` due to ``R1``.
+
+|
+
+- If ``R`` does not exist, or if any node other than ``R`` is the
+  header of ``C``, then no such child cycle ``C'`` is detected.
+  Threads that diverge at ``Q`` execute converged dynamic instances of
+  ``S`` since they do not encounter the cycle header on any path from
+  ``Q`` to ``S``. Informally, threads that diverge at ``Q``
+  reconverge at ``S`` in the same iteration of ``C``.
+
+  .. table::
+     :align: left
+
+     +---------+-------+-----+-----+-----+-----+-----+-----+-----+-----+------+
+     |         | 1     | 2   | 3   | 4   | 5   | 6   | 7   | 8   | 9   | 10   |
+     +---------+-------+-----+-----+-----+-----+-----+-----+-----+-----+------+
+     | Thread1 | Entry | P1  | Q1  | R1  | S1  | P3  | Q3  | R3  | S3  | Exit |
+     +---------+-------+-----+-----+-----+-----+-----+-----+-----+-----+------+
+     | Thread2 | Entry | P2  | Q2  |     | S2  | P4  | Q4  | R2  | S4  | Exit |
+     +---------+-------+-----+-----+-----+-----+-----+-----+-----+-----+------+
+
+|
+
+  .. note::
+
+     In general, the cycle ``C`` in the above statements is not
+     expected to be the same cycle for different headers. Cycles and
+     their headers are tightly coupled; for different headers in the
+     same outermost cycle, the child cycles detected may be different.
+     The property relevant to the above examples is that for every
+     closed path, there is a cycle ``C`` that contains the path and
+     whose header is on that path.
+
+The diverged entry criterion must be checked for every closed path
+passing through a divergent branch ``B`` and its join ``J``. Since
+:ref:`every closed path passes through the header of some
+cycle<cycle-closed-path-header>`, this amounts to checking every cycle
+``C`` that contains ``B`` and ``J``. When the header of ``C``
+dominates the join ``J``, there can be no entry to any path from the
+header to ``J``, which includes any diverged path from ``B`` to ``J``.
+This is also true for any closed paths passing through the header of
+an outer cycle that contains ``C``.
+
+Thus, the diverged entry criterion can be conservatively simplified
+as follows:
+
+  For a divergent branch ``B`` and its join node ``J``, the nodes in a
+  cycle ``C`` that contains both ``B`` and ``J`` are m-converged only
+  if:
+
+  - ``B`` strictly dominates ``J``, or,
+  - The header ``H`` of ``C`` strictly dominates ``J``, or,
+  - Recursively, there is cycle ``C'`` inside ``C`` that satisfies the
+    same condition.
+
+When ``J`` is the same as ``H`` or ``B``, the trivial dominance is
+insufficient to make any statement about entries to diverged paths.
+
+.. _convergence-diverged-outside:
+
+Diverged Paths reaching a Cycle
+-------------------------------
+
+.. figure:: convergence-divergent-outside.png
+   :name: convergence-divergent-outside
+
+The figure shows two cycle hierarchies with a divergent branch in
+``Entry`` instead of ``Q``. For two threads that enter the closed path
+``P -> Q -> R -> S`` at ``P`` and ``R`` respectively, the convergence
+of dynamic instances generated along the path depends on whether ``P``
+or ``R`` is the header.
+
+-  Convergence when ``P`` is the header.
+
+   .. table::
+      :align: left
+
+      +---------+-------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
+      |         | 1     | 2   | 3   | 4   | 5   | 6   | 7   | 8   | 9   | 10  | 11  | 12  | 13   |
+      +---------+-------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
+      | Thread1 | Entry |     |     |     | P1  | Q1  | R1  | S1  | P3  | Q3  |     | S3  | Exit |
+      +---------+-------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
+      | Thread2 | Entry |     | R2  | S2  | P2  | Q2  |     | S2  | P4  | Q4  | R3  | S4  | Exit |
+      +---------+-------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
+
+   |
+
+-  Convergence when ``R`` is the header.
+
+   .. table::
+      :align: left
+
+      +---------+-------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
+      |         | 1     | 2   | 3   | 4   | 5   | 6   | 7   | 8   | 9   | 10  | 11  | 12   |
+      +---------+-------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
+      | Thread1 | Entry |     | P1  | Q1  | R1  | S1  | P3  | Q3  | S3  |     |     | Exit |
+      +---------+-------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
+      | Thread2 | Entry |     |     |     | R2  | S2  | P2  | Q2  | S2  | P4  | ... | Exit |
+      +---------+-------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
+
+   |
+
+Thus, when diverged paths reach different entries of an irreducible
+cycle from outside the cycle, the static analysis conservatively
+reports every node in the cycle as not m-converged.
+
+.. _convergence-reducible-cycle:
+
+Reducible Cycle
+---------------
+
+If ``C`` is a reducible cycle with header ``H``, then in any DFS,
+``H`` :ref:`must be the header of some cycle<cycle-reducible-headers>`
+``C'`` that contains ``C``. Independent of the DFS, there is no entry
+to the subgraph ``C`` other than ``H`` itself. Thus, we have the
+following:
+
+1. The diverged entry criterion is trivially satisfied for a divergent
+   branch and its join, where both are inside subgraph ``C``.
+2. When diverged paths reach the subgraph ``C`` from outside, their
+   convergence is always determined by the same header ``H``.
+
+Clearly, this can be determined only in a cycle hierarchy ``T`` where
+``C`` is detected as a reducible cycle. No such conclusion can be made
+in a different cycle hierarchy ``T'`` where ``C`` is part of a larger
+cycle ``C'`` with the same header, but this does not contradict the
+conclusion in ``T``.
--- a/llvm/docs/CycleTerminology.rst
+++ b/llvm/docs/CycleTerminology.rst
@ -7,6 +7,8 @@ LLVM Cycle Terminology
 .. contents::
   :local:

+.. _cycle-definition:
+
 Cycles
 ======

@ -53,6 +55,11 @@ A cycle C is said to be the *parent* of a basic block B if B occurs in
 C but not in any child cycle of C. Then B is also said to be a *child*
 of cycle C.

+.. _cycle-toplevel-block:
+
+A block B is said to be a *top-level block* if it is not the child of
+any cycle.
+
 .. _cycle-sibling:

 A basic block or cycle X is a *sibling* of another basic block or
@ -194,6 +201,9 @@ A *closed path* in a CFG is a connected sequence of nodes and edges in
 the CFG whose start and end nodes are the same, and whose remaining
 (inner) nodes are distinct.

+An *entry* to a closed path ``P`` is a node on ``P`` that is reachable
+from the function entry without passing through any other node on ``P``.
+
 1. If a node D dominates one or more nodes in a closed path P and P
   does not contain D, then D dominates every node in P.

@ -225,3 +235,31 @@ the CFG whose start and end nodes are the same, and whose remaining
   are the same cycle, or one of them is nested inside the other.
   Hence there is always a cycle that contains U1 and U2 but neither
   of D1 and D2.
+
+.. _cycle-closed-path-header:
+
+4. In any cycle hierarchy, the header ``H`` of the smallest cycle
+   ``C`` containing a closed path ``P`` itself lies on ``P``.
+
+   **Proof:** If ``H`` is not in ``P``, then there is a smaller cycle
+   ``C'`` in the set ``C - H`` containing ``P``, thus contradicting
+   the claim that ``C`` is the smallest such cycle.
+
+.. _cycle-reducible-headers:
+
+Reducible Cycle Headers
+=======================
+
+Although the cycle hierarchy depends on the DFS chosen, reducible
+cycles satisfy the following invariant:
+
+  If a reducible cycle ``C`` with header ``H`` is discovered in any
+  DFS, then there exists a cycle ``C'`` in every DFS with header
+  ``H``, that contains ``C``.
+
+**Proof:** For a closed path ``P`` in ``C`` that passes through ``H``,
+every cycle hierarchy has a smallest cycle ``C'`` containing ``P`` and
+whose header is in ``P``. Since ``H`` is the only entry to ``P``,
+``H`` must be the header of ``C'``. Since headers uniquely define
+cycles, ``C'`` contains every such closed path ``P``, and hence ``C'``
+contains ``C``.
--- a/llvm/docs/Reference.rst
+++ b/llvm/docs/Reference.rst
@ -15,6 +15,7 @@ LLVM and API reference documentation.
   BranchWeightMetadata
   Bugpoint
   CommandGuide/index
+   ConvergenceAndUniformity
   Coroutines
   DependenceGraphs/index
   ExceptionHandling
@ -219,3 +220,7 @@ Additional Topics

 :doc:`YamlIO`
   A reference guide for using LLVM's YAML I/O library.
+
+:doc:`ConvergenceAndUniformity`
+   A description of uniformity analysis in the presence of irreducible
+   control flow, and its implementation.
--- a/llvm/docs/convergence-both-diverged-nested.png
+++ b/llvm/docs/convergence-both-diverged-nested.png
--- a/llvm/docs/convergence-closed-path.png
+++ b/llvm/docs/convergence-closed-path.png
--- a/llvm/docs/convergence-divergent-inside.png
+++ b/llvm/docs/convergence-divergent-inside.png
--- a/llvm/docs/convergence-divergent-outside.png
+++ b/llvm/docs/convergence-divergent-outside.png
--- a/llvm/docs/convergence-natural-loop.png
+++ b/llvm/docs/convergence-natural-loop.png
--- a/llvm/include/llvm/ADT/GenericCycleInfo.h
+++ b/llvm/include/llvm/ADT/GenericCycleInfo.h
@ -105,7 +105,9 @@ public:
  }

  /// \brief Return whether \p Block is an entry block of the cycle.
-  bool isEntry(BlockT *Block) const { return is_contained(Entries, Block); }
+  bool isEntry(const BlockT *Block) const {
+    return is_contained(Entries, Block);
+  }

  /// \brief Return whether \p Block is contained in the cycle.
  bool contains(const BlockT *Block) const {
--- a/llvm/include/llvm/ADT/GenericUniformityImpl.h
+++ b/llvm/include/llvm/ADT/GenericUniformityImpl.h
--- a/llvm/include/llvm/ADT/GenericUniformityInfo.h
+++ b/llvm/include/llvm/ADT/GenericUniformityInfo.h
@ -0,0 +1,79 @@
+//===- GenericUniformityInfo.h ---------------------------*- C++ -*--------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef LLVM_ADT_GENERICUNIFORMITYINFO_H
+#define LLVM_ADT_GENERICUNIFORMITYINFO_H
+
+// #include "llvm/ADT/DenseSet.h"
+#include "llvm/ADT/GenericCycleInfo.h"
+// #include "llvm/ADT/SmallPtrSet.h"
+// #include "llvm/ADT/Uniformity.h"
+// #include "llvm/Analysis/LegacyDivergenceAnalysis.h"
+#include "llvm/Support/raw_ostream.h"
+
+namespace llvm {
+
+class TargetTransformInfo;
+
+template <typename ContextT> class GenericUniformityAnalysisImpl;
+
+template <typename ContextT> class GenericUniformityInfo {
+public:
+  using BlockT = typename ContextT::BlockT;
+  using FunctionT = typename ContextT::FunctionT;
+  using ValueRefT = typename ContextT::ValueRefT;
+  using ConstValueRefT = typename ContextT::ConstValueRefT;
+  using InstructionT = typename ContextT::InstructionT;
+  using DominatorTreeT = typename ContextT::DominatorTreeT;
+  using ThisT = GenericUniformityInfo<ContextT>;
+
+  using CycleInfoT = GenericCycleInfo<ContextT>;
+  using CycleT = typename CycleInfoT::CycleT;
+
+  GenericUniformityInfo(FunctionT &F, const DominatorTreeT &DT,
+                        const CycleInfoT &CI,
+                        const TargetTransformInfo *TTI = nullptr);
+  GenericUniformityInfo() = default;
+  GenericUniformityInfo(GenericUniformityInfo &&) = default;
+  GenericUniformityInfo &operator=(GenericUniformityInfo &&) = default;
+
+  /// Whether any divergence was detected.
+  bool hasDivergence() const;
+
+  /// The GPU kernel this analysis result is for
+  const FunctionT &getFunction() const { return *F; }
+
+  /// Whether \p V is divergent at its definition.
+  bool isDivergent(ConstValueRefT V) const;
+
+  /// Whether \p V is uniform/non-divergent.
+  bool isUniform(ConstValueRefT V) const { return !isDivergent(V); }
+
+  bool hasDivergentTerminator(const BlockT &B);
+
+  void print(raw_ostream &Out) const;
+
+private:
+  using ImplT = GenericUniformityAnalysisImpl<ContextT>;
+  struct ImplDeleter {
+    void operator()(GenericUniformityAnalysisImpl<ContextT> *Impl);
+  };
+
+  FunctionT *F;
+  std::unique_ptr<ImplT, ImplDeleter> DA;
+
+  GenericUniformityInfo(const GenericUniformityInfo &) = delete;
+  GenericUniformityInfo &operator=(const GenericUniformityInfo &) = delete;
+};
+
+} // namespace llvm
+
+#endif // LLVM_ADT_GENERICUNIFORMITYINFO_H
--- a/llvm/include/llvm/ADT/Uniformity.h
+++ b/llvm/include/llvm/ADT/Uniformity.h
@ -0,0 +1,33 @@
+//===- Uniformity.h --------------------------------------*- C++ -*--------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef LLVM_ADT_UNIFORMITY_H
+#define LLVM_ADT_UNIFORMITY_H
+
+namespace llvm {
+
+/// Enum describing how instructions behave with respect to uniformity and
+/// divergence, to answer the question: if the same instruction is executed by
+/// two threads in a convergent set of threads, will its result value(s) be
+/// uniform, i.e. the same on both threads?
+enum class InstructionUniformity {
+  /// The result values are uniform if and only if all operands are uniform.
+  Default,
+
+  /// The result values are always uniform.
+  AlwaysUniform,
+
+  /// The result values can never be assumed to be uniform.
+  NeverUniform
+};
+
+} // namespace llvm
+#endif // LLVM_ADT_UNIFORMITY_H
--- a/llvm/include/llvm/Analysis/UniformityAnalysis.h
+++ b/llvm/include/llvm/Analysis/UniformityAnalysis.h
@ -0,0 +1,78 @@
+//===- ConvergenceUtils.h -----------------------*- C++ -*-----------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+/// \file
+/// \brief Convergence info and convergence-aware uniform info for LLVM IR
+///
+/// This differs from traditional divergence analysis by taking convergence
+/// intrinsics into account.
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef LLVM_ANALYSIS_UNIFORMITYANALYSIS_H
+#define LLVM_ANALYSIS_UNIFORMITYANALYSIS_H
+
+#include "llvm/ADT/GenericUniformityInfo.h"
+#include "llvm/Analysis/CycleAnalysis.h"
+
+namespace llvm {
+
+extern template class GenericUniformityInfo<SSAContext>;
+using UniformityInfo = GenericUniformityInfo<SSAContext>;
+
+/// Analysis pass which computes \ref UniformityInfo.
+class UniformityInfoAnalysis
+    : public AnalysisInfoMixin<UniformityInfoAnalysis> {
+  friend AnalysisInfoMixin<UniformityInfoAnalysis>;
+  static AnalysisKey Key;
+
+public:
+  /// Provide the result typedef for this analysis pass.
+  using Result = UniformityInfo;
+
+  /// Run the analysis pass over a function and produce a dominator tree.
+  UniformityInfo run(Function &F, FunctionAnalysisManager &);
+
+  // TODO: verify analysis
+};
+
+/// Printer pass for the \c UniformityInfo.
+class UniformityInfoPrinterPass
+    : public PassInfoMixin<UniformityInfoPrinterPass> {
+  raw_ostream &OS;
+
+public:
+  explicit UniformityInfoPrinterPass(raw_ostream &OS);
+
+  PreservedAnalyses run(Function &F, FunctionAnalysisManager &AM);
+};
+
+/// Legacy analysis pass which computes a \ref CycleInfo.
+class UniformityInfoWrapperPass : public FunctionPass {
+  Function *m_function = nullptr;
+  UniformityInfo m_uniformityInfo;
+
+public:
+  static char ID;
+
+  UniformityInfoWrapperPass();
+
+  UniformityInfo &getUniformityInfo() { return m_uniformityInfo; }
+  const UniformityInfo &getUniformityInfo() const { return m_uniformityInfo; }
+
+  bool runOnFunction(Function &F) override;
+  void getAnalysisUsage(AnalysisUsage &AU) const override;
+  void releaseMemory() override;
+  void print(raw_ostream &OS, const Module *M = nullptr) const override;
+
+  // TODO: verify analysis
+};
+
+} // namespace llvm
+
+#endif // LLVM_ANALYSIS_UNIFORMITYANALYSIS_H
--- a/llvm/include/llvm/CodeGen/MachineCycleAnalysis.h
+++ b/llvm/include/llvm/CodeGen/MachineCycleAnalysis.h
@ -15,9 +15,8 @@
 #define LLVM_CODEGEN_MACHINECYCLEANALYSIS_H

 #include "llvm/ADT/GenericCycleInfo.h"
-#include "llvm/CodeGen/MachineSSAContext.h"
 #include "llvm/CodeGen/MachineFunctionPass.h"
-#include "llvm/InitializePasses.h"
+#include "llvm/CodeGen/MachineSSAContext.h"

 namespace llvm {

--- a/llvm/include/llvm/CodeGen/MachinePassRegistry.def
+++ b/llvm/include/llvm/CodeGen/MachinePassRegistry.def
@ -205,4 +205,6 @@ DUMMY_MACHINE_FUNCTION_PASS("reset-machine-function", ResetMachineFunctionPass,
 DUMMY_MACHINE_FUNCTION_PASS("machineverifier", MachineVerifierPass, ())
 DUMMY_MACHINE_FUNCTION_PASS("print-machine-cycles", MachineCycleInfoPrinterPass, ())
 DUMMY_MACHINE_FUNCTION_PASS("machine-sanmd", MachineSanitizerBinaryMetadata, ())
+DUMMY_MACHINE_FUNCTION_PASS("machine-uniformity", MachineUniformityInfoWrapperPass, ())
+DUMMY_MACHINE_FUNCTION_PASS("print-machine-uniformity", MachineUniformityInfoPrinterPass, ())
 #undef DUMMY_MACHINE_FUNCTION_PASS
--- a/llvm/include/llvm/CodeGen/MachineSSAContext.h
+++ b/llvm/include/llvm/CodeGen/MachineSSAContext.h
@ -26,10 +26,17 @@ class Register;
 template <typename _FunctionT> class GenericSSAContext;
 template <typename, bool> class DominatorTreeBase;

-inline auto successors(MachineBasicBlock *BB) { return BB->successors(); }
-inline auto predecessors(MachineBasicBlock *BB) { return BB->predecessors(); }
-inline unsigned succ_size(MachineBasicBlock *BB) { return BB->succ_size(); }
-inline unsigned pred_size(MachineBasicBlock *BB) { return BB->pred_size(); }
+inline auto successors(const MachineBasicBlock *BB) { return BB->successors(); }
+inline auto predecessors(const MachineBasicBlock *BB) {
+  return BB->predecessors();
+}
+inline unsigned succ_size(const MachineBasicBlock *BB) {
+  return BB->succ_size();
+}
+inline unsigned pred_size(const MachineBasicBlock *BB) {
+  return BB->pred_size();
+}
+inline auto instrs(const MachineBasicBlock &BB) { return BB.instrs(); }

 template <> class GenericSSAContext<MachineFunction> {
  const MachineRegisterInfo *RegInfo = nullptr;
@ -40,15 +47,25 @@ public:
  using FunctionT = MachineFunction;
  using InstructionT = MachineInstr;
  using ValueRefT = Register;
+  using ConstValueRefT = Register;
+  static const Register ValueRefNull;
  using DominatorTreeT = DominatorTreeBase<BlockT, false>;

-  static MachineBasicBlock *getEntryBlock(MachineFunction &F);
-
  void setFunction(MachineFunction &Fn);
  MachineFunction *getFunction() const { return MF; }

-  Printable print(MachineBasicBlock *Block) const;
-  Printable print(MachineInstr *Inst) const;
+  static MachineBasicBlock *getEntryBlock(MachineFunction &F);
+  static void appendBlockDefs(SmallVectorImpl<Register> &defs,
+                              const MachineBasicBlock &block);
+  static void appendBlockTerms(SmallVectorImpl<MachineInstr *> &terms,
+                               MachineBasicBlock &block);
+  static void appendBlockTerms(SmallVectorImpl<const MachineInstr *> &terms,
+                               const MachineBasicBlock &block);
+  MachineBasicBlock *getDefBlock(Register) const;
+  static bool isConstantValuePhi(const MachineInstr &Phi);
+
+  Printable print(const MachineBasicBlock *Block) const;
+  Printable print(const MachineInstr *Inst) const;
  Printable print(Register Value) const;
 };

--- a/llvm/include/llvm/CodeGen/MachineUniformityAnalysis.h
+++ b/llvm/include/llvm/CodeGen/MachineUniformityAnalysis.h
@ -0,0 +1,35 @@
+//===- MachineUniformityAnalysis.h ---------------------------*- C++ -*----===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+/// \file
+/// \brief Uniformity info and uniformity-aware uniform info for Machine IR
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef LLVM_CODEGEN_MACHINEUNIFORMITYANALYSIS_H
+#define LLVM_CODEGEN_MACHINEUNIFORMITYANALYSIS_H
+
+#include "llvm/ADT/GenericUniformityInfo.h"
+#include "llvm/CodeGen/MachineCycleAnalysis.h"
+#include "llvm/CodeGen/MachineDominators.h"
+#include "llvm/CodeGen/MachineSSAContext.h"
+
+namespace llvm {
+
+extern template class GenericUniformityInfo<MachineSSAContext>;
+using MachineUniformityInfo = GenericUniformityInfo<MachineSSAContext>;
+
+/// \brief Compute the uniform information of a Machine IR function.
+MachineUniformityInfo
+computeMachineUniformityInfo(MachineFunction &F,
+                             const MachineCycleInfo &cycleInfo,
+                             const MachineDomTree &domTree);
+
+} // namespace llvm
+
+#endif // LLVM_CODEGEN_MACHINEUNIFORMITYANALYSIS_H
--- a/llvm/include/llvm/CodeGen/TargetInstrInfo.h
+++ b/llvm/include/llvm/CodeGen/TargetInstrInfo.h
@ -16,6 +16,8 @@
 #include "llvm/ADT/ArrayRef.h"
 #include "llvm/ADT/DenseMap.h"
 #include "llvm/ADT/DenseMapInfo.h"
+#include "llvm/ADT/None.h"
+#include "llvm/ADT/Uniformity.h"
 #include "llvm/CodeGen/MIRFormatter.h"
 #include "llvm/CodeGen/MachineBasicBlock.h"
 #include "llvm/CodeGen/MachineFunction.h"
@ -2053,6 +2055,12 @@ public:
    return MI.getOperand(0);
  }

+  /// Return the uniformity behavior of the given instruction.
+  virtual InstructionUniformity
+  getInstructionUniformity(const MachineInstr &MI) const {
+    return InstructionUniformity::Default;
+  }
+
 private:
  mutable std::unique_ptr<MIRFormatter> Formatter;
  unsigned CallFrameSetupOpcode, CallFrameDestroyOpcode;
--- a/llvm/include/llvm/IR/SSAContext.h
+++ b/llvm/include/llvm/IR/SSAContext.h
@ -15,15 +15,24 @@
 #ifndef LLVM_IR_SSACONTEXT_H
 #define LLVM_IR_SSACONTEXT_H

+#include "llvm/ADT/GenericSSAContext.h"
+#include "llvm/IR/BasicBlock.h"
+#include "llvm/IR/ModuleSlotTracker.h"
 #include "llvm/Support/Printable.h"

+#include <memory>
+
 namespace llvm {
 class BasicBlock;
 class Function;
 class Instruction;
 class Value;
+template <typename> class SmallVectorImpl;
 template <typename, bool> class DominatorTreeBase;
-template <typename _FunctionT> class GenericSSAContext;
+
+inline auto instrs(const BasicBlock &BB) {
+  return llvm::make_range(BB.begin(), BB.end());
+}

 template <> class GenericSSAContext<Function> {
  Function *F;
@ -33,16 +42,33 @@ public:
  using FunctionT = Function;
  using InstructionT = Instruction;
  using ValueRefT = Value *;
+  using ConstValueRefT = const Value *;
+  static Value *ValueRefNull;
  using DominatorTreeT = DominatorTreeBase<BlockT, false>;

-  static BasicBlock *getEntryBlock(Function &F);
-
  void setFunction(Function &Fn);
  Function *getFunction() const { return F; }

-  Printable print(BasicBlock *Block) const;
-  Printable print(Instruction *Inst) const;
-  Printable print(Value *Value) const;
+  static BasicBlock *getEntryBlock(Function &F);
+  static const BasicBlock *getEntryBlock(const Function &F);
+
+  static void appendBlockDefs(SmallVectorImpl<Value *> &defs,
+                              BasicBlock &block);
+  static void appendBlockDefs(SmallVectorImpl<const Value *> &defs,
+                              const BasicBlock &block);
+
+  static void appendBlockTerms(SmallVectorImpl<Instruction *> &terms,
+                               BasicBlock &block);
+  static void appendBlockTerms(SmallVectorImpl<const Instruction *> &terms,
+                               const BasicBlock &block);
+
+  static bool comesBefore(const Instruction *lhs, const Instruction *rhs);
+  static bool isConstantValuePhi(const Instruction &Instr);
+  const BasicBlock *getDefBlock(const Value *value) const;
+
+  Printable print(const BasicBlock *Block) const;
+  Printable print(const Instruction *Inst) const;
+  Printable print(const Value *Value) const;
 };

 using SSAContext = GenericSSAContext<Function>;
--- a/llvm/include/llvm/InitializePasses.h
+++ b/llvm/include/llvm/InitializePasses.h
@ -279,6 +279,8 @@ void initializeMachineSanitizerBinaryMetadataPass(PassRegistry &);
 void initializeMachineSchedulerPass(PassRegistry&);
 void initializeMachineSinkingPass(PassRegistry&);
 void initializeMachineTraceMetricsPass(PassRegistry&);
+void initializeMachineUniformityInfoPrinterPassPass(PassRegistry &);
+void initializeMachineUniformityAnalysisPassPass(PassRegistry &);
 void initializeMachineVerifierPassPass(PassRegistry&);
 void initializeMemCpyOptLegacyPassPass(PassRegistry&);
 void initializeMemDepPrinterPass(PassRegistry&);
@ -402,6 +404,7 @@ void initializeTLSVariableHoistLegacyPassPass(PassRegistry &);
 void initializeTwoAddressInstructionPassPass(PassRegistry&);
 void initializeTypeBasedAAWrapperPassPass(PassRegistry&);
 void initializeTypePromotionPass(PassRegistry&);
+void initializeUniformityInfoWrapperPassPass(PassRegistry &);
 void initializeUnifyFunctionExitNodesLegacyPassPass(PassRegistry &);
 void initializeUnifyLoopExitsLegacyPassPass(PassRegistry &);
 void initializeUnpackMachineBundlesPass(PassRegistry&);
--- a/llvm/lib/Analysis/CMakeLists.txt
+++ b/llvm/lib/Analysis/CMakeLists.txt
@ -142,6 +142,7 @@ add_llvm_component_library(LLVMAnalysis
  TrainingLogger.cpp
  TypeBasedAliasAnalysis.cpp
  TypeMetadataUtils.cpp
+  UniformityAnalysis.cpp
  ScopedNoAliasAA.cpp
  ValueLattice.cpp
  ValueLatticeUtils.cpp
--- a/llvm/lib/Analysis/UniformityAnalysis.cpp
+++ b/llvm/lib/Analysis/UniformityAnalysis.cpp
@ -0,0 +1,159 @@
+//===- ConvergenceUtils.cpp -----------------------------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#include "llvm/Analysis/UniformityAnalysis.h"
+#include "llvm/ADT/GenericUniformityImpl.h"
+#include "llvm/Analysis/CycleAnalysis.h"
+#include "llvm/Analysis/TargetTransformInfo.h"
+#include "llvm/IR/Constants.h"
+#include "llvm/IR/Dominators.h"
+#include "llvm/IR/InstIterator.h"
+#include "llvm/IR/Instructions.h"
+#include "llvm/InitializePasses.h"
+
+using namespace llvm;
+
+template <>
+bool llvm::GenericUniformityAnalysisImpl<SSAContext>::hasDivergentDefs(
+    const Instruction &I) const {
+  return isDivergent((const Value *)&I);
+}
+
+template <>
+bool llvm::GenericUniformityAnalysisImpl<SSAContext>::markDefsDivergent(
+    const Instruction &Instr, bool AllDefsDivergent) {
+  return markDivergent(&Instr);
+}
+
+template <> void llvm::GenericUniformityAnalysisImpl<SSAContext>::initialize() {
+  for (auto &I : instructions(F)) {
+    if (TTI->isSourceOfDivergence(&I)) {
+      assert(!I.isTerminator());
+      markDivergent(I);
+    } else if (TTI->isAlwaysUniform(&I)) {
+      addUniformOverride(I);
+    }
+  }
+  for (auto &Arg : F.args()) {
+    if (TTI->isSourceOfDivergence(&Arg)) {
+      markDivergent(&Arg);
+    }
+  }
+}
+
+template <>
+void llvm::GenericUniformityAnalysisImpl<SSAContext>::pushUsers(
+    const Value *V) {
+  for (const auto *User : V->users()) {
+    const auto *UserInstr = dyn_cast<const Instruction>(User);
+    if (!UserInstr)
+      continue;
+    if (isAlwaysUniform(*UserInstr))
+      continue;
+    if (markDivergent(*UserInstr)) {
+      Worklist.push_back(UserInstr);
+    }
+  }
+}
+
+template <>
+void llvm::GenericUniformityAnalysisImpl<SSAContext>::pushUsers(
+    const Instruction &Instr) {
+  assert(!isAlwaysUniform(Instr));
+  if (Instr.isTerminator())
+    return;
+  pushUsers(cast<Value>(&Instr));
+}
+
+template <>
+bool llvm::GenericUniformityAnalysisImpl<SSAContext>::usesValueFromCycle(
+    const Instruction &I, const Cycle &DefCycle) const {
+  if (isAlwaysUniform(I))
+    return false;
+  for (const Use &U : I.operands()) {
+    if (auto *I = dyn_cast<Instruction>(&U)) {
+      if (DefCycle.contains(I->getParent()))
+        return true;
+    }
+  }
+  return false;
+}
+
+// This ensures explicit instantiation of
+// GenericUniformityAnalysisImpl::ImplDeleter::operator()
+template class llvm::GenericUniformityInfo<SSAContext>;
+
+//===----------------------------------------------------------------------===//
+//  UniformityInfoAnalysis and related pass implementations
+//===----------------------------------------------------------------------===//
+
+llvm::UniformityInfo UniformityInfoAnalysis::run(Function &F,
+                                                 FunctionAnalysisManager &FAM) {
+  auto &DT = FAM.getResult<DominatorTreeAnalysis>(F);
+  auto &TTI = FAM.getResult<TargetIRAnalysis>(F);
+  auto &CI = FAM.getResult<CycleAnalysis>(F);
+  return UniformityInfo{F, DT, CI, &TTI};
+}
+
+AnalysisKey UniformityInfoAnalysis::Key;
+
+UniformityInfoPrinterPass::UniformityInfoPrinterPass(raw_ostream &OS)
+    : OS(OS) {}
+
+PreservedAnalyses UniformityInfoPrinterPass::run(Function &F,
+                                                 FunctionAnalysisManager &AM) {
+  OS << "UniformityInfo for function '" << F.getName() << "':\n";
+  AM.getResult<UniformityInfoAnalysis>(F).print(OS);
+
+  return PreservedAnalyses::all();
+}
+
+//===----------------------------------------------------------------------===//
+//  UniformityInfoWrapperPass Implementation
+//===----------------------------------------------------------------------===//
+
+char UniformityInfoWrapperPass::ID = 0;
+
+UniformityInfoWrapperPass::UniformityInfoWrapperPass() : FunctionPass(ID) {
+  initializeUniformityInfoWrapperPassPass(*PassRegistry::getPassRegistry());
+}
+
+INITIALIZE_PASS_BEGIN(UniformityInfoWrapperPass, "uniforminfo",
+                      "Uniform Info Analysis", true, true)
+INITIALIZE_PASS_DEPENDENCY(DominatorTreeWrapperPass)
+INITIALIZE_PASS_DEPENDENCY(TargetTransformInfoWrapperPass)
+INITIALIZE_PASS_END(UniformityInfoWrapperPass, "uniforminfo",
+                    "Uniform Info Analysis", true, true)
+
+void UniformityInfoWrapperPass::getAnalysisUsage(AnalysisUsage &AU) const {
+  AU.setPreservesAll();
+  AU.addRequired<DominatorTreeWrapperPass>();
+  AU.addRequired<CycleInfoWrapperPass>();
+  AU.addRequired<TargetTransformInfoWrapperPass>();
+}
+
+bool UniformityInfoWrapperPass::runOnFunction(Function &F) {
+  auto &cycleInfo = getAnalysis<CycleInfoWrapperPass>().getResult();
+  auto &domTree = getAnalysis<DominatorTreeWrapperPass>().getDomTree();
+  auto &targetTransformInfo =
+      getAnalysis<TargetTransformInfoWrapperPass>().getTTI(F);
+
+  m_function = &F;
+  m_uniformityInfo =
+      UniformityInfo{F, domTree, cycleInfo, &targetTransformInfo};
+  return false;
+}
+
+void UniformityInfoWrapperPass::print(raw_ostream &OS, const Module *) const {
+  OS << "UniformityInfo for function '" << m_function->getName() << "':\n";
+}
+
+void UniformityInfoWrapperPass::releaseMemory() {
+  m_uniformityInfo = UniformityInfo{};
+  m_function = nullptr;
+}
--- a/llvm/lib/CodeGen/CMakeLists.txt
+++ b/llvm/lib/CodeGen/CMakeLists.txt
@ -142,6 +142,7 @@ add_llvm_component_library(LLVMCodeGen
  MachineSSAUpdater.cpp
  MachineStripDebug.cpp
  MachineTraceMetrics.cpp
+  MachineUniformityAnalysis.cpp
  MachineVerifier.cpp
  MIRFSDiscriminator.cpp
  MIRSampleProfile.cpp
--- a/llvm/lib/CodeGen/CodeGen.cpp
+++ b/llvm/lib/CodeGen/CodeGen.cpp
@ -92,6 +92,8 @@ void llvm::initializeCodeGen(PassRegistry &Registry) {
  initializeMachineRegionInfoPassPass(Registry);
  initializeMachineSchedulerPass(Registry);
  initializeMachineSinkingPass(Registry);
+  initializeMachineUniformityAnalysisPassPass(Registry);
+  initializeMachineUniformityInfoPrinterPassPass(Registry);
  initializeMachineVerifierPassPass(Registry);
  initializeObjCARCContractLegacyPassPass(Registry);
  initializeOptimizePHIsPass(Registry);
--- a/llvm/lib/CodeGen/MachineCycleAnalysis.cpp
+++ b/llvm/lib/CodeGen/MachineCycleAnalysis.cpp
@ -9,8 +9,10 @@
 #include "llvm/CodeGen/MachineCycleAnalysis.h"
 #include "llvm/ADT/GenericCycleImpl.h"
 #include "llvm/CodeGen/MachineRegisterInfo.h"
+#include "llvm/CodeGen/MachineSSAContext.h"
 #include "llvm/CodeGen/TargetInstrInfo.h"
 #include "llvm/CodeGen/TargetSubtargetInfo.h"
+#include "llvm/InitializePasses.h"

 using namespace llvm;

--- a/llvm/lib/CodeGen/MachineSSAContext.cpp
+++ b/llvm/lib/CodeGen/MachineSSAContext.cpp
@ -21,20 +21,52 @@

 using namespace llvm;

-MachineBasicBlock *MachineSSAContext::getEntryBlock(MachineFunction &F) {
-  return &F.front();
-}
+const Register MachineSSAContext::ValueRefNull{};

 void MachineSSAContext::setFunction(MachineFunction &Fn) {
  MF = &Fn;
  RegInfo = &MF->getRegInfo();
 }

-Printable MachineSSAContext::print(MachineBasicBlock *Block) const {
+MachineBasicBlock *MachineSSAContext::getEntryBlock(MachineFunction &F) {
+  return &F.front();
+}
+
+void MachineSSAContext::appendBlockTerms(
+    SmallVectorImpl<const MachineInstr *> &terms,
+    const MachineBasicBlock &block) {
+  for (auto &T : block.terminators())
+    terms.push_back(&T);
+}
+
+void MachineSSAContext::appendBlockDefs(SmallVectorImpl<Register> &defs,
+                                        const MachineBasicBlock &block) {
+  for (const MachineInstr &instr : block.instrs()) {
+    for (const MachineOperand &op : instr.operands()) {
+      if (op.isReg() && op.isDef())
+        defs.push_back(op.getReg());
+    }
+  }
+}
+
+/// Get the defining block of a value.
+MachineBasicBlock *MachineSSAContext::getDefBlock(Register value) const {
+  if (!value)
+    return nullptr;
+  return RegInfo->getVRegDef(value)->getParent();
+}
+
+bool MachineSSAContext::isConstantValuePhi(const MachineInstr &Phi) {
+  return Phi.isConstantValuePHI();
+}
+
+Printable MachineSSAContext::print(const MachineBasicBlock *Block) const {
+  if (!Block)
+    return Printable([](raw_ostream &Out) { Out << "<nullptr>"; });
  return Printable([Block](raw_ostream &Out) { Block->printName(Out); });
 }

-Printable MachineSSAContext::print(MachineInstr *I) const {
+Printable MachineSSAContext::print(const MachineInstr *I) const {
  return Printable([I](raw_ostream &Out) { I->print(Out); });
 }

--- a/llvm/lib/CodeGen/MachineUniformityAnalysis.cpp
+++ b/llvm/lib/CodeGen/MachineUniformityAnalysis.cpp
@ -0,0 +1,222 @@
+//===- MachineUniformityAnalysis.cpp --------------------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#include "llvm/CodeGen/MachineUniformityAnalysis.h"
+#include "llvm/ADT/GenericUniformityImpl.h"
+#include "llvm/CodeGen/MachineCycleAnalysis.h"
+#include "llvm/CodeGen/MachineDominators.h"
+#include "llvm/CodeGen/MachineRegisterInfo.h"
+#include "llvm/CodeGen/MachineSSAContext.h"
+#include "llvm/CodeGen/TargetInstrInfo.h"
+#include "llvm/InitializePasses.h"
+
+using namespace llvm;
+
+template <>
+bool llvm::GenericUniformityAnalysisImpl<MachineSSAContext>::hasDivergentDefs(
+    const MachineInstr &I) const {
+  for (auto &op : I.operands()) {
+    if (!op.isReg() || !op.isDef())
+      continue;
+    if (isDivergent(op.getReg()))
+      return true;
+  }
+  return false;
+}
+
+template <>
+bool llvm::GenericUniformityAnalysisImpl<MachineSSAContext>::markDefsDivergent(
+    const MachineInstr &Instr, bool AllDefsDivergent) {
+  bool insertedDivergent = false;
+  const auto &MRI = F.getRegInfo();
+  const auto &TRI = *MRI.getTargetRegisterInfo();
+  for (auto &op : Instr.operands()) {
+    if (!op.isReg() || !op.isDef())
+      continue;
+    if (!op.getReg().isVirtual())
+      continue;
+    assert(!op.getSubReg());
+    if (!AllDefsDivergent) {
+      auto *RC = MRI.getRegClassOrNull(op.getReg());
+      if (RC && !TRI.isDivergentRegClass(RC))
+        continue;
+    }
+    insertedDivergent |= markDivergent(op.getReg());
+  }
+  return insertedDivergent;
+}
+
+template <>
+void llvm::GenericUniformityAnalysisImpl<MachineSSAContext>::initialize() {
+  const auto &InstrInfo = *F.getSubtarget().getInstrInfo();
+
+  for (const MachineBasicBlock &block : F) {
+    for (const MachineInstr &instr : block) {
+      auto uniformity = InstrInfo.getInstructionUniformity(instr);
+      if (uniformity == InstructionUniformity::AlwaysUniform) {
+        addUniformOverride(instr);
+        continue;
+      }
+
+      if (uniformity == InstructionUniformity::NeverUniform) {
+        markDefsDivergent(instr, /* AllDefsDivergent = */ false);
+      }
+    }
+  }
+}
+
+template <>
+void llvm::GenericUniformityAnalysisImpl<MachineSSAContext>::pushUsers(
+    Register Reg) {
+  const auto &RegInfo = F.getRegInfo();
+  for (MachineInstr &UserInstr : RegInfo.use_instructions(Reg)) {
+    if (isAlwaysUniform(UserInstr))
+      continue;
+    if (markDivergent(UserInstr))
+      Worklist.push_back(&UserInstr);
+  }
+}
+
+template <>
+void llvm::GenericUniformityAnalysisImpl<MachineSSAContext>::pushUsers(
+    const MachineInstr &Instr) {
+  assert(!isAlwaysUniform(Instr));
+  if (Instr.isTerminator())
+    return;
+  for (const MachineOperand &op : Instr.operands()) {
+    if (op.isReg() && op.isDef() && op.getReg().isVirtual())
+      pushUsers(op.getReg());
+  }
+}
+
+template <>
+bool llvm::GenericUniformityAnalysisImpl<MachineSSAContext>::usesValueFromCycle(
+    const MachineInstr &I, const MachineCycle &DefCycle) const {
+  assert(!isAlwaysUniform(I));
+  for (auto &Op : I.operands()) {
+    if (!Op.isReg() || !Op.readsReg())
+      continue;
+    auto Reg = Op.getReg();
+    assert(Reg.isVirtual());
+    auto *Def = F.getRegInfo().getVRegDef(Reg);
+    if (DefCycle.contains(Def->getParent()))
+      return true;
+  }
+  return false;
+}
+
+// This ensures explicit instantiation of
+// GenericUniformityAnalysisImpl::ImplDeleter::operator()
+template class llvm::GenericUniformityInfo<MachineSSAContext>;
+
+MachineUniformityInfo
+llvm::computeMachineUniformityInfo(MachineFunction &F,
+                                   const MachineCycleInfo &cycleInfo,
+                                   const MachineDomTree &domTree) {
+  auto &MRI = F.getRegInfo();
+  assert(MRI.isSSA() && "Expected to be run on SSA form!");
+  return MachineUniformityInfo(F, domTree, cycleInfo);
+}
+
+namespace {
+
+/// Legacy analysis pass which computes a \ref MachineUniformityInfo.
+class MachineUniformityAnalysisPass : public MachineFunctionPass {
+  MachineUniformityInfo UI;
+
+public:
+  static char ID;
+
+  MachineUniformityAnalysisPass();
+
+  MachineUniformityInfo &getUniformityInfo() { return UI; }
+  const MachineUniformityInfo &getUniformityInfo() const { return UI; }
+
+  bool runOnMachineFunction(MachineFunction &F) override;
+  void getAnalysisUsage(AnalysisUsage &AU) const override;
+  void print(raw_ostream &OS, const Module *M = nullptr) const override;
+
+  // TODO: verify analysis
+};
+
+class MachineUniformityInfoPrinterPass : public MachineFunctionPass {
+public:
+  static char ID;
+
+  MachineUniformityInfoPrinterPass();
+
+  bool runOnMachineFunction(MachineFunction &F) override;
+  void getAnalysisUsage(AnalysisUsage &AU) const override;
+};
+
+} // namespace
+
+char MachineUniformityAnalysisPass::ID = 0;
+
+MachineUniformityAnalysisPass::MachineUniformityAnalysisPass()
+    : MachineFunctionPass(ID) {
+  initializeMachineUniformityAnalysisPassPass(*PassRegistry::getPassRegistry());
+}
+
+INITIALIZE_PASS_BEGIN(MachineUniformityAnalysisPass, "machine-uniformity",
+                      "Machine Uniformity Info Analysis", true, true)
+INITIALIZE_PASS_DEPENDENCY(MachineCycleInfoWrapperPass)
+INITIALIZE_PASS_DEPENDENCY(MachineDominatorTree)
+INITIALIZE_PASS_END(MachineUniformityAnalysisPass, "machine-uniformity",
+                    "Machine Uniformity Info Analysis", true, true)
+
+void MachineUniformityAnalysisPass::getAnalysisUsage(AnalysisUsage &AU) const {
+  AU.setPreservesAll();
+  AU.addRequired<MachineCycleInfoWrapperPass>();
+  AU.addRequired<MachineDominatorTree>();
+  MachineFunctionPass::getAnalysisUsage(AU);
+}
+
+bool MachineUniformityAnalysisPass::runOnMachineFunction(MachineFunction &MF) {
+  auto &DomTree = getAnalysis<MachineDominatorTree>().getBase();
+  auto &CI = getAnalysis<MachineCycleInfoWrapperPass>().getCycleInfo();
+  UI = computeMachineUniformityInfo(MF, CI, DomTree);
+  return false;
+}
+
+void MachineUniformityAnalysisPass::print(raw_ostream &OS,
+                                          const Module *) const {
+  OS << "MachineUniformityInfo for function: " << UI.getFunction().getName()
+     << "\n";
+  UI.print(OS);
+}
+
+char MachineUniformityInfoPrinterPass::ID = 0;
+
+MachineUniformityInfoPrinterPass::MachineUniformityInfoPrinterPass()
+    : MachineFunctionPass(ID) {
+  initializeMachineUniformityInfoPrinterPassPass(
+      *PassRegistry::getPassRegistry());
+}
+
+INITIALIZE_PASS_BEGIN(MachineUniformityInfoPrinterPass,
+                      "print-machine-uniformity",
+                      "Print Machine Uniformity Info Analysis", true, true)
+INITIALIZE_PASS_DEPENDENCY(MachineUniformityAnalysisPass)
+INITIALIZE_PASS_END(MachineUniformityInfoPrinterPass,
+                    "print-machine-uniformity",
+                    "Print Machine Uniformity Info Analysis", true, true)
+
+void MachineUniformityInfoPrinterPass::getAnalysisUsage(
+    AnalysisUsage &AU) const {
+  AU.setPreservesAll();
+  AU.addRequired<MachineUniformityAnalysisPass>();
+  MachineFunctionPass::getAnalysisUsage(AU);
+}
+
+bool MachineUniformityInfoPrinterPass::runOnMachineFunction(
+    MachineFunction &F) {
+  auto &UI = getAnalysis<MachineUniformityAnalysisPass>();
+  UI.print(errs());
+  return false;
+}
--- a/llvm/lib/IR/SSAContext.cpp
+++ b/llvm/lib/IR/SSAContext.cpp
@ -13,30 +13,85 @@
 //===----------------------------------------------------------------------===//

 #include "llvm/IR/SSAContext.h"
+#include "llvm/IR/Argument.h"
 #include "llvm/IR/BasicBlock.h"
 #include "llvm/IR/Function.h"
 #include "llvm/IR/Instruction.h"
-#include "llvm/IR/ModuleSlotTracker.h"
-#include "llvm/IR/Value.h"
+#include "llvm/IR/Instructions.h"
 #include "llvm/Support/raw_ostream.h"

 using namespace llvm;

+Value *SSAContext::ValueRefNull = nullptr;
+
+void SSAContext::setFunction(Function &Fn) { F = &Fn; }
+
 BasicBlock *SSAContext::getEntryBlock(Function &F) {
  return &F.getEntryBlock();
 }

-void SSAContext::setFunction(Function &Fn) { F = &Fn; }
+const BasicBlock *SSAContext::getEntryBlock(const Function &F) {
+  return &F.getEntryBlock();
+}

-Printable SSAContext::print(Value *V) const {
+void SSAContext::appendBlockDefs(SmallVectorImpl<Value *> &defs,
+                                 BasicBlock &block) {
+  for (auto &instr : block.instructionsWithoutDebug(/*SkipPseudoOp=*/true)) {
+    if (instr.isTerminator())
+      break;
+    if (instr.getType()->isVoidTy())
+      continue;
+    auto *def = &instr;
+    defs.push_back(def);
+  }
+}
+
+void SSAContext::appendBlockDefs(SmallVectorImpl<const Value *> &defs,
+                                 const BasicBlock &block) {
+  for (auto &instr : block) {
+    if (instr.isTerminator())
+      break;
+    defs.push_back(&instr);
+  }
+}
+
+void SSAContext::appendBlockTerms(SmallVectorImpl<Instruction *> &terms,
+                                  BasicBlock &block) {
+  terms.push_back(block.getTerminator());
+}
+
+void SSAContext::appendBlockTerms(SmallVectorImpl<const Instruction *> &terms,
+                                  const BasicBlock &block) {
+  terms.push_back(block.getTerminator());
+}
+
+const BasicBlock *SSAContext::getDefBlock(const Value *value) const {
+  if (const auto *instruction = dyn_cast<Instruction>(value))
+    return instruction->getParent();
+  return nullptr;
+}
+
+bool SSAContext::comesBefore(const Instruction *lhs, const Instruction *rhs) {
+  return lhs->comesBefore(rhs);
+}
+
+bool SSAContext::isConstantValuePhi(const Instruction &Instr) {
+  if (auto *Phi = dyn_cast<PHINode>(&Instr))
+    return Phi->hasConstantValue();
+  return false;
+}
+
+Printable SSAContext::print(const Value *V) const {
  return Printable([V](raw_ostream &Out) { V->print(Out); });
 }

-Printable SSAContext::print(Instruction *Inst) const {
+Printable SSAContext::print(const Instruction *Inst) const {
  return print(cast<Value>(Inst));
 }

-Printable SSAContext::print(BasicBlock *BB) const {
+Printable SSAContext::print(const BasicBlock *BB) const {
+  if (!BB)
+    return Printable([](raw_ostream &Out) { Out << "<nullptr>"; });
  if (BB->hasName())
    return Printable([BB](raw_ostream &Out) { Out << BB->getName(); });

--- a/llvm/lib/Passes/PassBuilder.cpp
+++ b/llvm/lib/Passes/PassBuilder.cpp
@ -71,6 +71,7 @@
 #include "llvm/Analysis/TargetLibraryInfo.h"
 #include "llvm/Analysis/TargetTransformInfo.h"
 #include "llvm/Analysis/TypeBasedAliasAnalysis.h"
+#include "llvm/Analysis/UniformityAnalysis.h"
 #include "llvm/IR/DebugInfo.h"
 #include "llvm/IR/Dominators.h"
 #include "llvm/IR/PassManager.h"
--- a/llvm/lib/Passes/PassRegistry.def
+++ b/llvm/lib/Passes/PassRegistry.def
@ -235,6 +235,7 @@ FUNCTION_ANALYSIS("targetir",
 FUNCTION_ANALYSIS("verify", VerifierAnalysis())
 FUNCTION_ANALYSIS("pass-instrumentation", PassInstrumentationAnalysis(PIC))
 FUNCTION_ANALYSIS("divergence", DivergenceAnalysis())
+FUNCTION_ANALYSIS("uniformity", UniformityInfoAnalysis())

 #ifndef FUNCTION_ALIAS_ANALYSIS
 #define FUNCTION_ALIAS_ANALYSIS(NAME, CREATE_PASS)                             \
@ -362,6 +363,7 @@ FUNCTION_PASS("print-cfg-sccs", CFGSCCPrinterPass(dbgs()))
 FUNCTION_PASS("print-predicateinfo", PredicateInfoPrinterPass(dbgs()))
 FUNCTION_PASS("print-mustexecute", MustExecutePrinterPass(dbgs()))
 FUNCTION_PASS("print-memderefs", MemDerefPrinterPass(dbgs()))
+FUNCTION_PASS("print<uniformity>", UniformityInfoPrinterPass(dbgs()))
 FUNCTION_PASS("reassociate", ReassociatePass())
 FUNCTION_PASS("redundant-dbg-inst-elim", RedundantDbgInstEliminationPass())
 FUNCTION_PASS("reg2mem", RegToMemPass())
--- a/llvm/lib/Target/AMDGPU/SIDefines.h
+++ b/llvm/lib/Target/AMDGPU/SIDefines.h
@ -130,6 +130,13 @@ enum : uint64_t {

  // Is a WMMA instruction.
  IsWMMA = UINT64_C(1) << 59,
+
+  // Is source of divergence.
+  //
+  // Note: There is no corresponding SIInstrInfo::IsSourceOfDivergence method
+  // by design, since this flag only covers opcodes that are _always_ divergent.
+  // Use SIInstrInfo::getInstructionUniformity for a more complete analysis.
+  IsSourceOfDivergence = UINT64_C(1) << 60
 };

 // v_cmp_class_* etc. use a 10-bit mask for what operation is checked.
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
@ -3156,9 +3156,9 @@ bool SIInstrInfo::FoldImmediate(MachineInstr &UseMI, MachineInstr &DefMI,
        return false;

      unsigned NewOpc =
-          IsFMA ? (IsF32 ? AMDGPU::V_FMAMK_F32
-                         : ST.hasTrue16BitInsts() ? AMDGPU::V_FMAMK_F16_t16
-                                                  : AMDGPU::V_FMAMK_F16)
+          IsFMA ? (IsF32                    ? AMDGPU::V_FMAMK_F32
+                   : ST.hasTrue16BitInsts() ? AMDGPU::V_FMAMK_F16_t16
+                                            : AMDGPU::V_FMAMK_F16)
                : (IsF32 ? AMDGPU::V_MADMK_F32 : AMDGPU::V_MADMK_F16);
      if (pseudoToMCOpcode(NewOpc) == -1)
        return false;
@ -3236,9 +3236,9 @@ bool SIInstrInfo::FoldImmediate(MachineInstr &UseMI, MachineInstr &DefMI,
      }

      unsigned NewOpc =
-          IsFMA ? (IsF32 ? AMDGPU::V_FMAAK_F32
-                         : ST.hasTrue16BitInsts() ? AMDGPU::V_FMAAK_F16_t16
-                                                  : AMDGPU::V_FMAAK_F16)
+          IsFMA ? (IsF32                    ? AMDGPU::V_FMAAK_F32
+                   : ST.hasTrue16BitInsts() ? AMDGPU::V_FMAAK_F16_t16
+                                            : AMDGPU::V_FMAAK_F16)
                : (IsF32 ? AMDGPU::V_MADAK_F32 : AMDGPU::V_MADAK_F16);
      if (pseudoToMCOpcode(NewOpc) == -1)
        return false;
@ -8395,6 +8395,125 @@ unsigned SIInstrInfo::getInstrLatency(const InstrItineraryData *ItinData,
  return SchedModel.computeInstrLatency(&MI);
 }

+InstructionUniformity
+SIInstrInfo::getGenericInstructionUniformity(const MachineInstr &MI) const {
+  unsigned opcode = MI.getOpcode();
+  if (opcode == AMDGPU::G_INTRINSIC ||
+      opcode == AMDGPU::G_INTRINSIC_W_SIDE_EFFECTS) {
+    return AMDGPU::isIntrinsicSourceOfDivergence(MI.getIntrinsicID())
+               ? InstructionUniformity::NeverUniform
+               : InstructionUniformity::AlwaysUniform;
+  }
+
+  // Loads from the private and flat address spaces are divergent, because
+  // threads can execute the load instruction with the same inputs and get
+  // different results.
+  //
+  // All other loads are not divergent, because if threads issue loads with the
+  // same arguments, they will always get the same result.
+  if (opcode == AMDGPU::G_LOAD) {
+    if (MI.memoperands_empty())
+      return InstructionUniformity::NeverUniform; // conservative assumption
+
+    if (llvm::any_of(MI.memoperands(), [](const MachineMemOperand *mmo) {
+          return mmo->getAddrSpace() == AMDGPUAS::PRIVATE_ADDRESS ||
+                 mmo->getAddrSpace() == AMDGPUAS::FLAT_ADDRESS;
+        })) {
+      // At least one MMO in a non-global address space.
+      return InstructionUniformity::NeverUniform;
+    }
+    return InstructionUniformity::Default;
+  }
+
+  if (SIInstrInfo::isGenericAtomicRMWOpcode(opcode) ||
+      opcode == AMDGPU::G_ATOMIC_CMPXCHG ||
+      opcode == AMDGPU::G_ATOMIC_CMPXCHG_WITH_SUCCESS) {
+    return InstructionUniformity::NeverUniform;
+  }
+  return InstructionUniformity::Default;
+}
+
+InstructionUniformity
+SIInstrInfo::getInstructionUniformity(const MachineInstr &MI) const {
+  if (MI.getDesc().TSFlags & SIInstrFlags::IsSourceOfDivergence)
+    return InstructionUniformity::NeverUniform;
+
+  // Atomics are divergent because they are executed sequentially: when an
+  // atomic operation refers to the same address in each thread, then each
+  // thread after the first sees the value written by the previous thread as
+  // original value.
+
+  if (isAtomic(MI))
+    return InstructionUniformity::NeverUniform;
+
+  // Loads from the private and flat address spaces are divergent, because
+  // threads can execute the load instruction with the same inputs and get
+  // different results.
+  if (isFLAT(MI) && MI.mayLoad()) {
+    if (MI.memoperands_empty())
+      return InstructionUniformity::NeverUniform; // conservative assumption
+
+    if (llvm::any_of(MI.memoperands(), [](const MachineMemOperand *mmo) {
+          return mmo->getAddrSpace() == AMDGPUAS::PRIVATE_ADDRESS ||
+                 mmo->getAddrSpace() == AMDGPUAS::FLAT_ADDRESS;
+        })) {
+      // At least one MMO in a non-global address space.
+      return InstructionUniformity::NeverUniform;
+    }
+
+    return InstructionUniformity::Default;
+  }
+
+  unsigned opcode = MI.getOpcode();
+  if (opcode == AMDGPU::COPY) {
+    const MachineOperand &srcOp = MI.getOperand(1);
+    if (srcOp.isReg() && srcOp.getReg().isPhysical()) {
+      const TargetRegisterClass *regClass = RI.getPhysRegClass(srcOp.getReg());
+      return RI.isSGPRClass(regClass) ? InstructionUniformity::AlwaysUniform
+                                      : InstructionUniformity::NeverUniform;
+    }
+    return InstructionUniformity::Default;
+  }
+  if (opcode == AMDGPU::INLINEASM || opcode == AMDGPU::INLINEASM_BR) {
+    const MachineRegisterInfo &MRI = MI.getParent()->getParent()->getRegInfo();
+    for (auto &op : MI.operands()) {
+      if (!op.isReg() || !op.isDef())
+        continue;
+      auto *RC = MRI.getRegClass(op.getReg());
+      if (!RC || RI.isDivergentRegClass(RC))
+        return InstructionUniformity::NeverUniform;
+    }
+    return InstructionUniformity::AlwaysUniform;
+  }
+  if (opcode == AMDGPU::V_READLANE_B32 || opcode == AMDGPU::V_READFIRSTLANE_B32)
+    return InstructionUniformity::AlwaysUniform;
+
+  if (opcode == AMDGPU::V_WRITELANE_B32)
+    return InstructionUniformity::NeverUniform;
+
+  // GMIR handling
+  if (SIInstrInfo::isGenericOpcode(opcode))
+    return SIInstrInfo::getGenericInstructionUniformity(MI);
+
+  // Handling $vpgr reads
+  for (auto srcOp : MI.operands()) {
+    if (srcOp.isReg() && srcOp.getReg().isPhysical()) {
+      const TargetRegisterClass *regClass = RI.getPhysRegClass(srcOp.getReg());
+      if (RI.isVGPRClass(regClass))
+        return InstructionUniformity::NeverUniform;
+    }
+  }
+
+  // TODO: Uniformity check condtions above can be rearranged for more
+  // redability
+
+  // TODO: amdgcn.{ballot, [if]cmp} should be AlwaysUniform, but they are
+  //       currently turned into no-op COPYs by SelectionDAG ISel and are
+  //       therefore no longer recognizable.
+
+  return InstructionUniformity::Default;
+}
+
 unsigned SIInstrInfo::getDSShaderTypeValue(const MachineFunction &MF) {
  switch (MF.getFunction().getCallingConv()) {
  case CallingConv::AMDGPU_PS:
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.h
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.h
@ -1166,6 +1166,12 @@ public:
                           const MachineInstr &MI,
                           unsigned *PredCost = nullptr) const override;

+  InstructionUniformity
+  getInstructionUniformity(const MachineInstr &MI) const override final;
+
+  InstructionUniformity
+  getGenericInstructionUniformity(const MachineInstr &MI) const;
+
  const MIRFormatter *getMIRFormatter() const override {
    if (!Formatter.get())
      Formatter = std::make_unique<AMDGPUMIRFormatter>();
--- a/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/MIR/always-uniform-gmir.mir
+++ b/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/MIR/always-uniform-gmir.mir
@ -0,0 +1,130 @@
+# NOTE: This file is Generic MIR translation of test/Analysis/DivergenceAnalysis/AMDGPU/always_uniform.ll test file
+# RUN: llc -mtriple=amdgcn-- -run-pass=print-machine-uniformity -o - %s 2>&1 | FileCheck %s
+---
+name:            readfirstlane
+body:             |
+  bb.1:
+    ; CHECK-LABEL: MachineUniformityInfo for function: readfirstlane
+    ; CHECK: DIVERGENT: %{{[0-9]+}}
+    ; CHECK-SAME:llvm.amdgcn.workitem.id.x
+    ; CHECK-NOT: DIVERGENT: {{.*}}llvm.amdgcn.readfirstlane
+    %6:_(p1) = G_IMPLICIT_DEF
+    %4:_(s32) = G_INTRINSIC intrinsic(@llvm.amdgcn.workitem.id.x)
+    %5:_(s32) = G_INTRINSIC intrinsic(@llvm.amdgcn.readfirstlane), %4(s32)
+    G_STORE %5(s32), %6(p1) :: (store (s32) into `i32 addrspace(1)* undef`, addrspace 1)
+    S_ENDPGM 0
+...
+---
+name:            icmp
+body:             |
+  bb.1:
+    liveins: $sgpr4_sgpr5
+    ; CHECK-LABEL: MachineUniformityInfo for function: icmp
+    ; CHECK-NEXT: ALL VALUES UNIFORM
+
+    %3:_(p4) = COPY $sgpr4_sgpr5
+    %13:_(s32) = G_CONSTANT i32 0
+    %7:_(p4) = G_INTRINSIC intrinsic(@llvm.amdgcn.kernarg.segment.ptr)
+    %8:_(s32) = G_LOAD %7(p4) :: (dereferenceable invariant load (s32), align 16, addrspace 4)
+    %9:_(s64) = G_CONSTANT i64 8
+    %10:_(p4) = G_PTR_ADD %7, %9(s64)
+    %11:_(p1) = G_LOAD %10(p4) :: (dereferenceable invariant load (p1), addrspace 4)
+    %12:_(s64) = G_INTRINSIC intrinsic(@llvm.amdgcn.icmp), %8(s32), %13(s32), 33
+    G_STORE %12(s64), %11(p1) :: (volatile store (s64) , addrspace 1)
+    S_ENDPGM 0
+
+...
+---
+name:            fcmp
+body:             |
+  bb.1:
+    liveins: $sgpr4_sgpr5
+    ; CHECK-LABEL: MachineUniformityInfo for function: fcmp
+    ; CHECK-NEXT: ALL VALUES UNIFORM
+
+    %3:_(p4) = COPY $sgpr4_sgpr5
+    %10:_(s32) = G_CONSTANT i32 0
+    %12:_(s32) = G_CONSTANT i32 1
+    %16:_(p1) = G_IMPLICIT_DEF
+    %7:_(p4) = G_INTRINSIC intrinsic(@llvm.amdgcn.kernarg.segment.ptr)
+    %8:_(<2 x s32>) = G_LOAD %7(p4) :: (dereferenceable invariant load (<2 x s32>) , align 16, addrspace 4)
+    %9:_(s32) = G_EXTRACT_VECTOR_ELT %8(<2 x s32>), %10(s32)
+    %11:_(s32) = G_EXTRACT_VECTOR_ELT %8(<2 x s32>), %12(s32)
+    %13:_(s64) = G_CONSTANT i64 4
+    %14:_(p4) = G_PTR_ADD %7, %13(s64)
+    %15:_(s64) = G_INTRINSIC intrinsic(@llvm.amdgcn.fcmp), %9(s32), %11(s32), 33
+    G_STORE %15(s64), %16(p1) :: (volatile store (s64) into `i64 addrspace(1)* undef`, addrspace 1)
+    S_ENDPGM 0
+
+...
+---
+name:            ballot
+body:             |
+  bb.1:
+    liveins: $sgpr4_sgpr5
+    ; CHECK-LABEL: MachineUniformityInfo for function: ballot
+    ; CHECK-NEXT: ALL VALUES UNIFORM
+
+    %2:_(p4) = COPY $sgpr4_sgpr5
+    %10:_(p1) = G_IMPLICIT_DEF
+    %6:_(p4) = G_INTRINSIC intrinsic(@llvm.amdgcn.kernarg.segment.ptr)
+    %7:_(s32) = G_LOAD %6(p4) :: (dereferenceable invariant load (s32), align 16, addrspace 4)
+    %8:_(s1) = G_TRUNC %7(s32)
+    %9:_(s64) = G_INTRINSIC intrinsic(@llvm.amdgcn.ballot), %8(s1)
+    G_STORE %9(s64), %10(p1) :: (volatile store (s64) into `i64 addrspace(1)* undef`, addrspace 1)
+    S_ENDPGM 0
+
+...
+---
+name:            asm_sgpr
+registers:
+  - { id: 0, class: _, preferred-register: '' }
+  - { id: 1, class: sreg_32, preferred-register: '' }
+  - { id: 2, class: vgpr_32, preferred-register: '' }
+  - { id: 3, class: _, preferred-register: '' }
+body:             |
+  bb.0:
+    liveins: $vgpr0
+    ; CHECK-LABEL: MachineUniformityInfo for function: asm_sgpr
+    ; CHECK-NOT: DIVERGENT: %1
+
+    %0:_(s32) = COPY $vgpr0
+    %2:vgpr_32 = COPY %0(s32)
+    INLINEASM &"; def $0, $1", 0 /* attdialect */, 1966090 /* regdef:SReg_32 */, def %1, 1835017 /* reguse:VGPR_32 */, %2
+    %3:_(s32) = COPY %1
+    $vgpr0 = COPY %3(s32)
+    SI_RETURN implicit $vgpr0
+
+...
+
+# FIXME :: BELOW INLINE ASM SHOULD BE DIVERGENT
+---
+name:            asm_mixed_sgpr_vgpr
+registers:
+  - { id: 0, class: _, preferred-register: '' }
+  - { id: 1, class: sreg_32, preferred-register: '' }
+  - { id: 2, class: vgpr_32, preferred-register: '' }
+  - { id: 3, class: vgpr_32, preferred-register: '' }
+  - { id: 4, class: _, preferred-register: '' }
+  - { id: 5, class: _, preferred-register: '' }
+  - { id: 6, class: _, preferred-register: '' }
+liveins:         []
+frameInfo:
+body:             |
+  bb.0:
+    liveins: $vgpr0
+    ; CHECK-LABEL: MachineUniformityInfo for function: asm_mixed_sgpr_vgpr
+    ; CHECK: DIVERGENT: %0:
+    ; CHECK: DIVERGENT: %3:
+    ; CHECK: DIVERGENT: %2:
+    ; CHECK: DIVERGENT: %5:
+    %0:_(s32) = COPY $vgpr0
+    %6:_(p1) = G_IMPLICIT_DEF
+    %3:vgpr_32 = COPY %0(s32)
+    INLINEASM &"; def $0, $1, $2", 0 /* attdialect */, 1966090 /* regdef:SReg_32 */, def %1, 1835018 /* regdef:VGPR_32 */, def %2, 1835017 /* reguse:VGPR_32 */, %3
+    %4:_(s32) = COPY %1
+    %5:_(s32) = COPY %2
+    G_STORE %5(s32), %6(p1) :: (store (s32) into `i32 addrspace(1)* undef`, addrspace 1)
+    SI_RETURN
+
+...
--- a/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/MIR/always-uniform.mir
+++ b/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/MIR/always-uniform.mir
@ -0,0 +1,55 @@
+# RUN: llc -mtriple=amdgcn-- -run-pass=print-machine-uniformity -o - %s 2>&1 | FileCheck %s
+# readlane, readfirstlane is always uniform
+
+---
+name:            readlane
+machineFunctionInfo:
+  isEntryFunction: true
+body:             |
+  bb.0:
+    ; CHECK-LABEL: MachineUniformityInfo for function: readlane
+    ; CHECK-NEXT: ALL VALUES UNIFORM
+    %0:vgpr_32 = IMPLICIT_DEF
+    %1:vgpr_32 = IMPLICIT_DEF
+    %2:sgpr_32 = V_READFIRSTLANE_B32 %0, implicit $exec
+    %3:sgpr_32 = V_READLANE_B32 %1, 0, implicit $exec
+    $sgpr0 = V_READFIRSTLANE_B32 $vgpr0, implicit $exec
+    $sgpr1 = V_READLANE_B32 $vgpr1, $sgpr0, implicit $exec
+    S_ENDPGM 0
+...
+
+# Readlane with physical register as operand
+---
+name:            readlane2
+machineFunctionInfo:
+  isEntryFunction: true
+body:             |
+  bb.0:
+    ; CHECK-LABEL: MachineUniformityInfo for function: readlane2
+    ; CHECK-NEXT: ALL VALUES UNIFORM
+    %0:vgpr_32 = IMPLICIT_DEF
+    %1:vgpr_32 = IMPLICIT_DEF
+    %4:sgpr_32 = V_READLANE_B32 $vgpr0, 0, implicit $exec
+    $sgpr0 = V_READFIRSTLANE_B32 $vgpr0, implicit $exec
+    $sgpr1 = V_READLANE_B32 $vgpr1, $sgpr0, implicit $exec
+    %5:sgpr_32 = V_READFIRSTLANE_B32 $vgpr1, implicit $exec
+    S_ENDPGM 0
+...
+
+
+
+# for copy operand src = sgpr -> uniform
+---
+name:            sgprcopy
+tracksRegLiveness: true
+machineFunctionInfo:
+  isEntryFunction: true
+body:             |
+  bb.0:
+    ; CHECK-LABEL: MachineUniformityInfo for function: sgprcopy
+    ; CHECK-NEXT: ALL VALUES UNIFORM
+    liveins: $sgpr0,$sgpr1,$vgpr0
+    %0:sgpr_32 = COPY $sgpr0
+    %1:vgpr_32 = COPY $sgpr1
+    S_ENDPGM 0
+...
--- a/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/MIR/atomics-gmir.mir
+++ b/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/MIR/atomics-gmir.mir
@ -0,0 +1,109 @@
+# RUN: llc -mtriple=amdgcn-- -run-pass=print-machine-uniformity -o - %s 2>&1 | FileCheck %s
+
+---
+name:            test1
+tracksRegLiveness: true
+body:             |
+  bb.1:
+    %2:_(s32) = IMPLICIT_DEF
+    %3:_(s32) = IMPLICIT_DEF
+    %0:_(p0) = G_MERGE_VALUES %2(s32), %3(s32)
+    %1:_(s32) = IMPLICIT_DEF
+
+    ; CHECK: DIVERGENT
+    ; CHECK-SAME: G_ATOMICRMW_XCHG
+    %4:_(s32) = G_ATOMICRMW_XCHG %0(p0), %1 :: (load store seq_cst (s32))
+
+    ; CHECK: DIVERGENT
+    ; CHECK-SAME: G_ATOMIC_CMPXCHG_WITH_SUCCESS
+    %5:_(s32), %6:_(s1) = G_ATOMIC_CMPXCHG_WITH_SUCCESS %0(p0), %1, %2 :: (load store seq_cst seq_cst (s32) )
+    $vgpr0 = COPY %4(s32)
+    SI_RETURN implicit $vgpr0
+...
+
+---
+name:            test_atomic_inc_dec
+tracksRegLiveness: true
+body:             |
+  bb.1:
+
+    %2:_(s32) = IMPLICIT_DEF
+    %3:_(s32) = IMPLICIT_DEF
+    %0:_(p1) = G_MERGE_VALUES %2(s32), %3(s32)
+    %1:_(s32) = IMPLICIT_DEF
+    %5:_(s64) = IMPLICIT_DEF
+
+    ; CHECK: DIVERGENT: %{{[0-9]}}: %{{[0-9]}}:_(s32) = G_INTRINSIC_W_SIDE_EFFECTS intrinsic(@llvm.amdgcn.atomic.inc)
+    %4:_(s32) = G_INTRINSIC_W_SIDE_EFFECTS intrinsic(@llvm.amdgcn.atomic.inc), %0(p1), %1(s32), 0, 0, 0 :: (load store (s32) )
+
+    ; CHECK: DIVERGENT: %{{[0-9]}}: %{{[0-9]}}:_(s64) = G_INTRINSIC_W_SIDE_EFFECTS intrinsic(@llvm.amdgcn.atomic.inc)
+    %6:_(s64) = G_INTRINSIC_W_SIDE_EFFECTS intrinsic(@llvm.amdgcn.atomic.inc), %0(p1), %5(s64), 0, 0, 0 :: (load store (s64) )
+
+    ; CHECK: DIVERGENT: %{{[0-9]}}: %{{[0-9]}}:_(s32) = G_INTRINSIC_W_SIDE_EFFECTS intrinsic(@llvm.amdgcn.atomic.dec)
+    %7:_(s32) = G_INTRINSIC_W_SIDE_EFFECTS intrinsic(@llvm.amdgcn.atomic.dec), %0(p1), %1(s32), 0, 0, 0 :: (load store (s32) )
+
+    ; CHECK: DIVERGENT: %{{[0-9]}}: %{{[0-9]}}:_(s64) = G_INTRINSIC_W_SIDE_EFFECTS intrinsic(@llvm.amdgcn.atomic.dec)
+    %8:_(s64) = G_INTRINSIC_W_SIDE_EFFECTS intrinsic(@llvm.amdgcn.atomic.dec), %0(p1), %5(s64), 0, 0, 0 :: (load store (s64) )
+    $vgpr0 = COPY %4(s32)
+    SI_RETURN implicit $vgpr0
+
+...
+
+---
+name:            test_atomics
+tracksRegLiveness: true
+body:             |
+  bb.1:
+
+    %2:_(s32) = IMPLICIT_DEF
+    %3:_(s32) = IMPLICIT_DEF
+    %0:_(p1) = G_MERGE_VALUES %2(s32), %3(s32)
+    %1:_(s32) = IMPLICIT_DEF
+    %5:_(s32) = IMPLICIT_DEF
+
+    ; CHECK: DIVERGENT: %{{[0-9]}}: %{{[0-9]}}:_(s32) = G_ATOMICRMW_ADD
+    %4:_(s32) = G_ATOMICRMW_ADD %2, %3
+
+    ; CHECK: DIVERGENT: %{{[0-9]}}: %{{[0-9]}}:_(s32) = G_ATOMICRMW_SUB
+    %6:_(s32) = G_ATOMICRMW_SUB %1, %5
+
+    ; CHECK: DIVERGENT: %{{[0-9]}}: %{{[0-9]}}:_(s32) = G_ATOMICRMW_AND
+    %7:_(s32) = G_ATOMICRMW_AND %2, %3
+
+    ; CHECK: DIVERGENT: %{{[0-9]}}: %{{[0-9]}}:_(s32) = G_ATOMICRMW_NAND
+    %8:_(s32) = G_ATOMICRMW_NAND %1, %5
+
+    ; CHECK: DIVERGENT: %{{[0-9]}}: %{{[0-9]}}:_(s32) = G_ATOMICRMW_OR
+    %9:_(s32) = G_ATOMICRMW_OR %2, %3
+
+    ; CHECK: DIVERGENT: %{{[0-9]*}}: %{{[0-9]*}}:_(s32) = G_ATOMICRMW_XOR
+    %10:_(s32) = G_ATOMICRMW_XOR %1, %5
+
+    ; CHECK: DIVERGENT: %{{[0-9]*}}: %{{[0-9]*}}:_(s32) = G_ATOMICRMW_MAX
+    %11:_(s32) = G_ATOMICRMW_MAX %2, %3
+
+    ; CHECK: DIVERGENT: %{{[0-9]*}}: %{{[0-9]*}}:_(s32) = G_ATOMICRMW_MIN
+    %12:_(s32) = G_ATOMICRMW_MIN %1, %5
+
+    ; CHECK: DIVERGENT: %{{[0-9]*}}: %{{[0-9]*}}:_(s32) = G_ATOMICRMW_UMAX
+    %13:_(s32) = G_ATOMICRMW_UMAX %2, %3
+
+    ; CHECK: DIVERGENT: %{{[0-9]*}}: %{{[0-9]*}}:_(s32) = G_ATOMICRMW_UMIN
+    %14:_(s32) = G_ATOMICRMW_UMIN %1, %5
+
+    ; CHECK: DIVERGENT: %{{[0-9]*}}: %{{[0-9]*}}:_(s32) = G_ATOMICRMW_FADD
+    %15:_(s32) = G_ATOMICRMW_FADD %2, %3
+
+    ; CHECK: DIVERGENT: %{{[0-9]*}}: %{{[0-9]*}}:_(s32) = G_ATOMICRMW_FSUB
+    %16:_(s32) = G_ATOMICRMW_FSUB %1, %5
+
+    ; CHECK: DIVERGENT: %{{[0-9]*}}: %{{[0-9]*}}:_(s32) = G_ATOMICRMW_FMAX
+    %17:_(s32) = G_ATOMICRMW_FMAX %2, %3
+
+    ; CHECK: DIVERGENT: %{{[0-9]*}}: %{{[0-9]*}}:_(s32) = G_ATOMICRMW_FMIN
+    %18:_(s32) = G_ATOMICRMW_FMIN %1, %5
+
+    $vgpr0 = COPY %4(s32)
+    SI_RETURN implicit $vgpr0
+
+...
--- a/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/MIR/atomics.mir
+++ b/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/MIR/atomics.mir
@ -0,0 +1,132 @@
+# RUN: llc -mtriple=amdgcn-- -run-pass=print-machine-uniformity -o - %s 2>&1 | FileCheck %s
+
+---
+name:            test1
+tracksRegLiveness: true
+body:             |
+  bb.0:
+    ; CHECK-LABEL: MachineUniformityInfo for function: test1
+    %2:vgpr_32 = IMPLICIT_DEF
+    %1:vgpr_32 = IMPLICIT_DEF
+    %0:vgpr_32 = IMPLICIT_DEF
+    %3:sreg_64 = REG_SEQUENCE %0, %subreg.sub0, %1, %subreg.sub1
+    %5:vreg_64 = COPY %3
+    %6:vreg_64 = COPY %3
+    ; CHECK: DIVERGENT
+    ; CHECK-SAME: FLAT_ATOMIC_SWAP_RTN
+    %4:vgpr_32 = FLAT_ATOMIC_SWAP_RTN killed %5, %2, 0, 1, implicit $exec, implicit $flat_scr :: (load store seq_cst (s32))
+    ; CHECK: DIVERGENT
+    ; CHECK-SAME: FLAT_ATOMIC_SWAP_RTN
+    %7:vgpr_32 = FLAT_ATOMIC_SWAP_RTN killed %6, %2, 0, 1, implicit $exec, implicit $flat_scr ; No memopernads
+    $vgpr0 = COPY %4
+    SI_RETURN implicit $vgpr0
+...
+
+---
+name:            test2
+tracksRegLiveness: true
+body:             |
+  bb.0:
+    ; CHECK-LABEL: MachineUniformityInfo for function: test2
+    %3:vgpr_32 = IMPLICIT_DEF
+    %2:vgpr_32 = IMPLICIT_DEF
+    %1:vgpr_32 = IMPLICIT_DEF
+    %0:vgpr_32 = IMPLICIT_DEF
+    %4:sreg_64 = REG_SEQUENCE %0, %subreg.sub0, %1, %subreg.sub1
+    %5:sreg_64 = REG_SEQUENCE %3, %subreg.sub0, %2, %subreg.sub1
+    %7:vreg_64 = COPY %4
+    %8:vreg_64 = COPY %5
+    ; CHECK: DIVERGENT
+    ; CHECK-SAME: FLAT_ATOMIC_CMPSWAP_RTN
+    %6:vgpr_32 = FLAT_ATOMIC_CMPSWAP_RTN killed %7, killed %8, 0, 1, implicit $exec, implicit $flat_scr :: (load store seq_cst seq_cst (s32))
+    %9:sreg_64_xexec = V_CMP_EQ_U32_e64 %6, %2, implicit $exec
+    %10:vgpr_32 = V_CNDMASK_B32_e64 0, 0, 0, 1, killed %9, implicit $exec
+    $vgpr0 = COPY %6
+    $vgpr1 = COPY %10
+    SI_RETURN implicit $vgpr0, implicit $vgpr1
+...
+
+---
+name:            atomic_inc
+tracksRegLiveness: true
+body:             |
+  bb.0:
+  ; CHECK-LABEL: MachineUniformityInfo for function: atomic_inc
+    %2:vgpr_32 = IMPLICIT_DEF
+    %1:vgpr_32 = IMPLICIT_DEF
+    %0:vgpr_32 = IMPLICIT_DEF
+    %3:sreg_64 = REG_SEQUENCE %0, %subreg.sub0, %1, %subreg.sub1
+    %5:vreg_64 = COPY %3
+    ; CHECK: DIVERGENT
+    ; CHECK-SAME: GLOBAL_ATOMIC_INC_RTN
+    %4:vgpr_32 = GLOBAL_ATOMIC_INC_RTN killed %5, %2, 0, 1, implicit $exec :: (load store (s32), addrspace 1)
+    $vgpr0 = COPY %4
+    SI_RETURN implicit $vgpr0
+...
+
+---
+name:            atomic_inc_64
+tracksRegLiveness: true
+body:             |
+  bb.0:
+  ; CHECK-LABEL: MachineUniformityInfo for function: atomic_inc_64
+    %3:vgpr_32 = IMPLICIT_DEF
+    %2:vgpr_32 = IMPLICIT_DEF
+    %1:vgpr_32 = IMPLICIT_DEF
+    %0:vgpr_32 = IMPLICIT_DEF
+    %4:sreg_64 = REG_SEQUENCE %2, %subreg.sub0, %3, %subreg.sub1
+    %5:sreg_64 = REG_SEQUENCE %0, %subreg.sub0, %1, %subreg.sub1
+    %7:vreg_64 = COPY %5
+    %8:vreg_64 = COPY %4
+    ; CHECK: DIVERGENT
+    ; CHECK-SAME: GLOBAL_ATOMIC_INC_X2_RTN
+    %6:vreg_64 = GLOBAL_ATOMIC_INC_X2_RTN killed %7, killed %8, 0, 1, implicit $exec :: (load store (s64), addrspace 1)
+    %9:vgpr_32 = COPY %6.sub1
+    %10:vgpr_32 = COPY %6.sub0
+    $vgpr0 = COPY %10
+    $vgpr1 = COPY %9
+    SI_RETURN implicit $vgpr0, implicit $vgpr1
+...
+
+---
+name:            atomic_dec
+tracksRegLiveness: true
+body:             |
+  bb.0:
+  ; CHECK-LABEL: MachineUniformityInfo for function: atomic_dec
+    %2:vgpr_32 = IMPLICIT_DEF
+    %1:vgpr_32 = IMPLICIT_DEF
+    %0:vgpr_32 = IMPLICIT_DEF
+    %3:sreg_64 = REG_SEQUENCE %0, %subreg.sub0, %1, %subreg.sub1
+    %5:vreg_64 = COPY %3
+    ; CHECK: DIVERGENT
+    ; CHECK-SAME: GLOBAL_ATOMIC_DEC_RTN
+    %4:vgpr_32 = GLOBAL_ATOMIC_DEC_RTN killed %5, %2, 0, 1, implicit $exec :: (load store (s32), addrspace 1)
+    $vgpr0 = COPY %4
+    SI_RETURN implicit $vgpr0
+...
+
+
+---
+name:            atomic_dec_64
+tracksRegLiveness: true
+body:             |
+  bb.0:
+  ; CHECK-LABEL: MachineUniformityInfo for function: atomic_dec_64
+    %3:vgpr_32 = IMPLICIT_DEF
+    %2:vgpr_32 = IMPLICIT_DEF
+    %1:vgpr_32 = IMPLICIT_DEF
+    %0:vgpr_32 = IMPLICIT_DEF
+    %4:sreg_64 = REG_SEQUENCE %2, %subreg.sub0, %3, %subreg.sub1
+    %5:sreg_64 = REG_SEQUENCE %0, %subreg.sub0, %1, %subreg.sub1
+    %7:vreg_64 = COPY %5
+    %8:vreg_64 = COPY %4
+    ; CHECK: DIVERGENT
+    ; CHECK-SAME: GLOBAL_ATOMIC_DEC_X2_RTN
+    %6:vreg_64 = GLOBAL_ATOMIC_DEC_X2_RTN killed %7, killed %8, 0, 1, implicit $exec :: (load store (s64), addrspace 1)
+    %9:vgpr_32 = COPY %6.sub1
+    %10:vgpr_32 = COPY %6.sub0
+    $vgpr0 = COPY %10
+    $vgpr1 = COPY %9
+    SI_RETURN implicit $vgpr0, implicit $vgpr1
+...
--- a/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/MIR/hidden-diverge-gmir.mir
+++ b/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/MIR/hidden-diverge-gmir.mir
@ -0,0 +1,79 @@
+# RUN: llc -mtriple=amdgcn-- -run-pass=print-machine-uniformity -o - %s 2>&1 | FileCheck %s
+# CHECK-LABEL: MachineUniformityInfo for function: hidden_diverge
+# CHECK-LABEL: BLOCK bb.0
+# CHECK: DIVERGENT: %{{[0-9]*}}: %{{[0-9]*}}:_(s32) = G_INTRINSIC intrinsic(@llvm.amdgcn.workitem.id.x)
+# CHECK: DIVERGENT: %{{[0-9]*}}: %{{[0-9]*}}:_(s1) = G_ICMP intpred(slt)
+# CHECK: DIVERGENT: %{{[0-9]*}}: %{{[0-9]*}}:_(s1) = G_XOR %{{[0-9]*}}:_, %{{[0-9]*}}:_
+# CHECK: DIVERGENT: %{{[0-9]*}}: %{{[0-9]*}}:_(s1), %{{[0-9]*}}:_(s64) = G_INTRINSIC_W_SIDE_EFFECTS intrinsic(@llvm.amdgcn.if)
+# CHECK: DIVERGENT: %{{[0-9]*}}: %{{[0-9]*}}:_(s1), %{{[0-9]*}}:_(s64) = G_INTRINSIC_W_SIDE_EFFECTS intrinsic(@llvm.amdgcn.if)
+# CHECK: DIVERGENT: G_BRCOND %{{[0-9]*}}:_(s1), %bb.1
+# CHECK: DIVERGENT: G_BR %bb.2
+# CHECK-LABEL: BLOCK bb.1
+# CHECK-LABEL: BLOCK bb.2
+# CHECK: DIVERGENT: %{{[0-9]*}}: %{{[0-9]*}}:_(s32) = G_PHI %{{[0-9]*}}:_(s32), %bb.1, %{{[0-9]*}}:_(s32), %bb.0
+# CHECK: DIVERGENT: %{{[0-9]*}}: %{{[0-9]*}}:_(s1) = G_PHI %{{[0-9]*}}:_(s1), %bb.1, %{{[0-9]*}}:_(s1), %bb.0
+# CHECK: DIVERGENT: %{{[0-9]*}}: %{{[0-9]*}}:_(s1), %{{[0-9]*}}:_(s64) = G_INTRINSIC_W_SIDE_EFFECTS intrinsic(@llvm.amdgcn.if)
+# CHECK: DIVERGENT: %{{[0-9]*}}: %{{[0-9]*}}:_(s1), %{{[0-9]*}}:_(s64) = G_INTRINSIC_W_SIDE_EFFECTS intrinsic(@llvm.amdgcn.if)
+# CHECK: DIVERGENT: G_BRCOND %{{[0-9]*}}:_(s1), %bb.3
+# CHECK: DIVERGENT: G_BR %bb.4
+# CHECK-LABEL: BLOCK bb.3
+# CHECK-LABEL: BLOCK bb.4
+# CHECK: DIVERGENT: %{{[0-9]*}}: %{{[0-9]*}}:_(s32) = G_PHI %{{[0-9]*}}:_(s32), %bb.2, %{{[0-9]*}}:_(s32), %bb.3
+
+---
+name:            hidden_diverge
+tracksRegLiveness: true
+body:             |
+  bb.1:
+    successors: %bb.2(0x40000000), %bb.3(0x40000000)
+    liveins: $sgpr4_sgpr5
+
+    %4:_(p4) = COPY $sgpr4_sgpr5
+    %15:_(s32) = G_CONSTANT i32 0
+    %17:_(s1) = G_CONSTANT i1 true
+    %23:_(s32) = G_CONSTANT i32 1
+    %30:_(s32) = G_CONSTANT i32 2
+    %32:_(p1) = G_IMPLICIT_DEF
+    %33:_(s32) = G_IMPLICIT_DEF
+    %8:_(p4) = G_INTRINSIC intrinsic(@llvm.amdgcn.kernarg.segment.ptr)
+    %9:_(<3 x s32>) = G_LOAD %8(p4) :: (dereferenceable invariant load (<3 x s32>), align 16, addrspace 4)
+    %10:_(s64) = G_CONSTANT i64 4
+    %11:_(p4) = G_PTR_ADD %8, %10(s64)
+    %12:_(s64) = G_CONSTANT i64 8
+    %13:_(p4) = G_PTR_ADD %8, %12(s64)
+    %14:_(s32) = G_INTRINSIC intrinsic(@llvm.amdgcn.workitem.id.x)
+    %16:_(s1) = G_ICMP intpred(slt), %14(s32), %15
+    %18:_(s1) = G_XOR %16, %17
+    %19:_(s1), %20:_(s64) = G_INTRINSIC_W_SIDE_EFFECTS intrinsic(@llvm.amdgcn.if), %16(s1)
+    G_BRCOND %19(s1), %bb.2
+    G_BR %bb.3
+
+  bb.2:
+    successors: %bb.3(0x80000000)
+
+    %21:_(s32) = G_EXTRACT_VECTOR_ELT %9(<3 x s32>), %15(s32)
+    %22:_(s32) = G_EXTRACT_VECTOR_ELT %9(<3 x s32>), %23(s32)
+    %24:_(s1) = G_ICMP intpred(slt), %21(s32), %15
+
+  bb.3:
+    successors: %bb.4(0x40000000), %bb.5(0x40000000)
+
+    %25:_(s32) = G_PHI %22(s32), %bb.2, %33(s32), %bb.1
+    %26:_(s1) = G_PHI %24(s1), %bb.2, %18(s1), %bb.1
+    G_INTRINSIC_W_SIDE_EFFECTS intrinsic(@llvm.amdgcn.end.cf), %20(s64)
+    %27:_(s1), %28:_(s64) = G_INTRINSIC_W_SIDE_EFFECTS intrinsic(@llvm.amdgcn.if), %26(s1)
+    G_BRCOND %27(s1), %bb.4
+    G_BR %bb.5
+
+  bb.4:
+    successors: %bb.5(0x80000000)
+
+    %29:_(s32) = G_EXTRACT_VECTOR_ELT %9(<3 x s32>), %30(s32)
+
+  bb.5:
+    %31:_(s32) = G_PHI %25(s32), %bb.3, %29(s32), %bb.4
+    G_INTRINSIC_W_SIDE_EFFECTS intrinsic(@llvm.amdgcn.end.cf), %28(s64)
+    G_STORE %31(s32), %32(p1) :: (volatile store (s32) into `i32 addrspace(1)* undef`, addrspace 1)
+    S_ENDPGM 0
+
+...
--- a/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/MIR/hidden-diverge.mir
+++ b/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/MIR/hidden-diverge.mir
@ -0,0 +1,79 @@
+# RUN: llc -mtriple=amdgcn-- -run-pass=print-machine-uniformity -o - %s 2>&1 | FileCheck %s
+
+---
+# CHECK-LABEL: MachineUniformityInfo for function: hidden_diverge
+# CHECK-LABEL: BLOCK bb.0
+# CHECK: DIVERGENT: %{{[0-9]*}}: %{{[0-9]*}}:vgpr_32(s32) = COPY $vgpr0
+# CHECK: DIVERGENT: %{{[0-9]*}}: %{{[0-9]*}}:sreg_64 = V_CMP_GT_I32_e64
+# CHECK: DIVERGENT: %{{[0-9]*}}: %{{[0-9]*}}:sreg_64 = V_CMP_LT_I32_e64
+# CHECK: DIVERGENT: %{{[0-9]*}}: %{{[0-9]*}}:vreg_1 = COPY
+# CHECK: DIVERGENT: %{{[0-9]*}}:sreg_64 = SI_IF
+# CHECK: DIVERGENT: S_BRANCH %bb.1
+# CHECK-LABEL: BLOCK bb.2
+# CHECK: DIVERGENT: %{{[0-9]*}}: %{{[0-9]*}}:sreg_32 = PHI %{{[0-9]*}}:sreg_32, %bb.0, %{{[0-9]*}}:sreg_32, %bb.1
+# CHECK: DIVERGENT: %{{[0-9]*}}: %{{[0-9]*}}:vreg_1 = PHI %{{[0-9]*}}:vreg_1, %bb.0, %{{[0-9]*}}:sreg_64, %bb.1
+# CHECK: DIVERGENT: %{{[0-9]*}}: %{{[0-9]*}}:sreg_64 = COPY %{{[0-9]*}}:vreg_1
+# CHECK: DIVERGENT: %{{[0-9]*}}:sreg_64 = SI_IF %{{[0-9]*}}:sreg_64, %bb.4
+# CHECK: DIVERGENT: S_BRANCH %bb.3
+# CHECK-LABEL: BLOCK bb.3
+# CHECK-LABEL: BLOCK bb.4
+# CHECK: DIVERGENT: %{{[0-9]*}}: %{{[0-9]*}}:vgpr_32 = PHI %{{[0-9]*}}:sreg_32, %bb.2, %{{[0-9]*}}:sreg_32, %bb.3
+
+name:            hidden_diverge
+tracksRegLiveness: true
+body:             |
+  bb.0:
+    successors: %bb.1(0x40000000), %bb.2(0x40000000)
+    liveins: $vgpr0, $sgpr0_sgpr1
+
+    %11:sgpr_64(p4) = COPY $sgpr0_sgpr1
+    %10:vgpr_32(s32) = COPY $vgpr0
+    %15:sreg_64_xexec = S_LOAD_DWORDX2_IMM %11(p4), 36, 0
+    %16:sreg_32_xm0_xexec = S_LOAD_DWORD_IMM %11(p4), 44, 0
+    %17:sreg_32 = COPY %15.sub1
+    %18:sreg_32 = COPY %15.sub0
+    %19:sgpr_96 = REG_SEQUENCE killed %18, %subreg.sub0, killed %17, %subreg.sub1, killed %16, %subreg.sub2
+    %0:sgpr_96 = COPY %19
+    %20:sreg_32 = S_MOV_B32 -1
+    %21:sreg_64 = V_CMP_GT_I32_e64 %10(s32), killed %20, implicit $exec
+    %22:sreg_32 = S_MOV_B32 0
+    %23:sreg_64 = V_CMP_LT_I32_e64 %10(s32), killed %22, implicit $exec
+    %1:vreg_1 = COPY %21
+    %14:sreg_32 = IMPLICIT_DEF
+    %2:sreg_64 = SI_IF killed %23, %bb.2, implicit-def dead $exec, implicit-def dead $scc, implicit $exec
+    S_BRANCH %bb.1
+
+  bb.1:
+    successors: %bb.2(0x80000000)
+
+    %24:sreg_32 = COPY %0.sub0
+    %3:sreg_32 = COPY %0.sub1
+    %25:sreg_32 = S_MOV_B32 0
+    S_CMP_LT_I32 killed %24, killed %25, implicit-def $scc
+    %26:sreg_64 = COPY $scc
+    %4:sreg_64 = COPY %26
+
+  bb.2:
+    successors: %bb.3(0x40000000), %bb.4(0x40000000)
+
+    %5:sreg_32 = PHI %14, %bb.0, %3, %bb.1
+    %6:vreg_1 = PHI %1, %bb.0, %4, %bb.1
+    SI_END_CF %2, implicit-def dead $exec, implicit-def dead $scc, implicit $exec
+    %27:sreg_64 = COPY %6
+    %7:sreg_64 = SI_IF %27, %bb.4, implicit-def dead $exec, implicit-def dead $scc, implicit $exec
+    S_BRANCH %bb.3
+
+  bb.3:
+    successors: %bb.4(0x80000000)
+
+    %8:sreg_32 = COPY %0.sub2
+
+  bb.4:
+    %9:vgpr_32 = PHI %5, %bb.2, %8, %bb.3
+    SI_END_CF %7, implicit-def dead $exec, implicit-def dead $scc, implicit $exec
+    %28:sreg_64 = IMPLICIT_DEF
+    %29:vreg_64 = COPY %28
+    GLOBAL_STORE_DWORD killed %29, %9, 0, 0, implicit $exec
+    S_ENDPGM 0
+
+...
--- a/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/MIR/irreducible/branch-outside-gmir.mir
+++ b/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/MIR/irreducible/branch-outside-gmir.mir
@ -0,0 +1,59 @@
+# RUN: llc -mtriple=amdgcn-- -run-pass=print-machine-uniformity -o - %s 2>&1 | FileCheck %s
+
+# CHECK-LABEL: MachineUniformityInfo for function: basic
+# CHECK-NEXT: CYCLES ASSSUMED DIVERGENT:
+# CHECK-NEXT: depth=1: entries(bb.1 bb.3) bb.2
+# CHECK-LABEL: BLOCK bb.1
+# CHECK: DIVERGENT
+# CHECK: DIVERGENT
+# CHECK-LABEL: BLOCK bb.2
+# CHECK: DIVERGENT
+# CHECK: DIVERGENT
+# CHECK-LABEL: BLOCK bb.3
+# CHECK: DIVERGENT
+# CHECK: DIVERGENT
+# CHECK-LABEL: BLOCK bb.4
+# CHECK-NOT: DIVERGENT
+
+
+---
+name:            basic
+tracksRegLiveness: true
+body:             |
+  bb.0:
+    successors: %bb.3, %bb.1
+
+    %0:_(s32) = G_IMPLICIT_DEF
+    %1:_(s32) = G_CONSTANT i32 0
+    %2:_(s32) = G_IMPLICIT_DEF
+    %3:_(s32) = G_CONSTANT i32 1
+    %4:_(s32) = G_INTRINSIC intrinsic(@llvm.amdgcn.workitem.id.x)
+    %6:_(s1) = G_ICMP intpred(slt), %1(s32), %0(s32) ;uniform condition
+    %7:_(s1) = G_ICMP intpred(eq), %4(s32), %0 ;divergent condition
+    G_BRCOND %7(s1), %bb.3
+    G_BR %bb.1
+
+  bb.1:
+    successors: %bb.2
+
+    %8:_(s32) = G_PHI %0(s32), %bb.0, %2(s32), %bb.3
+    %9:_(s32) = G_ADD %8(s32), %3(s32)
+    G_BR %bb.2
+
+  bb.2:
+    successors: %bb.3, %bb.4
+
+    %13:_(s32) = G_ADD %2(s32), %3(s32)
+    %10:_(s32) = G_ADD %8(s32), %3(s32)
+    G_BRCOND %6(s1), %bb.3
+    G_BR %bb.4
+
+  bb.3:
+    successors: %bb.1
+    %11:_(s32) = G_PHI %13(s32), %bb.2, %0(s32), %bb.0
+    %12:_(s32) = G_ADD %11(s32), %3(s32)
+    G_BR %bb.1
+  bb.4:
+    %14:_(s32) = G_ADD %2(s32), %3(s32)
+    S_ENDPGM 0
+...
--- a/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/MIR/irreducible/diverged-entry-basic-gmir.mir
+++ b/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/MIR/irreducible/diverged-entry-basic-gmir.mir
@ -0,0 +1,118 @@
+# RUN: llc -mtriple=amdgcn-- -run-pass=print-machine-uniformity -o - %s 2>&1 | FileCheck %s
+# CHECK-LABEL: MachineUniformityInfo for function: divergent_cycle_1
+# CHECK-NEXT: CYCLES ASSSUMED DIVERGENT:
+# CHECK-NEXT: depth=1: entries(bb.3 bb.1) bb.4 bb.2
+# CHECK-NEXT: CYCLES WITH DIVERGENT EXIT:
+# CHECK-NEXT: depth=2: entries(bb.4 bb.1) bb.2
+# CHECK-NEXT: depth=1: entries(bb.3 bb.1) bb.4 bb.2
+
+
+
+---
+name:            divergent_cycle_1
+tracksRegLiveness: true
+body:             |
+  bb.0:
+    successors: %bb.1, %bb.3
+    %0:_(s32) = G_CONSTANT i32 0
+    %1:_(s32) = G_CONSTANT i32 1
+
+    %2:_(s32) = G_IMPLICIT_DEF
+    %3:_(s32) = G_IMPLICIT_DEF
+
+    %4:_(s32) = G_INTRINSIC intrinsic(@llvm.amdgcn.workitem.id.x)
+    %6:_(s1) = G_ICMP intpred(slt), %2(s32), %0(s32) ;uniform condition
+    %7:_(s1) = G_ICMP intpred(eq), %4(s32), %0 ;divergent condition
+    G_BRCOND %6(s1), %bb.1
+    G_BR %bb.3
+
+  bb.1:
+    successors: %bb.2
+
+  ; CHECK: DIVERGENT: %{{[0-9]*}}: %{{[0-9]*}}:_(s32) = G_PHI %{{[0-9]*}}:_(s32), %bb.0, %{{[0-9]*}}:_(s32), %bb.4
+    %8:_(s32) = G_PHI %2(s32), %bb.0, %3(s32), %bb.4
+    %9:_(s32) = G_ADD %3(s32), %1(s32)
+    G_BR %bb.2
+
+  bb.2:
+    successors: %bb.3, %bb.4
+
+    %13:_(s32) = G_ADD %3(s32), %1(s32)
+    G_BRCOND %7(s1), %bb.4
+    G_BR %bb.3
+
+  bb.3:
+    successors: %bb.4
+
+    %14:_(s32) = G_ADD %3(s32), %1(s32)
+    G_BR %bb.4
+  bb.4:
+    successors: %bb.5, %bb.1
+
+  ; CHECK: DIVERGENT: %{{[0-9]*}}: %{{[0-9]*}}:_(s32) = G_PHI %{{[0-9]*}}:_(s32), %bb.2, %{{[0-9]*}}:_(s32), %bb.3
+    %15:_(s32) = G_PHI %13(s32), %bb.2, %14(s32), %bb.3
+    %16:_(s32) = G_ADD %3(s32), %1(s32)
+    G_BRCOND %6(s1), %bb.5
+    G_BR %bb.1
+
+  bb.5:
+    %17:_(s32) = G_ADD %3(s32), %1(s32)
+    S_ENDPGM 0
+...
+
+# CHECK-LABEL: MachineUniformityInfo for function: uniform_cycle_1
+---
+name:            uniform_cycle_1
+tracksRegLiveness: true
+body:             |
+  bb.0:
+    successors: %bb.1, %bb.5
+    %0:_(s32) = G_CONSTANT i32 0
+    %1:_(s32) = G_CONSTANT i32 1
+
+    %2:_(s32) = G_IMPLICIT_DEF
+    %3:_(s32) = G_IMPLICIT_DEF
+
+    %4:_(s32) = G_INTRINSIC intrinsic(@llvm.amdgcn.workitem.id.x)
+    %6:_(s1) = G_ICMP intpred(slt), %2(s32), %0(s32) ;uniform condition
+    %7:_(s1) = G_ICMP intpred(eq), %4(s32), %0 ;divergent condition
+    G_BRCOND %6(s1), %bb.1
+    G_BR %bb.5
+
+  bb.1:
+    successors: %bb.2
+
+  ; CHECK-NOT: DIVERGENT: %{{[0-9]*}}: %{{[0-9]*}}:_(s32) = G_PHI %{{[0-9]*}}:_(s32), %bb.0, %{{[0-9]*}}:_(s32), %bb.4
+    %8:_(s32) = G_PHI %2(s32), %bb.0, %3(s32), %bb.5
+    %9:_(s32) = G_ADD %3(s32), %1(s32)
+    G_BR %bb.2
+
+  bb.2:
+    successors: %bb.3, %bb.4
+
+    %13:_(s32) = G_ADD %3(s32), %1(s32)
+    G_BRCOND %7(s1), %bb.4
+    G_BR %bb.3
+
+  bb.3:
+    successors: %bb.4
+
+    %14:_(s32) = G_ADD %3(s32), %1(s32)
+    G_BR %bb.4
+  bb.4:
+    successors: %bb.6, %bb.5
+
+  ; CHECK: DIVERGENT: %{{[0-9]*}}: %{{[0-9]*}}:_(s32) = G_PHI %{{[0-9]*}}:_(s32), %bb.2, %{{[0-9]*}}:_(s32), %bb.3
+    %15:_(s32) = G_PHI %13(s32), %bb.2, %14(s32), %bb.3
+    %16:_(s32) = G_ADD %3(s32), %1(s32)
+    G_BRCOND %6(s1), %bb.6
+    G_BR %bb.5
+
+  bb.5:
+    successors: %bb.1
+    %18:_(s32) = G_ADD %3(s32), %1(s32)
+    G_BR %bb.1
+  bb.6:
+    %17:_(s32) = G_ADD %3(s32), %1(s32)
+    S_ENDPGM 0
+...
--- a/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/MIR/irreducible/exit-divergence-gmir.mir
+++ b/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/MIR/irreducible/exit-divergence-gmir.mir
@ -0,0 +1,58 @@
+# RUN: llc -mtriple=amdgcn-- -run-pass=print-machine-uniformity -o - %s 2>&1 | FileCheck %s
+# CHECK-LABEL: MachineUniformityInfo for function: basic
+# CHECK-NOT: CYCLES ASSSUMED DIVERGENT:
+# CHECK: CYCLES WITH DIVERGENT EXIT:
+# CHECK: depth=1: entries(bb.1 bb.3) bb.2
+
+---
+name:            basic
+tracksRegLiveness: true
+body:             |
+  bb.0:
+    successors: %bb.3, %bb.1
+
+    %0:_(s32) = G_CONSTANT i32 0
+    %1:_(s32) = G_CONSTANT i32 1
+
+    %2:_(s32) = G_IMPLICIT_DEF
+    %3:_(s32) = G_IMPLICIT_DEF
+
+    %6:_(s1) = G_ICMP intpred(slt), %2(s32), %0(s32) ;uniform condition
+    %7:_(s1) = G_ICMP intpred(eq), %4(s32), %0 ;divergent condition
+    %4:_(s32) = G_INTRINSIC intrinsic(@llvm.amdgcn.workitem.id.x)
+    G_BRCOND %6(s1), %bb.3
+    G_BR %bb.1
+
+  bb.1:
+    successors: %bb.2
+
+    %8:_(s32) = G_PHI %2(s32), %bb.0, %3(s32), %bb.3
+    %10:_(s32) = G_PHI %2(s32), %bb.0, %16(s32), %bb.3
+    %9:_(s32) = G_ADD %3(s32), %1(s32)
+    G_BR %bb.2
+
+  bb.2:
+    successors: %bb.3, %bb.4
+
+    %13:_(s32) = G_ADD %3(s32), %1(s32)
+    %14:_(s32) = G_ADD %10(s32), %1(s32)
+    %15:_(s32) = G_ADD %10(s32), %1(s32)
+
+    G_BRCOND %7(s1), %bb.3
+    G_BR %bb.4
+
+  bb.3:
+    successors: %bb.1
+    %16:_(s32) = G_PHI %13(s32), %bb.2, %2(s32), %bb.0
+    %17:_(s32) = G_ADD %3(s32), %1(s32)
+    G_BR %bb.1
+  bb.4:
+    ; CHECK-LABEL: bb.4
+    ; CHECK: DIVERGENT:
+    ; CHECK: DIVERGENT:
+    ; CHECK-NOT: DIVERGENT:
+    %18:_(s32) = G_ADD %8(s32), %3(s32)
+    %19:_(s32) = G_ADD %8(s32), %3(s32)
+    %20:_(s32) = G_ADD %3(s32), %1(s32)
+    S_ENDPGM 0
+...
--- a/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/MIR/irreducible/irreducible-1.mir
+++ b/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/MIR/irreducible/irreducible-1.mir
@ -0,0 +1,53 @@
+# RUN: llc -mtriple=amdgcn-- -run-pass=print-machine-uniformity -o - %s 2>&1 | FileCheck %s
+
+# CHECK-LABEL: MachineUniformityInfo for function: irreducible
+# CHECK: CYCLES ASSSUMED DIVERGENT:
+# CHECK:   depth=1: entries(bb.2 bb.1) bb.3 bb.5 bb.4
+# CHECK: CYCLES WITH DIVERGENT EXIT:
+# CHECK:   depth=1: entries(bb.2 bb.1) bb.3 bb.5 bb.4
+# CHECK:   depth=2: entries(bb.3 bb.1) bb.5 bb.4
+
+---
+name:            irreducible
+tracksRegLiveness: true
+machineFunctionInfo:
+  isEntryFunction: true
+body:             |
+  bb.0:
+    successors: %bb.1, %bb.2
+    liveins: $vgpr0, $vgpr1, $vgpr2, $sgpr4_sgpr5, $sgpr6_sgpr7, $sgpr8_sgpr9, $sgpr10_sgpr11, $sgpr14, $sgpr15, $sgpr16
+
+    %0:sreg_32 = IMPLICIT_DEF
+    %2:vgpr_32 = COPY $vgpr0
+    %3:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
+    S_CMP_EQ_U32 %0, 0, implicit-def $scc
+    S_CBRANCH_SCC1 %bb.1, implicit $scc
+    S_BRANCH %bb.2
+
+  bb.1:
+    %28:vgpr_32 = PHI %3, %bb.0, %49, %bb.5
+    %29:vgpr_32 = V_ADD_U32_e64 %28, 1, 0, implicit $exec
+    S_BRANCH %bb.3
+
+  bb.2:
+    %38:vgpr_32 = PHI %3, %bb.0, %49, %bb.4
+    %39:vgpr_32 = V_ADD_U32_e64 %38, 2, 0, implicit $exec
+
+  bb.3:
+    %49:vgpr_32 = PHI %29, %bb.1, %39, %bb.2
+
+  bb.4:
+    successors: %bb.2, %bb.5
+
+    %50:vgpr_32 = V_AND_B32_e32 3, %2, implicit $exec
+    %51:sreg_64 = V_CMP_EQ_U32_e64 %50, 2, implicit $exec
+    %52:sreg_64 = SI_IF killed %51:sreg_64, %bb.2, implicit-def dead $exec, implicit-def dead $scc, implicit $exec
+
+  bb.5:
+    successors: %bb.1, %bb.6
+    %61:sreg_64 = V_CMP_EQ_U32_e64 %50, 1, implicit $exec
+    %62:sreg_64 = SI_IF killed %61:sreg_64, %bb.1, implicit-def dead $exec, implicit-def dead $scc, implicit $exec
+
+  bb.6:
+    S_ENDPGM 0
+...
--- a/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/MIR/irreducible/irreducible-2-gmir.mir
+++ b/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/MIR/irreducible/irreducible-2-gmir.mir
@ -0,0 +1,89 @@
+# RUN: llc -mtriple=amdgcn-- -run-pass=print-machine-uniformity -o - %s 2>&1 | FileCheck %s
+
+#        bb0(div)
+#       /    \
+#     bb1 <-> bb2
+#              |
+#             bb3
+# CHECK-LABEL: MachineUniformityInfo for function: cycle_diverge_enter
+# CHECK-NEXT: CYCLES ASSSUMED DIVERGENT:
+# CHECK-NEXT: depth=1: entries(bb.2 bb.1)
+# CHECK-NEXT: CYCLES WITH DIVERGENT EXIT:
+# CHECK-NEXT: depth=1: entries(bb.2 bb.1)
+---
+name:            cycle_diverge_enter
+tracksRegLiveness: true
+body:             |
+  bb.0:
+    successors: %bb.1, %bb.2
+
+    %0:_(s32) = G_IMPLICIT_DEF
+    %1:_(s32) = G_CONSTANT i32 0
+    %2:_(s32) = G_IMPLICIT_DEF
+    %3:_(s32) = G_CONSTANT i32 1
+    %4:_(s32) = G_INTRINSIC intrinsic(@llvm.amdgcn.workitem.id.x)
+    %6:_(s1) = G_ICMP intpred(slt), %4(s32), %1 ; DIVERGENT CONDITION
+    %7:_(s1) = G_ICMP intpred(slt), %2(s32), %1 ; UNIFORM CONDITION
+    G_BRCOND %6(s1), %bb.1 ; divergent branch
+    G_BR %bb.2
+  bb.1:
+    successors: %bb.2
+
+    %8:_(s32) = G_PHI %1(s32), %bb.0, %0(s32), %bb.2
+    G_BR %bb.2
+  bb.2:
+    successors: %bb.1, %bb.3
+
+    %9:_(s32) = G_PHI %2(s32), %bb.1, %3(s32), %bb.0
+    %10:_(s1) = G_ICMP intpred(eq), %9(s32), %1(s32)
+    G_BRCOND %10(s1), %bb.3  ; divergent branch
+    G_BR %bb.1
+
+  bb.3:
+    %11:_(s32), %12:_(s1) = G_UADDO %9, %3
+    S_ENDPGM 0
+
+...
+
+
+# CHECK-LABEL: MachineUniformityInfo for function: cycle_diverge_exit
+# CHECK: DIVERGENT: %{{[0-9]*}}: %{{[0-9]*}}:_(s32), %{{[0-9]*}}:_(s1) = G_UADDO %8:_, %{{[0-9]*}}:_
+#        bb0
+#       /    \
+#     bb1 <-> bb2(div)
+#              |
+#             bb3
+---
+name:            cycle_diverge_exit
+tracksRegLiveness: true
+body:             |
+  bb.0:
+    successors: %bb.1, %bb.2
+
+    %0:_(s32) = G_IMPLICIT_DEF
+    %1:_(s32) = G_CONSTANT i32 0
+    %2:_(s32) = G_IMPLICIT_DEF
+    %3:_(s32) = G_CONSTANT i32 1
+    %4:_(s32) = G_INTRINSIC intrinsic(@llvm.amdgcn.workitem.id.x)
+    %6:_(s1) = G_ICMP intpred(slt), %4(s32), %1 ; DIVERGENT CONDITION
+    %7:_(s1) = G_ICMP intpred(slt), %2(s32), %1 ; UNIFORM CONDITION
+    G_BRCOND %7(s1), %bb.1 ; uniform branch
+    G_BR %bb.2
+  bb.1:
+    successors: %bb.2
+
+    %8:_(s32) = G_PHI %1(s32), %bb.0, %0(s32), %bb.2
+    G_BR %bb.2
+  bb.2:
+    successors: %bb.1, %bb.3
+
+    %9:_(s32) = G_PHI %2(s32), %bb.1, %3(s32), %bb.0
+    %10:_(s1) = G_ICMP intpred(sgt), %9(s32), %1(s32)
+    G_BRCOND %6(s1), %bb.3  ; divergent branch
+    G_BR %bb.1
+
+  bb.3:
+    %11:_(s32), %12:_(s1) = G_UADDO %9, %3
+    S_ENDPGM 0
+
+...
--- a/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/MIR/join-loopexit-gmir.mir
+++ b/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/MIR/join-loopexit-gmir.mir
@ -0,0 +1,57 @@
+# RUN: llc -mtriple=amdgcn-- -run-pass=print-machine-uniformity -o - %s 2>&1 | FileCheck %s
+# CHECK-LABEL: MachineUniformityInfo for function: test
+
+# CHECK-LABEL: BLOCK bb.0
+# CHECK: DIVERGENT: %{{[0-9]*}}: %{{[0-9]*}}:_(s1) = G_ICMP intpred(eq), %{{[0-9]*}}:_(s32), %{{[0-9]*}}:_
+
+# CHECK-LABEL: BLOCK bb.1
+# CHECK-NOT: DIVERGENT: %{{[0-9]*}}: %{{[0-9]*}}:_(s32) = G_PHI %{{[0-9]*}}:_(s32), %bb.0, %{{[0-9]*}}:_(s32), %bb.2
+# CHECK:  DIVERGENT: G_BRCOND %{{[0-9]*}}:_(s1), %bb.3
+
+# CHECK-LABEL: BLOCK bb.2
+# CHECK-NOT: DIVERGENT: %{{[0-9]*}}: %{{[0-9]*}}:_(s32), %{{[0-9]*}}:_(s1) = G_UADDO_
+# CHECK-NOT: DIVERGENT: G_BRCOND %{{[0-9]*}}:_(s1), %bb.3
+
+# CHECK-LABEL: BLOCK bb.3
+# CHECK: DIVERGENT: %{{[0-9]*}}: %{{[0-9]*}}:_(s1) = G_PHI %{{[0-9]*}}:_(s1), %bb.1, %{{[0-9]*}}:_(s1), %bb.2
+# CHECK: DIVERGENT: G_BRCOND %{{[0-9]*}}:_(s1), %bb.4
+# CHECK: DIVERGENT: G_BR %bb.5
+
+---
+name:            test
+tracksRegLiveness: true
+body:             |
+  bb.0:
+    successors: %bb.1
+
+    %2:_(s1) = G_CONSTANT i1 true
+    %3:_(s1) = G_CONSTANT i1 false
+    %1:_(s32) = G_CONSTANT i32 0
+    %20:_(s32) = G_CONSTANT i32 7
+    %5:_(s32) = G_CONSTANT i32 -1
+    %4:_(s32) = G_INTRINSIC intrinsic(@llvm.amdgcn.workitem.id.x)
+    %6:_(s1) = G_ICMP intpred(eq), %4(s32), %5
+
+  bb.1:
+    successors: %bb.2, %bb.3
+
+    %8:_(s32) = G_PHI %20(s32), %bb.0, %21(s32), %bb.2
+    G_BRCOND %6(s1), %bb.3   ; Entrance to loop is divergent
+  bb.2:
+    successors: %bb.3, %bb.1
+
+    %21:_(s32), %22:_(s1) = G_UADDO %8, %5
+    %23:_(s1) = G_ICMP intpred(eq), %21(s32), %1
+    G_BRCOND %23(s1), %bb.3
+    G_BR %bb.1
+
+  bb.3:
+    %31:_(s1) = G_PHI %2(s1), %bb.1, %3(s1), %bb.2
+    S_ENDPGM 0
+    G_BRCOND %31(s1), %bb.4
+    G_BR %bb.5
+  bb.4:
+    G_BR %bb.5
+  bb.5:
+
+...
--- a/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/MIR/loads-gmir.mir
+++ b/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/MIR/loads-gmir.mir
@ -0,0 +1,36 @@
+# RUN: llc -mtriple=amdgcn-- -run-pass=print-machine-uniformity -o - %s 2>&1 | FileCheck %s
+
+---
+name:            loads
+tracksRegLiveness: true
+body:             |
+  bb.1.entry:
+    %1:_(p0) = G_IMPLICIT_DEF
+    %4:_(p1) = G_IMPLICIT_DEF
+    %6:_(p5) = G_IMPLICIT_DEF
+
+    ; Atomic load
+    ; CHECK: DIVERGENT
+    ; CHECK-SAME: G_LOAD
+    %0:_(s32) = G_LOAD %1(p0) :: (load seq_cst (s32) from `ptr undef`)
+
+    ; flat load
+    ; CHECK: DIVERGENT
+    ; CHECK-SAME: G_LOAD
+    %2:_(s32) = G_LOAD %1(p0) :: (load (s32) from `ptr undef`)
+
+    ; Gloabal load
+    ; CHECK-NOT: DIVERGENT
+    %3:_(s32) = G_LOAD %4(p1) :: (load (s32) from `ptr addrspace(1) undef`, addrspace 1)
+
+    ; Private load
+    ; CHECK: DIVERGENT
+    ; CHECK-SAME: G_LOAD
+    %5:_(s32) = G_LOAD %6(p5) :: (volatile load (s32) from `ptr addrspace(5) undef`, addrspace 5)
+    G_STORE %2(s32), %4(p1) :: (volatile store (s32) into `ptr addrspace(1) undef`, addrspace 1)
+    G_STORE %3(s32), %4(p1) :: (volatile store (s32) into `ptr addrspace(1) undef`, addrspace 1)
+    G_STORE %5(s32), %4(p1) :: (volatile store (s32) into `ptr addrspace(1) undef`, addrspace 1)
+    G_STORE %0(s32), %4(p1) :: (volatile store (s32) into `ptr addrspace(1) undef`, addrspace 1)
+    SI_RETURN
+
+...
--- a/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/MIR/never-uniform.mir
+++ b/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/MIR/never-uniform.mir
@ -0,0 +1,138 @@
+# RUN: llc -mtriple=amdgcn-- -run-pass=print-machine-uniformity -o - %s 2>&1 | FileCheck %s
+# loads from flat non uniform
+---
+name:            flatloads
+tracksRegLiveness: true
+machineFunctionInfo:
+  isEntryFunction: true
+
+body:             |
+  bb.0:
+    ; CHECK-LABEL: MachineUniformityInfo for function: flatloads
+    ; CHECK: DIVERGENT: %1
+    ; CHECK-NOT: DIVERGENT: %2
+    %0:vreg_64 = IMPLICIT_DEF
+    %1:vgpr_32(s32) = FLAT_LOAD_DWORD %0, 0, 0, implicit $exec, implicit $flat_scr :: (load (s32))
+    %2:vgpr_32(s32) = FLAT_LOAD_DWORD %0, 0, 0, implicit $exec, implicit $flat_scr :: (load (s32), addrspace 1)
+    %3:sreg_32 = V_READFIRSTLANE_B32 %1(s32), implicit $exec
+    S_ENDPGM 0
+...
+
+# loads from scratch non uniform
+---
+name:            scratchloads
+tracksRegLiveness: true
+machineFunctionInfo:
+  isEntryFunction: true
+
+body:             |
+  bb.0:
+    ; CHECK-LABEL: MachineUniformityInfo for function: scratchloads
+    ; CHECK: DIVERGENT: %1
+    %0:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
+    %1:vgpr_32 = SCRATCH_LOAD_DWORD %0, 0, 0, implicit $exec, implicit $flat_scr :: (load (s32), addrspace 5)
+    S_ENDPGM 0
+...
+
+# Global load
+---
+name:            globalloads
+tracksRegLiveness: true
+machineFunctionInfo:
+  isEntryFunction: true
+
+body:             |
+  bb.0:
+    ; CHECK-LABEL: MachineUniformityInfo for function: globalloads
+    ; CHECK: DIVERGENT: %2
+    ; CHECK-NOT: DIVERGENT: %3
+    %0:vreg_64 = IMPLICIT_DEF
+    %1:vreg_64 = IMPLICIT_DEF
+    %2:vgpr_32(s32) = GLOBAL_LOAD_DWORD %0, 0, 0, implicit $exec, implicit $flat_scr :: (load (s32))
+    %3:vreg_64 = GLOBAL_LOAD_DWORDX2 %1, 0, 0, implicit $exec :: (load (s64), addrspace 1)
+    %4:sreg_32 = V_READFIRSTLANE_B32 %2(s32), implicit $exec
+    S_ENDPGM 0
+...
+
+# FIXME:: ADDTID might instruction incorrectly marked uniform
+---
+name:            dsreads
+tracksRegLiveness: true
+machineFunctionInfo:
+  isEntryFunction: true
+
+body:             |
+  bb.0:
+    ; CHECK-LABEL: MachineUniformityInfo for function: dsreads
+    ; CHECK-NEXT: ALL VALUES UNIFORM
+    %0:vreg_64 = IMPLICIT_DEF
+    $m0 = S_MOV_B32 0
+    %1:vgpr_32 = DS_READ_ADDTID_B32 0, 0, implicit $m0, implicit $exec
+    S_ENDPGM 0
+...
+
+# copy source == $sgpr => uniform, $vgpr => divergent
+---
+name:            sgprcopy
+tracksRegLiveness: true
+machineFunctionInfo:
+  isEntryFunction: true
+body:             |
+  bb.0:
+    ; CHECK-LABEL: MachineUniformityInfo for function: sgprcopy
+    ; CHECK: DIVERGENT: %2
+    liveins: $sgpr0,$sgpr1,$vgpr0
+    %0:sgpr_32 = COPY $sgpr0
+    %1:vgpr_32 = COPY $sgpr1
+    %2:vgpr_32 = COPY $vgpr0
+    S_ENDPGM 0
+...
+
+# writelane is not uniform
+---
+name:            writelane
+machineFunctionInfo:
+  isEntryFunction: true
+body:             |
+  bb.0:
+    ; CHECK-LABEL: MachineUniformityInfo for function: writelane
+    ; CHECK: DIVERGENT: %4
+    ; CHECK: DIVERGENT: %5
+    %0:vgpr_32 = IMPLICIT_DEF
+    %1:vgpr_32 = IMPLICIT_DEF
+    %2:sgpr_32 = V_READFIRSTLANE_B32 %0, implicit $exec
+    %3:sgpr_32 = V_READLANE_B32 %1, 0, implicit $exec
+    $sgpr0 = V_READFIRSTLANE_B32 $vgpr0, implicit $exec
+    $sgpr1 = V_READLANE_B32 $vgpr1, $sgpr0, implicit $exec
+
+    %4:vgpr_32 = V_WRITELANE_B32 0, 0, %0, implicit $exec
+    %5:sreg_64 = V_CMP_EQ_U32_e64 %0, %4, implicit $exec
+    S_CBRANCH_VCCZ %bb.1, implicit $vcc
+
+  bb.1:
+    %16:vgpr_32 = IMPLICIT_DEF
+    S_ENDPGM 0
+...
+# Direclty reading physing vgpr not uniform
+---
+name:            physicalreg
+tracksRegLiveness: true
+body:             |
+  bb.0:
+    ; CHECK-LABEL: MachineUniformityInfo for function: physicalreg
+    ; CHECK: DIVERGENT: %0
+    ; CHECK: DIVERGENT: %1
+    ; CHECK: DIVERGENT: %2
+    ; CHECK: DIVERGENT: %3
+    ; CHECK: DIVERGENT: %4
+    ; CHECK-NOT: DIVERGENT
+    ; CHECK: DIVERGENT: %5
+    liveins: $vgpr0, $vgpr1, $vgpr2, $vgpr3, $vgpr4, $vgpr5
+    %0:vgpr_32 = COPY $vgpr0
+    %1:vgpr_32 = COPY $vgpr1
+    %2:vgpr_32 = V_AND_B32_e32 %1, $vgpr3, implicit $exec
+    %3:vgpr_32 = V_ADD_U32_e32 $vgpr2, $vgpr3, implicit $exec
+    %4:vgpr_32 = V_SUB_CO_U32_e32 $vgpr2, $vgpr3, implicit $exec, implicit-def $vcc
+    %5:vgpr_32 = V_AND_B32_e32 $vgpr4, $vgpr5, implicit $exec
+    S_ENDPGM 0
+...
--- a/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/MIR/temporal-diverge-gmir.mir
+++ b/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/MIR/temporal-diverge-gmir.mir
@ -0,0 +1,76 @@
+# RUN: llc -mtriple=amdgcn-- -run-pass=print-machine-uniformity -o - %s 2>&1 | FileCheck %s
+# CHECK-LABEL: MachineUniformityInfo for function: hidden_loop_diverge
+
+# CHECK-LABEL: BLOCK bb.0
+# CHECK-NOT: DIVERGENT: %{{[0-9]*}}: %{{[0-9]*}}:_(s1) = G_ICMP intpred(slt), %{{[0-9]*}}:_(s32), %{{[0-9]*}}:_
+# CHECK-NOT: DIVERGENT: G_BRCOND %{{[0-9]*}}:_(s1), %bb.3
+# CHECK-NOT: DIVERGENT: G_BR %bb.1
+
+# CHECK-LABEL: BLOCK bb.1
+# CHECK: DIVERGENT: %{{[0-9]*}}: %{{[0-9]*}}:_(s1) = G_ICMP intpred(slt), %{{[0-9]*}}:_(s32), %{{[0-9]*}}:_
+# CHECK: DIVERGENT: G_BRCOND %{{[0-9]*}}:_(s1), %bb.3
+# CHECK: DIVERGENT: G_BR %bb.2
+
+# CHECK-LABEL: BLOCK bb.2
+# CHECK: DIVERGENT: %{{[0-9]*}}: %{{[0-9]*}}:_(s1) = G_ICMP intpred(sgt), %{{[0-9]*}}:_(s32), %{{[0-9]*}}:_
+# CHECK: DIVERGENT: G_BRCOND %{{[0-9]*}}:_(s1), %bb.4
+# CHECK: DIVERGENT: G_BR %bb.1
+
+# CHECK-LABEL: BLOCK bb.3
+# CHECK: DIVERGENT: %{{[0-9]*}}: %{{[0-9]*}}:_(s32) = G_PHI %{{[0-9]*}}:_(s32), %bb.0, %{{[0-9]*}}:_(s32), %bb.1
+# CHECK-NOT: DIVERGENT: G_BRCOND %{{[0-9]*}}:_(s1), %bb.4
+# CHECK-NOT: DIVERGENT: G_BR %bb.5
+
+# CHECK-LABEL: BLOCK bb.4
+# CHECK: DIVERGENT: %{{[0-9]*}}: %{{[0-9]*}}:_(s32) = G_PHI %{{[0-9]*}}:_(s32), %bb.3, %{{[0-9]*}}:_(s32), %bb.2
+
+# CHECK-LABEL: BLOCK bb.5
+# CHECK: DIVERGENT: %{{[0-9]*}}: %{{[0-9]*}}:_(s32) = G_PHI %{{[0-9]*}}:_(s32), %bb.3, %{{[0-9]*}}:_(s32), %bb.4
+
+---
+name:            hidden_loop_diverge
+tracksRegLiveness: true
+body:             |
+    bb.0:
+        successors: %bb.3, %bb.1
+        liveins: $sgpr4_sgpr5
+
+        %0:_(s32) = G_IMPLICIT_DEF
+        %20:_(s32) = G_IMPLICIT_DEF
+        %21:_(s32) = G_CONSTANT i32 42
+        %22:_(s32) = G_IMPLICIT_DEF
+        %1:_(s32) = G_CONSTANT i32 0
+        %2:_(s32) = G_INTRINSIC intrinsic(@llvm.amdgcn.workitem.id.x)
+        %3:_(s1) = G_ICMP intpred(slt), %0(s32), %1
+        G_BRCOND %3(s1), %bb.3                            ; Uniform branch
+        G_BR %bb.1
+    bb.1:
+        successors: %bb.3, %bb.2
+
+        %4:_(s32) = G_PHI %1(s32), %bb.0, %7(s32), %bb.2
+        %5:_(s1) = G_ICMP intpred(slt), %1(s32), %2(s32)
+        G_BRCOND %5(s1), %bb.3
+        G_BR %bb.2
+    bb.2:
+        successors: %bb.4, %bb.1
+
+        %6:_(s32) = G_CONSTANT i32 1
+        %7:_(s32) = G_ADD %6(s32), %4(s32)
+        %8:_(s1) = G_ICMP intpred(sgt), %2(s32), %1(s32)
+        G_BRCOND %8(s1), %bb.4
+        G_BR %bb.1
+    bb.3:
+        successors: %bb.4, %bb.5
+
+        %9:_(s32) = G_PHI %20(s32), %bb.0, %4(s32), %bb.1    ; Temporal divergent phi
+        G_BRCOND %3(s1), %bb.4
+        G_BR %bb.5
+
+    bb.4:
+        successors: %bb.5
+
+        %10:_(s32) = G_PHI %21(s32), %bb.3, %22(s32), %bb.2
+        G_BR %bb.5
+    bb.5:
+        %11:_(s32) = G_PHI %20(s32), %bb.3, %22(s32), %bb.4
+...
--- a/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/always_uniform.ll
+++ b/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/always_uniform.ll
@ -1,4 +1,5 @@
 ; RUN: opt -mtriple amdgcn-unknown-amdhsa -passes='print<divergence>' -disable-output %s 2>&1 | FileCheck %s
+; RUN: opt -mtriple amdgcn-unknown-amdhsa -passes='print<uniformity>' -disable-output %s 2>&1 | FileCheck %s

 ; CHECK-LABEL: for function 'readfirstlane':
 define amdgpu_kernel void @readfirstlane() {
@ -39,7 +40,7 @@ define i32 @asm_sgpr(i32 %divergent) {
  ret i32 %sgpr
 }

-; CHECK-LABEL: Divergence Analysis' for function 'asm_mixed_sgpr_vgpr':
+; CHECK-LABEL: for function 'asm_mixed_sgpr_vgpr':
 ; CHECK: DIVERGENT: %asm = call { i32, i32 } asm "; def $0, $1, $2", "=s,=v,v"(i32 %divergent)
 ; CHECK-NEXT: {{^[ \t]+}}%sgpr = extractvalue { i32, i32 } %asm, 0
 ; CHECK-NEXT: DIVERGENT:       %vgpr = extractvalue { i32, i32 } %asm, 1
--- a/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/atomics.ll
+++ b/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/atomics.ll
@ -1,4 +1,5 @@
 ; RUN: opt -mtriple amdgcn-- -passes='print<divergence>' -disable-output %s 2>&1 | FileCheck %s
+; RUN: opt -mtriple amdgcn-- -passes='print<uniformity>' -disable-output %s 2>&1 | FileCheck %s

 ; CHECK: DIVERGENT: %orig = atomicrmw xchg ptr %ptr, i32 %val seq_cst
 define amdgpu_kernel void @test1(ptr %ptr, i32 %val) #0 {
--- a/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/b42473-r1-crash.ll
+++ b/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/b42473-r1-crash.ll
@ -1,12 +1,13 @@
 ; RUN: opt -mtriple amdgcn-unknown-amdhsa -passes='print<divergence>' -disable-output %s 2>&1 | FileCheck %s
+; RUN: opt -mtriple amdgcn-unknown-amdhsa -passes='print<uniformity>' -disable-output %s 2>&1 | FileCheck %s

 declare i32 @gf2(i32)
 declare i32 @gf1(i32)

 define  void @tw1(ptr addrspace(4) noalias nocapture readonly %A, ptr addrspace(4) noalias nocapture %B) local_unnamed_addr #2 {
-; CHECK: Divergence Analysis' for function 'tw1':
-; CHECK: DIVERGENT: ptr addrspace(4) %A
-; CHECK: DIVERGENT: ptr addrspace(4) %B
+; CHECK: for function 'tw1':
+; CHECK-DAG: DIVERGENT: ptr addrspace(4) %A
+; CHECK-DAG: DIVERGENT: ptr addrspace(4) %B
 entry:
 ; CHECK: DIVERGENT:       %call = tail call i32 @gf2(i32 0) #0
 ; CHECK: DIVERGENT:       %cmp = icmp ult i32 %call, 16
--- a/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/control-flow-intrinsics.ll
+++ b/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/control-flow-intrinsics.ll
@ -1,8 +1,9 @@
 ; RUN: opt -mtriple amdgcn-mesa-mesa3d -passes='print<divergence>' -disable-output %s 2>&1 | FileCheck %s
+; RUN: opt -mtriple amdgcn-mesa-mesa3d -passes='print<uniformity>' -disable-output %s 2>&1 | FileCheck %s

 ; Tests control flow intrinsics that should be treated as uniform

-; CHECK: Divergence Analysis' for function 'test_if_break':
+; CHECK: for function 'test_if_break':
 ; CHECK: DIVERGENT: %cond = icmp eq i32 %arg0, 0
 ; CHECK-NOT: DIVERGENT
 ; CHECK: ret void
@ -14,7 +15,7 @@ entry:
  ret void
 }

-; CHECK: Divergence Analysis' for function 'test_if':
+; CHECK: for function 'test_if':
 ; CHECK: DIVERGENT: %cond = icmp eq i32 %arg0, 0
 ; CHECK-NEXT: DIVERGENT: %if = call { i1, i64 } @llvm.amdgcn.if.i64(i1 %cond)
 ; CHECK-NEXT: DIVERGENT: %if.bool = extractvalue { i1, i64 } %if, 0
@ -33,7 +34,7 @@ entry:
 }

 ; The result should still be treated as divergent, even with a uniform source.
-; CHECK: Divergence Analysis' for function 'test_if_uniform':
+; CHECK: for function 'test_if_uniform':
 ; CHECK-NOT: DIVERGENT
 ; CHECK: DIVERGENT: %if = call { i1, i64 } @llvm.amdgcn.if.i64(i1 %cond)
 ; CHECK-NEXT: DIVERGENT: %if.bool = extractvalue { i1, i64 } %if, 0
@ -51,7 +52,7 @@ entry:
  ret void
 }

-; CHECK: Divergence Analysis' for function 'test_loop_uniform':
+; CHECK: for function 'test_loop_uniform':
 ; CHECK: DIVERGENT: %loop = call i1 @llvm.amdgcn.loop.i64(i64 %mask)
 define amdgpu_ps void @test_loop_uniform(i64 inreg %mask) {
 entry:
@ -61,7 +62,7 @@ entry:
  ret void
 }

-; CHECK: Divergence Analysis' for function 'test_else':
+; CHECK: for function 'test_else':
 ; CHECK: DIVERGENT: %else = call { i1, i64 } @llvm.amdgcn.else.i64.i64(i64 %mask)
 ; CHECK: DIVERGENT:       %else.bool = extractvalue { i1, i64 } %else, 0
 ; CHECK: {{^[ \t]+}}%else.mask = extractvalue { i1, i64 } %else, 1
@ -77,7 +78,7 @@ entry:
 }

 ; This case is probably always broken
-; CHECK: Divergence Analysis' for function 'test_else_divergent_mask':
+; CHECK: for function 'test_else_divergent_mask':
 ; CHECK: DIVERGENT: %if = call { i1, i64 } @llvm.amdgcn.else.i64.i64(i64 %mask)
 ; CHECK-NEXT: DIVERGENT: %if.bool = extractvalue { i1, i64 } %if, 0
 ; CHECK-NOT: DIVERGENT
--- a/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/hidden_diverge.ll
+++ b/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/hidden_diverge.ll
@ -1,11 +1,13 @@
 ; RUN: opt -mtriple amdgcn-unknown-amdhsa -passes='print<divergence>' -disable-output %s 2>&1 | FileCheck %s
+; RUN: opt -mtriple amdgcn-unknown-amdhsa -passes='print<uniformity>' -disable-output %s 2>&1 | FileCheck %s

 define amdgpu_kernel void @hidden_diverge(i32 %n, i32 %a, i32 %b) #0 {
-; CHECK-LABEL: 'Divergence Analysis' for function 'hidden_diverge'
+; CHECK-LABEL: for function 'hidden_diverge'
 entry:
  %tid = call i32 @llvm.amdgcn.workitem.id.x()
  %cond.var = icmp slt i32 %tid, 0
  br i1 %cond.var, label %B, label %C ; divergent
+; CHECK:  DIVERGENT: %cond.var =
 ; CHECK: DIVERGENT: br i1 %cond.var,
 B:
  %cond.uni = icmp slt i32 %n, 0
@ -22,7 +24,7 @@ merge:
 }

 define amdgpu_kernel void @hidden_loop_ipd(i32 %n, i32 %a, i32 %b) #0 {
-; CHECK-LABEL: 'Divergence Analysis' for function 'hidden_loop_ipd'
+; CHECK-LABEL: for function 'hidden_loop_ipd'
 entry:
  %tid = call i32 @llvm.amdgcn.workitem.id.x()
  %cond.var = icmp slt i32 %tid, 0
--- a/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/hidden_loopdiverge.ll
+++ b/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/hidden_loopdiverge.ll
@ -1,9 +1,10 @@
 ; RUN: opt -mtriple amdgcn-unknown-amdhsa -passes='print<divergence>' -disable-output %s 2>&1 | FileCheck %s
+; RUN: opt -mtriple amdgcn-unknown-amdhsa -passes='print<uniformity>' -disable-output %s 2>&1 | FileCheck %s

 ; divergent loop (H<header><exiting to X>, B<exiting to Y>)
 ; the divergent join point in %exit is obscured by uniform control joining in %X
 define amdgpu_kernel void @hidden_loop_diverge(i32 %n, i32 %a, i32 %b) #0 {
-; CHECK-LABEL: Divergence Analysis' for function 'hidden_loop_diverge'
+; CHECK-LABEL: for function 'hidden_loop_diverge':
 ; CHECK-NOT: DIVERGENT: %uni.
 ; CHECK-NOT: DIVERGENT: br i1 %uni.

@ -45,7 +46,7 @@ exit:
 ; divergent loop (H<header><exiting to X>, B<exiting to Y>)
 ; the phi nodes in X and Y don't actually receive divergent values
 define amdgpu_kernel void @unobserved_loop_diverge(i32 %n, i32 %a, i32 %b) #0 {
-; CHECK-LABEL: Divergence Analysis' for function 'unobserved_loop_diverge':
+; CHECK-LABEL: for function 'unobserved_loop_diverge':
 ; CHECK-NOT: DIVERGENT: %uni.
 ; CHECK-NOT: DIVERGENT: br i1 %uni.

@ -86,7 +87,7 @@ exit:
 ; the inner loop has no exit to top level.
 ; the outer loop becomes divergent as its exiting branch in C is control-dependent on the inner loop's divergent loop exit in D.
 define amdgpu_kernel void @hidden_nestedloop_diverge(i32 %n, i32 %a, i32 %b) #0 {
-; CHECK-LABEL: Divergence Analysis' for function 'hidden_nestedloop_diverge':
+; CHECK-LABEL: for function 'hidden_nestedloop_diverge':
 ; CHECK-NOT: DIVERGENT: %uni.
 ; CHECK-NOT: DIVERGENT: br i1 %uni.

@ -137,7 +138,7 @@ exit:
 ; the outer loop has no immediately divergent exiting edge.
 ; the inner exiting edge is exiting to top-level through the outer loop causing both to become divergent.
 define amdgpu_kernel void @hidden_doublebreak_diverge(i32 %n, i32 %a, i32 %b) #0 {
-; CHECK-LABEL: Divergence Analysis' for function 'hidden_doublebreak_diverge':
+; CHECK-LABEL: for function 'hidden_doublebreak_diverge':
 ; CHECK-NOT: DIVERGENT: %uni.
 ; CHECK-NOT: DIVERGENT: br i1 %uni.

@ -179,7 +180,7 @@ Y:

 ; divergent loop (G<header>, L<exiting to D>) contained inside a uniform loop (H<header>, B, G, L , D<exiting to x>)
 define amdgpu_kernel void @hidden_containedloop_diverge(i32 %n, i32 %a, i32 %b) #0 {
-; CHECK-LABEL: Divergence Analysis' for function 'hidden_containedloop_diverge':
+; CHECK-LABEL: for function 'hidden_containedloop_diverge':
 ; CHECK-NOT: DIVERGENT: %uni.
 ; CHECK-NOT: DIVERGENT: br i1 %uni.

--- a/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/inline-asm.ll
+++ b/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/inline-asm.ll
@ -1,50 +1,52 @@
 ; RUN: opt -mtriple amdgcn-unknown-amdhsa -mcpu=tahiti -passes='print<divergence>' -disable-output %s 2>&1 | FileCheck %s
 ; RUN: opt -mtriple amdgcn-unknown-amdhsa -mcpu=gfx908 -passes='print<divergence>' -disable-output %s 2>&1 | FileCheck %s
+; RUN: opt -mtriple amdgcn-unknown-amdhsa -mcpu=tahiti -passes='print<uniformity>' -disable-output %s 2>&1 | FileCheck %s
+; RUN: opt -mtriple amdgcn-unknown-amdhsa -mcpu=gfx908 -passes='print<uniformity>' -disable-output %s 2>&1 | FileCheck %s
 ; Make sure nothing crashes on targets with or without AGPRs

-; CHECK: Divergence Analysis' for function 'inline_asm_1_sgpr_virtreg_output':
+; CHECK-LABEL: for function 'inline_asm_1_sgpr_virtreg_output':
 ; CHECK-NOT: DIVERGENT
 define i32 @inline_asm_1_sgpr_virtreg_output() {
  %sgpr = call i32 asm "s_mov_b32 $0, 0", "=s"()
  ret i32 %sgpr
 }

-; CHECK: Divergence Analysis' for function 'inline_asm_1_sgpr_physreg_output':
+; CHECK-LABEL: for function 'inline_asm_1_sgpr_physreg_output':
 ; CHECK-NOT: DIVERGENT
 define i32 @inline_asm_1_sgpr_physreg_output() {
  %sgpr = call i32 asm "s_mov_b32 s0, 0", "={s0}"()
  ret i32 %sgpr
 }

-; CHECK: Divergence Analysis' for function 'inline_asm_1_vgpr_virtreg_output':
+; CHECK-LABEL: for function 'inline_asm_1_vgpr_virtreg_output':
 ; CHECK: DIVERGENT: %vgpr = call i32 asm "v_mov_b32 $0, 0", "=v"()
 define i32 @inline_asm_1_vgpr_virtreg_output() {
  %vgpr = call i32 asm "v_mov_b32 $0, 0", "=v"()
  ret i32 %vgpr
 }

-; CHECK: Divergence Analysis' for function 'inline_asm_1_vgpr_physreg_output':
+; CHECK-LABEL: for function 'inline_asm_1_vgpr_physreg_output':
 ; CHECK: DIVERGENT: %vgpr = call i32 asm "v_mov_b32 v0, 0", "={v0}"()
 define i32 @inline_asm_1_vgpr_physreg_output() {
  %vgpr = call i32 asm "v_mov_b32 v0, 0", "={v0}"()
  ret i32 %vgpr
 }

-; CHECK: Divergence Analysis' for function 'inline_asm_1_agpr_virtreg_output':
+; CHECK-LABEL: for function 'inline_asm_1_agpr_virtreg_output':
 ; CHECK: DIVERGENT: %vgpr = call i32 asm "; def $0", "=a"()
 define i32 @inline_asm_1_agpr_virtreg_output() {
  %vgpr = call i32 asm "; def $0", "=a"()
  ret i32 %vgpr
 }

-; CHECK: Divergence Analysis' for function 'inline_asm_1_agpr_physreg_output':
+; CHECK-LABEL: for function 'inline_asm_1_agpr_physreg_output':
 ; CHECK: DIVERGENT: %vgpr = call i32 asm "; def a0", "={a0}"()
 define i32 @inline_asm_1_agpr_physreg_output() {
  %vgpr = call i32 asm "; def a0", "={a0}"()
  ret i32 %vgpr
 }

-; CHECK: Divergence Analysis' for function 'inline_asm_2_sgpr_virtreg_output':
+; CHECK-LABEL: for function 'inline_asm_2_sgpr_virtreg_output':
 ; CHECK-NOT: DIVERGENT
 define void @inline_asm_2_sgpr_virtreg_output() {
  %asm = call { i32, i32 } asm "; def $0, $1", "=s,=s"()
@ -56,7 +58,7 @@ define void @inline_asm_2_sgpr_virtreg_output() {
 }

 ; One output is SGPR, one is VGPR. Infer divergent for the aggregate, but uniform on the SGPR extract
-; CHECK: Divergence Analysis' for function 'inline_asm_sgpr_vgpr_virtreg_output':
+; CHECK-LABEL: for function 'inline_asm_sgpr_vgpr_virtreg_output':
 ; CHECK: DIVERGENT:       %asm = call { i32, i32 } asm "; def $0, $1", "=s,=v"()
 ; CHECK-NEXT: {{^[ \t]+}}%sgpr = extractvalue { i32, i32 } %asm, 0
 ; CHECK-NEXT: DIVERGENT:       %vgpr = extractvalue { i32, i32 } %asm, 1
@ -69,7 +71,7 @@ define void @inline_asm_sgpr_vgpr_virtreg_output() {
  ret void
 }

-; CHECK: Divergence Analysis' for function 'inline_asm_vgpr_sgpr_virtreg_output':
+; CHECK-LABEL: for function 'inline_asm_vgpr_sgpr_virtreg_output':
 ; CHECK: DIVERGENT:       %asm = call { i32, i32 } asm "; def $0, $1", "=v,=s"()
 ; CHECK-NEXT: DIVERGENT:       %vgpr = extractvalue { i32, i32 } %asm, 0
 ; CHECK-NEXT: {{^[ \t]+}}%sgpr = extractvalue { i32, i32 } %asm, 1
@ -83,7 +85,7 @@ define void @inline_asm_vgpr_sgpr_virtreg_output() {
 }

 ; Have an extra output constraint
-; CHECK: Divergence Analysis' for function 'multi_sgpr_inline_asm_output_input_constraint':
+; CHECK-LABEL: for function 'multi_sgpr_inline_asm_output_input_constraint':
 ; CHECK-NOT: DIVERGENT
 define void @multi_sgpr_inline_asm_output_input_constraint() {
  %asm = call { i32, i32 } asm "; def $0, $1", "=s,=s,s"(i32 1234)
@ -94,7 +96,7 @@ define void @multi_sgpr_inline_asm_output_input_constraint() {
  ret void
 }

-; CHECK: Divergence Analysis' for function 'inline_asm_vgpr_sgpr_virtreg_output_input_constraint':
+; CHECK-LABEL: for function 'inline_asm_vgpr_sgpr_virtreg_output_input_constraint':
 ; CHECK: DIVERGENT:       %asm = call { i32, i32 } asm "; def $0, $1", "=v,=s,v"(i32 1234)
 ; CHECK-NEXT: DIVERGENT:       %vgpr = extractvalue { i32, i32 } %asm, 0
 ; CHECK-NEXT: {{^[ \t]+}}%sgpr = extractvalue { i32, i32 } %asm, 1
--- a/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/interp_f16.ll
+++ b/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/interp_f16.ll
@ -1,4 +1,5 @@
 ; RUN: opt -mtriple amdgcn-- -passes='print<divergence>' -disable-output %s 2>&1 | FileCheck %s
+; RUN: opt -mtriple amdgcn-- -passes='print<uniformity>' -disable-output %s 2>&1 | FileCheck %s

 ; CHECK: for function 'interp_p1_f16'
 ; CHECK: DIVERGENT:       %p1 = call float @llvm.amdgcn.interp.p1.f16
--- a/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/intrinsics.ll
+++ b/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/intrinsics.ll
@ -1,4 +1,5 @@
 ; RUN: opt -mtriple amdgcn-- -passes='print<divergence>' -disable-output %s 2>&1 | FileCheck %s
+; RUN: opt -mtriple amdgcn-- -passes='print<uniformity>' -disable-output %s 2>&1 | FileCheck %s

 ; CHECK: DIVERGENT: %swizzle = call i32 @llvm.amdgcn.ds.swizzle(i32 %src, i32 100) #0
 define amdgpu_kernel void @ds_swizzle(ptr addrspace(1) %out, i32 %src) #0 {
--- a/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/irreducible/branch-outside.ll
+++ b/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/irreducible/branch-outside.ll
@ -0,0 +1,81 @@
+; RUN: opt %s -mtriple amdgcn-- -passes='print<uniformity>' -disable-output 2>&1 | FileCheck %s
+
+; CHECK=LABEL: UniformityInfo for function 'basic':
+; CHECK: CYCLES ASSSUMED DIVERGENT:
+; CHECK:   depth=1: entries(P T) Q
+define amdgpu_kernel void @basic(i32 %a, i32 %b, i32 %c) {
+entry:
+ %cond.uni = icmp slt i32 %a, 0
+ %tid = call i32 @llvm.amdgcn.workitem.id.x()
+ %cond.div = icmp slt i32 %tid, 0
+ br i1 %cond.div, label %T, label %P
+
+P:
+; CHECK:   DIVERGENT:   %pp.phi =
+; CHECK: DIVERGENT:      %pp =
+  %pp.phi  = phi i32 [ %a, %entry], [ %b, %T ]
+  %pp = add i32 %b, 1
+  br label %Q
+
+Q:
+; CHECK: DIVERGENT:   %qq =
+; CHECK:   DIVERGENT:   %qq.div =
+  %qq = add i32 %b, 1
+  %qq.div = add i32 %pp.phi, 1
+  br i1 %cond.uni, label %T, label %exit
+
+T:
+; CHECK:   DIVERGENT:   %t.phi =
+; CHECK: DIVERGENT:     %tt =
+  %t.phi = phi i32 [ %qq, %Q ], [ %a, %entry ]
+  %tt = add i32 %b, 1
+  br label %P
+
+exit:
+; CHECK-NOT: DIVERGENT:     %ee =
+  %ee = add i32 %b, 1
+  ret void
+}
+
+; CHECK=LABEL: UniformityInfo for function 'nested':
+; CHECK: CYCLES ASSSUMED DIVERGENT:
+; CHECK:  depth=1: entries(P T) Q A C B
+define amdgpu_kernel void @nested(i32 %a, i32 %b, i32 %c) {
+entry:
+ %cond.uni = icmp slt i32 %a, 0
+ %tid = call i32 @llvm.amdgcn.workitem.id.x()
+ %cond.div = icmp slt i32 %tid, 0
+ br i1 %cond.div, label %T, label %P
+
+P:
+ %pp.phi  = phi i32 [ %a, %entry], [ %b, %T ]
+ %pp = add i32 %b, 1
+ br i1 %cond.uni, label %B, label %Q
+
+Q:
+  %qq = add i32 %b, 1
+  br i1 %cond.uni, label %T, label %exit
+
+A:
+  %aa = add i32 %b, 1
+  br label %B
+
+B:
+  %bb = add i32 %b, 1
+  br label %C
+
+C:
+  %cc = add i32 %b, 1
+  br i1 %cond.uni, label %Q, label %A
+
+T:
+  %t.phi = phi i32 [ %qq, %Q ], [ %a, %entry ]
+  %tt = add i32 %b, 1
+  br i1 %cond.uni, label %A, label %P
+
+exit:
+  %ee = add i32 %b, 1
+  ret void
+}
+
+declare i32 @llvm.amdgcn.workitem.id.x() #0
--- a/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/irreducible/diverged-entry-basic.ll
+++ b/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/irreducible/diverged-entry-basic.ll
@ -0,0 +1,80 @@
+; RUN: opt %s -mtriple amdgcn-- -passes='print<uniformity>' -disable-output 2>&1 | FileCheck %s
+
+define amdgpu_kernel void @divergent_cycle_1(i32 %a, i32 %b, i32 %c) {
+; CHECK-LABEL: UniformityInfo for function 'divergent_cycle_1':
+; CHECK: CYCLES ASSSUMED DIVERGENT:
+; CHECK:   depth=1: entries(R P) S Q
+; CHECK: CYCLES WITH DIVERGENT EXIT:
+; CHECK:   depth=2: entries(S P) Q
+; CHECK:   depth=1: entries(R P) S Q
+entry:
+ %cond.uni = icmp slt i32 %a, 0
+ %tid = call i32 @llvm.amdgcn.workitem.id.x()
+ %cond.div = icmp slt i32 %tid, 0
+ br i1 %cond.uni, label %P, label %R
+
+P:
+; CHECK:   DIVERGENT:   %pp.phi =
+  %pp.phi  = phi i32 [ %a, %entry], [ %b, %S ]
+  %pp = add i32 %b, 1
+  br label %Q
+
+Q:
+  %qq = add i32 %b, 1
+  br i1 %cond.div, label %S, label %R
+
+R:
+  %rr = add i32 %b, 1
+  br label %S
+
+S:
+; CHECK:   DIVERGENT:   %s.phi =
+  %s.phi = phi i32 [ %qq, %Q ], [ %rr, %R ]
+  %ss = add i32 %b, 1
+  br i1 %cond.uni, label %exit, label %P
+
+exit:
+  %ee = add i32 %b, 1
+  ret void
+}
+
+define amdgpu_kernel void @uniform_cycle_1(i32 %a, i32 %b, i32 %c) {
+; CHECK-LABEL: UniformityInfo for function 'uniform_cycle_1':
+; CHECK-NOT: CYCLES ASSSUMED DIVERGENT:
+; CHECK-NOT: CYCLES WITH DIVERGENT EXIT:
+entry:
+ %cond.uni = icmp slt i32 %a, 0
+ %tid = call i32 @llvm.amdgcn.workitem.id.x()
+ %cond.div = icmp slt i32 %tid, 0
+ br i1 %cond.uni, label %P, label %T
+
+P:
+; CHECK-NOT:   DIVERGENT:   %pp.phi = phi i32
+  %pp.phi  = phi i32 [ %a, %entry], [ %b, %T ]
+  %pp = add i32 %b, 1
+  br label %Q
+
+Q:
+  %qq = add i32 %b, 1
+  br i1 %cond.div, label %S, label %R
+
+R:
+  %rr = add i32 %b, 1
+  br label %S
+
+S:
+; CHECK:   DIVERGENT:   %s.phi =
+  %s.phi = phi i32 [ %qq, %Q ], [ %rr, %R ]
+  %ss = add i32 %b, 1
+  br i1 %cond.uni, label %exit, label %T
+
+T:
+  %tt = add i32 %b, 1
+  br label %P
+
+exit:
+  %ee = add i32 %b, 1
+  ret void
+}
+
+declare i32 @llvm.amdgcn.workitem.id.x() #0
--- a/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/irreducible/diverged-entry-headers-nested.ll
+++ b/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/irreducible/diverged-entry-headers-nested.ll
@ -0,0 +1,240 @@
+; RUN: opt %s -mtriple amdgcn-- -passes='print<uniformity>' -disable-output 2>&1 | FileCheck %s
+
+; These tests have identical control flow graphs with slight changes
+; that affect cycle-info. There is a minor functional difference in
+; the branch conditions; but that is not relevant to the tests.
+
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+;;
+;; The inner cycle has a header (P) that dominates the join, hence
+;; both cycles are reported as converged.
+;;
+;; CHECK-LABEL: UniformityInfo for function 'headers_b_p':
+;; CHECK-NOT: CYCLES ASSSUMED DIVERGENT:
+;; CHECK-NOT: CYCLES WITH DIVERGENT EXIT:
+
+define amdgpu_kernel void @headers_b_p(i32 %a, i32 %b, i32 %c) {
+entry:
+ %cond.uni = icmp slt i32 %a, 0
+ %tid = call i32 @llvm.amdgcn.workitem.id.x()
+ %cond.div = icmp slt i32 %tid, 0
+ %a.div = add i32 %tid, %a
+ br i1 %cond.uni, label %B, label %A
+
+A:
+ br label %B
+
+B:
+ br i1 %cond.uni, label %C, label %D
+
+C:
+ br i1 %cond.uni, label %T, label %P
+
+P:
+  %pp.phi  = phi i32 [ %a, %C], [ %b, %T ]
+  %pp = add i32 %b, 1
+  br i1 %cond.uni, label %R, label %Q
+
+Q:
+  %qq = add i32 %b, 1
+  br i1 %cond.div, label %S, label %R
+
+R:
+  %rr = add i32 %b, 1
+  br label %S
+
+S:
+  %s.phi = phi i32 [ %qq, %Q ], [ %rr, %R ]
+  %ss = add i32 %pp.phi, 1
+  br i1 %cond.uni, label %D, label %T
+
+D:
+  br i1 %cond.uni, label %exit, label %A
+
+T:
+  %tt.phi = phi i32 [ %ss, %S ], [ %a, %C ]
+  %tt = add i32 %b, 1
+  br label %P
+
+exit:
+  %ee = add i32 %b, 1
+  ret void
+}
+
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+;;
+;; Same as previous, but the outer cycle has a different header (A).
+;; The inner cycle has a header (P) that dominates the join, hence
+;; both cycles are reported as converged.
+;;
+;; CHECK-LABEL: UniformityInfo for function 'headers_a_p':
+;; CHECK-NOT: CYCLES ASSSUMED DIVERGENT:
+;; CHECK-NOT: CYCLES WITH DIVERGENT EXIT:
+
+define amdgpu_kernel void @headers_a_p(i32 %a, i32 %b, i32 %c) {
+entry:
+ %cond.uni = icmp slt i32 %a, 0
+ %tid = call i32 @llvm.amdgcn.workitem.id.x()
+ %cond.div = icmp slt i32 %tid, 0
+ %a.div = add i32 %tid, %a
+ br i1 %cond.uni, label %B, label %A
+
+A:
+ br label %B
+
+B:
+ br i1 %cond.uni, label %C, label %D
+
+C:
+ br i1 %cond.uni, label %T, label %P
+
+P:
+  %pp.phi  = phi i32 [ %a, %C], [ %b, %T ]
+  %pp = add i32 %b, 1
+  br i1 %cond.uni, label %R, label %Q
+
+Q:
+  %qq = add i32 %b, 1
+  br i1 %cond.div, label %S, label %R
+
+R:
+  %rr = add i32 %b, 1
+  br label %S
+
+S:
+  %s.phi = phi i32 [ %qq, %Q ], [ %rr, %R ]
+  %ss = add i32 %pp.phi, 1
+  br i1 %cond.uni, label %D, label %T
+
+D:
+  br i1 %cond.uni, label %exit, label %A
+
+T:
+  %tt.phi = phi i32 [ %ss, %S ], [ %a, %C ]
+  %tt = add i32 %b, 1
+  br label %P
+
+exit:
+  %ee = add i32 %b, 1
+  ret void
+}
+
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+;;
+;; The inner cycle has a header (T) that does not dominate the join.
+;; The outer cycle has a header (B) that dominates the join. Hence
+;; only the inner cycle is reported as diverged.
+;;
+;; CHECK-LABEL: UniformityInfo for function 'headers_b_t':
+;; CHECK: CYCLES ASSSUMED DIVERGENT:
+;; CHECK:   depth=2: entries(T P) S Q R
+;; CHECK: CYCLES WITH DIVERGENT EXIT:
+;; CHECK:   depth=1: entries(B A) D T S Q P R C
+
+define amdgpu_kernel void @headers_b_t(i32 %a, i32 %b, i32 %c) {
+entry:
+ %cond.uni = icmp slt i32 %a, 0
+ %tid = call i32 @llvm.amdgcn.workitem.id.x()
+ %cond.div = icmp slt i32 %tid, 0
+ %a.div = add i32 %tid, %a
+ br i1 %cond.uni, label %A, label %B
+
+A:
+ br label %B
+
+B:
+ br i1 %cond.uni, label %C, label %D
+
+C:
+ br i1 %cond.uni, label %P, label %T
+
+P:
+  %pp.phi  = phi i32 [ %a, %C], [ %b, %T ]
+  %pp = add i32 %b, 1
+  br i1 %cond.uni, label %R, label %Q
+
+Q:
+  %qq = add i32 %b, 1
+  br i1 %cond.div, label %S, label %R
+
+R:
+  %rr = add i32 %b, 1
+  br label %S
+
+S:
+  %s.phi = phi i32 [ %qq, %Q ], [ %rr, %R ]
+  %ss = add i32 %pp.phi, 1
+  br i1 %cond.uni, label %D, label %T
+
+D:
+  br i1 %cond.uni, label %exit, label %A
+
+T:
+  %tt.phi = phi i32 [ %ss, %S ], [ %a, %C ]
+  %tt = add i32 %b, 1
+  br label %P
+
+exit:
+  %ee = add i32 %b, 1
+  ret void
+}
+
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+;;
+;; The cycles have headers (A and T) that do not dominate the join.
+;; Hence the outermost cycle is reported as diverged.
+;;
+;; CHECK-LABEL: UniformityInfo for function 'headers_a_t':
+;; CHECK: CYCLES ASSSUMED DIVERGENT:
+;; CHECK:   depth=1: entries(A B) D T S Q P R C
+;; CHECK-NOT: CYCLES WITH DIVERGENT EXIT:
+
+define amdgpu_kernel void @headers_a_t(i32 %a, i32 %b, i32 %c) {
+entry:
+ %cond.uni = icmp slt i32 %a, 0
+ %tid = call i32 @llvm.amdgcn.workitem.id.x()
+ %cond.div = icmp slt i32 %tid, 0
+ %a.div = add i32 %tid, %a
+ br i1 %cond.uni, label %B, label %A
+
+A:
+ br label %B
+
+B:
+ br i1 %cond.uni, label %C, label %D
+
+C:
+ br i1 %cond.uni, label %P, label %T
+
+P:
+  %pp.phi  = phi i32 [ %a, %C], [ %b, %T ]
+  %pp = add i32 %b, 1
+  br i1 %cond.uni, label %R, label %Q
+
+Q:
+  %qq = add i32 %b, 1
+  br i1 %cond.div, label %S, label %R
+
+R:
+  %rr = add i32 %b, 1
+  br label %S
+
+S:
+  %s.phi = phi i32 [ %qq, %Q ], [ %rr, %R ]
+  %ss = add i32 %pp.phi, 1
+  br i1 %cond.uni, label %D, label %T
+
+D:
+  br i1 %cond.uni, label %exit, label %A
+
+T:
+  %tt.phi = phi i32 [ %ss, %S ], [ %a, %C ]
+  %tt = add i32 %b, 1
+  br label %P
+
+exit:
+  %ee = add i32 %b, 1
+  ret void
+}
+
+declare i32 @llvm.amdgcn.workitem.id.x() #0
--- a/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/irreducible/diverged-entry-headers.ll
+++ b/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/irreducible/diverged-entry-headers.ll
@ -0,0 +1,103 @@
+; RUN: opt %s -mtriple amdgcn-- -passes='print<uniformity>' -disable-output 2>&1 | FileCheck %s
+
+; These tests have identical control flow graphs with slight changes
+; that affect cycle-info. There is a minor functional difference in
+; the branch conditions; but that is not relevant to the tests.
+
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+;;
+;; The cycle has a header (T) that does not dominate the join, hence
+;; the entire cycle is reported as converged.
+;;
+;; CHECK-LABEL: UniformityInfo for function 't_header':
+;; CHECK: CYCLES ASSSUMED DIVERGENT:
+;; CHECK:   depth=1: entries(T P) S Q R
+
+define amdgpu_kernel void @t_header(i32 %a, i32 %b, i32 %c) {
+entry:
+ %cond.uni = icmp slt i32 %a, 0
+ %tid = call i32 @llvm.amdgcn.workitem.id.x()
+ %cond.div = icmp slt i32 %tid, 0
+ %a.div = add i32 %tid, %a
+ br i1 %cond.uni, label %P, label %T
+
+P:
+; CHECK:   DIVERGENT:   %pp.phi =
+  %pp.phi  = phi i32 [ %a, %entry], [ %b, %T ]
+  %pp = add i32 %b, 1
+  br i1 %cond.uni, label %R, label %Q
+
+Q:
+  %qq = add i32 %b, 1
+  br i1 %cond.div, label %S, label %R
+
+R:
+  %rr = add i32 %b, 1
+  br label %S
+
+S:
+; CHECK:   DIVERGENT:   %s.phi =
+; CHECK:   DIVERGENT:   %ss =
+  %s.phi = phi i32 [ %qq, %Q ], [ %rr, %R ]
+  %ss = add i32 %pp.phi, 1
+  br i1 %cond.uni, label %exit, label %T
+
+T:
+; CHECK:   DIVERGENT:   %tt.phi =
+  %tt.phi = phi i32 [ %ss, %S ], [ %a, %entry ]
+  %tt = add i32 %b, 1
+  br label %P
+
+exit:
+  %ee = add i32 %b, 1
+  ret void
+}
+
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+;;
+;; The cycle has a header (P) that dominates the join, hence
+;; the cycle is reported as converged.
+;;
+;; CHECK-LABEL: UniformityInfo for function 'p_header':
+;; CHECK-NOT: CYCLES ASSSUMED DIVERGENT:
+
+define amdgpu_kernel void @p_header(i32 %a, i32 %b, i32 %c) {
+entry:
+ %cond.uni = icmp slt i32 %a, 0
+ %tid = call i32 @llvm.amdgcn.workitem.id.x()
+ %cond.div = icmp slt i32 %tid, 0
+ br i1 %cond.uni, label %T, label %P
+
+P:
+; CHECK-NOT:   DIVERGENT:   %pp.phi = phi i32
+  %pp.phi  = phi i32 [ %a, %entry], [ %b, %T ]
+  %pp = add i32 %b, 1
+  br i1 %cond.uni, label %R, label %Q
+
+Q:
+  %qq = add i32 %b, 1
+  br i1 %cond.div, label %S, label %R
+
+R:
+  %rr = add i32 %b, 1
+  br label %S
+
+S:
+; CHECK:   DIVERGENT:   %s.phi =
+; CHECK-NOT:   DIVERGENT:   %ss = add i32
+  %s.phi = phi i32 [ %qq, %Q ], [ %rr, %R ]
+  %ss = add i32 %pp.phi, 1
+  br i1 %cond.uni, label %exit, label %T
+
+T:
+; CHECK-NIT:   DIVERGENT:   %tt.phi = phi i32
+  %tt.phi = phi i32 [ %ss, %S ], [ %a, %entry ]
+  %tt = add i32 %b, 1
+  br label %P
+
+exit:
+  %ee = add i32 %b, 1
+  ret void
+}
+
+declare i32 @llvm.amdgcn.workitem.id.x() #0
--- a/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/irreducible/exit-divergence.ll
+++ b/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/irreducible/exit-divergence.ll
@ -0,0 +1,139 @@
+; RUN: opt %s -mtriple amdgcn-- -passes='print<uniformity>' -disable-output 2>&1 | FileCheck %s
+
+; CHECK=LABEL: UniformityInfo for function 'basic':
+; CHECK-NOT: CYCLES ASSSUMED DIVERGENT:
+; CHECK: CYCLES WITH DIVERGENT EXIT:
+; CHECK:   depth=1: entries(P T) Q
+define amdgpu_kernel void @basic(i32 %a, i32 %b, i32 %c) {
+entry:
+ %cond.uni = icmp slt i32 %a, 0
+ %tid = call i32 @llvm.amdgcn.workitem.id.x()
+ %cond.div = icmp slt i32 %tid, 0
+ br i1 %cond.uni, label %T, label %P
+
+P:
+  %pp.phi.1  = phi i32 [ %a, %entry], [ %b, %T ]
+  %pp.phi.2  = phi i32 [ %a, %entry], [ %tt.phi, %T ]
+  %pp = add i32 %b, 1
+  br label %Q
+
+Q:
+  %qq = add i32 %b, 1
+  %qq.div.1 = add i32 %pp.phi.2, 1
+  %qq.div.2 = add i32 %pp.phi.2, 1
+  br i1 %cond.div, label %T, label %exit
+
+T:
+  %tt.phi = phi i32 [ %qq, %Q ], [ %a, %entry ]
+  %tt = add i32 %b, 1
+  br label %P
+
+exit:
+; CHECK:   DIVERGENT:   %ee.1 =
+; CHECK:   DIVERGENT:   %xx.2 =
+; CHECK-NOT: DIVERGENT:     %ee.3 =
+  %ee.1 = add i32 %pp.phi.1, 1
+  %xx.2 = add i32 %pp.phi.2, 1
+  %ee.3 = add i32 %b, 1
+  ret void
+}
+
+; CHECK-LABEL: UniformityInfo for function 'outer_reducible':
+; CHECK-NOT: CYCLES ASSSUMED DIVERGENT:
+; CHECK: CYCLES WITH DIVERGENT EXIT:
+; CHECK:   depth=1: entries(H) P T R Q
+define amdgpu_kernel void @outer_reducible(i32 %a, i32 %b, i32 %c) {
+entry:
+ %cond.uni = icmp slt i32 %a, 0
+ %tid = call i32 @llvm.amdgcn.workitem.id.x()
+ %cond.div = icmp slt i32 %tid, 0
+ br label %H
+
+H:
+ br i1 %cond.uni, label %T, label %P
+
+P:
+  %pp.phi.1  = phi i32 [ %a, %H], [ %b, %T ]
+  %pp.phi.2  = phi i32 [ %a, %H], [ %tt.phi, %T ]
+  %pp = add i32 %b, 1
+  br label %Q
+
+Q:
+  %qq = add i32 %b, 1
+  %qq.div.1 = add i32 %pp.phi.2, 1
+  %qq.div.2 = add i32 %pp.phi.2, 1
+  br i1 %cond.div, label %R, label %exit
+
+R:
+  br i1 %cond.uni, label %T, label %H
+
+
+T:
+  %tt.phi = phi i32 [ %qq, %R ], [ %a, %H ]
+  %tt = add i32 %b, 1
+  br label %P
+
+exit:
+; CHECK:   DIVERGENT:   %ee.1 =
+; CHECK:   DIVERGENT:   %xx.2 =
+; CHECK-NOT: DIVERGENT:     %ee.3 =
+  %ee.1 = add i32 %pp.phi.1, 1
+  %xx.2 = add i32 %pp.phi.2, 1
+  %ee.3 = add i32 %b, 1
+  ret void
+}
+
+;      entry(div)
+;      |   \
+;      H -> B
+;      ^   /|
+;      \--C |
+;          \|
+;           X
+;
+; This has a divergent cycle due to the external divergent branch, but
+; there are no divergent exits. Hence a use at X is not divergent
+; unless the def itself is divergent.
+;
+; CHECK-LABEL: UniformityInfo for function 'no_divergent_exit':
+; CHECK: CYCLES ASSSUMED DIVERGENT:
+; CHECK:   depth=1: entries(H B) C
+; CHECK-NOT: CYCLES WITH DIVERGENT EXIT:
+define amdgpu_kernel void @no_divergent_exit(i32 %n, i32 %a, i32 %b) #0 {
+entry:
+  %tid = call i32 @llvm.amdgcn.workitem.id.x()
+  %div.cond = icmp slt i32 %tid, 0
+  %uni.cond = icmp slt i32 %a, 0
+  br i1 %div.cond, label %B, label %H
+
+H:                                                ; preds = %C, %entry
+; CHECK: DIVERGENT:  %div.merge.h =
+  %div.merge.h = phi i32 [ 0, %entry ], [ %b, %C ]
+  br label %B
+
+B:                                                ; preds = %H, %entry
+; CHECK: DIVERGENT:  %div.merge.b =
+  %div.merge.b = phi i32 [ %a, %H ], [ 1, %entry ]
+; CHECK-NOT: DIVERGENT  %bb =
+  %bb = add i32 %a, 1
+; CHECK-NOT: DIVERGENT:  br i1 %uni.cond, label %X, label %C
+  br i1 %uni.cond, label %X, label %C
+
+C:                                                ; preds = %B
+; CHECK-NOT: DIVERGENT  %cc =
+  %cc = add i32 %a, 1
+; CHECK-NOT: DIVERGENT:  br i1 %uni.cond, label %X, label %H
+  br i1 %uni.cond, label %X, label %H
+
+; CHECK-LABEL: BLOCK X
+X:                                                ; preds = %C, %B
+; CHECK: DIVERGENT:  %uni.merge.x =
+  %uni.merge.x = phi i32 [ %bb, %B ], [%cc, %C ]
+; CHECK: DIVERGENT: %div.merge.x =
+  %div.merge.x = phi i32 [ %div.merge.b, %B ], [%cc, %C ]
+  ret void
+}
+
+declare i32 @llvm.amdgcn.workitem.id.x() #0
+
+attributes #0 = { nounwind readnone }
--- a/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/irreducible/irreducible-1.ll
+++ b/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/irreducible/irreducible-1.ll
@ -1,13 +1,5 @@
 ; RUN: opt %s -mtriple amdgcn-- -passes='print<divergence>' -disable-output 2>&1 | FileCheck %s
-
-; NOTE: The new pass manager does not fall back on legacy divergence
-; analysis even when the function contains an irreducible loop. The
-; (new) divergence analysis conservatively reports all values as
-; divergent. This test does not check for this conservative
-; behaviour. Instead, it only checks for the values that are known to
-; be divergent according to the legacy analysis.
-
-; RUN: opt -mtriple amdgcn-- -passes='print<divergence>' -disable-output %s 2>&1 | FileCheck %s
+; RUN: opt %s -mtriple amdgcn-- -passes='print<uniformity>' -disable-output 2>&1 | FileCheck %s

 ; This test contains an unstructured loop.
 ;           +-------------- entry ----------------+
@ -21,21 +13,27 @@
 ;                             |
 ;                             V
 ;                        if (i3 == 5) // divergent
-; because sync dependent on (tid / i3).
+;                                        because sync dependent on (tid / i3).
+
 define i32 @unstructured_loop(i1 %entry_cond) {
-; CHECK-LABEL: Divergence Analysis' for function 'unstructured_loop'
+; CHECK-LABEL: for function 'unstructured_loop'
+; CHECK: DIVERGENT: i1 %entry_cond
+
 entry:
  %tid = call i32 @llvm.amdgcn.workitem.id.x()
  br i1 %entry_cond, label %loop_entry_1, label %loop_entry_2
 loop_entry_1:
+; CHECK: DIVERGENT: %i1 =
  %i1 = phi i32 [ 0, %entry ], [ %i3, %loop_latch ]
  %j1 = add i32 %i1, 1
  br label %loop_body
 loop_entry_2:
+; CHECK: DIVERGENT: %i2 =
  %i2 = phi i32 [ 0, %entry ], [ %i3, %loop_latch ]
  %j2 = add i32 %i2, 2
  br label %loop_body
 loop_body:
+; CHECK: DIVERGENT: %i3 =
  %i3 = phi i32 [ %j1, %loop_entry_1 ], [ %j2, %loop_entry_2 ]
  br label %loop_latch
 loop_latch:
@ -43,9 +41,10 @@ loop_latch:
  switch i32 %div, label %branch [ i32 1, label %loop_entry_1
                                   i32 2, label %loop_entry_2 ]
 branch:
+; CHECK: DIVERGENT: %cmp =
+; CHECK: DIVERGENT: br i1 %cmp,
  %cmp = icmp eq i32 %i3, 5
  br i1 %cmp, label %then, label %else
-; CHECK: DIVERGENT: br i1 %cmp,
 then:
  ret i32 0
 else:
--- a/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/irreducible/irreducible-2.ll
+++ b/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/irreducible/irreducible-2.ll
@ -0,0 +1,574 @@
+; RUN: opt %s -mtriple amdgcn-- -passes='print<uniformity>' -disable-output 2>&1 | FileCheck %s
+
+define amdgpu_kernel void @cycle_diverge_enter(i32 %n, i32 %a, i32 %b) #0 {
+;      entry(div)
+;      /   \
+;     H <-> B
+;           |
+;           X
+; CHECK-LABEL: for function 'cycle_diverge_enter':
+; CHECK-NOT: DIVERGENT: %uni.
+; CHECK-NOT: DIVERGENT: br i1 %uni.
+
+entry:
+  %tid = call i32 @llvm.amdgcn.workitem.id.x()
+  %div.cond = icmp slt i32 %tid, 0
+  %uni.cond = icmp slt i32 %a, 0
+  br i1 %div.cond, label %B, label %H  ; divergent branch
+
+H:
+  %div.merge.h = phi i32 [ 0, %entry ], [ %b, %B ]
+  br label %B
+; CHECK: DIVERGENT: %div.merge.h
+
+B:
+  %div.merge.b = phi i32 [ %a, %H ], [1, %entry ]
+  %div.cond.b = icmp sgt i32 %div.merge.b, 0
+  %div.b.inc = add i32 %b, 1
+  br i1 %div.cond, label %X, label %H ; divergent branch
+; CHECK: DIVERGENT: %div.merge.b
+
+X:
+  %div.use = add i32 %div.merge.b, 1
+  ret void
+; CHECK: DIVERGENT: %div.use =
+
+}
+
+define amdgpu_kernel void @cycle_diverge_exit(i32 %n, i32 %a, i32 %b) #0 {
+;      entry
+;      /   \
+;     H <-> B(div)
+;           |
+;           X
+;
+; CHECK-LABEL: for function 'cycle_diverge_exit':
+; CHECK-NOT: DIVERGENT: %uni.
+; CHECK-NOT: DIVERGENT: br i1 %uni.
+
+entry:
+  %tid = call i32 @llvm.amdgcn.workitem.id.x()
+  %div.cond = icmp slt i32 %tid, 0
+  %uni.cond = icmp slt i32 %a, 0
+  br i1 %uni.cond, label %B, label %H
+
+H:
+  %uni.merge.h = phi i32 [ 0, %entry ], [ %b, %B ]
+  br label %B
+
+B:
+  %uni.merge.b = phi i32 [ %a, %H ], [1, %entry ]
+  %uni.cond.b = icmp sgt i32 %uni.merge.b, 0
+  %uni.b.inc = add i32 %b, 1
+  br i1 %div.cond, label %X, label %H ; divergent branch
+
+X:
+  %div.use = add i32 %uni.merge.b, 1
+  ret void
+; CHECK: DIVERGENT: %div.use =
+}
+
+define amdgpu_kernel void @cycle_reentrance(i32 %n, i32 %a, i32 %b) #0 {
+; For this case, threads enter the cycle from C would take C->D->H,
+; at the point of H, diverged threads may continue looping in cycle(H-B-D)
+; until all threads exit the cycle(H-B-D) and cause temporal divergence
+; exiting at edge H->C. We currently do not analyze such kind of inner
+; cycle temporal divergence. Instead, we mark all values in the cycle
+; being divergent conservatively.
+;      entry--\
+;       |     |
+;  ---> H(div)|
+;  |   / \    /
+;  |  B   C<--
+;  ^   \ /
+;  \----D
+;       |
+;       X
+; CHECK-LABEL: for function 'cycle_reentrance':
+; CHECK-NOT: DIVERGENT: %uni.
+; CHECK-NOT: DIVERGENT: br i1 %uni.
+
+entry:
+  %tid = call i32 @llvm.amdgcn.workitem.id.x()
+  %div.cond = icmp slt i32 %tid, 0
+  %uni.cond = icmp slt i32 %a, 0
+  br i1 %uni.cond, label %H, label %C
+
+H:
+  %div.merge.h = phi i32 [ 0, %entry ], [ %b, %D ]
+  br i1 %div.cond, label %B, label %C  ; divergent branch
+
+B:
+  %div.inc.b = add i32 %div.merge.h, 1
+; CHECK: DIVERGENT: %div.inc.b
+  br label %D
+
+C:
+  %div.merge.c = phi i32 [0, %entry], [%div.merge.h, %H]
+  %div.inc.c = add i32 %div.merge.c, 2
+; CHECK: DIVERGENT: %div.inc.c
+  br label %D
+
+D:
+  %div.merge.d = phi i32 [ %div.inc.b, %B ], [ %div.inc.c, %C ]
+; CHECK: DIVERGENT: %div.merge.d
+  br i1 %uni.cond, label %X, label %H
+
+X:
+  ret void
+}
+
+define amdgpu_kernel void @cycle_reentrance2(i32 %n, i32 %a, i32 %b) #0 {
+; This is mostly the same as cycle_reentrance, the only difference is
+; the successor order, thus different dfs visiting order. This is just
+; make sure we are doing uniform analysis correctly under different dfs
+; order.
+;      entry--\
+;       |     |
+;  ---> H(div)|
+;  |   / \    /
+;  |  B   C<--
+;  ^   \ /
+;  \----D
+;       |
+;       X
+; CHECK-LABEL: for function 'cycle_reentrance2':
+; CHECK-NOT: DIVERGENT: %uni.
+; CHECK-NOT: DIVERGENT: br i1 %uni.
+
+entry:
+  %tid = call i32 @llvm.amdgcn.workitem.id.x()
+  %div.cond = icmp slt i32 %tid, 0
+  %uni.cond = icmp slt i32 %a, 0
+  br i1 %uni.cond, label %C, label %H
+
+H:
+  %div.merge.h = phi i32 [ 0, %entry ], [ %b, %D ]
+  br i1 %div.cond, label %B, label %C  ; divergent branch
+
+B:
+  %div.inc.b = add i32 %div.merge.h, 1
+; CHECK: DIVERGENT: %div.inc.b
+  br label %D
+
+C:
+  %div.merge.c = phi i32 [0, %entry], [%div.merge.h, %H]
+  %div.inc.c = add i32 %div.merge.c, 2
+; CHECK: DIVERGENT: %div.inc.c
+  br label %D
+
+D:
+  %div.merge.d = phi i32 [ %div.inc.b, %B ], [ %div.inc.c, %C ]
+; CHECK: DIVERGENT: %div.merge.d
+  br i1 %uni.cond, label %X, label %H
+
+X:
+  ret void
+}
+
+define amdgpu_kernel void @cycle_join_dominated_by_diverge(i32 %n, i32 %a, i32 %b) #0 {
+; the join-node D is dominated by diverge point H2
+;      entry
+;       | |
+;  --> H1 |
+;  |     \|
+;  |      H2(div)
+;  |     / \
+;  |    B   C
+;  ^     \ /
+;  \------D
+;         |
+;         X
+; CHECK-LABEL: for function 'cycle_join_dominated_by_diverge':
+; CHECK-NOT: DIVERGENT: %uni.
+; CHECK-NOT: DIVERGENT: br i1 %uni.
+
+entry:
+  %tid = call i32 @llvm.amdgcn.workitem.id.x()
+  %div.cond = icmp slt i32 %tid, 0
+  %uni.cond = icmp slt i32 %a, 0
+  br i1 %uni.cond, label %H1, label %H2
+
+H1:
+  %uni.merge.h1 = phi i32 [ 0, %entry ], [ %b, %D ]
+  br label %H2
+
+H2:
+  %uni.merge.h2 = phi i32 [ 0, %entry ], [ %b, %H1 ]
+  br i1 %div.cond, label %B, label %C  ; divergent branch
+
+B:
+  %uni.inc.b = add i32 %uni.merge.h2, 1
+  br label %D
+
+C:
+  %uni.inc.c = add i32 %uni.merge.h2, 2
+  br label %D
+
+D:
+  %div.merge.d = phi i32 [ %uni.inc.b, %B ], [ %uni.inc.c, %C ]
+; CHECK: DIVERGENT: %div.merge.d
+  br i1 %uni.cond, label %X, label %H1
+
+X:
+  ret void
+}
+
+define amdgpu_kernel void @cycle_join_dominated_by_entry(i32 %n, i32 %a, i32 %b) #0 {
+; the join-node D is dominated by cycle entry H2
+;      entry
+;       | |
+;  --> H1 |
+;  |     \|
+;  |      H2 -----
+;  |      |      |
+;  |      A(div) |
+;  |     / \     v
+;  |    B   C   /
+;  ^     \ /   /
+;  \------D <-/
+;         |
+;         X
+; CHECK-LABEL: for function 'cycle_join_dominated_by_entry':
+; CHECK-NOT: DIVERGENT: %uni.
+; CHECK-NOT: DIVERGENT: br i1 %uni.
+
+entry:
+  %tid = call i32 @llvm.amdgcn.workitem.id.x()
+  %div.cond = icmp slt i32 %tid, 0
+  %uni.cond = icmp slt i32 %a, 0
+  br i1 %uni.cond, label %H1, label %H2
+
+H1:
+  %uni.merge.h1 = phi i32 [ 0, %entry ], [ %b, %D ]
+  br label %H2
+
+H2:
+  %uni.merge.h2 = phi i32 [ 0, %entry ], [ %b, %H1 ]
+  br i1 %uni.cond, label %A, label %D
+
+A:
+  br i1 %div.cond, label %B, label %C  ; divergent branch
+
+B:
+  %uni.inc.b = add i32 %uni.merge.h2, 1
+  br label %D
+
+C:
+  %uni.inc.c = add i32 %uni.merge.h2, 2
+  br label %D
+
+D:
+  %div.merge.d = phi i32 [ %uni.inc.b, %B ], [ %uni.inc.c, %C ], [%uni.merge.h2, %H2]
+; CHECK: DIVERGENT: %div.merge.d
+  br i1 %uni.cond, label %X, label %H1
+
+X:
+  ret void
+}
+
+define amdgpu_kernel void @cycle_join_not_dominated(i32 %n, i32 %a, i32 %b) #0 {
+; if H is the header, the sync label propagation may stop at join node D.
+; But join node D is not dominated by divergence starting block C, and also
+; not dominated by any entries(H/C). So we conservatively mark all the values
+; in the cycle divergent for now.
+;      entry
+;       |  |
+;  ---> H  |
+;  |   / \ v
+;  |  B<--C(div)
+;  ^   \ /
+;  \----D
+;       |
+;       X
+; CHECK-LABEL: for function 'cycle_join_not_dominated':
+; CHECK-NOT: DIVERGENT: %uni.
+
+entry:
+  %tid = call i32 @llvm.amdgcn.workitem.id.x()
+  %div.cond = icmp slt i32 %tid, 0
+  %uni.cond = icmp slt i32 %a, 0
+  br i1 %uni.cond, label %C, label %H
+
+H:
+  %div.merge.h = phi i32 [ 0, %entry ], [ %b, %D ]
+  br i1 %uni.cond, label %B, label %C
+
+B:
+  %div.merge.b = phi i32 [ 0, %H ], [ %b, %C ]
+  %div.inc.b = add i32 %div.merge.b, 1
+; CHECK: DIVERGENT: %div.inc.b
+  br label %D
+
+C:
+  %div.merge.c = phi i32 [0, %entry], [%div.merge.h, %H]
+  %div.inc.c = add i32 %div.merge.c, 2
+; CHECK: DIVERGENT: %div.inc.c
+  br i1 %div.cond, label %D, label %B  ; divergent branch
+
+D:
+  %div.merge.d = phi i32 [ %div.inc.b, %B ], [ %div.inc.c, %C ]
+; CHECK: DIVERGENT: %div.merge.d
+  br i1 %uni.cond, label %X, label %H
+
+X:
+  ret void
+}
+
+define amdgpu_kernel void @cycle_join_not_dominated2(i32 %n, i32 %a, i32 %b) #0 {
+; This is mostly the same as cycle_join_not_dominated, the only difference is
+; the dfs visiting order, so the cycle analysis result is different.
+;      entry
+;       |  |
+;  ---> H  |
+;  |   / \ v
+;  |  B<--C(div)
+;  ^   \ /
+;  \----D
+;       |
+;       X
+; CHECK-LABEL: for function 'cycle_join_not_dominated2':
+; CHECK-NOT: DIVERGENT: %uni.
+
+entry:
+  %tid = call i32 @llvm.amdgcn.workitem.id.x()
+  %div.cond = icmp slt i32 %tid, 0
+  %uni.cond = icmp slt i32 %a, 0
+  br i1 %uni.cond, label %H, label %C
+
+H:
+  %div.merge.h = phi i32 [ 0, %entry ], [ %b, %D ]
+  br i1 %uni.cond, label %B, label %C
+
+B:
+  %div.merge.b = phi i32 [ 0, %H ], [ %b, %C ]
+  %div.inc.b = add i32 %div.merge.b, 1
+; CHECK: DIVERGENT: %div.inc.b
+  br label %D
+
+C:
+  %div.merge.c = phi i32 [0, %entry], [%div.merge.h, %H]
+  %div.inc.c = add i32 %div.merge.c, 2
+; CHECK: DIVERGENT: %div.inc.c
+  br i1 %div.cond, label %D, label %B  ; divergent branch
+
+D:
+  %div.merge.d = phi i32 [ %div.inc.b, %B ], [ %div.inc.c, %C ]
+; CHECK: DIVERGENT: %div.merge.d
+  br i1 %uni.cond, label %X, label %H
+
+X:
+  ret void
+}
+
+define amdgpu_kernel void @natural_loop_two_backedges(i32 %n, i32 %a, i32 %b) #0 {
+; FIXME: the uni.merge.h can be viewed as uniform.
+; CHECK-LABEL: for function 'natural_loop_two_backedges':
+
+entry:
+  %tid = call i32 @llvm.amdgcn.workitem.id.x()
+  %div.cond = icmp slt i32 %tid, 0
+  %uni.cond = icmp slt i32 %a, 0
+  br label %H
+
+H:
+  %uni.merge.h = phi i32 [ 0, %entry ], [ %uni.inc, %B ], [ %uni.inc, %C]
+  %uni.inc = add i32 %uni.merge.h, 1
+  br label %B
+
+B:
+  br i1 %div.cond, label %C, label %H
+
+C:
+  br i1 %uni.cond, label %X, label %H
+
+X:
+  ret void
+}
+
+define amdgpu_kernel void @natural_loop_two_backedges2(i32 %n, i32 %a, i32 %b) #0 {
+; FIXME: the uni.merge.h can be viewed as uniform.
+; CHECK-LABEL: for function 'natural_loop_two_backedges2':
+
+entry:
+  %tid = call i32 @llvm.amdgcn.workitem.id.x()
+  %div.cond = icmp slt i32 %tid, 0
+  %uni.cond = icmp slt i32 %a, 0
+  br label %H
+
+H:
+  %uni.merge.h = phi i32 [ 0, %entry ], [ %uni.inc, %B ], [ %uni.inc, %C]
+  %uni.inc = add i32 %uni.merge.h, 1
+  br i1 %uni.cond, label %B, label %D
+
+B:
+  br i1 %div.cond, label %C, label %H
+
+C:
+  br label %H
+
+D:
+  br i1 %uni.cond, label %B, label %X
+
+X:
+  ret void
+}
+
+define amdgpu_kernel void @cycle_enter_nested(i32 %n, i32 %a, i32 %b) #0 {
+;
+;   entry(div)
+;       |   \
+;   --> H1   |
+;  /    |    |
+;  | -> H2   |
+;  | |  |    /
+;  | \--B <--
+;  ^    |
+;  \----C
+;       |
+;       X
+; CHECK-LABEL: for function 'cycle_enter_nested':
+; CHECK-NOT: DIVERGENT: %uni.
+; CHECK-NOT: DIVERGENT: br i1 %uni.
+
+entry:
+  %tid = call i32 @llvm.amdgcn.workitem.id.x()
+  %div.cond = icmp slt i32 %tid, 0
+  %uni.cond = icmp slt i32 %a, 0
+  br i1 %div.cond, label %B, label %H1
+
+H1:
+  %div.merge.h1 = phi i32 [ 1, %entry ], [ %b, %C ]
+  br label %H2
+; CHECK: DIVERGENT: %div.merge.h1
+
+H2:
+  %div.merge.h2 = phi i32 [ 2, %B ], [ %a, %H1 ]
+; CHECK: DIVERGENT: %div.merge.h2
+  br label %B
+
+B:
+  %div.merge.b = phi i32 [0, %entry], [%a, %H2]
+; CHECK: DIVERGENT: %div.merge.b
+  br i1 %uni.cond, label %C, label %H2
+
+C:
+  br i1 %uni.cond, label %X, label %H1
+
+X:
+  ret void
+}
+
+define amdgpu_kernel void @cycle_inner_exit_enter(i32 %n, i32 %a, i32 %b) #0 {
+;          entry
+;        /      \
+;       E1-> A-> E2
+;       ^    |    \
+;       |    E3-> E4
+;       |    ^   /
+;       |     \ /
+;       C <----B
+;       |
+;       X
+; CHECK-LABEL: for function 'cycle_inner_exit_enter':
+; CHECK-NOT: DIVERGENT: %uni.
+; CHECK-NOT: DIVERGENT: br i1 %uni.
+
+entry:
+  %tid = call i32 @llvm.amdgcn.workitem.id.x()
+  %div.cond = icmp slt i32 %tid, 0
+  %uni.cond = icmp slt i32 %a, 0
+  br i1 %uni.cond, label %E2, label %E1
+
+E1:
+  %div.merge.e1 = phi i32 [ 1, %entry ], [ %b, %C ]
+  br label %A
+; CHECK: DIVERGENT: %div.merge.e1
+
+A:
+  br i1 %uni.cond, label %E2, label %E3
+
+E2:
+  %div.merge.e2 = phi i32 [ 2, %entry ], [ %a, %A ]
+; CHECK: DIVERGENT: %div.merge.e2
+  br label %E4
+
+E3:
+  %div.merge.e3 = phi i32 [ 0, %A ], [ %b, %B ]
+; CHECK: DIVERGENT: %div.merge.e3
+  br label %E4
+
+E4:
+  %div.merge.e4 = phi i32 [ 0, %E2 ], [ %a, %E3 ]
+; CHECK: DIVERGENT: %div.merge.e4
+  br label %B
+
+B:
+  br i1 %div.cond, label %C, label %E3
+
+C:
+  br i1 %uni.cond, label %X, label %E1
+
+X:
+  ret void
+}
+
+define amdgpu_kernel void @cycle_inner_exit_enter2(i32 %n, i32 %a, i32 %b) #0 {
+; This case is almost the same as cycle_inner_exit_enter, with only different
+; dfs visiting order, thus different cycle hierarchy.
+;          entry
+;        /      \
+;       E1-> A-> E2
+;       ^    |    \
+;       |    E3-> E4
+;       |    ^   /
+;       |     \ /
+;       C <----B
+;       |
+;       X
+; CHECK-LABEL: for function 'cycle_inner_exit_enter2':
+; CHECK-NOT: DIVERGENT: %uni.
+; CHECK-NOT: DIVERGENT: br i1 %uni.
+
+entry:
+  %tid = call i32 @llvm.amdgcn.workitem.id.x()
+  %div.cond = icmp slt i32 %tid, 0
+  %uni.cond = icmp slt i32 %a, 0
+  br i1 %uni.cond, label %E1, label %E2
+
+E1:
+  %div.merge.e1 = phi i32 [ 1, %entry ], [ %b, %C ]
+  br label %A
+; CHECK: DIVERGENT: %div.merge.e1
+
+A:
+  br i1 %uni.cond, label %E2, label %E3
+
+E2:
+  %div.merge.e2 = phi i32 [ 2, %entry ], [ %a, %A ]
+; CHECK: DIVERGENT: %div.merge.e2
+  br label %E4
+
+E3:
+  %div.merge.e3 = phi i32 [ 0, %A ], [ %b, %B ]
+; CHECK: DIVERGENT: %div.merge.e3
+  br label %E4
+
+E4:
+  %div.merge.e4 = phi i32 [ 0, %E2 ], [ %a, %E3 ]
+; CHECK: DIVERGENT: %div.merge.e4
+  br label %B
+
+B:
+  br i1 %div.cond, label %C, label %E3
+
+C:
+  br i1 %uni.cond, label %X, label %E1
+
+X:
+  ret void
+}
+
+declare i32 @llvm.amdgcn.workitem.id.x() #0
+
+attributes #0 = { nounwind readnone }
--- a/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/irreducible/reducible-headers.ll
+++ b/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/irreducible/reducible-headers.ll
@ -0,0 +1,225 @@
+; RUN: opt %s -mtriple amdgcn-- -passes='print<uniformity>' -disable-output 2>&1 | FileCheck %s
+
+;
+;                         Entry
+;                           |
+;                           v
+;                  -------->H---------
+;                  |        |        |
+;                  |        v        |
+;                  |    --->T----    |
+;                  |    |       |    |
+;                  |    |       V    |
+;                  S<---R       P <---
+;                  ^    ^       |
+;                  |    |  Div  |
+;                  |    --- Q <--
+;                  |        |
+;                  |        v
+;                  -------- U
+;                           |
+;                           v
+;                          Exit
+;
+; The divergent branch is at Q that exits an irreducible cycle with
+; entries T and P nested inside a reducible cycle with header H. R is
+; assigned label R, which reaches P. S is a join node with label S. If
+; this is propagated to P via H, then P is incorrectly recognized as a
+; join, making the inner cycle divergent. P is always executed
+; convergently -- either by threads that reconverged at header H, or
+; by threads that are still executing the inner cycle. Thus, any PHI
+; at P should not be marked divergent.
+
+define amdgpu_kernel void @nested_irreducible(i32 %a, i32 %b, i32 %c) {
+; CHECK=LABEL: UniformityInfo for function 'nested_irreducible':
+; CHECK-NOT: CYCLES ASSSUMED DIVERGENT:
+; CHECK: CYCLES WITH DIVERGENT EXIT:
+; CHECK:   depth=2: entries(P T) R Q
+; CHECK:   depth=1: entries(H) S P T R Q U
+entry:
+  %cond.uni = icmp slt i32 %a, 0
+  %tid = call i32 @llvm.amdgcn.workitem.id.x()
+  %cond.div = icmp slt i32 %tid, 0
+  br label %H
+
+H:
+ br i1 %cond.uni, label %T, label %P
+
+P:
+; CHECK-LABEL: BLOCK P
+; CHECK-NOT:   DIVERGENT:   %pp.phi =
+; CHECK-NOT: DIVERGENT:      %pp =
+  %pp.phi  = phi i32 [ %a, %H], [ %b, %T ]
+  %pp = add i32 %b, 1
+  br label %Q
+
+Q:
+; CHECK-LABEL: BLOCK Q
+; CHECK-NOT: DIVERGENT:   %qq =
+; CHECK-NOT:   DIVERGENT:   %qq.uni =
+  %qq = add i32 %b, 1
+  %qq.uni = add i32 %pp.phi, 1
+  br i1 %cond.div, label %R, label %U
+
+R:
+  br i1 %cond.uni, label %S, label %T
+
+T:
+; CHECK-LABEL: BLOCK T
+; CHECK-NOT:   DIVERGENT:   %tt.phi =
+; CHECK-NOT: DIVERGENT:     %tt =
+  %tt.phi = phi i32 [ %qq, %R ], [ %a, %H ]
+  %tt = add i32 %b, 1
+  br label %P
+
+S:
+; CHECK-LABEL: BLOCK S
+; CHECK:   DIVERGENT:   %ss.phi =
+; CHECK-NOT: DIVERGENT:     %ss =
+  %ss.phi = phi i32 [ %qq.uni, %U ], [ %a, %R ]
+  %ss = add i32 %b, 1
+  br label %H
+
+U:
+  br i1 %cond.uni, label %S, label %exit
+
+exit:
+; CHECK: DIVERGENT:     %ee.div =
+; CHECK-NOT: DIVERGENT:     %ee =
+  %ee.div =  add i32 %qq.uni, 1
+  %ee = add i32 %b, 1
+  ret void
+}
+
+;
+;                         Entry
+;                           |
+;                           v
+;               -->-------->H---------
+;               |  ^        |        |
+;               |  |        |        |
+;               |  |        |        |
+;               |  |        |        |
+;               |  |        v        V
+;               |  R<-------T-->U--->P
+;               |          Div       |
+;               |                    |
+;               ----------- Q <-------
+;                           |
+;                           v
+;                          Exit
+;
+; This is a reducible cycle with a divergent branch at T. Disjoint
+; paths eventually join at the header H, which is assigned label H.
+; Node P is assigned label U. If the header label were propagated to
+; P, it will be incorrectly recgonized as a join. P is always executed
+; convergently -- either by threads that reconverged at header H, or
+; by threads that diverged at T (and eventually reconverged at H).
+; Thus, any PHI at P should not be marked divergent.
+
+define amdgpu_kernel void @header_label_1(i32 %a, i32 %b, i32 %c) {
+; CHECK=LABEL: UniformityInfo for function 'header_label_1':
+; CHECK-NOT: CYCLES ASSSUMED DIVERGENT:
+; CHECK: CYCLES WITH DIVERGENT EXIT:
+; CHECK:  depth=1: entries(H) Q P U T R
+entry:
+  %cond.uni = icmp slt i32 %a, 0
+  %tid = call i32 @llvm.amdgcn.workitem.id.x()
+  %cond.div = icmp slt i32 %tid, 0
+  br label %H
+
+H:
+  br i1 %cond.uni, label %T, label %P
+
+P:
+; CHECK-LABEL: BLOCK P
+; CHECK-NOT:   DIVERGENT:   %pp.phi =
+; CHECK-NOT: DIVERGENT:      %pp =
+  %pp.phi  = phi i32 [ %a, %H], [ %b, %U ]
+  %pp = add i32 %b, 1
+  br label %Q
+
+Q:
+; CHECK-LABEL: BLOCK Q
+; CHECK-NOT: DIVERGENT:   %qq =
+; CHECK-NOT:   DIVERGENT:   %qq.uni =
+  %qq = add i32 %b, 1
+  %qq.uni = add i32 %pp.phi, 1
+  br i1 %cond.uni, label %exit, label %H
+
+R:
+  br label %H
+
+T:
+  br i1 %cond.div, label %R, label %U
+
+U:
+  br label %P
+
+exit:
+; CHECK-LABEL: BLOCK exit
+; CHECK: DIVERGENT:     %ee.div =
+; CHECK-NOT: DIVERGENT:     %ee =
+  %ee.div =  add i32 %qq.uni, 1
+  %ee = add i32 %b, 1
+  ret void
+}
+
+;        entry
+;            |
+;        --> H1
+;        |   | \
+;        |   | H2(div)
+;        |   \ / \
+;        |    B   C
+;        ^     \ /
+;        \------D
+;               |
+;               X
+;
+; This is a reducible cycle with a divergent branch at H2. Disjoint
+; paths eventually join at the header D, which is assigned label D.
+; Node B is assigned label B. If the header label D were propagated to
+; B, it will be incorrectly recgonized as a join. B is always executed
+; convergently -- either by threads that reconverged at header H1, or
+; by threads that diverge at H2 (and eventually reconverged at H1).
+; Thus, any PHI at B should not be marked divergent.
+
+define amdgpu_kernel void @header_label_2(i32 %a, i32 %b, i32 %c) {
+; CHECK-LABEL: UniformityInfo for function 'header_label_2':
+; CHECK-NOT: CYCLES ASSSUMED DIVERGENT:
+; CHECK-NOT: CYCLES WITH DIVERGENT EXIT:
+entry:
+  %cond.uni = icmp slt i32 %a, 0
+  %tid = call i32 @llvm.amdgcn.workitem.id.x()
+  %cond.div = icmp slt i32 %tid, 0
+  br label %H1
+
+H1:
+  br i1 %cond.uni, label %B, label %H2
+
+H2:
+  br i1 %cond.div, label %B, label %C
+
+B:
+; CHECK-LABEL: BLOCK B
+; CHECK-NOT: DIVERGENT:     %bb.phi =
+  %bb.phi = phi i32 [%a, %H1], [%b, %H2]
+  br label %D
+
+C:
+  br label %D
+
+D:
+; CHECK-LABEL: BLOCK D
+; CHECK: DIVERGENT:     %dd.phi =
+  %dd.phi = phi i32 [%a, %B], [%b, %C]
+  br i1 %cond.uni, label %exit, label %H1
+
+exit:
+  %ee.1 = add i32 %dd.phi, 1
+  %ee.2 = add i32 %b, 1
+  ret void
+}
+
+declare i32 @llvm.amdgcn.workitem.id.x() #0
--- a/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/join-at-loop-exit.ll
+++ b/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/join-at-loop-exit.ll
@ -1,6 +1,6 @@
 ; RUN: opt -mtriple amdgcn-unknown-amdhsa -passes='print<divergence>' -disable-output %s 2>&1 | FileCheck %s
+; RUN: opt -mtriple amdgcn-unknown-amdhsa -passes='print<uniformity>' -disable-output %s 2>&1 | FileCheck %s

-; CHECK: bb3:
 ; CHECK: DIVERGENT:       %Guard.bb4 = phi i1 [ true, %bb1 ], [ false, %bb2 ]
 ; CHECK: DIVERGENT:       br i1 %Guard.bb4, label %bb4, label %bb5

--- a/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/join-at-loop-heart.ll
+++ b/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/join-at-loop-heart.ll
@ -0,0 +1,44 @@
+; RUN: opt -mtriple amdgcn-unknown-amdhsa -passes='print<divergence>' -disable-output %s 2>&1 | FileCheck %s
+; RUN: opt -mtriple amdgcn-unknown-amdhsa -passes='print<uniformity>' -disable-output %s 2>&1 | FileCheck %s
+
+; CHECK: DIVERGENT:  %phi.h = phi i32 [ 0, %entry ], [ %inc, %C ], [ %inc, %D ], [ %inc, %E ]
+; CHECK: DIVERGENT:       %tid = call i32 @llvm.amdgcn.workitem.id.x()
+; CHECK: DIVERGENT:       %div.cond = icmp slt i32 %tid, 0
+; CHECK: DIVERGENT:  %inc = add i32 %phi.h, 1
+; CHECK: DIVERGENT:       br i1 %div.cond, label %C, label %D
+
+define void @nested_loop_extension() {
+entry:
+  %anchor = call token @llvm.experimental.convergence.anchor()
+  br label %A
+
+A:
+  %phi.h = phi i32 [ 0, %entry ], [ %inc, %C ], [ %inc, %D ], [ %inc, %E ]
+  br label %B
+
+B:
+  %tid = call i32 @llvm.amdgcn.workitem.id.x()
+  %div.cond = icmp slt i32 %tid, 0
+  %inc = add i32 %phi.h, 1
+  br i1 %div.cond, label %C, label %D
+
+C:
+  br i1 undef, label %A, label %E
+
+D:
+  br i1 undef, label %A, label %E
+
+E:
+  %b = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %anchor) ]
+  br i1 undef, label %A, label %F
+
+F:
+  ret void
+}
+
+declare i32 @llvm.amdgcn.workitem.id.x() #0
+
+declare token @llvm.experimental.convergence.anchor()
+declare token @llvm.experimental.convergence.loop()
+
+attributes #0 = { nounwind readnone }
--- a/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/kernel-args.ll
+++ b/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/kernel-args.ll
@ -1,18 +1,19 @@
 ; RUN: opt -mtriple amdgcn-- -passes='print<divergence>' -disable-output %s 2>&1 | FileCheck %s
+; RUN: opt -mtriple amdgcn-- -passes='print<uniformity>' -disable-output %s 2>&1 | FileCheck %s

-; CHECK-LABEL: Divergence Analysis' for function 'test_amdgpu_ps':
-; CHECK: DIVERGENT:  ptr addrspace(4) %arg0
+; CHECK-LABEL: for function 'test_amdgpu_ps':
+; CHECK-DAG: DIVERGENT:  ptr addrspace(4) %arg0
+; CHECK-DAG: DIVERGENT:  <2 x i32> %arg3
+; CHECK-DAG: DIVERGENT:  <3 x i32> %arg4
+; CHECK-DAG: DIVERGENT:  float %arg5
+; CHECK-DAG: DIVERGENT:  i32 %arg6
 ; CHECK-NOT: DIVERGENT
-; CHECK: DIVERGENT:  <2 x i32> %arg3
-; CHECK: DIVERGENT:  <3 x i32> %arg4
-; CHECK: DIVERGENT:  float %arg5
-; CHECK: DIVERGENT:  i32 %arg6

 define amdgpu_ps void @test_amdgpu_ps(ptr addrspace(4) byref([4 x <16 x i8>]) %arg0, float inreg %arg1, i32 inreg %arg2, <2 x i32> %arg3, <3 x i32> %arg4, float %arg5, i32 %arg6) #0 {
  ret void
 }

-; CHECK-LABEL: Divergence Analysis' for function 'test_amdgpu_kernel':
+; CHECK-LABEL: for function 'test_amdgpu_kernel':
 ; CHECK-NOT: %arg0
 ; CHECK-NOT: %arg1
 ; CHECK-NOT: %arg2
@ -24,7 +25,7 @@ define amdgpu_kernel void @test_amdgpu_kernel(ptr addrspace(4) byref([4 x <16 x
  ret void
 }

-; CHECK-LABEL: Divergence Analysis' for function 'test_c':
+; CHECK-LABEL: for function 'test_c':
 ; CHECK: DIVERGENT:
 ; CHECK: DIVERGENT:
 ; CHECK: DIVERGENT:
--- a/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/llvm.amdgcn.buffer.atomic.ll
+++ b/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/llvm.amdgcn.buffer.atomic.ll
@ -1,4 +1,5 @@
 ; RUN: opt -mtriple amdgcn-mesa-mesa3d -passes='print<divergence>' -disable-output %s 2>&1 | FileCheck %s
+; RUN: opt -mtriple amdgcn-mesa-mesa3d -passes='print<uniformity>' -disable-output %s 2>&1 | FileCheck %s

 ;CHECK: DIVERGENT: %orig = call i32 @llvm.amdgcn.buffer.atomic.swap.i32(
 define float @buffer_atomic_swap(<4 x i32> inreg %rsrc, i32 inreg %data) #0 {
--- a/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/llvm.amdgcn.image.atomic.ll
+++ b/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/llvm.amdgcn.image.atomic.ll
@ -1,4 +1,5 @@
 ; RUN: opt -mtriple amdgcn-mesa-mesa3d -passes='print<divergence>' -disable-output %s 2>&1 | FileCheck %s
+; RUN: opt -mtriple amdgcn-mesa-mesa3d -passes='print<uniformity>' -disable-output %s 2>&1 | FileCheck %s

 ;CHECK: DIVERGENT: %orig = call i32 @llvm.amdgcn.image.atomic.swap.1d.i32.i32(
 define float @image_atomic_swap(<8 x i32> inreg %rsrc, i32 inreg %addr, i32 inreg %data) #0 {
--- a/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/no-return-blocks.ll
+++ b/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/no-return-blocks.ll
@ -1,4 +1,5 @@
 ; RUN: opt -mtriple amdgcn-- -passes='print<divergence>' -disable-output %s 2>&1 | FileCheck %s
+; RUN: opt -mtriple amdgcn-- -passes='print<uniformity>' -disable-output %s 2>&1 | FileCheck %s

 ; CHECK: DIVERGENT:  %tmp5 = getelementptr inbounds float, ptr addrspace(1) %arg, i64 %tmp2
 ; CHECK: DIVERGENT:  %tmp10 = load volatile float, ptr addrspace(1) %tmp5, align 4
--- a/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/phi-undef.ll
+++ b/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/phi-undef.ll
@ -1,14 +1,18 @@
-; RUN: opt -mtriple amdgcn-- -passes='print<divergence>' -disable-output %s 2>&1 | FileCheck %s
+; RUN: opt -mtriple amdgcn-- -passes='print<divergence>' -disable-output %s 2>&1 | FileCheck %s --check-prefixes=CHECK,LOOPDA
+; RUN: opt -mtriple amdgcn-- -passes='print<uniformity>' -disable-output %s 2>&1 | FileCheck %s --check-prefixes=CHECK,CYCLEDA

 ; CHECK-LABEL: 'test1':
-; CHECK-NEXT: DIVERGENT: i32 %bound
-; CHECK: {{^  *}}%counter =
+; CHECK: DIVERGENT: i32 %bound
+; CYCLEDA: DIVERGENT: %counter =
+; LOOPDA: {{^  *}} %counter =
 ; CHECK-NEXT: DIVERGENT: %break = icmp sge i32 %counter, %bound
-; CHECK-NEXT: DIVERGENT: br i1 %break, label %footer, label %body
-; CHECK: {{^  *}}%counter.next =
-; CHECK: {{^  *}}%counter.footer =
-; CHECK: DIVERGENT: br i1 %break, label %end, label %header
+; CYCLEDA: DIVERGENT: %counter.next =
+; CYCLEDA: DIVERGENT: %counter.footer =
+; LOOPDA: {{^  *}}%counter.next =
+; LOOPDA: {{^  *}}%counter.footer =
+
 ; Note: %counter is not divergent!
+
 define amdgpu_ps void @test1(i32 %bound) {
 entry:
  br label %header
--- a/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/propagate-loop-live-out.ll
+++ b/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/propagate-loop-live-out.ll
@ -1,8 +1,7 @@
 ; RUN: opt -mtriple amdgcn-unknown-amdhsa -passes='print<divergence>' -disable-output %s 2>&1 | FileCheck %s
+; RUN: opt -mtriple amdgcn-unknown-amdhsa -passes='print<uniformity>' -disable-output %s 2>&1 | FileCheck %s

-; CHECK: bb6:
 ; CHECK: DIVERGENT:       %.126.i355.i = phi i1 [ false, %bb5 ], [ true, %bb4 ]
-; CHECK: DIVERGENT:       br i1 %.126.i355.i, label %bb7, label %bb8

 ; Function Attrs: nounwind readnone speculatable
 declare i32 @llvm.amdgcn.workitem.id.x() #0
--- a/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/temporal_diverge.ll
+++ b/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/temporal_diverge.ll
@ -1,8 +1,9 @@
 ; RUN: opt -mtriple amdgcn-unknown-amdhsa -passes='print<divergence>' -disable-output %s 2>&1 | FileCheck %s
+; RUN: opt -mtriple amdgcn-unknown-amdhsa -passes='print<uniformity>' -disable-output %s 2>&1 | FileCheck %s

 ; temporal-divergent use of value carried by divergent loop
 define amdgpu_kernel void @temporal_diverge(i32 %n, i32 %a, i32 %b) #0 {
-; CHECK-LABEL: Divergence Analysis' for function 'temporal_diverge':
+; CHECK-LABEL: for function 'temporal_diverge':
 ; CHECK-NOT: DIVERGENT: %uni.
 ; CHECK-NOT: DIVERGENT: br i1 %uni.

@ -26,7 +27,7 @@ X:

 ; temporal-divergent use of value carried by divergent loop inside a top-level loop
 define amdgpu_kernel void @temporal_diverge_inloop(i32 %n, i32 %a, i32 %b) #0 {
-; CHECK-LABEL: Divergence Analysis' for function 'temporal_diverge_inloop':
+; CHECK-LABEL: for function 'temporal_diverge_inloop':
 ; CHECK-NOT: DIVERGENT: %uni.
 ; CHECK-NOT: DIVERGENT: br i1 %uni.

@ -58,7 +59,7 @@ Y:

 ; temporal-uniform use of a valud, definition and users are carried by a surrounding divergent loop
 define amdgpu_kernel void @temporal_uniform_indivloop(i32 %n, i32 %a, i32 %b) #0 {
-; CHECK-LABEL: Divergence Analysis' for function 'temporal_uniform_indivloop':
+; CHECK-LABEL: for function 'temporal_uniform_indivloop':
 ; CHECK-NOT: DIVERGENT: %uni.
 ; CHECK-NOT: DIVERGENT: br i1 %uni.

@ -90,7 +91,7 @@ Y:

 ; temporal-divergent use of value carried by divergent loop, user is inside sibling loop
 define amdgpu_kernel void @temporal_diverge_loopuser(i32 %n, i32 %a, i32 %b) #0 {
-; CHECK-LABEL: Divergence Analysis' for function 'temporal_diverge_loopuser':
+; CHECK-LABEL: for function 'temporal_diverge_loopuser':
 ; CHECK-NOT: DIVERGENT: %uni.
 ; CHECK-NOT: DIVERGENT: br i1 %uni.

@ -120,7 +121,7 @@ Y:

 ; temporal-divergent use of value carried by divergent loop, user is inside sibling loop, defs and use are carried by a uniform loop
 define amdgpu_kernel void @temporal_diverge_loopuser_nested(i32 %n, i32 %a, i32 %b) #0 {
-; CHECK-LABEL: Divergence Analysis' for function 'temporal_diverge_loopuser_nested':
+; CHECK-LABEL: for function 'temporal_diverge_loopuser_nested':
 ; CHECK-NOT: DIVERGENT: %uni.
 ; CHECK-NOT: DIVERGENT: br i1 %uni.

--- a/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/trivial-join-at-loop-exit.ll
+++ b/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/trivial-join-at-loop-exit.ll
@ -1,6 +1,6 @@
 ; RUN: opt -mtriple amdgcn-unknown-amdhsa -passes='print<divergence>' -disable-output %s 2>&1 | FileCheck %s
+; RUN: opt -mtriple amdgcn-unknown-amdhsa -passes='print<uniformity>' -disable-output %s 2>&1 | FileCheck %s

-; CHECK: bb2:
 ; CHECK-NOT: DIVERGENT:       %Guard.bb2 = phi i1 [ true, %bb1 ], [ false, %bb0 ]

 ; Function Attrs: nounwind readnone speculatable
--- a/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/unreachable-loop-block.ll
+++ b/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/unreachable-loop-block.ll
@ -1,4 +1,5 @@
 ; RUN: opt -mtriple amdgcn-- -passes='print<divergence>' -disable-output %s 2>&1 | FileCheck %s
+; RUN: opt -mtriple amdgcn-- -passes='print<uniformity>' -disable-output %s 2>&1 | FileCheck %s

 ; CHECK: DIVERGENT:  %tmp = cmpxchg volatile
 define amdgpu_kernel void @unreachable_loop(i32 %tidx) #0 {
--- a/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/workitem-intrinsics.ll
+++ b/llvm/test/Analysis/DivergenceAnalysis/AMDGPU/workitem-intrinsics.ll
@ -1,4 +1,5 @@
 ; RUN: opt -mtriple amdgcn-unknown-amdhsa -passes='print<divergence>' -disable-output %s 2>&1 | FileCheck %s
+; RUN: opt -mtriple amdgcn-unknown-amdhsa -passes='print<uniformity>' -disable-output %s 2>&1 | FileCheck %s

 declare i32 @llvm.amdgcn.workitem.id.x() #0
 declare i32 @llvm.amdgcn.workitem.id.y() #0
--- a/llvm/test/Analysis/DivergenceAnalysis/NVPTX/daorder.ll
+++ b/llvm/test/Analysis/DivergenceAnalysis/NVPTX/daorder.ll
@ -1,14 +1,16 @@
 ; RUN: opt %s -passes='print<divergence>' -disable-output 2>&1 | FileCheck %s
+; RUN: opt %s -passes='print<uniformity>' -disable-output 2>&1 | FileCheck %s

 target datalayout = "e-i64:64-v16:16-v32:32-n16:32:64"
 target triple = "nvptx64-nvidia-cuda"

 define i32 @daorder(i32 %n) {
-; CHECK-LABEL: Divergence Analysis' for function 'daorder'
+; CHECK-LABEL: for function 'daorder'
 entry:
  %tid = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
  %cond = icmp slt i32 %tid, 0
  br i1 %cond, label %A, label %B ; divergent
+; CHECK:  DIVERGENT: %cond =
 ; CHECK: DIVERGENT: br i1 %cond,
 A:
  %defAtA = add i32 %n, 1 ; uniform
--- a/llvm/test/Analysis/DivergenceAnalysis/NVPTX/diverge.ll
+++ b/llvm/test/Analysis/DivergenceAnalysis/NVPTX/diverge.ll
@ -1,15 +1,17 @@
 ; RUN: opt %s -passes='print<divergence>' -disable-output 2>&1 | FileCheck %s
+; RUN: opt %s -passes='print<uniformity>' -disable-output 2>&1 | FileCheck %s

 target datalayout = "e-i64:64-v16:16-v32:32-n16:32:64"
 target triple = "nvptx64-nvidia-cuda"

 ; return (n < 0 ? a + threadIdx.x : b + threadIdx.x)
 define i32 @no_diverge(i32 %n, i32 %a, i32 %b) {
-; CHECK-LABEL: Divergence Analysis' for function 'no_diverge'
+; CHECK-LABEL: for function 'no_diverge'
 entry:
  %tid = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
  %cond = icmp slt i32 %n, 0
  br i1 %cond, label %then, label %else ; uniform
+; CHECK-NOT: DIVERGENT: %cond =
 ; CHECK-NOT: DIVERGENT: br i1 %cond,
 then:
  %a1 = add i32 %a, %tid
@ -27,11 +29,12 @@ merge:
 ;   c = b;
 ; return c;               // c is divergent: sync dependent
 define i32 @sync(i32 %a, i32 %b) {
-; CHECK-LABEL: Divergence Analysis' for function 'sync'
+; CHECK-LABEL: for function 'sync'
 bb1:
  %tid = call i32 @llvm.nvvm.read.ptx.sreg.tid.y()
  %cond = icmp slt i32 %tid, 5
  br i1 %cond, label %bb2, label %bb3
+; CHECK:  DIVERGENT: %cond =
 ; CHECK: DIVERGENT: br i1 %cond,
 bb2:
  br label %bb3
@ -48,11 +51,12 @@ bb3:
 ; // c here is divergent because it is sync dependent on threadIdx.x >= 5
 ; return c;
 define i32 @mixed(i32 %n, i32 %a, i32 %b) {
-; CHECK-LABEL: Divergence Analysis' for function 'mixed'
+; CHECK-LABEL: for function 'mixed'
 bb1:
  %tid = call i32 @llvm.nvvm.read.ptx.sreg.tid.z()
  %cond = icmp slt i32 %tid, 5
  br i1 %cond, label %bb6, label %bb2
+; CHECK:  DIVERGENT: %cond =
 ; CHECK: DIVERGENT: br i1 %cond,
 bb2:
  %cond2 = icmp slt i32 %n, 0
@ -73,13 +77,14 @@ bb6:

 ; We conservatively treats all parameters of a __device__ function as divergent.
 define i32 @device(i32 %n, i32 %a, i32 %b) {
-; CHECK-LABEL: Divergence Analysis' for function 'device'
-; CHECK: DIVERGENT: i32 %n
-; CHECK: DIVERGENT: i32 %a
-; CHECK: DIVERGENT: i32 %b
+; CHECK-LABEL: for function 'device'
+; CHECK-DAG: DIVERGENT: i32 %n
+; CHECK-DAG: DIVERGENT: i32 %a
+; CHECK-DAG: DIVERGENT: i32 %b
 entry:
  %cond = icmp slt i32 %n, 0
  br i1 %cond, label %then, label %else
+; CHECK:  DIVERGENT: %cond =
 ; CHECK: DIVERGENT: br i1 %cond,
 then:
  br label %merge
@ -98,7 +103,7 @@ merge:
 ;
 ; The i defined in the loop is used outside.
 define i32 @loop() {
-; CHECK-LABEL: Divergence Analysis' for function 'loop'
+; CHECK-LABEL: for function 'loop'
 entry:
  %laneid = call i32 @llvm.nvvm.read.ptx.sreg.laneid()
  br label %loop
@ -111,6 +116,7 @@ loop:
 loop_exit:
  %cond = icmp eq i32 %i, 10
  br i1 %cond, label %then, label %else
+; CHECK:  DIVERGENT: %cond =
 ; CHECK: DIVERGENT: br i1 %cond,
 then:
  ret i32 0
@ -120,7 +126,7 @@ else:

 ; Same as @loop, but the loop is in the LCSSA form.
 define i32 @lcssa() {
-; CHECK-LABEL: Divergence Analysis' for function 'lcssa'
+; CHECK-LABEL: for function 'lcssa'
 entry:
  %tid = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
  br label %loop
@ -135,6 +141,7 @@ loop_exit:
 ; CHECK: DIVERGENT: %i.lcssa =
  %cond = icmp eq i32 %i.lcssa, 10
  br i1 %cond, label %then, label %else
+; CHECK:  DIVERGENT: %cond =
 ; CHECK: DIVERGENT: br i1 %cond,
 then:
  ret i32 0
@ -144,6 +151,7 @@ else:

 ; Verifies sync-dependence is computed correctly in the absense of loops.
 define i32 @sync_no_loop(i32 %arg) {
+; CHECK-LABEL: for function 'sync_no_loop'
 entry:
  %0 = add i32 %arg, 1
  %tid = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
--- a/llvm/test/Analysis/DivergenceAnalysis/NVPTX/hidden_diverge.ll
+++ b/llvm/test/Analysis/DivergenceAnalysis/NVPTX/hidden_diverge.ll
@ -1,18 +1,21 @@
 ; RUN: opt %s -passes='print<divergence>' -disable-output 2>&1 | FileCheck %s
+; RUN: opt %s -passes='print<uniformity>' -disable-output 2>&1 | FileCheck %s

 target datalayout = "e-i64:64-v16:16-v32:32-n16:32:64"
 target triple = "nvptx64-nvidia-cuda"

 define i32 @hidden_diverge(i32 %n, i32 %a, i32 %b) {
-; CHECK-LABEL: Divergence Analysis' for function 'hidden_diverge'
+; CHECK-LABEL: for function 'hidden_diverge'
 entry:
  %tid = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
  %cond.var = icmp slt i32 %tid, 0
  br i1 %cond.var, label %B, label %C ; divergent
+; CHECK:  DIVERGENT: %cond.var =
 ; CHECK: DIVERGENT: br i1 %cond.var,
 B:
  %cond.uni = icmp slt i32 %n, 0
  br i1 %cond.uni, label %C, label %merge ; uniform
+; CHECK-NOT: DIVERGENT: %cond.uni =
 ; CHECK-NOT: DIVERGENT: br i1 %cond.uni,
 C:
  %phi.var.hidden = phi i32 [ 1, %entry ], [ 2, %B  ]
--- a/llvm/test/Analysis/DivergenceAnalysis/NVPTX/irreducible.ll
+++ b/llvm/test/Analysis/DivergenceAnalysis/NVPTX/irreducible.ll
@ -1,4 +1,5 @@
 ; RUN: opt %s -passes='print<divergence>' -disable-output 2>&1 | FileCheck %s
+; RUN: opt %s -passes='print<uniformity>' -disable-output 2>&1 | FileCheck %s

 ; NOTE: The new pass manager does not fall back on legacy divergence
 ; analysis even when the function contains an irreducible loop. The
@ -24,7 +25,7 @@ target triple = "nvptx64-nvidia-cuda"
 ;                        if (i3 == 5) // divergent
 ; because sync dependent on (tid / i3).
 define i32 @unstructured_loop(i1 %entry_cond) {
-; CHECK-LABEL: Divergence Analysis' for function 'unstructured_loop'
+; CHECK-LABEL: for function 'unstructured_loop'
 entry:
  %tid = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
  br i1 %entry_cond, label %loop_entry_1, label %loop_entry_2
@ -46,6 +47,7 @@ loop_latch:
 branch:
  %cmp = icmp eq i32 %i3, 5
  br i1 %cmp, label %then, label %else
+; CHECK:  DIVERGENT: %cmp =
 ; CHECK: DIVERGENT: br i1 %cmp,
 then:
  ret i32 0