I really needed this, like, factually, yesterday,
when verifying dependency breaking idioms for AMD Zen 3 scheduler model.
Consider the following example:
```
$ ./bin/llvm-exegesis --mode=inverse_throughput --snippets-file=/tmp/snippet.s --num-repetitions=1000000 --repetition-mode=duplicate
Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-4a7e50.o
---
mode: inverse_throughput
key:
instructions:
- 'VPXORYrr YMM0 YMM0 YMM0'
config: ''
register_initial_values: []
cpu_name: znver3
llvm_triple: x86_64-unknown-linux-gnu
num_repetitions: 1000000
measurements:
- { key: inverse_throughput, value: 0.31025, per_snippet_value: 0.31025 }
error: ''
info: ''
assembled_snippet: C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C3
...
```
What does it tell us?
So wait, it can only execute ~3 x86 AVX YMM PXOR zero-idioms per cycle?
That doesn't seem right. That's even less than there are pipes supporting this type of op.
Now, second example:
```
$ ./bin/llvm-exegesis --mode=inverse_throughput --snippets-file=/tmp/snippet.s --num-repetitions=1000000 --repetition-mode=loop
Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-2418b5.o
---
mode: inverse_throughput
key:
instructions:
- 'VPXORYrr YMM0 YMM0 YMM0'
config: ''
register_initial_values: []
cpu_name: znver3
llvm_triple: x86_64-unknown-linux-gnu
num_repetitions: 1000000
measurements:
- { key: inverse_throughput, value: 1.00011, per_snippet_value: 1.00011 }
error: ''
info: ''
assembled_snippet: 49B80800000000000000C5FDEFC0C5FDEFC04983C0FF75F2C3
...
```
Now that's just worse. Due to the looping, the throughput completely plummeted,
and now we can only do a single instruction/cycle!?
That's not great.
And final example:
```
$ ./bin/llvm-exegesis --mode=inverse_throughput --snippets-file=/tmp/snippet.s --num-repetitions=1000000 --repetition-mode=loop --loop-body-size=1000
Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-c402e2.o
---
mode: inverse_throughput
key:
instructions:
- 'VPXORYrr YMM0 YMM0 YMM0'
config: ''
register_initial_values: []
cpu_name: znver3
llvm_triple: x86_64-unknown-linux-gnu
num_repetitions: 1000000
measurements:
- { key: inverse_throughput, value: 0.167087, per_snippet_value: 0.167087 }
error: ''
info: ''
assembled_snippet: 49B80800000000000000C5FDEFC0C5FDEFC04983C0FF75F2C3
...
```
So if we merge the previous two approaches, do duplicate this single-instruction snippet 1000x
(loop-body-size/instruction count in snippet), and run a loop with 1000 iterations
over that duplicated/unrolled snippet, the measured throughput goes through the roof,
up to 5.9 instructions/cycle, which finally tells us that this idiom is zero-cycle!
Reviewed By: courbet
Differential Revision: https://reviews.llvm.org/D102522
55 lines
1.7 KiB
C++
55 lines
1.7 KiB
C++
//===-- SnippetRepetitor.h --------------------------------------*- C++ -*-===//
|
|
//
|
|
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
|
|
// See https://llvm.org/LICENSE.txt for license information.
|
|
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
|
|
//
|
|
//===----------------------------------------------------------------------===//
|
|
///
|
|
/// \file
|
|
/// Defines helpers to fill functions with repetitions of a snippet.
|
|
///
|
|
//===----------------------------------------------------------------------===//
|
|
|
|
#ifndef LLVM_TOOLS_LLVM_EXEGESIS_FUNCTIONFILLER_H
|
|
#define LLVM_TOOLS_LLVM_EXEGESIS_FUNCTIONFILLER_H
|
|
|
|
#include "Assembler.h"
|
|
#include "BenchmarkResult.h"
|
|
#include "LlvmState.h"
|
|
#include "llvm/ADT/BitVector.h"
|
|
#include "llvm/CodeGen/MachineFunction.h"
|
|
#include "llvm/MC/MCInst.h"
|
|
#include "llvm/MC/MCInstrInfo.h"
|
|
#include "llvm/Object/Binary.h"
|
|
|
|
namespace llvm {
|
|
namespace exegesis {
|
|
|
|
class SnippetRepetitor {
|
|
public:
|
|
static std::unique_ptr<const SnippetRepetitor>
|
|
Create(InstructionBenchmark::RepetitionModeE Mode, const LLVMState &State);
|
|
|
|
virtual ~SnippetRepetitor();
|
|
|
|
// Returns the set of registers that are reserved by the repetitor.
|
|
virtual BitVector getReservedRegs() const = 0;
|
|
|
|
// Returns a functor that repeats `Instructions` so that the function executes
|
|
// at least `MinInstructions` instructions.
|
|
virtual FillFunction Repeat(ArrayRef<MCInst> Instructions,
|
|
unsigned MinInstructions,
|
|
unsigned LoopBodySize) const = 0;
|
|
|
|
explicit SnippetRepetitor(const LLVMState &State) : State(State) {}
|
|
|
|
protected:
|
|
const LLVMState &State;
|
|
};
|
|
|
|
} // namespace exegesis
|
|
} // namespace llvm
|
|
|
|
#endif
|