Roman Lebedev 78eaff2ef8
[llvm-exegesis] Loop unrolling for loop snippet repetitor mode
I really needed this, like, factually, yesterday,
when verifying dependency breaking idioms for AMD Zen 3 scheduler model.

Consider the following example:
```
$ ./bin/llvm-exegesis --mode=inverse_throughput --snippets-file=/tmp/snippet.s --num-repetitions=1000000 --repetition-mode=duplicate
Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-4a7e50.o
---
mode:            inverse_throughput
key:
  instructions:
    - 'VPXORYrr YMM0 YMM0 YMM0'
  config:          ''
  register_initial_values: []
cpu_name:        znver3
llvm_triple:     x86_64-unknown-linux-gnu
num_repetitions: 1000000
measurements:
  - { key: inverse_throughput, value: 0.31025, per_snippet_value: 0.31025 }
error:           ''
info:            ''
assembled_snippet: C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C3
...

```
What does it tell us?
So wait, it can only execute ~3 x86 AVX YMM PXOR zero-idioms per cycle?
That doesn't seem right. That's even less than there are pipes supporting this type of op.

Now, second example:
```
$ ./bin/llvm-exegesis --mode=inverse_throughput --snippets-file=/tmp/snippet.s --num-repetitions=1000000 --repetition-mode=loop
Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-2418b5.o
---
mode:            inverse_throughput
key:
  instructions:
    - 'VPXORYrr YMM0 YMM0 YMM0'
  config:          ''
  register_initial_values: []
cpu_name:        znver3
llvm_triple:     x86_64-unknown-linux-gnu
num_repetitions: 1000000
measurements:
  - { key: inverse_throughput, value: 1.00011, per_snippet_value: 1.00011 }
error:           ''
info:            ''
assembled_snippet: 49B80800000000000000C5FDEFC0C5FDEFC04983C0FF75F2C3
...
```
Now that's just worse. Due to the looping, the throughput completely plummeted,
and now we can only do a single instruction/cycle!?

That's not great.
And final example:
```
$ ./bin/llvm-exegesis --mode=inverse_throughput --snippets-file=/tmp/snippet.s --num-repetitions=1000000 --repetition-mode=loop --loop-body-size=1000
Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-c402e2.o
---
mode:            inverse_throughput
key:
  instructions:
    - 'VPXORYrr YMM0 YMM0 YMM0'
  config:          ''
  register_initial_values: []
cpu_name:        znver3
llvm_triple:     x86_64-unknown-linux-gnu
num_repetitions: 1000000
measurements:
  - { key: inverse_throughput, value: 0.167087, per_snippet_value: 0.167087 }
error:           ''
info:            ''
assembled_snippet: 49B80800000000000000C5FDEFC0C5FDEFC04983C0FF75F2C3
...
```

So if we merge the previous two approaches, do duplicate this single-instruction snippet 1000x
(loop-body-size/instruction count in snippet), and run a loop with 1000 iterations
over that duplicated/unrolled snippet, the measured throughput goes through the roof,
up to 5.9 instructions/cycle, which finally tells us that this idiom is zero-cycle!

Reviewed By: courbet

Differential Revision: https://reviews.llvm.org/D102522
2021-05-25 12:08:27 +03:00

95 lines
3.0 KiB
C++

//===-- BenchmarkRunner.h ---------------------------------------*- C++ -*-===//
//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//
//===----------------------------------------------------------------------===//
///
/// \file
/// Defines the abstract BenchmarkRunner class for measuring a certain execution
/// property of instructions (e.g. latency).
///
//===----------------------------------------------------------------------===//
#ifndef LLVM_TOOLS_LLVM_EXEGESIS_BENCHMARKRUNNER_H
#define LLVM_TOOLS_LLVM_EXEGESIS_BENCHMARKRUNNER_H
#include "Assembler.h"
#include "BenchmarkCode.h"
#include "BenchmarkResult.h"
#include "LlvmState.h"
#include "MCInstrDescView.h"
#include "SnippetRepetitor.h"
#include "llvm/ADT/SmallVector.h"
#include "llvm/MC/MCInst.h"
#include "llvm/Support/Error.h"
#include <cstdlib>
#include <memory>
#include <vector>
namespace llvm {
namespace exegesis {
// Common code for all benchmark modes.
class BenchmarkRunner {
public:
explicit BenchmarkRunner(const LLVMState &State,
InstructionBenchmark::ModeE Mode);
virtual ~BenchmarkRunner();
Expected<InstructionBenchmark>
runConfiguration(const BenchmarkCode &Configuration, unsigned NumRepetitions,
unsigned LoopUnrollFactor,
ArrayRef<std::unique_ptr<const SnippetRepetitor>> Repetitors,
bool DumpObjectToDisk) const;
// Scratch space to run instructions that touch memory.
struct ScratchSpace {
static constexpr const size_t kAlignment = 1024;
static constexpr const size_t kSize = 1 << 20; // 1MB.
ScratchSpace()
: UnalignedPtr(std::make_unique<char[]>(kSize + kAlignment)),
AlignedPtr(
UnalignedPtr.get() + kAlignment -
(reinterpret_cast<intptr_t>(UnalignedPtr.get()) % kAlignment)) {}
char *ptr() const { return AlignedPtr; }
void clear() { std::memset(ptr(), 0, kSize); }
private:
const std::unique_ptr<char[]> UnalignedPtr;
char *const AlignedPtr;
};
// A helper to measure counters while executing a function in a sandboxed
// context.
class FunctionExecutor {
public:
virtual ~FunctionExecutor();
// FIXME deprecate this.
virtual Expected<int64_t> runAndMeasure(const char *Counters) const = 0;
virtual Expected<llvm::SmallVector<int64_t, 4>>
runAndSample(const char *Counters) const = 0;
};
protected:
const LLVMState &State;
const InstructionBenchmark::ModeE Mode;
private:
virtual Expected<std::vector<BenchmarkMeasure>>
runMeasurements(const FunctionExecutor &Executor) const = 0;
Expected<std::string> writeObjectFile(const BenchmarkCode &Configuration,
const FillFunction &Fill) const;
const std::unique_ptr<ScratchSpace> Scratch;
};
} // namespace exegesis
} // namespace llvm
#endif // LLVM_TOOLS_LLVM_EXEGESIS_BENCHMARKRUNNER_H