Removed all the caching maps (BB, Inst) in `Embedder` as we don't want
to cache embeddings in general. Our earlier experiments on Symbolic
embeddings show recomputation of embeddings is cheaper than cache
lookups.
OTOH, Flow-Aware embeddings would benefit from instruction level
caching, as computing the embedding for an instruction would depend on
the embeddings of other instructions in a function. So, retained
instruction embedding caching logic only for Flow-Aware computation.
This also necessitates an `invalidate` method that would clean up the
cache when the embeddings would become invalid due to transformations.
Refactored IR2Vec vocabulary and introduced IR (semantics) agnostic `VocabStorage`
- `Vocabulary` *has-a* `VocabStorage`
- `Vocabulary` deals with LLVM IR specific entities. This would help in efficient reuse of parts of the logic for MIR.
- Storage uses a section-based approach instead of a flat vector, improving organization and access patterns.
Comparison predicates (equal, not equal, greater than, etc.) provide important semantic information about program behavior. Previously, IR2Vec only captured that a comparison was happening but not what kind of comparison it was. This PR extends the IR2Vec vocabulary to include comparison predicates (ICmp and FCmp) as part of the embedding space.
Following are the changes:
1. Expand the vocabulary slot layout to include predicate entries after opcodes, types, and operands
2. Add methods to handle predicate embedding lookups and conversions
3. Update the embedder implementations to include predicate information when processing CmpInst instructions
4. Update test files to include the new predicate entries in the vocabulary
(Tracking issues: #141817, #141833)
Refactor IR2Vec vocabulary to use canonical type IDs, improving the embedding representation for LLVM IR types.
The previous implementation used raw Type::TypeID values directly in the vocabulary, which led to redundant entries (e.g., all float variants mapped to "FloatTy" but had separate slots). This change improves the vocabulary by:
1. Making the type representation more consistent by properly canonicalizing types
2. Reducing vocabulary size by eliminating redundant entries
3. Improving the embedding quality by ensuring similar types share the same representation
(Tracking issue - #141817)
Consolidate IR2Vec option categories to use a single shared category across the library and tool.
With this change the cl options defined in IR2Vec.cpp are visible in llvm-ir2vec tool. This is necessary as we use the same options in the tool.
Add helper methods to IR2Vec's Vocabulary class for numeric ID mapping and vocabulary size calculation. These APIs will be useful in triplet generation for `llvm-ir2vec` tool (See #149214).
(Tracking issue - #141817)
Refactored IR2Vec vocabulary handling to improve code organization and error handling. This would help in upcoming PRs related to the IR2Vec tool.
(Tracking issue - #141817)
This patch fixes:
llvm/lib/Analysis/IR2Vec.cpp:280:3: error: default label in switch
which covers all enumeration values
[-Werror,-Wcovered-switch-default]
This PR restructures the vocabulary.
* String based look-ups are removed. Vocabulary is changed from a map to vector. (#141832)
* Grouped all the vocabulary related methods under a single class - `ir2vec::Vocabulary`. This replaces `IR2VecVocabResult`.
* `ir2vec::Vocabulary` effectively abstracts out the _layout_ and other internal details of the vector structure. Exposes necessary APIs for accessing the Vocabulary.
These changes ensure that _all_ known opcodes and types are present in the vocabulary. We have retained the original operands. This can be extended going forward.
(Tracking issue - #141817)
This PR adds out-of-place arithmetic operators (`+`, `-`, `*`) to the `Embedding` class in IR2Vec, complementing the existing in-place operators (`+=`, `-=`, `*=`).
Tests have been added to verify the functionality of these new operators.
(Tracking issue - #141817)
The code following `llvm_unreachable` is optimized out in Release builds. In this case, `Embedder::create` do not seem to return `nullptr` causing `CreateInvalidMode` test to break. Hence removing `llvm_unreachable`.
This change simplifies the API by removing the error handling complexity.
- Changed `Embedder::create()` to return `std::unique_ptr<Embedder>` directly instead of `Expected<std::unique_ptr<Embedder>>`
- Updated documentation and tests to reflect the new API
- Added death test for invalid IR2Vec kind in debug mode
- In release mode, simply returns nullptr for invalid kinds instead of creating an error
(Tracking issue - #141817)
This patch fixes:
llvm/lib/Analysis/IR2Vec.cpp:296:2: error: extra ';' outside of a
function is incompatible with C++98
[-Werror,-Wc++98-compat-extra-semi]
Changes to scale opcodes, types and args once in `IR2VecVocabAnalysis` so that we can avoid scaling each time while computing embeddings. This PR refactors the vocabulary to explicitly define 3 sections---Opcodes, Types, and Arguments---used for computing Embeddings.
(Tracking issue - #141817 ; partly fixes - #141832)
Changes to consider BBs that are reachable from the entry block. Similarly we skip debug instruction while computing the embeddings.
(Tracking issue - #141817)
This PR changes some asserts in Vocab to hard checks that emit error and exposes flags and constructor to help in unit tests.
(Tracking issue - #141817)
Currently `Embedding` is `std::vector<double>`. This PR makes it a data type wrapped around `std::vector<double>` to overload basic arithmetic operators and expose comparison operations. It _simplifies_ the usage here and in the passes where operations on `Embedding` would be performed.
(Tracking issue - #141817)
This PR removes the necessity to know the dimension of the embeddings while invoking `Embedder::Create`. Having the `Dimension` parameter introduces complexities in downstream consumers.
(Tracking issue - #141817)
This PR exposes interfaces to compute embeddings at BB level. This would be necessary for delta patching the embeddings in MLInliner (#141836).
(Tracking issue - #141817)
Currently, users have to invoke two APIs: `computeEmbeddings()` followed
by getters to access the embeddings. This PR refactors the code to
reduce this *stateful* access of APIs. Users can now directly invoke
getters; Internally, getters would compute the embeddings.
These are identified by misc-include-cleaner. I've filtered out those
that break builds. Also, I'm staying away from llvm-config.h,
config.h, and Compiler.h, which likely cause platform- or
compiler-specific build failures.