
Refactor llvm-ir2vec to use subcommands instead of a mode flag for better CLI usability. - Converted the `--mode` flag to three distinct subcommands: `triplets`, `entities`, and `embeddings` - Updated documentation, tests, and python script
231 lines
7.1 KiB
ReStructuredText
231 lines
7.1 KiB
ReStructuredText
llvm-ir2vec - IR2Vec Embedding Generation Tool
|
|
==============================================
|
|
|
|
.. program:: llvm-ir2vec
|
|
|
|
SYNOPSIS
|
|
--------
|
|
|
|
:program:`llvm-ir2vec` [*subcommand*] [*options*]
|
|
|
|
DESCRIPTION
|
|
-----------
|
|
|
|
:program:`llvm-ir2vec` is a standalone command-line tool for IR2Vec. It
|
|
generates IR2Vec embeddings for LLVM IR and supports triplet generation
|
|
for vocabulary training. The tool provides three main subcommands:
|
|
|
|
1. **triplets**: Generates numeric triplets in train2id format for vocabulary
|
|
training from LLVM IR.
|
|
|
|
2. **entities**: Generates entity mapping files (entity2id.txt) for vocabulary
|
|
training.
|
|
|
|
3. **embeddings**: Generates IR2Vec embeddings using a trained vocabulary
|
|
at different granularity levels (instruction, basic block, or function).
|
|
|
|
The tool is designed to facilitate machine learning applications that work with
|
|
LLVM IR by converting the IR into numerical representations that can be used by
|
|
ML models. The `triplets` subcommand generates numeric IDs directly instead of string
|
|
triplets, streamlining the training data preparation workflow.
|
|
|
|
.. note::
|
|
|
|
For information about using IR2Vec programmatically within LLVM passes and
|
|
the C++ API, see the `IR2Vec Embeddings <https://llvm.org/docs/MLGO.html#ir2vec-embeddings>`_
|
|
section in the MLGO documentation.
|
|
|
|
OPERATION MODES
|
|
---------------
|
|
|
|
Triplet Generation and Entity Mapping Modes are used for preparing
|
|
vocabulary and training data for knowledge graph embeddings. The Embedding Mode
|
|
is used for generating embeddings from LLVM IR using a pre-trained vocabulary.
|
|
|
|
The Seed Embedding Vocabulary of IR2Vec is trained on a large corpus of LLVM IR
|
|
by modeling the relationships between opcodes, types, and operands as a knowledge
|
|
graph. For this purpose, Triplet Generation and Entity Mapping Modes generate
|
|
triplets and entity mappings in the standard format used for knowledge graph
|
|
embedding training (see
|
|
<https://github.com/thunlp/OpenKE/tree/OpenKE-PyTorch?tab=readme-ov-file#data-format>
|
|
for details).
|
|
|
|
See `llvm/utils/mlgo-utils/IR2Vec/generateTriplets.py` for more details on how
|
|
these two modes are used to generate the triplets and entity mappings.
|
|
|
|
Triplet Generation
|
|
~~~~~~~~~~~~~~~~~~
|
|
|
|
With the `triplets` subcommand, :program:`llvm-ir2vec` analyzes LLVM IR and extracts
|
|
numeric triplets consisting of opcode IDs, type IDs, and operand IDs. These triplets
|
|
are generated in the standard format used for knowledge graph embedding training.
|
|
The tool outputs numeric IDs directly using the ir2vec::Vocabulary mapping
|
|
infrastructure, eliminating the need for string-to-ID preprocessing.
|
|
|
|
Usage:
|
|
|
|
.. code-block:: bash
|
|
|
|
llvm-ir2vec triplets input.bc -o triplets_train2id.txt
|
|
|
|
Entity Mapping Generation
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
With the `entities` subcommand, :program:`llvm-ir2vec` generates the entity mappings
|
|
supported by IR2Vec in the standard format used for knowledge graph embedding
|
|
training. This subcommand outputs all supported entities (opcodes, types, and
|
|
operands) with their corresponding numeric IDs, and is not specific for an
|
|
LLVM IR file.
|
|
|
|
Usage:
|
|
|
|
.. code-block:: bash
|
|
|
|
llvm-ir2vec entities -o entity2id.txt
|
|
|
|
Embedding Generation
|
|
~~~~~~~~~~~~~~~~~~~~
|
|
|
|
With the `embeddings` subcommand, :program:`llvm-ir2vec` uses a pre-trained vocabulary to
|
|
generate numerical embeddings for LLVM IR at different levels of granularity.
|
|
|
|
Example Usage:
|
|
|
|
.. code-block:: bash
|
|
|
|
llvm-ir2vec embeddings --ir2vec-vocab-path=vocab.json --level=func input.bc -o embeddings.txt
|
|
|
|
OPTIONS
|
|
-------
|
|
|
|
Global options:
|
|
|
|
.. option:: -o <filename>
|
|
|
|
Specify the output filename. Use ``-`` to write to standard output (default).
|
|
|
|
.. option:: --help
|
|
|
|
Print a summary of command line options.
|
|
|
|
Subcommand-specific options:
|
|
|
|
**embeddings** subcommand:
|
|
|
|
.. option:: <input-file>
|
|
|
|
The input LLVM IR or bitcode file to process. This positional argument is
|
|
required for the `embeddings` subcommand.
|
|
|
|
.. option:: --level=<level>
|
|
|
|
Specify the embedding generation level. Valid values are:
|
|
|
|
* ``inst`` - Generate instruction-level embeddings
|
|
* ``bb`` - Generate basic block-level embeddings
|
|
* ``func`` - Generate function-level embeddings (default)
|
|
|
|
.. option:: --function=<name>
|
|
|
|
Process only the specified function instead of all functions in the module.
|
|
|
|
.. option:: --ir2vec-vocab-path=<path>
|
|
|
|
Specify the path to the vocabulary file (required for embedding generation).
|
|
The vocabulary file should be in JSON format and contain the trained
|
|
vocabulary for embedding generation. See `llvm/lib/Analysis/models`
|
|
for pre-trained vocabulary files.
|
|
|
|
.. option:: --ir2vec-opc-weight=<weight>
|
|
|
|
Specify the weight for opcode embeddings (default: 1.0). This controls
|
|
the relative importance of instruction opcodes in the final embedding.
|
|
|
|
.. option:: --ir2vec-type-weight=<weight>
|
|
|
|
Specify the weight for type embeddings (default: 0.5). This controls
|
|
the relative importance of type information in the final embedding.
|
|
|
|
.. option:: --ir2vec-arg-weight=<weight>
|
|
|
|
Specify the weight for argument embeddings (default: 0.2). This controls
|
|
the relative importance of operand information in the final embedding.
|
|
|
|
|
|
**triplets** subcommand:
|
|
|
|
.. option:: <input-file>
|
|
|
|
The input LLVM IR or bitcode file to process. This positional argument is
|
|
required for the `triplets` subcommand.
|
|
|
|
**entities** subcommand:
|
|
|
|
No subcommand-specific options.
|
|
|
|
OUTPUT FORMAT
|
|
-------------
|
|
|
|
Triplet Mode Output
|
|
~~~~~~~~~~~~~~~~~~~
|
|
|
|
In triplet mode, the output consists of numeric triplets in train2id format with
|
|
metadata headers. The format includes:
|
|
|
|
.. code-block:: text
|
|
|
|
MAX_RELATIONS=<max_relations_count>
|
|
<head_entity_id> <tail_entity_id> <relation_id>
|
|
<head_entity_id> <tail_entity_id> <relation_id>
|
|
...
|
|
|
|
Each line after the metadata header represents one instruction relationship,
|
|
with numeric IDs for head entity, relation, and tail entity. The metadata
|
|
header (MAX_RELATIONS) provides counts for post-processing and training setup.
|
|
|
|
Entity Mode Output
|
|
~~~~~~~~~~~~~~~~~~
|
|
|
|
In entity mode, the output consists of entity mapping in the format:
|
|
|
|
.. code-block:: text
|
|
|
|
<total_entities>
|
|
<entity_string> <numeric_id>
|
|
<entity_string> <numeric_id>
|
|
...
|
|
|
|
The first line contains the total number of entities, followed by one entity
|
|
mapping per line with tab-separated entity string and numeric ID.
|
|
|
|
Embedding Mode Output
|
|
~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
In embedding mode, the output format depends on the specified level:
|
|
|
|
* **Function Level**: One embedding vector per function
|
|
* **Basic Block Level**: One embedding vector per basic block, grouped by function
|
|
* **Instruction Level**: One embedding vector per instruction, grouped by basic block and function
|
|
|
|
Each embedding is represented as a floating point vector.
|
|
|
|
EXIT STATUS
|
|
-----------
|
|
|
|
:program:`llvm-ir2vec` returns 0 on success, and a non-zero value on failure.
|
|
|
|
Common failure cases include:
|
|
|
|
* Invalid or missing input file
|
|
* Missing or invalid vocabulary file (in embedding mode)
|
|
* Specified function not found in the module
|
|
* Invalid command line options
|
|
|
|
SEE ALSO
|
|
--------
|
|
|
|
:doc:`../MLGO`
|
|
|
|
For more information about the IR2Vec algorithm and approach, see:
|
|
`IR2Vec: LLVM IR Based Scalable Program Embeddings <https://doi.org/10.1145/3418463>`_.
|