Add support for non-allocatable module-level CUDA managed variables
using pointer indirection through a companion global in
__nv_managed_data__. The CUDA runtime populates this pointer with the
unified memory address via __cudaRegisterManagedVar and
__cudaInitModule.
- Create a .managed.ptr companion global in the __nv_managed_data__
section and register it with _FortranACUFRegisterManagedVariable
- Call __cudaInitModule once after all variables are registered, only
when non-allocatable managed globals are present, to populate managed
pointers
- Annotate managed globals in gpu.module with nvvm.managed for PTX
.attribute(.managed) generation
- Suppress cuf.data_transfer for assignments to/from non-allocatable
module managed variables, since cudaMemcpy would target the shadow
address rather than the actual unified memory
- Preserve cuf.data_transfer for device_var = managed_var assignments
where explicit transfer is still required
Note: This PR depends on
[#189751](https://github.com/llvm/llvm-project/pull/189751) (MLIR:
nvvm.managed attribute).
Currently, we do not check the module for requires directives, which
means we'll miss these and not set them on the OpenMP module.
Otherwise, due to the first come first serve method we currently check
the symbols, there is certain formats that would mean the compiler would
miss that a user had specified requires somewhere in the module. This is
partially but not fully avoided by the Semantics layer pushing the
requires on to the top most PFT symbol, as it is entirely possible to
create a legal Fortran program where you could have two or more of these
(e.g. module and main program in one file, standalone funcitons
intermixed with modules or main program). Some examples of this are
shown in the added Fortran test. This PR opts to resolve it by gathering
all of the relevant symbols and processing them.
Also removed gathering from BlockDataUnit as I don't think these symbols
ever get the requires applied.
Update Flang Extension doc to remove note about a warning that was
removed in a previous PR (PR #178088). It is an oversight that this doc
change was not made in that previous PR. The oversight was only recently
discovered and has led to this PR.
Fix a dangling if condition in allOtherUsesAreSafeForAssociate that
caused isBeforeInBlock to be called even when instructions were known to
not be in the same block, triggering an assertion. The condition was
changed to else if and added brackets.
---------
Co-authored-by: Yebin Chon <ychon@nvidia.com>
This PR extends flang's alias analysis so it can reason about values
that originate from OpenACC data and privatization operations, including
values passed through block arguments.
This patch fixes lowering of elemental character MIN/MAX in HLFIR.
Previously, these cases could hit a lowering-time TODO
`ElementalIntrinsicCallBuilder::computeDynamicCharacterResultLength` and
abort. This change computes the character result length as the maximum
length of the present actual arguments, allowing valid elemental
character MIN/MAX calls to lower successfully.
Added regression coverage for elemental character MIN/MAX, including
differing-length arguments.
Co-authored-by: Sairudra More <moresair@pe31.hpc.amslabs.hpecorp.net>
Use shift instead of sliceLb only when the array_coor has an explicit
slice (indicesAreFortran case). When the slice comes from an embox,
the indices are 1-based section indices and must subtract 1.
Add support for non-allocatable module-level CUDA managed variables
using pointer indirection through a companion global in
__nv_managed_data__. The CUDA runtime populates this pointer with the
unified memory address via __cudaRegisterManagedVar and
__cudaInitModule.
1. Create a .managed.ptr companion global in the __nv_managed_data__
section and register it with _FortranACUFRegisterManagedVariable
(CUFAddConstructor.cpp)
2. Call __cudaInitModule after registration to populate the managed
pointer (registration.cpp)
3. Annotate managed globals in gpu.module with nvvm.managed for PTX
.attribute(.managed) generation (cuda-code-gen.mlir)
4. Suppress cuf.data_transfer for assignments to/from non-allocatable
module managed variables, since cudaMemcpy would target the shadow
address rather than the actual unified memory (tools.h)
5. Preserve cuf.data_transfer for device_var = managed_var assignments
where explicit transfer is still required
Lower the iterator modifier on depend clause to omp.iterator. Iterated
depend objects emit `!omp.iterated<!llvm.ptr>` by using
`getDataOperandBaseAddr` to generate base address and
`genIteratorCoordinate` to get the addr+offset. The non-iterated objects
in depend clause still use existing lowering path.
This patch is part of feature work for #188061.
Assisted with copilot.
Extend the depend clause to support `!omp.iterated<Ty>` handles
alongside plain depend vars, so the IR can represent both forms.
Assisted with copilot
This is part of feature work for
https://github.com/llvm/llvm-project/issues/188061
When the same name is USE-associated with two or more distinct ultimate
symbols, and they are not both generic procedure interfaces, it's not an
error unless the name is actually referenced in the scope. But when the
scope is itself a module or submodule, our module files don't preserve
the error for later diagnosis -- instead, the UseErrorDetails symbol
that serves as a "poison pill" in case of later use is discarded when
the module file is generated. So emit additional USE statements to the
module file so that a UseErrorDetails symbol is created anew when the
module file is read.
Linear iteration variables were being treated as private. This fixes
one of the issues reported in #170784.
The regression reported in #188536 occurred because
LinearClauseProcessor was rewriting all basic blocks whose names
contained a given substring, including those that were not part of the
translated SIMD region.
This didn't cause problems before because linear variables were always
privatized, which doesn't happen with this change.
The issue is fixed by rewriting only the basic blocks that correspond to
the omp.simd operation.
Per OpenACC spec 2.5.4, branching out of `parallel`/`serial`/`kernels`
constructs is not allowed. Add a GOTO check to `NoBranchingEnforce` that
collects labels within the construct block and flags GOTOs targeting
labels outside. In-region GOTOs are allowed.
The check applies only to compute constructs (`parallel`, `serial`,
`kernels`), not to data constructs where GOTO out is valid.
Add a new acc::VariableInfoAttr attribute that can be extended and implemented by
language dialects to carry language specific information about variables that is
not reflected into the MLIR type system and is needed in the implementation
of the init/copy/destroy APIs.
A new genPrivateVariableInfo API is added to the MappableTypeInterface to generate
such attribute from an mlir::Value for the host variable.
The use case and motivation is the Fortran OPTIONAL attribute. This patch adds
a new fir::OpenACCFortranVariableInfoAtt that implements the acc::VariableInfoAttr
to carry the OPTIONAL information around.
When a `GoTo` inside an ACC region (`acc.loop`, `acc.data`,
`acc.parallel`, etc.) targets a label outside that region, the lowering
generated an illegal cross-region `cf.br`. This caused MLIR verification
failures or stack overflows in `runRegionDCE`'s recursive
`propagateLiveness`.
This patch addresses the issue with a generalized approach:
- Add `genOpenACCRegionExitBranch` helper that detects cross-region
branches from any ACC region op and generates the appropriate terminator
(`acc.yield` for compute/loop ops, `acc.terminator` for data ops). The
helper verifies that `parentOp` is an ACC operation, so it does not
interfere with branches inside `scf.execute_region` or other non-ACC
regions.
- In `genBranch`, when a cross-region exit from an ACC region is
detected, store a unique exit ID into a selector variable and generate
the region terminator. After the ACC op, a jump table dispatches to the
correct target based on the selector. This correctly handles GOTOs that
skip intermediate code between the loop end and the target label.
- Emit a TODO diagnostic for GOTOs that cross multiple nested ACC region
boundaries.
- Fix `acc.data` creation when the construct has no data clauses but
contains unstructured control flow: skip the early return in
`genACCDataOp` so the `acc.data` region is created and blocks are
properly managed.
Select type lowering was keeping scalar selector as descriptors inside
TYPE IS for derived type, leading to a declare using a fir.box.
This is not the canonical representation for such variables that can be
tracked with a simple pointer. This code that is remapping variables
that appear in data operation in lowering was not expecting a
fir.declare to be emitted with fir.box for such entity (an assert was
hit in the added OpenACC test).
Align the lowering of derived type scalar selector with the handling of
intrinsic selector. While doing this, simplify the logic by using and
adding fir::BaseBoxAddr helpers to ensure that attributes such as
VOLATILE are correctly propagated (they matter more than keeping the
fir.ptr/fir.heap type that is not relevant for the selector that does
not have the POINTER/ALLOCATABLE attributes).
In OpenACC semantic checking filter out symbols with MiscDetails, which
include construct names, scope names, complex part designators, type
parameter inquiries, etc.
When an ArrayRefParameter (or OptionalArrayRefParameter) appears in a
non-last position within a struct() assembly format directive, the
printed
output is ambiguous: the comma-separated array elements are
indistinguishable from the struct-level commas separating key-value
pairs.
Fix this by wrapping such parameters in square brackets in both the
generated printer and parser. The printer emits '[' before and ']' after
the array value; the parser calls parseLSquare()/parseRSquare() around
the
FieldParser call. Parameters with a custom printer or parser are
unaffected
(the user controls the format in that case).
Fixes#156623
Assisted-by: Claude Code
This is a follow-up fix for commit 0f5e9bee.
Only write effects to thread-local memory should be considered safe to
parallelize in workshare lowering, not reads. When both reads and writes
were safe, the cascading effect in moveToSingle could cause entire
SingleRegions to become fully parallelized, eliminating the omp.single
and its implicit barrier. This removed synchronization points needed to
keep threads coordinated inside sequential loops containing workshared
operations, causing race conditions in forall-workshare patterns.
This was exposed by the Fujitsu Test Suite and made the following tests
regress:
FAIL: test-suite :: Fujitsu/Fortran/0398/Fujitsu-Fortran-0398_0031.test
FAIL: test-suite :: Fujitsu/Fortran/0398/Fujitsu-Fortran-0398_0013.test
FAIL: test-suite :: Fujitsu/Fortran/0398/Fujitsu-Fortran-0398_0030.test
FAIL: test-suite :: Fujitsu/Fortran/0398/Fujitsu-Fortran-0398_0014.test
Updates #143330
Due to a somewhat recent change, IntOrFpInduction recipes have
associated VPIRFlags. The VPlanUnroll logic for WidenInduction recipes
predates this change, and computes incomplete wrap-flags: update it to
simply use the flags on IntOrFpInduction recipes; PointerInduction
recipes have no associated flags, and indeed, no flags should be used.
The CUFComputeSharedMemoryOffsetsAndSize pass used getOps to find
cuf.shared_memory operations, which only searches direct children of
gpu.func. When cuf.shared_memory ops are nested inside scf.parallel
(e.g. from reduction lowering), they are missed and never receive
offset/isStatic attributes, causing a "cuf.shared_memory must have an
offset for code gen" assertion later. Switch to walk to find
cuf.shared_memory ops at any nesting depth.
Fix lowering of `!$omp declare reduction` for intrinsic operators
applied
to user-defined derived types (e.g., `+` on `type(t)`). Previously, this
hit a TODO in `ReductionProcessor::getReductionInitValue` because the
code
tried to compute an init value for a non-predefined type, when it should
instead use the initializer region from the `DeclareReductionOp`.
This fixes the issue #176278: [Flang][OpenMP] Compilation error when
type-list in declare reduction directive is derived type name.
The root cause was a naming mismatch: `genOMP` for
`OpenMPDeclareReductionConstruct` used a raw operator string (e.g.,
"Add")
as the reduction name, while `processReductionArguments` at the use site
computed a canonical name via `getReductionName` (e.g.,
"add_reduction_byref_rec__QFTt"). The `lookupSymbol` in
`createDeclareReductionHelper` never found the already-created op, so it
fell through to `createDeclareReduction` which called
`getReductionInitValue`
with the derived type and hit the TODO.
The fix has three parts:
1. Consistent names: In `genOMP` for `OpenMPDeclareReductionConstruct`,
compute
the reduction name using the same `getReductionName` scheme that
`processReductionArguments` uses, so both sites produce identical symbol
names.
For intrinsic operators, this maps through `ReductionIdentifier` to get
the
canonical name. For user-defined named reductions, the raw symbol name
is used
directly, matching the existing custom-reduction lookup path.
2. Reuse reduction: In `processReductionArguments`, when an intrinsic
operator
reduction is requested, check whether a user-defined declare reduction
already
exists under that canonical name before attempting to create a new one.
If
found, reuse it. This avoids calling `createDeclareReduction` (and thus
`getReductionInitValue`) for types that have user-provided initializers.
3. Reference semantics: Change `doReductionByRef` to return true for
derived
types. Previously it returned false for both trivial and derived types,
treating
derived types as by-val. This is incorrect for user-defined combiners
that
operate on components via side-effects (e.g., `omp_out%x = omp_out%x +
omp_in%x`): the combiner mutates `omp_out` in place and doesn't produce
a
whole-struct value, so `convertExprToValue` returns the component type
(`i32`) rather than the struct type, causing a type mismatch in the
`omp.yield`. By-ref is the correct model: the combiner stores into the
lhs reference and yields it.
The combiner callback in `processReductionCombiner` is also updated to
handle the by-ref derived-type case: when the combiner result type
doesn't match the element type (as happens with component-level
assignments), the store is skipped since the assignment already wrote
into omp_out as a side-effect, and only the lhs reference is yielded.
Tests updates:
- Update declare-reduction-intrinsic-op.f90 from a negative test
(checking
for the TODO error) to a positive test checking the generated MLIR.
- Update omp-declare-reduction-derivedtype.f90 CHECK lines to match the
reference semantics fix: the `declare_reduction` now has type
`!fir.ref<...>`
with a `byref_element_type` attribute, an alloc region, a two-argument
init
region, and a combiner that stores into the lhs and yields the
reference. The function body checks for initme and mycombine are
unchanged in substance but use literal type names instead of a regex
capture to avoid greedy matching issues with nested angle brackets.
Remaining work: declare reduction without an initializer clause is not
yet
supported. I plan to address that subsequently.
Assisted-by: Claude Opus 4.6.
Note: Relied on LLM (Claude Opus 4.6) to help navigate the Flang APIs
and assist
with the corresponding boilerplate code & tests updates; in particular:
in order
to get the aforementioned consistent naming, in
`ReductionProcessor::getReductionName` I had to get rid of
`parser::DefinedOperator::EnumToString` and instead introduce
`getRedIdFromParserIntrOp` (which does the conversion manually; just to
make
sure I haven't missed anything: is there no existing conversion
function?
AFAICT, there is none, but I might've missed it). In any case, feedback
welcome!
---------
Co-authored-by: Matt P. Dziubinski <matt-p.dziubinski@hpe.com>
Refactor how `func.func` discardable attributes are handled in the
Func-to-LLVM conversion. Instead of ad hoc checks for linkage and
readnone followed by a simple filter, the pass now generically processes
inherent attributes from LLVMFuncOp.
Attributes that correspond to inherent `llvm.func` ODS names can be
attached as `llvm.<name>` on `func.func` and are stripped to `<name>`
when building `LLVM::LLVMFuncOp`, so LLVM-specific knobs stay namespaced
on the source op but land on the right inherent slots on `llvm.func`.
Other discardable attributes continue to be propagated as-is.
Fixes#175959Fixes#181464
Assisted-by: CLion code completion, GPT 5.3-Codex
---------
Co-authored-by: Mehdi Amini <joker.eph@gmail.com>
Added support for option "-fdebug-info-for-profiling" in flang.
- When the option `-fdebug-info-for-profiling` option is passed,
compiler sets the `DebugInfoForProfiling` flag and triggers the
`AddDiscriminatorsPass`. This pass inserts additional debug metadata,
specifically discriminator values into the IR to improve the profiling
precision.
- Additionally `-add-debug-info` pass has been updated to emit an extra
field, `debugInfoForProfiling: true` inside the generated DICompileUnit
metadata node.
This patch moves the OpenMP offloding module attributes handling from
flang to mlir so that it can be reused in ClangIR was well.
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Michael Kruse <github@meinersbur.de>
Use fir.bitcast in FIR-to-MemRef casts so bit patterns are preserved
(e.g. TRANSFER), while keeping fir.convert for memref/reference
marshaling and non-bitcast-compatible cases.
Check if the code associated with a nest or sequence construct is well
formed. Emit diagnostic messages if not.
Make a clearer separation for checks of loop-nest-associated and loop-
sequence-associated constructs.
Unify structure of some of the more common messages.
Issue: https://github.com/llvm/llvm-project/issues/185287