Use shift instead of sliceLb only when the array_coor has an explicit
slice (indicesAreFortran case). When the slice comes from an embox,
the indices are 1-based section indices and must subtract 1.
When an ArrayRefParameter (or OptionalArrayRefParameter) appears in a
non-last position within a struct() assembly format directive, the
printed
output is ambiguous: the comma-separated array elements are
indistinguishable from the struct-level commas separating key-value
pairs.
Fix this by wrapping such parameters in square brackets in both the
generated printer and parser. The printer emits '[' before and ']' after
the array value; the parser calls parseLSquare()/parseRSquare() around
the
FieldParser call. Parameters with a custom printer or parser are
unaffected
(the user controls the format in that case).
Fixes#156623
Assisted-by: Claude Code
This is a follow-up fix for commit 0f5e9bee.
Only write effects to thread-local memory should be considered safe to
parallelize in workshare lowering, not reads. When both reads and writes
were safe, the cascading effect in moveToSingle could cause entire
SingleRegions to become fully parallelized, eliminating the omp.single
and its implicit barrier. This removed synchronization points needed to
keep threads coordinated inside sequential loops containing workshared
operations, causing race conditions in forall-workshare patterns.
This was exposed by the Fujitsu Test Suite and made the following tests
regress:
FAIL: test-suite :: Fujitsu/Fortran/0398/Fujitsu-Fortran-0398_0031.test
FAIL: test-suite :: Fujitsu/Fortran/0398/Fujitsu-Fortran-0398_0013.test
FAIL: test-suite :: Fujitsu/Fortran/0398/Fujitsu-Fortran-0398_0030.test
FAIL: test-suite :: Fujitsu/Fortran/0398/Fujitsu-Fortran-0398_0014.test
Updates #143330
Refactor how `func.func` discardable attributes are handled in the
Func-to-LLVM conversion. Instead of ad hoc checks for linkage and
readnone followed by a simple filter, the pass now generically processes
inherent attributes from LLVMFuncOp.
Attributes that correspond to inherent `llvm.func` ODS names can be
attached as `llvm.<name>` on `func.func` and are stripped to `<name>`
when building `LLVM::LLVMFuncOp`, so LLVM-specific knobs stay namespaced
on the source op but land on the right inherent slots on `llvm.func`.
Other discardable attributes continue to be propagated as-is.
Fixes#175959Fixes#181464
Assisted-by: CLion code completion, GPT 5.3-Codex
---------
Co-authored-by: Mehdi Amini <joker.eph@gmail.com>
Use fir.bitcast in FIR-to-MemRef casts so bit patterns are preserved
(e.g. TRANSFER), while keeping fir.convert for memref/reference
marshaling and non-bitcast-compatible cases.
Previously, 32-bit types (integer, real, logical, complex) were printed
without the (kind=4) suffix in DWARF debug type names, while other sizes
always included the kind suffix. This inconsistency is now removed by
always appending (kind=X) to all basic type names, making the format
uniform across all type sizes.
Fixes https://github.com/llvm/llvm-project/issues/119478.
When fir.array_coor carries an explicit shape_shift (non-default lower
bounds) and an explicit slice, the indices are Fortran indices rather
than 1-based section indices. The FIRToMemRef pass was unconditionally
subtracting 1 from sliced indices, which is only correct for 1-based
section indices (the embox-with-embedded-slice case).
For shape_shift + explicit slice, the correct adjustment is to subtract
the slice lower bound instead of 1. This produces proper 0-based memref
indices.
This pattern arises after the FIR inliner canonicalizes
fir.embox(shape_shift, slice) + fir.array_coor(box) into a single
fir.array_coor with explicit shape_shift and slice operands, where the
indices become Fortran indices.
Without this fix, arrays with non-default lower bounds (e.g., A(0:N) or
A(-1:N)) produce negative memref indices, writing before the array
allocation and causing a segfault.
LZ: processor-dependent (default, flang prints leading zero); LZS:
suppress the optional leading zero before the decimal point; LZP: print
the optional leading zero before the decimal point. Changes span the
source parser, compile-time format validator, runtime format processing,
and runtime output formatting. Includes semantic test (io18.f90) and
documentation updates.
Extends `do concurrent` device support by emitting compiler-generated
declare mapper ops for live-ins whose types are record types and have
allocatable members.
In some cases, `fir.use_stmt` operation can end up in offload region
like in acc routine for example. Make sure we can validate the symbols
associated with the `fir.use_stmt` operation.
This fixes a bug in USM mode where the `close` map type modifer was
attached to some `map.info.op`'s corresponding to user-defined type
members while the parent type instance itself is not marked as `close`.
This fix ensures that if a parent record type map does not have the
'close' flag, it is cleared from its members as well, maintaining
consistency.
Gemini was used to create tests. AI generated test code was reviewed
line-by-line by me. Which were derived from a reproducer I was working
with to debug the issue.
Assisted-by: Gemini <gemini@google.com>
We have to materialize `fir.box` before adding a `fir.convert` to a
memref type. Otherwise we get:
`'fir.convert' op invalid type conversion'!fir.box<!fir.array<?xi32>>' /
'memref<?xi32, strided<[?], offset: ?>>'`
This is a fix for two problems that caused a crash:
1. Thread-local variables sometimes are required to be parallelized.
Added a special case to handle this in
`LowerWorkshare.cpp:isSafeToParallelize`.
2. Race condition caused by a `nowait` added to the `omp.workshare` if
it is the last operation in a block. This allowed multiple threads to
execute the `omp.workshare` region concurrently. Since
_FortranAPushValue modifies a shared stack, this concurrent access
causes a crash. Disable the addition of `nowait` and rely on the
implicit barrier at the the of the `omp.workshare` region.
Fixes#143330
Pass
----
Add the `acc-recipe-materialization` pass, which materializes OpenACC
privatization, firstprivate and reduction recipes by inlining their
init, copy, combiner, and destroy regions into the operation for the
construct. The pass runs on acc.parallel, acc.serial, acc.kernels, and
acc.loop.
- Firstprivate: Inserts acc.firstprivate_map so the initial value is
available on the device, then clones the recipe init and copy regions
into the construct and replaces uses with the materialized alloca.
Optional destroy region is cloned before the region terminator.
- Private: Clones the recipe init region into the construct (at region
entry or at the loop op for acc.loop private). Replaces uses of the
recipe result with the materialized alloca. Optional destroy region is
cloned before the region terminator.
- Reduction: Creates acc.reduction_init (init region inlined) and
acc.reduction_combine_region (combiner region inlined). All uses of the
reduction in the region are updated to the reduction init result.
New operations
--------------
- acc.reduction_init: Allocates and initializes a private reduction
variable from a recipe. Takes the original reduction variable and
reduction_operator; has a single region that must yield one value (the
private storage) via acc.yield. Used by the pass to materialize
acc.reduction_recipe init regions inside the compute construct.
- acc.reduction_combine_region: Combines the private reduction value
with the shared reduction variable. Takes the shared and private
memrefs; has a single region (the recipe combiner) terminated by
acc.yield with no operands. Used by the pass to materialize the
reduction recipe combiner.
Both ops implement RegionBranchOpInterface. acc.yield is updated to
allow terminating ReductionInitOp and ReductionCombineRegionOp regions.
Supporting changes
------------------
- OpenACCUtilsLoop: Factor cloneACCRegionInto out of the existing
loop-conversion helper so the pass can clone recipe regions with
optional result replacement; loop conversion now calls the shared
helper.
- Flang: Add ReductionInitOpFortranObjectViewModel
(FortranObjectViewOpInterface) for acc.reduction_init and register it in
OpenACC extensions.
Tests
-----
- MLIR: acc-recipe-materialization-{firstprivate,private,reduction,
kernel-private,parallel}.mlir (memref dialect).
- Flang: acc-recipe-materialization-{firstprivate,firstprivate-derived,
private,reduction,kernel-private,parallel}.fir; firstprivate test has a
second RUN with -acc-optimize-firstprivate-map.
---------
Co-authored-by: Scott Manley <rscottmanley@gmail.com>
Enable Flang to match Clang behavior for command-line recording in DWARF
producer strings when using -grecord-command-line.
Signed-off-by: Yangyu Chen <cyy@cyyself.name>
Remove the special-case that handled `fir.array_coor` with a
block-argument base by converting the element ref result (!fir.ref<i32>
-> memref<i32>) and leaving fir.array_coor alive.
Instead, we now always convert the base (!fir.ref<!fir.array<...>> ->
memref<...>) and compute the memref indices from the fir.array_coor
operands, so loads/stores become memref.load/store base[indices] and
fir.array_coor can be erased when it’s only used by memory ops.
When a target region is placed inside a constant false condition (e.g.,
`if (.false.)`), the dead code gets eliminated on the host side,
removing the `omp.target` operation entirely. However, the device-side
compilation pipeline is unaware of this elimination and attempts to
generate kernel code. Since the host never created offload metadata for
the eliminated target, the device-side kernel function lacks the
"kernel" attribute, causing `OpenMPOpt` to fail with an assertion when
it expects all outlined kernels to have this attribute. The problem can
be seen with the following code:
```fortran
program cele
implicit none
real :: V
integer :: i
if (.false.) then
!$omp target teams distribute parallel do
do i = 1, 5
V = V * 2
end do
!$omp end target teams distribute parallel do
end if
end program
```
It currently fails with the following assertion:
```
Assertion `omp::isOpenMPKernel(*Kernel) && "Expected kernel function!"' failed.
llvm/lib/Transforms/IPO/OpenMPOpt.cpp:4291
```
This PR adds `DeleteUnreachableTargetsPass` that identifies `omp.target`
operations in unreachable code blocks and removes them.
This pass splits up the `vscaleRange` pass-option from the
`VScaleAttrPass` into `vscaleMin` and `vscaleMax` respectively, since a
`std::pair<>` cannot be used as a cli-option and crashes when running
`flang -march=rv64gcv -O3 file.f90 -mmlir -debug`.
Since the options can now be set individually I added some error
checking following the semantics described in the langref
https://llvm.org/docs/LangRef.html#function-attributes.
I also added tests since there were none for only this pass before.
This change makes `fir.field_index` a Pure operation, and
add support of `ConditionallySpeculatable` interface for
`fir.coordinate_of`. The test demonstrates how this affects
Flang LICM.
This pass optimizes acc.firstprivate_map operations generated during
OpenACC recipe materialization when acc.firstprivate is materialized
into the mapping and a private allocation inside region. The
optimization applies to scalar variables of trivial types (integers,
reals, logicals) as long as they are not optional.
The pass hoists loads from the firstprivate variable to before the
compute region, converting the firstprivate copy to a pass-by-value
pattern. This eliminates the need for runtime copying the firstprivate
variable since only its value is needed for initializing private copies.
This patch updates the function filtering OpenMP pass intended to remove
host functions from the MLIR module created by Flang lowering when
targeting an OpenMP target device.
Host functions holding target regions must be kept, so that the target
regions within them can be translated for the device. The issue is that
non-target operations inside these functions cannot be discarded because
some of them hold information that is also relevant during target device
codegen. Specifically, mapping information resides outside of
`omp.target` regions.
This patch updates the previous behavior where all host operations were
preserved to then ignore all of those that are not actually needed by
target device codegen. This, in practice, means only keeping target
regions and mapping information needed by the device. Arguments for some
of these remaining operations are replaced by placeholder allocations
and `fir.undefined`, since they are only actually defined inside of the
target regions themselves.
As a result, this set of changes makes it possible to later simplify
target device codegen, as it is no longer necessary to handle host
operations differently to avoid issues.
This change moves the declare of result storage alloca before the call
so that alias analysis can revert to linking fir.declare to the fisrt
dominating dummy_scope instead of the dominating one.
This is only relevant when MLIR inlining is enabled and is the first
step to fix issues recent TBAA changes that placed target data in its
own tree exposed an issue with the result storage of a TARGET result.
After inlining, the usages of the result storage inside the callee and
after the call ended-up being placed in different nodes (target and non
target) of the same TBAA tree (for the dominating function).
The fact that both nodes are placed in the same tree stems from
https://github.com/llvm/llvm-project/pull/146006 that fixed another TBAA
issue related to MLIR inlining and function result where the function
result was placed into the wrong TBAA tree, which with nested inlining
could end-up being the tree of a callee where the result storage was a
dummy, causing the TBAA to wrongfully tell that any access to the result
storage inside the nested callee did not alias with any access after the
call.
By moving the declare before the call that will be inlined, this patch
will allow reverting #146006 and fixing both issues: the TBAA emit for
usages of the result storage after the call will always be placed in a
different TBAA tree than any usages of the result storage inside the
callee.
This patch implements `OperationMoveOpInterface::canMoveOutOf()`
method for `acc.loop`, such that even Pure operations are not hoisted
by LICM if any of their operands are referenced in the data operands
of `acc.loop`. Related to #175108.
This patch implements `ConditionallySpeculatable` interface for some
FIR operations (`embox`, `rebox`, `box_addr`, `box_dims` and `convert`).
It also adds `Pure` trait for `fir.shape`, `fir.shapeshift`,
`fir.shift` and `fir.slice`.
I could have split this into multiple patches, but the changes
are better tested together on real apps, and the amount of affected
code is small.
There are more `NoMemoryEffect` operations for which I am planning
to do the same in future PRs.
Support `cuf.device_address` same way as `fir.address_of`.
This implementation implies that the host address and the device
address `MustAlias` (as shown in the new test). This should be
conservatively correct as long as `MustAlias` does not allow
to assume that the actual addresses are the same (that is what
LLVM documentation implies, I believe).
It is probably worth adding an operation interface to handle
`fir::AddrOfOp` and `cuf::DeviceAddressOp` in FIR AliasAnalysis,
but for the initial implementation I hardcoded the checks.
I also removed the call to `fir::valueHasFirAttribute` that performs
on demand SymbolTable lookups, which may be costly, and added
SymbolTable caching in FIR AliasAnalysis object. Anyway,
`fir::valueHasFirAttribute` does not work for `cuf::DeviceAddressOp`.
This PR makes a couple of minor tweaks to the lowering for
declare_mapper operations:
1) Add declare_mapper operations to the list of global operations to
have optimisation passes
executed on them. Primarily just to make sure we keep it inline with
other global operations
that contain regions. Prevents oddities where we embed FIR/HLFIR into
the mapper that needs
lowered before being converted to LLVM-IR. One example that springs to
mind is if we ever
decide to remove the single block condition on the operation to allow
conditional checks
for mapped data.
2) Add a CodeGenOpenMP.cpp conversion for DeclareMapperOp to make sure
we convert the return
type correctly from a BoxType to a struct type rather than an opaque
pointer when lowering.
Currently, I've left out the block argument types from being converted
as they're wrapped
in a fir.ref and would be opauqe pointers in either case.
So some minor additions to keep declare_mapper a little more inline with
the rest of the OpenMP operations.
Moves a `todo` to check for the current level of support for by-ref
reductions to the `FunctionFiltering` pass. This guarantees that the
check does not trigger when the same module is compiled twice: on the
CPU and on the GPU.
Add a verification pass that checks live-in values and symbol references
within offload regions are legal for the target execution model.
When code is offloaded to a device (e.g., GPU), not all values and
symbols from the host context are directly accessible. Data must be
explicitly mapped via OpenACC data clauses (copyin, create, present
etc.), declared with device attributes, or be trivial scalars that can
be passed by value. Similarly, symbol references to globals must have
proper `declare` attributes or device-resident data attributes.
This pass walks operations implementing `OffloadRegionOpInterface`,
which includes OpenACC compute constructs (`acc.parallel`,
`acc.kernels`, `acc.serial`) as well as GPU operations like
`gpu.launch`. For each region, it uses liveness analysis to identify
values flowing into the region and checks their validity using the
`OpenACCSupport` analysis.
Key features:
- Validates live-in values against OpenACC data mapping requirements
- Validates symbol references for device accessibility
- Supports soft-check mode for diagnostic-only verification
- Configurable device_type for target-specific behavior
This patch uses the fir.use_stmt operations to generate correct debug
metadata for use statement when `only` and `=>` are used. The debug flow
is changed a bit where we process the module globals first so that we
have the global variables when we start to process `fir.use_stmt`.
Fixes#160923.
In #173438 I added a FIR specific loop invariant code motion pass.
During the review, Tom pointed out certain limitations about OpenMP
dialect operations that should be taken into consideration during
transformations such as LICM:
https://github.com/llvm/llvm-project/pull/173438#discussion_r2657612148
I also found issues with hoisting operations out of `acc.loop`
operations in certain conditions (see the added test in `licm.fir`).
I am proposing a new operation interface that will allow to control
movement of operations during MLIR transformations. In particular, I
propose two methods (there might be more):
* op.canMoveOutOf(cand) - returns true, if it is allowed to move 'cand'
operation out of 'op'.
* op.canMoveFromDescendant(descendant, cand) - return true, if it is
allowed to move 'cand' out of 'descendant' and into 'op'.
I used the new interface to get rid of explicit OpenMP interfaces checks
in Flang's LICM, and I also used it for `acc.loop` operation (though, I
provided conservative initial implementation).
The new interface is part of FIR dialect, but I think it would better
fit into the core MLIR set of interfaces so that the checks that I make
in Flang's LICM are actually done in
`mlir::moveLoopInvariantCode`. Moreover, other code movement
transformations that may appear in MLIR may also need to use such an
interface.
I would like to get some feedback on whether it is reasonable to move
the interface to core MLIR.
Add comprehensive APIs to detect device-resident data across OpenACC
type and operation interfaces. This enables passes to identify data that
is already on the device (e.g., CUF device/managed/constant memory, GPU
address spaces) and handle it appropriately.
New interface methods:
- PointerLikeType::isDeviceData(Value): Returns true if the pointer
points to device data.
- MappableType::isDeviceData(Value): Returns true if the variable
represents device data.
- GlobalVariableOpInterface::isDeviceData(): Returns true if the global
variable is device data.
New utilities in OpenACCUtils:
- acc::isDeviceValue(Value): Checks if a value represents device data by
querying type interfaces, PartialEntityAccessOpInterface for base
entities, and AddressOfGlobalOpInterface for global symbols.
- acc::isValidValueUse(Value, Region): Checks if a value is legal in an
OpenACC region by verifying it comes from a data operation, is only used
by private clauses, or is device data.
Updated isValidSymbolUse to check
GlobalVariableOpInterface::isDeviceData()
for symbols referencing device-resident globals.
FIR implementations check for CUF data attributes (device, managed,
constant, shared, unified) on operations, block arguments, and globals.
The implementation traces through fir.rebox, fir.embox, fir.declare,
hlfir.declare, and fir.address_of to find the underlying data source.
Memref implementations check for gpu::AddressSpaceAttr on the memref
type.
Updated ACCImplicitData to use acc::isDeviceValue for generating
acc.deviceptr clauses for device-resident data instead of
copyin/copyout.
Updated OpenACCSupport::isValidValueUse to fallback to the new
acc::isValidValueUse utility.
Add support array section in private, firstprivate, and reduction.
Key changes:
- Change the related data operation result type to return the same type
as the array base (same type as the acc variable input in the
operation), while it was the type of the section before. This allows
remapping the base the to result value (to use the data operation result
as the base when generating addressing inside the compute region).
- The generatePrivateInit implementation of FIROpenACCTypeInterfaces is
modified to allocate storage only for the section, and to return the
mock base address (that is the address of the allocation minus the
offset/lower bound of the privatized section).
- The code generating the copy and combiner region is moved from
OpenACC.cpp to FIROpenACCTypeInterfaces.cpp via the addition of new
generateCopy and generateCombiner interface in the
MappableTypeInterface. This allows sharing all the addressing helper
with generatePrivateInit, and will allow late generation of all recipes
with Fortran.
- Update generatePrivateDestroy to deallocate the beginning of the
section if any.
In the process, the generatePrivateInit implementation is
modified so that it is more uniform to make it easier to deal with the
section. This also allowed removing runtime calls when initializing the
private for array reduction.
This patch introduces FIRToMemRef, a lowering pass that converts FIR
memory operations to the MemRef dialect, including support for slices,
shifts, and descriptor-style access patterns. To support partial
lowering, where FIR and MemRef types can coexist, we extend the handling
of fir.convert to correctly marshal between FIR reference-like types and
MemRef descriptors. The patch also factors the type conversion logic
into a reusable FIRToMemRefTypeConverter, which centralizes the rules
for converting FIR types (e.g. !fir.ref, !fir.box, sequences, logicals)
to their corresponding memref types, and is used throughout the new
pass.
---------
Co-authored-by: Scott Manley <rscottmanley@gmail.com>
Co-authored-by: jeanPerier <jean.perier.polytechnique@gmail.com>
Reserve "-funsafe-cray-pointers" (with "f") for the driver. In the
fir-alias-analysis use "-unsafe-cray-pointers" (without "f").
This prevents conflicts with how certain kinds of tools use the "unsafe
Cray pointers" options.
The following test was triggering a runtime crash **on the host before
launching the kernel**:
```fortran
program test_omp_target_map_bug_v5
implicit none
type nested_type
real, allocatable :: alloc_field(:)
end type nested_type
type nesting_type
integer :: int_field
type(nested_type) :: derived_field
end type nesting_type
type(nesting_type) :: config
allocate(config%derived_field%alloc_field(1))
!$OMP TARGET ENTER DATA MAP(TO:config, config%derived_field%alloc_field)
!$OMP TARGET
config%derived_field%alloc_field(1) = 1.0
!$OMP END TARGET
deallocate(config%derived_field%alloc_field)
end program test_omp_target_map_bug_v5
```
In particular, the runtime was producing a segmentation fault when the
test is compiled with any optimization level > 0; if you compile with
-O0 the sample ran fine.
After debugging the runtime, it turned out the crash was happening at
the point where the runtime calls the default mapper emitted by the
compiler for `nesting_type; in particular at this point in the runtime:
c62cd2877c/offload/libomptarget/omptarget.cpp (L307).
Bisecting the optimization pipeline using `-mllvm -opt-bisect-limit=N`,
the first pass that triggered the issue on `O1` was the `instcombine`
pass. Debugging this further, the issue narrows down to canonicalizing
`getelementptr` instructions from using struct types (in this case the
`nesting_type` in the sample above) to using addressing bytes (`i8`). In
particular, in `O0`, you would see something like this:
```llvm
define internal void @.omp_mapper._QQFnesting_type_omp_default_mapper(ptr noundef %0, ptr noundef %1, ptr noundef %2, i64 noundef %3, i64 noundef %4, ptr noundef %5) #6 {
entry:
%6 = udiv exact i64 %3, 56
%7 = getelementptr %_QFTnesting_type, ptr %2, i64 %6
....
}
```
```llvm
define internal void @.omp_mapper._QQFnesting_type_omp_default_mapper(ptr noundef %0, ptr noundef %1, ptr noundef %2, i64 noundef %3, i64 noundef %4, ptr noundef %5) #6 {
entry:
%6 = getelementptr i8, ptr %2, i64 %3
....
}
```
The `udiv exact` instruction emitted by the OMP IR Builder (see:
c62cd2877c/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp (L9154))
allows `instcombine` to assume that `%3` is divisible by the struct size
(here `56`) and, therefore, replaces the result of the division with
direct GEP on `i8` rather than the struct type.
However, the runtime was calling
`@.omp_mapper._QQFnesting_type_omp_default_mapper` not with `56` (the
proper struct size) but with `48`!
Debugging this further, I found that the size of `omp.map.info`
operation to which the default mapper is attached computes the value of
`48` because we set the map to partial (see:
c62cd2877c/flang/lib/Optimizer/OpenMP/MapInfoFinalization.cpp (L1146)
and
c62cd2877c/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp (L4501-L4512)).
However, I think this is incorrect since the emitted mapper (and
user-defined mappers in general) are defined on the whole struct type
and should never be marked as partial. Hence, the fix in this PR.
This allows speculating recursively speculatable operations
containing `fir.result`. Note that making it Pure does not allow
speculating `fir.result` itself from its containing operation,
since it is a terminator.
The new pass allows hoisting some `fir.load` operations early
in MLIR. For example, many descriptor load might be hoisted
out of the loops, though it does not make much difference
in performance, because LLVM is able to optimize such loads
(which are lowered as `llvm.memcpy` into temporary descriptors),
given that proper TBAA information is generated by Flang.
Further hoisting improvements are possible in [HL]FIR LICM,
e.g. getting proper mod-ref results for Fortran runtime calls
may allow hoisting loads from global variables, which LLVM
cannot do due to lack of alias information.
This patch also contains improvements for FIR mod-ref analysis:
We may recurse into `HasRecursiveMemoryEffects` operations and
use `getModRef` recursively to get more precise results for
regions with `fir.call` operations.
This patch also modifies `AliasAnalysis` to set the instantiation
point for cases where the tracked data is accessed through a load
from `!fir.ref<!fir.box<>>`: without this change the mod-ref
analysis was not able to recognize user pointer/allocatable variables.
LICM (#173438) may insert new operations at the beginning of
`fir.do_concurrent`'s block and they cannot be always hoisted
to the alloca-block of the parent operation. This patch
only moves `fir.alloca`s into the alloca-block, and moves
all other operations right before fir.do_concurrent.