LegalizeLaunchFuncOpPattern previously rejected gpu.launch_func ops with
more than one async dependency. This change removes that limitation by
synchronizing additional dependencies onto the primary stream using
CUDA/HIP events, following the same approach already used in
ConvertWaitAsyncOpToGpuRuntimeCallPattern for gpu.wait async.
For each additional async dependency beyond the first:
- If it is a stream (produced by mgpuStreamCreate), create an event,
record it on that stream, wait for it on the primary stream, then
destroy the event.
- If it is already an event, wait for it directly on the primary stream
and destroy it.
Fixes#156984
Assisted-by: Claude Code