[MLIR][NVVM] Improve inline_ptx, add readwrite support (#154358)
Key Features
1. Multiple SSA returns – no struct packing/unpacking required.
2. Automatic struct unpacking – values are directly usable.
3. Readable register mapping
* {$rwN} → read-write
* {$roN} → read-only
* {$woN} → write-only
4. Full read-write support (+ modifier).
5. Simplified operand specification – avoids cryptic
"=r,=r,=f,=f,f,f,0,1" constraints.
6. Predicate support: PTX `@p` predication support
IR Example:
```
%wo0, %wo1 = nvvm.inline_ptx """
.reg .pred p;
setp.ge.s32 p, {$r0}, {$r1};
selp.s32 {$rw0}, {$r0}, {$r1}, p;
selp.s32 {$rw1}, {$r0}, {$r1}, p;
selp.s32 {$w0}, {$r0}, {$r1}, p;
selp.s32 {$w1}, {$r0}, {$r1}, p;
""" ro(%a, %b : f32, f32) rw(%c, %d : i32, i32) -> f32, f32
```
After lowering
```
%0 = llvm.inline_asm has_side_effects asm_dialect = att
"{
.reg .pred p;\
setp.ge.s32 p, $4, $5; \
selp.s32 $0, $4, $5, p;\
selp.s32 $1, $4, $5, p;\
selp.s32 $2, $4, $5, p;\
selp.s32 $3, $4, $5, p;\
}"
"=r,=r,=f,=f,f,f,0,1"
%c500_i32, %c400_i32, %cst, %cst_0
: (i32, i32, f32, f32)
-> !llvm.struct<(i32, i32, f32, f32)>
%1 = llvm.extractvalue %0 : !llvm.struct<(i32, i32, f32, f32)>
%2 = llvm.extractvalue %0 : !llvm.struct<(i32, i32, f32, f32)>
%3 = llvm.extractvalue %0 : !llvm.struct<(i32, i32, f32, f32)>
%4 = llvm.extractvalue %0 : !llvm.struct<(i32, i32, f32, f32)>
// Unpacked result from nvvm.inline_ptx
%5 = arith.addi %1, %2 : i32
// read only
%6 = arith.addf %cst, %cst_0 : f32
// write only
%7 = arith.addf %3, %4 : f32
```