Yaxun (Sam) Liu f93fcde52b
[offload-arch] Fix amdgpu-arch crash on Windows with ROCm 7.1 (#167695)
The tool was crashing on Windows with ROCm 7.1 due to two issues: misuse
of hipDeviceGet which should not be used (it worked before by accident
but was undefined behavior), and ABI incompatibility from
hipDeviceProp_t struct layout changes between HIP versions where the
gcnArchName offset changed from 396 to 1160 bytes.

The fix removes hipDeviceGet and queries properties directly by device
index. It defines separate struct layouts for R0600 (HIP 6.x+) and R0000
(legacy) to handle the different memory layouts correctly.

An automatic API fallback mechanism tries R0600, then R0000, then the
unversioned API until one succeeds, ensuring compatibility across
different HIP runtime versions. A new --hip-api-version option allows
manually selecting the API version when needed.

Additional improvements include enhanced error handling with
hipGetErrorString, verbose logging throughout the detection process, and
runtime version detection using hipRuntimeGetVersion when available. The
versioned API functions provide stable ABI across HIP versions.

Fixes: SWDEV-564272
2025-11-13 19:03:21 -05:00
..
2025-09-25 15:39:41 -07:00