The genericStateMachine call uses synchronize::thread wich is expected
to be implemented using a workgroup level barrier.
Currently as in some other architectures where if threads in the same
warp as the main thread reach the barrier may cause a race condition
there's a condition that makes some threads not enter the state machine.
But in Intel GPUs all threads must reach the barrier for it to be
completed, otherwise the threads in the state machine never make
progress.
This PR moves the condition into an architecture-dependent config so it
can work correctly for both kinds of hardware.