This commit updates the lowering of all-reduce operations to annotate
the generated barriers with `memfence [#gpu.address_space<workgroup>]`
so that these barriers do not force unrelated global memory operations
to complete. It similarly sets up the warp synchronization function in
the vectory distribuhte tests, since they also only read/write shared
memory.
In additon, this commit adds convenience builders for gpu.barrier, which
will allow it to either fence on a given address space or on the address
space of a provided memref.