`shard.allgather` concatenates along a specified gather-axis. However,
`mpi.allgather` always concatenates along the first dimension and there
is no MPI operation that allows gathering along an arbitrary axis.
Hence, if gather-axis!=0, we need to create a temporary buffer where we
gather along the first dimension and then copy from that buffer to the
final output along the specified gather-axis. This is not ideal by far.
Along the way also
- fixing computation of memref size in mpitollvm
- adding a simple canonicalization pattern for comm_size for easier
debugging
- adding more tests