Same as with v4i8 types, we should not be using PerformEXTRACTCombine for v8i8 types.
Make v4i8 a legal type and plumb through lowering of relevant instructions.
Some critical code paths we have depend on efficient byte extraction from data loaded as integers. By default LLVM tries to extract bytes by storing/loading from stack, which is very inefficient on GPU.