- Allow `FMod` template to have different computational types and make it work for 80-bit long double. - Switch to use `uint64_t` as the intermediate computational types for `float`, significantly reduce the latency of `fmodf` when the exponent difference is large.