Implementing expm1 function for double precision based on exp function
algorithm:
- Reduced x = log2(e) * (hi + mid1 + mid2) + lo, where:
* hi is an integer
* mid1 * 2^-6 is an integer
* mid2 * 2^-12 is an integer
* |lo| < 2^-13 + 2^-30
- Then exp(x) - 1 = 2^hi * 2^mid1 * 2^mid2 * exp(lo) - 1 ~ 2^hi *
(2^mid1 * 2^mid2 * (1 + lo * P(lo)) - 2^(-hi) )
- We evaluate fast pass with P(lo) is a degree-3 Taylor polynomial of
(e^lo - 1) / lo in double precision
- If the Ziv accuracy test fails, we use degree-6 Taylor polynomial of
(e^lo - 1) / lo in double double precision
- If the Ziv accuracy test still fails, we re-evaluate everything in
128-bit precision.
Summary:
Currently, we use the RPC server to respond to different ports which
each contain a request from some client thread wishing to do work on the
server. This scan starts at zero and continues until its checked all
ports at which point it resets. If we find an active port, we service it
and then restart the search.
This is bad for two reasons. First, it means that we will always bias
the lower ports. If a thread grabs a high port it will be stuck for a
very long time until all the other work is done. Second, it means that
the `handle_server` function can technically run indefinitely as long as
the client is always pushing new work. Because the OpenMP implementation
uses the user thread to service the kernel, this means that it could be
stalled with another asyncrhonous device's kernels.
This patch addresses this by making the server restart at the next port
over. This means we will always do a full scan of the ports before
quitting.
Summary:
This variable needs a reserved name starting with `__`. It was
mistakenly changed with a mass replace. It happened to work because the
tests still picked up the associated symbol, but it just became a bad
name because it's not reserved anymore.
The name __llvm_libc was mass-replaced with LIBC_NAMESPACE which ended
up changing the "__llvm_libc" prefix of the delete operator linkage names to
"LIBC_NAMESPACE". This change corrects it by changing the namespace prefix
to "__llvm_libc_<version info>".
Other libraries dependent on these libraries will automatically inherit
those compile options. This change in particular affects the compile
option "-DLIBC_COPT_STDIO_USE_SYSTEM_FILE".
This patch enables the compilation of libc for rv32 by unifying the
current rv64 and rv32 implementation into a single rv implementation.
We updated the cmake file to match the new riscv32 arch and force
LIBC_TARGET_ARCHITECTURE to be "riscv" whenever we find "riscv32" or
"riscv64". This is required as LIBC_TARGET_ARCHITECTURE is used in the
path for several platform specific implementations.
Reviewed By: michaelrj
Differential Revision: https://reviews.llvm.org/D148797
printf_core.parser is not yet updated to use the printf config options. It
does not use them currently anyway and the corresponding parser_test
should be updated to respect the config options.
Summary:
This patch adds the necessary entrypoints to handle the `fseek`,
`fflush`, and `ftell` functions. These are all very straightfoward, we
simply make RPC calls to the associated function on the other end.
Implementing it this way allows us to more or less borrow the state of
the stream from the server as we intentionally maintain no internal
state on the GPU device. However, this does not implement the `errno`
functinality so that must be ignored.
When creating the new scanf reader design, I forgot to add back the
calls to flockfile and funlockfile in vfscanf_internal. This patch fixes
that, and also changes the system file version to use the normal
variants since ungetc_unlocked isn't always available.
Summary:
Normally, the implementation of `puts` simply writes a second newline
charcter after printing the first string. However, because the GPU does
everything in batches of the SIMT group size, this will end up with very
poor output where you get the strings printed and then 1-64 newline
characters all in a row. Optimizations like to turn `printf` calls into
`puts` so it's a good idea to make this produce the expected output.
The least invasive way I could do this was to add a new opcode. It's a
little bloated, but it avoids an unneccessary and slow send operation to
configure this.
In a previous patch, the printf writer was rewritten to use a single
writer class with a buffer and a callback hook. This patch refactors
scanf's reader to match conceptually.
Summary:
The parser class for stdio currently accepts different argument
providers. In-tree this is only used for a fuzzer test, however, the
proposed implementation of the GPU handling of printf / scanf will
require custom argument handlers. This makes the current approach of
using a preprocessor macro messier. This path proposed folding this
logic into a template instantiation. The downside to this is that
because the implementation of the parser class is placed into an
implementation file we need to manually instantiate the needed templates
which will slightly bloat binary size. Alternatively we could remove the
implementation file, or key off of the `libc` external packaging macro
so it is not present in the installed version.
Summary:
Previously, the `fread` operation was wrong in cases when we read less
data than was requested. That is, if we tried to read N bytes while the
file was in EOF, it would still copy N bytes of garbage. This is fixed
by only copying over the sizes we got from locally opening it rather
than just using the provided size.
Additionally, this patch simplifies the interface. The output functions
have special variants for writing to stdout / stderr. This is primarily
an optimization for these common cases so we can avoid sending the
stream as an argument which has a high delay. Because for input, we
already need to start with a `send` to tell the server how much data to
read, it costs us nothing to send the file along with it so this is
redundant. Re-use the file encoding scheme from the other
implementations, the one that stores the stream type in the LSBs of the
FILE pointer.
Two major off-by-one errors are fixed in this patch. The first is in
float_to_string.h with length_for_num, which wasn't accounting for the
implicit leading bit when calculating the length of a number, causing
a missing digit on 80 bit float max. The other off-by-one is the
ryu_long_double_constants.h (a.k.a the Mega Table) not having any
entries for the last POW10_OFFSET in POW10_SPLIT. This was also found on
80 bit float max. Finally, the integer calculation mode was using a
slightly too short integer, again on 80 bit float max, not accounting
for the mantissa width. All of these are fixed in this patch.
Summary:
This patch removes the `rpc_reset` function. This was previously used to
initialize the RPC client on the device by setting up the pointers to
communicate with the server. The purpose of this was to make it easier
to initialize the device for testing. However, this prevented us from
enforcing an invariant that the buffers are all read-only from the
client side.
The expected way to initialize the server is now to copy it from the
host runtime. This will allow us to maintain that the RPC client is in
the constant address space on the GPU, potentially through inference,
and improving caching behaviour.
Summary:
This was mistakenly using the opcode for `ferror` which wasn't noticed
because tests using this weren't yet activated. This patch fixes this
mistake.
The list of printf copts available in config.json wasn't working because
the printf_core subdirectory was included before the printf_copts
variable was defined, making it effectively nothing for the printf
internals. Additionally, the tests weren't respecting the flags so they
would cause the tests to fail. This patch reorders the cmake in src and
adds flag handling in test.
Summary:
This patch implements the `fgets`, `getc`, `fgetc`, and `getchar`
functions on the GPU. Their implementations are straightforward enough.
One thing worth noting is that the implementation of `fgets` will be
extremely slow due to the high latency to read a single char. A faster
solution would be to make a new RPC call to call `fgets` (due to the
special rule that newline or null breaks the stream). But this is left
out because performance isn't the primary concern here.
The two decimal float printing styles are similar, but different in how
they end. For simplicity of writing I initially gave them different
"write_last_block" functions. This patch unifies them into one function.
Reviewed By: sivachandra
Differential Revision: https://reviews.llvm.org/D158036
This patch adds the long double table option for printf into the new
configuration scheme. This allows it to be set for most targets but
unset for baremetal.
This patch changes the size of time_t to be an int64_t. This still
follows the POSIX standard which only requires time_t to be an integer.
Making time_t a 64-bit integer also fixes two cases in 32 bits platforms
that use SYS_clock_nanosleep_time64 and SYS_clock_gettime64, as the name
of these calls implies, they require a 64-bit time_t. For instance, in rv32,
the 32-bit version of these syscalls is not available.
We also follow glibc here, where time_t is still a 32-bit integer in
arm32.
Reviewed By: sivachandra
Differential Revision: https://reviews.llvm.org/D159125
Summary:
This patch improves the implementation of the standard `rand()` function
by implementing it in terms of the xorshift64star pRNG as described in
https://en.wikipedia.org/wiki/Xorshift#xorshift*. This is a good,
general purpose random number generator that is sufficient for most
applications that do not require an extremely long period. This patch
also correctly initializes the seed to be `1` as described by the
standard. We also increase the `RAND_MAX` value to be `INT_MAX` as the
standard only specifies that it can be larger than 32768.
Summary:
We currently call the GPU routine to terminate the current thread in
three separate locations .This should be wrapped into a helper function
to simplify the implementation.
Summary:
This patch implements fwrite, putc, putchar, and fputc on the GPU. These
are very straightforward, the main difference for the GPU implementation
is that we are currently ignoring `errno`. This patch also introduces a
minimal smoke test for `putc` that is an exact copy of the `puts` test
except we print the string char by char. This also modifies the `fopen`
test to use `fwrite` to mirror its use of `fread` so that it is tested
as well.
Summary:
The GPU uses separate implementations to perform file IO. This is all
done through the RPC interface and we kept it minimal such that we could
treat a `stdin`, `stdout`, or `stderr` handle from the CPU correctly on
the GPU. The RPC implementation uses different opcodes for whether or
not we are using one of the standard streams. This is so we do not need
to initialize anything to access the CPU's standard stream, because the
server knows that it should print to `stdout` if it gets the `STDOUT`
variant of the opcode. It also saves us an RPC call, which are expensive
relatively speaking. This patch simply cleans up this interface to make
them all use a common function. This is done in preparation to implement
some more file IO functions like getc or putc.
Similar to D159208, this patch unifies the calls to a syscall, in this
patch it is the syscall SYS_clock_gettime/SYS_clock_gettime64.
This patch also fixes calls to SYS_clock_gettime64 by creating a
timespec64 object, passing it to the syscall and rewriting the timespec
given by the caller with timespec64 object's contents. This fixes cases
where timespec has a 4 bytes long time_t member, but SYS_clock_gettime
is not available (e.g., rv32).
The %p format wasn't correctly passing along flags and modifiers to the
integer conversion behind the scenes. This patch fixes that behavior, as
well as changing the nullptr behavior to be a string conversion behind
the scenes.
Reviewed By: lntue, jhuber6
Differential Revision: https://reviews.llvm.org/D159458
These functions were implemented by simply calling their `__builtin_*`
equivalents.
The builtins were resolving to the libc functions back again. This patch
adds explicit
vendor versions for these functions to avoid the recursion.
This patch fixes the overflow check in update_from_seconds, used by
gmtime, gmtime_r and mktime.
In update_from_seconds, total_seconds is a int64_t and the previous
overflow check for when sizeof(time_t) == 4 would check if it was <
0x80000000 and > 0x7FFFFFFF, however, this check would cause the
following issues:
1. Valid negative numbers would be discarded, e.g., -1 is
0xffffffffffffffff as a int64_t, outside the range of the overflow
check.
2. Some valid positive numbers would be discarded because the hex
constants were being implicitly converted to int64_t, e.g., 0x80000000
would be implicitly converted to 2147483648, instead of -2147483648.
The fix for both cases was to static_cast total_seconds and the
constants to time_t if sizeof(time_t) == 4. The behaviour is not changed
in systems with sizeof(time_t) == 8.
---------
Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>