OpenCL requires constant string arguments to be in a particular address space,
so OpenCL sources can't use the regular `__nvvm_reflect()`.
Allow NVVMReflect pass to accept an Open_CL specific variant with a constant
string in a non-default address space.
Differential Revision: https://reviews.llvm.org/D139213
DAGCombiner replaces (load const_addr1) directly chained with (store
(val, const_addr2)) with val if address space stripped const_addr1 ==
const_addr2. The patch fixes the issue by checking address spaces as
well. However, it might makes sense to not to chain together side
effects that belong to different address spaces in the first place and
make SelectionDAG::root address space aware.
Alignment of an alloca in IR can be lower than the preferred alignment
on purpose, but this override essentially treats the preferred
alignment as the minimum alignment.
The patch changes this behavior to always use the specified
alignment. If alignment is not set explicitly in LLVM IR, it is set to
DL.getPrefTypeAlign(Ty) in computeAllocaDefaultAlign.
Tests are changed as well: explicit alignment is increased to match
the preferred alignment if it changes output, or omitted when it is
hard to determine the right value (e.g. for pointers, some structs, or
weird types).
Differential Revision: https://reviews.llvm.org/D135462
This patch adds lowering for function calls with variadic number of
arguments as well as enables support for the following
instructions/intrinsics:
- va_arg
- va_start
- va_end
- va_copy
Note that this patch doesn't intent to include clang's support for
variadic functions for CUDA.
According to the docs:
PTX version 6.0 supports passing unsized array parameter to a
function which can be used to implement variadic functions. [0]
The last parameter in the parameter list may be a .param array of
type .b8 with no size specified. It is used to pass an arbitrary
number of parameters to the function packed into a single array
object.
When calling a function with such an unsized last argument, the last
argument may be omitted from the call instruction if no parameter is
passed through it. Accesses to this array parameter must be within
the bounds of the array. The result of an access is undefined if no
array was passed, or if the access was outside the bounds of the
actual array being passed. [1]
Note that aggregates passed by value as variadic arguments are not
currently supported.
[0] https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#variadic-functions
[1] https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#kernel-and-function-directives-func
Differential Revision: https://reviews.llvm.org/D138531
Over the past day or so, i've took a large swing at our tests,
and reduced the number of tests that were still using the old syntax
from ~1800 to just 200.
Left to handle: (as it is seen in this patch)
* Transforms/LSR
* Transforms/CGP
* Transforms/TypePromotion
* Transforms/HardwareLoops
* Analysis/*
* some misc.
I think this is the right point to start actively refusing
to honor the old syntax, except for the old tests,
to prevent the old syntax from creeping back in.
Thus, let's add temporary default-off flag,
and if it is not passed refuse to accept old syntax.
The tests that still need porting are annotated with this flag.
Reviewed By: aeubanks
Differential Revision: https://reviews.llvm.org/D139647
Alignment of function arguments can be increased only if we can do
this for all call sites. Therefore we do not increase it for external
functions, and now we skip functions that have address taken, to avoid
any issues with functions pointers.
Differential Revision: https://reviews.llvm.org/D135708
When --nvptx-short-ptr is set, local pointers are stored as 32-bit on
nvptx64 target.
Before this patch, arguments for a function declaration were always
emitted as b64 regardless of their address space, but they were set as
b32 for the corresponding call instruction:
.extern .func test
(
.param .b64 test_param_0
)
[...]
.param .b32 param0;
st.param.b32 [param0+0], %r1;
call.uni test, (param0);
This is not supported:
ptxas: Type of argument does not match formal parameter
'test_param_0'
Now short pointers in a function declaration are emitted as b32 if
--nvptx-short-ptr is set.
Differential Revision: https://reviews.llvm.org/D135674
Recent Clang changes expose _bf16 types for SSE2-enabled host compilations and
that makes those types visible furing GPU-side compilation, where it currently
fails with Sema complaining that __bf16 is not supported.
Considering that __bf16 is a storage-only type, enabling it for NVPTX if it's
enabled on the host should pose no issues, correctness-wise.
Recent NVIDIA GPUs have introduced bf16 support, so we'll likely grow better
support for __bf16 on NVPTX going forward.
Differential Revision: https://reviews.llvm.org/D136311
`getFunctionParamOptimizedAlign` was being passed a null function
argument when getting the callee of a bitcasted function symbol. This is
because `CallBase::getCalledFunction` does not look through bitcasts.
There is already code to handle this case in
`NVPTXTargetLowering::getArgumentAlignment`, which is now hoisted into
an NVPTX util.
The alignment computation now gracefully handles computing alignment of
virtual functions with a check for null.
Before this patch the code in printScalarConstant was unable to handle
nested constant expressions like (gep (addrspacecast ptr)) and crashed
with:
LLVM ERROR: Unsupported expression in static initializer:
addrspacecast ([4 x i8] addrspace(1)* @ga to [4 x i8]*)
We can use lowerConstantForGV instead which is a customized version of
lowerConstant that supports generic() and nested expressions.
Differential Revision: https://reviews.llvm.org/D127878
1) Fixed a typo in PTXAS_EXECUTABLE CMake variable (PXTAS -> PTXAS).
2) Version check was implemented incorrectly,
now version (major, minor) is converted to int for comparison.
3) ptxas -arch argument was incorrect (or missing) in 3 tests.
Differential Revision: https://reviews.llvm.org/D127866
The second argument of `NVPTXFrameLowering::emitPrologue(MachineFunction &MF, MachineBasicBlock &MBB)` is the first MBB of the MF. In that function, it assumes the first MBB always contains instructions, so it gets the first instruction by MachineInstr *MI = &MBB.front();. However, with the reproducer/test case attached, all instructions in the first MBB is cleared in a previous pass for stack coloring. As a consequence, MBB.front() triggers the assertion that the first node is actually a sentinel node. Hence we are using MachineBasicBlock::iterator to iterate over MBB.
Fix#52623.
Differential Revision: https://reviews.llvm.org/D132663
This is the ultimate fallback code if UADDO isn't supported.
If the target uses 0/1 we used one compare, but if the target doesn't
use 0/1 we emitted two compares. Regardless of boolean constants we
should only need to check that the Result is less than one of the
original operands. So we only need one compare.
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D133708
In order to convert to mulwide.s32, we compute the 2nd operand as MulWide.32 $r, (1 << 31).
(1 << 31) is interpreted as a negative number, and is not equivalent to the original instruction.
The code `int64_t r = (int64_t)a << 31;` incorrectly compiled to `mul.wide.s32 %rd7, %r1, -2147483648;`
Reviewed By: jchlanda
Differential Revision: https://reviews.llvm.org/D132516
Today llc will crash when attempting to use non-power-of-two integer types as
function arguments or returns. This patch enables passing non standard integer
values in functions by promoting them before store and truncating after load.
The main motivation of implementing this change is that rust casts small structs
(less than pointer size) into an integer of the same size. As an example, if a
struct contains three u8 then it will be passed as an i24. This patch is a step
towards enabling rust compilation to ptx while retaining the target independent
optimizations.
More context can be found in https://github.com/llvm/llvm-project/issues/55764
Differential Revision: https://reviews.llvm.org/D129291
The original patch revealed an issue of reading incorrect values on BE hosts.
That is now changed to use `endian::read32le()` and `endian::read64le()`.
Original commit message:
The current implementation assumes that all pointers used in the
initialization of an aggregate are aligned according to the pointer size
of the target; that might not be so if the object is packed. In that
case, an array of .u8 should be used and pointers should be decorated
with the mask() operator.
The operator was introduced in PTX ISA 7.1, so an error is issued if the
case is detected for an earlier version.
Differential Revision: https://reviews.llvm.org/D127504
The current implementation assumes that all pointers used in the
initialization of an aggregate are aligned according to the pointer size
of the target; that might not be so if the object is packed. In that
case, an array of .u8 should be used and pointers should be decorated
with the mask() operator.
The operator was introduced in PTX ISA 7.1, so an error is issued if the
case is detected for an earlier version.
Differential Revision: https://reviews.llvm.org/D127504
A global variable may have the same name as a label, and ptxas does not accept it.
Prefix labels with $L__ to fix this.
Reviewed By: MaskRay, tra
Differential Revision: https://reviews.llvm.org/D119669
PTX supports those instructions for i64 starting from 4.3.
The patch also marks corresponding DAG nodes legal for both i32 and i64.
Reviewed By: tra
Differential Revision: https://reviews.llvm.org/D124698
ptxas is a proprietary compiler from Nvidia that can compile PTX to
machine code (SASS). It has a lot of diagnostics to catch errors
in PTX, which can be used to verify PTX output from llc.
Set -DPXTAS_EXECUTABLE=/path/to/ptxas CMake option to enable it.
If this option is not set, then ptxas is substituted to true which
effectively disables all ptxas RUN lines.
LLVM_PTXAS_EXECUTABLE environment variable takes precedence over
the CMake option, and allows to override ptxas executable that is used for LIT
without complete re-configuration.
Differential Revision: https://reviews.llvm.org/D121727
Some generic tests are not supported by the nvptx now. Moreover, they
are no plans to fix the tested features in nvptx. So, suggest to mark
them as UNSUPPORTED
Differential Revision: https://reviews.llvm.org/D123928
Make sure NVPTX backend can handle bitcasting between `float` and `<2 x half>` types.
This was discovered through: https://github.com/intel/llvm/issues/5969
I'm not suggesting that such bitcasts make much sense, but it feels like the compiler should not hard crash on them.
Reviewed By: tra
Differential Revision: https://reviews.llvm.org/D124171
Opaque pointers are enabled by default since D123300, so test IR should be
regenerated correspondingly.
Differential Revision: https://reviews.llvm.org/D123842
ptxas fails to parse such syntax:
mov.u64 %rd1, ($str);
fatal : Parsing error near '$str': syntax error
A new MCAsmInfo option was added because InParens parameter of
MCExpr::print is not sufficient to disable parens
completely. MCExpr::print resets it to false for a recursive call in
case of unary or binary expressions.
Targets that require parens around identifiers that start with '$'
should always pass MCAsmInfo to MCExpr::print.
Therefore 'operator<<(raw_ostream &, MCExpr&)' should be avoided
because it calls MCExpr::print with nullptr MAI.
Differential Revision: https://reviews.llvm.org/D123702
ptxas fails to parse such syntax:
mov.u64 %rd1, ($str);
fatal : Parsing error near '$str': syntax error
A new MCAsmInfo option was added because InParens parameter of
MCExpr::print is not sufficient to disable parens
completely. MCExpr::print resets it to false for a recursive call in
case of unary or binary expressions.
Differential Revision: https://reviews.llvm.org/D123702
The second parameter should be a multiple of the warp size (32).
PTX ISA spec, s9.7.12.1. Parallel Synchronization and Communication
Instructions: bar, barrier
barrier.sync{.aligned} a{, b};
Operand b specifies the number of threads participating in the
barrier. If no thread count is specified, all threads in the CTA
participate in the barrier. When specifying a thread count, the value
must be a multiple of the warp size.
Differential Revision: https://reviews.llvm.org/D123470
PTX ISA spec, s5.4.8. Variable Attribute Directive: .attribute
PTX ISA Notes
Introduced in PTX ISA version 4.0.
Target ISA Notes
.managed attribute requires sm_30 or higher.
Differential Revision: https://reviews.llvm.org/D123040
PTX ISA spec, s9.7.8.6. Data Movement and Conversion Instructions:
shfl.sync
PTX ISA Notes
Introduced in PTX ISA version 6.0.
Target ISA Notes
Requires sm_30 or higher.
Differential Revision: https://reviews.llvm.org/D123039
PTX ISA spec, s9.7.12.4. Parallel Synchronization and Communication
Instructions: atom
Target ISA Notes
64-bit atom.{and,or,xor,min,max} require sm_32 or higher.
Differential Revision: https://reviews.llvm.org/D123038