Status Update
Comments
ap...@google.com <ap...@google.com> #2
One design decision worth elaboring on is whether or not to InstIntrinsicsCall
to InstIntrinsics
. With InstIntrinsicsCall
no longer derived from InstCall
, to get rid of the Target
parameter, it can never get lowered into a call (at least not a call for which the target is unknown to the backend). In high-level languages, intrinsics look like function calls, but at Subzero's IR level they look more like instructions. Hence dropping the "Call" from the name would make sense. Alternatives discussed with Antonio include:
- Keep the
InstCall
target operand separate from theSrcs
operand list, so the goal of aligning the source operands of load/store-like intrinsics with those of regular load/store instructions is still achieved. Unfortunately this causes complications for liveness analysis, since the target operand of a regular call can be a function pointer produced by other instructions which need to be considered live. We could teach Subzero to treat the target operand as an additional source operand, but it's a bit risky and might require more than liveness analysis to be updated. - Move the target operand to the last position of
Srcs[]
. This works without affecting things like liveness analysis, but is still a bit risky since we must ensure nothing looks for the target operand at index 0. It would also still be an unused undefined value for intrinsics. - Add virtual methods for accessing load/store operands. Currently there are no virtual methods for
Inst
, and since we process many thousands of these it could have a performance impact to make virtual calls. Also note it already uses aKind
enum to perform its own RTTI.
It's worth noting that to LLVM, intrinsics are definitely considered functions: call
instruction. Despite Subzero's design clearly having been inspired by LLVM, it uses a separate class for representing intrinsics. InstIntrinsicCall
also takes an Intrinsics::IntrinsicInfo
argument at construction, whereas with LLVM that information is part of the call target.
Hence I intend on moving forward with renaming it to InstIntrinsics
, and adding some comments to clarify that it represents a extension instruction which few parts of the compiler have to know the exact functionality about.
Description
While investigating this bug , I discovered that using the Subzero backend targeting Windows x86 (32-bit), that enabling optimization level
Om1
("minus one") produces invalid code for many of the ReactorUnitTests that typically result in access violations.In particular, the following ReactorUnitTests cases crash: Sample, Uninitialized, Branching, MulHigh, LargeStack, Call_ArgsMixed, Fibonacci, Coroutines_Parameters, Coroutines_Parallel, Intrinsics_Scatter, Intrinsics_Gather, ExtractFromRValue, Multithreaded_Coroutine.
After much investigation (gorily detailed below), I determined that this is related to Subzero not sorting and combining allocas when targeting Win32 - that is, when it calls
Func->processAllocas(SortAndCombineAllocas)
withSortAndCombineAllocas == false
. This only happens for optimization level Om1, not O2, which sets thisSortAndCombineAllocas = true
. For now, we can work around this bug by also settingSortAndCombineAllocas = true
for Om1 so that this mode can be used in helping to find optimization bugs, but we should try to figure out the real problem here.Gory Details
Focusing on the Uninitialized case, we can compare the generated code on Win32 vs Linux32:
On Linux, ebp is used to access locals, while on Windows, esp is used. This comes down to
Traits::X86_STACK_ALIGNMENT_BYTES
, which is 4 for Windows, and 16 for Linux. The functionTargetX86Base::needsStackPointerAlignment()
is used to determine whether to use esp or ebp:Indeed, forcing
Traits::X86_STACK_ALIGNMENT_BYTES
to 16 on Windows ends up generating the same code as on Linux, and now the test no longer crashes.I think it must be happening in
TargetX86Base<TraitsType>::addProlog()
when it callsassignVarStackSlots()
with last paramUsesFramePointer
that’s set toIsEbpBasedFrame && !needsStackPointerAlignment()
, so false on Windows, but true on Linux.I think I understand the problem. In
TargetX86Base<TraitsType>::lowerAlloca
, local variableUseFramePointer
gets set to true for Win32 (and Linux32) because hasFramePointer() is true:Below, if
UseFramePointer
is true, it emits asub sp,<size of variable>
:I believe the purpose of this is to make sure to offset the stack pointer in case this function calls another so that the callee doesn’t step over its stack (but I'm not sure). Thus, for each alloca, we end up emitting something like:
Note that this offsets
esp
as we emit allocas. After allocas are emitted for the function, the body of the function is generated, and when stack variables are referenced, it uses eitherframe pointer (ebp) - offset
, orstack pointer (esp) + offset
. The offsets for each variable is a fixed value for the body of the function. Now when the frame pointer is used, the offsets from it are naturally correct because the base pointer never changes. However, if the stack pointer is used, the fact that the stack pointer moves in between instructions that use it to reference stack variables means the fixed offsets are invalid:The problem is that
[esp+24]
is supposed to be the fixed offset to variable A, but then we offset esp, making this offset no longer valid for the rest of the body. Note that when using the frame pointer, we don’t have this problem:The fact that
esp
changes doesn’t matter as we always useebp
to access stack variables.So now, the real problem: the fact that
UseFramePointer
was set to true inlowerAlloca
assumed that the frame pointer would be used to reference stack variables. However, inTargetX86Base<TraitsType>::addProlog()
, when it callsassignVarStackSlots()
, which determines whether to assign a frame-based or stack-based offset to each variable, the last argument,UsesFramePointer
, is set to:IsEbpBasedFrame && !needsStackPointerAlignment()
. The!needsStackPointerAlignment()
here means that despite the fact that we were supposed to use the frame pointer for stack offsets, we will instead use the stack pointer. This is incongruent with the initial assumption inlowerAlloca
described above.To work around this, we would want
UseFramePointer
to be false inlowerAlloca
. To do this, though, all stack variables must have already known offsets:The way to do this is to make sure that the allocas are sorted and combined. In
TargetX86Base<TraitsType>::translateOm1()
, we have:If we change
SortAndCombineAllocas
to true, then we no longer use the frame pointer, and everything works. In fact, this is whatTargetX86Base<TraitsType>::translateO2()
does.With this change, the generated code no longer emits code that moves ‘esp’ except for once at the top of the function, and is able to use esp to access local variables correctly:
Of course, this is just a quick fix, and it means we are now optimizing allocas in Om1. I’m still not exactly sure what the real problem is - in particular, I don’t know why
sub esp,<value>
is emitted for stack vars with unknown offsets when ebp is going to be used.