Subzero: fix SortAndCombineAllocas = false [171222930]

Bug

Status Update

No update yet.

Description

am...@google.com

created issue #1

Oct 19, 2020 06:22PM

While investigating this bug, I discovered that using the Subzero backend targeting Windows x86 (32-bit), that enabling optimization level Om1 ("minus one") produces invalid code for many of the ReactorUnitTests that typically result in access violations.

In particular, the following ReactorUnitTests cases crash: Sample, Uninitialized, Branching, MulHigh, LargeStack, Call_ArgsMixed, Fibonacci, Coroutines_Parameters, Coroutines_Parallel, Intrinsics_Scatter, Intrinsics_Gather, ExtractFromRValue, Multithreaded_Coroutine.

After much investigation (gorily detailed below), I determined that this is related to Subzero not sorting and combining allocas when targeting Win32 - that is, when it calls Func->processAllocas(SortAndCombineAllocas) with SortAndCombineAllocas == false. This only happens for optimization level Om1, not O2, which sets this SortAndCombineAllocas = true. For now, we can work around this bug by also setting SortAndCombineAllocas = true for Om1 so that this mode can be used in helping to find optimization bugs, but we should try to figure out the real problem here.

Gory Details

Focusing on the Uninitialized case, we can compare the generated code on Win32 vs Linux32:

On Linux, ebp is used to access locals, while on Windows, esp is used. This comes down to Traits::X86_STACK_ALIGNMENT_BYTES, which is 4 for Windows, and 16 for Linux. The function TargetX86Base::needsStackPointerAlignment() is used to determine whether to use esp or ebp:

  bool needsStackPointerAlignment() const override {
    // If the ABI's stack alignment is smaller than the vector size (16 bytes),
    // use the (realigned) stack pointer for addressing any stack variables.
    return Traits::X86_STACK_ALIGNMENT_BYTES < 16;
  }

Indeed, forcing Traits::X86_STACK_ALIGNMENT_BYTES to 16 on Windows ends up generating the same code as on Linux, and now the test no longer crashes.

I think it must be happening in TargetX86Base<TraitsType>::addProlog() when it calls assignVarStackSlots() with last param UsesFramePointer that’s set to IsEbpBasedFrame && !needsStackPointerAlignment(), so false on Windows, but true on Linux.

I think I understand the problem. In TargetX86Base<TraitsType>::lowerAlloca, local variable UseFramePointer gets set to true for Win32 (and Linux32) because hasFramePointer() is true:

  const bool UseFramePointer =
      hasFramePointer() || OverAligned || !AllocaWithKnownOffset || OptM1;

Below, if UseFramePointer is true, it emits a sub sp,<size of variable>:

      _sub_sp(Ctx->getConstantInt32(Value));

I believe the purpose of this is to make sure to offset the stack pointer in case this function calls another so that the callee doesn’t step over its stack (but I'm not sure). Thus, for each alloca, we end up emitting something like:

00BBB066  sub         esp,4                   
00BBB069  mov         dword ptr [ebp-24],esp

Note that this offsets esp as we emit allocas. After allocas are emitted for the function, the body of the function is generated, and when stack variables are referenced, it uses either frame pointer (ebp) - offset, or stack pointer (esp) + offset. The offsets for each variable is a fixed value for the body of the function. Now when the frame pointer is used, the offsets from it are naturally correct because the base pointer never changes. However, if the stack pointer is used, the fact that the stack pointer moves in between instructions that use it to reference stack variables means the fixed offsets are invalid:

00BBB066  sub         esp,4                   
00BBB069  mov         dword ptr [esp+24],esp    A
00BBB06D  sub         esp,4                   
00BBB070  mov         dword ptr [esp+28],esp    B
00BBB074  sub         esp,4                   
00BBB077  mov         dword ptr [esp+44],esp    C

The problem is that [esp+24] is supposed to be the fixed offset to variable A, but then we offset esp, making this offset no longer valid for the rest of the body. Note that when using the frame pointer, we don’t have this problem:

0x8531069   mov    DWORD PTR [ebp-24],esp
0x853106c   sub    esp,16
0x853106f   mov    DWORD PTR [ebp-20],esp
0x8531072   sub    esp,16
0x8531075   mov    DWORD PTR [ebp-4],esp

The fact that esp changes doesn’t matter as we always use ebp to access stack variables.

So now, the real problem: the fact that UseFramePointer was set to true in lowerAlloca assumed that the frame pointer would be used to reference stack variables. However, in TargetX86Base<TraitsType>::addProlog(), when it calls assignVarStackSlots(), which determines whether to assign a frame-based or stack-based offset to each variable, the last argument, UsesFramePointer, is set to: IsEbpBasedFrame && !needsStackPointerAlignment(). The !needsStackPointerAlignment() here means that despite the fact that we were supposed to use the frame pointer for stack offsets, we will instead use the stack pointer. This is incongruent with the initial assumption in lowerAlloca described above.

To work around this, we would want UseFramePointer to be false in lowerAlloca. To do this, though, all stack variables must have already known offsets:

    if (UseFramePointer) {
      _sub_sp(Ctx->getConstantInt32(Value));
    } else {
      // If we don't need a Frame Pointer, this alloca has a known offset to the
      // stack pointer. We don't need adjust the stack pointer, nor assign any
      // value to Dest, as Dest is rematerializable.
      assert(Dest->isRematerializable());
      FixedAllocaSizeBytes += Value;
      Context.insert<InstFakeDef>(Dest);
    }

The way to do this is to make sure that the allocas are sorted and combined. In TargetX86Base<TraitsType>::translateOm1(), we have:

  static constexpr bool SortAndCombineAllocas = false;
  Func->processAllocas(SortAndCombineAllocas);

If we change SortAndCombineAllocas to true, then we no longer use the frame pointer, and everything works. In fact, this is what TargetX86Base<TraitsType>::translateO2() does.

With this change, the generated code no longer emits code that moves ‘esp’ except for once at the top of the function, and is able to use esp to access local variables correctly:

01A6E060  push        ebp  
01A6E061  mov         ebp,esp  
01A6E063  sub         esp,30h  
01A6E066  jmp         01A6E06B  
01A6E06B  mov         dword ptr [esp],4  
01A6E072  mov         eax,dword ptr [esp+4]  
01A6E076  mov         dword ptr [esp+2Ch],eax  
01A6E07A  mov         eax,dword ptr [esp+4]  
01A6E07E  mov         dword ptr [esp+28h],eax  
01A6E082  mov         eax,dword ptr [esp+28h]  
01A6E086  add         eax,dword ptr [esp+2Ch]  
01A6E08A  mov         dword ptr [esp+24h],eax  
01A6E08E  mov         eax,dword ptr [esp+24h]  
01A6E092  mov         dword ptr [esp+4],eax  
01A6E096  mov         al,0  
01A6E098  cmp         al,0

Of course, this is just a quick fix, and it means we are now optimizing allocas in Om1. I’m still not exactly sure what the real problem is - in particular, I don’t know why sub esp,<value> is emitted for stack vars with unknown offsets when ebp is going to be used.

Comments

ap...@google.com <ap...@google.com> #2Oct 20, 2020 08:53PM

One design decision worth elaboring on is whether or not to rename InstIntrinsicsCall to InstIntrinsics. With InstIntrinsicsCall no longer derived from InstCall, to get rid of the Target parameter, it can never get lowered into a call (at least not a call for which the target is unknown to the backend). In high-level languages, intrinsics look like function calls, but at Subzero's IR level they look more like instructions. Hence dropping the "Call" from the name would make sense. Alternatives discussed with Antonio include:

Keep the InstCall target operand separate from the Srcs operand list, so the goal of aligning the source operands of load/store-like intrinsics with those of regular load/store instructions is still achieved. Unfortunately this causes complications for liveness analysis, since the target operand of a regular call can be a function pointer produced by other instructions which need to be considered live. We could teach Subzero to treat the target operand as an additional source operand, but it's a bit risky and might require more than liveness analysis to be updated.
Move the target operand to the last position of Srcs[]. This works without affecting things like liveness analysis, but is still a bit risky since we must ensure nothing looks for the target operand at index 0. It would also still be an unused undefined value for intrinsics.
Add virtual methods for accessing load/store operands. Currently there are no virtual methods for Inst, and since we process many thousands of these it could have a performance impact to make virtual calls. Also note it already uses a Kind enum to perform its own RTTI.

It's worth noting that to LLVM, intrinsics are definitely considered functions: https://llvm.org/docs/LangRef.html#intrinsic-functions. They can even only be invoked through a call instruction. Despite Subzero's design clearly having been inspired by LLVM, it uses a separate class for representing intrinsics. InstIntrinsicCall also takes an Intrinsics::IntrinsicInfo argument at construction, whereas with LLVM that information is part of the call target.

Hence I intend on moving forward with renaming it to InstIntrinsics, and adding some comments to clarify that it represents a extension instruction which few parts of the compiler have to know the exact functionality about.

ni...@google.com <ni...@google.com> Aug 25, 2022 03:40PM

Status: New

Issue 171222930

Description

Gory Details

Issue summary

Comments

ap...@google.com <ap...@google.com> #2Oct 20, 2020 08:53PM

ni...@google.com <ni...@google.com> Aug 25, 2022 03:40PM

Add comment

Issue metadata