The Go low-level calling convention on x86-64
=============================================

:Author: Raphael ‘kena’ Poss
:Date: July 2018
:modified: 2018-09-03
:slug: go-calling-convention-x86-64
:category: Programming
:tags: golang, compilers, analysis, programming languages, c++
:series: Go low-level code analysis

.. contents::

----

.. note::
   The latest version of this document can be found online at
   https://dr-knz.net/go-calling-convention-x86-64.html.
   Alternate formats:
   `Source <https://dr-knz.net/go-calling-convention-x86-64.txt>`_,
   `PDF <https://dr-knz.net/go-calling-convention-x86-64.pdf>`_.

Introduction
------------

This article analyzes how the Go compiler generates code for function
calls, argument passing and exception handling on x86-64 targets.

This expressely does not analyze how the Go compiler lays out data in
memory (other than function arguments and return values), how escape
analysis works, what the code generator must do to accommodate the
asynchronous garbage collector, and how the handling of goroutines
impacts code generation.

All tests below are performed using the ``freebsd/amd64`` target of go
1.10.3. The assembly listing are produced with ``go tool objdump`` and
use the `Go assembler syntax`__.

.. __: https://golang.org/doc/asm

.. note:: `A followup analysis
   <{filename}go-calling-convention-x86-64-2020.rst>`__ is also
   available which revisits the findings with Go 1.15. Spoiler alert:
   only few things have changed. The 2018 findings remain largely
   valid.


Argument and return values, call sequence
-----------------------------------------

Arguments and return value
~~~~~~~~~~~~~~~~~~~~~~~~~~

How does Go pass arguments to function and return results?

Let us look at the simplest function:

.. code-block:: go

   func EmptyFunc() { }

This compiles to:

.. code-block:: asm

   EmptyFunc:
        0x480630              c3                      RET

Now, with a return value:

.. code-block:: go

   func FuncConst() int { return 123 }

This compiles to:

.. code-block:: asm

   FuncConst:
        0x480630              48c74424087b000000      MOVQ $0x7b, 0x8(SP)
    0x480639              c3                      RET

So return values are passed via memory, on the stack, not in registers
like in most `standard x86-64 calling conventions`__ for natively compiled
languages.

.. __: https://en.wikipedia.org/wiki/X86_calling_conventions

Compare the output from a C or C++ compiler:

.. code-block:: asm


   FuncConst:
        movl    $123, %eax
        retq

This passes the return value in a register.

How do simple arguments get passed in Go?

.. code-block:: go

   // Note: substracting z so we know which argument is which.
   func FuncAdd(x,y,z int) int { return x + y - z }

This compiles to:

.. code-block:: asm

   FuncAdd:
    0x480630              488b442408              MOVQ 0x8(SP), AX  ; get arg x
    0x480635              488b4c2410              MOVQ 0x10(SP), CX ; get arg y
    0x48063a              4801c8                  ADDQ CX, AX       ; %ax <- x + y
    0x48063d              488b4c2418              MOVQ 0x18(SP), CX ; get arg z
    0x480642              4801c8                  SUBQ CX, AX       ; %ax <- x + y - z
    0x480645              4889442420              MOVQ AX, 0x20(SP) ; return x+y-z
    0x48064a              c3                      RET

So arguments are passed via memory, on the stack, not in registers
like other languages.

Also we see the arguments are at the top of the stack, and the return
value slot underneath that.

Compare the output from a C or C++ compiler:

.. code-block:: asm

   FuncAdd:
        leal    (%rdi,%rsi), %eax
        subl    %edx, %eax
        retq

This passes the arguments in registers. The exact number depends on
the calling convention, but for freebsd/amd64 up to 6 arguments are
passed in registers, the rest on the stack.

Note that there is an open proposal to implement register passing in
Go at https://github.com/golang/go/issues/18597. This proposal has not
yet been accepted.

Call sequence: how a function gets called
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

How does a function like ``FuncAdd`` above get called?

.. code-block:: go

   func DoCallAdd() int { return FuncAdd(1, 2, 3) }

This gives:

.. code-block:: asm


   0x480650              64488b0c25f8ffffff      MOVQ FS:0xfffffff8, CX
   0x480659              483b6110                CMPQ 0x10(CX), SP
   0x48065d              7641                    JBE 0x4806a0
   0x48065f              4883ec28                SUBQ $0x28, SP
   0x480663              48896c2420              MOVQ BP, 0x20(SP)
   0x480668              488d6c2420              LEAQ 0x20(SP), BP
   0x48066d              48c7042401000000        MOVQ $0x1, 0(SP)
   0x480675              48c744240802000000      MOVQ $0x2, 0x8(SP)
   0x48067e              48c744241003000000      MOVQ $0x3, 0x10(SP)
   0x480687              e8a4ffffff              CALL src.FuncAdd(SB)
   0x48068c              488b442418              MOVQ 0x18(SP), AX
   0x480691              4889442430              MOVQ AX, 0x30(SP)
   0x480696              488b6c2420              MOVQ 0x20(SP), BP
   0x48069b              4883c428                ADDQ $0x28, SP
   0x48069f              c3                      RET
   0x4806a0              e80ba5fcff              CALL runtime.morestack_noctxt(SB)
   0x4806a5              eba9                    JMP src.DoCallAdd(SB)

Woah, what is going on?

At the center of the function we see what we wanted to see:

.. code-block:: asm

   0x48066d              48c7042401000000        MOVQ $0x1, 0(SP)     ; set arg x
   0x480675              48c744240802000000      MOVQ $0x2, 0x8(SP)   ; set arg y
   0x48067e              48c744241003000000      MOVQ $0x3, 0x10(SP)  ; set arg z
   0x480687              e8a4ffffff              CALL src.FuncAdd(SB) ; call
   0x48068c              488b442418              MOVQ 0x18(SP), AX    ; get return value of FuncAdd
   0x480691              4889442430              MOVQ AX, 0x30(SP)    ; set return value of DoCallAdd

The arguments are pushed into the stack before the call, and after the
call the return value is retrieved from the callee frame and copied to
the caller frame. So far, so good.

However, now we know that arguments are passed on the stack, this
means that any function that calls other functions now must ensure
there is some stack space to pass arguments to its callees. This is
what we see here:

.. code-block:: asm

   ; Before the call: make space for callee.
   0x48065f              4883ec28                SUBQ $0x28, SP
   ; After the call: restore stack pointer.
   0x48069b              4883c428                ADDQ $0x28, SP

Now, what is the remaining stuff?

Because Go has exceptions (“panics”) it must preserve the ability of
the runtime system to unwind the stack. So in every activation record
it must store the difference between the stack pointer on entry and
the stack pointer for callees. This is the “frame pointer” which is
stored in this calling convention in the BP register. That is why we
see:

.. code-block:: asm

   ; Store the frame pointer of the caller into a known location in
   ; the current activation record.
   0x480663              48896c2420              MOVQ BP, 0x20(SP)
   ; Store the address of the copy of the parent frame pointer
   ; into the new frame pointer.
   0x480668              488d6c2420              LEAQ 0x20(SP), BP

This maintains the invariant of the calling convention that BP always
points to a linked list of frame pointers, where each successive value
of BP is 32 bytes beyond the value of the stack pointer in the current
frame (SP+0x20). This way the stack can always be successfully
unwound.

Finally, what about the last bit of code?

.. code-block:: asm

   0x480650              64488b0c25f8ffffff      MOVQ FS:0xfffffff8, CX
   0x480659              483b6110                CMPQ 0x10(CX), SP
   0x48065d              7641                    JBE 0x4806a0
   ...
   0x4806a0              e80ba5fcff              CALL runtime.morestack_noctxt(SB)
   0x4806a5              eba9                    JMP src.DoCallAdd(SB)

The Go runtime implements “tiny stacks” as an
optimization: a goroutine always starts with a very small stack so
that a running go program can have many “small” goroutines active at
the same time. However that means that on the standard tiny stack it is
not really possible to call many functions recursively.

Therefore, in Go, every function that needs an activation record on the
stack needs first to check whether the current goroutine stack is
large enough for this. It does this by comparing the current value of
the stack pointer to the low water mark of the current goroutine,
stored at offset 16 (0x10) of the goroutine struct, which itself can
always be found at address FS:0xfffffff8.

Compare how ``DoCallAdd`` works in C or C++:

.. code-block:: asm

   DoCallAdd:
         movl    $3, %edx
         movl    $2, %esi
         movl    $1, %edi
         jmp     FuncAdd

This passes the arguments in registers, then transfers control to the
callee with a ``jmp`` — a tail call. This is valid because the return
value of ``FuncAdd`` becomes the return value of ``DoCallAdd``.

What of the stack pointer? The function ``DoCallAdd`` cannot tell us
much in C because, in contrast to Go, it does not have any variables
on the stack and thus does need an activation record. In general (and
that is valid for Go too), if there is no need for an activation
record, there is no need to set up / adjust the stack pointer.

So how would a C/C++ compiler handle an activation record? We can
force one like this:

.. code-block:: c

   void other(int *x);
   int DoCallAddX() { int x = 123; other(&x); return x; }

Gives us:

.. code-block:: asm

   DoCallAddX:
        subq    $24, %rsp       ; make space
        leaq    12(%rsp), %rdi  ; allocate x at address rsp+12
        movl    $123, 12(%rsp)  ; store 123 into x
        call    other           ; call other(&x)
        movl    12(%rsp), %eax  ; load value from x
        addq    $24, %rsp       ; restore stack pointer
        ret

So ``%rsp`` gets adjusted upon function entry and restored in the
epilogue.

No surprise. But is there? What of exception handling?

Aside: exceptions in C/C++
~~~~~~~~~~~~~~~~~~~~~~~~~~

The assembly above was generated with a C/C++ compiler that *does*
support exceptions. In general, the compiler cannot assume that a
callee won't throw an exception. Yet we did not see anything about
saving the stack pointer and/or setting up a frame pointer in the
generated code above. So how does the C/C++ runtime handle stack
unwinding?

There are fundamentally two main ways to implement exception propagation in
an ABI (Application Binary Interface):

- “dynamic registration”, with frame pointers in each activation
  record, organized as a linked list. This makes stack unwinding fast
  at the expense of having to set up the frame pointer in each
  function that calls other functions.
  This is also simpler to implement.

- “table-driven”, where the compiler and assembler create data
  structures *alongside* the program code to indicate which addresses
  of code correspond to which sizes of activation records. This is
  called “Call Frame Information” (CFI) data in e.g. the GNU tool
  chain. When an exception is generated, the data in this table is
  loaded to determine how to unwind. This makes exception propagation
  slower but the general case faster.

In general, a language where exceptions are common and used for
control flow will adopt dynamic registration, whereas a language where
exceptions are rare will adopt table-driven unwinding to ensure the
common case is more efficient. The latter choice is extremely common
for C/C++ compilers.

Interestingly, the Go language designers recommend *against* using
exceptions (“panics”) for control flow, so one would expect they expect
their language to fall in the second category and ought to also
implement table-driven unwinding. Yet the Go compiler still uses
dynamic registration. Maybe the table-driven approach was not used
because it is more complex to implement?

More reading:

- JL Schilling - `Optimizing away C++ exception handling`__ - ACM SIGPLAN Notices, 1998

  .. __: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.116.8337&rep=rep1&type=pdf

Callee-save registers—or not
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Are there callee-save registers in Go? Can the Go compiler expect the
callee will avoid using some registers, i.e. they won't be clobbered
unless strictly needed?

In other languages, this optimization enables a function that calls
another function to keep “important” values in registers and avoid to
push its temporary variables to the stack (and thus force the apparition of
an activation record on the stack).

Let's try:

.. code-block:: go

   func Intermediate() int {
     x := Other()
     x += Other()
     return x
   }

Is there a callee-save register for the Go compiler to store ``x`` in?

Let's check:

.. code-block:: asm

   Intermediate:
     [...]
     0x4806dd              e8ceffffff              CALL src.Other(SB)
     0x4806e2              488b0424                MOVQ 0(SP), AX
     0x4806e6              4889442408              MOVQ AX, 0x8(SP)
     0x4806eb              e8c0ffffff              CALL src.Other(SB)
     0x4806f0              488b442408              MOVQ 0x8(SP), AX
     0x4806f5              48030424                ADDQ 0(SP), AX
     0x4806f9              4889442420              MOVQ AX, 0x20(SP)
     [...]

So, no. The Go compiler always spills the temporaries to the stack
during calls.

What does the C/C++ compiler do for this? Let's see:

.. code-block:: asm


   Intermediate:
        pushq   %rbx          ; save %rbx from caller
        xorl    %eax, %eax
        call    other
        movl    %eax, %ebx    ; use callee-save for intermediate result
        xorl    %eax, %eax
        call    other
        addl    %ebx, %eax    ; use callee-save again
        popq    %rbx          ; restore callee-save for caller
        ret

Most C/C++ calling convention have a number of callee-save registers
for intermediate results. On this platform, this includes at least
``%rbx``.

The cost of pointers and interfaces
-----------------------------------

Go implements both pointer types (e.g. ``*int``) and interface types
with vtables (comparable to classes containing virtual methods in
C++).

How are they implemented in the calling convention?

Pointers use just one word
~~~~~~~~~~~~~~~~~~~~~~~~~~

Looking at the following code:

.. code-block:: go

    func UsePtr(x *int) int { return *x }

The generated code:

.. code-block:: asm

   UsePtr:
     0x480630              488b442408              MOVQ 0x8(SP), AX  ; load x
     0x480635              488b00                  MOVQ 0(AX), AX    ; load *x
     0x480638              4889442410              MOVQ AX, 0x10(SP) ; return *x
     0x48063d              c3                      RET

So a pointer is the same size as an ``int`` and uses just one word slot
in the argument struct. Ditto for return values:

.. code-block:: go

    var x int
    func RetPtr() *int { return &x }
    func NilPtr() *int { return nil }

This gives us:

.. code-block:: asm

   RetPtr:
     0x480650              488d0581010c00          LEAQ src.x(SB), AX  ; compute &x
     0x480657              4889442408              MOVQ AX, 0x8(SP)    ; return &x
     0x48065c              c3                      RET
   NilPtr:
     0x480660              48c744240800000000      MOVQ $0x0, 0x8(SP)  ; return 0
     0x480669              c3                      RET

Interfaces use two words
~~~~~~~~~~~~~~~~~~~~~~~~

Considering the following code:

.. code-block:: go

    type Foo interface{ foo() }
    func InterfaceNil() Foo { return nil }

The compiler generates the following:

.. code-block:: asm

   InterfaceNil:
     0x4805b0              0f57c0                  XORPS X0, X0
     0x4805b3              0f11442408              MOVUPS X0, 0x8(SP)
     0x4805b8              c3                      RET

So an interface value is bigger. The pseudo-register ``X0`` in the Go
pseudo-assembly is really the x86 ``%xmm0``, a full 16-byte (128 bit)
register.

We can confirm that by looking at a function that simply forwards
an interface argument as a return value:

.. code-block:: go

   func InterfacePass(Foo x) Foo { return x }

This gives us:

.. code-block:: asm

   InterfacePass:
     0x4805b0              488b442408              MOVQ 0x8(SP), AX
     0x4805b5              4889442418              MOVQ AX, 0x18(SP)
     0x4805ba              488b442410              MOVQ 0x10(SP), AX
     0x4805bf              4889442420              MOVQ AX, 0x20(SP)
     0x4805c4              c3                      RET

Although there is just 1 argument and return value, the compiler has
to copy two words. Interface “values” are really a pointer to a vtable
and a value combined together.

Strings and slices use two and three words
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Next to pointers (one word) and interface values (two words) the Go
compiler also has special layouts for two other things:

- the special type ``string`` is implemented as two words: a word
  containing the length, and a word to the start of the string. This
  supports computing ``len()`` and slicing in constant time.
- all slice types (``[]T``) are implemented using 3 words:
  a length, a capacity and a pointer to the first element. This supports
  computing ``len()``, ``cap()`` and slicing in constant time.

The reason why ``string`` values do not need a capacity is that
``string`` is an immutable type in Go.

Constructing interface values
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. note:: The information in this section was collected with Go
   1.10. It remains largely unchanged in Go 1.15. The minor
   differences are reviewed in `this followup analysis
   <https://dr-knz.net/go-calling-convention-x86-64-2020.html#the-cost-of-pointers-and-interfaces>`__.

Constructing a non-nil interface value requires storing the vtable
pointer alongside the value.

In most real world cases the vtable part is known statically (because
the type being cast to the interface type is known statically). We'll
ignore the conversions from one interface type to another here.

For the value part, Go has multiple implementation strategies based on
the actual type of value.

The most common case, an interface implemented by a pointer type,
looks like this:

.. code-block:: go

    // Define the interface.
    type Foo interface{ foo() }
    // Define a struct type implementing the interface by pointer.
    type foo struct{ x int }
    func (*foo) foo() {}
    // Define a global variable so we don't use the heap allocator.
    var x foo

    // Make an interface value.
    func MakeInterface1() Foo { return &x }

This gives us:

.. code-block:: asm

   MakeInterface1:
      0x4805c0              488d05d9010400          LEAQ go.itab.*src.foo,src.Foo(SB), AX
      0x4805c7              4889442408              MOVQ AX, 0x8(SP)
      0x4805cc              488d0505f20b00          LEAQ src.x(SB), AX
      0x4805d3              4889442410              MOVQ AX, 0x10(SP)
      0x4805d8              c3                      RET

Just as predicted: address of vtable in the first word, pointer to the
struct in the second word. No surprise.

Things become a bit more expensive if the struct implements the
interface by value:

.. code-block:: go

    // Define a struct type implementing the interface by value.
    type bar struct{ x int }
    func (bar) foo() {}
    // Define a global variable so we don't use the heap allocator.
    var y bar

    // Make an interface value.
    func MakeInterface2() Foo { return y }

This gives us:

.. code-block:: asm

   MakeInterface2:
      0x4805c0              64488b0c25f8ffffff      MOVQ FS:0xfffffff8, CX
      0x4805c9              483b6110                CMPQ 0x10(CX), SP
      0x4805cd              7648                    JBE 0x480617
      0x4805cf              4883ec28                SUBQ $0x28, SP
      0x4805d3              48896c2420              MOVQ BP, 0x20(SP)
      0x4805d8              488d6c2420              LEAQ 0x20(SP), BP
      0x4805dd              488d053c020400          LEAQ go.itab.src.bar,src.Foo(SB), AX
      0x4805e4              48890424                MOVQ AX, 0(SP)
      0x4805e8              488d05e9f10b00          LEAQ src.x(SB), AX
      0x4805ef              4889442408              MOVQ AX, 0x8(SP)
      0x4805f4              e8e7b2f8ff              CALL runtime.convT2I64(SB)
      0x4805f9              488b442410              MOVQ 0x10(SP), AX
      0x4805fe              488b4c2418              MOVQ 0x18(SP), CX
      0x480603              4889442430              MOVQ AX, 0x30(SP)
      0x480608              48894c2438              MOVQ CX, 0x38(SP)
      0x48060d              488b6c2420              MOVQ 0x20(SP), BP
      0x480612              4883c428                ADDQ $0x28, SP
      0x480616              c3                      RET
      0x480617              e814a5fcff              CALL runtime.morestack_noctxt(SB)
      0x48061c              eba2                    JMP github.com/knz/go-panic/src.MakeInterface2(SB)

Holy Moly. What just went on?

The function became suddently much larger because it is now making a
call to another function ``runtime.convT2I64``.

As per the previous sections, as soon as there is a callee, the caller
must set up an activation record, so we see 1) a check the stack is
large enough 2) adjusting the stack pointer 3) preserving the frame
pointer for stack unwinding during exceptions. This explains the
prologue and epilogue, so the “meat” that remains, taking this into
account, is this:

.. code-block:: asm

      0x4805dd              488d053c020400          LEAQ go.itab.src.bar,src.Foo(SB), AX
      0x4805e4              48890424                MOVQ AX, 0(SP)
      0x4805e8              488d05e9f10b00          LEAQ src.x(SB), AX
      0x4805ef              4889442408              MOVQ AX, 0x8(SP)
      0x4805f4              e8e7b2f8ff              CALL runtime.convT2I64(SB)
      0x4805f9              488b442410              MOVQ 0x10(SP), AX
      0x4805fe              488b4c2418              MOVQ 0x18(SP), CX
      0x480603              4889442430              MOVQ AX, 0x30(SP)
      0x480608              48894c2438              MOVQ CX, 0x38(SP)

What this does is to perform the regular Go call
``runtime.convT2I64(&bar_foo_vtable, y)`` and returns its result,
which is an interface and thus takes two words.

What does this function do?

.. code-block:: go

   func convT2I64(tab *itab, elem unsafe.Pointer) (i iface) {
        t := tab._type
        // [...]
        var x unsafe.Pointer
        // [...]
                x = mallocgc(8, t, false)
                *(*uint64)(x) = *(*uint64)(elem)
        // [...]
        i.tab = tab
        i.data = x
        return
   }

What this does really is to call the heap allocator and allocate a
slot in memory to store a copy of the value provided, and a pointer to
that heap-allocated slot is stored in the interface value.

In other words, in general, types that implement interfaces by value
will mandate a trip to the heap allocator every time a value of that
type is turned into an interface value.

As a special case, if the value provided is the “zero value” for the type implementing
the interface, the heap allocation is avoided and a special “address
to the zero value” is used instead to construct the interface
reference. This is checked by ``convT2I64`` in the code I elided above::

        if *(*uint64)(elem) == 0 {
                x = unsafe.Pointer(&zeroVal[0])
        } else {
                x = mallocgc(8, t, false)
                *(*uint64)(x) = *(*uint64)(elem)
        }

This is correct because the function ``convT2I64`` is only used for
64-bit types that implement the interface. This is true of the
``struct`` that I defined above, which contains just one 64-bit field.

There are many such ``convT2I`` functions for various type layouts
that may implement the interface, for example:

- ``convT2I16``, ``convT2I32``, ``convT2I64`` for “small” types;
- ``convT2Istring``, ``convT2Islice`` for ``string`` and slice types;
- ``convT2Inoptr`` for structs that do not contain pointers;
- ``convT2I`` for the general case.

All of them except for the general cases ``convT2Inoptr`` and
``convT2I`` will attempt to avoid the heap allocator if the value is
the zero value.

Nevertheless, in all these cases the caller that is constructing an
interface value must check its stack size and set up an activation
record, because it is making a call.

So, in general, types that implement interfaces by value cause
overhead when they are converted into the interface type.

Interfaces for empty structs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

There is just one, not-too-exciting super-special case: *empty
structs*. These can implement the interface by value without overhead:

.. code-block:: go

    type empty struct{}
    func (empty) foo() {}
    var x empty
    func MakeInterface3() Foo { return x }

This gives us:

.. code-block:: asm

   MakeInterface3:
     0x4805c0              488d0539020400          LEAQ go.itab.src.empty,src.Foo(SB), AX
     0x4805c7              4889442408              MOVQ AX, 0x8(SP)
     0x4805cc              488d05edf20b00          LEAQ runtime.zerobase(SB), AX
     0x4805d3              4889442410              MOVQ AX, 0x10(SP)
     0x4805d8              c3                      RET

The “value part” of interface values for empty structs is always
``&runtime.zerobase`` and can be computed without a call and thus
without overhead.

``error`` is an interface type
------------------------------

Compare the following two functions:

.. code-block:: go

   func Simple1() int { return 123 }
   func Simple2() (int, error) { return 123, nil }

And their generated code:

.. code-block:: asm

    Simple1:
      0x4805b0              48c74424087b000000      MOVQ $0x7b, 0x8(SP)
      0x4805b9              c3                      RET

    Simple2:
      0x4805c0              48c74424087b000000      MOVQ $0x7b, 0x8(SP)
      0x4805c9              0f57c0                  XORPS X0, X0
      0x4805cc              0f11442410              MOVUPS X0, 0x10(SP)
      0x4805d1              c3                      RET

What we see here is that ``error`` being an interface, the function
returning ``error`` must set up two extra words of return value.

In the ``nil`` case this is still straightforward (at the expense of
16 bytes of extra zero data).

It is also still pretty straightforward if the error object was pre-allocated. For example:

.. code-block:: go

    var errDivByZero = errors.New("can't divide by zero")

    func Compute(x, y float64) (float64, error) {
      if y == 0 {
        return 0, errDivByZero
      }
      return x / y, nil
    }

Compiling to:

.. code-block:: asm

   Compute:
     0x4805e0              f20f10442410            MOVSD_XMM 0x10(SP), X0 ; load y into X0
           if y == 0 {
     0x4805e6              0f57c9                  XORPS X1, X1    ; compute float64(0)
     0x4805e9              660f2ec1                UCOMISD X1, X0  ; is y == 0?
     0x4805ed              7521                    JNE 0x480610    ; no: go to return x/y
     0x4805ef              7a1f                    JP 0x480610     ; no: go to return x/y
                   return 0, errDivByZero
     0x4805f1              488b05402c0a00          MOVQ src.errDivByZero+8(SB), AX
     0x4805f8              488b0d312c0a00          MOVQ src.errDivByZero(SB), CX
     0x4805ff              f20f114c2418            MOVSD_XMM X1, 0x18(SP)
     0x480605              48894c2420              MOVQ CX, 0x20(SP)
     0x48060a              4889442428              MOVQ AX, 0x28(SP)
     0x48060f              c3                      RET
           return x / y, nil
     0x480610              f20f104c2408            MOVSD_XMM 0x8(SP), X1   ; load x into X1
     0x480616              f20f5ec8                DIVSD X0, X1            ; compute x / y
     0x48061a              f20f114c2418            MOVSD_XMM X1, 0x18(SP)  ; return x / y
     0x480620              0f57c0                  XORPS X0, X0            ; compute error(nil)
     0x480623              0f11442420              MOVUPS X0, 0x20(SP)     ; return error(nil)
     0x480628              c3                      RET

Common case: computed errors
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The simple case where error objects are pre-allocated is handled
efficiently, but in real world code the error text is usually computed
to include some contextual information, for example:

.. code-block:: go

    func Compute(x, y float64) (float64, error) {
      if y == 0 {
        return 0, fmt.Errorf("can't divide %f by zero", x)
      }
      return x / y, nil
    }

At the moment we organize the function this way, we are paying the
price of a call to another function: setting up an activation record,
frame pointer, checking the stack size, etc. Even on the “hot” path
where the error does not occur.

This makes the relatively “simple” function ``Compute``, where the
crux of the computation is just 1 instruction, ``divsd``, extremely
large:

.. code-block:: asm

    Compute:
      ; [... stack size check, SP and BP set up elided ...]
      0x482201              f20f10442468            MOVSD_XMM 0x68(SP), X0  ; load y
            if y == 0 {   ; like before
      0x482207              0f57c9                  XORPS X1, X1   ; compute float64(0)
      0x48220a              660f2ec1                UCOMISD X1, X0 ; is y == 0?
      0x48220e              0f85a7000000            JNE 0x4822bb   ; no: go to return x/y
      0x482214              0f8aa1000000            JP 0x4822bb    ; no: go to return x/y

                    return 0, fmt.Errorf("can't divide %f by zero", x)
      0x48221a              f20f10442460            MOVSD_XMM 0x60(SP), X0  ; load x

      ; The following code allocates a special struct using
      ; runtime.convT2E64 to pass the variable arguments to
      ; fmt.Errorf. The struct contains the value of x.
      0x482220              f20f11442438            MOVSD_XMM X0, 0x38(SP)
      0x482226              0f57c0                  XORPS X0, X0
      0x482229              0f11442440              MOVUPS X0, 0x40(SP)
      0x48222e              488d05cbf80000          LEAQ 0xf8cb(IP), AX
      0x482235              48890424                MOVQ AX, 0(SP)
      0x482239              488d442438              LEAQ 0x38(SP), AX
      0x48223e              4889442408              MOVQ AX, 0x8(SP)
      0x482243              e83895f8ff              CALL runtime.convT2E64(SB)
      0x482248              488b442410              MOVQ 0x10(SP), AX
      0x48224d              488b4c2418              MOVQ 0x18(SP), CX

      ; The varargs struct is saved for later on the stack.
      0x482252              4889442440              MOVQ AX, 0x40(SP)
      0x482257              48894c2448              MOVQ CX, 0x48(SP)

      ; The constant string "can't divide..." is passed in the argument list of fmt.Errorf.
      0x48225c              488d05111e0300          LEAQ 0x31e11(IP), AX
      0x482263              48890424                MOVQ AX, 0(SP)
      0x482267              48c744240817000000      MOVQ $0x17, 0x8(SP)

      ; A slice object is created to point to the vararg struct and given
      ; as argument to fmt.Errorf.
      0x482270              488d442440              LEAQ 0x40(SP), AX
      0x482275              4889442410              MOVQ AX, 0x10(SP)
      0x48227a              48c744241801000000      MOVQ $0x1, 0x18(SP)
      0x482283              48c744242001000000      MOVQ $0x1, 0x20(SP)
      0x48228c              e8cf81ffff              CALL fmt.Errorf(SB)
      ; The result value of fmt.Errorf is retrieved.
      0x482291              488b442428              MOVQ 0x28(SP), AX
      0x482296              488b4c2430              MOVQ 0x30(SP), CX

      ; return float64(0) as first return value:
      0x48229b              0f57c0                  XORPS X0, X0
      0x48229e              f20f11442470            MOVSD_XMM X0, 0x70(SP)
      ; return the result of fmt.Errorf as 2nd return value:
      0x4822a4              4889442478              MOVQ AX, 0x78(SP)
      0x4822a9              48898c2480000000        MOVQ CX, 0x80(SP)
      ; [ ... restore BP/SP ... ]
      0x4822ba              c3                      RET

            return x / y, nil    ; same as before
      0x4822bb              f20f104c2460            MOVSD_XMM 0x60(SP), X1 ; load x into X1
      0x4822c1              f20f5ec8                DIVSD X0, X1           ; compute x / y
      0x4822c5              f20f114c2470            MOVSD_XMM X1, 0x70(SP) ; return x / y
      0x4822cb              0f57c0                  XORPS X0, X0           ; compute error(nil)
      0x4822ce              0f11442478              MOVUPS X0, 0x78(SP)    ; return error(nil)
      ; [ ... restore BP/SP ... ]
      0x4822dc              c3                      RET

So what are we learning here?

- ``fmt.Errorf`` (like all vararg functions in Go) get additional
  argument passing code: the arguments are stored in the caller's
  activation record, and a slice object is given as argument to the
  vararg-accepting callee.

- this price is paid on the cold path to any real-world function that
  allocates error objects dynamically when an error is encountered.

We are not considering here the cost of running ``fmt.Errorf``
itself, which usually has to go to the heap allocator multiple times
because it does not know in advance how long the computed string will
be.

.. note:: A more thorough review of how vararg calls work
   is available in `this followup analysis
   <https://dr-knz.net/go-calling-convention-x86-64-2020.html#vararg-calls>`__.


Common case: testing for errors
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The other common case is when a caller checks the error returned by a
callee, like this:

.. code-block:: go

    func Caller() (int, error) {
      v, err := Callee()
      if err != nil {
        return -1, err
      }
      return v + 1, nil
    }

This gives us:

.. code-block:: asm

    Caller:
      ; [... stack size check, SP and BP set up elided ...]
            v, err := Callee()
      0x48061d              e8beffffff              CALL src.Callee(SB)
      0x480622              488b442410              MOVQ 0x10(SP), AX  ; retrieve return value
      0x480627              488b0c24                MOVQ 0(SP), CX     ; load error vtable
      0x48062b              488b542408              MOVQ 0x8(SP), DX   ; load error value
            if err != nil {
      0x480630              4885d2                  TESTQ DX, DX       ; is the value part nil?
      0x480633              741d                    JE 0x480652        ; yes, go to v+1 below
                    return -1, err
      0x480635              48c7442428ffffffff      MOVQ $-0x1, 0x28(SP)  ; return -1
      0x48063e              4889542430              MOVQ DX, 0x30(SP)     ; return err.vtable
      0x480643              4889442438              MOVQ AX, 0x38(SP)     ; return err.value
      ; [ ... restore BP/SP ... ]
      0x480651              c3                      RET
            return v + 1, nil
      0x480652              488d4101                LEAQ 0x1(CX), AX    ; compute v + 1
      0x480656              4889442428              MOVQ AX, 0x28(SP)   ; return v + 1
      0x48065b              0f57c0                  XORPS X0, X0        ; compute error(nil)
      0x48065e              0f11442430              MOVUPS X0, 0x30(SP) ; return error(nil)
      ; [ ... restore BP/SP ... ]
      0x48066c              c3                      RET

So any time a caller needs to check the error return of a callee,
there are 2 instructions to retrieve the error value, 2 instructions
to test whether it is ``nil``, and in the “hot” path where there is no
error two more instruction on every return path to return
``error(nil)``.

For reference (we'll consider that again below), if there was no error
to check/propagate the function becomes much simpler:

.. code-block:: asm


   Caller:
      ; [... stack size check, SP and BP set up elided ...]
      0x48060d              e8ceffffff              CALL github.com/knz/go-panic/src.Callee2(SB)
      0x480612              488b0424                MOVQ 0(SP), AX    ; retrieve return value
      0x480616              48ffc0                  INCQ AX           ; compute v + 1
      0x480619              4889442418              MOVQ AX, 0x18(SP) ; return v + 1
      ; [ ... restore BP/SP ... ]
      0x480627              c3                      RET

(No extra instructions, no extra branch.)

Implementation of ``defer``
---------------------------

.. note:: The mechanism presented in this section is still used as of
   Go 1.15. However, Go 1.15 has an optimization that is enabled in a
   common case which simplifies the mechanism further. We will see how
   that works in `this followup analysis
   <https://dr-knz.net/go-calling-convention-x86-64-2020.html#implementation-of-defer>`__.


Go provides a feature to register, from the body of a function, a list
of callback functions that are *guaranteed* to be called when the
call terminates, even during exception propagation.

(This is useful e.g. to ensure that resources are freed and mutexes
unlocked regardless of what happens with one of the callees.)

How does this work? Let's consider the simple example:

.. code-block:: go

     func Defer1() int { defer f(); return 123 }

This compiles to:

.. code-block:: asm

   Defer1:
       ; [... stack size check, SP and BP set up elided ...]

       ; Prepare the return value 0. This is set in memory because
       ; (theoretically, albeit not in this particular example) the deferred
       ; function can access the return value and may do so before it was
       ; set by the remainder of the function body.
       0x48208d              48c744242000000000      MOVQ $0x0, 0x20(SP)

       ; Prepare the defer by calling runtime.deferproc(0, &f)
       0x482096              c7042400000000          MOVL $0x0, 0(SP)
       0x48209d              488d05f46e0300          LEAQ 0x36ef4(IP), AX
       0x4820a4              4889442408              MOVQ AX, 0x8(SP)
       0x4820a9              e8822afaff              CALL runtime.deferproc(SB)

       ; Special check of the return value of runtime.deferproc.
       ; In the common case, deferproc returns 0.
       ; If a panic is generated by the function body (or one of the callees),
       ; and the defer function catches the panic with `recover`, then
       ; control will re-return from `deferproc` with value 1.
       0x4820ae              85c0                    TESTL AX, AX
       0x4820b0              7519                    JNE 0x4820cb  ; has a panic been caught?

       ; Prepare the return value 123.
       0x4820b2              48c74424207b000000      MOVQ $0x7b, 0x20(SP)
       0x4820bb              90                      NOPL
       ; Ensure the defers are run.
       0x4820bc              e84f33faff              CALL runtime.deferreturn(SB)
       ; [ ... restore BP/SP ... ]
       0x4820ca              c3                      RET

       ; We've caught a panic. We're still running the defers.
       0x4820cb              90                      NOPL
       0x4820cc              e83f33faff              CALL runtime.deferreturn(SB)
       ; [ ... restore BP/SP ... ]
       0x4820da              c3                      RET

How to read this:

- the code generated for function that contains ``defer`` always
  contains calls to ``deferproc`` and ``defereturn`` and thus needs an
  activation record, and thus a stack size check and frame pointer
  setup.
- if a function contains ``defer`` there will be a call to ``deferreturn``
  on every return path.
- the actual callback is not stored in the activation record of the
  function; instead what ``deferproc`` does (internally) is store the
  callback in a linked list from the goroutine's header
  struct. ``deferreturn`` runs and pops the entries from that linked list.

The code is generated this way regardless of whether the deferred
function contains ``recover()``, see below.

Deferred closures
~~~~~~~~~~~~~~~~~

In real-world uses, the deferred function is actually a closure that
has access to the enclosing function's local variables. For example:

.. code-block:: go

    func Defer2() (res int) {
        defer func() {
                res = 123
        }()
        return -1
    }

This compiles to:

.. code-block:: asm


   Defer2:
      ; [... stack size check, SP and BP set up elided ...]

      ; Store the zero value as return value.
      0x48208d              48c744242800000000      MOVQ $0x0, 0x28(SP)

      ; Store the frame pointer of Defer2 for use by the deferred closure.
      0x482096              488d442428              LEAQ 0x28(SP), AX
      0x48209b              4889442410              MOVQ AX, 0x10(SP)

      ; Call runtime.deferproc(8, &Defer2.func1)
      ; Where Defer2.func1 is the code generated for the closure, see below.
      ; The closure takes an implicit argument, which is the frame
      ; pointer of the enclosing function, where it can peek
      ; at the enclosing function's local variables.
      0x4820a0              c7042408000000          MOVL $0x8, 0(SP)
      0x4820a7              488d05b26e0300          LEAQ 0x36eb2(IP), AX
      0x4820ae              4889442408              MOVQ AX, 0x8(SP)
      0x4820b3              e8782afaff              CALL runtime.deferproc(SB)

      ; Are we recovering from a panic?
      0x4820b8              85c0                    TESTL AX, AX
      0x4820ba              7519                    JNE 0x4820d5

      ; Common path.
      ; Set -1 as return value.
      0x4820bc              48c744242800000000      MOVQ $-1, 0x28(SP)
      0x4820c5              90                      NOPL
      ; Run the defers.
      0x4820c6              e84533faff              CALL runtime.deferreturn(SB)
      ; [ ... restore BP/SP ... ]
      0x4820d4              c3                      RET

      ; Recovering from a panic.
      0x4820d5              90                      NOPL
      0x4820d6              e83533faff              CALL runtime.deferreturn(SB)
      ; [ ... restore BP/SP ... ]
      0x4820e4              c3                      RET

  Defer2.func1:
      ; Load the frame pointer of the enclosing function.
      MOVQ    0x8(SP), AX
      ; Store the new value into the return value slot of the
      ; enclosing function's frame.
      MOVQ    $123, (AX)
      RET

So a closure gets compiled as an anonymous function which returns a
pointer to the enclosing frame as implicit first argument.

Every non-local variable accessed in the closure is marked to force
spill in the enclosing function, to ensure they are allocated on the
stack and not in registers.

Since return values and arguments are always on the stack anyway,
using them in closures thus comes at no additional overhead. This
would be different for other variables which could avoid
a stack allocation otherwise.

.. note:: This section focuses specifically on *deferred*
   closures. This gives the Go compiler the guarantee that the closure
   itsself *does not escape*.

   If the closure did escape, then additional machinery would kick in
   to allocate the closure on the heap together with the variables it
   needs to access from the enclosing function.

Implementation of ``panic``
---------------------------

Using ``panic()`` in a function
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

A function that uses ``panic()`` without computing anything
(including, for now, not computing any object as exception) looks like
this:

.. code-block:: go

    func Panic1() { panic(nil) }
    var x int
    func Panic2() { panic(&x) }

This gives us:

.. code-block:: asm


   Panic1:
      ; [... stack size check, SP and BP set up elided ...]
      0x4805fd              0f57c0                  XORPS X0, X0
      0x480600              0f110424                MOVUPS X0, 0(SP)
      0x480604              e8f747faff              CALL runtime.gopanic(SB)
      0x480609              0f0b                    UD2

   Panic2:
      ; [... stack size check, SP and BP set up elided ...]
      0x4806dd              488d05bcaf0000          LEAQ 0xafbc(IP), AX
      0x4806e4              48890424                MOVQ AX, 0(SP)
      0x4806e8              488d05f1f00b00          LEAQ src.x(SB), AX
      0x4806ef              4889442408              MOVQ AX, 0x8(SP)
      0x4806f4              e80747faff              CALL runtime.gopanic(SB)
      0x4806f9              0f0b                    UD2

What is going on?

Using ``panic()`` in the body of a function translates in any case to
a call to ``runtime.gopanic()``. Therefore in any case the function
needs to check its stack size and set up an activation record, like
every other function that calls anything.

Then for the call to ``runtime.gopanic()``: this function takes a
single argument of type ``interface{}``.  So the caller that invokes
``panic()`` must create an interface value with whatever object/value
it wants to use as exception.

- In ``Panic2()`` the regular vtable for ``interface{}`` is used and the
  address of ``x`` is passed as interface value.
- In ``Panic1()`` the Go compiler uses another special-case
  optimization: ``interface{}(nil)`` is implemented using a zero vtable.

So really, from the perspective of generated code, using ``panic()``
in the body of a function looks very much like any other function
call, except it is actually simpler: the compiler knows that
``runtime.gopanic()`` does not return and thus does not need to generate
instructions to return the caller on the return path from the call to
``gopanic``.

Finally, if the function needs to create/allocate an object to
throw as exception, the code to prepare this object (initialization,
allocation, etc.) will be added just as usual.

Exceptions for intermediate functions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The Go code generation of a function that calls another function that
*may* throw an exception does not handle anything specially: it sets up an
activation record and prepares the frame pointer as usual.

This price paid for setting up the frame pointer is paid anytime
another function is called, irrespective of whether it will throw an
exception or not.

Therefore, exception propagation in Go is cheaper than the testing and
propagation of ``error`` results.

Catching exceptions: ``defer`` + ``recover``
--------------------------------------------

As of Go 1.10 the language does not provide a simple-to-use control
structure like ``try``-``catch``.

Instead, it provides a special pseudo-function called ``recover()``.
When the author of a function ``foo()`` wishes to catch an exception generated in
``foo()`` or one of its callees, the code must be structured as follows:

- a separate function or closure (other than ``foo``) must contain a call
  to ``recover()``;
- a call to that separate function must be deferred from ``foo``.

Low-level mechanism
~~~~~~~~~~~~~~~~~~~

We can look at the mechanism by compiling the following:

.. code-block:: go

    func Recovering(r *int) {
        // The pseudo-function recover() returns nil by default, except when
        // called in a deferred activation, in which case it catches the
        // exception object, stops stack unwinding and returns the exception
        // object as its return value.
        if recover() != nil {
           *r = 123
        }
    }

    func TryCatch() (res int) {
        defer Recovering(&res)
        // call a function that may throw an exception.
        f()
        // Regular path: return -1
        res = -1
    }

In this example, the ``TryCatch`` function is compiled like the
functions ``Defer1``/``Defer2`` of the previous section, so it is not
detailed further. The interesting part is ``Recovering``:

.. code-block:: asm

   Recovering:
      ; [... stack size check, SP and BP set up elided ...]

      ; Call runtime.gorecover(), giving it the address of Recover's
      ; activation record as argument.
      0x48208d              488d442428              LEAQ 0x28(SP), AX
      0x482092              48890424                MOVQ AX, 0(SP)
      0x482096              e8653efaff              CALL runtime.gorecover(SB)

      ; Check the return value.
      0x48209b              488b442408              MOVQ 0x8(SP), AX
      0x4820a0              4885c0                  TESTQ AX, AX  ; is it nil?
      0x4820a3              740c                    JE 0x4820b1   ; yes, go to the return path below.

      ; Retrieve the argument r
      0x4820a5              488b442428              MOVQ 0x28(SP), AX
      ; Set *r = 123
      0x4820aa              48c7007b000000          MOVQ $0x7b, 0(AX)

      0x4820b1
          ; [ ... restore BP/SP ... ]
      0x4820ba              c3                      RET

Because using the pseudo-function ``recover()`` compiles to a function
call, the ``Recovering`` function needs its own activation record,
thus stack size check, frame pointer, etc.

What the ``gorecover()`` function internally does, in turn, is to
check if there is an exception propagation in progress. If there is,
it stops the propagation and returns the panic object. If there is
not, it simply returns ``nil``.

(To “stop the propagation” it sets a flag in the panic object /
goroutine struct. This is subsequently picked up by the unwind
mechanism when the deferred function terminates. See the source code
in ``src/runtime/panic.go`` for details.)

Cost of ``defer`` + ``recover``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

A function that wishes to catch an exception needs to ``defer`` the
other function that will actually do the catch.

This incurs the cost of ``defer`` always, even when the exception
does not occur:

- setting up an activation record (checking the stack size, adjusting
  the stack pointer, setting the frame pointer, etc.) because there
  will be a call in any case.
- setting up the deferred call in the goroutine header struct at the
  beginning.
- performing the deferred call on every return path.

The first cost is only overhead if the function catching the exception
did not otherwise contain function calls and could have avoided
allocating an activation record. For example, a “small” function that
merely accesses some existing structs and may only panic due to e.g. a
nil pointer dereference, would see that cost as overhead.

The other two costs are relatively low:

- setting up the deferred call does not need heap allocation. Its code
  path is relatively short. The main price it pays is accessing the
  goroutine header struct and a couple conditional branches.

- running the deferred calls however incurs:

  - the price of jumping around the deferred call list (a few memory
    accesses but no conditional branch, so fairly innocuous);
  - running the body of the actual deferred functions. This in turn
    costs the overhead of setting up and tearing down *their*
    activation record (because they're probably calling other
    functions, e.g. when they use ``recover``) even when there is no
    exception to catch.

We will look at empirical measurements in a separate article.

An interesting question: ``error`` vs ``panic``?
------------------------------------------------

What is cheaper: handling exceptions via ``panic`` / ``recover``, or
passing and testing error results with ``if err := ...; err != nil {
return err }``?

The analysis above so far reveals:

- at the point an exception/error is generated:

  - in both cases the function that generates an exception/error
    usually needs an activation record (stack size check, etc.)

    - for ``panic``, because of the call to ``runtime.gopanic``;
    - for both errors and panics, because of the call to ``fmt.Errorf``
      or ``fmt.Sprintf`` to create a contextual error object.

  - in the specific case of a function that does not call other
    functions, and only returns pre-defined error objects, using
    ``panic`` will incur an activation record whereas the function
    with an error return will not need it.

  - throwing an exception with ``panic`` results in less code usually,
    because the compiler does not generate a return path.

  To summarise, overall, the two approaches for the function(s) where
  exceptions/errors occur have similar costs.

- in “leaf” functions that never produce exceptions/errors but must
  implement an interface type where other implementors of the
  interface *may* produce exceptions/errors, handling
  exceptions/errors with ``panic`` is *always cheaper*.

  This is because the leaf function will neither contain ``panic`` nor
  the initialization of the extra ``nil`` return value.

- at the point an exception/error is propagated without change, the
  ``panic``-based handling is *always cheaper*:

  - it moves fewer result values around from callee to caller;
  - it does not contain a test of the error return and the
    accompanying conditional branch;
  - there is fewer code overall so less pressure on the I-cache.

- at the point an exception/error is caught and conditionally handled,
  then the ``panic``-based handling is *always more expensive* because
  it incurs the cost of ``defer`` and an extra activation record (for
  the deferred closure/function) which the error-based handling does
  not require.

So in short, this is not a clear-cut case: ``panic``-based exception
handling is nearly-always cheaper for the tree of callees, but more
expensive for the code that catches the exception.

Using ``panic`` over error returns is thus only advantageous if there
is enough computation in the call tree to offset the cost of setting
up the catch environment. This is true in particular:

- when the call tree where errors can be generated is guaranteed to
  always be deep/complex enough that the savings of the
  ``panic``-based handling will be noticeable.
- when the call tree is invoked multiple times and the catch
  environment can be set up just once for all the calls.

I aim to complement this article with a later experiment to verify
this hypothesis empirically.

Differences with ``gccgo``
--------------------------

.. important:: (Erratum as of January 2020): The following
   observations were made in 2018 with all optimizations
   disabled. This is unfair to ``gccgo``, as GCC's optimization engine
   is quite capable. Once enabled with ``-O``, ``gccgo`` is in fact
   able to use registers to pass arguments. I might revisit this topic
   in a followup article..

The GNU Compiler Collection now contains a Go compiler too called
``gccgo``.

In contrast to ``6g`` (the original Go compiler) this tries to mimic
the native calling convention. This brings *potential*
performance benefits:

- arguments are passed in registers when possible.
- the first return value is passed in a register.
- attempts to use callee-save registers for temporaries.

However these benefits are not actually realized, because ``gccgo``
(as of GCC 8.2) also has the following problems:

- it disables many standard GCC optimizations, like register reloading
  and (some forms of) temporary variable elimination, and thus causes
  many more spills to memory than necessary.
- because the temporary variables spill to the stack nearly always,
  that means nearly every function needs an activation record (not
  just those that call other functions or have many local variables)
  and thus always need to check the stack size upfront.

These two limitations together make the code generated by ``gccgo``
unacceptably longer and more memory-heavy overall.

For example, the simple ``FuncAdd`` from the beginning of this
document compiles with ``gccgo`` to:

.. code-block:: asm

   FuncAdd:
    3b91:       64 48 3b 24 25 70 00    cmp    %fs:0x70,%rsp           ; is the stack large enough?
    3b98:       00 00
    3b9a:       73 12                   jae    3bae <src.FuncAdd+0x1d> ; yes, go below
    3b9c:       41 ba 08 00 00 00       mov    $0x8,%r10d              ; call __morestack
    3ba2:       41 bb 00 00 00 00       mov    $0x0,%r11d
    3ba8:       e8 36 10 00 00          callq  4be3 <__morestack>

    ; The following `retq` instruction on the return path to
    ; __morestack is not actually executed: `__morestack` is a standard
    ; GCC facility (not specific to Go) which auto-magically
    ; returns to the *next* instruction after its return address.
    3bad:       c3                      retq

    ; Main function body.

    ; Start by preparing the frame pointer.
    3bae:       55                      push   %rbp
    3baf:       48 89 e5                mov    %rsp,%rbp

    ; Store the arguments x, y, z into temporaries on the stack.
    3bb2:       48 89 7d e8             mov    %rdi,-0x18(%rbp)
    3bb6:       48 89 75 e0             mov    %rsi,-0x20(%rbp)
    3bba:       48 89 55 d8             mov    %rdx,-0x28(%rbp)

    ; Store zero (the default value) into a  temporary variable
    ; holding the return value at BP+8.
    3bbe:       48 c7 45 f8 00 00 00    movq   $0x0,-0x8(%rbp)
    3bc5:       00
    ; Re-load the arguments x and y from the stack.
    3bc6:       48 8b 55 e8             mov    -0x18(%rbp),%rdx
    3bca:       48 8b 45 e0             mov    -0x20(%rbp),%rax
    ; Compute x + y.
    3bce:       48 01 d0                add    %rdx,%rax
    ; Re-load z from the stack and compute x + y - z.
    3bd1:       48 2b 45 d8             sub    -0x28(%rbp),%rax
    ; Store the result value into the temporary variable
    ; for the return value.
    3bd5:       48 89 45 f8             mov    %rax,-0x8(%rbp)
    ; Re-load the return value from the temporary variable into
    ; a register.
    3bd9:       48 8b 45 f8             mov    -0x8(%rbp),%rax

    ; Restore the frame pointer, return.
    3bdd:       5d                      pop    %rbp
    3bde:       c3                      retq

This is very sad. GCC for other languages than Go is perfectly able to
eliminate temporary variables. The following code would be just as
correct:

.. code-block:: asm


    FuncAdd:
       add %rdi, %rsi, %rax
       sub %rax, %rdx, %rax
       retq

(Disclaimer: these limitations can be lifted in a later version of
``gccgo``.)

Summary of observations
-----------------------

The low-level calling convention used by the Go compiler on x86-64
targets is memory-heavy: arguments and return values are always passed
on the stack. This can be contrasted with code generation by compilers
for other languages (C/C++, Rust, etc) where registers are used when
possible for arguments and return values.

The Go compiler uses dynamic registration (with a linked list of frame
pointers) to prepare activation records for stack unwinding. This
incurs a stack setup overhead on any function that calls other
functions, even in the common case where stack unwinding does not
occur. This can be contrasted with other languages that consider
exceptions uncommon and implement table-driven unwinding, with no
stack setup overhead on the common path.

Arguments and return values incur the standard memory costs of data
types in Go. Scalar and struct types passed by value occupy their size
on the stack. String and interface values use two words, slices use
three. Because ``error`` is an interface type, it occupies two words.

Building an ``error`` value to return is usually more expensive than
other values because in most cases this incurs a call to a
vararg-accepting function (e.g. ``fmt.Errof``).

The call sequence for vararg-accepting functions is the same as
functions accepting slices as arguments, but the caller must also
prepare the slice's contents on the stack to contain (a copy of) the
argument values.

Go implements ``defer``, a feature similar to ``finally`` in other
languages. This is done by registering a callback in the current
lightweight thread (“goroutine”) at the beginning and executing the
registered callbacks on every return path. This mechanism does not
require heap allocation but incurs a small overhead on the control path.

Exceptions are thrown with ``panic()`` and caught with ``defer`` and
``recover()``. Throwing the panic compiles down to a regular call to
an internal function of the run-time system. That internal function is
then responsible for stack unwinding. The compiler knows that
``panic()`` does not return and thus skips generating code for a
return path. The mechanism to catch exceptions is fully hidden inside
the pseudo-function ``recover()`` and does not require special
handling for the code generator. Code generation makes no distinction
between functions that may throw exceptions and those who are
guaranteed to never throw.

The calling convention suggests there is a non-trivial trade-off
between handling exceptional situations with ``panic`` vs. using
``error`` return values and checking them at every intermediate step
of a call stack. This trade-off remains to be analyzed empirically in
particular applications.

.. The GCC-based ``gccgo`` compiler attempts to use a completely
.. different, potentially more efficient register-based calling
.. convention. Sadly, it fails to generate more efficient code overall
.. because it does not eliminate temporary variables on the stack, like
.. GCC does for other languages and the original Go compiler does for Go.

Next in the series:

- `Measuring argument passing in Go and C++`__

  .. __: https://dr-knz.net/measuring-argument-passing-in-go-and-cpp.html

- `Measuring multiple return values in Go and C++`__

  .. __: https://dr-knz.net/measuring-multiple-return-values-in-go-and-cpp.html

- `Measuring errors vs. exceptions in Go and C++`__

  .. __: https://dr-knz.net/measuring-errors-vs-exceptions-in-go-and-cpp.html

- `The Go low-level calling convention on x86-64 - New in 2020 and Go 1.15`__

  .. __: https://dr-knz.net/go-calling-convention-x86-64-2020.html

- `Errors vs exceptions in Go and C++ in 2020`__

  .. __: https://dr-knz.net/go-errors-vs-exceptions-2020.html

Further reading
---------------

- JL Schilling — `Optimizing away C++ exception handling`__ — ACM SIGPLAN Notices, 1998

  .. __: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.116.8337&rep=rep1&type=pdf

- David Chase — `proposal: cmd/compile: define register-based calling
  convention`__, January 2017, last accessed July 2018

  .. __: https://github.com/golang/go/issues/18597

- Steven Huang — `Golang Calling Convention`__ (Chinese), August 2017,
  last accessed July 2018

  .. __: https://particle.cafe/blog/golang-calling-convention.html

- The Go Programming Language — `Effective Go`__, last accessed July 2018

  .. __: https://golang.org/doc/effective_go.html

- Wikipedia — `x64 Calling Conventions`__, last accessed July 2018

  .. __: https://en.wikipedia.org/wiki/X86_calling_conventions

- Wikipedia — `Exception handling implementation`__, last accessed July 2018

  .. __: https://en.wikipedia.org/wiki/Exception_handling#Exception_handling_implementation

- Wikipedia — `Structure of Call Stacks`__, last accessed July 2018

  .. __: https://en.wikipedia.org/wiki/Call_stack#Structure