mcg_implemented elsewhere

Compile time marshalling

In one of my posts about managed/unmanaged interop in C# (P/Invoke), I left you with the promise of answering a few questions, namely: can we manually create our own marshalling stubs in C# (at compile time), and can they be faster than the runtime generated ones ?

A bit of background

It’s funny that when I raised these questions back in March, I was still unaware of .NET Native and ASP vNext which were announced by Microsoft in the following months. The main idea behind these initiatives is to speed up especially the startup time of .NET code on resource constrained systems (mobile, cloud).
For instance, while traditionally on desktop systems intermediate language (IL) in .NET assemblies is compiled to machine code at runtime by the Just-In-Time Compiler (JIT), .NET Native moves this step to compile time. While this has several advantages, a direct consequence of the lack of runtime IL compilation is that we can’t generate and run IL code on the fly anymore. Even though not much user code uses this, the framework itself critically depends on this feature for interop marshalling stub generation. Since it is no longer available in .NET Native, this phase had to be moved to compile time as well. In fact, this step – called Marshalling and Code Generation (MCG) is one of the elements of the .NET Native toolchain. By the way, .NET Native isn’t the first project which has done compile time marshalling. For example, it has been used for a long time in the DXSharp project.

The basic concepts are always the same: generate code which marshals the input arguments and return values, and wrap it around a calli IL instruction. Since the C# compiler will never emit a calli instruction, this actual call will always have to be implemented in IL directly (or the compiler will have to be extended, recently possible with Roslyn). Where the desktop .NET runtime (CLR) emits the whole marshalling stub in IL, the MCG generated code is C# so it requires a seperate call to an IL method with the calli implementation. If you drill down far enough in the generated sources for a .NET Native project, in the end you’ll find something like this (all other classes/methods omitted for brevity):

internal unsafe static partial class Interop
{
    private static partial class McgNative
    {
        internal static partial class Intrinsics
        {
            internal static T StdCall(IntPtr pfn, void* arg0, int arg1)
            {
                // This method is implemented elsewhere in the toolchain
                return default(T);
            }
        }
    }
}

Note the giveaway comment ‘this method is implemented elsewhere in the toolchain’, which you can read as ‘this is as far as we can go with C#’, and which indicates that some other tool in the .NET Native chain will emit the real body for the method.

DIY compile time marshalling

So what would the .NET Native ‘implemented elsewhere’ source look like, or: how can we do our own marshalling ? To call a native function which expects an integer argument (like the Sleep function I used in previous posts), first we would need to create an IL calli implementation which takes the address of the native callsite  and the integer argument:

.assembly extern mscorlib {}
.assembly CalliImpl { .ver 0:0:0:0 }
.module CalliImpl.dll

.class public CalliHelpers
{
    .method public static void Action_uint32(native int, unsigned int32) cil managed
    {
        ldarg.1
        ldarg.0
        calli unmanaged stdcall void(int32)
        ret
    }
}

If we feed it the address of the Sleep function in kernel32 (using LoadLibrary and GetProcAddress, which we ironically invoke through P/Invoke…), we can see the CalliHelper method on the managed stack instead of the familiar DomainBoundILStubClass. In other words, compile time marshalling in action:

Child SP IP Call Site
00f2f264 77a9d4bc [InlinedCallFrame: 00f2f264]
00f2f260 010b03e4 CalliHelpers.Action_uint32(IntPtr, UInt32)
00f2f290 010b013b TestPInvoke.Program.Main(System.String[])
00f2f428 63c92652 [GCFrame: 00f2f428]

This ‘hello world’ example is nice but ideally you would like to use well tested code. Therefore, I wanted to try and leverage the MCG from .NET Native, but it turned out to be a bit more work than I anticipated as you need to somehow inject the actual IL calli stubs to make the calls work. So perhaps in a future blog.

What about C++ interop ?

There seems to be a lot of confusion around this type of interop: some claim it to be faster, some slower. In reality it can be both depending on what you do. The C++ compiler understands both types of code (native and managed), and with it comes its main selling point: not speed but type safety. Where in C# the developer has to provide the P/Invoke signature, including calling convention and marshalling of the arguments and return values, the C++ compiler knows this already from the native header files. Therefore, in C++/CLI you simply include the header and if necessary (you are in a managed section) the compiler does the P/Invoke for you implicitly.

#include

using namespace System;

int main(array ^args)
{
    Console::WriteLine(L"Press any key...");
    while (!Console::KeyAvailable)
    {
        Sleep(500);
    }
    return 0;
}

Sleep is an unmanaged function included from Windows.h, and invoked from a managed code body. From the managed stack in WinDbg you can see how it works:

00e3f16c 00fa2065 DomainBoundILStubClass.IL_STUB_PInvoke(UInt32)
00e3f170 00fa1fcc [InlinedCallFrame: 00e3f170] .Sleep(UInt32)
00e3f1b4 00fa1fcc .main(System.String[])
00e3f1c8 00fa1cff .mainCRTStartupStrArray(System.String[])

As you can see, there is again a marshalling stub, as in C#, it is however generated without developer intervention. This alone should be reason enough to use C++/CLI in heavy interop scenarios, but there are more advantages. For instance, the C++ compiler can optimize away multiple dependent calls across the interop boundary, making the whole thing faster, or can P/Invoke to native C++ class instance functions, something entirely impossible in C#. It moreover allows you to apart from depending on external native code, create ‘mixed mode’ or IJW (It Just Works) assemblies which contain native code as well as the usual managed code in a self contained unit.
Despite all this, the P/Invoke offered by C++/CLI still leverages the runtime stub generation mechanism, and therefore, it’s not intrinsically faster than explicit P/Invoke.

Word of warning

Let me end with this: the aim of this post is to offer an insight in the black box called interop, not as a promotion for DIY marshalling. If you find yourself in need of creating your own (compile time) marshalling stubs for faster interop, chances are you are doing something wrong. Especially for enterprise/web development it’s not very likely the interop itself is the bottleneck. Therefore, focussing on improving the interop scenario yourself – instead of letting the .NET framework team worry about it – is very, very likely a case of premature optimization. However, for game/datacenter/scientific scenarios, you can end up in situations where you want to use every CPU cycle efficiently, and perhaps after reading this post you’ll have a better idea of where to look.

Pinvoke

PInvoke: beyond the magic

Ever ran into problems passing data between unmanaged code and managed code ? Or just curious what really happens when you slap that [DllImport] on a method ? This post is for you: below I’ll shine some light inside the blackbox that’s called Platform Invoke.

Let’s start with a very minimal console app that has a call to an unmanaged Win32 function:

namespace TestPInvoke
{
    class Program
    {
        [DllImport("kernel32.dll")]
        static extern void Sleep(uint dwMilliseconds);

        static void Main(string[] args)
        {
            Console.WriteLine("Press any key...");

            while (!Console.KeyAvailable)
            {
                Sleep(1000);
            }
        }
    }
}

Nothing exciting going on there: just the console polling for a keypress, and sleeping the thread for 1 second after every poll. The important thing of course is the way in which we sleep the thread, which is with PInvoke instead of using the usual mscorlib System.Threading.Thread.Sleep(Int32).

Now let’s run it under WinDbg + SOS, and see if we can find out what happens. The managed stack while sleeping looks like this:

Child SP IP       Call Site
00ebee24 0108013d DomainBoundILStubClass.IL_STUB_PInvoke(UInt32)
00ebee28 0108008e [InlinedCallFrame: 00ebee28] TestPInvoke.Program.Sleep(UInt32)
00ebee6c 0108008e TestPInvoke.Program.Main(System.String[])

On the bottom is the entrypoint. The next frame on the stack is just an information frame telling us the call to Program.Sleep was inlined in Main (notice the same IP). The next frame is more interesting: as the last frame on the managed stack this must be our marshalling stub.

We can dump the MethodDescriptor of the Program.Main and DomainBoundILStubClass.IL_STUB_PInvoke methods for comparison, which gives us:

0:000> !IP2MD 0108008e
MethodDesc: 00fc37c8
Method Name: TestPInvoke.Program.Main(System.String[])
Class: 00fc12a8
MethodTable: 00fc37e4
mdToken: 06000002
Module: 00fc2ed4
IsJitted: yes
CodeAddr: 01080050

and

0:000> !IP2MD 0108013d
MethodDesc: 00fc38f0
Method Name: DomainBoundILStubClass.IL_STUB_PInvoke(UInt32)
Class: 00fc385c
MethodTable: 00fc38b0
mdToken: 06000000
Module: 00fc2ed4
IsJitted: yes
CodeAddr: 010800c0

This tells us both methods are originally IL code, and they are JIT compiled. For the Main method we knew this of course, and for the PInvoke stub it can’t be a surprise either given the class and method names. So let’s dump out the IL:

0:000> !DumpIL 00fc37c8
IL_0001: ldstr "Press any key..."
IL_0006: call System.Console::WriteLine
IL_000c: br.s IL_001b
IL_000f: ldc.i4 1000
IL_0014: call TestPInvoke.Program::Sleep
IL_001b: call System.Console::get_KeyAvailable
IL_0020: ldc.i4.0
IL_0021: ceq
IL_0025: brtrue.s IL_010e
IL_0027: ret

No surprises there. Next the stub:

0:000> !DumpIL 00fc38f0
error decoding IL

OK, that’s weird. The metadata tells us we have an IL compiled method, the JITted code is there:

0:000> !u 010800c0
Normal JIT generated code
DomainBoundILStubClass.IL_STUB_PInvoke(UInt32)
(actual code left out)

but where is the IL body?

In fact, it turns out since .NET v4.0, all interop stubs are generated at runtime in IL and JIT compiled for the relevant architecture. Note this runtime IL has a clear difference with the IL emitted in runtime assemblies (for instance the ones generated for XML serialization), as the interop stubs aren’t contained in a runtime generated assembly or module. Instead, the module token is spoofed to be identical to the calling frame’s module (you can check this above). Likewise, there is only runtime data for these methods, and looking up its class info gives:

!DumpClass 00fc385c
Class Name: 
mdToken: 02000000
File: C:\dev\voidcall\Profiler\ProfilerNext\TestPInvoke\TestPInvoke\bin\Debug\TestPInvoke.exe
Parent Class: 00000000
Module: 00fc2ed4
Method Table: 00fc38b0
Total Method Slots: 0
Class Attributes: 101
Transparency: Critical

This containing class – DomainBoundILStubClass – is some weird thing as well: it doesn’t inherit anything (not even System.Object), the name isn’t filled in, and there are no method slots, even though we know there is a at least one method in this class, namely the one we just followed to get to it. So probably this class is just a construct for keeping integrity in the CLR internal datastructures.

So there really seems to be no good way to get the IL of those stubs. The CLR team realized this as well and decided to publish the generated IL as ETW events. The ILStub Diagnostics tool can be used to intercept them. If we do this for our test program we see the following (formatted for readability):

// Managed Signature: void(uint32)
// Native Signature: unmanaged stdcall void(int32)
.maxstack 3
.locals (int32 A,int32 B)
// Initialize
    call native int [mscorlib] System.StubHelpers.StubHelpers::GetStubContext()
    call void [mscorlib] System.StubHelpers.StubHelpers::DemandPermission(native int)
// Marshal
    ldc.i4 0x0
    stloc.0    
IL_0010:
    ldarg.0
    stloc.1    
// CallMethod
    ldloc.1
    call native int [mscorlib] System.StubHelpers.StubHelpers::GetStubContext()
    ldc.i4 0x14
    add
    ldind.i
    ldind.i
    calli unmanaged stdcall void(int32) //actual unmanaged method call
// Unmarshal (nothing in this case)
// Return
    ret

The (un)marshalling isn’t very interesting in this case (int32 in and nothing out). To make it more clear for those who don’t use IL daily, I used ILAsm to compile this method body into a dll and used ILSpy to view it in decompiled C#:

static void ILStub_PInvoke(int A)
{
    //initialize
    StubHelpers.DemandPermission(StubHelpers.GetStubContext());
    //CallMethod
    calli(void(int32), A, *(*(StubHelpers.GetStubContext() + 20))); //not actual C#, but more readable anyway
}

The call to the unmanaged method is done with a calli instruction, which is a strongly typed call to an unmanaged callsite. The first parameter (not on the stack but encoded in IL), is the signature of the callsite [void(int32)], followed by (on the stack) the argument (in this case A), ultimately followed by the unmanaged function pointer (which must be stored in offset 20 of the context returned from StubHelpers.GetStubContext()).

So what magic takes place in StubHelpers.GetStubContext() ?

The answer will come naturally if we take for example a simple program that has 2 PInvoke methods with the same input and output arguments:

[DllImport("kernel32.dll")]
static extern void ExitThread(uint dwExitCode);

[DllImport("kernel32.dll")]
static extern void Sleep(uint dwMilliseconds);

If I let the CLR generate an IL stub for both methods, I have exactly the same input and output marshalling, and even the unmanaged function call signature (not address) is the same.

That seems a bit of a waste, so how could one optimize this ?

Indeed, we would save on basically everything we care about (RAM, JIT compilation) by just generating one IL stub for every unique input+output argument signature, and injecting that stub with the unmanaged address it needs to call.

This is exactly how it works: when the CLR encounters a PInvoke method, it pushes a frame on the stack (InlinedCallFrame) with info about – among other things – the unmanaged function address just before calling the actual IL stub.

The stub in turn requests this information through StubHelpers.GetStubContext() (aka ‘gimme my callframe’), and calls into the unmanaged function.

To see this in action, consider the code:

namespace TestPInvoke
{
    class Program
    {
        [DllImport("kernel32.dll")]
        static extern void Sleep(uint dwMilliseconds);

        [DllImport("kernel32.dll", EntryPoint = "Sleep")]
        static extern void SleepAgain(uint dwMilliseconds);

        static void Main(string[] args)
        {
            Console.WriteLine("Press any key...");

            while (!Console.KeyAvailable)
            {
                Sleep(500);
                SleepAgain(500);
            }
        }
    }
}

I’ll run this from WinDbg+SOS, here’s the disassembly of the calls to Sleep and SleepAgain in main:

mov     ecx,1F4h
call    0042c04c (TestPInvoke.Program.Sleep(UInt32), mdToken: 06000001)
mov     ecx,1F4h
call    0042c058 (TestPInvoke.Program.SleepAgain(UInt32), mdToken: 06000002)

You see the calls to Sleep and SleepAgain are pointing to different addresses. If we dump the unmanaged code at these locations we have:

!u 0042c04c (Sleep)
Unmanaged code
mov     eax,42379Ch
jmp     006100d0 (DomainBoundILStubClass.IL_STUB_PInvoke(UInt32))

!u 0042c058 (SleepAgain)
Unmanaged code
mov     eax,4237C8h
jmp     006100d0 (DomainBoundILStubClass.IL_STUB_PInvoke(UInt32)

Indeed, we see in a few lines that some different value is loaded into eax, before jumping to the same address (the IL stub). Since the value in eax is the only thing seperating the two, this must be a pointer to our call frame.

So let’s consider these as memory addresses and check what’s there:

dd 42379Ch
0042379c  63000001 20ea0005 00000000 00192385
004237ac  001925ec 00423808 0042c010 00000000

dd 4237C8h
004237c8  630b0002 20ea0006 00000000 00192385
004237d8  001925ec 00423810 0042c01c 00000000

Now remember the offset in the calli instruction above ? The unmanaged call was to a pointer reference at offset 20 (14h) in our stubcontext. Or in plain words: take the value at offset 20 in the callframe (emphasized), and dereference it. This gives us:

00423808 => 7747cf49 (KERNEL32!SleepStub)
00423810 => 7747cf49 (same)

And there we have it, PInvoke demystified.

In a next post I’ll address the following questions:

  • can we manually create our own marshalling stubs in C# (at compile time) ?
  • can it be faster than the runtime generated one ?
  • what about the reverse case (unmanaged code calling us) ?