Ruurd
Ruurd

Jun 26, 2014 6 min read

Compile time marshalling

In one of my posts about managed/unmanaged interop in C# (P/Invoke), I left you with the promise of answering a few questions, namely: can we manually create our own marshalling stubs in C# (at compile time), and can they be faster than the runtime generated ones ?

A bit of background

It’s funny that when I raised these questions back in March, I was still unaware of .NET Native and ASP vNext which were announced by Microsoft in the following months. The main idea behind these initiatives is to speed up especially the startup time of .NET code on resource constrained systems (mobile, cloud). For instance, while traditionally on desktop systems intermediate language (IL) in .NET assemblies is compiled to machine code at runtime by the Just-In-Time Compiler (JIT), .NET Native moves this step to compile time. While this has several advantages, a direct consequence of the lack of runtime IL compilation is that we can’t generate and run IL code on the fly anymore. Even though not much user code uses this, the framework itself critically depends on this feature for interop marshalling stub generation. Since it is no longer available in .NET Native, this phase had to be moved to compile time as well. In fact, this step - called Marshalling and Code Generation (MCG) is one of the elements of the .NET Native toolchain. By the way, .NET Native isn’t the first project which has done compile time marshalling. For example, it has been used for a long time in the DXSharp project.

The basic concepts are always the same: generate code which marshals the input arguments and return values, and wrap it around a calli IL instruction. Since the C# compiler will never emit a calli instruction, this actual call will always have to be implemented in IL directly (or the compiler will have to be extended, recently possible with Roslyn). Where the desktop .NET runtime (CLR) emits the whole marshalling stub in IL, the MCG generated code is C# so it requires a seperate call to an IL method with the calli implementation. If you drill down far enough in the generated sources for a .NET Native project, in the end you’ll find something like this (all other classes/methods omitted for brevity):

internal unsafe static partial class Interop
{
    private static partial class McgNative
    {
        internal static partial class Intrinsics
        {
            internal static T StdCall(IntPtr pfn, void* arg0, int arg1)
            {
                // This method is implemented elsewhere in the toolchain
                return default(T);
            }
        }
    }
}

Note the giveaway comment ’this method is implemented elsewhere in the toolchain’, which you can read as ’this is as far as we can go with C#’, and which indicates that some other tool in the .NET Native chain will emit the real body for the method.

DIY compile time marshalling

So what would the .NET Native ‘implemented elsewhere’ source look like, or: how can we do our own marshalling ? To call a native function which expects an integer argument (like the Sleep function I used in previous posts), first we would need to create an IL calli implementation which takes the address of the native callsite  and the integer argument:

.assembly extern mscorlib {}
.assembly CalliImpl { .ver 0:0:0:0 }
.module CalliImpl.dll

.class public CalliHelpers
{
    .method public static void Action_uint32(native int, unsigned int32) cil managed
    {
        ldarg.1
        ldarg.0
        calli unmanaged stdcall void(int32)
        ret
    }
}

If we feed it the address of the Sleep function in kernel32 (using LoadLibrary and GetProcAddress, which we ironically invoke through P/Invoke…), we can see the CalliHelper method on the managed stack instead of the familiar DomainBoundILStubClass. In other words, compile time marshalling in action:

Child SP IP Call Site
00f2f264 77a9d4bc [InlinedCallFrame: 00f2f264]
00f2f260 010b03e4 CalliHelpers.Action_uint32(IntPtr, UInt32)
00f2f290 010b013b TestPInvoke.Program.Main(System.String[])
00f2f428 63c92652 [GCFrame: 00f2f428]

This ‘hello world’ example is nice but ideally you would like to use well tested code. Therefore, I wanted to try and leverage the MCG from .NET Native, but it turned out to be a bit more work than I anticipated as you need to somehow inject the actual IL calli stubs to make the calls work. So perhaps in a future blog.

What about C++ interop ?

There seems to be a lot of confusion around this type of interop: some claim it to be faster, some slower. In reality it can be both depending on what you do. The C++ compiler understands both types of code (native and managed), and with it comes its main selling point: not speed but type safety. Where in C# the developer has to provide the P/Invoke signature, including calling convention and marshalling of the arguments and return values, the C++ compiler knows this already from the native header files. Therefore, in C++/CLI you simply include the header and if necessary (you are in a managed section) the compiler does the P/Invoke for you implicitly.

#include

using namespace System;

int main(array ^args)
{
    Console::WriteLine(L"Press any key...");
    while (!Console::KeyAvailable)
    {
        Sleep(500);
    }
    return 0;
}

Sleep is an unmanaged function included from Windows.h, and invoked from a managed code body. From the managed stack in WinDbg you can see how it works:

00e3f16c 00fa2065 DomainBoundILStubClass.IL_STUB_PInvoke(UInt32)
00e3f170 00fa1fcc [InlinedCallFrame: 00e3f170] .Sleep(UInt32)
00e3f1b4 00fa1fcc .main(System.String[])
00e3f1c8 00fa1cff .mainCRTStartupStrArray(System.String[])

As you can see, there is again a marshalling stub, as in C#, it is however generated without developer intervention. This alone should be reason enough to use C++/CLI in heavy interop scenarios, but there are more advantages. For instance, the C++ compiler can optimize away multiple dependent calls across the interop boundary, making the whole thing faster, or can P/Invoke to native C++ class instance functions, something entirely impossible in C#. It moreover allows you to apart from depending on external native code, create ‘mixed mode’ or IJW (It Just Works) assemblies which contain native code as well as the usual managed code in a self contained unit. Despite all this, the P/Invoke offered by C++/CLI still leverages the runtime stub generation mechanism, and therefore, it’s not intrinsically faster than explicit P/Invoke.

Word of warning

Let me end with this: the aim of this post is to offer an insight in the black box called interop, not as a promotion for DIY marshalling. If you find yourself in need of creating your own (compile time) marshalling stubs for faster interop, chances are you are doing something wrong. Especially for enterprise/web development it’s not very likely the interop itself is the bottleneck. Therefore, focussing on improving the interop scenario yourself - instead of letting the .NET framework team worry about it - is very, very likely a case of premature optimization. However, for game/datacenter/scientific scenarios, you can end up in situations where you want to use every CPU cycle efficiently, and perhaps after reading this post you’ll have a better idea of where to look.