down-the-rabbit-hole

Down the rabbit hole: JIT Cache

Some weeks ago I was working on a heavyweight .NET web application and got annoyed by the fact that I needed to wait forever to see changes in my code at work. I don’t have a slow machine, and where the initial compilation of the .NET/C# was pretty fast, the loading – dominated by Just-In-Time compilation (JIT) – took forever. It routinely took 30-60 seconds to come up, and most of that time was spent on loading images and JITing them.

That the JIT can take time is a well known fact, and there are some ways to alleviate the burden. For one there is the Native Image Generator (NGen), a tool shipped with the .NET framework since v1.1. It allows you to pre-JIT entire assemblies and install them in a system cache for later use. Other more recent developments are the .NET Native toolchain (MSFT) and SharpLang / C# Native (community), which leverage the C++ compiler instead of the JIT compiler to compile .NET (store) apps directly to native images.

However great these techniques are, they are designed with the idea ‘pay once, profit forever’ in mind. A good idea for production releases, but they won’t solve my problem; if I change a single statement in my code and rebuild using these tools, it will increase the time I have to wait instead of reduce it due to the more extensive Common Intermediate Language (CIL) to native compilation.

The idea

An alternative for the scenario described above would be to keep a system global cache of JITted code (methods), only invoking the actual JIT for code that isn’t in the cache yet.

Requirements

  • Interceptor: a mechanism to hook calls from the Common Language Runtime (CLR) to the JIT compiler. We need this to be able to introduce our own ‘business’ logic (caching mechanism) in this channel.
  • Injector: a mechanism to load the Interceptor into a remote .NET process. We need this to hook every starting .NET program: most JIT compilation is done during startup, so loading the interception code should take place at that time for maximum profit.
  • Cache: the actual business logic. A smart cache keeping track of already JITted code and validity.

Note: in this article I’m going to discuss the first 2 which are the technical challenge. The actual cache will be a topic of a future article, and in all honesty I’m not sure it’s going to work. The awesome involved in the first two items was worth the effort already.

 

Interceptor

In the desktop version of .NET, the CLR and JIT are two separate libraries – clr.dll and clrjit.dll/protojit.dll respectively – which get loaded in every .NET process. I started from the very simple assumption that the CLR calls into some function export of the clrjit library. When I checked out the public exports of clrjit though, there are only 2:

clrjit_exports

I called the getJit and got back something which was likely a pointer to some C++ class, but was unable to figure out what to do with it or to disseminate the methods and required arguments, so my best guess was googling for ‘jit’, ‘getJit’, ‘hook’ etc.

I found a brilliant article by Daniel Pistelli, who identified the returned value from clrjit!getJit as an implementation of Rotor‘s ICorJitCompiler. The Rotor project (officially: SSCLI) is the closest thing we non-MSFT people have to the sources of the native parts of the .NET ecosystem (runtime, jit, GC). However, MSFT made it very clear it was only a demonstration project: it wasn’t the actual .NET source. Moreover, the latest release is from 2006. In his article, Daniel found that he could use the Rotor source headers to work with the production .NET version of the JIT: the vtable in the .NET desktop implementation is more extensive, but the  first entries are identical.

For the full details I’ll refer you to his article, but once operational this is enough for us to intercept and wrap requests for compilation to the JIT compiler with our own function with the signature:

int __stdcall compileMethod(ULONG_PTR classthis, ICorJitInfo *comp, CORINFO_METHOD_INFO *info, unsigned flags, BYTE **nativeEntry, ULONG  *nativeSizeOfCode)

 Injector

Most JIT compilation takes place during process start, so to not miss anything, we have to  find a way to hook the JIT in a very early stage. There are two methods I explored: for the first I self-hosted the CLR. Using the unmanaged hosting APIs it’s possible to write a native application in which you have much more control over process execution. For instance you can first load the runtime, which will automatically load the JIT compiler as well, insert your custom hook next, and only then start executing managed code. This will ensure you don’t miss a bit.

An example trace from a self-hosted CLR trivial console app with JIT hooking:

jitlog

Note that the JIT is hooked before managed execution starts.

However, this method has a downside, namely that it will only work for processes we start explicitly using our native loader. Any .NET executable started in the regular way will escape our hooking code. What we really want is to load our hook at process start in every .NET process. For this we need a couple of things:

  1. process start notifications – to detect a new process start
  2. remote code injection – to load our hooking code into the newly started process

It turns out both are possible, but to do so we have to dive into the domain of kernel mode, and remote code injection. For fun, try and enter those keywords in a search engine and see how many references to ‘black hat’, ‘malicious’, ‘rootkit’, ‘security exploits’ etc you find. Clearly, the methods I want to use have some attraction on a whole different kind of audience as well.

Anyway, I still want it so down the rabbit hole we go.

1. Process start notifications

We can register a method for receiving process start/quit notifications by calling PsSetCreateProcessNotifyRoutine. This function is part of the kernel mode APIs, and to access them we have to write a kernel driver. When you download and install the Windows Driver Kit, which integrates with Visual Studio, you get standard templates for writing drivers, which I strongly advice you to use, because writing a driver from scratch is not especially hard, but it is very troublesome as any bug or bugcheck hit will make your process (a.k.a. the kernel) crash, so better to start from some tested scaffold. When testing the driver I did so in a new VM, which was a good foresight as I fried it a couple of times, making it completely unbootable.

Anyway, back to the code. To register the notification routine we have to call PsSetCreateProcessNotifyRoutine during Driver Entry:

HANDLE hNotifyEvent;
PKEVENT NotifyEvent = NULL;
unsigned long lastPid;

VOID NotifyRoutine(_In_ HANDLE parentId, _In_ HANDLE processId, _In_ BOOLEAN Create)
{
    UNREFERENCED_PARAMETER(parentId);

    if (Create)
    {
        DbgPrint("Execution detected. PID: %d", processId);

        if (NotifyEvent != NULL)
        {
            lastPid = (unsigned long)processId;
            KeSetEvent(NotifyEvent, 0, FALSE);
        }
    }
    else
    {
        DbgPrint("Termination detected. PID: %d", processId);
    }
}

VOID OnUnload(IN PDRIVER_OBJECT DriverObject)
{
    UNREFERENCED_PARAMETER(DriverObject);

    // remove notify callback
    PsSetCreateProcessNotifyRoutine(NotifyRoutine, TRUE);
}

NTSTATUS DriverEntry(_In_ PDRIVER_OBJECT DriverObject, _In_ PUNICODE_STRING RegistryPath)
{
    // Create an event object to signal when a process started.
    DECLARE_CONST_UNICODE_STRING(NotifyEventName, L"\\NotifyEvent");
    NotifyEvent = IoCreateSynchronizationEvent((PUNICODE_STRING)&NotifyEventName, &hNotifyEvent);
    if (NotifyEvent == NULL)
        return STATUS_UNSUCCESSFUL;
    KeClearEvent(NotifyEvent);

    // boiler plate code omitted

    WdfDriverCreate(DriverObject, RegistryPath, &attributes, &config, WDF_NO_HANDLE);

    PsSetCreateProcessNotifyRoutine(NotifyRoutine, FALSE);

    DriverObject->DriverUnload = OnUnload; // omitting this will hang the system
}

You can see we also register a kernel event object which will be signaled every time a process start notification is received, as we will need this later.

Once process start  notifications were going, I explored ways to also do part 2 (remote code injection) from kernel mode, but decided against it for two reasons: the kernel mode APIs, while offering some very powerful and low level access to the machine and OS, are very limited (you cannot access regular win32 APIs), so it’s much easier and faster to develop in user mode. And second, I got bored of restoring yet another fried VM.

1b. Getting notifications to user mode

So I needed an always running component in user mode which communicates with the kernel mode driver: the ideal use case for a Windows service. By default kernel driver objects aren’t accessible in user mode. To access it you have to expose them in the (kernel) object directory as a ‘Dos Device’. Adding a symbolic link like \DosDevices\InterceptorDriver to the actual driver object – using WdfDeviceCreateSymbolicLink – is sufficient to access it by name in user mode (full path: \\.\InterceptorDriver).

Just open it like a file:

HANDLE hDevice = CreateFileW(
    L"\\\\.\\InterceptorDriver", // driver to open
    0,                           // no access to driver
    FILE_SHARE_READ | FILE_SHARE_WRITE, // share mode
    NULL,                        // default security attributes
    OPEN_EXISTING,               // disposition
    0,                           // file attributes
    NULL);                       // do not copy file attributes

For the actual communication the preferred way is using IOCTL: in user mode you can send an IO control code to the driver:

DWORD pId;
DWORD junk = 0;
BOOL bResult = DeviceIoControl(hDevice, // device to be queried
               IOCTL_PROCESS_NOTIFYNEW, // operation to perform
               NULL, 0,                 // no input buffer
               &pId, sizeof(pId),       // output buffer
               &junk,                   // # bytes returned
               (LPOVERLAPPED)NULL);     // synchronous I/O

The driver itself has to handle the code:

VOID InterceptorDriverEvtIoDeviceControl(_In_ WDFQUEUE Queue, _In_ WDFREQUEST Request, _In_ size_t OutputBufferLength, _In_ size_t InputBufferLength, _In_ ULONG IoControlCode)
{
    NTSTATUS status = STATUS_SUCCESS;
    size_t bytesReturned = 0;
    switch (IoControlCode)
    {
    case IOCTL_PROCESS_NOTIFYNEW:
    {
        if (NotifyEvent == NULL)
            break;

        // Set a finite timeout to allow service shutdown (else thread is stuck in kernel mode).
        LARGE_INTEGER timeOut;
        timeOut.QuadPart = -10000 * 1000; // 100 ns units.
        status = KeWaitForSingleObject(NotifyEvent, Executive, KernelMode, FALSE, &timeOut);

        if (status == STATUS_SUCCESS)
        {
            unsigned long * buffer;
            if (NT_SUCCESS(WdfRequestRetrieveOutputBuffer(Request, sizeof(lastPid), &buffer, NULL)))
            {
                *buffer = lastPid;
                bytesReturned = sizeof(lastPid);
            }
        }
        break;
    }
    default:
        break;
    }
    WdfRequestCompleteWithInformation(Request, status, bytesReturned);
    return;
}

and in the driver IO queue setup register this method:

NTSTATUS InterceptorDriverQueueInitialize(_In_ WDFDEVICE Device)
{
    WDFQUEUE queue;
    NTSTATUS status;
    WDF_IO_QUEUE_CONFIG queueConfig;

    PAGED_CODE();

    WDF_IO_QUEUE_CONFIG_INIT_DEFAULT_QUEUE(&queueConfig,WdfIoQueueDispatchParallel);
    queueConfig.EvtIoDeviceControl = InterceptorDriverEvtIoDeviceControl;
    queueConfig.EvtIoStop = InterceptorDriverEvtIoStop;

    status = WdfIoQueueCreate(Device,&queueConfig,WDF_NO_OBJECT_ATTRIBUTES,&queue);
    if( !NT_SUCCESS(status) )
    {
        TraceEvents(TRACE_LEVEL_ERROR, TRACE_QUEUE, "WdfIoQueueCreate failed %!STATUS!", status);
        return status;
    }
    return status;
}

The mechanism we have here is a sort of ‘long-polling’ of the kernel driver: the service sends an IOCTL code to the driver, and the driver pauses the thread on an event which is signaled every time a process is started. Only then does the thread return to usermode, with in its output buffer the ID of the process. To allow for windows service shutdown, it’s advisable to wait for the event with a timeout (and poll again if it returned due to this timeout), otherwise the thread will be stuck in kernel mode until you start one more process – making service shutdown impossible.

2. Remote code injection

We are back in user mode now, and we can run code once a process starts. The next step is to somehow load our JIT hooking code in every new (.NET) process, and make it start executing. There are a couple of ways in which you can do this, and most involve hacks around CreateRemoteThread. This Win32 function allows a process to start a thread in the address space of another process. The challenge is how to get the process to load our hooking code. There are 2 approaches which both require writing into the remote process memory before calling CreateRemoteThread:

  • write the hooking code directly in the remote process, and call CreateRemoteThread with an entry point in this memory
  • compile our hooking code to a dll, and only write the dll name to the remote process memory. Then call CreateRemoteThread with the address of kernel32!LoadLibrary with its argument pointing to the name

As I want to be able to hook the JIT in 32 as well as 64 bit processes, I have to compile 2 versions of the hooking code anyway. For the sake of code modularity and seperation of concerns I opted for the second way, so the simple recipe I took is:

  • A. Write a dll which on load executes the hooking code, and compile it in 2 flavors (32/64 bit).
  • B. In the Windows service, on process start notification, use CreateRemoteThread + LoadLibrary to load the correct flavor of the dll in the target
A. Auto executing library

This is quite easy, but you have to beware the dragons. A dll has a DllMain entry point with signature:

BOOL APIENTRY DllMain( HMODULE hModule, DWORD ul_reason_for_call, LPVOID lpReserved); 

This entry point is called when (specified in ul_reason_for_call) the dll is first loaded or unloaded, or on thread creation/destruction. The thing to beware for is written in the Remarks section: “Access to the entry point is serialized by the system on a process-wide basis. Threads in DllMain hold the loader lock so no additional DLLs can be dynamically loaded or initialized.”. In other words: you can not load a library in code that runs in the context of DllMain.

Why is this a problem for us ? The hooking code has to query the .NET shim library (mscoree.dll) to find out if and which .NET runtimes are loaded in the process. Since there is no a priori way to know for sure the shim library is already loaded when we try to get a module handle, our hooking code may trigger a module load and so a deadlock.

The fix is easy: just start a new thread in the DllMain entrypoint and make that thread query the shim library. This thread will start execution outside the current loader lock.

B. CreateRemoteThread + LoadLibrary

I will skip over most details here as it’s described in much detail in various articles, however there are some things to beware of when cross injecting from the 64 bit service to a 32 bit process. The steps in the ‘regular’ procedure are:

  1. Get access to the process
  2. Write memory with name of hooking Dll
  3. Start remote thread with entrypoint kernel32!LoadLibrary
  4. Free memory

Most of these are straightforward, but there is a problem in cross injecting in step 3, and more specifically in finding the exact address to call.

When injecting in a same bitness architecture, this is easy as we can use a trick: kernel32 is loaded into every process at the same virtual address. This address can change, but only at reboot. Using this trick, we can:

  1. Get the module handle (virtual address) of the kernel32 module in the injecting process – it will be identical in the remote process
  2. Call kernel32!GetProcAddress to find the address of LoadLibrary

When injecting cross bitness, we have 2 problems: the kernel32 loading address is different for 64 and 32 bit, and we can not use kernel32!GetProcAddress on our 64 bit kernel module to find the address in the 32 bit one. To fix this, I replaced the steps above for this scenario by:

  1. Use PSAPI and call EnumProcessModulesEx on the target process, with the explicit LIST_MODULES_32BIT option (there are also 64 modules in a 32 bit process, go figure), get their names (GetModuleBaseName) to find kernel32, and when found get the module address from GetModuleInformation
  2. Use ImageHlp’s MapAndLoad and extract the header information from the PE header of the 32 bit kernel32. Find the export directory and together with the name directory find the RVA of LoadLibrary ourselves (Note: the RVAs in the PE are the in memory RVAs. On disk layout of a PE is different, you can use the section info header to correlate the two). Add this to the number from step 1 to find the VA of kernel32!LoadLibrary

Working setup

A DbgView of the loading and injection in both flavors of .NET processes (32 and 64 bit):

injection_success

 

Note: I strive to put the full code out there eventually. But it may take some time.

mcg_implemented elsewhere

Compile time marshalling

In one of my posts about managed/unmanaged interop in C# (P/Invoke), I left you with the promise of answering a few questions, namely: can we manually create our own marshalling stubs in C# (at compile time), and can they be faster than the runtime generated ones ?

A bit of background

It’s funny that when I raised these questions back in March, I was still unaware of .NET Native and ASP vNext which were announced by Microsoft in the following months. The main idea behind these initiatives is to speed up especially the startup time of .NET code on resource constrained systems (mobile, cloud).
For instance, while traditionally on desktop systems intermediate language (IL) in .NET assemblies is compiled to machine code at runtime by the Just-In-Time Compiler (JIT), .NET Native moves this step to compile time. While this has several advantages, a direct consequence of the lack of runtime IL compilation is that we can’t generate and run IL code on the fly anymore. Even though not much user code uses this, the framework itself critically depends on this feature for interop marshalling stub generation. Since it is no longer available in .NET Native, this phase had to be moved to compile time as well. In fact, this step – called Marshalling and Code Generation (MCG) is one of the elements of the .NET Native toolchain. By the way, .NET Native isn’t the first project which has done compile time marshalling. For example, it has been used for a long time in the DXSharp project.

The basic concepts are always the same: generate code which marshals the input arguments and return values, and wrap it around a calli IL instruction. Since the C# compiler will never emit a calli instruction, this actual call will always have to be implemented in IL directly (or the compiler will have to be extended, recently possible with Roslyn). Where the desktop .NET runtime (CLR) emits the whole marshalling stub in IL, the MCG generated code is C# so it requires a seperate call to an IL method with the calli implementation. If you drill down far enough in the generated sources for a .NET Native project, in the end you’ll find something like this (all other classes/methods omitted for brevity):

internal unsafe static partial class Interop
{
    private static partial class McgNative
    {
        internal static partial class Intrinsics
        {
            internal static T StdCall(IntPtr pfn, void* arg0, int arg1)
            {
                // This method is implemented elsewhere in the toolchain
                return default(T);
            }
        }
    }
}

Note the giveaway comment ‘this method is implemented elsewhere in the toolchain’, which you can read as ‘this is as far as we can go with C#’, and which indicates that some other tool in the .NET Native chain will emit the real body for the method.

DIY compile time marshalling

So what would the .NET Native ‘implemented elsewhere’ source look like, or: how can we do our own marshalling ? To call a native function which expects an integer argument (like the Sleep function I used in previous posts), first we would need to create an IL calli implementation which takes the address of the native callsite  and the integer argument:

.assembly extern mscorlib {}
.assembly CalliImpl { .ver 0:0:0:0 }
.module CalliImpl.dll

.class public CalliHelpers
{
    .method public static void Action_uint32(native int, unsigned int32) cil managed
    {
        ldarg.1
        ldarg.0
        calli unmanaged stdcall void(int32)
        ret
    }
}

If we feed it the address of the Sleep function in kernel32 (using LoadLibrary and GetProcAddress, which we ironically invoke through P/Invoke…), we can see the CalliHelper method on the managed stack instead of the familiar DomainBoundILStubClass. In other words, compile time marshalling in action:

Child SP IP Call Site
00f2f264 77a9d4bc [InlinedCallFrame: 00f2f264]
00f2f260 010b03e4 CalliHelpers.Action_uint32(IntPtr, UInt32)
00f2f290 010b013b TestPInvoke.Program.Main(System.String[])
00f2f428 63c92652 [GCFrame: 00f2f428]

This ‘hello world’ example is nice but ideally you would like to use well tested code. Therefore, I wanted to try and leverage the MCG from .NET Native, but it turned out to be a bit more work than I anticipated as you need to somehow inject the actual IL calli stubs to make the calls work. So perhaps in a future blog.

What about C++ interop ?

There seems to be a lot of confusion around this type of interop: some claim it to be faster, some slower. In reality it can be both depending on what you do. The C++ compiler understands both types of code (native and managed), and with it comes its main selling point: not speed but type safety. Where in C# the developer has to provide the P/Invoke signature, including calling convention and marshalling of the arguments and return values, the C++ compiler knows this already from the native header files. Therefore, in C++/CLI you simply include the header and if necessary (you are in a managed section) the compiler does the P/Invoke for you implicitly.

#include

using namespace System;

int main(array ^args)
{
    Console::WriteLine(L"Press any key...");
    while (!Console::KeyAvailable)
    {
        Sleep(500);
    }
    return 0;
}

Sleep is an unmanaged function included from Windows.h, and invoked from a managed code body. From the managed stack in WinDbg you can see how it works:

00e3f16c 00fa2065 DomainBoundILStubClass.IL_STUB_PInvoke(UInt32)
00e3f170 00fa1fcc [InlinedCallFrame: 00e3f170] .Sleep(UInt32)
00e3f1b4 00fa1fcc .main(System.String[])
00e3f1c8 00fa1cff .mainCRTStartupStrArray(System.String[])

As you can see, there is again a marshalling stub, as in C#, it is however generated without developer intervention. This alone should be reason enough to use C++/CLI in heavy interop scenarios, but there are more advantages. For instance, the C++ compiler can optimize away multiple dependent calls across the interop boundary, making the whole thing faster, or can P/Invoke to native C++ class instance functions, something entirely impossible in C#. It moreover allows you to apart from depending on external native code, create ‘mixed mode’ or IJW (It Just Works) assemblies which contain native code as well as the usual managed code in a self contained unit.
Despite all this, the P/Invoke offered by C++/CLI still leverages the runtime stub generation mechanism, and therefore, it’s not intrinsically faster than explicit P/Invoke.

Word of warning

Let me end with this: the aim of this post is to offer an insight in the black box called interop, not as a promotion for DIY marshalling. If you find yourself in need of creating your own (compile time) marshalling stubs for faster interop, chances are you are doing something wrong. Especially for enterprise/web development it’s not very likely the interop itself is the bottleneck. Therefore, focussing on improving the interop scenario yourself – instead of letting the .NET framework team worry about it – is very, very likely a case of premature optimization. However, for game/datacenter/scientific scenarios, you can end up in situations where you want to use every CPU cycle efficiently, and perhaps after reading this post you’ll have a better idea of where to look.